Win16 Memory Management

I've been recently working with Classic Mac OS programming[0] and just that memory model (also using dealing with the lack of virtual memory using opaque handles to memory that need to be locked when used) is painful enough[1] - having to deal with segment addressing on top of that does not sound like fun. Thank god for the Motorola 68000!

[0]Made an AppleTalk chat client/server https://github.com/kalleboo/GlobalTalk-Chat

[1]The equivalent to HeapWalker I used was Metroweks ZoneRanger which was bundled with their compiler. It has a nice visualization of how fragmented the memory is https://bitbang.social/@kalleboo/116302075194704555

Sometimes I think that if it were the old days, I probably wouldn't have been able to program. I remember that these days we program on top of 64bit virtual addresses, but how did developers do it back then

Pretty good detail in this article! But what really surprises me is how some ideas just keep coming back.

When I wrote a binary translator, I ended up having to keep a translated return stack to optimize RET opcodes. That put me in exactly the same position as the Win16 kernel with regard to having to patch pointers (in case of Win16, just the segment part) on stack.

Of course I did not have the benefit of my guests calling a lock function, so I ended up having to run a garbage collection operation to determine which pointers are in use & take exceptions on now-invalidated segments. Lots of extra work that Windows didn't need: it's nice to be king :-)

In 1994 I was 2 years out of school. I'd written one windows shareware application and a whole lot of unix-y things. People were excited about the internet but most people didn't have access. Unix shell accounts via dialup were common though.

One day I was encouraged to write a Windows Sockets emulation layer for ordinary dial-up shell accounts like those offered by netcom. The idea was to allow the use of the recently released Mosaic browser without an actual internet connection. I figured sure, no problem. I'll use curl or some other tool in the shell account to do the actual fetching of URLs, transfer styles over zmodem, and simulate all the tcp/ip calls in the DLL.

I couldn't even get started. The reason is that I couldn't understand how the different Windows applications could all share memory allocated at runtime in the winsock.dll.

I asked a highly experienced ex Microsoft person, and he just said what are you talking about. There's no API to allocate shared memory.

So I gave up. 6 months later someone else did it.

Around then I realized the truth: Windows 3.1 had no memory protection at all. Specifically all global variables in DLLs were shared by default. The hard part wasn't sharing memory among users of a DLL. If anything, the hard part was having good discipline to avoid sharing it.

Since I'd only used multiuser Unix in school, and I knew Windows supported multitasking (even if only the cooperative kind), I just couldn't wrap my head around the idea that I'm multitasking operating system could exist without memory protection.

If you think programming in Win16 (or whatever we want to call it), you should try teaching people to do it. I worked as a commercial trainer on C and Windows way back when - C and the Windows API were no bed of roses, but the different memory models were mind-numbing for us tutors and the poor punters, many of whom didn't know C!

Thank god for the 386.

> Exports are used for application code which is externally called.

This was the magic moment for me, learning Windows 3.0 programming. The idea that my program is no longer master of it's world, but instead is just something that gets loaded and called by Windows.

I posted the same thing a few days ago:

https://news.ycombinator.com/item?id=48424862

I'll just stop posting on HN.

Good informative article.

Win16 programming was an important formative phase in my career. There is a lot of wisdom in old solutions to thorny problems and knowing them often clues you to how one may adapt them to today's problem. For example, when CPU+GPU programming appeared i immediately imagined CPU memory accessed with "near" pointers and GPU memory accessed with "far" pointers with a switch to a pseudo-segment register.

It also conditioned a programmer to learn about various complexities involved and be careful in their programming i.e. it taught you discipline. You understood your compiler, OS and hardware better and how to write code keeping them all in mind. For example, i often say my study of embedded programming started with Win16!

Another bit of cleverness was "Thunking" between 16-bit and 32-bit code. Here is Raymond Chen on how it worked there and Why can’t you thunk between 32-bit and 64-bit Windows? - https://devblogs.microsoft.com/oldnewthing/20081020-00/?p=20...

[0]Made an AppleTalk chat client/server https://github.com/kalleboo/GlobalTalk-Chat

It wasn't really the processor architecture. Segmented addressing was actually fairly easy if the processor was used only in the way that protected mode was envisioned as working. As the headlined article observes, a lot of this stuff simply wasn't necessary in OS/2 1.x, even though that too had DLLs, callback window procedures, and the multiple tiny/small/medium/large/compact/huge memory models.

The differences were (a) that DOS+Windows was designed so that the same programs could run in both real mode, with overlaying, and 286 protected mode, with segmented virtual memory; and (b) that to really save on RAM DOS+Windows had ideas such as the data segments for DLLs being globally shared across all processes. These added all of the complications mentioned in the headlined article and more besides. It was the operating system, not the processor architecture.

I couldn't even get started. The reason is that I couldn't understand how the different Windows applications could all share memory allocated at runtime in the winsock.dll.

I asked a highly experienced ex Microsoft person, and he just said what are you talking about. There's no API to allocate shared memory.

So I gave up. 6 months later someone else did it.

Thank god for the 386.

Pretty good detail in this article! But what really surprises me is how some ideas just keep coming back.

> Exports are used for application code which is externally called.

This was the magic moment for me, learning Windows 3.0 programming. The idea that my program is no longer master of it's world, but instead is just something that gets loaded and called by Windows.

Good informative article.

I posted the same thing a few days ago:

https://news.ycombinator.com/item?id=48424862

I'll just stop posting on HN.

Check the ID numbers (48410844 < 48424862) and bear in mind that Hacker News has this thing where sometimes submissions get re-cycled for attention. Yes, annoyingly it does seem to make the presented datestamps wrong.

I think it's just bad timing.

It is always a matter of luck.

People submit a lot of stuff all the time, very few people go through "New" and thus a new submission probably have a very short life time before it is drowned by newer submissions.

A submission to survive most likely needs some initial push from non-organic voting.

It probably helps if you share you submission early with your colleagues and in other sites.

I've had this experience a few times so I don't post submissions anymore either (including one of my own articles being flagged despite over a hundred comments). I know people will say vote rigging doesn't happen on HN, but I think it's naïve to think any site on the internet is impervious to vote rigging.

Some of this was automatically handled by the compiler and wouldn't have been an issue. Current x86-64 ABIs, for instance, require function entry to use specific forms annotated by metadata to support stack walking to support exception handling. Like the far entry here, this is invisible to most programmers -- the compiler does it for you.

Similarly, while locking and unlocking memory blocks is no longer generally a concern, most programs still deal with files, and graphics programs still have to call map/unmap functions to access graphics data. All the same tools apply -- helper functions/libraries, RAII, and leak/sanitizer tools to dynamically detect usage errors.

As someone who grew up coding after it was mostly 32-bit, I can't say this with certainty, but my gut feeling is that paradoxically you would have and it would've made you stronger.

It's easy when it's the only way to get things done. Think about how nobody who was learning programming before 2023 was seriously thinking "This would be so much easier if the computer wrote it all for me".

Memory mapping/bank switching was fairly common on 8-bit and 16-bit systems, where a small memory window was used to select different memory banks, allowing a program to access more memory in chunks.

Game consoles like NES, SNES and Game Boy had additional hardware built in the cartridge to support memory mapping/bank switching.

For PCs, EMS (memory) provided a similar concept. It reserved a 64 kB window divided in 16 kB pages in the first 1 MB and allowed to map up to 32 MB.

16 bit programs used 16 bit addresses, generally speaking.

Even with 32bit systems where you’d want more than 4GB RAM, application software still had 32 bit addresses (and thus 4GB memory limit).

I think it was a lot more common for 8bit systems to allow for 16 bit addressing though.

It’s been a while though. So hopefully I’m not misremembering things.

I first found out about segmenting in 16 bit systems in 2016 by reading a lively explanation from an older edition of Duntemann's Assembly Language Step by Step (the newer editions focus largely on Linux and 32/64-bit systems).

Attention spans were longer.

You just had to live with the constraints.

It biased your selection of data structures and algorithms.

Max 64KB array size meant pointers to allocated structs and linked lists were much more popular back then versus 1 large array of structs.

The Win16 HANDLE memory allocation also meant you had to worry about how you handle structs which had pointers to others structs (a FAR ptr may not be a stable value, unless you locked the HANDLE for the duration of the allocation)

Then you had to worry about stuff that no college programming book talked about (ignore the lack of error checking):

  char FAR *p;
  char FAR *mem = farmalloc(65536);

  for (p = &mem[65535]; p >= &mem[0]; p--) {
    dostuff(p);
  }

Welcome to an infinite loop...

You had to figure out so much on your own back then - and reinvent the wheel.

For me it is fascinating how today I can learn a foreign language, or how to code by interacting with the LLM.

I understood it as Windows developers had to manually deal with segment limitations since Windows supported running on pre-286 CPUs without protected mode (Wikipedia says Windows 1-3 all supported the 8088). OS/2 just made the 286 a minimum requirement so they could rely on a CPU with more modern features.

The 68k didn't come with an MMU like the 286 so MacOS couldn't rely on virtual memory like OS/2 did but at least the flat memory space meant you didn't have to juggle 64k segments

Memory mapping/bank switching was fairly common on 8-bit and 16-bit systems, where a small memory window was used to select different memory banks, allowing a program to access more memory in chunks.

Game consoles like NES, SNES and Game Boy had additional hardware built in the cartridge to support memory mapping/bank switching.

For PCs, EMS (memory) provided a similar concept. It reserved a 64 kB window divided in 16 kB pages in the first 1 MB and allowed to map up to 32 MB.

You had to figure out so much on your own back then - and reinvent the wheel.

For me it is fascinating how today I can learn a foreign language, or how to code by interacting with the LLM.

As someone who grew up coding after it was mostly 32-bit, I can't say this with certainty, but my gut feeling is that paradoxically you would have and it would've made you stronger.

I think it'd be mixed.

I think the knowledge of underlying hardware is useful and good to know.

But also that sort of knowledge got dated pretty quickly in the early computer era. Further, the capabilities of things like optimizing compilers quickly got to a point where they'd outpace most hand written assembly. Today, it's basically just floating point operations where you can still do better than a compiler.

In the early days, you'd have the correct impression that the C compilers spat out utter garbage which was a lot slower than what you could hand craft. As optimization techniques got better and better, the work you did because the compiler was dumb ultimately would have gotten in the way.

Exactly. I'd argue that all those programming Gods and Gods because they went through that period. Whatever didn't kill them made them stronger. We should replicate that experience by deliberately writing in low level C and assembly for a few years.

16 bit programs used 16 bit addresses, generally speaking.

Even with 32bit systems where you’d want more than 4GB RAM, application software still had 32 bit addresses (and thus 4GB memory limit).

I think it was a lot more common for 8bit systems to allow for 16 bit addressing though.

It’s been a while though. So hopefully I’m not misremembering things.

> I think it was a lot more common for 8bit systems to allow for 16 bit addressing though.

The 6502 and Z80 could use 16 bit addressing to access up to 64kb of memory. The 6502 had various other addressing systems, including iirc 8 bits, but none of them were wider tha 16 bits.

You had to deal with two flavors of pointer, near and far. Far pointers came with segment selector, for accessing more than 64k. Your choice of memory model influenced the defaults. You might use near pointers for internal references in a module, and far pointers for external references.

And the 32-bit 4GB limit was often really "just a bit under 2GB" depending on the hardware, OS, etc

Not really. 16-bit programs on x86 used 32-bit pointers (effectively 20-bit due to the segment mechanism).

8-bit microprocessors used 16-bit addresses.

You just had to live with the constraints.

It biased your selection of data structures and algorithms.

Max 64KB array size meant pointers to allocated structs and linked lists were much more popular back then versus 1 large array of structs.

Then you had to worry about stuff that no college programming book talked about (ignore the lack of error checking):

  char FAR *p;
  char FAR *mem = farmalloc(65536);

  for (p = &mem[65535]; p >= &mem[0]; p--) {
    dostuff(p);
  }

Welcome to an infinite loop...

Attention spans were longer.

  char FAR *p;
  char FAR *mem = farmalloc(65536);

  for (p = &mem[65535]; p >= &mem[0]; p--) {
    dostuff(p);
  }

Nice one.

To be fair to Windows, good C courses should still teach this, but I'm not sure if they do :-)

It's UB to set a pointer to before the first element of an array, or after the last element plus one. So, if it knows the call to farmalloc/malloc returns the start of an object, a modern C compiler on a modern architecture may, in principle, optimise the above to an infinite loop.

I've seen something similar on architectures (long ago) where a zero-bit-pattern pointer was a valid memory address you might actually access. Of course p-1 is not less than p when p is zero.

I've been wondering about this lately. As a kid, I spent hour upon hour learning about computing: typing in Basic code from a magazine into a Commodore 64, playing with music on an Atari STe, learning my way around a DOS command line, dabbling with 3D modelling... just so much stuff that my own kids would never have the patience for.

I wonder if it's just that kids today (gods that makes me sound old!) are constantly surrounded by entertaining things to do - gaming, TV/films, music, social media.

I have been wondering how to train my 6-year old son and myself to increase my attention span.

Some rules are obvious -- cutoff mobiles and pads completely (he doesn't have access to them so it's for me), sit in the library and study from books (I believe this is even possible for programming topics as I can write on paper). Basically, cutting off everything electronics definitely helps -- even putting my phone in the bag improves productivity significantly.

But the problem is, my son is unruly. If I put him in the library, most likely he runs around and messes things up, which ends up we leave early without doing anything.

I think it's just bad timing.

It is always a matter of luck.

People submit a lot of stuff all the time, very few people go through "New" and thus a new submission probably have a very short life time before it is drowned by newer submissions.

A submission to survive most likely needs some initial push from non-organic voting.

It probably helps if you share you submission early with your colleagues and in other sites.

This has happened at least 4 times to my posts just last month.

The 68k didn't come with an MMU like the 286 so MacOS couldn't rely on virtual memory like OS/2 did but at least the flat memory space meant you didn't have to juggle 64k segments

But like real-mode Windows, the Macintosh OS was designed with small amounts of RAM. 32K limits pop up in various APIs. Handle memory allocation.

Not as much of a strait jacket as Windows segmented-memory programming, but compared to Unix, it did feel constricting.

Yes, outwith the idea of Family API programs (which couldn't use Presentation Manager and whatnot anyway) OS/2 1.x did target the 286 as a minimum. But that doesn't mean that DOS+Windows didn't use the features.

It did. It was bi-modal. There were at one point switches to the WIN command to tell it whether to come up in real mode or 286 protected mode. In the latter it definitely did use the features of protected mode.

It was the bi-modal nature that was the problem. Essentially, they had to design a whole layer that simulated when in real mode all of the load-on-demand stuff that the processor architecture supplied for free in 286 protected mode, and make it so that the thing would all work either way with no changes to applications.

I think it'd be mixed.

I think the knowledge of underlying hardware is useful and good to know.

Not really. 16-bit programs on x86 used 32-bit pointers (effectively 20-bit due to the segment mechanism).

8-bit microprocessors used 16-bit addresses.

> I think it was a lot more common for 8bit systems to allow for 16 bit addressing though.

The 6502 and Z80 could use 16 bit addressing to access up to 64kb of memory. The 6502 had various other addressing systems, including iirc 8 bits, but none of them were wider tha 16 bits.

Oh yeah. I had loads of 6502 and Z80 systems (still do in fact). Can’t believe I forgot about that!

Though in fairness, I do mostly now just use those systems to teach my kids BASIC

Well, most of the addressing modes of the Z80 used a 16-bit register pair (i.e. 0 to 64K-1 bytes) to address, the 6502 used a somewhat stranger set of addressing modes, but once again you could address 0 to 64K-1 bytes.

Psst! Let's blow their minds and tell them about the MC68008. (-:

And the 32-bit 4GB limit was often really "just a bit under 2GB" depending on the hardware, OS, etc

This has happened at least 4 times to my posts just last month.

But like real-mode Windows, the Macintosh OS was designed with small amounts of RAM. 32K limits pop up in various APIs. Handle memory allocation.

Not as much of a strait jacket as Windows segmented-memory programming, but compared to Unix, it did feel constricting.

3/1 split was common, though. Especially towards the end of 32-bit era.

I guess it was awkward to use languages that had higher level than assembly in order to write 16-bit programs that required more than 64KiB of memory. And also not quite portable, since they were all tied to x86 CPU. Those were messy times I guess. A somewhat similar story was 32-bit PAE, where the the CPU could address more than 4GiB physical memory, but software was still 32-bit and virtual addresses were capped at 4GiB. Linus was right that you must have more virtual memory (preferably 10+ times more) than physical, otherwise you have to jump through hoops. https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-ab...

I wonder if it's just that kids today (gods that makes me sound old!) are constantly surrounded by entertaining things to do - gaming, TV/films, music, social media.

I have been shielding my 6 years old son from electronics, except 40 minutes of TV twice a week. I have no idea how to grow his patience and perseverance, though. He is like me, who doesn't have a lot of patience to begin with, so I can't really guide him through some of the situations. We have been taking him to some activities as well as reading to him but nothing really sticks.

I just hope eventually he loves reading and learns in a more traditional way instead of from laptops and pads.

I think that's actually a pretty accurate observation. I'm not a cognitive science expert, so I don't know the details, but there have been articles about 'popcorn brain' due to sustained attention issues, right? Personally, I use LLMs for coding quite often (in my environment, I'm often forced to use them). Compared to the past, when I use an LLM, the answers come immediately, so it seems harder to focus deeply than before. The generation younger than me, which is more focused on Shorts, probably has it even worse

I think it's an adaptation. Instead of living in a world with limited valuable information we're now living at the end of a firehose of never-ending near-useless information which has to be filtered at high speed.

Thats correct - and I notice that on myself. There are just much more things reachable at any point in time compared to our youth it takes real effort to focus.

  char FAR *p;
  char FAR *mem = farmalloc(65536);

  for (p = &mem[65535]; p >= &mem[0]; p--) {
    dostuff(p);
  }

Nice one.

To be fair to Windows, good C courses should still teach this, but I'm not sure if they do :-)

I've seen something similar on architectures (long ago) where a zero-bit-pattern pointer was a valid memory address you might actually access. Of course p-1 is not less than p when p is zero.

None of my college CS courses used programming languages that featured FAR pointers.

The above example would cause an infinite loop on Win16's seg:off far memory model, but compiling on Win32 would not cause an infinite loop.

Problem is that far pointers only affect the offset, not the segment. So decrementing a 0 value offset would just wrap around to 0xFFFF and the segment would stay the same, so you're going from mem[0] to mem[65535] not mem[-1].

I have been wondering how to train my 6-year old son and myself to increase my attention span.

But the problem is, my son is unruly. If I put him in the library, most likely he runs around and messes things up, which ends up we leave early without doing anything.

> But the problem is, my son is unruly. If I put him in the library, most likely he runs around and messes things up, which ends up we leave early without doing anything.

Some potential ideas to explore. Take what you want, leave what you don't.

a) if you're training for attention span, make sure the target is appropriate and also within reach of your child.

b) have a plan for the visit: when I helped at a school library, classes for kids in your kid's age group would come in, the librarian would read them a story, then the kids would look for a book, check out at the desk and read (or look at the book anyway) quietly until the end of the visit. I think we'd get about 40 minutes for a visit. Most days, at least some of the kids would be getting ansy before it was time to go.

c) Plan around your kid's activity needs. Some kids will do long still antention tasks better after doing some amount of physical activity. Some kids will do these kinds of things better after a meal. Some will do it better in the morning or the afternoon. Many kids will have a harder time if the library visit was a surprise. You know your kid, try to have your library visits when they're likely to work well. If he likes story time, try to visit when there's a story time available.

d) don't expect that you can both go to the library and work independently. You're going to the library with him, and he's going to need you to help him out for much of the time. But you might be able to find him a book together, then find you a book together, then sit down and read for a bit together.

e) if all you can get done is finding a book, no big deal. You can read at home too.

If a lion can figure out how to behave in the library, so can your kid ;) https://www.michelleknudsen.com/library_lion_77788.htm

> There were at one point switches to the WIN command to tell it whether to come up in real mode or 286 protected mode. In the latter it definitely did use the features of protected mode.

Windows 3.0’s WIN.COM supported:

/R for real mode (8086)

/S for standard mode (16-bit protected mode)

/E for 386 Enhanced Mode (32-bit virtual machine manager (VMM), running Windows in VM1, and DOS apps in VM2+)

Psst! Let's blow their minds and tell them about the MC68008. (-:

Oh yeah. I had loads of 6502 and Z80 systems (still do in fact). Can’t believe I forgot about that!

Though in fairness, I do mostly now just use those systems to teach my kids BASIC

Just curious how do you teach kids BASIC. I’m sure my 6 year old won’t sit tight.

"portable" used to mean "able to be ported" rather than the "comes automatically ported if you just change compiler options" that it means today

3/1 split was common, though. Especially towards the end of 32-bit era.

None of my college CS courses used programming languages that featured FAR pointers.

The above example would cause an infinite loop on Win16's seg:off far memory model, but compiling on Win32 would not cause an infinite loop.

I just hope eventually he loves reading and learns in a more traditional way instead of from laptops and pads.

6 is pretty early to enjoy reading books, so I wouldn't worry.

We struggled to get our son into reading too, but he took straight away to comics, and from there he had a long stint with graphic novels (e.g. Percy Jackson, Artemis Fowl). You can get more mature graphic novels as they mature and progress, e.g. City of Dragons. And eventually he picked up an Alex Rider book, and hasn't stopped since. He's now how I remember myself as a kid - nose stuck in a book, completely engrossed!

> But the problem is, my son is unruly. If I put him in the library, most likely he runs around and messes things up, which ends up we leave early without doing anything.

Some potential ideas to explore. Take what you want, leave what you don't.

a) if you're training for attention span, make sure the target is appropriate and also within reach of your child.

e) if all you can get done is finding a book, no big deal. You can read at home too.

If a lion can figure out how to behave in the library, so can your kid ;) https://www.michelleknudsen.com/library_lion_77788.htm

Thats correct - and I notice that on myself. There are just much more things reachable at any point in time compared to our youth it takes real effort to focus.

> There were at one point switches to the WIN command to tell it whether to come up in real mode or 286 protected mode. In the latter it definitely did use the features of protected mode.

Windows 3.0’s WIN.COM supported:

/R for real mode (8086)

/S for standard mode (16-bit protected mode)

/E for 386 Enhanced Mode (32-bit virtual machine manager (VMM), running Windows in VM1, and DOS apps in VM2+)

Just curious how do you teach kids BASIC. I’m sure my 6 year old won’t sit tight.

"portable" used to mean "able to be ported" rather than the "comes automatically ported if you just change compiler options" that it means today

6 is pretty early to enjoy reading books, so I wouldn't worry.

This is a kind of knowledge base article which resulted from attempts to understand exactly how memory management works in 16-bit Windows. It is not exactly undocumented, but it is also not well documented; even before Windows 3.0 appeared, the assumption was that essentially all application developers were going to use a high-level language and their development tools would take care of the low-level details.

Furthermore, nearly all materials for beginning Windows developers focused on the more visible aspects of Windows programming, i.e. windows, icons, menus, and so on. Memory management was glossed over, even though it was absolutely critical to writing a solid Windows application any more complex than a Hello World program.

Windows 3.0 SDK HeapWalker memory analysis tool

The memory management details and mechanisms are rooted in the 8086 real mode history of Windows 1.x and 2.x, and much of the complexity persisted even when Windows only ran in protected mode starting with Windows 3.1.

Unless noted otherwise, in this article “Windows” refers to the 16-bit line of Microsoft products, not Windows NT.

Introduction to Windows Memory Management

The key to understanding Windows memory management is that from the very beginning, Windows was among other things a fancy overlay manager. For many years, Windows was too big for typical PCs of the time and needed some way to keep only the most active memory segments in physical RAM, with some mechanism to discard and reload less frequently needed segments on demand. Paging was obviously not used because there was no support for it in 8086 and 80286 systems (and before Windows 3.0, those were very nearly the entirety of the installed base).

In the simplest case of an application with one code segment and one data segment, the movable nature of Windows segments is almost entirely transparent. When the application is running, the CS (code) segment register points to the code segment and the DS (data) and SS (stack) segment registers point to the data segment. As long as the application only uses near calls/jumps within its code segment and near pointers to the data/stack segment, it does not care at all where exactly the segments are in memory, i.e. the actual values loaded into CS/DS/SS registers. Windows can move the segments around and everything will work fine.

But even beginning Windows programmers working through a Hello World style example very quickly start suspecting that life is not so simple in the land of 16-bit Windows. The window procedure must be declared as FAR PASCAL, which is fair enough given that it needs to conform to Windows calling conventions. But it also has to be exported from the application’s executable, otherwise the program won’t work properly. That is a concept entirely unfamiliar to non-Windows developers.

To help implement its memory management scheme, Windows adopted and extended the “New Executable” (NE) format first used by “DOS 4”, better known as Multitasking DOS 4.0 and significantly different from PC DOS and MS-DOS 4.0/4.01. Unlike the DOS MZ executable format where an application is effectively a single binary blob, the NE format is segment oriented and each segment is stored on disk separately. That gives Windows the ability to load (or reload) individual segments and move them around in memory.

The NE format also supports imports and exports. Imports are used when an application needs to call external code, such as the OS itself. Exports are used for application code which is externally called.

A window procedure is one such externally called piece of code. It needs to be exported so that Windows can perform its magic on it. Said magic lets Windows fix up the window procedure prolog (entry sequence) so that it loads the application’s own data segment into the DS register.

Shifting Memory

Everything in Windows memory management revolves around segments, contiguous blocks of memory up to 64KB in size. In normal 8086 programming, each segment is identified by its segment address, which directly corresponds to its address in physical memory. Because most segments in Windows can be moved or discarded, they are instead identified by handles. A handle is a 16-bit value which should be considered opaque, even if it might actually a simple index into some table.

For programmers familiar with x86 protected mode, a Windows segment handle is a lot like a protected-mode selector: It is a 16-bit value which uniquely identifies a memory segment, but it is independent of the segment’s location in system memory. The similarity is not coincidental. Steve Wood, the designer of Windows 1.0 memory management, used the Intel 286 protected mode as inspiration1 for the Windows memory manager (the 286 came out in 1982 and work on Windows started in 1983).

A handle refers to a memory segment regardless of where it is in memory, i.e. regardless of what its 8086 segment address is. The GlobalAlloc API allocates contiguous memory from the global heap (possibly more than 64K) and returns a segment handle.

Since the 8086 does not support protected mode, approximating protected-mode functionality takes quite a bit of extra work and discipline. Given that a handle is not a segment address, it can’t be used as the segment portion of a far 16:16 pointer. To address anything in another segment, an application needs to form a far pointer.

To that end, the application needs to call the GlobalLock API which returns a segment address and locks the segment in memory (increments its lock count). While locked, the segment won’t be moved and its segment address will stay valid.

Once it is done accessing memory in the segment, the application calls GlobalUnlock. That decrements the segment’s lock count and once the count drops to zero, the segment may be moved again.

Needless to say, after calling GlobalUnlock, the segment address returned by GlobalLock must be considered invalid. Note that this is a possible source of sneaky bugs—after calling GlobalUnlock, the segment most likely won’t move immediately. An application might erroneously access a previously locked segment after unlocking it and not cause any obvious harm.

Indeed Windows won’t move or discard a segment unless it has to, because it may well be used again. However, once segments are unlocked, Windows may move them around or discard them at any moment.

Now let’s take a closer look at the possible segment types.

Segment Flags

Windows segments have several important attributes which determine how they’re treated by the Windows memory manager.

Segments can be fixed or movable. The names are clear enough; movable segments can be shuffled around by Windows as long as they’re not locked, while fixed segments stay in place. For example, segments which hold interrupt handler routines must be fixed so that interrupt vectors stay valid. Ideally most of an application’s code and data segments would be movable, giving Windows an opportunity to efficiently manage memory. The ability to move segments is necessary because freeing or discarding segments creates “holes” in memory, potentially quickly fragmenting memory. Windows needs to be able to compact segments by moving them in order to consolidate free memory into one or more larger chunks.

Segments can also be discardable or nondiscardable. Code segments are typically discardable because they aren’t writable. If an unused code segment is removed and later needed again, Windows can easily reload it from the original executable. The same is true of resources which are also read-only. Data segments, on the other hand, tend to be non-discardable because they’re usually writable and once they’re modified, they cannot just be reloaded from disk. That said, applications might allow writable data segments to be discardable if they are willing to re-create their contents in case the segment is needed again after having been discarded.

DLLs

Dynamic linking was not yet a widespread technique in the mid-1980s and Microsoft Windows was one of the first systems with support for dynamically linked libraries (DLLs), also called shared libraries. While some larger systems used dynamic linking since the 1970s, UNIX systems only started introducing shared libraries in the mid to late 1980s.

Windows DLLs are NE format images just like Windows applications, but DLLs are not applications. DLLs cannot be executed directly, only loaded and called into by other processes (tasks in Windows parlance). The bulk of Windows was in fact implemented as DLLs (KERNEL, USER, GDI).

DLLs export routines (entry points) that are callable by applications. Applications can be linked against DLLs at link time, with imports referring to DLL names and entry points. DLLs can be also loaded entirely dynamically, and their entry points can be queried by ordinal (number) or by name.

Note that unlike UNIX systems, Windows never had a global name space for dynamic symbol resolution. Symbols from DLLs were always imported first by module name and then by name or ordinal. The two-level name space takes slightly more effort to manage but avoids name collisions, such that if two DLLs export a symbol named Alloc, there is no confusion as to which one is needed because the module name distinguishes between the two. And of course without the two-level name space, imports by ordinal (which are slightly faster and consume less memory) would have been completely impractical.

One key difference between applications and DLLs that is relevant to Windows programming is that DLLs have no stack of their own and always run with the stack of their caller. Although DLLs almost always have their own data segment, it is different from the stack segment, i.e. SS != DS.

This difference means that DLLs must be built differently from applications. The compiler must be told to generate code for DLLs, or more specifically, told that it cannot assume DS and SS registers address the same memory.

In the early days of Windows, the prolog and epilog for DLL entry points was the same as application prolog/epilog. Compiler writers eventually figured out that the prolog for applications can be simplified, because SS equals DS. But that is not the case for DLLs, and DLLs still need to use the old style “fat” prologs that the Windows module loader needs to patch up.

Secret Switches

Microsoft C supported Windows development from its earliest days, i.e. version 3.0 (earlier Microsoft C versions were rebranded third-party products; Microsoft C 3.0 was the first C compiler developed by Microsoft, initially for XENIX and DOS).

However, for many years, this support was almost secret. The Windows specific switches were completely omitted from compiler documentation, or they were listed but users were referred to the Windows SDK. That was the case up to and including Microsoft C 5.1, which documents the fact that the /Gw and /Aw switches exist, but does not explain what they do and how to use them, instead referring to the Windows SDK documentation. This perhaps neatly illustrates the somewhat incestuous relationship between the Windows development group and the Microsoft languages group.

Since Microsoft C 3.0 (1985), the compilers had the /Aw and /Gw switches (and also the /Au switch) .

The /Aw switch is a memory model modifier and specifies that SS != DS, but DS should not be reloaded at function entry (because Windows takes care of that). The /Aw switch is meant to be used when generating DLLs.

The /Gw switch generates Windows prologs and epilogs for far functions. It is required for exported functions located in both applications and DLLs, and it is very much a Windows specialty.

Windows Prologs and Epilogs

So what exactly do those Windows specific function prologs and epilogs look like? Everything is spelled out in the CMACROS.INC file shipped with the Windows SDK. Unfortunately CMACROS.INC is a jumble of MASM conditionals, nearly impossible for humans to read. It’s much easier to see what code the C compiler produces, or what exactly assembly code using CMACROS.INC turns into.

Here’s what Microsoft C 3.0 generates, as shown by a listing file the compiler produces, with added comments:

PUBLIC	Proc

Proc PROC FAR
*** 000 1e push ds ; almost
*** 001 58 pop ax ; no-op
*** 002 90 xchg ax,ax ; NOP
*** 003 45 inc bp ; marker
*** 004 55 push bp ; save BP
*** 005 8b ec mov bp,sp
*** 007 1e push ds
*** 008 8e d8 mov ds,ax ; reload DS
; Line 4
*** 00a 8b 46 06 mov ax,[bp+6]
*** 00d 03 46 08 add ax,[bp+8]
*** 010 83 ed 02 sub bp,2
*** 013 8b e5 mov sp,bp
*** 015 1f pop ds
*** 016 5d pop bp ; restore BP
*** 017 4d dec bp ; recover value
*** 018 cb ret
Proc ENDP

First of all, note that the prolog seemingly spends a lot of instructions on doing very little real work. It pushes DS, moves it to AX, and then moves AX to DS after saving DS. It also increments BP before pushing it on the stack, and decrements it again after popping.

All in all, seemingly a lot of effort for nothing. But that’s actually the point: The Windows prolog and epilog code is meant to be harmless when it is not needed.

If the function is in fact exported from a Windows NE module, the Windows loader will patch the first three bytes to load the module’s default data segment into AX. Here’s what it looks like in SYMDEB, taken from a random GDI function:

_TEXT:SELECTOBJECT: 5BC1:1840 B80591 MOV AX,9105 5BC1:1843 45 INC BP 5BC1:1844 55 PUSH BP 5BC1:1845 8BEC MOV BP,SP 5BC1:1847 1E PUSH DS 5BC1:1848 8ED8 MOV DS,AX 5BC1:184A 83EC04 SUB SP,+04

In the above case, 5BC1h is the GDI module’s _TEXT code segment, and 9105h is the default data segment of the GDI module.

The Windows memory manager keeps the prolog updated such that if the data segment moves, the exported functions that refer to it get fixed up again to point to the new address.

Note that the NODATA keyword in a Windows .DEF file tells Windows not to patch the function prolog. This is necessary in situations where e.g. an exported entry point simply jumps to another exported function, or if the function has no need to access the data segment.

Now, what about that BP incrementing and decrementing? Windows depends on being able to walk the stack, and therefore applications and libraries must keep the stack frames in a format that Windows will understand.

When the Windows memory manager moves around segments, it must know whether they are referenced in stack frames that are already pushed on the stack. For example, if Windows tries to move a code segment that directly or indirectly called into the currently executing code, it has to either detect the situation and not move the segment, or move it and adjust the stack. What Windows can not do is move the segment and leave the stack as is. The same is true for default data segments.

Non-default data segments are not a problem because they are either locked and cannot move, or are unlocked and therefore correctly written Windows applications do not keep any pointers into such segments.

Incrementing BP before pushing serves an important purpose: It tells Windows that the BP value was pushed by a far function, i.e. there will be both an offset and a segment on the stack. Obviously, for this scheme to work, stacks must be always word-aligned. Fortunately Windows ensures that they are aligned initially, and it takes some effort to misalign them (because there’s no easy way to push an odd number of bytes on the stack).

Comparison with OS/2

It is instructive to compare 16-bit Windows with 16-bit OS/2. The two systems were in many ways very close relatives. Both used the same executable format (NE) with only minor differences. Both used segment-based memory management. Both used the same development tools from Microsoft.

By virtue of using protected mode, OS/2 required less cooperation from the programmer. In protected mode, a segment selector was at the same time the equivalent of a Windows handle and a segment address. Programmers therefore did not need to bother with carefully locking and unlocking segments.

OS/2 applications also did not require any special prolog and epilog code for externally callable functions, and there was no need to explicitly export window procedures etc. from the NE module; there was also no equivalent of (and no need for) MakeProcInstance. In other words, the OS did not need to unwind application stacks, and it didn’t need to patch entry points.

Thanks to the 80286 memory management hardware, segments could be moved, discarded, and reloaded entirely behind an application’s back. There was no need for GlobalLock/GlobalUnlock, eliminating a source of programming errors.

Like Windows DLLs, OS/2 DLL entry points did need a special prolog to set the DS register to the DLL’s data segment, but on OS/2 no special support from the OS was needed. And of course OS/2 DLLs likewise had to be built with the /Aw switch or equivalent, indicating that SS != DS.

Overall, the 286 hardware did a lot of the heavy lifting, and memory management was less work (with less room for bugs) for both the OS and the programmer.

Testing

The Windows SDK provided tools designed to stress the Windows memory management. For example, errors related to incorrect segment locking/unlocking will not show up if there is no memory pressure and the mismanaged segment stays in place. Such bugs can remain hidden and in the worst case, only manifest under difficult-to-reproduce scenarios.

The SHAKER tool in the Windows 1.0 SDK was used to “shake” memory and force segments to be discarded and moved around. This was intended to stress the memory management and reveal memory management bugs which would remain dormant under typical conditions.

Shaker and HeapWalker tools in Windows 1.x SDK

Another tool was HEAPWALK, primarily a diagnostic utility capable of displaying the currently allocated segments and their owners. However, HEAPWALK was also able to allocate all available memory and free it up in 1K increments, simulating low memory conditions.

The Windows 3.0 SDK version of Shaker

Shaker and HeapWalker were still shipped with the Windows 3.0 SDK, not least because Windows 3.0 running in Real mode was minimally different from Windows 1.0 as far as memory management was concerned.

These tools were necessary because although the memory management in Windows was sophisticated, the hardware to back it was lacking (certainly before Windows 3.0 running in protected mode). Instead of letting the hardware catch errors like attempts to access unallocated memory, programmers had to use specialized tools to try and induce errors and hope that bugs will manifest in visible ways. This was not an exact science because in the 8086 architecture, every memory address was valid, and reads and writes always succeeded.

The Windows 3.1 SDK replaced the Shaker tool with Stress, a new utility which was designed to test application behavior under low-resource conditions — limited memory in various Windows internal heaps, running out of disk space, running out of file handles, etc.

The Windows 3.1 SDK Stress tool

Since Windows 3.1 only ran in protected mode, some of the earlier memory management issues were no longer applicable, but low-resource conditions were as relevant as ever.

Summary

16-bit Windows introduced a fairly sophisticated memory management system. Due to lack of hardware support, significant discipline was required on the part of application programmers. If the wrong compiler switches were used, or functions weren’t properly exported, or segments were not correctly locked and unlocked… all bets were off.

References

1. Peter Norton’s Windows 3.0 Power Programming Techniques, Peter Norton and Paul Yao, 1990, page 613.

Hacker Times