Finding NT Memory Leaks
When a system is out of memory, any program that tries to do work will run extremely slow, as it uses the disk drive swap file for memory. For example, beta test versions of Microsoft's IIS have often leaked memory, so eventually the system would slow to a crawl, all programs would have trouble running, and use an excessive amount of CPU time. In almost all cases, when ListManager uses all the CPU power of the machine, it is because the NT system is out of memory and some other program has been leaking memory. Now there is no system memory left.
A few people have asked us about memory leaks and long-term stability on NT, so we thought we'd share what we've found.
One thing we've found is that Windows NT is not very good at reporting which applications are leaking memory. Some very early version of Lyris’s ListManager products slowly leaked memory over time, but none of NT's process watching tools reported the "lyris" process as growing. All recent versions of Lyris products no longer leak memory, which is carefully checked for by the Lyris programming staff.
What we found out is that the important number on Windows NT is the "committed bytes" that Performance Monitor reports. This is the total amount of memory being used on your system by all programs, including the operating system.
When ListManager was leaking memory we found that the "committed bytes" would steadily rise, until NT would stop functioning reliably. When the "total commit" was very large, NT might report a "quota limit" error, new programs would be kept from starting and new ListManager threads could be kept from starting. IIS might start reporting CGI and permission errors.
Since then, we've found that various other programs in NT do leak memory and that the only technique we've found to perceive this is to look at the "committed bytes". The leaked memory builds over time, until NT runs out of memory and the system stops running reliably.
Our recommendation, then, for anyone who experiences trouble with NT stability over time, is to do this:
* Run "perfmon.exe".
* Create a chart graphing "memory/committed bytes".
* Slow the charting rate down to 5000 seconds, so those trends are visible.
* Keep perfmon running in a corner of the window.
If your "committed bytes" rises over time (over several days) and don't go down, speed the charting rate back up to every second, then shut down, one by one, each process or service running on your system. When the program causing the problem is terminated, you should see a big drop on your "committed bytes".
If you cannot make the "committed bytes" go down, your problem might be with IIS, or some other OS-integrated service which won't free its memory, even when stopped. Try upgrading the service, applying a service pack, or changing things around (try a different web server for a few days, for example). When the "committed bytes" stops going up over time, it's likely that you'll have fixed the problem.
In a chart of perfmon.exe running on clio.lyris.net (our test server), with 3 days worth of data on-screen, we charted both the "committed bytes" of the system and "private bytes" of the ListManager executable. In this chart ListManager averages about 6Mb of RAM, occasionally spiking to 11Mb of RAM during loaded times. The "committed bytes", the total amount of memory used on the system, stays around 40Mb.
|    |