The block device cache is causing kswapd thrashing, usually bringing the system to a halt. This problem has been reproduced on kernels as recent as 2.4.21. In our application we deal with large (multi-GB) files on multi-CPU 4GB platforms. While handling these files, the block device cache allocates all remaining available memory (3.5G) up to the 4G physical limit. Once the block device cache has pegged the physical memory limit, it doesn't seem to manage it's allocation of that memory well enough to prevent unnecessary page-swapping. Ultimately, thrashing takes over and the SYSTEM COMES TO A HALT. After the application closes all files and exits, the cache maintains its allocation of this memory until either: 1) the file is removed, or 2) somebody requests more memory. In the former case, used memory (top, /proc/meminfo) drops instantly to the amount used by all processes (sum of ps use). In the latter, memory use remains pegged and swapping typically remains a problem. There doesn't appear to be a timeout on the cache's allocation. This problem is most noticable when the (cached) files causing the problem are on a local disk. Below is an example of a pseudo-idle system (only running 'du') which is affected by the trashing problem. Both CPUs are 99% system, kswapd is 99.9%, load average exceeds 4 and growing, and virtually all memory is consumed, although only 717,140K is reported to be used by "all" processes (using a sum of 'ps -aux' memory use). 5:31pm up 53 days, 11:28, 19 users, load average: 4.64, 3.14, 2.14 160 processes: 157 sleeping, 2 running, 0 zombie, 1 stopped CPU0 states: 0.1% user, 99.0% system, 0.0% nice, 0.2% idle CPU1 states: 0.1% user, 99.2% system, 0.0% nice, 0.0% idle Mem: 3928460K av, 3828808K used, 99652K free, 0K shrd, 26148K buff Swap: 4194224K av, 696384K used, 3497840K free 2715008K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 5 root 17 0 0 0 0 RW 99.9 0.0 218:52 kswapd I have seen situations where the load average exceeds 12.0 (!), and others on a 4-CPU 64-bit 6GB machine (running 2.4.21[-4.EL]) where all four CPUs are at 100% system, and page-swapping. THIS PROBLEM IS READILY REPRODUCIBLE. I have attached a tarball which includes the source to a file-seek test program (fst); and a simple memory reclaimation program (reclaim). fst can be used to generate large files (with seek behavior typical of our application, as seeking seems to aggrevate the problem). When using fst (on a 4GB system), specify 'num_blks' to be 2,000,000 to 4,000,000, with mode = 1 (seek-updating enabled): fst 3000000 fst.out 1 This will create a file with 3,000,000 blocks of random size between 1-2048 bytes. Midway through creating fst.out, the block device cache should have allocated all of memory. If thrashing doesn't immediately occur you can run multiple fst's to aggravate the problem. reclaim can be used to illustrate that, with fst still running (and pegged), it is possible to manually reclaim/free the memory used by the block device cache, thereby eliminating the issues with kswapd, bdflush, kupdated, etc. But given that fst's still running, memory usage creeps back up, as expected. This seems to be a fairly fundamental and substantial problem. Over time rogue memory use by the block device cache simply creeps up and up toward the physical limit. And it becomes a probem more readily. Can anyone provide a means to mitigate or eliminate this problem? We've toyed with altering parameters to bdflush and the like, with no success. Thanks for your help, -chris ----------------------------------------------------------------- Chris M. Petersen cmp@synopsys.com Sr. R&D Engineer Synopsys, Inc. o: 919.425.7342 1101 Slater Road, Suite 300 c: 919.349.6393 Durham, NC 27703 f: 919.425.7320 -----------------------------------------------------------------