The block device cache is causing kswapd thrashing, usually bringing
the system to a halt.

This problem has been reproduced on kernels as recent as 2.4.21.

In our application we deal with large (multi-GB) files on multi-CPU
4GB platforms.  While handling these files, the block device cache
allocates all remaining available memory (3.5G) up to the 4G
physical limit.

Once the block device cache has pegged the physical memory limit,
it doesn't seem to manage it's allocation of that memory well enough
to prevent unnecessary page-swapping.  Ultimately, thrashing takes
over and the SYSTEM COMES TO A HALT.

After the application closes all files and exits, the cache maintains
its allocation of this memory until either: 1) the file is removed,
or 2) somebody requests more memory.  In the former case, used memory
(top, /proc/meminfo) drops instantly to the amount used by all
processes (sum of ps use).  In the latter, memory use remains pegged
and swapping typically remains a problem.  There doesn't appear to be
a timeout on the cache's allocation.

This problem is most noticable when the (cached) files causing the
problem are on a local disk.

Below is an example of a pseudo-idle system (only running 'du')
which is affected by the trashing problem.  Both CPUs are 99% system,
kswapd is 99.9%, load average exceeds 4 and growing, and virtually
all memory is consumed, although only 717,140K is reported to be
used by "all" processes (using a sum of 'ps -aux' memory use).

  5:31pm  up 53 days, 11:28, 19 users,  load average: 4.64, 3.14, 2.14
160 processes: 157 sleeping, 2 running, 0 zombie, 1 stopped
CPU0 states:  0.1% user, 99.0% system,  0.0% nice,  0.2% idle
CPU1 states:  0.1% user, 99.2% system,  0.0% nice,  0.0% idle
Mem:  3928460K av, 3828808K used,   99652K free,       0K shrd,   26148K buff
Swap: 4194224K av,  696384K used, 3497840K free                 2715008K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    5 root      17   0     0    0     0 RW   99.9  0.0 218:52 kswapd

I have seen situations where the load average exceeds 12.0 (!), and
others on a 4-CPU 64-bit 6GB machine (running 2.4.21[-4.EL]) where
all four CPUs are at 100% system, and page-swapping.

THIS PROBLEM IS READILY REPRODUCIBLE.

I have attached a tarball which includes the source to a file-seek
test program (fst); and a simple memory reclaimation program (reclaim).

fst can be used to generate large files (with seek behavior typical
of our application, as seeking seems to aggrevate the problem).  When
using fst (on a 4GB system), specify 'num_blks' to be 2,000,000 to
4,000,000, with mode = 1 (seek-updating enabled):

    fst 3000000 fst.out 1

This will create a file with 3,000,000 blocks of random size between
1-2048 bytes.  Midway through creating fst.out, the block device cache
should have allocated all of memory.  If thrashing doesn't immediately
occur you can run multiple fst's to aggravate the problem.

reclaim can be used to illustrate that, with fst still running (and
pegged), it is possible to manually reclaim/free the memory used by
the block device cache, thereby eliminating the issues with kswapd,
bdflush, kupdated, etc.  But given that fst's still running, memory
usage creeps back up, as expected.

This seems to be a fairly fundamental and substantial problem.  Over
time rogue memory use by the block device cache simply creeps up and
up toward the physical limit.  And it becomes a probem more readily.

Can anyone provide a means to mitigate or eliminate this problem?
We've toyed with altering parameters to bdflush and the like, with
no success.

Thanks for your help,
-chris

-----------------------------------------------------------------
Chris M. Petersen                                cmp@synopsys.com
Sr. R&D Engineer

Synopsys, Inc.                                    o: 919.425.7342
1101 Slater Road, Suite 300                       c: 919.349.6393
Durham, NC  27703                                 f: 919.425.7320
-----------------------------------------------------------------