Kernel 2.6.8.1: swap storm of death

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Kernel 2.6.8.1: swap storm of death
@ 2004-08-22 13:27 Karl Vogel
  2004-08-22 13:33 ` Karl Vogel
  2004-08-22 18:49 ` Kernel 2.6.8.1: swap storm of death - 2.6.8.1-mm4 also karl.vogel
  0 siblings, 2 replies; 6+ messages in thread
From: Karl Vogel @ 2004-08-22 13:27 UTC (permalink / raw)
  To: linux-kernel

I can bring down my box by running a program that does a calloc() of 512Mb 
(which is the size of my RAM). The box starts to heavily swap and never 
recovers from it. The process that calloc's the memory gets OOM killed (which 
is also strange as I have 1Gb free swap).

After the OOM kill, the shell where I started the calloc() program is alive 
but very slow. The box continues to swap and the other processes remain dead.

To gather some more statistics, I did the following:

- start 'vmstat 1|tee vmstat.txt' in 1 VT session.
- run expunge (= program that does calloc(512Mb)) in another VT.

The box freezes for some time. After a while expunge is OOM killed, the vmstat 
on the other VT remains dead. A ping over the network is still possible and I 
can still start programs on the expunge VT, albeit it is slow as the disk is 
still thrashing.

The diagnostics can be found here:

* Kernel .config
  http://users.telenet.be/kvogel/config.txt

* expunge program
  http://users.telenet.be/kvogel/expunge.c

* vmstat 1  output while executing expunge (this freezes)
  http://users.telenet.be/kvogel/vmstat.txt

* vmstat in expunge VT after the OOM kill
  http://users.telenet.be/kvogel/vmstat-after-kill.txt

* /proc/slabinfo after OOM kill
  http://users.telenet.be/kvogel/slab.txt

* swapon -s
Filename                                Type            Size    Used    
Priority
/dev/hda3                               partition       1044216 0       -1

* Kernel boot line:
       kernel /vmlinuz-2.6.8.1 ro root=/dev/compat/root elevator=cfq 
voluntary-preempt=3 preempt=1

Kernel was patched with voluntary-preempt-2.6.8.1-P7
syslogd & klogd weren't running and 'dmesg -n 1' was done beforehand.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death
  2004-08-22 13:27 Kernel 2.6.8.1: swap storm of death Karl Vogel
@ 2004-08-22 13:33 ` Karl Vogel
  2004-08-22 18:49 ` Kernel 2.6.8.1: swap storm of death - 2.6.8.1-mm4 also karl.vogel
  1 sibling, 0 replies; 6+ messages in thread
From: Karl Vogel @ 2004-08-22 13:33 UTC (permalink / raw)
  To: linux-kernel

On Sunday 22 August 2004 15:27, Karl Vogel wrote:

> The diagnostics can be found here:

Forgot one:

* ps ax - after OOM kill
  http://users.telenet.be/kvogel/ps.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - 2.6.8.1-mm4 also
  2004-08-22 13:27 Kernel 2.6.8.1: swap storm of death Karl Vogel
  2004-08-22 13:33 ` Karl Vogel
@ 2004-08-22 18:49 ` karl.vogel
  2004-08-22 19:18   ` Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit Karl Vogel
  1 sibling, 1 reply; 6+ messages in thread
From: karl.vogel @ 2004-08-22 18:49 UTC (permalink / raw)
  To: linux-kernel

I just tried if I could trigger the same swap of death on 2.6.8.1-mm4.
It appears I could :(

I will have another go at it with elevator=as and see if that makes
a difference.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit
  2004-08-22 18:49 ` Kernel 2.6.8.1: swap storm of death - 2.6.8.1-mm4 also karl.vogel
@ 2004-08-22 19:18   ` Karl Vogel
  2004-08-23 14:12     ` Marcelo Tosatti
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Vogel @ 2004-08-22 19:18 UTC (permalink / raw)
  To: linux-kernel

When using elevator=as I'm unable to trigger the swap of death, so it seems
that the CFQ scheduler is at blame here.

With AS scheduler, the system recovers in +-10 seconds, vmstat output during
that time:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0 295632  40372  49400   87  278   324   303 1424   784  7  2 78 13
 0  0      0 295632  40372  49400    0    0     0     0 1210   648  3  1 96  0
 0  0      0 295632  40372  49400    0    0     0     0 1209   652  4  0 96  0
 2  0      0 112784  40372  49400    0    0     0     0 1204   630 23 34 43  0
 1  9 156236    788    264   8128   28 156220  3012 156228 3748  3655 11 31  0 59
 0 15 176656   2196    280   8664    0 20420   556 20436 1108   374  2  5  0 93
 0 17 205320    724    232   7960   28 28664   396 28664 1118   503  7 12  0 81
 2 12 217892   1812    252   8556  248 12584   864 12584 1495   318  2  7  0 91
 4 14 253268   2500    268   8728  188 35392   432 35392 1844   399  3  7  0 90
 0 13 255692   1188    288   9152  960 2424  1408  2424 1173  2215 10  5  0 85
 0  7 266140   2288    312   9276  604 10468   752 10468 1248   644  5  5  0 90
 0  7 190516 340636    348   9860 1400    0  2016     0 1294   817  4  8  0 88
 1  8 190516 339460    384  10844  552    0  1556     4 1241   642  3  1  0 96
 1  3 190516 337084    404  11968 1432    0  2576     4 1292   788  3  1  0 96
 0  6 190516 333892    420  13612 1844    0  3500     0 1343   850  5  2  0 93
 0  1 190516 333700    424  13848  480    0   720     0 1250   654  3  2  0 95
 0  1 190516 334468    424  13848  188    0   188     0 1224   589  3  2  0 95

With CFQ processes got stuck in 'D' and never left that state. See URL's in my
initial post for diagnostics.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit
  2004-08-22 19:18   ` Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit Karl Vogel
@ 2004-08-23 14:12     ` Marcelo Tosatti
  2004-08-23 15:41       ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-08-23 14:12 UTC (permalink / raw)
  To: Karl Vogel, axboe; +Cc: linux-kernel

On Sun, Aug 22, 2004 at 09:18:51PM +0200, Karl Vogel wrote:
> When using elevator=as I'm unable to trigger the swap of death, so it seems
> that the CFQ scheduler is at blame here.
> 
> With AS scheduler, the system recovers in +-10 seconds, vmstat output during
> that time:
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  1  0      0 295632  40372  49400   87  278   324   303 1424   784  7  2 78 13
>  0  0      0 295632  40372  49400    0    0     0     0 1210   648  3  1 96  0
>  0  0      0 295632  40372  49400    0    0     0     0 1209   652  4  0 96  0
>  2  0      0 112784  40372  49400    0    0     0     0 1204   630 23 34 43  0
>  1  9 156236    788    264   8128   28 156220  3012 156228 3748  3655 11 31  0 59
>  0 15 176656   2196    280   8664    0 20420   556 20436 1108   374  2  5  0 93
>  0 17 205320    724    232   7960   28 28664   396 28664 1118   503  7 12  0 81
>  2 12 217892   1812    252   8556  248 12584   864 12584 1495   318  2  7  0 91
>  4 14 253268   2500    268   8728  188 35392   432 35392 1844   399  3  7  0 90
>  0 13 255692   1188    288   9152  960 2424  1408  2424 1173  2215 10  5  0 85
>  0  7 266140   2288    312   9276  604 10468   752 10468 1248   644  5  5  0 90
>  0  7 190516 340636    348   9860 1400    0  2016     0 1294   817  4  8  0 88
>  1  8 190516 339460    384  10844  552    0  1556     4 1241   642  3  1  0 96
>  1  3 190516 337084    404  11968 1432    0  2576     4 1292   788  3  1  0 96
>  0  6 190516 333892    420  13612 1844    0  3500     0 1343   850  5  2  0 93
>  0  1 190516 333700    424  13848  480    0   720     0 1250   654  3  2  0 95
>  0  1 190516 334468    424  13848  188    0   188     0 1224   589  3  2  0 95
> 
> With CFQ processes got stuck in 'D' and never left that state. See URL's in my
> initial post for diagnostics.

I can confirm this on a 512MB box with 512MB swap (2.6.8-rc4). Using CFQ the machine swaps out
400 megs, with AS it swaps out 30M.  

That leads to allocation failures/etc. 

CFQ allocates a huge number of bio/biovecs:

 cat /proc/slabinfo | grep bio
biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata    128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata 52     52      0
biovec-64            265    265    768    5    1 : tunables   54   27    0 : slabdata 53     53      0
biovec-16            260    260    192   20    1 : tunables  120   60    0 : slabdata 13     13      0
biovec-4             272    305     64   61    1 : tunables  120   60    0 : slabdata  5      5      0
biovec-1          121088 122040     16  226    1 : tunables  120   60    0 : slabdata    540    540      0
bio               121131 121573     64   61    1 : tunables  120   60    0 : slabdata   1992   1993      0


biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata 128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata  52     52      0
biovec-64            265    265    768    5    1 : tunables   54   27    0 : slabdata  53     53      0
biovec-16            258    260    192   20    1 : tunables  120   60    0 : slabdata  13     13      0
biovec-4             257    305     64   61    1 : tunables  120   60    0 : slabdata   5      5      0
biovec-1           66390  68026     16  226    1 : tunables  120   60    0 : slabdata 301    301      0
bio                66389  67222     64   61    1 : tunables  120   60    0 : slabdata   1102   1102      0

(which are freed later on, but the cause for the trashing during the swap IO).

While AS does:

[marcelo@yage marcelo]$ cat /proc/slabinfo | grep bio
biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata    128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata     52     52      0
biovec-64            260    260    768    5    1 : tunables   54   27    0 : slabdata     52     52      0
biovec-16            280    280    192   20    1 : tunables  120   60    0 : slabdata     14     14      0
biovec-4             264    305     64   61    1 : tunables  120   60    0 : slabdata      5      5      0
biovec-1            4478   5424     16  226    1 : tunables  120   60    0 : slabdata     24     24      0
bio                 4525   5002     64   61    1 : tunables  120   60    0 : slabdata     81     82      0


Odd thing is the 400M swapped out are not reclaimed after exp (the 512MB callocator) exits. With AS 
almost all swapped out memory is reclaimed on exit.

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0 492828  13308    320   3716    0    0     0     0 1002     5  0  0 100  0


Jens, is this huge amount of bio/biovec's allocations expected with CFQ? Its really really bad.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit
  2004-08-23 14:12     ` Marcelo Tosatti
@ 2004-08-23 15:41       ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2004-08-23 15:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Karl Vogel, linux-kernel

On Mon, Aug 23 2004, Marcelo Tosatti wrote:
> On Sun, Aug 22, 2004 at 09:18:51PM +0200, Karl Vogel wrote:
> > When using elevator=as I'm unable to trigger the swap of death, so it seems
> > that the CFQ scheduler is at blame here.
> > 
> > With AS scheduler, the system recovers in +-10 seconds, vmstat output during
> > that time:
> > 
> > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
> >  1  0      0 295632  40372  49400   87  278   324   303 1424   784  7  2 78 13
> >  0  0      0 295632  40372  49400    0    0     0     0 1210   648  3  1 96  0
> >  0  0      0 295632  40372  49400    0    0     0     0 1209   652  4  0 96  0
> >  2  0      0 112784  40372  49400    0    0     0     0 1204   630 23 34 43  0
> >  1  9 156236    788    264   8128   28 156220  3012 156228 3748  3655 11 31  0 59
> >  0 15 176656   2196    280   8664    0 20420   556 20436 1108   374  2  5  0 93
> >  0 17 205320    724    232   7960   28 28664   396 28664 1118   503  7 12  0 81
> >  2 12 217892   1812    252   8556  248 12584   864 12584 1495   318  2  7  0 91
> >  4 14 253268   2500    268   8728  188 35392   432 35392 1844   399  3  7  0 90
> >  0 13 255692   1188    288   9152  960 2424  1408  2424 1173  2215 10  5  0 85
> >  0  7 266140   2288    312   9276  604 10468   752 10468 1248   644  5  5  0 90
> >  0  7 190516 340636    348   9860 1400    0  2016     0 1294   817  4  8  0 88
> >  1  8 190516 339460    384  10844  552    0  1556     4 1241   642  3  1  0 96
> >  1  3 190516 337084    404  11968 1432    0  2576     4 1292   788  3  1  0 96
> >  0  6 190516 333892    420  13612 1844    0  3500     0 1343   850  5  2  0 93
> >  0  1 190516 333700    424  13848  480    0   720     0 1250   654  3  2  0 95
> >  0  1 190516 334468    424  13848  188    0   188     0 1224   589  3  2  0 95
> > 
> > With CFQ processes got stuck in 'D' and never left that state. See URL's in my
> > initial post for diagnostics.
> 
> I can confirm this on a 512MB box with 512MB swap (2.6.8-rc4). Using CFQ the machine swaps out
> 400 megs, with AS it swaps out 30M.  
> 
> That leads to allocation failures/etc. 
> 
> CFQ allocates a huge number of bio/biovecs:
> 
>  cat /proc/slabinfo | grep bio
> biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata    128    128      0
> biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata 52     52      0
> biovec-64            265    265    768    5    1 : tunables   54   27    0 : slabdata 53     53      0
> biovec-16            260    260    192   20    1 : tunables  120   60    0 : slabdata 13     13      0
> biovec-4             272    305     64   61    1 : tunables  120   60    0 : slabdata  5      5      0
> biovec-1          121088 122040     16  226    1 : tunables  120   60    0 : slabdata    540    540      0
> bio               121131 121573     64   61    1 : tunables  120   60    0 : slabdata   1992   1993      0
> 
> 
> biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata 128    128      0
> biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata  52     52      0
> biovec-64            265    265    768    5    1 : tunables   54   27    0 : slabdata  53     53      0
> biovec-16            258    260    192   20    1 : tunables  120   60    0 : slabdata  13     13      0
> biovec-4             257    305     64   61    1 : tunables  120   60    0 : slabdata   5      5      0
> biovec-1           66390  68026     16  226    1 : tunables  120   60    0 : slabdata 301    301      0
> bio                66389  67222     64   61    1 : tunables  120   60    0 : slabdata   1102   1102      0
> 
> (which are freed later on, but the cause for the trashing during the swap IO).
> 
> While AS does:
> 
> [marcelo@yage marcelo]$ cat /proc/slabinfo | grep bio
> biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata    128    128      0
> biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata     52     52      0
> biovec-64            260    260    768    5    1 : tunables   54   27    0 : slabdata     52     52      0
> biovec-16            280    280    192   20    1 : tunables  120   60    0 : slabdata     14     14      0
> biovec-4             264    305     64   61    1 : tunables  120   60    0 : slabdata      5      5      0
> biovec-1            4478   5424     16  226    1 : tunables  120   60    0 : slabdata     24     24      0
> bio                 4525   5002     64   61    1 : tunables  120   60    0 : slabdata     81     82      0
> 
> 
> Odd thing is the 400M swapped out are not reclaimed after exp (the 512MB callocator) exits. With AS 
> almost all swapped out memory is reclaimed on exit.
> 
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  0 492828  13308    320   3716    0    0     0     0 1002     5  0  0 100  0
> 
> 
> Jens, is this huge amount of bio/biovec's allocations expected with CFQ? Its really really bad.

Nope, it's not by design :-)

A test case would be nice, then I'll fix it as soon as possible. But
please retest with 2.6.8.1 marcelo, 2.6.8-rc4 is missing an important
fix to ll_rw_blk that can easily cause this. The first report is for
2.6.8.1, so I'm more puzzled on that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-08-23 15:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-22 13:27 Kernel 2.6.8.1: swap storm of death Karl Vogel
2004-08-22 13:33 ` Karl Vogel
2004-08-22 18:49 ` Kernel 2.6.8.1: swap storm of death - 2.6.8.1-mm4 also karl.vogel
2004-08-22 19:18   ` Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit Karl Vogel
2004-08-23 14:12     ` Marcelo Tosatti
2004-08-23 15:41       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox