Temporary lockup on loopback block device

All of lore.kernel.org
 help / color / mirror / Atom feed

* Temporary lockup on loopback block device
@ 2007-11-10 19:51 Mikulas Patocka
  2007-11-10 22:54 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-10 19:51 UTC (permalink / raw)
  To: linux-kernel

Hi

I am experiencing a transient lockup in 'D' state with loopback device. It 
happens when process writes to a filesystem in loopback with command like
dd if=/dev/zero of=/s/fill bs=4k 

CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
congestion_wait called from balance_dirty_pages.

After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
soon again.

I added a printk to the balance_dirty_pages
printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
pages_written %d, write_chunk %d\n", nr_reclaimable, 
global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
write_chunk);

and it shows this during the lockup:

wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
pages_written 1021, write_chunk 1522
wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
pages_written 1021, write_chunk 1522
wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
pages_written 1021, write_chunk 1522

What apparently happens:

writeback_inodes syncs inodes only on the given wbc->bdi, however 
balance_dirty_pages checks against global counts of dirty pages. So if 
there's nothing to sync on a given device, but there are other dirty pages 
so that the counts are over the limit, it will loop without doing any 
work.

To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
something writes to the backing device, it flushes the dirty pages 
generated by the loopback and the lockup is gone. If you add printk, don't 
forget to stop klogd, otherwise logging would end the lockup.

The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
devices are flushed ... but the code probably needs some redesign (i.e. 
either account per-device and flush per-device, or account-global and 
flush-global).

Mikulas

diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
--- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
+++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
@@ -214,7 +214,6 @@

	for (;;) {
		struct writeback_control wbc = {
-			.bdi            = bdi,
			.sync_mode      = WB_SYNC_NONE,
			.older_than_this = NULL,
			.nr_to_write    = write_chunk,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-10 19:51 Temporary lockup on loopback block device Mikulas Patocka
@ 2007-11-10 22:54 ` Andrew Morton
  2007-11-10 23:02   ` Peter Zijlstra
  2007-11-11  0:33   ` Mikulas Patocka
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2007-11-10 22:54 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, Peter Zijlstra, WU Fengguang

On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:

> Hi
> 
> I am experiencing a transient lockup in 'D' state with loopback device. It 
> happens when process writes to a filesystem in loopback with command like
> dd if=/dev/zero of=/s/fill bs=4k 
> 
> CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> congestion_wait called from balance_dirty_pages.
> 
> After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> soon again.
> 
> I added a printk to the balance_dirty_pages
> printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> pages_written %d, write_chunk %d\n", nr_reclaimable, 
> global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> write_chunk);
> 
> and it shows this during the lockup:
> 
> wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> pages_written 1021, write_chunk 1522
> wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> pages_written 1021, write_chunk 1522
> wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> pages_written 1021, write_chunk 1522
> 
> What apparently happens:
> 
> writeback_inodes syncs inodes only on the given wbc->bdi, however 
> balance_dirty_pages checks against global counts of dirty pages. So if 
> there's nothing to sync on a given device, but there are other dirty pages 
> so that the counts are over the limit, it will loop without doing any 
> work.
> 
> To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> something writes to the backing device, it flushes the dirty pages 
> generated by the loopback and the lockup is gone. If you add printk, don't 
> forget to stop klogd, otherwise logging would end the lockup.

erk.

> The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> devices are flushed ... but the code probably needs some redesign (i.e. 
> either account per-device and flush per-device, or account-global and 
> flush-global).
> 
> Mikulas
> 
> 
> diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> @@ -214,7 +214,6 @@
> 
> 	for (;;) {
> 		struct writeback_control wbc = {
> -			.bdi            = bdi,
> 			.sync_mode      = WB_SYNC_NONE,
> 			.older_than_this = NULL,
> 			.nr_to_write    = write_chunk,

Arguably we just have the wrong backing-device here, and what we should do
is to propagate the real backing device's pointer through up into the
filesystem.  There's machinery for this which things like DM stacks use.

I wonder if the post-2.6.23 changes happened to make this problem go away.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-10 22:54 ` Andrew Morton
@ 2007-11-10 23:02   ` Peter Zijlstra
  2007-11-11  0:38     ` Mikulas Patocka
  2007-11-11  0:33   ` Mikulas Patocka
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2007-11-10 23:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, WU Fengguang, Miklos Szeredi


On Sat, 2007-11-10 at 14:54 -0800, Andrew Morton wrote:
> On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> 
> > Hi
> > 
> > I am experiencing a transient lockup in 'D' state with loopback device. It 
> > happens when process writes to a filesystem in loopback with command like
> > dd if=/dev/zero of=/s/fill bs=4k 
> > 
> > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> > congestion_wait called from balance_dirty_pages.
> > 
> > After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> > soon again.
> > 
> > I added a printk to the balance_dirty_pages
> > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> > pages_written %d, write_chunk %d\n", nr_reclaimable, 
> > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> > write_chunk);
> > 
> > and it shows this during the lockup:
> > 
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > 
> > What apparently happens:
> > 
> > writeback_inodes syncs inodes only on the given wbc->bdi, however 
> > balance_dirty_pages checks against global counts of dirty pages. So if 
> > there's nothing to sync on a given device, but there are other dirty pages 
> > so that the counts are over the limit, it will loop without doing any 
> > work.
> > 
> > To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> > something writes to the backing device, it flushes the dirty pages 
> > generated by the loopback and the lockup is gone. If you add printk, don't 
> > forget to stop klogd, otherwise logging would end the lockup.
> 
> erk.

known issue.

> > The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> > devices are flushed ... but the code probably needs some redesign (i.e. 
> > either account per-device and flush per-device, or account-global and 
> > flush-global).

.24 will have the per-device solution.

> > 
> > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> > --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> > @@ -214,7 +214,6 @@
> > 
> > 	for (;;) {
> > 		struct writeback_control wbc = {
> > -			.bdi            = bdi,
> > 			.sync_mode      = WB_SYNC_NONE,
> > 			.older_than_this = NULL,
> > 			.nr_to_write    = write_chunk,
> 
> Arguably we just have the wrong backing-device here, and what we should do
> is to propagate the real backing device's pointer through up into the
> filesystem.  There's machinery for this which things like DM stacks use.
> 
> I wonder if the post-2.6.23 changes happened to make this problem go away.

The per BDI dirty stuff in 24 should make this work, I just checked and
loopback thingies seem to have their own BDI, so all should be well.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-10 23:02   ` Peter Zijlstra
@ 2007-11-11  0:38     ` Mikulas Patocka
  2007-11-11  7:50       ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-11  0:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel, WU Fengguang, Miklos Szeredi

> > Arguably we just have the wrong backing-device here, and what we should do
> > is to propagate the real backing device's pointer through up into the
> > filesystem.  There's machinery for this which things like DM stacks use.
> > 
> > I wonder if the post-2.6.23 changes happened to make this problem go away.
> 
> The per BDI dirty stuff in 24 should make this work, I just checked and
> loopback thingies seem to have their own BDI, so all should be well.

This is not only about loopback (I think the lockup can happen even 
without loopback) --- the main problem is:

Why are there over-limit dirty pages that no one is writing?

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-11  0:38     ` Mikulas Patocka
@ 2007-11-11  7:50       ` Miklos Szeredi
  2007-11-11 18:29         ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2007-11-11  7:50 UTC (permalink / raw)
  To: mikulas; +Cc: a.p.zijlstra, akpm, linux-kernel, wfg, miklos

> > > Arguably we just have the wrong backing-device here, and what we should do
> > > is to propagate the real backing device's pointer through up into the
> > > filesystem.  There's machinery for this which things like DM stacks use.
> > > 
> > > I wonder if the post-2.6.23 changes happened to make this problem go away.
> > 
> > The per BDI dirty stuff in 24 should make this work, I just checked and
> > loopback thingies seem to have their own BDI, so all should be well.
> 
> This is not only about loopback (I think the lockup can happen even 
> without loopback) --- the main problem is:
> 
> Why are there over-limit dirty pages that no one is writing?

Please do a sysrq-t, and cat /proc/vmstat during the hang.  Those
will show us what exactly is happening.

I've seen this type of hang many times, and I agree with Peter, that
it's probably about loopback, and is fixed in 2.6.24-rc.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-11  7:50       ` Miklos Szeredi
@ 2007-11-11 18:29         ` Mikulas Patocka
  2007-11-12 13:32           ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-11 18:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, akpm, linux-kernel, wfg

> > Why are there over-limit dirty pages that no one is writing?
> 
> Please do a sysrq-t, and cat /proc/vmstat during the hang.  Those
> will show us what exactly is happening.

I did and I posted relevant information from my finding --- it looped in 
balance_dirty_pages.

> I've seen this type of hang many times, and I agree with Peter, that
> it's probably about loopback, and is fixed in 2.6.24-rc.

On 2.6.23 it could happen even without loopback --- loopback just made it 
happen very often. 2.6.24 seems ok.

Mikulas

> Thanks,
> Miklos
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-11 18:29         ` Mikulas Patocka
@ 2007-11-12 13:32           ` Miklos Szeredi
  2007-11-15 22:35             ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2007-11-12 13:32 UTC (permalink / raw)
  To: mikulas; +Cc: a.p.zijlstra, akpm, linux-kernel, wfg

> On 2.6.23 it could happen even without loopback

Let's focus on this point, because we already know how the lockup
happens _with_ loopback and any other kind of bdi stacking.

Can you describe the setup?  Or better still, can you reproduce it and
post the sysrq-t output?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-12 13:32           ` Miklos Szeredi
@ 2007-11-15 22:35             ` Mikulas Patocka
  0 siblings, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-15 22:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, akpm, linux-kernel, wfg

> > On 2.6.23 it could happen even without loopback
> 
> Let's focus on this point, because we already know how the lockup
> happens _with_ loopback and any other kind of bdi stacking.
> 
> Can you describe the setup?  Or better still, can you reproduce it and
> post the sysrq-t output?

Hi

The trace is this, it is perfectly reproducible. It is 128M machine, 
Pentium 2 300MHz, host filesystem ext2, loop filesystems ext2 and spadfs 
(both of them locked up). But the problem is really over in 2.6.24, I 
think there is no more need to investigate it.

Mikulas

Nov 10 19:34:45 gerlinda kernel: SysRq : HELP : loglevel0-8 reBoot tErm 
Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync 
showTasks Unmount shoW-blocked-tasks
Nov 10 19:34:53 gerlinda kernel: SysRq : Show Blocked State
Nov 10 19:34:53 gerlinda kernel:   task                PC stack   pid 
father
Nov 10 19:34:54 gerlinda kernel: dd            D 00000286     0  4603   
2985
Nov 10 19:34:55 gerlinda kernel:        c580bcdc 00000086 c0308c20 
00000286 00000286 c580bcec 002a4e87 00000000
Nov 10 19:34:55 gerlinda kernel:        c580bd10 c0284bba c580bd1c 
00000000 c03775e0 c03775e0 002a4e87 c011d050
Nov 10 19:34:55 gerlinda kernel:        c117c030 c03771a0 00000064 
c02f8eb4 c0283efe c580bd44 c0145ebc 00000000
Nov 10 19:34:55 gerlinda kernel: Call Trace:
Nov 10 19:34:55 gerlinda kernel:  [<c0284bba>] schedule_timeout+0x4a/0xc0
Nov 10 19:34:55 gerlinda kernel:  [<c011d050>] process_timeout+0x0/0x10
Nov 10 19:34:55 gerlinda kernel:  [<c0283efe>] 
io_schedule_timeout+0xe/0x20
Nov 10 19:34:55 gerlinda kernel:  [<c0145ebc>] congestion_wait+0x6c/0x90
Nov 10 19:34:55 gerlinda kernel:  [<c01274e0>] 
autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel:  
[<c014135f>] balance_dirty_pages_ratelimited_nr+0x11f/0x1e0
Nov 10 19:34:55 gerlinda kernel:  [<c013cb98>] 
generic_file_buffered_write+0x2f8/0x6f0
Nov 10 19:34:55 gerlinda kernel:  [<c01198b7>] irq_exit+0x47/0x70
Nov 10 19:34:55 gerlinda kernel:  [<c01049e7>] do_IRQ+0x47/0x80
Nov 10 19:34:55 gerlinda kernel:  [<c0102cbf>] common_interrupt+0x23/0x28
Nov 10 19:34:55 gerlinda kernel:  [<c013d1e3>] 
__generic_file_aio_write_nolock+0x253/0x540
Nov 10 19:34:55 gerlinda kernel:  [<c012a87b>] 
hrtimer_run_queues+0x6b/0x290
Nov 10 19:34:55 gerlinda kernel:  [<c013d526>] 
generic_file_aio_write+0x56/0xd0 Nov 10 19:34:55 gerlinda kernel:  
[<c012ed9f>] tick_handle_periodic+0xf/0x70
Nov 10 19:34:55 gerlinda kernel:  [<c015a1d6>] do_sync_write+0xc6/0x110
Nov 10 19:34:55 gerlinda kernel:  [<c01274e0>] 
autoremove_wake_function+0x0/0x50Nov 10 19:34:55 gerlinda kernel:  
[<c01c604f>] clear_user+0x2f/0x50
Nov 10 19:34:55 gerlinda kernel:  [<c0120000>] ptrace_notify+0x30/0x90
Nov 10 19:34:55 gerlinda kernel:  [<c015aa56>] vfs_write+0xa6/0x140
Nov 10 19:34:55 gerlinda kernel:  [<c8926310>] SPADFS_FILE_WRITE+0x0/0x10 
[spadfs]
Nov 10 19:34:55 gerlinda kernel:  [<c015b031>] sys_write+0x41/0x70
Nov 10 19:34:55 gerlinda kernel:  [<c0102b16>] syscall_call+0x7/0xb
Nov 10 19:34:55 gerlinda kernel:  =======================


> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-10 22:54 ` Andrew Morton
  2007-11-10 23:02   ` Peter Zijlstra
@ 2007-11-11  0:33   ` Mikulas Patocka
  2007-11-11  3:56     ` Mikulas Patocka
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-11  0:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra, WU Fengguang



On Sat, 10 Nov 2007, Andrew Morton wrote:

> On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> 
> > Hi
> > 
> > I am experiencing a transient lockup in 'D' state with loopback device. It 
> > happens when process writes to a filesystem in loopback with command like
> > dd if=/dev/zero of=/s/fill bs=4k 
> > 
> > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> > congestion_wait called from balance_dirty_pages.
> > 
> > After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> > soon again.
> > 
> > I added a printk to the balance_dirty_pages
> > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> > pages_written %d, write_chunk %d\n", nr_reclaimable, 
> > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> > write_chunk);
> > 
> > and it shows this during the lockup:
> > 
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > 
> > What apparently happens:
> > 
> > writeback_inodes syncs inodes only on the given wbc->bdi, however 
> > balance_dirty_pages checks against global counts of dirty pages. So if 
> > there's nothing to sync on a given device, but there are other dirty pages 
> > so that the counts are over the limit, it will loop without doing any 
> > work.
> > 
> > To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> > something writes to the backing device, it flushes the dirty pages 
> > generated by the loopback and the lockup is gone. If you add printk, don't 
> > forget to stop klogd, otherwise logging would end the lockup.
> 
> erk.
> 
> > The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> > devices are flushed ... but the code probably needs some redesign (i.e. 
> > either account per-device and flush per-device, or account-global and 
> > flush-global).
> > 
> > Mikulas
> > 
> > 
> > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> > --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> > @@ -214,7 +214,6 @@
> > 
> > 	for (;;) {
> > 		struct writeback_control wbc = {
> > -			.bdi            = bdi,
> > 			.sync_mode      = WB_SYNC_NONE,
> > 			.older_than_this = NULL,
> > 			.nr_to_write    = write_chunk,
> 
> Arguably we just have the wrong backing-device here, and what we should do
> is to propagate the real backing device's pointer through up into the
> filesystem.  There's machinery for this which things like DM stacks use.

If you change loopback backing-device, you just turn this nicely 
reproducible example into a subtle race condition that can happen whenever 
you use loopback or not. Think, what happens when different process 
dirties memory:

You have process "A" that dirtied a lot of pages on device "1" but has not 
started writing them.
You have process "B" that is trying to write to device "2", sees dirty 
page count over limit, but can't do anything about it, because it is only 
allowed to flush pages on device "2". --- so it endlessly loops.

If you want to use the current flushing semantics, you just have to audit 
the whole kernel to make sure that if some process sees over-limit dirty 
page count, there is another process that is flushing the pages. Currently 
it is not true, the "dd" process sees over-limit count, but there is 
no-one writing.

> I wonder if the post-2.6.23 changes happened to make this problem go away.

I will try 2.6.24-rc2, but I don't think the root cause of this went away. 
Maybe you just reduced probability.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-11  0:33   ` Mikulas Patocka
@ 2007-11-11  3:56     ` Mikulas Patocka
  2007-11-11  5:33       ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-11  3:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra, WU Fengguang

On Sun, 11 Nov 2007, Mikulas Patocka wrote:

> On Sat, 10 Nov 2007, Andrew Morton wrote:
> 
> > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> > 
> > > Hi
> > > 
> > > I am experiencing a transient lockup in 'D' state with loopback device. It 
> > > happens when process writes to a filesystem in loopback with command like
> > > dd if=/dev/zero of=/s/fill bs=4k 
> > > 
> > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> > > congestion_wait called from balance_dirty_pages.
> > > 
> > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> > > soon again.
> > > 
> > > I added a printk to the balance_dirty_pages
> > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> > > pages_written %d, write_chunk %d\n", nr_reclaimable, 
> > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> > > write_chunk);
> > > 
> > > and it shows this during the lockup:
> > > 
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > 
> > > What apparently happens:
> > > 
> > > writeback_inodes syncs inodes only on the given wbc->bdi, however 
> > > balance_dirty_pages checks against global counts of dirty pages. So if 
> > > there's nothing to sync on a given device, but there are other dirty pages 
> > > so that the counts are over the limit, it will loop without doing any 
> > > work.
> > > 
> > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> > > something writes to the backing device, it flushes the dirty pages 
> > > generated by the loopback and the lockup is gone. If you add printk, don't 
> > > forget to stop klogd, otherwise logging would end the lockup.
> > 
> > erk.
> > 
> > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> > > devices are flushed ... but the code probably needs some redesign (i.e. 
> > > either account per-device and flush per-device, or account-global and 
> > > flush-global).
> > > 
> > > Mikulas
> > > 
> > > 
> > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> > > --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> > > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> > > @@ -214,7 +214,6 @@
> > > 
> > > 	for (;;) {
> > > 		struct writeback_control wbc = {
> > > -			.bdi            = bdi,
> > > 			.sync_mode      = WB_SYNC_NONE,
> > > 			.older_than_this = NULL,
> > > 			.nr_to_write    = write_chunk,
> > 
> > Arguably we just have the wrong backing-device here, and what we should do
> > is to propagate the real backing device's pointer through up into the
> > filesystem.  There's machinery for this which things like DM stacks use.
> 
> If you change loopback backing-device, you just turn this nicely 
> reproducible example into a subtle race condition that can happen whenever 
> you use loopback or not. Think, what happens when different process 
> dirties memory:
> 
> You have process "A" that dirtied a lot of pages on device "1" but has not 
> started writing them.
> You have process "B" that is trying to write to device "2", sees dirty 
> page count over limit, but can't do anything about it, because it is only 
> allowed to flush pages on device "2". --- so it endlessly loops.
> 
> If you want to use the current flushing semantics, you just have to audit 
> the whole kernel to make sure that if some process sees over-limit dirty 
> page count, there is another process that is flushing the pages. Currently 
> it is not true, the "dd" process sees over-limit count, but there is 
> no-one writing.
> 
> > I wonder if the post-2.6.23 changes happened to make this problem go away.
> 
> I will try 2.6.24-rc2, but I don't think the root cause of this went away. 
> Maybe you just reduced probability.
> 
> Mikulas

So I compiled it and I don't see any more lock-ups. The writeback loop 
doesn't depend on any global page count, so the above scenario can't 
happen here. Good.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Temporary lockup on loopback block device
  2007-11-11  3:56     ` Mikulas Patocka
@ 2007-11-11  5:33       ` Mikulas Patocka
  0 siblings, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2007-11-11  5:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Peter Zijlstra, WU Fengguang

> > > Arguably we just have the wrong backing-device here, and what we 
> > > should do is to propagate the real backing device's pointer through 
> > > up into the filesystem.  There's machinery for this which things 
> > > like DM stacks use.

Just thinking about the new implementation --- you shouldn't really 
propagate physical block device's backing_device into loopback device.

If you leave it as is (each loop device has it's own backing store), you 
can nicely avoid the long-standing loopback deadlock coming from the fact 
that flushing one page on loopback device can generate several more dirty 
pages on the filesystem.

If you let loopback device and physical device have the same backing 
store, then it can go wild creating more and more dirty pages up to a 
memory exhaustion. If you let them have different backing stores, it can't 
happen --- loopback flushing will just wait until the pages on the 
filesystem are written.

Mikulas

> So I compiled it and I don't see any more lock-ups. The writeback loop 
> doesn't depend on any global page count, so the above scenario can't 
> happen here. Good.
> 
> Mikulas
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-11-15 22:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-10 19:51 Temporary lockup on loopback block device Mikulas Patocka
2007-11-10 22:54 ` Andrew Morton
2007-11-10 23:02   ` Peter Zijlstra
2007-11-11  0:38     ` Mikulas Patocka
2007-11-11  7:50       ` Miklos Szeredi
2007-11-11 18:29         ` Mikulas Patocka
2007-11-12 13:32           ` Miklos Szeredi
2007-11-15 22:35             ` Mikulas Patocka
2007-11-11  0:33   ` Mikulas Patocka
2007-11-11  3:56     ` Mikulas Patocka
2007-11-11  5:33       ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.