* Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file [not found] ` <20071023115620.GA5678@mail.ustc.edu.cn> @ 2007-10-23 11:56 ` Fengguang Wu 2007-10-23 14:10 ` Chris Mason 2007-10-23 11:56 ` Fengguang Wu 1 sibling, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-10-23 11:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Maxim Levitsky, linux-kernel, Andrew Morton, Jeff Mahoney, reiserfs-dev, linux-fsdevel On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: > [ adding reiserfs devs to the CC ] Thank you. This fix is kind of crude - even when it fixed Maxim's problem, and survived my stress testing of a lot of patching and kernel compiling. I'd be glad to see better solutions. Fengguang --- reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file This is not a new problem in 2.6.23-git17. 2.6.22/2.6.23 is buggy in the same way. Reiserfs could accumulate dirty sub-page-size files until umount time. They cannot be synced to disk by pdflush routines or explicit `sync' commands. Only `umount' can do the trick. The direct cause is: the dirty page's PG_dirty is wrongly _cleared_. Call trace: [<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0 [<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710 [<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530 [<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0 [<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340 [<ffffffff802a187c>] __fput+0xcc/0x1b0 [<ffffffff802a1ba6>] fput+0x16/0x20 [<ffffffff8029e676>] filp_close+0x56/0x90 [<ffffffff8029fe0d>] sys_close+0xad/0x110 [<ffffffff8020c41e>] system_call+0x7e/0x83 Fix the bug by removing the cancel_dirty_page() call. Tests show that it causes no bad behaviors on various write sizes. === for the patient === Here are more detailed demonstrations of the problem. 1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to; and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed. ------------------------------ screen 0 ------------------------------ [T0] root /home/wfg# cat > /test/tiny [T1] hi [T2] root /home/wfg# ------------------------------ screen 1 ------------------------------ [T1] root /home/wfg# echo /test/tiny > /proc/filecache [T1] root /home/wfg# cat /proc/filecache # file /test/tiny # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback # idx len state refcnt 0 1 ___UD__Bd_ 2 [T2] root /home/wfg# cat /proc/filecache # file /test/tiny # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback # idx len state refcnt 0 1 ___U___Bd_ 2 2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied. ------------------------------ screen 0 ------------------------------ [T0] root /home/wfg# echo hi > /tmp/hi [T1] root /home/wfg# cp /tmp/hi /dev/stdin /test [T2] hi [T3] root /home/wfg# ------------------------------ screen 1 ------------------------------ [T1] root /proc/4397# cd /proc/`pidof cp` [T1] root /proc/4713# cat io rchar: 8396 wchar: 3 syscr: 20 syscw: 1 read_bytes: 0 write_bytes: 20480 cancelled_write_bytes: 4096 [T2] root /proc/4713# cat io rchar: 8399 wchar: 6 syscr: 21 syscw: 2 read_bytes: 0 write_bytes: 24576 cancelled_write_bytes: 4096 //Question: the 'write_bytes' is a bit more than expected ;-) Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- fs/reiserfs/stree.c | 3 --- 1 file changed, 3 deletions(-) --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c +++ linux-2.6.24-git17/fs/reiserfs/stree.c @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p } bh = next; } while (bh != head); - if (PAGE_SIZE == bh->b_size) { - cancel_dirty_page(page, PAGE_CACHE_SIZE); - } } } } ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file 2007-10-23 11:56 ` [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file Fengguang Wu @ 2007-10-23 14:10 ` Chris Mason [not found] ` <20071023144014.GA6174@mail.ustc.edu.cn> 0 siblings, 1 reply; 39+ messages in thread From: Chris Mason @ 2007-10-23 14:10 UTC (permalink / raw) To: Fengguang Wu Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, Jeff Mahoney, reiserfs-dev, linux-fsdevel On Tue, 23 Oct 2007 19:56:20 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: > > [ adding reiserfs devs to the CC ] > > Thank you. > > This fix is kind of crude - even when it fixed Maxim's problem, and > survived my stress testing of a lot of patching and kernel compiling. > I'd be glad to see better solutions. This should be safe, reiserfs has the buffer heads themselves clean and the page should get cleaned eventually. The cancel_dirty_page call was just an optimization to be VM friendly. -chris ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <20071023144014.GA6174@mail.ustc.edu.cn>]
* Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file [not found] ` <20071023144014.GA6174@mail.ustc.edu.cn> @ 2007-10-23 14:40 ` Fengguang Wu 2007-10-23 14:40 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-10-23 14:40 UTC (permalink / raw) To: Chris Mason Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, Jeff Mahoney, reiserfs-dev, linux-fsdevel On Tue, Oct 23, 2007 at 10:10:53AM -0400, Chris Mason wrote: > On Tue, 23 Oct 2007 19:56:20 +0800 > Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: > > > [ adding reiserfs devs to the CC ] > > > > Thank you. > > > > This fix is kind of crude - even when it fixed Maxim's problem, and > > survived my stress testing of a lot of patching and kernel compiling. > > I'd be glad to see better solutions. > > This should be safe, reiserfs has the buffer heads themselves clean and > the page should get cleaned eventually. The cancel_dirty_page call was > just an optimization to be VM friendly. > -chris 'chris' as in fs/reiserfs/{inode.c,namei.c}, and now in btrfs/*? Nice to meet you ;-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file [not found] ` <20071023144014.GA6174@mail.ustc.edu.cn> 2007-10-23 14:40 ` Fengguang Wu @ 2007-10-23 14:40 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-10-23 14:40 UTC (permalink / raw) To: Chris Mason Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, Jeff Mahoney, reiserfs-dev, linux-fsdevel On Tue, Oct 23, 2007 at 10:10:53AM -0400, Chris Mason wrote: > On Tue, 23 Oct 2007 19:56:20 +0800 > Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: > > > [ adding reiserfs devs to the CC ] > > > > Thank you. > > > > This fix is kind of crude - even when it fixed Maxim's problem, and > > survived my stress testing of a lot of patching and kernel compiling. > > I'd be glad to see better solutions. > > This should be safe, reiserfs has the buffer heads themselves clean and > the page should get cleaned eventually. The cancel_dirty_page call was > just an optimization to be VM friendly. > -chris 'chris' as in fs/reiserfs/{inode.c,namei.c}, and now in btrfs/*? Nice to meet you ;-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file [not found] ` <20071023115620.GA5678@mail.ustc.edu.cn> 2007-10-23 11:56 ` [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file Fengguang Wu @ 2007-10-23 11:56 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-10-23 11:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Maxim Levitsky, linux-kernel, Andrew Morton, Jeff Mahoney, reiserfs-dev, linux-fsdevel On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: > [ adding reiserfs devs to the CC ] Thank you. This fix is kind of crude - even when it fixed Maxim's problem, and survived my stress testing of a lot of patching and kernel compiling. I'd be glad to see better solutions. Fengguang --- reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file This is not a new problem in 2.6.23-git17. 2.6.22/2.6.23 is buggy in the same way. Reiserfs could accumulate dirty sub-page-size files until umount time. They cannot be synced to disk by pdflush routines or explicit `sync' commands. Only `umount' can do the trick. The direct cause is: the dirty page's PG_dirty is wrongly _cleared_. Call trace: [<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0 [<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710 [<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530 [<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0 [<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340 [<ffffffff802a187c>] __fput+0xcc/0x1b0 [<ffffffff802a1ba6>] fput+0x16/0x20 [<ffffffff8029e676>] filp_close+0x56/0x90 [<ffffffff8029fe0d>] sys_close+0xad/0x110 [<ffffffff8020c41e>] system_call+0x7e/0x83 Fix the bug by removing the cancel_dirty_page() call. Tests show that it causes no bad behaviors on various write sizes. === for the patient === Here are more detailed demonstrations of the problem. 1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to; and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed. ------------------------------ screen 0 ------------------------------ [T0] root /home/wfg# cat > /test/tiny [T1] hi [T2] root /home/wfg# ------------------------------ screen 1 ------------------------------ [T1] root /home/wfg# echo /test/tiny > /proc/filecache [T1] root /home/wfg# cat /proc/filecache # file /test/tiny # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback # idx len state refcnt 0 1 ___UD__Bd_ 2 [T2] root /home/wfg# cat /proc/filecache # file /test/tiny # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback # idx len state refcnt 0 1 ___U___Bd_ 2 2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied. ------------------------------ screen 0 ------------------------------ [T0] root /home/wfg# echo hi > /tmp/hi [T1] root /home/wfg# cp /tmp/hi /dev/stdin /test [T2] hi [T3] root /home/wfg# ------------------------------ screen 1 ------------------------------ [T1] root /proc/4397# cd /proc/`pidof cp` [T1] root /proc/4713# cat io rchar: 8396 wchar: 3 syscr: 20 syscw: 1 read_bytes: 0 write_bytes: 20480 cancelled_write_bytes: 4096 [T2] root /proc/4713# cat io rchar: 8399 wchar: 6 syscr: 21 syscw: 2 read_bytes: 0 write_bytes: 24576 cancelled_write_bytes: 4096 //Question: the 'write_bytes' is a bit more than expected ;-) Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- fs/reiserfs/stree.c | 3 --- 1 file changed, 3 deletions(-) --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c +++ linux-2.6.24-git17/fs/reiserfs/stree.c @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p } bh = next; } while (bh != head); - if (PAGE_SIZE == bh->b_size) { - cancel_dirty_page(page, PAGE_CACHE_SIZE); - } } } } ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <393056632.00561@ustc.edu.cn>]
[parent not found: <200710221505.35397.maximlevitsky@gmail.com>]
[parent not found: <20071022131045.GA5357@mail.ustc.edu.cn>]
[parent not found: <393060478.03650@ustc.edu.cn>]
[parent not found: <64bb37e0710310822r5ca6b793p8fd97db2f72a8655@mail.gmail.com>]
[parent not found: <393903856.06449@ustc.edu.cn>]
[parent not found: <64bb37e0711011120i63cdfe3ci18995d57b6649a8@mail.gmail.com>]
[parent not found: <E1Inljm-0002DW-CL@localhost>]
* writeout stalls in current -git [not found] ` <E1Inljm-0002DW-CL@localhost> @ 2007-11-02 1:54 ` Fengguang Wu 2007-11-02 7:42 ` Torsten Kaiser 2007-11-02 1:54 ` Fengguang Wu 1 sibling, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 1:54 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > Each time I noticed this I was using emerge (package util from the > > > gentoo distribution) to install/upgrade a package. The last step, > > > where this hang occurred, is moving the prepared files from a tmpfs > > > partion to the main xfs filesystem. > > > The hangs where not fatal, after a few second everything resumed > > > normal, so I was not able to capture a good image of what was > > > happening. > > > > Thank you for the detailed report. > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > cannot type and run new commands? > > Only writeout stalled. The emerge that was moving the files hung, but > everything else worked normaly. > I was able to run new commands, like coping the /proc/meminfo. But you mentioned in the next mail that `watch cat /proc/meminfo` could also be blocked for some time - I guess in the same time emerge was stalled? > [snip] > > > After this SysRq+W writeback resumed again. Possible that writing > > > above into the syslog triggered that. > > > > Maybe. Are the log files on another disk/partition? > > No, everything was going to / > > What might be interesting is, that doing cat /proc/meminfo > >~/stall/meminfo did not resume the writeback. So there might some > threshold that only was broken with the additional write from > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the Have you tried explicit `sync`? ;-) > syslog-ng package from gentoo: > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > The source tmpfs is mounted with any special parameters, but the > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > disk RAID5 md. > > > During the hang all CPUs where idle. > > > > No iowaits? ;-) > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > OK, the subject does not match for my case, but there was also a tmpfs > involved. And I found no thread with stalls on xfs. :-) Do you mean it is actually related with tmpfs? > > > The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330 > > > interrupts per second because of the bttv driver. (But I was not using > > > that device at this time.) > > > > > > I'm willing to test patches or more provide more information, but lack > > > a good testcase to trigger this on demand. > > > > Thank you. Maybe we can start by the applied debug patch :-) > > Will applied it and try to recreate this. > > Thanks for looking into it. Thank you for the rich information, too :-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 1:54 ` writeout stalls in current -git Fengguang Wu @ 2007-11-02 7:42 ` Torsten Kaiser [not found] ` <E1InrKN-0000MK-G5@localhost> 0 siblings, 1 reply; 39+ messages in thread From: Torsten Kaiser @ 2007-11-02 7:42 UTC (permalink / raw) To: Fengguang Wu Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel The Subject is still missleading, I'm using 2.6.23-mm1. On 11/2/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > > Each time I noticed this I was using emerge (package util from the > > > > gentoo distribution) to install/upgrade a package. The last step, > > > > where this hang occurred, is moving the prepared files from a tmpfs > > > > partion to the main xfs filesystem. > > > > The hangs where not fatal, after a few second everything resumed > > > > normal, so I was not able to capture a good image of what was > > > > happening. > > > > > > Thank you for the detailed report. > > > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > > cannot type and run new commands? > > > > Only writeout stalled. The emerge that was moving the files hung, but > > everything else worked normaly. > > I was able to run new commands, like coping the /proc/meminfo. > > But you mentioned in the next mail that `watch cat /proc/meminfo` > could also be blocked for some time - I guess in the same time emerge > was stalled? The behavior was different on these stalls. On first report the writeout stopped completly, the emerge stopped, but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and not stall. About the watch cat /proc/meminfo, I will write in the answer to the other mail... > > [snip] > > > > After this SysRq+W writeback resumed again. Possible that writing > > > > above into the syslog triggered that. > > > > > > Maybe. Are the log files on another disk/partition? > > > > No, everything was going to / > > > > What might be interesting is, that doing cat /proc/meminfo > > >~/stall/meminfo did not resume the writeback. So there might some > > threshold that only was broken with the additional write from > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the > > Have you tried explicit `sync`? ;-) No. I wanted to see what is stalled. So I startet by collecting info from /proc and then the SysRq+W. And after hitting SysRQ the writeout started to resume without any further action. But I think I have seen a `sync` stall also. During an other emerge I noticed the system slowing down and wanted to use `sync` to speed up the writeout. The result was, that the writeout did not speed up imiedetly only after around a minitue. The `sync` only returned at that time. Can writers starve `sync`? > > syslog-ng package from gentoo: > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > > > The source tmpfs is mounted with any special parameters, but the > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > > disk RAID5 md. > > > > During the hang all CPUs where idle. > > > > > > No iowaits? ;-) > > > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > > > OK, the subject does not match for my case, but there was also a tmpfs > > involved. And I found no thread with stalls on xfs. :-) > > Do you mean it is actually related with tmpfs? I don't know. It's just that I have seen tmpfs also redirtieing inodes in these logs and the stalling emerge is moving files from tmpfs to xfs. It could be, but I don't know enough about tmpfs internals to really be sure. I just wanted to mention, that tmpfs is involved somehow. Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <E1InrKN-0000MK-G5@localhost>]
* Re: writeout stalls in current -git [not found] ` <E1InrKN-0000MK-G5@localhost> @ 2007-11-02 7:52 ` Fengguang Wu 2007-11-02 17:47 ` Torsten Kaiser 2007-11-02 7:52 ` Fengguang Wu 1 sibling, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 7:52 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 08:42:05AM +0100, Torsten Kaiser wrote: > The Subject is still missleading, I'm using 2.6.23-mm1. > > On 11/2/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > > > Each time I noticed this I was using emerge (package util from the > > > > > gentoo distribution) to install/upgrade a package. The last step, > > > > > where this hang occurred, is moving the prepared files from a tmpfs > > > > > partion to the main xfs filesystem. > > > > > The hangs where not fatal, after a few second everything resumed > > > > > normal, so I was not able to capture a good image of what was > > > > > happening. > > > > > > > > Thank you for the detailed report. > > > > > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > > > cannot type and run new commands? > > > > > > Only writeout stalled. The emerge that was moving the files hung, but > > > everything else worked normaly. > > > I was able to run new commands, like coping the /proc/meminfo. > > > > But you mentioned in the next mail that `watch cat /proc/meminfo` > > could also be blocked for some time - I guess in the same time emerge > > was stalled? > > The behavior was different on these stalls. > On first report the writeout stopped completly, the emerge stopped, > but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and > not stall. > About the watch cat /proc/meminfo, I will write in the answer to the > other mail... OK. > > > [snip] > > > > > After this SysRq+W writeback resumed again. Possible that writing > > > > > above into the syslog triggered that. > > > > > > > > Maybe. Are the log files on another disk/partition? > > > > > > No, everything was going to / > > > > > > What might be interesting is, that doing cat /proc/meminfo > > > >~/stall/meminfo did not resume the writeback. So there might some > > > threshold that only was broken with the additional write from > > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the > > > > Have you tried explicit `sync`? ;-) > > No. I wanted to see what is stalled. So I startet by collecting info > from /proc and then the SysRq+W. And after hitting SysRQ the writeout > started to resume without any further action. > > But I think I have seen a `sync` stall also. During an other emerge I > noticed the system slowing down and wanted to use `sync` to speed up > the writeout. The result was, that the writeout did not speed up > imiedetly only after around a minitue. The `sync` only returned at > that time. > Can writers starve `sync`? I guess the new debug printks will provide more hints on it. > > > syslog-ng package from gentoo: > > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > > > > > The source tmpfs is mounted with any special parameters, but the > > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > > > disk RAID5 md. > > > > > During the hang all CPUs where idle. > > > > > > > > No iowaits? ;-) > > > > > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > > > > > OK, the subject does not match for my case, but there was also a tmpfs > > > involved. And I found no thread with stalls on xfs. :-) > > > > Do you mean it is actually related with tmpfs? > > I don't know. It's just that I have seen tmpfs also redirtieing inodes > in these logs and the stalling emerge is moving files from tmpfs to > xfs. > It could be, but I don't know enough about tmpfs internals to really be sure. > I just wanted to mention, that tmpfs is involved somehow. The requeue messages for tmpfs are not pleasant, but known to be fine ;-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 7:52 ` Fengguang Wu @ 2007-11-02 17:47 ` Torsten Kaiser 0 siblings, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-02 17:47 UTC (permalink / raw) To: Fengguang Wu Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On 11/2/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > I guess the new debug printks will provide more hints on it. The "throttle_vm_writeout" did not trigger for my new workload. Except one (the first) "balance_dirty_pages" came from line 445, the newly added. But I found an other workload that looks much more ... hmm ... 'mad'. If I do an unmerge the emerge program will read all files to revalidate their checksum and then delete it. If I do this unmerge the progress of emerge will stall periodically for ~47 second. (Two times I used a stopwatch to get this value. I think all other stalls where identical, at least in KSysGuard they looked evenly spaced) What really counts as 'mad' is this output from vmstat 10: 0 0 0 3639044 332 177420 0 0 292 20 101 618 1 1 98 0 1 0 0 3624068 332 180628 0 0 323 22 137 663 5 2 93 0 0 0 0 3602456 332 183972 0 0 301 23 159 641 9 3 87 2 -> this was emerge collecting its package database 0 0 0 3600052 332 184264 0 0 19 7743 823 5543 3 8 89 0 0 0 0 3599332 332 184280 0 0 1 2532 517 2341 1 2 97 0 -> normal removing, now the emerge stalls 0 0 0 3599404 332 184280 0 0 0 551 323 1290 0 0 99 0 0 0 0 3599648 332 184280 0 0 0 644 314 1222 0 1 99 0 0 0 0 3599648 332 184284 0 0 0 569 296 1242 0 0 99 0 0 0 0 3599868 332 184288 0 0 0 2362 320 2735 1 2 97 0 -> resumes for a short time, then stalls again 0 0 0 3599488 332 184288 0 0 0 584 292 1395 0 0 99 0 0 0 0 3600216 332 184288 0 0 0 550 301 1361 0 0 99 0 0 0 0 3594176 332 184296 0 0 0 562 300 1373 2 1 97 0 0 0 0 3594648 332 184296 0 0 0 1278 336 1881 1 1 98 0 0 0 0 3594172 332 184308 0 0 1 2812 421 2840 1 4 95 0 -> and again 0 0 0 3594296 332 184308 0 0 0 545 342 1283 0 0 99 0 0 0 0 3594376 332 184308 0 0 0 561 319 1314 0 1 99 0 0 0 0 3594340 332 184308 0 0 0 586 327 1258 0 1 99 0 0 0 0 3594644 332 184308 0 0 0 498 248 1376 0 0 99 0 0 0 0 3595116 332 184348 0 0 0 3519 565 3452 2 4 95 0 -> and again 0 0 0 3595320 332 184348 0 0 0 483 284 1163 0 0 99 0 3 0 0 3595444 332 184352 0 0 0 498 247 1173 3 0 97 0 1 0 0 3585108 332 184600 0 0 0 1298 644 2394 1 1 98 0 1 0 0 3588152 332 184608 0 0 0 3154 520 3221 2 4 94 0 -> and again procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 3588540 332 184608 0 0 0 574 268 1332 0 1 99 0 1 0 0 3588744 332 184608 0 0 0 546 335 1289 0 0 99 0 1 0 0 3588628 332 184608 0 0 0 638 348 1257 0 1 99 0 1 0 0 3588952 332 184608 0 0 0 567 310 1226 0 1 99 0 1 0 0 3603644 332 184972 0 0 59 2821 531 2419 3 4 91 1 1 0 0 3649476 332 186272 0 0 370 395 380 1335 1 1 98 0 -> emerge finishes, and now the system goes 'mad' The Dirty:-line from /proc/meminfo stays at 8 or 12 kB, but there system is writing like 'mad': 1 0 0 3650616 332 186276 0 0 0 424 296 1126 0 1 99 0 1 0 0 3650708 332 186276 0 0 0 418 249 1190 0 0 99 0 1 0 0 3650716 332 186276 0 0 0 418 256 1151 0 1 99 0 1 0 0 3650816 332 186276 0 0 0 420 257 1120 0 0 99 0 1 0 0 3651132 332 186276 0 0 0 418 269 1145 0 0 99 0 1 0 0 3651332 332 186280 0 0 0 419 294 1099 0 1 99 0 1 0 0 3651732 332 186280 0 0 0 423 311 1072 0 1 99 0 1 0 0 3652048 332 186280 0 0 0 400 317 1127 0 0 99 0 1 0 0 3652024 332 186280 0 0 0 426 346 1066 0 1 99 0 2 0 0 3652304 332 186280 0 0 0 425 357 1132 0 1 99 0 2 0 0 3652652 332 186280 0 0 0 416 364 1184 0 0 99 0 1 0 0 3652836 332 186280 0 0 0 413 397 1110 0 1 99 0 1 0 0 3652852 332 186284 0 0 0 426 427 1290 0 1 99 0 1 0 0 3652060 332 186420 0 0 14 404 421 1768 1 1 97 0 1 0 0 3652904 332 186420 0 0 0 418 437 1792 1 1 98 0 1 0 0 3653572 332 186420 0 0 0 410 442 1481 1 1 99 0 2 0 0 3653872 332 186420 0 0 0 410 451 1206 0 1 99 0 3 0 0 3654572 332 186420 0 0 0 414 479 1341 0 1 99 0 1 0 0 3651720 332 189832 0 0 341 420 540 1600 1 1 98 1 1 0 0 3653256 332 189832 0 0 0 411 499 1538 1 1 98 0 1 0 0 3654268 332 189832 0 0 0 428 505 1281 0 1 99 0 1 0 0 3655328 332 189832 0 0 0 394 532 1015 0 1 99 0 2 0 0 3655804 332 189832 0 0 0 355 546 964 0 1 99 0 1 0 0 3656804 332 189836 0 0 0 337 527 949 0 1 99 0 1 0 0 3658020 332 189836 0 0 0 348 522 937 0 1 99 0 1 0 0 3659992 332 189836 0 0 0 354 503 1078 0 1 99 0 1 0 0 3660068 332 189836 0 0 0 69 341 356 0 0 99 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 0 0 3660208 332 189836 0 0 0 18 311 236 0 0 99 0 2 0 0 3660028 332 189836 0 0 0 1 297 210 0 0 100 0 ... until it stopps. I tried this a second time, the same happend again. Neither SysRq+S nor `sync` will stop this after-finish-writeout. During the unmerges I had never seen more then 300 kB of dirty data, but as watch only updated once every 2 seconds that is not really a hard limit, but just what I was able to see. There was nothing else accessing the disks, only kcryptd, md1_raid5, pdflush and emerge showed up with minimal cpu time in top / atop. Before/during emerge stall: [ 360.920000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 30759 global 2 0 0 wc __ tw 1023 sk 0 [ 364.910000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 30759 global 2 0 0 wc __ tw 1023 sk 0 [ 369.530000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 30759 global 2 0 0 wc __ tw 1024 sk 0 [ 374.560000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 30386 global 3 0 0 wc __ tw 1024 sk 0 [ 379.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28684 global 3 0 0 wc __ tw 1024 sk 0 [ 384.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28684 global 3 0 0 wc __ tw 1024 sk 0 [ 389.660000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28684 global 3 0 0 wc __ tw 1024 sk 0 [ 394.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28684 global 3 0 0 wc _M tw 1023 sk 0 [ 394.620000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28683 global 3 0 0 wc __ tw 1023 sk 0 [ 399.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28683 global 2 0 0 wc __ tw 1023 sk 0 [ 404.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28683 global 2 0 0 wc __ tw 1024 sk 0 At this point definitly was the stall, as I then hit SysRq+W: SysRq : Show Blocked State task PC stack pid father xfssyncd D 0000000000000000 0 1040 2 ffff810006177b60 0000000000000046 0000000000000000 0000007000000001 0000000000000c31 0000000000000000 ffffffff80819b00 ffffffff80819b00 ffffffff80815f40 ffffffff80819b00 ffff810006177b20 ffff810006177b10 Call Trace: [<ffffffff805b16a7>] __down+0xa7/0x11e [<ffffffff8022da70>] default_wake_function+0x0/0x10 [<ffffffff805b1325>] __down_failed+0x35/0x3a [<ffffffff8037528e>] xfs_buf_lock+0x3e/0x40 [<ffffffff803773ce>] _xfs_buf_find+0x13e/0x240 [<ffffffff8037753f>] xfs_buf_get_flags+0x6f/0x190 [<ffffffff80377672>] xfs_buf_read_flags+0x12/0xa0 [<ffffffff803687e4>] xfs_trans_read_buf+0x64/0x340 [<ffffffff80352321>] xfs_itobp+0x81/0x1e0 [<ffffffff8022da70>] default_wake_function+0x0/0x10 [<ffffffff80354cce>] xfs_iflush+0xfe/0x520 [<ffffffff8036d48f>] xfs_finish_reclaim+0x15f/0x1c0 [<ffffffff8036d5bb>] xfs_finish_reclaim_all+0xcb/0xf0 [<ffffffff8036b608>] xfs_syncsub+0x68/0x300 [<ffffffff8037cbe7>] xfs_sync_worker+0x17/0x40 [<ffffffff8037cea2>] xfssyncd+0x142/0x1d0 [<ffffffff8037cd60>] xfssyncd+0x0/0x1d0 [<ffffffff8024a32b>] kthread+0x4b/0x80 [<ffffffff8020c9d8>] child_rip+0xa/0x12 [<ffffffff80219bd0>] lapic_next_event+0x0/0x10 [<ffffffff8024a2e0>] kthread+0x0/0x80 [<ffffffff8020c9ce>] child_rip+0x0/0x12 emerge D ffff81010901b308 0 6130 6116 ffff81000c5939e8 0000000000000086 0000000000000000 ffff81000614ff80 ffff8101089dd7f0 ffffffff8022d61c ffffffff80819b00 ffffffff80819b00 ffffffff80815f40 ffffffff80819b00 0000000000000086 ffffffff8022d7f3 Call Trace: [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 [<ffffffff8022d7f3>] try_to_wake_up+0x63/0x2e0 [<ffffffff805b16a7>] __down+0xa7/0x11e [<ffffffff8022da70>] default_wake_function+0x0/0x10 [<ffffffff805b1325>] __down_failed+0x35/0x3a [<ffffffff8037528e>] xfs_buf_lock+0x3e/0x40 [<ffffffff803773ce>] _xfs_buf_find+0x13e/0x240 [<ffffffff8037753f>] xfs_buf_get_flags+0x6f/0x190 [<ffffffff80377672>] xfs_buf_read_flags+0x12/0xa0 [<ffffffff803687e4>] xfs_trans_read_buf+0x64/0x340 [<ffffffff80352321>] xfs_itobp+0x81/0x1e0 [<ffffffff80375bae>] xfs_buf_rele+0x2e/0xd0 [<ffffffff80354cce>] xfs_iflush+0xfe/0x520 [<ffffffff803ae592>] __down_read_trylock+0x42/0x60 [<ffffffff80355c42>] xfs_inode_item_push+0x12/0x20 [<ffffffff80368207>] xfs_trans_push_ail+0x267/0x2b0 [<ffffffff8035c702>] xfs_log_reserve+0x72/0x120 [<ffffffff80366bb8>] xfs_trans_reserve+0xa8/0x210 [<ffffffff803525fb>] xfs_itruncate_finish+0xfb/0x310 [<ffffffff80372364>] xfs_inactive+0x364/0x490 [<ffffffff8037c834>] xfs_fs_clear_inode+0xa4/0xf0 [<ffffffff802a8736>] clear_inode+0x66/0x150 [<ffffffff802a899c>] generic_delete_inode+0x12c/0x140 [<ffffffff8029e93d>] do_unlinkat+0x14d/0x1e0 [<ffffffff8020bbbe>] system_call+0x7e/0x83 Next debug outputs: [ 410.310000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28685 global 4 0 0 wc __ tw 1024 sk 0 [ 414.600000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28685 global 4 0 0 wc __ tw 1024 sk 0 [ 419.620000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 28137 global 4 0 0 wc __ tw 1024 sk 0 [ 424.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25243 global 4 0 0 wc __ tw 1024 sk 0 [ 429.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25243 global 4 0 0 wc _M tw 1021 sk 0 [ 429.640000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25240 global 4 0 0 wc __ tw 1023 sk 0 [ 434.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 439.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 444.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 449.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 455.840000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 459.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1022 sk 0 [ 464.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 469.720000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 25241 global 2 0 0 wc __ tw 1024 sk 0 [ 475.040000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 22342 global 2 0 0 wc __ tw 1024 sk 0 [ 480.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21772 global 2 0 0 wc __ tw 1024 sk 0 [ 485.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21772 global 2 0 0 wc __ tw 1024 sk 0 [ 490.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21772 global 2 0 0 wc __ tw 1022 sk 0 [ 495.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21772 global 2 0 0 wc __ tw 1024 sk 0 [ 500.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21774 global 3 0 0 wc __ tw 1024 sk 0 [ 506.580000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21774 global 3 0 0 wc __ tw 1024 sk 0 [ 510.760000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21774 global 3 0 0 wc __ tw 1024 sk 0 [ 515.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21835 global 65 0 0 wc __ tw 1024 sk 0 [ 520.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21835 global 65 0 0 wc __ tw 1024 sk 0 [ 525.060000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21835 global 9 56 0 wc _M tw 961 sk 0 [ 525.080000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21772 global 9 56 0 wc _M tw 1023 sk 0 [ 525.100000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21771 global 9 56 0 wc _M tw 1023 sk 0 [ 525.110000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21770 global 9 56 0 wc _M tw 1024 sk 0 [ 525.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21770 global 9 56 0 wc _M tw 1024 sk 0 [ 525.160000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21770 global 9 56 0 wc _M tw 1024 sk 0 [ 525.170000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21770 global 9 56 0 wc _M tw 1023 sk 0 [ 525.170000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21769 global 9 28 0 wc _M tw 1023 sk 0 [ 525.190000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21768 global 9 28 0 wc _M tw 1024 sk 0 [ 525.200000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21768 global 9 28 0 wc _M tw 1024 sk 0 [ 525.210000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21768 global 9 28 0 wc _M tw 1024 sk 0 [ 525.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 21768 global 9 28 0 wc __ tw 1023 sk 0 [ 530.080000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 19499 global 2 0 0 wc __ tw 1024 sk 0 [ 535.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 540.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 545.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 550.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 555.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 561.990000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 1 0 0 wc __ tw 1022 sk 0 [ 566.020000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 570.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 575.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 18676 global 2 0 0 wc __ tw 1024 sk 0 [ 580.170000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 8244 global 3 0 0 wc __ tw 1024 sk 0 [ 585.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 8695 global 8 0 0 wc __ tw 1024 sk 0 [ 590.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10161 global 8 0 0 wc __ tw 1024 sk 0 [ 595.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10161 global 8 0 0 wc _M tw 1020 sk 0 [ 595.240000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10157 global 8 0 0 wc __ tw 1023 sk 0 [ 600.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10159 global 6 0 0 wc __ tw 1024 sk 0 [ 605.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10159 global 6 0 0 wc __ tw 1024 sk 0 [ 610.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10159 global 6 0 0 wc __ tw 1024 sk 0 [ 615.230000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10159 global 6 0 0 wc __ tw 1020 sk 0 [ 620.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 625.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1023 sk 0 [ 630.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1023 sk 0 [ 635.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 640.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 645.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 650.350000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 655.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10156 global 3 0 0 wc __ tw 1024 sk 0 [ 660.290000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10156 global 3 0 0 wc _M tw 1023 sk 0 [ 660.300000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 3 0 0 wc _M tw 1023 sk 0 [ 660.310000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc _M tw 1024 sk 0 [ 660.330000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc _M tw 1024 sk 0 [ 660.350000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc _M tw 1024 sk 0 [ 660.360000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc _M tw 1024 sk 0 [ 660.370000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc _M tw 1024 sk 0 [ 660.380000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10154 global 3 1 0 wc __ tw 1023 sk 0 [ 665.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1023 sk 0 [ 670.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 675.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 680.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 685.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 690.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 695.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1023 sk 0 [ 700.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1023 sk 0 [ 705.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 710.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 [ 715.320000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 10155 global 2 0 0 wc __ tw 1024 sk 0 I'm not sure, when emerge was finished here... Secound unmerge: [ 1177.110000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 16604 global 2 0 0 wc __ tw 1023 sk 0 [ 1182.110000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 16604 global 2 0 0 wc __ tw 1024 sk 0 [ 1187.130000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 15310 global 2 0 0 wc __ tw 1024 sk 0 [ 1192.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13335 global 2 0 0 wc __ tw 1024 sk 0 -> SysRq+W during one emerge stall [ 1194.530000] SysRq : Show Blocked State [ 1194.530000] task PC stack pid father [ 1194.540000] xfssyncd D ffff8101065798f8 0 1040 2 [ 1194.540000] ffff810006177d28 0000000000000046 0000000000000000 ffff81010904ae80 [ 1194.550000] ffff81010904ae80 0000000000000001 ffffffff80819b00 ffffffff80819b00 [ 1194.560000] ffffffff80815f40 ffffffff80819b00 ffffffff8039d996 0000000000000000 [ 1194.570000] Call Trace: [ 1194.570000] [<ffffffff8039d996>] submit_bio+0x66/0xf0 [ 1194.570000] [<ffffffff80375952>] _xfs_buf_ioapply+0x222/0x320 [ 1194.580000] [<ffffffff805b16a7>] __down+0xa7/0x11e [ 1194.590000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 1194.590000] [<ffffffff80376ad5>] xfs_buf_iostart+0x65/0x90 [ 1194.600000] [<ffffffff805b1325>] __down_failed+0x35/0x3a [ 1194.600000] [<ffffffff8034f34b>] xfs_iflock+0x1b/0x20 [ 1194.600000] [<ffffffff8036d4d0>] xfs_finish_reclaim+0x1a0/0x1c0 [ 1194.600000] [<ffffffff8036d5bb>] xfs_finish_reclaim_all+0xcb/0xf0 [ 1194.600000] [<ffffffff8036b608>] xfs_syncsub+0x68/0x300 [ 1194.600000] [<ffffffff8037cbe7>] xfs_sync_worker+0x17/0x40 [ 1194.600000] [<ffffffff8037cea2>] xfssyncd+0x142/0x1d0 [ 1194.600000] [<ffffffff8037cd60>] xfssyncd+0x0/0x1d0 [ 1194.600000] [<ffffffff8024a32b>] kthread+0x4b/0x80 [ 1194.600000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 [ 1194.600000] [<ffffffff80219bd0>] lapic_next_event+0x0/0x10 [ 1194.600000] [<ffffffff8024a2e0>] kthread+0x0/0x80 [ 1194.600000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 [ 1194.600000] [ 1194.600000] emerge D 0000000000000000 0 6742 6116 [ 1194.600000] ffff81000cc4d9e8 0000000000000086 0000000000000000 0000007000000001 [ 1194.600000] 0000000000000818 ffffffff00000000 ffffffff80819b00 ffffffff80819b00 [ 1194.600000] ffffffff80815f40 ffffffff80819b00 ffff81000cc4d9a8 ffff81000cc4d998 [ 1194.600000] Call Trace: [ 1194.600000] [<ffffffff805b16a7>] __down+0xa7/0x11e [ 1194.600000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 1194.600000] [<ffffffff805b1325>] __down_failed+0x35/0x3a [ 1194.600000] [<ffffffff8037528e>] xfs_buf_lock+0x3e/0x40 [ 1194.600000] [<ffffffff803773ce>] _xfs_buf_find+0x13e/0x240 [ 1194.600000] [<ffffffff8037753f>] xfs_buf_get_flags+0x6f/0x190 [ 1194.600000] [<ffffffff80377672>] xfs_buf_read_flags+0x12/0xa0 [ 1194.600000] [<ffffffff803687e4>] xfs_trans_read_buf+0x64/0x340 [ 1194.600000] [<ffffffff80352321>] xfs_itobp+0x81/0x1e0 [ 1194.600000] [<ffffffff80375bae>] xfs_buf_rele+0x2e/0xd0 [ 1194.600000] [<ffffffff80354cce>] xfs_iflush+0xfe/0x520 [ 1194.600000] [<ffffffff803ae592>] __down_read_trylock+0x42/0x60 [ 1194.600000] [<ffffffff80355c42>] xfs_inode_item_push+0x12/0x20 [ 1194.600000] [<ffffffff80368207>] xfs_trans_push_ail+0x267/0x2b0 [ 1194.600000] [<ffffffff8035c702>] xfs_log_reserve+0x72/0x120 [ 1194.600000] [<ffffffff80366bb8>] xfs_trans_reserve+0xa8/0x210 [ 1194.600000] [<ffffffff803525fb>] xfs_itruncate_finish+0xfb/0x310 [ 1194.600000] [<ffffffff80372364>] xfs_inactive+0x364/0x490 [ 1194.600000] [<ffffffff8037c834>] xfs_fs_clear_inode+0xa4/0xf0 [ 1194.600000] [<ffffffff802a8736>] clear_inode+0x66/0x150 [ 1194.600000] [<ffffffff802a899c>] generic_delete_inode+0x12c/0x140 [ 1194.600000] [<ffffffff8029e93d>] do_unlinkat+0x14d/0x1e0 [ 1194.600000] [<ffffffff8020bbbe>] system_call+0x7e/0x83 [ 1194.600000] [ 1197.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13337 global 4 0 0 wc __ tw 1024 sk 0 [ 1202.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13337 global 4 0 0 wc __ tw 1024 sk 0 [ 1207.150000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13337 global 4 0 0 wc _M tw 1021 sk 0 [ 1207.240000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13334 global 4 0 0 wc _M tw 1023 sk 0 [ 1207.260000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 13333 global 4 0 0 wc __ tw 1023 sk 0 ... After emerge finished: [ 1322.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11163 global 3 0 0 wc _M tw 1022 sk 0 [ 1322.650000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11161 global 3 0 0 wc __ tw 1023 sk 0 [ 1327.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11162 global 2 0 0 wc __ tw 1024 sk 0 [ 1332.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11162 global 2 0 0 wc __ tw 1024 sk 0 [ 1337.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11162 global 2 0 0 wc __ tw 1024 sk 0 [ 1342.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11162 global 2 0 0 wc __ tw 1024 sk 0 [ 1347.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11162 global 2 0 0 wc __ tw 1024 sk 0 -> After emerge finishes xfssyncd seems the only blocked process. Does this process do the continuing writeout? [ 1351.880000] SysRq : Show Blocked State [ 1351.880000] task PC stack pid father [ 1351.880000] xfssyncd D ffff810104f0f6f8 0 1040 2 [ 1351.880000] ffff810006177d28 0000000000000046 0000000000000000 ffff810101359380 [ 1351.880000] ffff810101359380 0000000000000001 ffffffff80819b00 ffffffff80819b00 [ 1351.880000] ffffffff80815f40 ffffffff80819b00 ffffffff8039d996 0000000000000000 [ 1351.880000] Call Trace: [ 1351.880000] [<ffffffff8039d996>] submit_bio+0x66/0xf0 [ 1351.880000] [<ffffffff80375952>] _xfs_buf_ioapply+0x222/0x320 [ 1351.880000] [<ffffffff805b16a7>] __down+0xa7/0x11e [ 1351.880000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 1351.880000] [<ffffffff80376ad5>] xfs_buf_iostart+0x65/0x90 [ 1351.880000] [<ffffffff805b1325>] __down_failed+0x35/0x3a [ 1351.880000] [<ffffffff8034f34b>] xfs_iflock+0x1b/0x20 [ 1351.880000] [<ffffffff8036d4d0>] xfs_finish_reclaim+0x1a0/0x1c0 [ 1351.880000] [<ffffffff8036d5bb>] xfs_finish_reclaim_all+0xcb/0xf0 [ 1351.880000] [<ffffffff8036b608>] xfs_syncsub+0x68/0x300 [ 1351.880000] [<ffffffff8037cbe7>] xfs_sync_worker+0x17/0x40 [ 1351.880000] [<ffffffff8037cea2>] xfssyncd+0x142/0x1d0 [ 1351.880000] [<ffffffff8037cd60>] xfssyncd+0x0/0x1d0 [ 1351.880000] [<ffffffff8024a32b>] kthread+0x4b/0x80 [ 1351.880000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 [ 1351.880000] [<ffffffff80219bd0>] lapic_next_event+0x0/0x10 [ 1351.880000] [<ffffffff8024a2e0>] kthread+0x0/0x80 [ 1351.880000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 [ 1351.880000] [ 1352.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11163 global 3 0 0 wc __ tw 1024 sk 0 [ 1357.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11216 global 3 0 0 wc _M tw 1022 sk 0 [ 1357.650000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11214 global 3 0 0 wc _M tw 1023 sk 0 [ 1357.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11213 global 1 3 0 wc _M tw 1024 sk 0 [ 1357.690000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11213 global 1 3 0 wc _M tw 1024 sk 0 [ 1357.700000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11213 global 1 3 0 wc __ tw 1023 sk 0 [ 1362.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11285 global 2 0 0 wc __ tw 1024 sk 0 [ 1367.630000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11289 global 2 0 0 wc __ tw 1024 sk 0 [ 1372.650000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11289 global 2 0 0 wc __ tw 1024 sk 0 -> Here I am trying SysRq+S to stop/finish the continued writeout of 8kB dirty data, but the disk where still working after that... [ 1375.860000] SysRq : Emergency Sync [ 1375.860000] mm/page-writeback.c 587 background_writeout: pdflush(284) 0 global 2 0 0 wc __ tw 1022 sk 0 [ 1375.960000] Emergency Sync complete [ 1377.650000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11288 global 1 0 0 wc __ tw 1024 sk 0 [ 1382.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11276 global 2 0 0 wc __ tw 1024 sk 0 [ 1387.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11276 global 2 0 0 wc __ tw 1024 sk 0 [ 1389.720000] mm/page-writeback.c 587 background_writeout: pdflush(285) 0 global 2 0 0 wc __ tw 1022 sk 0 [ 1392.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11277 global 1 0 0 wc __ tw 1024 sk 0 [ 1397.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11278 global 2 0 0 wc __ tw 1024 sk 0 [ 1402.670000] mm/page-writeback.c 661 wb_kupdate: pdflush(285) 11278 global 2 0 0 wc __ tw 1024 sk 0 I also did a SysRq+T, but nothing interessing in it. All processes sleeping in schedule_timeout and other timer stuff, except emerge and xfssyncd in state D (similar calltrace to the SysRq+W) and md1_raid5: [ 495.640000] md1_raid5 D 0000000000000000 0 946 2 [ 495.640000] ffff810006145d20 0000000000000046 0000000000000000 00000000000000 [ 495.640000] 0000000000000010 ffffffff00000000 ffffffff80819b00 ffffffff80819b [ 495.640000] ffffffff80815f40 ffffffff80819b00 ffff810006145ce0 ffff810006145c [ 495.640000] Call Trace: [ 495.640000] [<ffffffff8039d996>] submit_bio+0x66/0xf0 [ 495.640000] [<ffffffff804c41e5>] md_super_wait+0xb5/0xd0 [ 495.640000] [<ffffffff8024a710>] autoremove_wake_function+0x0/0x30 [ 495.640000] [<ffffffff804ccb60>] bitmap_unplug+0x1b0/0x1c0 [ 495.640000] [<ffffffff804cab90>] md_thread+0x0/0x100 [ 495.640000] [<ffffffff804bf3d6>] raid5d+0xa6/0x490 [ 495.640000] [<ffffffff805b0197>] schedule_timeout+0x67/0xd0 [ 495.640000] [<ffffffff8023e740>] process_timeout+0x0/0x10 [ 495.640000] [<ffffffff805b018a>] schedule_timeout+0x5a/0xd0 [ 495.640000] [<ffffffff804cab90>] md_thread+0x0/0x100 [ 495.640000] [<ffffffff804cabc0>] md_thread+0x30/0x100 [ 495.640000] [<ffffffff8024a710>] autoremove_wake_function+0x0/0x30 [ 495.640000] [<ffffffff804cab90>] md_thread+0x0/0x100 [ 495.640000] [<ffffffff8024a32b>] kthread+0x4b/0x80 [ 495.640000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 [ 495.640000] [<ffffffff8024a2e0>] kthread+0x0/0x80 [ 495.640000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 The following processes where running: events/3 R running task 0 18 2 syslog-ng R running task 0 4616 1 X R running task 0 5814 5764 [snip] > > I don't know. It's just that I have seen tmpfs also redirtieing inodes > > in these logs and the stalling emerge is moving files from tmpfs to > > xfs. > > It could be, but I don't know enough about tmpfs internals to really be sure. > > I just wanted to mention, that tmpfs is involved somehow. > > The requeue messages for tmpfs are not pleasant, but known to be fine ;-) OK, didnt know that. But makes sense. Dirty tmpfs inodes do not sound like a problem, but more like the normal case. ;-) I will try the patch from Peter and see if, this solves the emerge/installing part and post logs from that... Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git [not found] ` <E1InrKN-0000MK-G5@localhost> 2007-11-02 7:52 ` Fengguang Wu @ 2007-11-02 7:52 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 7:52 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 08:42:05AM +0100, Torsten Kaiser wrote: > The Subject is still missleading, I'm using 2.6.23-mm1. > > On 11/2/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > > > Each time I noticed this I was using emerge (package util from the > > > > > gentoo distribution) to install/upgrade a package. The last step, > > > > > where this hang occurred, is moving the prepared files from a tmpfs > > > > > partion to the main xfs filesystem. > > > > > The hangs where not fatal, after a few second everything resumed > > > > > normal, so I was not able to capture a good image of what was > > > > > happening. > > > > > > > > Thank you for the detailed report. > > > > > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > > > cannot type and run new commands? > > > > > > Only writeout stalled. The emerge that was moving the files hung, but > > > everything else worked normaly. > > > I was able to run new commands, like coping the /proc/meminfo. > > > > But you mentioned in the next mail that `watch cat /proc/meminfo` > > could also be blocked for some time - I guess in the same time emerge > > was stalled? > > The behavior was different on these stalls. > On first report the writeout stopped completly, the emerge stopped, > but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and > not stall. > About the watch cat /proc/meminfo, I will write in the answer to the > other mail... OK. > > > [snip] > > > > > After this SysRq+W writeback resumed again. Possible that writing > > > > > above into the syslog triggered that. > > > > > > > > Maybe. Are the log files on another disk/partition? > > > > > > No, everything was going to / > > > > > > What might be interesting is, that doing cat /proc/meminfo > > > >~/stall/meminfo did not resume the writeback. So there might some > > > threshold that only was broken with the additional write from > > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the > > > > Have you tried explicit `sync`? ;-) > > No. I wanted to see what is stalled. So I startet by collecting info > from /proc and then the SysRq+W. And after hitting SysRQ the writeout > started to resume without any further action. > > But I think I have seen a `sync` stall also. During an other emerge I > noticed the system slowing down and wanted to use `sync` to speed up > the writeout. The result was, that the writeout did not speed up > imiedetly only after around a minitue. The `sync` only returned at > that time. > Can writers starve `sync`? I guess the new debug printks will provide more hints on it. > > > syslog-ng package from gentoo: > > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > > > > > The source tmpfs is mounted with any special parameters, but the > > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > > > disk RAID5 md. > > > > > During the hang all CPUs where idle. > > > > > > > > No iowaits? ;-) > > > > > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > > > > > OK, the subject does not match for my case, but there was also a tmpfs > > > involved. And I found no thread with stalls on xfs. :-) > > > > Do you mean it is actually related with tmpfs? > > I don't know. It's just that I have seen tmpfs also redirtieing inodes > in these logs and the stalling emerge is moving files from tmpfs to > xfs. > It could be, but I don't know enough about tmpfs internals to really be sure. > I just wanted to mention, that tmpfs is involved somehow. The requeue messages for tmpfs are not pleasant, but known to be fine ;-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
* writeout stalls in current -git [not found] ` <E1Inljm-0002DW-CL@localhost> 2007-11-02 1:54 ` writeout stalls in current -git Fengguang Wu @ 2007-11-02 1:54 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 1:54 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > Each time I noticed this I was using emerge (package util from the > > > gentoo distribution) to install/upgrade a package. The last step, > > > where this hang occurred, is moving the prepared files from a tmpfs > > > partion to the main xfs filesystem. > > > The hangs where not fatal, after a few second everything resumed > > > normal, so I was not able to capture a good image of what was > > > happening. > > > > Thank you for the detailed report. > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > cannot type and run new commands? > > Only writeout stalled. The emerge that was moving the files hung, but > everything else worked normaly. > I was able to run new commands, like coping the /proc/meminfo. But you mentioned in the next mail that `watch cat /proc/meminfo` could also be blocked for some time - I guess in the same time emerge was stalled? > [snip] > > > After this SysRq+W writeback resumed again. Possible that writing > > > above into the syslog triggered that. > > > > Maybe. Are the log files on another disk/partition? > > No, everything was going to / > > What might be interesting is, that doing cat /proc/meminfo > >~/stall/meminfo did not resume the writeback. So there might some > threshold that only was broken with the additional write from > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the Have you tried explicit `sync`? ;-) > syslog-ng package from gentoo: > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > The source tmpfs is mounted with any special parameters, but the > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > disk RAID5 md. > > > During the hang all CPUs where idle. > > > > No iowaits? ;-) > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > OK, the subject does not match for my case, but there was also a tmpfs > involved. And I found no thread with stalls on xfs. :-) Do you mean it is actually related with tmpfs? > > > The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330 > > > interrupts per second because of the bttv driver. (But I was not using > > > that device at this time.) > > > > > > I'm willing to test patches or more provide more information, but lack > > > a good testcase to trigger this on demand. > > > > Thank you. Maybe we can start by the applied debug patch :-) > > Will applied it and try to recreate this. > > Thanks for looking into it. Thank you for the rich information, too :-) Fengguang ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <64bb37e0711011200n228e708eg255640388f83da22@mail.gmail.com>]
[parent not found: <E1InmAI-0003ME-2i@localhost>]
* Re: writeout stalls in current -git [not found] ` <E1InmAI-0003ME-2i@localhost> @ 2007-11-02 2:21 ` Fengguang Wu 2007-11-02 7:50 ` Torsten Kaiser 2007-11-02 2:21 ` Fengguang Wu 2007-11-02 10:15 ` Peter Zijlstra 2 siblings, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 2:21 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 8466 bytes --] On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote: > On 11/1/07, Torsten Kaiser <just.for.lkml@googlemail.com> wrote: > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > Thank you. Maybe we can start by the applied debug patch :-) > > > > Will applied it and try to recreate this. > > Patch applied, used emerge to install a 2.6.24-rc1 kernel. > > I had no complete stalls, but three times during the move from tmpfs > to the main xfs the emerge got noticeable slower. There still was > writeout happening, but as emerge prints out every file it has written > during the pause not one file was processed. > > vmstat 10: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 1 0 3146424 332 614768 0 0 134 1849 438 2515 3 4 91 2 > 0 0 0 3146644 332 614784 0 0 2 1628 507 646 0 2 85 13 > 0 0 0 3146868 332 614868 0 0 5 2359 527 1076 0 3 97 0 > 1 0 0 3144372 332 616148 0 0 96 2829 607 2666 2 5 92 0 > -> normal writeout > 0 0 0 3140560 332 618144 0 0 152 2764 633 3308 3 6 91 0 > 0 0 0 3137332 332 619908 0 0 114 1801 588 2858 3 4 93 0 > 0 0 0 3136912 332 620136 0 0 20 827 393 1605 1 2 98 0 > -> first stall 'stall': vmstat's output stalls for some time, or emerge stalls for the next several vmstat lines? > 0 0 0 3137088 332 620136 0 0 0 557 339 1437 0 1 99 0 > 0 0 0 3137160 332 620136 0 0 0 642 310 1400 0 1 99 0 > 0 0 0 3136588 332 620172 0 0 6 2972 527 1195 0 3 80 16 > 0 0 0 3136276 332 620348 0 0 10 2668 558 1195 0 3 96 0 > 0 0 0 3135228 332 620424 0 0 8 2712 522 1311 0 4 96 0 > 0 0 0 3131740 332 621524 0 0 75 2935 559 2457 2 5 93 0 > 0 0 0 3128348 332 622972 0 0 85 1470 490 2607 3 4 93 0 > 0 0 0 3129292 332 622972 0 0 0 527 353 1398 0 1 99 0 > -> second longer stall > 0 0 0 3128520 332 623028 0 0 6 488 249 1390 0 1 99 0 > 0 0 0 3128236 332 623028 0 0 0 482 222 1222 0 1 99 0 > 0 0 0 3128408 332 623028 0 0 0 585 269 1301 0 0 99 0 > 0 0 0 3128532 332 623028 0 0 0 610 262 1278 0 0 99 0 > 0 0 0 3128568 332 623028 0 0 0 636 345 1639 0 1 99 0 > 0 0 0 3129032 332 623040 0 0 1 664 337 1466 0 1 99 0 > 0 0 0 3129484 332 623040 0 0 0 658 300 1508 0 0 100 0 > 0 0 0 3129576 332 623040 0 0 0 562 271 1454 0 1 99 0 > 0 0 0 3129736 332 623040 0 0 0 627 278 1406 0 1 99 0 > 0 0 0 3129368 332 623040 0 0 0 507 274 1301 0 1 99 0 > 0 0 0 3129004 332 623040 0 0 0 444 211 1213 0 0 99 0 > 0 1 0 3127260 332 623040 0 0 0 1036 305 1242 0 1 95 4 > 0 0 0 3126280 332 623128 0 0 7 4241 555 1575 1 5 84 10 > 0 0 0 3124948 332 623232 0 0 6 4194 529 1505 1 4 95 0 > 0 0 0 3125228 332 624168 0 0 58 1966 586 1964 2 4 94 0 > -> emerge resumed to normal speed, without any intervention from my side > 0 0 0 3120932 332 625904 0 0 112 1546 546 2565 3 4 93 0 > 0 0 0 3118012 332 627568 0 0 128 1542 612 2705 3 4 93 0 Interesting, the 'bo' never falls to zero. > > >From syslog: > first stall: > [ 575.050000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259 > global 610 0 0 wc __ tw 1023 sk 0 > [ 586.350000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465 > global 6117 0 0 wc _M tw 967 sk 0 > [ 586.360000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408 > global 6117 0 0 wc __ tw 1022 sk 0 > [ 599.900000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523 > global 11141 0 0 wc __ tw 1009 sk 0 > [ 635.780000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397 > global 12757 124 0 wc __ tw 0 sk 0 > [ 638.470000] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 1536 global 11405 51 0 wc __ tw 0 sk 0 > [ 638.820000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373 > global 11276 48 0 wc __ tw -1 sk 0 > [ 641.260000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348 > global 10565 100 0 wc __ tw 0 sk 0 > [ 643.980000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324 > global 9788 103 0 wc __ tw -1 sk 0 > [ 646.120000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299 > global 8912 6 0 wc __ tw 0 sk 0 > > second stall: > [ 664.040000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117 > global 2864 81 0 wc _M tw -13 sk 0 > [ 664.400000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47080 > global 1995 137 0 wc _M tw 176 sk 0 > [ 664.510000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 46232 > global 1929 267 0 wc __ tw 880 sk 0 > cron[6927]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) > [ 809.560000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 49422 > global 19166 217 0 wc _M tw 380 sk 0 > [ 811.720000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48778 > global 17969 407 0 wc _M tw -4 sk 0 > [ 813.880000] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 1537 global 16592 233 0 wc _M tw -1 sk 0 > [ 814.710000] mm/page-writeback.c 418 balance_dirty_pages: find(6931) > 1537 global 16132 179 0 wc __ tw -1 sk 0 > [ 814.720000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47750 > global 16040 271 0 wc _M tw -1 sk 0 > [ 815.040000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 46725 > global 15403 779 0 wc CM tw 324 sk 0 > > the third stall happend after the emerge was finished. There still was > ~120Mb of dirty data, but its writeout got much slower over several > seconds. > vmstat 10: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 0 3096152 332 630424 0 0 81 1503 640 2771 5 4 91 0 > 0 0 0 3101024 332 631588 0 0 279 473 510 1281 5 2 92 1 > -> stall / slowdown starts > 0 0 0 3147924 332 632384 0 0 78 626 449 1384 0 1 99 0 > 1 0 0 3147940 332 632384 0 0 0 611 388 1387 0 1 99 0 > 0 1 0 3147576 332 632384 0 0 0 939 449 1432 0 1 99 0 > 0 0 0 3145476 332 632384 0 0 0 3592 644 925 0 4 93 3 > -> writeout resumes full speed > 0 0 0 3147232 332 632480 0 0 0 3108 678 1053 0 3 97 0 > 0 0 0 3146860 332 632480 0 0 0 2497 677 859 0 3 97 0 > 0 0 0 3146720 332 632480 0 0 0 2433 648 839 0 3 97 0 > 0 0 0 3147844 332 632484 0 0 0 2394 625 889 0 3 97 0 > 0 0 0 3148128 332 632484 0 0 0 2204 671 848 0 2 97 0 > > from syslog: > [ 848.070000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48084 > global 13805 0 0 wc _M tw 1008 sk 0 > [ 848.080000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48068 > global 13805 0 0 wc __ tw 1020 sk 0 > [ 884.090000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 61811 > global 30297 2 0 wc __ tw 862 sk 0 > [ 921.760000] mm/page-writeback.c 418 balance_dirty_pages: cat(7170) > 1541 global 28113 391 0 wc __ tw -5 sk 0 > -> that cat was probably my watch cat /proc/meminfo > -> during the stall there where no updates visible there > [ 922.190000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 76871 > global 27735 0 0 wc __ tw -5 sk 0 > [ 923.550000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 75842 > global 26688 106 0 wc _M tw -1 sk 0 > [ 924.940000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 74817 > global 25698 195 0 wc _M tw 0 sk 0 > > Apart from my normal kde desktop (no compiz) and the emerge the system was idle. Interestingly, no background_writeout() appears, but only balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't block the process. > If I see the complete stall again, I will post that too. Thank you, could you run it with the attached new debug patch? Fengguang [-- Attachment #2: writeback-debug.patch --] [-- Type: text/x-diff, Size: 2677 bytes --] --- mm/page-writeback.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) --- linux-2.6.24-git17.orig/mm/page-writeback.c +++ linux-2.6.24-git17/mm/page-writeback.c @@ -98,6 +98,26 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ +#define writeback_debug_report(n, wbc) do { \ + __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \ +} while (0) + +void __writeback_debug_report(long n, struct writeback_control *wbc, + const char *file, int line, const char *func) +{ + printk(KERN_DEBUG "%s %d %s: %s(%d) %ld " + "global %lu %lu %lu " + "wc %c%c tw %ld sk %ld\n", + file, line, func, + current->comm, current->pid, n, + global_page_state(NR_FILE_DIRTY), + global_page_state(NR_WRITEBACK), + global_page_state(NR_UNSTABLE_NFS), + wbc->encountered_congestion ? 'C':'_', + wbc->more_io ? 'M':'_', + wbc->nr_to_write, + wbc->pages_skipped); +} static void background_writeout(unsigned long _min_pages); @@ -395,6 +415,7 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + writeback_debug_report(pages_written, &wbc); } /* @@ -421,6 +442,7 @@ static void balance_dirty_pages(struct a break; /* We've done our duty */ congestion_wait(WRITE, HZ/10); + writeback_debug_report(-pages_written, &wbc); } if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh && @@ -515,6 +537,11 @@ void throttle_vm_writeout(gfp_t gfp_mask global_page_state(NR_WRITEBACK) <= dirty_thresh) break; congestion_wait(WRITE, HZ/10); + printk(KERN_DEBUG "throttle_vm_writeout: " + "congestion_wait on %lu+%lu > %lu\n", + global_page_state(NR_UNSTABLE_NFS), + global_page_state(NR_WRITEBACK), + dirty_thresh); /* * The caller might hold locks which can prevent IO completion @@ -557,6 +584,7 @@ static void background_writeout(unsigned wbc.pages_skipped = 0; writeback_inodes(&wbc); min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; + writeback_debug_report(min_pages, &wbc); if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ if (wbc.encountered_congestion || wbc.more_io) @@ -630,6 +658,7 @@ static void wb_kupdate(unsigned long arg wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); + writeback_debug_report(nr_to_write, &wbc); if (wbc.nr_to_write > 0) { if (wbc.encountered_congestion || wbc.more_io) congestion_wait(WRITE, HZ/10); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 2:21 ` Fengguang Wu @ 2007-11-02 7:50 ` Torsten Kaiser 0 siblings, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-02 7:50 UTC (permalink / raw) To: Fengguang Wu Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On 11/2/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote: > > On 11/1/07, Torsten Kaiser <just.for.lkml@googlemail.com> wrote: > > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > > Thank you. Maybe we can start by the applied debug patch :-) > > > > > > Will applied it and try to recreate this. > > > > Patch applied, used emerge to install a 2.6.24-rc1 kernel. > > > > I had no complete stalls, but three times during the move from tmpfs > > to the main xfs the emerge got noticeable slower. There still was > > writeout happening, but as emerge prints out every file it has written > > during the pause not one file was processed. > > > > vmstat 10: > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > > r b swpd free buff cache si so bi bo in cs us sy id wa > > 0 1 0 3146424 332 614768 0 0 134 1849 438 2515 3 4 91 2 > > 0 0 0 3146644 332 614784 0 0 2 1628 507 646 0 2 85 13 > > 0 0 0 3146868 332 614868 0 0 5 2359 527 1076 0 3 97 0 > > 1 0 0 3144372 332 616148 0 0 96 2829 607 2666 2 5 92 0 > > -> normal writeout > > 0 0 0 3140560 332 618144 0 0 152 2764 633 3308 3 6 91 0 > > 0 0 0 3137332 332 619908 0 0 114 1801 588 2858 3 4 93 0 > > 0 0 0 3136912 332 620136 0 0 20 827 393 1605 1 2 98 0 > > -> first stall > > 'stall': vmstat's output stalls for some time, or emerge stalls for > the next several vmstat lines? emerge stalls. The vmstat did work normally. > > 0 0 0 3137088 332 620136 0 0 0 557 339 1437 0 1 99 0 > > 0 0 0 3137160 332 620136 0 0 0 642 310 1400 0 1 99 0 So meaning that these last three lines indicated that for ~30 seconds the writeout was much slower than normal. > > 0 0 0 3136588 332 620172 0 0 6 2972 527 1195 0 3 80 16 > > 0 0 0 3136276 332 620348 0 0 10 2668 558 1195 0 3 96 0 > > 0 0 0 3135228 332 620424 0 0 8 2712 522 1311 0 4 96 0 > > 0 0 0 3131740 332 621524 0 0 75 2935 559 2457 2 5 93 0 > > 0 0 0 3128348 332 622972 0 0 85 1470 490 2607 3 4 93 0 > > 0 0 0 3129292 332 622972 0 0 0 527 353 1398 0 1 99 0 > > -> second longer stall > > 0 0 0 3128520 332 623028 0 0 6 488 249 1390 0 1 99 0 > > 0 0 0 3128236 332 623028 0 0 0 482 222 1222 0 1 99 0 > > 0 0 0 3128408 332 623028 0 0 0 585 269 1301 0 0 99 0 > > 0 0 0 3128532 332 623028 0 0 0 610 262 1278 0 0 99 0 > > 0 0 0 3128568 332 623028 0 0 0 636 345 1639 0 1 99 0 > > 0 0 0 3129032 332 623040 0 0 1 664 337 1466 0 1 99 0 > > 0 0 0 3129484 332 623040 0 0 0 658 300 1508 0 0 100 0 > > 0 0 0 3129576 332 623040 0 0 0 562 271 1454 0 1 99 0 > > 0 0 0 3129736 332 623040 0 0 0 627 278 1406 0 1 99 0 > > 0 0 0 3129368 332 623040 0 0 0 507 274 1301 0 1 99 0 > > 0 0 0 3129004 332 623040 0 0 0 444 211 1213 0 0 99 0 The second time the slowdown was much longer. > > 0 1 0 3127260 332 623040 0 0 0 1036 305 1242 0 1 95 4 > > 0 0 0 3126280 332 623128 0 0 7 4241 555 1575 1 5 84 10 > > 0 0 0 3124948 332 623232 0 0 6 4194 529 1505 1 4 95 0 > > 0 0 0 3125228 332 624168 0 0 58 1966 586 1964 2 4 94 0 > > -> emerge resumed to normal speed, without any intervention from my side > > 0 0 0 3120932 332 625904 0 0 112 1546 546 2565 3 4 93 0 > > 0 0 0 3118012 332 627568 0 0 128 1542 612 2705 3 4 93 0 > > Interesting, the 'bo' never falls to zero. Yes, I was not able to recreate the complete stall from the first mail, but even this slowdown does not look completly healthy. I "hope" this is the same bug, as I seem to be able to trigger this slowdown much easier. [snip logs] > > Interestingly, no background_writeout() appears, but only > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > block the process. Yes, I noticed that too. The only time I have seen background_writeout was during bootup and shutdown. As for the stalled watch cat /proc/meminfo: That happend on the third slowdown/stall when emerge was already finished > > If I see the complete stall again, I will post that too. > > Thank you, could you run it with the attached new debug patch? I will, but it will have to wait until the evening. Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git [not found] ` <E1InmAI-0003ME-2i@localhost> 2007-11-02 2:21 ` Fengguang Wu @ 2007-11-02 2:21 ` Fengguang Wu 2007-11-02 10:15 ` Peter Zijlstra 2 siblings, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 2:21 UTC (permalink / raw) To: Torsten Kaiser Cc: Maxim Levitsky, Peter Zijlstra, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 8466 bytes --] On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote: > On 11/1/07, Torsten Kaiser <just.for.lkml@googlemail.com> wrote: > > On 11/1/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > Thank you. Maybe we can start by the applied debug patch :-) > > > > Will applied it and try to recreate this. > > Patch applied, used emerge to install a 2.6.24-rc1 kernel. > > I had no complete stalls, but three times during the move from tmpfs > to the main xfs the emerge got noticeable slower. There still was > writeout happening, but as emerge prints out every file it has written > during the pause not one file was processed. > > vmstat 10: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 1 0 3146424 332 614768 0 0 134 1849 438 2515 3 4 91 2 > 0 0 0 3146644 332 614784 0 0 2 1628 507 646 0 2 85 13 > 0 0 0 3146868 332 614868 0 0 5 2359 527 1076 0 3 97 0 > 1 0 0 3144372 332 616148 0 0 96 2829 607 2666 2 5 92 0 > -> normal writeout > 0 0 0 3140560 332 618144 0 0 152 2764 633 3308 3 6 91 0 > 0 0 0 3137332 332 619908 0 0 114 1801 588 2858 3 4 93 0 > 0 0 0 3136912 332 620136 0 0 20 827 393 1605 1 2 98 0 > -> first stall 'stall': vmstat's output stalls for some time, or emerge stalls for the next several vmstat lines? > 0 0 0 3137088 332 620136 0 0 0 557 339 1437 0 1 99 0 > 0 0 0 3137160 332 620136 0 0 0 642 310 1400 0 1 99 0 > 0 0 0 3136588 332 620172 0 0 6 2972 527 1195 0 3 80 16 > 0 0 0 3136276 332 620348 0 0 10 2668 558 1195 0 3 96 0 > 0 0 0 3135228 332 620424 0 0 8 2712 522 1311 0 4 96 0 > 0 0 0 3131740 332 621524 0 0 75 2935 559 2457 2 5 93 0 > 0 0 0 3128348 332 622972 0 0 85 1470 490 2607 3 4 93 0 > 0 0 0 3129292 332 622972 0 0 0 527 353 1398 0 1 99 0 > -> second longer stall > 0 0 0 3128520 332 623028 0 0 6 488 249 1390 0 1 99 0 > 0 0 0 3128236 332 623028 0 0 0 482 222 1222 0 1 99 0 > 0 0 0 3128408 332 623028 0 0 0 585 269 1301 0 0 99 0 > 0 0 0 3128532 332 623028 0 0 0 610 262 1278 0 0 99 0 > 0 0 0 3128568 332 623028 0 0 0 636 345 1639 0 1 99 0 > 0 0 0 3129032 332 623040 0 0 1 664 337 1466 0 1 99 0 > 0 0 0 3129484 332 623040 0 0 0 658 300 1508 0 0 100 0 > 0 0 0 3129576 332 623040 0 0 0 562 271 1454 0 1 99 0 > 0 0 0 3129736 332 623040 0 0 0 627 278 1406 0 1 99 0 > 0 0 0 3129368 332 623040 0 0 0 507 274 1301 0 1 99 0 > 0 0 0 3129004 332 623040 0 0 0 444 211 1213 0 0 99 0 > 0 1 0 3127260 332 623040 0 0 0 1036 305 1242 0 1 95 4 > 0 0 0 3126280 332 623128 0 0 7 4241 555 1575 1 5 84 10 > 0 0 0 3124948 332 623232 0 0 6 4194 529 1505 1 4 95 0 > 0 0 0 3125228 332 624168 0 0 58 1966 586 1964 2 4 94 0 > -> emerge resumed to normal speed, without any intervention from my side > 0 0 0 3120932 332 625904 0 0 112 1546 546 2565 3 4 93 0 > 0 0 0 3118012 332 627568 0 0 128 1542 612 2705 3 4 93 0 Interesting, the 'bo' never falls to zero. > > >From syslog: > first stall: > [ 575.050000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259 > global 610 0 0 wc __ tw 1023 sk 0 > [ 586.350000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465 > global 6117 0 0 wc _M tw 967 sk 0 > [ 586.360000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408 > global 6117 0 0 wc __ tw 1022 sk 0 > [ 599.900000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523 > global 11141 0 0 wc __ tw 1009 sk 0 > [ 635.780000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397 > global 12757 124 0 wc __ tw 0 sk 0 > [ 638.470000] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 1536 global 11405 51 0 wc __ tw 0 sk 0 > [ 638.820000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373 > global 11276 48 0 wc __ tw -1 sk 0 > [ 641.260000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348 > global 10565 100 0 wc __ tw 0 sk 0 > [ 643.980000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324 > global 9788 103 0 wc __ tw -1 sk 0 > [ 646.120000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299 > global 8912 6 0 wc __ tw 0 sk 0 > > second stall: > [ 664.040000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117 > global 2864 81 0 wc _M tw -13 sk 0 > [ 664.400000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47080 > global 1995 137 0 wc _M tw 176 sk 0 > [ 664.510000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 46232 > global 1929 267 0 wc __ tw 880 sk 0 > cron[6927]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) > [ 809.560000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 49422 > global 19166 217 0 wc _M tw 380 sk 0 > [ 811.720000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48778 > global 17969 407 0 wc _M tw -4 sk 0 > [ 813.880000] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 1537 global 16592 233 0 wc _M tw -1 sk 0 > [ 814.710000] mm/page-writeback.c 418 balance_dirty_pages: find(6931) > 1537 global 16132 179 0 wc __ tw -1 sk 0 > [ 814.720000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47750 > global 16040 271 0 wc _M tw -1 sk 0 > [ 815.040000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 46725 > global 15403 779 0 wc CM tw 324 sk 0 > > the third stall happend after the emerge was finished. There still was > ~120Mb of dirty data, but its writeout got much slower over several > seconds. > vmstat 10: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 0 3096152 332 630424 0 0 81 1503 640 2771 5 4 91 0 > 0 0 0 3101024 332 631588 0 0 279 473 510 1281 5 2 92 1 > -> stall / slowdown starts > 0 0 0 3147924 332 632384 0 0 78 626 449 1384 0 1 99 0 > 1 0 0 3147940 332 632384 0 0 0 611 388 1387 0 1 99 0 > 0 1 0 3147576 332 632384 0 0 0 939 449 1432 0 1 99 0 > 0 0 0 3145476 332 632384 0 0 0 3592 644 925 0 4 93 3 > -> writeout resumes full speed > 0 0 0 3147232 332 632480 0 0 0 3108 678 1053 0 3 97 0 > 0 0 0 3146860 332 632480 0 0 0 2497 677 859 0 3 97 0 > 0 0 0 3146720 332 632480 0 0 0 2433 648 839 0 3 97 0 > 0 0 0 3147844 332 632484 0 0 0 2394 625 889 0 3 97 0 > 0 0 0 3148128 332 632484 0 0 0 2204 671 848 0 2 97 0 > > from syslog: > [ 848.070000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48084 > global 13805 0 0 wc _M tw 1008 sk 0 > [ 848.080000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48068 > global 13805 0 0 wc __ tw 1020 sk 0 > [ 884.090000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 61811 > global 30297 2 0 wc __ tw 862 sk 0 > [ 921.760000] mm/page-writeback.c 418 balance_dirty_pages: cat(7170) > 1541 global 28113 391 0 wc __ tw -5 sk 0 > -> that cat was probably my watch cat /proc/meminfo > -> during the stall there where no updates visible there > [ 922.190000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 76871 > global 27735 0 0 wc __ tw -5 sk 0 > [ 923.550000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 75842 > global 26688 106 0 wc _M tw -1 sk 0 > [ 924.940000] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 74817 > global 25698 195 0 wc _M tw 0 sk 0 > > Apart from my normal kde desktop (no compiz) and the emerge the system was idle. Interestingly, no background_writeout() appears, but only balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't block the process. > If I see the complete stall again, I will post that too. Thank you, could you run it with the attached new debug patch? Fengguang [-- Attachment #2: writeback-debug.patch --] [-- Type: text/x-diff, Size: 2677 bytes --] --- mm/page-writeback.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) --- linux-2.6.24-git17.orig/mm/page-writeback.c +++ linux-2.6.24-git17/mm/page-writeback.c @@ -98,6 +98,26 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ +#define writeback_debug_report(n, wbc) do { \ + __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \ +} while (0) + +void __writeback_debug_report(long n, struct writeback_control *wbc, + const char *file, int line, const char *func) +{ + printk(KERN_DEBUG "%s %d %s: %s(%d) %ld " + "global %lu %lu %lu " + "wc %c%c tw %ld sk %ld\n", + file, line, func, + current->comm, current->pid, n, + global_page_state(NR_FILE_DIRTY), + global_page_state(NR_WRITEBACK), + global_page_state(NR_UNSTABLE_NFS), + wbc->encountered_congestion ? 'C':'_', + wbc->more_io ? 'M':'_', + wbc->nr_to_write, + wbc->pages_skipped); +} static void background_writeout(unsigned long _min_pages); @@ -395,6 +415,7 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + writeback_debug_report(pages_written, &wbc); } /* @@ -421,6 +442,7 @@ static void balance_dirty_pages(struct a break; /* We've done our duty */ congestion_wait(WRITE, HZ/10); + writeback_debug_report(-pages_written, &wbc); } if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh && @@ -515,6 +537,11 @@ void throttle_vm_writeout(gfp_t gfp_mask global_page_state(NR_WRITEBACK) <= dirty_thresh) break; congestion_wait(WRITE, HZ/10); + printk(KERN_DEBUG "throttle_vm_writeout: " + "congestion_wait on %lu+%lu > %lu\n", + global_page_state(NR_UNSTABLE_NFS), + global_page_state(NR_WRITEBACK), + dirty_thresh); /* * The caller might hold locks which can prevent IO completion @@ -557,6 +584,7 @@ static void background_writeout(unsigned wbc.pages_skipped = 0; writeback_inodes(&wbc); min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; + writeback_debug_report(min_pages, &wbc); if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ if (wbc.encountered_congestion || wbc.more_io) @@ -630,6 +658,7 @@ static void wb_kupdate(unsigned long arg wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); + writeback_debug_report(nr_to_write, &wbc); if (wbc.nr_to_write > 0) { if (wbc.encountered_congestion || wbc.more_io) congestion_wait(WRITE, HZ/10); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git [not found] ` <E1InmAI-0003ME-2i@localhost> 2007-11-02 2:21 ` Fengguang Wu 2007-11-02 2:21 ` Fengguang Wu @ 2007-11-02 10:15 ` Peter Zijlstra [not found] ` <E1IntqD-0001dK-OE@localhost> 2007-11-02 19:22 ` Torsten Kaiser 2 siblings, 2 replies; 39+ messages in thread From: Peter Zijlstra @ 2007-11-02 10:15 UTC (permalink / raw) To: Fengguang Wu Cc: Torsten Kaiser, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > Interestingly, no background_writeout() appears, but only > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > block the process. Yeah, the background threshold is not (yet) scaled. So it can happen that the bdi_dirty limit is below the background limit. I'm curious though as to these stalls, though, I can't seem to think of what goes wrong.. esp since most writeback seems to happen from pdflush. (or I'm totally misreading it - quite a possible as I'm still recovering from a serious cold and not all the green stuff has yet figured out its proper place wrt brain cells 'n stuff) I still have this patch floating around: --- Subject: mm: speed up writeback ramp-up on clean systems We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits and bdi ramp-up should happen. Doing it this way avoids many small writeouts on an otherwise idle system and should also speed up the ramp-up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/page-writeback.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c 2007-09-28 10:08:33.937415368 +0200 +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200 @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long */ static void balance_dirty_pages(struct address_space *mapping) { - long bdi_nr_reclaimable; - long bdi_nr_writeback; + long nr_reclaimable, bdi_nr_reclaimable; + long nr_writeback, bdi_nr_writeback; long background_thresh; long dirty_thresh; long bdi_thresh; @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS); + nr_writeback = global_page_state(NR_WRITEBACK); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) break; + /* + * Throttle it only when the background writeback cannot + * catch-up. This avoids (excessively) small writeouts + * when the bdi limits are ramping up. + */ + if (nr_reclaimable + nr_writeback < + (background_thresh + dirty_thresh) / 2) + break; + if (!bdi->dirty_exceeded) bdi->dirty_exceeded = 1; ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <E1IntqD-0001dK-OE@localhost>]
* Re: writeout stalls in current -git [not found] ` <E1IntqD-0001dK-OE@localhost> @ 2007-11-02 10:33 ` Fengguang Wu 2007-11-05 23:57 ` Andrew Morton 2007-11-02 10:33 ` Fengguang Wu 1 sibling, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 10:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Torsten Kaiser, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > Interestingly, no background_writeout() appears, but only > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > block the process. > > Yeah, the background threshold is not (yet) scaled. So it can happen > that the bdi_dirty limit is below the background limit. > > I'm curious though as to these stalls, though, I can't seem to think of > what goes wrong.. esp since most writeback seems to happen from pdflush. Me confused too. The new debug patch will confirm whether emerge is waiting in balance_dirty_pages(). > (or I'm totally misreading it - quite a possible as I'm still recovering > from a serious cold and not all the green stuff has yet figured out its > proper place wrt brain cells 'n stuff) Do take care of yourself. > > I still have this patch floating around: I think this patch is OK for 2.6.24 :-) Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > > --- > Subject: mm: speed up writeback ramp-up on clean systems > > We allow violation of bdi limits if there is a lot of room on the > system. Once we hit half the total limit we start enforcing bdi limits > and bdi ramp-up should happen. Doing it this way avoids many small > writeouts on an otherwise idle system and should also speed up the > ramp-up. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > --- > mm/page-writeback.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > Index: linux-2.6/mm/page-writeback.c > =================================================================== > --- linux-2.6.orig/mm/page-writeback.c 2007-09-28 10:08:33.937415368 +0200 > +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200 > @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long > */ > static void balance_dirty_pages(struct address_space *mapping) > { > - long bdi_nr_reclaimable; > - long bdi_nr_writeback; > + long nr_reclaimable, bdi_nr_reclaimable; > + long nr_writeback, bdi_nr_writeback; > long background_thresh; > long dirty_thresh; > long bdi_thresh; > @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a > > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > + > + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > + global_page_state(NR_UNSTABLE_NFS); > + nr_writeback = global_page_state(NR_WRITEBACK); > + > bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > + > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > break; > > + /* > + * Throttle it only when the background writeback cannot > + * catch-up. This avoids (excessively) small writeouts > + * when the bdi limits are ramping up. > + */ > + if (nr_reclaimable + nr_writeback < > + (background_thresh + dirty_thresh) / 2) > + break; > + > if (!bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 10:33 ` Fengguang Wu @ 2007-11-05 23:57 ` Andrew Morton 2007-11-06 10:20 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Andrew Morton @ 2007-11-05 23:57 UTC (permalink / raw) To: Fengguang Wu Cc: peterz, just.for.lkml, maximlevitsky, linux-kernel, dgc, linux-fsdevel On Fri, 2 Nov 2007 18:33:29 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > > > Interestingly, no background_writeout() appears, but only > > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > > block the process. > > > > Yeah, the background threshold is not (yet) scaled. So it can happen > > that the bdi_dirty limit is below the background limit. > > > > I'm curious though as to these stalls, though, I can't seem to think of > > what goes wrong.. esp since most writeback seems to happen from pdflush. > > Me confused too. The new debug patch will confirm whether emerge is > waiting in balance_dirty_pages(). > > > (or I'm totally misreading it - quite a possible as I'm still recovering > > from a serious cold and not all the green stuff has yet figured out its > > proper place wrt brain cells 'n stuff) > > Do take care of yourself. > > > > > I still have this patch floating around: > > I think this patch is OK for 2.6.24 :-) > > Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> I would prefer Tested-by: :( > > > > --- > > Subject: mm: speed up writeback ramp-up on clean systems > > > > We allow violation of bdi limits if there is a lot of room on the > > system. Once we hit half the total limit we start enforcing bdi limits > > and bdi ramp-up should happen. Doing it this way avoids many small > > writeouts on an otherwise idle system and should also speed up the > > ramp-up. Given the problems we're having in there I'm a bit reluctant to go tossing hastily put together and inadequately tested stuff onto the fire. And that's what this patch looks like to me. Wanna convince me otherwise? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-05 23:57 ` Andrew Morton @ 2007-11-06 10:20 ` Peter Zijlstra 0 siblings, 0 replies; 39+ messages in thread From: Peter Zijlstra @ 2007-11-06 10:20 UTC (permalink / raw) To: Andrew Morton Cc: Fengguang Wu, just.for.lkml, maximlevitsky, linux-kernel, dgc, linux-fsdevel On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote: > > > Subject: mm: speed up writeback ramp-up on clean systems > > > > > > We allow violation of bdi limits if there is a lot of room on the > > > system. Once we hit half the total limit we start enforcing bdi limits > > > and bdi ramp-up should happen. Doing it this way avoids many small > > > writeouts on an otherwise idle system and should also speed up the > > > ramp-up. > > Given the problems we're having in there I'm a bit reluctant to go tossing > hastily put together and inadequately tested stuff onto the fire. And > that's what this patch looks like to me. Not really hastily, I think it was written before the stuff hit mainline. Inadequately tested, perhaps, its been in my and probably Wu's kernels for a while. Granted that's not a lot of testing in the face of those who have problems atm. > Wanna convince me otherwise? I'm perfectly happy with this patch earning its credits in -mm for a while and maybe going in around -rc4 or something like that (hoping that by then we've fixed these nagging issues). Another patch I did come up with yesterday - not driven by any problems in that area - could perhaps join this one on that path: --- Subject: mm: bdi: tweak task dirty penalty Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather excessive on large memory machines. Use sqrt to scale it sub-linearly. Update the comment while we're there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/page-writeback.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6-2/mm/page-writeback.c =================================================================== --- linux-2.6-2.orig/mm/page-writeback.c +++ linux-2.6-2/mm/page-writeback.c @@ -213,17 +213,21 @@ static inline void task_dirties_fraction } /* - * scale the dirty limit + * Task specific dirty limit: * - * task specific dirty limit: + * dirty -= 8 * sqrt(dirty) * p_{t} * - * dirty -= (dirty/8) * p_{t} + * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This + * avoids infrequent dirtiers from getting stuck in this other guys dirty + * pages. + * + * Use a sub-linear function to scale the penalty, we only need a little room. */ void task_dirty_limit(struct task_struct *tsk, long *pdirty) { long numerator, denominator; long dirty = *pdirty; - u64 inv = dirty >> 3; + u64 inv = 8*int_sqrt(dirty); task_dirties_fraction(tsk, &numerator, &denominator); inv *= numerator; ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git [not found] ` <E1IntqD-0001dK-OE@localhost> 2007-11-02 10:33 ` Fengguang Wu @ 2007-11-02 10:33 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-11-02 10:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Torsten Kaiser, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > Interestingly, no background_writeout() appears, but only > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > block the process. > > Yeah, the background threshold is not (yet) scaled. So it can happen > that the bdi_dirty limit is below the background limit. > > I'm curious though as to these stalls, though, I can't seem to think of > what goes wrong.. esp since most writeback seems to happen from pdflush. Me confused too. The new debug patch will confirm whether emerge is waiting in balance_dirty_pages(). > (or I'm totally misreading it - quite a possible as I'm still recovering > from a serious cold and not all the green stuff has yet figured out its > proper place wrt brain cells 'n stuff) Do take care of yourself. > > I still have this patch floating around: I think this patch is OK for 2.6.24 :-) Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > > --- > Subject: mm: speed up writeback ramp-up on clean systems > > We allow violation of bdi limits if there is a lot of room on the > system. Once we hit half the total limit we start enforcing bdi limits > and bdi ramp-up should happen. Doing it this way avoids many small > writeouts on an otherwise idle system and should also speed up the > ramp-up. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > --- > mm/page-writeback.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > Index: linux-2.6/mm/page-writeback.c > =================================================================== > --- linux-2.6.orig/mm/page-writeback.c 2007-09-28 10:08:33.937415368 +0200 > +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200 > @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long > */ > static void balance_dirty_pages(struct address_space *mapping) > { > - long bdi_nr_reclaimable; > - long bdi_nr_writeback; > + long nr_reclaimable, bdi_nr_reclaimable; > + long nr_writeback, bdi_nr_writeback; > long background_thresh; > long dirty_thresh; > long bdi_thresh; > @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a > > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > + > + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > + global_page_state(NR_UNSTABLE_NFS); > + nr_writeback = global_page_state(NR_WRITEBACK); > + > bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > + > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > break; > > + /* > + * Throttle it only when the background writeback cannot > + * catch-up. This avoids (excessively) small writeouts > + * when the bdi limits are ramping up. > + */ > + if (nr_reclaimable + nr_writeback < > + (background_thresh + dirty_thresh) / 2) > + break; > + > if (!bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 10:15 ` Peter Zijlstra [not found] ` <E1IntqD-0001dK-OE@localhost> @ 2007-11-02 19:22 ` Torsten Kaiser 2007-11-02 20:43 ` David Chinner [not found] ` <E1IpKZ4-0004je-Lb@localhost> 1 sibling, 2 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-02 19:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On 11/2/07, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > Interestingly, no background_writeout() appears, but only > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > block the process. > > Yeah, the background threshold is not (yet) scaled. So it can happen > that the bdi_dirty limit is below the background limit. I still have not seen a trigger of the "throttle_vm_writeout". This time installing 2.6.24-rc1 again it not even triggerd any other debugs apart from the one in wb_kupdate. But 300Mb of new files might still not trigger this with 4Gb of RAM. I'm currently testing 2.6.23-mm1 with this patch and the second writeback-debug patch. > I'm curious though as to these stalls, though, I can't seem to think of > what goes wrong.. esp since most writeback seems to happen from pdflush. I also don't know. But looking at the time the system takes to write out 8kb, I'm starting to suspect that something is writing this out, but not marking it clean... (Or redirtying it immediately?) > (or I'm totally misreading it - quite a possible as I'm still recovering > from a serious cold and not all the green stuff has yet figured out its > proper place wrt brain cells 'n stuff) Get well soon! > I still have this patch floating around: > > --- > Subject: mm: speed up writeback ramp-up on clean systems applied, but did not fix the stalls. Here the complete log from vmstat 10 and the syslog from an install of vanilla 2.6.24-rc1. (Please note: I installed the source of vanilla 2.6.24-rc1, but I am still using 2.6.23-mm1!) All lines with [note] are my comments about what the system was doing, both logs are from the same run, so the notes should be more or less in sync. I used SysRq+L to insert the SysRq-Helptext into the syslog as marker... The visible effects are similar to the unmerge run, but the stalls during the moving did only start later. But the same effect after emerge finished and the almost all dirty data was written, was visible: I can still hear the disks and see the hdd light flickering (mostly on) for much, much longer than it should take to write 8kb. vmstat 10: [note]emerge start 1 0 0 3668496 332 187748 0 0 0 29 39 491 3 0 96 0 1 0 0 3623940 332 188880 0 0 83 17 1724 3893 15 2 81 1 0 0 0 3559488 332 252432 0 0 1021 48 11719 4536 9 4 74 13 2 0 0 3482220 332 311916 0 0 70 60 93 3818 11 3 86 0 1 0 0 3289352 332 486932 0 0 2 35 33 11997 25 3 72 0 1 0 0 3174036 332 596412 0 0 10 33 35 3937 21 4 75 0 2 0 0 3215756 332 555292 0 0 6 28 85 742 12 12 76 0 2 0 0 3202128 332 559792 0 0 32 9 34 1566 31 1 68 0 2 0 0 3192804 332 568072 0 0 60 46 172 4206 30 2 67 1 3 0 0 3202424 332 572620 0 0 0 20 111 2223 27 1 72 0 1 0 0 3196112 332 578900 0 0 0 1649 149 2763 25 2 73 0 1 0 0 3190004 332 584956 0 0 0 17 110 2270 25 1 74 0 1 0 0 3183952 332 590840 0 0 0 11 104 2553 25 1 74 0 1 0 0 3176952 332 597068 0 0 0 2153 124 2886 25 2 72 0 1 0 0 3171044 332 602592 0 0 0 22 109 2580 26 1 73 0 1 0 0 3174896 332 605496 0 0 173 1441 312 2249 9 6 84 1 1 0 0 3165204 332 611856 0 0 569 3221 606 4236 4 7 87 1 0 0 0 3160856 332 613516 0 0 116 2281 570 3077 3 5 92 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3154712 332 615200 0 0 108 2166 528 3038 3 4 93 0 0 0 0 3156008 332 615420 0 0 18 1941 537 1015 0 2 97 0 0 0 0 3156652 332 615504 0 0 8 2232 547 900 0 2 98 0 0 0 0 3156748 332 615672 0 0 12 1932 537 947 0 2 98 0 0 0 0 3154720 332 615900 0 0 14 2204 584 1256 1 2 97 0 0 0 0 3154256 332 616060 0 0 10 2676 610 1317 1 3 96 0 1 0 0 3152488 332 616284 0 0 9 1994 573 1024 1 2 97 0 0 0 0 3152404 332 616408 0 0 4 2218 540 904 0 2 97 0 0 0 0 3151244 332 617156 0 0 44 2198 598 1921 2 4 94 0 0 0 0 3147224 332 618672 0 0 110 1802 644 2575 3 4 93 0 0 0 0 3144608 332 619824 0 0 80 1590 543 1900 2 4 95 0 0 0 0 3140768 332 621448 0 0 111 1758 657 2735 3 4 93 0 0 0 0 3140816 332 621896 0 0 26 801 531 1667 1 2 98 0 [note] first stall, SysRq+W 1 0 0 3127620 332 621896 0 0 0 640 490 1381 2 1 97 0 0 0 0 3127780 332 621900 0 0 0 627 475 1531 2 1 98 0 0 0 0 3127560 332 621900 0 0 0 587 464 1428 0 0 99 0 1 0 0 3126272 332 622460 0 0 32 945 556 1922 1 2 97 0 [note] installing resumes 0 0 0 3120860 332 624048 0 0 94 1950 785 2582 4 5 91 0 0 0 0 3117392 332 625200 0 0 76 1258 742 2217 2 3 95 0 [note] second stall 0 0 0 3118192 332 625200 0 0 0 617 559 1617 0 1 99 0 0 0 0 3118836 332 625200 0 0 0 603 550 1576 5 1 94 0 0 0 0 3118728 332 625200 0 0 0 682 601 1454 0 0 99 0 0 0 0 3118860 332 625200 0 0 0 653 557 1382 0 1 99 0 [note] installing resumes 1 0 0 3111356 332 624576 0 0 91 1277 789 2086 11 4 84 1 0 0 0 3149768 332 627792 0 0 322 504 655 1444 1 2 96 1 0 0 0 3150064 332 627792 0 0 0 559 623 1340 0 0 99 0 [note] emerge is finished, ~200Mb dirty data 0 0 0 3150220 332 627792 0 0 0 622 553 1553 2 1 97 0 0 0 0 3150456 332 627792 0 0 0 518 595 1315 0 1 99 0 0 0 0 3149380 332 627792 0 0 0 3759 801 1277 0 3 97 0 0 0 0 3148664 332 627840 0 0 0 3925 873 1500 0 4 96 0 0 0 0 3149672 332 627868 0 0 0 2476 800 1355 0 3 97 0 0 0 0 3148012 332 627872 0 0 0 2865 806 1235 0 3 97 0 0 0 0 3150496 332 627936 0 0 0 3074 847 1288 0 3 97 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3149568 332 627968 0 0 0 2238 751 1070 0 2 97 0 0 0 0 3150260 332 627988 0 0 0 872 607 1073 0 1 99 0 0 0 0 3150228 332 627988 0 0 0 1711 715 1214 0 2 98 0 0 0 0 3149300 332 627988 0 0 0 2195 752 1042 0 2 98 0 1 0 0 3150036 332 628032 0 0 0 2192 759 1118 0 2 97 0 0 0 0 3150868 332 628032 0 0 0 1035 639 1138 0 1 99 0 0 0 0 3150876 332 628068 0 0 0 1437 740 1153 0 1 98 0 0 0 0 3151152 332 628068 0 0 0 446 545 1381 0 0 100 0 0 0 0 3151212 332 628068 0 0 0 461 551 1412 2 0 98 0 [note] normal writeout finishes ~116kb dirty data left 1 0 0 3151088 332 628068 0 0 0 472 552 1468 0 0 99 0 0 0 0 3151260 332 628068 0 0 0 462 533 1369 0 0 100 0 0 0 0 3151296 332 628068 0 0 0 464 559 1325 0 0 100 0 0 0 0 3150992 332 628068 0 0 0 485 533 1350 0 0 100 0 0 0 0 3151092 332 628068 0 0 0 492 543 1378 0 0 100 0 [note] hit SysRq+W and SysRw+M 0 0 0 3150828 332 628076 0 0 0 430 541 1449 9 1 90 0 0 0 0 3150932 332 628076 0 0 0 459 535 1401 0 0 100 0 0 0 0 3151068 332 628076 0 0 0 465 536 1471 0 0 99 0 0 0 0 3151164 332 628076 0 0 0 453 525 1349 0 0 100 0 0 0 0 3151208 332 628076 0 0 0 474 530 1354 0 0 100 0 1 0 0 3151036 332 628076 0 0 0 449 506 1348 0 0 100 0 0 0 0 3151148 332 628076 0 0 0 476 520 1314 0 0 100 0 0 0 0 3151080 332 628076 0 0 0 467 521 1373 0 0 100 0 0 0 0 3151096 332 628076 0 0 0 464 521 1324 0 0 100 0 0 0 0 3151220 332 628076 0 0 0 461 548 1360 0 0 100 0 0 0 0 3151144 332 628076 0 0 0 417 480 1329 0 0 100 0 0 0 0 3150892 332 628076 0 0 0 492 543 1363 0 0 99 0 0 0 0 3151048 332 628076 0 0 0 436 515 1298 0 0 100 0 0 0 0 3151076 332 628076 0 0 0 434 513 1402 0 0 100 0 0 0 0 3151296 332 628076 0 0 0 430 508 1367 0 0 100 0 0 0 0 3150940 332 628076 0 0 0 472 527 1331 0 0 100 0 0 0 0 3151016 332 628076 0 0 0 472 527 1315 0 0 100 0 0 0 0 3151024 332 628076 0 0 0 227 409 703 0 0 100 0 0 0 0 3151272 332 628080 0 0 0 11 315 262 2 0 98 0 [note] writeout really finishes, disks go idle. from syslog: [note] emerge started, this unpacks the kernel into a tmpfs, patches it to rc1, packs it into a tar.bz2 and then moves the files from the tmpfs to my main xfs root fs [ 322.230000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 323.120000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20090 global 25 0 0 wc __ tw 1024 sk 0 [ 328.230000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20091 global 26 0 0 wc __ tw 1024 sk 0 [ 333.290000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20131 global 29 0 0 wc _M tw 1023 sk 0 [ 333.360000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20130 global 29 0 0 wc _M tw 1023 sk 0 [ 333.390000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20129 global 29 0 0 wc __ tw 1023 sk 0 [ 338.300000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20131 global 28 0 0 wc __ tw 1024 sk 0 [ 343.360000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20196 global 1 28 0 wc __ tw 1000 sk 0 [ 348.330000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 20188 global 4 0 0 wc __ tw 1024 sk 0 [ 353.380000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 27417 global 4 0 0 wc __ tw 1024 sk 0 [ 358.380000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 31801 global 4 0 0 wc __ tw 1024 sk 0 [ 363.380000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 40783 global 4 0 0 wc __ tw 1021 sk 0 [ 368.460000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44080 global 1 0 0 wc __ tw 1023 sk 0 [ 373.460000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44085 global 1 0 0 wc __ tw 1024 sk 0 [ 378.460000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44631 global 1 0 0 wc __ tw 1024 sk 0 [ 383.510000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44709 global 1 0 0 wc __ tw 1024 sk 0 [note] around here the creation of the tar.bz2 started [ 388.520000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45134 global 426 0 0 wc __ tw 1024 sk 0 [ 393.530000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45884 global 1148 0 0 wc __ tw 1024 sk 0 [ 398.530000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47002 global 2262 0 0 wc __ tw 1023 sk 0 [ 403.570000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47619 global 2888 0 0 wc __ tw 1024 sk 0 [ 408.570000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48276 global 3545 0 0 wc __ tw 1024 sk 0 [ 413.570000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48740 global 2997 1012 0 wc _M tw -1 sk 0 [ 413.570000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47715 global 2997 1012 0 wc _M tw 1024 sk 0 [ 413.580000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47715 global 1985 2024 0 wc _M tw -1 sk 0 [ 413.590000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46690 global 973 3036 0 wc _M tw -1 sk 0 [ 413.590000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45665 global 7 4002 0 wc __ tw 64 sk 0 [ 418.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45595 global 864 0 0 wc __ tw 1024 sk 0 [ 423.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46294 global 1563 0 0 wc __ tw 1024 sk 0 [ 428.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47036 global 2305 0 0 wc __ tw 1023 sk 0 [ 433.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47731 global 3000 0 0 wc __ tw 1024 sk 0 [ 438.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48525 global 3794 0 0 wc __ tw 1024 sk 0 [ 443.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49159 global 4428 0 0 wc __ tw 1024 sk 0 [note] around here the moving from the tmpfs to the xfs started [ 448.630000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50047 global 4304 1012 0 wc _M tw -1 sk 0 [ 448.640000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49022 global 3292 2024 0 wc _M tw -1 sk 0 [ 448.650000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47997 global 2234 3082 0 wc _M tw -1 sk 0 [ 448.650000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46972 global 1222 4094 0 wc _M tw -1 sk 0 [ 448.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45947 global 210 5106 0 wc _M tw -1 sk 0 [ 448.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44922 global 0 5336 0 wc __ tw 812 sk 0 [ 453.700000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45385 global 654 0 0 wc __ tw 1024 sk 0 [ 458.700000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45881 global 1150 0 0 wc _M tw 1023 sk 0 [ 458.790000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45880 global 1196 0 0 wc _M tw 1023 sk 0 [ 458.810000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45879 global 1196 0 0 wc __ tw 1023 sk 0 [ 463.840000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44729 global 0 0 0 wc __ tw 1024 sk 0 [ 468.860000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45653 global 869 0 0 wc __ tw 1024 sk 0 [ 473.880000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51262 global 6380 0 0 wc __ tw 1024 sk 0 [ 478.920000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56488 global 11523 0 0 wc __ tw 1024 sk 0 [ 485.260000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58839 global 13842 0 0 wc __ tw 1024 sk 0 [ 490.260000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 60796 global 15746 0 0 wc __ tw 1023 sk 0 [ 495.270000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 64003 global 18907 0 0 wc __ tw 1023 sk 0 [ 502.330000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 67524 global 21467 336 0 wc _M tw -5 sk 0 [ 505.350000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 66495 global 20615 51 0 wc _M tw 0 sk 0 [ 508.140000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 65471 global 19727 213 0 wc _M tw -1 sk 0 [ 508.550000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 64446 global 19483 336 0 wc _M tw 760 sk 0 [ 509.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 64182 global 19470 94 0 wc __ tw 1012 sk 0 [ 514.190000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 65780 global 19665 172 0 wc __ tw -1 sk 0 [ 517.310000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 64755 global 18827 14 0 wc __ tw -1 sk 0 [ 520.100000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 63730 global 17929 96 0 wc _M tw -13 sk 0 [ 522.560000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 62693 global 16937 167 0 wc _M tw -1 sk 0 [ 527.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 61668 global 16021 95 0 wc _M tw -6 sk 0 [ 530.460000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 60638 global 15115 52 0 wc _M tw -1 sk 0 [ 534.470000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 59613 global 14222 27 0 wc _M tw -4 sk 0 [ 537.760000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58585 global 13386 54 0 wc _M tw 0 sk 0 [ 541.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57561 global 12737 58 0 wc _M tw 281 sk 0 [ 541.090000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56818 global 12737 58 0 wc __ tw 1022 sk 0 [ 547.200000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 global 12829 72 0 wc __ tw 0 sk 0 [ 550.480000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 global 12017 62 0 wc __ tw 0 sk 0 [ 552.710000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 global 11133 83 0 wc __ tw 0 sk 0 [ 558.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 global 10470 33 0 wc _M tw 0 sk 0 [ 562.750000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 global 10555 69 0 wc _M tw 0 sk 0 [ 565.150000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 global 9562 498 0 wc _M tw -2 sk 0 [ 569.490000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 global 8960 2 0 wc _M tw 0 sk 0 [ 572.910000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 global 8088 205 0 wc _M tw -13 sk 0 [ 574.610000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 global 7114 188 0 wc _M tw -1 sk 0 [ 584.270000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 global 14544 0 0 wc _M tw -1 sk 0 [ 593.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 global 24583 736 0 wc _M tw -1 sk 0 [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 global 27004 6 0 wc _M tw 587 sk 0 [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 global 27004 6 0 wc __ tw 1014 sk 0 [note] first stall, the output from emerge stops, so it seems it can not start processing the next file until the stall ends [ 630.000000] SysRq : Emergency Sync [ 630.120000] Emergency Sync complete [ 632.850000] SysRq : Show Blocked State [ 632.850000] task PC stack pid father [ 632.850000] pdflush D ffff81000f091788 0 285 2 [ 632.850000] ffff810005d4da80 0000000000000046 0000000000000800 0000007000000001 [ 632.850000] ffff81000fd52400 ffffffff8022d61c ffffffff80819b00 ffffffff80819b00 [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810100316f98 0000000000000000 [ 632.850000] Call Trace: [ 632.850000] [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 [ 632.850000] [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90 [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 [ 632.850000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 [ 632.850000] [<ffffffff8036ed49>] xfs_inode_flush+0x179/0x1b0 [ 632.850000] [<ffffffff8037ca8f>] xfs_fs_write_inode+0x2f/0x90 [ 632.850000] [<ffffffff802b3aac>] __writeback_single_inode+0x2ac/0x380 [ 632.850000] [<ffffffff804d074e>] dm_table_any_congested+0x2e/0x80 [ 632.850000] [<ffffffff802b3f9d>] generic_sync_sb_inodes+0x20d/0x330 [ 632.850000] [<ffffffff802b4532>] writeback_inodes+0xa2/0xe0 [ 632.850000] [<ffffffff8026bfd6>] wb_kupdate+0xa6/0x140 [ 632.850000] [<ffffffff8026c4b0>] pdflush+0x0/0x1e0 [ 632.850000] [<ffffffff8026c5c0>] pdflush+0x110/0x1e0 [ 632.850000] [<ffffffff8026bf30>] wb_kupdate+0x0/0x140 [ 632.850000] [<ffffffff8024a32b>] kthread+0x4b/0x80 [ 632.850000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 [ 632.850000] [<ffffffff8024a2e0>] kthread+0x0/0x80 [ 632.850000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 [ 632.850000] [ 632.850000] emerge D 0000000000000000 0 6220 6129 [ 632.850000] ffff810103ced9f8 0000000000000086 0000000000000000 0000007000000001 [ 632.850000] ffff81000fd52cf8 ffffffff00000000 ffffffff80819b00 ffffffff80819b00 [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810103ced9b8 ffff810103ced9a8 [ 632.850000] Call Trace: [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 [ 632.850000] [<ffffffff80375bee>] xfs_buf_rele+0x2e/0xd0 [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 [ 632.850000] [<ffffffff80355c82>] xfs_inode_item_push+0x12/0x20 [ 632.850000] [<ffffffff80368247>] xfs_trans_push_ail+0x267/0x2b0 [ 632.850000] [<ffffffff8035c742>] xfs_log_reserve+0x72/0x120 [ 632.850000] [<ffffffff80366bf8>] xfs_trans_reserve+0xa8/0x210 [ 632.850000] [<ffffffff803731f2>] kmem_zone_zalloc+0x32/0x50 [ 632.850000] [<ffffffff8035263b>] xfs_itruncate_finish+0xfb/0x310 [ 632.850000] [<ffffffff8036daeb>] xfs_free_eofblocks+0x23b/0x280 [ 632.850000] [<ffffffff80371f93>] xfs_release+0x153/0x200 [ 632.850000] [<ffffffff80378010>] xfs_file_release+0x10/0x20 [ 632.850000] [<ffffffff80294251>] __fput+0xb1/0x220 [ 632.850000] [<ffffffff802910a4>] filp_close+0x54/0x90 [ 632.850000] [<ffffffff802929bf>] sys_close+0x9f/0x100 [ 632.850000] [<ffffffff8020bbbe>] system_call+0x7e/0x83 [ 632.850000] [ 662.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 73045 global 39157 0 0 wc __ tw 0 sk 0 [note] emerge resumed [ 664.030000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 673.150000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 72021 global 44617 0 0 wc __ tw -3 sk 0 [note] emerge stalled again [ 693.930000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 724.580000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 70994 global 48064 26 0 wc _M tw -5 sk 0 [note] emerge resumed again [ 724.710000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 751.470000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 69965 global 47914 46 0 wc _M tw -1 sk 0 [note] emerge is finished, but 200Mb of dirty data remain [ 761.950000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 775.520000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 68940 global 46911 414 0 wc _M tw 0 sk 0 [ 776.280000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 67916 global 45859 724 0 wc _M tw -2 sk 0 [ 777.370000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 66890 global 44834 325 0 wc _M tw -10 sk 0 [ 778.450000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 65856 global 43828 242 0 wc _M tw -1 sk 0 [ 779.020000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 64831 global 42807 484 0 wc _M tw -1 sk 0 [ 780.440000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 63806 global 41768 47 0 wc _M tw -7 sk 0 [ 781.560000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 62775 global 40730 445 0 wc _M tw 0 sk 0 [ 783.000000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 61751 global 39705 322 0 wc _M tw -3 sk 0 [ 785.140000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 60724 global 38732 310 0 wc _M tw -4 sk 0 [ 786.390000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 59696 global 37673 406 0 wc _M tw -6 sk 0 [ 787.310000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58666 global 36636 495 0 wc _M tw -9 sk 0 [ 787.720000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57633 global 35578 955 0 wc _M tw -1 sk 0 [ 789.100000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56608 global 34592 139 0 wc _M tw 0 sk 0 [ 790.400000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55584 global 33567 25 0 wc _M tw -3 sk 0 [ 791.780000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54557 global 32491 305 0 wc _M tw -11 sk 0 [ 793.790000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53522 global 31499 506 0 wc _M tw -5 sk 0 [ 796.680000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52493 global 30462 184 0 wc _M tw -3 sk 0 [ 798.930000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51466 global 29411 340 0 wc _M tw -11 sk 0 [ 800.330000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50431 global 28377 69 0 wc _M tw -4 sk 0 [ 803.900000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49403 global 27388 24 0 wc _M tw -2 sk 0 [ 805.600000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48377 global 26330 142 0 wc _M tw -6 sk 0 [ 807.740000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47347 global 25295 138 0 wc _M tw -1 sk 0 [ 809.680000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46322 global 24296 268 0 wc _M tw -2 sk 0 [ 812.120000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 45296 global 23269 81 0 wc _M tw -5 sk 0 [ 813.940000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 44267 global 22249 303 0 wc _M tw -1 sk 0 [ 815.940000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 43242 global 21205 220 0 wc _M tw -9 sk 0 [ 817.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 42209 global 20174 87 0 wc _M tw -7 sk 0 [ 819.430000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 41178 global 19142 31 0 wc _M tw -5 sk 0 [ 820.360000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 40149 global 18113 316 0 wc _M tw -7 sk 0 [ 822.310000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 39118 global 17098 85 0 wc _M tw 0 sk 0 [ 824.680000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 38094 global 16064 168 0 wc _M tw 0 sk 0 [ 829.250000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 37070 global 15059 44 0 wc _M tw 0 sk 0 [ 832.300000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 36046 global 14001 89 0 wc _M tw -2 sk 0 [ 836.030000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 35020 global 13741 0 0 wc _M tw 760 sk 0 [ 836.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 34756 global 13649 92 0 wc _M tw 922 sk 0 [ 836.290000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 34654 global 13649 0 0 wc _M tw 1022 sk 0 [ 836.720000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 34652 global 13650 0 0 wc __ tw 1023 sk 0 [ 843.210000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 60278 global 12631 110 0 wc __ tw 0 sk 0 [ 845.380000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 59254 global 11590 72 0 wc __ tw -1 sk 0 [ 852.340000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58229 global 10566 56 0 wc __ tw -1 sk 0 [ 854.360000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57204 global 9551 103 0 wc __ tw 0 sk 0 [ 857.140000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56180 global 8529 33 0 wc __ tw 0 sk 0 [ 860.800000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55156 global 7480 509 0 wc _M tw -9 sk 0 [ 863.350000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54123 global 6443 343 0 wc _M tw -10 sk 0 [ 866.020000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53089 global 5420 215 0 wc _M tw 0 sk 0 [ 870.080000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52065 global 4393 104 0 wc _M tw 0 sk 0 [ 872.210000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51041 global 3385 334 0 wc _M tw -5 sk 0 [ 874.280000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50012 global 2343 234 0 wc _M tw 0 sk 0 [ 884.350000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48988 global 1330 52 0 wc _M tw -4 sk 0 [ 889.810000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47960 global 294 122 0 wc _M tw 0 sk 0 [note] the system is down to 116kb dirty data, but still writing back heavyly [ 905.280000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [note] after a while in this state I hit SysRq+W and SysRq+M to capture more state [ 967.770000] SysRq : Show Blocked State [ 967.770000] task PC stack pid father [ 967.770000] pdflush D ffff810080043640 0 285 2 [ 967.770000] ffff810005d4da80 0000000000000046 ffff810005d4da48 0000007000000001 [ 967.770000] 0000000000000400 0000000000000001 ffffffff80819b00 ffffffff80819b00 [ 967.770000] ffffffff80815f40 ffffffff80819b00 ffff810005d4da40 ffff810005d4da30 [ 967.770000] Call Trace: [ 967.770000] [<ffffffff805b16e7>] __down+0xa7/0x11e [ 967.770000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 [ 967.770000] [<ffffffff805b1365>] __down_failed+0x35/0x3a [ 967.770000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 [ 967.770000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 [ 967.770000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 [ 967.770000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 [ 967.770000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 [ 967.770000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 [ 967.770000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 [ 967.770000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 [ 967.770000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 [ 967.770000] [<ffffffff8036ed49>] xfs_inode_flush+0x179/0x1b0 [ 967.770000] [<ffffffff8037ca8f>] xfs_fs_write_inode+0x2f/0x90 [ 967.770000] [<ffffffff802b3aac>] __writeback_single_inode+0x2ac/0x380 [ 967.770000] [<ffffffff804d074e>] dm_table_any_congested+0x2e/0x80 [ 967.770000] [<ffffffff802b3f9d>] generic_sync_sb_inodes+0x20d/0x330 [ 967.770000] [<ffffffff802b4532>] writeback_inodes+0xa2/0xe0 [ 967.770000] [<ffffffff8026bfd6>] wb_kupdate+0xa6/0x140 [ 967.770000] [<ffffffff8026c4b0>] pdflush+0x0/0x1e0 [ 967.770000] [<ffffffff8026c5c0>] pdflush+0x110/0x1e0 [ 967.770000] [<ffffffff8026bf30>] wb_kupdate+0x0/0x140 [ 967.770000] [<ffffffff8024a32b>] kthread+0x4b/0x80 [ 967.770000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 [ 967.770000] [<ffffffff8024a2e0>] kthread+0x0/0x80 [ 967.770000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 [ 967.770000] [ 968.640000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46936 global 30 0 0 wc _M tw 757 sk 0 [ 968.670000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46669 global 2 28 0 wc __ tw 996 sk 0 [ 970.520000] SysRq : Show Memory [ 970.530000] Mem-info: [ 970.530000] Node 0 DMA per-cpu: [ 970.530000] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [ 970.540000] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [ 970.540000] CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [ 970.540000] CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [ 970.540000] Node 0 DMA32 per-cpu: [ 970.540000] CPU 0: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch: 15 usd: 15 [ 970.540000] CPU 1: Hot: hi: 186, btch: 31 usd: 159 Cold: hi: 62, btch: 15 usd: 17 [ 970.540000] CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] Node 1 DMA32 per-cpu: [ 970.540000] CPU 0: Hot: hi: 186, btch: 31 usd: 28 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] CPU 1: Hot: hi: 186, btch: 31 usd: 47 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] CPU 2: Hot: hi: 186, btch: 31 usd: 155 Cold: hi: 62, btch: 15 usd: 12 [ 970.540000] CPU 3: Hot: hi: 186, btch: 31 usd: 183 Cold: hi: 62, btch: 15 usd: 3 [ 970.540000] Node 1 Normal per-cpu: [ 970.540000] CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0 [ 970.540000] CPU 2: Hot: hi: 186, btch: 31 usd: 118 Cold: hi: 62, btch: 15 usd: 19 [ 970.540000] CPU 3: Hot: hi: 186, btch: 31 usd: 163 Cold: hi: 62, btch: 15 usd: 13 [note] I do think, that /proc/meminfo also showed only 8kb of dirty remaining at this point, but I'm not 200% sure... [ 970.540000] Active:70883 inactive:117017 dirty:2 writeback:0 unstable:0 [ 970.540000] free:787733 slab:25286 mapped:12000 pagetables:2237 bounce:0 [ 970.540000] Node 0 DMA free:9448kB min:16kB low:20kB high:24kB active:0kB inactive:0kB present:8868kB pages_scanned:0 all_unreclaimable? no [ 970.540000] lowmem_reserve[]: 0 2004 2004 2004 [ 970.540000] Node 0 DMA32 free:1465640kB min:4040kB low:5048kB high:6060kB active:132340kB inactive:310048kB present:2052320kB pages_scanned:0 all_unreclaimable? no [ 970.540000] lowmem_reserve[]: 0 0 0 0 [ 970.540000] Node 1 DMA32 free:1476216kB min:3040kB low:3800kB high:4560kB active:3528kB inactive:41952kB present:1544000kB pages_scanned:0 all_unreclaimable? no [ 970.540000] lowmem_reserve[]: 0 0 505 505 [ 970.540000] Node 1 Normal free:199628kB min:1016kB low:1268kB high:1524kB active:147664kB inactive:116068kB present:517120kB pages_scanned:0 all_unreclaimable? no [ 970.540000] lowmem_reserve[]: 0 0 0 0 [ 970.540000] Node 0 DMA: 6*4kB 6*8kB 4*16kB 5*32kB 3*64kB 2*128kB 4*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 9448kB [ 970.540000] Node 0 DMA32: 158*4kB 66*8kB 30*16kB 22*32kB 10*64kB 7*128kB 6*256kB 4*512kB 6*1024kB 5*2048kB 352*4096kB = 1465640kB [ 970.540000] Node 1 DMA32: 866*4kB 446*8kB 228*16kB 122*32kB 50*64kB 32*128kB 23*256kB 17*512kB 16*1024kB 11*2048kB 342*4096kB = 1476216kB [ 970.540000] Node 1 Normal: 511*4kB 618*8kB 471*16kB 325*32kB 185*64kB 92*128kB 72*256kB 55*512kB 38*1024kB 26*2048kB 3*4096kB = 199580kB [ 970.540000] Swap cache: add 0, delete 0, find 0/0, race 0+0 [ 970.540000] Free swap = 9775416kB [ 970.540000] Total swap = 9775416kB [ 970.540000] Free swap: 9775416kB [ 970.570000] 1048576 pages of RAM [ 970.570000] 35174 reserved pages [ 970.570000] 149150 pages shared [ 970.570000] 0 pages swap cached [ 1137.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46642 global 1 0 0 wc _M tw 1022 sk 0 [ 1137.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1022 sk 0 [ 1138.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 [ 1143.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 [ 1148.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 [note] finally the disks go idle [ 1149.020000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks [ 1153.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46641 global 2 0 0 wc __ tw 1024 sk 0 [ 1158.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46641 global 2 0 0 wc __ tw 1024 sk 0 [ 1163.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46641 global 2 0 0 wc __ tw 1024 sk 0 [ 1168.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46641 global 2 0 0 wc _M tw 1023 sk 0 [ 1168.160000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 2 0 0 wc _M tw 1023 sk 0 [ 1168.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46639 global 2 0 0 wc __ tw 1023 sk 0 [ 1173.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 [ 1178.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 [ 1183.110000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 46640 global 1 0 0 wc __ tw 1024 sk 0 Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 19:22 ` Torsten Kaiser @ 2007-11-02 20:43 ` David Chinner 2007-11-02 21:02 ` Torsten Kaiser 2007-11-04 11:19 ` Torsten Kaiser [not found] ` <E1IpKZ4-0004je-Lb@localhost> 1 sibling, 2 replies; 39+ messages in thread From: David Chinner @ 2007-11-02 20:43 UTC (permalink / raw) To: Torsten Kaiser Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel, xfs On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 630.000000] SysRq : Emergency Sync > [ 630.120000] Emergency Sync complete > [ 632.850000] SysRq : Show Blocked State > [ 632.850000] task PC stack pid father > [ 632.850000] pdflush D ffff81000f091788 0 285 2 > [ 632.850000] ffff810005d4da80 0000000000000046 0000000000000800 > 0000007000000001 > [ 632.850000] ffff81000fd52400 ffffffff8022d61c ffffffff80819b00 > ffffffff80819b00 > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810100316f98 > 0000000000000000 > [ 632.850000] Call Trace: > [ 632.850000] [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 > [ 632.850000] [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90 > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > [ 632.850000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 That's stalled waiting on the inode cluster buffer lock. That implies that the inode lcuser is already being written out and the inode has been redirtied during writeout. Does the kernel you are testing have the "flush inodes in ascending inode number order" patches applied? If so, can you remove that patch and see if the problem goes away? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 20:43 ` David Chinner @ 2007-11-02 21:02 ` Torsten Kaiser 2007-11-04 11:19 ` Torsten Kaiser 1 sibling, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-02 21:02 UTC (permalink / raw) To: David Chinner Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On 11/2/07, David Chinner <dgc@sgi.com> wrote: > On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > > [ 630.000000] SysRq : Emergency Sync > > [ 630.120000] Emergency Sync complete > > [ 632.850000] SysRq : Show Blocked State > > [ 632.850000] task PC stack pid father > > [ 632.850000] pdflush D ffff81000f091788 0 285 2 > > [ 632.850000] ffff810005d4da80 0000000000000046 0000000000000800 > > 0000007000000001 > > [ 632.850000] ffff81000fd52400 ffffffff8022d61c ffffffff80819b00 > > ffffffff80819b00 > > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810100316f98 > > 0000000000000000 > > [ 632.850000] Call Trace: > > [ 632.850000] [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 > > [ 632.850000] [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90 > > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > > [ 632.850000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 > > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 > > That's stalled waiting on the inode cluster buffer lock. That implies > that the inode lcuser is already being written out and the inode has > been redirtied during writeout. > > Does the kernel you are testing have the "flush inodes in ascending > inode number order" patches applied? If so, can you remove that > patch and see if the problem goes away? It's 2.6.23-mm1 with only some small fixes. In it's broken-out directory I see: git-xfs.patch and writeback-fix-periodic-superblock-dirty-inode-flushing.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch I don't know if the patch you mentioned is part of that version of the mm-patchset. Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-02 20:43 ` David Chinner 2007-11-02 21:02 ` Torsten Kaiser @ 2007-11-04 11:19 ` Torsten Kaiser 2007-11-05 1:45 ` David Chinner 1 sibling, 1 reply; 39+ messages in thread From: Torsten Kaiser @ 2007-11-04 11:19 UTC (permalink / raw) To: David Chinner Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs [-- Attachment #1: Type: text/plain, Size: 7465 bytes --] On 11/2/07, David Chinner <dgc@sgi.com> wrote: > That's stalled waiting on the inode cluster buffer lock. That implies > that the inode lcuser is already being written out and the inode has > been redirtied during writeout. > > Does the kernel you are testing have the "flush inodes in ascending > inode number order" patches applied? If so, can you remove that > patch and see if the problem goes away? I can now confirm, that I see this also with the current mainline-git-version I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c plus the fix for the sg changes in ieee1394. Bisecting would be troublesome, as the sg changes prevent mainline to boot with my normal config / kill my network. treogen ~ # vmstat 10 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa -> starting emerge 1 0 0 3627072 332 157724 0 0 97 13 41 189 2 2 94 2 0 0 0 3607240 332 163736 0 0 599 10 332 951 2 1 93 4 0 0 0 3601920 332 167592 0 0 380 2 218 870 1 1 98 0 0 0 0 3596356 332 171648 0 0 404 21 182 818 0 0 99 0 0 0 0 3579328 332 180436 0 0 878 12 147 912 1 1 97 2 0 0 0 3575376 332 182776 0 0 236 4 244 953 1 1 95 3 2 1 0 3571792 332 185084 0 0 232 7 256 1003 2 1 95 2 0 0 0 3564844 332 187364 0 0 228 605 246 1167 2 1 93 4 0 0 0 3562128 332 189784 0 0 230 4 527 1238 2 1 93 4 0 1 0 3558764 332 191964 0 0 216 24 438 1059 1 1 93 6 0 0 0 3555120 332 193868 0 0 199 36 406 959 0 0 92 8 0 0 0 3552008 332 195928 0 0 197 11 458 1023 1 1 90 8 0 0 0 3548728 332 197660 0 0 183 7 496 1086 1 1 90 8 0 0 0 3545560 332 199372 0 0 170 8 483 1017 1 1 90 9 0 1 0 3542124 332 201256 0 0 190 1 544 1137 1 1 88 10 1 0 0 3536924 332 203296 0 0 195 7 637 1209 2 1 89 8 1 1 0 3485096 332 249184 0 0 101 16 10372 4537 13 3 76 8 2 0 0 3442004 332 279728 0 0 1086 40 219 1349 7 3 87 4 -> emerge is done reading its package database 1 0 0 3254796 332 448636 0 0 0 27 128 8360 24 6 70 0 2 0 0 3143304 332 554016 0 0 47 33 213 4480 16 11 72 1 -> kernel unpacked 1 0 0 3125700 332 560416 0 0 1 20 122 1675 24 1 75 0 1 0 0 3117356 332 567968 0 0 0 674 157 2975 24 2 73 1 2 0 0 3111636 332 573736 0 0 0 1143 151 1924 23 1 75 1 2 0 0 3102836 332 581332 0 0 0 890 153 1330 24 1 75 0 1 0 0 3097236 332 587360 0 0 0 656 194 1593 24 1 74 0 1 0 0 3086824 332 595480 0 0 0 812 235 2657 25 1 74 0 -> tar.bz2 created, installing starts now 0 0 0 3091612 332 601024 0 0 82 708 499 2397 17 4 78 1 0 0 0 3086088 332 602180 0 0 69 2459 769 2237 3 4 88 6 0 0 0 3085916 332 602236 0 0 2 1752 693 949 1 2 96 1 0 0 0 3084544 332 603564 0 0 66 4057 1176 2850 3 6 91 0 0 0 0 3078780 332 605572 0 0 98 3194 1169 3288 5 6 89 0 0 0 0 3077940 332 605924 0 0 17 1139 823 1547 1 2 97 0 0 0 0 3078268 332 605924 0 0 0 888 807 1329 0 1 99 0 -> first short stall procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3077040 332 605924 0 0 0 1950 785 1495 0 2 89 8 0 0 0 3076588 332 605896 0 0 2 3807 925 2046 1 4 95 0 0 0 0 3076900 332 606052 0 0 11 2564 768 1471 1 3 95 1 0 0 0 3071584 332 607928 0 0 87 2499 1108 3433 4 6 90 0 -> second longer stall (emerge was not able to complete a single filemove until the 'resume' line) 0 0 0 3071592 332 607928 0 0 0 693 692 1289 0 0 99 0 0 0 0 3072584 332 607928 0 0 0 792 731 1507 0 1 99 0 0 0 0 3072840 332 607928 0 0 0 806 707 1521 0 1 99 0 0 0 0 3072724 332 607928 0 0 0 782 695 1372 0 0 99 0 0 0 0 3072972 332 607928 0 0 0 677 612 1301 0 0 99 0 0 0 0 3072772 332 607928 0 0 0 738 681 1352 1 1 99 0 0 0 0 3073020 332 607928 0 0 0 785 708 1328 0 1 99 0 0 0 0 3072896 332 607928 0 0 0 833 722 1383 0 0 99 0 -> emerge resumed 0 0 0 3069476 332 607972 0 0 2 4885 812 2062 1 4 90 5 1 0 0 3069648 332 608068 0 0 4 4658 833 2158 1 4 93 2 0 0 0 3064972 332 610364 0 0 106 2494 1095 3620 5 7 88 0 0 0 0 3057536 332 612444 0 0 86 2023 1012 3440 4 6 90 0 1 0 0 3054572 332 612368 0 0 102 1526 1024 2277 6 5 87 2 -> emerge finished, but still >100Mb of dirty data accoring to /proc/meminfo 0 0 0 3048548 332 615764 0 0 337 659 796 1000 3 1 96 0 0 0 0 3092100 332 615860 0 0 15 616 606 1040 1 0 99 0 0 0 0 3092148 332 615860 0 0 0 641 622 1085 0 0 99 0 0 0 0 3092528 332 615860 0 0 0 766 654 1055 1 1 99 0 -> slow writeout until here, might be fixed with Peters patch to scale the background threshold 2 0 0 3090828 332 615860 0 0 0 1804 707 1215 0 2 98 0 0 0 0 3091056 332 615864 0 0 0 3877 831 2047 1 4 94 1 3 0 0 3090780 332 615864 0 0 0 2048 784 1154 1 2 97 1 0 0 0 3091096 332 615864 0 0 0 2690 751 1538 0 3 96 1 0 1 0 3091056 332 615864 0 0 0 2018 748 866 0 2 95 2 2 0 0 3092960 332 615864 0 0 0 2076 719 1118 0 2 97 0 -> writeout "done", /proc/meminfo showed 0kb of dirty data remaining 0 0 0 3093072 332 615864 0 0 0 645 646 1104 0 0 99 0 0 0 0 3093532 332 615864 0 0 0 726 658 1223 0 1 99 0 0 0 0 3093540 332 615864 0 0 0 801 699 1314 0 1 99 0 0 0 0 3093580 332 615864 0 0 0 783 738 1350 0 1 99 0 0 0 0 3093284 332 615920 0 0 6 746 655 1381 1 1 98 0 0 0 0 3092872 332 615920 0 0 0 862 703 1391 1 1 98 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3093224 332 615920 0 0 0 799 676 1394 0 0 99 0 0 0 0 3093304 332 615920 0 0 0 835 672 1514 1 1 98 0 0 0 0 3093476 332 615920 0 0 0 784 641 1404 1 1 98 0 0 0 0 3093264 332 615920 0 0 0 722 626 1483 1 1 99 0 0 0 0 3093476 332 615920 0 0 0 7 328 350 0 0 99 0 0 0 0 3093628 332 615920 0 0 0 11 332 407 0 0 99 0 -> disks finally go idle Torsten .config for 2.6.24-rc1+git attached [-- Attachment #2: config.gz --] [-- Type: application/x-gzip, Size: 11732 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-04 11:19 ` Torsten Kaiser @ 2007-11-05 1:45 ` David Chinner 2007-11-05 7:01 ` Torsten Kaiser 2007-11-05 18:27 ` Torsten Kaiser 0 siblings, 2 replies; 39+ messages in thread From: David Chinner @ 2007-11-05 1:45 UTC (permalink / raw) To: Torsten Kaiser Cc: David Chinner, Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs [-- Attachment #1: Type: text/plain, Size: 1083 bytes --] On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote: > On 11/2/07, David Chinner <dgc@sgi.com> wrote: > > That's stalled waiting on the inode cluster buffer lock. That implies > > that the inode lcuser is already being written out and the inode has > > been redirtied during writeout. > > > > Does the kernel you are testing have the "flush inodes in ascending > > inode number order" patches applied? If so, can you remove that > > patch and see if the problem goes away? > > I can now confirm, that I see this also with the current mainline-git-version > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c > plus the fix for the sg changes in ieee1394. Ok, so it's probably a side effect of the writeback changes. Attached are two patches (two because one was in a separate patchset as a standalone change) that should prevent async writeback from blocking on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. Can you see if this fixes the problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group [-- Attachment #2: xfs-factor-inotobp --] [-- Type: text/plain, Size: 9595 bytes --] --- fs/xfs/xfs_inode.c | 283 ++++++++++++++++++++++++----------------------------- 1 file changed, 129 insertions(+), 154 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-09-12 15:41:22.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c 2007-09-13 08:57:06.395641940 +1000 @@ -124,6 +124,126 @@ xfs_inobp_check( #endif /* + * Simple wrapper for calling xfs_imap() that includes error + * and bounds checking + */ +STATIC int +xfs_ino_to_imap( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_ino_t ino, + xfs_imap_t *imap, + uint imap_flags) +{ + int error; + + error = xfs_imap(mp, tp, ino, imap, imap_flags); + if (error) { + cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap() returned an " + "error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* + * If the inode number maps to a block outside the bounds + * of the file system then return NULL rather than calling + * read_buf and panicing when we get an error from the + * driver. + */ + if ((imap->im_blkno + imap->im_len) > + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) { + xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: " + "(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > " + " XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)", + (unsigned long long) imap->im_blkno, + (unsigned long long) imap->im_len, + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)); + return XFS_ERROR(EINVAL); + } + return 0; +} + +/* + * Find the buffer associated with the given inode map + * We do basic validation checks on the buffer once it has been + * retrieved from disk. + */ +STATIC int +xfs_imap_to_bp( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_imap_t *imap, + xfs_buf_t **bpp, + uint buf_flags, + uint imap_flags) +{ + int error; + int i; + int ni; + xfs_buf_t *bp; + + error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno, + (int)imap->im_len, XFS_BUF_LOCK, &bp); + if (error) { + cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned " + "an error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* + * Validate the magic number and version of every inode in the buffer + * (if DEBUG kernel) or the first inode in the buffer, otherwise. + */ +#ifdef DEBUG + ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog; +#else /* usual case */ + ni = 1; +#endif + + for (i = 0; i < ni; i++) { + int di_ok; + xfs_dinode_t *dip; + + dip = (xfs_dinode_t *)xfs_buf_offset(bp, + (i << mp->m_sb.sb_inodelog)); + di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC && + XFS_DINODE_GOOD_VERSION(dip->di_core.di_version); + if (unlikely(XFS_TEST_ERROR(!di_ok, mp, + XFS_ERRTAG_ITOBP_INOTOBP, + XFS_RANDOM_ITOBP_INOTOBP))) { + if (imap_flags & XFS_IMAP_BULKSTAT) { + xfs_trans_brelse(tp, bp); + return XFS_ERROR(EINVAL); + } + XFS_CORRUPTION_ERROR("xfs_imap_to_bp", + XFS_ERRLEVEL_HIGH, mp, dip); +#ifdef DEBUG + cmn_err(CE_PANIC, + "Device %s - bad inode magic/vsn " + "daddr %lld #%d (magic=%x)", + XFS_BUFTARG_NAME(mp->m_ddev_targp), + (unsigned long long)imap->im_blkno, i, + be16_to_cpu(dip->di_core.di_magic)); +#endif + xfs_trans_brelse(tp, bp); + return XFS_ERROR(EFSCORRUPTED); + } + } + + xfs_inobp_check(mp, bp); + + /* + * Mark the buffer as an inode buffer now that it looks good + */ + XFS_BUF_SET_VTYPE(bp, B_FS_INO); + + *bpp = bp; + return 0; +} + +/* * This routine is called to map an inode number within a file * system to the buffer containing the on-disk version of the * inode. It returns a pointer to the buffer containing the @@ -145,72 +265,19 @@ xfs_inotobp( xfs_buf_t **bpp, int *offset) { - int di_ok; xfs_imap_t imap; xfs_buf_t *bp; int error; - xfs_dinode_t *dip; - /* - * Call the space management code to find the location of the - * inode on disk. - */ imap.im_blkno = 0; - error = xfs_imap(mp, tp, ino, &imap, XFS_IMAP_LOOKUP); - if (error != 0) { - cmn_err(CE_WARN, - "xfs_inotobp: xfs_imap() returned an " - "error %d on %s. Returning error.", error, mp->m_fsname); + error = xfs_ino_to_imap(mp, tp, ino, &imap, XFS_IMAP_LOOKUP); + if (error) return error; - } - - /* - * If the inode number maps to a block outside the bounds of the - * file system then return NULL rather than calling read_buf - * and panicing when we get an error from the driver. - */ - if ((imap.im_blkno + imap.im_len) > - XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) { - cmn_err(CE_WARN, - "xfs_inotobp: inode number (%llu + %d) maps to a block outside the bounds " - "of the file system %s. Returning EINVAL.", - (unsigned long long)imap.im_blkno, - imap.im_len, mp->m_fsname); - return XFS_ERROR(EINVAL); - } - - /* - * Read in the buffer. If tp is NULL, xfs_trans_read_buf() will - * default to just a read_buf() call. - */ - error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap.im_blkno, - (int)imap.im_len, XFS_BUF_LOCK, &bp); - if (error) { - cmn_err(CE_WARN, - "xfs_inotobp: xfs_trans_read_buf() returned an " - "error %d on %s. Returning error.", error, mp->m_fsname); + error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, 0); + if (error) return error; - } - dip = (xfs_dinode_t *)xfs_buf_offset(bp, 0); - di_ok = - be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC && - XFS_DINODE_GOOD_VERSION(dip->di_core.di_version); - if (unlikely(XFS_TEST_ERROR(!di_ok, mp, XFS_ERRTAG_ITOBP_INOTOBP, - XFS_RANDOM_ITOBP_INOTOBP))) { - XFS_CORRUPTION_ERROR("xfs_inotobp", XFS_ERRLEVEL_LOW, mp, dip); - xfs_trans_brelse(tp, bp); - cmn_err(CE_WARN, - "xfs_inotobp: XFS_TEST_ERROR() returned an " - "error on %s. Returning EFSCORRUPTED.", mp->m_fsname); - return XFS_ERROR(EFSCORRUPTED); - } - - xfs_inobp_check(mp, bp); - /* - * Set *dipp to point to the on-disk inode in the buffer. - */ *dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset); *bpp = bp; *offset = imap.im_boffset; @@ -251,41 +318,15 @@ xfs_itobp( xfs_imap_t imap; xfs_buf_t *bp; int error; - int i; - int ni; if (ip->i_blkno == (xfs_daddr_t)0) { - /* - * Call the space management code to find the location of the - * inode on disk. - */ imap.im_blkno = bno; - if ((error = xfs_imap(mp, tp, ip->i_ino, &imap, - XFS_IMAP_LOOKUP | imap_flags))) + error = xfs_ino_to_imap(mp, tp, ip->i_ino, &imap, + XFS_IMAP_LOOKUP | imap_flags); + if (error) return error; /* - * If the inode number maps to a block outside the bounds - * of the file system then return NULL rather than calling - * read_buf and panicing when we get an error from the - * driver. - */ - if ((imap.im_blkno + imap.im_len) > - XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) { -#ifdef DEBUG - xfs_fs_cmn_err(CE_ALERT, mp, "xfs_itobp: " - "(imap.im_blkno (0x%llx) " - "+ imap.im_len (0x%llx)) > " - " XFS_FSB_TO_BB(mp, " - "mp->m_sb.sb_dblocks) (0x%llx)", - (unsigned long long) imap.im_blkno, - (unsigned long long) imap.im_len, - XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)); -#endif /* DEBUG */ - return XFS_ERROR(EINVAL); - } - - /* * Fill in the fields in the inode that will be used to * map the inode to its buffer from now on. */ @@ -303,76 +344,10 @@ xfs_itobp( } ASSERT(bno == 0 || bno == imap.im_blkno); - /* - * Read in the buffer. If tp is NULL, xfs_trans_read_buf() will - * default to just a read_buf() call. - */ - error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap.im_blkno, - (int)imap.im_len, XFS_BUF_LOCK, &bp); - if (error) { -#ifdef DEBUG - xfs_fs_cmn_err(CE_ALERT, mp, "xfs_itobp: " - "xfs_trans_read_buf() returned error %d, " - "imap.im_blkno 0x%llx, imap.im_len 0x%llx", - error, (unsigned long long) imap.im_blkno, - (unsigned long long) imap.im_len); -#endif /* DEBUG */ + error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, imap_flags); + if (error) return error; - } - - /* - * Validate the magic number and version of every inode in the buffer - * (if DEBUG kernel) or the first inode in the buffer, otherwise. - * No validation is done here in userspace (xfs_repair). - */ -#if !defined(__KERNEL__) - ni = 0; -#elif defined(DEBUG) - ni = BBTOB(imap.im_len) >> mp->m_sb.sb_inodelog; -#else /* usual case */ - ni = 1; -#endif - - for (i = 0; i < ni; i++) { - int di_ok; - xfs_dinode_t *dip; - - dip = (xfs_dinode_t *)xfs_buf_offset(bp, - (i << mp->m_sb.sb_inodelog)); - di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC && - XFS_DINODE_GOOD_VERSION(dip->di_core.di_version); - if (unlikely(XFS_TEST_ERROR(!di_ok, mp, - XFS_ERRTAG_ITOBP_INOTOBP, - XFS_RANDOM_ITOBP_INOTOBP))) { - if (imap_flags & XFS_IMAP_BULKSTAT) { - xfs_trans_brelse(tp, bp); - return XFS_ERROR(EINVAL); - } -#ifdef DEBUG - cmn_err(CE_ALERT, - "Device %s - bad inode magic/vsn " - "daddr %lld #%d (magic=%x)", - XFS_BUFTARG_NAME(mp->m_ddev_targp), - (unsigned long long)imap.im_blkno, i, - be16_to_cpu(dip->di_core.di_magic)); -#endif - XFS_CORRUPTION_ERROR("xfs_itobp", XFS_ERRLEVEL_HIGH, - mp, dip); - xfs_trans_brelse(tp, bp); - return XFS_ERROR(EFSCORRUPTED); - } - } - - xfs_inobp_check(mp, bp); - /* - * Mark the buffer as an inode buffer now that it looks good - */ - XFS_BUF_SET_VTYPE(bp, B_FS_INO); - - /* - * Set *dipp to point to the on-disk inode in the buffer. - */ *dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset); *bpp = bp; return 0; [-- Attachment #3: xfs-iflush-blocking-fix --] [-- Type: text/plain, Size: 6403 bytes --] --- fs/xfs/linux-2.6/xfs_super.c | 3 +- fs/xfs/linux-2.6/xfs_vnode.h | 5 --- fs/xfs/xfs_inode.c | 33 ++++++++++++++++--------- fs/xfs/xfs_inode.h | 7 +++-- fs/xfs/xfs_vnodeops.c | 55 +++++++++---------------------------------- 5 files changed, 41 insertions(+), 62 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-11-05 10:17:36.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c 2007-11-05 10:33:49.590268027 +1100 @@ -306,14 +306,15 @@ xfs_inotobp( * 0 for the disk block address. */ int -xfs_itobp( +xfs_itobp_flags( xfs_mount_t *mp, xfs_trans_t *tp, xfs_inode_t *ip, xfs_dinode_t **dipp, xfs_buf_t **bpp, xfs_daddr_t bno, - uint imap_flags) + uint imap_flags, + uint buf_flags) { xfs_imap_t imap; xfs_buf_t *bp; @@ -344,10 +345,17 @@ xfs_itobp( } ASSERT(bno == 0 || bno == imap.im_blkno); - error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, imap_flags); + error = xfs_imap_to_bp(mp, tp, &imap, &bp, buf_flags, imap_flags); if (error) return error; + if (!bp) { + ASSERT(buf_flags & XFS_BUF_TRYLOCK); + ASSERT(tp == NULL); + *bpp = NULL; + return EAGAIN; + } + *dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset); *bpp = bp; return 0; @@ -3068,15 +3076,6 @@ xfs_iflush( } /* - * Get the buffer containing the on-disk inode. - */ - error = xfs_itobp(mp, NULL, ip, &dip, &bp, 0, 0); - if (error) { - xfs_ifunlock(ip); - return error; - } - - /* * Decide how buffer will be flushed out. This is done before * the call to xfs_iflush_int because this field is zeroed by it. */ @@ -3125,6 +3124,16 @@ xfs_iflush( } /* + * Get the buffer containing the on-disk inode. + */ + error = xfs_itobp_flags(mp, NULL, ip, &dip, &bp, 0, 0, + (flags == INT_ASYNC) ? XFS_BUF_TRYLOCK : XFS_BUF_LOCK); + if (error ||!bp) { + xfs_ifunlock(ip); + return error; + } + + /* * First flush out the inode that xfs_iflush was called with. */ error = xfs_iflush_int(ip, bp); Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h 2007-11-02 13:44:46.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h 2007-11-05 10:25:44.885153248 +1100 @@ -488,9 +488,12 @@ int xfs_finish_reclaim_all(struct xfs_m /* * xfs_inode.c prototypes. */ -int xfs_itobp(struct xfs_mount *, struct xfs_trans *, +int xfs_itobp_flags(struct xfs_mount *, struct xfs_trans *, xfs_inode_t *, struct xfs_dinode **, struct xfs_buf **, - xfs_daddr_t, uint); + xfs_daddr_t, uint, uint); +#define xfs_itobp(mp, tp, ip, dipp, bpp, bno, iflags) \ + xfs_itobp_flags(mp, tp, ip, dipp, bpp, bno, iflags, XFS_BUF_LOCK) + int xfs_iread(struct xfs_mount *, struct xfs_trans *, xfs_ino_t, xfs_inode_t **, xfs_daddr_t, uint); int xfs_iread_extents(struct xfs_trans *, xfs_inode_t *, int); Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2007-11-02 13:44:50.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2007-11-05 10:39:05.969204451 +1100 @@ -840,7 +840,8 @@ xfs_fs_write_inode( struct inode *inode, int sync) { - int error = 0, flags = FLUSH_INODE; + int error = 0; + int flags = 0; xfs_itrace_entry(XFS_I(inode)); if (sync) { Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vnode.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_vnode.h 2007-10-02 16:01:47.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vnode.h 2007-11-05 10:40:49.103817818 +1100 @@ -73,12 +73,9 @@ typedef enum bhv_vrwlock { #define IO_INVIS 0x00020 /* don't update inode timestamps */ /* - * Flags for vop_iflush call + * Flags for xfs_inode_flush */ #define FLUSH_SYNC 1 /* wait for flush to complete */ -#define FLUSH_INODE 2 /* flush the inode itself */ -#define FLUSH_LOG 4 /* force the last log entry for - * this inode out to disk */ /* * Flush/Invalidate options for vop_toss/flush/flushinval_pages. Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-11-05 10:02:05.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-11-05 10:37:53.398623943 +1100 @@ -3556,29 +3556,6 @@ xfs_inode_flush( ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL))) return 0; - if (flags & FLUSH_LOG) { - if (iip && iip->ili_last_lsn) { - xlog_t *log = mp->m_log; - xfs_lsn_t sync_lsn; - int s, log_flags = XFS_LOG_FORCE; - - s = GRANT_LOCK(log); - sync_lsn = log->l_last_sync_lsn; - GRANT_UNLOCK(log, s); - - if ((XFS_LSN_CMP(iip->ili_last_lsn, sync_lsn) > 0)) { - if (flags & FLUSH_SYNC) - log_flags |= XFS_LOG_SYNC; - error = xfs_log_force(mp, iip->ili_last_lsn, log_flags); - if (error) - return error; - } - - if (ip->i_update_core == 0) - return 0; - } - } - /* * We make this non-blocking if the inode is contended, * return EAGAIN to indicate to the caller that they @@ -3586,30 +3563,22 @@ xfs_inode_flush( * blocking on inodes inside another operation right * now, they get caught later by xfs_sync. */ - if (flags & FLUSH_INODE) { - int flush_flags; - - if (flags & FLUSH_SYNC) { - xfs_ilock(ip, XFS_ILOCK_SHARED); - xfs_iflock(ip); - } else if (xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) { - if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip)) { - xfs_iunlock(ip, XFS_ILOCK_SHARED); - return EAGAIN; - } - } else { + if (flags & FLUSH_SYNC) { + xfs_ilock(ip, XFS_ILOCK_SHARED); + xfs_iflock(ip); + } else if (xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) { + if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip)) { + xfs_iunlock(ip, XFS_ILOCK_SHARED); return EAGAIN; } - - if (flags & FLUSH_SYNC) - flush_flags = XFS_IFLUSH_SYNC; - else - flush_flags = XFS_IFLUSH_ASYNC; - - error = xfs_iflush(ip, flush_flags); - xfs_iunlock(ip, XFS_ILOCK_SHARED); + } else { + return EAGAIN; } + error = xfs_iflush(ip, (flags & FLUSH_SYNC) ? XFS_IFLUSH_SYNC + : XFS_IFLUSH_ASYNC); + xfs_iunlock(ip, XFS_ILOCK_SHARED); + return error; } ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-05 1:45 ` David Chinner @ 2007-11-05 7:01 ` Torsten Kaiser 2007-11-05 18:27 ` Torsten Kaiser 1 sibling, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-05 7:01 UTC (permalink / raw) To: David Chinner Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On 11/5/07, David Chinner <dgc@sgi.com> wrote: > On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote: > > I can now confirm, that I see this also with the current mainline-git-version > > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c > > plus the fix for the sg changes in ieee1394. > > Ok, so it's probably a side effect of the writeback changes. > > Attached are two patches (two because one was in a separate patchset as > a standalone change) that should prevent async writeback from blocking > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > Can you see if this fixes the problem? Applied both patches against the kernel mentioned above. This blows up at boot: [ 80.807589] Filesystem "dm-0": Disabling barriers, not supported by the underlying device [ 80.820241] XFS mounting filesystem dm-0 [ 80.913144] ------------[ cut here ]------------ [ 80.914932] kernel BUG at drivers/md/raid5.c:143! [ 80.916751] invalid opcode: 0000 [1] SMP [ 80.918338] CPU 3 [ 80.919142] Modules linked in: [ 80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3 [ 80.922628] RIP: 0010:[<ffffffff804b6ee4>] [<ffffffff804b6ee4>] __release_stripe+0x164/0x170 [ 80.925935] RSP: 0018:ffff8100060e7dd0 EFLAGS: 00010002 [ 80.927987] RAX: 0000000000000000 RBX: ffff81010141c288 RCX: 0000000000000000 [ 80.930738] RDX: 0000000000000000 RSI: ffff81010141c288 RDI: ffff810004fb3200 [ 80.933488] RBP: ffff810004fb3200 R08: 0000000000000000 R09: 0000000000000005 [ 80.936240] R10: 0000000000000e00 R11: ffffe200038465e8 R12: ffff81010141c298 [ 80.938990] R13: 0000000000000286 R14: ffff810004fb3330 R15: 0000000000000000 [ 80.941741] FS: 000000000060c870(0000) GS:ffff810100313700(0000) knlGS:0000000000000000 [ 80.944861] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 80.947080] CR2: 00007fff7b295000 CR3: 0000000101842000 CR4: 00000000000006e0 [ 80.949830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 80.952580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 80.955332] Process md1_raid5 (pid: 974, threadinfo ffff8100060e6000, task ffff81000645c730) [ 80.958584] Stack: ffff81010141c288 00000000000001f4 ffff810004fb3200 ffffffff804b6f2d [ 80.961761] 00000000000001f4 ffff81010141c288 ffffffff804c8bd0 0000000000000000 [ 80.964681] ffff8100060e7ee8 ffffffff804bd094 ffff81000645c730 ffff8100060e7e70 [ 80.967518] Call Trace: [ 80.968558] [<ffffffff804b6f2d>] release_stripe+0x3d/0x60 [ 80.970677] [<ffffffff804c8bd0>] md_thread+0x0/0x100 [ 80.972629] [<ffffffff804bd094>] raid5d+0x344/0x450 [ 80.974549] [<ffffffff8023df10>] process_timeout+0x0/0x10 [ 80.976668] [<ffffffff805ae1ca>] schedule_timeout+0x5a/0xd0 [ 80.978855] [<ffffffff804c8bd0>] md_thread+0x0/0x100 [ 80.980807] [<ffffffff804c8c00>] md_thread+0x30/0x100 [ 80.982794] [<ffffffff80249f20>] autoremove_wake_function+0x0/0x30 [ 80.985214] [<ffffffff804c8bd0>] md_thread+0x0/0x100 [ 80.987167] [<ffffffff80249b3b>] kthread+0x4b/0x80 [ 80.989054] [<ffffffff8020c9c8>] child_rip+0xa/0x12 [ 80.990972] [<ffffffff80249af0>] kthread+0x0/0x80 [ 80.992824] [<ffffffff8020c9be>] child_rip+0x0/0x12 [ 80.994743] [ 80.995588] [ 80.995588] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 48 83 ec 28 48 89 5c 24 [ 80.999307] RIP [<ffffffff804b6ee4>] __release_stripe+0x164/0x170 [ 81.001711] RSP <ffff8100060e7dd0> Switching back to unpatched 2.6.23-mm1 boots sucessfull... Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-05 1:45 ` David Chinner 2007-11-05 7:01 ` Torsten Kaiser @ 2007-11-05 18:27 ` Torsten Kaiser 2007-11-06 4:25 ` David Chinner 1 sibling, 1 reply; 39+ messages in thread From: Torsten Kaiser @ 2007-11-05 18:27 UTC (permalink / raw) To: David Chinner Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On 11/5/07, David Chinner <dgc@sgi.com> wrote: > Ok, so it's probably a side effect of the writeback changes. > > Attached are two patches (two because one was in a separate patchset as > a standalone change) that should prevent async writeback from blocking > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > Can you see if this fixes the problem? Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch Applying your two patches ontop of that does not fix the stalls. vmstat 10 output from unmerging (uninstalling) a kernel: 1 0 0 3512188 332 192644 0 0 185 12 368 735 10 3 85 1 -> emerge starts to remove the kernel source files 3 0 0 3506624 332 192836 0 0 15 9825 2458 8307 7 12 81 0 0 0 0 3507212 332 192836 0 0 0 554 630 1233 0 1 99 0 0 0 0 3507292 332 192836 0 0 0 537 580 1328 0 1 99 0 0 0 0 3507168 332 192836 0 0 0 633 626 1380 0 1 99 0 0 0 0 3507116 332 192836 0 0 0 1510 768 2030 1 2 97 0 0 0 0 3507596 332 192836 0 0 0 524 540 1544 0 0 99 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3507540 332 192836 0 0 0 489 551 1293 0 0 99 0 0 0 0 3507528 332 192836 0 0 0 527 510 1432 1 1 99 0 0 0 0 3508052 332 192840 0 0 0 2088 910 2964 2 3 95 0 0 0 0 3507888 332 192840 0 0 0 442 565 1383 1 1 99 0 0 0 0 3508704 332 192840 0 0 0 497 529 1479 0 0 99 0 0 0 0 3508704 332 192840 0 0 0 594 595 1458 0 0 99 0 0 0 0 3511492 332 192840 0 0 0 2381 1028 2941 2 3 95 0 0 0 0 3510684 332 192840 0 0 0 699 600 1390 0 0 99 0 0 0 0 3511636 332 192840 0 0 0 741 661 1641 0 0 100 0 0 0 0 3524020 332 192840 0 0 0 2452 1080 3910 2 3 95 0 0 0 0 3524040 332 192844 0 0 0 530 617 1297 0 0 99 0 0 0 0 3524128 332 192844 0 0 0 812 674 1667 0 1 99 0 0 0 0 3527000 332 193672 0 0 339 721 754 1681 3 2 93 1 -> emerge is finished, no dirty or writeback data in /proc/meminfo 0 0 0 3571056 332 194768 0 0 111 639 632 1344 0 1 99 0 0 0 0 3571260 332 194768 0 0 0 757 688 1405 1 0 99 0 0 0 0 3571156 332 194768 0 0 0 753 641 1361 0 0 99 0 0 0 0 3571404 332 194768 0 0 0 766 653 1389 0 0 99 0 1 0 0 3571136 332 194768 0 0 6 764 669 1488 0 0 99 0 0 0 0 3571668 332 194824 0 0 0 764 657 1482 0 0 99 0 0 0 0 3571848 332 194824 0 0 0 673 659 1406 0 0 99 0 0 0 0 3571908 332 195052 0 0 22 753 638 1500 0 1 99 0 0 0 0 3573052 332 195052 0 0 0 765 631 1482 0 1 99 0 0 0 0 3574144 332 195052 0 0 0 771 640 1497 0 0 99 0 0 0 0 3573468 332 195052 0 0 0 458 485 1251 0 0 99 0 0 0 0 3574184 332 195052 0 0 0 427 474 1192 0 0 100 0 0 0 0 3575092 332 195052 0 0 0 461 482 1235 0 0 99 0 0 0 0 3576368 332 195056 0 0 0 582 556 1310 0 0 99 0 0 0 0 3579300 332 195056 0 0 0 695 571 1402 0 0 99 0 0 0 0 3580376 332 195056 0 0 0 417 568 906 0 0 99 0 0 0 0 3581212 332 195056 0 0 0 421 559 977 0 1 99 0 0 0 0 3583780 332 195060 0 0 0 494 555 1080 0 1 99 0 0 0 0 3584352 332 195060 0 0 0 99 347 559 0 0 99 0 0 0 0 3585232 332 195060 0 0 0 11 301 621 0 0 99 0 -> disks go idle. So these patches do not seem to be the source of these excessive disk writes... Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-05 18:27 ` Torsten Kaiser @ 2007-11-06 4:25 ` David Chinner 2007-11-06 7:10 ` Torsten Kaiser 2007-11-06 19:01 ` Peter Zijlstra 0 siblings, 2 replies; 39+ messages in thread From: David Chinner @ 2007-11-06 4:25 UTC (permalink / raw) To: Torsten Kaiser Cc: David Chinner, Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > On 11/5/07, David Chinner <dgc@sgi.com> wrote: > > Ok, so it's probably a side effect of the writeback changes. > > > > Attached are two patches (two because one was in a separate patchset as > > a standalone change) that should prevent async writeback from blocking > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > Can you see if this fixes the problem? > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch > Applying your two patches ontop of that does not fix the stalls. So you are having RAID5 problems as well? I'm struggling to understand what possible changed in XFS or writeback that would lead to stalls like this, esp. as you appear to be removing files when the stalls occur. Rather than vmstat, can you use something like iostat to show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. OOC, what is the 'xfs_info <mtpt>' output for your filesystem? > vmstat 10 output from unmerging (uninstalling) a kernel: > 1 0 0 3512188 332 192644 0 0 185 12 368 735 10 3 85 1 > -> emerge starts to remove the kernel source files > 3 0 0 3506624 332 192836 0 0 15 9825 2458 8307 7 12 81 0 > 0 0 0 3507212 332 192836 0 0 0 554 630 1233 0 1 99 0 > 0 0 0 3507292 332 192836 0 0 0 537 580 1328 0 1 99 0 > 0 0 0 3507168 332 192836 0 0 0 633 626 1380 0 1 99 0 > 0 0 0 3507116 332 192836 0 0 0 1510 768 2030 1 2 97 0 > 0 0 0 3507596 332 192836 0 0 0 524 540 1544 0 0 99 0 > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 3507540 332 192836 0 0 0 489 551 1293 0 0 99 0 > 0 0 0 3507528 332 192836 0 0 0 527 510 1432 1 1 99 0 > 0 0 0 3508052 332 192840 0 0 0 2088 910 2964 2 3 95 0 > 0 0 0 3507888 332 192840 0 0 0 442 565 1383 1 1 99 0 > 0 0 0 3508704 332 192840 0 0 0 497 529 1479 0 0 99 0 > 0 0 0 3508704 332 192840 0 0 0 594 595 1458 0 0 99 0 > 0 0 0 3511492 332 192840 0 0 0 2381 1028 2941 2 3 95 0 > 0 0 0 3510684 332 192840 0 0 0 699 600 1390 0 0 99 0 > 0 0 0 3511636 332 192840 0 0 0 741 661 1641 0 0 100 0 > 0 0 0 3524020 332 192840 0 0 0 2452 1080 3910 2 3 95 0 > 0 0 0 3524040 332 192844 0 0 0 530 617 1297 0 0 99 0 > 0 0 0 3524128 332 192844 0 0 0 812 674 1667 0 1 99 0 > 0 0 0 3527000 332 193672 0 0 339 721 754 1681 3 2 93 1 > -> emerge is finished, no dirty or writeback data in /proc/meminfo At this point, can you run a "sync" and see how long that takes to complete? The only thing I can think that woul dbe written out after this point is inodes, but even then it seems to go on for a long, long time and it really doesn't seem like XFS is holding up the inode writes. Another option is to use blktrace/blkparse to determine which process is issuing this I/O. > 0 0 0 3583780 332 195060 0 0 0 494 555 1080 0 1 99 0 > 0 0 0 3584352 332 195060 0 0 0 99 347 559 0 0 99 0 > 0 0 0 3585232 332 195060 0 0 0 11 301 621 0 0 99 0 > -> disks go idle. > > So these patches do not seem to be the source of these excessive disk writes... Well, the patches I posted should prevent blocking in the places that it was seen, so if that does not stop the slowdowns then either the writeback code is not feeding us inodes fast enough or the block device below is having some kind of problem.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 4:25 ` David Chinner @ 2007-11-06 7:10 ` Torsten Kaiser 2007-11-06 19:01 ` Peter Zijlstra 1 sibling, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-06 7:10 UTC (permalink / raw) To: David Chinner Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On 11/6/07, David Chinner <dgc@sgi.com> wrote: > On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > > On 11/5/07, David Chinner <dgc@sgi.com> wrote: > > > Ok, so it's probably a side effect of the writeback changes. > > > > > > Attached are two patches (two because one was in a separate patchset as > > > a standalone change) that should prevent async writeback from blocking > > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > > Can you see if this fixes the problem? > > > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch > > Applying your two patches ontop of that does not fix the stalls. > > So you are having RAID5 problems as well? The first 2.6.24-rc1-git-kernel that I patched with your patches did not boot for me. (Oops send in one of my previous mails) But given that the stacktrace was not xfs related and I had seen this patch on the lkml, I tried to fix this Oops this way. I did not have troubles with the RAID5 otherwise. > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Rather than vmstat, can you use something like iostat to > show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or > some such issue. Will do this this evening. > OOC, what is the 'xfs_info <mtpt>' output for your filesystem? meta-data=/dev/mapper/root isize=256 agcount=32, agsize=4731132 blks = sectsz=512 attr=1 data = bsize=4096 blocks=151396224, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 > > vmstat 10 output from unmerging (uninstalling) a kernel: > > 1 0 0 3512188 332 192644 0 0 185 12 368 735 10 3 85 1 > > -> emerge starts to remove the kernel source files > > 3 0 0 3506624 332 192836 0 0 15 9825 2458 8307 7 12 81 0 > > 0 0 0 3507212 332 192836 0 0 0 554 630 1233 0 1 99 0 > > 0 0 0 3507292 332 192836 0 0 0 537 580 1328 0 1 99 0 > > 0 0 0 3507168 332 192836 0 0 0 633 626 1380 0 1 99 0 > > 0 0 0 3507116 332 192836 0 0 0 1510 768 2030 1 2 97 0 > > 0 0 0 3507596 332 192836 0 0 0 524 540 1544 0 0 99 0 > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > > r b swpd free buff cache si so bi bo in cs us sy id wa > > 0 0 0 3507540 332 192836 0 0 0 489 551 1293 0 0 99 0 > > 0 0 0 3507528 332 192836 0 0 0 527 510 1432 1 1 99 0 > > 0 0 0 3508052 332 192840 0 0 0 2088 910 2964 2 3 95 0 > > 0 0 0 3507888 332 192840 0 0 0 442 565 1383 1 1 99 0 > > 0 0 0 3508704 332 192840 0 0 0 497 529 1479 0 0 99 0 > > 0 0 0 3508704 332 192840 0 0 0 594 595 1458 0 0 99 0 > > 0 0 0 3511492 332 192840 0 0 0 2381 1028 2941 2 3 95 0 > > 0 0 0 3510684 332 192840 0 0 0 699 600 1390 0 0 99 0 > > 0 0 0 3511636 332 192840 0 0 0 741 661 1641 0 0 100 0 > > 0 0 0 3524020 332 192840 0 0 0 2452 1080 3910 2 3 95 0 > > 0 0 0 3524040 332 192844 0 0 0 530 617 1297 0 0 99 0 > > 0 0 0 3524128 332 192844 0 0 0 812 674 1667 0 1 99 0 > > 0 0 0 3527000 332 193672 0 0 339 721 754 1681 3 2 93 1 > > -> emerge is finished, no dirty or writeback data in /proc/meminfo > > At this point, can you run a "sync" and see how long that takes to > complete? Already tried that: http://lkml.org/lkml/2007/11/2/178 See the logs from the second unmerge in the second half of the mail. The sync did not stop this writeout, but returned immediately. > The only thing I can think that woul dbe written out after > this point is inodes, but even then it seems to go on for a long, > long time and it really doesn't seem like XFS is holding up the > inode writes. Yes, I completly agree that this is much to long. Thats why I included the after-emerge-finished parts of the logs. But I still partly suspect xfs, because the xfssyncd shows up when I hip SysRq+W. > Another option is to use blktrace/blkparse to determine which process is > issuing this I/O. > > > 0 0 0 3583780 332 195060 0 0 0 494 555 1080 0 1 99 0 > > 0 0 0 3584352 332 195060 0 0 0 99 347 559 0 0 99 0 > > 0 0 0 3585232 332 195060 0 0 0 11 301 621 0 0 99 0 > > -> disks go idle. > > > > So these patches do not seem to be the source of these excessive disk writes... > > Well, the patches I posted should prevent blocking in the places that it > was seen, so if that does not stop the slowdowns then either the writeback > code is not feeding us inodes fast enough or the block device below is > having some kind of problem.... I don't think its the block device, because reading/writing larger files do not seem to be troubled. It looks much more like an inode problem. For example both installing and uninstalling kernel source trees show these stalls, but during uninstalling this is much more noticeable. But I agree that this might not be xfs specific, as this showed up at the same time as other people started reporting about the 100% iowait bug. Could be that this is the same bug and the differences between reiserfs and xfs might explain the iowait vs. idle. Or that I don't see the 100% iowait is something else on my system... Torsten ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 4:25 ` David Chinner 2007-11-06 7:10 ` Torsten Kaiser @ 2007-11-06 19:01 ` Peter Zijlstra 2007-11-06 20:26 ` Torsten Kaiser 1 sibling, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2007-11-06 19:01 UTC (permalink / raw) To: David Chinner Cc: Torsten Kaiser, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Just a crazy idea,.. Could there be a set_page_dirty() that doesn't have balance_dirty_pages() call near? For example modifying meta data in unlink? Such a situation could lead to an excess of dirty pages and the next call to balance_dirty_pages() would appear to stall, as it would desperately try to get below the limit again. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 19:01 ` Peter Zijlstra @ 2007-11-06 20:26 ` Torsten Kaiser 0 siblings, 0 replies; 39+ messages in thread From: Torsten Kaiser @ 2007-11-06 20:26 UTC (permalink / raw) To: Peter Zijlstra Cc: David Chinner, Fengguang Wu, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel, xfs On 11/6/07, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > > > I'm struggling to understand what possible changed in XFS or writeback that > > would lead to stalls like this, esp. as you appear to be removing files when > > the stalls occur. > > Just a crazy idea,.. > > Could there be a set_page_dirty() that doesn't have > balance_dirty_pages() call near? For example modifying meta data in > unlink? > > Such a situation could lead to an excess of dirty pages and the next > call to balance_dirty_pages() would appear to stall, as it would > desperately try to get below the limit again. Only if accounting of the dirty pages is also broken. In the unmerge testcase I see most of the time only <200kb of dirty data in /proc/meminfo. The system has 4Gb of RAM so I'm not sure if it should ever be valid to stall even the emerge/install testcase. Torsten Now building a kernel with the skipped-pages-accounting-patch reverted... ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <E1IpKZ4-0004je-Lb@localhost>]
* Re: writeout stalls in current -git [not found] ` <E1IpKZ4-0004je-Lb@localhost> @ 2007-11-06 9:17 ` Fengguang Wu 2007-11-06 21:53 ` Torsten Kaiser 2007-11-06 9:17 ` Fengguang Wu 1 sibling, 1 reply; 39+ messages in thread From: Fengguang Wu @ 2007-11-06 9:17 UTC (permalink / raw) To: Torsten Kaiser Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 547.200000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > global 12829 72 0 wc __ tw 0 sk 0 > [ 550.480000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > global 12017 62 0 wc __ tw 0 sk 0 > [ 552.710000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > global 11133 83 0 wc __ tw 0 sk 0 > [ 558.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > global 10470 33 0 wc _M tw 0 sk 0 4s > [ 562.750000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > global 10555 69 0 wc _M tw 0 sk 0 3s > [ 565.150000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > global 9562 498 0 wc _M tw -2 sk 0 4s > [ 569.490000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > global 8960 2 0 wc _M tw 0 sk 0 3s > [ 572.910000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > global 8088 205 0 wc _M tw -13 sk 0 2s > [ 574.610000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > global 7114 188 0 wc _M tw -1 sk 0 10s > [ 584.270000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > global 14544 0 0 wc _M tw -1 sk 0 9s > [ 593.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > global 24583 736 0 wc _M tw -1 sk 0 7s > [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > global 27004 6 0 wc _M tw 587 sk 0 > [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > global 27004 6 0 wc __ tw 1014 sk 0 The above messages and the below 'D' state pdflush indicate that one single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete. Let's try reverting the below patch with `patch -R`? It looks like the most relevant change - if it's not a low level bug. > [note] first stall, the output from emerge stops, so it seems it can > not start processing the next file until the stall ends > [ 630.000000] SysRq : Emergency Sync > [ 630.120000] Emergency Sync complete > [ 632.850000] SysRq : Show Blocked State > [ 632.850000] task PC stack pid father > [ 632.850000] pdflush D ffff81000f091788 0 285 2 > [ 632.850000] ffff810005d4da80 0000000000000046 0000000000000800 > 0000007000000001 > [ 632.850000] ffff81000fd52400 ffffffff8022d61c ffffffff80819b00 > ffffffff80819b00 > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810100316f98 > 0000000000000000 > [ 632.850000] Call Trace: > [ 632.850000] [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 > [ 632.850000] [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90 > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > [ 632.850000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 > [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 > [ 632.850000] [<ffffffff8036ed49>] xfs_inode_flush+0x179/0x1b0 > [ 632.850000] [<ffffffff8037ca8f>] xfs_fs_write_inode+0x2f/0x90 > [ 632.850000] [<ffffffff802b3aac>] __writeback_single_inode+0x2ac/0x380 > [ 632.850000] [<ffffffff804d074e>] dm_table_any_congested+0x2e/0x80 > [ 632.850000] [<ffffffff802b3f9d>] generic_sync_sb_inodes+0x20d/0x330 > [ 632.850000] [<ffffffff802b4532>] writeback_inodes+0xa2/0xe0 > [ 632.850000] [<ffffffff8026bfd6>] wb_kupdate+0xa6/0x140 > [ 632.850000] [<ffffffff8026c4b0>] pdflush+0x0/0x1e0 > [ 632.850000] [<ffffffff8026c5c0>] pdflush+0x110/0x1e0 > [ 632.850000] [<ffffffff8026bf30>] wb_kupdate+0x0/0x140 > [ 632.850000] [<ffffffff8024a32b>] kthread+0x4b/0x80 > [ 632.850000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 > [ 632.850000] [<ffffffff8024a2e0>] kthread+0x0/0x80 > [ 632.850000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 > [ 632.850000] > [ 632.850000] emerge D 0000000000000000 0 6220 6129 > [ 632.850000] ffff810103ced9f8 0000000000000086 0000000000000000 > 0000007000000001 > [ 632.850000] ffff81000fd52cf8 ffffffff00000000 ffffffff80819b00 > ffffffff80819b00 > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810103ced9b8 > ffff810103ced9a8 > [ 632.850000] Call Trace: > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > [ 632.850000] [<ffffffff80375bee>] xfs_buf_rele+0x2e/0xd0 > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 > [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 > [ 632.850000] [<ffffffff80355c82>] xfs_inode_item_push+0x12/0x20 > [ 632.850000] [<ffffffff80368247>] xfs_trans_push_ail+0x267/0x2b0 > [ 632.850000] [<ffffffff8035c742>] xfs_log_reserve+0x72/0x120 > [ 632.850000] [<ffffffff80366bf8>] xfs_trans_reserve+0xa8/0x210 > [ 632.850000] [<ffffffff803731f2>] kmem_zone_zalloc+0x32/0x50 > [ 632.850000] [<ffffffff8035263b>] xfs_itruncate_finish+0xfb/0x310 > [ 632.850000] [<ffffffff8036daeb>] xfs_free_eofblocks+0x23b/0x280 > [ 632.850000] [<ffffffff80371f93>] xfs_release+0x153/0x200 > [ 632.850000] [<ffffffff80378010>] xfs_file_release+0x10/0x20 > [ 632.850000] [<ffffffff80294251>] __fput+0xb1/0x220 > [ 632.850000] [<ffffffff802910a4>] filp_close+0x54/0x90 > [ 632.850000] [<ffffffff802929bf>] sys_close+0x9f/0x100 > [ 632.850000] [<ffffffff8020bbbe>] system_call+0x7e/0x83 > [ 632.850000] > [ 662.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 73045 > global 39157 0 0 wc __ tw 0 sk 0 > [note] emerge resumed > [ 664.030000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK > showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks > Unmount shoW-blocked-tasks ------------------------------------------------------ Subject: writeback: remove pages_skipped accounting in __block_write_full_page() From: Fengguang Wu <wfg@mail.ustc.edu.cn> Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. Also the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback should be skipped for 'locked buffer', but here it is 'clean buffer'! This patch fixes this bug. Though I'm not sure why __block_write_full_page() is called only to do nothing and who actually issued the writeback for us. This is the two possible new behaviors after the patch: 1) pretty nice: wait 30s and write ALL:) 2) not so good: - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The next patch will fix case (2). Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/buffer.c | 1 - fs/xfs/linux-2.6/xfs_aops.c | 5 ++--- 2 files changed, 2 insertions(+), 4 deletions(-) diff -puN fs/buffer.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page fs/buffer.c --- a/fs/buffer.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page +++ a/fs/buffer.c @@ -1730,7 +1730,6 @@ done: * The page and buffer_heads can be released at any time from * here on. */ - wbc->pages_skipped++; /* We didn't write this page */ } return err; diff -puN fs/xfs/linux-2.6/xfs_aops.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page fs/xfs/linux-2.6/xfs_aops.c --- a/fs/xfs/linux-2.6/xfs_aops.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page +++ a/fs/xfs/linux-2.6/xfs_aops.c @@ -402,10 +402,9 @@ xfs_start_page_writeback( clear_page_dirty_for_io(page); set_page_writeback(page); unlock_page(page); - if (!buffers) { + /* If no buffers on the page are to be written, finish it here */ + if (!buffers) end_page_writeback(page); - wbc->pages_skipped++; /* We didn't write this page */ - } } static inline int bio_add_buffer(struct bio *bio, struct buffer_head *bh) _ Patches currently in -mm which might be from wfg@mail.ustc.edu.cn are origin.patch ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 9:17 ` Fengguang Wu @ 2007-11-06 21:53 ` Torsten Kaiser 2007-11-06 23:31 ` David Chinner 0 siblings, 1 reply; 39+ messages in thread From: Torsten Kaiser @ 2007-11-06 21:53 UTC (permalink / raw) To: Fengguang Wu Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On 11/6/07, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > ------------------------------------------------------ > Subject: writeback: remove pages_skipped accounting in __block_write_full_page() > From: Fengguang Wu <wfg@mail.ustc.edu.cn> > > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug: [sni] > fs/buffer.c | 1 - > fs/xfs/linux-2.6/xfs_aops.c | 5 ++--- > 2 files changed, 2 insertions(+), 4 deletions(-) I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted. This does still stall. On 11/6/07, David Chinner <dgc@sgi.com> wrote: > Rather than vmstat, can you use something like iostat to show how busy your > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. Both "vmstat 10" and "iostat -x 10" output from this test: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 3700592 0 85424 0 0 31 83 108 244 2 1 95 1 -> emerge reads something, don't knwo for sure what... 1 0 0 3665352 0 87940 0 0 239 2 343 585 2 1 97 0 0 0 0 3657728 0 91228 0 0 322 35 445 833 0 0 99 0 1 0 0 3653136 0 94692 0 0 330 33 455 844 1 1 98 0 0 0 0 3646836 0 97720 0 0 289 3 422 751 1 1 98 0 0 0 0 3616468 0 99692 0 0 185 33 399 614 9 3 87 1 -> starts to remove the kernel tree 0 0 0 3610452 0 102592 0 0 138 3598 1398 3945 3 6 90 1 0 0 0 3607136 0 104548 0 0 2 5962 1919 6070 4 9 87 0 0 0 0 3606636 0 105080 0 0 0 1539 810 2200 1 2 97 0 -> first stall 28 sec. 0 0 0 3606592 0 105292 0 0 0 698 679 1390 0 1 99 0 0 0 0 3606440 0 105532 0 0 0 658 690 1457 0 0 99 0 0 0 0 3606068 0 106128 0 0 1 1780 947 1982 1 3 96 0 -> second stall 24 sec. 0 0 0 3606036 0 106464 0 0 4 858 758 1457 0 1 98 0 0 0 0 3605380 0 106872 0 0 0 1173 807 1880 1 2 97 0 0 0 0 3605000 0 107748 0 0 1 2413 1103 2996 2 4 94 0 -> third stall 38 sec. 0 0 0 3604488 0 108472 0 0 45 897 748 1577 0 1 98 0 0 0 0 3604176 0 108764 0 0 0 824 752 1700 0 1 98 0 0 0 0 3604012 0 108988 0 0 0 660 643 1237 0 1 99 0 0 0 0 3608936 0 110120 0 0 1 3490 1232 3455 3 5 91 0 -> fourth stall 64 sec. 1 0 0 3609060 0 110296 0 0 0 568 669 1222 0 1 99 0 0 0 0 3609464 0 110496 0 0 0 604 638 1366 0 1 99 0 0 0 0 3609244 0 110740 0 0 0 844 714 1282 0 1 99 0 0 0 0 3609508 0 110912 0 0 0 552 584 1185 1 1 99 0 2 0 0 3609436 0 111132 0 0 0 658 643 1442 0 1 99 0 0 0 0 3609212 0 111348 0 0 0 714 637 1382 0 0 99 0 0 0 0 3619132 0 110492 0 0 130 1086 736 1870 4 3 91 2 0 0 0 3657016 0 115496 0 0 466 589 718 1367 1 1 98 0 -> emerge finishs, dirty data was the hole time <1Mb, stays now below 300kb (btrace running...) 0 0 0 3657844 0 115660 0 0 0 564 635 1226 1 1 99 0 0 0 0 3658236 0 115840 0 0 0 582 600 1248 1 0 99 0 0 0 0 3658296 0 116012 0 0 0 566 606 1232 1 1 99 0 0 0 0 3657924 0 116212 0 0 0 688 596 1321 1 0 99 0 0 0 0 3658252 0 116416 0 0 0 631 642 1356 1 0 98 0 0 0 0 3658184 0 116592 0 0 0 566 575 1273 0 0 99 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 3658344 0 116772 0 0 0 649 606 1301 0 0 99 0 0 0 0 3658548 0 116976 0 0 0 617 624 1345 0 0 99 0 0 0 0 3659204 0 117160 0 0 0 550 576 1223 1 1 99 0 0 0 0 3659944 0 117344 0 0 0 620 583 1272 0 0 99 0 0 0 0 3660548 0 117540 0 0 0 605 611 1338 0 0 99 0 0 0 0 3661236 0 117732 0 0 0 582 569 1275 0 0 99 0 0 0 0 3662420 0 117888 0 0 0 590 571 1157 0 0 99 0 0 0 0 3664324 0 118068 0 0 0 566 553 1222 1 1 99 0 0 0 0 3665240 0 118168 0 0 0 401 574 862 0 0 99 0 0 0 0 3666984 0 118280 0 0 0 454 574 958 1 1 99 0 0 0 0 3668664 0 118400 0 0 0 396 559 946 0 0 99 0 0 0 0 3670628 0 118496 0 0 0 296 495 784 0 0 99 0 0 0 0 3671316 0 118496 0 0 0 36 334 307 0 0 99 0 -> disks go idle I also saved the btrace output, but that is even with bzip2 ~1.6Mb. Summary from btrace Total (253,0): Reads Queued: 5,385, 21,540KiB Writes Queued: 91,076, 362,640KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 5,385, 21,540KiB Writes Completed: 91,076, 362,640KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 8,883 Timer unplugs: 0 Throughput (R/W): 38KiB/s / 654KiB/s Events (253,0): 201,805 entries Skips: 0 forward (0 - 0.0%) The last 20% of the btrace look more or less completely like this, no other programs do any IO... 253,0 3 104626 526.293450729 974 C WS 79344288 + 8 [0] 253,0 3 104627 526.293455078 974 C WS 79344296 + 8 [0] 253,0 1 36469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] 253,0 1 36470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] 253,0 1 36471 444.523967430 1068 Q WS 117078784 + 8 [xfssyncd] 253,0 1 36472 444.523970097 1068 Q WS 117078792 + 8 [xfssyncd] 253,0 1 36473 444.548753821 1068 Q WS 117078784 + 8 [xfssyncd] 253,0 1 36474 444.548756324 1068 Q WS 117078792 + 8 [xfssyncd] 253,0 1 36475 444.553960214 1068 Q WS 195314144 + 8 [xfssyncd] 253,0 1 36476 444.553962765 1068 Q WS 195314152 + 8 [xfssyncd] 253,0 3 104628 526.310490373 974 C WS 154998480 + 8 [0] 253,0 3 104629 526.310490374 974 C WS 154998488 + 8 [0] 253,0 3 104630 526.310490386 974 C WS 154998480 + 8 [0] 253,0 3 104631 526.310490387 974 C WS 154998488 + 8 [0] 253,0 3 104632 526.310565814 974 C WS 117078784 + 8 [0] 253,0 3 104633 526.310570195 974 C WS 117078792 + 8 [0] 253,0 3 104634 526.313450024 974 C WS 117078784 + 8 [0] 253,0 3 104635 526.313454317 974 C WS 117078792 + 8 [0] 253,0 1 36477 444.583070774 1068 Q WS 195314144 + 8 [xfssyncd] 253,0 1 36478 444.583075517 1068 Q WS 195314152 + 8 [xfssyncd] 253,0 1 36479 444.583954077 1068 Q WS 233141680 + 8 [xfssyncd] 253,0 1 36480 444.583956804 1068 Q WS 233141688 + 8 [xfssyncd] 253,0 1 36481 444.619241615 1068 Q WS 233165296 + 8 [xfssyncd] 253,0 1 36482 444.619247992 1068 Q WS 233165304 + 8 [xfssyncd] 253,0 3 104636 526.320490406 974 C WS 195314144 + 8 [0] 253,0 3 104637 526.320490407 974 C WS 195314152 + 8 [0] 253,0 3 104638 526.320490419 974 C WS 195314144 + 8 [0] 253,0 3 104639 526.320490420 974 C WS 195314152 + 8 [0] 253,0 3 104640 526.348720498 974 C WS 233141680 + 8 [0] 253,0 3 104641 526.348724614 974 C WS 233141688 + 8 [0] 253,0 1 36483 444.643863141 1068 Q WS 272297440 + 8 [xfssyncd] 253,0 1 36484 444.643863143 1068 Q WS 272297448 + 8 [xfssyncd] 253,0 1 36485 444.675408559 1068 Q WS 272297440 + 8 [xfssyncd] 253,0 1 36486 444.675412236 1068 Q WS 272297448 + 8 [xfssyncd] iostat -x 10 output follows, each line from the above vmstat should correspond to one iostat output Linux 2.6.24-rc1 (treogen) 11/06/07 avg-cpu: %user %nice %system %iowait %steal %idle 2.27 0.00 1.13 1.41 0.00 95.18 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 14.46 34.81 7.63 15.11 176.60 418.68 26.17 0.53 23.07 5.59 12.71 sdb 14.54 34.60 7.51 14.91 176.01 415.36 26.38 0.43 19.29 5.00 11.20 sdc 14.62 34.50 7.55 15.29 177.12 417.73 26.04 0.47 20.42 5.31 12.12 md1 0.00 0.00 31.99 80.06 254.70 636.20 7.95 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.04 0.00 8.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 31.88 80.06 254.53 636.20 7.96 24.99 223.19 2.56 28.68 sdd 0.46 0.00 0.10 0.00 0.88 0.00 8.99 0.00 5.25 1.91 0.02 avg-cpu: %user %nice %system %iowait %steal %idle 1.85 0.00 0.55 0.19 0.00 97.41 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.80 0.00 4.00 0.80 166.40 11.20 37.00 0.07 13.96 12.29 5.90 sdb 19.40 0.00 4.50 0.70 191.20 10.40 38.77 0.07 13.27 10.77 5.60 sdc 18.20 0.10 6.50 1.30 197.60 14.50 27.19 0.11 13.85 11.41 8.90 md1 0.00 0.00 69.50 0.20 556.00 0.50 7.98 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 69.50 0.20 556.00 0.50 7.98 0.67 9.53 2.38 16.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.45 0.00 0.47 0.05 0.00 99.03 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 6.70 0.10 21.90 1.10 228.80 15.20 10.61 0.43 18.70 16.70 38.40 sdb 6.00 0.10 19.30 1.00 201.60 14.40 10.64 0.33 16.40 15.22 30.90 sdc 5.70 0.20 21.50 1.50 217.60 49.30 11.60 0.40 17.22 15.13 34.80 md1 0.00 0.00 81.10 0.70 648.80 5.60 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 81.10 0.70 648.80 5.60 8.00 1.61 19.73 12.11 99.10 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.94 0.00 0.79 0.02 0.00 98.24 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 4.70 0.00 21.10 1.50 206.40 22.40 10.12 0.40 17.65 16.11 36.40 sdb 6.20 0.10 20.80 1.50 216.00 23.20 10.73 0.35 15.70 13.50 30.10 sdc 5.50 0.10 23.60 2.40 232.80 57.00 11.15 0.46 17.65 14.96 38.90 md1 0.00 0.00 81.80 0.40 654.40 2.50 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 81.80 0.40 654.40 2.50 7.99 1.55 18.84 11.98 98.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.63 0.00 1.09 0.00 0.00 97.28 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.30 0.00 16.60 0.20 175.20 3.20 10.62 0.34 20.30 18.93 31.80 sdb 6.50 0.00 19.00 0.20 204.80 3.20 10.83 0.35 18.39 17.29 33.20 sdc 5.60 0.00 17.50 0.30 184.80 4.00 10.61 0.34 19.33 18.48 32.90 md1 0.00 0.00 70.50 0.00 564.00 0.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 70.50 0.00 564.00 0.00 8.00 1.43 20.30 13.49 95.10 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 9.38 0.00 3.45 1.55 0.00 85.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 9.30 1.60 15.50 2.20 198.40 40.00 13.47 0.27 15.08 12.71 22.50 sdb 8.40 3.50 12.00 1.80 163.20 52.00 15.59 0.19 13.62 12.54 17.30 sdc 8.30 4.10 13.70 2.90 176.00 118.90 17.77 0.24 14.58 11.93 19.80 md1 0.00 0.00 61.00 5.50 488.00 42.30 7.97 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 61.00 5.50 488.00 42.30 7.97 1.19 17.80 8.00 53.20 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 3.26 0.00 8.36 0.07 0.00 88.30 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 207.00 584.40 181.70 79.40 3109.60 5391.20 32.56 2.28 8.74 2.42 63.10 sdb 209.60 584.30 184.20 77.50 3150.40 5373.60 32.57 2.02 7.74 1.93 50.50 sdc 195.20 589.40 198.20 80.10 3147.20 5760.80 32.01 2.37 8.51 2.33 64.80 md1 0.00 0.00 12.00 1182.20 96.00 9456.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 12.00 1182.80 96.00 9461.00 8.00 61.79 51.70 0.83 99.20 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 3.24 0.00 6.90 0.02 0.00 89.84 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 203.60 541.60 163.40 84.60 2936.80 5101.60 32.41 5.66 22.74 3.05 75.70 sdb 201.10 533.20 165.90 83.50 2936.80 5028.00 31.94 5.23 20.77 2.61 65.20 sdc 201.00 540.30 164.50 89.50 2924.00 5346.30 32.56 5.77 22.71 3.00 76.30 md1 0.00 0.00 0.50 1115.30 4.00 8877.30 7.96 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.50 1114.70 4.00 8872.50 7.96 93.14 81.84 0.89 99.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.16 0.00 2.32 0.00 0.00 96.52 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 56.60 154.10 32.90 54.90 716.00 1726.40 27.82 0.45 5.26 2.79 24.50 sdb 57.80 161.00 35.60 57.10 747.20 1801.60 27.50 0.48 5.65 2.80 26.00 sdc 58.00 162.30 32.00 56.70 720.00 1808.10 28.50 0.44 4.88 2.82 25.00 md1 0.00 0.00 0.00 355.90 0.00 2842.30 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 355.90 0.00 2842.30 7.99 9.02 30.64 2.71 96.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.31 0.00 0.52 0.00 0.00 99.17 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.70 45.00 6.10 71.40 54.40 974.40 13.27 3.15 40.67 3.50 27.10 sdb 1.90 20.90 7.50 46.70 75.20 584.00 12.16 1.64 30.18 3.69 20.00 sdc 1.80 35.50 6.80 62.40 68.80 1055.20 16.24 1.82 26.30 3.16 21.90 md1 0.00 0.00 0.00 135.20 0.00 1038.20 7.68 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 135.20 0.00 1038.20 7.68 14.41 106.61 7.32 99.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.50 0.00 0.36 0.00 0.00 99.14 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 2.10 48.90 3.10 64.00 41.60 952.00 14.81 0.64 9.60 2.91 19.50 sdb 2.60 26.20 3.90 40.80 52.00 584.80 14.25 0.52 11.59 3.31 14.80 sdc 2.20 55.60 3.40 72.90 44.00 1076.90 14.69 0.67 8.73 2.44 18.60 md1 0.00 0.00 0.00 144.80 0.00 1158.40 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 144.80 0.00 1158.40 8.00 5.74 39.59 6.88 99.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.46 0.00 3.01 0.00 0.00 95.53 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 61.10 183.80 39.10 84.70 801.60 2204.80 24.28 0.82 6.62 2.05 25.40 sdb 57.80 180.00 42.70 77.70 804.00 2113.60 24.23 0.92 7.66 1.87 22.50 sdc 57.40 182.70 41.60 85.80 792.80 2200.10 23.49 1.11 8.74 2.06 26.20 md1 0.00 0.00 1.20 438.50 9.60 3507.10 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 1.20 438.50 9.60 3507.10 8.00 15.63 35.55 2.26 99.40 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.43 0.00 1.07 0.21 0.00 98.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.90 22.00 7.50 42.20 75.20 557.60 12.73 2.54 51.07 4.47 22.20 sdb 1.10 58.70 6.50 82.00 60.80 1169.60 13.90 2.69 30.36 2.49 22.00 sdc 0.90 59.50 6.90 83.50 62.40 1409.60 16.28 3.13 34.67 2.78 25.10 md1 0.00 0.00 0.10 168.70 0.80 1305.80 7.74 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.10 168.70 0.80 1305.80 7.74 15.74 93.27 5.90 99.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.79 0.00 3.94 0.00 0.00 94.27 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 80.40 200.50 43.10 64.10 976.00 2158.40 29.24 0.39 3.64 1.73 18.50 sdb 77.30 232.60 44.80 93.80 968.80 2655.20 26.15 0.34 2.47 1.17 16.20 sdc 67.00 244.30 52.60 103.50 944.80 2826.50 24.16 0.39 2.45 1.13 17.60 md1 0.00 0.00 0.20 532.90 1.60 4260.50 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.20 533.70 1.60 4266.90 7.99 11.08 20.71 1.87 99.90 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.04 0.00 1.40 0.00 0.00 97.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 40.80 138.90 18.70 77.50 488.00 1768.00 23.45 0.46 4.76 1.57 15.10 sdb 41.30 115.90 19.80 58.80 496.80 1436.00 24.59 0.83 10.61 2.09 16.40 sdc 35.20 149.50 25.90 89.40 500.80 1952.10 21.27 1.01 8.77 1.59 18.30 md1 0.00 0.00 0.10 335.50 0.80 2681.70 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.10 334.70 0.80 2675.30 7.99 9.89 29.61 2.94 98.30 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.44 0.00 1.08 0.49 0.00 97.99 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 4.50 37.20 9.60 63.60 112.80 853.60 13.20 1.83 25.01 2.90 21.20 sdb 3.80 58.90 9.90 82.40 109.60 1177.60 13.95 1.68 18.15 2.36 21.80 sdc 3.90 49.30 9.20 72.70 104.80 1327.20 17.48 2.09 25.53 2.94 24.10 md1 0.00 0.00 11.20 176.20 89.60 1362.00 7.75 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 11.20 176.20 89.60 1362.00 7.75 10.78 57.52 5.34 100.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.37 0.00 1.26 0.00 0.00 98.37 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.70 48.70 2.10 69.40 30.40 983.20 14.18 0.22 3.02 1.48 10.60 sdb 2.40 55.90 2.80 76.20 41.60 1095.20 14.39 0.24 3.05 1.28 10.10 sdc 1.20 57.50 1.50 80.70 21.60 1143.50 14.17 0.23 2.76 1.24 10.20 md1 0.00 0.00 0.00 186.70 0.00 1493.60 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 186.70 0.00 1493.60 8.00 2.63 14.10 5.33 99.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.18 0.00 0.60 0.02 0.00 99.19 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 3.50 51.10 5.10 68.70 68.80 1005.60 14.56 1.82 24.61 4.32 31.90 sdb 4.10 46.00 6.50 62.50 84.80 915.20 14.49 1.29 18.68 3.74 25.80 sdc 4.90 32.40 7.00 47.70 95.20 688.90 14.33 1.31 23.97 4.26 23.30 md1 0.00 0.00 0.00 145.00 0.00 1160.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 145.00 0.00 1160.00 8.00 12.36 85.27 6.59 95.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 3.26 0.00 5.29 0.00 0.00 91.45 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 136.10 375.40 101.70 75.40 1902.40 3672.00 31.48 0.57 3.24 1.87 33.20 sdb 150.60 372.90 88.50 77.50 1912.80 3664.80 33.60 0.57 3.43 1.90 31.60 sdc 141.30 388.80 95.20 88.40 1892.00 4198.40 33.17 0.69 3.76 1.98 36.40 md1 0.00 0.00 0.30 813.90 2.40 6509.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.30 813.90 2.40 6509.20 8.00 22.48 27.60 1.22 99.20 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.23 0.00 0.54 0.00 0.00 99.22 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.60 33.50 7.80 55.20 75.20 759.20 13.24 2.85 45.32 4.52 28.50 sdb 1.40 36.20 7.40 58.30 70.40 805.60 13.33 3.34 50.84 4.35 28.60 sdc 1.10 32.20 7.10 53.40 65.60 733.90 13.21 3.20 52.89 4.60 27.80 md1 0.00 0.00 0.00 128.60 0.00 984.00 7.65 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 128.60 0.00 984.00 7.65 19.93 154.97 7.60 97.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.42 0.00 0.56 0.00 0.00 99.02 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.10 54.10 1.30 61.60 19.20 967.20 15.68 0.29 4.55 2.81 17.70 sdb 2.30 26.40 2.40 34.90 37.60 532.00 15.27 0.27 7.10 4.29 16.00 sdc 0.90 50.00 1.00 59.20 15.20 914.50 15.44 0.27 4.47 2.59 15.60 md1 0.00 0.00 0.00 134.80 0.00 1078.40 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 134.80 0.00 1078.40 8.00 2.40 17.86 7.31 98.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.45 0.00 0.66 0.00 0.00 98.88 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 4.70 56.50 7.40 75.90 96.80 1116.00 14.56 1.67 20.07 4.23 35.20 sdb 4.20 42.70 6.80 62.70 88.00 900.00 14.22 1.30 18.75 3.68 25.60 sdc 5.20 52.90 8.10 71.50 106.40 1168.80 16.02 1.73 21.68 4.48 35.70 md1 0.00 0.00 0.00 170.20 0.00 1361.60 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 170.20 0.00 1361.60 8.00 17.84 104.81 5.80 98.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.47 0.00 0.66 0.00 0.00 98.87 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.30 48.20 1.30 53.80 20.00 856.80 15.91 0.28 5.08 3.36 18.50 sdb 1.60 45.40 1.60 51.50 25.60 816.00 15.85 0.28 5.24 3.15 16.70 sdc 1.60 41.80 1.70 47.60 26.40 755.70 15.86 0.28 5.58 3.39 16.70 md1 0.00 0.00 0.00 136.40 0.00 1091.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 136.40 0.00 1091.20 8.00 2.48 18.15 7.17 97.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.45 0.00 0.78 0.00 0.00 98.77 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.70 48.00 0.80 52.30 12.80 833.60 15.94 0.22 4.16 2.34 12.40 sdb 1.50 38.20 1.50 42.90 24.00 680.00 15.86 0.19 4.30 2.36 10.50 sdc 0.40 60.20 0.40 64.80 6.40 1030.50 15.90 0.29 4.51 2.55 16.60 md1 0.00 0.00 0.00 147.20 0.00 1177.60 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 147.20 0.00 1177.60 8.00 2.34 15.89 6.72 98.90 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.42 0.00 0.37 0.00 0.00 99.21 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.60 53.00 0.70 63.30 10.40 971.20 15.34 0.26 4.05 2.64 16.90 sdb 1.00 38.80 1.10 53.90 16.80 782.40 14.53 0.50 9.09 3.55 19.50 sdc 0.90 40.50 1.00 51.60 15.20 906.40 17.52 0.24 4.54 2.68 14.10 md1 0.00 0.00 0.00 142.60 0.00 1140.80 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 142.60 0.00 1140.80 8.00 2.33 16.33 6.90 98.40 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 4.39 0.00 3.08 1.74 0.00 90.79 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 42.10 103.40 17.80 63.40 478.40 1374.40 22.82 0.84 10.38 4.25 34.50 sdb 42.80 95.10 17.20 48.30 480.00 1189.60 25.49 0.45 6.90 3.97 26.00 sdc 45.90 100.60 18.60 57.50 516.00 1304.90 23.93 0.60 7.83 4.34 33.00 md1 0.00 0.00 47.60 252.40 380.80 2017.10 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 47.60 252.40 380.80 2017.10 7.99 7.29 24.30 3.28 98.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.99 0.00 1.22 0.00 0.00 97.79 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.50 34.60 28.70 56.50 354.40 780.80 13.32 5.49 60.46 6.27 53.40 sdb 14.70 32.80 28.50 53.80 345.60 745.60 13.26 3.42 41.51 5.55 45.70 sdc 14.00 23.00 27.00 44.30 328.00 590.50 12.88 3.54 49.71 6.80 48.50 md1 0.00 0.00 101.40 119.50 811.20 912.60 7.80 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 101.40 119.50 811.20 912.60 7.80 22.75 32.05 4.51 99.70 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.70 0.00 0.68 0.00 0.00 98.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.50 35.50 2.20 44.10 29.60 682.40 15.38 0.29 13.59 3.82 17.70 sdb 1.50 42.50 2.10 51.90 28.80 799.20 15.33 0.29 5.39 3.20 17.30 sdc 1.60 36.90 1.90 47.70 28.00 908.80 18.89 0.29 5.87 3.55 17.60 md1 0.00 0.00 0.00 116.60 0.00 932.10 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 116.60 0.00 932.10 7.99 2.73 157.86 8.45 98.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.76 0.00 0.47 0.00 0.00 98.77 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.40 49.80 0.40 65.00 6.40 961.60 14.80 0.23 3.46 2.16 14.10 sdb 1.20 28.90 1.50 39.20 21.60 588.00 14.98 0.18 4.32 2.87 11.70 sdc 0.80 43.30 1.10 53.80 15.20 819.50 15.20 0.24 4.28 2.84 15.60 md1 0.00 0.00 0.00 131.80 0.00 1054.40 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 131.80 0.00 1054.40 8.00 2.00 15.14 7.48 98.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.70 0.00 0.75 0.00 0.00 98.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.90 43.80 1.60 64.30 20.00 904.80 14.03 0.30 4.63 2.47 16.30 sdb 1.20 30.30 1.70 50.70 23.20 688.00 13.57 0.25 4.85 2.67 14.00 sdc 0.90 28.50 1.90 46.50 22.40 639.30 13.67 0.30 6.28 3.33 16.10 md1 0.00 0.00 0.00 124.40 0.00 994.00 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 124.40 0.00 994.00 7.99 2.19 17.60 7.97 99.10 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.74 0.00 0.34 0.00 0.00 98.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.90 50.40 1.20 62.40 16.80 942.40 15.08 0.22 3.49 1.76 11.20 sdb 0.30 53.50 0.40 64.50 5.60 984.00 15.25 0.18 2.82 1.48 9.60 sdc 1.60 34.30 2.00 47.20 28.80 801.60 16.88 0.25 5.04 2.60 12.80 md1 0.00 0.00 0.00 148.40 0.00 1185.80 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 148.40 0.00 1185.80 7.99 2.11 14.23 6.70 99.50 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 1.18 0.00 0.35 0.00 0.00 98.47 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.50 47.80 0.70 60.00 9.60 892.80 14.87 0.20 3.29 1.86 11.30 sdb 1.10 38.10 1.30 48.40 19.20 722.40 14.92 0.17 3.48 2.15 10.70 sdc 0.80 41.90 0.80 57.00 12.80 821.10 14.43 0.21 3.55 1.87 10.80 md1 0.00 0.00 0.00 140.80 0.00 1126.40 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 140.80 0.00 1126.40 8.00 1.98 14.06 7.03 99.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.38 0.00 0.38 0.00 0.00 99.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.40 42.10 1.50 55.20 23.20 808.80 14.67 0.18 3.26 1.94 11.00 sdb 0.70 39.90 0.80 52.10 12.00 766.40 14.71 0.20 3.71 2.04 10.80 sdc 1.20 38.40 1.40 49.20 20.80 730.50 14.85 0.22 4.45 2.29 11.60 md1 0.00 0.00 0.00 132.60 0.00 1060.80 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 132.60 0.00 1060.80 8.00 1.97 14.83 7.39 98.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.33 0.00 0.49 0.00 0.00 99.18 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.80 40.70 1.10 47.20 15.20 735.20 15.54 0.20 4.12 2.28 11.00 sdb 0.50 47.00 0.90 53.60 11.20 836.80 15.56 0.22 4.04 2.51 13.70 sdc 0.90 40.80 1.00 48.40 15.20 857.60 17.67 0.21 4.23 2.31 11.40 md1 0.00 0.00 0.00 132.60 0.00 1060.80 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 132.60 0.00 1060.80 8.00 1.97 14.85 7.44 98.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.40 0.00 0.40 0.00 0.00 99.20 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.90 42.90 1.00 53.60 15.20 800.80 14.95 0.21 3.75 1.96 10.70 sdb 1.10 42.10 1.10 50.80 17.60 772.00 15.21 0.18 3.56 2.02 10.50 sdc 0.70 50.20 0.80 60.20 12.00 911.50 15.14 0.20 3.25 1.90 11.60 md1 0.00 0.00 0.00 144.40 0.00 1155.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 144.40 0.00 1155.20 8.00 1.98 13.70 6.85 98.90 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.75 0.00 0.71 0.00 0.00 98.54 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.70 45.00 0.70 49.70 11.20 783.20 15.76 0.18 3.63 2.12 10.70 sdb 0.70 38.00 0.70 43.20 11.20 675.20 15.64 0.17 3.78 2.10 9.20 sdc 1.00 40.20 1.00 46.20 16.00 716.10 15.51 0.20 4.13 2.27 10.70 md1 0.00 0.00 0.00 126.20 0.00 1009.60 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 126.20 0.00 1009.60 8.00 1.96 15.50 7.75 97.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.42 0.00 0.44 0.00 0.00 99.14 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.50 41.50 0.90 53.80 11.20 780.00 14.46 0.20 3.62 2.21 12.10 sdb 0.70 39.80 0.90 60.40 12.80 819.20 13.57 0.24 3.92 1.89 11.60 sdc 0.90 32.70 1.30 46.60 17.60 763.20 16.30 0.23 4.84 2.61 12.50 md1 0.00 0.00 0.00 133.80 0.00 1070.40 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 133.80 0.00 1070.40 8.00 1.97 14.73 7.37 98.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.40 0.00 0.35 0.00 0.00 99.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.80 43.20 1.00 55.50 14.40 809.60 14.58 0.20 3.52 1.98 11.20 sdb 0.60 45.20 0.80 56.30 11.20 832.00 14.77 0.20 3.43 1.89 10.80 sdc 1.10 39.80 1.10 53.70 17.60 767.30 14.32 0.20 3.72 2.04 11.20 md1 0.00 0.00 0.00 143.00 0.00 1144.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 143.00 0.00 1144.00 8.00 1.98 13.86 6.94 99.20 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.40 0.00 0.47 0.00 0.00 99.14 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.00 44.80 1.00 48.70 16.00 773.60 15.89 0.20 4.08 2.01 10.00 sdb 0.90 49.00 0.90 52.90 14.40 840.80 15.90 0.22 4.01 2.23 12.00 sdc 1.30 39.20 1.30 44.10 20.80 691.30 15.69 0.23 5.11 3.02 13.70 md1 0.00 0.00 0.00 134.40 0.00 1075.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 134.40 0.00 1075.20 8.00 1.95 14.51 7.25 97.40 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.45 0.00 0.45 0.00 0.00 99.10 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.40 35.30 1.50 44.00 23.20 679.20 15.44 0.24 5.34 3.21 14.60 sdb 0.30 51.10 0.50 59.00 6.40 925.60 15.66 0.24 4.12 2.69 16.00 sdc 1.10 29.50 1.40 39.20 20.00 705.60 17.87 0.24 5.94 3.52 14.30 md1 0.00 0.00 0.00 120.40 0.00 963.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 120.40 0.00 963.20 8.00 1.99 16.53 8.27 99.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.60 0.00 0.48 0.00 0.00 98.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.50 47.20 0.50 51.50 8.00 819.20 15.91 0.27 5.21 3.75 19.50 sdb 1.40 45.00 1.50 49.00 23.20 781.60 15.94 0.23 4.48 2.95 14.90 sdc 1.60 30.20 1.70 34.40 26.40 545.70 15.85 0.18 5.01 2.99 10.80 md1 0.00 0.00 0.00 123.00 0.00 984.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 123.00 0.00 984.00 8.00 1.96 15.92 7.95 97.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.50 0.00 0.47 0.00 0.00 99.03 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.10 21.60 6.30 36.80 91.20 548.00 14.83 0.46 10.56 8.12 35.00 sdb 6.10 20.80 7.10 35.90 105.60 534.40 14.88 0.38 8.79 6.67 28.70 sdc 3.80 22.80 4.70 38.70 68.00 572.10 14.75 0.43 9.86 7.26 31.50 md1 0.00 0.00 0.00 73.00 0.00 584.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 73.00 0.00 584.00 8.00 1.98 27.15 13.62 99.40 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.78 0.00 0.60 0.00 0.00 98.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.50 27.10 6.10 37.10 92.80 569.60 15.33 0.39 9.03 6.18 26.70 sdb 7.20 23.60 8.10 33.50 122.40 513.60 15.29 0.33 7.96 5.84 24.30 sdc 7.00 25.80 7.90 35.70 119.20 628.80 17.16 0.42 9.59 7.02 30.60 md1 0.00 0.00 0.00 80.40 0.00 643.20 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 80.40 0.00 643.20 8.00 1.99 24.70 12.39 99.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.21 0.00 0.37 0.00 0.00 99.42 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 4.70 25.70 5.60 34.80 82.40 524.00 15.01 0.29 7.13 5.25 21.20 sdb 4.60 26.20 5.40 35.80 80.00 535.20 14.93 0.28 6.77 4.59 18.90 sdc 4.90 25.70 5.70 35.60 84.80 529.10 14.86 0.35 8.38 6.54 27.00 md1 0.00 0.00 0.00 84.60 0.00 676.80 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 84.60 0.00 676.80 8.00 1.98 23.43 11.69 98.90 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.30 0.00 0.28 0.00 0.00 99.41 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 2.20 17.20 2.40 23.40 36.80 357.60 15.29 0.20 7.83 5.97 15.40 sdb 2.10 15.20 2.50 20.70 36.80 320.00 15.38 0.16 7.03 4.91 11.40 sdc 3.90 13.00 4.20 19.30 64.80 290.50 15.12 0.26 10.94 8.30 19.50 md1 0.00 0.00 0.00 47.90 0.00 382.60 7.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 47.90 0.00 382.60 7.99 1.12 23.36 11.77 56.40 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.39 0.00 0.24 0.02 0.00 99.35 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.90 0.00 14.40 16.00 0.03 28.89 7.78 0.70 sdb 0.00 0.00 0.00 0.90 0.00 14.40 16.00 0.03 31.11 7.78 0.70 sdc 0.00 0.10 0.00 1.30 0.00 64.80 49.85 0.06 44.62 10.77 1.40 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 21:53 ` Torsten Kaiser @ 2007-11-06 23:31 ` David Chinner 2007-11-07 2:13 ` David Chinner 0 siblings, 1 reply; 39+ messages in thread From: David Chinner @ 2007-11-06 23:31 UTC (permalink / raw) To: Torsten Kaiser Cc: Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > On 11/6/07, David Chinner <dgc@sgi.com> wrote: > > Rather than vmstat, can you use something like iostat to show how busy your > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > Both "vmstat 10" and "iostat -x 10" output from this test: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 0 0 3700592 0 85424 0 0 31 83 108 244 2 1 95 1 > -> emerge reads something, don't knwo for sure what... > 1 0 0 3665352 0 87940 0 0 239 2 343 585 2 1 97 0 .... > > The last 20% of the btrace look more or less completely like this, no > other programs do any IO... > > 253,0 3 104626 526.293450729 974 C WS 79344288 + 8 [0] > 253,0 3 104627 526.293455078 974 C WS 79344296 + 8 [0] > 253,0 1 36469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > 253,0 1 36470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] ^^ Apparently we are doing synchronous writes. That would explain why it is slow. We shouldn't be doing synchronous writes here. I'll see if I can reproduce this. <goes off and looks> Yes, I can reproduce the sync writes coming out of xfssyncd. I'll look into this further and send a patch when I have something concrete. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-06 23:31 ` David Chinner @ 2007-11-07 2:13 ` David Chinner 2007-11-07 7:15 ` Torsten Kaiser 0 siblings, 1 reply; 39+ messages in thread From: David Chinner @ 2007-11-07 2:13 UTC (permalink / raw) To: David Chinner Cc: Torsten Kaiser, Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote: > On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > > On 11/6/07, David Chinner <dgc@sgi.com> wrote: > > > Rather than vmstat, can you use something like iostat to show how busy your > > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > > > Both "vmstat 10" and "iostat -x 10" output from this test: > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > > r b swpd free buff cache si so bi bo in cs us sy id wa > > 2 0 0 3700592 0 85424 0 0 31 83 108 244 2 1 95 1 > > -> emerge reads something, don't knwo for sure what... > > 1 0 0 3665352 0 87940 0 0 239 2 343 585 2 1 97 0 > .... > > > > The last 20% of the btrace look more or less completely like this, no > > other programs do any IO... > > > > 253,0 3 104626 526.293450729 974 C WS 79344288 + 8 [0] > > 253,0 3 104627 526.293455078 974 C WS 79344296 + 8 [0] > > 253,0 1 36469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > > 253,0 1 36470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] > ^^ > Apparently we are doing synchronous writes. That would explain why > it is slow. We shouldn't be doing synchronous writes here. I'll see if > I can reproduce this. > > <goes off and looks> > > Yes, I can reproduce the sync writes coming out of xfssyncd. I'll > look into this further and send a patch when I have something concrete. Ok, so it's not synchronous writes that we are doing - we're just submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. The "synchronous" nature appears to be coming from higher level locking when reclaiming inodes (on the flush lock). It appears that inode write clustering is failing completely so we are writing the same block multiple times i.e. once for each inode in the cluster we have to write. This must be a side effect of some other change as we haven't changed anything in the reclaim code recently..... /me scurries off to run some tests Indeed it is. The patch below should fix the problem - the inode clusters weren't getting set up properly when inodes were being read in or allocated. This is a regression, introduced by this mod: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_iget.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c 2007-11-02 13:44:46.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 @@ -248,7 +248,7 @@ finish_inode: icl = NULL; if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, first_index, 1)) { - if ((iq->i_ino & mask) == first_index) + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) icl = iq->i_cluster; } ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-07 2:13 ` David Chinner @ 2007-11-07 7:15 ` Torsten Kaiser 2007-11-08 0:38 ` David Chinner 0 siblings, 1 reply; 39+ messages in thread From: Torsten Kaiser @ 2007-11-07 7:15 UTC (permalink / raw) To: David Chinner Cc: Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel On 11/7/07, David Chinner <dgc@sgi.com> wrote: > Ok, so it's not synchronous writes that we are doing - we're just > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > The "synchronous" nature appears to be coming from higher level > locking when reclaiming inodes (on the flush lock). It appears that > inode write clustering is failing completely so we are writing the > same block multiple times i.e. once for each inode in the cluster we > have to write. Works for me. The only remaining stalls are sub second and look completely valid, considering the amount of files being removed. iostat 10 from this test: 3 0 0 3500192 332 204956 0 0 105 8512 1809 6473 6 10 83 1 0 0 0 3500200 332 204576 0 0 0 4367 1355 3712 2 6 92 0 2 0 0 3504264 332 203528 0 0 0 6805 1912 4967 4 8 88 0 0 0 0 3511632 332 203528 0 0 0 2843 805 1791 2 4 94 0 0 0 0 3516852 332 203516 0 0 0 3375 879 2712 3 5 93 0 0 0 0 3530544 332 202668 0 0 186 776 488 1152 4 2 89 4 0 0 0 3574788 332 204960 0 0 226 326 358 787 0 1 98 0 0 0 0 3576820 332 204960 0 0 0 376 332 737 0 0 99 0 0 0 0 3578432 332 204960 0 0 0 356 293 606 1 1 99 0 0 0 0 3580192 332 204960 0 0 0 101 104 384 0 0 99 0 I'm pleased to note that this is now much faster again. Thanks! Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> CC's please note: It looks like this was really a different problem then the 100% iowait that was seen with reiserfs. Also the one complete stall I have seen is probably something else. But I have not been able to reproduce this again with -mm and have never seen this on mainline, so I will just ignore that single event until I see it again. Torsten > --- > fs/xfs/xfs_iget.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c 2007-11-02 13:44:46.000000000 +1100 > +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 > @@ -248,7 +248,7 @@ finish_inode: > icl = NULL; > if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, > first_index, 1)) { > - if ((iq->i_ino & mask) == first_index) > + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) > icl = iq->i_cluster; > } > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-07 7:15 ` Torsten Kaiser @ 2007-11-08 0:38 ` David Chinner 2007-11-20 13:16 ` Damien Wyart 0 siblings, 1 reply; 39+ messages in thread From: David Chinner @ 2007-11-08 0:38 UTC (permalink / raw) To: Torsten Kaiser Cc: David Chinner, Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel On Wed, Nov 07, 2007 at 08:15:06AM +0100, Torsten Kaiser wrote: > On 11/7/07, David Chinner <dgc@sgi.com> wrote: > > Ok, so it's not synchronous writes that we are doing - we're just > > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > > The "synchronous" nature appears to be coming from higher level > > locking when reclaiming inodes (on the flush lock). It appears that > > inode write clustering is failing completely so we are writing the > > same block multiple times i.e. once for each inode in the cluster we > > have to write. > > Works for me. The only remaining stalls are sub second and look > completely valid, considering the amount of files being removed. .... > Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Great - thanks for reporting the problem and testing the fix. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-08 0:38 ` David Chinner @ 2007-11-20 13:16 ` Damien Wyart 2007-11-20 21:09 ` David Chinner 0 siblings, 1 reply; 39+ messages in thread From: Damien Wyart @ 2007-11-20 13:16 UTC (permalink / raw) To: David Chinner Cc: Torsten Kaiser, Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel Hello, > > > Ok, so it's not synchronous writes that we are doing - we're just > > > submitting bio's tagged as WRITE_SYNC to get the I/O issued > > > quickly. The "synchronous" nature appears to be coming from higher > > > level locking when reclaiming inodes (on the flush lock). It > > > appears that inode write clustering is failing completely so we > > > are writing the same block multiple times i.e. once for each inode > > > in the cluster we have to write. > > Works for me. The only remaining stalls are sub second and look > > completely valid, considering the amount of files being removed. > .... > > Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> * David Chinner <dgc@sgi.com> [2007-11-08 11:38]: > Great - thanks for reporting the problem and testing the fix. This patch has not yet made its way into 2.6.24 (rc3). Is it intended? Maybe the fix can wait for 2.6.25, but wanted to make sure... -- Damien Wyart ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git 2007-11-20 13:16 ` Damien Wyart @ 2007-11-20 21:09 ` David Chinner 0 siblings, 0 replies; 39+ messages in thread From: David Chinner @ 2007-11-20 21:09 UTC (permalink / raw) To: Damien Wyart Cc: David Chinner, Torsten Kaiser, Fengguang Wu, Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, linux-fsdevel On Tue, Nov 20, 2007 at 02:16:17PM +0100, Damien Wyart wrote: > Hello, > > > > > Ok, so it's not synchronous writes that we are doing - we're just > > > > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > > > > The "synchronous" nature appears to be coming from higher level > > > > locking when reclaiming inodes (on the flush lock). It appears that > > > > inode write clustering is failing completely so we are writing the > > > > same block multiple times i.e. once for each inode in the cluster we > > > > have to write. > > > > Works for me. The only remaining stalls are sub second and look > > > completely valid, considering the amount of files being removed. > > .... > > > Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> > > * David Chinner <dgc@sgi.com> [2007-11-08 11:38]: > > Great - thanks for reporting the problem and testing the fix. > > This patch has not yet made its way into 2.6.24 (rc3). Is it intended? > Maybe the fix can wait for 2.6.25, but wanted to make sure... The patch is in the XFS dev tree being QA'd, and we will push it to 2.6.24-rcX in the next few days. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: writeout stalls in current -git [not found] ` <E1IpKZ4-0004je-Lb@localhost> 2007-11-06 9:17 ` Fengguang Wu @ 2007-11-06 9:17 ` Fengguang Wu 1 sibling, 0 replies; 39+ messages in thread From: Fengguang Wu @ 2007-11-06 9:17 UTC (permalink / raw) To: Torsten Kaiser Cc: Peter Zijlstra, Maxim Levitsky, linux-kernel, Andrew Morton, David Chinner, linux-fsdevel On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 547.200000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > global 12829 72 0 wc __ tw 0 sk 0 > [ 550.480000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > global 12017 62 0 wc __ tw 0 sk 0 > [ 552.710000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > global 11133 83 0 wc __ tw 0 sk 0 > [ 558.660000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > global 10470 33 0 wc _M tw 0 sk 0 4s > [ 562.750000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > global 10555 69 0 wc _M tw 0 sk 0 3s > [ 565.150000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > global 9562 498 0 wc _M tw -2 sk 0 4s > [ 569.490000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > global 8960 2 0 wc _M tw 0 sk 0 3s > [ 572.910000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > global 8088 205 0 wc _M tw -13 sk 0 2s > [ 574.610000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > global 7114 188 0 wc _M tw -1 sk 0 10s > [ 584.270000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > global 14544 0 0 wc _M tw -1 sk 0 9s > [ 593.050000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > global 24583 736 0 wc _M tw -1 sk 0 7s > [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > global 27004 6 0 wc _M tw 587 sk 0 > [ 600.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > global 27004 6 0 wc __ tw 1014 sk 0 The above messages and the below 'D' state pdflush indicate that one single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete. Let's try reverting the below patch with `patch -R`? It looks like the most relevant change - if it's not a low level bug. > [note] first stall, the output from emerge stops, so it seems it can > not start processing the next file until the stall ends > [ 630.000000] SysRq : Emergency Sync > [ 630.120000] Emergency Sync complete > [ 632.850000] SysRq : Show Blocked State > [ 632.850000] task PC stack pid father > [ 632.850000] pdflush D ffff81000f091788 0 285 2 > [ 632.850000] ffff810005d4da80 0000000000000046 0000000000000800 > 0000007000000001 > [ 632.850000] ffff81000fd52400 ffffffff8022d61c ffffffff80819b00 > ffffffff80819b00 > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810100316f98 > 0000000000000000 > [ 632.850000] Call Trace: > [ 632.850000] [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90 > [ 632.850000] [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90 > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > [ 632.850000] [<ffffffff8026b293>] write_cache_pages+0x123/0x330 > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 > [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 > [ 632.850000] [<ffffffff8036ed49>] xfs_inode_flush+0x179/0x1b0 > [ 632.850000] [<ffffffff8037ca8f>] xfs_fs_write_inode+0x2f/0x90 > [ 632.850000] [<ffffffff802b3aac>] __writeback_single_inode+0x2ac/0x380 > [ 632.850000] [<ffffffff804d074e>] dm_table_any_congested+0x2e/0x80 > [ 632.850000] [<ffffffff802b3f9d>] generic_sync_sb_inodes+0x20d/0x330 > [ 632.850000] [<ffffffff802b4532>] writeback_inodes+0xa2/0xe0 > [ 632.850000] [<ffffffff8026bfd6>] wb_kupdate+0xa6/0x140 > [ 632.850000] [<ffffffff8026c4b0>] pdflush+0x0/0x1e0 > [ 632.850000] [<ffffffff8026c5c0>] pdflush+0x110/0x1e0 > [ 632.850000] [<ffffffff8026bf30>] wb_kupdate+0x0/0x140 > [ 632.850000] [<ffffffff8024a32b>] kthread+0x4b/0x80 > [ 632.850000] [<ffffffff8020c9d8>] child_rip+0xa/0x12 > [ 632.850000] [<ffffffff8024a2e0>] kthread+0x0/0x80 > [ 632.850000] [<ffffffff8020c9ce>] child_rip+0x0/0x12 > [ 632.850000] > [ 632.850000] emerge D 0000000000000000 0 6220 6129 > [ 632.850000] ffff810103ced9f8 0000000000000086 0000000000000000 > 0000007000000001 > [ 632.850000] ffff81000fd52cf8 ffffffff00000000 ffffffff80819b00 > ffffffff80819b00 > [ 632.850000] ffffffff80815f40 ffffffff80819b00 ffff810103ced9b8 > ffff810103ced9a8 > [ 632.850000] Call Trace: > [ 632.850000] [<ffffffff805b16e7>] __down+0xa7/0x11e > [ 632.850000] [<ffffffff8022da70>] default_wake_function+0x0/0x10 > [ 632.850000] [<ffffffff805b1365>] __down_failed+0x35/0x3a > [ 632.850000] [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40 > [ 632.850000] [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240 > [ 632.850000] [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190 > [ 632.850000] [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0 > [ 632.850000] [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340 > [ 632.850000] [<ffffffff80352361>] xfs_itobp+0x81/0x1e0 > [ 632.850000] [<ffffffff80375bee>] xfs_buf_rele+0x2e/0xd0 > [ 632.850000] [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520 > [ 632.850000] [<ffffffff803ae5d2>] __down_read_trylock+0x42/0x60 > [ 632.850000] [<ffffffff80355c82>] xfs_inode_item_push+0x12/0x20 > [ 632.850000] [<ffffffff80368247>] xfs_trans_push_ail+0x267/0x2b0 > [ 632.850000] [<ffffffff8035c742>] xfs_log_reserve+0x72/0x120 > [ 632.850000] [<ffffffff80366bf8>] xfs_trans_reserve+0xa8/0x210 > [ 632.850000] [<ffffffff803731f2>] kmem_zone_zalloc+0x32/0x50 > [ 632.850000] [<ffffffff8035263b>] xfs_itruncate_finish+0xfb/0x310 > [ 632.850000] [<ffffffff8036daeb>] xfs_free_eofblocks+0x23b/0x280 > [ 632.850000] [<ffffffff80371f93>] xfs_release+0x153/0x200 > [ 632.850000] [<ffffffff80378010>] xfs_file_release+0x10/0x20 > [ 632.850000] [<ffffffff80294251>] __fput+0xb1/0x220 > [ 632.850000] [<ffffffff802910a4>] filp_close+0x54/0x90 > [ 632.850000] [<ffffffff802929bf>] sys_close+0x9f/0x100 > [ 632.850000] [<ffffffff8020bbbe>] system_call+0x7e/0x83 > [ 632.850000] > [ 662.180000] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 73045 > global 39157 0 0 wc __ tw 0 sk 0 > [note] emerge resumed > [ 664.030000] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK > showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks > Unmount shoW-blocked-tasks ------------------------------------------------------ Subject: writeback: remove pages_skipped accounting in __block_write_full_page() From: Fengguang Wu <wfg@mail.ustc.edu.cn> Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. Also the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback should be skipped for 'locked buffer', but here it is 'clean buffer'! This patch fixes this bug. Though I'm not sure why __block_write_full_page() is called only to do nothing and who actually issued the writeback for us. This is the two possible new behaviors after the patch: 1) pretty nice: wait 30s and write ALL:) 2) not so good: - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The next patch will fix case (2). Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/buffer.c | 1 - fs/xfs/linux-2.6/xfs_aops.c | 5 ++--- 2 files changed, 2 insertions(+), 4 deletions(-) diff -puN fs/buffer.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page fs/buffer.c --- a/fs/buffer.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page +++ a/fs/buffer.c @@ -1730,7 +1730,6 @@ done: * The page and buffer_heads can be released at any time from * here on. */ - wbc->pages_skipped++; /* We didn't write this page */ } return err; diff -puN fs/xfs/linux-2.6/xfs_aops.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page fs/xfs/linux-2.6/xfs_aops.c --- a/fs/xfs/linux-2.6/xfs_aops.c~writeback-remove-pages_skipped-accounting-in-__block_write_full_page +++ a/fs/xfs/linux-2.6/xfs_aops.c @@ -402,10 +402,9 @@ xfs_start_page_writeback( clear_page_dirty_for_io(page); set_page_writeback(page); unlock_page(page); - if (!buffers) { + /* If no buffers on the page are to be written, finish it here */ + if (!buffers) end_page_writeback(page); - wbc->pages_skipped++; /* We didn't write this page */ - } } static inline int bio_add_buffer(struct bio *bio, struct buffer_head *bh) _ Patches currently in -mm which might be from wfg@mail.ustc.edu.cn are origin.patch ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2007-11-20 21:10 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200710220822.52370.maximlevitsky@gmail.com>
[not found] ` <200710221258.11384.maximlevitsky@gmail.com>
[not found] ` <393051953.24752@ustc.edu.cn>
[not found] ` <200710221421.21439.maximlevitsky@gmail.com>
[not found] ` <393126119.26275@ustc.edu.cn>
[not found] ` <1193134027.7406.1.camel@twins>
[not found] ` <20071023115620.GA5678@mail.ustc.edu.cn>
2007-10-23 11:56 ` [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file Fengguang Wu
2007-10-23 14:10 ` Chris Mason
[not found] ` <20071023144014.GA6174@mail.ustc.edu.cn>
2007-10-23 14:40 ` Fengguang Wu
2007-10-23 14:40 ` Fengguang Wu
2007-10-23 11:56 ` Fengguang Wu
[not found] ` <393056632.00561@ustc.edu.cn>
[not found] ` <200710221505.35397.maximlevitsky@gmail.com>
[not found] ` <20071022131045.GA5357@mail.ustc.edu.cn>
[not found] ` <393060478.03650@ustc.edu.cn>
[not found] ` <64bb37e0710310822r5ca6b793p8fd97db2f72a8655@mail.gmail.com>
[not found] ` <393903856.06449@ustc.edu.cn>
[not found] ` <64bb37e0711011120i63cdfe3ci18995d57b6649a8@mail.gmail.com>
[not found] ` <E1Inljm-0002DW-CL@localhost>
2007-11-02 1:54 ` writeout stalls in current -git Fengguang Wu
2007-11-02 7:42 ` Torsten Kaiser
[not found] ` <E1InrKN-0000MK-G5@localhost>
2007-11-02 7:52 ` Fengguang Wu
2007-11-02 17:47 ` Torsten Kaiser
2007-11-02 7:52 ` Fengguang Wu
2007-11-02 1:54 ` Fengguang Wu
[not found] ` <64bb37e0711011200n228e708eg255640388f83da22@mail.gmail.com>
[not found] ` <E1InmAI-0003ME-2i@localhost>
2007-11-02 2:21 ` Fengguang Wu
2007-11-02 7:50 ` Torsten Kaiser
2007-11-02 2:21 ` Fengguang Wu
2007-11-02 10:15 ` Peter Zijlstra
[not found] ` <E1IntqD-0001dK-OE@localhost>
2007-11-02 10:33 ` Fengguang Wu
2007-11-05 23:57 ` Andrew Morton
2007-11-06 10:20 ` Peter Zijlstra
2007-11-02 10:33 ` Fengguang Wu
2007-11-02 19:22 ` Torsten Kaiser
2007-11-02 20:43 ` David Chinner
2007-11-02 21:02 ` Torsten Kaiser
2007-11-04 11:19 ` Torsten Kaiser
2007-11-05 1:45 ` David Chinner
2007-11-05 7:01 ` Torsten Kaiser
2007-11-05 18:27 ` Torsten Kaiser
2007-11-06 4:25 ` David Chinner
2007-11-06 7:10 ` Torsten Kaiser
2007-11-06 19:01 ` Peter Zijlstra
2007-11-06 20:26 ` Torsten Kaiser
[not found] ` <E1IpKZ4-0004je-Lb@localhost>
2007-11-06 9:17 ` Fengguang Wu
2007-11-06 21:53 ` Torsten Kaiser
2007-11-06 23:31 ` David Chinner
2007-11-07 2:13 ` David Chinner
2007-11-07 7:15 ` Torsten Kaiser
2007-11-08 0:38 ` David Chinner
2007-11-20 13:16 ` Damien Wyart
2007-11-20 21:09 ` David Chinner
2007-11-06 9:17 ` Fengguang Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).