* A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
@ 2007-09-28 6:32 Chakri n
2007-09-28 6:50 ` Andrew Morton
[not found] ` <20070929110454.GA29861@mail.ustc.edu.cn>
0 siblings, 2 replies; 46+ messages in thread
From: Chakri n @ 2007-09-28 6:32 UTC (permalink / raw)
To: akpm, linux-pm, lkml
Hi,
In my testing, a unresponsive file system can hang all I/O in the system.
This is not seen in 2.4.
I started 20 threads doing I/O on a NFS share. They are just doing 4K
writes in a loop.
Now I stop NFS server hosting the NFS share and start a
"dd" process to write a file on local EXT3 file system.
# dd if=/dev/zero of=/tmp/x count=1000
This process never progresses.
There is plenty of HIGH MEMORY available in the system, but this
process never progresses.
# free
total used free
shared buffers cached
Mem: 3238004 609340 2628664 0 15136
551024
-/+ buffers/cache: 43180 3194824
Swap: 4096532 0 4096532
vmstat on the machine:
# vmstat
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 21 0 2628416 15152 551024 0 0 0 0 28 344 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 8 340 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 26 343 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 8 341 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 26 357 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 8 325 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 26 343 0
0 0 100 0
0 21 0 2628416 15152 551024 0 0 0 0 8 325 0
0 0 100 0
The problem seems to be in balance_dirty_pages, which calculates
dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
in 2.4. The dd processes finishes in no time.
NFS file systems can go offline, due to multiple reasons, a failed
switch, filer etc, but that should not effect other file systems in
the machine.
Can this behavior be fenced?, can the buffer cache be tuned so that
other processes do not see the effect?
The following is the back trace of the processes:
--------------------------------------
PID: 3552 TASK: cb1fc610 CPU: 0 COMMAND: "dd"
#0 [f5c04c38] schedule at c0624a34
#1 [f5c04cac] schedule_timeout at c06250ee
#2 [f5c04cf0] io_schedule_timeout at c0624c15
#3 [f5c04d04] congestion_wait at c045eb7d
#4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91
#5 [f5c04d7c] generic_file_buffered_write at c0457148
#6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5
#7 [f5c04e84] generic_file_aio_write at c0457799
#8 [f5c04eb4] ext3_file_write at f8888fd7
#9 [f5c04ed0] do_sync_write at c0472e27
#10 [f5c04f7c] vfs_write at c0473689
#11 [f5c04f98] sys_write at c0473c95
#12 [f5c04fb4] sysenter_entry at c0404ddf
------------------------------------------
PID: 3091 TASK: cb1f0100 CPU: 1 COMMAND: "test"
#0 [f6050c10] schedule at c0624a34
#1 [f6050c84] schedule_timeout at c06250ee
#2 [f6050cc8] io_schedule_timeout at c0624c15
#3 [f6050cdc] congestion_wait at c045eb7d
#4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91
#5 [f6050d54] generic_file_buffered_write at c0457148
#6 [f6050de8] __generic_file_aio_write_nolock at c04576e5
#7 [f6050e40] enqueue_entity at c042131f
#8 [f6050e5c] generic_file_aio_write at c0457799
#9 [f6050e8c] nfs_file_write at f8f90cee
#10 [f6050e9c] getnstimeofday at c043d3f7
#11 [f6050ed0] do_sync_write at c0472e27
#12 [f6050f7c] vfs_write at c0473689
#13 [f6050f98] sys_write at c0473c95
#14 [f6050fb4] sysenter_entry at c0404ddf
Thanks
--Chakri
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:32 A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Chakri n @ 2007-09-28 6:50 ` Andrew Morton 2007-09-28 6:59 ` Peter Zijlstra ` (3 more replies) [not found] ` <20070929110454.GA29861@mail.ustc.edu.cn> 1 sibling, 4 replies; 46+ messages in thread From: Andrew Morton @ 2007-09-28 6:50 UTC (permalink / raw) To: Chakri n; +Cc: linux-pm, lkml, nfs, Peter Zijlstra On Thu, 27 Sep 2007 23:32:36 -0700 "Chakri n" <chakriin5@gmail.com> wrote: > Hi, > > In my testing, a unresponsive file system can hang all I/O in the system. > This is not seen in 2.4. > > I started 20 threads doing I/O on a NFS share. They are just doing 4K > writes in a loop. > > Now I stop NFS server hosting the NFS share and start a > "dd" process to write a file on local EXT3 file system. > > # dd if=/dev/zero of=/tmp/x count=1000 > > This process never progresses. yup. > There is plenty of HIGH MEMORY available in the system, but this > process never progresses. > > ... > > The problem seems to be in balance_dirty_pages, which calculates > dirty_thresh based on only ZONE_NORMAL. The same scenario works fine > in 2.4. The dd processes finishes in no time. > NFS file systems can go offline, due to multiple reasons, a failed > switch, filer etc, but that should not effect other file systems in > the machine. > Can this behavior be fenced?, can the buffer cache be tuned so that > other processes do not see the effect? It's unrelated to the actual value of dirty_thresh: if the machine fills up with dirty (or unstable) NFS pages then eventually new writers will block until that condition clears. 2.4 doesn't have this problem at low levels of dirty data because 2.4 VFS/MM doesn't account for NFS pages at all. I'm not sure what we can do about this from a design perspective, really. We have data floating about in memory which we're not allowed to discard and if we allow it to increase without bound it will eventually either wedge userspace _anyway_ or it will take the machine down, resulting in data loss. What it would be nice to do would be to write that data to local disk if poss, then reclaim it. Perhaps David Howells' fscache code can do that (or could be tweaked to do so). If you really want to fill all memory with pages whic are dirty against a dead NFS server then you can manually increase /proc/sys/vm/dirty_background_ratio and dirty_ratio - that should give you the 2.4 behaviour. <thinks> Actually we perhaps could address this at the VFS level in another way. Processes which are writing to the dead NFS server will eventually block in balance_dirty_pages() once they've exceeded the memory limits and will remain blocked until the server wakes up - that's the behaviour we want. What we _don't_ want to happen is for other processes which are writing to other, non-dead devices to get collaterally blocked. We have patches which might fix that queued for 2.6.24. Peter? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:50 ` Andrew Morton @ 2007-09-28 6:59 ` Peter Zijlstra 2007-09-28 8:27 ` Chakri n 2007-09-28 13:28 ` Jonathan Corbet ` (2 subsequent siblings) 3 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 6:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Chakri n, linux-pm, lkml, nfs [-- Attachment #1: Type: text/plain, Size: 735 bytes --] On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > What we _don't_ want to happen is for other processes which are writing to > other, non-dead devices to get collaterally blocked. We have patches which > might fix that queued for 2.6.24. Peter? Nasty problem, don't do that :-) But yeah, with per BDI dirty limits we get stuck at whatever ratio that NFS server/mount (?) has - which could be 100%. Other processes will then work almost synchronously against their BDIs but it should work. [ They will lower the NFS-BDI's ratio, but some fancy clipping code will limit the other BDIs their dirty limit to not exceed the total limit. And with all these NFS pages stuck, that will still be nothing. ] [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:59 ` Peter Zijlstra @ 2007-09-28 8:27 ` Chakri n 2007-09-28 8:40 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Chakri n @ 2007-09-28 8:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, linux-pm, lkml, nfs Thanks. The BDI dirty limits sounds like a good idea. Is there already a patch for this, which I could try? I believe it works like this, Each BDI, will have a limit. If the dirty_thresh exceeds the limit, all the I/O on the block device will be synchronous. so, if I have sda & a NFS mount, the dirty limit can be different for each of them. I can set dirty limit for - sda to be 90% and - NFS mount to be 50%. So, if the dirty limit is greater than 50%, NFS does synchronously, but sda can work asynchronously, till dirty limit reaches 90%. Thanks --Chakri On 9/27/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > What we _don't_ want to happen is for other processes which are writing to > > other, non-dead devices to get collaterally blocked. We have patches which > > might fix that queued for 2.6.24. Peter? > > Nasty problem, don't do that :-) > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that > NFS server/mount (?) has - which could be 100%. Other processes will > then work almost synchronously against their BDIs but it should work. > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will > limit the other BDIs their dirty limit to not exceed the total limit. > And with all these NFS pages stuck, that will still be nothing. ] > > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 8:27 ` Chakri n @ 2007-09-28 8:40 ` Peter Zijlstra 2007-09-28 9:01 ` Chakri n 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 8:40 UTC (permalink / raw) To: Chakri n; +Cc: Andrew Morton, linux-pm, lkml, nfs [-- Attachment #1: Type: text/plain, Size: 2596 bytes --] [ please don't top-post! ] On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote: > On 9/27/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > > > What we _don't_ want to happen is for other processes which are writing to > > > other, non-dead devices to get collaterally blocked. We have patches which > > > might fix that queued for 2.6.24. Peter? > > > > Nasty problem, don't do that :-) > > > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that > > NFS server/mount (?) has - which could be 100%. Other processes will > > then work almost synchronously against their BDIs but it should work. > > > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will > > limit the other BDIs their dirty limit to not exceed the total limit. > > And with all these NFS pages stuck, that will still be nothing. ] > > > Thanks. > > The BDI dirty limits sounds like a good idea. > > Is there already a patch for this, which I could try? v2.6.23-rc8-mm2 > I believe it works like this, > > Each BDI, will have a limit. If the dirty_thresh exceeds the limit, > all the I/O on the block device will be synchronous. > > so, if I have sda & a NFS mount, the dirty limit can be different for > each of them. > > I can set dirty limit for > - sda to be 90% and > - NFS mount to be 50%. > > So, if the dirty limit is greater than 50%, NFS does synchronously, > but sda can work asynchronously, till dirty limit reaches 90%. Not quite, the system determines the limit itself in an adaptive fashion. bdi_limit = total_limit * p_bdi Where p is a faction [0,1], and is determined by the relative writeout speed of the current BDI vs all other BDIs. So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is idle, and the nfs mount gets twice as much traffic as sdb, the ratios will look like: p_sda: 0 p_sdb: 1/3 p_nfs: 2/3 Once the traffic exceeds the write speed of the device we build up a backlog and stuff gets throttled, so these proportions converge to the relative write speed of the BDIs when saturated with data. So what can happen in your case is that the NFS mount is the only one with traffic is will get a fraction of 1. If it then disconnects like in your case, it will still have all of the dirty limit pinned for NFS. However other devices will at that moment try to maintain a limit of 0, which ends up being similar to a sync mount. So they'll not get stuck, but they will be slow. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 8:40 ` Peter Zijlstra @ 2007-09-28 9:01 ` Chakri n 2007-09-28 9:12 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Chakri n @ 2007-09-28 9:01 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, linux-pm, lkml, nfs Thanks for explaining the adaptive logic. > However other devices will at that moment try to maintain a limit of 0, > which ends up being similar to a sync mount. > > So they'll not get stuck, but they will be slow. > > Sync should be ok, when the situation is bad like this and some one hijacked all the buffers. But, I see my simple dd to write 10blocks on local disk never completes even after 10 minutes. [root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10 I think the process is completely stuck and is not progressing at all. Is something going wrong in the calculations where it does not fall back to sync mode. Thanks --Chakri On 9/28/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > [ please don't top-post! ] > > On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote: > > > On 9/27/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > > > > > What we _don't_ want to happen is for other processes which are writing to > > > > other, non-dead devices to get collaterally blocked. We have patches which > > > > might fix that queued for 2.6.24. Peter? > > > > > > Nasty problem, don't do that :-) > > > > > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that > > > NFS server/mount (?) has - which could be 100%. Other processes will > > > then work almost synchronously against their BDIs but it should work. > > > > > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will > > > limit the other BDIs their dirty limit to not exceed the total limit. > > > And with all these NFS pages stuck, that will still be nothing. ] > > > > > Thanks. > > > > The BDI dirty limits sounds like a good idea. > > > > Is there already a patch for this, which I could try? > > v2.6.23-rc8-mm2 > > > I believe it works like this, > > > > Each BDI, will have a limit. If the dirty_thresh exceeds the limit, > > all the I/O on the block device will be synchronous. > > > > so, if I have sda & a NFS mount, the dirty limit can be different for > > each of them. > > > > I can set dirty limit for > > - sda to be 90% and > > - NFS mount to be 50%. > > > > So, if the dirty limit is greater than 50%, NFS does synchronously, > > but sda can work asynchronously, till dirty limit reaches 90%. > > Not quite, the system determines the limit itself in an adaptive > fashion. > > bdi_limit = total_limit * p_bdi > > Where p is a faction [0,1], and is determined by the relative writeout > speed of the current BDI vs all other BDIs. > > So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is > idle, and the nfs mount gets twice as much traffic as sdb, the ratios > will look like: > > p_sda: 0 > p_sdb: 1/3 > p_nfs: 2/3 > > Once the traffic exceeds the write speed of the device we build up a > backlog and stuff gets throttled, so these proportions converge to the > relative write speed of the BDIs when saturated with data. > > So what can happen in your case is that the NFS mount is the only one > with traffic is will get a fraction of 1. If it then disconnects like in > your case, it will still have all of the dirty limit pinned for NFS. > > However other devices will at that moment try to maintain a limit of 0, > which ends up being similar to a sync mount. > > So they'll not get stuck, but they will be slow. > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 9:01 ` Chakri n @ 2007-09-28 9:12 ` Peter Zijlstra 2007-09-28 9:20 ` Chakri n 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 9:12 UTC (permalink / raw) To: Chakri n; +Cc: Andrew Morton, linux-pm, lkml, nfs [-- Attachment #1: Type: text/plain, Size: 763 bytes --] On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote: > Thanks for explaining the adaptive logic. > > > However other devices will at that moment try to maintain a limit of 0, > > which ends up being similar to a sync mount. > > > > So they'll not get stuck, but they will be slow. > > > > > > Sync should be ok, when the situation is bad like this and some one > hijacked all the buffers. > > But, I see my simple dd to write 10blocks on local disk never > completes even after 10 minutes. > > [root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10 > > I think the process is completely stuck and is not progressing at all. > > Is something going wrong in the calculations where it does not fall > back to sync mode. What kernel is that? [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 9:12 ` Peter Zijlstra @ 2007-09-28 9:20 ` Chakri n 2007-09-28 9:23 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Chakri n @ 2007-09-28 9:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, linux-pm, lkml, nfs It's 2.6.23-rc6. Thanks --Chakri On 9/28/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote: > > Thanks for explaining the adaptive logic. > > > > > However other devices will at that moment try to maintain a limit of 0, > > > which ends up being similar to a sync mount. > > > > > > So they'll not get stuck, but they will be slow. > > > > > > > > > > Sync should be ok, when the situation is bad like this and some one > > hijacked all the buffers. > > > > But, I see my simple dd to write 10blocks on local disk never > > completes even after 10 minutes. > > > > [root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10 > > > > I think the process is completely stuck and is not progressing at all. > > > > Is something going wrong in the calculations where it does not fall > > back to sync mode. > > What kernel is that? > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 9:20 ` Chakri n @ 2007-09-28 9:23 ` Peter Zijlstra 2007-09-28 10:36 ` Chakri n 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 9:23 UTC (permalink / raw) To: Chakri n; +Cc: Andrew Morton, linux-pm, lkml, nfs [-- Attachment #1: Type: text/plain, Size: 171 bytes --] [ and one copy for the list too ] On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote: > It's 2.6.23-rc6. Could you try .23-rc8-mm2. It includes the per bdi stuff. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 9:23 ` Peter Zijlstra @ 2007-09-28 10:36 ` Chakri n 0 siblings, 0 replies; 46+ messages in thread From: Chakri n @ 2007-09-28 10:36 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, linux-pm, lkml, nfs It's works on .23-rc8-mm2 with out any problems. "dd" process does not hang any more. Thanks for all the help. Cheers --Chakri On 9/28/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > [ and one copy for the list too ] > > On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote: > > It's 2.6.23-rc6. > > Could you try .23-rc8-mm2. It includes the per bdi stuff. > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:50 ` Andrew Morton 2007-09-28 6:59 ` Peter Zijlstra @ 2007-09-28 13:28 ` Jonathan Corbet 2007-09-28 13:35 ` Peter Zijlstra 2007-09-28 18:04 ` Andrew Morton 2007-09-28 17:00 ` Trond Myklebust 2007-09-29 0:46 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Daniel Phillips 3 siblings, 2 replies; 46+ messages in thread From: Jonathan Corbet @ 2007-09-28 13:28 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-pm, lkml, nfs, Peter Zijlstra, Chakri n Andrew wrote: > It's unrelated to the actual value of dirty_thresh: if the machine fills up > with dirty (or unstable) NFS pages then eventually new writers will block > until that condition clears. > > 2.4 doesn't have this problem at low levels of dirty data because 2.4 > VFS/MM doesn't account for NFS pages at all. Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system to an external USB drive the other day when something flaked and the drive fell off the bus. That, too, was sufficient to wedge the entire system, even though the only thing which needed the dead drive was one rsync process. It's kind of a bummer to have to hit the reset button after the failure of (what should be) a non-critical piece of hardware. Not that I have a fix to propose...:) jon ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 13:28 ` Jonathan Corbet @ 2007-09-28 13:35 ` Peter Zijlstra 2007-09-28 16:45 ` [linux-pm] " Alan Stern 2007-09-29 1:27 ` Daniel Phillips 2007-09-28 18:04 ` Andrew Morton 1 sibling, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 13:35 UTC (permalink / raw) To: Jonathan Corbet; +Cc: Andrew Morton, linux-pm, lkml, nfs, Chakri n [-- Attachment #1: Type: text/plain, Size: 1252 bytes --] On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote: > Andrew wrote: > > It's unrelated to the actual value of dirty_thresh: if the machine fills up > > with dirty (or unstable) NFS pages then eventually new writers will block > > until that condition clears. > > > > 2.4 doesn't have this problem at low levels of dirty data because 2.4 > > VFS/MM doesn't account for NFS pages at all. > > Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system > to an external USB drive the other day when something flaked and the > drive fell off the bus. That, too, was sufficient to wedge the entire > system, even though the only thing which needed the dead drive was one > rsync process. It's kind of a bummer to have to hit the reset button > after the failure of (what should be) a non-critical piece of hardware. > > Not that I have a fix to propose...:) the per bdi work in -mm should make the system not drop dead. Still, would a remove,re-insert of the usb media end up with the same bdi? That is, would it recognise as the same and resume the transfer. Anyway, it would be grand (and dangerous) if we could provide for a button that would just kill off all outstanding pages against a dead device. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [linux-pm] Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 13:35 ` Peter Zijlstra @ 2007-09-28 16:45 ` Alan Stern 2007-09-29 1:27 ` Daniel Phillips 1 sibling, 0 replies; 46+ messages in thread From: Alan Stern @ 2007-09-28 16:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Jonathan Corbet, Chakri n, Andrew Morton, nfs, linux-pm, lkml On Fri, 28 Sep 2007, Peter Zijlstra wrote: > On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote: > > Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system > > to an external USB drive the other day when something flaked and the > > drive fell off the bus. That, too, was sufficient to wedge the entire > > system, even though the only thing which needed the dead drive was one > > rsync process. It's kind of a bummer to have to hit the reset button > > after the failure of (what should be) a non-critical piece of hardware. > > > > Not that I have a fix to propose...:) > > the per bdi work in -mm should make the system not drop dead. > > Still, would a remove,re-insert of the usb media end up with the same > bdi? That is, would it recognise as the same and resume the transfer. Removal and replacement of the media might work. I have never tried it. But Jon described removal of the device, not the media. Replacing the device definitely will not work. Alan Stern ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 13:35 ` Peter Zijlstra 2007-09-28 16:45 ` [linux-pm] " Alan Stern @ 2007-09-29 1:27 ` Daniel Phillips 1 sibling, 0 replies; 46+ messages in thread From: Daniel Phillips @ 2007-09-29 1:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Jonathan Corbet, Andrew Morton, linux-pm, lkml, nfs, Chakri n On Friday 28 September 2007 06:35, Peter Zijlstra wrote: > ,,,it would be grand (and dangerous) if we could provide for a > button that would just kill off all outstanding pages against a dead > device. Substitute "resources" for "pages" and you begin to get an idea of how tricky that actually is. That said, this is exactly what we have done with ddsnap, for the simple reason that our users, now emboldened by being able to stop or terminate the user space part, felt justified in expecting that the system continue as if nothing had happened, and furthermore, be able to restart ddsnap without a hiccup. (Otherwise known as a sysop's diety-given right to kill.) So this is what we do in the specific case of ddsnap: * When we detect some nasty state change such as our userspace control daemon disappearing on us, we go poking around and explicitly release every semaphore that the device driver could possibly wait on forever (interestingly they are all in our own driver except for BKL, which is just an artifact of device mapper not having gone over to unlock_ioctl for no good reason that I know of). * Then at the points were the driver falls through some lock thus released, we check our "ready" flag, and if it indicates "busted", proceed with wherever cleanup is needed at that point. Does not sound like an approach one would expect to work reliably, does it? But there just may be some general principle to be ferretted out here. (Anyone who has ideas on how bits of this procedure could be abstracted, please do not hesitate to step boldly forth into the limelight.) Incidentally, only a small subset of locks needed special handling as above. Most can be shown to have no way to block forever, short of an outright bug. I shudder to think how much work it would be to bring every driver in the kernel up to such a standard, particularly if user space components are involved, as with USB. On the other hand, every driver fixed is one less driver that sucks. The next one to emerge from the pipeline will most likely be NBD, which we have been working on in fits and starts for a while. Look for it to morph into "ddbd", with cross-node distributed data awareness, in addition to perforning its current job without deadlocking. Regards, Daniel ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 13:28 ` Jonathan Corbet 2007-09-28 13:35 ` Peter Zijlstra @ 2007-09-28 18:04 ` Andrew Morton 1 sibling, 0 replies; 46+ messages in thread From: Andrew Morton @ 2007-09-28 18:04 UTC (permalink / raw) To: Jonathan Corbet; +Cc: linux-pm, lkml, nfs, Peter Zijlstra, Chakri n On Fri, 28 Sep 2007 07:28:52 -0600 corbet@lwn.net (Jonathan Corbet) wrote: > Andrew wrote: > > It's unrelated to the actual value of dirty_thresh: if the machine fills up > > with dirty (or unstable) NFS pages then eventually new writers will block > > until that condition clears. > > > > 2.4 doesn't have this problem at low levels of dirty data because 2.4 > > VFS/MM doesn't account for NFS pages at all. > > Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system > to an external USB drive the other day when something flaked and the > drive fell off the bus. That, too, was sufficient to wedge the entire > system, even though the only thing which needed the dead drive was one > rsync process. It's kind of a bummer to have to hit the reset button > after the failure of (what should be) a non-critical piece of hardware. > > Not that I have a fix to propose...:) > That's a USB bug, surely. What should happen is that the kernel attempts writeback, gets an IO error and then your data gets lost. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:50 ` Andrew Morton 2007-09-28 6:59 ` Peter Zijlstra 2007-09-28 13:28 ` Jonathan Corbet @ 2007-09-28 17:00 ` Trond Myklebust 2007-09-28 18:49 ` Andrew Morton 2007-09-29 0:46 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Daniel Phillips 3 siblings, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2007-09-28 17:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > Actually we perhaps could address this at the VFS level in another way. > Processes which are writing to the dead NFS server will eventually block in > balance_dirty_pages() once they've exceeded the memory limits and will > remain blocked until the server wakes up - that's the behaviour we want. > > What we _don't_ want to happen is for other processes which are writing to > other, non-dead devices to get collaterally blocked. We have patches which > might fix that queued for 2.6.24. Peter? Do these patches also cause the memory reclaimers to steer clear of devices that are congested (and stop waiting on a congested device if they see that it remains congested for a long period of time)? Most of the collateral blocking I see tends to happen in memory allocation... Cheers Trond ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 17:00 ` Trond Myklebust @ 2007-09-28 18:49 ` Andrew Morton 2007-09-28 18:48 ` Peter Zijlstra 2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust 0 siblings, 2 replies; 46+ messages in thread From: Andrew Morton @ 2007-09-28 18:49 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > Actually we perhaps could address this at the VFS level in another way. > > Processes which are writing to the dead NFS server will eventually block in > > balance_dirty_pages() once they've exceeded the memory limits and will > > remain blocked until the server wakes up - that's the behaviour we want. > > > > What we _don't_ want to happen is for other processes which are writing to > > other, non-dead devices to get collaterally blocked. We have patches which > > might fix that queued for 2.6.24. Peter? > > Do these patches also cause the memory reclaimers to steer clear of > devices that are congested (and stop waiting on a congested device if > they see that it remains congested for a long period of time)? Most of > the collateral blocking I see tends to happen in memory allocation... > No, they don't attempt to do that, but I suspect they put in place infrastructure which could be used to improve direct-reclaimer latency. In the throttle_vm_writeout() path, at least. Do you know where the stalls are occurring? throttle_vm_writeout(), or via direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running sysrq-w five or ten times will probably be enough to determine this) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 18:49 ` Andrew Morton @ 2007-09-28 18:48 ` Peter Zijlstra 2007-09-28 19:16 ` Andrew Morton 2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust 1 sibling, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-28 18:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, Chakri n, linux-pm, lkml, nfs On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote: > Do you know where the stalls are occurring? throttle_vm_writeout(), or via > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running > sysrq-w five or ten times will probably be enough to determine this) would it make sense to instrument congestion_wait() callsites with vmstats? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 18:48 ` Peter Zijlstra @ 2007-09-28 19:16 ` Andrew Morton 2007-10-02 13:36 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-09-28 19:16 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Trond Myklebust, Chakri n, linux-pm, lkml, nfs On Fri, 28 Sep 2007 20:48:59 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote: > > > Do you know where the stalls are occurring? throttle_vm_writeout(), or via > > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running > > sysrq-w five or ten times will probably be enough to determine this) > > would it make sense to instrument congestion_wait() callsites with > vmstats? Better than nothing, but it isn't a great fit: we'd need one vmstat counter per congestion_wait() callsite, and it's all rather specific to the kernel-of-the-day. taskstats delay accounting isn't useful either - it will aggregate all the schedule() callsites. profile=sleep is just about ideal for this, isn't it? I suspect that most people don't know it's there, or forgot about it. It could be that profile=sleep just tells us "you're spending a lot of time in io_schedule()" or congestion_wait(), so perhaps we need to teach it to go for walk up the stack somehow. But lockdep knows how to do that already so perhaps we (ie: you ;)) can bolt sleep instrumentation onto lockdep as we (ie you ;)) did with the lockstat stuff? (Searches for the lockstat documentation) Did we forget to do that? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 19:16 ` Andrew Morton @ 2007-10-02 13:36 ` Peter Zijlstra 2007-10-02 15:42 ` Randy Dunlap 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-10-02 13:36 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, Zach Brown, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 7430 bytes --] On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote: > (Searches for the lockstat documentation) > > Did we forget to do that? yeah,... /me quickly whips up something Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- Documentation/lockstat.txt | 119 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) Index: linux-2.6/Documentation/lockstat.txt =================================================================== --- /dev/null +++ linux-2.6/Documentation/lockstat.txt @@ -0,0 +1,119 @@ + +LOCK STATISTICS + +- WHAT + +As the name suggests, it provides statistics on locks. + +- WHY + +Because things like lock contention can severely impact performance. + +- HOW + +Lockdep already has hooks in the lock functions and maps lock instances to +lock classes. We build on that. The graph below shows the relation between +the lock functions and the various hooks therein. + + __acquire + | + lock _____ + | \ + | __contended + | | + | <wait> + | _______/ + |/ + | + __acquired + | + . + <hold> + . + | + __release + | + unlock + +lock, unlock - the regular lock functions +__* - the hooks +<> - states + +With these hooks we provide the following statistics: + + con-bounces - number of lock contention that involved x-cpu data + contentions - number of lock acquisitions that had to wait + wait time min - shortest (non 0) time we ever had to wait for a lock + max - longest time we ever had to wait for a lock + total - total time we spend waiting on this lock + acq-bounes - number of lock acquisitions that involved x-cpu data + acquisitions - number of times we took the lock + hold time min - shortest (non 0) time we ever held the lock + max - longest time we ever held the lock + total - total time this lock was held + +From these number various other statistics can be derived, such as: + + hold time average = hold time total / acquisitions + +These numbers are gathered per lock class, per read/write state (when +applicable). + +It also tracks (4) contention points per class. A contention point is a call +site that had to wait on lock acquisition. + + - USAGE + +Look at the current lock statistics: + +(line numbers not part of actual output, done for clarity in the explanation below) + +# less /proc/lock_stat + +01 lock_stat version 0.2 +02 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total +04 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +05 +06 &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60 +07 &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38 +08 -------------------------- +09 &inode->i_data.tree_lock 0 [<ffffffff8027c08f>] add_to_page_cache+0x5f/0x190 +10 +11 ............................................................................................................................................................................................... +12 +13 dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 +14 ----------- +15 dcache_lock 180 [<ffffffff802c0d7e>] sys_getcwd+0x11e/0x230 +16 dcache_lock 165 [<ffffffff802c002a>] d_alloc+0x15a/0x210 +17 dcache_lock 33 [<ffffffff8035818d>] _atomic_dec_and_lock+0x4d/0x70 +18 dcache_lock 1 [<ffffffff802beef8>] shrink_dcache_parent+0x18/0x130 + +This except shows the first two lock class statistics. Line 01 shows the output +version - each time the format changes this will be updated. Line 02-04 show +the header with column descriptions. Lines 05-10 and 13-18 show the actual +statistics. These statistics come in two parts; the actual stats separated by a +short separator (line 08, 14) from the contention points. + +The first lock (05-10) is a read/write lock, and shows two lines above the +short separator. The contention points don't match the column descriptors, +they have two: contentions and [<IP>] symbol. + + +View the top contending locks: + +# grep : /proc/lock_stat | head + &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60 + &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38 + dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 + &inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74 + &zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06 + &inode->i_data.i_mmap_lock: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44 + &q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52 + &rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62 + &rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63 + tasklist_lock-W: 15 15 1.45 10.87 32.70 1201 7390 0.58 62.55 13648.47 + +Clear the statistics: + +# echo 0 > /proc/lock_stat [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-10-02 13:36 ` Peter Zijlstra @ 2007-10-02 15:42 ` Randy Dunlap 2007-10-03 9:28 ` [PATCH] lockstat: documentation Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Randy Dunlap @ 2007-10-02 15:42 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, lkml, Zach Brown, Ingo Molnar On Tue, 02 Oct 2007 15:36:01 +0200 Peter Zijlstra wrote: > On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote: > > > (Searches for the lockstat documentation) > > > > Did we forget to do that? > > yeah,... > > /me quickly whips up something Thanks. Just some typos noted below. > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > Documentation/lockstat.txt | 119 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 119 insertions(+) > > Index: linux-2.6/Documentation/lockstat.txt > =================================================================== > --- /dev/null > +++ linux-2.6/Documentation/lockstat.txt > @@ -0,0 +1,119 @@ > + > +LOCK STATISTICS > + > +- WHAT > + > +As the name suggests, it provides statistics on locks. > + > +- WHY > + > +Because things like lock contention can severely impact performance. > + > +- HOW > + > +Lockdep already has hooks in the lock functions and maps lock instances to > +lock classes. We build on that. The graph below shows the relation between > +the lock functions and the various hooks therein. > + > + __acquire > + | > + lock _____ > + | \ > + | __contended > + | | > + | <wait> > + | _______/ > + |/ > + | > + __acquired > + | > + . > + <hold> > + . > + | > + __release > + | > + unlock > + > +lock, unlock - the regular lock functions > +__* - the hooks > +<> - states > + > +With these hooks we provide the following statistics: > + > + con-bounces - number of lock contention that involved x-cpu data > + contentions - number of lock acquisitions that had to wait > + wait time min - shortest (non 0) time we ever had to wait for a lock (non-0) > + max - longest time we ever had to wait for a lock > + total - total time we spend waiting on this lock > + acq-bounes - number of lock acquisitions that involved x-cpu data -bounces > + acquisitions - number of times we took the lock > + hold time min - shortest (non 0) time we ever held the lock (non-0) > + max - longest time we ever held the lock > + total - total time this lock was held > + > +From these number various other statistics can be derived, such as: > + > + hold time average = hold time total / acquisitions > + > +These numbers are gathered per lock class, per read/write state (when > +applicable). > + > +It also tracks (4) contention points per class. A contention point is a call > +site that had to wait on lock acquisition. > + > + - USAGE > + > +Look at the current lock statistics: > + > +(line numbers not part of actual output, done for clarity in the explanation below) > + > +# less /proc/lock_stat > + > +01 lock_stat version 0.2 > +02 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > +03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total > +04 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ... > +15 dcache_lock 180 [<ffffffff802c0d7e>] sys_getcwd+0x11e/0x230 > +16 dcache_lock 165 [<ffffffff802c002a>] d_alloc+0x15a/0x210 > +17 dcache_lock 33 [<ffffffff8035818d>] _atomic_dec_and_lock+0x4d/0x70 > +18 dcache_lock 1 [<ffffffff802beef8>] shrink_dcache_parent+0x18/0x130 > + > +This except shows the first two lock class statistics. Line 01 shows the output excerpt > +version - each time the format changes this will be updated. Line 02-04 show > +the header with column descriptions. Lines 05-10 and 13-18 show the actual > +statistics. These statistics come in two parts; the actual stats separated by a > +short separator (line 08, 14) from the contention points. > + > +The first lock (05-10) is a read/write lock, and shows two lines above the > +short separator. The contention points don't match the column descriptors, > +they have two: contentions and [<IP>] symbol. ... --- ~Randy ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] lockstat: documentation 2007-10-02 15:42 ` Randy Dunlap @ 2007-10-03 9:28 ` Peter Zijlstra 2007-10-03 9:35 ` Ingo Molnar 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-10-03 9:28 UTC (permalink / raw) To: Randy Dunlap; +Cc: Andrew Morton, lkml, Zach Brown, Ingo Molnar Thanks Randy! update patch below. --- Subject: lockstat: documentation Provide some documentation for CONFIG_LOCK_STAT Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- Documentation/lockstat.txt | 120 +++++++++++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 2 2 files changed, 122 insertions(+) Index: linux-2.6/Documentation/lockstat.txt =================================================================== --- /dev/null +++ linux-2.6/Documentation/lockstat.txt @@ -0,0 +1,120 @@ + +LOCK STATISTICS + +- WHAT + +As the name suggests, it provides statistics on locks. + +- WHY + +Because things like lock contention can severely impact performance. + +- HOW + +Lockdep already has hooks in the lock functions and maps lock instances to +lock classes. We build on that. The graph below shows the relation between +the lock functions and the various hooks therein. + + __acquire + | + lock _____ + | \ + | __contended + | | + | <wait> + | _______/ + |/ + | + __acquired + | + . + <hold> + . + | + __release + | + unlock + +lock, unlock - the regular lock functions +__* - the hooks +<> - states + +With these hooks we provide the following statistics: + + con-bounces - number of lock contention that involved x-cpu data + contentions - number of lock acquisitions that had to wait + wait time min - shortest (non-0) time we ever had to wait for a lock + max - longest time we ever had to wait for a lock + total - total time we spend waiting on this lock + acq-bounces - number of lock acquisitions that involved x-cpu data + acquisitions - number of times we took the lock + hold time min - shortest (non-0) time we ever held the lock + max - longest time we ever held the lock + total - total time this lock was held + +From these number various other statistics can be derived, such as: + + hold time average = hold time total / acquisitions + +These numbers are gathered per lock class, per read/write state (when +applicable). + +It also tracks 4 contention points per class. A contention point is a call site +that had to wait on lock acquisition. + + - USAGE + +Look at the current lock statistics: + +( line numbers not part of actual output, done for clarity in the explanation + below ) + +# less /proc/lock_stat + +01 lock_stat version 0.2 +02 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total +04 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +05 +06 &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60 +07 &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38 +08 -------------------------- +09 &inode->i_data.tree_lock 0 [<ffffffff8027c08f>] add_to_page_cache+0x5f/0x190 +10 +11 ............................................................................................................................................................................................... +12 +13 dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 +14 ----------- +15 dcache_lock 180 [<ffffffff802c0d7e>] sys_getcwd+0x11e/0x230 +16 dcache_lock 165 [<ffffffff802c002a>] d_alloc+0x15a/0x210 +17 dcache_lock 33 [<ffffffff8035818d>] _atomic_dec_and_lock+0x4d/0x70 +18 dcache_lock 1 [<ffffffff802beef8>] shrink_dcache_parent+0x18/0x130 + +This excerpt shows the first two lock class statistics. Line 01 shows the +output version - each time the format changes this will be updated. Line 02-04 +show the header with column descriptions. Lines 05-10 and 13-18 show the actual +statistics. These statistics come in two parts; the actual stats separated by a +short separator (line 08, 14) from the contention points. + +The first lock (05-10) is a read/write lock, and shows two lines above the +short separator. The contention points don't match the column descriptors, +they have two: contentions and [<IP>] symbol. + + +View the top contending locks: + +# grep : /proc/lock_stat | head + &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60 + &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38 + dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 + &inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74 + &zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06 + &inode->i_data.i_mmap_lock: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44 + &q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52 + &rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62 + &rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63 + tasklist_lock-W: 15 15 1.45 10.87 32.70 1201 7390 0.58 62.55 13648.47 + +Clear the statistics: + +# echo 0 > /proc/lock_stat Index: linux-2.6/lib/Kconfig.debug =================================================================== --- linux-2.6.orig/lib/Kconfig.debug +++ linux-2.6/lib/Kconfig.debug @@ -309,6 +309,8 @@ config LOCK_STAT help This feature enables tracking lock contention points + For more details, see Documentation/lockstat.txt + config DEBUG_LOCKDEP bool "Lock dependency engine debugging" depends on DEBUG_KERNEL && LOCKDEP ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] lockstat: documentation 2007-10-03 9:28 ` [PATCH] lockstat: documentation Peter Zijlstra @ 2007-10-03 9:35 ` Ingo Molnar 0 siblings, 0 replies; 46+ messages in thread From: Ingo Molnar @ 2007-10-03 9:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Randy Dunlap, Andrew Morton, lkml, Zach Brown * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > Thanks Randy! > > update patch below. > > --- > Subject: lockstat: documentation > > Provide some documentation for CONFIG_LOCK_STAT > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 18:49 ` Andrew Morton 2007-09-28 18:48 ` Peter Zijlstra @ 2007-09-28 19:16 ` Trond Myklebust 2007-09-28 19:26 ` Andrew Morton 2007-09-29 1:51 ` KDB? Daniel Phillips 1 sibling, 2 replies; 46+ messages in thread From: Trond Myklebust @ 2007-09-28 19:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 1007 bytes --] On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote: > On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > Do these patches also cause the memory reclaimers to steer clear of > > devices that are congested (and stop waiting on a congested device if > > they see that it remains congested for a long period of time)? Most of > > the collateral blocking I see tends to happen in memory allocation... > > > > No, they don't attempt to do that, but I suspect they put in place > infrastructure which could be used to improve direct-reclaimer latency. In > the throttle_vm_writeout() path, at least. > > Do you know where the stalls are occurring? throttle_vm_writeout(), or via > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running > sysrq-w five or ten times will probably be enough to determine this) Looking back, they were getting caught up in balance_dirty_pages_ratelimited() and friends. See the attached example... Cheers Trond [-- Attachment #2: Attached message - [NFS] NFS on loopback locks up entire system(2.6.23-rc6)? --] [-- Type: message/rfc822, Size: 9371 bytes --] From: "Chakri n" <chakriin5@gmail.com> To: nfs@lists.sourceforge.net, Trond.Myklebust@netapp.com, linux-kernel@vger.kernel.org Subject: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)? Date: Thu, 20 Sep 2007 17:22:26 -0700 Message-ID: <92cbf19b0709201722k6265e647x31b7d25bc54b63a0@mail.gmail.com> Hi, I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel. I have mounted a local ext3 partition using loopback NFS (version 3) and started my test program. The test program forks 20 threads allocates 10MB for each thread, writes & reads a file on the loopback NFS mount. After running for about 5 min, I cannot even login to the machine. Commands like ps etc, hang in a live session. The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM & CPU to play around and no other io/heavy processes are running on the system. vmstat output shows no buffers are actually getting transferred in or out and iowait is 100%. [root@h46 ~]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 24 116 110080 11132 3045664 0 0 0 0 28 345 0 1 0 99 0 0 24 116 110080 11132 3045664 0 0 0 0 5 329 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 26 336 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 8 335 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 26 352 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 8 351 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 23 358 0 1 0 99 0 0 24 116 110080 11132 3045664 0 0 0 0 10 350 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 26 363 0 0 0 100 0 0 24 116 110080 11132 3045664 0 0 0 0 8 346 0 1 0 99 0 0 24 116 110080 11132 3045664 0 0 0 0 26 360 0 0 0 100 0 0 24 116 110080 11140 3045656 0 0 8 0 11 345 0 0 0 100 0 0 24 116 110080 11140 3045664 0 0 0 0 27 355 0 0 2 97 0 0 24 116 110080 11140 3045664 0 0 0 0 9 330 0 0 0 100 0 0 24 116 110080 11140 3045664 0 0 0 0 26 358 0 0 0 100 0 The following is the backtrace of 1. one of the threads of my test program 2. nfsd daemon and 3. a generic command like pstree, after the machine hangs: ------------------------------------------------------------- crash> bt 3252 PID: 3252 TASK: f6f3c610 CPU: 0 COMMAND: "test" #0 [f6bdcc10] schedule at c0624a34 #1 [f6bdcc84] schedule_timeout at c06250ee #2 [f6bdccc8] io_schedule_timeout at c0624c15 #3 [f6bdccdc] congestion_wait at c045eb7d #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6bdcd54] generic_file_buffered_write at c0457148 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5 #7 [f6bdce40] try_to_wake_up at c042342b #8 [f6bdce5c] generic_file_aio_write at c0457799 #9 [f6bdce8c] nfs_file_write at f8c25cee #10 [f6bdced0] do_sync_write at c0472e27 #11 [f6bdcf7c] vfs_write at c0473689 #12 [f6bdcf98] sys_write at c0473c95 #13 [f6bdcfb4] sysenter_entry at c0404ddf EAX: 00000004 EBX: 00000013 ECX: a4966008 EDX: 00980000 DS: 007b ESI: 00980000 ES: 007b EDI: a4966008 SS: 007b ESP: a5ae6ec0 EBP: a5ae6ef0 CS: 0073 EIP: b7eed410 ERR: 00000004 EFLAGS: 00000246 crash> bt 3188 PID: 3188 TASK: f74c4000 CPU: 1 COMMAND: "nfsd" #0 [f6836c7c] schedule at c0624a34 #1 [f6836cf0] __mutex_lock_slowpath at c062543d #2 [f6836d0c] mutex_lock at c0625326 #3 [f6836d18] generic_file_aio_write at c0457784 #4 [f6836d48] ext3_file_write at f8888fd7 #5 [f6836d64] do_sync_readv_writev at c0472d1f #6 [f6836e08] do_readv_writev at c0473486 #7 [f6836e6c] vfs_writev at c047358e #8 [f6836e7c] nfsd_vfs_write at f8e7f8d7 #9 [f6836ee0] nfsd_write at f8e80139 #10 [f6836f10] nfsd3_proc_write at f8e86afd #11 [f6836f44] nfsd_dispatch at f8e7c20c #12 [f6836f6c] svc_process at f89c18e0 #13 [f6836fbc] nfsd at f8e7c794 #14 [f6836fe4] kernel_thread_helper at c0405a35 crash> ps|grep ps 234 2 3 cb194000 IN 0.0 0 0 [khpsbpkt] 520 2 0 f7e18c20 IN 0.0 0 0 [kpsmoused] 2859 1 2 f7f3cc20 IN 0.1 9600 2040 cupsd 3340 3310 0 f4a0f840 UN 0.0 4360 816 pstree 3343 3284 2 f4a0f230 UN 0.0 4212 944 ps crash> bt 3340 PID: 3340 TASK: f4a0f840 CPU: 0 COMMAND: "pstree" #0 [e856be30] schedule at c0624a34 #1 [e856bea4] rwsem_down_failed_common at c04df6c0 #2 [e856bec4] rwsem_down_read_failed at c0625c2a #3 [e856bedc] call_rwsem_down_read_failed at c0625c96 #4 [e856bee8] down_read at c043c21a #5 [e856bef0] access_process_vm at c0462039 #6 [e856bf38] proc_pid_cmdline at c04a1bbb #7 [e856bf58] proc_info_read at c04a2f41 #8 [e856bf7c] vfs_read at c04737db #9 [e856bf98] sys_read at c0473c2e #10 [e856bfb4] sysenter_entry at c0404ddf EAX: 00000003 EBX: 00000005 ECX: 0804dc58 EDX: 00000062 DS: 007b ESI: 00000cba ES: 007b EDI: 0804e0e0 SS: 007b ESP: bfa3afe8 EBP: bfa3d4f8 CS: 0073 EIP: b7f64410 ERR: 00000003 EFLAGS: 00000246 ---------------------------------------------------------- Any ideas what could potentially trigger this? Please let me know if you would like to get any other specific details. Thanks --Chakri ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust @ 2007-09-28 19:26 ` Andrew Morton 2007-09-28 19:52 ` Trond Myklebust 2007-09-29 1:51 ` KDB? Daniel Phillips 1 sibling, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-09-28 19:26 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote: > > On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > Do these patches also cause the memory reclaimers to steer clear of > > > devices that are congested (and stop waiting on a congested device if > > > they see that it remains congested for a long period of time)? Most of > > > the collateral blocking I see tends to happen in memory allocation... > > > > > > > No, they don't attempt to do that, but I suspect they put in place > > infrastructure which could be used to improve direct-reclaimer latency. In > > the throttle_vm_writeout() path, at least. > > > > Do you know where the stalls are occurring? throttle_vm_writeout(), or via > > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running > > sysrq-w five or ten times will probably be enough to determine this) > > Looking back, they were getting caught up in > balance_dirty_pages_ratelimited() and friends. See the attached > example... that one is nfs-on-loopback, which is a special case, isn't it? NFS on loopback used to hang, but then we fixed it. It looks like we broke it again sometime in the intervening four years or so. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 19:26 ` Andrew Morton @ 2007-09-28 19:52 ` Trond Myklebust 2007-09-28 20:10 ` Andrew Morton 2007-09-28 20:24 ` Daniel Phillips 0 siblings, 2 replies; 46+ messages in thread From: Trond Myklebust @ 2007-09-28 19:52 UTC (permalink / raw) To: Andrew Morton; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > Looking back, they were getting caught up in > > balance_dirty_pages_ratelimited() and friends. See the attached > > example... > > that one is nfs-on-loopback, which is a special case, isn't it? I'm not sure that the hang that is illustrated here is so special. It is an example of a bog-standard ext3 write, that ends up calling the NFS client, which is hanging. The fact that it happens to be hanging on the nfsd process is more or less irrelevant here: the same thing could happen to any other process in the case where we have an NFS server that is down. > NFS on loopback used to hang, but then we fixed it. It looks like we > broke it again sometime in the intervening four years or so. It has been quirky all through the 2.6.x series because of this issue. Cheers Trond ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 19:52 ` Trond Myklebust @ 2007-09-28 20:10 ` Andrew Morton 2007-09-28 20:32 ` Trond Myklebust 2007-09-28 20:24 ` Daniel Phillips 1 sibling, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-09-28 20:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: chakriin5, linux-pm, linux-kernel, nfs, a.p.zijlstra On Fri, 28 Sep 2007 15:52:28 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > Looking back, they were getting caught up in > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > example... > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > I'm not sure that the hang that is illustrated here is so special. It is > an example of a bog-standard ext3 write, that ends up calling the NFS > client, which is hanging. The fact that it happens to be hanging on the > nfsd process is more or less irrelevant here: the same thing could > happen to any other process in the case where we have an NFS server that > is down. hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? We should be able to fix that by marking the backing device as write-congested. That'll have small race windows, but it should be a 99.9% fix? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 20:10 ` Andrew Morton @ 2007-09-28 20:32 ` Trond Myklebust 2007-09-28 20:43 ` Andrew Morton 0 siblings, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2007-09-28 20:32 UTC (permalink / raw) To: Andrew Morton; +Cc: chakriin5, linux-pm, linux-kernel, nfs, a.p.zijlstra On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > On Fri, 28 Sep 2007 15:52:28 -0400 > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > Looking back, they were getting caught up in > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > example... > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > I'm not sure that the hang that is illustrated here is so special. It is > > an example of a bog-standard ext3 write, that ends up calling the NFS > > client, which is hanging. The fact that it happens to be hanging on the > > nfsd process is more or less irrelevant here: the same thing could > > happen to any other process in the case where we have an NFS server that > > is down. > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > We should be able to fix that by marking the backing device as > write-congested. That'll have small race windows, but it should be a 99.9% > fix? No. The problem would rather appear to be that we're doing per-backing_dev writeback (if I read sync_sb_inodes() correctly), but we're measuring variables which are global to the VM. The backing device that we are selecting may not be writing out any dirty pages, in which case, we're just spinning in balance_dirty_pages_ratelimited(). Should we therefore perhaps be looking at adding per-backing_dev stats too? Trond ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 20:32 ` Trond Myklebust @ 2007-09-28 20:43 ` Andrew Morton 2007-09-28 21:36 ` Chakri n 0 siblings, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-09-28 20:43 UTC (permalink / raw) To: Trond Myklebust; +Cc: chakriin5, linux-pm, linux-kernel, nfs, a.p.zijlstra On Fri, 28 Sep 2007 16:32:18 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > > On Fri, 28 Sep 2007 15:52:28 -0400 > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > Looking back, they were getting caught up in > > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > > example... > > > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > > > I'm not sure that the hang that is illustrated here is so special. It is > > > an example of a bog-standard ext3 write, that ends up calling the NFS > > > client, which is hanging. The fact that it happens to be hanging on the > > > nfsd process is more or less irrelevant here: the same thing could > > > happen to any other process in the case where we have an NFS server that > > > is down. > > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > > > We should be able to fix that by marking the backing device as > > write-congested. That'll have small race windows, but it should be a 99.9% > > fix? > > No. The problem would rather appear to be that we're doing > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but > we're measuring variables which are global to the VM. The backing device > that we are selecting may not be writing out any dirty pages, in which > case, we're just spinning in balance_dirty_pages_ratelimited(). OK, so it's unrelated to page reclaim. > Should we therefore perhaps be looking at adding per-backing_dev stats > too? That's what mm-per-device-dirty-threshold.patch and friends are doing. Whether it works adequately is not really known at this time. Unfortunately kernel developers don't test -mm much. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 20:43 ` Andrew Morton @ 2007-09-28 21:36 ` Chakri n 2007-09-28 23:33 ` Chakri n 0 siblings, 1 reply; 46+ messages in thread From: Chakri n @ 2007-09-28 21:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, linux-pm, linux-kernel, nfs, a.p.zijlstra Here is a the snapshot of vmstats when the problem happened. I believe this could help a little. crash> kmem -V NR_FREE_PAGES: 680853 NR_INACTIVE: 95380 NR_ACTIVE: 26891 NR_ANON_PAGES: 2507 NR_FILE_MAPPED: 1832 NR_FILE_PAGES: 119779 NR_FILE_DIRTY: 0 NR_WRITEBACK: 18272 NR_SLAB_RECLAIMABLE: 1305 NR_SLAB_UNRECLAIMABLE: 2085 NR_PAGETABLE: 123 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 In my testing, I always saw the processes are waiting in balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() path. But this could be because I have about 4Gig of memory in the system and plenty of mem is still available around. I will rerun the test limiting memory to 1024MB and lets see if it takes in any different path. Thanks --Chakri On 9/28/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 28 Sep 2007 16:32:18 -0400 > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > > > On Fri, 28 Sep 2007 15:52:28 -0400 > > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > > Looking back, they were getting caught up in > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > > > example... > > > > > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > > > > > I'm not sure that the hang that is illustrated here is so special. It is > > > > an example of a bog-standard ext3 write, that ends up calling the NFS > > > > client, which is hanging. The fact that it happens to be hanging on the > > > > nfsd process is more or less irrelevant here: the same thing could > > > > happen to any other process in the case where we have an NFS server that > > > > is down. > > > > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > > > > > We should be able to fix that by marking the backing device as > > > write-congested. That'll have small race windows, but it should be a 99.9% > > > fix? > > > > No. The problem would rather appear to be that we're doing > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but > > we're measuring variables which are global to the VM. The backing device > > that we are selecting may not be writing out any dirty pages, in which > > case, we're just spinning in balance_dirty_pages_ratelimited(). > > OK, so it's unrelated to page reclaim. > > > Should we therefore perhaps be looking at adding per-backing_dev stats > > too? > > That's what mm-per-device-dirty-threshold.patch and friends are doing. > Whether it works adequately is not really known at this time. > Unfortunately kernel developers don't test -mm much. > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 21:36 ` Chakri n @ 2007-09-28 23:33 ` Chakri n 0 siblings, 0 replies; 46+ messages in thread From: Chakri n @ 2007-09-28 23:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, linux-pm, linux-kernel, nfs, a.p.zijlstra No change in behavior even in case of low memory systems. I confirmed it running on 1Gig machine. Thanks --Chakri On 9/28/07, Chakri n <chakriin5@gmail.com> wrote: > Here is a the snapshot of vmstats when the problem happened. I believe > this could help a little. > > crash> kmem -V > NR_FREE_PAGES: 680853 > NR_INACTIVE: 95380 > NR_ACTIVE: 26891 > NR_ANON_PAGES: 2507 > NR_FILE_MAPPED: 1832 > NR_FILE_PAGES: 119779 > NR_FILE_DIRTY: 0 > NR_WRITEBACK: 18272 > NR_SLAB_RECLAIMABLE: 1305 > NR_SLAB_UNRECLAIMABLE: 2085 > NR_PAGETABLE: 123 > NR_UNSTABLE_NFS: 0 > NR_BOUNCE: 0 > NR_VMSCAN_WRITE: 0 > > In my testing, I always saw the processes are waiting in > balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() > path. > > But this could be because I have about 4Gig of memory in the system > and plenty of mem is still available around. > > I will rerun the test limiting memory to 1024MB and lets see if it > takes in any different path. > > Thanks > --Chakri > > > On 9/28/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 28 Sep 2007 16:32:18 -0400 > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > > > > On Fri, 28 Sep 2007 15:52:28 -0400 > > > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > > > Looking back, they were getting caught up in > > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > > > > example... > > > > > > > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > > > > > > > I'm not sure that the hang that is illustrated here is so special. It is > > > > > an example of a bog-standard ext3 write, that ends up calling the NFS > > > > > client, which is hanging. The fact that it happens to be hanging on the > > > > > nfsd process is more or less irrelevant here: the same thing could > > > > > happen to any other process in the case where we have an NFS server that > > > > > is down. > > > > > > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > > > > > > > We should be able to fix that by marking the backing device as > > > > write-congested. That'll have small race windows, but it should be a 99.9% > > > > fix? > > > > > > No. The problem would rather appear to be that we're doing > > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but > > > we're measuring variables which are global to the VM. The backing device > > > that we are selecting may not be writing out any dirty pages, in which > > > case, we're just spinning in balance_dirty_pages_ratelimited(). > > > > OK, so it's unrelated to page reclaim. > > > > > Should we therefore perhaps be looking at adding per-backing_dev stats > > > too? > > > > That's what mm-per-device-dirty-threshold.patch and friends are doing. > > Whether it works adequately is not really known at this time. > > Unfortunately kernel developers don't test -mm much. > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 19:52 ` Trond Myklebust 2007-09-28 20:10 ` Andrew Morton @ 2007-09-28 20:24 ` Daniel Phillips 1 sibling, 0 replies; 46+ messages in thread From: Daniel Phillips @ 2007-09-28 20:24 UTC (permalink / raw) To: Trond Myklebust Cc: Andrew Morton, Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Friday 28 September 2007 12:52, Trond Myklebust wrote: > I'm not sure that the hang that is illustrated here is so special. It > is an example of a bog-standard ext3 write, that ends up calling the > NFS client, which is hanging. The fact that it happens to be hanging > on the nfsd process is more or less irrelevant here: the same thing > could happen to any other process in the case where we have an NFS > server that is down. Hi Trond, Could you clarify what you meant by "calling the NFS client"? I don't see any direct call in the posted backtrace. Regards, Daniel ^ permalink raw reply [flat|nested] 46+ messages in thread
* KDB? 2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust 2007-09-28 19:26 ` Andrew Morton @ 2007-09-29 1:51 ` Daniel Phillips 1 sibling, 0 replies; 46+ messages in thread From: Daniel Phillips @ 2007-09-29 1:51 UTC (permalink / raw) To: Trond Myklebust Cc: Andrew Morton, Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Friday 28 September 2007 12:16, Trond Myklebust wrote: > crash> bt 3188 crash> ps|grep ps Hey, that looks just like kdb! But I heard that kgdb is better in every way than kdb. Innocently yours, Daniel ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-28 6:50 ` Andrew Morton ` (2 preceding siblings ...) 2007-09-28 17:00 ` Trond Myklebust @ 2007-09-29 0:46 ` Daniel Phillips 3 siblings, 0 replies; 46+ messages in thread From: Daniel Phillips @ 2007-09-29 0:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Chakri n, linux-pm, lkml, nfs, Peter Zijlstra On Thursday 27 September 2007 23:50, Andrew Morton wrote: > Actually we perhaps could address this at the VFS level in another > way. Processes which are writing to the dead NFS server will > eventually block in balance_dirty_pages() once they've exceeded the > memory limits and will remain blocked until the server wakes up - > that's the behaviour we want. It is not necessary to restrict total dirty pages at all. Instead it is necessary to restrict total writeout in flight. This is evident from the fact that making progress is the one and only reason our kernel exists, and writeout is how we make progress clearing memory. In other words, if we guarantee the progress of writeout, we will live happily ever after and not have to sell the farm. The current situation has an eerily similar feeling to the VM instability in early 2.4, which was never solved until we convinced ourselves that the only way to deal with Moore's law as applied to number of memory pages was to implement positive control of swapout in the form of reverse mapping[1]. This time round, we need to add positive control of writeout in the form of rate limiting. I _think_ Peter is with me on this, and not only that, but between the too of us we already have patches for most of the subsystems that need it, and we have both been busy testing (different subsets of) these patches to destruction for the better part of a year. Anyway, to fix the immediate bug before the one true dirty_limit removal patch lands (promise) I think you are on the right track by noticing that balance_dirty_pages has to become aware of how congested the involved block device is, since blocking a writeout process on an underused block device is clearly a bad idea. Note how much this idea looks like rate limiting. [1] We lost the scent for a number of reasons, not least because the experimental implementation of reverse mapping at the time was buggy for reasons entirely unrelated to the reverse mapping itself. Regards, Daniel ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070929110454.GA29861@mail.ustc.edu.cn>]
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) [not found] ` <20070929110454.GA29861@mail.ustc.edu.cn> @ 2007-09-29 11:04 ` Fengguang Wu 2007-09-29 11:48 ` Peter Zijlstra 2007-10-01 15:57 ` Chuck Ebbert 0 siblings, 2 replies; 46+ messages in thread From: Fengguang Wu @ 2007-09-29 11:04 UTC (permalink / raw) To: Chakri n; +Cc: Peter Zijlstra, Krzysztof Oledzki, akpm, linux-pm, lkml On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote: > Hi, > > In my testing, a unresponsive file system can hang all I/O in the system. > This is not seen in 2.4. > > I started 20 threads doing I/O on a NFS share. They are just doing 4K > writes in a loop. > > Now I stop NFS server hosting the NFS share and start a > "dd" process to write a file on local EXT3 file system. > > # dd if=/dev/zero of=/tmp/x count=1000 > > This process never progresses. Peter, do you think this patch will help? === writeback: avoid possible balance_dirty_pages() lockup on light-load bdi On a busy-writing system, a writer could be hold up infinitely on a light-load device. It will be trying to sync more than enough dirty data. The problem case: 0. sda/nr_dirty >= dirty_limit; sdb/nr_dirty == 0 1. dd writes 32 pages on sdb 2. balance_dirty_pages() blocks dd, and tries to write 6MB. 3. it never gets there: there's only 128KB dirty data. 4. dd may be blocked for a loooong time as long as sda is overloaded Fix it by returning on 'zero dirty inodes' in the current bdi. (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. But there is no available counters for 'dirty pages'.) Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- mm/page-writeback.c | 3 +++ 1 file changed, 3 insertions(+) --- linux-2.6.22.orig/mm/page-writeback.c +++ linux-2.6.22/mm/page-writeback.c @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a if (nr_reclaimable + global_page_state(NR_WRITEBACK) <= dirty_thresh) break; + if (list_empty(&mapping->host->i_sb->s_dirty) && + list_empty(&mapping->host->i_sb->s_io)) + break; if (!dirty_exceeded) dirty_exceeded = 1; ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-29 11:04 ` Fengguang Wu @ 2007-09-29 11:48 ` Peter Zijlstra [not found] ` <20070929122842.GA5454@mail.ustc.edu.cn> 2007-10-01 15:57 ` Chuck Ebbert 1 sibling, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-09-29 11:48 UTC (permalink / raw) To: Fengguang Wu; +Cc: Chakri n, Krzysztof Oledzki, akpm, linux-pm, lkml On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote: > On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote: > > Hi, > > > > In my testing, a unresponsive file system can hang all I/O in the system. > > This is not seen in 2.4. > > > > I started 20 threads doing I/O on a NFS share. They are just doing 4K > > writes in a loop. > > > > Now I stop NFS server hosting the NFS share and start a > > "dd" process to write a file on local EXT3 file system. > > > > # dd if=/dev/zero of=/tmp/x count=1000 > > > > This process never progresses. > > Peter, do you think this patch will help? In another sub-thread: > It's works on .23-rc8-mm2 with out any problems. > > "dd" process does not hang any more. > > Thanks for all the help. > > Cheers > --Chakri So the per-bdi dirty patches that are in -mm already fix the problem. > === > writeback: avoid possible balance_dirty_pages() lockup on light-load bdi > > On a busy-writing system, a writer could be hold up infinitely on a > light-load device. It will be trying to sync more than enough dirty data. > > The problem case: > > 0. sda/nr_dirty >= dirty_limit; > sdb/nr_dirty == 0 > 1. dd writes 32 pages on sdb > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > 3. it never gets there: there's only 128KB dirty data. > 4. dd may be blocked for a loooong time as long as sda is overloaded > > Fix it by returning on 'zero dirty inodes' in the current bdi. > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > But there is no available counters for 'dirty pages'.) > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > --- > mm/page-writeback.c | 3 +++ > 1 file changed, 3 insertions(+) > > --- linux-2.6.22.orig/mm/page-writeback.c > +++ linux-2.6.22/mm/page-writeback.c > @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a > if (nr_reclaimable + global_page_state(NR_WRITEBACK) <= > dirty_thresh) > break; > + if (list_empty(&mapping->host->i_sb->s_dirty) && > + list_empty(&mapping->host->i_sb->s_io)) > + break; > > if (!dirty_exceeded) > dirty_exceeded = 1; > On the patch itself, not sure if it would have been enough. As soon as there is a single dirty inode on the list one would get caught in the same problem as before. That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this break won't fix it. ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070929122842.GA5454@mail.ustc.edu.cn>]
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) [not found] ` <20070929122842.GA5454@mail.ustc.edu.cn> @ 2007-09-29 12:28 ` Fengguang Wu 2007-09-29 14:43 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-09-29 12:28 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chakri n, Krzysztof Oledzki, akpm, linux-pm, lkml On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote: > > On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote: > > On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote: > > > Hi, > > > > > > In my testing, a unresponsive file system can hang all I/O in the system. > > > This is not seen in 2.4. > > > > > > I started 20 threads doing I/O on a NFS share. They are just doing 4K > > > writes in a loop. > > > > > > Now I stop NFS server hosting the NFS share and start a > > > "dd" process to write a file on local EXT3 file system. > > > > > > # dd if=/dev/zero of=/tmp/x count=1000 > > > > > > This process never progresses. > > > > Peter, do you think this patch will help? > > In another sub-thread: > > > It's works on .23-rc8-mm2 with out any problems. > > > > "dd" process does not hang any more. > > > > Thanks for all the help. > > > > Cheers > > --Chakri > > So the per-bdi dirty patches that are in -mm already fix the problem. That's good. But still it could be a good candidate for 2.6.22.x or even 2.6.23. > > === > > writeback: avoid possible balance_dirty_pages() lockup on light-load bdi > > > > On a busy-writing system, a writer could be hold up infinitely on a > > light-load device. It will be trying to sync more than enough dirty data. > > > > The problem case: > > > > 0. sda/nr_dirty >= dirty_limit; > > sdb/nr_dirty == 0 > > 1. dd writes 32 pages on sdb > > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > > 3. it never gets there: there's only 128KB dirty data. > > 4. dd may be blocked for a loooong time as long as sda is overloaded > > > > Fix it by returning on 'zero dirty inodes' in the current bdi. > > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > > But there is no available counters for 'dirty pages'.) > > > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > > --- > > mm/page-writeback.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > --- linux-2.6.22.orig/mm/page-writeback.c > > +++ linux-2.6.22/mm/page-writeback.c > > @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a > > if (nr_reclaimable + global_page_state(NR_WRITEBACK) <= > > dirty_thresh) > > break; > > + if (list_empty(&mapping->host->i_sb->s_dirty) && > > + list_empty(&mapping->host->i_sb->s_io)) > > + break; > > > > if (!dirty_exceeded) > > dirty_exceeded = 1; > > > > On the patch itself, not sure if it would have been enough. As soon as > there is a single dirty inode on the list one would get caught in the > same problem as before. That should not be a problem. Normally the few new dirty inodes will be all cleaned in one go and there are no more dirty inodes left(at least for a moment). Hmm, I guess the new 'break' should be moved immediately after writeback_inodes()... > That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this > break won't fix it. In fact this patch exactly targets at this condition. When NFS* < dirty_limit, Chakri won't see the lockup at all. The problem was, there are only two 'break's in the loop, and neither one evaluates to true for his dd command. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-29 12:28 ` Fengguang Wu @ 2007-09-29 14:43 ` Peter Zijlstra 0 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-09-29 14:43 UTC (permalink / raw) To: Fengguang Wu; +Cc: Chakri n, Krzysztof Oledzki, akpm, linux-pm, lkml On Sat, 2007-09-29 at 20:28 +0800, Fengguang Wu wrote: > On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote: > > On the patch itself, not sure if it would have been enough. As soon as > > there is a single dirty inode on the list one would get caught in the > > same problem as before. > > That should not be a problem. Normally the few new dirty inodes will > be all cleaned in one go and there are no more dirty inodes left(at > least for a moment). Hmm, I guess the new 'break' should be moved > immediately after writeback_inodes()... > > > That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this > > break won't fix it. > > In fact this patch exactly targets at this condition. > When NFS* < dirty_limit, Chakri won't see the lockup at all. > The problem was, there are only two 'break's in the loop, and neither > one evaluates to true for his dd command. Yeah indeed, when put in the loop, after writeback_inodes() it makes sense. No idea what I was thinking, must be one of those days... :-/ ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) 2007-09-29 11:04 ` Fengguang Wu 2007-09-29 11:48 ` Peter Zijlstra @ 2007-10-01 15:57 ` Chuck Ebbert [not found] ` <20071002020040.GA5275@mail.ustc.edu.cn> 1 sibling, 1 reply; 46+ messages in thread From: Chuck Ebbert @ 2007-10-01 15:57 UTC (permalink / raw) To: Fengguang Wu Cc: Chakri n, Peter Zijlstra, Krzysztof Oledzki, akpm, linux-pm, lkml, richard kennedy On 09/29/2007 07:04 AM, Fengguang Wu wrote: > On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote: >> Hi, >> >> In my testing, a unresponsive file system can hang all I/O in the system. >> This is not seen in 2.4. >> >> I started 20 threads doing I/O on a NFS share. They are just doing 4K >> writes in a loop. >> >> Now I stop NFS server hosting the NFS share and start a >> "dd" process to write a file on local EXT3 file system. >> >> # dd if=/dev/zero of=/tmp/x count=1000 >> >> This process never progresses. > > Peter, do you think this patch will help? > > === > writeback: avoid possible balance_dirty_pages() lockup on light-load bdi > > On a busy-writing system, a writer could be hold up infinitely on a > light-load device. It will be trying to sync more than enough dirty data. > > The problem case: > > 0. sda/nr_dirty >= dirty_limit; > sdb/nr_dirty == 0 > 1. dd writes 32 pages on sdb > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > 3. it never gets there: there's only 128KB dirty data. > 4. dd may be blocked for a loooong time as long as sda is overloaded > > Fix it by returning on 'zero dirty inodes' in the current bdi. > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > But there is no available counters for 'dirty pages'.) > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > --- > mm/page-writeback.c | 3 +++ > 1 file changed, 3 insertions(+) > > --- linux-2.6.22.orig/mm/page-writeback.c > +++ linux-2.6.22/mm/page-writeback.c > @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a > if (nr_reclaimable + global_page_state(NR_WRITEBACK) <= > dirty_thresh) > break; > + if (list_empty(&mapping->host->i_sb->s_dirty) && > + list_empty(&mapping->host->i_sb->s_io)) > + break; > > if (!dirty_exceeded) > dirty_exceeded = 1; > This looks better than the other candidate to fix the problem. Are we going to fix 2.6.23 before release? Multiple people have reported this problem now... ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20071002020040.GA5275@mail.ustc.edu.cn>]
* [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi [not found] ` <20071002020040.GA5275@mail.ustc.edu.cn> @ 2007-10-02 2:00 ` Fengguang Wu 2007-10-02 2:14 ` Andrew Morton 2007-10-03 12:46 ` richard kennedy 1 sibling, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-10-02 2:00 UTC (permalink / raw) To: Chuck Ebbert Cc: Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, akpm, linux-pm, lkml, richard kennedy On Mon, Oct 01, 2007 at 11:57:34AM -0400, Chuck Ebbert wrote: > On 09/29/2007 07:04 AM, Fengguang Wu wrote: > > On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote: > >> Hi, > >> > >> In my testing, a unresponsive file system can hang all I/O in the system. > >> This is not seen in 2.4. > >> > >> I started 20 threads doing I/O on a NFS share. They are just doing 4K > >> writes in a loop. > >> > >> Now I stop NFS server hosting the NFS share and start a > >> "dd" process to write a file on local EXT3 file system. > >> > >> # dd if=/dev/zero of=/tmp/x count=1000 > >> > >> This process never progresses. > > > > Peter, do you think this patch will help? > > > > === > > writeback: avoid possible balance_dirty_pages() lockup on light-load bdi [...] > This looks better than the other candidate to fix the problem. Are we going > to fix 2.6.23 before release? Multiple people have reported this problem now... (expecting real world confirmations...) Here is a new safer version. It's more ugly though. --- writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi On a busy-writing system, a writer could be hold up infinitely on a light-load device. It will be trying to sync more than available dirty data. The problem case: 0. sda/nr_dirty >= dirty_limit; sdb/nr_dirty == 0 1. dd writes 32 pages on sdb 2. balance_dirty_pages() blocks dd, and tries to write 6MB. 3. it never gets there: there's only 128KB dirty data. 4. dd may be blocked for a loooong time Fix it by returning on 'zero dirty inodes' in the current bdi. (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. But there is no available counters for 'dirty pages'.) But the newly introduced 'break' could make the nr_writeback drift away above the dirty limit. The workaround is to limit the error under 1MB. Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- mm/page-writeback.c | 5 +++++ 1 file changed, 5 insertions(+) --- linux-2.6.22.orig/mm/page-writeback.c +++ linux-2.6.22/mm/page-writeback.c @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ + if (list_empty(&mapping->host->i_sb->s_dirty) && + list_empty(&mapping->host->i_sb->s_io) && + nr_reclaimable + global_page_state(NR_WRITEBACK) <= + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT))) + break; } congestion_wait(WRITE, HZ/10); } ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi 2007-10-02 2:00 ` [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi Fengguang Wu @ 2007-10-02 2:14 ` Andrew Morton [not found] ` <20071002121327.GA5718@mail.ustc.edu.cn> 0 siblings, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-10-02 2:14 UTC (permalink / raw) To: Fengguang Wu Cc: Chuck Ebbert, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, linux-pm, lkml, richard kennedy On Tue, 2 Oct 2007 10:00:40 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi > > On a busy-writing system, a writer could be hold up infinitely on a > light-load device. It will be trying to sync more than available dirty data. > > The problem case: > > 0. sda/nr_dirty >= dirty_limit; > sdb/nr_dirty == 0 > 1. dd writes 32 pages on sdb > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > 3. it never gets there: there's only 128KB dirty data. > 4. dd may be blocked for a loooong time Please quantify loooong. > Fix it by returning on 'zero dirty inodes' in the current bdi. > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > But there is no available counters for 'dirty pages'.) > > But the newly introduced 'break' could make the nr_writeback drift away > above the dirty limit. The workaround is to limit the error under 1MB. I'm still not sure that we fully understand this yet. If the sdb writer is stuck in balance_dirty_pages() then all sda writers will be in balance_dirty_pages() too, madly writing stuff out to sda. And pdflush will be writing out sda as well. All this writeout to sda should release the sdb writer. Why isn't this happening? > Cc: Chuck Ebbert <cebbert@redhat.com> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > --- > mm/page-writeback.c | 5 +++++ > 1 file changed, 5 insertions(+) > > --- linux-2.6.22.orig/mm/page-writeback.c > +++ linux-2.6.22/mm/page-writeback.c > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a > pages_written += write_chunk - wbc.nr_to_write; > if (pages_written >= write_chunk) > break; /* We've done our duty */ > + if (list_empty(&mapping->host->i_sb->s_dirty) && > + list_empty(&mapping->host->i_sb->s_io) && > + nr_reclaimable + global_page_state(NR_WRITEBACK) <= > + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT))) > + break; > } > congestion_wait(WRITE, HZ/10); > } Well that has a nice safetly net. Perhaps it could fail a bit later on, but that depends on why it's failing. How well tested was this? If we merge this for 2.6.23 then I expect that we'll immediately unmerge it for 2.6.24 because Peter's stuff fixes this problem by other means. Do we all agree with the above sentence? ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20071002121327.GA5718@mail.ustc.edu.cn>]
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi [not found] ` <20071002121327.GA5718@mail.ustc.edu.cn> @ 2007-10-02 12:13 ` Fengguang Wu [not found] ` <20071002132702.GA10967@mail.ustc.edu.cn> 1 sibling, 0 replies; 46+ messages in thread From: Fengguang Wu @ 2007-10-02 12:13 UTC (permalink / raw) To: Andrew Morton Cc: Chuck Ebbert, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, linux-pm, lkml, richard kennedy, Ingo Molnar On Mon, Oct 01, 2007 at 07:14:57PM -0700, Andrew Morton wrote: > On Tue, 2 Oct 2007 10:00:40 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote: > > > writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi > > > > On a busy-writing system, a writer could be hold up infinitely on a > > light-load device. It will be trying to sync more than available dirty data. > > > > The problem case: > > > > 0. sda/nr_dirty >= dirty_limit; > > sdb/nr_dirty == 0 > > 1. dd writes 32 pages on sdb > > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > > 3. it never gets there: there's only 128KB dirty data. > > 4. dd may be blocked for a loooong time > > Please quantify loooong. There're only two 'break' conditions in the loop: 1. nr_dirty + nr_unstable + nr_writeback < dirty_limit => *mostly* FALSE for a busy system => *always* FALSE in Chakri's stucked NFS case 2. nr_written >= 6MB for a light-load bdi: => *never* TRUE until there comes many new writers, contributing more dirty pages to sync => more worse, those new writers will also stuck here... the obvious unbalance here is: each writer contributes only 32KB new dirty pages, but want to consume (not necessarily available) 6MB So loooong = min(global-less-busy-time, bdi-many-new-writers-arrival-time). > > Fix it by returning on 'zero dirty inodes' in the current bdi. > > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > > But there is no available counters for 'dirty pages'.) > > > > But the newly introduced 'break' could make the nr_writeback drift away > > above the dirty limit. The workaround is to limit the error under 1MB. > > I'm still not sure that we fully understand this yet. > > If the sdb writer is stuck in balance_dirty_pages() then all sda writers > will be in balance_dirty_pages() too, madly writing stuff out to sda. And > pdflush will be writing out sda as well. All this writeout to sda should > release the sdb writer. > > Why isn't this happening? You are right in the reasoning. The exact consequence is: the light-load sdb is made as _unresponsive_ as the busy sda Hence Chakri's case: whenever NFS is stuck, every device get stuck. > > > Cc: Chuck Ebbert <cebbert@redhat.com> > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > > --- > > mm/page-writeback.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > --- linux-2.6.22.orig/mm/page-writeback.c > > +++ linux-2.6.22/mm/page-writeback.c > > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a > > pages_written += write_chunk - wbc.nr_to_write; > > if (pages_written >= write_chunk) > > break; /* We've done our duty */ > > + if (list_empty(&mapping->host->i_sb->s_dirty) && > > + list_empty(&mapping->host->i_sb->s_io) && > > + nr_reclaimable + global_page_state(NR_WRITEBACK) <= > > + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT))) > > + break; > > } > > congestion_wait(WRITE, HZ/10); > > } > > Well that has a nice safetly net. Perhaps it could fail a bit later on, > but that depends on why it's failing. In theory, every CPU/paralle writer could contribute 8 pages of error. Hence we get 1MB/32KB = 32 (CPUs/writers). One more serious problem is, a busy writer could also drain all the dirty pages and make (nr_writeback == dirty_limit+1MB). In that case, I suspect the light-load sdb writer still have good chance to make progress(need confirmation). > How well tested was this? Not well tested till now. My system becomes unusable soon after starting the NFS write(even before plugging the network). I'm seeing large latencies in try_to_wake_up(). Hope that Ingo could help it out. > If we merge this for 2.6.23 then I expect that we'll immediately unmerge it > for 2.6.24 because Peter's stuff fixes this problem by other means. > > Do we all agree with the above sentence? Yeah, Peter and me were both aware of the timing. This patch is only meant for 2.6.23 and 2.6.22.10. Fengguang ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20071002132702.GA10967@mail.ustc.edu.cn>]
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi [not found] ` <20071002132702.GA10967@mail.ustc.edu.cn> @ 2007-10-02 13:27 ` Fengguang Wu 2007-10-02 18:35 ` Chuck Ebbert 0 siblings, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-10-02 13:27 UTC (permalink / raw) To: Andrew Morton Cc: Chuck Ebbert, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, linux-pm, lkml, richard kennedy, Ingo Molnar On Tue, Oct 02, 2007 at 08:13:27PM +0800, Fengguang Wu wrote: [...] > One more serious problem is, a busy writer could also drain all the > dirty pages and make (nr_writeback == dirty_limit+1MB). In that case, > I suspect the light-load sdb writer still have good chance to > make progress(need confirmation). Well it seems to be a really tricky issue without knowing the per-bdi numbers. Maybe we could just encourage users to upgrade to 2.6.24... ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi 2007-10-02 13:27 ` Fengguang Wu @ 2007-10-02 18:35 ` Chuck Ebbert 0 siblings, 0 replies; 46+ messages in thread From: Chuck Ebbert @ 2007-10-02 18:35 UTC (permalink / raw) To: Fengguang Wu Cc: Andrew Morton, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, linux-pm, lkml, richard kennedy, Ingo Molnar On 10/02/2007 09:27 AM, Fengguang Wu wrote: > On Tue, Oct 02, 2007 at 08:13:27PM +0800, Fengguang Wu wrote: > [...] >> One more serious problem is, a busy writer could also drain all the >> dirty pages and make (nr_writeback == dirty_limit+1MB). In that case, >> I suspect the light-load sdb writer still have good chance to >> make progress(need confirmation). > > Well it seems to be a really tricky issue without knowing the per-bdi > numbers. Maybe we could just encourage users to upgrade to 2.6.24... > Yeah, and if that doesn't work there's always 2.6.25... :/ ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi [not found] ` <20071002020040.GA5275@mail.ustc.edu.cn> 2007-10-02 2:00 ` [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi Fengguang Wu @ 2007-10-03 12:46 ` richard kennedy [not found] ` <20071004015053.GA5789@mail.ustc.edu.cn> 1 sibling, 1 reply; 46+ messages in thread From: richard kennedy @ 2007-10-03 12:46 UTC (permalink / raw) To: Fengguang Wu Cc: Chuck Ebbert, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, akpm, linux-pm, lkml On Tue, 2007-10-02 at 10:00 +0800, Fengguang Wu wrote: > On Mon, Oct 01, 2007 at 11:57:34AM -0400, Chuck Ebbert wrote: > > On 09/29/2007 07:04 AM, Fengguang Wu wrote: ... > > (expecting real world confirmations...) > > Here is a new safer version. It's more ugly though. > > --- > writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi > > On a busy-writing system, a writer could be hold up infinitely on a > light-load device. It will be trying to sync more than available dirty data. > > The problem case: > > 0. sda/nr_dirty >= dirty_limit; > sdb/nr_dirty == 0 > 1. dd writes 32 pages on sdb > 2. balance_dirty_pages() blocks dd, and tries to write 6MB. > 3. it never gets there: there's only 128KB dirty data. > 4. dd may be blocked for a loooong time > > Fix it by returning on 'zero dirty inodes' in the current bdi. > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'. > But there is no available counters for 'dirty pages'.) > > But the newly introduced 'break' could make the nr_writeback drift away > above the dirty limit. The workaround is to limit the error under 1MB. > > Cc: Chuck Ebbert <cebbert@redhat.com> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > --- > mm/page-writeback.c | 5 +++++ > 1 file changed, 5 insertions(+) > > --- linux-2.6.22.orig/mm/page-writeback.c > +++ linux-2.6.22/mm/page-writeback.c > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a > pages_written += write_chunk - wbc.nr_to_write; > if (pages_written >= write_chunk) > break; /* We've done our duty */ > + if (list_empty(&mapping->host->i_sb->s_dirty) && > + list_empty(&mapping->host->i_sb->s_io) && > + nr_reclaimable + global_page_state(NR_WRITEBACK) <= > + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT))) > + break; > } > congestion_wait(WRITE, HZ/10); > } I've been testing 2.6.23-rc9 + this patch all morning but have just seen a lockup. As usual it happened just after a large file copy finished and while nr_dirty is still large. I'm sorry to say I didn't have a serial console running so I don't have an other info. I will try again and see if I can capture some more data. I did notice that at the beginning of my tests the dirty blocks are written back more quickly than usual nr_dirty count after the copy finished and then 60 seconds later :- after copy +60 seconds 73520 0 73533 0 68554 1 but after several iterations of my testcase & just before the lockup 68560 57165 71974 62896 which is about the same as a unpatched kernel. Richard ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20071004015053.GA5789@mail.ustc.edu.cn>]
* Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi [not found] ` <20071004015053.GA5789@mail.ustc.edu.cn> @ 2007-10-04 1:50 ` Fengguang Wu 0 siblings, 0 replies; 46+ messages in thread From: Fengguang Wu @ 2007-10-04 1:50 UTC (permalink / raw) To: richard kennedy Cc: Chuck Ebbert, Greg KH, Chakri n, Peter Zijlstra, Krzysztof Oledzki, akpm, linux-pm, lkml On Wed, Oct 03, 2007 at 01:46:52PM +0100, richard kennedy wrote: > On Tue, 2007-10-02 at 10:00 +0800, Fengguang Wu wrote: > > --- > > mm/page-writeback.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > --- linux-2.6.22.orig/mm/page-writeback.c > > +++ linux-2.6.22/mm/page-writeback.c > > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a > > pages_written += write_chunk - wbc.nr_to_write; > > if (pages_written >= write_chunk) > > break; /* We've done our duty */ > > + if (list_empty(&mapping->host->i_sb->s_dirty) && > > + list_empty(&mapping->host->i_sb->s_io) && > > + nr_reclaimable + global_page_state(NR_WRITEBACK) <= > > + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT))) > > + break; > > } > > congestion_wait(WRITE, HZ/10); > > } > > I've been testing 2.6.23-rc9 + this patch all morning but have just seen > a lockup. As usual it happened just after a large file copy finished > and while nr_dirty is still large. I'm sorry to say I didn't have a > serial console running so I don't have an other info. I will try again > and see if I can capture some more data. > > I did notice that at the beginning of my tests the dirty blocks are > written back more quickly than usual > > nr_dirty count after the copy finished and then 60 seconds later :- > after copy +60 seconds > 73520 0 > 73533 0 > 68554 1 > > but after several iterations of my testcase & just before the lockup > 68560 57165 > 71974 62896 > > which is about the same as a unpatched kernel. Hi Richard, Thank you for the testing. However, my patch is kind of duplicate efforts. I was taking the 'do it if simple' attitude. I can continue to improve it if you really want it. Otherwise I'd recommend you to test the coming 2.6.24-rc1 or backport the -mm writeback patches back to 2.6.23 and test it there. Peter has did a good job on it. Fengguang ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2007-10-04 1:51 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-28 6:32 A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Chakri n
2007-09-28 6:50 ` Andrew Morton
2007-09-28 6:59 ` Peter Zijlstra
2007-09-28 8:27 ` Chakri n
2007-09-28 8:40 ` Peter Zijlstra
2007-09-28 9:01 ` Chakri n
2007-09-28 9:12 ` Peter Zijlstra
2007-09-28 9:20 ` Chakri n
2007-09-28 9:23 ` Peter Zijlstra
2007-09-28 10:36 ` Chakri n
2007-09-28 13:28 ` Jonathan Corbet
2007-09-28 13:35 ` Peter Zijlstra
2007-09-28 16:45 ` [linux-pm] " Alan Stern
2007-09-29 1:27 ` Daniel Phillips
2007-09-28 18:04 ` Andrew Morton
2007-09-28 17:00 ` Trond Myklebust
2007-09-28 18:49 ` Andrew Morton
2007-09-28 18:48 ` Peter Zijlstra
2007-09-28 19:16 ` Andrew Morton
2007-10-02 13:36 ` Peter Zijlstra
2007-10-02 15:42 ` Randy Dunlap
2007-10-03 9:28 ` [PATCH] lockstat: documentation Peter Zijlstra
2007-10-03 9:35 ` Ingo Molnar
2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust
2007-09-28 19:26 ` Andrew Morton
2007-09-28 19:52 ` Trond Myklebust
2007-09-28 20:10 ` Andrew Morton
2007-09-28 20:32 ` Trond Myklebust
2007-09-28 20:43 ` Andrew Morton
2007-09-28 21:36 ` Chakri n
2007-09-28 23:33 ` Chakri n
2007-09-28 20:24 ` Daniel Phillips
2007-09-29 1:51 ` KDB? Daniel Phillips
2007-09-29 0:46 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Daniel Phillips
[not found] ` <20070929110454.GA29861@mail.ustc.edu.cn>
2007-09-29 11:04 ` Fengguang Wu
2007-09-29 11:48 ` Peter Zijlstra
[not found] ` <20070929122842.GA5454@mail.ustc.edu.cn>
2007-09-29 12:28 ` Fengguang Wu
2007-09-29 14:43 ` Peter Zijlstra
2007-10-01 15:57 ` Chuck Ebbert
[not found] ` <20071002020040.GA5275@mail.ustc.edu.cn>
2007-10-02 2:00 ` [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi Fengguang Wu
2007-10-02 2:14 ` Andrew Morton
[not found] ` <20071002121327.GA5718@mail.ustc.edu.cn>
2007-10-02 12:13 ` Fengguang Wu
[not found] ` <20071002132702.GA10967@mail.ustc.edu.cn>
2007-10-02 13:27 ` Fengguang Wu
2007-10-02 18:35 ` Chuck Ebbert
2007-10-03 12:46 ` richard kennedy
[not found] ` <20071004015053.GA5789@mail.ustc.edu.cn>
2007-10-04 1:50 ` Fengguang Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox