* 2.6.14-rc2-mm1 - ext3 wedging up @ 2005-09-22 19:59 Valdis.Kletnieks 2005-09-23 0:36 ` Con Kolivas 0 siblings, 1 reply; 22+ messages in thread From: Valdis.Kletnieks @ 2005-09-22 19:59 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 3412 bytes --] Am seeing reproducible wedging up when writing large (20M+) files to an ext3 file system. Oddly enough, if something *else* writes files to the file system as well, it will unwedge for a while and make progress. Also, a 'sync' command will relieve things temporarily - but after a few megabytes it comes to a halt again. Looks like a borkage someplace not causing it to actually finish pushing dirty file pages out - gkrellm reports little/no disk activity in progress. File activity on *other* filesystems continues unimpeded. A representative sample sysrq-t output (doing an rpm2cpio | cpio -ivdm on the FC4 kernel.src.rpm, lsof reports the file being extracted was linux-2.6.13.tar.bz2). [17187066.172000] cpio D C3A2EC8C 1928 9299 9144 9298 (NOTLB) [17187066.172000] c3a2eca4 00000000 c011d897 c3a2ec8c cab666a0 cab66560 a830ef00 003d0f8b [17187066.172000] 365c0400 00000000 00000282 c3a2ecac 001b7450 c3a2ece0 c3a2ecd0 c036e5bf [17187066.172000] cb4c1f64 c04f43a8 001b7450 4b87ad6e c011e35a cab66560 c04f4120 00000019 [17187066.172000] Call Trace: [17187066.172000] [<c036e5bf>] schedule_timeout+0x72/0x90 [17187066.172000] [<c036e532>] io_schedule_timeout+0xe/0x16 [17187066.172000] [<c02627bd>] blk_congestion_wait+0x53/0x68 [17187066.172000] [<c0139ec8>] balance_dirty_pages+0xe8/0x142 [17187066.172000] [<c0139fcf>] task_balance_dirty_pages+0xad/0xb6 [17187066.172000] [<c0139fe4>] balance_dirty_pages_ratelimited+0xc/0x92 [17187066.172000] [<c0136b2b>] generic_file_buffered_write+0x427/0x50f [17187066.172000] [<c0136fb0>] __generic_file_aio_write_nolock+0x39d/0x3da [17187066.172000] [<c01371e7>] generic_file_aio_write+0x62/0xb0 [17187066.172000] [<c01895ad>] ext3_file_write+0x1a/0x88 [17187066.172000] [<c014e9fc>] do_sync_write+0xb1/0xe6 [17187066.172000] [<c014eade>] vfs_write+0xad/0x156 [17187066.172000] [<c014ec22>] sys_write+0x3b/0x60 [17187066.172000] [<c01026b1>] syscall_call+0x7/0xb /proc/meminfo says: MemTotal: 255140 kB MemFree: 11048 kB Buffers: 17084 kB Cached: 43020 kB SwapCached: 23244 kB Active: 200156 kB Inactive: 16128 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 255140 kB LowFree: 11048 kB SwapTotal: 1052216 kB SwapFree: 940176 kB Dirty: 60 kB Writeback: 4 kB Mapped: 185012 kB Slab: 19288 kB CommitLimit: 1179784 kB Committed_AS: 415788 kB PageTables: 1368 kB VmallocTotal: 777940 kB VmallocUsed: 28268 kB VmallocChunk: 747728 kB Here I kept entering 'sync' in another window - each time, I'd see an immediate read/write flurry on gkrellm for 1-3 seconds, and then nothing until the next sync - then it would start moving again. [~]2 l /usr/src/valdis/kern/linux-2.6.13.tar.bz2 1244 -rw------- 1 valdis valdis 1263104 Sep 22 15:48 /usr/src/valdis/kern/linux-2.6.13.tar.bz2 [~]2 sync [~]2 l /usr/src/valdis/kern/linux-2.6.13.tar.bz2 7920 -rw------- 1 valdis valdis 8092672 Sep 22 15:51 /usr/src/valdis/kern/linux-2.6.13.tar.bz2 [~]2 sync [~]2 l /usr/src/valdis/kern/linux-2.6.13.tar.bz2 9464 -rw------- 1 valdis valdis 9669120 Sep 22 15:52 /usr/src/valdis/kern/linux-2.6.13.tar.bz2 [~]2 sync [~]2 l /usr/src/valdis/kern/linux-2.6.13.tar.bz2 11516 -rw------- 1 valdis valdis 11770880 Sep 22 15:52 /usr/src/valdis/kern/linux-2.6.13.tar.bz2 [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-22 19:59 2.6.14-rc2-mm1 - ext3 wedging up Valdis.Kletnieks @ 2005-09-23 0:36 ` Con Kolivas 2005-09-23 7:20 ` Valdis.Kletnieks 0 siblings, 1 reply; 22+ messages in thread From: Con Kolivas @ 2005-09-23 0:36 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: linux-kernel On Fri, 23 Sep 2005 05:59, Valdis.Kletnieks@vt.edu wrote: > Am seeing reproducible wedging up when writing large (20M+) files to an > ext3 file system. Oddly enough, if something *else* writes files to the > file system as well, it will unwedge for a while and make progress. Also, > a 'sync' command will relieve things temporarily - but after a few > megabytes it comes to a halt again. Looks like a borkage someplace not > causing it to actually finish pushing dirty file pages out - gkrellm > reports little/no disk activity in progress. File activity on *other* > filesystems continues unimpeded. Could be the write throttling patches. Try backing these out (in this order I think): http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2/2.6.14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1-tweaks.patch http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2/2.6.14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1.patch Cheers, Con ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 0:36 ` Con Kolivas @ 2005-09-23 7:20 ` Valdis.Kletnieks 2005-09-23 8:45 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Valdis.Kletnieks @ 2005-09-23 7:20 UTC (permalink / raw) To: Con Kolivas, Andrea Arcangeli; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 3532 bytes --] On Fri, 23 Sep 2005 10:36:16 +1000, Con Kolivas said: (Adding Andrea to the To: list...) > On Fri, 23 Sep 2005 05:59, Valdis.Kletnieks@vt.edu wrote: > > Am seeing reproducible wedging up when writing large (20M+) files to an > > ext3 file system. Oddly enough, if something *else* writes files to the > > file system as well, it will unwedge for a while and make progress. Also, > > a 'sync' command will relieve things temporarily - but after a few > > megabytes it comes to a halt again. Looks like a borkage someplace not > > causing it to actually finish pushing dirty file pages out - gkrellm > > reports little/no disk activity in progress. File activity on *other* > > filesystems continues unimpeded. > > > Could be the write throttling patches. > > Try backing these out (in this order I think): > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2/2.6 .14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1-tweaks.patch > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2/2.6 .14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1.patch Bingo. I haven't built a kernel with these excluded, but writing 0 to /proc/sys/vm/dirty_ratio_centisecs fixes the problem, so I'm pretty sure this is it. (For the record, I've noticed the starvation issue that Andrea is trying to address, where one process can lock out others, so I *do* think work is needed here...) Tuning/debugging info I gathered: 1) I was seeing 'future_pages' values averaging as high as 7K to 14K - is this a "reasonable" number when writing a 38M file to a fairly sluggish laptop disk (gkrellm showing 1M/sec to 3M/sec often, maybe 10M/sec if it's a single fairly linear write..). Popular values that were seen on multiple trials included 14364, 13851, and 13338 (though a few times, a wedge would hit 14364, and a few seconds later would "burp" up a bunch of disk I/O and drop to 7626 and stay there). The reproducability of the numbers probably says more about the fact that the system is otherwise basically idle at 3AM than anything else (so the number of available pages isn't bouncing around due to other processes). 2) 'centiseconds' value of 0 disabled as designed. The default value of '500' is *waaay* too high on my laptop - even 100 is consistently too much. 40 was consistently low enough, 50 was usually OK, 75 was usually *not* OK, but would sometimes "stutter" through and not completely grind to a halt. I'm not sure exactly where the "knee" is, or if it moves during higher-load (my laptop works harder during the day in the office than 2AM at home, usually). The patch includes documentation: +dirty_ratio_centisecs +----------------- + +Throttle the I/O if the per-task writing bandwidth is high enough for +the dirty_ratio to be reached in less than dirty_ratio_centisecs. This +makes the write throttling per-process and avoids making too much +memory dirty at the same time. Ideally in the future we should add +some feedback from the backing_dev_info to know the max disk bandwidth. I'm pretty convinced that for this patch to work, it *will* need feedback from the actual (not max) disk bandwidth and possibly the actual amount of RAM - what works on Andrea's 1G workstation with (presumably) a real disk system is waay too much for 256M and a single laptop-class disk. For now, I'm leaving centisecs set to 40, and will see how that works - most of my "problem cases" involve an FTP on a 10/100mbit connection, so that will get tried tomorrow..... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 7:20 ` Valdis.Kletnieks @ 2005-09-23 8:45 ` Andrea Arcangeli 2005-09-23 14:24 ` Dave Kleikamp 2005-09-23 9:45 ` Con Kolivas 2005-09-23 15:31 ` Andrea Arcangeli 2 siblings, 1 reply; 22+ messages in thread From: Andrea Arcangeli @ 2005-09-23 8:45 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Con Kolivas, linux-kernel On Fri, Sep 23, 2005 at 03:20:33AM -0400, Valdis.Kletnieks@vt.edu wrote: > On Fri, 23 Sep 2005 10:36:16 +1000, Con Kolivas said: > > (Adding Andrea to the To: list...) > > > On Fri, 23 Sep 2005 05:59, Valdis.Kletnieks@vt.edu wrote: > > > Am seeing reproducible wedging up when writing large (20M+) files to an > > > ext3 file system. Oddly enough, if something *else* writes files to the > > > file system as well, it will unwedge for a while and make progress. Also, So you get a total hang? I guess there's a bug somewhere... > I'm pretty convinced that for this patch to work, it *will* need feedback from > the actual (not max) disk bandwidth and possibly the actual amount of RAM - > what works on Andrea's 1G workstation with (presumably) a real disk system > is waay too much for 256M and a single laptop-class disk. That's not the problem here if you get a total hang. This heuristic should only reduce the amount of dirty memory, it should never grind a task to a total hang, until some other task writes to the filesystem. The sysrq shows the task sleeping in blk_congestion_wait. > For now, I'm leaving centisecs set to 40, and will see how that works - most > of my "problem cases" involve an FTP on a 10/100mbit connection, so that will > get tried tomorrow..... You should leave it to 0 until I find the buglet that hangs the system. I'll have a look. One other thing to change is to call balance_dirty only when the dirty bit is toggled (so overwrites of dirty cache are not accounted, since they generate no additional I/O on disk). Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 8:45 ` Andrea Arcangeli @ 2005-09-23 14:24 ` Dave Kleikamp 0 siblings, 0 replies; 22+ messages in thread From: Dave Kleikamp @ 2005-09-23 14:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, 2005-09-23 at 10:45 +0200, Andrea Arcangeli wrote: > On Fri, Sep 23, 2005 at 03:20:33AM -0400, Valdis.Kletnieks@vt.edu wrote: > > On Fri, 23 Sep 2005 10:36:16 +1000, Con Kolivas said: > > > > (Adding Andrea to the To: list...) > > > > > On Fri, 23 Sep 2005 05:59, Valdis.Kletnieks@vt.edu wrote: > > > > Am seeing reproducible wedging up when writing large (20M+) files to an > > > > ext3 file system. Oddly enough, if something *else* writes files to the > > > > file system as well, it will unwedge for a while and make progress. Also, > > So you get a total hang? I guess there's a bug somewhere... I get a similar hang running fsx on a jfs file system. "echo 0 > /proc/sys/vm/dirty_ratio_centisecs" fixes it as well. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 7:20 ` Valdis.Kletnieks 2005-09-23 8:45 ` Andrea Arcangeli @ 2005-09-23 9:45 ` Con Kolivas 2005-09-23 15:31 ` Andrea Arcangeli 2 siblings, 0 replies; 22+ messages in thread From: Con Kolivas @ 2005-09-23 9:45 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Andrea Arcangeli, linux-kernel On Fri, 23 Sep 2005 17:20, Valdis.Kletnieks@vt.edu wrote: > On Fri, 23 Sep 2005 10:36:16 +1000, Con Kolivas said: > > (Adding Andrea to the To: list...) > > > On Fri, 23 Sep 2005 05:59, Valdis.Kletnieks@vt.edu wrote: > > > Am seeing reproducible wedging up when writing large (20M+) files to an > > > ext3 file system. Oddly enough, if something *else* writes files to > > > the file system as well, it will unwedge for a while and make progress. > > > Also, a 'sync' command will relieve things temporarily - but after a > > > few megabytes it comes to a halt again. Looks like a borkage someplace > > > not causing it to actually finish pushing dirty file pages out - > > > gkrellm reports little/no disk activity in progress. File activity on > > > *other* filesystems continues unimpeded. > > > > Could be the write throttling patches. > > > > Try backing these out (in this order I think): > > > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2 > >/2.6 > > .14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1-tweaks.patch > > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc2 > >/2.6 > > .14-rc2-mm1/broken-out/per-task-predictive-write-throttling-1.patch > > Bingo. I haven't built a kernel with these excluded, but writing 0 to > /proc/sys/vm/dirty_ratio_centisecs fixes the problem, so I'm pretty sure > this is it. > > (For the record, I've noticed the starvation issue that Andrea is trying to > address, where one process can lock out others, so I *do* think work is > needed here...) I don't disagree, which is why I was excited by this work as well. Like all things in the kernel it always ends up being more complicated than the original plan, requiring reworking. So I do not remotely see this as a problem at this early stage. Cheers, Con ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 7:20 ` Valdis.Kletnieks 2005-09-23 8:45 ` Andrea Arcangeli 2005-09-23 9:45 ` Con Kolivas @ 2005-09-23 15:31 ` Andrea Arcangeli 2005-09-23 19:11 ` Valdis.Kletnieks 2005-09-23 20:57 ` Dave Kleikamp 2 siblings, 2 replies; 22+ messages in thread From: Andrea Arcangeli @ 2005-09-23 15:31 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Con Kolivas, linux-kernel Hello, Can you try this updated patch? I believe the blk_congestion_wait is just wrong there, since there may be just one page being flushed. That sounds like a longstanding bug except it normally wouldn't trigger because the dirty levels never goes down near zero during heavy writes. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.14-rc1/per-task-predictive-write-throttling-3 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 15:31 ` Andrea Arcangeli @ 2005-09-23 19:11 ` Valdis.Kletnieks 2005-09-23 20:57 ` Dave Kleikamp 1 sibling, 0 replies; 22+ messages in thread From: Valdis.Kletnieks @ 2005-09-23 19:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Con Kolivas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 616 bytes --] On Fri, 23 Sep 2005 17:31:58 +0200, Andrea Arcangeli said: > Hello, > > Can you try this updated patch? I believe the blk_congestion_wait is > just wrong there, since there may be just one page being flushed. That > sounds like a longstanding bug except it normally wouldn't trigger > because the dirty levels never goes down near zero during heavy writes. > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.1 4-rc1/per-task-predictive-write-throttling-3 Will do, although it may be Sunday night or Monday morning before I can report back - just got handed a few higher-priority tasks... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 15:31 ` Andrea Arcangeli 2005-09-23 19:11 ` Valdis.Kletnieks @ 2005-09-23 20:57 ` Dave Kleikamp 2005-09-23 20:59 ` Dave Kleikamp 1 sibling, 1 reply; 22+ messages in thread From: Dave Kleikamp @ 2005-09-23 20:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, 2005-09-23 at 17:31 +0200, Andrea Arcangeli wrote: > Hello, > > Can you try this updated patch? I believe the blk_congestion_wait is > just wrong there, since there may be just one page being flushed. That > sounds like a longstanding bug except it normally wouldn't trigger > because the dirty levels never goes down near zero during heavy writes. fsx is now stuck in a loop somewhere, using 100% cpu. > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.14-rc1/per-task-predictive-write-throttling-3 -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 20:57 ` Dave Kleikamp @ 2005-09-23 20:59 ` Dave Kleikamp 2005-09-23 21:46 ` Dave Kleikamp 0 siblings, 1 reply; 22+ messages in thread From: Dave Kleikamp @ 2005-09-23 20:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, 2005-09-23 at 15:57 -0500, Dave Kleikamp wrote: > On Fri, 2005-09-23 at 17:31 +0200, Andrea Arcangeli wrote: > > Hello, > > > > Can you try this updated patch? I believe the blk_congestion_wait is > > just wrong there, since there may be just one page being flushed. That > > sounds like a longstanding bug except it normally wouldn't trigger > > because the dirty levels never goes down near zero during heavy writes. > > fsx is now stuck in a loop somewhere, using 100% cpu. I hit send a little early. It eventually responded to a ^C. I'll try to get some more info. -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 20:59 ` Dave Kleikamp @ 2005-09-23 21:46 ` Dave Kleikamp 2005-09-26 8:14 ` Andrea Arcangeli 2005-09-28 22:38 ` Andrea Arcangeli 0 siblings, 2 replies; 22+ messages in thread From: Dave Kleikamp @ 2005-09-23 21:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, 2005-09-23 at 15:59 -0500, Dave Kleikamp wrote: > On Fri, 2005-09-23 at 15:57 -0500, Dave Kleikamp wrote: > > On Fri, 2005-09-23 at 17:31 +0200, Andrea Arcangeli wrote: > > > Hello, > > > > > > Can you try this updated patch? I believe the blk_congestion_wait is > > > just wrong there, since there may be just one page being flushed. That > > > sounds like a longstanding bug except it normally wouldn't trigger > > > because the dirty levels never goes down near zero during heavy writes. > > > > fsx is now stuck in a loop somewhere, using 100% cpu. > > I hit send a little early. It eventually responded to a ^C. I'll try > to get some more info. I'd guess that it's spinning in balance_dirty_pages. /proc/<pid>/future_dirty is 25650 for fsx. It appears that nr_reclaimable is not going to zero for some reason. -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 21:46 ` Dave Kleikamp @ 2005-09-26 8:14 ` Andrea Arcangeli 2005-09-28 22:38 ` Andrea Arcangeli 1 sibling, 0 replies; 22+ messages in thread From: Andrea Arcangeli @ 2005-09-26 8:14 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, Sep 23, 2005 at 04:46:19PM -0500, Dave Kleikamp wrote: > On Fri, 2005-09-23 at 15:59 -0500, Dave Kleikamp wrote: > > On Fri, 2005-09-23 at 15:57 -0500, Dave Kleikamp wrote: > > > On Fri, 2005-09-23 at 17:31 +0200, Andrea Arcangeli wrote: > > > > Hello, > > > > > > > > Can you try this updated patch? I believe the blk_congestion_wait is > > > > just wrong there, since there may be just one page being flushed. That > > > > sounds like a longstanding bug except it normally wouldn't trigger > > > > because the dirty levels never goes down near zero during heavy writes. > > > > > > fsx is now stuck in a loop somewhere, using 100% cpu. > > > > I hit send a little early. It eventually responded to a ^C. I'll try > > to get some more info. > > I'd guess that it's spinning in balance_dirty_pages. > /proc/<pid>/future_dirty is 25650 for fsx. It appears that Ok the good news is that this isn't a bug in the basic algorithm, but just in the implementation of it. > nr_reclaimable is not going to zero for some reason. Exactly, the !nr_reclaimable check is what I thought would have prevented an infinite loop to trigger... Unfortunately I couldn't reproduce on my laptop, I was working from the laptop the whole last week (I even did a presentation with this patch applied ;), I'll try to reprouce with fsx now. Thanks for the help! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-23 21:46 ` Dave Kleikamp 2005-09-26 8:14 ` Andrea Arcangeli @ 2005-09-28 22:38 ` Andrea Arcangeli 2005-10-01 0:27 ` Dave Kleikamp 1 sibling, 1 reply; 22+ messages in thread From: Andrea Arcangeli @ 2005-09-28 22:38 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, Sep 23, 2005 at 04:46:19PM -0500, Dave Kleikamp wrote: > On Fri, 2005-09-23 at 15:59 -0500, Dave Kleikamp wrote: > I'd guess that it's spinning in balance_dirty_pages. > /proc/<pid>/future_dirty is 25650 for fsx. It appears that > nr_reclaimable is not going to zero for some reason. Even if nr_reclaimable isn't going to zero, eventually the loop should break out because pages_written must increase. So this make me think it might be the nr_unstable that destabilizes it, and whatever it is, it is a bug in mainline as well, except it was well hidden until now, because the dirty levels never approached zero during heavy write-IO like it can happen with this feature enabled. Basically whatever we account as "reclaimable" must be _written_out_ and accounted as well in the "pages_written" otherwise it'll just hang. If there's a problem, it shall be a longstanding one. Can you try with this new patch that stops accounting "unstable" as "reclaimable". It should be possible to flush the dirty pages to disk so "nr_dirty" should be safe because they should always increase the "pages_written". I'm not sure if this fixes it, but this at least rule out the nfs from the equation (perhaps nfs will never be accounted as "pages_written" and that would be a possible explanation of the infinite loop). This new update also makes sure to never account rewrites (except for reiserfs where it's more difficult to change the code for this). I tried with fsx (no params) but I couldn't reproduce any problem yet, but I've no nfs workload involved in my test box. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.14-rc1/per-task-predictive-write-throttling-4 thanks for the help! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-09-28 22:38 ` Andrea Arcangeli @ 2005-10-01 0:27 ` Dave Kleikamp 2005-10-02 10:27 ` Andrea Arcangeli 0 siblings, 1 reply; 22+ messages in thread From: Dave Kleikamp @ 2005-10-01 0:27 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Thu, 2005-09-29 at 00:38 +0200, Andrea Arcangeli wrote: > On Fri, Sep 23, 2005 at 04:46:19PM -0500, Dave Kleikamp wrote: > > On Fri, 2005-09-23 at 15:59 -0500, Dave Kleikamp wrote: > > I'd guess that it's spinning in balance_dirty_pages. > > /proc/<pid>/future_dirty is 25650 for fsx. It appears that > > nr_reclaimable is not going to zero for some reason. I tracked down my problem to a bug in jfs. jfs is explicitly setting I_DIRTY in the i_state for a special inode that is preventing it from being put on the s_dirty list. This must have been something I did a long time ago when I was a newbie. I'm embarrassed I hadn't noticed it until now. I still had the problem even with your latest patch, but it's fixed with this patch to jfs. I haven't yet tried the jfs patch with the earlier versions of your patch to see if there is really a problem with them. I don't have anything to say about the original problem reported on ext3, since I only saw the problem on jfs. > Even if nr_reclaimable isn't going to zero, eventually the loop should > break out because pages_written must increase. > > So this make me think it might be the nr_unstable that destabilizes it, > and whatever it is, it is a bug in mainline as well, except it was well > hidden until now, because the dirty levels never approached zero during > heavy write-IO like it can happen with this feature enabled. > > Basically whatever we account as "reclaimable" must be _written_out_ and > accounted as well in the "pages_written" otherwise it'll just hang. > If there's a problem, it shall be a longstanding one. Yep. My bad. > Can you try with this new patch that stops accounting "unstable" as > "reclaimable". It should be possible to flush the dirty pages to disk so > "nr_dirty" should be safe because they should always increase the > "pages_written". I'm not sure if this fixes it, but this at least rule > out the nfs from the equation (perhaps nfs will never be accounted as > "pages_written" and that would be a possible explanation of the infinite > loop). > > This new update also makes sure to never account rewrites (except for > reiserfs where it's more difficult to change the code for this). > > I tried with fsx (no params) but I couldn't reproduce any problem yet, > but I've no nfs workload involved in my test box. > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.14-rc1/per-task-predictive-write-throttling-4 > > thanks for the help! JFS: jfs should not be playing with i_state jfs has been explicitly setting i_state |= I_DIRTY on a special inode. This prevented it from being put on the s_dirty list. Very stupid. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> diff --git a/fs/jfs/jfs_dmap.c b/fs/jfs/jfs_dmap.c --- a/fs/jfs/jfs_dmap.c +++ b/fs/jfs/jfs_dmap.c @@ -305,7 +305,6 @@ int dbSync(struct inode *ipbmap) filemap_fdatawrite(ipbmap->i_mapping); filemap_fdatawait(ipbmap->i_mapping); - ipbmap->i_state |= I_DIRTY; diWriteSpecial(ipbmap, 0); return (0); diff --git a/fs/jfs/jfs_imap.c b/fs/jfs/jfs_imap.c --- a/fs/jfs/jfs_imap.c +++ b/fs/jfs/jfs_imap.c @@ -514,8 +514,6 @@ void diWriteSpecial(struct inode *ip, in ino_t inum = ip->i_ino; struct metapage *mp; - ip->i_state &= ~I_DIRTY; - if (secondary) address = addressPXD(&sbi->ait2) >> sbi->l2nbperpage; else diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c --- a/fs/jfs/jfs_txnmgr.c +++ b/fs/jfs/jfs_txnmgr.c @@ -2396,7 +2396,6 @@ static void txUpdateMap(struct tblock * */ if (tblk->xflag & COMMIT_CREATE) { diUpdatePMap(ipimap, tblk->ino, FALSE, tblk); - ipimap->i_state |= I_DIRTY; /* update persistent block allocation map * for the allocation of inode extent; */ @@ -2407,7 +2406,6 @@ static void txUpdateMap(struct tblock * } else if (tblk->xflag & COMMIT_DELETE) { ip = tblk->u.ip; diUpdatePMap(ipimap, ip->i_ino, TRUE, tblk); - ipimap->i_state |= I_DIRTY; iput(ip); } } -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-01 0:27 ` Dave Kleikamp @ 2005-10-02 10:27 ` Andrea Arcangeli 2005-10-02 10:32 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Andrea Arcangeli @ 2005-10-02 10:27 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Fri, Sep 30, 2005 at 07:27:04PM -0500, Dave Kleikamp wrote: > I tracked down my problem to a bug in jfs. jfs is explicitly setting Ok great this explain things, so perhaps my last hack attempt of not accounting the unstable pages in the "nr_reclaimable" isn't needed. What about Valids, were you using jfs too along with ext3? If a single fs has a bug the loop can happen (it could happen in mainline too, except it was less likely to be visible there). Note Valids, your smtp server bounces back my emails. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-02 10:27 ` Andrea Arcangeli @ 2005-10-02 10:32 ` Andrea Arcangeli 2005-10-02 13:51 ` Dave Kleikamp 2005-10-03 1:04 ` Valdis.Kletnieks 2 siblings, 0 replies; 22+ messages in thread From: Andrea Arcangeli @ 2005-10-02 10:32 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Sun, Oct 02, 2005 at 12:27:26PM +0200, Andrea Arcangeli wrote: > Note Valids, your smtp server bounces back my emails. here we go again: <Valdis.Kletnieks@vt.edu>: host smtp.vt.edu[198.82.161.8] said: 550 This domain is blacklisted,consult your postmaster (in reply to MAIL FROM command) If you blacklist 0.0.0.0/0 as well you won't risk getting any more spam ;) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-02 10:27 ` Andrea Arcangeli 2005-10-02 10:32 ` Andrea Arcangeli @ 2005-10-02 13:51 ` Dave Kleikamp 2005-10-03 18:06 ` Dave Kleikamp 2005-10-03 1:04 ` Valdis.Kletnieks 2 siblings, 1 reply; 22+ messages in thread From: Dave Kleikamp @ 2005-10-02 13:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Sun, 2005-10-02 at 12:27 +0200, Andrea Arcangeli wrote: > On Fri, Sep 30, 2005 at 07:27:04PM -0500, Dave Kleikamp wrote: > > I tracked down my problem to a bug in jfs. jfs is explicitly setting > > Ok great this explain things, so perhaps my last hack attempt of not > accounting the unstable pages in the "nr_reclaimable" isn't needed. Maybe it is. I just retested the fixed jfs on 2.6.14-rc2-mm1, and I still see the hang. I can probably debug it further on Monday if necessary. > What about Valids, were you using jfs too along with ext3? If a single > fs has a bug the loop can happen (it could happen in mainline too, > except it was less likely to be visible there). > > Note Valids, your smtp server bounces back my emails. > -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-02 13:51 ` Dave Kleikamp @ 2005-10-03 18:06 ` Dave Kleikamp 2005-10-03 18:31 ` Valdis.Kletnieks 0 siblings, 1 reply; 22+ messages in thread From: Dave Kleikamp @ 2005-10-03 18:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1625 bytes --] On Sun, 2005-10-02 at 08:51 -0500, Dave Kleikamp wrote: > On Sun, 2005-10-02 at 12:27 +0200, Andrea Arcangeli wrote: > > On Fri, Sep 30, 2005 at 07:27:04PM -0500, Dave Kleikamp wrote: > > > I tracked down my problem to a bug in jfs. jfs is explicitly setting > > > > Ok great this explain things, so perhaps my last hack attempt of not > > accounting the unstable pages in the "nr_reclaimable" isn't needed. > > Maybe it is. I just retested the fixed jfs on 2.6.14-rc2-mm1, and I > still see the hang. I can probably debug it further on Monday if > necessary. I finally figured out what the problem was with jfs. There are really three things I ended up fixing, but the most important was that the reserved inodes that jfs uses for metadata were not in the inode hash. __mark_inode_dirty() fails to add the inode to the superblock's dirty list if hlist_unhashed() is true. Without being on the dirty list, the inode is not even looked at by writeback_inodes(). The other problems are that jfs explicitly sets I_DIRTY (I already reported that one) and that metadata_writepage may repeatedly redirty an inode that is waiting on journal I/O without initiating the journal I/O. > > What about Valids, were you using jfs too along with ext3? If a single > > fs has a bug the loop can happen (it could happen in mainline too, > > except it was less likely to be visible there). Unfortunately, this doesn't solve Valdis' problem, as he isn't using jfs. Valdis, do you have any other file systems mounted besides ext3? I wonder if another file system has a similar problem. -- David Kleikamp IBM Linux Technology Center [-- Attachment #2: jfs-i_hash.patch --] [-- Type: text/x-patch, Size: 3881 bytes --] JFS: make special inodes play nicely with page balancing This patch fixes up a few problems with jfs's reserved inodes. 1. There is no need for the jfs code setting the I_DIRTY bits in i_state. I am ashamed that the code ever did this, and surprised it hasn't been noticed until now. 2. Make sure special inodes are on an inode hash list. If the inodes are unhashed, __mark_inode_dirty will fail to put the inode on the superblock's dirty list, and the data will not be flushed under memory pressure. 3. Force writing journal data to disk when metapage_writepage is unable to write a metadata page due to pending journal I/O. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> diff -Nurp linux-2.6.14-rc2-mm1/fs/jfs/jfs_dmap.c linux/fs/jfs/jfs_dmap.c --- linux-2.6.14-rc2-mm1/fs/jfs/jfs_dmap.c 2005-09-22 18:06:45.000000000 -0500 +++ linux/fs/jfs/jfs_dmap.c 2005-10-02 14:22:12.000000000 -0500 @@ -305,7 +305,6 @@ int dbSync(struct inode *ipbmap) filemap_fdatawrite(ipbmap->i_mapping); filemap_fdatawait(ipbmap->i_mapping); - ipbmap->i_state |= I_DIRTY; diWriteSpecial(ipbmap, 0); return (0); diff -Nurp linux-2.6.14-rc2-mm1/fs/jfs/jfs_imap.c linux/fs/jfs/jfs_imap.c --- linux-2.6.14-rc2-mm1/fs/jfs/jfs_imap.c 2005-08-28 18:41:01.000000000 -0500 +++ linux/fs/jfs/jfs_imap.c 2005-10-03 13:01:53.000000000 -0500 @@ -57,6 +57,12 @@ #include "jfs_debug.h" /* + * __mark_inode_dirty expects inodes to be hashed. Since we don't want + * special inodes in the fileset inode space, we hash them to a dummy head + */ +static HLIST_HEAD(aggregate_hash); + +/* * imap locks */ /* iag free list lock */ @@ -491,6 +497,8 @@ struct inode *diReadSpecial(struct super /* release the page */ release_metapage(mp); + hlist_add_head(&ip->i_hash, &aggregate_hash); + return (ip); } @@ -514,8 +522,6 @@ void diWriteSpecial(struct inode *ip, in ino_t inum = ip->i_ino; struct metapage *mp; - ip->i_state &= ~I_DIRTY; - if (secondary) address = addressPXD(&sbi->ait2) >> sbi->l2nbperpage; else diff -Nurp linux-2.6.14-rc2-mm1/fs/jfs/jfs_metapage.c linux/fs/jfs/jfs_metapage.c --- linux-2.6.14-rc2-mm1/fs/jfs/jfs_metapage.c 2005-08-28 18:41:01.000000000 -0500 +++ linux/fs/jfs/jfs_metapage.c 2005-10-03 12:26:50.000000000 -0500 @@ -395,6 +395,12 @@ static int metapage_writepage(struct pag if (mp->nohomeok && !test_bit(META_forcewrite, &mp->flag)) { redirty = 1; + /* + * Make sure this page isn't blocked indefinitely. + * If the journal isn't undergoing I/O, push it + */ + if (mp->log && !(mp->log->cflag & logGC_PAGEOUT)) + jfs_flush_journal(mp->log, 0); continue; } diff -Nurp linux-2.6.14-rc2-mm1/fs/jfs/jfs_txnmgr.c linux/fs/jfs/jfs_txnmgr.c --- linux-2.6.14-rc2-mm1/fs/jfs/jfs_txnmgr.c 2005-09-22 18:06:45.000000000 -0500 +++ linux/fs/jfs/jfs_txnmgr.c 2005-10-02 14:22:12.000000000 -0500 @@ -2396,7 +2396,6 @@ static void txUpdateMap(struct tblock * */ if (tblk->xflag & COMMIT_CREATE) { diUpdatePMap(ipimap, tblk->ino, FALSE, tblk); - ipimap->i_state |= I_DIRTY; /* update persistent block allocation map * for the allocation of inode extent; */ @@ -2407,7 +2406,6 @@ static void txUpdateMap(struct tblock * } else if (tblk->xflag & COMMIT_DELETE) { ip = tblk->u.ip; diUpdatePMap(ipimap, ip->i_ino, TRUE, tblk); - ipimap->i_state |= I_DIRTY; iput(ip); } } diff -Nurp linux-2.6.14-rc2-mm1/fs/jfs/super.c linux/fs/jfs/super.c --- linux-2.6.14-rc2-mm1/fs/jfs/super.c 2005-09-22 18:05:56.000000000 -0500 +++ linux/fs/jfs/super.c 2005-10-03 10:54:41.000000000 -0500 @@ -442,6 +442,7 @@ static int jfs_fill_super(struct super_b inode->i_nlink = 1; inode->i_size = sb->s_bdev->bd_inode->i_size; inode->i_mapping->a_ops = &jfs_metapage_aops; + insert_inode_hash(inode); mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); sbi->direct_inode = inode; ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-03 18:06 ` Dave Kleikamp @ 2005-10-03 18:31 ` Valdis.Kletnieks 2005-10-10 17:15 ` Andrea Arcangeli 0 siblings, 1 reply; 22+ messages in thread From: Valdis.Kletnieks @ 2005-10-03 18:31 UTC (permalink / raw) To: Dave Kleikamp; +Cc: Andrea Arcangeli, Con Kolivas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 279 bytes --] On Mon, 03 Oct 2005 13:06:08 CDT, Dave Kleikamp said: > Unfortunately, this doesn't solve Valdis' problem, as he isn't using > jfs. Valdis, do you have any other file systems mounted besides ext3? > I wonder if another file system has a similar problem. Nope, all ext3 here.. [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-03 18:31 ` Valdis.Kletnieks @ 2005-10-10 17:15 ` Andrea Arcangeli 2005-10-10 17:21 ` Dave Kleikamp 0 siblings, 1 reply; 22+ messages in thread From: Andrea Arcangeli @ 2005-10-10 17:15 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: Dave Kleikamp, Con Kolivas, linux-kernel Hello, So what's the status of this? Dave can you still reproduce hangs with your last jfs fixes applied? Valids did you test my last patch (you find it in my ftp area) that removes the unstable pages from the equation? all ext3 as local fs ok, but do you use nfs for the networked fs? If not can you post a way to reproduce the hang? Is it enough to boot with mem=256M? Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-10 17:15 ` Andrea Arcangeli @ 2005-10-10 17:21 ` Dave Kleikamp 0 siblings, 0 replies; 22+ messages in thread From: Dave Kleikamp @ 2005-10-10 17:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Valdis.Kletnieks, Con Kolivas, linux-kernel On Mon, 2005-10-10 at 19:15 +0200, Andrea Arcangeli wrote: > Hello, > > So what's the status of this? Dave can you still reproduce hangs with > your last jfs fixes applied? With my latest jfs patch, I was unable to reproduce the hang on an unmodified 2.6.14-rc2-mm1 kernel. > Valids did you test my last patch (you find it in my ftp area) that > removes the unstable pages from the equation? all ext3 as local fs ok, > but do you use nfs for the networked fs? If not can you post a way to > reproduce the hang? Is it enough to boot with mem=256M? > > Thanks. > -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: 2.6.14-rc2-mm1 - ext3 wedging up 2005-10-02 10:27 ` Andrea Arcangeli 2005-10-02 10:32 ` Andrea Arcangeli 2005-10-02 13:51 ` Dave Kleikamp @ 2005-10-03 1:04 ` Valdis.Kletnieks 2 siblings, 0 replies; 22+ messages in thread From: Valdis.Kletnieks @ 2005-10-03 1:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Dave Kleikamp, Con Kolivas, linux-kernel [-- Attachment #1: Type: text/plain, Size: 699 bytes --] On Sun, 02 Oct 2005 12:27:26 +0200, Andrea Arcangeli said: > Ok great this explain things, so perhaps my last hack attempt of not > accounting the unstable pages in the "nr_reclaimable" isn't needed. > > What about Valids, were you using jfs too along with ext3? If a single > fs has a bug the loop can happen (it could happen in mainline too, > except it was less likely to be visible there). % zgrep -i jfs /proc/config.gz # CONFIG_JFS_FS is not set Sorry, this is an ext3-based system, no JFS here. Another (possibly unimportant) data point: I was seeing it with 256M of RAM, but after a recent upgrade to 768M, I'm not seeing it. Probably need to reboot with mem=256 to replicate now... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2005-10-10 17:21 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-22 19:59 2.6.14-rc2-mm1 - ext3 wedging up Valdis.Kletnieks 2005-09-23 0:36 ` Con Kolivas 2005-09-23 7:20 ` Valdis.Kletnieks 2005-09-23 8:45 ` Andrea Arcangeli 2005-09-23 14:24 ` Dave Kleikamp 2005-09-23 9:45 ` Con Kolivas 2005-09-23 15:31 ` Andrea Arcangeli 2005-09-23 19:11 ` Valdis.Kletnieks 2005-09-23 20:57 ` Dave Kleikamp 2005-09-23 20:59 ` Dave Kleikamp 2005-09-23 21:46 ` Dave Kleikamp 2005-09-26 8:14 ` Andrea Arcangeli 2005-09-28 22:38 ` Andrea Arcangeli 2005-10-01 0:27 ` Dave Kleikamp 2005-10-02 10:27 ` Andrea Arcangeli 2005-10-02 10:32 ` Andrea Arcangeli 2005-10-02 13:51 ` Dave Kleikamp 2005-10-03 18:06 ` Dave Kleikamp 2005-10-03 18:31 ` Valdis.Kletnieks 2005-10-10 17:15 ` Andrea Arcangeli 2005-10-10 17:21 ` Dave Kleikamp 2005-10-03 1:04 ` Valdis.Kletnieks
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox