* 2.6.24-rc6 reproducible raid5 hang @ 2007-12-27 17:06 dean gaudet 2007-12-27 17:39 ` dean gaudet 2007-12-27 19:52 ` Justin Piszcz 0 siblings, 2 replies; 30+ messages in thread From: dean gaudet @ 2007-12-27 17:06 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 1093 bytes --] hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching "iostat -kx /dev/sd? 5". i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 19281 bytes --] [-- Attachment #3: Type: APPLICATION/octet-stream, Size: 25339 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-27 17:06 2.6.24-rc6 reproducible raid5 hang dean gaudet @ 2007-12-27 17:39 ` dean gaudet 2007-12-29 16:48 ` dean gaudet 2007-12-27 19:52 ` Justin Piszcz 1 sibling, 1 reply; 30+ messages in thread From: dean gaudet @ 2007-12-27 17:39 UTC (permalink / raw) To: linux-raid hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: > hey neil -- remember that raid5 hang which me and only one or two others > ever experienced and which was hard to reproduce? we were debugging it > well over a year ago (that box has 400+ day uptime now so at least that > long ago :) the workaround was to increase stripe_cache_size... i seem to > have a way to reproduce something which looks much the same. > > setup: > > - 2.6.24-rc6 > - system has 8GiB RAM but no swap > - 8x750GB in a raid5 with one spare, chunksize 1024KiB. > - mkfs.xfs default options > - mount -o noatime > - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 > > that sequence hangs for me within 10 seconds... and i can unhang / rehang > it by toggling between stripe_cache_size 256 and 1024. i detect the hang > by watching "iostat -kx /dev/sd? 5". > > i've attached the kernel log where i dumped task and timer state while it > was hung... note that you'll see at some point i did an xfs mount with > external journal but it happens with internal journal as well. > > looks like it's using the raid456 module and async api. > > anyhow let me know if you need more info / have any suggestions. > > -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-27 17:39 ` dean gaudet @ 2007-12-29 16:48 ` dean gaudet 2007-12-29 20:47 ` Dan Williams 0 siblings, 1 reply; 30+ messages in thread From: dean gaudet @ 2007-12-29 16:48 UTC (permalink / raw) To: linux-raid hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? -dean On Thu, 27 Dec 2007, dean gaudet wrote: > hmm this seems more serious... i just ran into it with chunksize 64KiB and > while just untarring a bunch of linux kernels in parallel... increasing > stripe_cache_size did the trick again. > > -dean > > On Thu, 27 Dec 2007, dean gaudet wrote: > > > hey neil -- remember that raid5 hang which me and only one or two others > > ever experienced and which was hard to reproduce? we were debugging it > > well over a year ago (that box has 400+ day uptime now so at least that > > long ago :) the workaround was to increase stripe_cache_size... i seem to > > have a way to reproduce something which looks much the same. > > > > setup: > > > > - 2.6.24-rc6 > > - system has 8GiB RAM but no swap > > - 8x750GB in a raid5 with one spare, chunksize 1024KiB. > > - mkfs.xfs default options > > - mount -o noatime > > - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 > > > > that sequence hangs for me within 10 seconds... and i can unhang / rehang > > it by toggling between stripe_cache_size 256 and 1024. i detect the hang > > by watching "iostat -kx /dev/sd? 5". > > > > i've attached the kernel log where i dumped task and timer state while it > > was hung... note that you'll see at some point i did an xfs mount with > > external journal but it happens with internal journal as well. > > > > looks like it's using the raid456 module and async api. > > > > anyhow let me know if you need more info / have any suggestions. > > > > -dean > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 16:48 ` dean gaudet @ 2007-12-29 20:47 ` Dan Williams 2007-12-29 20:58 ` dean gaudet 0 siblings, 1 reply; 30+ messages in thread From: Dan Williams @ 2007-12-29 20:47 UTC (permalink / raw) To: dean gaudet; +Cc: linux-raid On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > the same 64k chunk array and had raised the stripe_cache_size to 1024... > and got a hang. this time i grabbed stripe_cache_active before bumping > the size again -- it was only 905 active. as i recall the bug we were > debugging a year+ ago the active was at the size when it would hang. so > this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. > 3/4 of stripes active. This state should automatically clear... > > anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to > hit that limit too if i try harder :) Once you hang if 'stripe_cache_size' is increased such that stripe_cache_active < 3/4 * stripe_cache_size things will start flowing again. > > btw what units are stripe_cache_size/active in? is the memory consumed > equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * > raid_disks * stripe_cache_active)? > memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size > > -dean > -- Dan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 20:47 ` Dan Williams @ 2007-12-29 20:58 ` dean gaudet 2007-12-29 21:50 ` Justin Piszcz 2007-12-29 22:06 ` Dan Williams 0 siblings, 2 replies; 30+ messages in thread From: dean gaudet @ 2007-12-29 20:58 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On Sat, 29 Dec 2007, Dan Williams wrote: > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > > the same 64k chunk array and had raised the stripe_cache_size to 1024... > > and got a hang. this time i grabbed stripe_cache_active before bumping > > the size again -- it was only 905 active. as i recall the bug we were > > debugging a year+ ago the active was at the size when it would hang. so > > this is probably something new. > > I believe I am seeing the same issue and am trying to track down > whether XFS is doing something unexpected, i.e. I have not been able > to reproduce the problem with EXT3. MD tries to increase throughput > by letting some stripe work build up in batches. It looks like every > time your system has hung it has been in the 'inactive_blocked' state > i.e. > 3/4 of stripes active. This state should automatically > clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 20:58 ` dean gaudet @ 2007-12-29 21:50 ` Justin Piszcz 2007-12-29 22:11 ` dean gaudet 2007-12-29 22:06 ` Dan Williams 1 sibling, 1 reply; 30+ messages in thread From: Justin Piszcz @ 2007-12-29 21:50 UTC (permalink / raw) To: dean gaudet; +Cc: Dan Williams, linux-raid On Sat, 29 Dec 2007, dean gaudet wrote: > On Sat, 29 Dec 2007, Dan Williams wrote: > >> On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: >>> hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on >>> the same 64k chunk array and had raised the stripe_cache_size to 1024... >>> and got a hang. this time i grabbed stripe_cache_active before bumping >>> the size again -- it was only 905 active. as i recall the bug we were >>> debugging a year+ ago the active was at the size when it would hang. so >>> this is probably something new. >> >> I believe I am seeing the same issue and am trying to track down >> whether XFS is doing something unexpected, i.e. I have not been able >> to reproduce the problem with EXT3. MD tries to increase throughput >> by letting some stripe work build up in batches. It looks like every >> time your system has hung it has been in the 'inactive_blocked' state >> i.e. > 3/4 of stripes active. This state should automatically >> clear... > > cool, glad you can reproduce it :) > > i have a bit more data... i'm seeing the same problem on debian's > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. > > i'm doing some more isolation but just grabbing kernels i have precompiled > so far -- a 2.6.19.7 kernel doesn't show the problem, and early > indications are a 2.6.21.7 kernel also doesn't have the problem but i'm > giving it longer to show its head. > > i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just > so we get the debian patches out of the way. > > i was tempted to blame async api because it's newish :) but according to > the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async > API, and it still hung, so async is probably not to blame. > > anyhow the test case i'm using is the dma_thrasher script i attached... it > takes about an hour to give me confidence there's no problems so this will > take a while. > > -dean > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Dean, Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? Justin. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 21:50 ` Justin Piszcz @ 2007-12-29 22:11 ` dean gaudet 2007-12-29 22:21 ` dean gaudet 0 siblings, 1 reply; 30+ messages in thread From: dean gaudet @ 2007-12-29 22:11 UTC (permalink / raw) To: Justin Piszcz; +Cc: Dan Williams, linux-raid On Sat, 29 Dec 2007, Justin Piszcz wrote: > Curious btw what kind of filesystem size/raid type (5, but defaults I assume, > nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache > size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults > The script you sent out earlier, you are able to reproduce it easily with 31 > or so kernel tar decompressions? not sure, the point of the script is to untar more than there is RAM. it happened with a single rsync running though -- 3.5M indoes from a remote box. it also happens with the single 10GB dd write... although i've been using the tar method for testing different kernel revs. -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 22:11 ` dean gaudet @ 2007-12-29 22:21 ` dean gaudet 0 siblings, 0 replies; 30+ messages in thread From: dean gaudet @ 2007-12-29 22:21 UTC (permalink / raw) To: Justin Piszcz; +Cc: Dan Williams, linux-raid On Sat, 29 Dec 2007, dean gaudet wrote: > On Sat, 29 Dec 2007, Justin Piszcz wrote: > > > Curious btw what kind of filesystem size/raid type (5, but defaults I assume, > > nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache > > size/chunk size(s) are you using/testing with? > > mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 > mkfs.xfs -f /dev/md2 > > otherwise defaults hmm i missed a few things, here's exactly how i created the array: mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 /dev/sd[a-h]1 it's reassembled automagically each reboot, but i do this each reboot: mkfs.xfs -f /dev/md2 mount -o noatime /dev/md2 /mnt/new ./dma_thrasher linux.tar.gz /mnt/new the --assume-clean and noatime probably make no difference though... on the bisection front it looks like it's new behaviour between 2.6.21.7 and 2.6.22.15 (stock kernels now, not debian). i've got to step out for a while, but i'll go at it again later, probably with git bisect unless someone has some cherry picked changes to suggest. -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 20:58 ` dean gaudet 2007-12-29 21:50 ` Justin Piszcz @ 2007-12-29 22:06 ` Dan Williams 2007-12-30 17:58 ` dean gaudet 1 sibling, 1 reply; 30+ messages in thread From: Dan Williams @ 2007-12-29 22:06 UTC (permalink / raw) To: dean gaudet; +Cc: linux-raid On Dec 29, 2007 1:58 PM, dean gaudet <dean@arctic.org> wrote: > On Sat, 29 Dec 2007, Dan Williams wrote: > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > > > the same 64k chunk array and had raised the stripe_cache_size to 1024... > > > and got a hang. this time i grabbed stripe_cache_active before bumping > > > the size again -- it was only 905 active. as i recall the bug we were > > > debugging a year+ ago the active was at the size when it would hang. so > > > this is probably something new. > > > > I believe I am seeing the same issue and am trying to track down > > whether XFS is doing something unexpected, i.e. I have not been able > > to reproduce the problem with EXT3. MD tries to increase throughput > > by letting some stripe work build up in batches. It looks like every > > time your system has hung it has been in the 'inactive_blocked' state > > i.e. > 3/4 of stripes active. This state should automatically > > clear... > > cool, glad you can reproduce it :) > > i have a bit more data... i'm seeing the same problem on debian's > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. > This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang> I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. -- Dan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-29 22:06 ` Dan Williams @ 2007-12-30 17:58 ` dean gaudet 2008-01-09 18:28 ` Dan Williams 0 siblings, 1 reply; 30+ messages in thread From: dean gaudet @ 2007-12-30 17:58 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 3103 bytes --] On Sat, 29 Dec 2007, Dan Williams wrote: > On Dec 29, 2007 1:58 PM, dean gaudet <dean@arctic.org> wrote: > > On Sat, 29 Dec 2007, Dan Williams wrote: > > > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: > > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > > > > the same 64k chunk array and had raised the stripe_cache_size to 1024... > > > > and got a hang. this time i grabbed stripe_cache_active before bumping > > > > the size again -- it was only 905 active. as i recall the bug we were > > > > debugging a year+ ago the active was at the size when it would hang. so > > > > this is probably something new. > > > > > > I believe I am seeing the same issue and am trying to track down > > > whether XFS is doing something unexpected, i.e. I have not been able > > > to reproduce the problem with EXT3. MD tries to increase throughput > > > by letting some stripe work build up in batches. It looks like every > > > time your system has hung it has been in the 'inactive_blocked' state > > > i.e. > 3/4 of stripes active. This state should automatically > > > clear... > > > > cool, glad you can reproduce it :) > > > > i have a bit more data... i'm seeing the same problem on debian's > > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. > > > > This is just brainstorming at this point, but it looks like xfs can > submit more requests in the bi_end_io path such that it can lock > itself out of the RAID array. The sequence that concerns me is: > > return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang> > > I need verify whether this path is actually triggering, but if we are > in an inactive_blocked condition this new request will be put on a > wait queue and we'll never get to the release_stripe() call after > return_io(). It would be interesting to see if this is new XFS > behavior in recent kernels. i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. with my git tree sync'd to that commit my test cases fail in under 20 minutes uptime (i rebooted and tested 3x). sync'd to the commit previous to it i've got 8h of run-time now without the problem. this isn't definitive of course since it does seem to be timing dependent, but since all failures have occured much earlier than that for me so far i think this indicates this change is either the cause of the problem or exacerbates an existing raid5 problem. given that this problem looks like a very rare problem i saw with 2.6.18 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an existing problem... not that i have evidence either way. i've attached a new kernel log with a hang at d89d87965d... and the reduced config file i was using for the bisect. hopefully the hang looks the same as what we were seeing at 2.6.24-rc6. let me know. -dean [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 13738 bytes --] [-- Attachment #3: Type: APPLICATION/octet-stream, Size: 7117 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-30 17:58 ` dean gaudet @ 2008-01-09 18:28 ` Dan Williams 2008-01-10 0:09 ` Neil Brown 0 siblings, 1 reply; 30+ messages in thread From: Dan Williams @ 2008-01-09 18:28 UTC (permalink / raw) To: dean gaudet; +Cc: linux-raid, neilb On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: > On Sat, 29 Dec 2007, Dan Williams wrote: > > > On Dec 29, 2007 1:58 PM, dean gaudet <dean@arctic.org> wrote: > > > On Sat, 29 Dec 2007, Dan Williams wrote: > > > > > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: > > > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > > > > > the same 64k chunk array and had raised the stripe_cache_size to 1024... > > > > > and got a hang. this time i grabbed stripe_cache_active before bumping > > > > > the size again -- it was only 905 active. as i recall the bug we were > > > > > debugging a year+ ago the active was at the size when it would hang. so > > > > > this is probably something new. > > > > > > > > I believe I am seeing the same issue and am trying to track down > > > > whether XFS is doing something unexpected, i.e. I have not been able > > > > to reproduce the problem with EXT3. MD tries to increase throughput > > > > by letting some stripe work build up in batches. It looks like every > > > > time your system has hung it has been in the 'inactive_blocked' state > > > > i.e. > 3/4 of stripes active. This state should automatically > > > > clear... > > > > > > cool, glad you can reproduce it :) > > > > > > i have a bit more data... i'm seeing the same problem on debian's > > > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. > > > > > > > This is just brainstorming at this point, but it looks like xfs can > > submit more requests in the bi_end_io path such that it can lock > > itself out of the RAID array. The sequence that concerns me is: > > > > return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang> > > > > I need verify whether this path is actually triggering, but if we are > > in an inactive_blocked condition this new request will be put on a > > wait queue and we'll never get to the release_stripe() call after > > return_io(). It would be interesting to see if this is new XFS > > behavior in recent kernels. > > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > which was Neil's change in 2.6.22 for deferring generic_make_request > until there's enough stack space for it. > > with my git tree sync'd to that commit my test cases fail in under 20 > minutes uptime (i rebooted and tested 3x). sync'd to the commit previous > to it i've got 8h of run-time now without the problem. > > this isn't definitive of course since it does seem to be timing > dependent, but since all failures have occured much earlier than that > for me so far i think this indicates this change is either the cause of > the problem or exacerbates an existing raid5 problem. > > given that this problem looks like a very rare problem i saw with 2.6.18 > (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an > existing problem... not that i have evidence either way. > > i've attached a new kernel log with a hang at d89d87965d... and the > reduced config file i was using for the bisect. hopefully the hang > looks the same as what we were seeing at 2.6.24-rc6. let me know. > Dean could you try the below patch to see if it fixes your failure scenario? It passes my test case. Thanks, Dan -------> md: add generic_make_request_immed to prevent raid5 hang From: Dan Williams <dan.j.williams@intel.com> Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: 1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit (3/4 * stripe_cache_size) 2/ a bio is submitted that requires M stripes to be processed where M > N 3/ stripes 1 through N are up-to-date and ready for immediate processing, i.e. no trip through raid5d required This results in the calling thread hanging while waiting for resources to process stripes N through M. This means we never return from make_request. All other raid5 users pile up in get_active_stripe. Increasing stripe_cache_size temporarily resolves the blockage by allowing the blocked make_request to return to generic_make_request. Another way to solve this is to move all i/o submission to raid5d context. Thanks to Dean Gaudet for bisecting this down to d89d8796. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/ll_rw_blk.c | 16 +++++++++++++--- drivers/md/raid5.c | 4 ++-- include/linux/blkdev.h | 1 + 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 8b91994..bff40c2 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -3287,16 +3287,26 @@ end_io: } /* - * We only want one ->make_request_fn to be active at a time, - * else stack usage with stacked devices could be a problem. + * In the general case we only want one ->make_request_fn to be active + * at a time, else stack usage with stacked devices could be a problem. * So use current->bio_{list,tail} to keep a list of requests * submited by a make_request_fn function. * current->bio_tail is also used as a flag to say if * generic_make_request is currently active in this task or not. * If it is NULL, then no make_request is active. If it is non-NULL, * then a make_request is active, and new requests should be added - * at the tail + * at the tail. + * However, some stacking drivers, like md-raid5, need to submit + * the bio without delay when it may not have the resources to + * complete its q->make_request_fn. generic_make_request_immed is + * provided for this explicit purpose. */ +void generic_make_request_immed(struct bio *bio) +{ + __generic_make_request(bio); +} +EXPORT_SYMBOL(generic_make_request_immed); + void generic_make_request(struct bio *bio) { if (current->bio_tail) { diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c857b5a..ffa2be4 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -450,7 +450,7 @@ static void ops_run_io(struct stripe_head *sh) test_bit(R5_ReWrite, &sh->dev[i].flags)) atomic_add(STRIPE_SECTORS, &rdev->corrected_errors); - generic_make_request(bi); + generic_make_request_immed(bi); } else { if (rw == WRITE) set_bit(STRIPE_DEGRADED, &sh->state); @@ -3124,7 +3124,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page) if (rw == WRITE && test_bit(R5_ReWrite, &sh->dev[i].flags)) atomic_add(STRIPE_SECTORS, &rdev->corrected_errors); - generic_make_request(bi); + generic_make_request_immed(bi); } else { if (rw == WRITE) set_bit(STRIPE_DEGRADED, &sh->state); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index d18ee67..774a3a0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -642,6 +642,7 @@ extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); extern void register_disk(struct gendisk *dev); extern void generic_make_request(struct bio *bio); +extern void generic_make_request_immed(struct bio *bio); extern void blk_put_request(struct request *); extern void __blk_put_request(struct request_queue *, struct request *); extern void blk_end_sync_rq(struct request *rq, int error); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-09 18:28 ` Dan Williams @ 2008-01-10 0:09 ` Neil Brown 2008-01-10 3:07 ` Dan Williams ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Neil Brown @ 2008-01-10 0:09 UTC (permalink / raw) To: Dan Williams; +Cc: dean gaudet, linux-raid On Wednesday January 9, dan.j.williams@intel.com wrote: > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > which was Neil's change in 2.6.22 for deferring generic_make_request > > until there's enough stack space for it. > > > > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization > by preventing recursive calls to generic_make_request. However the > following conditions can cause raid5 to hang until 'stripe_cache_size' is > increased: > Thanks for pursuing this guys. That explanation certainly sounds very credible. The generic_make_request_immed is a good way to confirm that we have found the bug, but I don't like it as a long term solution, as it just reintroduced the problem that we were trying to solve with the problematic commit. As you say, we could arrange that all request submission happens in raid5d and I think this is the right way to proceed. However we can still take some of the work into the thread that is submitting the IO by calling "raid5d()" at the end of make_request, like this. Can you test it please? Does it seem reasonable? Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/md.c | 2 +- ./drivers/md/raid5.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-07 13:32:10.000000000 +1100 +++ ./drivers/md/md.c 2008-01-10 11:08:02.000000000 +1100 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev) if (mddev->ro) return; - if (signal_pending(current)) { + if (current == mddev->thread->tsk && signal_pending(current)) { if (mddev->pers->sync_request) { printk(KERN_INFO "md: %s in immediate safe mode\n", mdname(mddev)); diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c 2008-01-07 13:32:10.000000000 +1100 +++ ./drivers/md/raid5.c 2008-01-10 11:06:54.000000000 +1100 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req } } +static void raid5d (mddev_t *mddev); static int make_request(struct request_queue *q, struct bio * bi) { @@ -3547,7 +3548,7 @@ static int make_request(struct request_q goto retry; } finish_wait(&conf->wait_for_overlap, &w); - handle_stripe(sh, NULL); + set_bit(STRIPE_HANDLE, &sh->state); release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ @@ -3569,6 +3570,7 @@ static int make_request(struct request_q test_bit(BIO_UPTODATE, &bi->bi_flags) ? 0 : -EIO); } + raid5d(mddev); return 0; } ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 0:09 ` Neil Brown @ 2008-01-10 3:07 ` Dan Williams 2008-01-10 3:57 ` Neil Brown 2008-01-10 7:13 ` dean gaudet 2008-01-10 17:59 ` dean gaudet 2 siblings, 1 reply; 30+ messages in thread From: Dan Williams @ 2008-01-10 3:07 UTC (permalink / raw) To: Neil Brown; +Cc: dean gaudet, linux-raid On Jan 9, 2008 5:09 PM, Neil Brown <neilb@suse.de> wrote: > On Wednesday January 9, dan.j.williams@intel.com wrote: > > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: > > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > which was Neil's change in 2.6.22 for deferring generic_make_request > > > until there's enough stack space for it. > > > > > > > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization > > by preventing recursive calls to generic_make_request. However the > > following conditions can cause raid5 to hang until 'stripe_cache_size' is > > increased: > > > > Thanks for pursuing this guys. That explanation certainly sounds very > credible. > > The generic_make_request_immed is a good way to confirm that we have > found the bug, but I don't like it as a long term solution, as it > just reintroduced the problem that we were trying to solve with the > problematic commit. > > As you say, we could arrange that all request submission happens in > raid5d and I think this is the right way to proceed. However we can > still take some of the work into the thread that is submitting the > IO by calling "raid5d()" at the end of make_request, like this. > > Can you test it please? This passes my failure case. However, my test is different from Dean's in that I am using tiobench and the latest rev of my 'get_priority_stripe' patch. I believe the failure mechanism is the same, but it would be good to get confirmation from Dean. get_priority_stripe has the effect of increasing the frequency of make_request->handle_stripe->generic_make_request sequences. > Does it seem reasonable? What do you think about limiting the number of stripes the submitting thread handles to be equal to what it submitted? If I'm a stripe that only submits 1 stripe worth of work should I get stuck handling the rest of the cache? Regards, Dan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 3:07 ` Dan Williams @ 2008-01-10 3:57 ` Neil Brown 2008-01-10 4:56 ` Dan Williams 2008-01-10 20:28 ` Bill Davidsen 0 siblings, 2 replies; 30+ messages in thread From: Neil Brown @ 2008-01-10 3:57 UTC (permalink / raw) To: Dan Williams; +Cc: dean gaudet, linux-raid On Wednesday January 9, dan.j.williams@intel.com wrote: > On Jan 9, 2008 5:09 PM, Neil Brown <neilb@suse.de> wrote: > > On Wednesday January 9, dan.j.williams@intel.com wrote: > > > > Can you test it please? > > This passes my failure case. Thanks! > > > Does it seem reasonable? > > What do you think about limiting the number of stripes the submitting > thread handles to be equal to what it submitted? If I'm a stripe that > only submits 1 stripe worth of work should I get stuck handling the > rest of the cache? Dunno.... Someone has to do the work, and leaving it all to raid5d means that it all gets done on one CPU. I expect that most of the time the queue of ready stripes is empty so make_request will mostly only handle it's own stripes anyway. The times that it handles other thread's stripes will probably balance out with the times that other threads handle this threads stripes. So I'm incline to leave it as "do as much work as is available to be done" as that is simplest. But I can probably be talked out of it with a convincing argument.... NeilBrown ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 3:57 ` Neil Brown @ 2008-01-10 4:56 ` Dan Williams 2008-01-10 20:28 ` Bill Davidsen 1 sibling, 0 replies; 30+ messages in thread From: Dan Williams @ 2008-01-10 4:56 UTC (permalink / raw) To: Neil Brown; +Cc: dean gaudet, linux-raid On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote: > So I'm incline to leave it as "do as much work as is available to be > done" as that is simplest. But I can probably be talked out of it > with a convincing argument.... Well, in an age of CFS and CFQ it smacks of 'unfairness'. But does that trump KISS...? Probably not. -- Dan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 3:57 ` Neil Brown 2008-01-10 4:56 ` Dan Williams @ 2008-01-10 20:28 ` Bill Davidsen 1 sibling, 0 replies; 30+ messages in thread From: Bill Davidsen @ 2008-01-10 20:28 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, dean gaudet, linux-raid Neil Brown wrote: > On Wednesday January 9, dan.j.williams@intel.com wrote: > >> On Jan 9, 2008 5:09 PM, Neil Brown <neilb@suse.de> wrote: >> >>> On Wednesday January 9, dan.j.williams@intel.com wrote: >>> >>> Can you test it please? >>> >> This passes my failure case. >> > > Thanks! > > >>> Does it seem reasonable? >>> >> What do you think about limiting the number of stripes the submitting >> thread handles to be equal to what it submitted? If I'm a stripe that >> only submits 1 stripe worth of work should I get stuck handling the >> rest of the cache? >> > > Dunno.... > Someone has to do the work, and leaving it all to raid5d means that it > all gets done on one CPU. > I expect that most of the time the queue of ready stripes is empty so > make_request will mostly only handle it's own stripes anyway. > The times that it handles other thread's stripes will probably balance > out with the times that other threads handle this threads stripes. > > So I'm incline to leave it as "do as much work as is available to be > done" as that is simplest. But I can probably be talked out of it > with a convincing argument.... > How about "it will perform better (defined as faster) during conditions of unusual i/o activity?" Is that a convincing argument to use your solution as offered? How about "complexity and maintainability are a zero-sum problem?" -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 0:09 ` Neil Brown 2008-01-10 3:07 ` Dan Williams @ 2008-01-10 7:13 ` dean gaudet 2008-01-10 18:49 ` Dan Williams 2008-01-10 17:59 ` dean gaudet 2 siblings, 1 reply; 30+ messages in thread From: dean gaudet @ 2008-01-10 7:13 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, linux-raid On Thu, 10 Jan 2008, Neil Brown wrote: > On Wednesday January 9, dan.j.williams@intel.com wrote: > > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: > > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > which was Neil's change in 2.6.22 for deferring generic_make_request > > > until there's enough stack space for it. > > > > > > > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization > > by preventing recursive calls to generic_make_request. However the > > following conditions can cause raid5 to hang until 'stripe_cache_size' is > > increased: > > > > Thanks for pursuing this guys. That explanation certainly sounds very > credible. > > The generic_make_request_immed is a good way to confirm that we have > found the bug, but I don't like it as a long term solution, as it > just reintroduced the problem that we were trying to solve with the > problematic commit. > > As you say, we could arrange that all request submission happens in > raid5d and I think this is the right way to proceed. However we can > still take some of the work into the thread that is submitting the > IO by calling "raid5d()" at the end of make_request, like this. > > Can you test it please? Does it seem reasonable? i've got this running now (against 2.6.24-rc6)... it has passed ~25 minutes of testing so far, which is a good sign. i'll report back tomorrow and hopefully we'll have survived 8h+ of testing. thanks! w.r.t. dan's cfq comments -- i really don't know the details, but does this mean cfq will misattribute the IO to the wrong user/process? or is it just a concern that CPU time will be spent on someone's IO? the latter is fine to me... the former seems sucky because with today's multicore systems CPU time seems cheap compared to IO. -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 7:13 ` dean gaudet @ 2008-01-10 18:49 ` Dan Williams 2008-01-11 1:46 ` Neil Brown 0 siblings, 1 reply; 30+ messages in thread From: Dan Williams @ 2008-01-10 18:49 UTC (permalink / raw) To: dean gaudet; +Cc: Neil Brown, linux-raid On Jan 10, 2008 12:13 AM, dean gaudet <dean@arctic.org> wrote: > w.r.t. dan's cfq comments -- i really don't know the details, but does > this mean cfq will misattribute the IO to the wrong user/process? or is > it just a concern that CPU time will be spent on someone's IO? the latter > is fine to me... the former seems sucky because with today's multicore > systems CPU time seems cheap compared to IO. > I do not see this affecting the time slicing feature of cfq, because as Neil says the work has to get done at some point. If I give up some of my slice working on someone else's I/O chances are the favor will be returned in kind since the code does not discriminate. The io-priority capability of cfq currently does not work as advertised with current MD since the priority is tied to the current thread and the thread that actually submits the i/o on a stripe is non-deterministic. So I do not see this change making the situation any worse. In fact, it may make it a bit better since there is a higher chance for the thread submitting i/o to MD to do its own i/o to the backing disks. Reviewed-by: Dan Williams <dan.j.williams@intel.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 18:49 ` Dan Williams @ 2008-01-11 1:46 ` Neil Brown 2008-01-11 2:14 ` dean gaudet 0 siblings, 1 reply; 30+ messages in thread From: Neil Brown @ 2008-01-11 1:46 UTC (permalink / raw) To: Dan Williams; +Cc: dean gaudet, linux-raid On Thursday January 10, dan.j.williams@gmail.com wrote: > On Jan 10, 2008 12:13 AM, dean gaudet <dean@arctic.org> wrote: > > w.r.t. dan's cfq comments -- i really don't know the details, but does > > this mean cfq will misattribute the IO to the wrong user/process? or is > > it just a concern that CPU time will be spent on someone's IO? the latter > > is fine to me... the former seems sucky because with today's multicore > > systems CPU time seems cheap compared to IO. > > > > I do not see this affecting the time slicing feature of cfq, because > as Neil says the work has to get done at some point. If I give up > some of my slice working on someone else's I/O chances are the favor > will be returned in kind since the code does not discriminate. The > io-priority capability of cfq currently does not work as advertised > with current MD since the priority is tied to the current thread and > the thread that actually submits the i/o on a stripe is > non-deterministic. So I do not see this change making the situation > any worse. In fact, it may make it a bit better since there is a > higher chance for the thread submitting i/o to MD to do its own i/o to > the backing disks. > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> Thanks. But I suspect you didn't test it with a bitmap :-) I ran the mdadm test suite and it hit a problem - easy enough to fix. I'll look out for any other possible related problem (due to raid5d running in different processes) and then submit it. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-11 1:46 ` Neil Brown @ 2008-01-11 2:14 ` dean gaudet 0 siblings, 0 replies; 30+ messages in thread From: dean gaudet @ 2008-01-11 2:14 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, linux-raid On Fri, 11 Jan 2008, Neil Brown wrote: > Thanks. > But I suspect you didn't test it with a bitmap :-) > I ran the mdadm test suite and it hit a problem - easy enough to fix. damn -- i "lost" my bitmap 'cause it was external and i didn't have things set up properly to pick it up after a reboot :) if you send an updated patch i'll give it another spin... -dean ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-10 0:09 ` Neil Brown 2008-01-10 3:07 ` Dan Williams 2008-01-10 7:13 ` dean gaudet @ 2008-01-10 17:59 ` dean gaudet 2 siblings, 0 replies; 30+ messages in thread From: dean gaudet @ 2008-01-10 17:59 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, linux-raid On Thu, 10 Jan 2008, Neil Brown wrote: > On Wednesday January 9, dan.j.williams@intel.com wrote: > > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: > > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 > > > > > > which was Neil's change in 2.6.22 for deferring generic_make_request > > > until there's enough stack space for it. > > > > > > > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization > > by preventing recursive calls to generic_make_request. However the > > following conditions can cause raid5 to hang until 'stripe_cache_size' is > > increased: > > > > Thanks for pursuing this guys. That explanation certainly sounds very > credible. > > The generic_make_request_immed is a good way to confirm that we have > found the bug, but I don't like it as a long term solution, as it > just reintroduced the problem that we were trying to solve with the > problematic commit. > > As you say, we could arrange that all request submission happens in > raid5d and I think this is the right way to proceed. However we can > still take some of the work into the thread that is submitting the > IO by calling "raid5d()" at the end of make_request, like this. > > Can you test it please? Does it seem reasonable? > > Thanks, > NeilBrown > > > Signed-off-by: Neil Brown <neilb@suse.de> it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's pretty good evidence it works for me. thanks! Tested-by: dean gaudet <dean@arctic.org> > > ### Diffstat output > ./drivers/md/md.c | 2 +- > ./drivers/md/raid5.c | 4 +++- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff .prev/drivers/md/md.c ./drivers/md/md.c > --- .prev/drivers/md/md.c 2008-01-07 13:32:10.000000000 +1100 > +++ ./drivers/md/md.c 2008-01-10 11:08:02.000000000 +1100 > @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev) > if (mddev->ro) > return; > > - if (signal_pending(current)) { > + if (current == mddev->thread->tsk && signal_pending(current)) { > if (mddev->pers->sync_request) { > printk(KERN_INFO "md: %s in immediate safe mode\n", > mdname(mddev)); > > diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c > --- .prev/drivers/md/raid5.c 2008-01-07 13:32:10.000000000 +1100 > +++ ./drivers/md/raid5.c 2008-01-10 11:06:54.000000000 +1100 > @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req > } > } > > +static void raid5d (mddev_t *mddev); > > static int make_request(struct request_queue *q, struct bio * bi) > { > @@ -3547,7 +3548,7 @@ static int make_request(struct request_q > goto retry; > } > finish_wait(&conf->wait_for_overlap, &w); > - handle_stripe(sh, NULL); > + set_bit(STRIPE_HANDLE, &sh->state); > release_stripe(sh); > } else { > /* cannot get stripe for read-ahead, just give-up */ > @@ -3569,6 +3570,7 @@ static int make_request(struct request_q > test_bit(BIO_UPTODATE, &bi->bi_flags) > ? 0 : -EIO); > } > + raid5d(mddev); > return 0; > } > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-27 17:06 2.6.24-rc6 reproducible raid5 hang dean gaudet 2007-12-27 17:39 ` dean gaudet @ 2007-12-27 19:52 ` Justin Piszcz 2007-12-28 0:08 ` dean gaudet 1 sibling, 1 reply; 30+ messages in thread From: Justin Piszcz @ 2007-12-27 19:52 UTC (permalink / raw) To: dean gaudet; +Cc: linux-raid On Thu, 27 Dec 2007, dean gaudet wrote: > hey neil -- remember that raid5 hang which me and only one or two others > ever experienced and which was hard to reproduce? we were debugging it > well over a year ago (that box has 400+ day uptime now so at least that > long ago :) the workaround was to increase stripe_cache_size... i seem to > have a way to reproduce something which looks much the same. > > setup: > > - 2.6.24-rc6 > - system has 8GiB RAM but no swap > - 8x750GB in a raid5 with one spare, chunksize 1024KiB. > - mkfs.xfs default options > - mount -o noatime > - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 > > that sequence hangs for me within 10 seconds... and i can unhang / rehang > it by toggling between stripe_cache_size 256 and 1024. i detect the hang > by watching "iostat -kx /dev/sd? 5". > > i've attached the kernel log where i dumped task and timer state while it > was hung... note that you'll see at some point i did an xfs mount with > external journal but it happens with internal journal as well. > > looks like it's using the raid456 module and async api. > > anyhow let me know if you need more info / have any suggestions. > > -dean With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. Justin. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2007-12-27 19:52 ` Justin Piszcz @ 2007-12-28 0:08 ` dean gaudet 0 siblings, 0 replies; 30+ messages in thread From: dean gaudet @ 2007-12-28 0:08 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 474 bytes --] On Thu, 27 Dec 2007, Justin Piszcz wrote: > With that high of a stripe size the stripe_cache_size needs to be greater than > the default to handle it. i'd argue that any deadlock is a bug... regardless i'm still seeing deadlocks with the default chunk_size of 64k and stripe_cache_size of 256... in this case it's with a workload which is untarring 34 copies of the linux kernel at the same time. it's a variant of doug ledford's memtest, and i've attached it. -dean [-- Attachment #2: Type: TEXT/PLAIN, Size: 4046 bytes --] #!/usr/bin/perl # Copyright (c) 2007 dean gaudet <dean@arctic.org> # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the "Software"), # to deal in the Software without restriction, including without limitation # the rights to use, copy, modify, merge, publish, distribute, sublicense, # and/or sell copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included # in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR # OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, # ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR # OTHER DEALINGS IN THE SOFTWARE. # this idea shamelessly stolen from doug ledford use warnings; use strict; # ensure stdout is not buffered select(STDOUT); $| = 1; my $usage = "usage: $0 linux.tar.gz /path1 [/path2 ...]\n"; defined(my $tarball = shift) or die $usage; -f $tarball or die "$tarball does not exist or is not a file\n"; my @paths = @ARGV; $#paths >= 0 or die "$usage"; # determine size of uncompressed tarball open(GZIP, "-|") || exec "gzip", "--quiet", "--list", $tarball; my $line = <GZIP>; my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#; defined($tarball_size) or die "unexpected result from gzip --quiet --list $tarball\n"; close(GZIP); # determine amount of memory open(MEMINFO, "</proc/meminfo") or die "unable to open /proc/meminfo for read: $!\n"; my $total_mem; while (<MEMINFO>) { if (/^MemTotal:\s*(\d+)\s*kB/) { $total_mem = $1; last; } } defined($total_mem) or die "did not find MemTotal line in /proc/meminfo\n"; close(MEMINFO); $total_mem *= 1024; print "total memory: $total_mem\n"; print "uncompressed tarball: $tarball_size\n"; my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size); print "nr simultaneous processes: $nr_simultaneous\n"; sub system_or_die { my @args = @_; system(@args); if ($? == 1) { my $msg = sprintf("%s failed to exec %s: $!\n", scalar(localtime), $args[0]); } elsif ($? & 127) { my $msg = sprintf("%s %s died with signal %d, %s coredump\n", scalar(localtime), $args[0], ($? & 127), ($? & 128) ? "with" : "without"); die $msg; } elsif (($? >> 8) != 0) { my $msg = sprintf("%s %s exited with non-zero exit code %d\n", scalar(localtime), $args[0], $? >> 8); die $msg; } } sub untar($) { mkdir($_[0]) or die localtime()." unable to mkdir($_[0]): $!\n"; system_or_die("tar", "-xzf", $tarball, "-C", $_[0]); } print localtime()." untarring golden copy\n"; my $golden = $paths[0]."/dma_tmp.$$.gold"; untar($golden); my $pass_no = 0; while (1) { print localtime()." pass $pass_no: extracting\n"; my @outputs; foreach my $n (1..$nr_simultaneous) { # treat paths in a round-robin manner my $dir = shift(@paths); push(@paths, $dir); $dir .= "/dma_tmp.$$.$n"; push(@outputs, $dir); my $pid = fork; defined($pid) or die localtime()." unable to fork: $!\n"; if ($pid == 0) { untar($dir); exit(0); } } # wait for the children while (wait != -1) {} print localtime()." pass $pass_no: diffing\n"; foreach my $dir (@outputs) { my $pid = fork; defined($pid) or die localtime()." unable to fork: $!\n"; if ($pid == 0) { system_or_die("diff", "-U", "3", "-rN", $golden, $dir); system_or_die("rm", "-fr", $dir); exit(0); } } # wait for the children while (wait != -1) {} ++$pass_no; } ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang @ 2008-01-23 13:37 Tim Southerwood 2008-01-23 17:43 ` Carlos Carvalho 0 siblings, 1 reply; 30+ messages in thread From: Tim Southerwood @ 2008-01-23 13:37 UTC (permalink / raw) To: linux-raid Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. Raid 5 configured across 4 x 500GB SATA disks (Nforce nv_sata driver, Asus M2N-E mobo, Athlon X64, 4GB RAM MD Chunk size is 1024k. This is allocated to an LVM2 PV, then sliced up. Taking one sample logical volume of 150GB I ran mkfs.xfs -d su=1024k,sw=3 -L vol_linux /dev/vg00/vol_linux I then found that putting high write load on that filesystem cause a hang. High load could be a little as a single rsync of a mirror of Ubunty Gutsy (many 10's of GB) from my old server to here. Hang would happen in a few hours typically. I could generate relatively quick hangs by running xfs_fsr (defragger) in parallel. Trying the workaround up upping /sys/block/md1/md/stripe_cache_size to 4096 seems (fingers crossed) to have helped. Been running the rsync again, plus xfs_fst + a few dd's of 11 GB to the same filesystem. I did notice also that the write speed increased dramatically with a bigger stripe_cache_size. A more detailed analysis of the problem indicated that, after the hang: I could log in; One CPU core was stuck in 100% IO wait. The other core was useable, with care. So I managed to get a SysRQ T and one place the system appeared blocked was via this path: [ 2039.466258] xfs_fsr D 0000000000000000 0 7324 7308 [ 2039.466260] ffff810119399858 0000000000000082 0000000000000000 0000000000000046 [ 2039.466263] ffff810110d6c680 ffff8101102ba998 ffff8101102ba770 ffffffff8054e5e0 [ 2039.466265] ffff8101102ba998 000000010014a1e6 ffffffffffffffff ffff810110ddcb30 [ 2039.466268] Call Trace: [ 2039.466277] [<ffffffff8808a26b>] :raid456:get_active_stripe+0x1cb/0x610 [ 2039.466282] [<ffffffff80234000>] default_wake_function+0x0/0x10 [ 2039.466289] [<ffffffff88090ff8>] :raid456:make_request+0x1f8/0x610 [ 2039.466293] [<ffffffff80251c20>] autoremove_wake_function+0x0/0x30 [ 2039.466295] [<ffffffff80331121>] __up_read+0x21/0xb0 [ 2039.466300] [<ffffffff8031f336>] generic_make_request+0x1d6/0x3d0 [ 2039.466303] [<ffffffff80280bad>] vm_normal_page+0x3d/0xc0 [ 2039.466307] [<ffffffff8031f59f>] submit_bio+0x6f/0xf0 [ 2039.466311] [<ffffffff802c98cc>] dio_bio_submit+0x5c/0x90 [ 2039.466313] [<ffffffff802c9943>] dio_send_cur_page+0x43/0xa0 [ 2039.466316] [<ffffffff802c99ee>] submit_page_section+0x4e/0x150 [ 2039.466319] [<ffffffff802ca2e2>] __blockdev_direct_IO+0x742/0xb50 [ 2039.466342] [<ffffffff8832e9a2>] :xfs:xfs_vm_direct_IO+0x182/0x190 [ 2039.466357] [<ffffffff8832edb0>] :xfs:xfs_get_blocks_direct+0x0/0x20 [ 2039.466370] [<ffffffff8832e350>] :xfs:xfs_end_io_direct+0x0/0x80 [ 2039.466375] [<ffffffff80444fb5>] __wait_on_bit_lock+0x65/0x80 [ 2039.466380] [<ffffffff80272883>] generic_file_direct_IO+0xe3/0x190 [ 2039.466385] [<ffffffff802729a4>] generic_file_direct_write+0x74/0x150 [ 2039.466402] [<ffffffff88336db2>] :xfs:xfs_write+0x492/0x8f0 [ 2039.466421] [<ffffffff883099bc>] :xfs:xfs_iunlock+0x2c/0xb0 [ 2039.466437] [<ffffffff88336866>] :xfs:xfs_read+0x186/0x240 [ 2039.466443] [<ffffffff8029e5b9>] do_sync_write+0xd9/0x120 [ 2039.466448] [<ffffffff80251c20>] autoremove_wake_function+0x0/0x30 [ 2039.466457] [<ffffffff8029eead>] vfs_write+0xdd/0x190 [ 2039.466461] [<ffffffff8029f5b3>] sys_write+0x53/0x90 [ 2039.466465] [<ffffffff8020c29e>] system_call+0x7e/0x83 However, I'm of the opinion that the system should not deadlock, even if tunable parameters are unfavourable. I'm happy with the workaround (indeed the system performs better). However, it will take me a week's worth of testing before I'm willing to commission this as my new fileserver. So, if there is anything anyone would like me to try, I'm happy to volunteer as a guinea pig :) Yes, I can build and patch kernels. But I'm not hot at debugging kernels so if kernel core dumps or whatever are needed, please point me at the right document or hint as to which commands I need to read about. Cheers Tim ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-23 13:37 Tim Southerwood @ 2008-01-23 17:43 ` Carlos Carvalho 2008-01-24 20:30 ` Tim Southerwood 0 siblings, 1 reply; 30+ messages in thread From: Carlos Carvalho @ 2008-01-23 17:43 UTC (permalink / raw) To: Tim Southerwood; +Cc: linux-raid Tim Southerwood (ts@dionic.net) wrote on 23 January 2008 13:37: >Sorry if this breaks threaded mail readers, I only just subscribed to >the list so don;t have the original post to reply to. > >I believe I'm having the same problem. > >Regarding XFS on a raid5 md array: > >Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* >2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. This has been corrected already, install Neil's patches. It worked for several people under high stress, including us. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-23 17:43 ` Carlos Carvalho @ 2008-01-24 20:30 ` Tim Southerwood 2008-01-28 17:29 ` Tim Southerwood 0 siblings, 1 reply; 30+ messages in thread From: Tim Southerwood @ 2008-01-24 20:30 UTC (permalink / raw) To: linux-raid Carlos Carvalho wrote: > Tim Southerwood (ts@dionic.net) wrote on 23 January 2008 13:37: > >Sorry if this breaks threaded mail readers, I only just subscribed to > >the list so don;t have the original post to reply to. > > > >I believe I'm having the same problem. > > > >Regarding XFS on a raid5 md array: > > > >Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* > >2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. > > This has been corrected already, install Neil's patches. It worked for > several people under high stress, including us. > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Hi I just coerced the patch into 2.6.23.14, reset /sys/block/md1/md/stripe_cache_size to default (256) and rebooted. I can confirm that after 2 hours of heavy bashing[1] the system has not hung. Looks good - many thanks. But I will run with a stripe_cache_size of 4096 in practise as it improves write speen on my configuration about 2.5 times. Cheers Tim [1] Rsync > 50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same filesystem. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-24 20:30 ` Tim Southerwood @ 2008-01-28 17:29 ` Tim Southerwood 2008-01-29 14:16 ` Carlos Carvalho 0 siblings, 1 reply; 30+ messages in thread From: Tim Southerwood @ 2008-01-28 17:29 UTC (permalink / raw) To: linux-raid Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. Was that the correct thing to do, or did this issue get fixed in a different way that I wouldn't have spotted? I had a look at the git logs but it was not obvious - please pardon my ignorance, I'm not familiar enough with the code. Many thanks, Tim Tim Southerwood wrote: > Carlos Carvalho wrote: >> Tim Southerwood (ts@dionic.net) wrote on 23 January 2008 13:37: >> >Sorry if this breaks threaded mail readers, I only just subscribed >> to >the list so don;t have the original post to reply to. >> > >> >I believe I'm having the same problem. >> > >> >Regarding XFS on a raid5 md array: >> > >> >Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* >> >2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. >> >> This has been corrected already, install Neil's patches. It worked for >> several people under high stress, including us. >> - >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Hi > > I just coerced the patch into 2.6.23.14, reset > /sys/block/md1/md/stripe_cache_size to default (256) and rebooted. > > I can confirm that after 2 hours of heavy bashing[1] the system has not > hung. Looks good - many thanks. But I will run with a stripe_cache_size > of 4096 in practise as it improves write speen on my configuration about > 2.5 times. > > Cheers > > Tim > > > > [1] Rsync > 50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same > filesystem. > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-28 17:29 ` Tim Southerwood @ 2008-01-29 14:16 ` Carlos Carvalho 2008-01-29 22:58 ` Bill Davidsen 0 siblings, 1 reply; 30+ messages in thread From: Carlos Carvalho @ 2008-01-29 14:16 UTC (permalink / raw) To: linux-raid Tim Southerwood (ts@dionic.net) wrote on 28 January 2008 17:29: >Subtitle: Patch to mainline yet? > >Hi > >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand >on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-29 14:16 ` Carlos Carvalho @ 2008-01-29 22:58 ` Bill Davidsen 2008-02-14 10:13 ` Burkhard Carstens 0 siblings, 1 reply; 30+ messages in thread From: Bill Davidsen @ 2008-01-29 22:58 UTC (permalink / raw) To: Carlos Carvalho; +Cc: linux-raid, Neil Brown Carlos Carvalho wrote: > Tim Southerwood (ts@dionic.net) wrote on 28 January 2008 17:29: > >Subtitle: Patch to mainline yet? > > > >Hi > > > >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand > >on my server. > > I applied all 4 pending patches to .24. It's been better than .22 and > .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. Neil, have these been sent up against 24-stable and 23-stable? -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 2.6.24-rc6 reproducible raid5 hang 2008-01-29 22:58 ` Bill Davidsen @ 2008-02-14 10:13 ` Burkhard Carstens 0 siblings, 0 replies; 30+ messages in thread From: Burkhard Carstens @ 2008-02-14 10:13 UTC (permalink / raw) To: linux-raid Am Dienstag, 29. Januar 2008 23:58 schrieb Bill Davidsen: > Carlos Carvalho wrote: > > Tim Southerwood (ts@dionic.net) wrote on 28 January 2008 17:29: > > >Subtitle: Patch to mainline yet? > > > > > >Hi > > > > > >I don't see evidence of Neil's patch in 2.6.24, so I applied it > > > by hand on my server. > > > > I applied all 4 pending patches to .24. It's been better than .22 > > and .23... Unfortunately the bitmap and rai1 patch don't go in > > .22.16. > > Neil, have these been sent up against 24-stable and 23-stable? .. and .22-stable ? Also, is this a xfs-on-raid5 bug or would it also happen with ext3-on-raid5 ? regards Burkhard ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2008-02-14 10:13 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-12-27 17:06 2.6.24-rc6 reproducible raid5 hang dean gaudet 2007-12-27 17:39 ` dean gaudet 2007-12-29 16:48 ` dean gaudet 2007-12-29 20:47 ` Dan Williams 2007-12-29 20:58 ` dean gaudet 2007-12-29 21:50 ` Justin Piszcz 2007-12-29 22:11 ` dean gaudet 2007-12-29 22:21 ` dean gaudet 2007-12-29 22:06 ` Dan Williams 2007-12-30 17:58 ` dean gaudet 2008-01-09 18:28 ` Dan Williams 2008-01-10 0:09 ` Neil Brown 2008-01-10 3:07 ` Dan Williams 2008-01-10 3:57 ` Neil Brown 2008-01-10 4:56 ` Dan Williams 2008-01-10 20:28 ` Bill Davidsen 2008-01-10 7:13 ` dean gaudet 2008-01-10 18:49 ` Dan Williams 2008-01-11 1:46 ` Neil Brown 2008-01-11 2:14 ` dean gaudet 2008-01-10 17:59 ` dean gaudet 2007-12-27 19:52 ` Justin Piszcz 2007-12-28 0:08 ` dean gaudet -- strict thread matches above, loose matches on Subject: below -- 2008-01-23 13:37 Tim Southerwood 2008-01-23 17:43 ` Carlos Carvalho 2008-01-24 20:30 ` Tim Southerwood 2008-01-28 17:29 ` Tim Southerwood 2008-01-29 14:16 ` Carlos Carvalho 2008-01-29 22:58 ` Bill Davidsen 2008-02-14 10:13 ` Burkhard Carstens
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).