* raid5 performance question
@ 2006-03-06 11:46 Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-06 11:46 UTC (permalink / raw)
To: Linux RAID Mailing List; +Cc: Neil Brown
Neil Hello .
I have a performance question.
I am using raid5 stripe size 1024K over 4 disks.
I am benchmarking it with an asynchronous tester.
This tester submits 100 IOs of size of 1024 K --> as the stripe size.
It reads raw io from the device, no file system is involved.
I am making the following comparsion:
1. Reading 4 disks at the same time using 1 MB buffer in random manner.
2. Reading 1 raid5 device using 1MB buffer in random manner.
I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
4 disks, the raid5 device gives 35 MB/s .
it is like i am reading a single disk , but by looking at iostat i can
see that all
disks are active but with low throughput.
Any idea ?
Thank you.
--
Raz
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: raid5 performance question 2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro) @ 2006-03-06 11:59 ` Gordon Henderson 2006-03-06 12:56 ` Raz Ben-Jehuda(caro) 2006-03-06 22:17 ` raid5 performance question Guy 2006-03-06 22:24 ` Neil Brown 2 siblings, 1 reply; 43+ messages in thread From: Gordon Henderson @ 2006-03-06 11:59 UTC (permalink / raw) To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote: > Neil Hello . > I have a performance question. > > I am using raid5 stripe size 1024K over 4 disks. > I am benchmarking it with an asynchronous tester. > This tester submits 100 IOs of size of 1024 K --> as the stripe size. > It reads raw io from the device, no file system is involved. > > I am making the following comparsion: > > 1. Reading 4 disks at the same time using 1 MB buffer in random manner. > 2. Reading 1 raid5 device using 1MB buffer in random manner. > > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from > 4 disks, the raid5 device gives 35 MB/s . > it is like i am reading a single disk , but by looking at iostat i can > see that all > disks are active but with low throughput. > > Any idea ? Is this reading the block device direct, or via a filesystem? If the latter, what filesystem? If ext2/3 have you tried mkfs with a stride option? See: http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11 Gordon ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-06 11:59 ` Gordon Henderson @ 2006-03-06 12:56 ` Raz Ben-Jehuda(caro) 2006-03-13 22:36 ` raid5 hang on get_active_stripe Patrik Jonsson 0 siblings, 1 reply; 43+ messages in thread From: Raz Ben-Jehuda(caro) @ 2006-03-06 12:56 UTC (permalink / raw) To: Gordon Henderson; +Cc: Linux RAID Mailing List it reads raw. no filesystem whatsover. On 3/6/06, Gordon Henderson <gordon@drogon.net> wrote: > On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote: > > > Neil Hello . > > I have a performance question. > > > > I am using raid5 stripe size 1024K over 4 disks. > > I am benchmarking it with an asynchronous tester. > > This tester submits 100 IOs of size of 1024 K --> as the stripe size. > > It reads raw io from the device, no file system is involved. > > > > I am making the following comparsion: > > > > 1. Reading 4 disks at the same time using 1 MB buffer in random manner. > > 2. Reading 1 raid5 device using 1MB buffer in random manner. > > > > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from > > 4 disks, the raid5 device gives 35 MB/s . > > it is like i am reading a single disk , but by looking at iostat i can > > see that all > > disks are active but with low throughput. > > > > Any idea ? > > Is this reading the block device direct, or via a filesystem? If the > latter, what filesystem? > > If ext2/3 have you tried mkfs with a stride option? > > See: > http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11 > > Gordon > -- Raz ^ permalink raw reply [flat|nested] 43+ messages in thread
* raid5 hang on get_active_stripe 2006-03-06 12:56 ` Raz Ben-Jehuda(caro) @ 2006-03-13 22:36 ` Patrik Jonsson 2006-03-13 23:17 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: Patrik Jonsson @ 2006-03-13 22:36 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 577 bytes --] Hi all, I just experienced some kind of lockup accessing my 8-drive raid5 (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but now processes that try to read the md device hang. ps tells me they are all sleeping in get_active_stripe. There is nothing in the syslog, and I can read from the individual drives fine with dd. mdadm says the state is "active". I'm not sure what to do now. Is it safe to try to reboot the system or could that cause the device to get corrupted if it's hung in the middle of some important operation? Ideas appreciated, /Patrik [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 254 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-03-13 22:36 ` raid5 hang on get_active_stripe Patrik Jonsson @ 2006-03-13 23:17 ` Neil Brown 2006-05-11 15:13 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-03-13 23:17 UTC (permalink / raw) To: Patrik Jonsson; +Cc: linux-raid On Monday March 13, patrik@ucolick.org wrote: > Hi all, > > I just experienced some kind of lockup accessing my 8-drive raid5 > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but > now processes that try to read the md device hang. ps tells me they are > all sleeping in get_active_stripe. There is nothing in the syslog, and I > can read from the individual drives fine with dd. mdadm says the state > is "active". Hmmm... That's sad. That's going to be very hard to track down. If you could echo t > /proc/sysrq-trigger and send me the dump that appears in the kernel log, I would appreciate it. I doubt it will be very helpful, but it is the best bet I can come up with. > > I'm not sure what to do now. Is it safe to try to reboot the system or > could that cause the device to get corrupted if it's hung in the middle > of some important operation? You could try increasing the size of the stripe cache echo 512 > /sys/block/mdX/md/stripe_cache_size (choose and appropriate 'X'). Maybe check the content of /sys/block/mdX/md/stripe_cache_active as well. Other than that, just reboot. The raid5 will do a resync, but the data should be fine. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-03-13 23:17 ` Neil Brown @ 2006-05-11 15:13 ` dean gaudet 2006-05-17 18:41 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-11 15:13 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 1927 bytes --] On Tue, 14 Mar 2006, Neil Brown wrote: > On Monday March 13, patrik@ucolick.org wrote: > > Hi all, > > > > I just experienced some kind of lockup accessing my 8-drive raid5 > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but > > now processes that try to read the md device hang. ps tells me they are > > all sleeping in get_active_stripe. There is nothing in the syslog, and I > > can read from the individual drives fine with dd. mdadm says the state > > is "active". > > Hmmm... That's sad. That's going to be very hard to track down. > > If you could > echo t > /proc/sysrq-trigger > > and send me the dump that appears in the kernel log, I would > appreciate it. I doubt it will be very helpful, but it is the best > bet I can come up with. i seem to be running into this as well... it has happenned several times in the past three weeks. i attached the kernel log output... it's a debian 2.6.16 kernel, which is based mostly on 2.6.16.10. md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1] 1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 3/187 pages [12KB], 1024KB chunk those drives are on 3w-xxxx (7508 controller). i'm using lvm2 and xfs as the filesystem (although i'm pretty sure an ext3 fs on another lv is hanging too -- but i forgot to check before i unwedged it). let me know if anything else is useful and i can try to catch it next time. > You could try increasing the size of the stripe cache > echo 512 > /sys/block/mdX/md/stripe_cache_size > (choose and appropriate 'X'). yeah that got things going again -- it took a minute or so maybe, i wasn't paying attention as to how fast things cleared up. > Maybe check the content of > /sys/block/mdX/md/stripe_cache_active > as well. next time i'll check this before i increase stripe_cache_size... it's 0 now, but the raid5 is working again... -dean [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 7944 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-11 15:13 ` dean gaudet @ 2006-05-17 18:41 ` dean gaudet 2006-05-19 3:46 ` Neil Brown 2006-05-23 2:33 ` Neil Brown 0 siblings, 2 replies; 43+ messages in thread From: dean gaudet @ 2006-05-17 18:41 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Thu, 11 May 2006, dean gaudet wrote: > On Tue, 14 Mar 2006, Neil Brown wrote: > > > On Monday March 13, patrik@ucolick.org wrote: > > > I just experienced some kind of lockup accessing my 8-drive raid5 > > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but > > > now processes that try to read the md device hang. ps tells me they are > > > all sleeping in get_active_stripe. There is nothing in the syslog, and I > > > can read from the individual drives fine with dd. mdadm says the state > > > is "active". ... > > i seem to be running into this as well... it has happenned several times > in the past three weeks. i attached the kernel log output... it happenned again... same system as before... > > You could try increasing the size of the stripe cache > > echo 512 > /sys/block/mdX/md/stripe_cache_size > > (choose and appropriate 'X'). > > yeah that got things going again -- it took a minute or so maybe, i > wasn't paying attention as to how fast things cleared up. i tried 768 this time and it wasn't enough... 1024 did it again... > > > Maybe check the content of > > /sys/block/mdX/md/stripe_cache_active > > as well. > > next time i'll check this before i increase stripe_cache_size... it's > 0 now, but the raid5 is working again... here's a sequence of things i did... not sure if it helps: # cat /sys/block/md4/md/stripe_cache_active 435 # cat /sys/block/md4/md/stripe_cache_size 512 # echo 768 >/sys/block/md4/md/stripe_cache_size # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # echo 1024 >/sys/block/md4/md/stripe_cache_size # cat /sys/block/md4/md/stripe_cache_active 927 # cat /sys/block/md4/md/stripe_cache_active 151 # cat /sys/block/md4/md/stripe_cache_active 66 # cat /sys/block/md4/md/stripe_cache_active 2 # cat /sys/block/md4/md/stripe_cache_active 1 # cat /sys/block/md4/md/stripe_cache_active 0 # cat /sys/block/md4/md/stripe_cache_active 3 and it's OK again... except i'm going to lower the stripe_cache_size to 256 again because i'm not sure i want to keep having to double it each freeze :) let me know if you want the task dump output from this one too. -dean ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-17 18:41 ` dean gaudet @ 2006-05-19 3:46 ` Neil Brown 2006-05-23 2:33 ` Neil Brown 1 sibling, 0 replies; 43+ messages in thread From: Neil Brown @ 2006-05-19 3:46 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Wednesday May 17, dean@arctic.org wrote: > > let me know if you want the task dump output from this one too. > No thanks - I doubt it will containing anything helpful. I'll try to put some serious time into this next week - as soon as I get mdadm 2.5 out. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-17 18:41 ` dean gaudet 2006-05-19 3:46 ` Neil Brown @ 2006-05-23 2:33 ` Neil Brown 2006-05-26 16:07 ` dean gaudet 1 sibling, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-05-23 2:33 UTC (permalink / raw) To: dean gaudet; +Cc: Neil Brown, Patrik Jonsson, linux-raid [-- Attachment #1: message body text --] [-- Type: text/plain, Size: 1367 bytes --] On Wednesday May 17, dean@arctic.org wrote: > On Thu, 11 May 2006, dean gaudet wrote: > > > On Tue, 14 Mar 2006, Neil Brown wrote: > > > > > On Monday March 13, patrik@ucolick.org wrote: > > > > I just experienced some kind of lockup accessing my 8-drive raid5 > > > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but > > > > now processes that try to read the md device hang. ps tells me they are > > > > all sleeping in get_active_stripe. There is nothing in the syslog, and I > > > > can read from the individual drives fine with dd. mdadm says the state > > > > is "active". > ... > > > > i seem to be running into this as well... it has happenned several times > > in the past three weeks. i attached the kernel log output... > > it happenned again... same system as before... > I've spent all morning looking at this and while I cannot see what is happening I did find a couple of small bugs, so that is good... I've attached three patches. The first fix two small bugs (I think). The last adds some extra information to /sys/block/mdX/md/stripe_cache_active They are against 2.6.16.11. If you could apply them and if the problem recurs, report the content of stripe_cache_active several times before and after changing it, just like you did last time, that might help throw some light on the situation. Thanks, NeilBrown [-- Attachment #2: 002MdR5PlugProblem --] [-- Type: text/plain, Size: 2331 bytes --] Status: ok Fix a plug/unplug race in raid5 When a device is unplugged, requests are moved from one or two (depending on whether a bitmap is in use) queues to the main request queue. So whenever requests are put on either of those queues, we should make sure the raid5 array is 'plugged'. However we don't. We currently plug the raid5 queue just before putting requests on queues, so there is room for a race. If something unplugs the queue at just the wrong time, requests will be left on the queue and nothing will want to unplug them. Normally something else will plug and unplug the queue fairly soon, but there is a risk that nothing will. Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid5.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-23 12:27:58.000000000 +1000 +++ ./drivers/md/raid5.c 2006-05-23 12:28:26.000000000 +1000 @@ -77,12 +77,14 @@ static void __release_stripe(raid5_conf_ if (atomic_read(&conf->active_stripes)==0) BUG(); if (test_bit(STRIPE_HANDLE, &sh->state)) { - if (test_bit(STRIPE_DELAYED, &sh->state)) + if (test_bit(STRIPE_DELAYED, &sh->state)) { list_add_tail(&sh->lru, &conf->delayed_list); - else if (test_bit(STRIPE_BIT_DELAY, &sh->state) && - conf->seq_write == sh->bm_seq) + blk_plug_device(conf->mddev->queue); + } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) && + conf->seq_write == sh->bm_seq) { list_add_tail(&sh->lru, &conf->bitmap_list); - else { + blk_plug_device(conf->mddev->queue); + } else { clear_bit(STRIPE_BIT_DELAY, &sh->state); list_add_tail(&sh->lru, &conf->handle_list); } @@ -1519,13 +1521,6 @@ static int raid5_issue_flush(request_que return ret; } -static inline void raid5_plug_device(raid5_conf_t *conf) -{ - spin_lock_irq(&conf->device_lock); - blk_plug_device(conf->mddev->queue); - spin_unlock_irq(&conf->device_lock); -} - static int make_request (request_queue_t *q, struct bio * bi) { mddev_t *mddev = q->queuedata; @@ -1577,7 +1572,6 @@ static int make_request (request_queue_t goto retry; } finish_wait(&conf->wait_for_overlap, &w); - raid5_plug_device(conf); handle_stripe(sh); release_stripe(sh); [-- Attachment #3: 003MdR5BitmapPlug --] [-- Type: text/plain, Size: 2966 bytes --] Status: ok Fix some small races in bitmap plugging in raid5. The comment gives more details, but I didn't quite have the sequencing write, so there was room for races to leave bits unset in the on-disk bitmap for short periods of time. Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid5.c | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-23 12:28:26.000000000 +1000 +++ ./drivers/md/raid5.c 2006-05-23 12:28:53.000000000 +1000 @@ -15,6 +15,30 @@ * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ +/* + * BITMAP UNPLUGGING: + * + * The sequencing for updating the bitmap reliably is a little + * subtle (and I got it wrong the first time) so it deserves some + * explanation. + * + * We group bitmap updates into batches. Each batch has a number. + * We may write out several batches at once, but that isn't very important. + * conf->bm_write is the number of the last batch successfully written. + * conf->bm_flush is the number of the last batch that was closed to + * new additions. + * When we discover that we will need to write to any block in a stripe + * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq + * the number of the batch it will be in. This is bm_flush+1. + * When we are ready to do a write, if that batch hasn't been written yet, + * we plug the array and queue the stripe for later. + * When an unplug happens, we increment bm_flush, thus closing the current + * batch. + * When we notice that bm_flush > bm_write, we write out all pending updates + * to the bitmap, and advance bm_write to where bm_flush was. + * This may occasionally write a bit out twice, but is sure never to + * miss any bits. + */ #include <linux/config.h> #include <linux/module.h> @@ -81,7 +105,7 @@ static void __release_stripe(raid5_conf_ list_add_tail(&sh->lru, &conf->delayed_list); blk_plug_device(conf->mddev->queue); } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) && - conf->seq_write == sh->bm_seq) { + sh->bm_seq - conf->seq_write > 0) { list_add_tail(&sh->lru, &conf->bitmap_list); blk_plug_device(conf->mddev->queue); } else { @@ -884,9 +908,9 @@ static int add_stripe_bio(struct stripe_ (unsigned long long)sh->sector, dd_idx); if (conf->mddev->bitmap && firstwrite) { - sh->bm_seq = conf->seq_write; bitmap_startwrite(conf->mddev->bitmap, sh->sector, STRIPE_SECTORS, 0); + sh->bm_seq = conf->seq_flush+1; set_bit(STRIPE_BIT_DELAY, &sh->state); } @@ -1692,7 +1716,7 @@ static void raid5d (mddev_t *mddev) while (1) { struct list_head *first; - if (conf->seq_flush - conf->seq_write > 0) { + if (conf->seq_flush != conf->seq_write) { int seq = conf->seq_flush; spin_unlock_irq(&conf->device_lock); bitmap_unplug(mddev->bitmap); [-- Attachment #4: 004MdR5Debug --] [-- Type: text/plain, Size: 1153 bytes --] Status: ok Add some debugging to track raid5 hang problem Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid5.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-23 12:28:53.000000000 +1000 +++ ./drivers/md/raid5.c 2006-05-23 12:29:05.000000000 +1000 @@ -1807,9 +1807,23 @@ static ssize_t stripe_cache_active_show(mddev_t *mddev, char *page) { raid5_conf_t *conf = mddev_to_conf(mddev); - if (conf) - return sprintf(page, "%d\n", atomic_read(&conf->active_stripes)); - else + if (conf) { + int n; + int c1, c2; + struct list_head *l; + n = sprintf(page, "%d\n", atomic_read(&conf->active_stripes)); + n += sprintf(page+n, "%d preread\n", atomic_read(&conf->preread_active_stripes)); + spin_lock_irq(&conf->device_lock); + c1=0; + list_for_each(l, &conf->bitmap_list) + c1++; + c2=0; + list_for_each(l, &conf->delayed_list) + c2++; + spin_unlock_irq(&conf->device_lock); + n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2); + return n; + } else return 0; } ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-23 2:33 ` Neil Brown @ 2006-05-26 16:07 ` dean gaudet 2006-05-26 23:55 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-26 16:07 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Tue, 23 May 2006, Neil Brown wrote: > I've spent all morning looking at this and while I cannot see what is > happening I did find a couple of small bugs, so that is good... > > I've attached three patches. The first fix two small bugs (I think). > The last adds some extra information to > /sys/block/mdX/md/stripe_cache_active > > They are against 2.6.16.11. > > If you could apply them and if the problem recurs, report the content > of stripe_cache_active several times before and after changing it, > just like you did last time, that might help throw some light on the > situation. i applied them against 2.6.16.18 and two days later i got my first hang... below is the stripe_cache foo. thanks -dean neemlark:~# cd /sys/block/md4/md/ neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_size 256 neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size neemlark:/sys/block/md4/md# cat stripe_cache_active 474 187 preread bitlist=0 delaylist=222 neemlark:/sys/block/md4/md# cat stripe_cache_active 438 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 438 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 469 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 72 preread bitlist=160 delaylist=103 neemlark:/sys/block/md4/md# cat stripe_cache_active 1 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 2 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 0 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 2 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1] 1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 10/187 pages [40KB], 1024KB chunk ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-26 16:07 ` dean gaudet @ 2006-05-26 23:55 ` Neil Brown 2006-05-27 0:28 ` dean gaudet 2006-05-27 15:42 ` dean gaudet 0 siblings, 2 replies; 43+ messages in thread From: Neil Brown @ 2006-05-26 23:55 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Friday May 26, dean@arctic.org wrote: > On Tue, 23 May 2006, Neil Brown wrote: > > i applied them against 2.6.16.18 and two days later i got my first hang... > below is the stripe_cache foo. > > thanks > -dean > > neemlark:~# cd /sys/block/md4/md/ > neemlark:/sys/block/md4/md# cat stripe_cache_active > 255 > 0 preread > bitlist=0 delaylist=255 > neemlark:/sys/block/md4/md# cat stripe_cache_active > 255 > 0 preread > bitlist=0 delaylist=255 > neemlark:/sys/block/md4/md# cat stripe_cache_active > 255 > 0 preread > bitlist=0 delaylist=255 Thanks. This narrows it down quite a bit... too much infact: I can now say for sure that this cannot possible happen :-) Two things that might be helpful: 1/ Do you have any other patches on 2.6.16.18 other than the 3 I sent you? If you do I'd like to see them, just in case. 2/ The message.gz you sent earlier with the echo t > /proc/sysrq-trigger trace in it didn't contain information about md4_raid5 - the controlling thread for that array. It must have missed out due to a buffer overflowing. Next time it happens, could you to get this trace again and see if you can find out what what md4_raid5 is going. Maybe do the 'echo t' several times. I think that you need a kernel recompile to make the dmesg buffer larger. Thanks for your patience - this must be very frustrating for you. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-26 23:55 ` Neil Brown @ 2006-05-27 0:28 ` dean gaudet 2006-05-27 15:42 ` dean gaudet 1 sibling, 0 replies; 43+ messages in thread From: dean gaudet @ 2006-05-27 0:28 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 2877 bytes --] On Sat, 27 May 2006, Neil Brown wrote: > On Friday May 26, dean@arctic.org wrote: > > On Tue, 23 May 2006, Neil Brown wrote: > > > > i applied them against 2.6.16.18 and two days later i got my first hang... > > below is the stripe_cache foo. > > > > thanks > > -dean > > > > neemlark:~# cd /sys/block/md4/md/ > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > Thanks. This narrows it down quite a bit... too much infact: I can > now say for sure that this cannot possible happen :-) heheh. fwiw the box has traditionally been rock solid.. it's ancient though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 w/seagate 400GB disks... i really don't suspect the hardware all that much because the freeze seems to be rather consistent as to time of day (overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb going). unfortunately it doesn't happen every time... but every time i've unstuck the box i've noticed those processes going. other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3 and one with xfs. > Two things that might be helpful: > 1/ Do you have any other patches on 2.6.16.18 other than the 3 I > sent you? If you do I'd like to see them, just in case. it was just 2.6.16.18 plus the 3 you sent... i attached the .config (it's rather full -- based off debian kernel .config). maybe there's a compiler bug: gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3) > 2/ The message.gz you sent earlier with the > echo t > /proc/sysrq-trigger > trace in it didn't contain information about md4_raid5 - the > controlling thread for that array. It must have missed out > due to a buffer overflowing. Next time it happens, could you > to get this trace again and see if you can find out what > what md4_raid5 is going. Maybe do the 'echo t' several times. > I think that you need a kernel recompile to make the dmesg > buffer larger. ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ... note that i'm going to include two more patches in this next kernel: http://lkml.org/lkml/2006/5/23/42 http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch the first was the Jens Axboe patch you mentioned here recently (for accounting with i/o barriers)... and the second gets rid of the tcp treason uncloaked messages. > Thanks for your patience - this must be very frustrating for you. fortunately i'm the primary user of this box... and the bug doesn't corrupt anything... and i can unstick it easily :) so it's not all that frustrating actually. -dean [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 17866 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-26 23:55 ` Neil Brown 2006-05-27 0:28 ` dean gaudet @ 2006-05-27 15:42 ` dean gaudet 2006-05-28 12:04 ` Neil Brown 1 sibling, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-27 15:42 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 2878 bytes --] On Sat, 27 May 2006, Neil Brown wrote: > Thanks. This narrows it down quite a bit... too much infact: I can > now say for sure that this cannot possible happen :-) > > 2/ The message.gz you sent earlier with the > echo t > /proc/sysrq-trigger > trace in it didn't contain information about md4_raid5 - the got another hang again this morning... full dmesg output attached. -dean neemlark:/sys/block/md4/md# cat stripe_cache_active 248 0 preread bitlist=0 delaylist=248 neemlark:/sys/block/md4/md# cat stripe_cache_active 248 0 preread bitlist=0 delaylist=248 neemlark:/sys/block/md4/md# cat stripe_cache_active 248 0 preread bitlist=0 delaylist=248 neemlark:/sys/block/md4/md# cat stripe_cache_size 256 neemlark:/sys/block/md4/md# echo 512 >!$ echo 512 >stripe_cache_size neemlark:/sys/block/md4/md# cat stripe_cache_active 511 254 preread bitlist=0 delaylist=199 neemlark:/sys/block/md4/md# cat stripe_cache_active 511 148 preread bitlist=0 delaylist=199 neemlark:/sys/block/md4/md# cat stripe_cache_active 435 95 preread bitlist=0 delaylist=199 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 11 preread bitlist=0 delaylist=327 neemlark:/sys/block/md4/md# cat stripe_cache_active 511 11 preread bitlist=0 delaylist=327 neemlark:/sys/block/md4/md# cat stripe_cache_active 494 359 preread bitlist=0 delaylist=127 neemlark:/sys/block/md4/md# cat stripe_cache_active 191 67 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 272 preread bitlist=0 delaylist=175 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 32 preread bitlist=0 delaylist=317 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 398 preread bitlist=0 delaylist=114 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 398 preread bitlist=0 delaylist=114 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 398 preread bitlist=0 delaylist=114 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 17 preread bitlist=0 delaylist=265 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 17 preread bitlist=0 delaylist=265 neemlark:/sys/block/md4/md# cat stripe_cache_active 442 124 preread bitlist=0 delaylist=3 neemlark:/sys/block/md4/md# cat stripe_cache_active 127 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 154 preread bitlist=0 delaylist=235 neemlark:/sys/block/md4/md# cat stripe_cache_active 389 321 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 162 preread bitlist=0 delaylist=133 neemlark:/sys/block/md4/md# cat stripe_cache_active 385 24 preread bitlist=0 delaylist=142 neemlark:/sys/block/md4/md# cat stripe_cache_active 109 3 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 0 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 0 0 preread bitlist=0 delaylist=0 [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 29294 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-27 15:42 ` dean gaudet @ 2006-05-28 12:04 ` Neil Brown 2006-05-29 20:56 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-05-28 12:04 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Saturday May 27, dean@arctic.org wrote: > On Sat, 27 May 2006, Neil Brown wrote: > > > Thanks. This narrows it down quite a bit... too much infact: I can > > now say for sure that this cannot possible happen :-) > > > > 2/ The message.gz you sent earlier with the > > echo t > /proc/sysrq-trigger > > trace in it didn't contain information about md4_raid5 - the > > got another hang again this morning... full dmesg output attached. > Thanks. Nothing surprising there, which maybe is a surprise itself... I'm still somewhat stumped by this. But given that it is nicely repeatable, I'm sure we can get there... The following patch adds some more tracing to raid5, and might fix a subtle bug in ll_rw_blk, though it is an incredible long shot that this could be affecting raid5 (if it is, I'll have to assume there is another bug somewhere). It certainly doesn't break ll_rw_blk. Whether it actually fixes something I'm not sure. If you could try with these on top of the previous patches I'd really appreciate it. When you read from ..../stripe_cache_active, it should trigger a (cryptic) kernel message within the next 15 seconds. If I could get the contents of that file and the kernel messages, that should help. Thanks heaps, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./block/ll_rw_blk.c | 4 ++-- ./drivers/md/raid5.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+), 2 deletions(-) diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c --- ./block/ll_rw_blk.c~current~ 2006-05-28 21:54:23.000000000 +1000 +++ ./block/ll_rw_blk.c 2006-05-28 21:55:17.000000000 +1000 @@ -874,7 +874,7 @@ static void __blk_queue_free_tags(reques } q->queue_tags = NULL; - q->queue_flags &= ~(1 << QUEUE_FLAG_QUEUED); + clear_bit(QUEUE_FLAG_QUEUED, &q->queue_flags); } /** @@ -963,7 +963,7 @@ int blk_queue_init_tags(request_queue_t * assign it, all done */ q->queue_tags = tags; - q->queue_flags |= (1 << QUEUE_FLAG_QUEUED); + set_bit(QUEUE_FLAG_QUEUED, &q->queue_flags); return 0; fail: kfree(tags); diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-27 09:17:10.000000000 +1000 +++ ./drivers/md/raid5.c 2006-05-28 21:56:56.000000000 +1000 @@ -1701,13 +1701,20 @@ static sector_t sync_request(mddev_t *md * During the scan, completed stripes are saved for us by the interrupt * handler, so that they will not have to wait for our next wakeup. */ +static unsigned long trigger; + static void raid5d (mddev_t *mddev) { struct stripe_head *sh; raid5_conf_t *conf = mddev_to_conf(mddev); int handled; + int trace = 0; PRINTK("+++ raid5d active\n"); + if (test_and_clear_bit(0, &trigger)) + trace = 1; + if (trace) + printk("raid5d runs\n"); md_check_recovery(mddev); @@ -1725,6 +1732,13 @@ static void raid5d (mddev_t *mddev) activate_bit_delay(conf); } + if (trace) + printk(" le=%d, pas=%d, bqp=%d le=%d\n", + list_empty(&conf->handle_list), + atomic_read(&conf->preread_active_stripes), + blk_queue_plugged(mddev->queue), + list_empty(&conf->delayed_list)); + if (list_empty(&conf->handle_list) && atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD && !blk_queue_plugged(mddev->queue) && @@ -1756,6 +1770,8 @@ static void raid5d (mddev_t *mddev) unplug_slaves(mddev); PRINTK("--- raid5d inactive\n"); + if (trace) + printk("raid5d done\n"); } static ssize_t @@ -1813,6 +1829,7 @@ stripe_cache_active_show(mddev_t *mddev, struct list_head *l; n = sprintf(page, "%d\n", atomic_read(&conf->active_stripes)); n += sprintf(page+n, "%d preread\n", atomic_read(&conf->preread_active_stripes)); + n += sprintf(page+n, "%splugged\n", blk_queue_plugged(mddev->queue)?"":"not "); spin_lock_irq(&conf->device_lock); c1=0; list_for_each(l, &conf->bitmap_list) @@ -1822,6 +1839,7 @@ stripe_cache_active_show(mddev_t *mddev, c2++; spin_unlock_irq(&conf->device_lock); n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2); + trigger = 0xffff; return n; } else return 0; ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-28 12:04 ` Neil Brown @ 2006-05-29 20:56 ` dean gaudet 2006-05-29 23:51 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-29 20:56 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid [-- Attachment #1: Type: TEXT/PLAIN, Size: 5820 bytes --] On Sun, 28 May 2006, Neil Brown wrote: > The following patch adds some more tracing to raid5, and might fix a > subtle bug in ll_rw_blk, though it is an incredible long shot that > this could be affecting raid5 (if it is, I'll have to assume there is > another bug somewhere). It certainly doesn't break ll_rw_blk. > Whether it actually fixes something I'm not sure. > > If you could try with these on top of the previous patches I'd really > appreciate it. > > When you read from ..../stripe_cache_active, it should trigger a > (cryptic) kernel message within the next 15 seconds. If I could get > the contents of that file and the kernel messages, that should help. got the hang again... attached is the dmesg with the cryptic messages. i didn't think to grab the task dump this time though. hope there's a clue in this one :) but send me another patch if you need more data. -dean neemlark:/sys/block/md4/md# cat stripe_cache_size 256 neemlark:/sys/block/md4/md# cat stripe_cache_active 251 0 preread plugged bitlist=0 delaylist=251 neemlark:/sys/block/md4/md# cat stripe_cache_active 251 0 preread plugged bitlist=0 delaylist=251 neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size neemlark:/sys/block/md4/md# cat stripe_cache_active 512 292 preread not plugged bitlist=0 delaylist=32 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 292 preread not plugged bitlist=0 delaylist=32 neemlark:/sys/block/md4/md# cat stripe_cache_active 445 0 preread not plugged bitlist=0 delaylist=73 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 413 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 13 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 493 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 487 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 405 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 1 preread not plugged bitlist=0 delaylist=28 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 84 preread not plugged bitlist=0 delaylist=69 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 69 preread not plugged bitlist=0 delaylist=56 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 41 preread not plugged bitlist=0 delaylist=38 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 10 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 453 3 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 14 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 477 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 476 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 486 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 384 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 387 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 462 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 448 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 501 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 476 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 416 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 386 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 434 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 406 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 447 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 501 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 448 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 416 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 416 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 380 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 1 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 353 0 preread not plugged bitlist=0 delaylist=0 [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 6698 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-29 20:56 ` dean gaudet @ 2006-05-29 23:51 ` Neil Brown 2006-05-31 0:15 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-05-29 23:51 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Monday May 29, dean@arctic.org wrote: > > hope there's a clue in this one :) but send me another patch if you need > more data. Thanks. This confirms that the device is 'plugged' - which I knew had to be the case, but equally knew that it couldn't be the case :-) Whenever the device gets plugged a 3msec timer is set and when the timer fires, the device gets unplugged. So it cannot possibly stay plugged for more than 3 msecs. Yet obviously it does. I don't think the timer code can be going wrong, as it is very widely used and if there was a problem I'm sure it would have been noticed by now. Besides I've checked it and it looks good - but that doesn't seem to prove anything :-( Another possibility is another processor doing q->queue_flags |= (1 << some_flag); at the same time that the timer does clear_bit(queue_plugged, &q->queue_flags); That could cause the clearing of the bit to be lost. But I don't think that happens, certainly not after the last patch I gave you. I now realise I should have got that cryptic printk to print the result of timer_pending(&mddev->queue->unplug_timer); but I'm fairly sure it would have said '0' which would leave me equally in the dark. Maybe you have bad memory with one bit that doesn't stay set (or clear) properly, and that bit happen to always line up with the QUEUE_FLAG_PLUGGED bit for this array.... Ok, that's impossible too, especially as Patrik reported the same problem! (stares at the code lots more, goes down several blind alleys...) Well.... maybe..... There does seem to be a small hole in the chain that leads from a queue being plugged to it be unplugged again. I'm not convinced that the race can actually be lost, but obviously something fairly unbelievable is happening... Could you try this patch please? On top of the rest. And if it doesn't fail in a couple of days, tell me how regularly the message kblockd_schedule_work failed gets printed. Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./block/ll_rw_blk.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c --- ./block/ll_rw_blk.c~current~ 2006-05-30 09:48:02.000000000 +1000 +++ ./block/ll_rw_blk.c 2006-05-30 09:48:48.000000000 +1000 @@ -1636,7 +1636,11 @@ static void blk_unplug_timeout(unsigned { request_queue_t *q = (request_queue_t *)data; - kblockd_schedule_work(&q->unplug_work); + if (!kblockd_schedule_work(&q->unplug_work)) { + /* failed to schedule the work, try again later */ + printk("kblockd_schedule_work failed\n"); + mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); + } } /** ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-29 23:51 ` Neil Brown @ 2006-05-31 0:15 ` dean gaudet 2006-05-31 0:46 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-31 0:15 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Tue, 30 May 2006, Neil Brown wrote: > Could you try this patch please? On top of the rest. > And if it doesn't fail in a couple of days, tell me how regularly the > message > kblockd_schedule_work failed > gets printed. i'm running this patch now ... and just after reboot, no freeze yet, i've already seen a handful of these: May 30 17:05:09 localhost kernel: kblockd_schedule_work failed May 30 17:05:59 localhost kernel: kblockd_schedule_work failed May 30 17:08:16 localhost kernel: kblockd_schedule_work failed May 30 17:10:51 localhost kernel: kblockd_schedule_work failed May 30 17:11:51 localhost kernel: kblockd_schedule_work failed May 30 17:12:46 localhost kernel: kblockd_schedule_work failed May 30 17:14:14 localhost kernel: kblockd_schedule_work failed -dean > > Thanks, > NeilBrown > > > Signed-off-by: Neil Brown <neilb@suse.de> > > ### Diffstat output > ./block/ll_rw_blk.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c > --- ./block/ll_rw_blk.c~current~ 2006-05-30 09:48:02.000000000 +1000 > +++ ./block/ll_rw_blk.c 2006-05-30 09:48:48.000000000 +1000 > @@ -1636,7 +1636,11 @@ static void blk_unplug_timeout(unsigned > { > request_queue_t *q = (request_queue_t *)data; > > - kblockd_schedule_work(&q->unplug_work); > + if (!kblockd_schedule_work(&q->unplug_work)) { > + /* failed to schedule the work, try again later */ > + printk("kblockd_schedule_work failed\n"); > + mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); > + } > } > > /** > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-31 0:15 ` dean gaudet @ 2006-05-31 0:46 ` Neil Brown 2006-05-31 1:03 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-05-31 0:46 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Tuesday May 30, dean@arctic.org wrote: > On Tue, 30 May 2006, Neil Brown wrote: > > > Could you try this patch please? On top of the rest. > > And if it doesn't fail in a couple of days, tell me how regularly the > > message > > kblockd_schedule_work failed > > gets printed. > > i'm running this patch now ... and just after reboot, no freeze yet, i've > already seen a handful of these: > > May 30 17:05:09 localhost kernel: kblockd_schedule_work failed > May 30 17:05:59 localhost kernel: kblockd_schedule_work failed > May 30 17:08:16 localhost kernel: kblockd_schedule_work failed > May 30 17:10:51 localhost kernel: kblockd_schedule_work failed > May 30 17:11:51 localhost kernel: kblockd_schedule_work failed > May 30 17:12:46 localhost kernel: kblockd_schedule_work failed > May 30 17:14:14 localhost kernel: kblockd_schedule_work failed 1 every minute or so. That's probably more than I would have expected, but strongly lends evidence to the theory that this is the problem. I certainly wouldn't expect a failure every time kblockd_schedule_work failed (in the original code), but the fact that it does fail sometimes means there is a possible race which can cause the failure that experienced. So I am optimistic that the patch will have fixed the problem. Please let me know when you reach an uptime of 3 days. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-31 0:46 ` Neil Brown @ 2006-05-31 1:03 ` dean gaudet 2006-05-31 1:38 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-31 1:03 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Wed, 31 May 2006, Neil Brown wrote: > On Tuesday May 30, dean@arctic.org wrote: > > On Tue, 30 May 2006, Neil Brown wrote: > > > > > Could you try this patch please? On top of the rest. > > > And if it doesn't fail in a couple of days, tell me how regularly the > > > message > > > kblockd_schedule_work failed > > > gets printed. > > > > i'm running this patch now ... and just after reboot, no freeze yet, i've > > already seen a handful of these: > > > > May 30 17:05:09 localhost kernel: kblockd_schedule_work failed > > May 30 17:05:59 localhost kernel: kblockd_schedule_work failed > > May 30 17:08:16 localhost kernel: kblockd_schedule_work failed > > May 30 17:10:51 localhost kernel: kblockd_schedule_work failed > > May 30 17:11:51 localhost kernel: kblockd_schedule_work failed > > May 30 17:12:46 localhost kernel: kblockd_schedule_work failed > > May 30 17:14:14 localhost kernel: kblockd_schedule_work failed > > 1 every minute or so. That's probably more than I would have > expected, but strongly lends evidence to the theory that this is the > problem. actually i think the rate is higher... i'm not sure why, but klogd doesn't seem to keep up with it: root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages 31 root@neemlark:~# dmesg | grep -c kblockd_schedule_work 8192 i don't have CONFIG_PRINTK_TIME=y ... so i can't read timestamps from dmesg. but cool! if the dmesg spam seems to be a problem i can just comment it out of the patch... i'll let you know if it freezes again. thanks -dean ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-31 1:03 ` dean gaudet @ 2006-05-31 1:38 ` Neil Brown 2006-05-31 1:46 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-05-31 1:38 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Tuesday May 30, dean@arctic.org wrote: > > actually i think the rate is higher... i'm not sure why, but klogd doesn't > seem to keep up with it: > > root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages > 31 > root@neemlark:~# dmesg | grep -c kblockd_schedule_work > 8192 # grep 'last message repeated' /var/log/messages ?? Obviously even faster than I thought. I guess workqueue threads must take a while to get scheduled... I'm beginning to wonder if I really have found the bug after all :-( I'll look forward to the results either way. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-31 1:38 ` Neil Brown @ 2006-05-31 1:46 ` dean gaudet 2006-06-01 2:51 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-05-31 1:46 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Wed, 31 May 2006, Neil Brown wrote: > On Tuesday May 30, dean@arctic.org wrote: > > > > actually i think the rate is higher... i'm not sure why, but klogd doesn't > > seem to keep up with it: > > > > root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages > > 31 > > root@neemlark:~# dmesg | grep -c kblockd_schedule_work > > 8192 > > # grep 'last message repeated' /var/log/messages > ?? um hi, of course :) the paste below is approximately correct. -dean root@neemlark:~# egrep 'kblockd_schedule_work|last message repeated' /var/log/messages May 30 17:05:09 localhost kernel: kblockd_schedule_work failed May 30 17:05:59 localhost kernel: kblockd_schedule_work failed May 30 17:08:16 localhost kernel: kblockd_schedule_work failed May 30 17:10:51 localhost kernel: kblockd_schedule_work failed May 30 17:11:51 localhost kernel: kblockd_schedule_work failed May 30 17:12:46 localhost kernel: kblockd_schedule_work failed May 30 17:12:56 localhost last message repeated 22 times May 30 17:14:14 localhost kernel: kblockd_schedule_work failed May 30 17:16:57 localhost kernel: kblockd_schedule_work failed May 30 17:17:00 localhost last message repeated 83 times May 30 17:17:02 localhost kernel: kblockd_schedule_work failed May 30 17:17:33 localhost last message repeated 950 times May 30 17:18:34 localhost last message repeated 2218 times May 30 17:19:35 localhost last message repeated 1581 times May 30 17:20:01 localhost last message repeated 579 times May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost last message repeated 23 times May 30 17:20:03 localhost kernel: kblockd_schedule_work failed May 30 17:20:34 localhost last message repeated 1058 times May 30 17:21:35 localhost last message repeated 2171 times May 30 17:22:36 localhost last message repeated 2305 times May 30 17:23:37 localhost last message repeated 2311 times May 30 17:24:38 localhost last message repeated 1993 times May 30 17:25:01 localhost last message repeated 702 times May 30 17:25:02 localhost kernel: kblockd_schedule_work failed May 30 17:25:02 localhost last message repeated 15 times May 30 17:25:02 localhost kernel: kblockd_schedule_work failed May 30 17:25:02 localhost last message repeated 12 times May 30 17:25:03 localhost kernel: kblockd_schedule_work failed May 30 17:25:34 localhost last message repeated 1061 times May 30 17:26:35 localhost last message repeated 2009 times May 30 17:27:36 localhost last message repeated 1941 times May 30 17:28:37 localhost last message repeated 2345 times May 30 17:29:38 localhost last message repeated 2367 times May 30 17:30:01 localhost last message repeated 870 times May 30 17:30:01 localhost kernel: kblockd_schedule_work failed May 30 17:30:01 localhost last message repeated 45 times May 30 17:30:02 localhost kernel: kblockd_schedule_work failed May 30 17:30:33 localhost last message repeated 1180 times May 30 17:31:34 localhost last message repeated 2062 times May 30 17:32:34 localhost last message repeated 2277 times May 30 17:32:36 localhost kernel: kblockd_schedule_work failed May 30 17:33:07 localhost last message repeated 1114 times May 30 17:34:08 localhost last message repeated 2308 times May 30 17:35:01 localhost last message repeated 1941 times May 30 17:35:01 localhost kernel: kblockd_schedule_work failed May 30 17:35:02 localhost last message repeated 20 times May 30 17:35:02 localhost kernel: kblockd_schedule_work failed May 30 17:35:33 localhost last message repeated 1051 times May 30 17:36:34 localhost last message repeated 2002 times May 30 17:37:35 localhost last message repeated 1644 times May 30 17:38:36 localhost last message repeated 1731 times May 30 17:39:37 localhost last message repeated 1844 times May 30 17:40:01 localhost last message repeated 817 times May 30 17:40:02 localhost kernel: kblockd_schedule_work failed May 30 17:40:02 localhost last message repeated 39 times May 30 17:40:02 localhost kernel: kblockd_schedule_work failed May 30 17:40:02 localhost last message repeated 12 times May 30 17:40:03 localhost kernel: kblockd_schedule_work failed May 30 17:40:34 localhost last message repeated 1051 times May 30 17:41:35 localhost last message repeated 1576 times May 30 17:42:36 localhost last message repeated 2000 times May 30 17:43:37 localhost last message repeated 2058 times May 30 17:44:15 localhost last message repeated 1337 times May 30 17:44:15 localhost kernel: kblockd_schedule_work failed May 30 17:44:46 localhost last message repeated 1016 times May 30 17:45:01 localhost last message repeated 432 times May 30 17:45:02 localhost kernel: kblockd_schedule_work failed May 30 17:45:02 localhost kernel: kblockd_schedule_work failed May 30 17:45:33 localhost last message repeated 1229 times May 30 17:46:34 localhost last message repeated 2552 times May 30 17:47:36 localhost last message repeated 2126 times May 30 17:48:37 localhost last message repeated 2348 times May 30 17:49:38 localhost last message repeated 1843 times May 30 17:50:00 localhost last message repeated 796 times May 30 17:50:01 localhost kernel: kblockd_schedule_work failed May 30 17:50:32 localhost last message repeated 1238 times May 30 17:51:33 localhost last message repeated 2374 times May 30 17:52:34 localhost last message repeated 2366 times May 30 17:53:35 localhost last message repeated 2195 times May 30 17:54:36 localhost last message repeated 2345 times May 30 17:55:01 localhost last message repeated 790 times May 30 17:55:02 localhost kernel: kblockd_schedule_work failed May 30 17:55:02 localhost last message repeated 49 times May 30 17:55:02 localhost kernel: kblockd_schedule_work failed May 30 17:55:02 localhost last message repeated 13 times May 30 17:55:03 localhost kernel: kblockd_schedule_work failed May 30 17:55:34 localhost last message repeated 1070 times May 30 17:56:35 localhost last message repeated 2278 times May 30 17:57:37 localhost last message repeated 2459 times May 30 17:58:38 localhost last message repeated 2076 times May 30 17:59:39 localhost last message repeated 1719 times May 30 18:00:01 localhost last message repeated 654 times May 30 18:00:03 localhost kernel: kblockd_schedule_work failed May 30 18:00:34 localhost last message repeated 868 times May 30 18:01:35 localhost last message repeated 1765 times May 30 18:02:33 localhost last message repeated 1759 times May 30 18:02:35 localhost kernel: kblockd_schedule_work failed May 30 18:02:35 localhost last message repeated 9 times May 30 18:02:35 localhost kernel: kblockd_schedule_work failed May 30 18:02:35 localhost kernel: kblockd_schedule_work failed May 30 18:02:35 localhost last message repeated 9 times May 30 18:02:35 localhost kernel: kblockd_schedule_work failed May 30 18:02:35 localhost last message repeated 4 times May 30 18:02:36 localhost kernel: kblockd_schedule_work failed May 30 18:02:36 localhost last message repeated 4 times May 30 18:02:36 localhost kernel: kblockd_schedule_work failed May 30 18:02:36 localhost last message repeated 10 times May 30 18:02:36 localhost kernel: kblockd_schedule_work failed May 30 18:02:36 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost last message repeated 7 times May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:02:37 localhost kernel: kblockd_schedule_work failed May 30 18:03:08 localhost last message repeated 842 times May 30 18:04:09 localhost last message repeated 1828 times May 30 18:04:15 localhost last message repeated 221 times May 30 18:04:15 localhost kernel: kblockd_schedule_work failed May 30 18:04:46 localhost last message repeated 1042 times May 30 18:05:00 localhost last message repeated 551 times May 30 18:05:01 localhost kernel: kblockd_schedule_work failed May 30 18:05:32 localhost last message repeated 919 times May 30 18:06:33 localhost last message repeated 2608 times May 30 18:07:34 localhost last message repeated 2412 times May 30 18:08:35 localhost last message repeated 2536 times May 30 18:09:36 localhost last message repeated 2441 times May 30 18:10:01 localhost last message repeated 812 times May 30 18:10:01 localhost kernel: kblockd_schedule_work failed May 30 18:10:01 localhost last message repeated 5 times May 30 18:10:01 localhost kernel: kblockd_schedule_work failed May 30 18:10:01 localhost last message repeated 16 times May 30 18:10:02 localhost kernel: kblockd_schedule_work failed May 30 18:10:33 localhost last message repeated 1311 times May 30 18:11:34 localhost last message repeated 2396 times May 30 18:11:45 localhost last message repeated 328 times May 30 18:11:47 localhost kernel: kblockd_schedule_work failed May 30 18:12:18 localhost last message repeated 1470 times May 30 18:13:19 localhost last message repeated 2457 times May 30 18:14:20 localhost last message repeated 2275 times May 30 18:15:01 localhost last message repeated 1577 times May 30 18:15:01 localhost kernel: kblockd_schedule_work failed May 30 18:15:01 localhost last message repeated 6 times May 30 18:15:02 localhost kernel: kblockd_schedule_work failed May 30 18:15:33 localhost last message repeated 1241 times May 30 18:16:37 localhost last message repeated 2487 times May 30 18:17:01 localhost last message repeated 826 times May 30 18:17:02 localhost kernel: kblockd_schedule_work failed May 30 18:17:33 localhost last message repeated 1386 times May 30 18:18:34 localhost last message repeated 2195 times May 30 18:19:29 localhost last message repeated 1021 times May 30 18:19:55 localhost last message repeated 62 times May 30 18:20:04 localhost kernel: kblockd_schedule_work failed May 30 18:20:39 localhost last message repeated 105 times May 30 18:21:34 localhost last message repeated 189 times May 30 18:22:45 localhost last message repeated 104 times May 30 18:23:40 localhost last message repeated 170 times May 30 18:24:15 localhost last message repeated 120 times May 30 18:24:45 localhost kernel: kblockd_schedule_work failed May 30 18:24:49 localhost last message repeated 13 times May 30 18:25:19 localhost kernel: kblockd_schedule_work failed May 30 18:25:25 localhost last message repeated 138 times May 30 18:26:49 localhost last message repeated 113 times May 30 18:27:35 localhost last message repeated 339 times May 30 18:28:50 localhost last message repeated 108 times May 30 18:29:35 localhost last message repeated 143 times May 30 18:30:05 localhost kernel: kblockd_schedule_work failed May 30 18:30:40 localhost last message repeated 55 times May 30 18:31:50 localhost last message repeated 313 times May 30 18:32:30 localhost last message repeated 236 times May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost last message repeated 8 times May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost last message repeated 11 times May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost last message repeated 5 times May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost last message repeated 3 times May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:36 localhost kernel: kblockd_schedule_work failed May 30 18:32:50 localhost kernel: kblockd_schedule_work failed May 30 18:33:15 localhost last message repeated 91 times May 30 18:34:15 localhost last message repeated 86 times May 30 18:34:50 localhost last message repeated 133 times May 30 18:35:20 localhost kernel: kblockd_schedule_work failed May 30 18:35:50 localhost last message repeated 245 times May 30 18:36:35 localhost last message repeated 190 times May 30 18:37:50 localhost last message repeated 372 times May 30 18:38:25 localhost last message repeated 27 times May 30 18:39:40 localhost last message repeated 188 times May 30 18:40:00 localhost last message repeated 129 times May 30 18:40:35 localhost kernel: kblockd_schedule_work failed May 30 18:41:10 localhost last message repeated 76 times May 30 18:42:20 localhost last message repeated 186 times May 30 18:43:00 localhost last message repeated 154 times ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-05-31 1:46 ` dean gaudet @ 2006-06-01 2:51 ` Neil Brown 2006-06-02 7:10 ` dean gaudet 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-06-01 2:51 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Tuesday May 30, dean@arctic.org wrote: > On Wed, 31 May 2006, Neil Brown wrote: > > > On Tuesday May 30, dean@arctic.org wrote: > > > > > > actually i think the rate is higher... i'm not sure why, but klogd doesn't > > > seem to keep up with it: > > > > > > root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages > > > 31 > > > root@neemlark:~# dmesg | grep -c kblockd_schedule_work > > > 8192 > > > > # grep 'last message repeated' /var/log/messages > > ?? > > um hi, of course :) the paste below is approximately correct. > Ok, so thousands of times a second. Not what I expected at all... So I read the code more carefully, and can see that patch was all wrong. It might actually make the problem go away, but isn't really the right fix. I think I will be able to suggest a fix which should work a be a reasonable thing to do, but it still won't help me understand the problem. I've got one more long-shot I would like to try first. If you could backout that change to ll_rw_block, and apply this patch instead. Then when it hangs, just cat the stripe_cache_active file and see if that unplugs things or not (cat it a few times). If it does, I think I am on to the problem. If it doesn't I'll give up and give you a patch that should work around the problem in a reasonable way. Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid5.c | 3 +++ 1 file changed, 3 insertions(+) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-28 21:56:56.000000000 +1000 +++ ./drivers/md/raid5.c 2006-06-01 12:44:52.000000000 +1000 @@ -1840,6 +1840,9 @@ stripe_cache_active_show(mddev_t *mddev, spin_unlock_irq(&conf->device_lock); n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2); trigger = 0xffff; + + /* incredible long shot... */ + kblockd_flush(); return n; } else return 0; ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-01 2:51 ` Neil Brown @ 2006-06-02 7:10 ` dean gaudet 2006-06-02 7:27 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: dean gaudet @ 2006-06-02 7:10 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Thu, 1 Jun 2006, Neil Brown wrote: > I've got one more long-shot I would like to try first. If you could > backout that change to ll_rw_block, and apply this patch instead. > Then when it hangs, just cat the stripe_cache_active file and see if > that unplugs things or not (cat it a few times). nope that didn't unstick it... i had to raise stripe_cache_size (from 256 to 768... 512 wasn't enough)... -dean ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-02 7:10 ` dean gaudet @ 2006-06-02 7:27 ` Neil Brown 2006-06-10 19:49 ` dean gaudet 2006-06-13 18:53 ` Bill Davidsen 0 siblings, 2 replies; 43+ messages in thread From: Neil Brown @ 2006-06-02 7:27 UTC (permalink / raw) To: dean gaudet; +Cc: Patrik Jonsson, linux-raid On Friday June 2, dean@arctic.org wrote: > On Thu, 1 Jun 2006, Neil Brown wrote: > > > I've got one more long-shot I would like to try first. If you could > > backout that change to ll_rw_block, and apply this patch instead. > > Then when it hangs, just cat the stripe_cache_active file and see if > > that unplugs things or not (cat it a few times). > > nope that didn't unstick it... i had to raise stripe_cache_size (from 256 > to 768... 512 wasn't enough)... > > -dean Ok, thanks. I still don't know what is really going on, but I'm 99.9863% sure this will fix it, and is a reasonable thing to do. (Yes, I lose a ';'. That is deliberate). Please let me know what this proves, and thanks again for your patience. NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid5.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-28 21:56:56.000000000 +1000 +++ ./drivers/md/raid5.c 2006-06-02 17:24:07.000000000 +1000 @@ -285,7 +285,7 @@ static struct stripe_head *get_active_st < (conf->max_nr_stripes *3/4) || !conf->inactive_blocked), conf->device_lock, - unplug_slaves(conf->mddev); + raid5_unplug_device(conf->mddev->queue) ); conf->inactive_blocked = 0; } else ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-02 7:27 ` Neil Brown @ 2006-06-10 19:49 ` dean gaudet 2006-06-13 18:53 ` Bill Davidsen 1 sibling, 0 replies; 43+ messages in thread From: dean gaudet @ 2006-06-10 19:49 UTC (permalink / raw) To: Neil Brown; +Cc: Patrik Jonsson, linux-raid On Fri, 2 Jun 2006, Neil Brown wrote: > On Friday June 2, dean@arctic.org wrote: > > On Thu, 1 Jun 2006, Neil Brown wrote: > > > > > I've got one more long-shot I would like to try first. If you could > > > backout that change to ll_rw_block, and apply this patch instead. > > > Then when it hangs, just cat the stripe_cache_active file and see if > > > that unplugs things or not (cat it a few times). > > > > nope that didn't unstick it... i had to raise stripe_cache_size (from 256 > > to 768... 512 wasn't enough)... > > > > -dean > > Ok, thanks. > I still don't know what is really going on, but I'm 99.9863% sure this > will fix it, and is a reasonable thing to do. > (Yes, I lose a ';'. That is deliberate). it's been running for a week now... and the freeze hasn't occured... it's possible the circumstances for reproducing it haven't occured again either, but i haven't really changed my disk usage behaviour so it's probably fixed. let me know if you come up with some other solution you'd like tested. thanks -dean > > Please let me know what this proves, and thanks again for your > patience. > > NeilBrown > > > Signed-off-by: Neil Brown <neilb@suse.de> > > ### Diffstat output > ./drivers/md/raid5.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c > --- ./drivers/md/raid5.c~current~ 2006-05-28 21:56:56.000000000 +1000 > +++ ./drivers/md/raid5.c 2006-06-02 17:24:07.000000000 +1000 > @@ -285,7 +285,7 @@ static struct stripe_head *get_active_st > < (conf->max_nr_stripes *3/4) > || !conf->inactive_blocked), > conf->device_lock, > - unplug_slaves(conf->mddev); > + raid5_unplug_device(conf->mddev->queue) > ); > conf->inactive_blocked = 0; > } else > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-02 7:27 ` Neil Brown 2006-06-10 19:49 ` dean gaudet @ 2006-06-13 18:53 ` Bill Davidsen 2006-06-13 19:05 ` dean gaudet 2006-06-13 23:13 ` Neil Brown 1 sibling, 2 replies; 43+ messages in thread From: Bill Davidsen @ 2006-06-13 18:53 UTC (permalink / raw) To: Neil Brown; +Cc: dean gaudet, Patrik Jonsson, linux-raid Neil Brown wrote: >On Friday June 2, dean@arctic.org wrote: > > >>On Thu, 1 Jun 2006, Neil Brown wrote: >> >> >> >>>I've got one more long-shot I would like to try first. If you could >>>backout that change to ll_rw_block, and apply this patch instead. >>>Then when it hangs, just cat the stripe_cache_active file and see if >>>that unplugs things or not (cat it a few times). >>> >>> >>nope that didn't unstick it... i had to raise stripe_cache_size (from 256 >>to 768... 512 wasn't enough)... >> >>-dean >> >> > >Ok, thanks. >I still don't know what is really going on, but I'm 99.9863% sure this >will fix it, and is a reasonable thing to do. >(Yes, I lose a ';'. That is deliberate). > >Please let me know what this proves, and thanks again for your >patience. > >NeilBrown > [...snip...] Will that fix be in 2.6.17? -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-13 18:53 ` Bill Davidsen @ 2006-06-13 19:05 ` dean gaudet 2006-06-13 23:13 ` Neil Brown 1 sibling, 0 replies; 43+ messages in thread From: dean gaudet @ 2006-06-13 19:05 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, Patrik Jonsson, linux-raid On Tue, 13 Jun 2006, Bill Davidsen wrote: > Neil Brown wrote: > > > On Friday June 2, dean@arctic.org wrote: > > > > > On Thu, 1 Jun 2006, Neil Brown wrote: > > > > > > > > > > I've got one more long-shot I would like to try first. If you could > > > > backout that change to ll_rw_block, and apply this patch instead. > > > > Then when it hangs, just cat the stripe_cache_active file and see if > > > > that unplugs things or not (cat it a few times). > > > > > > > nope that didn't unstick it... i had to raise stripe_cache_size (from 256 > > > to 768... 512 wasn't enough)... > > > > > > -dean > > > > > > > Ok, thanks. > > I still don't know what is really going on, but I'm 99.9863% sure this > > will fix it, and is a reasonable thing to do. > > (Yes, I lose a ';'. That is deliberate). > > > > Please let me know what this proves, and thanks again for your > > patience. > > > > NeilBrown > > > [...snip...] > > Will that fix be in 2.6.17? also -- is it appropriate enough for 2.6.16.x? -dean ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-13 18:53 ` Bill Davidsen 2006-06-13 19:05 ` dean gaudet @ 2006-06-13 23:13 ` Neil Brown 2006-10-07 23:25 ` Chris Allen 1 sibling, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-06-13 23:13 UTC (permalink / raw) To: Bill Davidsen; +Cc: dean gaudet, Patrik Jonsson, linux-raid On Tuesday June 13, davidsen@tmr.com wrote: > > Will that fix be in 2.6.17? > Probably not. We have had the last 'rc' twice and I so I don't think it is appropriate to submit the patch at this stage. I probably will submit it for an early 2.6.17.x. and for 2.6.16.y. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-06-13 23:13 ` Neil Brown @ 2006-10-07 23:25 ` Chris Allen 2006-10-09 11:03 ` Chris Allen 0 siblings, 1 reply; 43+ messages in thread From: Chris Allen @ 2006-10-07 23:25 UTC (permalink / raw) To: linux-raid Neil Brown wrote: > On Tuesday June 13, davidsen@tmr.com wrote: > >> Will that fix be in 2.6.17? >> >> > > Probably not. We have had the last 'rc' twice and I so I don't think > it is appropriate to submit the patch at this stage. > I probably will submit it for an early 2.6.17.x. and for 2.6.16.y. > > > What is the status of this? I've been experiencing exactly the same get_active_stripe lockup on a FC5 2.6.17-1.2187_FC5smp stock kernel. Curiously we have ten similar heavily loaded servers but only one of them experiences the problem. The problem happens consistently after 24 hours or so when I hammer the raid5 array over NFS, but I've never managed to trigger it with local access. I'd also say (anecdotally) that it only started happening since I added a bitmap to my array. As with the other poster, the lockup is released by increasing stripe_cache_size. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-07 23:25 ` Chris Allen @ 2006-10-09 11:03 ` Chris Allen 2006-10-09 23:06 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: Chris Allen @ 2006-10-09 11:03 UTC (permalink / raw) To: linux-raid Ok, after more testing, this lockup happens consistently when bitmaps are switched on and never when they are switched off. Ideas anybody? On Sun, Oct 08, 2006 at 12:25:46AM +0100, Chris Allen wrote: > > > Neil Brown wrote: > >On Tuesday June 13, davidsen@tmr.com wrote: > > > >>Will that fix be in 2.6.17? > >> > >> > > > >Probably not. We have had the last 'rc' twice and I so I don't think > >it is appropriate to submit the patch at this stage. > >I probably will submit it for an early 2.6.17.x. and for 2.6.16.y. > > > > > > > > What is the status of this? > > I've been experiencing exactly the same get_active_stripe lockup > on a FC5 2.6.17-1.2187_FC5smp stock kernel. Curiously we have ten > similar heavily loaded > servers but only one of them experiences the problem. The problem > happens consistently after 24 hours or so > when I hammer the raid5 array over NFS, but I've never managed to > trigger it with local access. I'd also > say (anecdotally) that it only started happening since I added a bitmap > to my array. > > > As with the other poster, the lockup is released by increasing > stripe_cache_size. > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-09 11:03 ` Chris Allen @ 2006-10-09 23:06 ` Neil Brown 2006-10-09 23:21 ` Chris Allen 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-10-09 23:06 UTC (permalink / raw) To: Chris Allen; +Cc: linux-raid On Monday October 9, chris@cjx.com wrote: > Ok, after more testing, this lockup happens consistently when > bitmaps are switched on and never when they are switched off. > > Ideas anybody? > No. I'm completely stumped. Which means it is probably something very obvious, but I keep looking in the wrong place :-) The interaction with bitmaps is interesting and might prove helpful. I'll have another look at the code and see if it opens up some possibilities. I've been working with Dean Gaudet who can reproduce it. He has been trying out patches that print out lots of debugging information when the problem occurs. This has narrowed it down a bit, but I'm still in the dark. It seems that blk_plug_device is setting a timer to go off in about 3msecs, but it gets deactivated without firing. This should only happen when the device gets unplugged, but the device remains plugged. Are you happy to try a kernel.org kernel with a few patches and a little shell script running? The net result is that it detects when there is a problem, prints out some trace information, and then gets array going again. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-09 23:06 ` Neil Brown @ 2006-10-09 23:21 ` Chris Allen 2006-10-10 0:08 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: Chris Allen @ 2006-10-09 23:21 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown wrote: > On Monday October 9, chris@cjx.com wrote: > >> Ok, after more testing, this lockup happens consistently when >> bitmaps are switched on and never when they are switched off. >> >> > Are you happy to try a kernel.org kernel with a few patches and a > little shell script running? > The net result is that it detects when there is a problem, prints out > some trace information, and then gets array going again. > > > Very happy to. Let me know what you'd like me to do. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-09 23:21 ` Chris Allen @ 2006-10-10 0:08 ` Neil Brown 2006-10-10 10:27 ` Bas van Schaik 2006-11-14 10:28 ` Chris Allen 0 siblings, 2 replies; 43+ messages in thread From: Neil Brown @ 2006-10-10 0:08 UTC (permalink / raw) To: Chris Allen; +Cc: linux-raid On Tuesday October 10, chris@cjx.com wrote: > > Very happy to. Let me know what you'd like me to do. Cool thanks. At the end is a patch against 2.6.17.11, though it should apply against any later 2.6.17 kernel. Apply this and reboot. Then run while true do cat /sys/block/mdX/md/stripe_cache_active sleep 10 done > /dev/null (maybe write a little script or whatever). Leave this running. It effects the check for "has raid5 hung". Make sure to change "mdX" to whatever is appropriate. Occasionally look in the kernel logs for plug problem: if you find that, send me the surrounding text - there should be about a dozen lines following this one. Hopefully this will let me know which is last thing to happen: a plug or an unplug. If the last is a "plug", then the timer really should still be pending, but isn't (this is impossible). So I'll look more closely at that option. If the last is an "unplug", then the 'Plugged' flag should really be clear but it isn't (this is impossible). So I'll look more closely at that option. Dean is running this, but he only gets the hang every couple of weeks. If you get it more often, that would help me a lot. Thanks, NeilBrown diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c --- ./.patches/orig/block/ll_rw_blk.c 2006-08-21 09:52:46.000000000 +1000 +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.000000000 +1000 @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_ * This is called with interrupts off and no requests on the queue and * with the queue lock held. */ +static atomic_t seq = ATOMIC_INIT(0); void blk_plug_device(request_queue_t *q) { WARN_ON(!irqs_disabled()); @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q) return; if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { + q->last_plug = jiffies; + q->plug_seq = atomic_read(&seq); + atomic_inc(&seq); mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); - } + } else + q->last_plug_skip = jiffies; + if (!timer_pending(&q->unplug_timer) && + !q->unplug_work.pending) + printk("Neither Timer or work are pending\n"); } EXPORT_SYMBOL(blk_plug_device); @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q) { WARN_ON(!irqs_disabled()); - if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) + if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { + q->last_unplug_skip = jiffies; return 0; + } del_timer(&q->unplug_timer); + q->last_unplug = jiffies; + q->unplug_seq = atomic_read(&seq); + atomic_inc(&seq); + if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) + printk("queue still (or again) plugged\n"); return 1; } @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc static void blk_unplug_work(void *data) { request_queue_t *q = data; - + q->last_unplug_work = jiffies; blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, q->rq.count[READ] + q->rq.count[WRITE]); @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, q->rq.count[READ] + q->rq.count[WRITE]); + q->last_unplug_timeout = jiffies; kblockd_schedule_work(&q->unplug_work); } diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c --- ./.patches/orig/drivers/md/raid1.c 2006-08-10 17:28:01.000000000 +1000 +++ ./drivers/md/raid1.c 2006-09-04 21:58:31.000000000 +1000 @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev) d = conf->raid_disks; d--; rdev = conf->mirrors[d].rdev; - atomic_add(s, &rdev->corrected_errors); if (rdev && test_bit(In_sync, &rdev->flags)) { if (sync_page_io(rdev->bdev, @@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev) s<<9, conf->tmppage, READ) == 0) /* Well, this device is dead */ md_error(mddev, rdev); + else + atomic_add(s, &rdev->corrected_errors); + } } } else { diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c --- ./.patches/orig/drivers/md/raid5.c 2006-08-10 17:28:01.000000000 +1000 +++ ./drivers/md/raid5.c 2006-09-25 11:11:59.000000000 +1000 @@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_ md_wakeup_thread(conf->mddev->thread); } atomic_dec(&conf->active_stripes); + if (atomic_read(&conf->active_stripes) < conf->low_size) + conf->low_size = atomic_read(&conf->active_stripes); if (!test_bit(STRIPE_EXPANDING, &sh->state)) { list_add_tail(&sh->lru, &conf->inactive_list); wake_up(&conf->wait_for_stripe); @@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_ spin_lock_irqsave(&conf->device_lock, flags); + if (conf->dotrace) + printk(" trace unplug_device\n"); + if (blk_remove_plug(q)) { + if (conf->dotrace) + printk(" trace remove plug\n"); conf->seq_flush++; raid5_activate_delayed(conf); } @@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que static inline void raid5_plug_device(raid5_conf_t *conf) { spin_lock_irq(&conf->device_lock); + if (conf->dotrace) + printk(" trace plug device\n"); blk_plug_device(conf->mddev->queue); spin_unlock_irq(&conf->device_lock); } @@ -2154,6 +2163,42 @@ static ssize_t stripe_cache_active_show(mddev_t *mddev, char *page) { raid5_conf_t *conf = mddev_to_conf(mddev); + + /* tracing */ + if (conf->dotrace) { + conf->dotrace = 0; + printk("end trace, force unplug\n"); + raid5_unplug_device(mddev->queue); + raid5_unplug_device(mddev->queue); + } else { + if (conf->last_size > conf->max_nr_stripes/2 && + conf->last_size <= conf->low_size) { + conf->dotrace = 1; + printk("plug problem: low_size=%d\n", conf->low_size); + printk(" blk_queue_plugged = %d\n", blk_queue_plugged(mddev->queue)); + printk(" flags = %lx\n", mddev->queue->queue_flags); + printk(" unplug_delay = %lu (HZ=%d)\n", mddev->queue->unplug_delay, HZ); + printk(" timer->expires %lu (jiffies %lu)\n", + mddev->queue->unplug_timer.expires, jiffies); + printk(" pending = %d\n", timer_pending(&mddev->queue->unplug_timer)); + printk(" wq_pending = %lx\n", mddev->queue->unplug_work.pending); + printk(" wq_entry empty? %d\n", list_empty(&mddev->queue->unplug_work.entry)); + printk(" unplug=%lu unplug_skip=%lu\n", + mddev->queue->last_unplug, + mddev->queue->last_unplug_skip); + printk(" plug=%lu plug_skip=%lu\n", + mddev->queue->last_plug, + mddev->queue->last_plug_skip); + printk(" unplug_work=%lu unplug_timeout=%lu\n", + mddev->queue->last_unplug_work, + mddev->queue->last_unplug_timeout); + printk(" unplug_seq=%lu plug_seq=%lu\n", + mddev->queue->unplug_seq, + mddev->queue->plug_seq); + } + conf->last_size = atomic_read(&conf->active_stripes); + conf->low_size = conf->last_size; + } if (conf) return sprintf(page, "%d\n", atomic_read(&conf->active_stripes)); else diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h --- ./.patches/orig/include/linux/blkdev.h 2006-08-10 17:28:48.000000000 +1000 +++ ./include/linux/blkdev.h 2006-09-25 11:08:27.000000000 +1000 @@ -380,6 +380,9 @@ struct request_queue */ unsigned long queue_flags; + unsigned long last_unplug, last_unplug_skip, last_plug, last_plug_skip; + unsigned long last_unplug_work, last_unplug_timeout; + unsigned long unplug_seq, plug_seq; /* * protects queue structures from reentrancy. ->__queue_lock should * _never_ be used directly, it is queue private. always use diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h --- ./.patches/orig/include/linux/raid/raid5.h 2006-08-10 17:28:53.000000000 +1000 +++ ./include/linux/raid/raid5.h 2006-08-28 14:09:32.000000000 +1000 @@ -245,6 +245,10 @@ struct raid5_private_data { struct page *spare_page; /* Used when checking P/Q in raid6 */ + int dotrace; + int last_size; + int low_size; + /* * Free stripes pool */ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-10 0:08 ` Neil Brown @ 2006-10-10 10:27 ` Bas van Schaik 2006-11-14 10:28 ` Chris Allen 1 sibling, 0 replies; 43+ messages in thread From: Bas van Schaik @ 2006-10-10 10:27 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Hi all, Neil Brown wrote: > On Tuesday October 10, chris@cjx.com wrote: > >> Very happy to. Let me know what you'd like me to do. >> > > Cool thanks. > (snip) > I don't know if it's useful information, but I'm encountering the same problem here, in a totally different situation. I'm using Peter Breuers ENBD (you probably know him, since he started a discussion about request retries with exponential timeouts and a communication channel to raid a while ago) to import a total of 12 devices from other machines to compose those disks into 3 arrays of RAID5. Those 3 arrays are combined in one VG with one LV, running CryptoLoop on top. Last, but not least, a ReiserFS is created on the loopback device. I'm using the Debian Etch stock 2.6.17-kernel, by the way. When doing a lot of I/O on the ReiserFS (like a "reiserfsck --rebuild-tree"), the machine suddenly gets stuck, I think after filling it's memory with buffers. I've been doing a lot of debugging with Peter, attached you'll find a "ps -axl" with a widened WCHAN column to see that some of the enbd-client processes get stuck in the RAID code. We've not been able find out how ENBD gets into the RAID code, but I don't think that's really relevant right now. Here's the relevant part of ps: ps ax -o f,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,command (only the relevant rows) > F UID PID PPID PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND > (snip) > 5 0 26523 1 23 0 2140 1052 - Ss ? 00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi > 5 0 26540 1 23 0 2140 1048 get_active_stripe Ds ? 00:00:00 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl > 5 0 26552 1 23 0 2140 1044 - Ss ? 00:00:00 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf > 5 0 26556 1 23 0 2140 1048 - Ss ? 00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda > 5 0 26561 1 23 0 2140 1052 get_active_stripe Ds ? 00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb > 5 0 26564 1 23 0 2144 1052 - Ss ? 00:00:00 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg > 5 0 26568 1 23 0 2144 1052 - Ss ? 00:00:00 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh > 5 0 26581 1 23 0 2144 1052 - Ss ? 00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc > 5 0 26590 1 23 0 2140 1048 - Ss ? 00:00:00 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde > 5 0 26606 1 23 0 2144 1052 - Ss ? 00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj > 5 0 26614 1 23 0 2144 1052 - Ss ? 00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk > 5 0 26616 1 23 0 2144 1056 - Ss ? 00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd > 5 0 26617 26523 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi > 5 0 26618 26523 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi > 5 0 26619 26540 24 0 2140 948 enbd_get_req S ? 00:00:01 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl > 5 0 26620 26540 24 0 2140 948 enbd_get_req S ? 00:00:01 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl > 5 0 26621 26552 24 0 2140 948 get_active_stripe D ? 00:32:11 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf > 5 0 26622 26552 24 0 2140 948 get_active_stripe D ? 00:32:18 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf > 5 0 26623 26564 23 0 2144 956 enbd_get_req S ? 00:32:27 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg > 5 0 26624 26564 24 0 2144 956 enbd_get_req S ? 00:32:37 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg > 5 0 26625 26568 24 0 2144 956 enbd_get_req S ? 00:35:35 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh > 5 0 26626 26561 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb > 5 0 26627 26561 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb > 5 0 26628 26568 24 0 2144 956 enbd_get_req S ? 00:35:37 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh > 5 0 26629 26556 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda > 5 0 26630 26556 24 0 2140 948 enbd_get_req S ? 00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda > 5 0 26631 26581 24 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc > 5 0 26632 26581 24 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc > 5 0 26633 26590 24 0 2140 952 enbd_get_req S ? 00:36:58 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde > 5 0 26634 26590 24 0 2140 952 enbd_get_req S ? 00:36:50 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde > 5 0 26635 26606 24 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj > 5 0 26636 26606 24 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj > 5 0 26637 26616 24 0 2144 952 enbd_get_req S ? 00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd > 5 0 26638 26616 23 0 2144 952 enbd_get_req S ? 00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd > 5 0 26639 26614 23 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk > 5 0 26640 26614 24 0 2144 948 enbd_get_req S ? 00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk > I've tried this "reiserfsck --rebuild-tree" a couple of times, it keeps hanging at the same point when my memory gets filled with buffers. My assumption is, that reiserfs is writing out too fast, the network (ENBD) can't handle it and after a while there's no memory left for TCP buffers. I've solved this problem by editing /proc/sys/vm/min_free_kbytes to force the kernel to leave some memory for the TCP buffers and other interrupt handling. I'm not able to install a vanilla kernel with some patches, but I would be happy to provide some extra details about the crash if you want me to. I assume I can even reproduce it, on another cluster however, since I've recreated a (ext3) filesystem on the cluster we're talking about. Regards -- Bas van Schaik ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-10-10 0:08 ` Neil Brown 2006-10-10 10:27 ` Bas van Schaik @ 2006-11-14 10:28 ` Chris Allen 2006-11-15 20:39 ` dean gaudet 1 sibling, 1 reply; 43+ messages in thread From: Chris Allen @ 2006-11-14 10:28 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid You probably guessed that no matter what I did, I never, ever saw the problem when your trace was installed. I'd guess at some obscure timing-related problem. I can still trigger it consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are turned on. Neil Brown wrote: > On Tuesday October 10, chris@cjx.com wrote: > >> Very happy to. Let me know what you'd like me to do. >> > > Cool thanks. > > At the end is a patch against 2.6.17.11, though it should apply against > any later 2.6.17 kernel. > Apply this and reboot. > > Then run > > while true > do cat /sys/block/mdX/md/stripe_cache_active > sleep 10 > done > /dev/null > > (maybe write a little script or whatever). Leave this running. It > effects the check for "has raid5 hung". Make sure to change "mdX" to > whatever is appropriate. > > Occasionally look in the kernel logs for > plug problem: > > if you find that, send me the surrounding text - there should be about > a dozen lines following this one. > > Hopefully this will let me know which is last thing to happen: a plug > or an unplug. > If the last is a "plug", then the timer really should still be > pending, but isn't (this is impossible). So I'll look more closely at > that option. > If the last is an "unplug", then the 'Plugged' flag should really be > clear but it isn't (this is impossible). So I'll look more closely at > that option. > > Dean is running this, but he only gets the hang every couple of > weeks. If you get it more often, that would help me a lot. > > Thanks, > NeilBrown > > > diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c > --- ./.patches/orig/block/ll_rw_blk.c 2006-08-21 09:52:46.000000000 +1000 > +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.000000000 +1000 > @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_ > * This is called with interrupts off and no requests on the queue and > * with the queue lock held. > */ > +static atomic_t seq = ATOMIC_INIT(0); > void blk_plug_device(request_queue_t *q) > { > WARN_ON(!irqs_disabled()); > @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q) > return; > > if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { > + q->last_plug = jiffies; > + q->plug_seq = atomic_read(&seq); > + atomic_inc(&seq); > mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); > blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); > - } > + } else > + q->last_plug_skip = jiffies; > + if (!timer_pending(&q->unplug_timer) && > + !q->unplug_work.pending) > + printk("Neither Timer or work are pending\n"); > } > > EXPORT_SYMBOL(blk_plug_device); > @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q) > { > WARN_ON(!irqs_disabled()); > > - if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) > + if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { > + q->last_unplug_skip = jiffies; > return 0; > + } > > del_timer(&q->unplug_timer); > + q->last_unplug = jiffies; > + q->unplug_seq = atomic_read(&seq); > + atomic_inc(&seq); > + if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) > + printk("queue still (or again) plugged\n"); > return 1; > } > > @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc > static void blk_unplug_work(void *data) > { > request_queue_t *q = data; > - > + q->last_unplug_work = jiffies; > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, > q->rq.count[READ] + q->rq.count[WRITE]); > > @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, > q->rq.count[READ] + q->rq.count[WRITE]); > > + q->last_unplug_timeout = jiffies; > kblockd_schedule_work(&q->unplug_work); > } > > > diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c > --- ./.patches/orig/drivers/md/raid1.c 2006-08-10 17:28:01.000000000 +1000 > +++ ./drivers/md/raid1.c 2006-09-04 21:58:31.000000000 +1000 > @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev) > d = conf->raid_disks; > d--; > rdev = conf->mirrors[d].rdev; > - atomic_add(s, &rdev->corrected_errors); > if (rdev && > test_bit(In_sync, &rdev->flags)) { > if (sync_page_io(rdev->bdev, > @@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev) > s<<9, conf->tmppage, READ) == 0) > /* Well, this device is dead */ > md_error(mddev, rdev); > + else > + atomic_add(s, &rdev->corrected_errors); > + > } > } > } else { > > diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c > --- ./.patches/orig/drivers/md/raid5.c 2006-08-10 17:28:01.000000000 +1000 > +++ ./drivers/md/raid5.c 2006-09-25 11:11:59.000000000 +1000 > @@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_ > md_wakeup_thread(conf->mddev->thread); > } > atomic_dec(&conf->active_stripes); > + if (atomic_read(&conf->active_stripes) < conf->low_size) > + conf->low_size = atomic_read(&conf->active_stripes); > if (!test_bit(STRIPE_EXPANDING, &sh->state)) { > list_add_tail(&sh->lru, &conf->inactive_list); > wake_up(&conf->wait_for_stripe); > @@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_ > > spin_lock_irqsave(&conf->device_lock, flags); > > + if (conf->dotrace) > + printk(" trace unplug_device\n"); > + > if (blk_remove_plug(q)) { > + if (conf->dotrace) > + printk(" trace remove plug\n"); > conf->seq_flush++; > raid5_activate_delayed(conf); > } > @@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que > static inline void raid5_plug_device(raid5_conf_t *conf) > { > spin_lock_irq(&conf->device_lock); > + if (conf->dotrace) > + printk(" trace plug device\n"); > blk_plug_device(conf->mddev->queue); > spin_unlock_irq(&conf->device_lock); > } > @@ -2154,6 +2163,42 @@ static ssize_t > stripe_cache_active_show(mddev_t *mddev, char *page) > { > raid5_conf_t *conf = mddev_to_conf(mddev); > + > + /* tracing */ > + if (conf->dotrace) { > + conf->dotrace = 0; > + printk("end trace, force unplug\n"); > + raid5_unplug_device(mddev->queue); > + raid5_unplug_device(mddev->queue); > + } else { > + if (conf->last_size > conf->max_nr_stripes/2 && > + conf->last_size <= conf->low_size) { > + conf->dotrace = 1; > + printk("plug problem: low_size=%d\n", conf->low_size); > + printk(" blk_queue_plugged = %d\n", blk_queue_plugged(mddev->queue)); > + printk(" flags = %lx\n", mddev->queue->queue_flags); > + printk(" unplug_delay = %lu (HZ=%d)\n", mddev->queue->unplug_delay, HZ); > + printk(" timer->expires %lu (jiffies %lu)\n", > + mddev->queue->unplug_timer.expires, jiffies); > + printk(" pending = %d\n", timer_pending(&mddev->queue->unplug_timer)); > + printk(" wq_pending = %lx\n", mddev->queue->unplug_work.pending); > + printk(" wq_entry empty? %d\n", list_empty(&mddev->queue->unplug_work.entry)); > + printk(" unplug=%lu unplug_skip=%lu\n", > + mddev->queue->last_unplug, > + mddev->queue->last_unplug_skip); > + printk(" plug=%lu plug_skip=%lu\n", > + mddev->queue->last_plug, > + mddev->queue->last_plug_skip); > + printk(" unplug_work=%lu unplug_timeout=%lu\n", > + mddev->queue->last_unplug_work, > + mddev->queue->last_unplug_timeout); > + printk(" unplug_seq=%lu plug_seq=%lu\n", > + mddev->queue->unplug_seq, > + mddev->queue->plug_seq); > + } > + conf->last_size = atomic_read(&conf->active_stripes); > + conf->low_size = conf->last_size; > + } > if (conf) > return sprintf(page, "%d\n", atomic_read(&conf->active_stripes)); > else > > diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h > --- ./.patches/orig/include/linux/blkdev.h 2006-08-10 17:28:48.000000000 +1000 > +++ ./include/linux/blkdev.h 2006-09-25 11:08:27.000000000 +1000 > @@ -380,6 +380,9 @@ struct request_queue > */ > unsigned long queue_flags; > > + unsigned long last_unplug, last_unplug_skip, last_plug, last_plug_skip; > + unsigned long last_unplug_work, last_unplug_timeout; > + unsigned long unplug_seq, plug_seq; > /* > * protects queue structures from reentrancy. ->__queue_lock should > * _never_ be used directly, it is queue private. always use > > diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h > --- ./.patches/orig/include/linux/raid/raid5.h 2006-08-10 17:28:53.000000000 +1000 > +++ ./include/linux/raid/raid5.h 2006-08-28 14:09:32.000000000 +1000 > @@ -245,6 +245,10 @@ struct raid5_private_data { > > struct page *spare_page; /* Used when checking P/Q in raid6 */ > > + int dotrace; > + int last_size; > + int low_size; > + > /* > * Free stripes pool > */ > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 hang on get_active_stripe 2006-11-14 10:28 ` Chris Allen @ 2006-11-15 20:39 ` dean gaudet 0 siblings, 0 replies; 43+ messages in thread From: dean gaudet @ 2006-11-15 20:39 UTC (permalink / raw) To: Chris Allen; +Cc: Neil Brown, linux-raid and i haven't seen it either... neil do you think your latest patch was hiding the bug? 'cause there was an iteration of an earlier patch which didn't produce much spam in dmesg but the bug was still there, then there is the version below which spams dmesg a fair amount but i didn't see the bug in ~30 days. btw i've upgraded that box to 2.6.18.2 without the patch (it had some conflicts)... haven't seen the bug yet though (~10 days so far). hmm i wonder if i could reproduce it more rapidly if i lowered /sys/block/mdX/md/stripe_cache_size. i'll give that a go. -dean On Tue, 14 Nov 2006, Chris Allen wrote: > You probably guessed that no matter what I did, I never, ever saw the problem > when your > trace was installed. I'd guess at some obscure timing-related problem. I can > still trigger it > consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are > turned on. > > > > Neil Brown wrote: > > On Tuesday October 10, chris@cjx.com wrote: > > > > > Very happy to. Let me know what you'd like me to do. > > > > > > > Cool thanks. > > > > At the end is a patch against 2.6.17.11, though it should apply against > > any later 2.6.17 kernel. > > Apply this and reboot. > > > > Then run > > > > while true > > do cat /sys/block/mdX/md/stripe_cache_active > > sleep 10 > > done > /dev/null > > > > (maybe write a little script or whatever). Leave this running. It > > effects the check for "has raid5 hung". Make sure to change "mdX" to > > whatever is appropriate. > > > > Occasionally look in the kernel logs for > > plug problem: > > > > if you find that, send me the surrounding text - there should be about > > a dozen lines following this one. > > > > Hopefully this will let me know which is last thing to happen: a plug > > or an unplug. > > If the last is a "plug", then the timer really should still be > > pending, but isn't (this is impossible). So I'll look more closely at > > that option. > > If the last is an "unplug", then the 'Plugged' flag should really be > > clear but it isn't (this is impossible). So I'll look more closely at > > that option. > > > > Dean is running this, but he only gets the hang every couple of > > weeks. If you get it more often, that would help me a lot. > > > > Thanks, > > NeilBrown > > > > > > diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c > > --- ./.patches/orig/block/ll_rw_blk.c 2006-08-21 09:52:46.000000000 +1000 > > +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.000000000 +1000 > > @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_ > > * This is called with interrupts off and no requests on the queue and > > * with the queue lock held. > > */ > > +static atomic_t seq = ATOMIC_INIT(0); > > void blk_plug_device(request_queue_t *q) > > { > > WARN_ON(!irqs_disabled()); > > @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q) > > return; > > if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { > > + q->last_plug = jiffies; > > + q->plug_seq = atomic_read(&seq); > > + atomic_inc(&seq); > > mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); > > blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); > > - } > > + } else > > + q->last_plug_skip = jiffies; > > + if (!timer_pending(&q->unplug_timer) && > > + !q->unplug_work.pending) > > + printk("Neither Timer or work are pending\n"); > > } > > EXPORT_SYMBOL(blk_plug_device); > > @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q) > > { > > WARN_ON(!irqs_disabled()); > > - if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) > > + if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { > > + q->last_unplug_skip = jiffies; > > return 0; > > + } > > del_timer(&q->unplug_timer); > > + q->last_unplug = jiffies; > > + q->unplug_seq = atomic_read(&seq); > > + atomic_inc(&seq); > > + if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) > > + printk("queue still (or again) plugged\n"); > > return 1; > > } > > @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc > > static void blk_unplug_work(void *data) > > { > > request_queue_t *q = data; > > - > > + q->last_unplug_work = jiffies; > > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, > > q->rq.count[READ] + q->rq.count[WRITE]); > > @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned > > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, > > q->rq.count[READ] + q->rq.count[WRITE]); > > + q->last_unplug_timeout = jiffies; > > kblockd_schedule_work(&q->unplug_work); > > } > > > > diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c > > --- ./.patches/orig/drivers/md/raid1.c 2006-08-10 17:28:01.000000000 > > +1000 > > +++ ./drivers/md/raid1.c 2006-09-04 21:58:31.000000000 +1000 > > @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev) > > d = conf->raid_disks; > > d--; > > rdev = conf->mirrors[d].rdev; > > - atomic_add(s, > > &rdev->corrected_errors); > > if (rdev && > > test_bit(In_sync, > > &rdev->flags)) { > > if > > (sync_page_io(rdev->bdev, > > @@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev) > > s<<9, > > conf->tmppage, READ) == 0) > > /* Well, this > > device is dead */ > > md_error(mddev, > > rdev); > > + else > > + atomic_add(s, > > &rdev->corrected_errors); > > + > > } > > } > > } else { > > > > diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c > > --- ./.patches/orig/drivers/md/raid5.c 2006-08-10 17:28:01.000000000 > > +1000 > > +++ ./drivers/md/raid5.c 2006-09-25 11:11:59.000000000 +1000 > > @@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_ > > md_wakeup_thread(conf->mddev->thread); > > } > > atomic_dec(&conf->active_stripes); > > + if (atomic_read(&conf->active_stripes) < > > conf->low_size) > > + conf->low_size = > > atomic_read(&conf->active_stripes); > > if (!test_bit(STRIPE_EXPANDING, &sh->state)) { > > list_add_tail(&sh->lru, &conf->inactive_list); > > wake_up(&conf->wait_for_stripe); > > @@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_ > > spin_lock_irqsave(&conf->device_lock, flags); > > + if (conf->dotrace) > > + printk(" trace unplug_device\n"); > > + > > if (blk_remove_plug(q)) { > > + if (conf->dotrace) > > + printk(" trace remove plug\n"); > > conf->seq_flush++; > > raid5_activate_delayed(conf); > > } > > @@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que > > static inline void raid5_plug_device(raid5_conf_t *conf) > > { > > spin_lock_irq(&conf->device_lock); > > + if (conf->dotrace) > > + printk(" trace plug device\n"); > > blk_plug_device(conf->mddev->queue); > > spin_unlock_irq(&conf->device_lock); > > } > > @@ -2154,6 +2163,42 @@ static ssize_t > > stripe_cache_active_show(mddev_t *mddev, char *page) > > { > > raid5_conf_t *conf = mddev_to_conf(mddev); > > + > > + /* tracing */ > > + if (conf->dotrace) { > > + conf->dotrace = 0; > > + printk("end trace, force unplug\n"); > > + raid5_unplug_device(mddev->queue); > > + raid5_unplug_device(mddev->queue); > > + } else { > > + if (conf->last_size > conf->max_nr_stripes/2 && > > + conf->last_size <= conf->low_size) { > > + conf->dotrace = 1; > > + printk("plug problem: low_size=%d\n", conf->low_size); > > + printk(" blk_queue_plugged = %d\n", > > blk_queue_plugged(mddev->queue)); > > + printk(" flags = %lx\n", mddev->queue->queue_flags); > > + printk(" unplug_delay = %lu (HZ=%d)\n", > > mddev->queue->unplug_delay, HZ); > > + printk(" timer->expires %lu (jiffies %lu)\n", > > + mddev->queue->unplug_timer.expires, jiffies); > > + printk(" pending = %d\n", > > timer_pending(&mddev->queue->unplug_timer)); > > + printk(" wq_pending = %lx\n", > > mddev->queue->unplug_work.pending); > > + printk(" wq_entry empty? %d\n", > > list_empty(&mddev->queue->unplug_work.entry)); > > + printk(" unplug=%lu unplug_skip=%lu\n", > > + mddev->queue->last_unplug, > > + mddev->queue->last_unplug_skip); > > + printk(" plug=%lu plug_skip=%lu\n", > > + mddev->queue->last_plug, > > + mddev->queue->last_plug_skip); > > + printk(" unplug_work=%lu unplug_timeout=%lu\n", > > + mddev->queue->last_unplug_work, > > + mddev->queue->last_unplug_timeout); > > + printk(" unplug_seq=%lu plug_seq=%lu\n", > > + mddev->queue->unplug_seq, > > + mddev->queue->plug_seq); > > + } > > + conf->last_size = atomic_read(&conf->active_stripes); > > + conf->low_size = conf->last_size; > > + } > > if (conf) > > return sprintf(page, "%d\n", > > atomic_read(&conf->active_stripes)); > > else > > > > diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h > > --- ./.patches/orig/include/linux/blkdev.h 2006-08-10 17:28:48.000000000 > > +1000 > > +++ ./include/linux/blkdev.h 2006-09-25 11:08:27.000000000 +1000 > > @@ -380,6 +380,9 @@ struct request_queue > > */ > > unsigned long queue_flags; > > + unsigned long last_unplug, last_unplug_skip, last_plug, > > last_plug_skip; > > + unsigned long last_unplug_work, last_unplug_timeout; > > + unsigned long unplug_seq, plug_seq; > > /* > > * protects queue structures from reentrancy. ->__queue_lock should > > * _never_ be used directly, it is queue private. always use > > > > diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h > > --- ./.patches/orig/include/linux/raid/raid5.h 2006-08-10 > > 17:28:53.000000000 +1000 > > +++ ./include/linux/raid/raid5.h 2006-08-28 14:09:32.000000000 +1000 > > @@ -245,6 +245,10 @@ struct raid5_private_data { > > struct page *spare_page; /* Used when checking P/Q in > > raid6 */ > > + int dotrace; > > + int last_size; > > + int low_size; > > + > > /* > > * Free stripes pool > > */ > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 43+ messages in thread
* RE: raid5 performance question 2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro) 2006-03-06 11:59 ` Gordon Henderson @ 2006-03-06 22:17 ` Guy 2006-03-06 22:24 ` Neil Brown 2 siblings, 0 replies; 43+ messages in thread From: Guy @ 2006-03-06 22:17 UTC (permalink / raw) To: 'Raz Ben-Jehuda(caro)', 'Linux RAID Mailing List' Cc: 'Neil Brown' Does test 1 have 4 processes? Does test 2 have 1 process? The number of testing processes should be the same in both tests. } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Raz Ben-Jehuda(caro) } Sent: Monday, March 06, 2006 6:46 AM } To: Linux RAID Mailing List } Cc: Neil Brown } Subject: raid5 performance question } } Neil Hello . } I have a performance question. } } I am using raid5 stripe size 1024K over 4 disks. } I am benchmarking it with an asynchronous tester. } This tester submits 100 IOs of size of 1024 K --> as the stripe size. } It reads raw io from the device, no file system is involved. } } I am making the following comparsion: } } 1. Reading 4 disks at the same time using 1 MB buffer in random manner. } 2. Reading 1 raid5 device using 1MB buffer in random manner. } } I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s } from } 4 disks, the raid5 device gives 35 MB/s . } it is like i am reading a single disk , but by looking at iostat i can } see that all } disks are active but with low throughput. } } Any idea ? } } Thank you. } -- } Raz } - } To unsubscribe from this list: send the line "unsubscribe linux-raid" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro) 2006-03-06 11:59 ` Gordon Henderson 2006-03-06 22:17 ` raid5 performance question Guy @ 2006-03-06 22:24 ` Neil Brown 2006-03-07 8:40 ` Raz Ben-Jehuda(caro) 2006-03-08 6:45 ` thunder7 2 siblings, 2 replies; 43+ messages in thread From: Neil Brown @ 2006-03-06 22:24 UTC (permalink / raw) To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List On Monday March 6, raziebe@gmail.com wrote: > Neil Hello . > I have a performance question. > > I am using raid5 stripe size 1024K over 4 disks. I assume you mean a chunksize of 1024K rather than a stripe size. With a 4 disk array, the stripe size will be 3 times the chunksize, and so could not possibly by 1024K. > I am benchmarking it with an asynchronous tester. > This tester submits 100 IOs of size of 1024 K --> as the stripe size. > It reads raw io from the device, no file system is involved. > > I am making the following comparsion: > > 1. Reading 4 disks at the same time using 1 MB buffer in random manner. > 2. Reading 1 raid5 device using 1MB buffer in random manner. If your chunk size is 1MB, then you will need larger sequential reads to get good throughput. You can also try increasing the size of the stripe cache in /sys/block/mdX/md/stripe_cache_size The units are in pages (normally 4K) per device. The default is 256 which fits only one stripe with a 1 Meg chunk size. Try 1024 ? NeilBrown > > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from > 4 disks, the raid5 device gives 35 MB/s . > it is like i am reading a single disk , but by looking at iostat i can > see that all > disks are active but with low throughput. > > Any idea ? > > Thank you. > -- > Raz > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-06 22:24 ` Neil Brown @ 2006-03-07 8:40 ` Raz Ben-Jehuda(caro) 2006-03-07 23:03 ` Neil Brown 2006-03-08 6:45 ` thunder7 1 sibling, 1 reply; 43+ messages in thread From: Raz Ben-Jehuda(caro) @ 2006-03-07 8:40 UTC (permalink / raw) To: Neil Brown; +Cc: Linux RAID Mailing List Neil. what is the stripe_cache exacly ? First , here are some numbers. Setting it to 1024 gives me 85 MB/s. Setting it to 4096 gives me 105 MB/s. Setting it to 8192 gives me 115 MB/s. the md.txt does not say much about it just that it is the number of entries. here are some tests i have made: test1: when i set the stripe_cache to zero and run: "dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000" i am getting 120MB/s. when i set the stripe cache to 4096 and : issue the same command i am getting 120 MB/s as well. test 2: I would describe what this tester does: It opens N descriptors over a device. It issues N IOs to the target and waits for the completion of each IO. When the IO is completed the tester has two choices: 1. calculate a new seek posistion over the target. 2. move sequetially to the next position. meaning , if one reads 1MB buffer, the next position is current+1M. I am using direct IO and asynchrnous IO. option 1 simulates non contigous files. option 2 simulates contiguous files. the above numbers were made with option 2. if i am using option 1 i am getting 95 MB/s with stripe_size=4096. A single disk in this manner ( option 1 ) gives ~28 MB/s. A single disk in scenario 2 gives ~30 MB/s. I understand the a question of the IO distribution is something to talk about. but i am submitting 250 IOs so i suppose to be heavy on the raid. Questions 1. how can the stripe size cache gives me a boost when i have total random access to the disk ? 2. Does direct IO passes this cache ? 3. How can a dd of 1 MB over 1MB chunck size acheive this high throughputs of 4 disks even if does not get the stripe cache benifits ? thank you raz. On 3/7/06, Neil Brown <neilb@suse.de> wrote: > On Monday March 6, raziebe@gmail.com wrote: > > Neil Hello . > > I have a performance question. > > > > I am using raid5 stripe size 1024K over 4 disks. > > I assume you mean a chunksize of 1024K rather than a stripe size. > With a 4 disk array, the stripe size will be 3 times the chunksize, > and so could not possibly by 1024K. > > > I am benchmarking it with an asynchronous tester. > > This tester submits 100 IOs of size of 1024 K --> as the stripe size. > > It reads raw io from the device, no file system is involved. > > > > I am making the following comparsion: > > > > 1. Reading 4 disks at the same time using 1 MB buffer in random manner. > > 2. Reading 1 raid5 device using 1MB buffer in random manner. > > If your chunk size is 1MB, then you will need larger sequential reads > to get good throughput. > > You can also try increasing the size of the stripe cache in > /sys/block/mdX/md/stripe_cache_size > > The units are in pages (normally 4K) per device. The default is 256 which fits > only one stripe with a 1 Meg chunk size. > > Try 1024 ? > > NeilBrown > > > > > > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from > > 4 disks, the raid5 device gives 35 MB/s . > > it is like i am reading a single disk , but by looking at iostat i can > > see that all > > disks are active but with low throughput. > > > > Any idea ? > > > > Thank you. > > -- > > Raz > > - > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Raz ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-07 8:40 ` Raz Ben-Jehuda(caro) @ 2006-03-07 23:03 ` Neil Brown 2006-03-22 13:22 ` Bill Davidsen 0 siblings, 1 reply; 43+ messages in thread From: Neil Brown @ 2006-03-07 23:03 UTC (permalink / raw) To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List On Tuesday March 7, raziebe@gmail.com wrote: > Neil. > what is the stripe_cache exacly ? In order to ensure correctness of data, all IO operations on a raid5 pass through the 'stripe cache' This is a cache of stripes where each stripe is one page wide across all devices. e.g. to write a block, we allocate one stripe in the cache to cover that block, pre-read anything that might be needed, copy in the new data and update parity, and write out anything that has changed. Similarly to read, we allocate a stripe to cover the block, read in the requires parts, and copy out of the stripe cache into the destination. Requiring all reads to pass through the stripe cache is not strictly necessary, but it keeps the code a lot easier to manage (fewer special cases). Bypassing the cache for simple read requests when the array is non-degraded is on my list.... > > First , here are some numbers. > > Setting it to 1024 gives me 85 MB/s. > Setting it to 4096 gives me 105 MB/s. > Setting it to 8192 gives me 115 MB/s. Not surprisingly, a larger cache gives better throughput as it allows more parallelism. There is probably a link between optimal cache size and chunk size. > > the md.txt does not say much about it just that it is the number of > entries. No. I should fix that. > > here are some tests i have made: > > test1: > when i set the stripe_cache to zero and run: Setting it to zero is a no-op. Only values from 17 to 32768 are permitted. > > "dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000" > i am getting 120MB/s. > when i set the stripe cache to 4096 and : issue the same command i am > getting 120 MB/s > as well. This sort of operation will causes the kernel's read-ahead to keep the drives reading constantly. Providing the stripe cache is large enough to hold 2 full chunk-sized stripes, you should get very good throughput. > > test 2: > I would describe what this tester does: > > It opens N descriptors over a device. > It issues N IOs to the target and waits for the completion of each IO. > When the IO is completed the tester has two choices: > > 1. calculate a new seek posistion over the target. > > 2. move sequetially to the next position. meaning , if one reads 1MB > buffer, the next > position is current+1M. > > I am using direct IO and asynchrnous IO. > > option 1 simulates non contigous files. option 2 simulates contiguous files. > the above numbers were made with option 2. > if i am using option 1 i am getting 95 MB/s with stripe_size=4096. > > A single disk in this manner ( option 1 ) gives ~28 MB/s. > A single disk in scenario 2 gives ~30 MB/s. > > I understand the a question of the IO distribution is something to talk > about. but i am submitting 250 IOs so i suppose to be heavy on the raid. > > Questions > 1. how can the stripe size cache gives me a boost when i have total > random access > to the disk ? It doesn't give you a boost exactly. It is just that a small cache can get in your way by reducing the possibly parallelism. > > 2. Does direct IO passes this cache ? Yes. Everything does. > > 3. How can a dd of 1 MB over 1MB chunck size acheive this high > throughputs of 4 disks > even if does not get the stripe cache benifits ? read-ahead performed by the kernel. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-07 23:03 ` Neil Brown @ 2006-03-22 13:22 ` Bill Davidsen 2006-03-24 4:40 ` Neil Brown 0 siblings, 1 reply; 43+ messages in thread From: Bill Davidsen @ 2006-03-22 13:22 UTC (permalink / raw) To: Neil Brown; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List Neil Brown wrote: >On Tuesday March 7, raziebe@gmail.com wrote: > > >>Neil. >>what is the stripe_cache exacly ? >> >> > >In order to ensure correctness of data, all IO operations on a raid5 >pass through the 'stripe cache' This is a cache of stripes where each >stripe is one page wide across all devices. > >e.g. to write a block, we allocate one stripe in the cache to cover >that block, pre-read anything that might be needed, copy in the new >data and update parity, and write out anything that has changed. > > I can see that you would have to read the old data and parity blocks for RAID-5, I assume that's what you mean by "might be needed" and not a read of every drive to get the data to rebuild the parity from scratch. That would be not only slower, but require complex error recovery on an error reading unneeded data. >Similarly to read, we allocate a stripe to cover the block, read in >the requires parts, and copy out of the stripe cache into the >destination. > >Requiring all reads to pass through the stripe cache is not strictly >necessary, but it keeps the code a lot easier to manage (fewer special >cases). Bypassing the cache for simple read requests when the array >is non-degraded is on my list.... > It sounds as if you do a memory copy with each read, even if a read to user buffer would be possible. Hopefully I'm reading that wrong. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-22 13:22 ` Bill Davidsen @ 2006-03-24 4:40 ` Neil Brown 0 siblings, 0 replies; 43+ messages in thread From: Neil Brown @ 2006-03-24 4:40 UTC (permalink / raw) To: Bill Davidsen; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List On Wednesday March 22, davidsen@tmr.com wrote: > Neil Brown wrote: > > >On Tuesday March 7, raziebe@gmail.com wrote: > > > > > >>Neil. > >>what is the stripe_cache exacly ? > >> > >> > > > >In order to ensure correctness of data, all IO operations on a raid5 > >pass through the 'stripe cache' This is a cache of stripes where each > >stripe is one page wide across all devices. > > > >e.g. to write a block, we allocate one stripe in the cache to cover > >that block, pre-read anything that might be needed, copy in the new > >data and update parity, and write out anything that has changed. > > > > > I can see that you would have to read the old data and parity blocks for > RAID-5, I assume that's what you mean by "might be needed" and not a > read of every drive to get the data to rebuild the parity from scratch. > That would be not only slower, but require complex error recovery on an > error reading unneeded data. "might be needed" because sometime raid5 reads the old copies of the blocks it is about to over-write, and sometimes it reads all the blocks that it is NOT going to over-write instead. And if it is over-writing all blocks in the stripe, it doesn't need to read anything. > > >Similarly to read, we allocate a stripe to cover the block, read in > >the requires parts, and copy out of the stripe cache into the > >destination. > > > >Requiring all reads to pass through the stripe cache is not strictly > >necessary, but it keeps the code a lot easier to manage (fewer special > >cases). Bypassing the cache for simple read requests when the array > >is non-degraded is on my list.... > > > It sounds as if you do a memory copy with each read, even if a read to > user buffer would be possible. Hopefully I'm reading that wrong. Unfortunately you are reading it correctly. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: raid5 performance question 2006-03-06 22:24 ` Neil Brown 2006-03-07 8:40 ` Raz Ben-Jehuda(caro) @ 2006-03-08 6:45 ` thunder7 1 sibling, 0 replies; 43+ messages in thread From: thunder7 @ 2006-03-08 6:45 UTC (permalink / raw) To: Linux RAID Mailing List From: Neil Brown <neilb@suse.de> Date: Tue, Mar 07, 2006 at 09:24:26AM +1100 > You can also try increasing the size of the stripe cache in > /sys/block/mdX/md/stripe_cache_size > > The units are in pages (normally 4K) per device. The default is 256 which fits > only one stripe with a 1 Meg chunk size. > > Try 1024 ? > Interesting. I noticed I don't have such a file for my raid6 device. Can you explain why? I thought raid6 and raid5 worked a lot like each other. Thanks, Jurriaan -- So I must behave with haste and dishonor to serve the political ambitions of men that I never vowed, before the Lord, to serve and honor. I must be pretty and delicate and happy and perfect; I must accord his passing no mourning, and no loss. Michelle West - The Broken Crown Debian (Unstable) GNU/Linux 2.6.16-rc5-mm2 5503 bogomips load 0.06 ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2006-11-15 20:39 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro) 2006-03-06 11:59 ` Gordon Henderson 2006-03-06 12:56 ` Raz Ben-Jehuda(caro) 2006-03-13 22:36 ` raid5 hang on get_active_stripe Patrik Jonsson 2006-03-13 23:17 ` Neil Brown 2006-05-11 15:13 ` dean gaudet 2006-05-17 18:41 ` dean gaudet 2006-05-19 3:46 ` Neil Brown 2006-05-23 2:33 ` Neil Brown 2006-05-26 16:07 ` dean gaudet 2006-05-26 23:55 ` Neil Brown 2006-05-27 0:28 ` dean gaudet 2006-05-27 15:42 ` dean gaudet 2006-05-28 12:04 ` Neil Brown 2006-05-29 20:56 ` dean gaudet 2006-05-29 23:51 ` Neil Brown 2006-05-31 0:15 ` dean gaudet 2006-05-31 0:46 ` Neil Brown 2006-05-31 1:03 ` dean gaudet 2006-05-31 1:38 ` Neil Brown 2006-05-31 1:46 ` dean gaudet 2006-06-01 2:51 ` Neil Brown 2006-06-02 7:10 ` dean gaudet 2006-06-02 7:27 ` Neil Brown 2006-06-10 19:49 ` dean gaudet 2006-06-13 18:53 ` Bill Davidsen 2006-06-13 19:05 ` dean gaudet 2006-06-13 23:13 ` Neil Brown 2006-10-07 23:25 ` Chris Allen 2006-10-09 11:03 ` Chris Allen 2006-10-09 23:06 ` Neil Brown 2006-10-09 23:21 ` Chris Allen 2006-10-10 0:08 ` Neil Brown 2006-10-10 10:27 ` Bas van Schaik 2006-11-14 10:28 ` Chris Allen 2006-11-15 20:39 ` dean gaudet 2006-03-06 22:17 ` raid5 performance question Guy 2006-03-06 22:24 ` Neil Brown 2006-03-07 8:40 ` Raz Ben-Jehuda(caro) 2006-03-07 23:03 ` Neil Brown 2006-03-22 13:22 ` Bill Davidsen 2006-03-24 4:40 ` Neil Brown 2006-03-08 6:45 ` thunder7
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).