raid5 performance question

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 performance question
@ 2006-03-06 11:46 Raz Ben-Jehuda(caro)
  2006-03-06 11:59 ` Gordon Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-06 11:46 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: Neil Brown

Neil Hello .
I have a performance question.

I am using raid5 stripe size 1024K over 4 disks.
I am benchmarking it with an asynchronous tester.
This tester submits 100 IOs of size of 1024 K --> as the stripe size.
It reads raw io from the device, no file system is involved.

I am making the following comparsion:

1. Reading 4 disks at the same time using 1 MB buffer in random manner.
2. Reading 1 raid5 device using 1MB buffer in random manner.

I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
4 disks, the raid5 device gives 35 MB/s .
it is like i am reading a single disk , but by looking at iostat i can
see that all
disks are active but with low throughput.

Any idea ?

Thank you.
--
Raz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro)
@ 2006-03-06 11:59 ` Gordon Henderson
  2006-03-06 12:56   ` Raz Ben-Jehuda(caro)
  2006-03-06 22:17 ` raid5 performance question Guy
  2006-03-06 22:24 ` Neil Brown
  2 siblings, 1 reply; 43+ messages in thread
From: Gordon Henderson @ 2006-03-06 11:59 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote:

> Neil Hello .
> I have a performance question.
>
> I am using raid5 stripe size 1024K over 4 disks.
> I am benchmarking it with an asynchronous tester.
> This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> It reads raw io from the device, no file system is involved.
>
> I am making the following comparsion:
>
> 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> 2. Reading 1 raid5 device using 1MB buffer in random manner.
>
> I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> 4 disks, the raid5 device gives 35 MB/s .
> it is like i am reading a single disk , but by looking at iostat i can
> see that all
> disks are active but with low throughput.
>
> Any idea ?

Is this reading the block device direct, or via a filesystem? If the
latter, what filesystem?

If ext2/3 have you tried mkfs with a stride option?

See:
  http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11

Gordon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-06 11:59 ` Gordon Henderson
@ 2006-03-06 12:56   ` Raz Ben-Jehuda(caro)
  2006-03-13 22:36     ` raid5 hang on get_active_stripe Patrik Jonsson
  0 siblings, 1 reply; 43+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-06 12:56 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: Linux RAID Mailing List

it reads raw. no filesystem whatsover.

On 3/6/06, Gordon Henderson <gordon@drogon.net> wrote:
> On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote:
>
> > Neil Hello .
> > I have a performance question.
> >
> > I am using raid5 stripe size 1024K over 4 disks.
> > I am benchmarking it with an asynchronous tester.
> > This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> > It reads raw io from the device, no file system is involved.
> >
> > I am making the following comparsion:
> >
> > 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> > 2. Reading 1 raid5 device using 1MB buffer in random manner.
> >
> > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> > 4 disks, the raid5 device gives 35 MB/s .
> > it is like i am reading a single disk , but by looking at iostat i can
> > see that all
> > disks are active but with low throughput.
> >
> > Any idea ?
>
> Is this reading the block device direct, or via a filesystem? If the
> latter, what filesystem?
>
> If ext2/3 have you tried mkfs with a stride option?
>
> See:
>   http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11
>
> Gordon
>


--
Raz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* raid5 hang on get_active_stripe
  2006-03-06 12:56   ` Raz Ben-Jehuda(caro)
@ 2006-03-13 22:36     ` Patrik Jonsson
  2006-03-13 23:17       ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Patrik Jonsson @ 2006-03-13 22:36 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

Hi all,

I just experienced some kind of lockup accessing my 8-drive raid5
(2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
now processes that try to read the md device hang. ps tells me they are
all sleeping in get_active_stripe. There is nothing in the syslog, and I
can read from the individual drives fine with dd. mdadm says the state
is "active".

I'm not sure what to do now. Is it safe to try to reboot the system or
could that cause the device to get corrupted if it's hung in the middle
of some important operation?

Ideas appreciated,

/Patrik

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-03-13 22:36     ` raid5 hang on get_active_stripe Patrik Jonsson
@ 2006-03-13 23:17       ` Neil Brown
  2006-05-11 15:13         ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-03-13 23:17 UTC (permalink / raw)
  To: Patrik Jonsson; +Cc: linux-raid

On Monday March 13, patrik@ucolick.org wrote:
> Hi all,
> 
> I just experienced some kind of lockup accessing my 8-drive raid5
> (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> now processes that try to read the md device hang. ps tells me they are
> all sleeping in get_active_stripe. There is nothing in the syslog, and I
> can read from the individual drives fine with dd. mdadm says the state
> is "active".

Hmmm... That's sad. That's going to be very hard to track down.

If you could
  echo t > /proc/sysrq-trigger

and send me the dump that appears in the kernel log, I would
appreciate it.  I doubt it will be very helpful, but it is the best
bet I can come up with.

> 
> I'm not sure what to do now. Is it safe to try to reboot the system or
> could that cause the device to get corrupted if it's hung in the middle
> of some important operation?

You could try increasing the size of the stripe cache
  echo 512 > /sys/block/mdX/md/stripe_cache_size
(choose and appropriate 'X').
Maybe check the content of
         /sys/block/mdX/md/stripe_cache_active
as well.

Other than that, just reboot.  The raid5 will do a resync, but the
data should be fine.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-03-13 23:17       ` Neil Brown
@ 2006-05-11 15:13         ` dean gaudet
  2006-05-17 18:41           ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-11 15:13 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1927 bytes --]

On Tue, 14 Mar 2006, Neil Brown wrote:

> On Monday March 13, patrik@ucolick.org wrote:
> > Hi all,
> > 
> > I just experienced some kind of lockup accessing my 8-drive raid5
> > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> > now processes that try to read the md device hang. ps tells me they are
> > all sleeping in get_active_stripe. There is nothing in the syslog, and I
> > can read from the individual drives fine with dd. mdadm says the state
> > is "active".
> 
> Hmmm... That's sad. That's going to be very hard to track down.
> 
> If you could
>   echo t > /proc/sysrq-trigger
> 
> and send me the dump that appears in the kernel log, I would
> appreciate it.  I doubt it will be very helpful, but it is the best
> bet I can come up with.

i seem to be running into this as well... it has happenned several times 
in the past three weeks.  i attached the kernel log output...

it's a debian 2.6.16 kernel, which is based mostly on 2.6.16.10.

md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1]
      1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 3/187 pages [12KB], 1024KB chunk

those drives are on 3w-xxxx (7508 controller).  i'm using lvm2 and
xfs as the filesystem (although i'm pretty sure an ext3 fs on another lv
is hanging too -- but i forgot to check before i unwedged it).

let me know if anything else is useful and i can try to catch it next
time.

> You could try increasing the size of the stripe cache
>   echo 512 > /sys/block/mdX/md/stripe_cache_size
> (choose and appropriate 'X').

yeah that got things going again -- it took a minute or so maybe, i
wasn't paying attention as to how fast things cleared up.

> Maybe check the content of
>          /sys/block/mdX/md/stripe_cache_active
> as well.

next time i'll check this before i increase stripe_cache_size... it's
0 now, but the raid5 is working again...

-dean

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 7944 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-11 15:13         ` dean gaudet
@ 2006-05-17 18:41           ` dean gaudet
  2006-05-19  3:46             ` Neil Brown
  2006-05-23  2:33             ` Neil Brown
  0 siblings, 2 replies; 43+ messages in thread
From: dean gaudet @ 2006-05-17 18:41 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Thu, 11 May 2006, dean gaudet wrote:

> On Tue, 14 Mar 2006, Neil Brown wrote:
> 
> > On Monday March 13, patrik@ucolick.org wrote:
> > > I just experienced some kind of lockup accessing my 8-drive raid5
> > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> > > now processes that try to read the md device hang. ps tells me they are
> > > all sleeping in get_active_stripe. There is nothing in the syslog, and I
> > > can read from the individual drives fine with dd. mdadm says the state
> > > is "active".
...
> 
> i seem to be running into this as well... it has happenned several times 
> in the past three weeks.  i attached the kernel log output...

it happenned again...  same system as before...


> > You could try increasing the size of the stripe cache
> >   echo 512 > /sys/block/mdX/md/stripe_cache_size
> > (choose and appropriate 'X').
> 
> yeah that got things going again -- it took a minute or so maybe, i
> wasn't paying attention as to how fast things cleared up.

i tried 768 this time and it wasn't enough... 1024 did it again...

> 
> > Maybe check the content of
> >          /sys/block/mdX/md/stripe_cache_active
> > as well.
> 
> next time i'll check this before i increase stripe_cache_size... it's
> 0 now, but the raid5 is working again...

here's a sequence of things i did... not sure if it helps:

# cat /sys/block/md4/md/stripe_cache_active
435
# cat /sys/block/md4/md/stripe_cache_size
512
# echo 768 >/sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# echo 1024 >/sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
927
# cat /sys/block/md4/md/stripe_cache_active
151
# cat /sys/block/md4/md/stripe_cache_active
66
# cat /sys/block/md4/md/stripe_cache_active
2
# cat /sys/block/md4/md/stripe_cache_active
1
# cat /sys/block/md4/md/stripe_cache_active
0
# cat /sys/block/md4/md/stripe_cache_active
3

and it's OK again... except i'm going to lower the stripe_cache_size to
256 again because i'm not sure i want to keep having to double it each
freeze :)

let me know if you want the task dump output from this one too.

-dean

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-17 18:41           ` dean gaudet
@ 2006-05-19  3:46             ` Neil Brown
  2006-05-23  2:33             ` Neil Brown
  1 sibling, 0 replies; 43+ messages in thread
From: Neil Brown @ 2006-05-19  3:46 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Wednesday May 17, dean@arctic.org wrote:
> 
> let me know if you want the task dump output from this one too.
> 

No thanks - I doubt it will containing anything helpful.

I'll try to put some serious time into this next week - as soon as I
get mdadm 2.5 out.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-17 18:41           ` dean gaudet
  2006-05-19  3:46             ` Neil Brown
@ 2006-05-23  2:33             ` Neil Brown
  2006-05-26 16:07               ` dean gaudet
  1 sibling, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-05-23  2:33 UTC (permalink / raw)
  To: dean gaudet; +Cc: Neil Brown, Patrik Jonsson, linux-raid

[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 1367 bytes --]

On Wednesday May 17, dean@arctic.org wrote:
> On Thu, 11 May 2006, dean gaudet wrote:
> 
> > On Tue, 14 Mar 2006, Neil Brown wrote:
> > 
> > > On Monday March 13, patrik@ucolick.org wrote:
> > > > I just experienced some kind of lockup accessing my 8-drive raid5
> > > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> > > > now processes that try to read the md device hang. ps tells me they are
> > > > all sleeping in get_active_stripe. There is nothing in the syslog, and I
> > > > can read from the individual drives fine with dd. mdadm says the state
> > > > is "active".
> ...
> > 
> > i seem to be running into this as well... it has happenned several times 
> > in the past three weeks.  i attached the kernel log output...
> 
> it happenned again...  same system as before...
> 

I've spent all morning looking at this and while I cannot see what is
happening I did find a couple of small bugs, so that is good...

I've attached three patches.  The first fix two small bugs (I think).
The last adds some extra information to
  /sys/block/mdX/md/stripe_cache_active

They are against 2.6.16.11.

If you could apply them and if the problem recurs, report the content
of stripe_cache_active several times before and after changing it,
just like you did last time, that might help throw some light on the
situation.

Thanks,
NeilBrown


[-- Attachment #2: 002MdR5PlugProblem --]
[-- Type: text/plain, Size: 2331 bytes --]

Status: ok

Fix a plug/unplug race in raid5

When a device is unplugged, requests are moved from one or two
(depending on whether a bitmap is in use) queues to the main
request queue.

So whenever requests are put on either of those queues, we should make
sure the raid5 array is 'plugged'.
However we don't.  We currently plug the raid5 queue just before
putting requests on queues, so there is room for a race.  If something
unplugs the queue at just the wrong time, requests will be left on
the queue and nothing will want to unplug them.
Normally something else will plug and unplug the queue fairly
soon, but there is a risk that nothing will.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-23 12:27:58.000000000 +1000
+++ ./drivers/md/raid5.c	2006-05-23 12:28:26.000000000 +1000
@@ -77,12 +77,14 @@ static void __release_stripe(raid5_conf_
 		if (atomic_read(&conf->active_stripes)==0)
 			BUG();
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAYED, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->delayed_list);
-			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				 conf->seq_write == sh->bm_seq)
+				blk_plug_device(conf->mddev->queue);
+			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+				   conf->seq_write == sh->bm_seq) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
-			else {
+				blk_plug_device(conf->mddev->queue);
+			} else {
 				clear_bit(STRIPE_BIT_DELAY, &sh->state);
 				list_add_tail(&sh->lru, &conf->handle_list);
 			}
@@ -1519,13 +1521,6 @@ static int raid5_issue_flush(request_que
 	return ret;
 }
 
-static inline void raid5_plug_device(raid5_conf_t *conf)
-{
-	spin_lock_irq(&conf->device_lock);
-	blk_plug_device(conf->mddev->queue);
-	spin_unlock_irq(&conf->device_lock);
-}
-
 static int make_request (request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
@@ -1577,7 +1572,6 @@ static int make_request (request_queue_t
 				goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
-			raid5_plug_device(conf);
 			handle_stripe(sh);
 			release_stripe(sh);
 

[-- Attachment #3: 003MdR5BitmapPlug --]
[-- Type: text/plain, Size: 2966 bytes --]

Status: ok

Fix some small races in bitmap plugging in raid5.

The comment gives more details, but I didn't quite have the
sequencing write, so there was room for races to leave bits
unset in the on-disk bitmap for short periods of time.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-23 12:28:26.000000000 +1000
+++ ./drivers/md/raid5.c	2006-05-23 12:28:53.000000000 +1000
@@ -15,6 +15,30 @@
  * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
+/*
+ * BITMAP UNPLUGGING:
+ *
+ * The sequencing for updating the bitmap reliably is a little
+ * subtle (and I got it wrong the first time) so it deserves some
+ * explanation.
+ *
+ * We group bitmap updates into batches.  Each batch has a number.
+ * We may write out several batches at once, but that isn't very important.
+ * conf->bm_write is the number of the last batch successfully written.
+ * conf->bm_flush is the number of the last batch that was closed to
+ *    new additions.
+ * When we discover that we will need to write to any block in a stripe
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
+ * the number of the batch it will be in. This is bm_flush+1.
+ * When we are ready to do a write, if that batch hasn't been written yet,
+ *   we plug the array and queue the stripe for later.
+ * When an unplug happens, we increment bm_flush, thus closing the current
+ *   batch.
+ * When we notice that bm_flush > bm_write, we write out all pending updates
+ * to the bitmap, and advance bm_write to where bm_flush was.
+ * This may occasionally write a bit out twice, but is sure never to
+ * miss any bits.
+ */
 
 #include <linux/config.h>
 #include <linux/module.h>
@@ -81,7 +105,7 @@ static void __release_stripe(raid5_conf_
 				list_add_tail(&sh->lru, &conf->delayed_list);
 				blk_plug_device(conf->mddev->queue);
 			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   conf->seq_write == sh->bm_seq) {
+				   sh->bm_seq - conf->seq_write > 0) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 				blk_plug_device(conf->mddev->queue);
 			} else {
@@ -884,9 +908,9 @@ static int add_stripe_bio(struct stripe_
 		(unsigned long long)sh->sector, dd_idx);
 
 	if (conf->mddev->bitmap && firstwrite) {
-		sh->bm_seq = conf->seq_write;
 		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
 				  STRIPE_SECTORS, 0);
+		sh->bm_seq = conf->seq_flush+1;
 		set_bit(STRIPE_BIT_DELAY, &sh->state);
 	}
 
@@ -1692,7 +1716,7 @@ static void raid5d (mddev_t *mddev)
 	while (1) {
 		struct list_head *first;
 
-		if (conf->seq_flush - conf->seq_write > 0) {
+		if (conf->seq_flush != conf->seq_write) {
 			int seq = conf->seq_flush;
 			spin_unlock_irq(&conf->device_lock);
 			bitmap_unplug(mddev->bitmap);

[-- Attachment #4: 004MdR5Debug --]
[-- Type: text/plain, Size: 1153 bytes --]

Status: ok

Add some debugging to track raid5 hang problem


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-23 12:28:53.000000000 +1000
+++ ./drivers/md/raid5.c	2006-05-23 12:29:05.000000000 +1000
@@ -1807,9 +1807,23 @@ static ssize_t
 stripe_cache_active_show(mddev_t *mddev, char *page)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	if (conf)
-		return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
-	else
+	if (conf) {
+		int n;
+		int c1, c2;
+		struct list_head *l;
+		n = sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
+		n += sprintf(page+n, "%d preread\n", atomic_read(&conf->preread_active_stripes));
+		spin_lock_irq(&conf->device_lock);
+		c1=0;
+		list_for_each(l, &conf->bitmap_list)
+			c1++;
+		c2=0;
+		list_for_each(l, &conf->delayed_list)
+			c2++;
+		spin_unlock_irq(&conf->device_lock);
+		n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2);
+		return n;
+	} else
 		return 0;
 }
 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-23  2:33             ` Neil Brown
@ 2006-05-26 16:07               ` dean gaudet
  2006-05-26 23:55                 ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-26 16:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Tue, 23 May 2006, Neil Brown wrote:

> I've spent all morning looking at this and while I cannot see what is
> happening I did find a couple of small bugs, so that is good...
> 
> I've attached three patches.  The first fix two small bugs (I think).
> The last adds some extra information to
>   /sys/block/mdX/md/stripe_cache_active
> 
> They are against 2.6.16.11.
> 
> If you could apply them and if the problem recurs, report the content
> of stripe_cache_active several times before and after changing it,
> just like you did last time, that might help throw some light on the
> situation.

i applied them against 2.6.16.18 and two days later i got my first hang... 
below is the stripe_cache foo.

thanks
-dean

neemlark:~# cd /sys/block/md4/md/
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size
neemlark:/sys/block/md4/md# cat stripe_cache_active
474
187 preread
bitlist=0 delaylist=222
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
469
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
72 preread
bitlist=160 delaylist=103
neemlark:/sys/block/md4/md# cat stripe_cache_active
1
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# 

md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1]
      1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 10/187 pages [40KB], 1024KB chunk

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-26 16:07               ` dean gaudet
@ 2006-05-26 23:55                 ` Neil Brown
  2006-05-27  0:28                   ` dean gaudet
  2006-05-27 15:42                   ` dean gaudet
  0 siblings, 2 replies; 43+ messages in thread
From: Neil Brown @ 2006-05-26 23:55 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Friday May 26, dean@arctic.org wrote:
> On Tue, 23 May 2006, Neil Brown wrote:
> 
> i applied them against 2.6.16.18 and two days later i got my first hang... 
> below is the stripe_cache foo.
> 
> thanks
> -dean
> 
> neemlark:~# cd /sys/block/md4/md/
> neemlark:/sys/block/md4/md# cat stripe_cache_active 
> 255
> 0 preread
> bitlist=0 delaylist=255
> neemlark:/sys/block/md4/md# cat stripe_cache_active 
> 255
> 0 preread
> bitlist=0 delaylist=255
> neemlark:/sys/block/md4/md# cat stripe_cache_active 
> 255
> 0 preread
> bitlist=0 delaylist=255

Thanks.  This narrows it down quite a bit... too much infact:  I can
now say for sure that this cannot possible happen :-)

Two things that might be helpful:
  1/ Do you have any other patches on 2.6.16.18 other than the 3 I
    sent you?  If you do I'd like to see them, just in case.
  2/ The message.gz you sent earlier with the
          echo t > /proc/sysrq-trigger
     trace in it didn't contain information about md4_raid5 - the 
     controlling thread for that array.  It must have missed out
     due to a buffer overflowing.  Next time it happens, could you
     to get this trace again and see if you can find out what
     what md4_raid5 is going.  Maybe do the 'echo t' several times.
     I think that you need a kernel recompile to make the dmesg
     buffer larger.

Thanks for your patience - this must be very frustrating for you.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-26 23:55                 ` Neil Brown
@ 2006-05-27  0:28                   ` dean gaudet
  2006-05-27 15:42                   ` dean gaudet
  1 sibling, 0 replies; 43+ messages in thread
From: dean gaudet @ 2006-05-27  0:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2877 bytes --]

On Sat, 27 May 2006, Neil Brown wrote:

> On Friday May 26, dean@arctic.org wrote:
> > On Tue, 23 May 2006, Neil Brown wrote:
> > 
> > i applied them against 2.6.16.18 and two days later i got my first hang... 
> > below is the stripe_cache foo.
> > 
> > thanks
> > -dean
> > 
> > neemlark:~# cd /sys/block/md4/md/
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> 
> Thanks.  This narrows it down quite a bit... too much infact:  I can
> now say for sure that this cannot possible happen :-)

heheh.  fwiw the box has traditionally been rock solid.. it's ancient 
though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 
w/seagate 400GB disks... i really don't suspect the hardware all that much 
because the freeze seems to be rather consistent as to time of day 
(overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb 
going).  unfortunately it doesn't happen every time... but every time i've 
unstuck the box i've noticed those processes going.

other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3
and one with xfs.

> Two things that might be helpful:
>   1/ Do you have any other patches on 2.6.16.18 other than the 3 I
>     sent you?  If you do I'd like to see them, just in case.

it was just 2.6.16.18 plus the 3 you sent... i attached the .config
(it's rather full -- based off debian kernel .config).

maybe there's a compiler bug:

gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)

>   2/ The message.gz you sent earlier with the
>           echo t > /proc/sysrq-trigger
>      trace in it didn't contain information about md4_raid5 - the 
>      controlling thread for that array.  It must have missed out
>      due to a buffer overflowing.  Next time it happens, could you
>      to get this trace again and see if you can find out what
>      what md4_raid5 is going.  Maybe do the 'echo t' several times.
>      I think that you need a kernel recompile to make the dmesg
>      buffer larger.

ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ...

note that i'm going to include two more patches in this next kernel:

http://lkml.org/lkml/2006/5/23/42
http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch

the first was the Jens Axboe patch you mentioned here recently (for
accounting with i/o barriers)... and the second gets rid of the tcp
treason uncloaked messages.

> Thanks for your patience - this must be very frustrating for you.

fortunately i'm the primary user of this box... and the bug doesn't
corrupt anything... and i can unstick it easily :)  so it's not all that
frustrating actually.

-dean

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 17866 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-26 23:55                 ` Neil Brown
  2006-05-27  0:28                   ` dean gaudet
@ 2006-05-27 15:42                   ` dean gaudet
  2006-05-28 12:04                     ` Neil Brown
  1 sibling, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-27 15:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2878 bytes --]

On Sat, 27 May 2006, Neil Brown wrote:

> Thanks.  This narrows it down quite a bit... too much infact:  I can
> now say for sure that this cannot possible happen :-)
> 
>   2/ The message.gz you sent earlier with the
>           echo t > /proc/sysrq-trigger
>      trace in it didn't contain information about md4_raid5 - the 

got another hang again this morning... full dmesg output attached.

-dean

neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_size
256
neemlark:/sys/block/md4/md# echo 512 >!$
echo 512 >stripe_cache_size
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
254 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
148 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
435
95 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
11 preread
bitlist=0 delaylist=327
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
11 preread
bitlist=0 delaylist=327
neemlark:/sys/block/md4/md# cat stripe_cache_active
494
359 preread
bitlist=0 delaylist=127
neemlark:/sys/block/md4/md# cat stripe_cache_active
191
67 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
272 preread
bitlist=0 delaylist=175
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
32 preread
bitlist=0 delaylist=317
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
17 preread
bitlist=0 delaylist=265
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
17 preread
bitlist=0 delaylist=265
neemlark:/sys/block/md4/md# cat stripe_cache_active
442
124 preread
bitlist=0 delaylist=3
neemlark:/sys/block/md4/md# cat stripe_cache_active
127
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
154 preread
bitlist=0 delaylist=235
neemlark:/sys/block/md4/md# cat stripe_cache_active
389
321 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
162 preread
bitlist=0 delaylist=133
neemlark:/sys/block/md4/md# cat stripe_cache_active
385
24 preread
bitlist=0 delaylist=142
neemlark:/sys/block/md4/md# cat stripe_cache_active
109
3 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 29294 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-27 15:42                   ` dean gaudet
@ 2006-05-28 12:04                     ` Neil Brown
  2006-05-29 20:56                       ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-05-28 12:04 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Saturday May 27, dean@arctic.org wrote:
> On Sat, 27 May 2006, Neil Brown wrote:
> 
> > Thanks.  This narrows it down quite a bit... too much infact:  I can
> > now say for sure that this cannot possible happen :-)
> > 
> >   2/ The message.gz you sent earlier with the
> >           echo t > /proc/sysrq-trigger
> >      trace in it didn't contain information about md4_raid5 - the 
> 
> got another hang again this morning... full dmesg output attached.
> 

Thanks.  Nothing surprising there, which maybe is a surprise itself...

I'm still somewhat stumped by this.  But given that it is nicely
repeatable, I'm sure we can get there...

The following patch adds some more tracing to raid5, and might fix a
subtle bug in ll_rw_blk, though it is an incredible long shot that
this could be affecting raid5 (if it is, I'll have to assume there is
another bug somewhere).   It certainly doesn't break ll_rw_blk.
Whether it actually fixes something I'm not sure.

If you could try with these on top of the previous patches I'd really
appreciate it.

When you read from ..../stripe_cache_active, it should trigger a
(cryptic) kernel message within the next 15 seconds.  If I could get
the contents of that file and the kernel messages, that should help.

Thanks heaps,

NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c  |    4 ++--
 ./drivers/md/raid5.c |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c
--- ./block/ll_rw_blk.c~current~	2006-05-28 21:54:23.000000000 +1000
+++ ./block/ll_rw_blk.c	2006-05-28 21:55:17.000000000 +1000
@@ -874,7 +874,7 @@ static void __blk_queue_free_tags(reques
 	}
 
 	q->queue_tags = NULL;
-	q->queue_flags &= ~(1 << QUEUE_FLAG_QUEUED);
+	clear_bit(QUEUE_FLAG_QUEUED, &q->queue_flags);
 }
 
 /**
@@ -963,7 +963,7 @@ int blk_queue_init_tags(request_queue_t 
 	 * assign it, all done
 	 */
 	q->queue_tags = tags;
-	q->queue_flags |= (1 << QUEUE_FLAG_QUEUED);
+	set_bit(QUEUE_FLAG_QUEUED, &q->queue_flags);
 	return 0;
 fail:
 	kfree(tags);

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-27 09:17:10.000000000 +1000
+++ ./drivers/md/raid5.c	2006-05-28 21:56:56.000000000 +1000
@@ -1701,13 +1701,20 @@ static sector_t sync_request(mddev_t *md
  * During the scan, completed stripes are saved for us by the interrupt
  * handler, so that they will not have to wait for our next wakeup.
  */
+static unsigned long trigger;
+
 static void raid5d (mddev_t *mddev)
 {
 	struct stripe_head *sh;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int handled;
+	int trace = 0;
 
 	PRINTK("+++ raid5d active\n");
+	if (test_and_clear_bit(0, &trigger))
+		trace = 1;
+	if (trace)
+		printk("raid5d runs\n");
 
 	md_check_recovery(mddev);
 
@@ -1725,6 +1732,13 @@ static void raid5d (mddev_t *mddev)
 			activate_bit_delay(conf);
 		}
 
+		if (trace)
+			printk(" le=%d, pas=%d, bqp=%d le=%d\n",
+			       list_empty(&conf->handle_list),
+			       atomic_read(&conf->preread_active_stripes),
+			       blk_queue_plugged(mddev->queue),
+			       list_empty(&conf->delayed_list));
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
@@ -1756,6 +1770,8 @@ static void raid5d (mddev_t *mddev)
 	unplug_slaves(mddev);
 
 	PRINTK("--- raid5d inactive\n");
+	if (trace)
+		printk("raid5d done\n");
 }
 
 static ssize_t
@@ -1813,6 +1829,7 @@ stripe_cache_active_show(mddev_t *mddev,
 		struct list_head *l;
 		n = sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
 		n += sprintf(page+n, "%d preread\n", atomic_read(&conf->preread_active_stripes));
+		n += sprintf(page+n, "%splugged\n", blk_queue_plugged(mddev->queue)?"":"not ");
 		spin_lock_irq(&conf->device_lock);
 		c1=0;
 		list_for_each(l, &conf->bitmap_list)
@@ -1822,6 +1839,7 @@ stripe_cache_active_show(mddev_t *mddev,
 			c2++;
 		spin_unlock_irq(&conf->device_lock);
 		n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2);
+		trigger = 0xffff;
 		return n;
 	} else
 		return 0;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-28 12:04                     ` Neil Brown
@ 2006-05-29 20:56                       ` dean gaudet
  2006-05-29 23:51                         ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-29 20:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5820 bytes --]

On Sun, 28 May 2006, Neil Brown wrote:

> The following patch adds some more tracing to raid5, and might fix a
> subtle bug in ll_rw_blk, though it is an incredible long shot that
> this could be affecting raid5 (if it is, I'll have to assume there is
> another bug somewhere).   It certainly doesn't break ll_rw_blk.
> Whether it actually fixes something I'm not sure.
> 
> If you could try with these on top of the previous patches I'd really
> appreciate it.
> 
> When you read from ..../stripe_cache_active, it should trigger a
> (cryptic) kernel message within the next 15 seconds.  If I could get
> the contents of that file and the kernel messages, that should help.

got the hang again... attached is the dmesg with the cryptic messages.  i 
didn't think to grab the task dump this time though.

hope there's a clue in this one :)  but send me another patch if you need 
more data.

-dean

neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size 
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
445
0 preread
not plugged
bitlist=0 delaylist=73
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
413
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
13
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
493
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
487
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
405
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
1 preread
not plugged
bitlist=0 delaylist=28
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
84 preread
not plugged
bitlist=0 delaylist=69
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
69 preread
not plugged
bitlist=0 delaylist=56
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
41 preread
not plugged
bitlist=0 delaylist=38
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
10 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
453
3 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
14 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
477
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
486
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
384
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
387
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
462
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
448
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
501
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
416
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
386
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
434
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
406
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
447
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
501
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
448
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
416
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
416
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
380
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
1
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
353
0 preread
not plugged
bitlist=0 delaylist=0

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 6698 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-29 20:56                       ` dean gaudet
@ 2006-05-29 23:51                         ` Neil Brown
  2006-05-31  0:15                           ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-05-29 23:51 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Monday May 29, dean@arctic.org wrote:
> 
> hope there's a clue in this one :)  but send me another patch if you need 
> more data.

Thanks.  This confirms that the device is 'plugged' - which I knew had
to be the case, but equally knew that it couldn't be the case :-)

Whenever the device gets plugged a 3msec timer is set and when the
timer fires, the device gets unplugged.  So it cannot possibly stay
plugged for more than 3 msecs.  Yet obviously it does.

I don't think the timer code can be going wrong, as it is very widely
used and if there was a problem I'm sure it would have been noticed by
now.  Besides I've checked it and it looks good - but that doesn't
seem to prove anything :-(

Another possibility is another processor doing
    q->queue_flags |= (1 << some_flag);
at the same time that the timer does
    clear_bit(queue_plugged, &q->queue_flags);

That could cause the clearing of the bit to be lost.  But I don't
think that happens, certainly not after the last patch I gave you.

I now realise I should have got that cryptic printk to print the
result of
         timer_pending(&mddev->queue->unplug_timer);
but I'm fairly sure it would have said '0' which would leave me
equally in the dark.

Maybe you have bad memory with one bit that doesn't stay set (or
clear) properly, and that bit happen to always line up with the
QUEUE_FLAG_PLUGGED bit for this array.... Ok, that's impossible too,
especially as Patrik reported the same problem!

(stares at the code lots more, goes down several blind alleys...)

Well.... maybe.....
There does seem to be a small hole in the chain that leads from a
queue being plugged to it be unplugged again.  I'm not convinced that
the race can actually be lost, but obviously something fairly
unbelievable is happening...

Could you try this patch please?  On top of the rest.
And if it doesn't fail in a couple of days, tell me how regularly the
message 
   kblockd_schedule_work failed
gets printed.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c
--- ./block/ll_rw_blk.c~current~	2006-05-30 09:48:02.000000000 +1000
+++ ./block/ll_rw_blk.c	2006-05-30 09:48:48.000000000 +1000
@@ -1636,7 +1636,11 @@ static void blk_unplug_timeout(unsigned 
 {
 	request_queue_t *q = (request_queue_t *)data;

-	kblockd_schedule_work(&q->unplug_work);
+	if (!kblockd_schedule_work(&q->unplug_work)) {
+		/* failed to schedule the work, try again later */
+		printk("kblockd_schedule_work failed\n");
+		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
+	}
 }

 /**

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-29 23:51                         ` Neil Brown
@ 2006-05-31  0:15                           ` dean gaudet
  2006-05-31  0:46                             ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-31  0:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Tue, 30 May 2006, Neil Brown wrote:

> Could you try this patch please?  On top of the rest.
> And if it doesn't fail in a couple of days, tell me how regularly the
> message 
>    kblockd_schedule_work failed
> gets printed.

i'm running this patch now ... and just after reboot, no freeze yet, i've 
already seen a handful of these:

May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
May 30 17:14:14 localhost kernel: kblockd_schedule_work failed

-dean

> 
> Thanks,
> NeilBrown
> 
> 
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> ### Diffstat output
>  ./block/ll_rw_blk.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c
> --- ./block/ll_rw_blk.c~current~	2006-05-30 09:48:02.000000000 +1000
> +++ ./block/ll_rw_blk.c	2006-05-30 09:48:48.000000000 +1000
> @@ -1636,7 +1636,11 @@ static void blk_unplug_timeout(unsigned 
>  {
>  	request_queue_t *q = (request_queue_t *)data;
>  
> -	kblockd_schedule_work(&q->unplug_work);
> +	if (!kblockd_schedule_work(&q->unplug_work)) {
> +		/* failed to schedule the work, try again later */
> +		printk("kblockd_schedule_work failed\n");
> +		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
> +	}
>  }
>  
>  /**
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-31  0:15                           ` dean gaudet
@ 2006-05-31  0:46                             ` Neil Brown
  2006-05-31  1:03                               ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-05-31  0:46 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Tuesday May 30, dean@arctic.org wrote:
> On Tue, 30 May 2006, Neil Brown wrote:
> 
> > Could you try this patch please?  On top of the rest.
> > And if it doesn't fail in a couple of days, tell me how regularly the
> > message 
> >    kblockd_schedule_work failed
> > gets printed.
> 
> i'm running this patch now ... and just after reboot, no freeze yet, i've 
> already seen a handful of these:
> 
> May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
> May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
> May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
> May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
> May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
> May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
> May 30 17:14:14 localhost kernel: kblockd_schedule_work failed

1 every minute or so.  That's probably more than I would have
expected, but strongly lends evidence to the theory that this is the
problem.

I certainly wouldn't expect a failure every time kblockd_schedule_work
failed (in the original code), but the fact that it does fail
sometimes means there is a possible race which can cause the failure
that experienced.

So I am optimistic that the patch will have fixed the problem.  Please
let me know when you reach an uptime of 3 days.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-31  0:46                             ` Neil Brown
@ 2006-05-31  1:03                               ` dean gaudet
  2006-05-31  1:38                                 ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-31  1:03 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Wed, 31 May 2006, Neil Brown wrote:

> On Tuesday May 30, dean@arctic.org wrote:
> > On Tue, 30 May 2006, Neil Brown wrote:
> > 
> > > Could you try this patch please?  On top of the rest.
> > > And if it doesn't fail in a couple of days, tell me how regularly the
> > > message 
> > >    kblockd_schedule_work failed
> > > gets printed.
> > 
> > i'm running this patch now ... and just after reboot, no freeze yet, i've 
> > already seen a handful of these:
> > 
> > May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
> > May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
> > May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
> > May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
> > May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
> > May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
> > May 30 17:14:14 localhost kernel: kblockd_schedule_work failed
> 
> 1 every minute or so.  That's probably more than I would have
> expected, but strongly lends evidence to the theory that this is the
> problem.

actually i think the rate is higher... i'm not sure why, but klogd doesn't 
seem to keep up with it:

root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages
31
root@neemlark:~# dmesg | grep -c kblockd_schedule_work
8192

i don't have CONFIG_PRINTK_TIME=y ... so i can't read timestamps from 
dmesg.

but cool!  if the dmesg spam seems to be a problem i can just comment it 
out of the patch...

i'll let you know if it freezes again.

thanks
-dean

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-31  1:03                               ` dean gaudet
@ 2006-05-31  1:38                                 ` Neil Brown
  2006-05-31  1:46                                   ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-05-31  1:38 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Tuesday May 30, dean@arctic.org wrote:
> 
> actually i think the rate is higher... i'm not sure why, but klogd doesn't 
> seem to keep up with it:
> 
> root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages
> 31
> root@neemlark:~# dmesg | grep -c kblockd_schedule_work
> 8192

# grep 'last message repeated' /var/log/messages
??

Obviously even faster than I thought.  I guess workqueue threads must
take a while to get scheduled...
I'm beginning to wonder if I really have found the bug after all :-(

I'll look forward to the results either way.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-31  1:38                                 ` Neil Brown
@ 2006-05-31  1:46                                   ` dean gaudet
  2006-06-01  2:51                                     ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-05-31  1:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Wed, 31 May 2006, Neil Brown wrote:

> On Tuesday May 30, dean@arctic.org wrote:
> > 
> > actually i think the rate is higher... i'm not sure why, but klogd doesn't 
> > seem to keep up with it:
> > 
> > root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages
> > 31
> > root@neemlark:~# dmesg | grep -c kblockd_schedule_work
> > 8192
> 
> # grep 'last message repeated' /var/log/messages
> ??

um hi, of course :)  the paste below is approximately correct.

-dean

root@neemlark:~# egrep 'kblockd_schedule_work|last message repeated' /var/log/messages
May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
May 30 17:12:56 localhost last message repeated 22 times
May 30 17:14:14 localhost kernel: kblockd_schedule_work failed
May 30 17:16:57 localhost kernel: kblockd_schedule_work failed
May 30 17:17:00 localhost last message repeated 83 times
May 30 17:17:02 localhost kernel: kblockd_schedule_work failed
May 30 17:17:33 localhost last message repeated 950 times
May 30 17:18:34 localhost last message repeated 2218 times
May 30 17:19:35 localhost last message repeated 1581 times
May 30 17:20:01 localhost last message repeated 579 times
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost last message repeated 23 times
May 30 17:20:03 localhost kernel: kblockd_schedule_work failed
May 30 17:20:34 localhost last message repeated 1058 times
May 30 17:21:35 localhost last message repeated 2171 times
May 30 17:22:36 localhost last message repeated 2305 times
May 30 17:23:37 localhost last message repeated 2311 times
May 30 17:24:38 localhost last message repeated 1993 times
May 30 17:25:01 localhost last message repeated 702 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 15 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 12 times
May 30 17:25:03 localhost kernel: kblockd_schedule_work failed
May 30 17:25:34 localhost last message repeated 1061 times
May 30 17:26:35 localhost last message repeated 2009 times
May 30 17:27:36 localhost last message repeated 1941 times
May 30 17:28:37 localhost last message repeated 2345 times
May 30 17:29:38 localhost last message repeated 2367 times
May 30 17:30:01 localhost last message repeated 870 times
May 30 17:30:01 localhost kernel: kblockd_schedule_work failed
May 30 17:30:01 localhost last message repeated 45 times
May 30 17:30:02 localhost kernel: kblockd_schedule_work failed
May 30 17:30:33 localhost last message repeated 1180 times
May 30 17:31:34 localhost last message repeated 2062 times
May 30 17:32:34 localhost last message repeated 2277 times
May 30 17:32:36 localhost kernel: kblockd_schedule_work failed
May 30 17:33:07 localhost last message repeated 1114 times
May 30 17:34:08 localhost last message repeated 2308 times
May 30 17:35:01 localhost last message repeated 1941 times
May 30 17:35:01 localhost kernel: kblockd_schedule_work failed
May 30 17:35:02 localhost last message repeated 20 times
May 30 17:35:02 localhost kernel: kblockd_schedule_work failed
May 30 17:35:33 localhost last message repeated 1051 times
May 30 17:36:34 localhost last message repeated 2002 times
May 30 17:37:35 localhost last message repeated 1644 times
May 30 17:38:36 localhost last message repeated 1731 times
May 30 17:39:37 localhost last message repeated 1844 times
May 30 17:40:01 localhost last message repeated 817 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 39 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 12 times
May 30 17:40:03 localhost kernel: kblockd_schedule_work failed
May 30 17:40:34 localhost last message repeated 1051 times
May 30 17:41:35 localhost last message repeated 1576 times
May 30 17:42:36 localhost last message repeated 2000 times
May 30 17:43:37 localhost last message repeated 2058 times
May 30 17:44:15 localhost last message repeated 1337 times
May 30 17:44:15 localhost kernel: kblockd_schedule_work failed
May 30 17:44:46 localhost last message repeated 1016 times
May 30 17:45:01 localhost last message repeated 432 times
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:33 localhost last message repeated 1229 times
May 30 17:46:34 localhost last message repeated 2552 times
May 30 17:47:36 localhost last message repeated 2126 times
May 30 17:48:37 localhost last message repeated 2348 times
May 30 17:49:38 localhost last message repeated 1843 times
May 30 17:50:00 localhost last message repeated 796 times
May 30 17:50:01 localhost kernel: kblockd_schedule_work failed
May 30 17:50:32 localhost last message repeated 1238 times
May 30 17:51:33 localhost last message repeated 2374 times
May 30 17:52:34 localhost last message repeated 2366 times
May 30 17:53:35 localhost last message repeated 2195 times
May 30 17:54:36 localhost last message repeated 2345 times
May 30 17:55:01 localhost last message repeated 790 times
May 30 17:55:02 localhost kernel: kblockd_schedule_work failed
May 30 17:55:02 localhost last message repeated 49 times
May 30 17:55:02 localhost kernel: kblockd_schedule_work failed
May 30 17:55:02 localhost last message repeated 13 times
May 30 17:55:03 localhost kernel: kblockd_schedule_work failed
May 30 17:55:34 localhost last message repeated 1070 times
May 30 17:56:35 localhost last message repeated 2278 times
May 30 17:57:37 localhost last message repeated 2459 times
May 30 17:58:38 localhost last message repeated 2076 times
May 30 17:59:39 localhost last message repeated 1719 times
May 30 18:00:01 localhost last message repeated 654 times
May 30 18:00:03 localhost kernel: kblockd_schedule_work failed
May 30 18:00:34 localhost last message repeated 868 times
May 30 18:01:35 localhost last message repeated 1765 times
May 30 18:02:33 localhost last message repeated 1759 times
May 30 18:02:35 localhost kernel: kblockd_schedule_work failed
May 30 18:02:35 localhost last message repeated 9 times
May 30 18:02:35 localhost kernel: kblockd_schedule_work failed
May 30 18:02:35 localhost kernel: kblockd_schedule_work failed
May 30 18:02:35 localhost last message repeated 9 times
May 30 18:02:35 localhost kernel: kblockd_schedule_work failed
May 30 18:02:35 localhost last message repeated 4 times
May 30 18:02:36 localhost kernel: kblockd_schedule_work failed
May 30 18:02:36 localhost last message repeated 4 times
May 30 18:02:36 localhost kernel: kblockd_schedule_work failed
May 30 18:02:36 localhost last message repeated 10 times
May 30 18:02:36 localhost kernel: kblockd_schedule_work failed
May 30 18:02:36 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost last message repeated 7 times
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:02:37 localhost kernel: kblockd_schedule_work failed
May 30 18:03:08 localhost last message repeated 842 times
May 30 18:04:09 localhost last message repeated 1828 times
May 30 18:04:15 localhost last message repeated 221 times
May 30 18:04:15 localhost kernel: kblockd_schedule_work failed
May 30 18:04:46 localhost last message repeated 1042 times
May 30 18:05:00 localhost last message repeated 551 times
May 30 18:05:01 localhost kernel: kblockd_schedule_work failed
May 30 18:05:32 localhost last message repeated 919 times
May 30 18:06:33 localhost last message repeated 2608 times
May 30 18:07:34 localhost last message repeated 2412 times
May 30 18:08:35 localhost last message repeated 2536 times
May 30 18:09:36 localhost last message repeated 2441 times
May 30 18:10:01 localhost last message repeated 812 times
May 30 18:10:01 localhost kernel: kblockd_schedule_work failed
May 30 18:10:01 localhost last message repeated 5 times
May 30 18:10:01 localhost kernel: kblockd_schedule_work failed
May 30 18:10:01 localhost last message repeated 16 times
May 30 18:10:02 localhost kernel: kblockd_schedule_work failed
May 30 18:10:33 localhost last message repeated 1311 times
May 30 18:11:34 localhost last message repeated 2396 times
May 30 18:11:45 localhost last message repeated 328 times
May 30 18:11:47 localhost kernel: kblockd_schedule_work failed
May 30 18:12:18 localhost last message repeated 1470 times
May 30 18:13:19 localhost last message repeated 2457 times
May 30 18:14:20 localhost last message repeated 2275 times
May 30 18:15:01 localhost last message repeated 1577 times
May 30 18:15:01 localhost kernel: kblockd_schedule_work failed
May 30 18:15:01 localhost last message repeated 6 times
May 30 18:15:02 localhost kernel: kblockd_schedule_work failed
May 30 18:15:33 localhost last message repeated 1241 times
May 30 18:16:37 localhost last message repeated 2487 times
May 30 18:17:01 localhost last message repeated 826 times
May 30 18:17:02 localhost kernel: kblockd_schedule_work failed
May 30 18:17:33 localhost last message repeated 1386 times
May 30 18:18:34 localhost last message repeated 2195 times
May 30 18:19:29 localhost last message repeated 1021 times
May 30 18:19:55 localhost last message repeated 62 times
May 30 18:20:04 localhost kernel: kblockd_schedule_work failed
May 30 18:20:39 localhost last message repeated 105 times
May 30 18:21:34 localhost last message repeated 189 times
May 30 18:22:45 localhost last message repeated 104 times
May 30 18:23:40 localhost last message repeated 170 times
May 30 18:24:15 localhost last message repeated 120 times
May 30 18:24:45 localhost kernel: kblockd_schedule_work failed
May 30 18:24:49 localhost last message repeated 13 times
May 30 18:25:19 localhost kernel: kblockd_schedule_work failed
May 30 18:25:25 localhost last message repeated 138 times
May 30 18:26:49 localhost last message repeated 113 times
May 30 18:27:35 localhost last message repeated 339 times
May 30 18:28:50 localhost last message repeated 108 times
May 30 18:29:35 localhost last message repeated 143 times
May 30 18:30:05 localhost kernel: kblockd_schedule_work failed
May 30 18:30:40 localhost last message repeated 55 times
May 30 18:31:50 localhost last message repeated 313 times
May 30 18:32:30 localhost last message repeated 236 times
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost last message repeated 8 times
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost last message repeated 11 times
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost last message repeated 5 times
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost last message repeated 3 times
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:36 localhost kernel: kblockd_schedule_work failed
May 30 18:32:50 localhost kernel: kblockd_schedule_work failed
May 30 18:33:15 localhost last message repeated 91 times
May 30 18:34:15 localhost last message repeated 86 times
May 30 18:34:50 localhost last message repeated 133 times
May 30 18:35:20 localhost kernel: kblockd_schedule_work failed
May 30 18:35:50 localhost last message repeated 245 times
May 30 18:36:35 localhost last message repeated 190 times
May 30 18:37:50 localhost last message repeated 372 times
May 30 18:38:25 localhost last message repeated 27 times
May 30 18:39:40 localhost last message repeated 188 times
May 30 18:40:00 localhost last message repeated 129 times
May 30 18:40:35 localhost kernel: kblockd_schedule_work failed
May 30 18:41:10 localhost last message repeated 76 times
May 30 18:42:20 localhost last message repeated 186 times
May 30 18:43:00 localhost last message repeated 154 times

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-05-31  1:46                                   ` dean gaudet
@ 2006-06-01  2:51                                     ` Neil Brown
  2006-06-02  7:10                                       ` dean gaudet
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-06-01  2:51 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Tuesday May 30, dean@arctic.org wrote:
> On Wed, 31 May 2006, Neil Brown wrote:
> 
> > On Tuesday May 30, dean@arctic.org wrote:
> > > 
> > > actually i think the rate is higher... i'm not sure why, but klogd doesn't 
> > > seem to keep up with it:
> > > 
> > > root@neemlark:~# grep -c kblockd_schedule_work /var/log/messages
> > > 31
> > > root@neemlark:~# dmesg | grep -c kblockd_schedule_work
> > > 8192
> > 
> > # grep 'last message repeated' /var/log/messages
> > ??
> 
> um hi, of course :)  the paste below is approximately correct.
> 

Ok, so thousands of times a second.  Not what I expected at all...

So I read the code more carefully, and can see that patch was all
wrong.  It might actually make the problem go away, but isn't really
the right fix.

I think I will be able to suggest a fix which should work a be a
reasonable thing to do, but it still won't help me understand the
problem.

I've got one more long-shot I would like to try first.  If you could
backout that change to ll_rw_block, and apply this patch instead.
Then when it hangs, just cat the stripe_cache_active file and see if
that unplugs things or not (cat it a few times).

If it does, I think I am on to the problem.
If it doesn't I'll give up and give you a patch that should work
around the problem in a reasonable way.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    3 +++
 1 file changed, 3 insertions(+)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-28 21:56:56.000000000 +1000
+++ ./drivers/md/raid5.c	2006-06-01 12:44:52.000000000 +1000
@@ -1840,6 +1840,9 @@ stripe_cache_active_show(mddev_t *mddev,
 		spin_unlock_irq(&conf->device_lock);
 		n += sprintf(page+n, "bitlist=%d delaylist=%d\n", c1, c2);
 		trigger = 0xffff;
+
+		/* incredible long shot... */
+		kblockd_flush();
 		return n;
 	} else
 		return 0;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-01  2:51                                     ` Neil Brown
@ 2006-06-02  7:10                                       ` dean gaudet
  2006-06-02  7:27                                         ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: dean gaudet @ 2006-06-02  7:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Thu, 1 Jun 2006, Neil Brown wrote:

> I've got one more long-shot I would like to try first.  If you could
> backout that change to ll_rw_block, and apply this patch instead.
> Then when it hangs, just cat the stripe_cache_active file and see if
> that unplugs things or not (cat it a few times).

nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
to 768... 512 wasn't enough)...

-dean

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-02  7:10                                       ` dean gaudet
@ 2006-06-02  7:27                                         ` Neil Brown
  2006-06-10 19:49                                           ` dean gaudet
  2006-06-13 18:53                                           ` Bill Davidsen
  0 siblings, 2 replies; 43+ messages in thread
From: Neil Brown @ 2006-06-02  7:27 UTC (permalink / raw)
  To: dean gaudet; +Cc: Patrik Jonsson, linux-raid

On Friday June 2, dean@arctic.org wrote:
> On Thu, 1 Jun 2006, Neil Brown wrote:
> 
> > I've got one more long-shot I would like to try first.  If you could
> > backout that change to ll_rw_block, and apply this patch instead.
> > Then when it hangs, just cat the stripe_cache_active file and see if
> > that unplugs things or not (cat it a few times).
> 
> nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
> to 768... 512 wasn't enough)...
> 
> -dean

Ok, thanks.
I still don't know what is really going on, but I'm 99.9863% sure this
will fix it, and is a reasonable thing to do.
(Yes, I lose a ';'.  That is deliberate).

Please let me know what this proves, and thanks again for your
patience.

NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-05-28 21:56:56.000000000 +1000
+++ ./drivers/md/raid5.c	2006-06-02 17:24:07.000000000 +1000
@@ -285,7 +285,7 @@ static struct stripe_head *get_active_st
 						     < (conf->max_nr_stripes *3/4)
 						     || !conf->inactive_blocked),
 						    conf->device_lock,
-						    unplug_slaves(conf->mddev);
+						    raid5_unplug_device(conf->mddev->queue)
 					);
 				conf->inactive_blocked = 0;
 			} else

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-02  7:27                                         ` Neil Brown
@ 2006-06-10 19:49                                           ` dean gaudet
  2006-06-13 18:53                                           ` Bill Davidsen
  1 sibling, 0 replies; 43+ messages in thread
From: dean gaudet @ 2006-06-10 19:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrik Jonsson, linux-raid

On Fri, 2 Jun 2006, Neil Brown wrote:

> On Friday June 2, dean@arctic.org wrote:
> > On Thu, 1 Jun 2006, Neil Brown wrote:
> > 
> > > I've got one more long-shot I would like to try first.  If you could
> > > backout that change to ll_rw_block, and apply this patch instead.
> > > Then when it hangs, just cat the stripe_cache_active file and see if
> > > that unplugs things or not (cat it a few times).
> > 
> > nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
> > to 768... 512 wasn't enough)...
> > 
> > -dean
> 
> Ok, thanks.
> I still don't know what is really going on, but I'm 99.9863% sure this
> will fix it, and is a reasonable thing to do.
> (Yes, I lose a ';'.  That is deliberate).

it's been running for a week now... and the freeze hasn't occured... it's 
possible the circumstances for reproducing it haven't occured again 
either, but i haven't really changed my disk usage behaviour so it's 
probably fixed.

let me know if you come up with some other solution you'd like tested.

thanks
-dean


> 
> Please let me know what this proves, and thanks again for your
> patience.
> 
> NeilBrown
> 
> 
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> ### Diffstat output
>  ./drivers/md/raid5.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
> --- ./drivers/md/raid5.c~current~	2006-05-28 21:56:56.000000000 +1000
> +++ ./drivers/md/raid5.c	2006-06-02 17:24:07.000000000 +1000
> @@ -285,7 +285,7 @@ static struct stripe_head *get_active_st
>  						     < (conf->max_nr_stripes *3/4)
>  						     || !conf->inactive_blocked),
>  						    conf->device_lock,
> -						    unplug_slaves(conf->mddev);
> +						    raid5_unplug_device(conf->mddev->queue)
>  					);
>  				conf->inactive_blocked = 0;
>  			} else
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-02  7:27                                         ` Neil Brown
  2006-06-10 19:49                                           ` dean gaudet
@ 2006-06-13 18:53                                           ` Bill Davidsen
  2006-06-13 19:05                                             ` dean gaudet
  2006-06-13 23:13                                             ` Neil Brown
  1 sibling, 2 replies; 43+ messages in thread
From: Bill Davidsen @ 2006-06-13 18:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: dean gaudet, Patrik Jonsson, linux-raid

Neil Brown wrote:

>On Friday June 2, dean@arctic.org wrote:
>  
>
>>On Thu, 1 Jun 2006, Neil Brown wrote:
>>
>>    
>>
>>>I've got one more long-shot I would like to try first.  If you could
>>>backout that change to ll_rw_block, and apply this patch instead.
>>>Then when it hangs, just cat the stripe_cache_active file and see if
>>>that unplugs things or not (cat it a few times).
>>>      
>>>
>>nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
>>to 768... 512 wasn't enough)...
>>
>>-dean
>>    
>>
>
>Ok, thanks.
>I still don't know what is really going on, but I'm 99.9863% sure this
>will fix it, and is a reasonable thing to do.
>(Yes, I lose a ';'.  That is deliberate).
>
>Please let me know what this proves, and thanks again for your
>patience.
>
>NeilBrown
>
[...snip...]

Will that fix be in 2.6.17?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-13 18:53                                           ` Bill Davidsen
@ 2006-06-13 19:05                                             ` dean gaudet
  2006-06-13 23:13                                             ` Neil Brown
  1 sibling, 0 replies; 43+ messages in thread
From: dean gaudet @ 2006-06-13 19:05 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Patrik Jonsson, linux-raid

On Tue, 13 Jun 2006, Bill Davidsen wrote:

> Neil Brown wrote:
> 
> > On Friday June 2, dean@arctic.org wrote:
> >  
> > > On Thu, 1 Jun 2006, Neil Brown wrote:
> > > 
> > >    
> > > > I've got one more long-shot I would like to try first.  If you could
> > > > backout that change to ll_rw_block, and apply this patch instead.
> > > > Then when it hangs, just cat the stripe_cache_active file and see if
> > > > that unplugs things or not (cat it a few times).
> > > >      
> > > nope that didn't unstick it... i had to raise stripe_cache_size (from 256
> > > to 768... 512 wasn't enough)...
> > > 
> > > -dean
> > >    
> > 
> > Ok, thanks.
> > I still don't know what is really going on, but I'm 99.9863% sure this
> > will fix it, and is a reasonable thing to do.
> > (Yes, I lose a ';'.  That is deliberate).
> > 
> > Please let me know what this proves, and thanks again for your
> > patience.
> > 
> > NeilBrown
> > 
> [...snip...]
> 
> Will that fix be in 2.6.17?

also -- is it appropriate enough for 2.6.16.x?

-dean

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-13 18:53                                           ` Bill Davidsen
  2006-06-13 19:05                                             ` dean gaudet
@ 2006-06-13 23:13                                             ` Neil Brown
  2006-10-07 23:25                                               ` Chris Allen
  1 sibling, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-06-13 23:13 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: dean gaudet, Patrik Jonsson, linux-raid

On Tuesday June 13, davidsen@tmr.com wrote:
> 
> Will that fix be in 2.6.17?
> 

Probably not.  We have had the last 'rc' twice and I so I don't think
it is appropriate to submit the patch at this stage.
I probably will submit it for an early 2.6.17.x. and for 2.6.16.y.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-06-13 23:13                                             ` Neil Brown
@ 2006-10-07 23:25                                               ` Chris Allen
  2006-10-09 11:03                                                 ` Chris Allen
  0 siblings, 1 reply; 43+ messages in thread
From: Chris Allen @ 2006-10-07 23:25 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> On Tuesday June 13, davidsen@tmr.com wrote:
>   
>> Will that fix be in 2.6.17?
>>
>>     
>
> Probably not.  We have had the last 'rc' twice and I so I don't think
> it is appropriate to submit the patch at this stage.
> I probably will submit it for an early 2.6.17.x. and for 2.6.16.y.
>
>
>   

What is the status of this?

I've been experiencing exactly the same get_active_stripe lockup
on a FC5 2.6.17-1.2187_FC5smp  stock kernel. Curiously we have ten 
similar heavily loaded
servers but only one of them experiences the problem. The problem 
happens consistently after 24 hours or so
when I hammer the raid5 array over NFS, but I've never managed to 
trigger it with local access. I'd also
say (anecdotally) that it only started happening since I added a bitmap 
to my array.

As with the other poster, the lockup is released by increasing 
stripe_cache_size.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-07 23:25                                               ` Chris Allen
@ 2006-10-09 11:03                                                 ` Chris Allen
  2006-10-09 23:06                                                   ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Chris Allen @ 2006-10-09 11:03 UTC (permalink / raw)
  To: linux-raid

Ok, after more testing, this lockup happens consistently when
bitmaps are switched on and never when they are switched off.

Ideas anybody?




On Sun, Oct 08, 2006 at 12:25:46AM +0100, Chris Allen wrote:
> 
> 
> Neil Brown wrote:
> >On Tuesday June 13, davidsen@tmr.com wrote:
> >  
> >>Will that fix be in 2.6.17?
> >>
> >>    
> >
> >Probably not.  We have had the last 'rc' twice and I so I don't think
> >it is appropriate to submit the patch at this stage.
> >I probably will submit it for an early 2.6.17.x. and for 2.6.16.y.
> >
> >
> >  
> 
> What is the status of this?
> 
> I've been experiencing exactly the same get_active_stripe lockup
> on a FC5 2.6.17-1.2187_FC5smp  stock kernel. Curiously we have ten 
> similar heavily loaded
> servers but only one of them experiences the problem. The problem 
> happens consistently after 24 hours or so
> when I hammer the raid5 array over NFS, but I've never managed to 
> trigger it with local access. I'd also
> say (anecdotally) that it only started happening since I added a bitmap 
> to my array.
> 
> 
> As with the other poster, the lockup is released by increasing 
> stripe_cache_size.
> 
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-09 11:03                                                 ` Chris Allen
@ 2006-10-09 23:06                                                   ` Neil Brown
  2006-10-09 23:21                                                     ` Chris Allen
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-10-09 23:06 UTC (permalink / raw)
  To: Chris Allen; +Cc: linux-raid

On Monday October 9, chris@cjx.com wrote:
> Ok, after more testing, this lockup happens consistently when
> bitmaps are switched on and never when they are switched off.
> 
> Ideas anybody?
> 

No. I'm completely stumped.
Which means it is probably something very obvious, but I keep looking
in the wrong place :-)

The interaction with bitmaps is interesting and might prove helpful.
I'll have another look at the code and see if it opens up some
possibilities. 

I've been working with Dean Gaudet who can reproduce it.  He has been
trying out patches that print out lots of debugging information when
the problem occurs.  This has narrowed it down a bit, but I'm still in
the dark.

It seems that blk_plug_device is setting a timer to go off in about
3msecs, but it gets deactivated without firing. This should only
happen when the device gets unplugged, but the device remains plugged.

Are you happy to try a kernel.org kernel with a few patches and a
little shell script running?
The net result is that it detects when there is a problem, prints out
some trace information, and then gets array going again.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-09 23:06                                                   ` Neil Brown
@ 2006-10-09 23:21                                                     ` Chris Allen
  2006-10-10  0:08                                                       ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Chris Allen @ 2006-10-09 23:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid


Neil Brown wrote:
> On Monday October 9, chris@cjx.com wrote:
>   
>> Ok, after more testing, this lockup happens consistently when
>> bitmaps are switched on and never when they are switched off.
>>
>>     
> Are you happy to try a kernel.org kernel with a few patches and a
> little shell script running?
> The net result is that it detects when there is a problem, prints out
> some trace information, and then gets array going again.
>
>
>   

Very happy to. Let me know what you'd like me to do.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-09 23:21                                                     ` Chris Allen
@ 2006-10-10  0:08                                                       ` Neil Brown
  2006-10-10 10:27                                                         ` Bas van Schaik
  2006-11-14 10:28                                                         ` Chris Allen
  0 siblings, 2 replies; 43+ messages in thread
From: Neil Brown @ 2006-10-10  0:08 UTC (permalink / raw)
  To: Chris Allen; +Cc: linux-raid

On Tuesday October 10, chris@cjx.com wrote:
> 
> Very happy to. Let me know what you'd like me to do.

Cool thanks.

At the end is a patch against 2.6.17.11, though it should apply against
any later 2.6.17 kernel.
Apply this and reboot.

Then run

       while true
       do cat /sys/block/mdX/md/stripe_cache_active
          sleep 10
       done > /dev/null

(maybe write a little script or whatever).  Leave this running. It
effects the check for "has raid5 hung".  Make sure to change "mdX" to
whatever is appropriate.

Occasionally look in the kernel logs for
   plug problem:

if you find that, send me the surrounding text - there should be about
a dozen lines following this one.

Hopefully this will let me know which is last thing to happen: a plug
or an unplug.
If the last is a "plug", then the timer really should still be
pending, but isn't (this is impossible).  So I'll look more closely at
that option.
If the last is an "unplug", then the 'Plugged' flag should really be
clear but it isn't (this is impossible).  So I'll look more closely at
that option.

Dean is running this, but he only gets the hang every couple of
weeks.  If you get it more often, that would help me a lot.

Thanks,
NeilBrown


diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
--- ./.patches/orig/block/ll_rw_blk.c	2006-08-21 09:52:46.000000000 +1000
+++ ./block/ll_rw_blk.c	2006-10-05 11:33:32.000000000 +1000
@@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
  * This is called with interrupts off and no requests on the queue and
  * with the queue lock held.
  */
+static atomic_t seq = ATOMIC_INIT(0);
 void blk_plug_device(request_queue_t *q)
 {
 	WARN_ON(!irqs_disabled());
@@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
 		return;
 
 	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
+		q->last_plug = jiffies;
+		q->plug_seq = atomic_read(&seq);
+		atomic_inc(&seq);
 		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
 		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
-	}
+	} else
+		q->last_plug_skip = jiffies;
+	if (!timer_pending(&q->unplug_timer) &&
+	    !q->unplug_work.pending)
+		printk("Neither Timer or work are pending\n");
 }
 
 EXPORT_SYMBOL(blk_plug_device);
@@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
 {
 	WARN_ON(!irqs_disabled());
 
-	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
+	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
+		q->last_unplug_skip = jiffies;
 		return 0;
+	}
 
 	del_timer(&q->unplug_timer);
+	q->last_unplug = jiffies;
+	q->unplug_seq = atomic_read(&seq);
+	atomic_inc(&seq);
+	if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
+		printk("queue still (or again) plugged\n");
 	return 1;
 }
 
@@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
 static void blk_unplug_work(void *data)
 {
 	request_queue_t *q = data;
-
+	q->last_unplug_work = jiffies;
 	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
@@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned 
 	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
+	q->last_unplug_timeout = jiffies;
 	kblockd_schedule_work(&q->unplug_work);
 }
 

diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
--- ./.patches/orig/drivers/md/raid1.c	2006-08-10 17:28:01.000000000 +1000
+++ ./drivers/md/raid1.c	2006-09-04 21:58:31.000000000 +1000
@@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev)
 							d = conf->raid_disks;
 						d--;
 						rdev = conf->mirrors[d].rdev;
-						atomic_add(s, &rdev->corrected_errors);
 						if (rdev &&
 						    test_bit(In_sync, &rdev->flags)) {
 							if (sync_page_io(rdev->bdev,
@@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev)
 									 s<<9, conf->tmppage, READ) == 0)
 								/* Well, this device is dead */
 								md_error(mddev, rdev);
+							else
+								atomic_add(s, &rdev->corrected_errors);
+
 						}
 					}
 				} else {

diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c
--- ./.patches/orig/drivers/md/raid5.c	2006-08-10 17:28:01.000000000 +1000
+++ ./drivers/md/raid5.c	2006-09-25 11:11:59.000000000 +1000
@@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_
 					md_wakeup_thread(conf->mddev->thread);
 			}
 			atomic_dec(&conf->active_stripes);
+			if (atomic_read(&conf->active_stripes) < conf->low_size)
+				conf->low_size = atomic_read(&conf->active_stripes);
 			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
@@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 
+	if (conf->dotrace)
+		printk(" trace unplug_device\n");
+
 	if (blk_remove_plug(q)) {
+		if (conf->dotrace)
+			printk(" trace remove plug\n");
 		conf->seq_flush++;
 		raid5_activate_delayed(conf);
 	}
@@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que
 static inline void raid5_plug_device(raid5_conf_t *conf)
 {
 	spin_lock_irq(&conf->device_lock);
+	if (conf->dotrace)
+		printk(" trace plug device\n");
 	blk_plug_device(conf->mddev->queue);
 	spin_unlock_irq(&conf->device_lock);
 }
@@ -2154,6 +2163,42 @@ static ssize_t
 stripe_cache_active_show(mddev_t *mddev, char *page)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
+
+	/* tracing */
+	if (conf->dotrace) {
+		conf->dotrace = 0;
+		printk("end trace, force unplug\n");
+		raid5_unplug_device(mddev->queue);
+		raid5_unplug_device(mddev->queue);
+	} else {
+		if (conf->last_size > conf->max_nr_stripes/2 &&
+		    conf->last_size <= conf->low_size) {
+			conf->dotrace = 1;
+			printk("plug problem: low_size=%d\n", conf->low_size);
+			printk(" blk_queue_plugged = %d\n", blk_queue_plugged(mddev->queue));
+			printk(" flags = %lx\n", mddev->queue->queue_flags);
+			printk(" unplug_delay = %lu (HZ=%d)\n", mddev->queue->unplug_delay, HZ);
+			printk(" timer->expires %lu (jiffies %lu)\n",
+			       mddev->queue->unplug_timer.expires, jiffies);
+			printk(" pending = %d\n", timer_pending(&mddev->queue->unplug_timer));
+			printk(" wq_pending = %lx\n", mddev->queue->unplug_work.pending);
+			printk(" wq_entry empty? %d\n", list_empty(&mddev->queue->unplug_work.entry));
+			printk(" unplug=%lu unplug_skip=%lu\n",
+			       mddev->queue->last_unplug,
+			       mddev->queue->last_unplug_skip);
+			printk(" plug=%lu plug_skip=%lu\n",
+			       mddev->queue->last_plug,
+			       mddev->queue->last_plug_skip);
+			printk(" unplug_work=%lu unplug_timeout=%lu\n",
+			       mddev->queue->last_unplug_work,
+			       mddev->queue->last_unplug_timeout);
+			printk(" unplug_seq=%lu plug_seq=%lu\n",
+			       mddev->queue->unplug_seq,
+			       mddev->queue->plug_seq);
+		}
+		conf->last_size = atomic_read(&conf->active_stripes);
+		conf->low_size = conf->last_size;
+	}
 	if (conf)
 		return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
 	else

diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h
--- ./.patches/orig/include/linux/blkdev.h	2006-08-10 17:28:48.000000000 +1000
+++ ./include/linux/blkdev.h	2006-09-25 11:08:27.000000000 +1000
@@ -380,6 +380,9 @@ struct request_queue
 	 */
 	unsigned long		queue_flags;
 
+	unsigned long last_unplug, last_unplug_skip, last_plug, last_plug_skip;
+	unsigned long last_unplug_work, last_unplug_timeout;
+	unsigned long unplug_seq, plug_seq;
 	/*
 	 * protects queue structures from reentrancy. ->__queue_lock should
 	 * _never_ be used directly, it is queue private. always use

diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h
--- ./.patches/orig/include/linux/raid/raid5.h	2006-08-10 17:28:53.000000000 +1000
+++ ./include/linux/raid/raid5.h	2006-08-28 14:09:32.000000000 +1000
@@ -245,6 +245,10 @@ struct raid5_private_data {
 
 	struct page 		*spare_page; /* Used when checking P/Q in raid6 */
 
+	int dotrace;
+	int last_size;
+	int low_size;
+
 	/*
 	 * Free stripes pool
 	 */

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-10  0:08                                                       ` Neil Brown
@ 2006-10-10 10:27                                                         ` Bas van Schaik
  2006-11-14 10:28                                                         ` Chris Allen
  1 sibling, 0 replies; 43+ messages in thread
From: Bas van Schaik @ 2006-10-10 10:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Hi all,

Neil Brown wrote:
> On Tuesday October 10, chris@cjx.com wrote:
>   
>> Very happy to. Let me know what you'd like me to do.
>>     
>
> Cool thanks.
> (snip)
>   
I don't know if it's useful information, but I'm encountering the same
problem here, in a totally different situation. I'm using Peter Breuers
ENBD (you probably know him, since he started a discussion about request
retries with exponential timeouts and a communication channel to raid a
while ago) to import a total of 12 devices from other machines to
compose those disks into 3 arrays of RAID5. Those 3 arrays are combined
in one VG with one LV, running CryptoLoop on top. Last, but not least, a
ReiserFS is created on the loopback device. I'm using the Debian Etch
stock 2.6.17-kernel, by the way.

When doing a lot of I/O on the ReiserFS (like a "reiserfsck
--rebuild-tree"), the machine suddenly gets stuck, I think after filling
it's memory with buffers. I've been doing a lot of debugging with Peter,
attached you'll find a "ps -axl" with a widened WCHAN column to see that
some of the enbd-client processes get stuck in the RAID code. We've not
been able find out how ENBD gets into the RAID code, but I don't think
that's really relevant right now. Here's the relevant part of ps:

ps ax -o f,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,command
(only the relevant rows)

> F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN                          STAT TT           TIME COMMAND
> (snip)
> 5     0 26523     1  23   0  2140 1052 -                              Ss   ?        00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi
> 5     0 26540     1  23   0  2140 1048 get_active_stripe              Ds   ?        00:00:00 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl
> 5     0 26552     1  23   0  2140 1044 -                              Ss   ?        00:00:00 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf
> 5     0 26556     1  23   0  2140 1048 -                              Ss   ?        00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda
> 5     0 26561     1  23   0  2140 1052 get_active_stripe              Ds   ?        00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb
> 5     0 26564     1  23   0  2144 1052 -                              Ss   ?        00:00:00 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg
> 5     0 26568     1  23   0  2144 1052 -                              Ss   ?        00:00:00 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh
> 5     0 26581     1  23   0  2144 1052 -                              Ss   ?        00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc
> 5     0 26590     1  23   0  2140 1048 -                              Ss   ?        00:00:00 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde
> 5     0 26606     1  23   0  2144 1052 -                              Ss   ?        00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj
> 5     0 26614     1  23   0  2144 1052 -                              Ss   ?        00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk
> 5     0 26616     1  23   0  2144 1056 -                              Ss   ?        00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd
> 5     0 26617 26523  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi
> 5     0 26618 26523  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss01 1300 -i iss01-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndi
> 5     0 26619 26540  24   0  2140  948 enbd_get_req                   S    ?        00:00:01 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl
> 5     0 26620 26540  24   0  2140  948 enbd_get_req                   S    ?        00:00:01 enbd-client iss04 1300 -i iss04-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndl
> 5     0 26621 26552  24   0  2140  948 get_active_stripe              D    ?        00:32:11 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf
> 5     0 26622 26552  24   0  2140  948 get_active_stripe              D    ?        00:32:18 enbd-client iss02 1200 -i iss02-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndf
> 5     0 26623 26564  23   0  2144  956 enbd_get_req                   S    ?        00:32:27 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg
> 5     0 26624 26564  24   0  2144  956 enbd_get_req                   S    ?        00:32:37 enbd-client iss03 1200 -i iss03-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndg
> 5     0 26625 26568  24   0  2144  956 enbd_get_req                   S    ?        00:35:35 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh
> 5     0 26626 26561  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb
> 5     0 26627 26561  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss02 1100 -i iss02-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndb
> 5     0 26628 26568  24   0  2144  956 enbd_get_req                   S    ?        00:35:37 enbd-client iss04 1200 -i iss04-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/ndh
> 5     0 26629 26556  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda
> 5     0 26630 26556  24   0  2140  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss01 1100 -i iss01-hda5 -n 2 -e -m -b 4096 -p 30 /dev/nda
> 5     0 26631 26581  24   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc
> 5     0 26632 26581  24   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss03 1100 -i iss03-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndc
> 5     0 26633 26590  24   0  2140  952 enbd_get_req                   S    ?        00:36:58 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde
> 5     0 26634 26590  24   0  2140  952 enbd_get_req                   S    ?        00:36:50 enbd-client iss01 1200 -i iss01-hdc5 -n 2 -e -m -b 4096 -p 30 /dev/nde
> 5     0 26635 26606  24   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj
> 5     0 26636 26606  24   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss02 1300 -i iss02-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndj
> 5     0 26637 26616  24   0  2144  952 enbd_get_req                   S    ?        00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd
> 5     0 26638 26616  23   0  2144  952 enbd_get_req                   S    ?        00:00:00 enbd-client iss04 1100 -i iss04-hda5 -n 2 -e -m -b 4096 -p 30 /dev/ndd
> 5     0 26639 26614  23   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk
> 5     0 26640 26614  24   0  2144  948 enbd_get_req                   S    ?        00:00:00 enbd-client iss03 1300 -i iss03-hdd -n 2 -e -m -b 4096 -p 30 /dev/ndk
>   

I've tried this "reiserfsck --rebuild-tree" a couple of times, it keeps
hanging at the same point when my memory gets filled with buffers. My
assumption is, that reiserfs is writing out too fast, the network (ENBD)
can't handle it and after a while there's no memory left for TCP
buffers. I've solved this problem by editing
/proc/sys/vm/min_free_kbytes to force the kernel to leave some memory
for the TCP buffers and other interrupt handling.

I'm not able to install a vanilla kernel with some patches, but I would
be happy to provide some extra details about the crash if you want me
to. I assume I can even reproduce it, on another cluster however, since
I've recreated a (ext3) filesystem on the cluster we're talking about.

Regards

  -- Bas van Schaik



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-10-10  0:08                                                       ` Neil Brown
  2006-10-10 10:27                                                         ` Bas van Schaik
@ 2006-11-14 10:28                                                         ` Chris Allen
  2006-11-15 20:39                                                           ` dean gaudet
  1 sibling, 1 reply; 43+ messages in thread
From: Chris Allen @ 2006-11-14 10:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

You probably guessed that no matter what I did, I never, ever saw the 
problem when your
trace was installed. I'd guess at some obscure timing-related problem. I 
can still trigger it
consistently with a vanilla 2.6.17_SMP though, but again only when 
bitmaps are turned on.



Neil Brown wrote:
> On Tuesday October 10, chris@cjx.com wrote:
>   
>> Very happy to. Let me know what you'd like me to do.
>>     
>
> Cool thanks.
>
> At the end is a patch against 2.6.17.11, though it should apply against
> any later 2.6.17 kernel.
> Apply this and reboot.
>
> Then run
>
>        while true
>        do cat /sys/block/mdX/md/stripe_cache_active
>           sleep 10
>        done > /dev/null
>
> (maybe write a little script or whatever).  Leave this running. It
> effects the check for "has raid5 hung".  Make sure to change "mdX" to
> whatever is appropriate.
>
> Occasionally look in the kernel logs for
>    plug problem:
>
> if you find that, send me the surrounding text - there should be about
> a dozen lines following this one.
>
> Hopefully this will let me know which is last thing to happen: a plug
> or an unplug.
> If the last is a "plug", then the timer really should still be
> pending, but isn't (this is impossible).  So I'll look more closely at
> that option.
> If the last is an "unplug", then the 'Plugged' flag should really be
> clear but it isn't (this is impossible).  So I'll look more closely at
> that option.
>
> Dean is running this, but he only gets the hang every couple of
> weeks.  If you get it more often, that would help me a lot.
>
> Thanks,
> NeilBrown
>
>
> diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
> --- ./.patches/orig/block/ll_rw_blk.c	2006-08-21 09:52:46.000000000 +1000
> +++ ./block/ll_rw_blk.c	2006-10-05 11:33:32.000000000 +1000
> @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
>   * This is called with interrupts off and no requests on the queue and
>   * with the queue lock held.
>   */
> +static atomic_t seq = ATOMIC_INIT(0);
>  void blk_plug_device(request_queue_t *q)
>  {
>  	WARN_ON(!irqs_disabled());
> @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
>  		return;
>  
>  	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> +		q->last_plug = jiffies;
> +		q->plug_seq = atomic_read(&seq);
> +		atomic_inc(&seq);
>  		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
>  		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
> -	}
> +	} else
> +		q->last_plug_skip = jiffies;
> +	if (!timer_pending(&q->unplug_timer) &&
> +	    !q->unplug_work.pending)
> +		printk("Neither Timer or work are pending\n");
>  }
>  
>  EXPORT_SYMBOL(blk_plug_device);
> @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
>  {
>  	WARN_ON(!irqs_disabled());
>  
> -	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> +	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> +		q->last_unplug_skip = jiffies;
>  		return 0;
> +	}
>  
>  	del_timer(&q->unplug_timer);
> +	q->last_unplug = jiffies;
> +	q->unplug_seq = atomic_read(&seq);
> +	atomic_inc(&seq);
> +	if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> +		printk("queue still (or again) plugged\n");
>  	return 1;
>  }
>  
> @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
>  static void blk_unplug_work(void *data)
>  {
>  	request_queue_t *q = data;
> -
> +	q->last_unplug_work = jiffies;
>  	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
>  				q->rq.count[READ] + q->rq.count[WRITE]);
>  
> @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned 
>  	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
>  				q->rq.count[READ] + q->rq.count[WRITE]);
>  
> +	q->last_unplug_timeout = jiffies;
>  	kblockd_schedule_work(&q->unplug_work);
>  }
>  
>
> diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
> --- ./.patches/orig/drivers/md/raid1.c	2006-08-10 17:28:01.000000000 +1000
> +++ ./drivers/md/raid1.c	2006-09-04 21:58:31.000000000 +1000
> @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev)
>  							d = conf->raid_disks;
>  						d--;
>  						rdev = conf->mirrors[d].rdev;
> -						atomic_add(s, &rdev->corrected_errors);
>  						if (rdev &&
>  						    test_bit(In_sync, &rdev->flags)) {
>  							if (sync_page_io(rdev->bdev,
> @@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev)
>  									 s<<9, conf->tmppage, READ) == 0)
>  								/* Well, this device is dead */
>  								md_error(mddev, rdev);
> +							else
> +								atomic_add(s, &rdev->corrected_errors);
> +
>  						}
>  					}
>  				} else {
>
> diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c
> --- ./.patches/orig/drivers/md/raid5.c	2006-08-10 17:28:01.000000000 +1000
> +++ ./drivers/md/raid5.c	2006-09-25 11:11:59.000000000 +1000
> @@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_
>  					md_wakeup_thread(conf->mddev->thread);
>  			}
>  			atomic_dec(&conf->active_stripes);
> +			if (atomic_read(&conf->active_stripes) < conf->low_size)
> +				conf->low_size = atomic_read(&conf->active_stripes);
>  			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
>  				list_add_tail(&sh->lru, &conf->inactive_list);
>  				wake_up(&conf->wait_for_stripe);
> @@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_
>  
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  
> +	if (conf->dotrace)
> +		printk(" trace unplug_device\n");
> +
>  	if (blk_remove_plug(q)) {
> +		if (conf->dotrace)
> +			printk(" trace remove plug\n");
>  		conf->seq_flush++;
>  		raid5_activate_delayed(conf);
>  	}
> @@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que
>  static inline void raid5_plug_device(raid5_conf_t *conf)
>  {
>  	spin_lock_irq(&conf->device_lock);
> +	if (conf->dotrace)
> +		printk(" trace plug device\n");
>  	blk_plug_device(conf->mddev->queue);
>  	spin_unlock_irq(&conf->device_lock);
>  }
> @@ -2154,6 +2163,42 @@ static ssize_t
>  stripe_cache_active_show(mddev_t *mddev, char *page)
>  {
>  	raid5_conf_t *conf = mddev_to_conf(mddev);
> +
> +	/* tracing */
> +	if (conf->dotrace) {
> +		conf->dotrace = 0;
> +		printk("end trace, force unplug\n");
> +		raid5_unplug_device(mddev->queue);
> +		raid5_unplug_device(mddev->queue);
> +	} else {
> +		if (conf->last_size > conf->max_nr_stripes/2 &&
> +		    conf->last_size <= conf->low_size) {
> +			conf->dotrace = 1;
> +			printk("plug problem: low_size=%d\n", conf->low_size);
> +			printk(" blk_queue_plugged = %d\n", blk_queue_plugged(mddev->queue));
> +			printk(" flags = %lx\n", mddev->queue->queue_flags);
> +			printk(" unplug_delay = %lu (HZ=%d)\n", mddev->queue->unplug_delay, HZ);
> +			printk(" timer->expires %lu (jiffies %lu)\n",
> +			       mddev->queue->unplug_timer.expires, jiffies);
> +			printk(" pending = %d\n", timer_pending(&mddev->queue->unplug_timer));
> +			printk(" wq_pending = %lx\n", mddev->queue->unplug_work.pending);
> +			printk(" wq_entry empty? %d\n", list_empty(&mddev->queue->unplug_work.entry));
> +			printk(" unplug=%lu unplug_skip=%lu\n",
> +			       mddev->queue->last_unplug,
> +			       mddev->queue->last_unplug_skip);
> +			printk(" plug=%lu plug_skip=%lu\n",
> +			       mddev->queue->last_plug,
> +			       mddev->queue->last_plug_skip);
> +			printk(" unplug_work=%lu unplug_timeout=%lu\n",
> +			       mddev->queue->last_unplug_work,
> +			       mddev->queue->last_unplug_timeout);
> +			printk(" unplug_seq=%lu plug_seq=%lu\n",
> +			       mddev->queue->unplug_seq,
> +			       mddev->queue->plug_seq);
> +		}
> +		conf->last_size = atomic_read(&conf->active_stripes);
> +		conf->low_size = conf->last_size;
> +	}
>  	if (conf)
>  		return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
>  	else
>
> diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h
> --- ./.patches/orig/include/linux/blkdev.h	2006-08-10 17:28:48.000000000 +1000
> +++ ./include/linux/blkdev.h	2006-09-25 11:08:27.000000000 +1000
> @@ -380,6 +380,9 @@ struct request_queue
>  	 */
>  	unsigned long		queue_flags;
>  
> +	unsigned long last_unplug, last_unplug_skip, last_plug, last_plug_skip;
> +	unsigned long last_unplug_work, last_unplug_timeout;
> +	unsigned long unplug_seq, plug_seq;
>  	/*
>  	 * protects queue structures from reentrancy. ->__queue_lock should
>  	 * _never_ be used directly, it is queue private. always use
>
> diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h
> --- ./.patches/orig/include/linux/raid/raid5.h	2006-08-10 17:28:53.000000000 +1000
> +++ ./include/linux/raid/raid5.h	2006-08-28 14:09:32.000000000 +1000
> @@ -245,6 +245,10 @@ struct raid5_private_data {
>  
>  	struct page 		*spare_page; /* Used when checking P/Q in raid6 */
>  
> +	int dotrace;
> +	int last_size;
> +	int low_size;
> +
>  	/*
>  	 * Free stripes pool
>  	 */
>   

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 hang on get_active_stripe
  2006-11-14 10:28                                                         ` Chris Allen
@ 2006-11-15 20:39                                                           ` dean gaudet
  0 siblings, 0 replies; 43+ messages in thread
From: dean gaudet @ 2006-11-15 20:39 UTC (permalink / raw)
  To: Chris Allen; +Cc: Neil Brown, linux-raid

and i haven't seen it either... neil do you think your latest patch was 
hiding the bug?  'cause there was an iteration of an earlier patch which 
didn't produce much spam in dmesg but the bug was still there, then there 
is the version below which spams dmesg a fair amount but i didn't see the 
bug in ~30 days.

btw i've upgraded that box to 2.6.18.2 without the patch (it had some 
conflicts)... haven't seen the bug yet though (~10 days so far).

hmm i wonder if i could reproduce it more rapidly if i lowered 
/sys/block/mdX/md/stripe_cache_size.  i'll give that a go.

-dean


On Tue, 14 Nov 2006, Chris Allen wrote:

> You probably guessed that no matter what I did, I never, ever saw the problem
> when your
> trace was installed. I'd guess at some obscure timing-related problem. I can
> still trigger it
> consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are
> turned on.
> 
> 
> 
> Neil Brown wrote:
> > On Tuesday October 10, chris@cjx.com wrote:
> >   
> > > Very happy to. Let me know what you'd like me to do.
> > >     
> > 
> > Cool thanks.
> > 
> > At the end is a patch against 2.6.17.11, though it should apply against
> > any later 2.6.17 kernel.
> > Apply this and reboot.
> > 
> > Then run
> > 
> >        while true
> >        do cat /sys/block/mdX/md/stripe_cache_active
> >           sleep 10
> >        done > /dev/null
> > 
> > (maybe write a little script or whatever).  Leave this running. It
> > effects the check for "has raid5 hung".  Make sure to change "mdX" to
> > whatever is appropriate.
> > 
> > Occasionally look in the kernel logs for
> >    plug problem:
> > 
> > if you find that, send me the surrounding text - there should be about
> > a dozen lines following this one.
> > 
> > Hopefully this will let me know which is last thing to happen: a plug
> > or an unplug.
> > If the last is a "plug", then the timer really should still be
> > pending, but isn't (this is impossible).  So I'll look more closely at
> > that option.
> > If the last is an "unplug", then the 'Plugged' flag should really be
> > clear but it isn't (this is impossible).  So I'll look more closely at
> > that option.
> > 
> > Dean is running this, but he only gets the hang every couple of
> > weeks.  If you get it more often, that would help me a lot.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> > diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
> > --- ./.patches/orig/block/ll_rw_blk.c	2006-08-21 09:52:46.000000000 +1000
> > +++ ./block/ll_rw_blk.c	2006-10-05 11:33:32.000000000 +1000
> > @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
> >   * This is called with interrupts off and no requests on the queue and
> >   * with the queue lock held.
> >   */
> > +static atomic_t seq = ATOMIC_INIT(0);
> >  void blk_plug_device(request_queue_t *q)
> >  {
> >  	WARN_ON(!irqs_disabled());
> > @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
> >  		return;
> >   	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> > +		q->last_plug = jiffies;
> > +		q->plug_seq = atomic_read(&seq);
> > +		atomic_inc(&seq);
> >  		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
> >  		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
> > -	}
> > +	} else
> > +		q->last_plug_skip = jiffies;
> > +	if (!timer_pending(&q->unplug_timer) &&
> > +	    !q->unplug_work.pending)
> > +		printk("Neither Timer or work are pending\n");
> >  }
> >   EXPORT_SYMBOL(blk_plug_device);
> > @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
> >  {
> >  	WARN_ON(!irqs_disabled());
> >  -	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> > +	if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> > +		q->last_unplug_skip = jiffies;
> >  		return 0;
> > +	}
> >   	del_timer(&q->unplug_timer);
> > +	q->last_unplug = jiffies;
> > +	q->unplug_seq = atomic_read(&seq);
> > +	atomic_inc(&seq);
> > +	if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> > +		printk("queue still (or again) plugged\n");
> >  	return 1;
> >  }
> >  @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
> >  static void blk_unplug_work(void *data)
> >  {
> >  	request_queue_t *q = data;
> > -
> > +	q->last_unplug_work = jiffies;
> >  	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
> >  				q->rq.count[READ] + q->rq.count[WRITE]);
> >  @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned
> > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
> >  				q->rq.count[READ] + q->rq.count[WRITE]);
> >  +	q->last_unplug_timeout = jiffies;
> >  	kblockd_schedule_work(&q->unplug_work);
> >  }
> >  
> > diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
> > --- ./.patches/orig/drivers/md/raid1.c	2006-08-10 17:28:01.000000000
> > +1000
> > +++ ./drivers/md/raid1.c	2006-09-04 21:58:31.000000000 +1000
> > @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev)
> >  							d = conf->raid_disks;
> >  						d--;
> >  						rdev = conf->mirrors[d].rdev;
> > -						atomic_add(s,
> > &rdev->corrected_errors);
> >  						if (rdev &&
> >  						    test_bit(In_sync,
> > &rdev->flags)) {
> >  							if
> > (sync_page_io(rdev->bdev,
> > @@ -1509,6 +1508,9 @@ static void raid1d(mddev_t *mddev)
> >  									 s<<9,
> > conf->tmppage, READ) == 0)
> >  								/* Well, this
> > device is dead */
> >  								md_error(mddev,
> > rdev);
> > +							else
> > +								atomic_add(s,
> > &rdev->corrected_errors);
> > +
> >  						}
> >  					}
> >  				} else {
> > 
> > diff ./.patches/orig/drivers/md/raid5.c ./drivers/md/raid5.c
> > --- ./.patches/orig/drivers/md/raid5.c	2006-08-10 17:28:01.000000000
> > +1000
> > +++ ./drivers/md/raid5.c	2006-09-25 11:11:59.000000000 +1000
> > @@ -93,6 +93,8 @@ static void __release_stripe(raid5_conf_
> >  					md_wakeup_thread(conf->mddev->thread);
> >  			}
> >  			atomic_dec(&conf->active_stripes);
> > +			if (atomic_read(&conf->active_stripes) <
> > conf->low_size)
> > +				conf->low_size =
> > atomic_read(&conf->active_stripes);
> >  			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> >  				list_add_tail(&sh->lru, &conf->inactive_list);
> >  				wake_up(&conf->wait_for_stripe);
> > @@ -1677,7 +1679,12 @@ static void raid5_unplug_device(request_
> >   	spin_lock_irqsave(&conf->device_lock, flags);
> >  +	if (conf->dotrace)
> > +		printk(" trace unplug_device\n");
> > +
> >  	if (blk_remove_plug(q)) {
> > +		if (conf->dotrace)
> > +			printk(" trace remove plug\n");
> >  		conf->seq_flush++;
> >  		raid5_activate_delayed(conf);
> >  	}
> > @@ -1721,6 +1728,8 @@ static int raid5_issue_flush(request_que
> >  static inline void raid5_plug_device(raid5_conf_t *conf)
> >  {
> >  	spin_lock_irq(&conf->device_lock);
> > +	if (conf->dotrace)
> > +		printk(" trace plug device\n");
> >  	blk_plug_device(conf->mddev->queue);
> >  	spin_unlock_irq(&conf->device_lock);
> >  }
> > @@ -2154,6 +2163,42 @@ static ssize_t
> >  stripe_cache_active_show(mddev_t *mddev, char *page)
> >  {
> >  	raid5_conf_t *conf = mddev_to_conf(mddev);
> > +
> > +	/* tracing */
> > +	if (conf->dotrace) {
> > +		conf->dotrace = 0;
> > +		printk("end trace, force unplug\n");
> > +		raid5_unplug_device(mddev->queue);
> > +		raid5_unplug_device(mddev->queue);
> > +	} else {
> > +		if (conf->last_size > conf->max_nr_stripes/2 &&
> > +		    conf->last_size <= conf->low_size) {
> > +			conf->dotrace = 1;
> > +			printk("plug problem: low_size=%d\n", conf->low_size);
> > +			printk(" blk_queue_plugged = %d\n",
> > blk_queue_plugged(mddev->queue));
> > +			printk(" flags = %lx\n", mddev->queue->queue_flags);
> > +			printk(" unplug_delay = %lu (HZ=%d)\n",
> > mddev->queue->unplug_delay, HZ);
> > +			printk(" timer->expires %lu (jiffies %lu)\n",
> > +			       mddev->queue->unplug_timer.expires, jiffies);
> > +			printk(" pending = %d\n",
> > timer_pending(&mddev->queue->unplug_timer));
> > +			printk(" wq_pending = %lx\n",
> > mddev->queue->unplug_work.pending);
> > +			printk(" wq_entry empty? %d\n",
> > list_empty(&mddev->queue->unplug_work.entry));
> > +			printk(" unplug=%lu unplug_skip=%lu\n",
> > +			       mddev->queue->last_unplug,
> > +			       mddev->queue->last_unplug_skip);
> > +			printk(" plug=%lu plug_skip=%lu\n",
> > +			       mddev->queue->last_plug,
> > +			       mddev->queue->last_plug_skip);
> > +			printk(" unplug_work=%lu unplug_timeout=%lu\n",
> > +			       mddev->queue->last_unplug_work,
> > +			       mddev->queue->last_unplug_timeout);
> > +			printk(" unplug_seq=%lu plug_seq=%lu\n",
> > +			       mddev->queue->unplug_seq,
> > +			       mddev->queue->plug_seq);
> > +		}
> > +		conf->last_size = atomic_read(&conf->active_stripes);
> > +		conf->low_size = conf->last_size;
> > +	}
> >  	if (conf)
> >  		return sprintf(page, "%d\n",
> > atomic_read(&conf->active_stripes));
> >  	else
> > 
> > diff ./.patches/orig/include/linux/blkdev.h ./include/linux/blkdev.h
> > --- ./.patches/orig/include/linux/blkdev.h	2006-08-10 17:28:48.000000000
> > +1000
> > +++ ./include/linux/blkdev.h	2006-09-25 11:08:27.000000000 +1000
> > @@ -380,6 +380,9 @@ struct request_queue
> >  	 */
> >  	unsigned long		queue_flags;
> >  +	unsigned long last_unplug, last_unplug_skip, last_plug,
> > last_plug_skip;
> > +	unsigned long last_unplug_work, last_unplug_timeout;
> > +	unsigned long unplug_seq, plug_seq;
> >  	/*
> >  	 * protects queue structures from reentrancy. ->__queue_lock should
> >  	 * _never_ be used directly, it is queue private. always use
> > 
> > diff ./.patches/orig/include/linux/raid/raid5.h ./include/linux/raid/raid5.h
> > --- ./.patches/orig/include/linux/raid/raid5.h	2006-08-10
> > 17:28:53.000000000 +1000
> > +++ ./include/linux/raid/raid5.h	2006-08-28 14:09:32.000000000 +1000
> > @@ -245,6 +245,10 @@ struct raid5_private_data {
> >   	struct page 		*spare_page; /* Used when checking P/Q in
> > raid6 */
> >  +	int dotrace;
> > +	int last_size;
> > +	int low_size;
> > +
> >  	/*
> >  	 * Free stripes pool
> >  	 */
> >   
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: raid5 performance question
  2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro)
  2006-03-06 11:59 ` Gordon Henderson
@ 2006-03-06 22:17 ` Guy
  2006-03-06 22:24 ` Neil Brown
  2 siblings, 0 replies; 43+ messages in thread
From: Guy @ 2006-03-06 22:17 UTC (permalink / raw)
  To: 'Raz Ben-Jehuda(caro)', 'Linux RAID Mailing List'
  Cc: 'Neil Brown'

Does test 1 have 4 processes?
Does test 2 have 1 process?

The number of testing processes should be the same in both tests.

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Raz Ben-Jehuda(caro)
} Sent: Monday, March 06, 2006 6:46 AM
} To: Linux RAID Mailing List
} Cc: Neil Brown
} Subject: raid5 performance question
} 
} Neil Hello .
} I have a performance question.
} 
} I am using raid5 stripe size 1024K over 4 disks.
} I am benchmarking it with an asynchronous tester.
} This tester submits 100 IOs of size of 1024 K --> as the stripe size.
} It reads raw io from the device, no file system is involved.
} 
} I am making the following comparsion:
} 
} 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
} 2. Reading 1 raid5 device using 1MB buffer in random manner.
} 
} I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s
} from
} 4 disks, the raid5 device gives 35 MB/s .
} it is like i am reading a single disk , but by looking at iostat i can
} see that all
} disks are active but with low throughput.
} 
} Any idea ?
} 
} Thank you.
} --
} Raz
} -
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro)
  2006-03-06 11:59 ` Gordon Henderson
  2006-03-06 22:17 ` raid5 performance question Guy
@ 2006-03-06 22:24 ` Neil Brown
  2006-03-07  8:40   ` Raz Ben-Jehuda(caro)
  2006-03-08  6:45   ` thunder7
  2 siblings, 2 replies; 43+ messages in thread
From: Neil Brown @ 2006-03-06 22:24 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Monday March 6, raziebe@gmail.com wrote:
> Neil Hello .
> I have a performance question.
> 
> I am using raid5 stripe size 1024K over 4 disks.

I assume you mean a chunksize of 1024K rather than a stripe size.
With a 4 disk array, the stripe size will be 3 times the chunksize,
and so could not possibly by 1024K.

> I am benchmarking it with an asynchronous tester.
> This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> It reads raw io from the device, no file system is involved.
> 
> I am making the following comparsion:
> 
> 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> 2. Reading 1 raid5 device using 1MB buffer in random manner.

If your chunk size is 1MB, then you will need larger sequential reads
to get good throughput.

You can also try increasing the size of the stripe cache in
   /sys/block/mdX/md/stripe_cache_size

The units are in pages (normally 4K) per device.  The default is 256 which fits
only one stripe with a 1 Meg chunk size.

Try 1024 ?

NeilBrown


> 
> I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> 4 disks, the raid5 device gives 35 MB/s .
> it is like i am reading a single disk , but by looking at iostat i can
> see that all
> disks are active but with low throughput.
> 
> Any idea ?
> 
> Thank you.
> --
> Raz
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-06 22:24 ` Neil Brown
@ 2006-03-07  8:40   ` Raz Ben-Jehuda(caro)
  2006-03-07 23:03     ` Neil Brown
  2006-03-08  6:45   ` thunder7
  1 sibling, 1 reply; 43+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-07  8:40 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux RAID Mailing List

Neil.
what is the stripe_cache exacly ?

First , here are some numbers.

Setting it to 1024 gives me 85 MB/s.
Setting it to 4096 gives me 105 MB/s.
Setting it to 8192 gives me 115 MB/s.

the md.txt does not say much about it just that it is the number of
entries.

here are some tests i have made:

test1:
when i set the stripe_cache to zero and run:

  "dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000"
 i am getting 120MB/s.
 when i set the stripe cache to 4096 and : issue the same command i am
getting 120 MB/s
as well.

test 2:
I would describe what this tester does:

It opens N descriptors over a device.
It issues N IOs to the target and waits for the completion of each IO.
When the IO is completed the tester has two choices:

  1. calculate a new seek posistion over the target.

  2. move sequetially to the next position. meaning , if one reads 1MB
buffer, the next
      position is current+1M.

  I am using direct IO and asynchrnous IO.

option 1 simulates non contigous files. option 2 simulates contiguous files.
the above numbers were made with option 2.
if i am using option 1 i am getting 95 MB/s with stripe_size=4096.

A single disk in this manner ( option 1 ) gives ~28 MB/s.
A single disk in scenario 2 gives ~30 MB/s.

I understand the a question of the IO distribution is something to talk
about. but i am submitting 250 IOs so i suppose to be heavy on the raid.

Questions
1. how can the stripe size cache gives me a boost when i have total
random access
    to the disk ?

2. Does direct IO passes this cache ?

3. How can a dd of 1 MB over 1MB chunck size acheive this high
throughputs of 4 disks
   even if does not get the stripe cache benifits ?

thank  you
raz.

On 3/7/06, Neil Brown <neilb@suse.de> wrote:
> On Monday March 6, raziebe@gmail.com wrote:
> > Neil Hello .
> > I have a performance question.
> >
> > I am using raid5 stripe size 1024K over 4 disks.
>
> I assume you mean a chunksize of 1024K rather than a stripe size.
> With a 4 disk array, the stripe size will be 3 times the chunksize,
> and so could not possibly by 1024K.
>
> > I am benchmarking it with an asynchronous tester.
> > This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> > It reads raw io from the device, no file system is involved.
> >
> > I am making the following comparsion:
> >
> > 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> > 2. Reading 1 raid5 device using 1MB buffer in random manner.
>
> If your chunk size is 1MB, then you will need larger sequential reads
> to get good throughput.
>
> You can also try increasing the size of the stripe cache in
>    /sys/block/mdX/md/stripe_cache_size
>
> The units are in pages (normally 4K) per device.  The default is 256 which fits
> only one stripe with a 1 Meg chunk size.
>
> Try 1024 ?
>
> NeilBrown
>
>
> >
> > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> > 4 disks, the raid5 device gives 35 MB/s .
> > it is like i am reading a single disk , but by looking at iostat i can
> > see that all
> > disks are active but with low throughput.
> >
> > Any idea ?
> >
> > Thank you.
> > --
> > Raz
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Raz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-07  8:40   ` Raz Ben-Jehuda(caro)
@ 2006-03-07 23:03     ` Neil Brown
  2006-03-22 13:22       ` Bill Davidsen
  0 siblings, 1 reply; 43+ messages in thread
From: Neil Brown @ 2006-03-07 23:03 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Tuesday March 7, raziebe@gmail.com wrote:
> Neil.
> what is the stripe_cache exacly ?

In order to ensure correctness of data, all IO operations on a raid5
pass through the 'stripe cache'  This is a cache of stripes where each
stripe is one page wide across all devices.

e.g. to write a block, we allocate one stripe in the cache to cover
that block, pre-read anything that might be needed, copy in the new
data and update parity, and write out anything that has changed.

Similarly to read, we allocate a stripe to cover the block, read in
the requires parts, and copy out of the stripe cache into the
destination.

Requiring all reads to pass through the stripe cache is not strictly
necessary, but it keeps the code a lot easier to manage (fewer special
cases).   Bypassing the cache for simple read requests when the array
is non-degraded is on my list....

> 
> First , here are some numbers.
> 
> Setting it to 1024 gives me 85 MB/s.
> Setting it to 4096 gives me 105 MB/s.
> Setting it to 8192 gives me 115 MB/s.

Not surprisingly, a larger cache gives better throughput as it allows
more parallelism.  There is probably a link between optimal cache size
and chunk size.

> 
> the md.txt does not say much about it just that it is the number of
> entries.

No.  I should fix that.

> 
> here are some tests i have made:
> 
> test1:
> when i set the stripe_cache to zero and run:

Setting it to zero is a no-op.  Only values from 17 to 32768 are
permitted.

> 
>   "dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000"
>  i am getting 120MB/s.
>  when i set the stripe cache to 4096 and : issue the same command i am
> getting 120 MB/s
> as well.

This sort of operation will causes the kernel's read-ahead to keep the
drives reading constantly.  Providing the stripe cache is large enough
to hold 2 full chunk-sized stripes, you should get very good
throughput.

> 
> test 2:
> I would describe what this tester does:
> 
> It opens N descriptors over a device.
> It issues N IOs to the target and waits for the completion of each IO.
> When the IO is completed the tester has two choices:
> 
>   1. calculate a new seek posistion over the target.
> 
>   2. move sequetially to the next position. meaning , if one reads 1MB
> buffer, the next
>       position is current+1M.
> 
>   I am using direct IO and asynchrnous IO.
> 
> option 1 simulates non contigous files. option 2 simulates contiguous files.
> the above numbers were made with option 2.
> if i am using option 1 i am getting 95 MB/s with stripe_size=4096.
> 
> A single disk in this manner ( option 1 ) gives ~28 MB/s.
> A single disk in scenario 2 gives ~30 MB/s.
> 
> I understand the a question of the IO distribution is something to talk
> about. but i am submitting 250 IOs so i suppose to be heavy on the raid.
> 
> Questions
> 1. how can the stripe size cache gives me a boost when i have total
> random access
>     to the disk ?

It doesn't give you a boost exactly.  It is just that a small cache
can get in your way by reducing the possibly parallelism.

> 
> 2. Does direct IO passes this cache ?

Yes.  Everything does.

> 
> 3. How can a dd of 1 MB over 1MB chunck size acheive this high
> throughputs of 4 disks
>    even if does not get the stripe cache benifits ?

read-ahead performed by the kernel.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-07 23:03     ` Neil Brown
@ 2006-03-22 13:22       ` Bill Davidsen
  2006-03-24  4:40         ` Neil Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Bill Davidsen @ 2006-03-22 13:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List

Neil Brown wrote:

>On Tuesday March 7, raziebe@gmail.com wrote:
>  
>
>>Neil.
>>what is the stripe_cache exacly ?
>>    
>>
>
>In order to ensure correctness of data, all IO operations on a raid5
>pass through the 'stripe cache'  This is a cache of stripes where each
>stripe is one page wide across all devices.
>
>e.g. to write a block, we allocate one stripe in the cache to cover
>that block, pre-read anything that might be needed, copy in the new
>data and update parity, and write out anything that has changed.
>  
>
I can see that you would have to read the old data and parity blocks for 
RAID-5, I assume that's what you mean by "might be needed" and not a 
read of every drive to get the data to rebuild the parity from scratch. 
That would be not only slower, but require complex error recovery on an 
error reading unneeded data.

>Similarly to read, we allocate a stripe to cover the block, read in
>the requires parts, and copy out of the stripe cache into the
>destination.
>
>Requiring all reads to pass through the stripe cache is not strictly
>necessary, but it keeps the code a lot easier to manage (fewer special
>cases).   Bypassing the cache for simple read requests when the array
>is non-degraded is on my list....
>
It sounds as if you do a memory copy with each read, even if a read to 
user buffer would be possible. Hopefully I'm reading that wrong.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-22 13:22       ` Bill Davidsen
@ 2006-03-24  4:40         ` Neil Brown
  0 siblings, 0 replies; 43+ messages in thread
From: Neil Brown @ 2006-03-24  4:40 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List

On Wednesday March 22, davidsen@tmr.com wrote:
> Neil Brown wrote:
> 
> >On Tuesday March 7, raziebe@gmail.com wrote:
> >  
> >
> >>Neil.
> >>what is the stripe_cache exacly ?
> >>    
> >>
> >
> >In order to ensure correctness of data, all IO operations on a raid5
> >pass through the 'stripe cache'  This is a cache of stripes where each
> >stripe is one page wide across all devices.
> >
> >e.g. to write a block, we allocate one stripe in the cache to cover
> >that block, pre-read anything that might be needed, copy in the new
> >data and update parity, and write out anything that has changed.
> >  
> >
> I can see that you would have to read the old data and parity blocks for 
> RAID-5, I assume that's what you mean by "might be needed" and not a 
> read of every drive to get the data to rebuild the parity from scratch. 
> That would be not only slower, but require complex error recovery on an 
> error reading unneeded data.

"might be needed" because sometime raid5 reads the old copies of the
blocks it is about to over-write, and sometimes it reads all the
blocks that it is NOT going to over-write instead.  And if it is
over-writing all blocks in the stripe, it doesn't need to read
anything.


> 
> >Similarly to read, we allocate a stripe to cover the block, read in
> >the requires parts, and copy out of the stripe cache into the
> >destination.
> >
> >Requiring all reads to pass through the stripe cache is not strictly
> >necessary, but it keeps the code a lot easier to manage (fewer special
> >cases).   Bypassing the cache for simple read requests when the array
> >is non-degraded is on my list....
> >
> It sounds as if you do a memory copy with each read, even if a read to 
> user buffer would be possible. Hopefully I'm reading that wrong.

Unfortunately you are reading it correctly.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: raid5 performance question
  2006-03-06 22:24 ` Neil Brown
  2006-03-07  8:40   ` Raz Ben-Jehuda(caro)
@ 2006-03-08  6:45   ` thunder7
  1 sibling, 0 replies; 43+ messages in thread
From: thunder7 @ 2006-03-08  6:45 UTC (permalink / raw)
  To: Linux RAID Mailing List

From: Neil Brown <neilb@suse.de>
Date: Tue, Mar 07, 2006 at 09:24:26AM +1100
> You can also try increasing the size of the stripe cache in
>    /sys/block/mdX/md/stripe_cache_size
> 
> The units are in pages (normally 4K) per device.  The default is 256 which fits
> only one stripe with a 1 Meg chunk size.
> 
> Try 1024 ?
> 
Interesting. I noticed I don't have such a file for my raid6 device. Can
you explain why? I thought raid6 and raid5 worked a lot like each other.

Thanks,
Jurriaan
-- 
So I must behave with haste and dishonor to serve the political ambitions
of men that I never vowed, before the Lord, to serve and honor. I must be
pretty and delicate and happy and perfect; I must accord his passing no
mourning, and no loss.
	Michelle West - The Broken Crown
Debian (Unstable) GNU/Linux 2.6.16-rc5-mm2 5503 bogomips load 0.06

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2006-11-15 20:39 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-06 11:46 raid5 performance question Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
2006-03-06 12:56   ` Raz Ben-Jehuda(caro)
2006-03-13 22:36     ` raid5 hang on get_active_stripe Patrik Jonsson
2006-03-13 23:17       ` Neil Brown
2006-05-11 15:13         ` dean gaudet
2006-05-17 18:41           ` dean gaudet
2006-05-19  3:46             ` Neil Brown
2006-05-23  2:33             ` Neil Brown
2006-05-26 16:07               ` dean gaudet
2006-05-26 23:55                 ` Neil Brown
2006-05-27  0:28                   ` dean gaudet
2006-05-27 15:42                   ` dean gaudet
2006-05-28 12:04                     ` Neil Brown
2006-05-29 20:56                       ` dean gaudet
2006-05-29 23:51                         ` Neil Brown
2006-05-31  0:15                           ` dean gaudet
2006-05-31  0:46                             ` Neil Brown
2006-05-31  1:03                               ` dean gaudet
2006-05-31  1:38                                 ` Neil Brown
2006-05-31  1:46                                   ` dean gaudet
2006-06-01  2:51                                     ` Neil Brown
2006-06-02  7:10                                       ` dean gaudet
2006-06-02  7:27                                         ` Neil Brown
2006-06-10 19:49                                           ` dean gaudet
2006-06-13 18:53                                           ` Bill Davidsen
2006-06-13 19:05                                             ` dean gaudet
2006-06-13 23:13                                             ` Neil Brown
2006-10-07 23:25                                               ` Chris Allen
2006-10-09 11:03                                                 ` Chris Allen
2006-10-09 23:06                                                   ` Neil Brown
2006-10-09 23:21                                                     ` Chris Allen
2006-10-10  0:08                                                       ` Neil Brown
2006-10-10 10:27                                                         ` Bas van Schaik
2006-11-14 10:28                                                         ` Chris Allen
2006-11-15 20:39                                                           ` dean gaudet
2006-03-06 22:17 ` raid5 performance question Guy
2006-03-06 22:24 ` Neil Brown
2006-03-07  8:40   ` Raz Ben-Jehuda(caro)
2006-03-07 23:03     ` Neil Brown
2006-03-22 13:22       ` Bill Davidsen
2006-03-24  4:40         ` Neil Brown
2006-03-08  6:45   ` thunder7

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).