Reason for md raid 01 blksize limited to 4 KiB?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Reason for md raid 01 blksize limited to 4 KiB?
@ 2012-05-21  8:43 Sebastian Riemer
  2012-05-21 23:14 ` Stan Hoeppner
  2012-05-21 23:28 ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Sebastian Riemer @ 2012-05-21  8:43 UTC (permalink / raw)
  To: linux-raid

Hi list,

I'm wondering why stacking raid1 above raid0 limits the block sizes in
the blkio queue to 4 KiB both read and write.

The max_sectors_kb is at 512. So it's not a matter of limits.

Could someone explain, please? Or could someone pinpoint me to the
related location in the source code?

We've thought of using this for replication via InfiniBand/SRP. 4 KiB
chunks are completely inefficient with SRP. We wanted to do this with
DRBD first, but this is also extremely inefficient, because of chunk
sizes in the blkio queue.

I can reproduce the small 4 KiB chunks also in a file copy benchmark
with raid 01 on ram disks.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-21  8:43 Reason for md raid 01 blksize limited to 4 KiB? Sebastian Riemer
@ 2012-05-21 23:14 ` Stan Hoeppner
  2012-05-21 23:28 ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2012-05-21 23:14 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-raid

On 5/21/2012 3:43 AM, Sebastian Riemer wrote:
> Hi list,
> 
> I'm wondering why stacking raid1 above raid0 limits the block sizes in
> the blkio queue to 4 KiB both read and write.

Likely because the developers only considered RAID 1 for being used in a
2, 3, maybe even 4 disk array, using local disks.  With "standard"
storage configurations, nobody in his/her right mind would consider
mirroring two RAID 0 arrays--they'd go the opposite route, either RAID
1+0 or RAID 10.  You have a unique use case.

And related to this, you may want to read my thread of earlier today
about thread/CPU core scalability WRT RAID 1.  Even if you massage the
blkio problem away, you may then run into a CPU ceiling trying to push
that much data through a single RAID 1 thread.

> The max_sectors_kb is at 512. So it's not a matter of limits.
> 
> Could someone explain, please? Or could someone pinpoint me to the
> related location in the source code?

> We've thought of using this for replication via InfiniBand/SRP. 4 KiB
> chunks are completely inefficient with SRP. We wanted to do this with
> DRBD first, but this is also extremely inefficient, because of chunk
> sizes in the blkio queue.

Infiniband max message size is 4K, for a 1:1 ratio with md RAID 1 blocks
pushed down the stack.  Thus I'm failing to see the efficiency problem.
 Is this a packet stuffing issue?

Are you using SRP or iSER?

> I can reproduce the small 4 KiB chunks also in a file copy benchmark
> with raid 01 on ram disks.

This is probably related to the Linux page size which is limited to 4K
on x86.  On IA64 you can go up to 16M pages.  What limit are you seeing
for the RAID 0 array blkio chunks?

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-21  8:43 Reason for md raid 01 blksize limited to 4 KiB? Sebastian Riemer
  2012-05-21 23:14 ` Stan Hoeppner
@ 2012-05-21 23:28 ` NeilBrown
  2012-05-25 12:35   ` Sebastian Riemer
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-05-21 23:28 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

On Mon, 21 May 2012 10:43:51 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:

> Hi list,
> 
> I'm wondering why stacking raid1 above raid0 limits the block sizes in
> the blkio queue to 4 KiB both read and write.
> 
> The max_sectors_kb is at 512. So it's not a matter of limits.
> 
> Could someone explain, please? Or could someone pinpoint me to the
> related location in the source code?
> 
> We've thought of using this for replication via InfiniBand/SRP. 4 KiB
> chunks are completely inefficient with SRP. We wanted to do this with
> DRBD first, but this is also extremely inefficient, because of chunk
> sizes in the blkio queue.
> 
> I can reproduce the small 4 KiB chunks also in a file copy benchmark
> with raid 01 on ram disks.

This should be fixed in linux 3.4 with commit 6b740b8d79252f13bcb7e5d3c1d

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-21 23:28 ` NeilBrown
@ 2012-05-25 12:35   ` Sebastian Riemer
  2012-05-28  4:05     ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Riemer @ 2012-05-25 12:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

On 22/05/12 01:28, NeilBrown wrote:
> 
> This should be fixed in linux 3.4 with commit 6b740b8d79252f13bcb7e5d3c1d
> 

I've tested the RAID 01 with kernel 3.4 and it isn't fixed. It is even
worse, because direct IO doesn't work any more on the raid1 device (with
kernel 3.2 it worked).

There are still 4k chunks which aren't merged in the raid0 devices below
(blkparse -i md100 -i md200 -i md300 | less).
Could you also check this on your setup, please?

Cheers,
Sebastian


Btw. this is my test script:

#!/bin/bash
if [ "`lsmod | grep brd`" == "" ]; then
        modprobe brd rd_nr=4 rd_size=524288
fi
mdadm -C /dev/md100 --force --assume-clean -n 2 -l raid0 /dev/ram0 /dev/ram1
mdadm -C /dev/md200 --force --assume-clean -n 2 -l raid0 /dev/ram2 /dev/ram3
blktrace /dev/md100 &
pid=$!
dd if=/dev/zero of=/dev/md100 bs=1M oflag=direct
kill -2 $pid
blktrace /dev/md200 &
pid=$!
dd if=/dev/zero of=/dev/md200 bs=1M oflag=direct
kill -2 $pid
mv md100* r0_only/
mv md200* r0_only/

mdadm -C /dev/md300 --force --assume-clean -n 2 -l raid1 /dev/md100
/dev/md200
blktrace -d /dev/md100 -d /dev/md200 -d /dev/md300 -b 4096 &
pid=$!
# Kernel 3.4 doesn't support direct IO on the md300 device
dd if=/dev/zero of=/dev/md300 bs=1M
kill -2 $pid

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-25 12:35   ` Sebastian Riemer
@ 2012-05-28  4:05     ` NeilBrown
  2012-05-29  9:30       ` Sebastian Riemer
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-05-28  4:05 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2041 bytes --]

On Fri, 25 May 2012 14:35:57 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:

> Hi Neil,
> 
> On 22/05/12 01:28, NeilBrown wrote:
> > 
> > This should be fixed in linux 3.4 with commit 6b740b8d79252f13bcb7e5d3c1d
> > 
> 
> I've tested the RAID 01 with kernel 3.4 and it isn't fixed. It is even
> worse, because direct IO doesn't work any more on the raid1 device (with
> kernel 3.2 it worked).

What do you mean by "doesn't work"?  Returns errors? crashes? hangs? kills
you cat?

It works for me.

When I use 32K direct writes to a RAID1, both underlying RAID0 arrays see
64-sector writes.

(when I do normal buffered writes I see 8-sector writes which seems odd,
 but clearly md/RAID1 is allowing large writes through)

NeilBrown


> 
> There are still 4k chunks which aren't merged in the raid0 devices below
> (blkparse -i md100 -i md200 -i md300 | less).
> Could you also check this on your setup, please?
> 
> Cheers,
> Sebastian
> 
> 
> Btw. this is my test script:
> 
> #!/bin/bash
> if [ "`lsmod | grep brd`" == "" ]; then
>         modprobe brd rd_nr=4 rd_size=524288
> fi
> mdadm -C /dev/md100 --force --assume-clean -n 2 -l raid0 /dev/ram0 /dev/ram1
> mdadm -C /dev/md200 --force --assume-clean -n 2 -l raid0 /dev/ram2 /dev/ram3
> blktrace /dev/md100 &
> pid=$!
> dd if=/dev/zero of=/dev/md100 bs=1M oflag=direct
> kill -2 $pid
> blktrace /dev/md200 &
> pid=$!
> dd if=/dev/zero of=/dev/md200 bs=1M oflag=direct
> kill -2 $pid
> mv md100* r0_only/
> mv md200* r0_only/
> 
> mdadm -C /dev/md300 --force --assume-clean -n 2 -l raid1 /dev/md100
> /dev/md200
> blktrace -d /dev/md100 -d /dev/md200 -d /dev/md300 -b 4096 &
> pid=$!
> # Kernel 3.4 doesn't support direct IO on the md300 device
> dd if=/dev/zero of=/dev/md300 bs=1M
> kill -2 $pid
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-28  4:05     ` NeilBrown
@ 2012-05-29  9:30       ` Sebastian Riemer
  2012-05-29 10:25         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Riemer @ 2012-05-29  9:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 28/05/12 06:05, NeilBrown wrote:
> What do you mean by "doesn't work"?  Returns errors? crashes? hangs? kills
> you cat?
> 

dd with oflag=direct returned an IO error.

> It works for me.
> 
> When I use 32K direct writes to a RAID1, both underlying RAID0 arrays see
> 64-sector writes.
> 
> (when I do normal buffered writes I see 8-sector writes which seems odd,
>  but clearly md/RAID1 is allowing large writes through)
> 

Now, I've updated mdadm to version 3.2.5 and it works like you've
described it. Thanks for your help! But the buffered IO is what matters.
4k isn't enough there. Please inform me about changes which increase the
size in buffered IO. I'll have a look at this, too.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-29  9:30       ` Sebastian Riemer
@ 2012-05-29 10:25         ` NeilBrown
  2012-05-30 13:03           ` Sebastian Riemer
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-05-29 10:25 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:

> On 28/05/12 06:05, NeilBrown wrote:
> > What do you mean by "doesn't work"?  Returns errors? crashes? hangs? kills
> > you cat?
> > 
> 
> dd with oflag=direct returned an IO error.

Odd.  Updating mdadm shouldn't affect that.

> 
> > It works for me.
> > 
> > When I use 32K direct writes to a RAID1, both underlying RAID0 arrays see
> > 64-sector writes.
> > 
> > (when I do normal buffered writes I see 8-sector writes which seems odd,
> >  but clearly md/RAID1 is allowing large writes through)
> > 
> 
> Now, I've updated mdadm to version 3.2.5 and it works like you've
> described it. Thanks for your help! But the buffered IO is what matters.
> 4k isn't enough there. Please inform me about changes which increase the
> size in buffered IO. I'll have a look at this, too.

I don't know.  I'd have to dive into the code and look around and put a few
printks in to see what is happening.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-29 10:25         ` NeilBrown
@ 2012-05-30 13:03           ` Sebastian Riemer
  2012-05-31  5:42             ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Riemer @ 2012-05-30 13:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 29/05/12 12:25, NeilBrown wrote:
> On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> <sebastian.riemer@profitbricks.com> wrote:
>> Now, I've updated mdadm to version 3.2.5 and it works like you've
>> described it. Thanks for your help! But the buffered IO is what matters.
>> 4k isn't enough there. Please inform me about changes which increase the
>> size in buffered IO. I'll have a look at this, too.
> 
> I don't know.  I'd have to dive into the code and look around and put a few
> printks in to see what is happening.

Now, I've configured a storage server with real HDDs for testing the
cached IO with kernel 3.4. Here direct IO always doesn't work
(Input/Output error with dd/fio). And cached IO is totally slow. My
RAID0 devices are md100 and md200. The RAID1 on top is the md300.

The md100 is reported as "faulty spare" and this has hit the following a
kernel bug.

This is the debug output:

md/raid0:md100: make_request bug: can't convert block across chunks or
bigger than 512k 541312 320
md/raid0:md200: make_request bug: can't convert block across chunks or
bigger than 512k 541312 320
md/raid1:md300: Disk failure on md100, disabling device.
md/raid1:md300: Operation continuing on 1 devices.
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:1, o:0, dev:md100
disk 1, wo:0, o:1, dev:md200
RAID1 conf printout:
--- wd:1 rd:2
disk 1, wo:0, o:1, dev:md200
md/raid0:md200: make_request bug: can't convert block across chunks or
bigger than 512k 2704000 320

The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
controller where the drives are passed through as single drive RAID0
logical devices. I guess this is a problem for MD RAID0 underneath the
RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-30 13:03           ` Sebastian Riemer
@ 2012-05-31  5:42             ` NeilBrown
  2012-05-31  6:18               ` Yuanhan Liu
  2012-05-31 10:26               ` Sebastian Riemer
  0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2012-05-31  5:42 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]

On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:

> On 29/05/12 12:25, NeilBrown wrote:
> > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > <sebastian.riemer@profitbricks.com> wrote:
> >> Now, I've updated mdadm to version 3.2.5 and it works like you've
> >> described it. Thanks for your help! But the buffered IO is what matters.
> >> 4k isn't enough there. Please inform me about changes which increase the
> >> size in buffered IO. I'll have a look at this, too.
> > 
> > I don't know.  I'd have to dive into the code and look around and put a few
> > printks in to see what is happening.
> 
> Now, I've configured a storage server with real HDDs for testing the
> cached IO with kernel 3.4. Here direct IO always doesn't work
> (Input/Output error with dd/fio). And cached IO is totally slow. My
> RAID0 devices are md100 and md200. The RAID1 on top is the md300.
> 
> The md100 is reported as "faulty spare" and this has hit the following a
> kernel bug.
> 
> This is the debug output:
> 
> md/raid0:md100: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid1:md300: Disk failure on md100, disabling device.
> md/raid1:md300: Operation continuing on 1 devices.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:1, o:0, dev:md100
> disk 1, wo:0, o:1, dev:md200
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 1, wo:0, o:1, dev:md200
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 2704000 320
> 
> The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
> controller where the drives are passed through as single drive RAID0
> logical devices. I guess this is a problem for MD RAID0 underneath the
> RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.

Hmmm... that's bad.  Looks like I have a bug .... yes I do.  Patch below
fixes it.  If you could test and confirm I would appreciated it.

As for the cached writes being always 4K - are you writing through a
filesystem or directly to /dev/md300??

If the former it is a bug in that filesystem.
If the later, it is a bug in fs/block_dev.c
In particular, fs/block_dev.c uses "generic_writepages" for the
"writepages" method rather than "mpage_writepages" (or a wrapper which
calls it with appropriate args).

'generic_writepages' simply calls ->writepage on each dirty page.
mpage_writepages (used e.g. by ext2) collects multiple pages into
a single bio.

The elevator at the device level should still collect these 1-page bios into
larger requests, but I guess that has higher CPU overhead.

thanks for the report.

NeilBrown

From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 31 May 2012 15:39:11 +1000
Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn

The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.

However were were only setting that when a device was hot-added,
not when a device was present from the start.

This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.

Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 15dd59b..d7e9577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	err = -EINVAL;
 	spin_lock_init(&conf->device_lock);
 	rdev_for_each(rdev, mddev) {
+		struct request_queue *q;
 		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)
@@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		if (disk->rdev)
 			goto abort;
 		disk->rdev = rdev;
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3f91c2e..d037adb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
 				 (conf->raid_disks / conf->near_copies));
 
 	rdev_for_each(rdev, mddev) {
-
+		struct request_queue *q;
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= conf->raid_disks
 		    || disk_idx < 0)
@@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
 				goto out_free_conf;
 			disk->rdev = rdev;
 		}
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-31  5:42             ` NeilBrown
@ 2012-05-31  6:18               ` Yuanhan Liu
  2012-05-31 10:26               ` Sebastian Riemer
  1 sibling, 0 replies; 11+ messages in thread
From: Yuanhan Liu @ 2012-05-31  6:18 UTC (permalink / raw)
  To: NeilBrown; +Cc: Sebastian Riemer, linux-raid

On Thu, May 31, 2012 at 03:42:56PM +1000, NeilBrown wrote:
> On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
> <sebastian.riemer@profitbricks.com> wrote:
> 
> > On 29/05/12 12:25, NeilBrown wrote:
> > > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > > <sebastian.riemer@profitbricks.com> wrote:
> > >> Now, I've updated mdadm to version 3.2.5 and it works like you've
[snip]... 
> The elevator at the device level should still collect these 1-page bios into
> larger requests, but I guess that has higher CPU overhead.
> 
> thanks for the report.
> 
> NeilBrown
> 
> From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 31 May 2012 15:39:11 +1000
> Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn
> 
> The new merge_bvec_fn which calls the corresponding function
> in subsidiary devices requires that mddev->merge_check_needed
> be set if any child has a merge_bvec_fn.
> 
> However were were only setting that when a device was hot-added,
          ^^^^
I guess there is a typo here.

> not when a device was present from the start.
> 
> This bug was introduced in 3.4 so patch is suitable for 3.4.y
> kernels.

And, Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> 
> Cc: stable@vger.kernel.org
> Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 15dd59b..d7e9577 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  	err = -EINVAL;
>  	spin_lock_init(&conf->device_lock);
>  	rdev_for_each(rdev, mddev) {
> +		struct request_queue *q;
>  		int disk_idx = rdev->raid_disk;
>  		if (disk_idx >= mddev->raid_disks
>  		    || disk_idx < 0)
> @@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  		if (disk->rdev)
>  			goto abort;
>  		disk->rdev = rdev;
> +		q = bdev_get_queue(rdev->bdev);
> +		if (q->merge_bvec_fn)
> +			mddev->merge_check_needed = 1;
>  
>  		disk->head_position = 0;
>  	}
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 3f91c2e..d037adb 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
>  				 (conf->raid_disks / conf->near_copies));
>  
>  	rdev_for_each(rdev, mddev) {
> -
> +		struct request_queue *q;
>  		disk_idx = rdev->raid_disk;
>  		if (disk_idx >= conf->raid_disks
>  		    || disk_idx < 0)
> @@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
>  				goto out_free_conf;
>  			disk->rdev = rdev;
>  		}
> +		q = bdev_get_queue(rdev->bdev);
> +		if (q->merge_bvec_fn)
> +			mddev->merge_check_needed = 1;
>  
>  		disk_stack_limits(mddev->gendisk, rdev->bdev,
>  				  rdev->data_offset << 9);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reason for md raid 01 blksize limited to 4 KiB?
  2012-05-31  5:42             ` NeilBrown
  2012-05-31  6:18               ` Yuanhan Liu
@ 2012-05-31 10:26               ` Sebastian Riemer
  1 sibling, 0 replies; 11+ messages in thread
From: Sebastian Riemer @ 2012-05-31 10:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 31/05/12 07:42, NeilBrown wrote:
> Hmmm... that's bad.  Looks like I have a bug .... yes I do.  Patch below
> fixes it.  If you could test and confirm I would appreciated it.

I've tested the patch and I can confirm that it works as expected, now.
Thanks for fixing it that fast!
I've tested at least raid 01.
How could one test the raid10.c changes? Create raid 010 where raid0.c
is below rai10.c in the stack?

> As for the cached writes being always 4K - are you writing through a
> filesystem or directly to /dev/md300??

I've got a file copy test which creates a file with fio, random data and
direct IO on ext4. Afterwards I copy the file which isn't in the cache
and measure the time. So this reads first and then writes back. During
the copy I run blktrace in order to have a closer look at the request
sizes. The throughput is at a stable number after multiple cycles.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-31 10:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-21  8:43 Reason for md raid 01 blksize limited to 4 KiB? Sebastian Riemer
2012-05-21 23:14 ` Stan Hoeppner
2012-05-21 23:28 ` NeilBrown
2012-05-25 12:35   ` Sebastian Riemer
2012-05-28  4:05     ` NeilBrown
2012-05-29  9:30       ` Sebastian Riemer
2012-05-29 10:25         ` NeilBrown
2012-05-30 13:03           ` Sebastian Riemer
2012-05-31  5:42             ` NeilBrown
2012-05-31  6:18               ` Yuanhan Liu
2012-05-31 10:26               ` Sebastian Riemer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).