From: NeilBrown <neilb@suse.de>
To: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Reason for md raid 01 blksize limited to 4 KiB?
Date: Thu, 31 May 2012 15:42:56 +1000 [thread overview]
Message-ID: <20120531154256.6eb567c7@notabene.brown> (raw)
In-Reply-To: <4FC61A94.3050605@profitbricks.com>
[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]
On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:
> On 29/05/12 12:25, NeilBrown wrote:
> > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > <sebastian.riemer@profitbricks.com> wrote:
> >> Now, I've updated mdadm to version 3.2.5 and it works like you've
> >> described it. Thanks for your help! But the buffered IO is what matters.
> >> 4k isn't enough there. Please inform me about changes which increase the
> >> size in buffered IO. I'll have a look at this, too.
> >
> > I don't know. I'd have to dive into the code and look around and put a few
> > printks in to see what is happening.
>
> Now, I've configured a storage server with real HDDs for testing the
> cached IO with kernel 3.4. Here direct IO always doesn't work
> (Input/Output error with dd/fio). And cached IO is totally slow. My
> RAID0 devices are md100 and md200. The RAID1 on top is the md300.
>
> The md100 is reported as "faulty spare" and this has hit the following a
> kernel bug.
>
> This is the debug output:
>
> md/raid0:md100: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid1:md300: Disk failure on md100, disabling device.
> md/raid1:md300: Operation continuing on 1 devices.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:1, o:0, dev:md100
> disk 1, wo:0, o:1, dev:md200
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 1, wo:0, o:1, dev:md200
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 2704000 320
>
> The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
> controller where the drives are passed through as single drive RAID0
> logical devices. I guess this is a problem for MD RAID0 underneath the
> RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.
Hmmm... that's bad. Looks like I have a bug .... yes I do. Patch below
fixes it. If you could test and confirm I would appreciated it.
As for the cached writes being always 4K - are you writing through a
filesystem or directly to /dev/md300??
If the former it is a bug in that filesystem.
If the later, it is a bug in fs/block_dev.c
In particular, fs/block_dev.c uses "generic_writepages" for the
"writepages" method rather than "mpage_writepages" (or a wrapper which
calls it with appropriate args).
'generic_writepages' simply calls ->writepage on each dirty page.
mpage_writepages (used e.g. by ext2) collects multiple pages into
a single bio.
The elevator at the device level should still collect these 1-page bios into
larger requests, but I guess that has higher CPU overhead.
thanks for the report.
NeilBrown
From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 31 May 2012 15:39:11 +1000
Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.
However were were only setting that when a device was hot-added,
not when a device was present from the start.
This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.
Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 15dd59b..d7e9577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
err = -EINVAL;
spin_lock_init(&conf->device_lock);
rdev_for_each(rdev, mddev) {
+ struct request_queue *q;
int disk_idx = rdev->raid_disk;
if (disk_idx >= mddev->raid_disks
|| disk_idx < 0)
@@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
if (disk->rdev)
goto abort;
disk->rdev = rdev;
+ q = bdev_get_queue(rdev->bdev);
+ if (q->merge_bvec_fn)
+ mddev->merge_check_needed = 1;
disk->head_position = 0;
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3f91c2e..d037adb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
(conf->raid_disks / conf->near_copies));
rdev_for_each(rdev, mddev) {
-
+ struct request_queue *q;
disk_idx = rdev->raid_disk;
if (disk_idx >= conf->raid_disks
|| disk_idx < 0)
@@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
goto out_free_conf;
disk->rdev = rdev;
}
+ q = bdev_get_queue(rdev->bdev);
+ if (q->merge_bvec_fn)
+ mddev->merge_check_needed = 1;
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2012-05-31 5:42 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-21 8:43 Reason for md raid 01 blksize limited to 4 KiB? Sebastian Riemer
2012-05-21 23:14 ` Stan Hoeppner
2012-05-21 23:28 ` NeilBrown
2012-05-25 12:35 ` Sebastian Riemer
2012-05-28 4:05 ` NeilBrown
2012-05-29 9:30 ` Sebastian Riemer
2012-05-29 10:25 ` NeilBrown
2012-05-30 13:03 ` Sebastian Riemer
2012-05-31 5:42 ` NeilBrown [this message]
2012-05-31 6:18 ` Yuanhan Liu
2012-05-31 10:26 ` Sebastian Riemer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120531154256.6eb567c7@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=sebastian.riemer@profitbricks.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.