linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Reason for md raid 01 blksize limited to 4 KiB?
Date: Thu, 31 May 2012 15:42:56 +1000	[thread overview]
Message-ID: <20120531154256.6eb567c7@notabene.brown> (raw)
In-Reply-To: <4FC61A94.3050605@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]

On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:

> On 29/05/12 12:25, NeilBrown wrote:
> > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > <sebastian.riemer@profitbricks.com> wrote:
> >> Now, I've updated mdadm to version 3.2.5 and it works like you've
> >> described it. Thanks for your help! But the buffered IO is what matters.
> >> 4k isn't enough there. Please inform me about changes which increase the
> >> size in buffered IO. I'll have a look at this, too.
> > 
> > I don't know.  I'd have to dive into the code and look around and put a few
> > printks in to see what is happening.
> 
> Now, I've configured a storage server with real HDDs for testing the
> cached IO with kernel 3.4. Here direct IO always doesn't work
> (Input/Output error with dd/fio). And cached IO is totally slow. My
> RAID0 devices are md100 and md200. The RAID1 on top is the md300.
> 
> The md100 is reported as "faulty spare" and this has hit the following a
> kernel bug.
> 
> This is the debug output:
> 
> md/raid0:md100: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid1:md300: Disk failure on md100, disabling device.
> md/raid1:md300: Operation continuing on 1 devices.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:1, o:0, dev:md100
> disk 1, wo:0, o:1, dev:md200
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 1, wo:0, o:1, dev:md200
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 2704000 320
> 
> The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
> controller where the drives are passed through as single drive RAID0
> logical devices. I guess this is a problem for MD RAID0 underneath the
> RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.

Hmmm... that's bad.  Looks like I have a bug .... yes I do.  Patch below
fixes it.  If you could test and confirm I would appreciated it.

As for the cached writes being always 4K - are you writing through a
filesystem or directly to /dev/md300??

If the former it is a bug in that filesystem.
If the later, it is a bug in fs/block_dev.c
In particular, fs/block_dev.c uses "generic_writepages" for the
"writepages" method rather than "mpage_writepages" (or a wrapper which
calls it with appropriate args).

'generic_writepages' simply calls ->writepage on each dirty page.
mpage_writepages (used e.g. by ext2) collects multiple pages into
a single bio.

The elevator at the device level should still collect these 1-page bios into
larger requests, but I guess that has higher CPU overhead.

thanks for the report.

NeilBrown

From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 31 May 2012 15:39:11 +1000
Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn

The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.

However were were only setting that when a device was hot-added,
not when a device was present from the start.

This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.

Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 15dd59b..d7e9577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	err = -EINVAL;
 	spin_lock_init(&conf->device_lock);
 	rdev_for_each(rdev, mddev) {
+		struct request_queue *q;
 		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)
@@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		if (disk->rdev)
 			goto abort;
 		disk->rdev = rdev;
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3f91c2e..d037adb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
 				 (conf->raid_disks / conf->near_copies));
 
 	rdev_for_each(rdev, mddev) {
-
+		struct request_queue *q;
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= conf->raid_disks
 		    || disk_idx < 0)
@@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
 				goto out_free_conf;
 			disk->rdev = rdev;
 		}
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

  reply	other threads:[~2012-05-31  5:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-21  8:43 Reason for md raid 01 blksize limited to 4 KiB? Sebastian Riemer
2012-05-21 23:14 ` Stan Hoeppner
2012-05-21 23:28 ` NeilBrown
2012-05-25 12:35   ` Sebastian Riemer
2012-05-28  4:05     ` NeilBrown
2012-05-29  9:30       ` Sebastian Riemer
2012-05-29 10:25         ` NeilBrown
2012-05-30 13:03           ` Sebastian Riemer
2012-05-31  5:42             ` NeilBrown [this message]
2012-05-31  6:18               ` Yuanhan Liu
2012-05-31 10:26               ` Sebastian Riemer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120531154256.6eb567c7@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=sebastian.riemer@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).