Re: raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Nate Dailey <nate.dailey@stratus.com>
Cc: linux-raid@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages
Date: Fri, 13 Feb 2015 17:01:32 +1100	[thread overview]
Message-ID: <20150213170132.0c61c508@notabene.brown> (raw)
In-Reply-To: <54DCD8DD.7080103@stratus.com>

[-- Attachment #1: Type: text/plain, Size: 3813 bytes --]

On Thu, 12 Feb 2015 11:46:21 -0500 Nate Dailey <nate.dailey@stratus.com>
wrote:

> On 02/04/2015 11:59 PM, NeilBrown wrote:
> > On Wed, 28 Jan 2015 10:29:46 -0500 Nate Dailey <nate.dailey@stratus.com>
> > wrote:
> >
> >> I'm writing about something that appears to be an issue with raid1's
> >> narrow_write_error, particular to non-512-byte-sector disks. Here's what
> >> I'm doing:
> >>
> >> - 2 disk raid1, 4K disks, each connected to a different SAS HBA
> >> - mount a filesystem on the raid1, run a test that writes to it
> >> - remove one of the SAS HBAs (echo 1 >
> >> /sys/bus/pci/devices/0000\:45\:00.0/remove)
> >>
> >> At this point, writes fail and narrow_write_error breaks them up and
> >> retries, one sector at a time. But these are 512-byte sectors, and sd
> >> doesn't like it:
> >>
> >> [ 2645.310517] sd 3:0:1:0: [sde] Bad block number requested
> >> [ 2645.310610] sd 3:0:1:0: [sde] Bad block number requested
> >> [ 2645.310690] sd 3:0:1:0: [sde] Bad block number requested
> >> ...
> >>
> >> There appears to be no real harm done, but there can be a huge number of
> >> these messages in the log.
> >>
> >> I can avoid this by disabling bad block tracking, but it looks like
> >> maybe the superblock's bblog_shift is intended to address this exact
> >> issue. However, I don't see a way to change it. Presumably this is
> >> something mdadm should be setting up? I don't see bblog_shift ever set
> >> to anything other than 0.
> >>
> >> This is on a RHEL 7.1 kernel, version 3.10.0-221.el7. I took a look at
> >> upstream sd and md changes and nothing jumps out at me that would have
> >> affected this (but I have not tested to see if the bad block messages do
> >> or do not happen on an upstream kernel).
> >>
> >> I'd appreciate any advice re: how to handle this. Thanks!
> >
> > Thanks for the report.
> >
> > narrow_write_error() should use bdev_logical_block_size() and round up to
> > that.
> > Possibly mdadm should get the same information and set bblog_shift
> > accordingly when creating a bad block log.
> >
> > I've made a note to fix that, but I'm happy to review  patches too :-)
> >
> > thanks,
> > NeilBrown
> >
> 
> I will post a narrow_write_error patch shortly.
> 
> I did some experimentation with setting the bblog_shift in mdadm, but it 
> didn't work out the way I expected. It turns out that the value is only 
> loaded from the superblock if:
> 
> 1453        if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
> 1454            rdev->badblocks.count == 0) {
> ...
> 1473                rdev->badblocks.shift = sb->bblog_shift;
> 
> And this feature bit is only set if any bad blocks have actually been 
> recorded.
> 
> It also appears to me that the shift is used when loading the bad blocks 
> from the superblock, but not when storing the bad block list in the 
> superblock.
> 
> Seems like these are bugs, but I'm not certain how the code is supposed 
> to work (and am getting in a bit over my head with this).

Yes, that's probably a bug.
The

	} else if (sb->bblog_offset != 0)
		rdev->badblocks.shift = 0;

should be

	} else if (sb->bblog_offset != 0)
		rdev->badblocks.shift = sb->bblog_shift;

> 
> In any case, it doesn't appear to me that there's any harm in having the 
> bblog_shift not match the disk's block size (right?).

Having the bblog_shift larger than the disk's block size certainly should not
be a problem.  Having it small only causes the problem that you have already
discovered.

NeilBrown


> 
> Nate Dailey
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

     prev parent reply	other threads:[~2015-02-13  6:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-28 15:29 raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages Nate Dailey
2015-02-05  4:59 ` NeilBrown
2015-02-12 16:46   ` Nate Dailey
2015-02-13  6:01     ` NeilBrown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150213170132.0c61c508@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=nate.dailey@stratus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).