raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages
@ 2015-01-28 15:29 Nate Dailey
  2015-02-05  4:59 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Nate Dailey @ 2015-01-28 15:29 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-scsi

I'm writing about something that appears to be an issue with raid1's 
narrow_write_error, particular to non-512-byte-sector disks. Here's what 
I'm doing:

- 2 disk raid1, 4K disks, each connected to a different SAS HBA
- mount a filesystem on the raid1, run a test that writes to it
- remove one of the SAS HBAs (echo 1 > 
/sys/bus/pci/devices/0000\:45\:00.0/remove)

At this point, writes fail and narrow_write_error breaks them up and 
retries, one sector at a time. But these are 512-byte sectors, and sd 
doesn't like it:

[ 2645.310517] sd 3:0:1:0: [sde] Bad block number requested
[ 2645.310610] sd 3:0:1:0: [sde] Bad block number requested
[ 2645.310690] sd 3:0:1:0: [sde] Bad block number requested
...

There appears to be no real harm done, but there can be a huge number of 
these messages in the log.

I can avoid this by disabling bad block tracking, but it looks like 
maybe the superblock's bblog_shift is intended to address this exact 
issue. However, I don't see a way to change it. Presumably this is 
something mdadm should be setting up? I don't see bblog_shift ever set 
to anything other than 0.

This is on a RHEL 7.1 kernel, version 3.10.0-221.el7. I took a look at 
upstream sd and md changes and nothing jumps out at me that would have 
affected this (but I have not tested to see if the bad block messages do 
or do not happen on an upstream kernel).

I'd appreciate any advice re: how to handle this. Thanks!

Nate Dailey
Stratus Technologies

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages
  2015-01-28 15:29 raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages Nate Dailey
@ 2015-02-05  4:59 ` NeilBrown
  2015-02-12 16:46   ` Nate Dailey
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2015-02-05  4:59 UTC (permalink / raw)
  To: Nate Dailey; +Cc: linux-raid, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]

On Wed, 28 Jan 2015 10:29:46 -0500 Nate Dailey <nate.dailey@stratus.com>
wrote:

> I'm writing about something that appears to be an issue with raid1's 
> narrow_write_error, particular to non-512-byte-sector disks. Here's what 
> I'm doing:
> 
> - 2 disk raid1, 4K disks, each connected to a different SAS HBA
> - mount a filesystem on the raid1, run a test that writes to it
> - remove one of the SAS HBAs (echo 1 > 
> /sys/bus/pci/devices/0000\:45\:00.0/remove)
> 
> At this point, writes fail and narrow_write_error breaks them up and 
> retries, one sector at a time. But these are 512-byte sectors, and sd 
> doesn't like it:
> 
> [ 2645.310517] sd 3:0:1:0: [sde] Bad block number requested
> [ 2645.310610] sd 3:0:1:0: [sde] Bad block number requested
> [ 2645.310690] sd 3:0:1:0: [sde] Bad block number requested
> ...
> 
> There appears to be no real harm done, but there can be a huge number of 
> these messages in the log.
> 
> I can avoid this by disabling bad block tracking, but it looks like 
> maybe the superblock's bblog_shift is intended to address this exact 
> issue. However, I don't see a way to change it. Presumably this is 
> something mdadm should be setting up? I don't see bblog_shift ever set 
> to anything other than 0.
> 
> This is on a RHEL 7.1 kernel, version 3.10.0-221.el7. I took a look at 
> upstream sd and md changes and nothing jumps out at me that would have 
> affected this (but I have not tested to see if the bad block messages do 
> or do not happen on an upstream kernel).
> 
> I'd appreciate any advice re: how to handle this. Thanks!

Thanks for the report.

narrow_write_error() should use bdev_logical_block_size() and round up to
that.
Possibly mdadm should get the same information and set bblog_shift
accordingly when creating a bad block log.

I've made a note to fix that, but I'm happy to review  patches too :-)

thanks,
NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages
  2015-02-05  4:59 ` NeilBrown
@ 2015-02-12 16:46   ` Nate Dailey
  2015-02-13  6:01     ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Nate Dailey @ 2015-02-12 16:46 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-scsi

On 02/04/2015 11:59 PM, NeilBrown wrote:
> On Wed, 28 Jan 2015 10:29:46 -0500 Nate Dailey <nate.dailey@stratus.com>
> wrote:
>
>> I'm writing about something that appears to be an issue with raid1's
>> narrow_write_error, particular to non-512-byte-sector disks. Here's what
>> I'm doing:
>>
>> - 2 disk raid1, 4K disks, each connected to a different SAS HBA
>> - mount a filesystem on the raid1, run a test that writes to it
>> - remove one of the SAS HBAs (echo 1 >
>> /sys/bus/pci/devices/0000\:45\:00.0/remove)
>>
>> At this point, writes fail and narrow_write_error breaks them up and
>> retries, one sector at a time. But these are 512-byte sectors, and sd
>> doesn't like it:
>>
>> [ 2645.310517] sd 3:0:1:0: [sde] Bad block number requested
>> [ 2645.310610] sd 3:0:1:0: [sde] Bad block number requested
>> [ 2645.310690] sd 3:0:1:0: [sde] Bad block number requested
>> ...
>>
>> There appears to be no real harm done, but there can be a huge number of
>> these messages in the log.
>>
>> I can avoid this by disabling bad block tracking, but it looks like
>> maybe the superblock's bblog_shift is intended to address this exact
>> issue. However, I don't see a way to change it. Presumably this is
>> something mdadm should be setting up? I don't see bblog_shift ever set
>> to anything other than 0.
>>
>> This is on a RHEL 7.1 kernel, version 3.10.0-221.el7. I took a look at
>> upstream sd and md changes and nothing jumps out at me that would have
>> affected this (but I have not tested to see if the bad block messages do
>> or do not happen on an upstream kernel).
>>
>> I'd appreciate any advice re: how to handle this. Thanks!
>
> Thanks for the report.
>
> narrow_write_error() should use bdev_logical_block_size() and round up to
> that.
> Possibly mdadm should get the same information and set bblog_shift
> accordingly when creating a bad block log.
>
> I've made a note to fix that, but I'm happy to review  patches too :-)
>
> thanks,
> NeilBrown
>

I will post a narrow_write_error patch shortly.

I did some experimentation with setting the bblog_shift in mdadm, but it 
didn't work out the way I expected. It turns out that the value is only 
loaded from the superblock if:

1453        if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
1454            rdev->badblocks.count == 0) {
...
1473                rdev->badblocks.shift = sb->bblog_shift;

And this feature bit is only set if any bad blocks have actually been 
recorded.

It also appears to me that the shift is used when loading the bad blocks 
from the superblock, but not when storing the bad block list in the 
superblock.

Seems like these are bugs, but I'm not certain how the code is supposed 
to work (and am getting in a bit over my head with this).

In any case, it doesn't appear to me that there's any harm in having the 
bblog_shift not match the disk's block size (right?).

Nate Dailey


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages
  2015-02-12 16:46   ` Nate Dailey
@ 2015-02-13  6:01     ` NeilBrown
  0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2015-02-13  6:01 UTC (permalink / raw)
  To: Nate Dailey; +Cc: linux-raid, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3813 bytes --]

On Thu, 12 Feb 2015 11:46:21 -0500 Nate Dailey <nate.dailey@stratus.com>
wrote:

> On 02/04/2015 11:59 PM, NeilBrown wrote:
> > On Wed, 28 Jan 2015 10:29:46 -0500 Nate Dailey <nate.dailey@stratus.com>
> > wrote:
> >
> >> I'm writing about something that appears to be an issue with raid1's
> >> narrow_write_error, particular to non-512-byte-sector disks. Here's what
> >> I'm doing:
> >>
> >> - 2 disk raid1, 4K disks, each connected to a different SAS HBA
> >> - mount a filesystem on the raid1, run a test that writes to it
> >> - remove one of the SAS HBAs (echo 1 >
> >> /sys/bus/pci/devices/0000\:45\:00.0/remove)
> >>
> >> At this point, writes fail and narrow_write_error breaks them up and
> >> retries, one sector at a time. But these are 512-byte sectors, and sd
> >> doesn't like it:
> >>
> >> [ 2645.310517] sd 3:0:1:0: [sde] Bad block number requested
> >> [ 2645.310610] sd 3:0:1:0: [sde] Bad block number requested
> >> [ 2645.310690] sd 3:0:1:0: [sde] Bad block number requested
> >> ...
> >>
> >> There appears to be no real harm done, but there can be a huge number of
> >> these messages in the log.
> >>
> >> I can avoid this by disabling bad block tracking, but it looks like
> >> maybe the superblock's bblog_shift is intended to address this exact
> >> issue. However, I don't see a way to change it. Presumably this is
> >> something mdadm should be setting up? I don't see bblog_shift ever set
> >> to anything other than 0.
> >>
> >> This is on a RHEL 7.1 kernel, version 3.10.0-221.el7. I took a look at
> >> upstream sd and md changes and nothing jumps out at me that would have
> >> affected this (but I have not tested to see if the bad block messages do
> >> or do not happen on an upstream kernel).
> >>
> >> I'd appreciate any advice re: how to handle this. Thanks!
> >
> > Thanks for the report.
> >
> > narrow_write_error() should use bdev_logical_block_size() and round up to
> > that.
> > Possibly mdadm should get the same information and set bblog_shift
> > accordingly when creating a bad block log.
> >
> > I've made a note to fix that, but I'm happy to review  patches too :-)
> >
> > thanks,
> > NeilBrown
> >
> 
> I will post a narrow_write_error patch shortly.
> 
> I did some experimentation with setting the bblog_shift in mdadm, but it 
> didn't work out the way I expected. It turns out that the value is only 
> loaded from the superblock if:
> 
> 1453        if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
> 1454            rdev->badblocks.count == 0) {
> ...
> 1473                rdev->badblocks.shift = sb->bblog_shift;
> 
> And this feature bit is only set if any bad blocks have actually been 
> recorded.
> 
> It also appears to me that the shift is used when loading the bad blocks 
> from the superblock, but not when storing the bad block list in the 
> superblock.
> 
> Seems like these are bugs, but I'm not certain how the code is supposed 
> to work (and am getting in a bit over my head with this).

Yes, that's probably a bug.
The

	} else if (sb->bblog_offset != 0)
		rdev->badblocks.shift = 0;

should be

	} else if (sb->bblog_offset != 0)
		rdev->badblocks.shift = sb->bblog_shift;

> 
> In any case, it doesn't appear to me that there's any harm in having the 
> bblog_shift not match the disk's block size (right?).

Having the bblog_shift larger than the disk's block size certainly should not
be a problem.  Having it small only causes the problem that you have already
discovered.

NeilBrown


> 
> Nate Dailey
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-13  6:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-28 15:29 raid1 narrow_write_error with 4K disks, sd "bad block number requested" messages Nate Dailey
2015-02-05  4:59 ` NeilBrown
2015-02-12 16:46   ` Nate Dailey
2015-02-13  6:01     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).