RAID-10 keeps aborting

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-10 keeps aborting
@ 2013-06-03  3:57 H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
                   ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03  3:57 UTC (permalink / raw)
  To: linux-raid

Hello,

I have a brand new server with a RAID-10 array.  The drives are a SAS
JBOD (mptsas) which I'm driving using Linux mdraid raid10.

Unfortunately, although the server did burn-in fine, once put in
production I have so far had multiple cases (about once every 24 hours)
of the raid10 failing, with a mirror pair dropping out in very short
succession:

Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
failure on sdb6, disabling device.
Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
continuing on 3 devices.
Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
failure on sdc6, disabling device.
Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
continuing on 2 devices.
Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
Manually zeroing.

Unfortunately, those two devices that just dropped out are of course the
mirrors of each other, causing filesystem corruption and shutdown in
very short order.

There are no other kernel messages from the same time, and given the
timing (less than 90 ms apart) it would appear that this is a timeout of
some kind and not an actual disk failure.

Are there any tunables I can tweak, or do I have a $4000 paperweight?

	-hpa

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
@ 2013-06-03  4:05 ` H. Peter Anvin
  2013-06-03  5:47 ` Dan Williams
  2013-06-11 21:50 ` Joe Lawrence
  2 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03  4:05 UTC (permalink / raw)
  To: linux-raid

On 06/02/2013 08:57 PM, H. Peter Anvin wrote:
> Hello,
> 
> I have a brand new server with a RAID-10 array.  The drives are a SAS
> JBOD (mptsas) which I'm driving using Linux mdraid raid10.
> 
> Unfortunately, although the server did burn-in fine, once put in
> production I have so far had multiple cases (about once every 24 hours)
> of the raid10 failing, with a mirror pair dropping out in very short
> succession:
> 
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
> failure on sdb6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
> continuing on 3 devices.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
> failure on sdc6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
> continuing on 2 devices.
> Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
> Manually zeroing.
> 

For the record, this is kernel 3.8.13 (-100.fc17 from Fedora).

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
@ 2013-06-03  5:47 ` Dan Williams
  2013-06-03  6:06   ` H. Peter Anvin
  2013-06-11 21:50 ` Joe Lawrence
  2 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2013-06-03  5:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Sun, Jun 2, 2013 at 8:57 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Hello,
>
> I have a brand new server with a RAID-10 array.  The drives are a SAS
> JBOD (mptsas) which I'm driving using Linux mdraid raid10.
>
> Unfortunately, although the server did burn-in fine, once put in
> production I have so far had multiple cases (about once every 24 hours)
> of the raid10 failing, with a mirror pair dropping out in very short
> succession:
>
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk
> failure on sdb6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation
> continuing on 3 devices.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk
> failure on sdc6, disabling device.
> Jun  2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation
> continuing on 2 devices.
> Jun  2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed.
> Manually zeroing.
>
> Unfortunately, those two devices that just dropped out are of course the
> mirrors of each other, causing filesystem corruption and shutdown in
> very short order.
>
> There are no other kernel messages from the same time, and given the
> timing (less than 90 ms apart) it would appear that this is a timeout of
> some kind and not an actual disk failure.

Looks like the underlying devices just may not support write_same...
if the device lies about support we don't find about it until the
first attempt fails and md drops the devices.

> Are there any tunables I can tweak, or do I have a $4000 paperweight?

One hack to prove this may be to explicitly disable write_same before
the array is assembled:

for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done

If this works then maybe md needs to be tolerant of write_same
failures since the block layer will simply retry with zeroes.

--
Dan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  5:47 ` Dan Williams
@ 2013-06-03  6:06   ` H. Peter Anvin
  2013-06-03  6:14     ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03  6:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 10:47 PM, Dan Williams wrote:
> 
> One hack to prove this may be to explicitly disable write_same before
> the array is assembled:
> 
> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
> 
> If this works then maybe md needs to be tolerant of write_same
> failures since the block layer will simply retry with zeroes.
> 

Trying that (array is already assembled but is currently functional.)
Let's hope it works.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:06   ` H. Peter Anvin
@ 2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
                         ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Dan Williams @ 2013-06-03  6:14 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>
>> One hack to prove this may be to explicitly disable write_same before
>> the array is assembled:
>>
>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
>>
>> If this works then maybe md needs to be tolerant of write_same
>> failures since the block layer will simply retry with zeroes.
>>
>
> Trying that (array is already assembled but is currently functional.)
> Let's hope it works.
>

If I'm reading things correctly that may still result in failure since
md will still pass the REQ_WRITE_SAME bios down to the the devices and
will receive BLK_PREP_KILL for its trouble.  md only notices that
write same is disabled on underlying devices at assembly time.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
@ 2013-06-03  6:30       ` H. Peter Anvin
  2013-06-03 14:39       ` H. Peter Anvin
  2013-06-03 15:47       ` H. Peter Anvin
  2 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03  6:30 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

Ok, will muck with that tomorrow. Hopefully it doesn't die tonight.

Dan Williams <dan.j.williams@gmail.com> wrote:

>On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>>
>>> One hack to prove this may be to explicitly disable write_same
>before
>>> the array is assembled:
>>>
>>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 >
>$i; done
>>>
>>> If this works then maybe md needs to be tolerant of write_same
>>> failures since the block layer will simply retry with zeroes.
>>>
>>
>> Trying that (array is already assembled but is currently functional.)
>> Let's hope it works.
>>
>
>If I'm reading things correctly that may still result in failure since
>md will still pass the REQ_WRITE_SAME bios down to the the devices and
>will receive BLK_PREP_KILL for its trouble.  md only notices that
>write same is disabled on underlying devices at assembly time.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
@ 2013-06-03 14:39       ` H. Peter Anvin
  2013-06-11 16:47         ` Joe Lawrence
  2013-06-03 15:47       ` H. Peter Anvin
  2 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 14:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 11:14 PM, Dan Williams wrote:
> On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 10:47 PM, Dan Williams wrote:
>>>
>>> One hack to prove this may be to explicitly disable write_same before
>>> the array is assembled:
>>>
>>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
>>>
>>> If this works then maybe md needs to be tolerant of write_same
>>> failures since the block layer will simply retry with zeroes.
>>>
>>
>> Trying that (array is already assembled but is currently functional.)
>> Let's hope it works.
>>
> 
> If I'm reading things correctly that may still result in failure since
> md will still pass the REQ_WRITE_SAME bios down to the the devices and
> will receive BLK_PREP_KILL for its trouble.  md only notices that
> write same is disabled on underlying devices at assembly time.
> 

Hmmm... that means getting dracut/udev to supply this little mod, unless
it can be fed as a kernel command-line option somehow.  Digging...

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 14:39       ` H. Peter Anvin
@ 2013-06-11 16:47         ` Joe Lawrence
  2013-06-11 17:12           ` H. Peter Anvin
  0 siblings, 1 reply; 64+ messages in thread
From: Joe Lawrence @ 2013-06-11 16:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

On Mon, 03 Jun 2013 07:39:46 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/02/2013 11:14 PM, Dan Williams wrote:
> > On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> >> On 06/02/2013 10:47 PM, Dan Williams wrote:
> >>>
> >>> One hack to prove this may be to explicitly disable write_same before
> >>> the array is assembled:
> >>>
> >>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done
> >>>
> >>> If this works then maybe md needs to be tolerant of write_same
> >>> failures since the block layer will simply retry with zeroes.
> >>>
> >>
> >> Trying that (array is already assembled but is currently functional.)
> >> Let's hope it works.
> >>
> > 
> > If I'm reading things correctly that may still result in failure since
> > md will still pass the REQ_WRITE_SAME bios down to the the devices and
> > will receive BLK_PREP_KILL for its trouble.  md only notices that
> > write same is disabled on underlying devices at assembly time.
> > 
> 
> Hmmm... that means getting dracut/udev to supply this little mod, unless
> it can be fed as a kernel command-line option somehow.  Digging...
> 
> 	-hpa

You've probably worked around this by now, but Dan's suggestion can be
tweaked if you are willing to fail/remove/re-add one of the disks.  I
just verified the following:

  /sys/block/md125/queue/write_same_max_bytes: 524288

% mdadm --fail /dev/md125 /dev/sds1
% mdadm --remove /dev/md125 /dev/sds1
% echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks
% mdadm --add /dev/md125 /dev/sds1

  /sys/block/md125/queue/write_same_max_bytes: 0

That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk)
which will recalculate write_same_max_bytes for the MD.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-11 16:47         ` Joe Lawrence
@ 2013-06-11 17:12           ` H. Peter Anvin
  0 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-11 17:12 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Dan Williams, linux-raid

On 06/11/2013 09:47 AM, Joe Lawrence wrote:
> 
> You've probably worked around this by now, but Dan's suggestion can be
> tweaked if you are willing to fail/remove/re-add one of the disks.  I
> just verified the following:
> 
>   /sys/block/md125/queue/write_same_max_bytes: 524288
> 
> % mdadm --fail /dev/md125 /dev/sds1
> % mdadm --remove /dev/md125 /dev/sds1
> % echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks
> % mdadm --add /dev/md125 /dev/sds1
> 
>   /sys/block/md125/queue/write_same_max_bytes: 0
> 
> That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk)
> which will recalculate write_same_max_bytes for the MD.
> 

No thanks.  I just hacked the kernel directly.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  6:14     ` Dan Williams
  2013-06-03  6:30       ` H. Peter Anvin
  2013-06-03 14:39       ` H. Peter Anvin
@ 2013-06-03 15:47       ` H. Peter Anvin
  2013-06-03 16:09         ` Joe Lawrence
  2013-06-03 17:22         ` Dan Williams
  2 siblings, 2 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 15:47 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/02/2013 11:14 PM, Dan Williams wrote:
> 
> If I'm reading things correctly that may still result in failure since
> md will still pass the REQ_WRITE_SAME bios down to the the devices and
> will receive BLK_PREP_KILL for its trouble.  md only notices that
> write same is disabled on underlying devices at assembly time.
> 

I have to admit to not seeing where md (as opposed to dm) even looks for
if the underlying devices have write same enabled.  It seems extremely
likely that it is write same that is causing the headaches, though.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 15:47       ` H. Peter Anvin
@ 2013-06-03 16:09         ` Joe Lawrence
  2013-06-03 17:22         ` Dan Williams
  1 sibling, 0 replies; 64+ messages in thread
From: Joe Lawrence @ 2013-06-03 16:09 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

On Mon, 3 Jun 2013, H. Peter Anvin wrote:

> On 06/02/2013 11:14 PM, Dan Williams wrote:
> > 
> > If I'm reading things correctly that may still result in failure since
> > md will still pass the REQ_WRITE_SAME bios down to the the devices and
> > will receive BLK_PREP_KILL for its trouble.  md only notices that
> > write same is disabled on underlying devices at assembly time.
> > 
> 
> I have to admit to not seeing where md (as opposed to dm) even looks for
> if the underlying devices have write same enabled.  It seems extremely
> likely that it is write same that is causing the headaches, though.

I'll try to take a look later, but for now maybe these threads could help 
you?

  http://thread.gmane.org/gmane.linux.raid/41035
  http://thread.gmane.org/gmane.linux.raid/41078

It might be that after commit 4363ac7 "block: Implement support for WRITE 
SAME", the max_write_same_sectors for the MD is set to the minimum its 
component disks (at least for RAID1).  Not sure exactly how RAID10 treats 
it, but maybe there are enough clues in that first thread to figure it 
out.

When I have some more time I can help investigate.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 15:47       ` H. Peter Anvin
  2013-06-03 16:09         ` Joe Lawrence
@ 2013-06-03 17:22         ` Dan Williams
  2013-06-03 17:40           ` H. Peter Anvin
  1 sibling, 1 reply; 64+ messages in thread
From: Dan Williams @ 2013-06-03 17:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/02/2013 11:14 PM, Dan Williams wrote:
>>
>> If I'm reading things correctly that may still result in failure since
>> md will still pass the REQ_WRITE_SAME bios down to the the devices and
>> will receive BLK_PREP_KILL for its trouble.  md only notices that
>> write same is disabled on underlying devices at assembly time.
>>
>
> I have to admit to not seeing where md (as opposed to dm) even looks for
> if the underlying devices have write same enabled.  It seems extremely
> likely that it is write same that is causing the headaches, though.
>

raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors().

...and here is where scsi considers failures as non-fatal in
sd_done().  I assume the REQ_QUIET is why there are no other kernel
messages.

        case ILLEGAL_REQUEST:
                if (sshdr.asc == 0x10)  /* DIX: Host detected corruption */
                        good_bytes = sd_completed_bytes(SCpnt);
                /* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */
                if (sshdr.asc == 0x20 || sshdr.asc == 0x24) {
                        switch (op) {
                        case UNMAP:
                                sd_config_discard(sdkp, SD_LBP_DISABLE);
                                break;
                        case WRITE_SAME_16:
                        case WRITE_SAME:
                                if (unmap)
                                        sd_config_discard(sdkp, SD_LBP_DISABLE);
                                else {
                                        sdkp->device->no_write_same = 1;
                                        sd_config_write_same(sdkp);

                                        good_bytes = 0;
                                        req->__data_len = blk_rq_bytes(req);
                                        req->cmd_flags |= REQ_QUIET;
                                }
                        }
                }
                break;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 17:22         ` Dan Williams
@ 2013-06-03 17:40           ` H. Peter Anvin
  2013-06-03 18:35             ` Martin K. Petersen
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 17:40 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 06/03/2013 10:22 AM, Dan Williams wrote:
> On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/02/2013 11:14 PM, Dan Williams wrote:
>>>
>>> If I'm reading things correctly that may still result in failure since
>>> md will still pass the REQ_WRITE_SAME bios down to the the devices and
>>> will receive BLK_PREP_KILL for its trouble.  md only notices that
>>> write same is disabled on underlying devices at assembly time.
>>>
>>
>> I have to admit to not seeing where md (as opposed to dm) even looks for
>> if the underlying devices have write same enabled.  It seems extremely
>> likely that it is write same that is causing the headaches, though.
>>
> 
> raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors().
> 

OK, I see it now.

I wonder changing blk_queue_max_write_same_sectors() to zero in the
kernel sources would do the trick here... might be easier than making
udev/dracut to the right thing... :-/

> ...and here is where scsi considers failures as non-fatal in
> sd_done().  I assume the REQ_QUIET is why there are no other kernel
> messages.

>                         case WRITE_SAME_16:
>                         case WRITE_SAME:
>                                 if (unmap)
>                                         sd_config_discard(sdkp, SD_LBP_DISABLE);
>                                 else {
>                                         sdkp->device->no_write_same = 1;
>                                         sd_config_write_same(sdkp);
> 
>                                         good_bytes = 0;
>                                         req->__data_len = blk_rq_bytes(req);
>                                         req->cmd_flags |= REQ_QUIET;
>                                 }
>                         }
>                 }
>                 break;

OK, so the device here says don't do this again, but fails the request
anyway expecting the block device to pick up the slack.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 17:40           ` H. Peter Anvin
@ 2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
                                 ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-03 18:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> OK, so the device here says don't do this again, but fails the
hpa> request anyway expecting the block device to pick up the slack.

Yes, the block layer function will resort to writing out zeroes directly
in this case.

MD should not consider a rejected WRITE SAME a failure.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
@ 2013-06-03 18:38               ` H. Peter Anvin
  2013-06-03 18:40               ` H. Peter Anvin
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 18:38 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

Right.  That seems to be the root of the problem here.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
@ 2013-06-03 18:40               ` H. Peter Anvin
  2013-06-03 22:20                 ` H. Peter Anvin
  2013-06-03 23:19               ` H. Peter Anvin
  2013-06-04 17:36               ` Dan Williams
  3 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 18:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

Presumably MD should do the same thing that SCSI does: disable
write_same and kick the failure upstack?

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:40               ` H. Peter Anvin
@ 2013-06-03 22:20                 ` H. Peter Anvin
  2013-06-03 22:34                   ` H. Peter Anvin
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 22:20 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 11:40 AM, H. Peter Anvin wrote:
> On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> OK, so the device here says don't do this again, but fails the
>> hpa> request anyway expecting the block device to pick up the slack.
>>
>> Yes, the block layer function will resort to writing out zeroes directly
>> in this case.
>>
>> MD should not consider a rejected WRITE SAME a failure.
>>
> 
> Presumably MD should do the same thing that SCSI does: disable
> write_same and kick the failure upstack?
> 

Also, given the seriousness of the failure mode, is this something that
should be addressed in stable?

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 22:20                 ` H. Peter Anvin
@ 2013-06-03 22:34                   ` H. Peter Anvin
  2013-06-04 15:56                     ` Martin K. Petersen
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 22:34 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/03/2013 03:20 PM, H. Peter Anvin wrote:
> 
> Also, given the seriousness of the failure mode, is this something that
> should be addressed in stable?
> 

Note that the filesystem is not mounted with the "discard" option, and
given the rarity of the aborts, whatever causes the WRITE SAME bio to be
generated must be a very rare event.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 22:34                   ` H. Peter Anvin
@ 2013-06-04 15:56                     ` Martin K. Petersen
  0 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-04 15:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Note that the filesystem is not mounted with the "discard" option,
hpa> and given the rarity of the aborts, whatever causes the WRITE SAME
hpa> bio to be generated must be a very rare event.

It's probably not discard. I'm guessing it's something that needs to
zero out a block range. Not something that happens very often.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
  2013-06-03 18:38               ` H. Peter Anvin
  2013-06-03 18:40               ` H. Peter Anvin
@ 2013-06-03 23:19               ` H. Peter Anvin
  2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 17:36               ` Dan Williams
  3 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-03 23:19 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid, Joe Lawrence

On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
> 
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
> 
> MD should not consider a rejected WRITE SAME a failure.
> 

We should probably add Joe Lawrence to this thread.

Joe: basically it seems that the error behavior of md (at least raid10,
but probably raid1 as well) on WRITE SAME is wrong, and it causes the
RAID to abort.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 23:19               ` H. Peter Anvin
@ 2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 15:46                   ` H. Peter Anvin
  2013-06-05 10:02                   ` Bernd Schubert
  0 siblings, 2 replies; 64+ messages in thread
From: Joe Lawrence @ 2013-06-04 15:39 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid, Joe Lawrence

On Mon, 3 Jun 2013, H. Peter Anvin wrote:

> On 06/03/2013 11:35 AM, Martin K. Petersen wrote:
> >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> > 
> > hpa> OK, so the device here says don't do this again, but fails the
> > hpa> request anyway expecting the block device to pick up the slack.
> > 
> > Yes, the block layer function will resort to writing out zeroes directly
> > in this case.
> > 
> > MD should not consider a rejected WRITE SAME a failure.
> > 
> 
> We should probably add Joe Lawrence to this thread.
> 
> Joe: basically it seems that the error behavior of md (at least raid10,
> but probably raid1 as well) on WRITE SAME is wrong, and it causes the
> RAID to abort.

Martin is probably the expert here (I had extended his initial WRITE SAME 
support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd 
using our San Blaze emulator to see the fall out. 

Just curious, what type drives were in your RAID and what does
/sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare 
drive to test, maybe you could try a quick sg_write_same command to see 
how the drive reacts?

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:39                 ` Joe Lawrence
@ 2013-06-04 15:46                   ` H. Peter Anvin
  2013-06-04 15:54                     ` Martin K. Petersen
  2013-06-05 10:02                   ` Bernd Schubert
  1 sibling, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-04 15:46 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid

On 06/04/2013 08:39 AM, Joe Lawrence wrote:
>>
>> We should probably add Joe Lawrence to this thread.
>>
>> Joe: basically it seems that the error behavior of md (at least raid10,
>> but probably raid1 as well) on WRITE SAME is wrong, and it causes the
>> RAID to abort.
> 
> Martin is probably the expert here (I had extended his initial WRITE SAME 
> support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd 
> using our San Blaze emulator to see the fall out. 
> 
> Just curious, what type drives were in your RAID and what does
> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare 
> drive to test, maybe you could try a quick sg_write_same command to see 
> how the drive reacts?
> 

The drives are SATA drives connected via mptsas.  max_write_same_blocks
show 65535.  Unfortunately the problems are rare enough that it didn't
pop up until the server was put in production, so I would like to avoid
experimenting on it as much as possible.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:46                   ` H. Peter Anvin
@ 2013-06-04 15:54                     ` Martin K. Petersen
  0 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-04 15:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Unfortunately the problems are rare enough that it didn't pop up
hpa> until the server was put in production, so I would like to avoid
hpa> experimenting on it as much as possible.

Yeah, and this is trivial to reproduce.

I will take a look if Joe doesn't beat me to it...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 15:39                 ` Joe Lawrence
  2013-06-04 15:46                   ` H. Peter Anvin
@ 2013-06-05 10:02                   ` Bernd Schubert
  2013-06-05 11:38                     ` Bernd Schubert
  1 sibling, 1 reply; 64+ messages in thread
From: Bernd Schubert @ 2013-06-05 10:02 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid

On 06/04/2013 05:39 PM, Joe Lawrence wrote:
>
> Just curious, what type drives were in your RAID and what does
> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare
> drive to test, maybe you could try a quick sg_write_same command to see
> how the drive reacts?
>

I just run into the same issue with an ancient system from 2006. Except 
that I'm in hurry an need it to stress-test my own work, I can do 
anything with it - it is booted via NFS and disks are only used for 
development/testing.

> (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes
> 16384

> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
> 0
> 0
> 0
> 0


Ah, now I found the reason why it fails, scsi-layer had set 
write_same_max_bytes to zero when it detected that it does not support 
it, but after reloading the arecal module (arcmsr) I now get:

> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
> 33553920
> 33553920
> 33553920
> 33553920

Now for example

> 11:0:1:2]   disk    Hitachi  HDS724040KLSA80  R001  /dev/sdl  /dev/sg11

> (squeeze)fslab1:~# sg_write_same --num=100 /dev/sg11
> Write same(10) command not supported


> (squeeze)fslab1:~# sg_write_same --16 --num=100 /dev/sg11
> Write same(16) command not supported


Cheers,
Bernd


PS: This is the 2nd time this I run into this, on Sunday I had a similar 
same issue at home with Ubuntus 3.8 kernel, but somehow not with vanilla 
3.9. I need to recheck the logs in evening to see if it is really the 
same issue.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 10:02                   ` Bernd Schubert
@ 2013-06-05 11:38                     ` Bernd Schubert
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
  2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
  0 siblings, 2 replies; 64+ messages in thread
From: Bernd Schubert @ 2013-06-05 11:38 UTC (permalink / raw)
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

On 06/05/2013 12:02 PM, Bernd Schubert wrote:
> On 06/04/2013 05:39 PM, Joe Lawrence wrote:
>>
>> Just curious, what type drives were in your RAID and what does
>> /sys/class/scsi_disk/*/max_write_same_blocks report?  If you have a spare
>> drive to test, maybe you could try a quick sg_write_same command to see
>> how the drive reacts?
>>
>
> I just run into the same issue with an ancient system from 2006. Except
> that I'm in hurry an need it to stress-test my own work, I can do
> anything with it - it is booted via NFS and disks are only used for
> development/testing.
>
>> (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes
>> 16384
>
>> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
>> 0
>> 0
>> 0
>> 0
>
>
> Ah, now I found the reason why it fails, scsi-layer had set
> write_same_max_bytes to zero when it detected that it does not support
> it, but after reloading the arecal module (arcmsr) I now get:
>
>> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes
>> 33553920
>> 33553920
>> 33553920
>> 33553920

In sd_config_write_same() it sets

	if (sdkp->max_ws_blocks == 0)
		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;

except when sdkp->device->no_write_same is set.
But only ata_scsi_sdev_config() sets that. And I also don't see any 
driver setting max_ws_blocks, so everything except of libata gets the 
default of SD_MAX_WS10_BLOCKS. This also seems to be consistent with the 
33553920 I see, except that there is somewhere a bit shift.
So no surprise that mptsas and arcmsr (and anything else) devices have 
write_same_max_bytes set.
As the correct handling in the md layer seems to be difficult, can we 
send a fake request at device configuration time to figure out if the 
device really support write-same?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 11:38                     ` Bernd Schubert
@ 2013-06-05 12:53                       ` Bernd Schubert
  2013-06-05 19:14                         ` Martin K. Petersen
  2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
  1 sibling, 1 reply; 64+ messages in thread
From: Bernd Schubert @ 2013-06-05 12:53 UTC (permalink / raw)
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

Here's a rather simply patch for scsi-midlayer

checkpatch.pl complains about style issue, but 
I just did it as the other lines there.

> schubert@squeeze@fsdevel3 linux-stable>scripts/checkpatch.pl patches-linux-3.9.y/ws10 
> ERROR: spaces prohibited around that ':' (ctx:WxW)
> #48: FILE: drivers/scsi/sd.h:87:
> +       unsigned        ws10 : 1;
>                              ^

If someone wants me, I can send another patch to fix the other
lines first.



scsi: Check if the device support WRITE_SAME_10

From: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

The md layer currently cannot handle failed WRITE_SAME commands
and the initial easiest fix is to check if the device supports
WRITE_SAME at all. It already tested for WRITE_SAME_16 and
this commit adds a test for WRITE_SAME_10.

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
---
 drivers/scsi/sd.c |    6 +++++-
 drivers/scsi/sd.h |    1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 82910cc..368f0a4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -742,7 +742,7 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	unsigned int logical_block_size = sdkp->device->sector_size;
 	unsigned int blocks = 0;
 
-	if (sdkp->device->no_write_same) {
+	if (sdkp->device->no_write_same || !(sdkp->ws10 || sdkp->ws16)) {
 		sdkp->max_ws_blocks = 0;
 		goto out;
 	}
@@ -2648,6 +2648,10 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
 	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
+			       WRITE_SAME))
+		sdkp->ws10 = 1;
+
+	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
 			       WRITE_SAME_16))
 		sdkp->ws16 = 1;
 }
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
@ 2013-06-05 19:14                         ` Martin K. Petersen
  2013-06-05 20:09                           ` Bernd Schubert
  0 siblings, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:14 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> The md layer currently cannot handle failed WRITE_SAME commands
Bernd> and the initial easiest fix is to check if the device supports
Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this
Bernd> commit adds a test for WRITE_SAME_10.

No go. That'll disable WRITE SAME for drives which don't support
RSOC. Which means almost all of them.

I propose the following...

-- 
Martin K. Petersen	Oracle Linux Engineering


[PATCH] sd: Update WRITE SAME heuristics

SATA drives located behind a SAS controller would incorrectly receive
WRITE SAME commands. Tweak the heuristics so that:

 - If REPORT SUPPORTED OPERATION CODES is provided we will use that to
   choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also
   fixes an issue with the old code which would issue WRITE SAME(10)
   despite the command not being whitelisted in REPORT SUPPORTED
   OPERATION CODES.

 - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back
   to WRITE SAME(10) unless the device has an ATA Information VPD page.
   The assumption is that a SATL which is smart enough to implement
   WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES.

To facilitate the new heuristics scsi_report_opcode() has been modified
to so we can distinguish between "operation not supported" and "RSOC not
supported".

Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2c0d0ec..3b1ea34 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
  * @opcode:	opcode for command to look up
  *
  * Uses the REPORT SUPPORTED OPERATION CODES to look up the given
- * opcode. Returns 0 if RSOC fails or if the command opcode is
- * unsupported. Returns 1 if the device claims to support the command.
+ * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is
+ * unsupported and 1 if the device claims to support the command.
  */
 int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 		       unsigned int len, unsigned char opcode)
@@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	int result;
 
 	if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3)
-		return 0;
+		return -EINVAL;
 
 	memset(cmd, 0, 16);
 	cmd[0] = MAINTENANCE_IN;
@@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	if (result && scsi_sense_valid(&sshdr) &&
 	    sshdr.sense_key == ILLEGAL_REQUEST &&
 	    (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00)
-		return 0;
+		return -EINVAL;
 
 	if ((buffer[1] & 3) == 3) /* Command supported */
 		return 1;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a37eda9..366b661 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
 
 	if (max == 0)
 		sdp->no_write_same = 1;
-	else if (max <= SD_MAX_WS16_BLOCKS)
-		sdkp->max_ws_blocks = max;
+	else
+		sdp->no_write_same = 0;
+
+	if (sdkp->ws16)
+		sdkp->max_ws_blocks =
+			max_t(unsigned long, max, SD_MAX_WS16_BLOCKS);
+	else
+		sdkp->max_ws_blocks =
+			max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
 
 	sd_config_write_same(sdkp);
 
@@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	 * blocks per I/O unless the device explicitly advertises a
 	 * bigger limit.
 	 */
-	if (sdkp->max_ws_blocks == 0)
-		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
-
-	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
+	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
 		blocks = min_not_zero(sdkp->max_ws_blocks,
 				      (u32)SD_MAX_WS16_BLOCKS);
 	else
 		blocks = min_not_zero(sdkp->max_ws_blocks,
 				      (u32)SD_MAX_WS10_BLOCKS);
 
+	if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
+		sdkp->max_ws_blocks = blocks;
+
 out:
 	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
 }
@@ -2645,9 +2652,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
-	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
-			       WRITE_SAME_16))
+	struct scsi_device *sdev = sdkp->device;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+		sdev->no_report_opcodes = 1;
+
+		/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
+		 * CODES is unsupported and the device has an ATA
+		 * Information VPD page (SAT).
+		 */
+		if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+			sdev->no_write_same = 1;
+	}
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1)
 		sdkp->ws16 = 1;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1)
+		sdkp->ws10 = 1;
 }
 
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 19:14                         ` Martin K. Petersen
@ 2013-06-05 20:09                           ` Bernd Schubert
  2013-06-07  2:15                             ` Martin K. Petersen
  0 siblings, 1 reply; 64+ messages in thread
From: Bernd Schubert @ 2013-06-05 20:09 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid,
	linux-scsi

On 06/05/2013 09:14 PM, Martin K. Petersen wrote:>>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:
> 
> Bernd> The md layer currently cannot handle failed WRITE_SAME commands
> Bernd> and the initial easiest fix is to check if the device supports
> Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this
> Bernd> commit adds a test for WRITE_SAME_10.
> 
> No go. That'll disable WRITE SAME for drives which don't support
> RSOC. Which means almost all of them.

Ah, sorry, I didn't check the specs.

> 
> I propose the following...
> 

> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
>  
>  	if (max == 0)
>  		sdp->no_write_same = 1;
> -	else if (max <= SD_MAX_WS16_BLOCKS)
> -		sdkp->max_ws_blocks = max;
> +	else
> +		sdp->no_write_same = 0;
> +
> +	if (sdkp->ws16)
> +		sdkp->max_ws_blocks =
> +			max_t(unsigned long, max, SD_MAX_WS16_BLOCKS);
> +	else
> +		sdkp->max_ws_blocks =
> +			max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
>  
>  	sd_config_write_same(sdkp);

Max? Not min_t()?


> @@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
>  	 * blocks per I/O unless the device explicitly advertises a
>  	 * bigger limit.
>  	 */
> -	if (sdkp->max_ws_blocks == 0)
> -		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
> -
> -	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
> +	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
>  		blocks = min_not_zero(sdkp->max_ws_blocks,
>  				      (u32)SD_MAX_WS16_BLOCKS);
>  	else
>  		blocks = min_not_zero(sdkp->max_ws_blocks,
>  				      (u32)SD_MAX_WS10_BLOCKS);
>  
> +	if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
> +		sdkp->max_ws_blocks = blocks;
> +
>  out:
>  	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
>  }

blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)) ?
Otherwise sdkp->max_ws_blocks and the queue might have different values, wouldn't they?


I cant't provide a comment about scsi_get_vpd_page, I simply don't know. You certainly
know the scsi specs ways better than I do!


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-05 20:09                           ` Bernd Schubert
@ 2013-06-07  2:15                             ` Martin K. Petersen
  2013-06-12 19:34                               ` Bernd Schubert
  0 siblings, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-07  2:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Martin K. Petersen, Joe Lawrence, H. Peter Anvin, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

>> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);

Bernd> Max? Not min_t()?

Brain fart. Updated patch with a few other adjustments.

I have tested this on a couple of JBODs with a mishmash of SATA and SAS
drives, including a few specimens that report MAX WRITE SAME BLOCKS.

-- 
Martin K. Petersen	Oracle Linux Engineering


[PATCH] sd: Update WRITE SAME heuristics

SATA drives located behind a SAS controller would incorrectly receive
WRITE SAME commands. Tweak the heuristics so that:

 - If REPORT SUPPORTED OPERATION CODES is provided we will use that to
   choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also
   fixes an issue with the old code which would issue WRITE SAME(10)
   despite the command not being whitelisted in REPORT SUPPORTED
   OPERATION CODES.

 - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back
   to WRITE SAME(10) unless the device has an ATA Information VPD page.
   The assumption is that a SATL which is smart enough to implement
   WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES.

To facilitate the new heuristics scsi_report_opcode() has been modified
to so we can distinguish between "operation not supported" and "RSOC not
supported".

Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: <stable@vger.kernel.org>

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2c0d0ec..3b1ea34 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page);
  * @opcode:	opcode for command to look up
  *
  * Uses the REPORT SUPPORTED OPERATION CODES to look up the given
- * opcode. Returns 0 if RSOC fails or if the command opcode is
- * unsupported. Returns 1 if the device claims to support the command.
+ * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is
+ * unsupported and 1 if the device claims to support the command.
  */
 int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 		       unsigned int len, unsigned char opcode)
@@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	int result;
 
 	if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3)
-		return 0;
+		return -EINVAL;
 
 	memset(cmd, 0, 16);
 	cmd[0] = MAINTENANCE_IN;
@@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	if (result && scsi_sense_valid(&sshdr) &&
 	    sshdr.sense_key == ILLEGAL_REQUEST &&
 	    (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00)
-		return 0;
+		return -EINVAL;
 
 	if ((buffer[1] & 3) == 3) /* Command supported */
 		return 1;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a37eda9..420e763 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -442,8 +442,10 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr,
 
 	if (max == 0)
 		sdp->no_write_same = 1;
-	else if (max <= SD_MAX_WS16_BLOCKS)
+	else if (max <= SD_MAX_WS16_BLOCKS) {
+		sdp->no_write_same = 0;
 		sdkp->max_ws_blocks = max;
+	}
 
 	sd_config_write_same(sdkp);
 
@@ -750,7 +752,6 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 {
 	struct request_queue *q = sdkp->disk->queue;
 	unsigned int logical_block_size = sdkp->device->sector_size;
-	unsigned int blocks = 0;
 
 	if (sdkp->device->no_write_same) {
 		sdkp->max_ws_blocks = 0;
@@ -762,18 +763,20 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 	 * blocks per I/O unless the device explicitly advertises a
 	 * bigger limit.
 	 */
-	if (sdkp->max_ws_blocks == 0)
-		sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS;
-
-	if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
-		blocks = min_not_zero(sdkp->max_ws_blocks,
-				      (u32)SD_MAX_WS16_BLOCKS);
-	else
-		blocks = min_not_zero(sdkp->max_ws_blocks,
-				      (u32)SD_MAX_WS10_BLOCKS);
+	if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS)
+		sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks,
+						   (u32)SD_MAX_WS16_BLOCKS);
+	else if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes)
+		sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks,
+						   (u32)SD_MAX_WS10_BLOCKS);
+	else {
+		sdkp->device->no_write_same = 1;
+		sdkp->max_ws_blocks = 0;
+	}
 
 out:
-	blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9));
+	blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
+					 (logical_block_size >> 9));
 }
 
 /**
@@ -2645,9 +2648,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp)
 
 static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 {
-	if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE,
-			       WRITE_SAME_16))
+	struct scsi_device *sdev = sdkp->device;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+		sdev->no_report_opcodes = 1;
+
+		/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
+		 * CODES is unsupported and the device has an ATA
+		 * Information VPD page (SAT).
+		 */
+		if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+			sdev->no_write_same = 1;
+	}
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1)
 		sdkp->ws16 = 1;
+
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1)
+		sdkp->ws10 = 1;
 }
 
 static int sd_try_extended_inquiry(struct scsi_device *sdp)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2386aeb..7a049de 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws : 1;
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
+	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10
  2013-06-07  2:15                             ` Martin K. Petersen
@ 2013-06-12 19:34                               ` Bernd Schubert
  0 siblings, 0 replies; 64+ messages in thread
From: Bernd Schubert @ 2013-06-12 19:34 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid,
	linux-scsi

On 06/07/2013 04:15 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:
>
>>> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS);
>
> Bernd> Max? Not min_t()?
>
> Brain fart. Updated patch with a few other adjustments.
>
> I have tested this on a couple of JBODs with a mishmash of SATA and SAS
> drives, including a few specimens that report MAX WRITE SAME BLOCKS.
>

Thanks for the update!

I'm far too long at work again, but I managed to test it now and it 
works fine for the ancient areca controller (ARC-1260) of this test-lab. 
So you may add

Tested-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

and

Reviewed-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

(although the latter probably does not count much for linux-scsi).


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 11:38                     ` Bernd Schubert
  2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
@ 2013-06-05 19:11                       ` Martin K. Petersen
  1 sibling, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:11 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams,
	linux-raid, linux-scsi

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> And I also don't see any driver setting max_ws_blocks, so
Bernd> everything except of libata gets the default of
Bernd> SD_MAX_WS10_BLOCKS. 

Yes. That's intentional. Unless the device provides MAXIMUM WRITE SAME
BLOCKS in the BLOCK LIMITS VPD.

Bernd> As the correct handling in the md layer seems to be difficult,
Bernd> can we send a fake request at device configuration time to figure
Bernd> out if the device really support write-same?

The problem is that WRITE SAME is destructive. And unfortunately a block
count of 0 means "write entire device".

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03 18:35             ` Martin K. Petersen
                                 ` (2 preceding siblings ...)
  2013-06-03 23:19               ` H. Peter Anvin
@ 2013-06-04 17:36               ` Dan Williams
  2013-06-04 17:54                 ` Martin K. Petersen
  3 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2013-06-04 17:36 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid

On Mon, Jun 3, 2013 at 11:35 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>
> hpa> OK, so the device here says don't do this again, but fails the
> hpa> request anyway expecting the block device to pick up the slack.
>
> Yes, the block layer function will resort to writing out zeroes directly
> in this case.
>
> MD should not consider a rejected WRITE SAME a failure.

What should md do in the partial success case?  Seems to be a layering
violation to assume that the block layer will retry with zeroes.
Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME
cleared?

--
Dan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:36               ` Dan Williams
@ 2013-06-04 17:54                 ` Martin K. Petersen
  2013-06-04 17:57                   ` H. Peter Anvin
  0 siblings, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-04 17:54 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, H. Peter Anvin, linux-raid

>>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes:

>> MD should not consider a rejected WRITE SAME a failure.

Dan> What should md do in the partial success case?  

What exactly do you mean when you say "partial success"? Either the
device accepts the command or it doesn't...


Dan> Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME
Dan> cleared?

No go.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:54                 ` Martin K. Petersen
@ 2013-06-04 17:57                   ` H. Peter Anvin
  2013-06-04 18:04                     ` Martin K. Petersen
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-04 17:57 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Dan Williams, linux-raid

On 06/04/2013 10:54 AM, Martin K. Petersen wrote:
>>>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes:
> 
>>> MD should not consider a rejected WRITE SAME a failure.
> 
> Dan> What should md do in the partial success case?  
> 
> What exactly do you mean when you say "partial success"? Either the
> device accepts the command or it doesn't...
> 

One subdevice accepts it and the other doesn't, presumably.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 17:57                   ` H. Peter Anvin
@ 2013-06-04 18:04                     ` Martin K. Petersen
  2013-06-04 18:32                       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-04 18:04 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> One subdevice accepts it and the other doesn't, presumably.

Ah. Well fail the command and let the block layer deal with it. This is
really no different from the discard case.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:04                     ` Martin K. Petersen
@ 2013-06-04 18:32                       ` Dan Williams
  2013-06-04 18:38                         ` H. Peter Anvin
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2013-06-04 18:32 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid

On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>
> hpa> One subdevice accepts it and the other doesn't, presumably.
>
> Ah. Well fail the command and let the block layer deal with it. This is
> really no different from the discard case.
>

Which md also does not handle if the device later returns "illegal
request" to a discard command.  My point about one device accepting
the write and another device dropping it is we now have an
inconsistent array and a write command to complete.  So I don't see
how md can wait/trust that the upper layer will retry and fix things
up?    Translate and retry internally for these command types, return
success to the original request, and disable future requests.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:32                       ` Dan Williams
@ 2013-06-04 18:38                         ` H. Peter Anvin
  2013-06-04 18:56                           ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-04 18:38 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, linux-raid

On 06/04/2013 11:32 AM, Dan Williams wrote:
> On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> One subdevice accepts it and the other doesn't, presumably.
>>
>> Ah. Well fail the command and let the block layer deal with it. This is
>> really no different from the discard case.
> 
> Which md also does not handle if the device later returns "illegal
> request" to a discard command.  My point about one device accepting
> the write and another device dropping it is we now have an
> inconsistent array and a write command to complete.  So I don't see
> how md can wait/trust that the upper layer will retry and fix things
> up?    Translate and retry internally for these command types, return
> success to the original request, and disable future requests.
> 

Well, if that is what the block device layer is defined to do then that
is what the block layer does.  It makes sense from the point of view of
a disk, there block layer has to translate and redo, so if the block
layer is defined to do that, why not rely on it?

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:38                         ` H. Peter Anvin
@ 2013-06-04 18:56                           ` Dan Williams
  2013-06-05  2:39                             ` H. Peter Anvin
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2013-06-04 18:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, linux-raid

On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> Well, if that is what the block device layer is defined to do then that
> is what the block layer does.  It makes sense from the point of view of
> a disk, there block layer has to translate and redo, so if the block
> layer is defined to do that, why not rely on it?
>

I'm just hung up on when we can safely mark the array as not dirty.
At a minimum this means raid needs a "I have an ignored-write-failure
in-flight, awaiting retry from upper layer" state.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-04 18:56                           ` Dan Williams
@ 2013-06-05  2:39                             ` H. Peter Anvin
       [not found]                               ` <(H.>
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-05  2:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: Martin K. Petersen, linux-raid

On 06/04/2013 11:56 AM, Dan Williams wrote:
> On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> Well, if that is what the block device layer is defined to do then that
>> is what the block layer does.  It makes sense from the point of view of
>> a disk, there block layer has to translate and redo, so if the block
>> layer is defined to do that, why not rely on it?
>>
> 
> I'm just hung up on when we can safely mark the array as not dirty.
> At a minimum this means raid needs a "I have an ignored-write-failure
> in-flight, awaiting retry from upper layer" state.
> 

Ah yes, if you rely on the block layer to retry on you you don't see the
beginnings and ends of the entire transaction, and at least ideally the
RAID -- and the specific blocks -- should be marked dirty during that
operation.  The same applies to DISCARD presumably.

Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply be
disabled on raid1/raid10 until this can be addressed?  Do we need to do
the same for DISCARD?

	-hpa

^ permalink raw reply	[flat|nested] 64+ messages in thread

```
[parent not found: <(H.>]
```

```
[parent not found: <Peter>]
```

```
[parent not found: <Anvin's>]
```

```
[parent not found: <message>]
```

```
[parent not found: <of>]
```

```
[parent not found: <"Wed>]
```
```
[parent not found: <"Thu>]
```
```
[parent not found: <"Tue>]
```

```
[parent not found: <04>]
```

```
[parent not found: <Jun>]
```

```
[parent not found: <2013>]
```

```
[parent not found: <14:27:47>]
```

```
[parent not found: <-0400")>]
```

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0400")>
@ 2013-06-07  2:19                                                     ` Martin K. Petersen
  2013-06-10 14:15                                                       ` Joe Lawrence
  0 siblings, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-07  2:19 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

>>>>> "Joe" == Joe Lawrence <joe.lawrence@stratus.com> writes:

Joe> I'm looking at the changes in raid1.c and am confused as to why we
Joe> did this in the first place (commit c8dc9c6):

Joe>         if (mddev->queue)
Joe>                 blk_queue_max_write_same_sectors(mddev->queue, mddev-> chunk_sectors);

Joe> for RAID1, aren't chunk_sectors always going to be zero?  (At least
Joe> on my machine, /sys/block/md*/queue/write_same_max_bytes for all md
Joe> RAID1 devices are set to 0.)  This would have the effect of
Joe> rendering bdev_write_same() always false for these MD devices.

I'm guessing that's a copy and paste from my raid0 support. In the raid0
case I set it to prevent straddling discard/write same commands.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-07  2:19                                                     ` Martin K. Petersen
@ 2013-06-10 14:15                                                       ` Joe Lawrence
  2013-06-12  3:15                                                         ` NeilBrown
  0 siblings, 1 reply; 64+ messages in thread
From: Joe Lawrence @ 2013-06-10 14:15 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid, Neil Brown

[Cc: Neil -- a few questions if you want to skip to the bottom ]

I looked a little more into the RAID1 case and tried a few things.
First, to enable WRITE SAME on RAID1, I had to apply the patch at the
bottom of this mail. With the patch in place, the component disk limits
propogate up to their MD:

  [2:0:4:0]    disk    SANBLAZE VLUN0001         0001  /dev/sds 
  [3:0:2:0]    disk    SEAGATE  ST9146853SS      0004  /dev/sdv

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 65535
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 33553920

I was interested in observing how a failed WRITE SAME would interact
with MD, the intent bitmap, and resyncing. 

In my setup, I created a RAID1 out of a 4G partition from a SAS disk
(supporting WRITE SAME) and a SanBlaze VirtualLUN (claiming WRITE SAME
support, but returing [sense_key,asc,ascq]: [0x05,0x20,0x00] on the
first such command.)  I also added an external write intent bitmap file
chunk size of 4KB to create a large, granular bitmap:

  mdadm --create /dev/md125 --raid-devices=2 --level=1 \
        --bitmap=/mnt/bitmap/file --bitmap-chunk=4K /dev/sds1 /dev/sdv1

After creating the RAID and letting the initial synchronization finish,
I filled the entire MD with random data.  I would use this to verify
resync using the write intent bitmap later.

From previous tests, I knew that the first failed WRITE SAME to the
VirtualLUN would bounce that disk from the MD.  Current and subsequent
WRITE SAME cmds would process just fine to the member disk that actually
supported the command. 

To kick off WRITE SAME commands, I added a new ext4 filesystem to the
disk.  When mounting (no special options) this executes the following
call chain:

  ext4_lazyinit_thread
    ext4_init_inode_table
      sb_issue_zeroout
        blkdev_issue_zeroout
          blkdev_issue_write_same

When the first WRITE SAME hits the VirtualLUN, MD kicks it from the RAID
and degrades the array:

  EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
  sd 2:0:6:0: [sds] CDB: 
  Write same(16): 93 00 00 00 00 00 00 00 21 10 00 00 0f f8 00 00
  mpt2sas0:        sas_address(0x500605b0006c0ae0), phy(3)
  mpt2sas0:        handle(0x000b), ioc_status(scsi data underrun)(0x0045), smid(59)
  mpt2sas0:        scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
  mpt2sas0:        [sense_key,asc,ascq]: [0x05,0x20,0x00], count(96)
  md/raid1:md125: Disk failure on sds1, disabling device.
  md/raid1:md125: Operation continuing on 1 devices.

the bitmap file starts recording dirty chunks:

  Sync Size : 4192192 (4.00 GiB 4.29 GB)
     Bitmap : 1048048 bits (chunks), 16409 dirty (1.6%)

The MD write_same_max_bytes are left at 33553920 until the VirtualLUN is
failed/remove/re-added.  After the WRITE SAME failure, the VirtualLUN's
max_write_same_blocks have been set to 0.  When re-added to the MD, this
value is reconsidered in the MD's write_same_max_bytes, which also gets
set to zero.  This behavior seems okay as the remaining good disk fully
supported WRITE SAME when the RAID was degraded.  Once the
non-supporting component disk was added to the RAID1, WRITE SAME was
disabled for the MD:

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 0
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 0

When the VirtualLUN was re-added to the RAID1, resync initiated.  Recall
that earlier I had dumped random bits on the entire MD device, so the
state of the disks should have looked like this:

  SAS  = init RAID sync + random bits + ext4 WRITE SAME 0's + ext4 misc
  VLUN = init RAID sync + random bits

and resync would need to consult the bitmap to repair the VLUN chunks
that WRITE SAME and whatever else ext4_lazyinit_thread layed down.

By setting the bitmap chunksize so small, the idea was to spread the
failed WRITE SAME across tracking bits.  CDB WRITE SAME num blocks was
0x0FF8 (4088) and 4088 x 512 (block size) ~= 2MB (much greater than
4KB).  With a systemtap probe, I saw 32 WRITE SAME commands (all about
4KB blocks) emitted from the block layer via ext4_lazyinit_thread.  So
the estimated dirty bits for all 32 should be somewhere around:

  32 * (2MB disk dirty / 4K disk per bit) = 16384 dirty bits

pretty close to the observed 16409 (the rest I assume were other ext4
housekeeping).

At this point we know:

  - Failed WRITE SAME will kick disk from MD RAID1
  - WRITE SAME is disabled if unsupported disk added to MD
  - Failed WRITE SAME is properly handled by bitmap, even when
    spanning bitmap bits.

A few outstanding questions that I have, maybe Neil or someone more
familiar with the code could answer.

Q1 - Is mddev->chunk_sectors is always zero for RAID1?

Q2 - I noticed handle_write_finished calls narrow_write_error to try and
     potentially avoid failing an entire device.  In my tests,
     narrow_write_error never succeeded as rdev->badblocks.shift = -1.

     I think this part of the bad block list code Neil has been working
     on.  I don't suppose this is the proper place for MD to reset
     write_same_max_bytes to disable future WRITE SAME and handling the
     individual writes here instead of the block layer?

Regards,

-- Joe

From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
From: Joe Lawrence <joe.lawrence@stratus.com>
Date: Fri, 7 Jun 2013 15:25:54 -0400
Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors

MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does
not need to worry about limiting the write same sectors in that regard.
Let disk_stack_limits choose the minimum of the RAID1 components write
same values.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/md/raid1.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fd86b37..3dc9ad6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);

-	if (mddev->queue)
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-10 14:15                                                       ` Joe Lawrence
@ 2013-06-12  3:15                                                         ` NeilBrown
  2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-13  2:45                                                           ` Joe Lawrence
  0 siblings, 2 replies; 64+ messages in thread
From: NeilBrown @ 2013-06-12  3:15 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2465 bytes --]

On Mon, 10 Jun 2013 10:15:05 -0400 Joe Lawrence <joe.lawrence@stratus.com>
wrote:

> A few outstanding questions that I have, maybe Neil or someone more
> familiar with the code could answer.
> 
> Q1 - Is mddev->chunk_sectors is always zero for RAID1?

Not always.  But often.  It is largely ignored.  I think the only effect is
to round the size of the device down to a multiple of the chunk size.

> 
> Q2 - I noticed handle_write_finished calls narrow_write_error to try and
>      potentially avoid failing an entire device.  In my tests,
>      narrow_write_error never succeeded as rdev->badblocks.shift = -1.

Yes.  You would need a newer mdadm (from my git tree) to get badblocks.shift
to something else.

> 
>      I think this part of the bad block list code Neil has been working
>      on.  I don't suppose this is the proper place for MD to reset
>      write_same_max_bytes to disable future WRITE SAME and handling the
>      individual writes here instead of the block layer?

If a drive reports that WRITE SAME works, but it doesn't, then I'm not sure
that I can be happy about working with that drive.
If a drive has some quirky behaviour wrt WRITE SAME, then that should be
handled in some place where 'quirks' are handled - certainly not in md.


I've applied that patch below - thanks.

NeilBrown


> 
> Regards,
> 
> -- Joe
> 
> 
> From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <joe.lawrence@stratus.com>
> Date: Fri, 7 Jun 2013 15:25:54 -0400
> Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> 
> MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does
> not need to worry about limiting the write same sectors in that regard.
> Let disk_stack_limits choose the minimum of the RAID1 components write
> same values.
> 
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
>  drivers/md/raid1.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index fd86b37..3dc9ad6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev)
>  	if (IS_ERR(conf))
>  		return PTR_ERR(conf);
>  
> -	if (mddev->queue)
> -		blk_queue_max_write_same_sectors(mddev->queue,
> -						 mddev->chunk_sectors);
>  	rdev_for_each(rdev, mddev) {
>  		if (!mddev->gendisk)
>  			continue;


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  3:15                                                         ` NeilBrown
@ 2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 14:25                                                             ` Martin K. Petersen
  2013-06-13  2:45                                                           ` Joe Lawrence
  1 sibling, 2 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-12  4:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid

On 06/11/2013 08:15 PM, NeilBrown wrote:
> If a drive reports that WRITE SAME works, but it doesn't, then I'm
> not sure that I can be happy about working with that drive.

Seriously... we have that kind of problems all over the place with all
kinds of hardware.  Falling back is sensible... the problem here is
*where* that needs to happen... the block layer already does, apparently.

> If a drive has some quirky behaviour wrt WRITE SAME, then that
> should be handled in some place where 'quirks' are handled -
> certainly not in md.

The problem here is that you don't find out ahead of time.

Now, if I understand the issue at hand correctly is that the reporting
here was actually a Linux bug related to SATA drives behind a SAS
controller.  Martin, am I right?

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  4:07                                                           ` H. Peter Anvin
@ 2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 10:22                                                               ` Joe Lawrence
  2013-06-12 14:28                                                               ` Martin K. Petersen
  2013-06-12 14:25                                                             ` Martin K. Petersen
  1 sibling, 2 replies; 64+ messages in thread
From: Bernd Schubert @ 2013-06-12  6:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams,
	linux-raid

On 06/12/2013 06:07 AM, H. Peter Anvin wrote:
> On 06/11/2013 08:15 PM, NeilBrown wrote:
>> If a drive reports that WRITE SAME works, but it doesn't, then I'm
>> not sure that I can be happy about working with that drive.
> 
> Seriously... we have that kind of problems all over the place with all
> kinds of hardware.  Falling back is sensible... the problem here is
> *where* that needs to happen... the block layer already does, apparently.
> 
>> If a drive has some quirky behaviour wrt WRITE SAME, then that
>> should be handled in some place where 'quirks' are handled -
>> certainly not in md.
> 
> The problem here is that you don't find out ahead of time.
> 
> Now, if I understand the issue at hand correctly is that the reporting
> here was actually a Linux bug related to SATA drives behind a SAS
> controller.  Martin, am I right?

Martin, please correct me if I'm wrong, but I think the code
optimistically enabled WRITE_SAME for any drive, except of those on a
sata (libata) controller. So not the drive reported that it can do
WRITE_SAME, but scsi-midlayer did that. Martins patch should improve
that (I still need to test it on our hardware), but I'm not sure if
there won't be some hardware falling through.


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  6:29                                                             ` Bernd Schubert
@ 2013-06-12 10:22                                                               ` Joe Lawrence
  2013-06-12 14:28                                                               ` Martin K. Petersen
  1 sibling, 0 replies; 64+ messages in thread
From: Joe Lawrence @ 2013-06-12 10:22 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen,
	Dan Williams, linux-raid

On Wed, 12 Jun 2013, Bernd Schubert wrote:

> On 06/12/2013 06:07 AM, H. Peter Anvin wrote:
> > On 06/11/2013 08:15 PM, NeilBrown wrote:
> >> If a drive reports that WRITE SAME works, but it doesn't, then I'm
> >> not sure that I can be happy about working with that drive.
> > 
> > Seriously... we have that kind of problems all over the place with all
> > kinds of hardware.  Falling back is sensible... the problem here is
> > *where* that needs to happen... the block layer already does, apparently.
> > 
> >> If a drive has some quirky behaviour wrt WRITE SAME, then that
> >> should be handled in some place where 'quirks' are handled -
> >> certainly not in md.
> > 
> > The problem here is that you don't find out ahead of time.
> > 
> > Now, if I understand the issue at hand correctly is that the reporting
> > here was actually a Linux bug related to SATA drives behind a SAS
> > controller.  Martin, am I right?
> 
> Martin, please correct me if I'm wrong, but I think the code
> optimistically enabled WRITE_SAME for any drive, except of those on a
> sata (libata) controller. So not the drive reported that it can do
> WRITE_SAME, but scsi-midlayer did that. Martins patch should improve
> that (I still need to test it on our hardware), but I'm not sure if
> there won't be some hardware falling through.

I'm worried about other unsupported HW, too.  Last night I started writing 
a patch to set the raid1,10 max write same sectors to 0 for inclusion in 
3.10 + stable... I'd like to include a mention of Martin's patch/the SATA 
drives in the commit log.  Thanks so much for hunting this down.

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  6:29                                                             ` Bernd Schubert
  2013-06-12 10:22                                                               ` Joe Lawrence
@ 2013-06-12 14:28                                                               ` Martin K. Petersen
  1 sibling, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:28 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen,
	Dan Williams, linux-raid

>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes:

Bernd> please correct me if I'm wrong, but I think the code
Bernd> optimistically enabled WRITE_SAME for any drive, except of those
Bernd> on a sata (libata) controller. 

Correct. WRITE SAME has been used extensively by RAID manufacturers for
years. So pretty much any SCSI-class drive supports it.

The headache we ran into in this case was SATA drives behind a SAS
controller.

Bernd> Martins patch should improve that (I still need to test it on our
Bernd> hardware), 

Please do! James wants a tested-by from someone that's not me before he
pushes the patch to Linus.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  4:07                                                           ` H. Peter Anvin
  2013-06-12  6:29                                                             ` Bernd Schubert
@ 2013-06-12 14:25                                                             ` Martin K. Petersen
  2013-06-12 14:29                                                               ` H. Peter Anvin
  1 sibling, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams,
	linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> If a drive has some quirky behaviour wrt WRITE SAME, then that should
>> be handled in some place where 'quirks' are handled - certainly not
>> in md.

hpa> The problem here is that you don't find out ahead of time.

hpa> Now, if I understand the issue at hand correctly is that the
hpa> reporting here was actually a Linux bug related to SATA drives
hpa> behind a SAS controller.  Martin, am I right?

Support for WRITE SAME is harder for us to detect. With discard we have
a set of device-reported bits we can use as triggers, not so for WRITE
SAME. And since it is a destructive command we can not simply issue one
at device discovery time to try whether it works.

Technically there's nothing that prevents a SAS controller's SCSI-ATA
Translation to handle WRITE SAME. The patch I posted simply adds another
heuristic. Namely that if we can see that the drive behind the SAS
controller is of the ATA persuasion we will not attempt to issue WRITE
SAME unless the controller explicitly advertises WRITE SAME support
using REPORT SUPPORTED OPERATION CODES.

Sadly we can not exclusively rely on RSOC when deciding whether WRITE
SAME is supported or not for devices in general. 95% of the WRITE
SAME-capable devices out there do not support RSOC :(

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:25                                                             ` Martin K. Petersen
@ 2013-06-12 14:29                                                               ` H. Peter Anvin
  2013-06-12 14:34                                                                 ` Martin K. Petersen
  0 siblings, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:29 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 07:25 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
>>> If a drive has some quirky behaviour wrt WRITE SAME, then that should
>>> be handled in some place where 'quirks' are handled - certainly not
>>> in md.
> 
> hpa> The problem here is that you don't find out ahead of time.
> 
> hpa> Now, if I understand the issue at hand correctly is that the
> hpa> reporting here was actually a Linux bug related to SATA drives
> hpa> behind a SAS controller.  Martin, am I right?
> 
> Support for WRITE SAME is harder for us to detect. With discard we have
> a set of device-reported bits we can use as triggers, not so for WRITE
> SAME. And since it is a destructive command we can not simply issue one
> at device discovery time to try whether it works.
> 
> Technically there's nothing that prevents a SAS controller's SCSI-ATA
> Translation to handle WRITE SAME. The patch I posted simply adds another
> heuristic. Namely that if we can see that the drive behind the SAS
> controller is of the ATA persuasion we will not attempt to issue WRITE
> SAME unless the controller explicitly advertises WRITE SAME support
> using REPORT SUPPORTED OPERATION CODES.
> 
> Sadly we can not exclusively rely on RSOC when deciding whether WRITE
> SAME is supported or not for devices in general. 95% of the WRITE
> SAME-capable devices out there do not support RSOC :(
> 

The second question is if we should disable WRITE SAME for raid1/10
(what about raid0?) for 3.10/stable or if your patch really is
sufficient... "just adds another heuristic" makes me nervous.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:29                                                               ` H. Peter Anvin
@ 2013-06-12 14:34                                                                 ` Martin K. Petersen
  2013-06-12 14:37                                                                   ` H. Peter Anvin
  2013-06-12 14:45                                                                   ` H. Peter Anvin
  0 siblings, 2 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams,
	linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> The second question is if we should disable WRITE SAME for raid1/10
hpa> (what about raid0?) for 3.10/stable or if your patch really is
hpa> sufficient... "just adds another heuristic" makes me nervous.

I think we should disable 1+10 in stable until we get the recovery
scenario sorted out.

I don't believe there are any problems with raid0.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:34                                                                 ` Martin K. Petersen
@ 2013-06-12 14:37                                                                   ` H. Peter Anvin
  2013-06-12 14:45                                                                   ` H. Peter Anvin
  1 sibling, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:37 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
> 
> I don't believe there are any problems with raid0.
> 

I was wondering if a WRITE SOME sent to raid0 would cause the array to
get stopped or paniced.

	-hpa


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:34                                                                 ` Martin K. Petersen
  2013-06-12 14:37                                                                   ` H. Peter Anvin
@ 2013-06-12 14:45                                                                   ` H. Peter Anvin
       [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
  2013-06-13  3:10                                                                     ` NeilBrown
  1 sibling, 2 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-12 14:45 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 504 bytes --]

On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> The second question is if we should disable WRITE SAME for raid1/10
> hpa> (what about raid0?) for 3.10/stable or if your patch really is
> hpa> sufficient... "just adds another heuristic" makes me nervous.
> 
> I think we should disable 1+10 in stable until we get the recovery
> scenario sorted out.
> 
> I don't believe there are any problems with raid0.
> 

How does this look?

	-hpa


[-- Attachment #2: 0001-raid1-10-Disable-WRITE-SAME-until-a-recovery-strateg.patch --]
[-- Type: text/x-patch, Size: 2452 bytes --]

From ac28be1574a6187f4f26bd75217059bf17b13560 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@zytor.com>
Date: Wed, 12 Jun 2013 07:37:43 -0700
Subject: [PATCH] raid1,10: Disable WRITE SAME until a recovery strategy is in
 place

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1 and raid10.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 drivers/md/raid1.c  | 4 ++--
 drivers/md/raid10.c | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5595118..914ca0a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2780,8 +2780,8 @@ static int run(struct mddev *mddev)
 		return PTR_ERR(conf);
 
 	if (mddev->queue)
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
+		blk_queue_max_write_same_sectors(mddev->queue, 0);
+
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 59d4daa..807ace8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3609,8 +3609,7 @@ static int run(struct mddev *mddev)
 	if (mddev->queue) {
 		blk_queue_max_discard_sectors(mddev->queue,
 					      mddev->chunk_sectors);
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
+		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
 		if (conf->geo.raid_disks % conf->geo.near_copies)
 			blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 64+ messages in thread

[parent not found: <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>]

[parent not found: <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>]

* Re: RAID-10 keeps aborting
       [not found]                                                                       ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>
@ 2013-06-12 15:58                                                                         ` H. Peter Anvin
  0 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-12 15:58 UTC (permalink / raw)
  To: Lawrence, Joe; +Cc: Martin K. Petersen, NeilBrown, Dan Williams, linux-raid

On 06/12/2013 08:55 AM, Lawrence, Joe wrote:
> This looks exactly like what I started last night, only with a more detailed commit msg.  Would it be worth mentioning that block will only retry if MD fails the master bio?  (Ie, if one of mirrored components succeed the WS)

Probably not... the big deal here is that it isn't a viable solution due
to the write bitmap fence.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12 14:45                                                                   ` H. Peter Anvin
       [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
@ 2013-06-13  3:10                                                                     ` NeilBrown
  2013-06-13  3:13                                                                       ` H. Peter Anvin
  2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 2 replies; 64+ messages in thread
From: NeilBrown @ 2013-06-13  3:10 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Wed, 12 Jun 2013 07:45:16 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/12/2013 07:34 AM, Martin K. Petersen wrote:
> >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> > 
> > hpa> The second question is if we should disable WRITE SAME for raid1/10
> > hpa> (what about raid0?) for 3.10/stable or if your patch really is
> > hpa> sufficient... "just adds another heuristic" makes me nervous.
> > 
> > I think we should disable 1+10 in stable until we get the recovery
> > scenario sorted out.
> > 
> > I don't believe there are any problems with raid0.
> > 
> 
> How does this look?
> 
> 	-hpa
> 

Promising - thanks.

However should we do the same thing in raid5.c too?
As far as I can tell, the default set by blk_set_stacking_limits() (which md
calls) is to allow WRITE_SAME if all all underlying devices do.
But I'm pretty sure raid5 will do the wrong thing with a WRITE_SAME request.

??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:10                                                                     ` NeilBrown
@ 2013-06-13  3:13                                                                       ` H. Peter Anvin
  2013-06-13  3:31                                                                         ` NeilBrown
  2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 1 reply; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-13  3:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

On 06/12/2013 08:10 PM, NeilBrown wrote:
> 
> Promising - thanks.
> 
> However should we do the same thing in raid5.c too? As far as I can
> tell, the default set by blk_set_stacking_limits() (which md calls)
> is to allow WRITE_SAME if all all underlying devices do. But I'm
> pretty sure raid5 will do the wrong thing with a WRITE_SAME
> request.
> 

Yes, if raid5 also bounces the array if the WRITE SAME request fails
at the device we need to do the same thing there.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:13                                                                       ` H. Peter Anvin
@ 2013-06-13  3:31                                                                         ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2013-06-13  3:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Wed, 12 Jun 2013 20:13:19 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote:

> On 06/12/2013 08:10 PM, NeilBrown wrote:
> > 
> > Promising - thanks.
> > 
> > However should we do the same thing in raid5.c too? As far as I can
> > tell, the default set by blk_set_stacking_limits() (which md calls)
> > is to allow WRITE_SAME if all all underlying devices do. But I'm
> > pretty sure raid5 will do the wrong thing with a WRITE_SAME
> > request.
> > 
> 
> Yes, if raid5 also bounces the array if the WRITE SAME request fails
> at the device we need to do the same thing there.
> 
> 	-hpa
> 

Thanks. I've modified the patch and tagged it for -stable.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  3:10                                                                     ` NeilBrown
  2013-06-13  3:13                                                                       ` H. Peter Anvin
@ 2013-06-13 21:40                                                                       ` Martin K. Petersen
  1 sibling, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-13 21:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: H. Peter Anvin, Martin K. Petersen, Joe Lawrence, Dan Williams,
	linux-raid

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

Neil> But I'm pretty sure raid5 will do the wrong thing with a
Neil> WRITE_SAME request.

Yeah, you should set: 

	mddev->queue->max_write_same_blocks = 0;

before disk_stack_limits() is called. Just like it's done with
discard_zeroes_data.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-12  3:15                                                         ` NeilBrown
  2013-06-12  4:07                                                           ` H. Peter Anvin
@ 2013-06-13  2:45                                                           ` Joe Lawrence
  2013-06-13  3:11                                                             ` NeilBrown
  1 sibling, 1 reply; 64+ messages in thread
From: Joe Lawrence @ 2013-06-13  2:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Joe Lawrence, Martin K. Petersen, H. Peter Anvin, Dan Williams,
	linux-raid

On Wed, 12 Jun 2013, NeilBrown wrote:

> I've applied that patch below - thanks.
> 
> NeilBrown
> > 
> > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> > From: Joe Lawrence <joe.lawrence@stratus.com>
> > Date: Fri, 7 Jun 2013 15:25:54 -0400
> > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> > 
[snip patch]

Hi Neil -- this patch was only to test out what RAID1 would do if the 
blk_queue_max_write_same_sectors were set to enable WRITE SAME.  As HPA 
and Martin point out, we should be disabling WRITE SAME for raid1/10 in 
3.10 / stable for now.  Sorry for the confusion.

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-13  2:45                                                           ` Joe Lawrence
@ 2013-06-13  3:11                                                             ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2013-06-13  3:11 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

On Wed, 12 Jun 2013 22:45:08 -0400 (EDT) Joe Lawrence
<joe.lawrence@stratus.com> wrote:

> On Wed, 12 Jun 2013, NeilBrown wrote:
> 
> > I've applied that patch below - thanks.
> > 
> > NeilBrown
> > > 
> > > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
> > > From: Joe Lawrence <joe.lawrence@stratus.com>
> > > Date: Fri, 7 Jun 2013 15:25:54 -0400
> > > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors
> > > 
> [snip patch]
> 
> Hi Neil -- this patch was only to test out what RAID1 would do if the 
> blk_queue_max_write_same_sectors were set to enable WRITE SAME.  As HPA 
> and Martin point out, we should be disabling WRITE SAME for raid1/10 in 
> 3.10 / stable for now.  Sorry for the confusion.
> 
> -- Joe

Ahh - thanks.

the code your patch removes is clearly wrong as chunk size doesn't mean
anything useful for raid1, but as you say we need to actually disable it, so
may as well do that at the same time as removing the error.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

```
[parent not found: <19:39:58>]
```

```
[parent not found: <-0700")>]
```

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0700")>
@ 2013-06-05 19:29                                                     ` Martin K. Petersen
  2013-06-06 18:27                                                       ` Joe Lawrence
  2013-06-12 14:43                                                     ` Martin K. Petersen
  1 sibling, 1 reply; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-05 19:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dan Williams, Martin K. Petersen, linux-raid, joe.lawrence

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
hpa> be disabled on raid1/raid10 until this can be addressed?  

Yeah, maybe.

hpa> Do we need to do the same for DISCARD?

For discard we have better heuristics from the device so the partial
completion case should be rare. But obviously a disk can reject a
command anytime with or without reason...

Also, discard is advisory. There is no guarantee that a pair of mirrored
drives will be consistent in the discarded region (unless the drives
promise to zero discarded blocks but MD doesn't check that for
raid1/raid10).

WRITE SAME requires strict consistency between the devices, however. So
it looks like there's some work that needs to be done in that
department.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-05 19:29                                                     ` Martin K. Petersen
@ 2013-06-06 18:27                                                       ` Joe Lawrence
       [not found]                                                         ` <(Joe>
  2013-06-06 18:36                                                         ` H. Peter Anvin
  0 siblings, 2 replies; 64+ messages in thread
From: Joe Lawrence @ 2013-06-06 18:27 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid

On Wed, 5 Jun 2013 12:29:32 -0700 (PDT)
"Martin K. Petersen" <martin.petersen@oracle.com> wrote:

> >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
> 
> hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
> hpa> be disabled on raid1/raid10 until this can be addressed?  
> 
> Yeah, maybe.

Martin, 

I'm looking at the changes in raid1.c and am confused as to why we did
this in the first place (commit c8dc9c6):

        if (mddev->queue)
                blk_queue_max_write_same_sectors(mddev->queue,
                                                 mddev->chunk_sectors);

for RAID1, aren't chunk_sectors always going to be zero?  (At least on
my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1
devices are set to 0.)  This would have the effect of rendering
bdev_write_same() always false for these MD devices.

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

```
[parent not found: <(Joe>]
```

* Re: RAID-10 keeps aborting
  2013-06-06 18:27                                                       ` Joe Lawrence
       [not found]                                                         ` <(Joe>
@ 2013-06-06 18:36                                                         ` H. Peter Anvin
  1 sibling, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-06 18:36 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid

On 06/06/2013 11:27 AM, Joe Lawrence wrote:
> On Wed, 5 Jun 2013 12:29:32 -0700 (PDT)
> "Martin K. Petersen" <martin.petersen@oracle.com> wrote:
> 
>>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>
>> hpa> Yuck, this suddenly got complex.  Perhaps WRITE SAME should simply
>> hpa> be disabled on raid1/raid10 until this can be addressed?  
>>
>> Yeah, maybe.
> 
> Martin, 
> 
> I'm looking at the changes in raid1.c and am confused as to why we did
> this in the first place (commit c8dc9c6):
> 
>         if (mddev->queue)
>                 blk_queue_max_write_same_sectors(mddev->queue,
>                                                  mddev->chunk_sectors);
> 
> for RAID1, aren't chunk_sectors always going to be zero?  (At least on
> my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1
> devices are set to 0.)  This would have the effect of rendering
> bdev_write_same() always false for these MD devices.
> 

That presumably also explains why only RAID-10 seems to be affected.

	-hpa



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
       [not found]                                                   ` <-0700")>
  2013-06-05 19:29                                                     ` Martin K. Petersen
@ 2013-06-12 14:43                                                     ` Martin K. Petersen
  1 sibling, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2013-06-12 14:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams,
	linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> I don't believe there are any problems with raid0.

hpa> I was wondering if a WRITE SOME sent to raid0 would cause the array
hpa> to get stopped or paniced.

raid0 doesn't do its own endio processing so an error will just bubble
up to the generic code in blk-lib.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
  2013-06-03  4:05 ` H. Peter Anvin
  2013-06-03  5:47 ` Dan Williams
@ 2013-06-11 21:50 ` Joe Lawrence
  2013-06-11 21:53   ` H. Peter Anvin
  2 siblings, 1 reply; 64+ messages in thread
From: Joe Lawrence @ 2013-06-11 21:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert

I'll away from the office the rest of the week for the Red Hat Summit,
but a few things tested today:

*** Test 1 (repeated for both RAID1 and RAID10)

Setup: RAID1/10, a single disk fails WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00]

Result: the RAID md_personality error_handler was invoked for the disk
        that kicked back Illegal Request.  That disk was marked
	faulty and write intent bitmaps started tracking chunks.  WRITE
	SAME to the other disks succeeded so MD didn't fail the parent
	bio and therefore blkdev_issue_write_same was considered
	successful.  

        When the failed component disk is re-added to the RAID, the
        write_same_max_bytes are recalculated and set to 0.  Resync with
        write intent bitmap properly mirrored the disk.

This behavior seems reasonable, though one could argue that
write_same_max_bytes should be set to 0 for the MD device once any WRITE
SAME fails.  Better yet, the block layer could fall back to plain WRITEs
without dropping the disk.


*** Test 2

Setup: RAID1, all member disks fail WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00].

Result: the first disk is kicked out of the RAID, but the second remains
	active.  The block layer detected an error from
	blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed.
	Manually zeroing" messages, but never prevents future WRITE
        SAMEs from hitting this MD device.


*** Test 3

Setup: RAID10, all member disks fail WRITE SAME with
       [sense_key,asc,ascq]: [0x05,0x20,0x00].

Result: all component disks are marked as faulty and lost page writes
	occur on MD device!  blkdev_issue_write_same fails and "mdXXX:
	WRITE SAME failed.  Manually zeroing" appear, but it's too late,
	there aren't any disks left standing in the RAID.


I think the results of Tests 2 and 3 confirm that RAID10 definitely
needs to be fixed...  Disabling WRITE SAME for RAID10 for now (3.10 +
stable) would be safest thing to do until this is sorted out properly.

-- Joe

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: RAID-10 keeps aborting
  2013-06-11 21:50 ` Joe Lawrence
@ 2013-06-11 21:53   ` H. Peter Anvin
  0 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2013-06-11 21:53 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert

Exactly.

Joe Lawrence <joe.lawrence@stratus.com> wrote:

>I'll away from the office the rest of the week for the Red Hat Summit,
>but a few things tested today:
>
>*** Test 1 (repeated for both RAID1 and RAID10)
>
>Setup: RAID1/10, a single disk fails WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00]
>
>Result: the RAID md_personality error_handler was invoked for the disk
>        that kicked back Illegal Request.  That disk was marked
>	faulty and write intent bitmaps started tracking chunks.  WRITE
>	SAME to the other disks succeeded so MD didn't fail the parent
>	bio and therefore blkdev_issue_write_same was considered
>	successful.  
>
>        When the failed component disk is re-added to the RAID, the
>       write_same_max_bytes are recalculated and set to 0.  Resync with
>        write intent bitmap properly mirrored the disk.
>
>This behavior seems reasonable, though one could argue that
>write_same_max_bytes should be set to 0 for the MD device once any
>WRITE
>SAME fails.  Better yet, the block layer could fall back to plain
>WRITEs
>without dropping the disk.
>
>
>*** Test 2
>
>Setup: RAID1, all member disks fail WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00].
>
>Result: the first disk is kicked out of the RAID, but the second
>remains
>	active.  The block layer detected an error from
>	blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed.
>	Manually zeroing" messages, but never prevents future WRITE
>        SAMEs from hitting this MD device.
>
>
>*** Test 3
>
>Setup: RAID10, all member disks fail WRITE SAME with
>       [sense_key,asc,ascq]: [0x05,0x20,0x00].
>
>Result: all component disks are marked as faulty and lost page writes
>	occur on MD device!  blkdev_issue_write_same fails and "mdXXX:
>	WRITE SAME failed.  Manually zeroing" appear, but it's too late,
>	there aren't any disks left standing in the RAID.
>
>
>I think the results of Tests 2 and 3 confirm that RAID10 definitely
>needs to be fixed...  Disabling WRITE SAME for RAID10 for now (3.10 +
>stable) would be safest thing to do until this is sorted out properly.
>
>-- Joe

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2013-06-13 21:40 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-03  3:57 RAID-10 keeps aborting H. Peter Anvin
2013-06-03  4:05 ` H. Peter Anvin
2013-06-03  5:47 ` Dan Williams
2013-06-03  6:06   ` H. Peter Anvin
2013-06-03  6:14     ` Dan Williams
2013-06-03  6:30       ` H. Peter Anvin
2013-06-03 14:39       ` H. Peter Anvin
2013-06-11 16:47         ` Joe Lawrence
2013-06-11 17:12           ` H. Peter Anvin
2013-06-03 15:47       ` H. Peter Anvin
2013-06-03 16:09         ` Joe Lawrence
2013-06-03 17:22         ` Dan Williams
2013-06-03 17:40           ` H. Peter Anvin
2013-06-03 18:35             ` Martin K. Petersen
2013-06-03 18:38               ` H. Peter Anvin
2013-06-03 18:40               ` H. Peter Anvin
2013-06-03 22:20                 ` H. Peter Anvin
2013-06-03 22:34                   ` H. Peter Anvin
2013-06-04 15:56                     ` Martin K. Petersen
2013-06-03 23:19               ` H. Peter Anvin
2013-06-04 15:39                 ` Joe Lawrence
2013-06-04 15:46                   ` H. Peter Anvin
2013-06-04 15:54                     ` Martin K. Petersen
2013-06-05 10:02                   ` Bernd Schubert
2013-06-05 11:38                     ` Bernd Schubert
2013-06-05 12:53                       ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert
2013-06-05 19:14                         ` Martin K. Petersen
2013-06-05 20:09                           ` Bernd Schubert
2013-06-07  2:15                             ` Martin K. Petersen
2013-06-12 19:34                               ` Bernd Schubert
2013-06-05 19:11                       ` RAID-10 keeps aborting Martin K. Petersen
2013-06-04 17:36               ` Dan Williams
2013-06-04 17:54                 ` Martin K. Petersen
2013-06-04 17:57                   ` H. Peter Anvin
2013-06-04 18:04                     ` Martin K. Petersen
2013-06-04 18:32                       ` Dan Williams
2013-06-04 18:38                         ` H. Peter Anvin
2013-06-04 18:56                           ` Dan Williams
2013-06-05  2:39                             ` H. Peter Anvin
     [not found]                               ` <(H.>
     [not found]                                 ` <Peter>
     [not found]                                   ` <Anvin's>
     [not found]                                     ` <message>
     [not found]                                       ` <of>
     [not found]                                         ` <"Wed>
     [not found]                                         ` <"Thu>
     [not found]                                         ` <"Tue>
     [not found]                                           ` <04>
     [not found]                                             ` <Jun>
     [not found]                                               ` <2013>
     [not found]                                                 ` <14:27:47>
     [not found]                                                   ` <-0400")>
2013-06-07  2:19                                                     ` Martin K. Petersen
2013-06-10 14:15                                                       ` Joe Lawrence
2013-06-12  3:15                                                         ` NeilBrown
2013-06-12  4:07                                                           ` H. Peter Anvin
2013-06-12  6:29                                                             ` Bernd Schubert
2013-06-12 10:22                                                               ` Joe Lawrence
2013-06-12 14:28                                                               ` Martin K. Petersen
2013-06-12 14:25                                                             ` Martin K. Petersen
2013-06-12 14:29                                                               ` H. Peter Anvin
2013-06-12 14:34                                                                 ` Martin K. Petersen
2013-06-12 14:37                                                                   ` H. Peter Anvin
2013-06-12 14:45                                                                   ` H. Peter Anvin
     [not found]                                                                     ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>
     [not found]                                                                       ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>
2013-06-12 15:58                                                                         ` H. Peter Anvin
2013-06-13  3:10                                                                     ` NeilBrown
2013-06-13  3:13                                                                       ` H. Peter Anvin
2013-06-13  3:31                                                                         ` NeilBrown
2013-06-13 21:40                                                                       ` Martin K. Petersen
2013-06-13  2:45                                                           ` Joe Lawrence
2013-06-13  3:11                                                             ` NeilBrown
     [not found]                                                 ` <19:39:58>
     [not found]                                                   ` <-0700")>
2013-06-05 19:29                                                     ` Martin K. Petersen
2013-06-06 18:27                                                       ` Joe Lawrence
     [not found]                                                         ` <(Joe>
2013-06-06 18:36                                                         ` H. Peter Anvin
2013-06-12 14:43                                                     ` Martin K. Petersen
2013-06-11 21:50 ` Joe Lawrence
2013-06-11 21:53   ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).