* RAID-10 keeps aborting @ 2013-06-03 3:57 H. Peter Anvin 2013-06-03 4:05 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 3:57 UTC (permalink / raw) To: linux-raid Hello, I have a brand new server with a RAID-10 array. The drives are a SAS JBOD (mptsas) which I'm driving using Linux mdraid raid10. Unfortunately, although the server did burn-in fine, once put in production I have so far had multiple cases (about once every 24 hours) of the raid10 failing, with a mirror pair dropping out in very short succession: Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk failure on sdb6, disabling device. Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation continuing on 3 devices. Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk failure on sdc6, disabling device. Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation continuing on 2 devices. Jun 2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed. Manually zeroing. Unfortunately, those two devices that just dropped out are of course the mirrors of each other, causing filesystem corruption and shutdown in very short order. There are no other kernel messages from the same time, and given the timing (less than 90 ms apart) it would appear that this is a timeout of some kind and not an actual disk failure. Are there any tunables I can tweak, or do I have a $4000 paperweight? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 3:57 RAID-10 keeps aborting H. Peter Anvin @ 2013-06-03 4:05 ` H. Peter Anvin 2013-06-03 5:47 ` Dan Williams 2013-06-11 21:50 ` Joe Lawrence 2 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 4:05 UTC (permalink / raw) To: linux-raid On 06/02/2013 08:57 PM, H. Peter Anvin wrote: > Hello, > > I have a brand new server with a RAID-10 array. The drives are a SAS > JBOD (mptsas) which I'm driving using Linux mdraid raid10. > > Unfortunately, although the server did burn-in fine, once put in > production I have so far had multiple cases (about once every 24 hours) > of the raid10 failing, with a mirror pair dropping out in very short > succession: > > Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk > failure on sdb6, disabling device. > Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation > continuing on 3 devices. > Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk > failure on sdc6, disabling device. > Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation > continuing on 2 devices. > Jun 2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed. > Manually zeroing. > For the record, this is kernel 3.8.13 (-100.fc17 from Fedora). -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 3:57 RAID-10 keeps aborting H. Peter Anvin 2013-06-03 4:05 ` H. Peter Anvin @ 2013-06-03 5:47 ` Dan Williams 2013-06-03 6:06 ` H. Peter Anvin 2013-06-11 21:50 ` Joe Lawrence 2 siblings, 1 reply; 64+ messages in thread From: Dan Williams @ 2013-06-03 5:47 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid On Sun, Jun 2, 2013 at 8:57 PM, H. Peter Anvin <hpa@zytor.com> wrote: > Hello, > > I have a brand new server with a RAID-10 array. The drives are a SAS > JBOD (mptsas) which I'm driving using Linux mdraid raid10. > > Unfortunately, although the server did burn-in fine, once put in > production I have so far had multiple cases (about once every 24 hours) > of the raid10 failing, with a mirror pair dropping out in very short > succession: > > Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Disk > failure on sdb6, disabling device. > Jun 2 20:23:05 terminus kernel: [83595.614689] md/raid10:md4: Operation > continuing on 3 devices. > Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Disk > failure on sdc6, disabling device. > Jun 2 20:23:05 terminus kernel: [83595.703106] md/raid10:md4: Operation > continuing on 2 devices. > Jun 2 20:23:05 terminus kernel: [83595.789234] md4: WRITE SAME failed. > Manually zeroing. > > Unfortunately, those two devices that just dropped out are of course the > mirrors of each other, causing filesystem corruption and shutdown in > very short order. > > There are no other kernel messages from the same time, and given the > timing (less than 90 ms apart) it would appear that this is a timeout of > some kind and not an actual disk failure. Looks like the underlying devices just may not support write_same... if the device lies about support we don't find about it until the first attempt fails and md drops the devices. > Are there any tunables I can tweak, or do I have a $4000 paperweight? One hack to prove this may be to explicitly disable write_same before the array is assembled: for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done If this works then maybe md needs to be tolerant of write_same failures since the block layer will simply retry with zeroes. -- Dan ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 5:47 ` Dan Williams @ 2013-06-03 6:06 ` H. Peter Anvin 2013-06-03 6:14 ` Dan Williams 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 6:06 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On 06/02/2013 10:47 PM, Dan Williams wrote: > > One hack to prove this may be to explicitly disable write_same before > the array is assembled: > > for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done > > If this works then maybe md needs to be tolerant of write_same > failures since the block layer will simply retry with zeroes. > Trying that (array is already assembled but is currently functional.) Let's hope it works. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 6:06 ` H. Peter Anvin @ 2013-06-03 6:14 ` Dan Williams 2013-06-03 6:30 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Dan Williams @ 2013-06-03 6:14 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 06/02/2013 10:47 PM, Dan Williams wrote: >> >> One hack to prove this may be to explicitly disable write_same before >> the array is assembled: >> >> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done >> >> If this works then maybe md needs to be tolerant of write_same >> failures since the block layer will simply retry with zeroes. >> > > Trying that (array is already assembled but is currently functional.) > Let's hope it works. > If I'm reading things correctly that may still result in failure since md will still pass the REQ_WRITE_SAME bios down to the the devices and will receive BLK_PREP_KILL for its trouble. md only notices that write same is disabled on underlying devices at assembly time. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 6:14 ` Dan Williams @ 2013-06-03 6:30 ` H. Peter Anvin 2013-06-03 14:39 ` H. Peter Anvin 2013-06-03 15:47 ` H. Peter Anvin 2 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 6:30 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid Ok, will muck with that tomorrow. Hopefully it doesn't die tonight. Dan Williams <dan.j.williams@gmail.com> wrote: >On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 06/02/2013 10:47 PM, Dan Williams wrote: >>> >>> One hack to prove this may be to explicitly disable write_same >before >>> the array is assembled: >>> >>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > >$i; done >>> >>> If this works then maybe md needs to be tolerant of write_same >>> failures since the block layer will simply retry with zeroes. >>> >> >> Trying that (array is already assembled but is currently functional.) >> Let's hope it works. >> > >If I'm reading things correctly that may still result in failure since >md will still pass the REQ_WRITE_SAME bios down to the the devices and >will receive BLK_PREP_KILL for its trouble. md only notices that >write same is disabled on underlying devices at assembly time. -- Sent from my mobile phone. Please excuse brevity and lack of formatting. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 6:14 ` Dan Williams 2013-06-03 6:30 ` H. Peter Anvin @ 2013-06-03 14:39 ` H. Peter Anvin 2013-06-11 16:47 ` Joe Lawrence 2013-06-03 15:47 ` H. Peter Anvin 2 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 14:39 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On 06/02/2013 11:14 PM, Dan Williams wrote: > On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 06/02/2013 10:47 PM, Dan Williams wrote: >>> >>> One hack to prove this may be to explicitly disable write_same before >>> the array is assembled: >>> >>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done >>> >>> If this works then maybe md needs to be tolerant of write_same >>> failures since the block layer will simply retry with zeroes. >>> >> >> Trying that (array is already assembled but is currently functional.) >> Let's hope it works. >> > > If I'm reading things correctly that may still result in failure since > md will still pass the REQ_WRITE_SAME bios down to the the devices and > will receive BLK_PREP_KILL for its trouble. md only notices that > write same is disabled on underlying devices at assembly time. > Hmmm... that means getting dracut/udev to supply this little mod, unless it can be fed as a kernel command-line option somehow. Digging... -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 14:39 ` H. Peter Anvin @ 2013-06-11 16:47 ` Joe Lawrence 2013-06-11 17:12 ` H. Peter Anvin 0 siblings, 1 reply; 64+ messages in thread From: Joe Lawrence @ 2013-06-11 16:47 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Dan Williams, linux-raid On Mon, 03 Jun 2013 07:39:46 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 06/02/2013 11:14 PM, Dan Williams wrote: > > On Sun, Jun 2, 2013 at 11:06 PM, H. Peter Anvin <hpa@zytor.com> wrote: > >> On 06/02/2013 10:47 PM, Dan Williams wrote: > >>> > >>> One hack to prove this may be to explicitly disable write_same before > >>> the array is assembled: > >>> > >>> for i in /sys/class/scsi_disk/*/max_write_same_blocks; do echo 0 > $i; done > >>> > >>> If this works then maybe md needs to be tolerant of write_same > >>> failures since the block layer will simply retry with zeroes. > >>> > >> > >> Trying that (array is already assembled but is currently functional.) > >> Let's hope it works. > >> > > > > If I'm reading things correctly that may still result in failure since > > md will still pass the REQ_WRITE_SAME bios down to the the devices and > > will receive BLK_PREP_KILL for its trouble. md only notices that > > write same is disabled on underlying devices at assembly time. > > > > Hmmm... that means getting dracut/udev to supply this little mod, unless > it can be fed as a kernel command-line option somehow. Digging... > > -hpa You've probably worked around this by now, but Dan's suggestion can be tweaked if you are willing to fail/remove/re-add one of the disks. I just verified the following: /sys/block/md125/queue/write_same_max_bytes: 524288 % mdadm --fail /dev/md125 /dev/sds1 % mdadm --remove /dev/md125 /dev/sds1 % echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks % mdadm --add /dev/md125 /dev/sds1 /sys/block/md125/queue/write_same_max_bytes: 0 That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk) which will recalculate write_same_max_bytes for the MD. Regards, -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-11 16:47 ` Joe Lawrence @ 2013-06-11 17:12 ` H. Peter Anvin 0 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-11 17:12 UTC (permalink / raw) To: Joe Lawrence; +Cc: Dan Williams, linux-raid On 06/11/2013 09:47 AM, Joe Lawrence wrote: > > You've probably worked around this by now, but Dan's suggestion can be > tweaked if you are willing to fail/remove/re-add one of the disks. I > just verified the following: > > /sys/block/md125/queue/write_same_max_bytes: 524288 > > % mdadm --fail /dev/md125 /dev/sds1 > % mdadm --remove /dev/md125 /dev/sds1 > % echo 0 > /sys/class/scsi_disk/2\:0\:2\:0/max_write_same_blocks > % mdadm --add /dev/md125 /dev/sds1 > > /sys/block/md125/queue/write_same_max_bytes: 0 > > That gets raid10.c to invoke disk_stack_limits (via raid10_add_disk) > which will recalculate write_same_max_bytes for the MD. > No thanks. I just hacked the kernel directly. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 6:14 ` Dan Williams 2013-06-03 6:30 ` H. Peter Anvin 2013-06-03 14:39 ` H. Peter Anvin @ 2013-06-03 15:47 ` H. Peter Anvin 2013-06-03 16:09 ` Joe Lawrence 2013-06-03 17:22 ` Dan Williams 2 siblings, 2 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 15:47 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On 06/02/2013 11:14 PM, Dan Williams wrote: > > If I'm reading things correctly that may still result in failure since > md will still pass the REQ_WRITE_SAME bios down to the the devices and > will receive BLK_PREP_KILL for its trouble. md only notices that > write same is disabled on underlying devices at assembly time. > I have to admit to not seeing where md (as opposed to dm) even looks for if the underlying devices have write same enabled. It seems extremely likely that it is write same that is causing the headaches, though. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 15:47 ` H. Peter Anvin @ 2013-06-03 16:09 ` Joe Lawrence 2013-06-03 17:22 ` Dan Williams 1 sibling, 0 replies; 64+ messages in thread From: Joe Lawrence @ 2013-06-03 16:09 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Dan Williams, linux-raid On Mon, 3 Jun 2013, H. Peter Anvin wrote: > On 06/02/2013 11:14 PM, Dan Williams wrote: > > > > If I'm reading things correctly that may still result in failure since > > md will still pass the REQ_WRITE_SAME bios down to the the devices and > > will receive BLK_PREP_KILL for its trouble. md only notices that > > write same is disabled on underlying devices at assembly time. > > > > I have to admit to not seeing where md (as opposed to dm) even looks for > if the underlying devices have write same enabled. It seems extremely > likely that it is write same that is causing the headaches, though. I'll try to take a look later, but for now maybe these threads could help you? http://thread.gmane.org/gmane.linux.raid/41035 http://thread.gmane.org/gmane.linux.raid/41078 It might be that after commit 4363ac7 "block: Implement support for WRITE SAME", the max_write_same_sectors for the MD is set to the minimum its component disks (at least for RAID1). Not sure exactly how RAID10 treats it, but maybe there are enough clues in that first thread to figure it out. When I have some more time I can help investigate. Regards, -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 15:47 ` H. Peter Anvin 2013-06-03 16:09 ` Joe Lawrence @ 2013-06-03 17:22 ` Dan Williams 2013-06-03 17:40 ` H. Peter Anvin 1 sibling, 1 reply; 64+ messages in thread From: Dan Williams @ 2013-06-03 17:22 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote: > On 06/02/2013 11:14 PM, Dan Williams wrote: >> >> If I'm reading things correctly that may still result in failure since >> md will still pass the REQ_WRITE_SAME bios down to the the devices and >> will receive BLK_PREP_KILL for its trouble. md only notices that >> write same is disabled on underlying devices at assembly time. >> > > I have to admit to not seeing where md (as opposed to dm) even looks for > if the underlying devices have write same enabled. It seems extremely > likely that it is write same that is causing the headaches, though. > raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors(). ...and here is where scsi considers failures as non-fatal in sd_done(). I assume the REQ_QUIET is why there are no other kernel messages. case ILLEGAL_REQUEST: if (sshdr.asc == 0x10) /* DIX: Host detected corruption */ good_bytes = sd_completed_bytes(SCpnt); /* INVALID COMMAND OPCODE or INVALID FIELD IN CDB */ if (sshdr.asc == 0x20 || sshdr.asc == 0x24) { switch (op) { case UNMAP: sd_config_discard(sdkp, SD_LBP_DISABLE); break; case WRITE_SAME_16: case WRITE_SAME: if (unmap) sd_config_discard(sdkp, SD_LBP_DISABLE); else { sdkp->device->no_write_same = 1; sd_config_write_same(sdkp); good_bytes = 0; req->__data_len = blk_rq_bytes(req); req->cmd_flags |= REQ_QUIET; } } } break; ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 17:22 ` Dan Williams @ 2013-06-03 17:40 ` H. Peter Anvin 2013-06-03 18:35 ` Martin K. Petersen 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 17:40 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On 06/03/2013 10:22 AM, Dan Williams wrote: > On Mon, Jun 3, 2013 at 8:47 AM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 06/02/2013 11:14 PM, Dan Williams wrote: >>> >>> If I'm reading things correctly that may still result in failure since >>> md will still pass the REQ_WRITE_SAME bios down to the the devices and >>> will receive BLK_PREP_KILL for its trouble. md only notices that >>> write same is disabled on underlying devices at assembly time. >>> >> >> I have to admit to not seeing where md (as opposed to dm) even looks for >> if the underlying devices have write same enabled. It seems extremely >> likely that it is write same that is causing the headaches, though. >> > > raid10 calls disk_stack_limits() after blk_queue_max_write_same_sectors(). > OK, I see it now. I wonder changing blk_queue_max_write_same_sectors() to zero in the kernel sources would do the trick here... might be easier than making udev/dracut to the right thing... :-/ > ...and here is where scsi considers failures as non-fatal in > sd_done(). I assume the REQ_QUIET is why there are no other kernel > messages. > case WRITE_SAME_16: > case WRITE_SAME: > if (unmap) > sd_config_discard(sdkp, SD_LBP_DISABLE); > else { > sdkp->device->no_write_same = 1; > sd_config_write_same(sdkp); > > good_bytes = 0; > req->__data_len = blk_rq_bytes(req); > req->cmd_flags |= REQ_QUIET; > } > } > } > break; OK, so the device here says don't do this again, but fails the request anyway expecting the block device to pick up the slack. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 17:40 ` H. Peter Anvin @ 2013-06-03 18:35 ` Martin K. Petersen 2013-06-03 18:38 ` H. Peter Anvin ` (3 more replies) 0 siblings, 4 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-03 18:35 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> OK, so the device here says don't do this again, but fails the hpa> request anyway expecting the block device to pick up the slack. Yes, the block layer function will resort to writing out zeroes directly in this case. MD should not consider a rejected WRITE SAME a failure. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 18:35 ` Martin K. Petersen @ 2013-06-03 18:38 ` H. Peter Anvin 2013-06-03 18:40 ` H. Peter Anvin ` (2 subsequent siblings) 3 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 18:38 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid On 06/03/2013 11:35 AM, Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> OK, so the device here says don't do this again, but fails the > hpa> request anyway expecting the block device to pick up the slack. > > Yes, the block layer function will resort to writing out zeroes directly > in this case. > > MD should not consider a rejected WRITE SAME a failure. > Right. That seems to be the root of the problem here. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 18:35 ` Martin K. Petersen 2013-06-03 18:38 ` H. Peter Anvin @ 2013-06-03 18:40 ` H. Peter Anvin 2013-06-03 22:20 ` H. Peter Anvin 2013-06-03 23:19 ` H. Peter Anvin 2013-06-04 17:36 ` Dan Williams 3 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 18:40 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid On 06/03/2013 11:35 AM, Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> OK, so the device here says don't do this again, but fails the > hpa> request anyway expecting the block device to pick up the slack. > > Yes, the block layer function will resort to writing out zeroes directly > in this case. > > MD should not consider a rejected WRITE SAME a failure. > Presumably MD should do the same thing that SCSI does: disable write_same and kick the failure upstack? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 18:40 ` H. Peter Anvin @ 2013-06-03 22:20 ` H. Peter Anvin 2013-06-03 22:34 ` H. Peter Anvin 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 22:20 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid On 06/03/2013 11:40 AM, H. Peter Anvin wrote: > On 06/03/2013 11:35 AM, Martin K. Petersen wrote: >>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> >> hpa> OK, so the device here says don't do this again, but fails the >> hpa> request anyway expecting the block device to pick up the slack. >> >> Yes, the block layer function will resort to writing out zeroes directly >> in this case. >> >> MD should not consider a rejected WRITE SAME a failure. >> > > Presumably MD should do the same thing that SCSI does: disable > write_same and kick the failure upstack? > Also, given the seriousness of the failure mode, is this something that should be addressed in stable? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 22:20 ` H. Peter Anvin @ 2013-06-03 22:34 ` H. Peter Anvin 2013-06-04 15:56 ` Martin K. Petersen 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 22:34 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid On 06/03/2013 03:20 PM, H. Peter Anvin wrote: > > Also, given the seriousness of the failure mode, is this something that > should be addressed in stable? > Note that the filesystem is not mounted with the "discard" option, and given the rarity of the aborts, whatever causes the WRITE SAME bio to be generated must be a very rare event. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 22:34 ` H. Peter Anvin @ 2013-06-04 15:56 ` Martin K. Petersen 0 siblings, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-04 15:56 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> Note that the filesystem is not mounted with the "discard" option, hpa> and given the rarity of the aborts, whatever causes the WRITE SAME hpa> bio to be generated must be a very rare event. It's probably not discard. I'm guessing it's something that needs to zero out a block range. Not something that happens very often. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 18:35 ` Martin K. Petersen 2013-06-03 18:38 ` H. Peter Anvin 2013-06-03 18:40 ` H. Peter Anvin @ 2013-06-03 23:19 ` H. Peter Anvin 2013-06-04 15:39 ` Joe Lawrence 2013-06-04 17:36 ` Dan Williams 3 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-03 23:19 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid, Joe Lawrence On 06/03/2013 11:35 AM, Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> OK, so the device here says don't do this again, but fails the > hpa> request anyway expecting the block device to pick up the slack. > > Yes, the block layer function will resort to writing out zeroes directly > in this case. > > MD should not consider a rejected WRITE SAME a failure. > We should probably add Joe Lawrence to this thread. Joe: basically it seems that the error behavior of md (at least raid10, but probably raid1 as well) on WRITE SAME is wrong, and it causes the RAID to abort. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 23:19 ` H. Peter Anvin @ 2013-06-04 15:39 ` Joe Lawrence 2013-06-04 15:46 ` H. Peter Anvin 2013-06-05 10:02 ` Bernd Schubert 0 siblings, 2 replies; 64+ messages in thread From: Joe Lawrence @ 2013-06-04 15:39 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid, Joe Lawrence On Mon, 3 Jun 2013, H. Peter Anvin wrote: > On 06/03/2013 11:35 AM, Martin K. Petersen wrote: > >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > > > hpa> OK, so the device here says don't do this again, but fails the > > hpa> request anyway expecting the block device to pick up the slack. > > > > Yes, the block layer function will resort to writing out zeroes directly > > in this case. > > > > MD should not consider a rejected WRITE SAME a failure. > > > > We should probably add Joe Lawrence to this thread. > > Joe: basically it seems that the error behavior of md (at least raid10, > but probably raid1 as well) on WRITE SAME is wrong, and it causes the > RAID to abort. Martin is probably the expert here (I had extended his initial WRITE SAME support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd using our San Blaze emulator to see the fall out. Just curious, what type drives were in your RAID and what does /sys/class/scsi_disk/*/max_write_same_blocks report? If you have a spare drive to test, maybe you could try a quick sg_write_same command to see how the drive reacts? -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 15:39 ` Joe Lawrence @ 2013-06-04 15:46 ` H. Peter Anvin 2013-06-04 15:54 ` Martin K. Petersen 2013-06-05 10:02 ` Bernd Schubert 1 sibling, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-04 15:46 UTC (permalink / raw) To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid On 06/04/2013 08:39 AM, Joe Lawrence wrote: >> >> We should probably add Joe Lawrence to this thread. >> >> Joe: basically it seems that the error behavior of md (at least raid10, >> but probably raid1 as well) on WRITE SAME is wrong, and it causes the >> RAID to abort. > > Martin is probably the expert here (I had extended his initial WRITE SAME > support in MD raid0 to raid1 and raid10), but I can try failing a WS cmd > using our San Blaze emulator to see the fall out. > > Just curious, what type drives were in your RAID and what does > /sys/class/scsi_disk/*/max_write_same_blocks report? If you have a spare > drive to test, maybe you could try a quick sg_write_same command to see > how the drive reacts? > The drives are SATA drives connected via mptsas. max_write_same_blocks show 65535. Unfortunately the problems are rare enough that it didn't pop up until the server was put in production, so I would like to avoid experimenting on it as much as possible. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 15:46 ` H. Peter Anvin @ 2013-06-04 15:54 ` Martin K. Petersen 0 siblings, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-04 15:54 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> Unfortunately the problems are rare enough that it didn't pop up hpa> until the server was put in production, so I would like to avoid hpa> experimenting on it as much as possible. Yeah, and this is trivial to reproduce. I will take a look if Joe doesn't beat me to it... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 15:39 ` Joe Lawrence 2013-06-04 15:46 ` H. Peter Anvin @ 2013-06-05 10:02 ` Bernd Schubert 2013-06-05 11:38 ` Bernd Schubert 1 sibling, 1 reply; 64+ messages in thread From: Bernd Schubert @ 2013-06-05 10:02 UTC (permalink / raw) To: Joe Lawrence; +Cc: H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid On 06/04/2013 05:39 PM, Joe Lawrence wrote: > > Just curious, what type drives were in your RAID and what does > /sys/class/scsi_disk/*/max_write_same_blocks report? If you have a spare > drive to test, maybe you could try a quick sg_write_same command to see > how the drive reacts? > I just run into the same issue with an ancient system from 2006. Except that I'm in hurry an need it to stress-test my own work, I can do anything with it - it is booted via NFS and disks are only used for development/testing. > (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes > 16384 > (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes > 0 > 0 > 0 > 0 Ah, now I found the reason why it fails, scsi-layer had set write_same_max_bytes to zero when it detected that it does not support it, but after reloading the arecal module (arcmsr) I now get: > (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes > 33553920 > 33553920 > 33553920 > 33553920 Now for example > 11:0:1:2] disk Hitachi HDS724040KLSA80 R001 /dev/sdl /dev/sg11 > (squeeze)fslab1:~# sg_write_same --num=100 /dev/sg11 > Write same(10) command not supported > (squeeze)fslab1:~# sg_write_same --16 --num=100 /dev/sg11 > Write same(16) command not supported Cheers, Bernd PS: This is the 2nd time this I run into this, on Sunday I had a similar same issue at home with Ubuntus 3.8 kernel, but somehow not with vanilla 3.9. I need to recheck the logs in evening to see if it is really the same issue. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-05 10:02 ` Bernd Schubert @ 2013-06-05 11:38 ` Bernd Schubert 2013-06-05 12:53 ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert 2013-06-05 19:11 ` RAID-10 keeps aborting Martin K. Petersen 0 siblings, 2 replies; 64+ messages in thread From: Bernd Schubert @ 2013-06-05 11:38 UTC (permalink / raw) Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid, linux-scsi On 06/05/2013 12:02 PM, Bernd Schubert wrote: > On 06/04/2013 05:39 PM, Joe Lawrence wrote: >> >> Just curious, what type drives were in your RAID and what does >> /sys/class/scsi_disk/*/max_write_same_blocks report? If you have a spare >> drive to test, maybe you could try a quick sg_write_same command to see >> how the drive reacts? >> > > I just run into the same issue with an ancient system from 2006. Except > that I'm in hurry an need it to stress-test my own work, I can do > anything with it - it is booted via NFS and disks are only used for > development/testing. > >> (squeeze)fslab1:~# cat /sys/block/md126/queue/write_same_max_bytes >> 16384 > >> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes >> 0 >> 0 >> 0 >> 0 > > > Ah, now I found the reason why it fails, scsi-layer had set > write_same_max_bytes to zero when it detected that it does not support > it, but after reloading the arecal module (arcmsr) I now get: > >> (squeeze)fslab1:~# cat /sys/block/sd[o,n,m,l]/queue/write_same_max_bytes >> 33553920 >> 33553920 >> 33553920 >> 33553920 In sd_config_write_same() it sets if (sdkp->max_ws_blocks == 0) sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS; except when sdkp->device->no_write_same is set. But only ata_scsi_sdev_config() sets that. And I also don't see any driver setting max_ws_blocks, so everything except of libata gets the default of SD_MAX_WS10_BLOCKS. This also seems to be consistent with the 33553920 I see, except that there is somewhere a bit shift. So no surprise that mptsas and arcmsr (and anything else) devices have write_same_max_bytes set. As the correct handling in the md layer seems to be difficult, can we send a fake request at device configuration time to figure out if the device really support write-same? ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH] scsi: Check if the device support WRITE_SAME_10 2013-06-05 11:38 ` Bernd Schubert @ 2013-06-05 12:53 ` Bernd Schubert 2013-06-05 19:14 ` Martin K. Petersen 2013-06-05 19:11 ` RAID-10 keeps aborting Martin K. Petersen 1 sibling, 1 reply; 64+ messages in thread From: Bernd Schubert @ 2013-06-05 12:53 UTC (permalink / raw) Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid, linux-scsi Here's a rather simply patch for scsi-midlayer checkpatch.pl complains about style issue, but I just did it as the other lines there. > schubert@squeeze@fsdevel3 linux-stable>scripts/checkpatch.pl patches-linux-3.9.y/ws10 > ERROR: spaces prohibited around that ':' (ctx:WxW) > #48: FILE: drivers/scsi/sd.h:87: > + unsigned ws10 : 1; > ^ If someone wants me, I can send another patch to fix the other lines first. scsi: Check if the device support WRITE_SAME_10 From: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> The md layer currently cannot handle failed WRITE_SAME commands and the initial easiest fix is to check if the device supports WRITE_SAME at all. It already tested for WRITE_SAME_16 and this commit adds a test for WRITE_SAME_10. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> --- drivers/scsi/sd.c | 6 +++++- drivers/scsi/sd.h | 1 + 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 82910cc..368f0a4 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -742,7 +742,7 @@ static void sd_config_write_same(struct scsi_disk *sdkp) unsigned int logical_block_size = sdkp->device->sector_size; unsigned int blocks = 0; - if (sdkp->device->no_write_same) { + if (sdkp->device->no_write_same || !(sdkp->ws10 || sdkp->ws16)) { sdkp->max_ws_blocks = 0; goto out; } @@ -2648,6 +2648,10 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp) static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) { if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE, + WRITE_SAME)) + sdkp->ws10 = 1; + + if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE, WRITE_SAME_16)) sdkp->ws16 = 1; } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 2386aeb..7a049de 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -84,6 +84,7 @@ struct scsi_disk { unsigned lbpws : 1; unsigned lbpws10 : 1; unsigned lbpvpd : 1; + unsigned ws10 : 1; unsigned ws16 : 1; }; #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) ^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10 2013-06-05 12:53 ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert @ 2013-06-05 19:14 ` Martin K. Petersen 2013-06-05 20:09 ` Bernd Schubert 0 siblings, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-05 19:14 UTC (permalink / raw) To: Bernd Schubert Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid, linux-scsi >>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: Bernd> The md layer currently cannot handle failed WRITE_SAME commands Bernd> and the initial easiest fix is to check if the device supports Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this Bernd> commit adds a test for WRITE_SAME_10. No go. That'll disable WRITE SAME for drives which don't support RSOC. Which means almost all of them. I propose the following... -- Martin K. Petersen Oracle Linux Engineering [PATCH] sd: Update WRITE SAME heuristics SATA drives located behind a SAS controller would incorrectly receive WRITE SAME commands. Tweak the heuristics so that: - If REPORT SUPPORTED OPERATION CODES is provided we will use that to choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also fixes an issue with the old code which would issue WRITE SAME(10) despite the command not being whitelisted in REPORT SUPPORTED OPERATION CODES. - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back to WRITE SAME(10) unless the device has an ATA Information VPD page. The assumption is that a SATL which is smart enough to implement WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES. To facilitate the new heuristics scsi_report_opcode() has been modified to so we can distinguish between "operation not supported" and "RSOC not supported". Reported-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 2c0d0ec..3b1ea34 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page); * @opcode: opcode for command to look up * * Uses the REPORT SUPPORTED OPERATION CODES to look up the given - * opcode. Returns 0 if RSOC fails or if the command opcode is - * unsupported. Returns 1 if the device claims to support the command. + * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is + * unsupported and 1 if the device claims to support the command. */ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, unsigned int len, unsigned char opcode) @@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, int result; if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3) - return 0; + return -EINVAL; memset(cmd, 0, 16); cmd[0] = MAINTENANCE_IN; @@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, if (result && scsi_sense_valid(&sshdr) && sshdr.sense_key == ILLEGAL_REQUEST && (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00) - return 0; + return -EINVAL; if ((buffer[1] & 3) == 3) /* Command supported */ return 1; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index a37eda9..366b661 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr, if (max == 0) sdp->no_write_same = 1; - else if (max <= SD_MAX_WS16_BLOCKS) - sdkp->max_ws_blocks = max; + else + sdp->no_write_same = 0; + + if (sdkp->ws16) + sdkp->max_ws_blocks = + max_t(unsigned long, max, SD_MAX_WS16_BLOCKS); + else + sdkp->max_ws_blocks = + max_t(unsigned long, max, SD_MAX_WS10_BLOCKS); sd_config_write_same(sdkp); @@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp) * blocks per I/O unless the device explicitly advertises a * bigger limit. */ - if (sdkp->max_ws_blocks == 0) - sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS; - - if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) + if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) blocks = min_not_zero(sdkp->max_ws_blocks, (u32)SD_MAX_WS16_BLOCKS); else blocks = min_not_zero(sdkp->max_ws_blocks, (u32)SD_MAX_WS10_BLOCKS); + if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes) + sdkp->max_ws_blocks = blocks; + out: blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9)); } @@ -2645,9 +2652,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp) static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) { - if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE, - WRITE_SAME_16)) + struct scsi_device *sdev = sdkp->device; + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) { + sdev->no_report_opcodes = 1; + + /* Disable WRITE SAME if REPORT SUPPORTED OPERATION + * CODES is unsupported and the device has an ATA + * Information VPD page (SAT). + */ + if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE)) + sdev->no_write_same = 1; + } + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1) sdkp->ws16 = 1; + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1) + sdkp->ws10 = 1; } static int sd_try_extended_inquiry(struct scsi_device *sdp) diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 2386aeb..7a049de 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -84,6 +84,7 @@ struct scsi_disk { unsigned lbpws : 1; unsigned lbpws10 : 1; unsigned lbpvpd : 1; + unsigned ws10 : 1; unsigned ws16 : 1; }; #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) ^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10 2013-06-05 19:14 ` Martin K. Petersen @ 2013-06-05 20:09 ` Bernd Schubert 2013-06-07 2:15 ` Martin K. Petersen 0 siblings, 1 reply; 64+ messages in thread From: Bernd Schubert @ 2013-06-05 20:09 UTC (permalink / raw) To: Martin K. Petersen Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid, linux-scsi On 06/05/2013 09:14 PM, Martin K. Petersen wrote:>>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: > > Bernd> The md layer currently cannot handle failed WRITE_SAME commands > Bernd> and the initial easiest fix is to check if the device supports > Bernd> WRITE_SAME at all. It already tested for WRITE_SAME_16 and this > Bernd> commit adds a test for WRITE_SAME_10. > > No go. That'll disable WRITE SAME for drives which don't support > RSOC. Which means almost all of them. Ah, sorry, I didn't check the specs. > > I propose the following... > > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -442,8 +442,15 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr, > > if (max == 0) > sdp->no_write_same = 1; > - else if (max <= SD_MAX_WS16_BLOCKS) > - sdkp->max_ws_blocks = max; > + else > + sdp->no_write_same = 0; > + > + if (sdkp->ws16) > + sdkp->max_ws_blocks = > + max_t(unsigned long, max, SD_MAX_WS16_BLOCKS); > + else > + sdkp->max_ws_blocks = > + max_t(unsigned long, max, SD_MAX_WS10_BLOCKS); > > sd_config_write_same(sdkp); Max? Not min_t()? > @@ -762,16 +769,16 @@ static void sd_config_write_same(struct scsi_disk *sdkp) > * blocks per I/O unless the device explicitly advertises a > * bigger limit. > */ > - if (sdkp->max_ws_blocks == 0) > - sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS; > - > - if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) > + if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) > blocks = min_not_zero(sdkp->max_ws_blocks, > (u32)SD_MAX_WS16_BLOCKS); > else > blocks = min_not_zero(sdkp->max_ws_blocks, > (u32)SD_MAX_WS10_BLOCKS); > > + if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes) > + sdkp->max_ws_blocks = blocks; > + > out: > blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9)); > } blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)) ? Otherwise sdkp->max_ws_blocks and the queue might have different values, wouldn't they? I cant't provide a comment about scsi_get_vpd_page, I simply don't know. You certainly know the scsi specs ways better than I do! Thanks, Bernd ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10 2013-06-05 20:09 ` Bernd Schubert @ 2013-06-07 2:15 ` Martin K. Petersen 2013-06-12 19:34 ` Bernd Schubert 0 siblings, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-07 2:15 UTC (permalink / raw) To: Bernd Schubert Cc: Martin K. Petersen, Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid, linux-scsi >>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: >> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS); Bernd> Max? Not min_t()? Brain fart. Updated patch with a few other adjustments. I have tested this on a couple of JBODs with a mishmash of SATA and SAS drives, including a few specimens that report MAX WRITE SAME BLOCKS. -- Martin K. Petersen Oracle Linux Engineering [PATCH] sd: Update WRITE SAME heuristics SATA drives located behind a SAS controller would incorrectly receive WRITE SAME commands. Tweak the heuristics so that: - If REPORT SUPPORTED OPERATION CODES is provided we will use that to choose between WRITE SAME(16), WRITE SAME(10) and disabled. This also fixes an issue with the old code which would issue WRITE SAME(10) despite the command not being whitelisted in REPORT SUPPORTED OPERATION CODES. - If REPORT SUPPORTED OPERATION CODES is not provided we will fall back to WRITE SAME(10) unless the device has an ATA Information VPD page. The assumption is that a SATL which is smart enough to implement WRITE SAME would also provide REPORT SUPPORTED OPERATION CODES. To facilitate the new heuristics scsi_report_opcode() has been modified to so we can distinguish between "operation not supported" and "RSOC not supported". Reported-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: <stable@vger.kernel.org> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 2c0d0ec..3b1ea34 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -1070,8 +1070,8 @@ EXPORT_SYMBOL_GPL(scsi_get_vpd_page); * @opcode: opcode for command to look up * * Uses the REPORT SUPPORTED OPERATION CODES to look up the given - * opcode. Returns 0 if RSOC fails or if the command opcode is - * unsupported. Returns 1 if the device claims to support the command. + * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is + * unsupported and 1 if the device claims to support the command. */ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, unsigned int len, unsigned char opcode) @@ -1081,7 +1081,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, int result; if (sdev->no_report_opcodes || sdev->scsi_level < SCSI_SPC_3) - return 0; + return -EINVAL; memset(cmd, 0, 16); cmd[0] = MAINTENANCE_IN; @@ -1097,7 +1097,7 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, if (result && scsi_sense_valid(&sshdr) && sshdr.sense_key == ILLEGAL_REQUEST && (sshdr.asc == 0x20 || sshdr.asc == 0x24) && sshdr.ascq == 0x00) - return 0; + return -EINVAL; if ((buffer[1] & 3) == 3) /* Command supported */ return 1; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index a37eda9..420e763 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -442,8 +442,10 @@ sd_store_write_same_blocks(struct device *dev, struct device_attribute *attr, if (max == 0) sdp->no_write_same = 1; - else if (max <= SD_MAX_WS16_BLOCKS) + else if (max <= SD_MAX_WS16_BLOCKS) { + sdp->no_write_same = 0; sdkp->max_ws_blocks = max; + } sd_config_write_same(sdkp); @@ -750,7 +752,6 @@ static void sd_config_write_same(struct scsi_disk *sdkp) { struct request_queue *q = sdkp->disk->queue; unsigned int logical_block_size = sdkp->device->sector_size; - unsigned int blocks = 0; if (sdkp->device->no_write_same) { sdkp->max_ws_blocks = 0; @@ -762,18 +763,20 @@ static void sd_config_write_same(struct scsi_disk *sdkp) * blocks per I/O unless the device explicitly advertises a * bigger limit. */ - if (sdkp->max_ws_blocks == 0) - sdkp->max_ws_blocks = SD_MAX_WS10_BLOCKS; - - if (sdkp->ws16 || sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) - blocks = min_not_zero(sdkp->max_ws_blocks, - (u32)SD_MAX_WS16_BLOCKS); - else - blocks = min_not_zero(sdkp->max_ws_blocks, - (u32)SD_MAX_WS10_BLOCKS); + if (sdkp->max_ws_blocks > SD_MAX_WS10_BLOCKS) + sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks, + (u32)SD_MAX_WS16_BLOCKS); + else if (sdkp->ws16 || sdkp->ws10 || sdkp->device->no_report_opcodes) + sdkp->max_ws_blocks = min_not_zero(sdkp->max_ws_blocks, + (u32)SD_MAX_WS10_BLOCKS); + else { + sdkp->device->no_write_same = 1; + sdkp->max_ws_blocks = 0; + } out: - blk_queue_max_write_same_sectors(q, blocks * (logical_block_size >> 9)); + blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * + (logical_block_size >> 9)); } /** @@ -2645,9 +2648,24 @@ static void sd_read_block_provisioning(struct scsi_disk *sdkp) static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) { - if (scsi_report_opcode(sdkp->device, buffer, SD_BUF_SIZE, - WRITE_SAME_16)) + struct scsi_device *sdev = sdkp->device; + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) { + sdev->no_report_opcodes = 1; + + /* Disable WRITE SAME if REPORT SUPPORTED OPERATION + * CODES is unsupported and the device has an ATA + * Information VPD page (SAT). + */ + if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE)) + sdev->no_write_same = 1; + } + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1) sdkp->ws16 = 1; + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1) + sdkp->ws10 = 1; } static int sd_try_extended_inquiry(struct scsi_device *sdp) diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 2386aeb..7a049de 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -84,6 +84,7 @@ struct scsi_disk { unsigned lbpws : 1; unsigned lbpws10 : 1; unsigned lbpvpd : 1; + unsigned ws10 : 1; unsigned ws16 : 1; }; #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) ^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH] scsi: Check if the device support WRITE_SAME_10 2013-06-07 2:15 ` Martin K. Petersen @ 2013-06-12 19:34 ` Bernd Schubert 0 siblings, 0 replies; 64+ messages in thread From: Bernd Schubert @ 2013-06-12 19:34 UTC (permalink / raw) To: Martin K. Petersen Cc: Joe Lawrence, H. Peter Anvin, Dan Williams, linux-raid, linux-scsi On 06/07/2013 04:15 AM, Martin K. Petersen wrote: >>>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: > >>> max_t(unsigned long, max, SD_MAX_WS10_BLOCKS); > > Bernd> Max? Not min_t()? > > Brain fart. Updated patch with a few other adjustments. > > I have tested this on a couple of JBODs with a mishmash of SATA and SAS > drives, including a few specimens that report MAX WRITE SAME BLOCKS. > Thanks for the update! I'm far too long at work again, but I managed to test it now and it works fine for the ancient areca controller (ARC-1260) of this test-lab. So you may add Tested-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> and Reviewed-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> (although the latter probably does not count much for linux-scsi). Cheers, Bernd ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-05 11:38 ` Bernd Schubert 2013-06-05 12:53 ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert @ 2013-06-05 19:11 ` Martin K. Petersen 1 sibling, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-05 19:11 UTC (permalink / raw) To: Bernd Schubert Cc: Joe Lawrence, H. Peter Anvin, Martin K. Petersen, Dan Williams, linux-raid, linux-scsi >>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: Bernd> And I also don't see any driver setting max_ws_blocks, so Bernd> everything except of libata gets the default of Bernd> SD_MAX_WS10_BLOCKS. Yes. That's intentional. Unless the device provides MAXIMUM WRITE SAME BLOCKS in the BLOCK LIMITS VPD. Bernd> As the correct handling in the md layer seems to be difficult, Bernd> can we send a fake request at device configuration time to figure Bernd> out if the device really support write-same? The problem is that WRITE SAME is destructive. And unfortunately a block count of 0 means "write entire device". -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 18:35 ` Martin K. Petersen ` (2 preceding siblings ...) 2013-06-03 23:19 ` H. Peter Anvin @ 2013-06-04 17:36 ` Dan Williams 2013-06-04 17:54 ` Martin K. Petersen 3 siblings, 1 reply; 64+ messages in thread From: Dan Williams @ 2013-06-04 17:36 UTC (permalink / raw) To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid On Mon, Jun 3, 2013 at 11:35 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> OK, so the device here says don't do this again, but fails the > hpa> request anyway expecting the block device to pick up the slack. > > Yes, the block layer function will resort to writing out zeroes directly > in this case. > > MD should not consider a rejected WRITE SAME a failure. What should md do in the partial success case? Seems to be a layering violation to assume that the block layer will retry with zeroes. Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME cleared? -- Dan ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 17:36 ` Dan Williams @ 2013-06-04 17:54 ` Martin K. Petersen 2013-06-04 17:57 ` H. Peter Anvin 0 siblings, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-04 17:54 UTC (permalink / raw) To: Dan Williams; +Cc: Martin K. Petersen, H. Peter Anvin, linux-raid >>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes: >> MD should not consider a rejected WRITE SAME a failure. Dan> What should md do in the partial success case? What exactly do you mean when you say "partial success"? Either the device accepts the command or it doesn't... Dan> Maybe just act on BIO_QUIET to retry the write with REQ_WRITE_SAME Dan> cleared? No go. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 17:54 ` Martin K. Petersen @ 2013-06-04 17:57 ` H. Peter Anvin 2013-06-04 18:04 ` Martin K. Petersen 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-04 17:57 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Dan Williams, linux-raid On 06/04/2013 10:54 AM, Martin K. Petersen wrote: >>>>>> "Dan" == Dan Williams <dan.j.williams@gmail.com> writes: > >>> MD should not consider a rejected WRITE SAME a failure. > > Dan> What should md do in the partial success case? > > What exactly do you mean when you say "partial success"? Either the > device accepts the command or it doesn't... > One subdevice accepts it and the other doesn't, presumably. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 17:57 ` H. Peter Anvin @ 2013-06-04 18:04 ` Martin K. Petersen 2013-06-04 18:32 ` Dan Williams 0 siblings, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-04 18:04 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> One subdevice accepts it and the other doesn't, presumably. Ah. Well fail the command and let the block layer deal with it. This is really no different from the discard case. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 18:04 ` Martin K. Petersen @ 2013-06-04 18:32 ` Dan Williams 2013-06-04 18:38 ` H. Peter Anvin 0 siblings, 1 reply; 64+ messages in thread From: Dan Williams @ 2013-06-04 18:32 UTC (permalink / raw) To: Martin K. Petersen; +Cc: H. Peter Anvin, linux-raid On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> One subdevice accepts it and the other doesn't, presumably. > > Ah. Well fail the command and let the block layer deal with it. This is > really no different from the discard case. > Which md also does not handle if the device later returns "illegal request" to a discard command. My point about one device accepting the write and another device dropping it is we now have an inconsistent array and a write command to complete. So I don't see how md can wait/trust that the upper layer will retry and fix things up? Translate and retry internally for these command types, return success to the original request, and disable future requests. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 18:32 ` Dan Williams @ 2013-06-04 18:38 ` H. Peter Anvin 2013-06-04 18:56 ` Dan Williams 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-04 18:38 UTC (permalink / raw) To: Dan Williams; +Cc: Martin K. Petersen, linux-raid On 06/04/2013 11:32 AM, Dan Williams wrote: > On Tue, Jun 4, 2013 at 11:04 AM, Martin K. Petersen > <martin.petersen@oracle.com> wrote: >>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> >> hpa> One subdevice accepts it and the other doesn't, presumably. >> >> Ah. Well fail the command and let the block layer deal with it. This is >> really no different from the discard case. > > Which md also does not handle if the device later returns "illegal > request" to a discard command. My point about one device accepting > the write and another device dropping it is we now have an > inconsistent array and a write command to complete. So I don't see > how md can wait/trust that the upper layer will retry and fix things > up? Translate and retry internally for these command types, return > success to the original request, and disable future requests. > Well, if that is what the block device layer is defined to do then that is what the block layer does. It makes sense from the point of view of a disk, there block layer has to translate and redo, so if the block layer is defined to do that, why not rely on it? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 18:38 ` H. Peter Anvin @ 2013-06-04 18:56 ` Dan Williams 2013-06-05 2:39 ` H. Peter Anvin 0 siblings, 1 reply; 64+ messages in thread From: Dan Williams @ 2013-06-04 18:56 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, linux-raid On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote: > Well, if that is what the block device layer is defined to do then that > is what the block layer does. It makes sense from the point of view of > a disk, there block layer has to translate and redo, so if the block > layer is defined to do that, why not rely on it? > I'm just hung up on when we can safely mark the array as not dirty. At a minimum this means raid needs a "I have an ignored-write-failure in-flight, awaiting retry from upper layer" state. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-04 18:56 ` Dan Williams @ 2013-06-05 2:39 ` H. Peter Anvin [not found] ` <(H.> 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-05 2:39 UTC (permalink / raw) To: Dan Williams; +Cc: Martin K. Petersen, linux-raid On 06/04/2013 11:56 AM, Dan Williams wrote: > On Tue, Jun 4, 2013 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote: >> Well, if that is what the block device layer is defined to do then that >> is what the block layer does. It makes sense from the point of view of >> a disk, there block layer has to translate and redo, so if the block >> layer is defined to do that, why not rely on it? >> > > I'm just hung up on when we can safely mark the array as not dirty. > At a minimum this means raid needs a "I have an ignored-write-failure > in-flight, awaiting retry from upper layer" state. > Ah yes, if you rely on the block layer to retry on you you don't see the beginnings and ends of the entire transaction, and at least ideally the RAID -- and the specific blocks -- should be marked dirty during that operation. The same applies to DISCARD presumably. Yuck, this suddenly got complex. Perhaps WRITE SAME should simply be disabled on raid1/raid10 until this can be addressed? Do we need to do the same for DISCARD? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <(H.>]
[parent not found: <Peter>]
[parent not found: <Anvin's>]
[parent not found: <message>]
[parent not found: <of>]
[parent not found: <"Wed>]
[parent not found: <"Thu>]
[parent not found: <"Tue>]
[parent not found: <04>]
[parent not found: <Jun>]
[parent not found: <2013>]
[parent not found: <14:27:47>]
[parent not found: <-0400")>]
* Re: RAID-10 keeps aborting [not found] ` <-0400")> @ 2013-06-07 2:19 ` Martin K. Petersen 2013-06-10 14:15 ` Joe Lawrence 0 siblings, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-07 2:19 UTC (permalink / raw) To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid >>>>> "Joe" == Joe Lawrence <joe.lawrence@stratus.com> writes: Joe> I'm looking at the changes in raid1.c and am confused as to why we Joe> did this in the first place (commit c8dc9c6): Joe> if (mddev->queue) Joe> blk_queue_max_write_same_sectors(mddev->queue, mddev-> chunk_sectors); Joe> for RAID1, aren't chunk_sectors always going to be zero? (At least Joe> on my machine, /sys/block/md*/queue/write_same_max_bytes for all md Joe> RAID1 devices are set to 0.) This would have the effect of Joe> rendering bdev_write_same() always false for these MD devices. I'm guessing that's a copy and paste from my raid0 support. In the raid0 case I set it to prevent straddling discard/write same commands. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-07 2:19 ` Martin K. Petersen @ 2013-06-10 14:15 ` Joe Lawrence 2013-06-12 3:15 ` NeilBrown 0 siblings, 1 reply; 64+ messages in thread From: Joe Lawrence @ 2013-06-10 14:15 UTC (permalink / raw) To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid, Neil Brown [Cc: Neil -- a few questions if you want to skip to the bottom ] I looked a little more into the RAID1 case and tried a few things. First, to enable WRITE SAME on RAID1, I had to apply the patch at the bottom of this mail. With the patch in place, the component disk limits propogate up to their MD: [2:0:4:0] disk SANBLAZE VLUN0001 0001 /dev/sds [3:0:2:0] disk SEAGATE ST9146853SS 0004 /dev/sdv /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 65535 /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535 /sys/block/md125/queue/write_same_max_bytes: 33553920 I was interested in observing how a failed WRITE SAME would interact with MD, the intent bitmap, and resyncing. In my setup, I created a RAID1 out of a 4G partition from a SAS disk (supporting WRITE SAME) and a SanBlaze VirtualLUN (claiming WRITE SAME support, but returing [sense_key,asc,ascq]: [0x05,0x20,0x00] on the first such command.) I also added an external write intent bitmap file chunk size of 4KB to create a large, granular bitmap: mdadm --create /dev/md125 --raid-devices=2 --level=1 \ --bitmap=/mnt/bitmap/file --bitmap-chunk=4K /dev/sds1 /dev/sdv1 After creating the RAID and letting the initial synchronization finish, I filled the entire MD with random data. I would use this to verify resync using the write intent bitmap later. From previous tests, I knew that the first failed WRITE SAME to the VirtualLUN would bounce that disk from the MD. Current and subsequent WRITE SAME cmds would process just fine to the member disk that actually supported the command. To kick off WRITE SAME commands, I added a new ext4 filesystem to the disk. When mounting (no special options) this executes the following call chain: ext4_lazyinit_thread ext4_init_inode_table sb_issue_zeroout blkdev_issue_zeroout blkdev_issue_write_same When the first WRITE SAME hits the VirtualLUN, MD kicks it from the RAID and degrades the array: EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null) sd 2:0:6:0: [sds] CDB: Write same(16): 93 00 00 00 00 00 00 00 21 10 00 00 0f f8 00 00 mpt2sas0: sas_address(0x500605b0006c0ae0), phy(3) mpt2sas0: handle(0x000b), ioc_status(scsi data underrun)(0x0045), smid(59) mpt2sas0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01) mpt2sas0: [sense_key,asc,ascq]: [0x05,0x20,0x00], count(96) md/raid1:md125: Disk failure on sds1, disabling device. md/raid1:md125: Operation continuing on 1 devices. the bitmap file starts recording dirty chunks: Sync Size : 4192192 (4.00 GiB 4.29 GB) Bitmap : 1048048 bits (chunks), 16409 dirty (1.6%) The MD write_same_max_bytes are left at 33553920 until the VirtualLUN is failed/remove/re-added. After the WRITE SAME failure, the VirtualLUN's max_write_same_blocks have been set to 0. When re-added to the MD, this value is reconsidered in the MD's write_same_max_bytes, which also gets set to zero. This behavior seems okay as the remaining good disk fully supported WRITE SAME when the RAID was degraded. Once the non-supporting component disk was added to the RAID1, WRITE SAME was disabled for the MD: /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 0 /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535 /sys/block/md125/queue/write_same_max_bytes: 0 When the VirtualLUN was re-added to the RAID1, resync initiated. Recall that earlier I had dumped random bits on the entire MD device, so the state of the disks should have looked like this: SAS = init RAID sync + random bits + ext4 WRITE SAME 0's + ext4 misc VLUN = init RAID sync + random bits and resync would need to consult the bitmap to repair the VLUN chunks that WRITE SAME and whatever else ext4_lazyinit_thread layed down. By setting the bitmap chunksize so small, the idea was to spread the failed WRITE SAME across tracking bits. CDB WRITE SAME num blocks was 0x0FF8 (4088) and 4088 x 512 (block size) ~= 2MB (much greater than 4KB). With a systemtap probe, I saw 32 WRITE SAME commands (all about 4KB blocks) emitted from the block layer via ext4_lazyinit_thread. So the estimated dirty bits for all 32 should be somewhere around: 32 * (2MB disk dirty / 4K disk per bit) = 16384 dirty bits pretty close to the observed 16409 (the rest I assume were other ext4 housekeeping). At this point we know: - Failed WRITE SAME will kick disk from MD RAID1 - WRITE SAME is disabled if unsupported disk added to MD - Failed WRITE SAME is properly handled by bitmap, even when spanning bitmap bits. A few outstanding questions that I have, maybe Neil or someone more familiar with the code could answer. Q1 - Is mddev->chunk_sectors is always zero for RAID1? Q2 - I noticed handle_write_finished calls narrow_write_error to try and potentially avoid failing an entire device. In my tests, narrow_write_error never succeeded as rdev->badblocks.shift = -1. I think this part of the bad block list code Neil has been working on. I don't suppose this is the proper place for MD to reset write_same_max_bytes to disable future WRITE SAME and handling the individual writes here instead of the block layer? Regards, -- Joe From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001 From: Joe Lawrence <joe.lawrence@stratus.com> Date: Fri, 7 Jun 2013 15:25:54 -0400 Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does not need to worry about limiting the write same sectors in that regard. Let disk_stack_limits choose the minimum of the RAID1 components write same values. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> --- drivers/md/raid1.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index fd86b37..3dc9ad6 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev) if (IS_ERR(conf)) return PTR_ERR(conf); - if (mddev->queue) - blk_queue_max_write_same_sectors(mddev->queue, - mddev->chunk_sectors); rdev_for_each(rdev, mddev) { if (!mddev->gendisk) continue; -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-10 14:15 ` Joe Lawrence @ 2013-06-12 3:15 ` NeilBrown 2013-06-12 4:07 ` H. Peter Anvin 2013-06-13 2:45 ` Joe Lawrence 0 siblings, 2 replies; 64+ messages in thread From: NeilBrown @ 2013-06-12 3:15 UTC (permalink / raw) To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid [-- Attachment #1: Type: text/plain, Size: 2465 bytes --] On Mon, 10 Jun 2013 10:15:05 -0400 Joe Lawrence <joe.lawrence@stratus.com> wrote: > A few outstanding questions that I have, maybe Neil or someone more > familiar with the code could answer. > > Q1 - Is mddev->chunk_sectors is always zero for RAID1? Not always. But often. It is largely ignored. I think the only effect is to round the size of the device down to a multiple of the chunk size. > > Q2 - I noticed handle_write_finished calls narrow_write_error to try and > potentially avoid failing an entire device. In my tests, > narrow_write_error never succeeded as rdev->badblocks.shift = -1. Yes. You would need a newer mdadm (from my git tree) to get badblocks.shift to something else. > > I think this part of the bad block list code Neil has been working > on. I don't suppose this is the proper place for MD to reset > write_same_max_bytes to disable future WRITE SAME and handling the > individual writes here instead of the block layer? If a drive reports that WRITE SAME works, but it doesn't, then I'm not sure that I can be happy about working with that drive. If a drive has some quirky behaviour wrt WRITE SAME, then that should be handled in some place where 'quirks' are handled - certainly not in md. I've applied that patch below - thanks. NeilBrown > > Regards, > > -- Joe > > > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001 > From: Joe Lawrence <joe.lawrence@stratus.com> > Date: Fri, 7 Jun 2013 15:25:54 -0400 > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors > > MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does > not need to worry about limiting the write same sectors in that regard. > Let disk_stack_limits choose the minimum of the RAID1 components write > same values. > > Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> > --- > drivers/md/raid1.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index fd86b37..3dc9ad6 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev) > if (IS_ERR(conf)) > return PTR_ERR(conf); > > - if (mddev->queue) > - blk_queue_max_write_same_sectors(mddev->queue, > - mddev->chunk_sectors); > rdev_for_each(rdev, mddev) { > if (!mddev->gendisk) > continue; [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 3:15 ` NeilBrown @ 2013-06-12 4:07 ` H. Peter Anvin 2013-06-12 6:29 ` Bernd Schubert 2013-06-12 14:25 ` Martin K. Petersen 2013-06-13 2:45 ` Joe Lawrence 1 sibling, 2 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-12 4:07 UTC (permalink / raw) To: NeilBrown; +Cc: Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid On 06/11/2013 08:15 PM, NeilBrown wrote: > If a drive reports that WRITE SAME works, but it doesn't, then I'm > not sure that I can be happy about working with that drive. Seriously... we have that kind of problems all over the place with all kinds of hardware. Falling back is sensible... the problem here is *where* that needs to happen... the block layer already does, apparently. > If a drive has some quirky behaviour wrt WRITE SAME, then that > should be handled in some place where 'quirks' are handled - > certainly not in md. The problem here is that you don't find out ahead of time. Now, if I understand the issue at hand correctly is that the reporting here was actually a Linux bug related to SATA drives behind a SAS controller. Martin, am I right? -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 4:07 ` H. Peter Anvin @ 2013-06-12 6:29 ` Bernd Schubert 2013-06-12 10:22 ` Joe Lawrence 2013-06-12 14:28 ` Martin K. Petersen 2013-06-12 14:25 ` Martin K. Petersen 1 sibling, 2 replies; 64+ messages in thread From: Bernd Schubert @ 2013-06-12 6:29 UTC (permalink / raw) To: H. Peter Anvin Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid On 06/12/2013 06:07 AM, H. Peter Anvin wrote: > On 06/11/2013 08:15 PM, NeilBrown wrote: >> If a drive reports that WRITE SAME works, but it doesn't, then I'm >> not sure that I can be happy about working with that drive. > > Seriously... we have that kind of problems all over the place with all > kinds of hardware. Falling back is sensible... the problem here is > *where* that needs to happen... the block layer already does, apparently. > >> If a drive has some quirky behaviour wrt WRITE SAME, then that >> should be handled in some place where 'quirks' are handled - >> certainly not in md. > > The problem here is that you don't find out ahead of time. > > Now, if I understand the issue at hand correctly is that the reporting > here was actually a Linux bug related to SATA drives behind a SAS > controller. Martin, am I right? Martin, please correct me if I'm wrong, but I think the code optimistically enabled WRITE_SAME for any drive, except of those on a sata (libata) controller. So not the drive reported that it can do WRITE_SAME, but scsi-midlayer did that. Martins patch should improve that (I still need to test it on our hardware), but I'm not sure if there won't be some hardware falling through. Cheers, Bernd ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 6:29 ` Bernd Schubert @ 2013-06-12 10:22 ` Joe Lawrence 2013-06-12 14:28 ` Martin K. Petersen 1 sibling, 0 replies; 64+ messages in thread From: Joe Lawrence @ 2013-06-12 10:22 UTC (permalink / raw) To: Bernd Schubert Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid On Wed, 12 Jun 2013, Bernd Schubert wrote: > On 06/12/2013 06:07 AM, H. Peter Anvin wrote: > > On 06/11/2013 08:15 PM, NeilBrown wrote: > >> If a drive reports that WRITE SAME works, but it doesn't, then I'm > >> not sure that I can be happy about working with that drive. > > > > Seriously... we have that kind of problems all over the place with all > > kinds of hardware. Falling back is sensible... the problem here is > > *where* that needs to happen... the block layer already does, apparently. > > > >> If a drive has some quirky behaviour wrt WRITE SAME, then that > >> should be handled in some place where 'quirks' are handled - > >> certainly not in md. > > > > The problem here is that you don't find out ahead of time. > > > > Now, if I understand the issue at hand correctly is that the reporting > > here was actually a Linux bug related to SATA drives behind a SAS > > controller. Martin, am I right? > > Martin, please correct me if I'm wrong, but I think the code > optimistically enabled WRITE_SAME for any drive, except of those on a > sata (libata) controller. So not the drive reported that it can do > WRITE_SAME, but scsi-midlayer did that. Martins patch should improve > that (I still need to test it on our hardware), but I'm not sure if > there won't be some hardware falling through. I'm worried about other unsupported HW, too. Last night I started writing a patch to set the raid1,10 max write same sectors to 0 for inclusion in 3.10 + stable... I'd like to include a mention of Martin's patch/the SATA drives in the commit log. Thanks so much for hunting this down. -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 6:29 ` Bernd Schubert 2013-06-12 10:22 ` Joe Lawrence @ 2013-06-12 14:28 ` Martin K. Petersen 1 sibling, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-12 14:28 UTC (permalink / raw) To: Bernd Schubert Cc: H. Peter Anvin, NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid >>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: Bernd> please correct me if I'm wrong, but I think the code Bernd> optimistically enabled WRITE_SAME for any drive, except of those Bernd> on a sata (libata) controller. Correct. WRITE SAME has been used extensively by RAID manufacturers for years. So pretty much any SCSI-class drive supports it. The headache we ran into in this case was SATA drives behind a SAS controller. Bernd> Martins patch should improve that (I still need to test it on our Bernd> hardware), Please do! James wants a tested-by from someone that's not me before he pushes the patch to Linus. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 4:07 ` H. Peter Anvin 2013-06-12 6:29 ` Bernd Schubert @ 2013-06-12 14:25 ` Martin K. Petersen 2013-06-12 14:29 ` H. Peter Anvin 1 sibling, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-12 14:25 UTC (permalink / raw) To: H. Peter Anvin Cc: NeilBrown, Joe Lawrence, Martin K. Petersen, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> If a drive has some quirky behaviour wrt WRITE SAME, then that should >> be handled in some place where 'quirks' are handled - certainly not >> in md. hpa> The problem here is that you don't find out ahead of time. hpa> Now, if I understand the issue at hand correctly is that the hpa> reporting here was actually a Linux bug related to SATA drives hpa> behind a SAS controller. Martin, am I right? Support for WRITE SAME is harder for us to detect. With discard we have a set of device-reported bits we can use as triggers, not so for WRITE SAME. And since it is a destructive command we can not simply issue one at device discovery time to try whether it works. Technically there's nothing that prevents a SAS controller's SCSI-ATA Translation to handle WRITE SAME. The patch I posted simply adds another heuristic. Namely that if we can see that the drive behind the SAS controller is of the ATA persuasion we will not attempt to issue WRITE SAME unless the controller explicitly advertises WRITE SAME support using REPORT SUPPORTED OPERATION CODES. Sadly we can not exclusively rely on RSOC when deciding whether WRITE SAME is supported or not for devices in general. 95% of the WRITE SAME-capable devices out there do not support RSOC :( -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 14:25 ` Martin K. Petersen @ 2013-06-12 14:29 ` H. Peter Anvin 2013-06-12 14:34 ` Martin K. Petersen 0 siblings, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-12 14:29 UTC (permalink / raw) To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid On 06/12/2013 07:25 AM, Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > >>> If a drive has some quirky behaviour wrt WRITE SAME, then that should >>> be handled in some place where 'quirks' are handled - certainly not >>> in md. > > hpa> The problem here is that you don't find out ahead of time. > > hpa> Now, if I understand the issue at hand correctly is that the > hpa> reporting here was actually a Linux bug related to SATA drives > hpa> behind a SAS controller. Martin, am I right? > > Support for WRITE SAME is harder for us to detect. With discard we have > a set of device-reported bits we can use as triggers, not so for WRITE > SAME. And since it is a destructive command we can not simply issue one > at device discovery time to try whether it works. > > Technically there's nothing that prevents a SAS controller's SCSI-ATA > Translation to handle WRITE SAME. The patch I posted simply adds another > heuristic. Namely that if we can see that the drive behind the SAS > controller is of the ATA persuasion we will not attempt to issue WRITE > SAME unless the controller explicitly advertises WRITE SAME support > using REPORT SUPPORTED OPERATION CODES. > > Sadly we can not exclusively rely on RSOC when deciding whether WRITE > SAME is supported or not for devices in general. 95% of the WRITE > SAME-capable devices out there do not support RSOC :( > The second question is if we should disable WRITE SAME for raid1/10 (what about raid0?) for 3.10/stable or if your patch really is sufficient... "just adds another heuristic" makes me nervous. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 14:29 ` H. Peter Anvin @ 2013-06-12 14:34 ` Martin K. Petersen 2013-06-12 14:37 ` H. Peter Anvin 2013-06-12 14:45 ` H. Peter Anvin 0 siblings, 2 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-12 14:34 UTC (permalink / raw) To: H. Peter Anvin Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> The second question is if we should disable WRITE SAME for raid1/10 hpa> (what about raid0?) for 3.10/stable or if your patch really is hpa> sufficient... "just adds another heuristic" makes me nervous. I think we should disable 1+10 in stable until we get the recovery scenario sorted out. I don't believe there are any problems with raid0. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 14:34 ` Martin K. Petersen @ 2013-06-12 14:37 ` H. Peter Anvin 2013-06-12 14:45 ` H. Peter Anvin 1 sibling, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-12 14:37 UTC (permalink / raw) To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid On 06/12/2013 07:34 AM, Martin K. Petersen wrote: > > I don't believe there are any problems with raid0. > I was wondering if a WRITE SOME sent to raid0 would cause the array to get stopped or paniced. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 14:34 ` Martin K. Petersen 2013-06-12 14:37 ` H. Peter Anvin @ 2013-06-12 14:45 ` H. Peter Anvin [not found] ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com> 2013-06-13 3:10 ` NeilBrown 1 sibling, 2 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-12 14:45 UTC (permalink / raw) To: Martin K. Petersen; +Cc: NeilBrown, Joe Lawrence, Dan Williams, linux-raid [-- Attachment #1: Type: text/plain, Size: 504 bytes --] On 06/12/2013 07:34 AM, Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> The second question is if we should disable WRITE SAME for raid1/10 > hpa> (what about raid0?) for 3.10/stable or if your patch really is > hpa> sufficient... "just adds another heuristic" makes me nervous. > > I think we should disable 1+10 in stable until we get the recovery > scenario sorted out. > > I don't believe there are any problems with raid0. > How does this look? -hpa [-- Attachment #2: 0001-raid1-10-Disable-WRITE-SAME-until-a-recovery-strateg.patch --] [-- Type: text/x-patch, Size: 2452 bytes --] From ac28be1574a6187f4f26bd75217059bf17b13560 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" <hpa@zytor.com> Date: Wed, 12 Jun 2013 07:37:43 -0700 Subject: [PATCH] raid1,10: Disable WRITE SAME until a recovery strategy is in place There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1 and raid10. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> --- drivers/md/raid1.c | 4 ++-- drivers/md/raid10.c | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 5595118..914ca0a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2780,8 +2780,8 @@ static int run(struct mddev *mddev) return PTR_ERR(conf); if (mddev->queue) - blk_queue_max_write_same_sectors(mddev->queue, - mddev->chunk_sectors); + blk_queue_max_write_same_sectors(mddev->queue, 0); + rdev_for_each(rdev, mddev) { if (!mddev->gendisk) continue; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 59d4daa..807ace8 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3609,8 +3609,7 @@ static int run(struct mddev *mddev) if (mddev->queue) { blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_write_same_sectors(mddev->queue, - mddev->chunk_sectors); + blk_queue_max_write_same_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, chunk_size); if (conf->geo.raid_disks % conf->geo.near_copies) blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks); -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 64+ messages in thread
[parent not found: <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com>]
[parent not found: <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com>]
* Re: RAID-10 keeps aborting [not found] ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com> @ 2013-06-12 15:58 ` H. Peter Anvin 0 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-12 15:58 UTC (permalink / raw) To: Lawrence, Joe; +Cc: Martin K. Petersen, NeilBrown, Dan Williams, linux-raid On 06/12/2013 08:55 AM, Lawrence, Joe wrote: > This looks exactly like what I started last night, only with a more detailed commit msg. Would it be worth mentioning that block will only retry if MD fails the master bio? (Ie, if one of mirrored components succeed the WS) Probably not... the big deal here is that it isn't a viable solution due to the write bitmap fence. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 14:45 ` H. Peter Anvin [not found] ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com> @ 2013-06-13 3:10 ` NeilBrown 2013-06-13 3:13 ` H. Peter Anvin 2013-06-13 21:40 ` Martin K. Petersen 1 sibling, 2 replies; 64+ messages in thread From: NeilBrown @ 2013-06-13 3:10 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid [-- Attachment #1: Type: text/plain, Size: 960 bytes --] On Wed, 12 Jun 2013 07:45:16 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 06/12/2013 07:34 AM, Martin K. Petersen wrote: > >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > > > hpa> The second question is if we should disable WRITE SAME for raid1/10 > > hpa> (what about raid0?) for 3.10/stable or if your patch really is > > hpa> sufficient... "just adds another heuristic" makes me nervous. > > > > I think we should disable 1+10 in stable until we get the recovery > > scenario sorted out. > > > > I don't believe there are any problems with raid0. > > > > How does this look? > > -hpa > Promising - thanks. However should we do the same thing in raid5.c too? As far as I can tell, the default set by blk_set_stacking_limits() (which md calls) is to allow WRITE_SAME if all all underlying devices do. But I'm pretty sure raid5 will do the wrong thing with a WRITE_SAME request. ?? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-13 3:10 ` NeilBrown @ 2013-06-13 3:13 ` H. Peter Anvin 2013-06-13 3:31 ` NeilBrown 2013-06-13 21:40 ` Martin K. Petersen 1 sibling, 1 reply; 64+ messages in thread From: H. Peter Anvin @ 2013-06-13 3:13 UTC (permalink / raw) To: NeilBrown; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid On 06/12/2013 08:10 PM, NeilBrown wrote: > > Promising - thanks. > > However should we do the same thing in raid5.c too? As far as I can > tell, the default set by blk_set_stacking_limits() (which md calls) > is to allow WRITE_SAME if all all underlying devices do. But I'm > pretty sure raid5 will do the wrong thing with a WRITE_SAME > request. > Yes, if raid5 also bounces the array if the WRITE SAME request fails at the device we need to do the same thing there. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-13 3:13 ` H. Peter Anvin @ 2013-06-13 3:31 ` NeilBrown 0 siblings, 0 replies; 64+ messages in thread From: NeilBrown @ 2013-06-13 3:31 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid [-- Attachment #1: Type: text/plain, Size: 681 bytes --] On Wed, 12 Jun 2013 20:13:19 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 06/12/2013 08:10 PM, NeilBrown wrote: > > > > Promising - thanks. > > > > However should we do the same thing in raid5.c too? As far as I can > > tell, the default set by blk_set_stacking_limits() (which md calls) > > is to allow WRITE_SAME if all all underlying devices do. But I'm > > pretty sure raid5 will do the wrong thing with a WRITE_SAME > > request. > > > > Yes, if raid5 also bounces the array if the WRITE SAME request fails > at the device we need to do the same thing there. > > -hpa > Thanks. I've modified the patch and tagged it for -stable. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-13 3:10 ` NeilBrown 2013-06-13 3:13 ` H. Peter Anvin @ 2013-06-13 21:40 ` Martin K. Petersen 1 sibling, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-13 21:40 UTC (permalink / raw) To: NeilBrown Cc: H. Peter Anvin, Martin K. Petersen, Joe Lawrence, Dan Williams, linux-raid >>>>> "Neil" == NeilBrown <neilb@suse.de> writes: Neil> But I'm pretty sure raid5 will do the wrong thing with a Neil> WRITE_SAME request. Yeah, you should set: mddev->queue->max_write_same_blocks = 0; before disk_stack_limits() is called. Just like it's done with discard_zeroes_data. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-12 3:15 ` NeilBrown 2013-06-12 4:07 ` H. Peter Anvin @ 2013-06-13 2:45 ` Joe Lawrence 2013-06-13 3:11 ` NeilBrown 1 sibling, 1 reply; 64+ messages in thread From: Joe Lawrence @ 2013-06-13 2:45 UTC (permalink / raw) To: NeilBrown Cc: Joe Lawrence, Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid On Wed, 12 Jun 2013, NeilBrown wrote: > I've applied that patch below - thanks. > > NeilBrown > > > > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001 > > From: Joe Lawrence <joe.lawrence@stratus.com> > > Date: Fri, 7 Jun 2013 15:25:54 -0400 > > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors > > [snip patch] Hi Neil -- this patch was only to test out what RAID1 would do if the blk_queue_max_write_same_sectors were set to enable WRITE SAME. As HPA and Martin point out, we should be disabling WRITE SAME for raid1/10 in 3.10 / stable for now. Sorry for the confusion. -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-13 2:45 ` Joe Lawrence @ 2013-06-13 3:11 ` NeilBrown 0 siblings, 0 replies; 64+ messages in thread From: NeilBrown @ 2013-06-13 3:11 UTC (permalink / raw) To: Joe Lawrence; +Cc: Martin K. Petersen, H. Peter Anvin, Dan Williams, linux-raid [-- Attachment #1: Type: text/plain, Size: 1036 bytes --] On Wed, 12 Jun 2013 22:45:08 -0400 (EDT) Joe Lawrence <joe.lawrence@stratus.com> wrote: > On Wed, 12 Jun 2013, NeilBrown wrote: > > > I've applied that patch below - thanks. > > > > NeilBrown > > > > > > From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001 > > > From: Joe Lawrence <joe.lawrence@stratus.com> > > > Date: Fri, 7 Jun 2013 15:25:54 -0400 > > > Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors > > > > [snip patch] > > Hi Neil -- this patch was only to test out what RAID1 would do if the > blk_queue_max_write_same_sectors were set to enable WRITE SAME. As HPA > and Martin point out, we should be disabling WRITE SAME for raid1/10 in > 3.10 / stable for now. Sorry for the confusion. > > -- Joe Ahh - thanks. the code your patch removes is clearly wrong as chunk size doesn't mean anything useful for raid1, but as you say we need to actually disable it, so may as well do that at the same time as removing the error. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <19:39:58>]
[parent not found: <-0700")>]
* Re: RAID-10 keeps aborting [not found] ` <-0700")> @ 2013-06-05 19:29 ` Martin K. Petersen 2013-06-06 18:27 ` Joe Lawrence 2013-06-12 14:43 ` Martin K. Petersen 1 sibling, 1 reply; 64+ messages in thread From: Martin K. Petersen @ 2013-06-05 19:29 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Dan Williams, Martin K. Petersen, linux-raid, joe.lawrence >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: hpa> Yuck, this suddenly got complex. Perhaps WRITE SAME should simply hpa> be disabled on raid1/raid10 until this can be addressed? Yeah, maybe. hpa> Do we need to do the same for DISCARD? For discard we have better heuristics from the device so the partial completion case should be rare. But obviously a disk can reject a command anytime with or without reason... Also, discard is advisory. There is no guarantee that a pair of mirrored drives will be consistent in the discarded region (unless the drives promise to zero discarded blocks but MD doesn't check that for raid1/raid10). WRITE SAME requires strict consistency between the devices, however. So it looks like there's some work that needs to be done in that department. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-05 19:29 ` Martin K. Petersen @ 2013-06-06 18:27 ` Joe Lawrence [not found] ` <(Joe> 2013-06-06 18:36 ` H. Peter Anvin 0 siblings, 2 replies; 64+ messages in thread From: Joe Lawrence @ 2013-06-06 18:27 UTC (permalink / raw) To: Martin K. Petersen; +Cc: H. Peter Anvin, Dan Williams, linux-raid On Wed, 5 Jun 2013 12:29:32 -0700 (PDT) "Martin K. Petersen" <martin.petersen@oracle.com> wrote: > >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: > > hpa> Yuck, this suddenly got complex. Perhaps WRITE SAME should simply > hpa> be disabled on raid1/raid10 until this can be addressed? > > Yeah, maybe. Martin, I'm looking at the changes in raid1.c and am confused as to why we did this in the first place (commit c8dc9c6): if (mddev->queue) blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); for RAID1, aren't chunk_sectors always going to be zero? (At least on my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1 devices are set to 0.) This would have the effect of rendering bdev_write_same() always false for these MD devices. -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <(Joe>]
* Re: RAID-10 keeps aborting 2013-06-06 18:27 ` Joe Lawrence [not found] ` <(Joe> @ 2013-06-06 18:36 ` H. Peter Anvin 1 sibling, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-06 18:36 UTC (permalink / raw) To: Joe Lawrence; +Cc: Martin K. Petersen, Dan Williams, linux-raid On 06/06/2013 11:27 AM, Joe Lawrence wrote: > On Wed, 5 Jun 2013 12:29:32 -0700 (PDT) > "Martin K. Petersen" <martin.petersen@oracle.com> wrote: > >>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> >> hpa> Yuck, this suddenly got complex. Perhaps WRITE SAME should simply >> hpa> be disabled on raid1/raid10 until this can be addressed? >> >> Yeah, maybe. > > Martin, > > I'm looking at the changes in raid1.c and am confused as to why we did > this in the first place (commit c8dc9c6): > > if (mddev->queue) > blk_queue_max_write_same_sectors(mddev->queue, > mddev->chunk_sectors); > > for RAID1, aren't chunk_sectors always going to be zero? (At least on > my machine, /sys/block/md*/queue/write_same_max_bytes for all md RAID1 > devices are set to 0.) This would have the effect of rendering > bdev_write_same() always false for these MD devices. > That presumably also explains why only RAID-10 seems to be affected. -hpa ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting [not found] ` <-0700")> 2013-06-05 19:29 ` Martin K. Petersen @ 2013-06-12 14:43 ` Martin K. Petersen 1 sibling, 0 replies; 64+ messages in thread From: Martin K. Petersen @ 2013-06-12 14:43 UTC (permalink / raw) To: H. Peter Anvin Cc: Martin K. Petersen, NeilBrown, Joe Lawrence, Dan Williams, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> I don't believe there are any problems with raid0. hpa> I was wondering if a WRITE SOME sent to raid0 would cause the array hpa> to get stopped or paniced. raid0 doesn't do its own endio processing so an error will just bubble up to the generic code in blk-lib. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-03 3:57 RAID-10 keeps aborting H. Peter Anvin 2013-06-03 4:05 ` H. Peter Anvin 2013-06-03 5:47 ` Dan Williams @ 2013-06-11 21:50 ` Joe Lawrence 2013-06-11 21:53 ` H. Peter Anvin 2 siblings, 1 reply; 64+ messages in thread From: Joe Lawrence @ 2013-06-11 21:50 UTC (permalink / raw) To: H. Peter Anvin Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert I'll away from the office the rest of the week for the Red Hat Summit, but a few things tested today: *** Test 1 (repeated for both RAID1 and RAID10) Setup: RAID1/10, a single disk fails WRITE SAME with [sense_key,asc,ascq]: [0x05,0x20,0x00] Result: the RAID md_personality error_handler was invoked for the disk that kicked back Illegal Request. That disk was marked faulty and write intent bitmaps started tracking chunks. WRITE SAME to the other disks succeeded so MD didn't fail the parent bio and therefore blkdev_issue_write_same was considered successful. When the failed component disk is re-added to the RAID, the write_same_max_bytes are recalculated and set to 0. Resync with write intent bitmap properly mirrored the disk. This behavior seems reasonable, though one could argue that write_same_max_bytes should be set to 0 for the MD device once any WRITE SAME fails. Better yet, the block layer could fall back to plain WRITEs without dropping the disk. *** Test 2 Setup: RAID1, all member disks fail WRITE SAME with [sense_key,asc,ascq]: [0x05,0x20,0x00]. Result: the first disk is kicked out of the RAID, but the second remains active. The block layer detected an error from blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed. Manually zeroing" messages, but never prevents future WRITE SAMEs from hitting this MD device. *** Test 3 Setup: RAID10, all member disks fail WRITE SAME with [sense_key,asc,ascq]: [0x05,0x20,0x00]. Result: all component disks are marked as faulty and lost page writes occur on MD device! blkdev_issue_write_same fails and "mdXXX: WRITE SAME failed. Manually zeroing" appear, but it's too late, there aren't any disks left standing in the RAID. I think the results of Tests 2 and 3 confirm that RAID10 definitely needs to be fixed... Disabling WRITE SAME for RAID10 for now (3.10 + stable) would be safest thing to do until this is sorted out properly. -- Joe ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: RAID-10 keeps aborting 2013-06-11 21:50 ` Joe Lawrence @ 2013-06-11 21:53 ` H. Peter Anvin 0 siblings, 0 replies; 64+ messages in thread From: H. Peter Anvin @ 2013-06-11 21:53 UTC (permalink / raw) To: Joe Lawrence; +Cc: linux-raid, Dan Williams, Martin K. Petersen, Bernd Schubert Exactly. Joe Lawrence <joe.lawrence@stratus.com> wrote: >I'll away from the office the rest of the week for the Red Hat Summit, >but a few things tested today: > >*** Test 1 (repeated for both RAID1 and RAID10) > >Setup: RAID1/10, a single disk fails WRITE SAME with > [sense_key,asc,ascq]: [0x05,0x20,0x00] > >Result: the RAID md_personality error_handler was invoked for the disk > that kicked back Illegal Request. That disk was marked > faulty and write intent bitmaps started tracking chunks. WRITE > SAME to the other disks succeeded so MD didn't fail the parent > bio and therefore blkdev_issue_write_same was considered > successful. > > When the failed component disk is re-added to the RAID, the > write_same_max_bytes are recalculated and set to 0. Resync with > write intent bitmap properly mirrored the disk. > >This behavior seems reasonable, though one could argue that >write_same_max_bytes should be set to 0 for the MD device once any >WRITE >SAME fails. Better yet, the block layer could fall back to plain >WRITEs >without dropping the disk. > > >*** Test 2 > >Setup: RAID1, all member disks fail WRITE SAME with > [sense_key,asc,ascq]: [0x05,0x20,0x00]. > >Result: the first disk is kicked out of the RAID, but the second >remains > active. The block layer detected an error from > blkdev_issue_write_same, emits "mdXXX: WRITE SAME failed. > Manually zeroing" messages, but never prevents future WRITE > SAMEs from hitting this MD device. > > >*** Test 3 > >Setup: RAID10, all member disks fail WRITE SAME with > [sense_key,asc,ascq]: [0x05,0x20,0x00]. > >Result: all component disks are marked as faulty and lost page writes > occur on MD device! blkdev_issue_write_same fails and "mdXXX: > WRITE SAME failed. Manually zeroing" appear, but it's too late, > there aren't any disks left standing in the RAID. > > >I think the results of Tests 2 and 3 confirm that RAID10 definitely >needs to be fixed... Disabling WRITE SAME for RAID10 for now (3.10 + >stable) would be safest thing to do until this is sorted out properly. > >-- Joe -- Sent from my mobile phone. Please excuse brevity and lack of formatting. ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2013-06-13 21:40 UTC | newest] Thread overview: 64+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-03 3:57 RAID-10 keeps aborting H. Peter Anvin 2013-06-03 4:05 ` H. Peter Anvin 2013-06-03 5:47 ` Dan Williams 2013-06-03 6:06 ` H. Peter Anvin 2013-06-03 6:14 ` Dan Williams 2013-06-03 6:30 ` H. Peter Anvin 2013-06-03 14:39 ` H. Peter Anvin 2013-06-11 16:47 ` Joe Lawrence 2013-06-11 17:12 ` H. Peter Anvin 2013-06-03 15:47 ` H. Peter Anvin 2013-06-03 16:09 ` Joe Lawrence 2013-06-03 17:22 ` Dan Williams 2013-06-03 17:40 ` H. Peter Anvin 2013-06-03 18:35 ` Martin K. Petersen 2013-06-03 18:38 ` H. Peter Anvin 2013-06-03 18:40 ` H. Peter Anvin 2013-06-03 22:20 ` H. Peter Anvin 2013-06-03 22:34 ` H. Peter Anvin 2013-06-04 15:56 ` Martin K. Petersen 2013-06-03 23:19 ` H. Peter Anvin 2013-06-04 15:39 ` Joe Lawrence 2013-06-04 15:46 ` H. Peter Anvin 2013-06-04 15:54 ` Martin K. Petersen 2013-06-05 10:02 ` Bernd Schubert 2013-06-05 11:38 ` Bernd Schubert 2013-06-05 12:53 ` [PATCH] scsi: Check if the device support WRITE_SAME_10 Bernd Schubert 2013-06-05 19:14 ` Martin K. Petersen 2013-06-05 20:09 ` Bernd Schubert 2013-06-07 2:15 ` Martin K. Petersen 2013-06-12 19:34 ` Bernd Schubert 2013-06-05 19:11 ` RAID-10 keeps aborting Martin K. Petersen 2013-06-04 17:36 ` Dan Williams 2013-06-04 17:54 ` Martin K. Petersen 2013-06-04 17:57 ` H. Peter Anvin 2013-06-04 18:04 ` Martin K. Petersen 2013-06-04 18:32 ` Dan Williams 2013-06-04 18:38 ` H. Peter Anvin 2013-06-04 18:56 ` Dan Williams 2013-06-05 2:39 ` H. Peter Anvin [not found] ` <(H.> [not found] ` <Peter> [not found] ` <Anvin's> [not found] ` <message> [not found] ` <of> [not found] ` <"Wed> [not found] ` <"Thu> [not found] ` <"Tue> [not found] ` <04> [not found] ` <Jun> [not found] ` <2013> [not found] ` <14:27:47> [not found] ` <-0400")> 2013-06-07 2:19 ` Martin K. Petersen 2013-06-10 14:15 ` Joe Lawrence 2013-06-12 3:15 ` NeilBrown 2013-06-12 4:07 ` H. Peter Anvin 2013-06-12 6:29 ` Bernd Schubert 2013-06-12 10:22 ` Joe Lawrence 2013-06-12 14:28 ` Martin K. Petersen 2013-06-12 14:25 ` Martin K. Petersen 2013-06-12 14:29 ` H. Peter Anvin 2013-06-12 14:34 ` Martin K. Petersen 2013-06-12 14:37 ` H. Peter Anvin 2013-06-12 14:45 ` H. Peter Anvin [not found] ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus! .com> [not found] ` <5AA430FFE4486C448003201AC83BC85E0360CE3F@EXHQ.corp.stratus.com> 2013-06-12 15:58 ` H. Peter Anvin 2013-06-13 3:10 ` NeilBrown 2013-06-13 3:13 ` H. Peter Anvin 2013-06-13 3:31 ` NeilBrown 2013-06-13 21:40 ` Martin K. Petersen 2013-06-13 2:45 ` Joe Lawrence 2013-06-13 3:11 ` NeilBrown [not found] ` <19:39:58> [not found] ` <-0700")> 2013-06-05 19:29 ` Martin K. Petersen 2013-06-06 18:27 ` Joe Lawrence [not found] ` <(Joe> 2013-06-06 18:36 ` H. Peter Anvin 2013-06-12 14:43 ` Martin K. Petersen 2013-06-11 21:50 ` Joe Lawrence 2013-06-11 21:53 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).