* absurdly high "optimal_io_size" on Seagate SAS disk @ 2014-11-06 16:47 Chris Friesen 2014-11-06 17:16 ` Chris Friesen 0 siblings, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-06 16:47 UTC (permalink / raw) To: Jens Axboe, lkml Hi, I'm running a modified 3.4-stable on relatively recent X86 server-class hardware. I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive) and it's reporting a value of 4294966784 for optimal_io_size. The other parameters look normal though: /sys/block/sda/queue/hw_sector_size:512 /sys/block/sda/queue/logical_block_size:512 /sys/block/sda/queue/max_segment_size:65536 /sys/block/sda/queue/minimum_io_size:512 /sys/block/sda/queue/optimal_io_size:4294966784 The other drives in the system look more like what I'd expect: /sys/block/sdb/queue/hw_sector_size:512 /sys/block/sdb/queue/logical_block_size:512 /sys/block/sdb/queue/max_segment_size:65536 /sys/block/sdb/queue/minimum_io_size:4096 /sys/block/sdb/queue/optimal_io_size:0 /sys/block/sdb/queue/physical_block_size:4096 /sys/block/sdc/queue/hw_sector_size:512 /sys/block/sdc/queue/logical_block_size:512 /sys/block/sdc/queue/max_segment_size:65536 /sys/block/sdc/queue/minimum_io_size:4096 /sys/block/sdc/queue/optimal_io_size:0 /sys/block/sdc/queue/physical_block_size:4096 According to the manual, the ST900MM0026 has a 512 byte physical sector size. Is this a drive firmware bug? Or a bug in the SAS driver? Or is there a valid reason for a single drive to report such a huge value? Would it make sense for the kernel to do some sort of sanity checking on this value? Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 16:47 absurdly high "optimal_io_size" on Seagate SAS disk Chris Friesen @ 2014-11-06 17:16 ` Chris Friesen 2014-11-06 17:34 ` Martin K. Petersen 0 siblings, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-06 17:16 UTC (permalink / raw) To: Jens Axboe, lkml, linux-scsi, Mike Snitzer, Martin K. Petersen On 11/06/2014 10:47 AM, Chris Friesen wrote: > Hi, > > I'm running a modified 3.4-stable on relatively recent X86 server-class > hardware. > > I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive) > and it's reporting a value of 4294966784 for optimal_io_size. The other > parameters look normal though: > > /sys/block/sda/queue/hw_sector_size:512 > /sys/block/sda/queue/logical_block_size:512 > /sys/block/sda/queue/max_segment_size:65536 > /sys/block/sda/queue/minimum_io_size:512 > /sys/block/sda/queue/optimal_io_size:4294966784 <snip> > According to the manual, the ST900MM0026 has a 512 byte physical sector > size. > > Is this a drive firmware bug? Or a bug in the SAS driver? Or is there > a valid reason for a single drive to report such a huge value? > > Would it make sense for the kernel to do some sort of sanity checking on > this value? Looks like this sort of thing has been seen before, in other drives (one of which is from the same family as my drive): http://www.spinics.net/lists/linux-scsi/msg65292.html http://iamlinux.technoyard.in/blog/why-is-my-ssd-disk-not-reconized-by-the-rhel6-anaconda-installer/ Perhaps the ST900MM0026 should be blacklisted as well? Or maybe the SCSI code should do a variation on Mike Snitzer's original patch and just ignore any values above some reasonable threshold? (And then we could remove the blacklist on the ST900MM0006.) Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 17:16 ` Chris Friesen @ 2014-11-06 17:34 ` Martin K. Petersen 2014-11-06 17:45 ` Chris Friesen 0 siblings, 1 reply; 20+ messages in thread From: Martin K. Petersen @ 2014-11-06 17:34 UTC (permalink / raw) To: Chris Friesen Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer, Martin K. Petersen >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris> Perhaps the ST900MM0026 should be blacklisted as well? Sure. I'll widen the net a bit for that Seagate model. commit 17f1ee2d16a6878269c4429306f6e678b7e61505 Author: Martin K. Petersen <martin.petersen@oracle.com> Date: Thu Nov 6 12:31:43 2014 -0500 SCSI: Blacklist ST900MM0026 Looks like this entire series of drives reports the wrong values in the block limits VPD. Widen the blacklist. Reported-by: Chris Friesen <chris.friesen@windriver.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index 49014a143c6a..9116531b415a 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -229,7 +229,7 @@ static struct { {"SanDisk", "ImageMate CF-SD1", NULL, BLIST_FORCELUN}, {"SEAGATE", "ST34555N", "0930", BLIST_NOTQ}, /* Chokes on tagged INQUIRY */ {"SEAGATE", "ST3390N", "9546", BLIST_NOTQ}, - {"SEAGATE", "ST900MM0006", NULL, BLIST_SKIP_VPD_PAGES}, + {"SEAGATE", "ST900MM", NULL, BLIST_SKIP_VPD_PAGES}, {"SGI", "RAID3", "*", BLIST_SPARSELUN}, {"SGI", "RAID5", "*", BLIST_SPARSELUN}, {"SGI", "TP9100", "*", BLIST_REPORTLUN2}, ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 17:34 ` Martin K. Petersen @ 2014-11-06 17:45 ` Chris Friesen 2014-11-06 18:12 ` Martin K. Petersen 0 siblings, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-06 17:45 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/06/2014 11:34 AM, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: > > Chris> Perhaps the ST900MM0026 should be blacklisted as well? > > Sure. I'll widen the net a bit for that Seagate model. That'd work, but is it the best way to go? I mean, I found one report of a similar problem on an SSD (model number unknown). In that case it was a near-UINT_MAX value as well. The problem with the blacklist is that until someone patches it, the drive is broken. And then it stays blacklisted even if the firmware gets fixed. I'm wondering if it might not be better to just ignore all values larger than X (where X is whatever we think is the largest conceivable reasonable value). Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 17:45 ` Chris Friesen @ 2014-11-06 18:12 ` Martin K. Petersen 2014-11-06 18:15 ` Jens Axboe 2014-11-06 19:14 ` Chris Friesen 0 siblings, 2 replies; 20+ messages in thread From: Martin K. Petersen @ 2014-11-06 18:12 UTC (permalink / raw) To: Chris Friesen Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris> That'd work, but is it the best way to go? I mean, I found one Chris> report of a similar problem on an SSD (model number unknown). In Chris> that case it was a near-UINT_MAX value as well. My concern is still the same. Namely that this particular drive happens to be returning UINT_MAX but it might as well be a value that's entirely random. Or even a value that is small and innocuous looking but completely wrong. Chris> The problem with the blacklist is that until someone patches it, Chris> the drive is broken. And then it stays blacklisted even if the Chris> firmware gets fixed. Well, you can manually blacklist in /proc/scsi/device_info. Chris> I'm wondering if it might not be better to just ignore all values Chris> larger than X (where X is whatever we think is the largest Chris> conceivable reasonable value). The problem is that finding that is not easy and it too will be a moving target. I'm willing to entertain the following, however... diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 95bfb7bfbb9d..75cc51a01860 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2593,7 +2593,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp) blk_queue_io_min(sdkp->disk->queue, get_unaligned_be16(&buffer[6]) * sector_sz); blk_queue_io_opt(sdkp->disk->queue, - get_unaligned_be32(&buffer[12]) * sector_sz); + min_t(u32, get_unaligned_be32(&buffer[12]), + sdkp->capacity) * sector_sz); if (buffer[3] == 0x3c) { unsigned int lba_count, desc_count; -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 18:12 ` Martin K. Petersen @ 2014-11-06 18:15 ` Jens Axboe 2014-11-06 19:14 ` Chris Friesen 1 sibling, 0 replies; 20+ messages in thread From: Jens Axboe @ 2014-11-06 18:15 UTC (permalink / raw) To: Martin K. Petersen, Chris Friesen; +Cc: lkml, linux-scsi, Mike Snitzer On 2014-11-06 11:12, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: > > Chris> That'd work, but is it the best way to go? I mean, I found one > Chris> report of a similar problem on an SSD (model number unknown). In > Chris> that case it was a near-UINT_MAX value as well. > > My concern is still the same. Namely that this particular drive happens > to be returning UINT_MAX but it might as well be a value that's entirely > random. Or even a value that is small and innocuous looking but > completely wrong. > > Chris> The problem with the blacklist is that until someone patches it, > Chris> the drive is broken. And then it stays blacklisted even if the > Chris> firmware gets fixed. > > Well, you can manually blacklist in /proc/scsi/device_info. > > Chris> I'm wondering if it might not be better to just ignore all values > Chris> larger than X (where X is whatever we think is the largest > Chris> conceivable reasonable value). > > The problem is that finding that is not easy and it too will be a moving > target. Didn't check, but assuming the value is the upper 24 bits of 32. If so, might not hurt to check for as 0xfffffe00 as an invalid value. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 18:12 ` Martin K. Petersen 2014-11-06 18:15 ` Jens Axboe @ 2014-11-06 19:14 ` Chris Friesen 2014-11-07 1:56 ` Martin K. Petersen 1 sibling, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-06 19:14 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/06/2014 12:12 PM, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> >>>>>> writes: > > Chris> That'd work, but is it the best way to go? I mean, I found > one Chris> report of a similar problem on an SSD (model number > unknown). In Chris> that case it was a near-UINT_MAX value as well. > > My concern is still the same. Namely that this particular drive > happens to be returning UINT_MAX but it might as well be a value > that's entirely random. Or even a value that is small and innocuous > looking but completely wrong. > > Chris> The problem with the blacklist is that until someone patches > it, Chris> the drive is broken. And then it stays blacklisted even > if the Chris> firmware gets fixed. > > Well, you can manually blacklist in /proc/scsi/device_info. > > Chris> I'm wondering if it might not be better to just ignore all > values Chris> larger than X (where X is whatever we think is the > largest Chris> conceivable reasonable value). > > The problem is that finding that is not easy and it too will be a > moving target. Do we need to be perfect, or just "good enough"? For a RAID card I expect it would be related to chunk size or stripe width or something...but even then I would expect to be able to cap it at 100MB or so. Or are there storage systems on really fast interfaces that could legitimately want a hundred meg of data at a time? On 11/06/2014 12:15 PM, Jens Axboe wrote: > Didn't check, but assuming the value is the upper 24 bits of 32. If > so, might not hurt to check for as 0xfffffe00 as an invalid value. Yep, in all three wonky cases so far "optimal_io_size" ended up as 4294966784, which is 0xfffffe00. Does something mask out the lower bits? Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-06 19:14 ` Chris Friesen @ 2014-11-07 1:56 ` Martin K. Petersen 2014-11-07 5:35 ` Chris Friesen 2014-11-07 17:10 ` Elliott, Robert (Server Storage) 0 siblings, 2 replies; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 1:56 UTC (permalink / raw) To: Chris Friesen Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris, Chris> For a RAID card I expect it would be related to chunk size or Chris> stripe width or something...but even then I would expect to be Chris> able to cap it at 100MB or so. Or are there storage systems on Chris> really fast interfaces that could legitimately want a hundred meg Chris> of data at a time? Well, there are several devices that report their capacity to indicate that they don't suffer any performance (RMW) penalties for large commands regardless of size. I would personally prefer them to report 0 in that case. Chris> Yep, in all three wonky cases so far "optimal_io_size" ended up Chris> as 4294966784, which is 0xfffffe00. Does something mask out the Chris> lower bits? Ignoring reported values of UINT_MAX and 0xfffffe000 only works until the next spec-dyslexic firmware writer comes along. I also think that singling out the OPTIMAL TRANSFER LENGTH is a bit of a red herring. A vendor could mess up any value in that VPD and it would still cause us grief. There's no rational explanation for why OTL would be more prone to being filled out incorrectly than any of the other parameters in that page. I do concur, though, that io_opt is problematic by virtue of being 32-bits and that gets multiplied by the sector size. So things can easily get out of whack for fdisk and friends (by comparison the value that we use for io_min is only 16 bits). I'm still partial to just blacklisting that entire Seagate family. We don't have any details on the alleged SSD having the same problem. For all we know it could be the same SAS disk drive and not an SSD at all. If there are compelling arguments or other supporting data for sanity checking OTL I'd suggest the following patch that caps it at 1GB. I know of a few devices that prefer alignment at that granularity. -- Martin K. Petersen Oracle Linux Engineering commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50 Author: Martin K. Petersen <martin.petersen@oracle.com> Date: Thu Nov 6 12:31:43 2014 -0500 [SCSI] sd: Sanity check the optimal I/O size We have come across a couple of devices that report crackpot values in the optimal I/O size in the Block Limits VPD page. Since this is a 32-bit entity that gets multiplied by the logical block size we can get disproportionately large values reported to the block layer. Cap io_opt at 1 GB. Reported-by: Chris Friesen <chris.friesen@windriver.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: stable@vger.kernel.org diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b041eca8955d..806e06c2575f 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2591,7 +2591,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp) blk_queue_io_min(sdkp->disk->queue, get_unaligned_be16(&buffer[6]) * sector_sz); blk_queue_io_opt(sdkp->disk->queue, - get_unaligned_be32(&buffer[12]) * sector_sz); + min_t(unsigned int, SD_MAX_IO_OPT_BYTES, + get_unaligned_be32(&buffer[12]) * sector_sz)); if (buffer[3] == 0x3c) { unsigned int lba_count, desc_count; diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 63ba5ca7f9a1..3492779d9d3e 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -44,10 +44,11 @@ enum { }; enum { - SD_DEF_XFER_BLOCKS = 0xffff, - SD_MAX_XFER_BLOCKS = 0xffffffff, - SD_MAX_WS10_BLOCKS = 0xffff, - SD_MAX_WS16_BLOCKS = 0x7fffff, + SD_DEF_XFER_BLOCKS = 0xffff, + SD_MAX_XFER_BLOCKS = 0xffffffff, + SD_MAX_WS10_BLOCKS = 0xffff, + SD_MAX_WS16_BLOCKS = 0x7fffff, + SD_MAX_IO_OPT_BYTES = 1024 * 1024 * 1024, }; enum { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 1:56 ` Martin K. Petersen @ 2014-11-07 5:35 ` Chris Friesen 2014-11-07 15:18 ` Dale R. Worley 2014-11-07 16:25 ` Martin K. Petersen 2014-11-07 17:10 ` Elliott, Robert (Server Storage) 1 sibling, 2 replies; 20+ messages in thread From: Chris Friesen @ 2014-11-07 5:35 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/06/2014 07:56 PM, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: > > Chris, > > Chris> For a RAID card I expect it would be related to chunk size or > Chris> stripe width or something...but even then I would expect to be > Chris> able to cap it at 100MB or so. Or are there storage systems on > Chris> really fast interfaces that could legitimately want a hundred meg > Chris> of data at a time? > > Well, there are several devices that report their capacity to indicate > that they don't suffer any performance (RMW) penalties for large > commands regardless of size. I would personally prefer them to report 0 > in that case. I got curious and looked at the spec at "http://www.13thmonkey.org/documentation/SCSI/sbc3r25.pdf". I'm now wondering if maybe linux is misbehaving. I think there is actually some justification for putting a huge value in the "optimal transfer length" field. That field is described as "the optimal transfer length in blocks for a single...command", but then later it has "If a device server receives a request with a transfer length exceeding this value, then a significant delay in processing the request may be incurred." As written, it is ambiguous. Looking at "ftp://ftp.t10.org/t10/document.03/03-028r2.pdf" it appears that originally that field was the "optimal maximum transfer length", not the "optimal transfer length". It appears that the intent was that the device was able to take requests up to the "maximum transfer length", but there would be a performance penalty if you went over the "optimum maximum transfer length". Section E.4 in "sbc3r25.pdf" talks about optimizing transfers. They suggest using a transfer length that is a multiple of "optimal transfer length granularity", up to a max of either the max or optimal transfer lengths depending on the size of the penalty if you exceed the optimal transfer length. This reinforces the idea that the "optimal transfer length" is actually the optimal *maximum* length, but any multiple of the optimal granularity is fine. Based on that, I think it would have been clearer if it had been called "/sys/block/sdb/queue/optimal_max_io_size". Also, I think it's wrong for filesystems and userspace to use it for alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they use the optimal granularity field for alignment, not the optimal transfer length. So for the ST900MM0006, it had: # sg_inq --vpd --page=0xb0 /dev/sdb VPD INQUIRY: Block limits page (SBC) Optimal transfer length granularity: 1 blocks Maximum transfer length: 0 blocks Optimal transfer length: 4294967295 blocks In this case I think the drive is trying to say that it doesn't require any special granularity (can handle alignment on 512-byte blocks), and that it can handle any size of transfer without performance penalty. Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 5:35 ` Chris Friesen @ 2014-11-07 15:18 ` Dale R. Worley 2014-11-07 16:25 ` Martin K. Petersen 1 sibling, 0 replies; 20+ messages in thread From: Dale R. Worley @ 2014-11-07 15:18 UTC (permalink / raw) To: Chris Friesen; +Cc: martin.petersen, axboe, linux-kernel, linux-scsi, snitzer > From: Chris Friesen <chris.friesen@windriver.com> > Also, I think it's wrong for filesystems and userspace to use it for > alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they > use the optimal granularity field for alignment, not the optimal > transfer length. Everything you say suggests that "optimal transfer length" means "there is a penalty for doing transfers *larger* than this", but people have been treating it as "there is a penalty for doing transfers *smaller* than this". But the latter is the "optimal transfer length granularity". Dale ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 5:35 ` Chris Friesen 2014-11-07 15:18 ` Dale R. Worley @ 2014-11-07 16:25 ` Martin K. Petersen 2014-11-07 17:42 ` Martin K. Petersen 2014-11-07 18:48 ` Chris Friesen 1 sibling, 2 replies; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 16:25 UTC (permalink / raw) To: Chris Friesen Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris, Chris> Also, I think it's wrong for filesystems and userspace to use it Chris> for alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks Chris> like they use the optimal granularity field for alignment, not Chris> the optimal transfer length. The original rationale behind the OTLG and OTL values was to be able to express stripe chunk size and stripe width. And to encourage aligned, full stripe writes but nothing bigger than that. Obviously the wording went through the usual standards body process to be vague/generic enough to be used for anything. It has changed several times since sbc3r25, btw. The kernel really isn't using io_opt. The value is merely stacked and communicated to userspace. The reason the partitioning tools blow up with weird values is that they try to align partitions beginnings to the stripe width. Which is the right thing to do as far as I'm concerned. I have worked with many, many partners in the storage industry to make sure they report sensible values in the Block Limits VPD. I have no reason to believe that the SAS drive issue in question is anything but a simple typo. I know there was a bug open with Seagate. I assume it has been fixed in their latest firmware. To my knowledge it is not a problem in any of their other drive models. Certainly isn't in any of the ones we are shipping. The unfortunate thing with disk drives is that firmware updates are much harder to deal with. And you rarely end up having access to an updated firmware unless your drive was procured through a vendor like Dell, HP or Oracle. That's why I originally opted to quirk this model in Linux. Otherwise I would just have said "update your firmware". If we had devices from many different vendors showing up with values that constantly threw off our tooling I would have more reason to be concerned. But we haven't. And this code has been in the kernel since 2.6.32 or so. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 16:25 ` Martin K. Petersen @ 2014-11-07 17:42 ` Martin K. Petersen 2014-11-07 17:51 ` Chris Friesen 2014-11-07 18:48 ` Chris Friesen 1 sibling, 1 reply; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 17:42 UTC (permalink / raw) To: Martin K. Petersen Cc: Chris Friesen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes: Martin> I know there was a bug open with Seagate. I assume it has been Martin> fixed in their latest firmware. Seagate confirms that this issue was fixed about a year ago. Will provide more data when I have it. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 17:42 ` Martin K. Petersen @ 2014-11-07 17:51 ` Chris Friesen 2014-11-07 18:03 ` Martin K. Petersen 0 siblings, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-07 17:51 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/07/2014 11:42 AM, Martin K. Petersen wrote: >>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes: > > Martin> I know there was a bug open with Seagate. I assume it has been > Martin> fixed in their latest firmware. > > Seagate confirms that this issue was fixed about a year ago. Will > provide more data when I have it. Okay, thanks for the clarification (for this and the spec itself). Apparently there's a new firmware available, dated Oct 13 but with no release notes. We just tried updating the firmware on one of the drives in question and it failed from two different versions of linux, while Windows won't install because it doesn't like our SSD apparently. Joy. Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 17:51 ` Chris Friesen @ 2014-11-07 18:03 ` Martin K. Petersen 0 siblings, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 18:03 UTC (permalink / raw) To: Chris Friesen Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris> Apparently there's a new firmware available, dated Oct 13 but Chris> with no release notes. We just tried updating the firmware on Chris> one of the drives in question and it failed from two different Chris> versions of linux, Did you use sg_write_buffer or some special firmware update tool? -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 16:25 ` Martin K. Petersen 2014-11-07 17:42 ` Martin K. Petersen @ 2014-11-07 18:48 ` Chris Friesen 2014-11-07 19:17 ` Martin K. Petersen 1 sibling, 1 reply; 20+ messages in thread From: Chris Friesen @ 2014-11-07 18:48 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/07/2014 10:25 AM, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: > > Chris, > > Chris> Also, I think it's wrong for filesystems and userspace to use it > Chris> for alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks > Chris> like they use the optimal granularity field for alignment, not > Chris> the optimal transfer length. > > The original rationale behind the OTLG and OTL values was to be able to > express stripe chunk size and stripe width. And to encourage aligned, > full stripe writes but nothing bigger than that. Obviously the wording > went through the usual standards body process to be vague/generic enough > to be used for anything. It has changed several times since sbc3r25, > btw. You've obviously been involved in this area a lot more closely than me, so I'll defer to your experience. :) I think that if that's the intended use case, then the spec wording could be improved. Looking at "sbc3r36.pdf", it still only explicitly mentions performance penalties for transfers that are larger than the "optimal transfer length", not for transfers that are smaller. On 11/07/2014 12:03 PM, Martin K. Petersen wrote: >>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: > > Chris> Apparently there's a new firmware available, dated Oct 13 but > Chris> with no release notes. We just tried updating the firmware on > Chris> one of the drives in question and it failed from two different > Chris> versions of linux, > > Did you use sg_write_buffer or some special firmware update tool? Both. I didn't do it myself, but the guy who did sent me the following: localhost:~$ ./dl_sea_fw-0.2.3_64 -m ST900MM0026 -d /dev/sda -f Lightningbug10K6-SED-0003.LOD ================================================================================ Seagate Firmware Download Utility v0.2.3 Build Date: Jan 9 2013 Copyright (c) 2012 Seagate Technology LLC, All Rights Reserved Fri Nov 7 14:51:21 2014 ================================================================================ Downloading file Lightningbug10K6-SED-0003.LOD to /dev/sda send_io: Input/output error send_io: Input/output error ! FW Download FAILED This log is from a different system running Debian: root@bricklane-2:/home/cgcs# sg_write_buffer -vvv --in=Lightningbug10K6-SED-0003.LOD --length=1752576 --mode=5 /dev/sdb open /dev/sdb with flags=0x802 sending single write buffer, mode=0x5, mpsec=0, id=0, offset=0, len=1752576 Write buffer cmd: 3b 05 00 00 00 00 1a be 00 00 Write buffer parameter list (first 256 bytes): e7 1a 0e 59 01 00 02 00 00 00 00 00 00 00 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 be 1a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d5 cd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 00 80 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bc 1a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 5f 42 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ioctl(SG_IO v3) failed: Invalid argument (errno=22) write buffer: pass through os error: Invalid argument Write buffer failed: Sense category: -1, try '-v' option for more information Apparently the "hdparm -I" command is giving bogus data as well. I've seen that happen if the drive is on a RAID controller--I assume that could cause problems with firmware updates too? Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 18:48 ` Chris Friesen @ 2014-11-07 19:17 ` Martin K. Petersen 2014-11-07 21:04 ` Chris Friesen 0 siblings, 1 reply; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 19:17 UTC (permalink / raw) To: Chris Friesen Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer >>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes: Chris> Apparently the "hdparm -I" command is giving bogus data as well. Chris> I've seen that happen if the drive is on a RAID controller--I Chris> assume that could cause problems with firmware updates too? I'd suggest trying /dev/sgN instead. But yes, some RAID controllers require you to use their tooling and won't allow direct passthrough. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 19:17 ` Martin K. Petersen @ 2014-11-07 21:04 ` Chris Friesen 0 siblings, 0 replies; 20+ messages in thread From: Chris Friesen @ 2014-11-07 21:04 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer On 11/07/2014 01:17 PM, Martin K. Petersen wrote: > I'd suggest trying /dev/sgN instead. That seems to work. Much appreciated. And it's now showing an "optimal_io_size" of 0, so I think the issue is dealt with. Thanks for all the help, it's been educational. :) Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 1:56 ` Martin K. Petersen 2014-11-07 5:35 ` Chris Friesen @ 2014-11-07 17:10 ` Elliott, Robert (Server Storage) 2014-11-07 17:40 ` Martin K. Petersen 2014-11-07 20:15 ` Douglas Gilbert 1 sibling, 2 replies; 20+ messages in thread From: Elliott, Robert (Server Storage) @ 2014-11-07 17:10 UTC (permalink / raw) To: Martin K. Petersen, Chris Friesen Cc: Jens Axboe, lkml, linux-scsi@vger.kernel.org, Mike Snitzer > commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50 > Author: Martin K. Petersen <martin.petersen@oracle.com> > Date: Thu Nov 6 12:31:43 2014 -0500 > > [SCSI] sd: Sanity check the optimal I/O size > > We have come across a couple of devices that report crackpot > values in the optimal I/O size in the Block Limits VPD page. > Since this is a 32-bit entity that gets multiplied by the > logical block size we can get > disproportionately large values reported to the block layer. > > Cap io_opt at 1 GB. Another reasonable cap is the maximum transfer size. There are lots of them: * the block layer BIO_MAX_PAGES value of 256 limits IOs to a maximum of 1 MiB * SCSI LLDs report their maximum transfer size in /sys/block/sdNN/queue/max_hw_sectors_kb * the SCSI midlayer maximum transfer size is set/reported in /sys/block/sdNN/queue/max_sectors_kb and the default is 512 KiB * the SCSI LLD maximum number of scatter gather entries reported in /sys/block/sdNN/queue/max_segments and /sys/block/sdNN/queue/max_segment_size creates a limit based on how fragmented the data buffer is in virtual memory * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field indicates the maximum transfer size for one command over the SCSI transport protocol supported by the drive itself It is risky to use transfer sizes larger than linux and Windows can generate, since drives are probably tested in those environments. --- Rob Elliott HP Server Storage ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 17:10 ` Elliott, Robert (Server Storage) @ 2014-11-07 17:40 ` Martin K. Petersen 2014-11-07 20:15 ` Douglas Gilbert 1 sibling, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2014-11-07 17:40 UTC (permalink / raw) To: Elliott, Robert (Server Storage) Cc: Martin K. Petersen, Chris Friesen, Jens Axboe, lkml, linux-scsi@vger.kernel.org, Mike Snitzer >>>>> "Rob" == Elliott, Robert (Server Storage) <Elliott@hp.com> writes: Rob, Rob> * the block layer BIO_MAX_PAGES value of 256 limits IOs Rob> to a maximum of 1 MiB We do support scatterlist chaining, though. Rob> * SCSI LLDs report their maximum transfer size in Rob> /sys/block/sdNN/queue/max_hw_sectors_kb Rob> * the SCSI midlayer maximum transfer size is set/reported Rob> in /sys/block/sdNN/queue/max_sectors_kb and the default is 512 Rob> KiB Rob> * the SCSI LLD maximum number of scatter gather entries Rob> reported in /sys/block/sdNN/queue/max_segments and Rob> /sys/block/sdNN/queue/max_segment_size creates a limit based on Rob> how fragmented the data buffer is in virtual memory Rob> * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field Rob> indicates the maximum transfer size for one command over the SCSI Rob> transport protocol supported by the drive itself Yep. We're already capping the actual max I/O size based on all of the above. However, the purpose of exposing io_opt was to be able to report stripe size to partitioning tools and filesystems for alignment purposes. And although they would ideally be the same it was always anticipated that stripe size could be bigger than the max I/O size. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: absurdly high "optimal_io_size" on Seagate SAS disk 2014-11-07 17:10 ` Elliott, Robert (Server Storage) 2014-11-07 17:40 ` Martin K. Petersen @ 2014-11-07 20:15 ` Douglas Gilbert 1 sibling, 0 replies; 20+ messages in thread From: Douglas Gilbert @ 2014-11-07 20:15 UTC (permalink / raw) To: Elliott, Robert (Server Storage), Martin K. Petersen, Chris Friesen Cc: Jens Axboe, lkml, linux-scsi@vger.kernel.org, Mike Snitzer On 14-11-07 12:10 PM, Elliott, Robert (Server Storage) wrote: >> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50 >> Author: Martin K. Petersen <martin.petersen@oracle.com> >> Date: Thu Nov 6 12:31:43 2014 -0500 >> >> [SCSI] sd: Sanity check the optimal I/O size >> >> We have come across a couple of devices that report crackpot >> values in the optimal I/O size in the Block Limits VPD page. >> Since this is a 32-bit entity that gets multiplied by the >> logical block size we can get >> disproportionately large values reported to the block layer. >> >> Cap io_opt at 1 GB. > > Another reasonable cap is the maximum transfer size. > There are lots of them: > > * the block layer BIO_MAX_PAGES value of 256 limits IOs > to a maximum of 1 MiB > * SCSI LLDs report their maximum transfer size in > /sys/block/sdNN/queue/max_hw_sectors_kb > * the SCSI midlayer maximum transfer size is set/reported > in /sys/block/sdNN/queue/max_sectors_kb > and the default is 512 KiB > * the SCSI LLD maximum number of scatter gather entries > reported in /sys/block/sdNN/queue/max_segments and > /sys/block/sdNN/queue/max_segment_size creates a > limit based on how fragmented the data buffer is > in virtual memory > * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field > indicates the maximum transfer size for one command over > the SCSI transport protocol supported by the drive itself > > It is risky to use transfer sizes larger than linux and > Windows can generate, since drives are probably tested in > those environments. After being burnt by a (virtual) SCSI disk recently, my utilities now take a more aggressive approach to the data-in buffer received from INQUIRY, MODE SENSE and LOG SENSE (and probably should add a few more): At a low level, after the command is completed, the data-in buffer is post-filled with zeros following the last valid byte as indicated by resid, until the end of that buffer. Then it is passed back for higher level processing of the command including its data-in buffer. Pre-filling the data-in buffer with zeros has been in place for a long time, but I don't think it helps much. So if there are any HBA drivers that set resid higher than it should be, expect some pain soon. Doug Gilbert ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2014-11-07 21:04 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-06 16:47 absurdly high "optimal_io_size" on Seagate SAS disk Chris Friesen 2014-11-06 17:16 ` Chris Friesen 2014-11-06 17:34 ` Martin K. Petersen 2014-11-06 17:45 ` Chris Friesen 2014-11-06 18:12 ` Martin K. Petersen 2014-11-06 18:15 ` Jens Axboe 2014-11-06 19:14 ` Chris Friesen 2014-11-07 1:56 ` Martin K. Petersen 2014-11-07 5:35 ` Chris Friesen 2014-11-07 15:18 ` Dale R. Worley 2014-11-07 16:25 ` Martin K. Petersen 2014-11-07 17:42 ` Martin K. Petersen 2014-11-07 17:51 ` Chris Friesen 2014-11-07 18:03 ` Martin K. Petersen 2014-11-07 18:48 ` Chris Friesen 2014-11-07 19:17 ` Martin K. Petersen 2014-11-07 21:04 ` Chris Friesen 2014-11-07 17:10 ` Elliott, Robert (Server Storage) 2014-11-07 17:40 ` Martin K. Petersen 2014-11-07 20:15 ` Douglas Gilbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).