absurdly high "optimal_io_size" on Seagate SAS disk

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* absurdly high "optimal_io_size" on Seagate SAS disk
@ 2014-11-06 16:47 Chris Friesen
  2014-11-06 17:16 ` Chris Friesen
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-06 16:47 UTC (permalink / raw)
  To: Jens Axboe, lkml

Hi,

I'm running a modified 3.4-stable on relatively recent X86 server-class 
hardware.

I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive) 
and it's reporting a value of 4294966784 for optimal_io_size.  The other 
parameters look normal though:

/sys/block/sda/queue/hw_sector_size:512
/sys/block/sda/queue/logical_block_size:512
/sys/block/sda/queue/max_segment_size:65536
/sys/block/sda/queue/minimum_io_size:512
/sys/block/sda/queue/optimal_io_size:4294966784

The other drives in the system look more like what I'd expect:

/sys/block/sdb/queue/hw_sector_size:512
/sys/block/sdb/queue/logical_block_size:512
/sys/block/sdb/queue/max_segment_size:65536
/sys/block/sdb/queue/minimum_io_size:4096
/sys/block/sdb/queue/optimal_io_size:0
/sys/block/sdb/queue/physical_block_size:4096

/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/max_segment_size:65536
/sys/block/sdc/queue/minimum_io_size:4096
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/physical_block_size:4096

According to the manual, the ST900MM0026 has a 512 byte physical sector 
size.

Is this a drive firmware bug?  Or a bug in the SAS driver?  Or is there 
a valid reason for a single drive to report such a huge value?

Would it make sense for the kernel to do some sort of sanity checking on 
this value?

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 16:47 absurdly high "optimal_io_size" on Seagate SAS disk Chris Friesen
@ 2014-11-06 17:16 ` Chris Friesen
  2014-11-06 17:34   ` Martin K. Petersen
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-06 17:16 UTC (permalink / raw)
  To: Jens Axboe, lkml, linux-scsi, Mike Snitzer, Martin K. Petersen

On 11/06/2014 10:47 AM, Chris Friesen wrote:
> Hi,
>
> I'm running a modified 3.4-stable on relatively recent X86 server-class
> hardware.
>
> I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive)
> and it's reporting a value of 4294966784 for optimal_io_size.  The other
> parameters look normal though:
>
> /sys/block/sda/queue/hw_sector_size:512
> /sys/block/sda/queue/logical_block_size:512
> /sys/block/sda/queue/max_segment_size:65536
> /sys/block/sda/queue/minimum_io_size:512
> /sys/block/sda/queue/optimal_io_size:4294966784

<snip>

> According to the manual, the ST900MM0026 has a 512 byte physical sector
> size.
>
> Is this a drive firmware bug?  Or a bug in the SAS driver?  Or is there
> a valid reason for a single drive to report such a huge value?
>
> Would it make sense for the kernel to do some sort of sanity checking on
> this value?

Looks like this sort of thing has been seen before, in other drives (one 
of which is from the same family as my drive):

http://www.spinics.net/lists/linux-scsi/msg65292.html

http://iamlinux.technoyard.in/blog/why-is-my-ssd-disk-not-reconized-by-the-rhel6-anaconda-installer/

Perhaps the ST900MM0026 should be blacklisted as well?

Or maybe the SCSI code should do a variation on Mike Snitzer's original 
patch and just ignore any values above some reasonable threshold?  (And 
then we could remove the blacklist on the ST900MM0006.)

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 17:16 ` Chris Friesen
@ 2014-11-06 17:34   ` Martin K. Petersen
  2014-11-06 17:45     ` Chris Friesen
  0 siblings, 1 reply; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-06 17:34 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer, Martin K. Petersen

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris> Perhaps the ST900MM0026 should be blacklisted as well?

Sure. I'll widen the net a bit for that Seagate model.

commit 17f1ee2d16a6878269c4429306f6e678b7e61505
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Thu Nov 6 12:31:43 2014 -0500

    SCSI: Blacklist ST900MM0026
    
    Looks like this entire series of drives reports the wrong values in the
    block limits VPD. Widen the blacklist.
    
    Reported-by: Chris Friesen <chris.friesen@windriver.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..9116531b415a 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -229,7 +229,7 @@ static struct {
 	{"SanDisk", "ImageMate CF-SD1", NULL, BLIST_FORCELUN},
 	{"SEAGATE", "ST34555N", "0930", BLIST_NOTQ},	/* Chokes on tagged INQUIRY */
 	{"SEAGATE", "ST3390N", "9546", BLIST_NOTQ},
-	{"SEAGATE", "ST900MM0006", NULL, BLIST_SKIP_VPD_PAGES},
+	{"SEAGATE", "ST900MM", NULL, BLIST_SKIP_VPD_PAGES},
 	{"SGI", "RAID3", "*", BLIST_SPARSELUN},
 	{"SGI", "RAID5", "*", BLIST_SPARSELUN},
 	{"SGI", "TP9100", "*", BLIST_REPORTLUN2},

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 17:34   ` Martin K. Petersen
@ 2014-11-06 17:45     ` Chris Friesen
  2014-11-06 18:12       ` Martin K. Petersen
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-06 17:45 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/06/2014 11:34 AM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
>
> Chris> Perhaps the ST900MM0026 should be blacklisted as well?
>
> Sure. I'll widen the net a bit for that Seagate model.

That'd work, but is it the best way to go?  I mean, I found one report 
of a similar problem on an SSD (model number unknown).  In that case it 
was a near-UINT_MAX value as well.

The problem with the blacklist is that until someone patches it, the 
drive is broken.  And then it stays blacklisted even if the firmware 
gets fixed.

I'm wondering if it might not be better to just ignore all values larger 
than X (where X is whatever we think is the largest conceivable 
reasonable value).

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 17:45     ` Chris Friesen
@ 2014-11-06 18:12       ` Martin K. Petersen
  2014-11-06 18:15         ` Jens Axboe
  2014-11-06 19:14         ` Chris Friesen
  0 siblings, 2 replies; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-06 18:12 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris> That'd work, but is it the best way to go?  I mean, I found one
Chris> report of a similar problem on an SSD (model number unknown).  In
Chris> that case it was a near-UINT_MAX value as well.

My concern is still the same. Namely that this particular drive happens
to be returning UINT_MAX but it might as well be a value that's entirely
random. Or even a value that is small and innocuous looking but
completely wrong.

Chris> The problem with the blacklist is that until someone patches it,
Chris> the drive is broken.  And then it stays blacklisted even if the
Chris> firmware gets fixed.

Well, you can manually blacklist in /proc/scsi/device_info.

Chris> I'm wondering if it might not be better to just ignore all values
Chris> larger than X (where X is whatever we think is the largest
Chris> conceivable reasonable value).

The problem is that finding that is not easy and it too will be a moving
target.

I'm willing to entertain the following, however...

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 95bfb7bfbb9d..75cc51a01860 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2593,7 +2593,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 	blk_queue_io_min(sdkp->disk->queue,
 			 get_unaligned_be16(&buffer[6]) * sector_sz);
 	blk_queue_io_opt(sdkp->disk->queue,
-			 get_unaligned_be32(&buffer[12]) * sector_sz);
+			 min_t(u32, get_unaligned_be32(&buffer[12]),
+			       sdkp->capacity) * sector_sz);

 	if (buffer[3] == 0x3c) {
 		unsigned int lba_count, desc_count;

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 18:12       ` Martin K. Petersen
@ 2014-11-06 18:15         ` Jens Axboe
  2014-11-06 19:14         ` Chris Friesen
  1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2014-11-06 18:15 UTC (permalink / raw)
  To: Martin K. Petersen, Chris Friesen; +Cc: lkml, linux-scsi, Mike Snitzer

On 2014-11-06 11:12, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
>
> Chris> That'd work, but is it the best way to go?  I mean, I found one
> Chris> report of a similar problem on an SSD (model number unknown).  In
> Chris> that case it was a near-UINT_MAX value as well.
>
> My concern is still the same. Namely that this particular drive happens
> to be returning UINT_MAX but it might as well be a value that's entirely
> random. Or even a value that is small and innocuous looking but
> completely wrong.
>
> Chris> The problem with the blacklist is that until someone patches it,
> Chris> the drive is broken.  And then it stays blacklisted even if the
> Chris> firmware gets fixed.
>
> Well, you can manually blacklist in /proc/scsi/device_info.
>
> Chris> I'm wondering if it might not be better to just ignore all values
> Chris> larger than X (where X is whatever we think is the largest
> Chris> conceivable reasonable value).
>
> The problem is that finding that is not easy and it too will be a moving
> target.

Didn't check, but assuming the value is the upper 24 bits of 32. If so, 
might not hurt to check for as 0xfffffe00 as an invalid value.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 18:12       ` Martin K. Petersen
  2014-11-06 18:15         ` Jens Axboe
@ 2014-11-06 19:14         ` Chris Friesen
  2014-11-07  1:56           ` Martin K. Petersen
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-06 19:14 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/06/2014 12:12 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com>
>>>>>> writes:
>
> Chris> That'd work, but is it the best way to go?  I mean, I found
> one Chris> report of a similar problem on an SSD (model number
> unknown).  In Chris> that case it was a near-UINT_MAX value as well.
>
> My concern is still the same. Namely that this particular drive
> happens to be returning UINT_MAX but it might as well be a value
> that's entirely random. Or even a value that is small and innocuous
> looking but completely wrong.
>
> Chris> The problem with the blacklist is that until someone patches
> it, Chris> the drive is broken.  And then it stays blacklisted even
> if the Chris> firmware gets fixed.
>
> Well, you can manually blacklist in /proc/scsi/device_info.
>
> Chris> I'm wondering if it might not be better to just ignore all
> values Chris> larger than X (where X is whatever we think is the
> largest Chris> conceivable reasonable value).
>
> The problem is that finding that is not easy and it too will be a
> moving target.


Do we need to be perfect, or just "good enough"?

For a RAID card I expect it would be related to chunk size or stripe
width or something...but even then I would expect to be able to cap it
at 100MB or so.  Or are there storage systems on really fast interfaces
that could legitimately want a hundred meg of data at a time?

On 11/06/2014 12:15 PM, Jens Axboe wrote:
> Didn't check, but assuming the value is the upper 24 bits of 32. If
> so, might not hurt to check for as 0xfffffe00 as an invalid value.

Yep, in all three wonky cases so far "optimal_io_size" ended up as 
4294966784, which is 0xfffffe00.  Does something mask out the lower bits?

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-06 19:14         ` Chris Friesen
@ 2014-11-07  1:56           ` Martin K. Petersen
  2014-11-07  5:35             ` Chris Friesen
  2014-11-07 17:10             ` Elliott, Robert (Server Storage)
  0 siblings, 2 replies; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07  1:56 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris,

Chris> For a RAID card I expect it would be related to chunk size or
Chris> stripe width or something...but even then I would expect to be
Chris> able to cap it at 100MB or so.  Or are there storage systems on
Chris> really fast interfaces that could legitimately want a hundred meg
Chris> of data at a time?

Well, there are several devices that report their capacity to indicate
that they don't suffer any performance (RMW) penalties for large
commands regardless of size. I would personally prefer them to report 0
in that case.

Chris> Yep, in all three wonky cases so far "optimal_io_size" ended up
Chris> as 4294966784, which is 0xfffffe00.  Does something mask out the
Chris> lower bits?

Ignoring reported values of UINT_MAX and 0xfffffe000 only works until
the next spec-dyslexic firmware writer comes along.

I also think that singling out the OPTIMAL TRANSFER LENGTH is a bit of a
red herring. A vendor could mess up any value in that VPD and it would
still cause us grief. There's no rational explanation for why OTL would
be more prone to being filled out incorrectly than any of the other
parameters in that page.

I do concur, though, that io_opt is problematic by virtue of being
32-bits and that gets multiplied by the sector size. So things can
easily get out of whack for fdisk and friends (by comparison the value
that we use for io_min is only 16 bits).

I'm still partial to just blacklisting that entire Seagate family. We
don't have any details on the alleged SSD having the same problem. For
all we know it could be the same SAS disk drive and not an SSD at all.

If there are compelling arguments or other supporting data for sanity
checking OTL I'd suggest the following patch that caps it at 1GB. I know
of a few devices that prefer alignment at that granularity.

-- 
Martin K. Petersen	Oracle Linux Engineering

commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Thu Nov 6 12:31:43 2014 -0500

    [SCSI] sd: Sanity check the optimal I/O size

    We have come across a couple of devices that report crackpot values in
    the optimal I/O size in the Block Limits VPD page. Since this is a
    32-bit entity that gets multiplied by the logical block size we can get
    disproportionately large values reported to the block layer.

    Cap io_opt at 1 GB.

    Reported-by: Chris Friesen <chris.friesen@windriver.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Cc: stable@vger.kernel.org

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b041eca8955d..806e06c2575f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2591,7 +2591,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 	blk_queue_io_min(sdkp->disk->queue,
 			 get_unaligned_be16(&buffer[6]) * sector_sz);
 	blk_queue_io_opt(sdkp->disk->queue,
-			 get_unaligned_be32(&buffer[12]) * sector_sz);
+			 min_t(unsigned int, SD_MAX_IO_OPT_BYTES,
+			       get_unaligned_be32(&buffer[12]) * sector_sz));

 	if (buffer[3] == 0x3c) {
 		unsigned int lba_count, desc_count;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 63ba5ca7f9a1..3492779d9d3e 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -44,10 +44,11 @@ enum {
 };

 enum {
-	SD_DEF_XFER_BLOCKS = 0xffff,
-	SD_MAX_XFER_BLOCKS = 0xffffffff,
-	SD_MAX_WS10_BLOCKS = 0xffff,
-	SD_MAX_WS16_BLOCKS = 0x7fffff,
+	SD_DEF_XFER_BLOCKS	= 0xffff,
+	SD_MAX_XFER_BLOCKS	= 0xffffffff,
+	SD_MAX_WS10_BLOCKS	= 0xffff,
+	SD_MAX_WS16_BLOCKS	= 0x7fffff,
+	SD_MAX_IO_OPT_BYTES	= 1024 * 1024 * 1024,
 };

 enum {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07  1:56           ` Martin K. Petersen
@ 2014-11-07  5:35             ` Chris Friesen
  2014-11-07 15:18               ` Dale R. Worley
  2014-11-07 16:25               ` Martin K. Petersen
  2014-11-07 17:10             ` Elliott, Robert (Server Storage)
  1 sibling, 2 replies; 20+ messages in thread
From: Chris Friesen @ 2014-11-07  5:35 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/06/2014 07:56 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
>
> Chris,
>
> Chris> For a RAID card I expect it would be related to chunk size or
> Chris> stripe width or something...but even then I would expect to be
> Chris> able to cap it at 100MB or so.  Or are there storage systems on
> Chris> really fast interfaces that could legitimately want a hundred meg
> Chris> of data at a time?
>
> Well, there are several devices that report their capacity to indicate
> that they don't suffer any performance (RMW) penalties for large
> commands regardless of size. I would personally prefer them to report 0
> in that case.

I got curious and looked at the spec at 
"http://www.13thmonkey.org/documentation/SCSI/sbc3r25.pdf".  I'm now 
wondering if maybe linux is misbehaving.

I think there is actually some justification for putting a huge value in 
the "optimal transfer length" field.  That field is described as "the 
optimal transfer length in blocks for a single...command", but then 
later it has "If a device server receives a request with a transfer 
length exceeding this value, then a significant delay in processing the 
request may be incurred."  As written, it is ambiguous.

Looking at "ftp://ftp.t10.org/t10/document.03/03-028r2.pdf" it appears 
that originally that field was the "optimal maximum transfer length", 
not the "optimal transfer length".  It appears that the intent was that 
the device was able to take requests up to the "maximum transfer 
length", but there would be a performance penalty if you went over the 
"optimum maximum transfer length".

Section E.4 in "sbc3r25.pdf" talks about optimizing transfers.  They 
suggest using a transfer length that is a multiple of "optimal transfer 
length granularity", up to a max of either the max or optimal transfer 
lengths depending on the size of the penalty if you exceed the optimal 
transfer length.  This reinforces the idea that the "optimal transfer 
length" is actually the optimal *maximum* length, but any multiple of 
the optimal granularity is fine.

Based on that, I think it would have been clearer if it had been called 
"/sys/block/sdb/queue/optimal_max_io_size".

Also, I think it's wrong for filesystems and userspace to use it for 
alignment.  In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they 
use the optimal granularity field for alignment, not the optimal 
transfer length.

So for the ST900MM0006, it had:

# sg_inq --vpd --page=0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
   Optimal transfer length granularity: 1 blocks
   Maximum transfer length: 0 blocks
   Optimal transfer length: 4294967295 blocks

In this case I think the drive is trying to say that it doesn't require 
any special granularity (can handle alignment on 512-byte blocks), and 
that it can handle any size of transfer without performance penalty.

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07  5:35             ` Chris Friesen
@ 2014-11-07 15:18               ` Dale R. Worley
  2014-11-07 16:25               ` Martin K. Petersen
  1 sibling, 0 replies; 20+ messages in thread
From: Dale R. Worley @ 2014-11-07 15:18 UTC (permalink / raw)
  To: Chris Friesen; +Cc: martin.petersen, axboe, linux-kernel, linux-scsi, snitzer

> From: Chris Friesen <chris.friesen@windriver.com>

> Also, I think it's wrong for filesystems and userspace to use it for 
> alignment.  In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they 
> use the optimal granularity field for alignment, not the optimal 
> transfer length.

Everything you say suggests that "optimal transfer length" means
"there is a penalty for doing transfers *larger* than this", but
people have been treating it as "there is a penalty for doing
transfers *smaller* than this".  But the latter is the "optimal
transfer length granularity".

Dale

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07  5:35             ` Chris Friesen
  2014-11-07 15:18               ` Dale R. Worley
@ 2014-11-07 16:25               ` Martin K. Petersen
  2014-11-07 17:42                 ` Martin K. Petersen
  2014-11-07 18:48                 ` Chris Friesen
  1 sibling, 2 replies; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07 16:25 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris,

Chris> Also, I think it's wrong for filesystems and userspace to use it
Chris> for alignment.  In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks
Chris> like they use the optimal granularity field for alignment, not
Chris> the optimal transfer length.

The original rationale behind the OTLG and OTL values was to be able to
express stripe chunk size and stripe width. And to encourage aligned,
full stripe writes but nothing bigger than that. Obviously the wording
went through the usual standards body process to be vague/generic enough
to be used for anything. It has changed several times since sbc3r25,
btw.

The kernel really isn't using io_opt. The value is merely stacked and
communicated to userspace. The reason the partitioning tools blow up
with weird values is that they try to align partitions beginnings to the
stripe width. Which is the right thing to do as far as I'm concerned.

I have worked with many, many partners in the storage industry to make
sure they report sensible values in the Block Limits VPD. I have no
reason to believe that the SAS drive issue in question is anything but a
simple typo. I know there was a bug open with Seagate. I assume it has
been fixed in their latest firmware. To my knowledge it is not a problem
in any of their other drive models. Certainly isn't in any of the ones
we are shipping.

The unfortunate thing with disk drives is that firmware updates are much
harder to deal with. And you rarely end up having access to an updated
firmware unless your drive was procured through a vendor like Dell, HP
or Oracle. That's why I originally opted to quirk this model in
Linux. Otherwise I would just have said "update your firmware".

If we had devices from many different vendors showing up with values
that constantly threw off our tooling I would have more reason to be
concerned. But we haven't. And this code has been in the kernel since
2.6.32 or so.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 16:25               ` Martin K. Petersen
@ 2014-11-07 17:42                 ` Martin K. Petersen
  2014-11-07 17:51                   ` Chris Friesen
  2014-11-07 18:48                 ` Chris Friesen
  1 sibling, 1 reply; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07 17:42 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chris Friesen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> I know there was a bug open with Seagate. I assume it has been
Martin> fixed in their latest firmware.

Seagate confirms that this issue was fixed about a year ago. Will
provide more data when I have it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 17:42                 ` Martin K. Petersen
@ 2014-11-07 17:51                   ` Chris Friesen
  2014-11-07 18:03                     ` Martin K. Petersen
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-07 17:51 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/07/2014 11:42 AM, Martin K. Petersen wrote:
>>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
>
> Martin> I know there was a bug open with Seagate. I assume it has been
> Martin> fixed in their latest firmware.
>
> Seagate confirms that this issue was fixed about a year ago. Will
> provide more data when I have it.

Okay, thanks for the clarification (for this and the spec itself).

Apparently there's a new firmware available, dated Oct 13 but with no 
release notes.  We just tried updating the firmware on one of the drives 
in question and it failed from two different versions of linux, while 
Windows won't install because it doesn't like our SSD apparently.  Joy.

Chris


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 17:51                   ` Chris Friesen
@ 2014-11-07 18:03                     ` Martin K. Petersen
  0 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07 18:03 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris> Apparently there's a new firmware available, dated Oct 13 but
Chris> with no release notes.  We just tried updating the firmware on
Chris> one of the drives in question and it failed from two different
Chris> versions of linux,

Did you use sg_write_buffer or some special firmware update tool?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 16:25               ` Martin K. Petersen
  2014-11-07 17:42                 ` Martin K. Petersen
@ 2014-11-07 18:48                 ` Chris Friesen
  2014-11-07 19:17                   ` Martin K. Petersen
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Friesen @ 2014-11-07 18:48 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/07/2014 10:25 AM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
> 
> Chris,
> 
> Chris> Also, I think it's wrong for filesystems and userspace to use it
> Chris> for alignment.  In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks
> Chris> like they use the optimal granularity field for alignment, not
> Chris> the optimal transfer length.
> 
> The original rationale behind the OTLG and OTL values was to be able to
> express stripe chunk size and stripe width. And to encourage aligned,
> full stripe writes but nothing bigger than that. Obviously the wording
> went through the usual standards body process to be vague/generic enough
> to be used for anything. It has changed several times since sbc3r25,
> btw.

You've obviously been involved in this area a lot more closely than me,
so I'll defer to your experience. :)

I think that if that's the intended use case, then the spec wording could
be improved.  Looking at "sbc3r36.pdf", it still only explicitly mentions
performance penalties for transfers that are larger than the "optimal
transfer length", not for transfers that are smaller.



On 11/07/2014 12:03 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
> 
> Chris> Apparently there's a new firmware available, dated Oct 13 but
> Chris> with no release notes.  We just tried updating the firmware on
> Chris> one of the drives in question and it failed from two different
> Chris> versions of linux,
> 
> Did you use sg_write_buffer or some special firmware update tool?

Both.  I didn't do it myself, but the guy who did sent me the following:


localhost:~$ ./dl_sea_fw-0.2.3_64 -m ST900MM0026 -d /dev/sda -f Lightningbug10K6-SED-0003.LOD 
================================================================================
 Seagate Firmware Download Utility v0.2.3 Build Date: Jan  9 2013
 Copyright (c) 2012 Seagate Technology LLC, All Rights Reserved
 Fri Nov  7 14:51:21 2014
================================================================================
Downloading file Lightningbug10K6-SED-0003.LOD to /dev/sda
send_io: Input/output error
send_io: Input/output error
 !
FW Download FAILED


This log is from a different system running Debian:

 root@bricklane-2:/home/cgcs# sg_write_buffer -vvv --in=Lightningbug10K6-SED-0003.LOD --length=1752576 --mode=5 /dev/sdb

open /dev/sdb with flags=0x802

sending single write buffer, mode=0x5, mpsec=0, id=0, offset=0, len=1752576

    Write buffer cmd: 3b 05 00 00 00 00 1a be 00 00 

    Write buffer parameter list (first 256 bytes):

e7 1a 0e 59 01 00 02 00  00 00 00 00 00 00 19 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 be 1a 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 d5 cd

00 00 00 00 00 00 00 00  00 00 00 00 00 00 07 00

80 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 bc 1a 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 5f 42

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

ioctl(SG_IO v3) failed: Invalid argument (errno=22)

write buffer: pass through os error: Invalid argument

Write buffer failed: Sense category: -1, try '-v' option for more information


Apparently the "hdparm -I" command is giving bogus data as well.
I've seen that happen if the drive is on a RAID controller--I assume
that could cause problems with firmware updates too?

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 18:48                 ` Chris Friesen
@ 2014-11-07 19:17                   ` Martin K. Petersen
  2014-11-07 21:04                     ` Chris Friesen
  0 siblings, 1 reply; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07 19:17 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Martin K. Petersen, Jens Axboe, lkml, linux-scsi, Mike Snitzer

>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:

Chris> Apparently the "hdparm -I" command is giving bogus data as well.
Chris> I've seen that happen if the drive is on a RAID controller--I
Chris> assume that could cause problems with firmware updates too?

I'd suggest trying /dev/sgN instead.

But yes, some RAID controllers require you to use their tooling and
won't allow direct passthrough.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 19:17                   ` Martin K. Petersen
@ 2014-11-07 21:04                     ` Chris Friesen
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Friesen @ 2014-11-07 21:04 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, lkml, linux-scsi, Mike Snitzer

On 11/07/2014 01:17 PM, Martin K. Petersen wrote:

> I'd suggest trying /dev/sgN instead.

That seems to work.  Much appreciated.

And it's now showing an "optimal_io_size" of 0, so I think the issue is 
dealt with.

Thanks for all the help, it's been educational. :)

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07  1:56           ` Martin K. Petersen
  2014-11-07  5:35             ` Chris Friesen
@ 2014-11-07 17:10             ` Elliott, Robert (Server Storage)
  2014-11-07 17:40               ` Martin K. Petersen
  2014-11-07 20:15               ` Douglas Gilbert
  1 sibling, 2 replies; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-11-07 17:10 UTC (permalink / raw)
  To: Martin K. Petersen, Chris Friesen
  Cc: Jens Axboe, lkml, linux-scsi@vger.kernel.org, Mike Snitzer

> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
> Author: Martin K. Petersen <martin.petersen@oracle.com>
> Date:   Thu Nov 6 12:31:43 2014 -0500
> 
>     [SCSI] sd: Sanity check the optimal I/O size
> 
>     We have come across a couple of devices that report crackpot
> 	values in the optimal I/O size in the Block Limits VPD page. 
>	Since this is a 32-bit entity that gets multiplied by the
>	logical block size we can get
>     disproportionately large values reported to the block layer.
> 
>     Cap io_opt at 1 GB.

Another reasonable cap is the maximum transfer size.
There are lots of them:

* the block layer BIO_MAX_PAGES value of 256 limits IOs
  to a maximum of 1 MiB
* SCSI LLDs report their maximum transfer size in
  /sys/block/sdNN/queue/max_hw_sectors_kb
* the SCSI midlayer maximum transfer size is set/reported
  in /sys/block/sdNN/queue/max_sectors_kb
  and the default is 512 KiB
* the SCSI LLD maximum number of scatter gather entries
  reported in /sys/block/sdNN/queue/max_segments and
  /sys/block/sdNN/queue/max_segment_size creates a
  limit based on how fragmented the data buffer is
  in virtual memory
* the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
  indicates the maximum transfer size for one command over
  the SCSI transport protocol supported by the drive itself

It is risky to use transfer sizes larger than linux and
Windows can generate, since drives are probably tested in
those environments.

---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 17:10             ` Elliott, Robert (Server Storage)
@ 2014-11-07 17:40               ` Martin K. Petersen
  2014-11-07 20:15               ` Douglas Gilbert
  1 sibling, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2014-11-07 17:40 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Martin K. Petersen, Chris Friesen, Jens Axboe, lkml,
	linux-scsi@vger.kernel.org, Mike Snitzer

>>>>> "Rob" == Elliott, Robert (Server Storage) <Elliott@hp.com> writes:

Rob,

Rob> * the block layer BIO_MAX_PAGES value of 256 limits IOs
Rob>   to a maximum of 1 MiB

We do support scatterlist chaining, though.

Rob> * SCSI LLDs report their maximum transfer size in
Rob>   /sys/block/sdNN/queue/max_hw_sectors_kb
Rob> * the SCSI midlayer maximum transfer size is set/reported
Rob>   in /sys/block/sdNN/queue/max_sectors_kb and the default is 512
Rob>   KiB
Rob> * the SCSI LLD maximum number of scatter gather entries
Rob>   reported in /sys/block/sdNN/queue/max_segments and
Rob>   /sys/block/sdNN/queue/max_segment_size creates a limit based on
Rob>   how fragmented the data buffer is in virtual memory
Rob> * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
Rob>   indicates the maximum transfer size for one command over the SCSI
Rob>   transport protocol supported by the drive itself

Yep. We're already capping the actual max I/O size based on all of the
above. However, the purpose of exposing io_opt was to be able to report
stripe size to partitioning tools and filesystems for alignment
purposes. And although they would ideally be the same it was always
anticipated that stripe size could be bigger than the max I/O size.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: absurdly high "optimal_io_size" on Seagate SAS disk
  2014-11-07 17:10             ` Elliott, Robert (Server Storage)
  2014-11-07 17:40               ` Martin K. Petersen
@ 2014-11-07 20:15               ` Douglas Gilbert
  1 sibling, 0 replies; 20+ messages in thread
From: Douglas Gilbert @ 2014-11-07 20:15 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), Martin K. Petersen,
	Chris Friesen
  Cc: Jens Axboe, lkml, linux-scsi@vger.kernel.org, Mike Snitzer

On 14-11-07 12:10 PM, Elliott, Robert (Server Storage) wrote:
>> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
>> Author: Martin K. Petersen <martin.petersen@oracle.com>
>> Date:   Thu Nov 6 12:31:43 2014 -0500
>>
>>      [SCSI] sd: Sanity check the optimal I/O size
>>
>>      We have come across a couple of devices that report crackpot
>> 	values in the optimal I/O size in the Block Limits VPD page.
>> 	Since this is a 32-bit entity that gets multiplied by the
>> 	logical block size we can get
>>      disproportionately large values reported to the block layer.
>>
>>      Cap io_opt at 1 GB.
>
> Another reasonable cap is the maximum transfer size.
> There are lots of them:
>
> * the block layer BIO_MAX_PAGES value of 256 limits IOs
>    to a maximum of 1 MiB
> * SCSI LLDs report their maximum transfer size in
>    /sys/block/sdNN/queue/max_hw_sectors_kb
> * the SCSI midlayer maximum transfer size is set/reported
>    in /sys/block/sdNN/queue/max_sectors_kb
>    and the default is 512 KiB
> * the SCSI LLD maximum number of scatter gather entries
>    reported in /sys/block/sdNN/queue/max_segments and
>    /sys/block/sdNN/queue/max_segment_size creates a
>    limit based on how fragmented the data buffer is
>    in virtual memory
> * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
>    indicates the maximum transfer size for one command over
>    the SCSI transport protocol supported by the drive itself
>
> It is risky to use transfer sizes larger than linux and
> Windows can generate, since drives are probably tested in
> those environments.

After being burnt by a (virtual) SCSI disk recently, my
utilities now take a more aggressive approach to the data-in
buffer received from INQUIRY, MODE SENSE and LOG SENSE (and
probably should add a few more):

At a low level, after the command is completed, the data-in
buffer is post-filled with zeros following the last valid
byte as indicated by resid, until the end of that buffer.
Then it is passed back for higher level processing of the
command including its data-in buffer.

Pre-filling the data-in buffer with zeros has been in place
for a long time, but I don't think it helps much.


So if there are any HBA drivers that set resid higher than it
should be, expect some pain soon.

Doug Gilbert



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-11-07 21:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-06 16:47 absurdly high "optimal_io_size" on Seagate SAS disk Chris Friesen
2014-11-06 17:16 ` Chris Friesen
2014-11-06 17:34   ` Martin K. Petersen
2014-11-06 17:45     ` Chris Friesen
2014-11-06 18:12       ` Martin K. Petersen
2014-11-06 18:15         ` Jens Axboe
2014-11-06 19:14         ` Chris Friesen
2014-11-07  1:56           ` Martin K. Petersen
2014-11-07  5:35             ` Chris Friesen
2014-11-07 15:18               ` Dale R. Worley
2014-11-07 16:25               ` Martin K. Petersen
2014-11-07 17:42                 ` Martin K. Petersen
2014-11-07 17:51                   ` Chris Friesen
2014-11-07 18:03                     ` Martin K. Petersen
2014-11-07 18:48                 ` Chris Friesen
2014-11-07 19:17                   ` Martin K. Petersen
2014-11-07 21:04                     ` Chris Friesen
2014-11-07 17:10             ` Elliott, Robert (Server Storage)
2014-11-07 17:40               ` Martin K. Petersen
2014-11-07 20:15               ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).