linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Friesen <chris.friesen@windriver.com>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>, lkml <linux-kernel@vger.kernel.org>,
	<linux-scsi@vger.kernel.org>, Mike Snitzer <snitzer@redhat.com>
Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk
Date: Thu, 6 Nov 2014 23:35:23 -0600	[thread overview]
Message-ID: <545C5A1B.9020206@windriver.com> (raw)
In-Reply-To: <yq1zjc3ai4h.fsf@sermon.lab.mkp.net>

On 11/06/2014 07:56 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <chris.friesen@windriver.com> writes:
>
> Chris,
>
> Chris> For a RAID card I expect it would be related to chunk size or
> Chris> stripe width or something...but even then I would expect to be
> Chris> able to cap it at 100MB or so.  Or are there storage systems on
> Chris> really fast interfaces that could legitimately want a hundred meg
> Chris> of data at a time?
>
> Well, there are several devices that report their capacity to indicate
> that they don't suffer any performance (RMW) penalties for large
> commands regardless of size. I would personally prefer them to report 0
> in that case.

I got curious and looked at the spec at 
"http://www.13thmonkey.org/documentation/SCSI/sbc3r25.pdf".  I'm now 
wondering if maybe linux is misbehaving.

I think there is actually some justification for putting a huge value in 
the "optimal transfer length" field.  That field is described as "the 
optimal transfer length in blocks for a single...command", but then 
later it has "If a device server receives a request with a transfer 
length exceeding this value, then a significant delay in processing the 
request may be incurred."  As written, it is ambiguous.

Looking at "ftp://ftp.t10.org/t10/document.03/03-028r2.pdf" it appears 
that originally that field was the "optimal maximum transfer length", 
not the "optimal transfer length".  It appears that the intent was that 
the device was able to take requests up to the "maximum transfer 
length", but there would be a performance penalty if you went over the 
"optimum maximum transfer length".

Section E.4 in "sbc3r25.pdf" talks about optimizing transfers.  They 
suggest using a transfer length that is a multiple of "optimal transfer 
length granularity", up to a max of either the max or optimal transfer 
lengths depending on the size of the penalty if you exceed the optimal 
transfer length.  This reinforces the idea that the "optimal transfer 
length" is actually the optimal *maximum* length, but any multiple of 
the optimal granularity is fine.

Based on that, I think it would have been clearer if it had been called 
"/sys/block/sdb/queue/optimal_max_io_size".

Also, I think it's wrong for filesystems and userspace to use it for 
alignment.  In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they 
use the optimal granularity field for alignment, not the optimal 
transfer length.


So for the ST900MM0006, it had:

# sg_inq --vpd --page=0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
   Optimal transfer length granularity: 1 blocks
   Maximum transfer length: 0 blocks
   Optimal transfer length: 4294967295 blocks

In this case I think the drive is trying to say that it doesn't require 
any special granularity (can handle alignment on 512-byte blocks), and 
that it can handle any size of transfer without performance penalty.

Chris

  reply	other threads:[~2014-11-07  5:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-06 16:47 absurdly high "optimal_io_size" on Seagate SAS disk Chris Friesen
2014-11-06 17:16 ` Chris Friesen
2014-11-06 17:34   ` Martin K. Petersen
2014-11-06 17:45     ` Chris Friesen
2014-11-06 18:12       ` Martin K. Petersen
2014-11-06 18:15         ` Jens Axboe
2014-11-06 19:14         ` Chris Friesen
2014-11-07  1:56           ` Martin K. Petersen
2014-11-07  5:35             ` Chris Friesen [this message]
2014-11-07 15:18               ` Dale R. Worley
2014-11-07 16:25               ` Martin K. Petersen
2014-11-07 17:42                 ` Martin K. Petersen
2014-11-07 17:51                   ` Chris Friesen
2014-11-07 18:03                     ` Martin K. Petersen
2014-11-07 18:48                 ` Chris Friesen
2014-11-07 19:17                   ` Martin K. Petersen
2014-11-07 21:04                     ` Chris Friesen
2014-11-07 17:10             ` Elliott, Robert (Server Storage)
2014-11-07 17:40               ` Martin K. Petersen
2014-11-07 20:15               ` Douglas Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=545C5A1B.9020206@windriver.com \
    --to=chris.friesen@windriver.com \
    --cc=axboe@kernel.dk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).