* SCSI target and IO-throttling
@ 2006-03-02 16:21 Vladislav Bolkhovitin
2006-03-03 18:07 ` Steve Byan
0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-02 16:21 UTC (permalink / raw)
To: linux-scsi
Hello
Could anyone advice how a SCSI target device can IO-throttle its
initiators, i.e. prevent them from queuing too many commands, please?
I suppose, the best way for doing this is to inform the initiators about
the maximum queue depth X of the target device, so any of the initiators
will not send more than X commands. But I have not found anything
similar to that on INQUIRY or MODE SENSE pages. Have I missed something?
Just returning QUEUE FULL status doesn't look to be correct, because it
can lead to out of order commands execution.
Apparently, hardware SCSI targets don't suffer from queuing overflow and
don't return all the time QUEUE FULL status, so the must be a way to do
the throttling more elegantly.
Regards,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin
@ 2006-03-03 18:07 ` Steve Byan
2006-03-03 18:47 ` Stefan Richter
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-03 18:07 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: linux-scsi
On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
> Could anyone advice how a SCSI target device can IO-throttle its
> initiators, i.e. prevent them from queuing too many commands, please?
>
> I suppose, the best way for doing this is to inform the initiators
> about the maximum queue depth X of the target device, so any of the
> initiators will not send more than X commands. But I have not found
> anything similar to that on INQUIRY or MODE SENSE pages. Have I
> missed something? Just returning QUEUE FULL status doesn't look to
> be correct, because it can lead to out of order commands execution.
Returning QUEUE FULL status is correct, unless the initiator does not
have any pending commands on the LUN, in which case you should return
BUSY. Yes, this can lead to out-of-order execution. That's why tapes
have traditionally not used SCSI command queuing.
Look into the unit attention interlock feature added to SCSI as a
result of uncovering this issue during the development of the iSCSI
standard.
> Apparently, hardware SCSI targets don't suffer from queuing
> overflow and don't return all the time QUEUE FULL status, so the
> must be a way to do the throttling more elegantly.
No, they just have big queues.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-03 18:07 ` Steve Byan
@ 2006-03-03 18:47 ` Stefan Richter
2006-03-03 20:24 ` Steve Byan
2006-03-06 19:15 ` Bryan Henderson
2006-03-07 17:53 ` Vladislav Bolkhovitin
2 siblings, 1 reply; 25+ messages in thread
From: Stefan Richter @ 2006-03-03 18:47 UTC (permalink / raw)
To: Steve Byan; +Cc: Vladislav Bolkhovitin, linux-scsi
Steve Byan wrote:
> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>> Could anyone advice how a SCSI target device can IO-throttle its
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators
>> about the maximum queue depth X of the target device,
[...]
> Returning QUEUE FULL status is correct, unless the initiator does not
> have any pending commands on the LUN, in which case you should return
> BUSY. Yes, this can lead to out-of-order execution. That's why tapes
> have traditionally not used SCSI command queuing.
>
> Look into the unit attention interlock feature added to SCSI as a
> result of uncovering this issue during the development of the iSCSI
> standard.
>
>> Apparently, hardware SCSI targets don't suffer from queuing overflow
[...]
> No, they just have big queues.
Depending on the the transport protocol, the problem of queue depth at
the target may not even exist in the first place. This is the case with
SBP-2 where the queue of command blocks resides at the initiator.
--
Stefan Richter
-=====-=-==- --== ---==
http://arcgraph.de/sr/
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-03 18:47 ` Stefan Richter
@ 2006-03-03 20:24 ` Steve Byan
0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-03 20:24 UTC (permalink / raw)
To: Stefan Richter; +Cc: Vladislav Bolkhovitin, linux-scsi
On Mar 3, 2006, at 1:47 PM, Stefan Richter wrote:
> Steve Byan wrote:
>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>> Apparently, hardware SCSI targets don't suffer from queuing
>>> overflow
> [...]
>> No, they just have big queues.
>
> Depending on the the transport protocol, the problem of queue depth
> at the target may not even exist in the first place. This is the
> case with SBP-2 where the queue of command blocks resides at the
> initiator.
Yes, and that's a clever optimization in SBP-2 to support resource-
poor targets. Thanks for reminding us of it.
Too bad SATA drives didn't take advantage of the SATA first-party DMA
to implement SBP-2. The definition of the tag field for native
command queuing adopted by T13 essentially makes it infeasible to
revisit this decision.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-03 18:07 ` Steve Byan
2006-03-03 18:47 ` Stefan Richter
@ 2006-03-06 19:15 ` Bryan Henderson
2006-03-06 19:55 ` Steve Byan
2006-03-07 17:56 ` Vladislav Bolkhovitin
2006-03-07 17:53 ` Vladislav Bolkhovitin
2 siblings, 2 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-03-06 19:15 UTC (permalink / raw)
To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin
>On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>
>> Could anyone advice how a SCSI target device can IO-throttle its
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators
>> about the maximum queue depth X of the target device, so any of the
>> initiators will not send more than X commands. But I have not found
>> anything similar to that on INQUIRY or MODE SENSE pages. Have I
>> missed something? Just returning QUEUE FULL status doesn't look to
>> be correct, because it can lead to out of order commands execution.
>
>Returning QUEUE FULL status is correct, unless the initiator does not
>have any pending commands on the LUN, in which case you should return
>BUSY. Yes, this can lead to out-of-order execution. That's why tapes
>have traditionally not used SCSI command queuing.
I'm confused, Vladislav appears to be asking about flow control such as
is built into ISCSI, wherein the ISCSI target tells the intitiator how
many tasks it's willing to work on at once and the initiator stops sending
new ones when it has hit that limit and waits for one of the previous ones
to finish. And the target can continuously change that number.
With the more primitive transports, I believe this is a manual
configuration step -- the target has a fixed maximum queue depth and you
tell the driver via some configuration parameter what it is.
As I understand it, any system in which QUEUE FULL (that's another name
for SCSI's Task Set Full, isn't it?) errors happen is one that is not
properly configured. I saw a broken ISCSI system that had QUEUE FULLs
happening, and it was a performance disaster.
>> Apparently, hardware SCSI targets don't suffer from queuing
>> overflow and don't return all the time QUEUE FULL status, so the
>> must be a way to do the throttling more elegantly.
>
>No, they just have big queues.
Big queues are another serious performance problem, when it means a target
accepts work faster than it can do it. I've seen that cause initiators to
send suboptimal requests (if the target appears to be working at infinite
speed, the initiator sends small chunks of work as soon as each is ready,
whereas if the initiator can tell that the target is choked, the initiator
combines and sorts work while it waits, into a stream the target can
handle more efficiently). When systems substitute an oversized queue in a
target for initiator-target flow control, the initiator ends up having to
compensate with artificial schemes to withhold work from a willing target
(e.g. Linux "queue plugging").
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-06 19:15 ` Bryan Henderson
@ 2006-03-06 19:55 ` Steve Byan
2006-03-07 23:32 ` Bryan Henderson
2006-03-07 17:56 ` Vladislav Bolkhovitin
1 sibling, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-06 19:55 UTC (permalink / raw)
To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin
On Mar 6, 2006, at 2:15 PM, Bryan Henderson wrote:
>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>
>>> Could anyone advice how a SCSI target device can IO-throttle its
>>> initiators, i.e. prevent them from queuing too many commands,
>>> please?
>>>
>>> I suppose, the best way for doing this is to inform the initiators
>>> about the maximum queue depth X of the target device, so any of the
>>> initiators will not send more than X commands. But I have not found
>>> anything similar to that on INQUIRY or MODE SENSE pages. Have I
>>> missed something? Just returning QUEUE FULL status doesn't look to
>>> be correct, because it can lead to out of order commands execution.
>>
>> Returning QUEUE FULL status is correct, unless the initiator does not
>> have any pending commands on the LUN, in which case you should return
>> BUSY. Yes, this can lead to out-of-order execution. That's why tapes
>> have traditionally not used SCSI command queuing.
>
> I'm confused, Vladislav appears to be asking about flow control
> such as
> is built into ISCSI, wherein the ISCSI target tells the intitiator how
> many tasks it's willing to work on at once and the initiator stops
> sending
> new ones when it has hit that limit and waits for one of the
> previous ones
> to finish. And the target can continuously change that number.
>
> With the more primitive transports,
Seems like a somewhat loaded description to me. Personally, I'd pick
something more neutral.
> I believe this is a manual
> configuration step -- the target has a fixed maximum queue depth
> and you
> tell the driver via some configuration parameter what it is.
Not true. Consider the case where multiple initiators share one
logical unit - there is no guarantee that a single initiator can
queue even a single command, since another initiator may have filled
the queue at the device.
Another case is a target that has multiple logical units; it is
conceivable that an implementation may share the device queue
resources among all logical units. In this case again, there is no
fixed number of commands that the target can guarantee to queue for a
logical unit.
> As I understand it, any system in which QUEUE FULL (that's another
> name
> for SCSI's Task Set Full, isn't it?)
Yes, you're correct. I should have written TASK SET FULL, which is
the correct name for the SCSI status value that we are discussing.
> errors happen is one that is not
> properly configured.
Absolutely untrue.
> I saw a broken ISCSI system that had QUEUE FULLs
> happening, and it was a performance disaster.
Was it a performance disaster because of the broken-ness, or solely
because of the TASK SET FULLs?
>>> Apparently, hardware SCSI targets don't suffer from queuing
>>> overflow and don't return all the time QUEUE FULL status, so the
>>> must be a way to do the throttling more elegantly.
>>
>> No, they just have big queues.
>
> Big queues are another serious performance problem, when it means a
> target
> accepts work faster than it can do it. I've seen that cause
> initiators to
> send suboptimal requests (if the target appears to be working at
> infinite
> speed, the initiator sends small chunks of work as soon as each is
> ready,
> whereas if the initiator can tell that the target is choked, the
> initiator
> combines and sorts work while it waits, into a stream the target can
> handle more efficiently).
1) Considering only first-order effects, who cares whether the
initiator sends sub-optimal requests and the target coalesces them,
or if the initiator does the coalescing itself?
2) If you care about performance, you don't try to fill the device
queue; you just want to have enough outstanding so that the device
doesn't go idle when there is work to do.
The reason why you do this has more to do with the access scheduling
algorithm in the target more than anything else; brain-damaged
marketing values small average access times more than a small
variance in access times, so the device folks do crazy shortest-
access-time-first scheduling instead of something more sane and less
prone to spreading out the access time distribution like CSCAN.
> When systems substitute an oversized queue in a
> target for initiator-target flow control, the initiator ends up
> having to
> compensate with artificial schemes to withhold work from a willing
> target
> (e.g. Linux "queue plugging").
1) The SCSI architectural standard does not prescribe any method for
initiator-target flow control other than TASK SET FULL and BUSY.
There's nothing wrong with X-ON and X-OFF for flow control,
especially when you cannot deterministically calculate a window size.
2) Tell the device folks to switch from shortest-access-time-first
scheduling to something less aggressive like CSCAN, and then you
might be able to tolerate the device queuing better.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-03 18:07 ` Steve Byan
2006-03-03 18:47 ` Stefan Richter
2006-03-06 19:15 ` Bryan Henderson
@ 2006-03-07 17:53 ` Vladislav Bolkhovitin
2006-03-07 18:19 ` Steve Byan
2 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 17:53 UTC (permalink / raw)
To: Steve Byan; +Cc: linux-scsi
Steve Byan wrote:
>
> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>
>> Could anyone advice how a SCSI target device can IO-throttle its
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators
>> about the maximum queue depth X of the target device, so any of the
>> initiators will not send more than X commands. But I have not found
>> anything similar to that on INQUIRY or MODE SENSE pages. Have I
>> missed something? Just returning QUEUE FULL status doesn't look to be
>> correct, because it can lead to out of order commands execution.
>
>
> Returning QUEUE FULL status is correct, unless the initiator does not
> have any pending commands on the LUN, in which case you should return
> BUSY. Yes, this can lead to out-of-order execution. That's why tapes
> have traditionally not used SCSI command queuing.
>
> Look into the unit attention interlock feature added to SCSI as a
> result of uncovering this issue during the development of the iSCSI
> standard.
>
>> Apparently, hardware SCSI targets don't suffer from queuing overflow
>> and don't return all the time QUEUE FULL status, so the must be a way
>> to do the throttling more elegantly.
>
>
> No, they just have big queues.
Thanks for the reply!
Things are getting clearer for me now, but still there are few things
that are not very clear for me. Hope, they won't require too long
answers. I'm asking, because we in SCST project (SCSI target mid-level
for Linux + some target drivers, http://scst.sourceforge.net) must
emulate correct SCSI target device behavior under any IO load, including
extreme high one.
- Can you estimate, please, how big target commands queue should be in
order to initiators will never receive QUEUE FULL status? Considering
case that initiators are Linux-based and each has a separate and
independent queue.
- The queue could be so big that the last command in it could not been
processed before the initiator's timeout, then, after the timeout was
hit, the initiator would start issuing ABORTs for the timeouted command.
Is it OK behavior? Or rather misconfiguration (of who, initiator or
target?)? Does the initiator in such situation supposed to reissue the
command after the preceding ones finished, or behave somehow else?
Apparently, ABORTs must hit the performance at the similar degree as too
many QUEUE FULLs, if not more.
Seems, we should setup on the target queue with virtually unlimited size
and, if an initiator is dumb enough to queue so much commands that there
will be timeouts, then it will be its problem and duty to rule the
situation without performance loss. Does it looks OK?
Thanks,
Vlad
> Regards,
> -Steve
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-06 19:15 ` Bryan Henderson
2006-03-06 19:55 ` Steve Byan
@ 2006-03-07 17:56 ` Vladislav Bolkhovitin
2006-03-07 18:38 ` Steve Byan
1 sibling, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 17:56 UTC (permalink / raw)
To: Bryan Henderson; +Cc: Steve Byan, linux-scsi
Bryan Henderson wrote:
>>On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>
>>
>>>Could anyone advice how a SCSI target device can IO-throttle its
>>>initiators, i.e. prevent them from queuing too many commands, please?
>>>
>>>I suppose, the best way for doing this is to inform the initiators
>>>about the maximum queue depth X of the target device, so any of the
>>>initiators will not send more than X commands. But I have not found
>>>anything similar to that on INQUIRY or MODE SENSE pages. Have I
>>>missed something? Just returning QUEUE FULL status doesn't look to
>>>be correct, because it can lead to out of order commands execution.
>>
>>Returning QUEUE FULL status is correct, unless the initiator does not
>>have any pending commands on the LUN, in which case you should return
>>BUSY. Yes, this can lead to out-of-order execution. That's why tapes
>>have traditionally not used SCSI command queuing.
>
>
> I'm confused, Vladislav appears to be asking about flow control such as
> is built into ISCSI, wherein the ISCSI target tells the intitiator how
> many tasks it's willing to work on at once and the initiator stops sending
> new ones when it has hit that limit and waits for one of the previous ones
> to finish. And the target can continuously change that number.
Yes, exactly.
> With the more primitive transports, I believe this is a manual
> configuration step -- the target has a fixed maximum queue depth and you
> tell the driver via some configuration parameter what it is.
We currently mostly deal with Fibre Channel, which seems to be a kind of
"more primitive transport" without explicit flow control. Actually, I'm
very surprised and can't believe that so advanced and expensive
technology doesn't have such basic thing as a good flow control.
Although, precisely speaking, such flow control is located on level
above transport (this is true for iSCSI as well), therefore this is SCSI
flaw, not FC.
> As I understand it, any system in which QUEUE FULL (that's another name
> for SCSI's Task Set Full, isn't it?) errors happen is one that is not
> properly configured. I saw a broken ISCSI system that had QUEUE FULLs
> happening, and it was a performance disaster.
It is what we observe, too much QUEUE FULLs degrade performance
considerably.
>>>Apparently, hardware SCSI targets don't suffer from queuing
>>>overflow and don't return all the time QUEUE FULL status, so the
>>>must be a way to do the throttling more elegantly.
>>
>>No, they just have big queues.
>
> Big queues are another serious performance problem, when it means a target
> accepts work faster than it can do it. I've seen that cause initiators to
> send suboptimal requests (if the target appears to be working at infinite
> speed, the initiator sends small chunks of work as soon as each is ready,
> whereas if the initiator can tell that the target is choked, the initiator
> combines and sorts work while it waits, into a stream the target can
> handle more efficiently). When systems substitute an oversized queue in a
> target for initiator-target flow control, the initiator ends up having to
> compensate with artificial schemes to withhold work from a willing target
> (e.g. Linux "queue plugging").
This is one point why I don't like having a overbig queue on the target.
Another one is initiator side timeouts when the queue so big that it
could not been done on time. I described it in the previous email.
Thanks,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 17:53 ` Vladislav Bolkhovitin
@ 2006-03-07 18:19 ` Steve Byan
2006-03-07 18:46 ` Vladislav Bolkhovitin
0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-07 18:19 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: linux-scsi
On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>> Could anyone advice how a SCSI target device can IO-throttle its
>>> initiators, i.e. prevent them from queuing too many commands,
>>> please?
>>>
>>> I suppose, the best way for doing this is to inform the
>>> initiators about the maximum queue depth X of the target device,
>>> so any of the initiators will not send more than X commands. But
>>> I have not found anything similar to that on INQUIRY or MODE
>>> SENSE pages. Have I missed something? Just returning QUEUE FULL
>>> status doesn't look to be correct, because it can lead to out of
>>> order commands execution.
>> Returning QUEUE FULL status is correct, unless the initiator does
>> not have any pending commands on the LUN, in which case you
>> should return BUSY. Yes, this can lead to out-of-order execution.
>> That's why tapes have traditionally not used SCSI command queuing.
>> Look into the unit attention interlock feature added to SCSI as a
>> result of uncovering this issue during the development of the
>> iSCSI standard.
>>> Apparently, hardware SCSI targets don't suffer from queuing
>>> overflow and don't return all the time QUEUE FULL status, so the
>>> must be a way to do the throttling more elegantly.
>> No, they just have big queues.
>
> Thanks for the reply!
>
> Things are getting clearer for me now, but still there are few
> things that are not very clear for me. Hope, they won't require too
> long answers. I'm asking, because we in SCST project (SCSI target
> mid-level for Linux + some target drivers, http://
> scst.sourceforge.net) must emulate correct SCSI target device
> behavior under any IO load, including extreme high one.
>
> - Can you estimate, please, how big target commands queue should
> be in order to initiators will never receive QUEUE FULL status?
> Considering case that initiators are Linux-based and each has a
> separate and independent queue.
Do you have a per-target pool of resources for handing commands, or
are the pools per-logical unit?
I'm not sure you could size the queue so that TASK_SET_FULL is never
returned. Just accept the fact the the target must return
TASK_SET_FULL or BUSY sometimes.
As a data-point, some modern SCSI disks support queue depths in the
range of 128 to 256 commands.
> - The queue could be so big that the last command in it could not
> been processed before the initiator's timeout, then, after the
> timeout was hit, the initiator would start issuing ABORTs for the
> timeouted command. Is it OK behavior?
Well, it's the behavior implied by the SCSI standard; that is, on a
timeout, the initiator should abort the command. If an initiator sets
it's timeout to less than the queuing delay at the server, I wouldn't
call that "OK behavior", but it's not the target's fault, it's the
initiator's fault.
> Or rather misconfiguration (of who, initiator or target?)? Does the
> initiator in such situation supposed to reissue the command after
> the preceding ones finished, or behave somehow else?
I think it's up to the class driver to decide whether to retry a
command after it times-out.
> Apparently, ABORTs must hit the performance at the similar degree
> as too many QUEUE FULLs, if not more.
Much worse, I would think.
> Seems, we should setup on the target queue with virtually unlimited
> size and, if an initiator is dumb enough to queue so much commands
> that there will be timeouts, then it will be its problem and duty
> to rule the situation without performance loss. Does it looks OK?
I don't think you need to pick an unlimited size. Something on the
order of 128 to 512 commands should be sufficient. If you have
multiple logical units, you could probably combine them in a common
pool and somewhat reduce the number of command resources you allocate
per logical unit, on the theory that they'll not all be fully
utilized at the same time.
By the way, make sure you don't deadlock trying to obtain command-
resources to return TASK_SET_FULL or BUSY to a command in the case
where the pool of command-resources is exhausted. This is one of the
tricky bits.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 17:56 ` Vladislav Bolkhovitin
@ 2006-03-07 18:38 ` Steve Byan
0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-07 18:38 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi
On Mar 7, 2006, at 12:56 PM, Vladislav Bolkhovitin wrote:
> Bryan Henderson wrote:
>>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>>
>>>
>>>> Could anyone advice how a SCSI target device can IO-throttle its
>>>> initiators, i.e. prevent them from queuing too many commands,
>>>> please?
>>>>
>>>> I suppose, the best way for doing this is to inform the
>>>> initiators about the maximum queue depth X of the target device,
>>>> so any of the initiators will not send more than X commands. But
>>>> I have not found anything similar to that on INQUIRY or MODE
>>>> SENSE pages. Have I missed something? Just returning QUEUE FULL
>>>> status doesn't look to be correct, because it can lead to out of
>>>> order commands execution.
>>>
>>> Returning QUEUE FULL status is correct, unless the initiator does
>>> not have any pending commands on the LUN, in which case you
>>> should return BUSY. Yes, this can lead to out-of-order execution.
>>> That's why tapes have traditionally not used SCSI command queuing.
>> I'm confused, Vladislav appears to be asking about flow control
>> such as is built into ISCSI, wherein the ISCSI target tells the
>> intitiator how many tasks it's willing to work on at once and the
>> initiator stops sending new ones when it has hit that limit and
>> waits for one of the previous ones to finish. And the target can
>> continuously change that number.
>
> Yes, exactly.
>
>> With the more primitive transports, I believe this is a manual
>> configuration step -- the target has a fixed maximum queue depth
>> and you tell the driver via some configuration parameter what it is.
>
> We currently mostly deal with Fibre Channel, which seems to be a
> kind of "more primitive transport" without explicit flow control.
> Actually, I'm very surprised and can't believe that so advanced and
> expensive technology doesn't have such basic thing as a good flow
> control. Although, precisely speaking, such flow control is located
> on level above transport (this is true for iSCSI as well),
> therefore this is SCSI flaw, not FC.
It has X-ON and X-OFF flow control. Not bad considering it was
designed in the early 1980's.
X-OFF is TASK_SET_FULL or BUSY
X-ON is a command completing, or if busy was received because the
initiator did not have any outstanding commands at the target, then X-
ON is implied after a short time delay.
Since an intelligently-designed initiator isn't going to dump every
command to the device anyway (after all, the person writing the
initiator driver wants to have some fun implementing I/O
optimizations too; can't let those target folk have all the fun :-),
the XON/XOFF flow control isn't often invoked.
>> As I understand it, any system in which QUEUE FULL (that's another
>> name for SCSI's Task Set Full, isn't it?) errors happen is one
>> that is not properly configured. I saw a broken ISCSI system that
>> had QUEUE FULLs happening, and it was a performance disaster.
>
> It is what we observe, too much QUEUE FULLs degrade performance
> considerably.
Sounds like a broken initiator.
>
>>>> Apparently, hardware SCSI targets don't suffer from queuing
>>>> overflow and don't return all the time QUEUE FULL status, so the
>>>> must be a way to do the throttling more elegantly.
>>>
>>> No, they just have big queues.
>> Big queues are another serious performance problem, when it means
>> a target accepts work faster than it can do it. I've seen that
>> cause initiators to send suboptimal requests (if the target
>> appears to be working at infinite speed, the initiator sends small
>> chunks of work as soon as each is ready, whereas if the initiator
>> can tell that the target is choked, the initiator combines and
>> sorts work while it waits, into a stream the target can handle
>> more efficiently). When systems substitute an oversized queue in
>> a target for initiator-target flow control, the initiator ends up
>> having to compensate with artificial schemes to withhold work from
>> a willing target (e.g. Linux "queue plugging").
>
> This is one point why I don't like having a overbig queue on the
> target.
This is just a matter of taste of whether you prefer the optimization
to be done on the initiator side or the target side. If you prefer it
to be done on the initiator side, then don't queue large amounts of
work at the target.
> Another one is initiator side timeouts when the queue so big that
> it could not been done on time. I described it in the previous email.
This is just a bug in the initiator. It can observe the average
service time and it knows how many commands it has queued. If it sets
its timeout anywhere close to the product of those two numbers it is
buggy.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 18:19 ` Steve Byan
@ 2006-03-07 18:46 ` Vladislav Bolkhovitin
2006-03-07 19:00 ` Steve Byan
0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 18:46 UTC (permalink / raw)
To: Steve Byan; +Cc: linux-scsi
Steve Byan wrote:
>
> On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote:
>
>> Steve Byan wrote:
>>
>>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>>
>>>> Could anyone advice how a SCSI target device can IO-throttle its
>>>> initiators, i.e. prevent them from queuing too many commands, please?
>>>>
>>>> I suppose, the best way for doing this is to inform the initiators
>>>> about the maximum queue depth X of the target device, so any of
>>>> the initiators will not send more than X commands. But I have not
>>>> found anything similar to that on INQUIRY or MODE SENSE pages.
>>>> Have I missed something? Just returning QUEUE FULL status doesn't
>>>> look to be correct, because it can lead to out of order commands
>>>> execution.
>>>
>>> Returning QUEUE FULL status is correct, unless the initiator does
>>> not have any pending commands on the LUN, in which case you should
>>> return BUSY. Yes, this can lead to out-of-order execution. That's
>>> why tapes have traditionally not used SCSI command queuing.
>>> Look into the unit attention interlock feature added to SCSI as a
>>> result of uncovering this issue during the development of the iSCSI
>>> standard.
>>>
>>>> Apparently, hardware SCSI targets don't suffer from queuing
>>>> overflow and don't return all the time QUEUE FULL status, so the
>>>> must be a way to do the throttling more elegantly.
>>>
>>> No, they just have big queues.
>>
>>
>> Thanks for the reply!
>>
>> Things are getting clearer for me now, but still there are few things
>> that are not very clear for me. Hope, they won't require too long
>> answers. I'm asking, because we in SCST project (SCSI target
>> mid-level for Linux + some target drivers, http://scst.sourceforge.net) must
>> emulate correct SCSI target device
>> behavior under any IO load, including extreme high one.
>>
>> - Can you estimate, please, how big target commands queue should be
>> in order to initiators will never receive QUEUE FULL status?
>> Considering case that initiators are Linux-based and each has a
>> separate and independent queue.
>
>
> Do you have a per-target pool of resources for handing commands, or are
> the pools per-logical unit?
Most limited resource is memory allocated for commands buffers. It is
per-target. Other resourses, like internal commands structures, are so
small, so they could be considered virtually unlimited. They are also
global, but accounting is done by per-(session(nexus), LU).
> I'm not sure you could size the queue so that TASK_SET_FULL is never
> returned. Just accept the fact the the target must return TASK_SET_FULL
> or BUSY sometimes.
We have relatively cheap method of queuing commands without allocating
buffers for them. This way millions of commands could be queued on an
average Linux box without problems. Only ABORTs and they influence on
performance worry me.
> As a data-point, some modern SCSI disks support queue depths in the
> range of 128 to 256 commands.
I rather asked about practical upper limit. From our observations a
Linux initiator could easily send 128+ commands, but usually less. Looks
like it depends from its available memory. Interested to know the exact
rule.
>> - The queue could be so big that the last command in it could not
>> been processed before the initiator's timeout, then, after the
>> timeout was hit, the initiator would start issuing ABORTs for the
>> timeouted command. Is it OK behavior?
>
>
> Well, it's the behavior implied by the SCSI standard; that is, on a
> timeout, the initiator should abort the command. If an initiator sets
> it's timeout to less than the queuing delay at the server, I wouldn't
> call that "OK behavior", but it's not the target's fault, it's the
> initiator's fault.
>
>> Or rather misconfiguration (of who, initiator or target?)? Does the
>> initiator in such situation supposed to reissue the command after the
>> preceding ones finished, or behave somehow else?
>
>
> I think it's up to the class driver to decide whether to retry a
> command after it times-out.
>
>> Apparently, ABORTs must hit the performance at the similar degree as
>> too many QUEUE FULLs, if not more.
>
>
> Much worse, I would think.
>
>> Seems, we should setup on the target queue with virtually unlimited
>> size and, if an initiator is dumb enough to queue so much commands
>> that there will be timeouts, then it will be its problem and duty to
>> rule the situation without performance loss. Does it looks OK?
>
>
> I don't think you need to pick an unlimited size. Something on the
> order of 128 to 512 commands should be sufficient. If you have multiple
> logical units, you could probably combine them in a common pool and
> somewhat reduce the number of command resources you allocate per
> logical unit, on the theory that they'll not all be fully utilized at
> the same time.
OK
> By the way, make sure you don't deadlock trying to obtain command-
> resources to return TASK_SET_FULL or BUSY to a command in the case
> where the pool of command-resources is exhausted. This is one of the
> tricky bits.
In our architecture there is no need to allocate any additional
resources to reply with TASK_SET_FULL or BUSY. So, we already took care
of this.
Thanks,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 18:46 ` Vladislav Bolkhovitin
@ 2006-03-07 19:00 ` Steve Byan
0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-07 19:00 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: linux-scsi
On Mar 7, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
>> As a data-point, some modern SCSI disks support queue depths in
>> the range of 128 to 256 commands.
>
> I rather asked about practical upper limit. From our observations a
> Linux initiator could easily send 128+ commands, but usually less.
> Looks like it depends from its available memory. Interested to know
> the exact rule.
I don't know the rule. Obviously, it could change over time, and be
different for different OS's.
Sounds to me like you might be trying to fix a busted initiator by
changing the target behavior.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-06 19:55 ` Steve Byan
@ 2006-03-07 23:32 ` Bryan Henderson
2006-03-08 15:35 ` Vladislav Bolkhovitin
2006-03-10 13:26 ` Steve Byan
0 siblings, 2 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-03-07 23:32 UTC (permalink / raw)
To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin
>> With the more primitive transports,
>
>Seems like a somewhat loaded description to me. Personally, I'd pick
>something more neutral.
Unfortunately, it's exactly what I mean. I understand that some people
attach negative connotations to primitivity, but I can't let that get in
the way of clarity.
>> I believe this is a manual
>> configuration step -- the target has a fixed maximum queue depth
>> and you
>> tell the driver via some configuration parameter what it is.
>
>Not true. Consider the case where multiple initiators share one
>logical unit - there is no guarantee that a single initiator can
>queue even a single command, since another initiator may have filled
>the queue at the device.
I'm not sure what it is that you're saying isn't true. You do give a good
explanation of why designers would want something more sophisticated than
this, but that doesn't mean every SCSI implementation actually is. Are
you saying there are no SCSI targets so primitive that they have a fixed
maximum queue depth? That there are no systems where you manually set the
maximum requests-in-flight at the initiator in order to optimally drive
such targets?
>> I saw a broken ISCSI system that had QUEUE FULLs
>> happening, and it was a performance disaster.
>
>Was it a performance disaster because of the broken-ness, or solely
>because of the TASK SET FULLs?
Because of the broken-ness. Task Set Full is the symptom, not the
disease. I should add that in this system, there was no way to make it
perform optimally and also see Task Set Full regularly.
You mentioned in another email that FCP is designed to use Task Set Full
for normal flow control. I heard that before, but didn't believe it; I
thought FCP was more advanced than that. But I believe it now. So I was
wrong to say that Task Set Full happening means a system is misconfigured.
But it's still the case that if you can design a system in which Task Set
Full never happens, it will perform better than one in which it does.
ISCSI flow control and manual setting of queue sizes in initiators are two
ways people do that.
>1) Considering only first-order effects, who cares whether the
>initiator sends sub-optimal requests and the target coalesces them,
>or if the initiator does the coalescing itself?
I don't know what a first-order effect is, so this may be out of bounds,
but here's a reason to care: the initiator may have more resource
available to do the work than the target. We're talking here about a
saturated target (which, rather than admit it's overwhelmed, keeps
accepting new tasks).
But it's really the wrong question, because the more important question is
would you rather have the initiator do the coalescing or nobody? There
exist targets that are not capable of combining or ordering tasks, and
still accept large queues of them. These are the ones I saw have
improperly large queues. A target that can actually make use of a large
backlog of work, on the other hand, is right to accept one.
I have seen people try to improve performance of a storage system by
increasing queue depth in the target such as this. They note that the
queue is always full, so it must need more queue space. But this degrades
performance, because on one of these first-in-first-out targets, the only
way to get peak capacity is to keep the queue full all the time so as to
create backpressure and cause the initiator to schedule the work.
Increasing the queue depth increases the chance that the initiator will
not have the backlog necessary to do that scheduling. The correct queue
depth on this kind of target is the number of requests the target can
process within the initiator's (and channel's) turnaround time.
>brain-damaged
>marketing values small average access times more than a small
>variance in access times, so the device folks do crazy shortest-
>access-time-first scheduling instead of something more sane and less
>prone to spreading out the access time distribution like CSCAN.
Since I'm talking about targets that don't do anything close to that
sophisticated with the stuff in their queue, this doesn't apply.
But I do have to point out that there are systems where throughput is
everything, and response time, including variability of it, is nothing. In
fact, the systems I work with are mostly that kind. For that kind of
system, you'd want to target to do that kind of scheduling.
>2) If you care about performance, you don't try to fill the device
>queue; you just want to have enough outstanding so that the device
>doesn't go idle when there is work to do.
Why would the queue have a greater capacity than what is needed when you
care about performance? Is there some non-performance reason to have a
giant queue?
I still think having a giant queue is not a solution to any flow control
(or, in the words of the original problem, I/O throttling) problem. I'm
even skeptical that there's any size you can make one that would avoid
queue full conditions. It would be like avoiding difficult memory
allocation algorithms by just having a whole lot of memory.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 23:32 ` Bryan Henderson
@ 2006-03-08 15:35 ` Vladislav Bolkhovitin
2006-03-08 15:56 ` Steve Byan
2006-03-10 13:26 ` Steve Byan
1 sibling, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-08 15:35 UTC (permalink / raw)
To: Bryan Henderson; +Cc: Steve Byan, linux-scsi
Bryan Henderson wrote:
>>>With the more primitive transports,
>>
>>Seems like a somewhat loaded description to me. Personally, I'd pick
>>something more neutral.
>
>
> Unfortunately, it's exactly what I mean. I understand that some people
> attach negative connotations to primitivity, but I can't let that get in
> the way of clarity.
>
>
>>>I believe this is a manual
>>>configuration step -- the target has a fixed maximum queue depth
>>>and you
>>>tell the driver via some configuration parameter what it is.
>>
>>Not true. Consider the case where multiple initiators share one
>>logical unit - there is no guarantee that a single initiator can
>>queue even a single command, since another initiator may have filled
>>the queue at the device.
>
>
> I'm not sure what it is that you're saying isn't true. You do give a good
> explanation of why designers would want something more sophisticated than
> this, but that doesn't mean every SCSI implementation actually is. Are
> you saying there are no SCSI targets so primitive that they have a fixed
> maximum queue depth? That there are no systems where you manually set the
> maximum requests-in-flight at the initiator in order to optimally drive
> such targets?
>
>
>>>I saw a broken ISCSI system that had QUEUE FULLs
>>>happening, and it was a performance disaster.
>>
>>Was it a performance disaster because of the broken-ness, or solely
>>because of the TASK SET FULLs?
>
>
> Because of the broken-ness. Task Set Full is the symptom, not the
> disease. I should add that in this system, there was no way to make it
> perform optimally and also see Task Set Full regularly.
>
> You mentioned in another email that FCP is designed to use Task Set Full
> for normal flow control. I heard that before, but didn't believe it; I
> thought FCP was more advanced than that. But I believe it now. So I was
> wrong to say that Task Set Full happening means a system is misconfigured.
> But it's still the case that if you can design a system in which Task Set
> Full never happens, it will perform better than one in which it does.
> ISCSI flow control and manual setting of queue sizes in initiators are two
> ways people do that.
>
>
>>1) Considering only first-order effects, who cares whether the
>>initiator sends sub-optimal requests and the target coalesces them,
>>or if the initiator does the coalescing itself?
>
>
> I don't know what a first-order effect is, so this may be out of bounds,
> but here's a reason to care: the initiator may have more resource
> available to do the work than the target. We're talking here about a
> saturated target (which, rather than admit it's overwhelmed, keeps
> accepting new tasks).
>
> But it's really the wrong question, because the more important question is
> would you rather have the initiator do the coalescing or nobody? There
> exist targets that are not capable of combining or ordering tasks, and
> still accept large queues of them. These are the ones I saw have
> improperly large queues. A target that can actually make use of a large
> backlog of work, on the other hand, is right to accept one.
>
> I have seen people try to improve performance of a storage system by
> increasing queue depth in the target such as this. They note that the
> queue is always full, so it must need more queue space. But this degrades
> performance, because on one of these first-in-first-out targets, the only
> way to get peak capacity is to keep the queue full all the time so as to
> create backpressure and cause the initiator to schedule the work.
> Increasing the queue depth increases the chance that the initiator will
> not have the backlog necessary to do that scheduling. The correct queue
> depth on this kind of target is the number of requests the target can
> process within the initiator's (and channel's) turnaround time.
>
>
>>brain-damaged
>>marketing values small average access times more than a small
>>variance in access times, so the device folks do crazy shortest-
>>access-time-first scheduling instead of something more sane and less
>>prone to spreading out the access time distribution like CSCAN.
>
>
> Since I'm talking about targets that don't do anything close to that
> sophisticated with the stuff in their queue, this doesn't apply.
>
> But I do have to point out that there are systems where throughput is
> everything, and response time, including variability of it, is nothing. In
> fact, the systems I work with are mostly that kind. For that kind of
> system, you'd want to target to do that kind of scheduling.
>
>
>>2) If you care about performance, you don't try to fill the device
>>queue; you just want to have enough outstanding so that the device
>>doesn't go idle when there is work to do.
>
>
> Why would the queue have a greater capacity than what is needed when you
> care about performance? Is there some non-performance reason to have a
> giant queue?
>
> I still think having a giant queue is not a solution to any flow control
> (or, in the words of the original problem, I/O throttling) problem. I'm
> even skeptical that there's any size you can make one that would avoid
> queue full conditions. It would be like avoiding difficult memory
> allocation algorithms by just having a whole lot of memory.
Yes, you're correct. But can you formulate a practical common rule
working on any SCSI transport, including FC, on which a SCSI target,
which knows some limit, can tell it to an initiator, so it will not try
to queue too many commands, please? It looks like I have no choice,
except doing "giant" queue on target hoping that initiators are smart
enough to not queue so many commands that it starts seeing timeouts.
Vlad
> --
> Bryan Henderson IBM Almaden Research Center
> San Jose CA Filesystems
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-08 15:35 ` Vladislav Bolkhovitin
@ 2006-03-08 15:56 ` Steve Byan
2006-03-08 17:49 ` Vladislav Bolkhovitin
0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-08 15:56 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi
On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote:
> Bryan Henderson wrote:
>> Why would the queue have a greater capacity than what is needed
>> when you care about performance? Is there some non-performance
>> reason to have a giant queue?
>> I still think having a giant queue is not a solution to any flow
>> control (or, in the words of the original problem, I/O throttling)
>> problem. I'm even skeptical that there's any size you can make
>> one that would avoid queue full conditions. It would be like
>> avoiding difficult memory allocation algorithms by just having a
>> whole lot of memory.
>
> Yes, you're correct. But can you formulate a practical common rule
> working on any SCSI transport, including FC, on which a SCSI
> target, which knows some limit, can tell it to an initiator, so it
> will not try to queue too many commands, please? It looks like I
> have no choice, except doing "giant" queue on target hoping that
> initiators are smart enough to not queue so many commands that it
> starts seeing timeouts.
I still don't understand why you are reluctant to return
TASK_SET_FULL or BUSY in this case; it's what the SCSI standard
supplies as the way to say "don't queue too many commands, please".
If you don't want to return TASK_SET_FULL, then yes, an effectively
unbounded command queue is your only alternative.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-08 15:56 ` Steve Byan
@ 2006-03-08 17:49 ` Vladislav Bolkhovitin
2006-03-08 18:09 ` Steve Byan
0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-08 17:49 UTC (permalink / raw)
To: Steve Byan; +Cc: Bryan Henderson, linux-scsi
Steve Byan wrote:
>
> On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote:
>
>> Bryan Henderson wrote:
>>
>>> Why would the queue have a greater capacity than what is needed when
>>> you care about performance? Is there some non-performance reason to
>>> have a giant queue?
>>> I still think having a giant queue is not a solution to any flow
>>> control (or, in the words of the original problem, I/O throttling)
>>> problem. I'm even skeptical that there's any size you can make one
>>> that would avoid queue full conditions. It would be like avoiding
>>> difficult memory allocation algorithms by just having a whole lot of
>>> memory.
>>
>>
>> Yes, you're correct. But can you formulate a practical common rule
>> working on any SCSI transport, including FC, on which a SCSI target,
>> which knows some limit, can tell it to an initiator, so it will not
>> try to queue too many commands, please? It looks like I have no
>> choice, except doing "giant" queue on target hoping that initiators
>> are smart enough to not queue so many commands that it starts seeing
>> timeouts.
>
>
> I still don't understand why you are reluctant to return TASK_SET_FULL
> or BUSY in this case; it's what the SCSI standard supplies as the way
> to say "don't queue too many commands, please".
I don't like out of order execution, which happens practically on all
such "rejected" commands, because subsequent already queued commands are
not "rejected" with it and some of them could be accepted later. And the
initiator (Linux with FC driver) is dumb enough to hit this
TASK_SET_FULL again and again until the queue is large enough. So, I can
see only one solution, which almost eliminate breaking the order, -
unbounded command queue.
But, maybe I should think/experiment more and ease the ordering
restriction...
Thanks,
Vlad
> If you don't want to return TASK_SET_FULL, then yes, an effectively
> unbounded command queue is your only alternative.
>
> Regards,
> -Steve
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-08 17:49 ` Vladislav Bolkhovitin
@ 2006-03-08 18:09 ` Steve Byan
2006-03-09 18:37 ` Vladislav Bolkhovitin
0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-08 18:09 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi
On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
>>
>> I still don't understand why you are reluctant to return
>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard
>> supplies as the way to say "don't queue too many commands, please".
>
> I don't like out of order execution, which happens practically on
> all such "rejected" commands, because subsequent already queued
> commands are not "rejected" with it and some of them could be
> accepted later.
I see, you care about order. So do tapes. The historical answer has
been to not support tagged command queuing when you care about
ordering. To dodge the performance problem due to lack of queuing,
the targets usually implement a read-ahead and write-behind cache,
and then perform queuing behind the scenes, after telling the
initiator that the command has completed. Of course, this has obvious
data integrity issues for disk-type logical units.
The solution introduced for tapes concurrent with iSCSI (which
motivated the need for command-queuing for tapes, since some
envisioned backing up to a tape drive located on 3000 miles away is
something called "unit-attention interlock", or "UA interlock". Check
out page 287 of the draft revision 23 of the SCSI Primary Commands -
3 (SPC-3) standard from T10.org. The UA_INTLCK_CTRL field can be set
to cause a persistent unit attention condition if a command was
rejected with TASK_SET_FULL or BUSY.
This requires the cooperation of the initiator.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-08 18:09 ` Steve Byan
@ 2006-03-09 18:37 ` Vladislav Bolkhovitin
2006-03-09 19:32 ` Steve Byan
0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-09 18:37 UTC (permalink / raw)
To: Steve Byan; +Cc: Bryan Henderson, linux-scsi
Steve Byan wrote:
>
> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
>
>> Steve Byan wrote:
>>
>>>
>
>>> I still don't understand why you are reluctant to return
>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard
>>> supplies as the way to say "don't queue too many commands, please".
>>
>>
>> I don't like out of order execution, which happens practically on all
>> such "rejected" commands, because subsequent already queued commands
>> are not "rejected" with it and some of them could be accepted later.
>
>
> I see, you care about order. So do tapes. The historical answer has
> been to not support tagged command queuing when you care about
> ordering. To dodge the performance problem due to lack of queuing, the
> targets usually implement a read-ahead and write-behind cache, and then
> perform queuing behind the scenes, after telling the initiator that the
> command has completed. Of course, this has obvious data integrity
> issues for disk-type logical units.
Yes, tapes just can't work without strict ordering. SCST was originally
done for tapes, so I still keep some kind of tape-oriented thinking :)
Actually, with current journaling file systems ordering also became more
important for disks as well. Data integrity problem in "behind the
scenes" queuing could be on practice easily solved by battery-based
backup power on the disks. In case of TASK_SET_FULL things are much
worse, because the reordering happens _between_ target and _initiator_,
since the initiator must retry "rejected" command explicitly, then in
case of the initiator crash before the command will be retried and if FS
on it uses ordering barriers to protect the integrity (Linux seems does
so, but I could be wrong), the FS data could be written out of order
with its journal and the FS could be corrupted. Even worse,
TASK_SET_FULL "rejects" basically happen every the queue length'th
command, ie very often. This is why I prefer the "dumb" and "safe" way.
But, I could overestimate the problem, because it looks like nobody
cares about it..
> The solution introduced for tapes concurrent with iSCSI (which
> motivated the need for command-queuing for tapes, since some envisioned
> backing up to a tape drive located on 3000 miles away is something
> called "unit-attention interlock", or "UA interlock". Check out page
> 287 of the draft revision 23 of the SCSI Primary Commands - 3 (SPC-3)
> standard from T10.org. The UA_INTLCK_CTRL field can be set to cause a
> persistent unit attention condition if a command was rejected with
> TASK_SET_FULL or BUSY.
Thanks, I'll take a look.
> This requires the cooperation of the initiator.
Which practically means that it will not work for at least several
years. I think, I won't be wrong, if say that no Linux initiators use
this feature and going to use...
BTW, it is also impossible to correctly process commands errors (CHECK
CONDITIONs) in async environment without using ACA (Auto Contingent
Allegiance). Again, I see no sign that it's used by Linux or somebody
interested to use it in Linux. Have I missed anything and it is not
important? (rather rhetorical question)
Thanks,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-09 18:37 ` Vladislav Bolkhovitin
@ 2006-03-09 19:32 ` Steve Byan
2006-03-10 18:46 ` Vladislav Bolkhovitin
0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-09 19:32 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi
On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
>> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
>>> Steve Byan wrote:
>>>
>>>>
>>>> I still don't understand why you are reluctant to return
>>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI
>>>> standard supplies as the way to say "don't queue too many
>>>> commands, please".
>>>
>>>
>>> I don't like out of order execution, which happens practically
>>> on all such "rejected" commands, because subsequent already
>>> queued commands are not "rejected" with it and some of them
>>> could be accepted later.
>> I see, you care about order. So do tapes. The historical answer
>> has been to not support tagged command queuing when you care
>> about ordering. To dodge the performance problem due to lack of
>> queuing, the targets usually implement a read-ahead and write-
>> behind cache, and then perform queuing behind the scenes, after
>> telling the initiator that the command has completed. Of course,
>> this has obvious data integrity issues for disk-type logical units.
>
> Yes, tapes just can't work without strict ordering. SCST was
> originally done for tapes, so I still keep some kind of tape-
> oriented thinking :)
>
> Actually, with current journaling file systems ordering also became
> more important for disks as well.
Usually the workload from a journaling filesystem consists of a lot
of unordered writes (user data) and some partially-ordered writes
(metadata). The partially-ordered writes do not have a defined
ordering with respect to the unordered writes; they are ordered only
with respect to each other. Most systems today solve the
TASK_SET_FULL problem by only having one ordered write outstanding at
any point in time. You want to do it this way anyway, so that you can
build up a queue of commits and do a group commit with the next write
to the journal.
If you need write barriers between the metadata writes and the data
writes, the initiator should use the ORDERED task tag on that write,
and have only one ORDERED write outstanding at any point in time (I
mean to the same logical unit, of course).
> Data integrity problem in "behind the scenes" queuing could be on
> practice easily solved by battery-based backup power on the disks.
> In case of TASK_SET_FULL things are much worse, because the
> reordering happens _between_ target and _initiator_, since the
> initiator must retry "rejected" command explicitly, then in case of
> the initiator crash before the command will be retried and if FS on
> it uses ordering barriers to protect the integrity (Linux seems
> does so, but I could be wrong), the FS data could be written out of
> order with its journal and the FS could be corrupted. Even worse,
> TASK_SET_FULL "rejects" basically happen every the queue length'th
> command, ie very often. This is why I prefer the "dumb" and "safe"
> way. But, I could overestimate the problem, because it looks like
> nobody cares about it..
See above, Since only one ordered write is ever pending, no file
system corruption occurs. Since you want to do group commits anyway,
you never need to have more than one ordered write pending.
>
>> The solution introduced for tapes concurrent with iSCSI (which
>> motivated the need for command-queuing for tapes, since some
>> envisioned backing up to a tape drive located on 3000 miles away
>> is something called "unit-attention interlock", or "UA
>> interlock". Check out page 287 of the draft revision 23 of the
>> SCSI Primary Commands - 3 (SPC-3) standard from T10.org. The
>> UA_INTLCK_CTRL field can be set to cause a persistent unit
>> attention condition if a command was rejected with TASK_SET_FULL
>> or BUSY.
>
> Thanks, I'll take a look.
>
>> This requires the cooperation of the initiator.
>
> Which practically means that it will not work for at least several
> years.
Well, the feature was added back in 2001 or 2002; the initiators have
already had years to incorporate it. This might say something about
the state of the Linux SCSI subsystem (running and ducking for
cover :-). Seriously, I think this has more to do with either the
lack of need for command-queuing for tapes or the lack of modern tape
support in Linux.
> I think, I won't be wrong, if say that no Linux initiators use this
> feature and going to use...
If you have an initiator that is sending queued SCSI commands with
the SIMPLE task attribute but which expects the target to maintain
ordering of those commands, the SCSI standard can't help you. The
initiator is broken.
If the initiator needs to send _queued_ SCSI commands with a task
attribute of ORDERED, then to preserve ordering it must set the
UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism
to offer such an initiator.
To the best of my knowledge no current Linux initiator sends SCSI
commands with a task attribute other than SIMPLE., and you seem to be
concerned only about Linux initiators. Therefor your target does not
need to preserve order. QUED.
> BTW, it is also impossible to correctly process commands errors
> (CHECK CONDITIONs) in async environment
When you say "async environment" I assume you are referring to
queuing SCSI commands using SCSI command queuing, as opposed to
sending a single SCSI command and synchronously awaiting its completion.
> without using ACA (Auto Contingent Allegiance). Again, I see no
> sign that it's used by Linux or somebody interested to use it in
> Linux. Have I missed anything and it is not important? (rather
> rhetorical question)
ACA is not important if the command that got the error is idempotent
and independent of all other commands in flight. In the case of disks
(SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this
condition is true (given the restriction on the number of outstanding
ordered writes which I discussed above), and so ACA is not needed.
Tapes would need ACA if they did command queuing (which is why ACA
was invented), but the practice in tape-land seems to be to avoid
SCSI command queuing and instead asynchronously stage the operations
behind the target. This does lead to complications in error recovery,
which is why tape error handling is so problematic.
My advice to you is to either
a) follow the industry trend, which is to use command queuing only
for SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape)
targets, or
b) fix the initiator to handle ordered queuing (i.e. add support for
the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-07 23:32 ` Bryan Henderson
2006-03-08 15:35 ` Vladislav Bolkhovitin
@ 2006-03-10 13:26 ` Steve Byan
1 sibling, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-10 13:26 UTC (permalink / raw)
To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin
On Mar 7, 2006, at 6:32 PM, Bryan Henderson wrote:
>>> With the more primitive transports,
>>
>> Seems like a somewhat loaded description to me. Personally, I'd pick
>> something more neutral.
>
> Unfortunately, it's exactly what I mean. I understand that some
> people
> attach negative connotations to primitivity, but I can't let that
> get in
> the way of clarity.
>
>>> I believe this is a manual
>>> configuration step -- the target has a fixed maximum queue depth
>>> and you
>>> tell the driver via some configuration parameter what it is.
>>
>> Not true. Consider the case where multiple initiators share one
>> logical unit - there is no guarantee that a single initiator can
>> queue even a single command, since another initiator may have filled
>> the queue at the device.
>
> I'm not sure what it is that you're saying isn't true.
I'm saying that your blanket statement that "With the more primitive
transports, I believe this is a manual configuration step -- the
target has a fixed maximum queue depth and you tell the driver via
some configuration parameter what it is." is not true.
> You do give a good
> explanation of why designers would want something more
> sophisticated than
> this, but that doesn't mean every SCSI implementation actually is.
I didn't say every SCSI implementation did anything in particular. On
the other hand, you did.
> Are
> you saying there are no SCSI targets so primitive that they have a
> fixed
> maximum queue depth?
Of course I'm not saying that no such systems exist. I'm only
refuting your claim that they all behave that way.
> That there are no systems where you manually set the
> maximum requests-in-flight at the initiator in order to optimally
> drive
> such targets?
Of course I'm not saying that no such systems exist. I'm only
refuting your claim that they all behave that way.
>
>>> I saw a broken ISCSI system that had QUEUE FULLs
>>> happening, and it was a performance disaster.
>>
>> Was it a performance disaster because of the broken-ness, or solely
>> because of the TASK SET FULLs?
>
> Because of the broken-ness. Task Set Full is the symptom, not the
> disease. I should add that in this system, there was no way to
> make it
> perform optimally and also see Task Set Full regularly.
>
> You mentioned in another email that FCP is designed to use Task Set
> Full
> for normal flow control. I heard that before, but didn't believe
> it; I
> thought FCP was more advanced than that. But I believe it now.
> So I was
> wrong to say that Task Set Full happening means a system is
> misconfigured.
> But it's still the case that if you can design a system in which
> Task Set
> Full never happens, it will perform better than one in which it does.
This is not necessarily true. TASK_SET_FULL does consume some
initiator CPU resources and some bus bandwidth, so if one of those is
your bottleneck, then yes, avoiding TASK_SET_FULL will improve
performance. But if the performance bottleneck is the device server
itself, then to a first approximation it makes no difference to
performance whether the commands are queued on the initiator side of
the interface or on the target side of the interface, assuming both
the initiator and the target are capable of performing the same
reordering optimizations.
> ISCSI flow control and manual setting of queue sizes in initiators
> are two
> ways people do that.
>
>> 1) Considering only first-order effects, who cares whether the
>> initiator sends sub-optimal requests and the target coalesces them,
>> or if the initiator does the coalescing itself?
>
> I don't know what a first-order effect is, so this may be out of
> bounds,
> but here's a reason to care: the initiator may have more resource
> available to do the work than the target. We're talking here about a
> saturated target (which, rather than admit it's overwhelmed, keeps
> accepting new tasks).
Usually the target resource that is the bottleneck is the mechanical
device, not the CPU. So it usually has the resources to devote to
reordering the queue. Even disk drives with their $5 CPU have enough
CPU bandwidth for this.
>
> But it's really the wrong question, because the more important
> question is
> would you rather have the initiator do the coalescing or nobody? There
> exist targets that are not capable of combining or ordering tasks, and
> still accept large queues of them.
So no target should be able to accept large numbers of queued
commands because some targets you've worked with are broken? Or we
should have to manually configure the queue depth on every target
because some of them are broken?
This also doesn't seem pertinent to TASK_SET_FULL versus iSCSI-style
windowing, since a broken target can accept a large queue of commands
no matter what flow-control mechanism is used.
I don't oppose including an option to an initiator that would
manually set a maximum queue depth for a particular make and model of
a SCSI target as a device-specific quirk; I just don't think it's
mandatory, I don't think it's a good idea to have it be a global
setting, and I also don't think it is the best general solution.
> These are the ones I saw have
> improperly large queues. A target that can actually make use of a
> large
> backlog of work, on the other hand, is right to accept one.
Absolutely. And the ones that can't should be sending TASK_SET_FULL
when they've reached their limit.
>
> I have seen people try to improve performance of a storage system by
> increasing queue depth in the target such as this. They note that the
> queue is always full, so it must need more queue space. But this
> degrades
> performance, because on one of these first-in-first-out targets,
> the only
> way to get peak capacity is to keep the queue full all the time so
> as to
> create backpressure and cause the initiator to schedule the work.
> Increasing the queue depth increases the chance that the initiator
> will
> not have the backlog necessary to do that scheduling. The correct
> queue
> depth on this kind of target is the number of requests the target can
> process within the initiator's (and channel's) turnaround time.
>
>> brain-damaged
>> marketing values small average access times more than a small
>> variance in access times, so the device folks do crazy shortest-
>> access-time-first scheduling instead of something more sane and less
>> prone to spreading out the access time distribution like CSCAN.
>
> Since I'm talking about targets that don't do anything close to that
> sophisticated with the stuff in their queue, this doesn't apply.
>
> But I do have to point out that there are systems where throughput is
> everything, and response time, including variability of it, is
> nothing. In
> fact, the systems I work with are mostly that kind. For that kind of
> system, you'd want to target to do that kind of scheduling.
Yep, for batch you want SATF scheduling. It's not appropriate as the
default setting for mass-produced disk devices, however.
>
>> 2) If you care about performance, you don't try to fill the device
>> queue; you just want to have enough outstanding so that the device
>> doesn't go idle when there is work to do.
>
> Why would the queue have a greater capacity than what is needed
> when you
> care about performance? Is there some non-performance reason to
> have a
> giant queue?
Benchmarks which measure whether the device can coalesce 256 512-byte
sequential writes :-)
Basically it is that for disk devices the optimal queue depth depends
on the workload, so it's statically-sized for the worst-case.
> I still think having a giant queue is not a solution to any flow
> control
> (or, in the words of the original problem, I/O throttling) problem.
I did not suggest a giant queue as a "solution". I only replied to
Vladislav's question as to how disk drives avoid sending
TASK_SET_FULL all the time. They have queue sizes larger than the
number of commands that the initiator usually tries to send.
> I'm
> even skeptical that there's any size you can make one that would avoid
> queue full conditions.
Well, if it's bigger than the number of SCSI command buffers
allocated by the initiator, the target wins and never has to send
TASK_SET_FULL (unless there are multiple initiators).
> It would be like avoiding difficult memory
> allocation algorithms by just having a whole lot of memory.
Yep. That's a good practical solution, and one which the operating
system on your desktop computer probably uses :-)
I do take your point; arbitrarily large queues only postpone the
point at which the target must reply TASK_SET_FULL. Usually that is
good enough.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-09 19:32 ` Steve Byan
@ 2006-03-10 18:46 ` Vladislav Bolkhovitin
2006-03-10 19:47 ` Steve Byan
2006-03-14 20:54 ` Douglas Gilbert
0 siblings, 2 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-10 18:46 UTC (permalink / raw)
To: Steve Byan; +Cc: Bryan Henderson, linux-scsi
Steve Byan wrote:
> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
>
>> Steve Byan wrote:
>>
>>> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
>>>
>>>> Steve Byan wrote:
>>>>
>>>>>
>>>>> I still don't understand why you are reluctant to return
>>>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard
>>>>> supplies as the way to say "don't queue too many commands, please".
>>>>
>>>>
>>>>
>>>> I don't like out of order execution, which happens practically on
>>>> all such "rejected" commands, because subsequent already queued
>>>> commands are not "rejected" with it and some of them could be
>>>> accepted later.
>>>
>>> I see, you care about order. So do tapes. The historical answer has
>>> been to not support tagged command queuing when you care about
>>> ordering. To dodge the performance problem due to lack of queuing,
>>> the targets usually implement a read-ahead and write- behind cache,
>>> and then perform queuing behind the scenes, after telling the
>>> initiator that the command has completed. Of course, this has
>>> obvious data integrity issues for disk-type logical units.
>>
>>
>> Yes, tapes just can't work without strict ordering. SCST was
>> originally done for tapes, so I still keep some kind of tape- oriented
>> thinking :)
>>
>> Actually, with current journaling file systems ordering also became
>> more important for disks as well.
>
>
> Usually the workload from a journaling filesystem consists of a lot of
> unordered writes (user data) and some partially-ordered writes
> (metadata). The partially-ordered writes do not have a defined ordering
> with respect to the unordered writes; they are ordered only with
> respect to each other. Most systems today solve the TASK_SET_FULL
> problem by only having one ordered write outstanding at any point in
> time. You want to do it this way anyway, so that you can build up a
> queue of commits and do a group commit with the next write to the journal.
>
> If you need write barriers between the metadata writes and the data
> writes, the initiator should use the ORDERED task tag on that write,
> and have only one ORDERED write outstanding at any point in time (I
> mean to the same logical unit, of course).
I mean the barrier between journal writes and metadata writes, because
they order is essential for a FS health. User data almost always not
journaled and not protected.
Obviously, having only one ORDERED, i.e. journal, write and having to
wait for it completition before submitting subsequent commands creates
some performance bottleneck. I mean mostly latency, which often quite
big in many SCSI transports. It would be much better to queue as many
such ORDERED commands as necessary and then, without waiting for their
completition, metadata updates (SIMPLE) commands and being sure, that no
metadata commands will be executed if any of ORDERED ones fail. As far
as I can see, nothing prevents to work that way right now, except that
somebody should implement it in both hardware and software.
>> Data integrity problem in "behind the scenes" queuing could be on
>> practice easily solved by battery-based backup power on the disks. In
>> case of TASK_SET_FULL things are much worse, because the reordering
>> happens _between_ target and _initiator_, since the initiator must
>> retry "rejected" command explicitly, then in case of the initiator
>> crash before the command will be retried and if FS on it uses
>> ordering barriers to protect the integrity (Linux seems does so, but
>> I could be wrong), the FS data could be written out of order with its
>> journal and the FS could be corrupted. Even worse, TASK_SET_FULL
>> "rejects" basically happen every the queue length'th command, ie very
>> often. This is why I prefer the "dumb" and "safe" way. But, I could
>> overestimate the problem, because it looks like nobody cares about it..
>
>
> See above, Since only one ordered write is ever pending, no file system
> corruption occurs. Since you want to do group commits anyway, you never
> need to have more than one ordered write pending.
>
>>
>>> The solution introduced for tapes concurrent with iSCSI (which
>>> motivated the need for command-queuing for tapes, since some
>>> envisioned backing up to a tape drive located on 3000 miles away is
>>> something called "unit-attention interlock", or "UA interlock".
>>> Check out page 287 of the draft revision 23 of the SCSI Primary
>>> Commands - 3 (SPC-3) standard from T10.org. The UA_INTLCK_CTRL
>>> field can be set to cause a persistent unit attention condition if
>>> a command was rejected with TASK_SET_FULL or BUSY.
>>
>>
>> Thanks, I'll take a look.
>>
>>> This requires the cooperation of the initiator.
>>
>>
>> Which practically means that it will not work for at least several
>> years.
>
>
> Well, the feature was added back in 2001 or 2002; the initiators have
> already had years to incorporate it. This might say something about the
> state of the Linux SCSI subsystem (running and ducking for cover :-).
> Seriously, I think this has more to do with either the lack of need for
> command-queuing for tapes or the lack of modern tape support in Linux.
>
>> I think, I won't be wrong, if say that no Linux initiators use this
>> feature and going to use...
>
>
> If you have an initiator that is sending queued SCSI commands with the
> SIMPLE task attribute but which expects the target to maintain ordering
> of those commands, the SCSI standard can't help you. The initiator is
> broken.
Sure
> If the initiator needs to send _queued_ SCSI commands with a task
> attribute of ORDERED, then to preserve ordering it must set the
> UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism
> to offer such an initiator.
>
> To the best of my knowledge no current Linux initiator sends SCSI
> commands with a task attribute other than SIMPLE., and you seem to be
> concerned only about Linux initiators. Therefor your target does not
> need to preserve order. QUED.
I prefer to be overinsured in such cases.
>> BTW, it is also impossible to correctly process commands errors
>> (CHECK CONDITIONs) in async environment
>
>
> When you say "async environment" I assume you are referring to queuing
> SCSI commands using SCSI command queuing, as opposed to sending a
> single SCSI command and synchronously awaiting its completion.
Yes
>> without using ACA (Auto Contingent Allegiance). Again, I see no sign
>> that it's used by Linux or somebody interested to use it in Linux.
>> Have I missed anything and it is not important? (rather rhetorical
>> question)
>
>
> ACA is not important if the command that got the error is idempotent
> and independent of all other commands in flight. In the case of disks
> (SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this
> condition is true (given the restriction on the number of outstanding
> ordered writes which I discussed above), and so ACA is not needed.
Yes, when working as you described, ACA is not needed. But when working
as I described, ACA is essential.
> Tapes would need ACA if they did command queuing (which is why ACA was
> invented), but the practice in tape-land seems to be to avoid SCSI
> command queuing and instead asynchronously stage the operations behind
> the target. This does lead to complications in error recovery, which is
> why tape error handling is so problematic.
Could you please explain "synchronously stage the operations behind the
target" more? I don't understand what you mean.
> My advice to you is to either
> a) follow the industry trend, which is to use command queuing only for
> SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape) targets, or
> b) fix the initiator to handle ordered queuing (i.e. add support for
> the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
OK, thanks. Looks like (a) is easier :).
BTW, do you have any statistic how many modern SCSI disks support those
features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of
available for us SCSI hardware, including tape libraries, supported ACA.
It was not very modern for that time, though
Regards,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-10 18:46 ` Vladislav Bolkhovitin
@ 2006-03-10 19:47 ` Steve Byan
2006-03-13 17:35 ` Vladislav Bolkhovitin
2006-03-14 20:54 ` Douglas Gilbert
1 sibling, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-10 19:47 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi
On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
>> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
> I mean the barrier between journal writes and metadata writes,
> because they order is essential for a FS health.
I counted journal writes as metadata writes. If you want to make a
distinction, OK, we now have a common language.
> Obviously, having only one ORDERED, i.e. journal, write and having
> to wait for it completition before submitting subsequent commands
> creates some performance bottleneck.
It might be obvious but it's not true.
You missed my point about group commits to the journal. That's why
there's no performance hit for only having one outstanding journal
write at a time; each journal write commits many transactions. Stated
another way, you don't want to eagerly initiate journal writes; you
want to execute one at a time, and group all transactions that arrive
while the one write is active into the next write.
See the seminal paper from Xerox PARC on "Group Commits in the CEDAR
Filesystem". I'm working from memory so I can't give you a better
citation than that. It's an old paper, probably circa 1987 or 1988,
published I think in an ACM journal.
I've benchmarked metadata-intensive workloads on a journaling
filesystem with a storage controller with NV-RAM arranged so that all
metadata and journal writes complete without any disk activity
against a vanilla controller. The lights on the disks on the NV-RAM
controller never came on; i.e. there was _no_ disk activity. The
lights on the disks attached to the vanilla controller were on solid.
The performance of the two systems was essentially the same with
respect to average response time and throughput.
> I mean mostly latency, which often quite big in many SCSI
> transports. It would be much better to queue as many such ORDERED
> commands as necessary and then, without waiting for their
> completition, metadata updates (SIMPLE) commands and being sure,
> that no metadata commands will be executed if any of ORDERED ones
> fail. As far as I can see, nothing prevents to work that way right
> now, except that somebody should implement it in both hardware and
> software.
If you use group commits, there's little value in implementing this.
>> To the best of my knowledge no current Linux initiator sends SCSI
>> commands with a task attribute other than SIMPLE., and you seem to
>> be concerned only about Linux initiators. Therefor your target
>> does not need to preserve order. QUED.
>
> I prefer to be overinsured in such cases.
Suit yourself. Just don't expect help from the SCSI standard, it's
not designed to do that.
>> ACA is not important if the command that got the error is
>> idempotent and independent of all other commands in flight. In
>> the case of disks (SBC command set) and CD-ROMs and DVD-ROMs (MMC
>> command-set) this condition is true (given the restriction on the
>> number of outstanding ordered writes which I discussed above),
>> and so ACA is not needed.
>
> Yes, when working as you described, ACA is not needed. But when
> working as I described, ACA is essential.
As is unit attention interlock.
>> Tapes would need ACA if they did command queuing (which is why
>> ACA was invented), but the practice in tape-land seems to be to
>> avoid SCSI command queuing and instead asynchronously stage the
>> operations behind the target. This does lead to complications in
>> error recovery, which is why tape error handling is so problematic.
>
> Could you please explain "synchronously stage the operations behind
> the target" more? I don't understand what you mean.
I mean they buffer the operations in memory after completing the SCSI
command and then (asynchronous to the execution of the SCSI command,
i,e, after it has been completed) queue them ("stage" them) and send
them on to the physical device.
I'm a bit hazy on the terminology, because I was never a tape guy and
it's been years since I thought about tapes, but I think the term the
industry used when streaming tapes first came out was "buffered
operation". The tape controller accepts the write command and
completes it with good status but doesn't write it to the media; it
waits until it has accumulated a sufficient number of records to keep
the tape streaming before starting to dump the buffer to the tape
media. This avoids the need for SCSI command-queuing while still
keeping the tape streaming.
>> My advice to you is to either
>> a) follow the industry trend, which is to use command queuing
>> only for SBC (disk) targets and not for MMC (CD-ROM) and SSC
>> (tape) targets, or
>> b) fix the initiator to handle ordered queuing (i.e. add support
>> for the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
>
> OK, thanks. Looks like (a) is easier :).
>
> BTW, do you have any statistic how many modern SCSI disks support
> those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago
> none of available for us SCSI hardware, including tape libraries,
> supported ACA. It was not very modern for that time, though
I can't say with certainty, but I believe no SCSI disk supports ACA
or UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess
it would be implemented in a low-performance path.
Storage controllers might be a different story; I have no data on
what they support in the way of task attributes, ACA, and unit
attention interlock.
As far as tapes go, I've got no data on modern SCSI tape controllers,
but judging by the squirming going on in T10 around command-ordering
for Fibre Channel tapes, I'd guess very few if any have gotten
command-queuing to work for tapes.
Regards,
-Steve
--
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-10 19:47 ` Steve Byan
@ 2006-03-13 17:35 ` Vladislav Bolkhovitin
0 siblings, 0 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-13 17:35 UTC (permalink / raw)
To: Steve Byan; +Cc: Bryan Henderson, linux-scsi
Steve Byan wrote:
> On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:
>
>> Steve Byan wrote:
>>
>>> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
>
>
>> I mean the barrier between journal writes and metadata writes,
>> because they order is essential for a FS health.
>
>
> I counted journal writes as metadata writes. If you want to make a
> distinction, OK, we now have a common language.
>
>> Obviously, having only one ORDERED, i.e. journal, write and having to
>> wait for it completition before submitting subsequent commands
>> creates some performance bottleneck.
>
>
> It might be obvious but it's not true.
>
> You missed my point about group commits to the journal. That's why
> there's no performance hit for only having one outstanding journal
> write at a time; each journal write commits many transactions. Stated
> another way, you don't want to eagerly initiate journal writes; you
> want to execute one at a time, and group all transactions that arrive
> while the one write is active into the next write.
>
> See the seminal paper from Xerox PARC on "Group Commits in the CEDAR
> Filesystem". I'm working from memory so I can't give you a better
> citation than that. It's an old paper, probably circa 1987 or 1988,
> published I think in an ACM journal.
I didn't miss your point. I wrote that such journal updates have to be
_synchronous_, i.e. it's necessary, despite that the updates are
combined in one command, to wait for their completion (as well as _all_
previously queued commands, including SIMPLE ones). This is the
(possible) performance bottleneck. Yes, the disk can imitate the
commands completion with its write back cache, but the cache is limited
in size, so on some workload it could get full and not able to help.
However, I don't have any numbers and maybe this is not so noticeable in
practice.
> I've benchmarked metadata-intensive workloads on a journaling
> filesystem with a storage controller with NV-RAM arranged so that all
> metadata and journal writes complete without any disk activity against
> a vanilla controller. The lights on the disks on the NV-RAM controller
> never came on; i.e. there was _no_ disk activity. The lights on the
> disks attached to the vanilla controller were on solid. The performance
> of the two systems was essentially the same with respect to average
> response time and throughput.
>
>> I mean mostly latency, which often quite big in many SCSI transports.
>> It would be much better to queue as many such ORDERED commands as
>> necessary and then, without waiting for their completition, metadata
>> updates (SIMPLE) commands and being sure, that no metadata commands
>> will be executed if any of ORDERED ones fail. As far as I can see,
>> nothing prevents to work that way right now, except that somebody
>> should implement it in both hardware and software.
>
>
> If you use group commits, there's little value in implementing this.
>
>>> Tapes would need ACA if they did command queuing (which is why ACA
>>> was invented), but the practice in tape-land seems to be to avoid
>>> SCSI command queuing and instead asynchronously stage the
>>> operations behind the target. This does lead to complications in
>>> error recovery, which is why tape error handling is so problematic.
>>
>>
>> Could you please explain "synchronously stage the operations behind
>> the target" more? I don't understand what you mean.
>
>
> I mean they buffer the operations in memory after completing the SCSI
> command and then (asynchronous to the execution of the SCSI command,
> i,e, after it has been completed) queue them ("stage" them) and send
> them on to the physical device.
>
> I'm a bit hazy on the terminology, because I was never a tape guy and
> it's been years since I thought about tapes, but I think the term the
> industry used when streaming tapes first came out was "buffered
> operation". The tape controller accepts the write command and completes
> it with good status but doesn't write it to the media; it waits until
> it has accumulated a sufficient number of records to keep the tape
> streaming before starting to dump the buffer to the tape media. This
> avoids the need for SCSI command-queuing while still keeping the tape
> streaming.
I see
>>> My advice to you is to either
>>> a) follow the industry trend, which is to use command queuing only
>>> for SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape)
>>> targets, or
>>> b) fix the initiator to handle ordered queuing (i.e. add support
>>> for the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
>>
>>
>> OK, thanks. Looks like (a) is easier :).
>>
>> BTW, do you have any statistic how many modern SCSI disks support
>> those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none
>> of available for us SCSI hardware, including tape libraries,
>> supported ACA. It was not very modern for that time, though
>
>
> I can't say with certainty, but I believe no SCSI disk supports ACA or
> UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess it
> would be implemented in a low-performance path.
This is the point from which we should have started :). It's senseless
to implement something, which you can't use.
> Storage controllers might be a different story; I have no data on what
> they support in the way of task attributes, ACA, and unit attention
> interlock.
>
> As far as tapes go, I've got no data on modern SCSI tape controllers,
> but judging by the squirming going on in T10 around command-ordering
> for Fibre Channel tapes, I'd guess very few if any have gotten
> command-queuing to work for tapes.
Thanks,
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-10 18:46 ` Vladislav Bolkhovitin
2006-03-10 19:47 ` Steve Byan
@ 2006-03-14 20:54 ` Douglas Gilbert
2006-03-15 17:15 ` Vladislav Bolkhovitin
1 sibling, 1 reply; 25+ messages in thread
From: Douglas Gilbert @ 2006-03-14 20:54 UTC (permalink / raw)
To: Vladislav Bolkhovitin; +Cc: Steve Byan, Bryan Henderson, linux-scsi
Vladislav Bolkhovitin wrote:
> Steve Byan wrote:
<snip>
> BTW, do you have any statistic how many modern SCSI disks support those
> features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of
> available for us SCSI hardware, including tape libraries, supported ACA.
> It was not very modern for that time, though
Vlad,
Here is part of the control mode page from a
recent SCSI disk (Cheetah 15k.4) :
# sdparm -p co /dev/sdb -ll
/dev/sdb: SEAGATE ST336754SS 0003
Direct access device specific parameters: WP=0 DPOFUA=1
Control mode page [PS=1]:
TST 0 [cha: n, def: 0, sav: 0] Task set type
0: lu maintains one task set for all I_T nexuses
1: lu maintains separate task sets for each I_T nexus
TMF_ONLY 0 [cha: n, def: 0, sav: 0] Task management functions only
D_SENSE 0 [cha: n, def: 0, sav: 0] Descriptor format sense data
GLTSD 0 [cha: y, def: 1, sav: 0] Global logging target save disable
RLEC 0 [cha: y, def: 0, sav: 0] Report log exception condition
QAM 0 [cha: y, def: 0, sav: 0] Queue algorithm modifier
0: restricted re-ordering; 1: unrestricted
QERR 0 [cha: n, def: 0, sav: 0] Queue error management
0: only affected task gets CC; 1: affected tasks aborted
3: affected tasks aborted on same I_T nexus
RAC 0 [cha: n, def: 0, sav: 0] Report a check
UA_INTLCK 0 [cha: n, def: 0, sav: 0] Unit attention interlocks control
0: unit attention cleared with check condition status
2: unit attention not cleared with check condition status
3: as 2 plus ua on busy, task set full or reservation conflict
SWP 0 [cha: n, def: 0, sav: 0] Software write protect
ATO 0 [cha: n, def: 0, sav: 0] Application tag owner
TAS 0 [cha: n, def: 0, sav: 0] Task aborted status
0: tasks aborted without response to app client
1: any other I_T nexuses receive task aborted
So it doesn't support UA_INTLCK ("cha: n" implies the user
cannot change that value). QAM can be changed to allow
unrestricted re-ordering (of task with the SIMPLE task
attribute).
The NormACA bit in the standard INQUIRY response is 0 so
it doesn't support ACA either.
Doug Gilbert
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling
2006-03-14 20:54 ` Douglas Gilbert
@ 2006-03-15 17:15 ` Vladislav Bolkhovitin
0 siblings, 0 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-15 17:15 UTC (permalink / raw)
To: dougg; +Cc: Steve Byan, Bryan Henderson, linux-scsi
Douglas Gilbert wrote:
> Vladislav Bolkhovitin wrote:
>
>>Steve Byan wrote:
>
>
> <snip>
>
>>BTW, do you have any statistic how many modern SCSI disks support those
>>features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of
>>available for us SCSI hardware, including tape libraries, supported ACA.
>>It was not very modern for that time, though
>
>
> Vlad,
> Here is part of the control mode page from a
> recent SCSI disk (Cheetah 15k.4) :
>
> # sdparm -p co /dev/sdb -ll
> /dev/sdb: SEAGATE ST336754SS 0003
> Direct access device specific parameters: WP=0 DPOFUA=1
> Control mode page [PS=1]:
> TST 0 [cha: n, def: 0, sav: 0] Task set type
> 0: lu maintains one task set for all I_T nexuses
> 1: lu maintains separate task sets for each I_T nexus
> TMF_ONLY 0 [cha: n, def: 0, sav: 0] Task management functions only
> D_SENSE 0 [cha: n, def: 0, sav: 0] Descriptor format sense data
> GLTSD 0 [cha: y, def: 1, sav: 0] Global logging target save disable
> RLEC 0 [cha: y, def: 0, sav: 0] Report log exception condition
> QAM 0 [cha: y, def: 0, sav: 0] Queue algorithm modifier
> 0: restricted re-ordering; 1: unrestricted
> QERR 0 [cha: n, def: 0, sav: 0] Queue error management
> 0: only affected task gets CC; 1: affected tasks aborted
> 3: affected tasks aborted on same I_T nexus
> RAC 0 [cha: n, def: 0, sav: 0] Report a check
> UA_INTLCK 0 [cha: n, def: 0, sav: 0] Unit attention interlocks control
> 0: unit attention cleared with check condition status
> 2: unit attention not cleared with check condition status
> 3: as 2 plus ua on busy, task set full or reservation conflict
> SWP 0 [cha: n, def: 0, sav: 0] Software write protect
> ATO 0 [cha: n, def: 0, sav: 0] Application tag owner
> TAS 0 [cha: n, def: 0, sav: 0] Task aborted status
> 0: tasks aborted without response to app client
> 1: any other I_T nexuses receive task aborted
>
> So it doesn't support UA_INTLCK ("cha: n" implies the user
> cannot change that value). QAM can be changed to allow
> unrestricted re-ordering (of task with the SIMPLE task
> attribute).
>
> The NormACA bit in the standard INQUIRY response is 0 so
> it doesn't support ACA either.
Thanks! This is exactly what we've seen in the our small investigation.
Perhaps, those features are really not needed, if nobody still use them.
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-03-15 17:15 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin
2006-03-03 18:07 ` Steve Byan
2006-03-03 18:47 ` Stefan Richter
2006-03-03 20:24 ` Steve Byan
2006-03-06 19:15 ` Bryan Henderson
2006-03-06 19:55 ` Steve Byan
2006-03-07 23:32 ` Bryan Henderson
2006-03-08 15:35 ` Vladislav Bolkhovitin
2006-03-08 15:56 ` Steve Byan
2006-03-08 17:49 ` Vladislav Bolkhovitin
2006-03-08 18:09 ` Steve Byan
2006-03-09 18:37 ` Vladislav Bolkhovitin
2006-03-09 19:32 ` Steve Byan
2006-03-10 18:46 ` Vladislav Bolkhovitin
2006-03-10 19:47 ` Steve Byan
2006-03-13 17:35 ` Vladislav Bolkhovitin
2006-03-14 20:54 ` Douglas Gilbert
2006-03-15 17:15 ` Vladislav Bolkhovitin
2006-03-10 13:26 ` Steve Byan
2006-03-07 17:56 ` Vladislav Bolkhovitin
2006-03-07 18:38 ` Steve Byan
2006-03-07 17:53 ` Vladislav Bolkhovitin
2006-03-07 18:19 ` Steve Byan
2006-03-07 18:46 ` Vladislav Bolkhovitin
2006-03-07 19:00 ` Steve Byan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).