SCSI target and IO-throttling

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SCSI target and IO-throttling
@ 2006-03-02 16:21 Vladislav Bolkhovitin
  2006-03-03 18:07 ` Steve Byan
  0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-02 16:21 UTC (permalink / raw)
  To: linux-scsi

Hello

Could anyone advice how a SCSI target device can IO-throttle its 
initiators, i.e. prevent them from queuing too many commands, please?

I suppose, the best way for doing this is to inform the initiators about 
the maximum queue depth X of the target device, so any of the initiators 
will not send more than X commands. But I have not found anything 
similar to that on INQUIRY or MODE SENSE pages. Have I missed something? 
Just returning QUEUE FULL status doesn't look to be correct, because it 
can lead to out of order commands execution.

Apparently, hardware SCSI targets don't suffer from queuing overflow and 
don't return all the time QUEUE FULL status, so the must be a way to do 
the throttling more elegantly.

Regards,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin
@ 2006-03-03 18:07 ` Steve Byan
  2006-03-03 18:47   ` Stefan Richter
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-03 18:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: linux-scsi


On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:

> Could anyone advice how a SCSI target device can IO-throttle its  
> initiators, i.e. prevent them from queuing too many commands, please?
>
> I suppose, the best way for doing this is to inform the initiators  
> about the maximum queue depth X of the target device, so any of the  
> initiators will not send more than X commands. But I have not found  
> anything similar to that on INQUIRY or MODE SENSE pages. Have I  
> missed something? Just returning QUEUE FULL status doesn't look to  
> be correct, because it can lead to out of order commands execution.

Returning QUEUE FULL status is correct, unless the initiator does not  
have any pending commands on the LUN, in which case you should return  
BUSY. Yes, this can lead to out-of-order execution. That's why tapes  
have traditionally not used SCSI command queuing.

Look into the unit attention interlock feature added to SCSI as a  
result of uncovering this issue during the development of the iSCSI  
standard.

> Apparently, hardware SCSI targets don't suffer from queuing  
> overflow and don't return all the time QUEUE FULL status, so the  
> must be a way to do the throttling more elegantly.

No, they just have big queues.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-03 18:07 ` Steve Byan
@ 2006-03-03 18:47   ` Stefan Richter
  2006-03-03 20:24     ` Steve Byan
  2006-03-06 19:15   ` Bryan Henderson
  2006-03-07 17:53   ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 25+ messages in thread
From: Stefan Richter @ 2006-03-03 18:47 UTC (permalink / raw)
  To: Steve Byan; +Cc: Vladislav Bolkhovitin, linux-scsi

Steve Byan wrote:
> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>> Could anyone advice how a SCSI target device can IO-throttle its  
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators  
>> about the maximum queue depth X of the target device,
[...]
> Returning QUEUE FULL status is correct, unless the initiator does not  
> have any pending commands on the LUN, in which case you should return  
> BUSY. Yes, this can lead to out-of-order execution. That's why tapes  
> have traditionally not used SCSI command queuing.
> 
> Look into the unit attention interlock feature added to SCSI as a  
> result of uncovering this issue during the development of the iSCSI  
> standard.
> 
>> Apparently, hardware SCSI targets don't suffer from queuing  overflow 
[...]
> No, they just have big queues.

Depending on the the transport protocol, the problem of queue depth at 
the target may not even exist in the first place. This is the case with 
SBP-2 where the queue of command blocks resides at the initiator.
-- 
Stefan Richter
-=====-=-==- --== ---==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-03 18:47   ` Stefan Richter
@ 2006-03-03 20:24     ` Steve Byan
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-03 20:24 UTC (permalink / raw)
  To: Stefan Richter; +Cc: Vladislav Bolkhovitin, linux-scsi


On Mar 3, 2006, at 1:47 PM, Stefan Richter wrote:

> Steve Byan wrote:
>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>> Apparently, hardware SCSI targets don't suffer from queuing   
>>> overflow
> [...]
>> No, they just have big queues.
>
> Depending on the the transport protocol, the problem of queue depth  
> at the target may not even exist in the first place. This is the  
> case with SBP-2 where the queue of command blocks resides at the  
> initiator.

Yes, and that's a clever optimization in SBP-2 to support resource- 
poor targets. Thanks for reminding us of it.

Too bad SATA drives didn't take advantage of the SATA first-party DMA  
to implement SBP-2. The definition of the tag field for native  
command queuing adopted by T13 essentially makes it infeasible to  
revisit this decision.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-03 18:07 ` Steve Byan
  2006-03-03 18:47   ` Stefan Richter
@ 2006-03-06 19:15   ` Bryan Henderson
  2006-03-06 19:55     ` Steve Byan
  2006-03-07 17:56     ` Vladislav Bolkhovitin
  2006-03-07 17:53   ` Vladislav Bolkhovitin
  2 siblings, 2 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-03-06 19:15 UTC (permalink / raw)
  To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin

>On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>
>> Could anyone advice how a SCSI target device can IO-throttle its 
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators 
>> about the maximum queue depth X of the target device, so any of the 
>> initiators will not send more than X commands. But I have not found 
>> anything similar to that on INQUIRY or MODE SENSE pages. Have I 
>> missed something? Just returning QUEUE FULL status doesn't look to 
>> be correct, because it can lead to out of order commands execution.
>
>Returning QUEUE FULL status is correct, unless the initiator does not 
>have any pending commands on the LUN, in which case you should return 
>BUSY. Yes, this can lead to out-of-order execution. That's why tapes 
>have traditionally not used SCSI command queuing.

I'm confused,  Vladislav appears to be asking about flow control such as 
is built into ISCSI, wherein the ISCSI target tells the intitiator how 
many tasks it's willing to work on at once and the initiator stops sending 
new ones when it has hit that limit and waits for one of the previous ones 
to finish.  And the target can continuously change that number.

With the more primitive transports, I believe this is a manual 
configuration step -- the target has a fixed maximum queue depth and you 
tell the driver via some configuration parameter what it is.

As I understand it, any system in which QUEUE FULL (that's another name 
for SCSI's Task Set Full, isn't it?) errors happen is one that is not 
properly configured.  I saw a broken ISCSI system that had QUEUE FULLs 
happening, and it was a performance disaster.

>> Apparently, hardware SCSI targets don't suffer from queuing 
>> overflow and don't return all the time QUEUE FULL status, so the 
>> must be a way to do the throttling more elegantly.
>
>No, they just have big queues.

Big queues are another serious performance problem, when it means a target 
accepts work faster than it can do it.  I've seen that cause initiators to 
send suboptimal requests (if the target appears to be working at infinite 
speed, the initiator sends small chunks of work as soon as each is ready, 
whereas if the initiator can tell that the target is choked, the initiator 
combines and sorts work while it waits, into a stream the target can 
handle more efficiently).  When systems substitute an oversized queue in a 
target for initiator-target flow control, the initiator ends up having to 
compensate with artificial schemes to withhold work from a willing target 
(e.g. Linux "queue plugging").

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-06 19:15   ` Bryan Henderson
@ 2006-03-06 19:55     ` Steve Byan
  2006-03-07 23:32       ` Bryan Henderson
  2006-03-07 17:56     ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-06 19:55 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin

On Mar 6, 2006, at 2:15 PM, Bryan Henderson wrote:

>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>
>>> Could anyone advice how a SCSI target device can IO-throttle its
>>> initiators, i.e. prevent them from queuing too many commands,  
>>> please?
>>>
>>> I suppose, the best way for doing this is to inform the initiators
>>> about the maximum queue depth X of the target device, so any of the
>>> initiators will not send more than X commands. But I have not found
>>> anything similar to that on INQUIRY or MODE SENSE pages. Have I
>>> missed something? Just returning QUEUE FULL status doesn't look to
>>> be correct, because it can lead to out of order commands execution.
>>
>> Returning QUEUE FULL status is correct, unless the initiator does not
>> have any pending commands on the LUN, in which case you should return
>> BUSY. Yes, this can lead to out-of-order execution. That's why tapes
>> have traditionally not used SCSI command queuing.
>
> I'm confused,  Vladislav appears to be asking about flow control  
> such as
> is built into ISCSI, wherein the ISCSI target tells the intitiator how
> many tasks it's willing to work on at once and the initiator stops  
> sending
> new ones when it has hit that limit and waits for one of the  
> previous ones
> to finish.  And the target can continuously change that number.
>
> With the more primitive transports,

Seems like a somewhat loaded description to me. Personally, I'd pick  
something more neutral.

> I believe this is a manual
> configuration step -- the target has a fixed maximum queue depth  
> and you
> tell the driver via some configuration parameter what it is.

Not true. Consider the case where multiple initiators share one  
logical unit  - there is no guarantee that a single initiator can  
queue even a single command, since another initiator may have filled  
the queue at the device.

Another case is a target that has multiple logical units; it is  
conceivable that an implementation may share the device queue  
resources among all logical units. In this case again, there is no  
fixed number of commands that the target can guarantee to queue for a  
logical unit.

> As I understand it, any system in which QUEUE FULL (that's another  
> name
> for SCSI's Task Set Full, isn't it?)

Yes, you're correct. I should have written TASK SET FULL, which is  
the correct name for the SCSI status value that we are discussing.

> errors happen is one that is not
> properly configured.

Absolutely untrue.

> I saw a broken ISCSI system that had QUEUE FULLs
> happening, and it was a performance disaster.

Was it a performance disaster because of the broken-ness, or solely  
because of the TASK SET FULLs?

>>> Apparently, hardware SCSI targets don't suffer from queuing
>>> overflow and don't return all the time QUEUE FULL status, so the
>>> must be a way to do the throttling more elegantly.
>>
>> No, they just have big queues.
>
> Big queues are another serious performance problem, when it means a  
> target
> accepts work faster than it can do it.  I've seen that cause  
> initiators to
> send suboptimal requests (if the target appears to be working at  
> infinite
> speed, the initiator sends small chunks of work as soon as each is  
> ready,
> whereas if the initiator can tell that the target is choked, the  
> initiator
> combines and sorts work while it waits, into a stream the target can
> handle more efficiently).

1) Considering only first-order effects, who cares whether the  
initiator sends sub-optimal requests and the target coalesces them,  
or if the initiator does the coalescing itself?

2) If you care about performance, you don't try to fill the device  
queue; you just want to have enough outstanding so that the device  
doesn't go idle when there is work to do.

The reason why you do this has more to do with the access scheduling  
algorithm in the target more than anything else; brain-damaged  
marketing values small average access times more than a small  
variance in access times, so the device folks do crazy shortest- 
access-time-first scheduling instead of something more sane and less  
prone to spreading out the access time distribution like CSCAN.

> When systems substitute an oversized queue in a
> target for initiator-target flow control, the initiator ends up  
> having to
> compensate with artificial schemes to withhold work from a willing  
> target
> (e.g. Linux "queue plugging").

1) The SCSI architectural standard does not prescribe any method for  
initiator-target flow control other than TASK SET FULL and BUSY.  
There's nothing wrong with X-ON and X-OFF for flow control,  
especially when you cannot deterministically calculate a window size.

2) Tell the device folks to switch from shortest-access-time-first  
scheduling to something less aggressive like CSCAN, and then you  
might be able to tolerate the device queuing better.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-03 18:07 ` Steve Byan
  2006-03-03 18:47   ` Stefan Richter
  2006-03-06 19:15   ` Bryan Henderson
@ 2006-03-07 17:53   ` Vladislav Bolkhovitin
  2006-03-07 18:19     ` Steve Byan
  2 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 17:53 UTC (permalink / raw)
  To: Steve Byan; +Cc: linux-scsi

Steve Byan wrote:
> 
> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
> 
>> Could anyone advice how a SCSI target device can IO-throttle its  
>> initiators, i.e. prevent them from queuing too many commands, please?
>>
>> I suppose, the best way for doing this is to inform the initiators  
>> about the maximum queue depth X of the target device, so any of the  
>> initiators will not send more than X commands. But I have not found  
>> anything similar to that on INQUIRY or MODE SENSE pages. Have I  
>> missed something? Just returning QUEUE FULL status doesn't look to  be 
>> correct, because it can lead to out of order commands execution.
> 
> 
> Returning QUEUE FULL status is correct, unless the initiator does not  
> have any pending commands on the LUN, in which case you should return  
> BUSY. Yes, this can lead to out-of-order execution. That's why tapes  
> have traditionally not used SCSI command queuing.
> 
> Look into the unit attention interlock feature added to SCSI as a  
> result of uncovering this issue during the development of the iSCSI  
> standard.
> 
>> Apparently, hardware SCSI targets don't suffer from queuing  overflow 
>> and don't return all the time QUEUE FULL status, so the  must be a way 
>> to do the throttling more elegantly.
> 
> 
> No, they just have big queues.

Thanks for the reply!

Things are getting clearer for me now, but still there are few things 
that are not very clear for me. Hope, they won't require too long 
answers. I'm asking, because we in SCST project (SCSI target mid-level 
for Linux + some target drivers, http://scst.sourceforge.net) must 
emulate correct SCSI target device behavior under any IO load, including 
extreme high one.

  - Can you estimate, please, how big target commands queue should be in 
order to initiators will never receive QUEUE FULL status? Considering 
case that initiators are Linux-based and each has a separate and 
independent queue.

  - The queue could be so big that the last command in it could not been 
processed before the initiator's timeout, then, after the timeout was 
hit, the initiator would start issuing ABORTs for the timeouted command. 
Is it OK behavior? Or rather misconfiguration (of who, initiator or 
target?)? Does the initiator in such situation supposed to reissue the 
command after the preceding ones finished, or behave somehow else? 
Apparently, ABORTs must hit the performance at the similar degree as too 
many QUEUE FULLs, if not more.

Seems, we should setup on the target queue with virtually unlimited size 
and, if an initiator is dumb enough to queue so much commands that there 
will be timeouts, then it will be its problem and duty to rule the 
situation without performance loss. Does it looks OK?

Thanks,
Vlad

> Regards,
> -Steve



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-06 19:15   ` Bryan Henderson
  2006-03-06 19:55     ` Steve Byan
@ 2006-03-07 17:56     ` Vladislav Bolkhovitin
  2006-03-07 18:38       ` Steve Byan
  1 sibling, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 17:56 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Steve Byan, linux-scsi

Bryan Henderson wrote:
>>On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>
>>
>>>Could anyone advice how a SCSI target device can IO-throttle its 
>>>initiators, i.e. prevent them from queuing too many commands, please?
>>>
>>>I suppose, the best way for doing this is to inform the initiators 
>>>about the maximum queue depth X of the target device, so any of the 
>>>initiators will not send more than X commands. But I have not found 
>>>anything similar to that on INQUIRY or MODE SENSE pages. Have I 
>>>missed something? Just returning QUEUE FULL status doesn't look to 
>>>be correct, because it can lead to out of order commands execution.
>>
>>Returning QUEUE FULL status is correct, unless the initiator does not 
>>have any pending commands on the LUN, in which case you should return 
>>BUSY. Yes, this can lead to out-of-order execution. That's why tapes 
>>have traditionally not used SCSI command queuing.
> 
> 
> I'm confused,  Vladislav appears to be asking about flow control such as 
> is built into ISCSI, wherein the ISCSI target tells the intitiator how 
> many tasks it's willing to work on at once and the initiator stops sending 
> new ones when it has hit that limit and waits for one of the previous ones 
> to finish.  And the target can continuously change that number.

Yes, exactly.

> With the more primitive transports, I believe this is a manual 
> configuration step -- the target has a fixed maximum queue depth and you 
> tell the driver via some configuration parameter what it is.

We currently mostly deal with Fibre Channel, which seems to be a kind of 
"more primitive transport" without explicit flow control. Actually, I'm 
very surprised and can't believe that so advanced and expensive 
technology doesn't have such basic thing as a good flow control. 
Although, precisely speaking, such flow control is located on level 
above transport (this is true for iSCSI as well), therefore this is SCSI 
flaw, not FC.

> As I understand it, any system in which QUEUE FULL (that's another name 
> for SCSI's Task Set Full, isn't it?) errors happen is one that is not 
> properly configured.  I saw a broken ISCSI system that had QUEUE FULLs 
> happening, and it was a performance disaster.

It is what we observe, too much QUEUE FULLs degrade performance 
considerably.

>>>Apparently, hardware SCSI targets don't suffer from queuing 
>>>overflow and don't return all the time QUEUE FULL status, so the 
>>>must be a way to do the throttling more elegantly.
>>
>>No, they just have big queues.
> 
> Big queues are another serious performance problem, when it means a target 
> accepts work faster than it can do it.  I've seen that cause initiators to 
> send suboptimal requests (if the target appears to be working at infinite 
> speed, the initiator sends small chunks of work as soon as each is ready, 
> whereas if the initiator can tell that the target is choked, the initiator 
> combines and sorts work while it waits, into a stream the target can 
> handle more efficiently).  When systems substitute an oversized queue in a 
> target for initiator-target flow control, the initiator ends up having to 
> compensate with artificial schemes to withhold work from a willing target 
> (e.g. Linux "queue plugging").

This is one point why I don't like having a overbig queue on the target. 
Another one is initiator side timeouts when the queue so big that it 
could not been done on time. I described it in the previous email.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 17:53   ` Vladislav Bolkhovitin
@ 2006-03-07 18:19     ` Steve Byan
  2006-03-07 18:46       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-07 18:19 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: linux-scsi


On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote:

> Steve Byan wrote:
>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>> Could anyone advice how a SCSI target device can IO-throttle its   
>>> initiators, i.e. prevent them from queuing too many commands,  
>>> please?
>>>
>>> I suppose, the best way for doing this is to inform the  
>>> initiators  about the maximum queue depth X of the target device,  
>>> so any of the  initiators will not send more than X commands. But  
>>> I have not found  anything similar to that on INQUIRY or MODE  
>>> SENSE pages. Have I  missed something? Just returning QUEUE FULL  
>>> status doesn't look to  be correct, because it can lead to out of  
>>> order commands execution.
>> Returning QUEUE FULL status is correct, unless the initiator does  
>> not  have any pending commands on the LUN, in which case you  
>> should return  BUSY. Yes, this can lead to out-of-order execution.  
>> That's why tapes  have traditionally not used SCSI command queuing.
>> Look into the unit attention interlock feature added to SCSI as a   
>> result of uncovering this issue during the development of the  
>> iSCSI  standard.
>>> Apparently, hardware SCSI targets don't suffer from queuing   
>>> overflow and don't return all the time QUEUE FULL status, so the   
>>> must be a way to do the throttling more elegantly.
>> No, they just have big queues.
>
> Thanks for the reply!
>
> Things are getting clearer for me now, but still there are few  
> things that are not very clear for me. Hope, they won't require too  
> long answers. I'm asking, because we in SCST project (SCSI target  
> mid-level for Linux + some target drivers, http:// 
> scst.sourceforge.net) must emulate correct SCSI target device  
> behavior under any IO load, including extreme high one.
>
>  - Can you estimate, please, how big target commands queue should  
> be in order to initiators will never receive QUEUE FULL status?  
> Considering case that initiators are Linux-based and each has a  
> separate and independent queue.

Do you have a per-target pool of resources for handing commands, or  
are the pools per-logical unit?

I'm not sure you could size the queue so that TASK_SET_FULL is never  
returned. Just accept the fact the the target must return  
TASK_SET_FULL or BUSY sometimes.

As a data-point, some modern SCSI disks support queue depths in the  
range of 128 to 256 commands.

>  - The queue could be so big that the last command in it could not  
> been processed before the initiator's timeout, then, after the  
> timeout was hit, the initiator would start issuing ABORTs for the  
> timeouted command. Is it OK behavior?

Well, it's the behavior implied by the SCSI standard; that is, on a  
timeout, the initiator should abort the command. If an initiator sets  
it's timeout to less than the queuing delay at the server, I wouldn't  
call that "OK behavior", but it's not the target's fault, it's the  
initiator's fault.

> Or rather misconfiguration (of who, initiator or target?)? Does the  
> initiator in such situation supposed to reissue the command after  
> the preceding ones finished, or behave somehow else?

I think it's up to the class driver to decide whether to retry a  
command after it times-out.

> Apparently, ABORTs must hit the performance at the similar degree  
> as too many QUEUE FULLs, if not more.

Much worse, I would think.

> Seems, we should setup on the target queue with virtually unlimited  
> size and, if an initiator is dumb enough to queue so much commands  
> that there will be timeouts, then it will be its problem and duty  
> to rule the situation without performance loss. Does it looks OK?

I don't think you need to pick an unlimited size. Something on the  
order of 128 to 512 commands should be sufficient. If you have  
multiple logical units, you could probably combine them in a common  
pool and somewhat reduce the number of command resources you allocate  
per logical unit, on the theory that they'll not all be fully  
utilized at the same time.


By the way, make sure you don't deadlock trying to obtain command- 
resources to return TASK_SET_FULL or BUSY to a command in the case  
where the pool of command-resources is exhausted. This is one of the  
tricky bits.


Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 17:56     ` Vladislav Bolkhovitin
@ 2006-03-07 18:38       ` Steve Byan
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-07 18:38 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi


On Mar 7, 2006, at 12:56 PM, Vladislav Bolkhovitin wrote:

> Bryan Henderson wrote:
>>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>>
>>>
>>>> Could anyone advice how a SCSI target device can IO-throttle its  
>>>> initiators, i.e. prevent them from queuing too many commands,  
>>>> please?
>>>>
>>>> I suppose, the best way for doing this is to inform the  
>>>> initiators about the maximum queue depth X of the target device,  
>>>> so any of the initiators will not send more than X commands. But  
>>>> I have not found anything similar to that on INQUIRY or MODE  
>>>> SENSE pages. Have I missed something? Just returning QUEUE FULL  
>>>> status doesn't look to be correct, because it can lead to out of  
>>>> order commands execution.
>>>
>>> Returning QUEUE FULL status is correct, unless the initiator does  
>>> not have any pending commands on the LUN, in which case you  
>>> should return BUSY. Yes, this can lead to out-of-order execution.  
>>> That's why tapes have traditionally not used SCSI command queuing.
>> I'm confused,  Vladislav appears to be asking about flow control  
>> such as is built into ISCSI, wherein the ISCSI target tells the  
>> intitiator how many tasks it's willing to work on at once and the  
>> initiator stops sending new ones when it has hit that limit and  
>> waits for one of the previous ones to finish.  And the target can  
>> continuously change that number.
>
> Yes, exactly.
>
>> With the more primitive transports, I believe this is a manual  
>> configuration step -- the target has a fixed maximum queue depth  
>> and you tell the driver via some configuration parameter what it is.
>
> We currently mostly deal with Fibre Channel, which seems to be a  
> kind of "more primitive transport" without explicit flow control.  
> Actually, I'm very surprised and can't believe that so advanced and  
> expensive technology doesn't have such basic thing as a good flow  
> control. Although, precisely speaking, such flow control is located  
> on level above transport (this is true for iSCSI as well),  
> therefore this is SCSI flaw, not FC.

It has X-ON and X-OFF flow control. Not bad considering it was  
designed in the early 1980's.

X-OFF is TASK_SET_FULL or BUSY
X-ON is a command completing, or if busy was received because the  
initiator did not have any outstanding commands at the target, then X- 
ON is implied after a short time delay.

Since an intelligently-designed initiator isn't going to dump every  
command to the device anyway (after all, the person writing the  
initiator driver wants to have some fun implementing I/O  
optimizations too; can't let those target folk have all the fun :-),  
the XON/XOFF flow control isn't often invoked.

>> As I understand it, any system in which QUEUE FULL (that's another  
>> name for SCSI's Task Set Full, isn't it?) errors happen is one  
>> that is not properly configured.  I saw a broken ISCSI system that  
>> had QUEUE FULLs happening, and it was a performance disaster.
>
> It is what we observe, too much QUEUE FULLs degrade performance  
> considerably.

Sounds like a broken initiator.

>
>>>> Apparently, hardware SCSI targets don't suffer from queuing  
>>>> overflow and don't return all the time QUEUE FULL status, so the  
>>>> must be a way to do the throttling more elegantly.
>>>
>>> No, they just have big queues.
>> Big queues are another serious performance problem, when it means  
>> a target accepts work faster than it can do it.  I've seen that  
>> cause initiators to send suboptimal requests (if the target  
>> appears to be working at infinite speed, the initiator sends small  
>> chunks of work as soon as each is ready, whereas if the initiator  
>> can tell that the target is choked, the initiator combines and  
>> sorts work while it waits, into a stream the target can handle  
>> more efficiently).  When systems substitute an oversized queue in  
>> a target for initiator-target flow control, the initiator ends up  
>> having to compensate with artificial schemes to withhold work from  
>> a willing target (e.g. Linux "queue plugging").
>
> This is one point why I don't like having a overbig queue on the  
> target.

This is just a matter of taste of whether you prefer the optimization  
to be done on the initiator side or the target side. If you prefer it  
to be done on the initiator side, then don't queue large amounts of  
work at the target.

> Another one is initiator side timeouts when the queue so big that  
> it could not been done on time. I described it in the previous email.

This is just a bug in the initiator. It can observe the average  
service time and it knows how many commands it has queued. If it sets  
its timeout anywhere close to the product of those two numbers it is  
buggy.

Regards,
-Steve

-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 18:19     ` Steve Byan
@ 2006-03-07 18:46       ` Vladislav Bolkhovitin
  2006-03-07 19:00         ` Steve Byan
  0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-07 18:46 UTC (permalink / raw)
  To: Steve Byan; +Cc: linux-scsi

Steve Byan wrote:
> 
> On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote:
> 
>> Steve Byan wrote:
>>
>>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote:
>>>
>>>> Could anyone advice how a SCSI target device can IO-throttle its   
>>>> initiators, i.e. prevent them from queuing too many commands,  please?
>>>>
>>>> I suppose, the best way for doing this is to inform the  initiators  
>>>> about the maximum queue depth X of the target device,  so any of 
>>>> the  initiators will not send more than X commands. But  I have not 
>>>> found  anything similar to that on INQUIRY or MODE  SENSE pages. 
>>>> Have I  missed something? Just returning QUEUE FULL  status doesn't 
>>>> look to  be correct, because it can lead to out of  order commands 
>>>> execution.
>>>
>>> Returning QUEUE FULL status is correct, unless the initiator does  
>>> not  have any pending commands on the LUN, in which case you  should 
>>> return  BUSY. Yes, this can lead to out-of-order execution.  That's 
>>> why tapes  have traditionally not used SCSI command queuing.
>>> Look into the unit attention interlock feature added to SCSI as a   
>>> result of uncovering this issue during the development of the  iSCSI  
>>> standard.
>>>
>>>> Apparently, hardware SCSI targets don't suffer from queuing   
>>>> overflow and don't return all the time QUEUE FULL status, so the   
>>>> must be a way to do the throttling more elegantly.
>>>
>>> No, they just have big queues.
>>
>>
>> Thanks for the reply!
>>
>> Things are getting clearer for me now, but still there are few  things 
>> that are not very clear for me. Hope, they won't require too  long 
>> answers. I'm asking, because we in SCST project (SCSI target  
>> mid-level for Linux + some target drivers, http://scst.sourceforge.net) must
>> emulate correct SCSI target device  
>> behavior under any IO load, including extreme high one.
>>
>>  - Can you estimate, please, how big target commands queue should  be 
>> in order to initiators will never receive QUEUE FULL status?  
>> Considering case that initiators are Linux-based and each has a  
>> separate and independent queue.
> 
> 
> Do you have a per-target pool of resources for handing commands, or  are 
> the pools per-logical unit?

Most limited resource is memory allocated for commands buffers. It is 
per-target. Other resourses, like internal commands structures, are so 
small, so they could be considered virtually unlimited. They are also 
global, but accounting is done by per-(session(nexus), LU).

> I'm not sure you could size the queue so that TASK_SET_FULL is never  
> returned. Just accept the fact the the target must return  TASK_SET_FULL 
> or BUSY sometimes.

We have relatively cheap method of queuing commands without allocating 
buffers for them. This way millions of commands could be queued on an 
average Linux box without problems. Only ABORTs and they influence on 
performance worry me.

> As a data-point, some modern SCSI disks support queue depths in the  
> range of 128 to 256 commands.

I rather asked about practical upper limit. From our observations a 
Linux initiator could easily send 128+ commands, but usually less. Looks 
like it depends from its available memory. Interested to know the exact 
rule.

>>  - The queue could be so big that the last command in it could not  
>> been processed before the initiator's timeout, then, after the  
>> timeout was hit, the initiator would start issuing ABORTs for the  
>> timeouted command. Is it OK behavior?
> 
> 
> Well, it's the behavior implied by the SCSI standard; that is, on a  
> timeout, the initiator should abort the command. If an initiator sets  
> it's timeout to less than the queuing delay at the server, I wouldn't  
> call that "OK behavior", but it's not the target's fault, it's the  
> initiator's fault.
> 
>> Or rather misconfiguration (of who, initiator or target?)? Does the  
>> initiator in such situation supposed to reissue the command after  the 
>> preceding ones finished, or behave somehow else?
> 
> 
> I think it's up to the class driver to decide whether to retry a  
> command after it times-out.
> 
>> Apparently, ABORTs must hit the performance at the similar degree  as 
>> too many QUEUE FULLs, if not more.
> 
> 
> Much worse, I would think.
> 
>> Seems, we should setup on the target queue with virtually unlimited  
>> size and, if an initiator is dumb enough to queue so much commands  
>> that there will be timeouts, then it will be its problem and duty  to 
>> rule the situation without performance loss. Does it looks OK?
> 
> 
> I don't think you need to pick an unlimited size. Something on the  
> order of 128 to 512 commands should be sufficient. If you have  multiple 
> logical units, you could probably combine them in a common  pool and 
> somewhat reduce the number of command resources you allocate  per 
> logical unit, on the theory that they'll not all be fully  utilized at 
> the same time.

OK

> By the way, make sure you don't deadlock trying to obtain command- 
> resources to return TASK_SET_FULL or BUSY to a command in the case  
> where the pool of command-resources is exhausted. This is one of the  
> tricky bits.

In our architecture there is no need to allocate any additional 
resources to reply with TASK_SET_FULL or BUSY. So, we already took care 
of this.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 18:46       ` Vladislav Bolkhovitin
@ 2006-03-07 19:00         ` Steve Byan
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-07 19:00 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: linux-scsi


On Mar 7, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:

> Steve Byan wrote:

>> As a data-point, some modern SCSI disks support queue depths in  
>> the  range of 128 to 256 commands.
>
> I rather asked about practical upper limit. From our observations a  
> Linux initiator could easily send 128+ commands, but usually less.  
> Looks like it depends from its available memory. Interested to know  
> the exact rule.

I don't know the rule. Obviously, it could change over time, and be  
different for different OS's.

Sounds to me like you might be trying to fix a busted initiator by  
changing the target behavior.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-06 19:55     ` Steve Byan
@ 2006-03-07 23:32       ` Bryan Henderson
  2006-03-08 15:35         ` Vladislav Bolkhovitin
  2006-03-10 13:26         ` Steve Byan
  0 siblings, 2 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-03-07 23:32 UTC (permalink / raw)
  To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin

>> With the more primitive transports,
>
>Seems like a somewhat loaded description to me. Personally, I'd pick 
>something more neutral.

Unfortunately, it's exactly what I mean.  I understand that some people 
attach negative connotations to primitivity, but I can't let that get in 
the way of clarity.

>> I believe this is a manual
>> configuration step -- the target has a fixed maximum queue depth 
>> and you
>> tell the driver via some configuration parameter what it is.
>
>Not true. Consider the case where multiple initiators share one 
>logical unit  - there is no guarantee that a single initiator can 
>queue even a single command, since another initiator may have filled 
>the queue at the device.

I'm not sure what it is that you're saying isn't true.  You do give a good 
explanation of why designers would want something more sophisticated than 
this, but that doesn't mean every SCSI implementation actually is.  Are 
you saying there are no SCSI targets so primitive that they have a fixed 
maximum queue depth?  That there are no systems where you manually set the 
maximum requests-in-flight at the initiator in order to optimally drive 
such targets?

>> I saw a broken ISCSI system that had QUEUE FULLs
>> happening, and it was a performance disaster.
>
>Was it a performance disaster because of the broken-ness, or solely 
>because of the TASK SET FULLs?

Because of the broken-ness.  Task Set Full is the symptom, not the 
disease.  I should add that in this system, there was no way to make it 
perform optimally and also see Task Set Full regularly.

You mentioned in another email that FCP is designed to use Task Set Full 
for normal flow control.  I heard that before, but didn't believe it; I 
thought  FCP was more advanced than that.  But I believe it now.  So I was 
wrong to say that Task Set Full happening means a system is misconfigured. 
 But it's still the case that if you can design a system in which Task Set 
Full never happens, it will perform better than one in which it does. 
ISCSI flow control and manual setting of queue sizes in initiators are two 
ways people do that.

>1) Considering only first-order effects, who cares whether the 
>initiator sends sub-optimal requests and the target coalesces them, 
>or if the initiator does the coalescing itself?

I don't know what  a first-order effect is, so this may be out of bounds, 
but here's a reason to care:  the initiator may have more resource 
available to do the work than the target.  We're talking here about a 
saturated target (which, rather than admit it's overwhelmed, keeps 
accepting new tasks).

But it's really the wrong question, because the more important question is 
would you rather have the initiator do the coalescing or nobody?  There 
exist targets that are not capable of combining or ordering tasks, and 
still accept large queues of them.  These are the ones I saw have 
improperly large queues.  A target that can actually make use of a large 
backlog of work, on the other hand, is right to accept one.

I have seen people try to improve performance of a storage system by 
increasing queue depth in the target such as this.  They note that the 
queue is always full, so it must need more queue space.  But this degrades 
performance, because on one of these first-in-first-out targets, the only 
way to get peak capacity is to keep the queue full all the time so as to 
create backpressure and cause the initiator to schedule the work. 
Increasing the queue depth increases the chance that the initiator will 
not have the backlog necessary to do that scheduling.  The correct queue 
depth on this kind of target is the number of requests the target can 
process within the initiator's (and channel's) turnaround time.

>brain-damaged 
>marketing values small average access times more than a small 
>variance in access times, so the device folks do crazy shortest- 
>access-time-first scheduling instead of something more sane and less 
>prone to spreading out the access time distribution like CSCAN.

Since I'm talking about targets that don't do anything close to that 
sophisticated with the stuff in their queue, this doesn't apply.

But I do have to point out that there are systems where throughput is 
everything, and response time, including variability of it, is nothing. In 
fact, the systems I work with are mostly that kind.  For that kind of 
system, you'd want to target to do that kind of scheduling.

>2) If you care about performance, you don't try to fill the device 
>queue; you just want to have enough outstanding so that the device 
>doesn't go idle when there is work to do.

Why would the queue have a greater capacity than what is needed when you 
care about performance?  Is there some non-performance reason to have a 
giant queue?

I still think having a giant queue is not a solution to any flow control 
(or, in the words of the original problem, I/O throttling) problem.  I'm 
even skeptical that there's any size you can make one that would avoid 
queue full conditions.  It would be like avoiding difficult memory 
allocation algorithms by just having a whole lot of memory.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 23:32       ` Bryan Henderson
@ 2006-03-08 15:35         ` Vladislav Bolkhovitin
  2006-03-08 15:56           ` Steve Byan
  2006-03-10 13:26         ` Steve Byan
  1 sibling, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-08 15:35 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Steve Byan, linux-scsi

Bryan Henderson wrote:
>>>With the more primitive transports,
>>
>>Seems like a somewhat loaded description to me. Personally, I'd pick 
>>something more neutral.
> 
> 
> Unfortunately, it's exactly what I mean.  I understand that some people 
> attach negative connotations to primitivity, but I can't let that get in 
> the way of clarity.
> 
> 
>>>I believe this is a manual
>>>configuration step -- the target has a fixed maximum queue depth 
>>>and you
>>>tell the driver via some configuration parameter what it is.
>>
>>Not true. Consider the case where multiple initiators share one 
>>logical unit  - there is no guarantee that a single initiator can 
>>queue even a single command, since another initiator may have filled 
>>the queue at the device.
> 
> 
> I'm not sure what it is that you're saying isn't true.  You do give a good 
> explanation of why designers would want something more sophisticated than 
> this, but that doesn't mean every SCSI implementation actually is.  Are 
> you saying there are no SCSI targets so primitive that they have a fixed 
> maximum queue depth?  That there are no systems where you manually set the 
> maximum requests-in-flight at the initiator in order to optimally drive 
> such targets?
> 
> 
>>>I saw a broken ISCSI system that had QUEUE FULLs
>>>happening, and it was a performance disaster.
>>
>>Was it a performance disaster because of the broken-ness, or solely 
>>because of the TASK SET FULLs?
> 
> 
> Because of the broken-ness.  Task Set Full is the symptom, not the 
> disease.  I should add that in this system, there was no way to make it 
> perform optimally and also see Task Set Full regularly.
> 
> You mentioned in another email that FCP is designed to use Task Set Full 
> for normal flow control.  I heard that before, but didn't believe it; I 
> thought  FCP was more advanced than that.  But I believe it now.  So I was 
> wrong to say that Task Set Full happening means a system is misconfigured. 
>  But it's still the case that if you can design a system in which Task Set 
> Full never happens, it will perform better than one in which it does. 
> ISCSI flow control and manual setting of queue sizes in initiators are two 
> ways people do that.
> 
> 
>>1) Considering only first-order effects, who cares whether the 
>>initiator sends sub-optimal requests and the target coalesces them, 
>>or if the initiator does the coalescing itself?
> 
> 
> I don't know what  a first-order effect is, so this may be out of bounds, 
> but here's a reason to care:  the initiator may have more resource 
> available to do the work than the target.  We're talking here about a 
> saturated target (which, rather than admit it's overwhelmed, keeps 
> accepting new tasks).
> 
> But it's really the wrong question, because the more important question is 
> would you rather have the initiator do the coalescing or nobody?  There 
> exist targets that are not capable of combining or ordering tasks, and 
> still accept large queues of them.  These are the ones I saw have 
> improperly large queues.  A target that can actually make use of a large 
> backlog of work, on the other hand, is right to accept one.
> 
> I have seen people try to improve performance of a storage system by 
> increasing queue depth in the target such as this.  They note that the 
> queue is always full, so it must need more queue space.  But this degrades 
> performance, because on one of these first-in-first-out targets, the only 
> way to get peak capacity is to keep the queue full all the time so as to 
> create backpressure and cause the initiator to schedule the work. 
> Increasing the queue depth increases the chance that the initiator will 
> not have the backlog necessary to do that scheduling.  The correct queue 
> depth on this kind of target is the number of requests the target can 
> process within the initiator's (and channel's) turnaround time.
> 
> 
>>brain-damaged 
>>marketing values small average access times more than a small 
>>variance in access times, so the device folks do crazy shortest- 
>>access-time-first scheduling instead of something more sane and less 
>>prone to spreading out the access time distribution like CSCAN.
> 
> 
> Since I'm talking about targets that don't do anything close to that 
> sophisticated with the stuff in their queue, this doesn't apply.
> 
> But I do have to point out that there are systems where throughput is 
> everything, and response time, including variability of it, is nothing. In 
> fact, the systems I work with are mostly that kind.  For that kind of 
> system, you'd want to target to do that kind of scheduling.
> 
> 
>>2) If you care about performance, you don't try to fill the device 
>>queue; you just want to have enough outstanding so that the device 
>>doesn't go idle when there is work to do.
> 
> 
> Why would the queue have a greater capacity than what is needed when you 
> care about performance?  Is there some non-performance reason to have a 
> giant queue?
> 
> I still think having a giant queue is not a solution to any flow control 
> (or, in the words of the original problem, I/O throttling) problem.  I'm 
> even skeptical that there's any size you can make one that would avoid 
> queue full conditions.  It would be like avoiding difficult memory 
> allocation algorithms by just having a whole lot of memory.

Yes, you're correct. But can you formulate a practical common rule 
working on any SCSI transport, including FC, on which a SCSI target, 
which knows some limit, can tell it to an initiator, so it will not try 
to queue too many commands, please? It looks like I have no choice, 
except doing "giant" queue on target hoping that initiators are smart 
enough to not queue so many commands that it starts seeing timeouts.

Vlad

> --
> Bryan Henderson                     IBM Almaden Research Center
> San Jose CA                         Filesystems
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-08 15:35         ` Vladislav Bolkhovitin
@ 2006-03-08 15:56           ` Steve Byan
  2006-03-08 17:49             ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-08 15:56 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi


On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote:

> Bryan Henderson wrote:
>> Why would the queue have a greater capacity than what is needed  
>> when you care about performance?  Is there some non-performance  
>> reason to have a giant queue?
>> I still think having a giant queue is not a solution to any flow  
>> control (or, in the words of the original problem, I/O throttling)  
>> problem.  I'm even skeptical that there's any size you can make  
>> one that would avoid queue full conditions.  It would be like  
>> avoiding difficult memory allocation algorithms by just having a  
>> whole lot of memory.
>
> Yes, you're correct. But can you formulate a practical common rule  
> working on any SCSI transport, including FC, on which a SCSI  
> target, which knows some limit, can tell it to an initiator, so it  
> will not try to queue too many commands, please? It looks like I  
> have no choice, except doing "giant" queue on target hoping that  
> initiators are smart enough to not queue so many commands that it  
> starts seeing timeouts.

I still don't understand why you are reluctant to return  
TASK_SET_FULL or BUSY in this case; it's what the SCSI standard  
supplies as the way to say "don't queue too many commands, please".

If you don't want to return TASK_SET_FULL, then yes, an effectively  
unbounded command queue is your only alternative.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-08 15:56           ` Steve Byan
@ 2006-03-08 17:49             ` Vladislav Bolkhovitin
  2006-03-08 18:09               ` Steve Byan
  0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-08 17:49 UTC (permalink / raw)
  To: Steve Byan; +Cc: Bryan Henderson, linux-scsi

Steve Byan wrote:
> 
> On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote:
> 
>> Bryan Henderson wrote:
>>
>>> Why would the queue have a greater capacity than what is needed  when 
>>> you care about performance?  Is there some non-performance  reason to 
>>> have a giant queue?
>>> I still think having a giant queue is not a solution to any flow  
>>> control (or, in the words of the original problem, I/O throttling)  
>>> problem.  I'm even skeptical that there's any size you can make  one 
>>> that would avoid queue full conditions.  It would be like  avoiding 
>>> difficult memory allocation algorithms by just having a  whole lot of 
>>> memory.
>>
>>
>> Yes, you're correct. But can you formulate a practical common rule  
>> working on any SCSI transport, including FC, on which a SCSI  target, 
>> which knows some limit, can tell it to an initiator, so it  will not 
>> try to queue too many commands, please? It looks like I  have no 
>> choice, except doing "giant" queue on target hoping that  initiators 
>> are smart enough to not queue so many commands that it  starts seeing 
>> timeouts.
> 
> 
> I still don't understand why you are reluctant to return  TASK_SET_FULL 
> or BUSY in this case; it's what the SCSI standard  supplies as the way 
> to say "don't queue too many commands, please".

I don't like out of order execution, which happens practically on all 
such "rejected" commands, because subsequent already queued commands are 
not "rejected" with it and some of them could be accepted later. And the 
initiator (Linux with FC driver) is dumb enough to hit this 
TASK_SET_FULL again and again until the queue is large enough. So, I can 
see only one solution, which almost eliminate breaking the order, - 
unbounded command queue.

But, maybe I should think/experiment more and ease the ordering 
restriction...

Thanks,
Vlad

> If you don't want to return TASK_SET_FULL, then yes, an effectively  
> unbounded command queue is your only alternative.
> 
> Regards,
> -Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-08 17:49             ` Vladislav Bolkhovitin
@ 2006-03-08 18:09               ` Steve Byan
  2006-03-09 18:37                 ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-08 18:09 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi

On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:

> Steve Byan wrote:
>>

>> I still don't understand why you are reluctant to return   
>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard   
>> supplies as the way to say "don't queue too many commands, please".
>
> I don't like out of order execution, which happens practically on  
> all such "rejected" commands, because subsequent already queued  
> commands are not "rejected" with it and some of them could be  
> accepted later.

I see, you care about order. So do tapes. The historical answer has  
been to not support tagged command queuing when you care about  
ordering. To dodge the performance problem due to lack of queuing,  
the targets usually implement a read-ahead and write-behind cache,  
and then perform queuing behind the scenes, after telling the  
initiator that the command has completed. Of course, this has obvious  
data integrity issues for disk-type logical units.

The solution introduced for tapes concurrent with iSCSI (which  
motivated the need for command-queuing for tapes, since some  
envisioned backing up to a tape drive located on 3000 miles away is  
something called "unit-attention interlock", or "UA interlock". Check  
out page 287 of the draft revision 23 of the SCSI Primary Commands -  
3 (SPC-3) standard from T10.org. The UA_INTLCK_CTRL field can be set  
to cause a persistent unit attention condition if a command was  
rejected with TASK_SET_FULL or BUSY.

This requires the cooperation of the initiator.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-08 18:09               ` Steve Byan
@ 2006-03-09 18:37                 ` Vladislav Bolkhovitin
  2006-03-09 19:32                   ` Steve Byan
  0 siblings, 1 reply; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-09 18:37 UTC (permalink / raw)
  To: Steve Byan; +Cc: Bryan Henderson, linux-scsi

Steve Byan wrote:
> 
> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
> 
>> Steve Byan wrote:
>>
>>>
> 
>>> I still don't understand why you are reluctant to return   
>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard   
>>> supplies as the way to say "don't queue too many commands, please".
>>
>>
>> I don't like out of order execution, which happens practically on  all 
>> such "rejected" commands, because subsequent already queued  commands 
>> are not "rejected" with it and some of them could be  accepted later.
> 
> 
> I see, you care about order. So do tapes. The historical answer has  
> been to not support tagged command queuing when you care about  
> ordering. To dodge the performance problem due to lack of queuing,  the 
> targets usually implement a read-ahead and write-behind cache,  and then 
> perform queuing behind the scenes, after telling the  initiator that the 
> command has completed. Of course, this has obvious  data integrity 
> issues for disk-type logical units.

Yes, tapes just can't work without strict ordering. SCST was originally 
done for tapes, so I still keep some kind of tape-oriented thinking :)

Actually, with current journaling file systems ordering also became more 
important for disks as well. Data integrity problem in "behind the 
scenes" queuing could be on practice easily solved by battery-based 
backup power on the disks. In case of TASK_SET_FULL things are much 
worse, because the reordering happens _between_ target and _initiator_, 
since the initiator must retry "rejected" command explicitly, then in 
case of the initiator crash before the command will be retried and if FS 
on it uses ordering barriers to protect the integrity (Linux seems does 
so, but I could be wrong), the FS data could be written out of order 
with its journal and the FS could be corrupted. Even worse, 
TASK_SET_FULL "rejects" basically happen every the queue length'th 
command, ie very often. This is why I prefer the "dumb" and "safe" way. 
But, I could overestimate the problem, because it looks like nobody 
cares about it..

> The solution introduced for tapes concurrent with iSCSI (which  
> motivated the need for command-queuing for tapes, since some  envisioned 
> backing up to a tape drive located on 3000 miles away is  something 
> called "unit-attention interlock", or "UA interlock". Check  out page 
> 287 of the draft revision 23 of the SCSI Primary Commands -  3 (SPC-3) 
> standard from T10.org. The UA_INTLCK_CTRL field can be set  to cause a 
> persistent unit attention condition if a command was  rejected with 
> TASK_SET_FULL or BUSY.

Thanks, I'll take a look.

> This requires the cooperation of the initiator.

Which practically means that it will not work for at least several 
years. I think, I won't be wrong, if say that no Linux initiators use 
this feature and going to use...

BTW, it is also impossible to correctly process commands errors (CHECK 
CONDITIONs) in async environment without using ACA (Auto Contingent 
Allegiance). Again, I see no sign that it's used by Linux or somebody 
interested to use it in Linux. Have I missed anything and it is not 
important? (rather rhetorical question)

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-09 18:37                 ` Vladislav Bolkhovitin
@ 2006-03-09 19:32                   ` Steve Byan
  2006-03-10 18:46                     ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-09 19:32 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi

On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:

> Steve Byan wrote:
>> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
>>> Steve Byan wrote:
>>>
>>>>
>>>> I still don't understand why you are reluctant to return    
>>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI  
>>>> standard   supplies as the way to say "don't queue too many  
>>>> commands, please".
>>>
>>>
>>> I don't like out of order execution, which happens practically  
>>> on  all such "rejected" commands, because subsequent already  
>>> queued  commands are not "rejected" with it and some of them  
>>> could be  accepted later.
>> I see, you care about order. So do tapes. The historical answer  
>> has  been to not support tagged command queuing when you care  
>> about  ordering. To dodge the performance problem due to lack of  
>> queuing,  the targets usually implement a read-ahead and write- 
>> behind cache,  and then perform queuing behind the scenes, after  
>> telling the  initiator that the command has completed. Of course,  
>> this has obvious  data integrity issues for disk-type logical units.
>
> Yes, tapes just can't work without strict ordering. SCST was  
> originally done for tapes, so I still keep some kind of tape- 
> oriented thinking :)
>
> Actually, with current journaling file systems ordering also became  
> more important for disks as well.

Usually the workload from a journaling filesystem consists of a lot  
of unordered writes (user data) and some partially-ordered writes  
(metadata). The partially-ordered writes do not have a defined  
ordering with respect to the unordered writes; they are ordered only  
with respect to each other. Most systems today solve the  
TASK_SET_FULL problem by only having one ordered write outstanding at  
any point in time. You want to do it this way anyway, so that you can  
build up a queue of commits and do a group commit with the next write  
to the journal.

If you need write barriers between the metadata writes and the data  
writes, the initiator should use the ORDERED task tag on that write,  
and have only one ORDERED write outstanding at any point in time (I  
mean to the same logical unit, of course).

> Data integrity problem in "behind the scenes" queuing could be on  
> practice easily solved by battery-based backup power on the disks.  
> In case of TASK_SET_FULL things are much worse, because the  
> reordering happens _between_ target and _initiator_, since the  
> initiator must retry "rejected" command explicitly, then in case of  
> the initiator crash before the command will be retried and if FS on  
> it uses ordering barriers to protect the integrity (Linux seems  
> does so, but I could be wrong), the FS data could be written out of  
> order with its journal and the FS could be corrupted. Even worse,  
> TASK_SET_FULL "rejects" basically happen every the queue length'th  
> command, ie very often. This is why I prefer the "dumb" and "safe"  
> way. But, I could overestimate the problem, because it looks like  
> nobody cares about it..

See above, Since only one ordered write is ever pending, no file  
system corruption occurs. Since you want to do group commits anyway,  
you never need to have more than one ordered write pending.
>
>> The solution introduced for tapes concurrent with iSCSI (which   
>> motivated the need for command-queuing for tapes, since some   
>> envisioned backing up to a tape drive located on 3000 miles away  
>> is  something called "unit-attention interlock", or "UA  
>> interlock". Check  out page 287 of the draft revision 23 of the  
>> SCSI Primary Commands -  3 (SPC-3) standard from T10.org. The  
>> UA_INTLCK_CTRL field can be set  to cause a persistent unit  
>> attention condition if a command was  rejected with TASK_SET_FULL  
>> or BUSY.
>
> Thanks, I'll take a look.
>
>> This requires the cooperation of the initiator.
>
> Which practically means that it will not work for at least several  
> years.

Well, the feature was added back in 2001 or 2002; the initiators have  
already had years to incorporate it. This might say something about  
the state of the Linux SCSI subsystem (running and ducking for  
cover :-). Seriously, I think this has more to do with either the  
lack of need for command-queuing for tapes or the lack of modern tape  
support in Linux.

> I think, I won't be wrong, if say that no Linux initiators use this  
> feature and going to use...

If you have an initiator that is sending queued SCSI commands with  
the SIMPLE task attribute but which expects the target to maintain  
ordering of those commands, the SCSI standard can't help you. The  
initiator is broken.

If the initiator needs to send _queued_ SCSI commands with a task  
attribute of ORDERED, then to preserve ordering it must set the  
UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism  
to offer such an initiator.

To the best of my knowledge no current Linux initiator sends SCSI  
commands with a task attribute other than SIMPLE., and you seem to be  
concerned only about Linux initiators. Therefor your target does not  
need to preserve order. QUED.

> BTW, it is also impossible to correctly process commands errors  
> (CHECK CONDITIONs) in async environment

When you say "async environment" I assume you are referring to  
queuing SCSI commands using SCSI command queuing, as opposed to  
sending a single SCSI command and synchronously awaiting its completion.

> without using ACA (Auto Contingent Allegiance). Again, I see no  
> sign that it's used by Linux or somebody interested to use it in  
> Linux. Have I missed anything and it is not important? (rather  
> rhetorical question)

ACA is not important if the command that got the error is idempotent  
and independent of all other commands in flight. In the case of disks  
(SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this  
condition is true (given the restriction on the number of outstanding  
ordered writes which I discussed above), and so ACA is not needed.

Tapes would need ACA if they did command queuing (which is why ACA  
was invented), but the practice in tape-land seems to be to avoid  
SCSI command queuing and instead asynchronously stage the operations  
behind the target. This does lead to complications in error recovery,  
which is why tape error handling is so problematic.

My advice to you is to either
a) follow the industry trend, which is to use command queuing only  
for SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape)  
targets, or
b) fix the initiator to handle ordered queuing (i.e. add support for  
the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-07 23:32       ` Bryan Henderson
  2006-03-08 15:35         ` Vladislav Bolkhovitin
@ 2006-03-10 13:26         ` Steve Byan
  1 sibling, 0 replies; 25+ messages in thread
From: Steve Byan @ 2006-03-10 13:26 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin

On Mar 7, 2006, at 6:32 PM, Bryan Henderson wrote:

>>> With the more primitive transports,
>>
>> Seems like a somewhat loaded description to me. Personally, I'd pick
>> something more neutral.
>
> Unfortunately, it's exactly what I mean.  I understand that some  
> people
> attach negative connotations to primitivity, but I can't let that  
> get in
> the way of clarity.
>
>>> I believe this is a manual
>>> configuration step -- the target has a fixed maximum queue depth
>>> and you
>>> tell the driver via some configuration parameter what it is.
>>
>> Not true. Consider the case where multiple initiators share one
>> logical unit  - there is no guarantee that a single initiator can
>> queue even a single command, since another initiator may have filled
>> the queue at the device.
>
> I'm not sure what it is that you're saying isn't true.

I'm saying that your blanket statement that "With the more primitive  
transports, I believe this is a manual configuration step -- the  
target has a fixed maximum queue depth and you tell the driver via  
some configuration parameter what it is." is not true.

> You do give a good
> explanation of why designers would want something more  
> sophisticated than
> this, but that doesn't mean every SCSI implementation actually is.

I didn't say every SCSI implementation did anything in particular. On  
the other hand, you did.

> Are
> you saying there are no SCSI targets so primitive that they have a  
> fixed
> maximum queue depth?

Of course I'm not saying that no such systems exist. I'm only  
refuting your claim that they all behave that way.

> That there are no systems where you manually set the
> maximum requests-in-flight at the initiator in order to optimally  
> drive
> such targets?

Of course I'm not saying that no such systems exist. I'm only  
refuting your claim that they all behave that way.

>
>>> I saw a broken ISCSI system that had QUEUE FULLs
>>> happening, and it was a performance disaster.
>>
>> Was it a performance disaster because of the broken-ness, or solely
>> because of the TASK SET FULLs?
>
> Because of the broken-ness.  Task Set Full is the symptom, not the
> disease.  I should add that in this system, there was no way to  
> make it
> perform optimally and also see Task Set Full regularly.
>
> You mentioned in another email that FCP is designed to use Task Set  
> Full
> for normal flow control.  I heard that before, but didn't believe  
> it; I
> thought  FCP was more advanced than that.  But I believe it now.   
> So I was
> wrong to say that Task Set Full happening means a system is  
> misconfigured.

>  But it's still the case that if you can design a system in which  
> Task Set
> Full never happens, it will perform better than one in which it does.

This is not necessarily true. TASK_SET_FULL does consume some  
initiator CPU resources and some bus bandwidth, so if one of those is  
your bottleneck, then yes, avoiding TASK_SET_FULL will improve  
performance. But if the performance bottleneck is the device server  
itself, then to a first approximation it makes no difference to  
performance whether the commands are queued on the initiator side of  
the interface or on the target side of the interface, assuming both  
the initiator and the target are capable of performing the same  
reordering optimizations.

> ISCSI flow control and manual setting of queue sizes in initiators  
> are two
> ways people do that.
>
>> 1) Considering only first-order effects, who cares whether the
>> initiator sends sub-optimal requests and the target coalesces them,
>> or if the initiator does the coalescing itself?
>
> I don't know what  a first-order effect is, so this may be out of  
> bounds,
> but here's a reason to care:  the initiator may have more resource
> available to do the work than the target.  We're talking here about a
> saturated target (which, rather than admit it's overwhelmed, keeps
> accepting new tasks).

Usually the target resource that is the bottleneck is the mechanical  
device, not the CPU. So it usually has the resources to devote to  
reordering the queue. Even disk drives with their $5 CPU have enough  
CPU bandwidth for this.
>
> But it's really the wrong question, because the more important  
> question is
> would you rather have the initiator do the coalescing or nobody? There
> exist targets that are not capable of combining or ordering tasks, and
> still accept large queues of them.

So no target should be able to accept large numbers of queued  
commands because some targets you've worked with are broken? Or we  
should have to manually configure the queue depth on every target  
because some of them are broken?

This also doesn't seem pertinent to TASK_SET_FULL versus iSCSI-style  
windowing, since a broken target can accept a large queue of commands  
no matter what flow-control mechanism is used.

I don't oppose including an option to an initiator that would  
manually set a maximum queue depth for a particular make and model of  
a SCSI target as a device-specific quirk; I just don't think it's  
mandatory, I don't think it's a good idea to have it be a global  
setting, and I also don't think it is the best general solution.

> These are the ones I saw have
> improperly large queues.  A target that can actually make use of a  
> large
> backlog of work, on the other hand, is right to accept one.

Absolutely. And the ones that can't should be sending TASK_SET_FULL  
when they've reached their limit.
>
> I have seen people try to improve performance of a storage system by
> increasing queue depth in the target such as this.  They note that the
> queue is always full, so it must need more queue space.  But this  
> degrades
> performance, because on one of these first-in-first-out targets,  
> the only
> way to get peak capacity is to keep the queue full all the time so  
> as to
> create backpressure and cause the initiator to schedule the work.
> Increasing the queue depth increases the chance that the initiator  
> will
> not have the backlog necessary to do that scheduling.  The correct  
> queue
> depth on this kind of target is the number of requests the target can
> process within the initiator's (and channel's) turnaround time.
>
>> brain-damaged
>> marketing values small average access times more than a small
>> variance in access times, so the device folks do crazy shortest-
>> access-time-first scheduling instead of something more sane and less
>> prone to spreading out the access time distribution like CSCAN.
>
> Since I'm talking about targets that don't do anything close to that
> sophisticated with the stuff in their queue, this doesn't apply.
>
> But I do have to point out that there are systems where throughput is
> everything, and response time, including variability of it, is  
> nothing. In
> fact, the systems I work with are mostly that kind.  For that kind of
> system, you'd want to target to do that kind of scheduling.

Yep, for batch you want SATF scheduling. It's not appropriate as the  
default setting for mass-produced disk devices, however.

>
>> 2) If you care about performance, you don't try to fill the device
>> queue; you just want to have enough outstanding so that the device
>> doesn't go idle when there is work to do.
>
> Why would the queue have a greater capacity than what is needed  
> when you
> care about performance?  Is there some non-performance reason to  
> have a
> giant queue?

Benchmarks which measure whether the device can coalesce 256 512-byte  
sequential writes :-)

Basically it is that for disk devices the optimal queue depth depends  
on the workload, so it's statically-sized for the worst-case.

> I still think having a giant queue is not a solution to any flow  
> control
> (or, in the words of the original problem, I/O throttling) problem.

I did not suggest a giant queue as a "solution". I only replied to  
Vladislav's question as to how disk drives avoid sending  
TASK_SET_FULL all the time. They have queue sizes larger than the  
number of commands that the initiator usually tries to send.

> I'm
> even skeptical that there's any size you can make one that would avoid
> queue full conditions.

Well, if it's bigger than the number of SCSI command buffers  
allocated by the initiator, the target wins and never has to send  
TASK_SET_FULL (unless there are multiple initiators).

> It would be like avoiding difficult memory
> allocation algorithms by just having a whole lot of memory.

Yep. That's a good practical solution, and one which the operating  
system on your desktop computer probably uses :-)

I do take your point; arbitrarily large queues only postpone the  
point at which the target must reply TASK_SET_FULL. Usually that is  
good enough.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-09 19:32                   ` Steve Byan
@ 2006-03-10 18:46                     ` Vladislav Bolkhovitin
  2006-03-10 19:47                       ` Steve Byan
  2006-03-14 20:54                       ` Douglas Gilbert
  0 siblings, 2 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-10 18:46 UTC (permalink / raw)
  To: Steve Byan; +Cc: Bryan Henderson, linux-scsi

Steve Byan wrote:
> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
> 
>> Steve Byan wrote:
>>
>>> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote:
>>>
>>>> Steve Byan wrote:
>>>>
>>>>>
>>>>> I still don't understand why you are reluctant to return    
>>>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI  standard   
>>>>> supplies as the way to say "don't queue too many  commands, please".
>>>>
>>>>
>>>>
>>>> I don't like out of order execution, which happens practically  on  
>>>> all such "rejected" commands, because subsequent already  queued  
>>>> commands are not "rejected" with it and some of them  could be  
>>>> accepted later.
>>>
>>> I see, you care about order. So do tapes. The historical answer  has  
>>> been to not support tagged command queuing when you care  about  
>>> ordering. To dodge the performance problem due to lack of  queuing,  
>>> the targets usually implement a read-ahead and write- behind cache,  
>>> and then perform queuing behind the scenes, after  telling the  
>>> initiator that the command has completed. Of course,  this has 
>>> obvious  data integrity issues for disk-type logical units.
>>
>>
>> Yes, tapes just can't work without strict ordering. SCST was  
>> originally done for tapes, so I still keep some kind of tape- oriented 
>> thinking :)
>>
>> Actually, with current journaling file systems ordering also became  
>> more important for disks as well.
> 
> 
> Usually the workload from a journaling filesystem consists of a lot  of 
> unordered writes (user data) and some partially-ordered writes  
> (metadata). The partially-ordered writes do not have a defined  ordering 
> with respect to the unordered writes; they are ordered only  with 
> respect to each other. Most systems today solve the  TASK_SET_FULL 
> problem by only having one ordered write outstanding at  any point in 
> time. You want to do it this way anyway, so that you can  build up a 
> queue of commits and do a group commit with the next write  to the journal.
> 
> If you need write barriers between the metadata writes and the data  
> writes, the initiator should use the ORDERED task tag on that write,  
> and have only one ORDERED write outstanding at any point in time (I  
> mean to the same logical unit, of course).

I mean the barrier between journal writes and metadata writes, because 
they order is essential for a FS health. User data almost always not 
journaled and not protected.

Obviously, having only one ORDERED, i.e. journal, write and having to 
wait for it completition before submitting subsequent commands creates 
some performance bottleneck. I mean mostly latency, which often quite 
big in many SCSI transports. It would be much better to queue as many 
such ORDERED commands as necessary and then, without waiting for their 
completition, metadata updates (SIMPLE) commands and being sure, that no 
metadata commands will be executed if any of ORDERED ones fail. As far 
as I can see, nothing prevents to work that way right now, except that 
somebody should implement it in both hardware and software.

>> Data integrity problem in "behind the scenes" queuing could be on  
>> practice easily solved by battery-based backup power on the disks.  In 
>> case of TASK_SET_FULL things are much worse, because the  reordering 
>> happens _between_ target and _initiator_, since the  initiator must 
>> retry "rejected" command explicitly, then in case of  the initiator 
>> crash before the command will be retried and if FS on  it uses 
>> ordering barriers to protect the integrity (Linux seems  does so, but 
>> I could be wrong), the FS data could be written out of  order with its 
>> journal and the FS could be corrupted. Even worse,  TASK_SET_FULL 
>> "rejects" basically happen every the queue length'th  command, ie very 
>> often. This is why I prefer the "dumb" and "safe"  way. But, I could 
>> overestimate the problem, because it looks like  nobody cares about it..
> 
> 
> See above, Since only one ordered write is ever pending, no file  system 
> corruption occurs. Since you want to do group commits anyway,  you never 
> need to have more than one ordered write pending.
> 
>>
>>> The solution introduced for tapes concurrent with iSCSI (which   
>>> motivated the need for command-queuing for tapes, since some   
>>> envisioned backing up to a tape drive located on 3000 miles away  is  
>>> something called "unit-attention interlock", or "UA  interlock". 
>>> Check  out page 287 of the draft revision 23 of the  SCSI Primary 
>>> Commands -  3 (SPC-3) standard from T10.org. The  UA_INTLCK_CTRL 
>>> field can be set  to cause a persistent unit  attention condition if 
>>> a command was  rejected with TASK_SET_FULL  or BUSY.
>>
>>
>> Thanks, I'll take a look.
>>
>>> This requires the cooperation of the initiator.
>>
>>
>> Which practically means that it will not work for at least several  
>> years.
> 
> 
> Well, the feature was added back in 2001 or 2002; the initiators have  
> already had years to incorporate it. This might say something about  the 
> state of the Linux SCSI subsystem (running and ducking for  cover :-). 
> Seriously, I think this has more to do with either the  lack of need for 
> command-queuing for tapes or the lack of modern tape  support in Linux.
> 
>> I think, I won't be wrong, if say that no Linux initiators use this  
>> feature and going to use...
> 
> 
> If you have an initiator that is sending queued SCSI commands with  the 
> SIMPLE task attribute but which expects the target to maintain  ordering 
> of those commands, the SCSI standard can't help you. The  initiator is 
> broken.

Sure

> If the initiator needs to send _queued_ SCSI commands with a task  
> attribute of ORDERED, then to preserve ordering it must set the  
> UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism  
> to offer such an initiator.
> 
> To the best of my knowledge no current Linux initiator sends SCSI  
> commands with a task attribute other than SIMPLE., and you seem to be  
> concerned only about Linux initiators. Therefor your target does not  
> need to preserve order. QUED.

I prefer to be overinsured in such cases.

>> BTW, it is also impossible to correctly process commands errors  
>> (CHECK CONDITIONs) in async environment
> 
> 
> When you say "async environment" I assume you are referring to  queuing 
> SCSI commands using SCSI command queuing, as opposed to  sending a 
> single SCSI command and synchronously awaiting its completion.

Yes

>> without using ACA (Auto Contingent Allegiance). Again, I see no  sign 
>> that it's used by Linux or somebody interested to use it in  Linux. 
>> Have I missed anything and it is not important? (rather  rhetorical 
>> question)
> 
> 
> ACA is not important if the command that got the error is idempotent  
> and independent of all other commands in flight. In the case of disks  
> (SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this  
> condition is true (given the restriction on the number of outstanding  
> ordered writes which I discussed above), and so ACA is not needed.

Yes, when working as you described, ACA is not needed. But when working 
as I described, ACA is essential.

> Tapes would need ACA if they did command queuing (which is why ACA  was 
> invented), but the practice in tape-land seems to be to avoid  SCSI 
> command queuing and instead asynchronously stage the operations  behind 
> the target. This does lead to complications in error recovery,  which is 
> why tape error handling is so problematic.

Could you please explain "synchronously stage the operations behind the 
target" more? I don't understand what you mean.

> My advice to you is to either
> a) follow the industry trend, which is to use command queuing only  for 
> SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape)  targets, or
> b) fix the initiator to handle ordered queuing (i.e. add support for  
> the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).

OK, thanks. Looks like (a) is easier :).

BTW, do you have any statistic how many modern SCSI disks support those 
features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of 
available for us SCSI hardware, including tape libraries, supported ACA. 
It was not very modern for that time, though

Regards,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-10 18:46                     ` Vladislav Bolkhovitin
@ 2006-03-10 19:47                       ` Steve Byan
  2006-03-13 17:35                         ` Vladislav Bolkhovitin
  2006-03-14 20:54                       ` Douglas Gilbert
  1 sibling, 1 reply; 25+ messages in thread
From: Steve Byan @ 2006-03-10 19:47 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi

On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:

> Steve Byan wrote:
>> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:

> I mean the barrier between journal writes and metadata writes,  
> because they order is essential for a FS health.

I counted journal writes as metadata writes. If you want to make a  
distinction, OK, we now have a common language.

> Obviously, having only one ORDERED, i.e. journal, write and having  
> to wait for it completition before submitting subsequent commands  
> creates some performance bottleneck.

It might be obvious but it's not true.

You missed my point about group commits to the journal. That's why  
there's no performance hit for only having one outstanding journal  
write at a time; each journal write commits many transactions. Stated  
another way, you don't want to eagerly initiate journal writes; you  
want to execute one at a time, and group all transactions that arrive  
while the one write is active into the next write.

See the seminal paper from Xerox PARC on "Group Commits in the CEDAR  
Filesystem". I'm working from memory so I can't give you a better  
citation than that. It's an old paper, probably circa 1987 or 1988,  
published I think in an ACM journal.

I've benchmarked metadata-intensive workloads on a journaling  
filesystem with a storage controller with NV-RAM arranged so that all  
metadata and journal writes complete without any disk activity  
against a vanilla controller. The lights on the disks on the NV-RAM  
controller never came on; i.e. there was _no_ disk activity. The  
lights on the disks attached to the vanilla controller were on solid.  
The performance of the two systems was essentially the same with  
respect to average response time and throughput.

> I mean mostly latency, which often quite big in many SCSI  
> transports. It would be much better to queue as many such ORDERED  
> commands as necessary and then, without waiting for their  
> completition, metadata updates (SIMPLE) commands and being sure,  
> that no metadata commands will be executed if any of ORDERED ones  
> fail. As far as I can see, nothing prevents to work that way right  
> now, except that somebody should implement it in both hardware and  
> software.

If you use group commits, there's little value in implementing this.

>> To the best of my knowledge no current Linux initiator sends SCSI   
>> commands with a task attribute other than SIMPLE., and you seem to  
>> be  concerned only about Linux initiators. Therefor your target  
>> does not  need to preserve order. QUED.
>
> I prefer to be overinsured in such cases.

Suit yourself. Just don't expect help from the SCSI standard, it's  
not designed to do that.

>> ACA is not important if the command that got the error is  
>> idempotent  and independent of all other commands in flight. In  
>> the case of disks  (SBC command set) and CD-ROMs and DVD-ROMs (MMC  
>> command-set) this  condition is true (given the restriction on the  
>> number of outstanding  ordered writes which I discussed above),  
>> and so ACA is not needed.
>
> Yes, when working as you described, ACA is not needed. But when  
> working as I described, ACA is essential.

As is unit attention interlock.

>> Tapes would need ACA if they did command queuing (which is why  
>> ACA  was invented), but the practice in tape-land seems to be to  
>> avoid  SCSI command queuing and instead asynchronously stage the  
>> operations  behind the target. This does lead to complications in  
>> error recovery,  which is why tape error handling is so problematic.
>
> Could you please explain "synchronously stage the operations behind  
> the target" more? I don't understand what you mean.

I mean they buffer the operations in memory after completing the SCSI  
command and then (asynchronous to the execution of the SCSI command,  
i,e, after it has been completed) queue them ("stage" them) and send  
them on to the physical device.

I'm a bit hazy on the terminology, because I was never a tape guy and  
it's been years since I thought about tapes, but I think the term the  
industry used when streaming tapes first came out was "buffered  
operation". The tape controller accepts the write command and  
completes it with good status but doesn't write it to the media; it  
waits until it has accumulated a sufficient number of records to keep  
the tape streaming before starting to dump the buffer to the tape  
media. This avoids the need for SCSI command-queuing while still  
keeping the tape streaming.

>> My advice to you is to either
>> a) follow the industry trend, which is to use command queuing  
>> only  for SBC (disk) targets and not for MMC (CD-ROM) and SSC  
>> (tape)  targets, or
>> b) fix the initiator to handle ordered queuing (i.e. add support  
>> for  the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
>
> OK, thanks. Looks like (a) is easier :).
>
> BTW, do you have any statistic how many modern SCSI disks support  
> those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago  
> none of available for us SCSI hardware, including tape libraries,  
> supported ACA. It was not very modern for that time, though

I can't say with certainty, but I believe no SCSI disk supports ACA  
or UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess  
it would be implemented in a low-performance path.

Storage controllers might be a different story; I have no data on  
what they support in the way of task attributes, ACA, and unit  
attention interlock.

As far as tapes go, I've got no data on modern SCSI tape controllers,  
but judging by the squirming going on in T10 around command-ordering  
for Fibre Channel tapes, I'd guess very few if any have gotten  
command-queuing to work for tapes.

Regards,
-Steve
-- 
Steve Byan <smb@egenera.com>
Software Architect
Egenera, Inc.
165 Forest Street
Marlboro, MA 01752
(508) 858-3125

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-10 19:47                       ` Steve Byan
@ 2006-03-13 17:35                         ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-13 17:35 UTC (permalink / raw)
  To: Steve Byan; +Cc: Bryan Henderson, linux-scsi

Steve Byan wrote:
> On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote:
> 
>> Steve Byan wrote:
>>
>>> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote:
> 
> 
>> I mean the barrier between journal writes and metadata writes,  
>> because they order is essential for a FS health.
> 
> 
> I counted journal writes as metadata writes. If you want to make a  
> distinction, OK, we now have a common language.
> 
>> Obviously, having only one ORDERED, i.e. journal, write and having  to 
>> wait for it completition before submitting subsequent commands  
>> creates some performance bottleneck.
> 
> 
> It might be obvious but it's not true.
> 
> You missed my point about group commits to the journal. That's why  
> there's no performance hit for only having one outstanding journal  
> write at a time; each journal write commits many transactions. Stated  
> another way, you don't want to eagerly initiate journal writes; you  
> want to execute one at a time, and group all transactions that arrive  
> while the one write is active into the next write.
> 
> See the seminal paper from Xerox PARC on "Group Commits in the CEDAR  
> Filesystem". I'm working from memory so I can't give you a better  
> citation than that. It's an old paper, probably circa 1987 or 1988,  
> published I think in an ACM journal.

I didn't miss your point. I wrote that such journal updates have to be 
_synchronous_, i.e. it's necessary, despite that the updates are 
combined in one command, to wait for their completion (as well as _all_ 
previously queued commands, including SIMPLE ones). This is the 
(possible) performance bottleneck. Yes, the disk can imitate the 
commands completion with its write back cache, but the cache is limited 
in size, so on some workload it could get full and not able to help. 
However, I don't have any numbers and maybe this is not so noticeable in 
practice.

> I've benchmarked metadata-intensive workloads on a journaling  
> filesystem with a storage controller with NV-RAM arranged so that all  
> metadata and journal writes complete without any disk activity  against 
> a vanilla controller. The lights on the disks on the NV-RAM  controller 
> never came on; i.e. there was _no_ disk activity. The  lights on the 
> disks attached to the vanilla controller were on solid.  The performance 
> of the two systems was essentially the same with  respect to average 
> response time and throughput.
> 
>> I mean mostly latency, which often quite big in many SCSI  transports. 
>> It would be much better to queue as many such ORDERED  commands as 
>> necessary and then, without waiting for their  completition, metadata 
>> updates (SIMPLE) commands and being sure,  that no metadata commands 
>> will be executed if any of ORDERED ones  fail. As far as I can see, 
>> nothing prevents to work that way right  now, except that somebody 
>> should implement it in both hardware and  software.
> 
> 
> If you use group commits, there's little value in implementing this.
> 
  >>> Tapes would need ACA if they did command queuing (which is why  ACA
>>> was invented), but the practice in tape-land seems to be to  avoid  
>>> SCSI command queuing and instead asynchronously stage the  
>>> operations  behind the target. This does lead to complications in  
>>> error recovery,  which is why tape error handling is so problematic.
>>
>>
>> Could you please explain "synchronously stage the operations behind  
>> the target" more? I don't understand what you mean.
> 
> 
> I mean they buffer the operations in memory after completing the SCSI  
> command and then (asynchronous to the execution of the SCSI command,  
> i,e, after it has been completed) queue them ("stage" them) and send  
> them on to the physical device.
> 
> I'm a bit hazy on the terminology, because I was never a tape guy and  
> it's been years since I thought about tapes, but I think the term the  
> industry used when streaming tapes first came out was "buffered  
> operation". The tape controller accepts the write command and  completes 
> it with good status but doesn't write it to the media; it  waits until 
> it has accumulated a sufficient number of records to keep  the tape 
> streaming before starting to dump the buffer to the tape  media. This 
> avoids the need for SCSI command-queuing while still  keeping the tape 
> streaming.

I see

>>> My advice to you is to either
>>> a) follow the industry trend, which is to use command queuing  only  
>>> for SBC (disk) targets and not for MMC (CD-ROM) and SSC  (tape)  
>>> targets, or
>>> b) fix the initiator to handle ordered queuing (i.e. add support  
>>> for  the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL).
>>
>>
>> OK, thanks. Looks like (a) is easier :).
>>
>> BTW, do you have any statistic how many modern SCSI disks support  
>> those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago  none 
>> of available for us SCSI hardware, including tape libraries,  
>> supported ACA. It was not very modern for that time, though
> 
> 
> I can't say with certainty, but I believe no SCSI disk supports ACA  or 
> UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess  it 
> would be implemented in a low-performance path.

This is the point from which we should have started :). It's senseless 
to implement something, which you can't use.

> Storage controllers might be a different story; I have no data on  what 
> they support in the way of task attributes, ACA, and unit  attention 
> interlock.
> 
> As far as tapes go, I've got no data on modern SCSI tape controllers,  
> but judging by the squirming going on in T10 around command-ordering  
> for Fibre Channel tapes, I'd guess very few if any have gotten  
> command-queuing to work for tapes.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-10 18:46                     ` Vladislav Bolkhovitin
  2006-03-10 19:47                       ` Steve Byan
@ 2006-03-14 20:54                       ` Douglas Gilbert
  2006-03-15 17:15                         ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 25+ messages in thread
From: Douglas Gilbert @ 2006-03-14 20:54 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: Steve Byan, Bryan Henderson, linux-scsi

Vladislav Bolkhovitin wrote:
> Steve Byan wrote:

<snip>

> BTW, do you have any statistic how many modern SCSI disks support those
> features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of
> available for us SCSI hardware, including tape libraries, supported ACA.
> It was not very modern for that time, though

Vlad,
Here is part of the control mode page from a
recent SCSI disk (Cheetah 15k.4) :

# sdparm -p co /dev/sdb -ll
    /dev/sdb: SEAGATE   ST336754SS        0003
    Direct access device specific parameters: WP=0  DPOFUA=1
Control mode page [PS=1]:
  TST         0  [cha: n, def:  0, sav:  0]  Task set type
        0: lu maintains one task set for all I_T nexuses
        1: lu maintains separate task sets for each I_T nexus
  TMF_ONLY    0  [cha: n, def:  0, sav:  0]  Task management functions only
  D_SENSE     0  [cha: n, def:  0, sav:  0]  Descriptor format sense data
  GLTSD       0  [cha: y, def:  1, sav:  0]  Global logging target save disable
  RLEC        0  [cha: y, def:  0, sav:  0]  Report log exception condition
  QAM         0  [cha: y, def:  0, sav:  0]  Queue algorithm modifier
        0: restricted re-ordering; 1: unrestricted
  QERR        0  [cha: n, def:  0, sav:  0]  Queue error management
        0: only affected task gets CC; 1: affected tasks aborted
        3: affected tasks aborted on same I_T nexus
  RAC         0  [cha: n, def:  0, sav:  0]  Report a check
  UA_INTLCK   0  [cha: n, def:  0, sav:  0]  Unit attention interlocks control
        0: unit attention cleared with check condition status
        2: unit attention not cleared with check condition status
        3: as 2 plus ua on busy, task set full or reservation conflict
  SWP         0  [cha: n, def:  0, sav:  0]  Software write protect
  ATO         0  [cha: n, def:  0, sav:  0]  Application tag owner
  TAS         0  [cha: n, def:  0, sav:  0]  Task aborted status
        0: tasks aborted without response to app client
        1: any other I_T nexuses receive task aborted

So it doesn't support UA_INTLCK ("cha: n" implies the user
cannot change that value). QAM can be changed to allow
unrestricted re-ordering (of task with the SIMPLE task
attribute).

The NormACA bit in the standard INQUIRY response is 0 so
it doesn't support ACA either.

Doug Gilbert


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: SCSI target and IO-throttling
  2006-03-14 20:54                       ` Douglas Gilbert
@ 2006-03-15 17:15                         ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 25+ messages in thread
From: Vladislav Bolkhovitin @ 2006-03-15 17:15 UTC (permalink / raw)
  To: dougg; +Cc: Steve Byan, Bryan Henderson, linux-scsi

Douglas Gilbert wrote:
> Vladislav Bolkhovitin wrote:
> 
>>Steve Byan wrote:
> 
> 
> <snip>
> 
>>BTW, do you have any statistic how many modern SCSI disks support those
>>features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of
>>available for us SCSI hardware, including tape libraries, supported ACA.
>>It was not very modern for that time, though
> 
> 
> Vlad,
> Here is part of the control mode page from a
> recent SCSI disk (Cheetah 15k.4) :
> 
> # sdparm -p co /dev/sdb -ll
>     /dev/sdb: SEAGATE   ST336754SS        0003
>     Direct access device specific parameters: WP=0  DPOFUA=1
> Control mode page [PS=1]:
>   TST         0  [cha: n, def:  0, sav:  0]  Task set type
>         0: lu maintains one task set for all I_T nexuses
>         1: lu maintains separate task sets for each I_T nexus
>   TMF_ONLY    0  [cha: n, def:  0, sav:  0]  Task management functions only
>   D_SENSE     0  [cha: n, def:  0, sav:  0]  Descriptor format sense data
>   GLTSD       0  [cha: y, def:  1, sav:  0]  Global logging target save disable
>   RLEC        0  [cha: y, def:  0, sav:  0]  Report log exception condition
>   QAM         0  [cha: y, def:  0, sav:  0]  Queue algorithm modifier
>         0: restricted re-ordering; 1: unrestricted
>   QERR        0  [cha: n, def:  0, sav:  0]  Queue error management
>         0: only affected task gets CC; 1: affected tasks aborted
>         3: affected tasks aborted on same I_T nexus
>   RAC         0  [cha: n, def:  0, sav:  0]  Report a check
>   UA_INTLCK   0  [cha: n, def:  0, sav:  0]  Unit attention interlocks control
>         0: unit attention cleared with check condition status
>         2: unit attention not cleared with check condition status
>         3: as 2 plus ua on busy, task set full or reservation conflict
>   SWP         0  [cha: n, def:  0, sav:  0]  Software write protect
>   ATO         0  [cha: n, def:  0, sav:  0]  Application tag owner
>   TAS         0  [cha: n, def:  0, sav:  0]  Task aborted status
>         0: tasks aborted without response to app client
>         1: any other I_T nexuses receive task aborted
> 
> So it doesn't support UA_INTLCK ("cha: n" implies the user
> cannot change that value). QAM can be changed to allow
> unrestricted re-ordering (of task with the SIMPLE task
> attribute).
> 
> The NormACA bit in the standard INQUIRY response is 0 so
> it doesn't support ACA either.

Thanks! This is exactly what we've seen in the our small investigation. 
Perhaps, those features are really not needed, if nobody still use them.

Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-03-15 17:15 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin
2006-03-03 18:07 ` Steve Byan
2006-03-03 18:47   ` Stefan Richter
2006-03-03 20:24     ` Steve Byan
2006-03-06 19:15   ` Bryan Henderson
2006-03-06 19:55     ` Steve Byan
2006-03-07 23:32       ` Bryan Henderson
2006-03-08 15:35         ` Vladislav Bolkhovitin
2006-03-08 15:56           ` Steve Byan
2006-03-08 17:49             ` Vladislav Bolkhovitin
2006-03-08 18:09               ` Steve Byan
2006-03-09 18:37                 ` Vladislav Bolkhovitin
2006-03-09 19:32                   ` Steve Byan
2006-03-10 18:46                     ` Vladislav Bolkhovitin
2006-03-10 19:47                       ` Steve Byan
2006-03-13 17:35                         ` Vladislav Bolkhovitin
2006-03-14 20:54                       ` Douglas Gilbert
2006-03-15 17:15                         ` Vladislav Bolkhovitin
2006-03-10 13:26         ` Steve Byan
2006-03-07 17:56     ` Vladislav Bolkhovitin
2006-03-07 18:38       ` Steve Byan
2006-03-07 17:53   ` Vladislav Bolkhovitin
2006-03-07 18:19     ` Steve Byan
2006-03-07 18:46       ` Vladislav Bolkhovitin
2006-03-07 19:00         ` Steve Byan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).