how to handle QUEUE_FULL/SAM_STAT_TASK_SET

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
@ 2007-11-12 19:54 Chris Friesen
  2007-11-13 22:04 ` Chris Friesen
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Friesen @ 2007-11-12 19:54 UTC (permalink / raw)
  To: linux-scsi, dgilbert, James.Bottomley, Eric.Moore,
	DL-MPTFusionLinux

Hi,

I asked this question on the list last Friday and haven't seen any 
replies, so I thought I'd ask again and broaden the receiver list a bit.

We have x86-based hardware with dual LSI 53c1030 devices.  We have a few 
apps that issue SCSI requests on sg device nodes.  The requests are 
generally related to the health of the disks (ie, LOG_SENSE, 
REQUEST_SENSE, TEST_UNIT_READY, MODE_SENSE_10, that sort of thing).

We recently moved from 2.6.10 to 2.6.14 and now we're seeing occasional 
QUEUE_FULL/SAM_STAT_TASK_SET_FULL errors being returned to userspace. 
These didn't ever show up in 2.6.10.

So...are these errors expected?  If so, why are they only showing up now?

Is there any way to get rid of the errors?  Should the scsi midlayer be 
handling retries for these or is it up to userspace because we're using 
the ioctl() interface?

Is there a "correct" way to handle them in userspace?  Should we delay 
then retry the command?  How long should the app delay, how many retries 
should it attempt before giving up?

I'm at a loss here, and I'm having a hard time finding any concrete 
information on the expected behaviour.

Thanks,

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-12 19:54 how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace? Chris Friesen
@ 2007-11-13 22:04 ` Chris Friesen
  2007-11-13 22:34   ` Moore, Eric
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Friesen @ 2007-11-13 22:04 UTC (permalink / raw)
  To: Larry.Stephens
  Cc: linux-scsi, dgilbert, James.Bottomley, Eric.Moore,
	DL-MPTFusionLinux

Chris Friesen wrote:

> We recently moved from 2.6.10 to 2.6.14 and now we're seeing occasional 
> QUEUE_FULL/SAM_STAT_TASK_SET_FULL errors being returned to userspace. 
> These didn't ever show up in 2.6.10.

I found something that might be interesting.

With the the 3.01.18 fusion driver the queue length (as shown by 
"/sys/class/scsi_generic/sgX/device/queue_depth") was set to 7, while 
with the 3.02.57 fusion driver it was set to either 64 or 32.

It may be coincidence, but it's interesting that 
MPT_SCSI_CMD_PER_DEV_LOW is set to 7 in the earlier driver, and 32 in 
the later one, while MPT_SCSI_CMD_PER_DEV_HIGH went from 31 to 64.

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-13 22:04 ` Chris Friesen
@ 2007-11-13 22:34   ` Moore, Eric
  2007-11-14 17:23     ` Chris Friesen
  0 siblings, 1 reply; 13+ messages in thread
From: Moore, Eric @ 2007-11-13 22:34 UTC (permalink / raw)
  To: Chris Friesen, Stephens, Larry
  Cc: linux-scsi, dgilbert, James.Bottomley, DL-MPT Fusion Linux

On Tuesday, November 13, 2007 3:04 PM, Chris Friesen wrote: 
> 
> Chris Friesen wrote:
> 
> > We recently moved from 2.6.10 to 2.6.14 and now we're 
> seeing occasional 
> > QUEUE_FULL/SAM_STAT_TASK_SET_FULL errors being returned to 
> userspace. 
> > These didn't ever show up in 2.6.10.
> 
> I found something that might be interesting.
> 
> With the the 3.01.18 fusion driver the queue length (as shown by 
> "/sys/class/scsi_generic/sgX/device/queue_depth") was set to 7, while 
> with the 3.02.57 fusion driver it was set to either 64 or 32.
> 
> It may be coincidence, but it's interesting that 
> MPT_SCSI_CMD_PER_DEV_LOW is set to 7 in the earlier driver, and 32 in 
> the later one, while MPT_SCSI_CMD_PER_DEV_HIGH went from 31 to 64.
> 


QUEUE_FULL and SAM_STAT_TASK_SET_FULL are not errors.
SAM_STAT_TASK_SET_FULL returned for the target that handle the number of
commands, and QUEUE_FULL returned from hba firmware meaning its can't
handle the number of commands.  Translated, the commands are retried by
scsiml.    I probably should be calling scsi_track_queue_full which
would be throttling the command back, however I'm not sure whether it
matters.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-13 22:34   ` Moore, Eric
@ 2007-11-14 17:23     ` Chris Friesen
  2007-11-14 22:45       ` Moore, Eric
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Friesen @ 2007-11-14 17:23 UTC (permalink / raw)
  To: Moore, Eric
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

Moore, Eric wrote:

> QUEUE_FULL and SAM_STAT_TASK_SET_FULL are not errors.

I consider them errors in the same way that ENOMEM or ENOBUFS (or even 
EAGAIN) are errors.  "There is a shortage of resources and the command 
could not be completed, please try again later."

Also, the behaviour has changed from 2.6.10 with the 3.01.18 fusion 
driver, to 2.6.14 with the 3.02.57 fusion driver.

With 2.6.10 our user app never saw SAM_STAT_TASK_SET_FULL.  I suspect it 
is due to the fact that it's using a queue size of 7, while in 2.6.14 
it's using a queue size of 32 or 64.

Which kernel version is behaving properly?

I've asked seagate what the queue size should be for that hardware, but 
haven't heard back yet.

> SAM_STAT_TASK_SET_FULL returned for the target that handle the number of
> commands, and QUEUE_FULL returned from hba firmware meaning its can't
> handle the number of commands.  Translated, the commands are retried by
> scsiml.    I probably should be calling scsi_track_queue_full which
> would be throttling the command back, however I'm not sure whether it
> matters.

We have a userspace app calling ioctl(...SG_IO...) on /dev/sdX and 
occasionally getting a status of SAM_STAT_TASK_SET_FULL.  I may be 
misreading the code, but it doesn't appear that the midlayer is retrying 
these commands.

If the queue length in 2.6.14 is correct then how do I handle that 
status code?  Maybe delay a bit then retry a few times?  How much delay? 
   How many retries?

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-14 17:23     ` Chris Friesen
@ 2007-11-14 22:45       ` Moore, Eric
  2007-11-15 19:09         ` Chris Friesen
  0 siblings, 1 reply; 13+ messages in thread
From: Moore, Eric @ 2007-11-14 22:45 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

On Wednesday, November 14, 2007 10:23 AM, Chris Friesen wrote: 
> > QUEUE_FULL and SAM_STAT_TASK_SET_FULL are not errors.
> 
> I consider them errors in the same way that ENOMEM or ENOBUFS 
> (or even 
> EAGAIN) are errors.  "There is a shortage of resources and 
> the command 
> could not be completed, please try again later."
> 
> Also, the behaviour has changed from 2.6.10 with the 3.01.18 fusion 
> driver, to 2.6.14 with the 3.02.57 fusion driver.
> 
> With 2.6.10 our user app never saw SAM_STAT_TASK_SET_FULL.  I 
> suspect it 
> is due to the fact that it's using a queue size of 7, while in 2.6.14 
> it's using a queue size of 32 or 64.
> 
> Which kernel version is behaving properly?

You already figured out the problem, I don't understand why your asking
if the kernel verison is behaving properly.   You said between those
driver versions the device queue depth increased from 32 to 64, and that
is exactly what happened.   The reason for the increase is some customer
ask for the increase queue_depth which helps with performance. We are
not going to decrease it back.

> 
> I've asked seagate what the queue size should be for that 
> hardware, but 
> haven't heard back yet.
> 
> > SAM_STAT_TASK_SET_FULL returned for the target that handle 
> the number of
> > commands, and QUEUE_FULL returned from hba firmware meaning 
> its can't
> > handle the number of commands.  Translated, the commands 
> are retried by
> > scsiml.    I probably should be calling scsi_track_queue_full which
> > would be throttling the command back, however I'm not sure 
> whether it
> > matters.
> 
> We have a userspace app calling ioctl(...SG_IO...) on /dev/sdX and 
> occasionally getting a status of SAM_STAT_TASK_SET_FULL.  I may be 
> misreading the code, but it doesn't appear that the midlayer 
> is retrying 
> these commands.
> 
> If the queue length in 2.6.14 is correct then how do I handle that 
> status code?  Maybe delay a bit then retry a few times?  How 
> much delay? 
>    How many retries?
> 

SAM_STAT_TASK_SET_FULL in /usr/src/linux/scsi/scsi.h, is the same as
QUEUE_FULL.  If you look in scsi_error.c searching for QUEUE_FULL, you
will see that it will translate to ADD_TO_MLQUEUE, which means it will
reposted to the request queue.      Ultimately, calling
scsi_track_queue_full would help by reducing the queue_depth on the fly,
however I'm not sure if that is there in the older kernels your running.
What I suggest you do is write a script to update the queue_depth to the
values youre wanting.

Example
#  echo 32 > /sys/class/scsi_device/0:0:0:0/device/queue_depth





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-14 22:45       ` Moore, Eric
@ 2007-11-15 19:09         ` Chris Friesen
  2007-11-15 19:43           ` James Smart
  2007-11-15 19:57           ` Moore, Eric
  0 siblings, 2 replies; 13+ messages in thread
From: Chris Friesen @ 2007-11-15 19:09 UTC (permalink / raw)
  To: Moore, Eric
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

Moore, Eric wrote:

> You already figured out the problem, I don't understand why your asking
> if the kernel verison is behaving properly.   You said between those
> driver versions the device queue depth increased from 32 to 64, and that
> is exactly what happened.   The reason for the increase is some customer
> ask for the increase queue_depth which helps with performance. We are
> not going to decrease it back.

My impression is that the per-device queue is supposed to be decreased 
at runtime to match the actual size that the hardware can handle.  In 
the earlier version we're seeing the queue set to 7 at runtime, while 
the more recent version is showing a queue depth of 32 or 64 and is 
giving QUEUE_FULL errors to the userspace apps.

I just wanted to make sure that 2.6.14 was working correctly (ie, this 
wasn't a bug that has been fixed in a more recent version).

> SAM_STAT_TASK_SET_FULL in /usr/src/linux/scsi/scsi.h, is the same as
> QUEUE_FULL.  If you look in scsi_error.c searching for QUEUE_FULL, you
> will see that it will translate to ADD_TO_MLQUEUE, which means it will
> reposted to the request queue.

I don't know the scsi code very well, so maybe I'm missing something 
obvious here.  If so, I apologize.

Our userspace apps are getting a status of TASK_SET_FULL on completion 
of an ioctl() call.

Does this status mean that the command needs to be retried by the 
userspace app, that it has already been retried by the lower levels and 
is now completed, or something else entirely?

Thanks,

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 19:09         ` Chris Friesen
@ 2007-11-15 19:43           ` James Smart
  2007-11-15 19:59             ` Moore, Eric
  2007-11-15 19:57           ` Moore, Eric
  1 sibling, 1 reply; 13+ messages in thread
From: James Smart @ 2007-11-15 19:43 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Moore, Eric, Stephens, Larry, linux-scsi, dgilbert,
	James.Bottomley, DL-MPT Fusion Linux



Chris Friesen wrote:
> Moore, Eric wrote:
> 
>> You already figured out the problem, I don't understand why your asking
>> if the kernel verison is behaving properly.   You said between those
>> driver versions the device queue depth increased from 32 to 64, and that
>> is exactly what happened.   The reason for the increase is some customer
>> ask for the increase queue_depth which helps with performance. We are
>> not going to decrease it back.
> 
> My impression is that the per-device queue is supposed to be decreased 
> at runtime to match the actual size that the hardware can handle.  In 
> the earlier version we're seeing the queue set to 7 at runtime, while 
> the more recent version is showing a queue depth of 32 or 64 and is 
> giving QUEUE_FULL errors to the userspace apps.

The midlayer doesn't do this automatically. The LLDD has to note the
QUEUE_FULL/TASK_SET_FULL status, then call scsi_adjust_queue_depth()
to manipulate things. And this gets really hairy to decrease load, then
ramp back up.

> I just wanted to make sure that 2.6.14 was working correctly (ie, this 
> wasn't a bug that has been fixed in a more recent version).
> 
>> SAM_STAT_TASK_SET_FULL in /usr/src/linux/scsi/scsi.h, is the same as
>> QUEUE_FULL.  If you look in scsi_error.c searching for QUEUE_FULL, you
>> will see that it will translate to ADD_TO_MLQUEUE, which means it will
>> reposted to the request queue.
> 
> I don't know the scsi code very well, so maybe I'm missing something 
> obvious here.  If so, I apologize.
> 
> Our userspace apps are getting a status of TASK_SET_FULL on completion 
> of an ioctl() call.

If you're using sgio, you will always be susceptible to getting these
statuses, even if the driver adjusts queue depth.

> Does this status mean that the command needs to be retried by the 
> userspace app, that it has already been retried by the lower levels and 
> is now completed, or something else entirely?

The status means you can retry again, hoping that the queue is not as
busy at that time. The recommendation is that you delay 1 or more seconds
before attempting again. But, even that is a general recommendation.
SAM-4 gives some basic guidance on how long to delay before the retry
(see table 26).

It would be bad form for the lower levels or driver to retry the command.
Some commands are not retryable without affecting device state. Since you
use sgio, it's up to you to retry.

-- james s

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 19:09         ` Chris Friesen
  2007-11-15 19:43           ` James Smart
@ 2007-11-15 19:57           ` Moore, Eric
  2007-11-15 21:59             ` Chris Friesen
  1 sibling, 1 reply; 13+ messages in thread
From: Moore, Eric @ 2007-11-15 19:57 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

On Thursday, November 15, 2007 12:10 PM, Chris Friesen wrote:
> 
> My impression is that the per-device queue is supposed to be 
> decreased 
> at runtime to match the actual size that the hardware can handle.  In 
> the earlier version we're seeing the queue set to 7 at runtime, while 
> the more recent version is showing a queue depth of 32 or 64 and is 
> giving QUEUE_FULL errors to the userspace apps.
> 

The per-device queue is a hard coded value.   We don't know what the
queue depth of each device attached to the controller.   


> 
> I don't know the scsi code very well, so maybe I'm missing something 
> obvious here.  If so, I apologize.
> 
> Our userspace apps are getting a status of TASK_SET_FULL on 
> completion 
> of an ioctl() call.
> 
> Does this status mean that the command needs to be retried by the 
> userspace app, that it has already been retried by the lower 
> levels and 
> is now completed, or something else entirely?
> 

The midlayer is retrying the command.  I pointed you to the code in the
previous email.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 19:43           ` James Smart
@ 2007-11-15 19:59             ` Moore, Eric
  0 siblings, 0 replies; 13+ messages in thread
From: Moore, Eric @ 2007-11-15 19:59 UTC (permalink / raw)
  To: James.Smart, Chris Friesen
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

On  Thursday, November 15, 2007 12:44 PM, James Smart wrote:
> The midlayer doesn't do this automatically. The LLDD has to note the
> QUEUE_FULL/TASK_SET_FULL status, then call scsi_adjust_queue_depth()
> to manipulate things. And this gets really hairy to decrease 
> load, then
> ramp back up.
> 

yeah I need to do that, but the customer should do it via sysfs, as I
previously noted, he is on 2.6.14 kernel.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 19:57           ` Moore, Eric
@ 2007-11-15 21:59             ` Chris Friesen
  2007-11-15 22:18               ` James Bottomley
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Friesen @ 2007-11-15 21:59 UTC (permalink / raw)
  To: Moore, Eric
  Cc: Stephens, Larry, linux-scsi, dgilbert, James.Bottomley,
	DL-MPT Fusion Linux

Moore, Eric wrote:
> On Thursday, November 15, 2007 12:10 PM, Chris Friesen wrote:

>>Does this status mean that the command needs to be retried by the 
>>userspace app, that it has already been retried by the lower 
>>levels and 
>>is now completed, or something else entirely?

> The midlayer is retrying the command.  I pointed you to the code in the
> previous email.

James Smart just indicated that the midlayer was not retrying the 
command because it's sgio.  Is he mistaken?

If the midlayer is retrying the command, then what should the 
application do when it receives that status?


Chris



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 21:59             ` Chris Friesen
@ 2007-11-15 22:18               ` James Bottomley
  2007-11-15 22:35                 ` Moore, Eric
  0 siblings, 1 reply; 13+ messages in thread
From: James Bottomley @ 2007-11-15 22:18 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Moore, Eric, Stephens, Larry, linux-scsi, dgilbert,
	DL-MPT Fusion Linux

On Thu, 2007-11-15 at 15:59 -0600, Chris Friesen wrote:
> Moore, Eric wrote:
> > On Thursday, November 15, 2007 12:10 PM, Chris Friesen wrote:
> 
> >>Does this status mean that the command needs to be retried by the 
> >>userspace app, that it has already been retried by the lower 
> >>levels and 
> >>is now completed, or something else entirely?
> 
> > The midlayer is retrying the command.  I pointed you to the code in the
> > previous email.
> 
> James Smart just indicated that the midlayer was not retrying the 
> command because it's sgio.  Is he mistaken?

No.  When the command goes via SG_IO it bypasses all return status
processing (and QUEUE_FULL/BUSY is a return status).  When it's
submitted in the normal way (i.e. via a ULD) then the mid-layer
processes these returns to a retry strategy.

> If the midlayer is retrying the command, then what should the 
> application do when it receives that status?

James



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 22:18               ` James Bottomley
@ 2007-11-15 22:35                 ` Moore, Eric
  2007-11-15 22:47                   ` James Bottomley
  0 siblings, 1 reply; 13+ messages in thread
From: Moore, Eric @ 2007-11-15 22:35 UTC (permalink / raw)
  To: James Bottomley, Chris Friesen
  Cc: Stephens, Larry, linux-scsi, dgilbert, DL-MPT Fusion Linux

> No.  When the command goes via SG_IO it bypasses all return status
> processing (and QUEUE_FULL/BUSY is a return status).  When it's
> submitted in the normal way (i.e. via a ULD) then the mid-layer
> processes these returns to a retry strategy.
> 

James - Today I'm working some other customer issue, and my target
returns SAM_STAT_TASK_SET_FULL when sg_inq is sent.   I see about 10
retries before the data is finally returned.  Who is issuing the retries
to my driver?   Doesn't sg_inq (sg3_utils), use SG_IO ->
scsi_execute_async-> scsi_softirq_done, where SAM_STAT_TASK_SET_FULL is
translated to ADD_TO_MLQUEUE, then retried, regardless the fact that
SG_DEFAULT_RETRIES equal zero.   Maybe I'm missing something, but I'm
seeing retries.

Eric    

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace?
  2007-11-15 22:35                 ` Moore, Eric
@ 2007-11-15 22:47                   ` James Bottomley
  0 siblings, 0 replies; 13+ messages in thread
From: James Bottomley @ 2007-11-15 22:47 UTC (permalink / raw)
  To: Moore, Eric
  Cc: Chris Friesen, Stephens, Larry, linux-scsi, dgilbert,
	DL-MPT Fusion Linux

On Thu, 2007-11-15 at 15:35 -0700, Moore, Eric wrote:
> > No.  When the command goes via SG_IO it bypasses all return status
> > processing (and QUEUE_FULL/BUSY is a return status).  When it's
> > submitted in the normal way (i.e. via a ULD) then the mid-layer
> > processes these returns to a retry strategy.
> > 
> 
> James - Today I'm working some other customer issue, and my target
> returns SAM_STAT_TASK_SET_FULL when sg_inq is sent.   I see about 10
> retries before the data is finally returned.  Who is issuing the retries
> to my driver?   Doesn't sg_inq (sg3_utils), use SG_IO ->
> scsi_execute_async-> scsi_softirq_done, where SAM_STAT_TASK_SET_FULL is
> translated to ADD_TO_MLQUEUE, then retried, regardless the fact that
> SG_DEFAULT_RETRIES equal zero.   Maybe I'm missing something, but I'm
> seeing retries.

No, you're not ... I'm not thinking straight about the disposition path.
There's another thing at work, which is the command default timeout,
when that exhausts we do return the status back to SG_IO; otherwise we
will follow the retry strategy in all cases.

Jmaes



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-11-15 22:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-12 19:54 how to handle QUEUE_FULL/SAM_STAT_TASK_SET_FULL in userspace? Chris Friesen
2007-11-13 22:04 ` Chris Friesen
2007-11-13 22:34   ` Moore, Eric
2007-11-14 17:23     ` Chris Friesen
2007-11-14 22:45       ` Moore, Eric
2007-11-15 19:09         ` Chris Friesen
2007-11-15 19:43           ` James Smart
2007-11-15 19:59             ` Moore, Eric
2007-11-15 19:57           ` Moore, Eric
2007-11-15 21:59             ` Chris Friesen
2007-11-15 22:18               ` James Bottomley
2007-11-15 22:35                 ` Moore, Eric
2007-11-15 22:47                   ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).