[bug report] block: Non-NCQ commands will never be executed while fio is continuously running

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
@ 2024-09-09 13:10 yangxingui
  2024-09-09 13:21 ` Damien Le Moal
  0 siblings, 1 reply; 12+ messages in thread
From: yangxingui @ 2024-09-09 13:10 UTC (permalink / raw)
  To: axboe, John Garry
  Cc: linux-block, linux-kernel, James.Bottomley, Martin K. Petersen,
	damien.lemoal

Hello axboe & John,

After the driver exposes all HW queues to the block layer, non-NCQ 
commands will never be executed while fio is continuously running, such 
as a smartctl command.

The cause of the problem is that other hctx used by the NCQ command is 
still active and can continue to issue NCQ commands to the sata disk.
And the pio command keeps retrying in its corresponding hctx because 
qc_defer() always returns true.

hctx0: ncq, pio, ncq
hctx1：ncq, ncq, ...
...
hctxn: ncq, ncq, ...

Is there any good solution for this?

Thanks.
Xingui

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-09 13:10 [bug report] block: Non-NCQ commands will never be executed while fio is continuously running yangxingui
@ 2024-09-09 13:21 ` Damien Le Moal
  2024-09-10  1:09   ` yangxingui
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2024-09-09 13:21 UTC (permalink / raw)
  To: yangxingui, axboe, John Garry
  Cc: linux-block, linux-kernel, James.Bottomley, Martin K. Petersen,
	damien.lemoal

On 9/9/24 22:10, yangxingui wrote:
> Hello axboe & John,
> 
> After the driver exposes all HW queues to the block layer, non-NCQ 
> commands will never be executed while fio is continuously running, such 
> as a smartctl command.
> 
> The cause of the problem is that other hctx used by the NCQ command is 
> still active and can continue to issue NCQ commands to the sata disk.
> And the pio command keeps retrying in its corresponding hctx because 
> qc_defer() always returns true.
> 
> hctx0: ncq, pio, ncq
> hctx1：ncq, ncq, ...
> ...
> hctxn: ncq, ncq, ...
> 
> Is there any good solution for this?

SATA devices are single queue so how can you have multiple queues ?
What adapter are you using ?

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-09 13:21 ` Damien Le Moal
@ 2024-09-10  1:09   ` yangxingui
  2024-09-10  4:45     ` Damien Le Moal
  0 siblings, 1 reply; 12+ messages in thread
From: yangxingui @ 2024-09-10  1:09 UTC (permalink / raw)
  To: Damien Le Moal, axboe, John Garry
  Cc: linux-block, linux-kernel, James.Bottomley, Martin K. Petersen,
	damien.lemoal



On 2024/9/9 21:21, Damien Le Moal wrote:
> On 9/9/24 22:10, yangxingui wrote:
>> Hello axboe & John,
>>
>> After the driver exposes all HW queues to the block layer, non-NCQ
>> commands will never be executed while fio is continuously running, such
>> as a smartctl command.
>>
>> The cause of the problem is that other hctx used by the NCQ command is
>> still active and can continue to issue NCQ commands to the sata disk.
>> And the pio command keeps retrying in its corresponding hctx because
>> qc_defer() always returns true.
>>
>> hctx0: ncq, pio, ncq
>> hctx1：ncq, ncq, ...
>> ...
>> hctxn: ncq, ncq, ...
>>
>> Is there any good solution for this?
> 
> SATA devices are single queue so how can you have multiple queues ?
> What adapter are you using ?

In the following patch, we expose the host's 16 hardware queues to the 
block layer. And when connecting to a sata disk, 16 hctx are used.

8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")

Thanks,
Xingui
.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10  1:09   ` yangxingui
@ 2024-09-10  4:45     ` Damien Le Moal
  2024-09-10  6:34       ` yangxingui
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2024-09-10  4:45 UTC (permalink / raw)
  To: yangxingui, axboe, John Garry
  Cc: linux-block, linux-kernel, James.Bottomley, Martin K. Petersen,
	damien.lemoal

On 9/10/24 10:09 AM, yangxingui wrote:
> 
> 
> On 2024/9/9 21:21, Damien Le Moal wrote:
>> On 9/9/24 22:10, yangxingui wrote:
>>> Hello axboe & John,
>>>
>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>> commands will never be executed while fio is continuously running, such
>>> as a smartctl command.
>>>
>>> The cause of the problem is that other hctx used by the NCQ command is
>>> still active and can continue to issue NCQ commands to the sata disk.
>>> And the pio command keeps retrying in its corresponding hctx because
>>> qc_defer() always returns true.
>>>
>>> hctx0: ncq, pio, ncq
>>> hctx1：ncq, ncq, ...
>>> ...
>>> hctxn: ncq, ncq, ...
>>>
>>> Is there any good solution for this?
>>
>> SATA devices are single queue so how can you have multiple queues ?
>> What adapter are you using ?
> 
> In the following patch, we expose the host's 16 hardware queues to the block
> layer. And when connecting to a sata disk, 16 hctx are used.
> 
> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")

OK, so the HBA is a hisi one, using libsas...
What is the device ? An SSD ? and HDD ?

Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
setting a scheduler resolve the issue ?

I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
have multiple queues with a shared tagset. Never seen the issue you are
reporting though using HDDs with mq-deadline or bfq as the scheduler.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10  4:45     ` Damien Le Moal
@ 2024-09-10  6:34       ` yangxingui
  2024-09-10 11:27         ` Niklas Cassel
  0 siblings, 1 reply; 12+ messages in thread
From: yangxingui @ 2024-09-10  6:34 UTC (permalink / raw)
  To: Damien Le Moal, axboe, John Garry
  Cc: linux-block, linux-kernel, James.Bottomley, Martin K. Petersen,
	damien.lemoal



On 2024/9/10 12:45, Damien Le Moal wrote:
> On 9/10/24 10:09 AM, yangxingui wrote:
>>
>>
>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>> On 9/9/24 22:10, yangxingui wrote:
>>>> Hello axboe & John,
>>>>
>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>> commands will never be executed while fio is continuously running, such
>>>> as a smartctl command.
>>>>
>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>> And the pio command keeps retrying in its corresponding hctx because
>>>> qc_defer() always returns true.
>>>>
>>>> hctx0: ncq, pio, ncq
>>>> hctx1：ncq, ncq, ...
>>>> ...
>>>> hctxn: ncq, ncq, ...
>>>>
>>>> Is there any good solution for this?
>>>
>>> SATA devices are single queue so how can you have multiple queues ?
>>> What adapter are you using ?
>>
>> In the following patch, we expose the host's 16 hardware queues to the block
>> layer. And when connecting to a sata disk, 16 hctx are used.
>>
>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
> 
> OK, so the HBA is a hisi one, using libsas...
> What is the device ? An SSD ? and HDD ?
Both SATA SSD and SATA HDD have this problem.

> 
> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
> setting a scheduler resolve the issue ?
Currently, the default configuration mq-deadline is used, and the same 
phenomenon occurs when I try setting it to none. It seems to have 
nothing to do with the scheduling strategy.

> 
> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
> have multiple queues with a shared tagset. Never seen the issue you are
> reporting though using HDDs with mq-deadline or bfq as the scheduler.
Unlike libsas, as these hosts don't use qc_defer()?

Thanks,
Xingui
.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10  6:34       ` yangxingui
@ 2024-09-10 11:27         ` Niklas Cassel
  2024-09-10 22:38           ` Damien Le Moal
  0 siblings, 1 reply; 12+ messages in thread
From: Niklas Cassel @ 2024-09-10 11:27 UTC (permalink / raw)
  To: yangxingui
  Cc: Damien Le Moal, axboe, John Garry, linux-block, linux-kernel,
	James.Bottomley, Martin K. Petersen, damien.lemoal

On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
> 
> 
> On 2024/9/10 12:45, Damien Le Moal wrote:
> > On 9/10/24 10:09 AM, yangxingui wrote:
> > > 
> > > 
> > > On 2024/9/9 21:21, Damien Le Moal wrote:
> > > > On 9/9/24 22:10, yangxingui wrote:
> > > > > Hello axboe & John,
> > > > > 
> > > > > After the driver exposes all HW queues to the block layer, non-NCQ
> > > > > commands will never be executed while fio is continuously running, such
> > > > > as a smartctl command.
> > > > > 
> > > > > The cause of the problem is that other hctx used by the NCQ command is
> > > > > still active and can continue to issue NCQ commands to the sata disk.
> > > > > And the pio command keeps retrying in its corresponding hctx because
> > > > > qc_defer() always returns true.
> > > > > 
> > > > > hctx0: ncq, pio, ncq
> > > > > hctx1：ncq, ncq, ...
> > > > > ...
> > > > > hctxn: ncq, ncq, ...
> > > > > 
> > > > > Is there any good solution for this?
> > > > 
> > > > SATA devices are single queue so how can you have multiple queues ?
> > > > What adapter are you using ?
> > > 
> > > In the following patch, we expose the host's 16 hardware queues to the block
> > > layer. And when connecting to a sata disk, 16 hctx are used.
> > > 
> > > 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
> > 
> > OK, so the HBA is a hisi one, using libsas...
> > What is the device ? An SSD ? and HDD ?
> Both SATA SSD and SATA HDD have this problem.
> 
> > 
> > Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
> > setting a scheduler resolve the issue ?
> Currently, the default configuration mq-deadline is used, and the same
> phenomenon occurs when I try setting it to none. It seems to have nothing to
> do with the scheduling strategy.
> 
> > 
> > I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
> > have multiple queues with a shared tagset. Never seen the issue you are
> > reporting though using HDDs with mq-deadline or bfq as the scheduler.
> Unlike libsas, as these hosts don't use qc_defer()?

mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
Translation (SAT) is done completely by the HBA, so from a Linux
perspective, we are issuing SCSI commands to the HBA.

We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566


If you look at SATA 3.a Gold specification,
"13.6.3 Intermixing Non-NCQ commands and NCQ commands"

"The host shall not issue a non-NCQ command while an NCQ command is outstanding."


In AHCI 1.3.1 specification,
"1.7 Theory of Operation"

"System software is responsible to ensure that queued and non-queued commands
are not mixed in the command list for the same device with the exception of
the NCQ Unload command."


Usually, tools like smartctl submit SCSI commands of type "ATA-16 passthrough",
which is a specific SCSI command that just contains a regular ATA command as
payload:
https://www.smartmontools.org/browser/trunk/smartmontools/scsiata.cpp?desc=1&order=date#L346

For a "ATA-16 passthrough" SCSI command, libata will simply copy the fields
from the "ATA-16 passthrough" SCSI command to the appropriate field in a newly
created ATA command, see the SAT specification and:
https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/ata/libata-scsi.c#L2878-L2887


See also the SAT-6 specification,
"6.2.4 Mechanism for processing some commands as NCQ commands"

"The ACS-5 standard defines a mechanism for NCQ encapsulation of some commands.
Use of this mechanism allows these commands to be processed without quiescing
the ATA device."

Without considering if it is a good idea or not, it should be possible to
translate some commands to instead use the "NCQ encapsulated" variant of
the ATA command that was used in the "ATA-16 passthrough" SCSI command.

However looking at e.g.:
https://www.smartmontools.org/browser/trunk/smartmontools/scsiata.cpp?desc=1&order=date#L566
smartctl is sending a IDENTIFY DEVICE (ECh) ATA command,
and this command has no NCQ encapsulated variant.

(Had the application instead used a READ LOG DMA EXT command to read the
IDENTIFY DEVICE data log, where log page 01h is a copy of IDENTIFY DEVICE data,
we would have been able to convert the command to an NCQ encapsulated variant.)



TL;DR: I do not see easy generic solution to this problem.

To be able to send a non-queued command, there has to be no NCQ commands queued
on the device. I guess you could implement a scheduler that would be quiescing
the queue, processes the non-queued command, and then thaw the queue, but that
would essentially make non-queued commands high priority commands, and could
thus be used to seriously limit throughput by just sending some non-queued
commands every now and then :)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10 11:27         ` Niklas Cassel
@ 2024-09-10 22:38           ` Damien Le Moal
  2024-09-11  9:41             ` yangxingui
  2024-09-19 12:26             ` Yu Kuai
  0 siblings, 2 replies; 12+ messages in thread
From: Damien Le Moal @ 2024-09-10 22:38 UTC (permalink / raw)
  To: Niklas Cassel, yangxingui
  Cc: axboe, John Garry, linux-block, linux-kernel, James.Bottomley,
	Martin K. Petersen

On 9/10/24 20:27, Niklas Cassel wrote:
> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>
>>
>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>
>>>>
>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>> Hello axboe & John,
>>>>>>
>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>> commands will never be executed while fio is continuously running, such
>>>>>> as a smartctl command.
>>>>>>
>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>> qc_defer() always returns true.
>>>>>>
>>>>>> hctx0: ncq, pio, ncq
>>>>>> hctx1：ncq, ncq, ...
>>>>>> ...
>>>>>> hctxn: ncq, ncq, ...
>>>>>>
>>>>>> Is there any good solution for this?
>>>>>
>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>> What adapter are you using ?
>>>>
>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>
>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>
>>> OK, so the HBA is a hisi one, using libsas...
>>> What is the device ? An SSD ? and HDD ?
>> Both SATA SSD and SATA HDD have this problem.
>>
>>>
>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>> setting a scheduler resolve the issue ?
>> Currently, the default configuration mq-deadline is used, and the same
>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>> do with the scheduling strategy.
>>
>>>
>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>> have multiple queues with a shared tagset. Never seen the issue you are
>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>> Unlike libsas, as these hosts don't use qc_defer()?
> 
> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
> Translation (SAT) is done completely by the HBA, so from a Linux
> perspective, we are issuing SCSI commands to the HBA.

Yes, but we still can get requeue happening. Though for a SATA drive, that is
unlikely since the max queue depth is clearly defined, unlike for SAS drives

> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566

And that may be the issue. More on this below.

> Without considering if it is a good idea or not, it should be possible to
> translate some commands to instead use the "NCQ encapsulated" variant of
> the ATA command that was used in the "ATA-16 passthrough" SCSI command.

That would be way too much work on the user side, and likely open up a can of
device bugs unseen until now.

> To be able to send a non-queued command, there has to be no NCQ commands queued
> on the device. I guess you could implement a scheduler that would be quiescing
> the queue, processes the non-queued command, and then thaw the queue, but that
> would essentially make non-queued commands high priority commands, and could
> thus be used to seriously limit throughput by just sending some non-queued
> commands every now and then :)

Passthrough commands do not go through the scheduler and are submitted directly
to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).

So for a single queue device, even if ata_qc_defer causes a requeue, the
passthrough command ends up back at the top of the dispatch queue. After
repeating this a few times, all in-flight NCQ commands complete and the
passthrough command goes through.

But I feel this is very fragile given that the block layer requeue is done
through a work item, so in parallel to an application submitting IOs. So in
theory, I think that the requeue for the passthrough command could happen forever...

And for a multi-queue setup like with the hisi adapter, that is what is happening.

I do not have any good idea how to fix that yet. We need to find something.
scsi_queue_rq() and the budget/host or device blocked state management may help
with that, or we have a bug there... In any case, I do not think it is a block
layer issue as the block layer knows nothing about NCQ vs non-NCQ.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10 22:38           ` Damien Le Moal
@ 2024-09-11  9:41             ` yangxingui
  2024-09-19 12:26             ` Yu Kuai
  1 sibling, 0 replies; 12+ messages in thread
From: yangxingui @ 2024-09-11  9:41 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel
  Cc: axboe, John Garry, linux-block, linux-kernel, James.Bottomley,
	Martin K. Petersen



On 2024/9/11 6:38, Damien Le Moal wrote:
> On 9/10/24 20:27, Niklas Cassel wrote:
>> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>>
>>>
>>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>>
>>>>>
>>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>>> Hello axboe & John,
>>>>>>>
>>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>>> commands will never be executed while fio is continuously running, such
>>>>>>> as a smartctl command.
>>>>>>>
>>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>>> qc_defer() always returns true.
>>>>>>>
>>>>>>> hctx0: ncq, pio, ncq
>>>>>>> hctx1：ncq, ncq, ...
>>>>>>> ...
>>>>>>> hctxn: ncq, ncq, ...
>>>>>>>
>>>>>>> Is there any good solution for this?
>>>>>>
>>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>>> What adapter are you using ?
>>>>>
>>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>>
>>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>>
>>>> OK, so the HBA is a hisi one, using libsas...
>>>> What is the device ? An SSD ? and HDD ?
>>> Both SATA SSD and SATA HDD have this problem.
>>>
>>>>
>>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>>> setting a scheduler resolve the issue ?
>>> Currently, the default configuration mq-deadline is used, and the same
>>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>>> do with the scheduling strategy.
>>>
>>>>
>>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>>> have multiple queues with a shared tagset. Never seen the issue you are
>>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>>> Unlike libsas, as these hosts don't use qc_defer()?
>>
>> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
>> Translation (SAT) is done completely by the HBA, so from a Linux
>> perspective, we are issuing SCSI commands to the HBA.
> 
> Yes, but we still can get requeue happening. Though for a SATA drive, that is
> unlikely since the max queue depth is clearly defined, unlike for SAS drives
> 
>> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
>> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566
> 
> And that may be the issue. More on this below.
> 
>> Without considering if it is a good idea or not, it should be possible to
>> translate some commands to instead use the "NCQ encapsulated" variant of
>> the ATA command that was used in the "ATA-16 passthrough" SCSI command.
> 
> That would be way too much work on the user side, and likely open up a can of
> device bugs unseen until now.
> 
>> To be able to send a non-queued command, there has to be no NCQ commands queued
>> on the device. I guess you could implement a scheduler that would be quiescing
>> the queue, processes the non-queued command, and then thaw the queue, but that
>> would essentially make non-queued commands high priority commands, and could
>> thus be used to seriously limit throughput by just sending some non-queued
>> commands every now and then :)
> 
> Passthrough commands do not go through the scheduler and are submitted directly
> to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).
> 
> So for a single queue device, even if ata_qc_defer causes a requeue, the
> passthrough command ends up back at the top of the dispatch queue. After
> repeating this a few times, all in-flight NCQ commands complete and the
> passthrough command goes through.
> 
> But I feel this is very fragile given that the block layer requeue is done
> through a work item, so in parallel to an application submitting IOs. So in
> theory, I think that the requeue for the passthrough command could happen forever...
> 
> And for a multi-queue setup like with the hisi adapter, that is what is happening.
> 
> I do not have any good idea how to fix that yet. We need to find something.
> scsi_queue_rq() and the budget/host or device blocked state management may help
> with that, or we have a bug there... In any case, I do not think it is a block
> layer issue as the block layer knows nothing about NCQ vs non-NCQ.
> 
Thanks for your reply, can we provide a module parameter to confirm 
whether to expose multiple queues to the upper layer? And let users choose.

Thanks,
Xingui
.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-10 22:38           ` Damien Le Moal
  2024-09-11  9:41             ` yangxingui
@ 2024-09-19 12:26             ` Yu Kuai
  2024-09-19 14:14               ` Damien Le Moal
  1 sibling, 1 reply; 12+ messages in thread
From: Yu Kuai @ 2024-09-19 12:26 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, yangxingui
  Cc: axboe, John Garry, linux-block, linux-kernel, James.Bottomley,
	Martin K. Petersen, yukuai (C), yangerkun@huawei.com

Hi,

在 2024/09/11 6:38, Damien Le Moal 写道:
> On 9/10/24 20:27, Niklas Cassel wrote:
>> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>>
>>>
>>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>>
>>>>>
>>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>>> Hello axboe & John,
>>>>>>>
>>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>>> commands will never be executed while fio is continuously running, such
>>>>>>> as a smartctl command.
>>>>>>>
>>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>>> qc_defer() always returns true.
>>>>>>>
>>>>>>> hctx0: ncq, pio, ncq
>>>>>>> hctx1：ncq, ncq, ...
>>>>>>> ...
>>>>>>> hctxn: ncq, ncq, ...
>>>>>>>
>>>>>>> Is there any good solution for this?
>>>>>>
>>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>>> What adapter are you using ?
>>>>>
>>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>>
>>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>>
>>>> OK, so the HBA is a hisi one, using libsas...
>>>> What is the device ? An SSD ? and HDD ?
>>> Both SATA SSD and SATA HDD have this problem.
>>>
>>>>
>>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>>> setting a scheduler resolve the issue ?
>>> Currently, the default configuration mq-deadline is used, and the same
>>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>>> do with the scheduling strategy.
>>>
>>>>
>>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>>> have multiple queues with a shared tagset. Never seen the issue you are
>>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>>> Unlike libsas, as these hosts don't use qc_defer()?
>>
>> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
>> Translation (SAT) is done completely by the HBA, so from a Linux
>> perspective, we are issuing SCSI commands to the HBA.
> 
> Yes, but we still can get requeue happening. Though for a SATA drive, that is
> unlikely since the max queue depth is clearly defined, unlike for SAS drives
> 
>> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
>> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566
> 
> And that may be the issue. More on this below.
> 
>> Without considering if it is a good idea or not, it should be possible to
>> translate some commands to instead use the "NCQ encapsulated" variant of
>> the ATA command that was used in the "ATA-16 passthrough" SCSI command.
> 
> That would be way too much work on the user side, and likely open up a can of
> device bugs unseen until now.
> 
>> To be able to send a non-queued command, there has to be no NCQ commands queued
>> on the device. I guess you could implement a scheduler that would be quiescing
>> the queue, processes the non-queued command, and then thaw the queue, but that
>> would essentially make non-queued commands high priority commands, and could
>> thus be used to seriously limit throughput by just sending some non-queued
>> commands every now and then :)
> 
> Passthrough commands do not go through the scheduler and are submitted directly
> to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).
> 
> So for a single queue device, even if ata_qc_defer causes a requeue, the
> passthrough command ends up back at the top of the dispatch queue. After
> repeating this a few times, all in-flight NCQ commands complete and the
> passthrough command goes through.
> 
> But I feel this is very fragile given that the block layer requeue is done
> through a work item, so in parallel to an application submitting IOs. So in
> theory, I think that the requeue for the passthrough command could happen forever...
> 
> And for a multi-queue setup like with the hisi adapter, that is what is happening.
> 
> I do not have any good idea how to fix that yet. We need to find something.
> scsi_queue_rq() and the budget/host or device blocked state management may help
> with that, or we have a bug there... In any case, I do not think it is a block
> layer issue as the block layer knows nothing about NCQ vs non-NCQ.

Does libata return a specific value in this case? If so, maybe we can
stop other hctx untill this IO is handled.

For now, I think libata should use single hctx, it just doesn't support
multiple hctx yet.

Thanks,
Kuai

> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-19 12:26             ` Yu Kuai
@ 2024-09-19 14:14               ` Damien Le Moal
  2024-10-31 14:12                 ` Niklas Cassel
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2024-09-19 14:14 UTC (permalink / raw)
  To: Yu Kuai, Niklas Cassel, yangxingui
  Cc: axboe, John Garry, linux-block, linux-kernel, James.Bottomley,
	Martin K. Petersen, yukuai (C), yangerkun@huawei.com

On 2024/09/19 14:26, Yu Kuai wrote:
> Hi,
> 
> 在 2024/09/11 6:38, Damien Le Moal 写道:
>> On 9/10/24 20:27, Niklas Cassel wrote:
>>> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>>>
>>>>
>>>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>>>> Hello axboe & John,
>>>>>>>>
>>>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>>>> commands will never be executed while fio is continuously running, such
>>>>>>>> as a smartctl command.
>>>>>>>>
>>>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>>>> qc_defer() always returns true.
>>>>>>>>
>>>>>>>> hctx0: ncq, pio, ncq
>>>>>>>> hctx1：ncq, ncq, ...
>>>>>>>> ...
>>>>>>>> hctxn: ncq, ncq, ...
>>>>>>>>
>>>>>>>> Is there any good solution for this?
>>>>>>>
>>>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>>>> What adapter are you using ?
>>>>>>
>>>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>>>
>>>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>>>
>>>>> OK, so the HBA is a hisi one, using libsas...
>>>>> What is the device ? An SSD ? and HDD ?
>>>> Both SATA SSD and SATA HDD have this problem.
>>>>
>>>>>
>>>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>>>> setting a scheduler resolve the issue ?
>>>> Currently, the default configuration mq-deadline is used, and the same
>>>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>>>> do with the scheduling strategy.
>>>>
>>>>>
>>>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>>>> have multiple queues with a shared tagset. Never seen the issue you are
>>>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>>>> Unlike libsas, as these hosts don't use qc_defer()?
>>>
>>> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
>>> Translation (SAT) is done completely by the HBA, so from a Linux
>>> perspective, we are issuing SCSI commands to the HBA.
>>
>> Yes, but we still can get requeue happening. Though for a SATA drive, that is
>> unlikely since the max queue depth is clearly defined, unlike for SAS drives
>>
>>> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
>>> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566
>>
>> And that may be the issue. More on this below.
>>
>>> Without considering if it is a good idea or not, it should be possible to
>>> translate some commands to instead use the "NCQ encapsulated" variant of
>>> the ATA command that was used in the "ATA-16 passthrough" SCSI command.
>>
>> That would be way too much work on the user side, and likely open up a can of
>> device bugs unseen until now.
>>
>>> To be able to send a non-queued command, there has to be no NCQ commands queued
>>> on the device. I guess you could implement a scheduler that would be quiescing
>>> the queue, processes the non-queued command, and then thaw the queue, but that
>>> would essentially make non-queued commands high priority commands, and could
>>> thus be used to seriously limit throughput by just sending some non-queued
>>> commands every now and then :)
>>
>> Passthrough commands do not go through the scheduler and are submitted directly
>> to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).
>>
>> So for a single queue device, even if ata_qc_defer causes a requeue, the
>> passthrough command ends up back at the top of the dispatch queue. After
>> repeating this a few times, all in-flight NCQ commands complete and the
>> passthrough command goes through.
>>
>> But I feel this is very fragile given that the block layer requeue is done
>> through a work item, so in parallel to an application submitting IOs. So in
>> theory, I think that the requeue for the passthrough command could happen forever...
>>
>> And for a multi-queue setup like with the hisi adapter, that is what is happening.
>>
>> I do not have any good idea how to fix that yet. We need to find something.
>> scsi_queue_rq() and the budget/host or device blocked state management may help
>> with that, or we have a bug there... In any case, I do not think it is a block
>> layer issue as the block layer knows nothing about NCQ vs non-NCQ.
> 
> Does libata return a specific value in this case? If so, maybe we can
> stop other hctx untill this IO is handled.
> 
> For now, I think libata should use single hctx, it just doesn't support
> multiple hctx yet.

libata does not care/know about hctx. It only issues commands to ATA devices,
which always are single queue. And pure SATA adapters like AHCI are always
single queue.

The issue at hand can happen only for libsas based SAS HBAs that have multiple
command submission queues (with a shared tag set). Commands for the same device
may end up being submitted through different queues, and when the submitted
commands include a mix of NCQ and non-NCQ commands, the problem happens without
libata being able to easily do anything about it, and not possible control
possible at the scsi layer either since the commands submitted are SCSI (not yet
translated to ATA commands) which do not have any NCQ/non-NCQ exclusion
knowledge at all. NCQ is an ATA concept unknown to the scsi and block layer.

We (Niklas and I) are trying to find a solution, but that may not be within
libata itself. It may need changes to libsas as well. Not sure yet. Still exploring.

> 
> Thanks,
> Kuai
> 
>>
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-09-19 14:14               ` Damien Le Moal
@ 2024-10-31 14:12                 ` Niklas Cassel
  2024-11-01  2:17                   ` yangxingui
  0 siblings, 1 reply; 12+ messages in thread
From: Niklas Cassel @ 2024-10-31 14:12 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Yu Kuai, yangxingui, axboe, John Garry, linux-block, linux-kernel,
	James.Bottomley, Martin K. Petersen, yukuai (C),
	yangerkun@huawei.com

On Thu, Sep 19, 2024 at 04:14:15PM +0200, Damien Le Moal wrote:
> On 2024/09/19 14:26, Yu Kuai wrote:
> > 
> > Does libata return a specific value in this case? If so, maybe we can
> > stop other hctx untill this IO is handled.
> > 
> > For now, I think libata should use single hctx, it just doesn't support
> > multiple hctx yet.
> 
> libata does not care/know about hctx. It only issues commands to ATA devices,
> which always are single queue. And pure SATA adapters like AHCI are always
> single queue.
> 
> The issue at hand can happen only for libsas based SAS HBAs that have multiple
> command submission queues (with a shared tag set). Commands for the same device
> may end up being submitted through different queues, and when the submitted
> commands include a mix of NCQ and non-NCQ commands, the problem happens without
> libata being able to easily do anything about it, and not possible control
> possible at the scsi layer either since the commands submitted are SCSI (not yet
> translated to ATA commands) which do not have any NCQ/non-NCQ exclusion
> knowledge at all. NCQ is an ATA concept unknown to the scsi and block layer.
> 
> We (Niklas and I) are trying to find a solution, but that may not be within
> libata itself. It may need changes to libsas as well. Not sure yet. Still exploring.

Hello Xingui,

I send a proposed solution to this problem here:
https://lore.kernel.org/linux-ide/20241031140731.224589-4-cassel@kernel.org/

Please test and see if it addresses your problem.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running
  2024-10-31 14:12                 ` Niklas Cassel
@ 2024-11-01  2:17                   ` yangxingui
  0 siblings, 0 replies; 12+ messages in thread
From: yangxingui @ 2024-11-01  2:17 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Yu Kuai, axboe, John Garry, linux-block, linux-kernel,
	James.Bottomley, Martin K. Petersen, yukuai (C),
	yangerkun@huawei.com



On 2024/10/31 22:12, Niklas Cassel wrote:
> On Thu, Sep 19, 2024 at 04:14:15PM +0200, Damien Le Moal wrote:
>> On 2024/09/19 14:26, Yu Kuai wrote:
>>>
>>> Does libata return a specific value in this case? If so, maybe we can
>>> stop other hctx untill this IO is handled.
>>>
>>> For now, I think libata should use single hctx, it just doesn't support
>>> multiple hctx yet.
>>
>> libata does not care/know about hctx. It only issues commands to ATA devices,
>> which always are single queue. And pure SATA adapters like AHCI are always
>> single queue.
>>
>> The issue at hand can happen only for libsas based SAS HBAs that have multiple
>> command submission queues (with a shared tag set). Commands for the same device
>> may end up being submitted through different queues, and when the submitted
>> commands include a mix of NCQ and non-NCQ commands, the problem happens without
>> libata being able to easily do anything about it, and not possible control
>> possible at the scsi layer either since the commands submitted are SCSI (not yet
>> translated to ATA commands) which do not have any NCQ/non-NCQ exclusion
>> knowledge at all. NCQ is an ATA concept unknown to the scsi and block layer.
>>
>> We (Niklas and I) are trying to find a solution, but that may not be within
>> libata itself. It may need changes to libsas as well. Not sure yet. Still exploring.
> 
> Hello Xingui,
> 
> I send a proposed solution to this problem here:
> https://lore.kernel.org/linux-ide/20241031140731.224589-4-cassel@kernel.org/
> 
> Please test and see if it addresses your problem.
> 
OK, thanks for following this issue and fixing it, we will verify it as 
soon as possible.

Thanks.
Xingui


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-01  2:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 13:10 [bug report] block: Non-NCQ commands will never be executed while fio is continuously running yangxingui
2024-09-09 13:21 ` Damien Le Moal
2024-09-10  1:09   ` yangxingui
2024-09-10  4:45     ` Damien Le Moal
2024-09-10  6:34       ` yangxingui
2024-09-10 11:27         ` Niklas Cassel
2024-09-10 22:38           ` Damien Le Moal
2024-09-11  9:41             ` yangxingui
2024-09-19 12:26             ` Yu Kuai
2024-09-19 14:14               ` Damien Le Moal
2024-10-31 14:12                 ` Niklas Cassel
2024-11-01  2:17                   ` yangxingui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).