[LSF/MM/BPF TOPIC] Improving Zoned Storage Support

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
@ 2024-01-16 18:20 Bart Van Assche
  2024-01-16 23:34 ` Damien Le Moal
  2024-01-17  8:15 ` Viacheslav Dubeyko
  0 siblings, 2 replies; 21+ messages in thread
From: Bart Van Assche @ 2024-01-16 18:20 UTC (permalink / raw)
  To: lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Damien Le Moal, Christoph Hellwig

The advantages of zoned storage are well known [1]:
* Higher sequential read and random read performance.
* Lower write amplification.
* Lower tail latency.
* Higher usable capacity because of less overprovisioning.

For many SSDs the L2P (logical to physical translation) table does not
fit entirely in the memory of the storage device. Zoned storage reduces
the size of the L2P table significantly and hence makes it much more
likely that the L2P table fits in the memory of the storage device. If
zoned storage eliminates L2P table paging, random read performance is
improved significantly.

A zoned storage SSD does not have to perform garbage collection. Hence,
write amplification and tail latency are reduced.

Zoned storage gives file systems control over how files are laid out on
the storage device. With zoned storage it is possible to allocate a
contiguous range of storage on the storage medium for a file. This
improves sequential read performance.

Log-structured file systems are a good match for zoned storage. Such
filesystems typically submit large bios to the block layer and have
multiple bios outstanding concurrently. The block layer splits bios if
their size exceeds the max_sectors limit (512 KiB for UFS; 128 KiB for a
popular NVMe controller). This increases the number of concurrently
outstanding bios further.

While the NVMe standard supports two different commands for writing to
zoned storage (Write and Zone Append), the SCSI standard only supports a
single command for writing to zoned storage (WRITE). A write append
emulation for SCSI exists in drivers/scsi/sd_zbc.c.

File system implementers have to decide whether to use Write or Zone
Append. While the Zone Append command tolerates reordering, with this
command the filesystem cannot control the order in which the data is
written on the medium without restricting the queue depth to one.
Additionally, the latency of write operations is lower compared to zone
append operations. From [2], a paper with performance results for one
ZNS SSD model: "we observe that the latency of write operations is lower
than that of append operations, even if the request size is the same".

The mq-deadline I/O scheduler serializes zoned writes even if these got
reordered by the block layer. However, the mq-deadline I/O scheduler,
just like any other single-queue I/O scheduler, is a performance
bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and
UFS 4.0 block devices support more than 200 K IOPS.

Supporting more than 200 K IOPS and giving the filesystem control over
the data layout is only possible by supporting multiple outstanding
writes and by preserving the order of these writes. Hence the proposal
to discuss this topic during the 2024 edition of LSF/MM/BPF summit.
Potential approaches to preserve the order of zoned writes are as follows:
* Track (e.g. in a hash table) for which zones there are pending zoned
   writes and submit all zoned writes per zone to the same hardware
   queue.
* For SCSI, if a SCSI device responds with a unit attention to a zoned
   write, activate the error handler if the block device reports an
   unaligned write error and sort by LBA and resubmit the zoned writes
   from inside the error handler.

In other words, this proposal is about supporting both the Write and
Zone Append commands as first class operations and to let filesystem
implementers decide which command(s) to use.

[1] Stavrinos, Theano, Daniel S. Berger, Ethan Katz-Bassett, and Wyatt
Lloyd. "Don't be a blockhead: zoned namespaces make work on conventional
SSDs obsolete." In Proceedings of the Workshop on Hot Topics in
Operating Systems, pp. 144-151. 2021.

[2] K. Doekemeijer, N. Tehrany, B. Chandrasekaran, M. Bjørling and A.
Trivedi, "Performance Characterization of NVMe Flash Devices with Zoned
Namespaces (ZNS)," 2023 IEEE International Conference on Cluster
Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 118-131, doi:
10.1109/CLUSTER52292.2023.00018.
(https://ieeexplore.ieee.org/abstract/document/10319951).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-16 18:20 [LSF/MM/BPF TOPIC] Improving Zoned Storage Support Bart Van Assche
@ 2024-01-16 23:34 ` Damien Le Moal
  2024-01-17  1:21   ` Bart Van Assche
  2024-01-17 17:36   ` Bart Van Assche
  2024-01-17  8:15 ` Viacheslav Dubeyko
  1 sibling, 2 replies; 21+ messages in thread
From: Damien Le Moal @ 2024-01-16 23:34 UTC (permalink / raw)
  To: Bart Van Assche, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 03:20, Bart Van Assche wrote:
> The advantages of zoned storage are well known [1]:
> * Higher sequential read and random read performance.
> * Lower write amplification.
> * Lower tail latency.
> * Higher usable capacity because of less overprovisioning.
> 
> For many SSDs the L2P (logical to physical translation) table does not
> fit entirely in the memory of the storage device. Zoned storage reduces
> the size of the L2P table significantly and hence makes it much more
> likely that the L2P table fits in the memory of the storage device. If
> zoned storage eliminates L2P table paging, random read performance is
> improved significantly.
> 
> A zoned storage SSD does not have to perform garbage collection. Hence,
> write amplification and tail latency are reduced.
> 
> Zoned storage gives file systems control over how files are laid out on
> the storage device. With zoned storage it is possible to allocate a
> contiguous range of storage on the storage medium for a file. This
> improves sequential read performance.
> 
> Log-structured file systems are a good match for zoned storage. Such
> filesystems typically submit large bios to the block layer and have
> multiple bios outstanding concurrently. The block layer splits bios if
> their size exceeds the max_sectors limit (512 KiB for UFS; 128 KiB for a
> popular NVMe controller). This increases the number of concurrently
> outstanding bios further.
> 
> While the NVMe standard supports two different commands for writing to
> zoned storage (Write and Zone Append), the SCSI standard only supports a
> single command for writing to zoned storage (WRITE). A write append
> emulation for SCSI exists in drivers/scsi/sd_zbc.c.
> 
> File system implementers have to decide whether to use Write or Zone
> Append. While the Zone Append command tolerates reordering, with this
> command the filesystem cannot control the order in which the data is
> written on the medium without restricting the queue depth to one.
> Additionally, the latency of write operations is lower compared to zone
> append operations. From [2], a paper with performance results for one
> ZNS SSD model: "we observe that the latency of write operations is lower
> than that of append operations, even if the request size is the same".

What is the queue depth for this claim ?

> The mq-deadline I/O scheduler serializes zoned writes even if these got
> reordered by the block layer. However, the mq-deadline I/O scheduler,
> just like any other single-queue I/O scheduler, is a performance
> bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and
> UFS 4.0 block devices support more than 200 K IOPS.

FYI, I am about to post 20-something patches that completely remove zone write
locking and replace it with "zone write plugging". That is done above the IO
scheduler and also provides zone append emulation for drives that ask for it.

With this change:
 - Zone append emulation is moved to the block layer, as a generic
implementation. sd and dm zone append emulation code is removed.
 - Any scheduler can be used, including "none". mq-deadline zone block device
special support is removed.
 - Overall, a lot less code (the series removes more code than it adds).
 - Reordering problems such as due to IO priority is resolved as well.

This will need a lot of testing, which we are working on. But your help with
testing on UFS devices will be appreciated as well.

> 
> Supporting more than 200 K IOPS and giving the filesystem control over
> the data layout is only possible by supporting multiple outstanding
> writes and by preserving the order of these writes. Hence the proposal
> to discuss this topic during the 2024 edition of LSF/MM/BPF summit.
> Potential approaches to preserve the order of zoned writes are as follows:
> * Track (e.g. in a hash table) for which zones there are pending zoned
>    writes and submit all zoned writes per zone to the same hardware
>    queue.
> * For SCSI, if a SCSI device responds with a unit attention to a zoned
>    write, activate the error handler if the block device reports an
>    unaligned write error and sort by LBA and resubmit the zoned writes
>    from inside the error handler.
> 
> In other words, this proposal is about supporting both the Write and
> Zone Append commands as first class operations and to let filesystem
> implementers decide which command(s) to use.
> 
> [1] Stavrinos, Theano, Daniel S. Berger, Ethan Katz-Bassett, and Wyatt
> Lloyd. "Don't be a blockhead: zoned namespaces make work on conventional
> SSDs obsolete." In Proceedings of the Workshop on Hot Topics in
> Operating Systems, pp. 144-151. 2021.
> 
> [2] K. Doekemeijer, N. Tehrany, B. Chandrasekaran, M. Bjørling and A.
> Trivedi, "Performance Characterization of NVMe Flash Devices with Zoned
> Namespaces (ZNS)," 2023 IEEE International Conference on Cluster
> Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 118-131, doi:
> 10.1109/CLUSTER52292.2023.00018.
> (https://ieeexplore.ieee.org/abstract/document/10319951).
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-16 23:34 ` Damien Le Moal
@ 2024-01-17  1:21   ` Bart Van Assche
  2024-01-17 17:36   ` Bart Van Assche
  1 sibling, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2024-01-17  1:21 UTC (permalink / raw)
  To: Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig, Jaegeuk Kim

On 1/16/24 15:34, Damien Le Moal wrote:
> On 1/17/24 03:20, Bart Van Assche wrote:
>> File system implementers have to decide whether to use Write or Zone
>> Append. While the Zone Append command tolerates reordering, with this
>> command the filesystem cannot control the order in which the data is
>> written on the medium without restricting the queue depth to one.
>> Additionally, the latency of write operations is lower compared to zone
>> append operations. From [2], a paper with performance results for one
>> ZNS SSD model: "we observe that the latency of write operations is lower
>> than that of append operations, even if the request size is the same".
> 
> What is the queue depth for this claim ?

Hmm ... I haven't found this in the paper. Maybe I overlooked something.

>> The mq-deadline I/O scheduler serializes zoned writes even if these got
>> reordered by the block layer. However, the mq-deadline I/O scheduler,
>> just like any other single-queue I/O scheduler, is a performance
>> bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and
>> UFS 4.0 block devices support more than 200 K IOPS.
> 
> FYI, I am about to post 20-something patches that completely remove zone write
> locking and replace it with "zone write plugging". That is done above the IO
> scheduler and also provides zone append emulation for drives that ask for it.
> 
> With this change:
>   - Zone append emulation is moved to the block layer, as a generic
> implementation. sd and dm zone append emulation code is removed.
>   - Any scheduler can be used, including "none". mq-deadline zone block device
> special support is removed.
>   - Overall, a lot less code (the series removes more code than it adds).
>   - Reordering problems such as due to IO priority is resolved as well.
> 
> This will need a lot of testing, which we are working on. But your help with
> testing on UFS devices will be appreciated as well.

That sounds very interesting. I can help with reviewing the kernel
patches and also with testing these.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-16 18:20 [LSF/MM/BPF TOPIC] Improving Zoned Storage Support Bart Van Assche
  2024-01-16 23:34 ` Damien Le Moal
@ 2024-01-17  8:15 ` Viacheslav Dubeyko
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-17  8:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org,
	Damien Le Moal, Christoph Hellwig



> On Jan 16, 2024, at 9:20 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> 
> The advantages of zoned storage are well known [1]:
> * Higher sequential read and random read performance.
> * Lower write amplification.
> * Lower tail latency.
> * Higher usable capacity because of less overprovisioning.
> 
> 

<skipped>

> 
> In other words, this proposal is about supporting both the Write and
> Zone Append commands as first class operations and to let filesystem
> implementers decide which command(s) to use.
> 

I am interested to attend the discussion.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-16 23:34 ` Damien Le Moal
  2024-01-17  1:21   ` Bart Van Assche
@ 2024-01-17 17:36   ` Bart Van Assche
  2024-01-17 17:48     ` Jens Axboe
  1 sibling, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-17 17:36 UTC (permalink / raw)
  To: Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/16/24 15:34, Damien Le Moal wrote:
> FYI, I am about to post 20-something patches that completely remove zone write
> locking and replace it with "zone write plugging". That is done above the IO
> scheduler and also provides zone append emulation for drives that ask for it.
> 
> With this change:
>   - Zone append emulation is moved to the block layer, as a generic
> implementation. sd and dm zone append emulation code is removed.
>   - Any scheduler can be used, including "none". mq-deadline zone block device
> special support is removed.
>   - Overall, a lot less code (the series removes more code than it adds).
>   - Reordering problems such as due to IO priority is resolved as well.
> 
> This will need a lot of testing, which we are working on. But your help with
> testing on UFS devices will be appreciated as well.

When posting this patch series, please include performance results
(IOPS) for a zoned null_blk device instance. mq-deadline doesn't support
more than 200 K IOPS, which is less than what UFS devices support. I
hope that this performance bottleneck will be solved with the new
approach.

Thank you,

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 17:36   ` Bart Van Assche
@ 2024-01-17 17:48     ` Jens Axboe
  2024-01-17 18:22       ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 17:48 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 10:36 AM, Bart Van Assche wrote:
> On 1/16/24 15:34, Damien Le Moal wrote:
>> FYI, I am about to post 20-something patches that completely remove zone write
>> locking and replace it with "zone write plugging". That is done above the IO
>> scheduler and also provides zone append emulation for drives that ask for it.
>>
>> With this change:
>>   - Zone append emulation is moved to the block layer, as a generic
>> implementation. sd and dm zone append emulation code is removed.
>>   - Any scheduler can be used, including "none". mq-deadline zone block device
>> special support is removed.
>>   - Overall, a lot less code (the series removes more code than it adds).
>>   - Reordering problems such as due to IO priority is resolved as well.
>>
>> This will need a lot of testing, which we are working on. But your help with
>> testing on UFS devices will be appreciated as well.
> 
> When posting this patch series, please include performance results
> (IOPS) for a zoned null_blk device instance. mq-deadline doesn't support
> more than 200 K IOPS, which is less than what UFS devices support. I
> hope that this performance bottleneck will be solved with the new
> approach.

Not really zone related, but I was very aware of the single lock
limitations when I ported deadline to blk-mq. Was always hoping that
someone would actually take the time to make it more efficient, but so
far that hasn't happened. Or maybe it'll be a case of "just do it
yourself, Jens" at some point...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 17:48     ` Jens Axboe
@ 2024-01-17 18:22       ` Bart Van Assche
  2024-01-17 18:43         ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-17 18:22 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 09:48, Jens Axboe wrote:
>> When posting this patch series, please include performance results
>> (IOPS) for a zoned null_blk device instance. mq-deadline doesn't support
>> more than 200 K IOPS, which is less than what UFS devices support. I
>> hope that this performance bottleneck will be solved with the new
>> approach.
> 
> Not really zone related, but I was very aware of the single lock
> limitations when I ported deadline to blk-mq. Was always hoping that
> someone would actually take the time to make it more efficient, but so
> far that hasn't happened. Or maybe it'll be a case of "just do it
> yourself, Jens" at some point...

Hi Jens,

I think it is something fundamental rather than something that can be
fixed. The I/O scheduling algorithms in mq-deadline and BFQ require
knowledge of all pending I/O requests. This implies that data structures
must be maintained that are shared across all CPU cores. Making these
thread-safe implies having synchronization mechanisms that are used
across all CPU cores. I think this is where the (about) 200 K IOPS
bottleneck comes from.

Additionally, the faster storage devices become, the larger the relative
overhead of an I/O scheduler is (assuming that I/O schedulers won't
become significantly faster).

A fundamental limitation of I/O schedulers is that multiple commands
must be submitted simultaneously to the storage device to achieve good
performance. However, if the queue depth is larger than one then the
device has some control over the order in which commands are executed.

Because of all the above reasons I'm recommending my colleagues to move
I/O prioritization into the storage device and to evolve towards a
future for solid storage devices without I/O schedulers. I/O schedulers
probably will remain important for rotating magnetic media.

Thank you,

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 18:22       ` Bart Van Assche
@ 2024-01-17 18:43         ` Jens Axboe
  2024-01-17 20:06           ` Jens Axboe
  2024-01-18  0:38           ` Bart Van Assche
  0 siblings, 2 replies; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 18:43 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 11:22 AM, Bart Van Assche wrote:
> On 1/17/24 09:48, Jens Axboe wrote:
>>> When posting this patch series, please include performance results
>>> (IOPS) for a zoned null_blk device instance. mq-deadline doesn't support
>>> more than 200 K IOPS, which is less than what UFS devices support. I
>>> hope that this performance bottleneck will be solved with the new
>>> approach.
>>
>> Not really zone related, but I was very aware of the single lock
>> limitations when I ported deadline to blk-mq. Was always hoping that
>> someone would actually take the time to make it more efficient, but so
>> far that hasn't happened. Or maybe it'll be a case of "just do it
>> yourself, Jens" at some point...
> 
> Hi Jens,
> 
> I think it is something fundamental rather than something that can be
> fixed. The I/O scheduling algorithms in mq-deadline and BFQ require
> knowledge of all pending I/O requests. This implies that data structures
> must be maintained that are shared across all CPU cores. Making these
> thread-safe implies having synchronization mechanisms that are used
> across all CPU cores. I think this is where the (about) 200 K IOPS
> bottleneck comes from.

Has any analysis been done on where the limitation comes from? For
kicks, I ran an IOPS benchmark on a smaller AMD box. It has 4 fast
drives, and if I use mq-deadline on those 4 drives I can get 13.5M IOPS
using just 4 threads, and only 2 cores. That's vastly more than 200K, in
fact that's ~3.3M per drive. At the same time it's vastly slower than
the 5M that they will do without a scheduler.

Doing a quick look at what slows it down, it's a mix of not being able
to use completion side batching (which again then brings in TSC reading
as the highest cycler user...), and some general deadline overhead. In
order:

+    3.32%  io_uring  [kernel.kallsyms]  [k] __dd_dispatch_request
+    2.71%  io_uring  [kernel.kallsyms]  [k] dd_insert_requests
+    1.21%  io_uring  [kernel.kallsyms]  [k] dd_dispatch_request

with the rest being noise. Biggest one is dd_has_work(), which seems
like it would be trivially fixable by just having a shared flag if ANY
of the priorities had work.

Granted, this test case is single threaded as far as a device is
concerned, which is obviously best case. Which then leads me to believe
that it may indeed be locking that's the main issue here, which is what
I suspected from the get-go. And while yes this is a lot of shared data,
there's absolutely ZERO reason why we would end up with a hard limit of
~200K IOPS even maintaining the behavior it has now.

So let's try a single device, single thread:

IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/31

That's device limits, using mq-deadline. Now let's try and have 4
threads banging on it, pinned to the same two cores:

IOPS=3.90M, BW=1903MiB/s, IOS/call=32/31

Certainly slower. Now let's try and have the scheduler place the same 4
threads where it sees fit:

IOPS=1.56M, BW=759MiB/s, IOS/call=32/31

Yikes! That's still substantially more than 200K IOPS even with heavy
contention, let's take a look at the profile:

-   70.63%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - submitter_uring_fn
      - entry_SYSCALL_64
      - do_syscall_64
         - __se_sys_io_uring_enter
            - 70.62% io_submit_sqes
                 blk_finish_plug
                 __blk_flush_plug
               - blk_mq_flush_plug_list
                  - 69.65% blk_mq_run_hw_queue
                       blk_mq_sched_dispatch_requests
                     - __blk_mq_sched_dispatch_requests
                        + 60.61% dd_dispatch_request
                        + 8.98% blk_mq_dispatch_rq_list
                  + 0.98% dd_insert_requests

which is exactly as expected, we're spending 70% of the CPU cycles
banging on dd->lock.

Let's run the same thing again, but let's just do single requests at the
time:

IOPS=1.10M, BW=535MiB/s, IOS/call=1/0

worse again, but still a far cry from 200K IOPS. Contention basically
the same, but now we're not able to amortize other submission side
costs.

What I'm getting at is that it's a trap to just say "oh IO schedulers
can't scale beyong low IOPS" without even looking into where those
limits may be coming from. I'm willing to bet that increasing the
current limit for multi-threaded workloads would not be that difficult,
and it would probably 5x the performance potential of such setups.

Do we care? Maybe not, if we accept that an IO scheduler is just for
"slower devices". But let's not go around spouting some 200K number as
if it's gospel, when it depends on so many factors like IO workload,
system used, etc.

> Additionally, the faster storage devices become, the larger the relative
> overhead of an I/O scheduler is (assuming that I/O schedulers won't
> become significantly faster).

FIrst part is definitely true, second assumption I think is a "I just
give up without even looking at why" kind of attitude.

> A fundamental limitation of I/O schedulers is that multiple commands
> must be submitted simultaneously to the storage device to achieve good
> performance. However, if the queue depth is larger than one then the
> device has some control over the order in which commands are executed.

This isn't new, that's been known and understood for decades.

> Because of all the above reasons I'm recommending my colleagues to move
> I/O prioritization into the storage device and to evolve towards a
> future for solid storage devices without I/O schedulers. I/O schedulers
> probably will remain important for rotating magnetic media.

While I don't agree with a lot of your stipulations above, this is a
recommendation I've been giving for a long time as well. Mostly
because it means less cruft for us to maintain in software, also full
well knowing that we're then at the mercy of hardware implementations
which may all behave differently. And even if we historically have not
had good luck punting these problems to hardware and getting the desired
outcome.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 18:43         ` Jens Axboe
@ 2024-01-17 20:06           ` Jens Axboe
  2024-01-17 20:18             ` Bart Van Assche
  2024-01-18  0:38           ` Bart Van Assche
  1 sibling, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 20:06 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 11:43 AM, Jens Axboe wrote:
> Certainly slower. Now let's try and have the scheduler place the same 4
> threads where it sees fit:
> 
> IOPS=1.56M, BW=759MiB/s, IOS/call=32/31
> 
> Yikes! That's still substantially more than 200K IOPS even with heavy
> contention, let's take a look at the profile:
> 
> -   70.63%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>    - submitter_uring_fn
>       - entry_SYSCALL_64
>       - do_syscall_64
>          - __se_sys_io_uring_enter
>             - 70.62% io_submit_sqes
>                  blk_finish_plug
>                  __blk_flush_plug
>                - blk_mq_flush_plug_list
>                   - 69.65% blk_mq_run_hw_queue
>                        blk_mq_sched_dispatch_requests
>                      - __blk_mq_sched_dispatch_requests
>                         + 60.61% dd_dispatch_request
>                         + 8.98% blk_mq_dispatch_rq_list
>                   + 0.98% dd_insert_requests
> 
> which is exactly as expected, we're spending 70% of the CPU cycles
> banging on dd->lock.

Case in point, I spent 10 min hacking up some smarts on the insertion
and dispatch side, and then we get:

IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32

or about a 63% improvement when running the _exact same thing_. Looking
at profiles:

-   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath

reducing the > 70% of locking contention down to ~14%. No change in data
structures, just an ugly hack that:

- Serializes dispatch, no point having someone hammer on dd->lock for
  dispatch when already running
- Serialize insertions, punt to one of N buckets if insertion is already
  busy. Current insertion will notice someone else did that, and will
  prune the buckets and re-run insertion.

And while I seriously doubt that my quick hack is 100% fool proof, it
works as a proof of concept. If we can get that kind of reduction with
minimal effort, well...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 20:06           ` Jens Axboe
@ 2024-01-17 20:18             ` Bart Van Assche
  2024-01-17 20:20               ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-17 20:18 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 12:06, Jens Axboe wrote:
> Case in point, I spent 10 min hacking up some smarts on the insertion
> and dispatch side, and then we get:
> 
> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
> 
> or about a 63% improvement when running the _exact same thing_. Looking
> at profiles:
> 
> -   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> 
> reducing the > 70% of locking contention down to ~14%. No change in data
> structures, just an ugly hack that:
> 
> - Serializes dispatch, no point having someone hammer on dd->lock for
>    dispatch when already running
> - Serialize insertions, punt to one of N buckets if insertion is already
>    busy. Current insertion will notice someone else did that, and will
>    prune the buckets and re-run insertion.
> 
> And while I seriously doubt that my quick hack is 100% fool proof, it
> works as a proof of concept. If we can get that kind of reduction with
> minimal effort, well...

If nobody else beats me to it then I will look into using separate
locks in the mq-deadline scheduler for insertion and dispatch.

Bart.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 20:18             ` Bart Van Assche
@ 2024-01-17 20:20               ` Jens Axboe
  2024-01-17 21:02                 ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 20:20 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 1:18 PM, Bart Van Assche wrote:
> On 1/17/24 12:06, Jens Axboe wrote:
>> Case in point, I spent 10 min hacking up some smarts on the insertion
>> and dispatch side, and then we get:
>>
>> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
>>
>> or about a 63% improvement when running the _exact same thing_. Looking
>> at profiles:
>>
>> -   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>
>> reducing the > 70% of locking contention down to ~14%. No change in data
>> structures, just an ugly hack that:
>>
>> - Serializes dispatch, no point having someone hammer on dd->lock for
>>    dispatch when already running
>> - Serialize insertions, punt to one of N buckets if insertion is already
>>    busy. Current insertion will notice someone else did that, and will
>>    prune the buckets and re-run insertion.
>>
>> And while I seriously doubt that my quick hack is 100% fool proof, it
>> works as a proof of concept. If we can get that kind of reduction with
>> minimal effort, well...
> 
> If nobody else beats me to it then I will look into using separate
> locks in the mq-deadline scheduler for insertion and dispatch.

That's not going to help by itself, as most of the contention (as I
showed in the profile trace in the email) is from dispatch competing
with itself, and not necessarily dispatch competing with insertion. And
not sure how that would even work, as insert and dispatch are working on
the same structures.

Do some proper analysis first, then that will show you where the problem
is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 20:20               ` Jens Axboe
@ 2024-01-17 21:02                 ` Jens Axboe
  2024-01-17 21:14                   ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 21:02 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 1:20 PM, Jens Axboe wrote:
> On 1/17/24 1:18 PM, Bart Van Assche wrote:
>> On 1/17/24 12:06, Jens Axboe wrote:
>>> Case in point, I spent 10 min hacking up some smarts on the insertion
>>> and dispatch side, and then we get:
>>>
>>> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
>>>
>>> or about a 63% improvement when running the _exact same thing_. Looking
>>> at profiles:
>>>
>>> -   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>
>>> reducing the > 70% of locking contention down to ~14%. No change in data
>>> structures, just an ugly hack that:
>>>
>>> - Serializes dispatch, no point having someone hammer on dd->lock for
>>>    dispatch when already running
>>> - Serialize insertions, punt to one of N buckets if insertion is already
>>>    busy. Current insertion will notice someone else did that, and will
>>>    prune the buckets and re-run insertion.
>>>
>>> And while I seriously doubt that my quick hack is 100% fool proof, it
>>> works as a proof of concept. If we can get that kind of reduction with
>>> minimal effort, well...
>>
>> If nobody else beats me to it then I will look into using separate
>> locks in the mq-deadline scheduler for insertion and dispatch.
> 
> That's not going to help by itself, as most of the contention (as I
> showed in the profile trace in the email) is from dispatch competing
> with itself, and not necessarily dispatch competing with insertion. And
> not sure how that would even work, as insert and dispatch are working on
> the same structures.
> 
> Do some proper analysis first, then that will show you where the problem
> is.

Here's a quick'n dirty that brings it from 1.56M to:

IOPS=3.50M, BW=1711MiB/s, IOS/call=32/32

by just doing something stupid - if someone is already dispatching, then
don't dispatch anything. Clearly shows that this is just dispatch
contention. But a 160% improvement from looking at the initial profile I
sent and hacking up something stupid in a few minutes does show that
there's a ton of low hanging fruit here.

This is run on nvme, so there's going to be lots of hardware queues.
This may even be worth solving in blk-mq rather than try and hack around
it in the scheduler, blk-mq has no idea that mq-deadline is serializing
all hardware queues like this. Or we just solve it in the io scheduler,
since that's the one with the knowledge.

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..822337521fc5 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -80,6 +80,11 @@ struct dd_per_prio {
 };
 
 struct deadline_data {
+	spinlock_t lock;
+	spinlock_t zone_lock ____cacheline_aligned_in_smp;
+
+	unsigned long dispatch_state;
+
 	/*
 	 * run time data
 	 */
@@ -100,9 +105,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +602,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	if (test_bit(0, &dd->dispatch_state) &&
+	    test_and_set_bit(0, &dd->dispatch_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +622,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(0, &dd->dispatch_state);
 	spin_unlock(&dd->lock);
 
 	return rq;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 21:02                 ` Jens Axboe
@ 2024-01-17 21:14                   ` Jens Axboe
  2024-01-17 21:33                     ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 21:14 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 2:02 PM, Jens Axboe wrote:
> On 1/17/24 1:20 PM, Jens Axboe wrote:
>> On 1/17/24 1:18 PM, Bart Van Assche wrote:
>>> On 1/17/24 12:06, Jens Axboe wrote:
>>>> Case in point, I spent 10 min hacking up some smarts on the insertion
>>>> and dispatch side, and then we get:
>>>>
>>>> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
>>>>
>>>> or about a 63% improvement when running the _exact same thing_. Looking
>>>> at profiles:
>>>>
>>>> -   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>>
>>>> reducing the > 70% of locking contention down to ~14%. No change in data
>>>> structures, just an ugly hack that:
>>>>
>>>> - Serializes dispatch, no point having someone hammer on dd->lock for
>>>>    dispatch when already running
>>>> - Serialize insertions, punt to one of N buckets if insertion is already
>>>>    busy. Current insertion will notice someone else did that, and will
>>>>    prune the buckets and re-run insertion.
>>>>
>>>> And while I seriously doubt that my quick hack is 100% fool proof, it
>>>> works as a proof of concept. If we can get that kind of reduction with
>>>> minimal effort, well...
>>>
>>> If nobody else beats me to it then I will look into using separate
>>> locks in the mq-deadline scheduler for insertion and dispatch.
>>
>> That's not going to help by itself, as most of the contention (as I
>> showed in the profile trace in the email) is from dispatch competing
>> with itself, and not necessarily dispatch competing with insertion. And
>> not sure how that would even work, as insert and dispatch are working on
>> the same structures.
>>
>> Do some proper analysis first, then that will show you where the problem
>> is.
> 
> Here's a quick'n dirty that brings it from 1.56M to:
> 
> IOPS=3.50M, BW=1711MiB/s, IOS/call=32/32
> 
> by just doing something stupid - if someone is already dispatching, then
> don't dispatch anything. Clearly shows that this is just dispatch
> contention. But a 160% improvement from looking at the initial profile I

224%, not sure where that math came from...

Anyway, just replying as I sent out the wrong patch. Here's the one I
tested.
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..133ab4a2673b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -80,6 +80,13 @@ struct dd_per_prio {
 };
 
 struct deadline_data {
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long dispatch_state;
+
 	/*
 	 * run time data
 	 */
@@ -100,9 +107,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +604,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	if (test_bit(0, &dd->dispatch_state) ||
+	    test_and_set_bit(0, &dd->dispatch_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +624,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(0, &dd->dispatch_state);
 	spin_unlock(&dd->lock);
 
 	return rq;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 21:14                   ` Jens Axboe
@ 2024-01-17 21:33                     ` Bart Van Assche
  2024-01-17 21:40                       ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-17 21:33 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 13:14, Jens Axboe wrote:
>   /* Maps an I/O priority class to a deadline scheduler priority. */
> @@ -600,6 +604,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	struct request *rq;
>   	enum dd_prio prio;
>   
> +	if (test_bit(0, &dd->dispatch_state) ||
> +	    test_and_set_bit(0, &dd->dispatch_state))
> +		return NULL;
> +
>   	spin_lock(&dd->lock);
>   	rq = dd_dispatch_prio_aged_requests(dd, now);
>   	if (rq)
> @@ -616,6 +624,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	}
>   
>   unlock:
> +	clear_bit(0, &dd->dispatch_state);
>   	spin_unlock(&dd->lock);

Can the above code be simplified by using spin_trylock() instead of
test_bit() and test_and_set_bit()?

Please note that whether or not spin_trylock() is used, there is a
race condition in this approach: if dd_dispatch_request() is called
just before another CPU calls spin_unlock() from inside 
dd_dispatch_request() then some requests won't be dispatched until
the next time dd_dispatch_request() is called.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 21:33                     ` Bart Van Assche
@ 2024-01-17 21:40                       ` Jens Axboe
  2024-01-18  0:43                         ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-17 21:40 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 2:33 PM, Bart Van Assche wrote:
> On 1/17/24 13:14, Jens Axboe wrote:
>>   /* Maps an I/O priority class to a deadline scheduler priority. */
>> @@ -600,6 +604,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>       struct request *rq;
>>       enum dd_prio prio;
>>   +    if (test_bit(0, &dd->dispatch_state) ||
>> +        test_and_set_bit(0, &dd->dispatch_state))
>> +        return NULL;
>> +
>>       spin_lock(&dd->lock);
>>       rq = dd_dispatch_prio_aged_requests(dd, now);
>>       if (rq)
>> @@ -616,6 +624,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>       }
>>     unlock:
>> +    clear_bit(0, &dd->dispatch_state);
>>       spin_unlock(&dd->lock);
> 
> Can the above code be simplified by using spin_trylock() instead of
> test_bit() and test_and_set_bit()?

It can't, because you can't assume that just because dd->lock is already
being held that dispatch is running.

> Please note that whether or not spin_trylock() is used, there is a
> race condition in this approach: if dd_dispatch_request() is called
> just before another CPU calls spin_unlock() from inside
> dd_dispatch_request() then some requests won't be dispatched until the
> next time dd_dispatch_request() is called.

Sure, that's not surprising. What I cared most about here is that we
should not have a race such that we'd stall. Since we haven't returned
this request just yet if we race, we know at least one will be issued
and we'll re-run at completion. So yeah, we may very well skip an issue,
that's well known within that change, which will be postponed to the
next queue run.

The patch is more to demonstrate that it would not take much to fix this
case, at least, it's a proof-of-concept.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 18:43         ` Jens Axboe
  2024-01-17 20:06           ` Jens Axboe
@ 2024-01-18  0:38           ` Bart Van Assche
  2024-01-18  0:42             ` Jens Axboe
  1 sibling, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-18  0:38 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 10:43, Jens Axboe wrote:
> Do we care? Maybe not, if we accept that an IO scheduler is just for
> "slower devices". But let's not go around spouting some 200K number as
> if it's gospel, when it depends on so many factors like IO workload,
> system used, etc.
I've never seen more than 200K IOPS in a single-threaded test. Since
your tests report higher IOPS numbers, I assume that you are submitting
I/O from multiple CPU cores at the same time.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-18  0:38           ` Bart Van Assche
@ 2024-01-18  0:42             ` Jens Axboe
  2024-01-18  0:54               ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2024-01-18  0:42 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 5:38 PM, Bart Van Assche wrote:
> On 1/17/24 10:43, Jens Axboe wrote:
>> Do we care? Maybe not, if we accept that an IO scheduler is just for
>> "slower devices". But let's not go around spouting some 200K number as
>> if it's gospel, when it depends on so many factors like IO workload,
>> system used, etc.
> I've never seen more than 200K IOPS in a single-threaded test. Since
> your tests report higher IOPS numbers, I assume that you are submitting
> I/O from multiple CPU cores at the same time.

Single core, using mq-deadline (with the poc patch, but number without
you can already find in a previous reply):

axboe@7950x ~/g/fio (master)> cat /sys/block/nvme0n1/queue/scheduler
none [mq-deadline] 
axboe@7950x ~/g/fio (master)> sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 /dev/nvme0n1

submitter=0, tid=1957, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/31
IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/32
IOPS=5.10M, BW=2.49GiB/s, IOS/call=31/31

Using non-polled IO, the number is around 4M.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-17 21:40                       ` Jens Axboe
@ 2024-01-18  0:43                         ` Bart Van Assche
  2024-01-18 14:51                           ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-18  0:43 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 13:40, Jens Axboe wrote:
> On 1/17/24 2:33 PM, Bart Van Assche wrote:
>> Please note that whether or not spin_trylock() is used, there is a
>> race condition in this approach: if dd_dispatch_request() is called
>> just before another CPU calls spin_unlock() from inside
>> dd_dispatch_request() then some requests won't be dispatched until the
>> next time dd_dispatch_request() is called.
> 
> Sure, that's not surprising. What I cared most about here is that we
> should not have a race such that we'd stall. Since we haven't returned
> this request just yet if we race, we know at least one will be issued
> and we'll re-run at completion. So yeah, we may very well skip an issue,
> that's well known within that change, which will be postponed to the
> next queue run.
> 
> The patch is more to demonstrate that it would not take much to fix this
> case, at least, it's a proof-of-concept.

The patch below implements what has been discussed in this e-mail
thread. I do not recommend to apply this patch since it reduces single-
threaded performance by 11% on an Intel Xeon Processor (Skylake, IBRS):


diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..d83831ced69a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -84,6 +84,10 @@ struct deadline_data {
  	 * run time data
  	 */

+	spinlock_t lock;
+	spinlock_t dispatch_lock;
+	spinlock_t zone_lock;
+
  	struct dd_per_prio per_prio[DD_PRIO_COUNT];

  	/* Data direction of latest dispatched request. */
@@ -100,9 +104,6 @@ struct deadline_data {
  	int front_merges;
  	u32 async_depth;
  	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
  };

  /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +601,16 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
  	struct request *rq;
  	enum dd_prio prio;

+	/*
+	 * Reduce lock contention on dd->lock by re-running the queue
+	 * asynchronously if another CPU core is already executing
+	 * dd_dispatch_request().
+	 */
+	if (!spin_trylock(&dd->dispatch_lock)) {
+		blk_mq_delay_run_hw_queue(hctx, 0);
+		return NULL;
+	}
+
  	spin_lock(&dd->lock);
  	rq = dd_dispatch_prio_aged_requests(dd, now);
  	if (rq)
@@ -617,6 +628,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)

  unlock:
  	spin_unlock(&dd->lock);
+	spin_unlock(&dd->dispatch_lock);

  	return rq;
  }
@@ -723,6 +735,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
  	dd->fifo_batch = fifo_batch;
  	dd->prio_aging_expire = prio_aging_expire;
  	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->dispatch_lock);
  	spin_lock_init(&dd->zone_lock);

  	/* We dispatch from request queue wide instead of hw queue */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-18  0:42             ` Jens Axboe
@ 2024-01-18  0:54               ` Bart Van Assche
  2024-01-18 15:07                 ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2024-01-18  0:54 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal, lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 16:42, Jens Axboe wrote:
> On 1/17/24 5:38 PM, Bart Van Assche wrote:
>> On 1/17/24 10:43, Jens Axboe wrote:
>>> Do we care? Maybe not, if we accept that an IO scheduler is just for
>>> "slower devices". But let's not go around spouting some 200K number as
>>> if it's gospel, when it depends on so many factors like IO workload,
>>> system used, etc.
>> I've never seen more than 200K IOPS in a single-threaded test. Since
>> your tests report higher IOPS numbers, I assume that you are submitting
>> I/O from multiple CPU cores at the same time.
> 
> Single core, using mq-deadline (with the poc patch, but number without
> you can already find in a previous reply):
> 
> axboe@7950x ~/g/fio (master)> cat /sys/block/nvme0n1/queue/scheduler
> none [mq-deadline]
> axboe@7950x ~/g/fio (master)> sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 /dev/nvme0n1
> 
> submitter=0, tid=1957, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/31
> IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/32
> IOPS=5.10M, BW=2.49GiB/s, IOS/call=31/31
> 
> Using non-polled IO, the number is around 4M.

A correction: my tests ran with 72 fio jobs instead of 1. I used
fio + io_uring + null_blk in my tests. I see about 1100 K IOPS with
a single fio job and about 150 K IOPS with 72 fio jobs. This shows
how I measured mq-deadline performance:

modprobe null_blk
fio --bs=4096 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
     --ioengine=io_uring --ioscheduler=mq-deadline --norandommap \
     --runtime=60 --rw=randread --thread --time_based=1 --buffered=0 \
     --numjobs=72 --iodepth=128 --iodepth_batch_submit=64 \
     --iodepth_batch_complete=64 --name=/dev/nullb0 --filename=/dev/nullb0

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-18  0:43                         ` Bart Van Assche
@ 2024-01-18 14:51                           ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2024-01-18 14:51 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 5:43 PM, Bart Van Assche wrote:
> On 1/17/24 13:40, Jens Axboe wrote:
>> On 1/17/24 2:33 PM, Bart Van Assche wrote:
>>> Please note that whether or not spin_trylock() is used, there is a
>>> race condition in this approach: if dd_dispatch_request() is called
>>> just before another CPU calls spin_unlock() from inside
>>> dd_dispatch_request() then some requests won't be dispatched until the
>>> next time dd_dispatch_request() is called.
>>
>> Sure, that's not surprising. What I cared most about here is that we
>> should not have a race such that we'd stall. Since we haven't returned
>> this request just yet if we race, we know at least one will be issued
>> and we'll re-run at completion. So yeah, we may very well skip an issue,
>> that's well known within that change, which will be postponed to the
>> next queue run.
>>
>> The patch is more to demonstrate that it would not take much to fix this
>> case, at least, it's a proof-of-concept.
> 
> The patch below implements what has been discussed in this e-mail
> thread. I do not recommend to apply this patch since it reduces single-

No, it implements a suggestion that you had, it had nothing to do with
what I suggested.

> threaded performance by 11% on an Intel Xeon Processor (Skylake, IBRS):

Not sure why you are even bothering sending a patch that makes things
_worse_ when the whole point is to reduce contention here. You added
another lock, and on top of that, you added code that now just bangs on
dispatch if it's busy already.

I already gave you a decent starting point with a patch that actually
reduces contention, no idea what this thing is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support
  2024-01-18  0:54               ` Bart Van Assche
@ 2024-01-18 15:07                 ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2024-01-18 15:07 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal,
	lsf-pc@lists.linux-foundation.org
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On 1/17/24 5:54 PM, Bart Van Assche wrote:
> On 1/17/24 16:42, Jens Axboe wrote:
>> On 1/17/24 5:38 PM, Bart Van Assche wrote:
>>> On 1/17/24 10:43, Jens Axboe wrote:
>>>> Do we care? Maybe not, if we accept that an IO scheduler is just for
>>>> "slower devices". But let's not go around spouting some 200K number as
>>>> if it's gospel, when it depends on so many factors like IO workload,
>>>> system used, etc.
>>> I've never seen more than 200K IOPS in a single-threaded test. Since
>>> your tests report higher IOPS numbers, I assume that you are submitting
>>> I/O from multiple CPU cores at the same time.
>>
>> Single core, using mq-deadline (with the poc patch, but number without
>> you can already find in a previous reply):
>>
>> axboe@7950x ~/g/fio (master)> cat /sys/block/nvme0n1/queue/scheduler
>> none [mq-deadline]
>> axboe@7950x ~/g/fio (master)> sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 /dev/nvme0n1
>>
>> submitter=0, tid=1957, file=/dev/nvme0n1, node=-1
>> polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/31
>> IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/32
>> IOPS=5.10M, BW=2.49GiB/s, IOS/call=31/31
>>
>> Using non-polled IO, the number is around 4M.
> 
> A correction: my tests ran with 72 fio jobs instead of 1. I used
> fio + io_uring + null_blk in my tests. I see about 1100 K IOPS with
> a single fio job and about 150 K IOPS with 72 fio jobs. This shows
> how I measured mq-deadline performance:
> 
> modprobe null_blk
> fio --bs=4096 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
>     --ioengine=io_uring --ioscheduler=mq-deadline --norandommap \
>     --runtime=60 --rw=randread --thread --time_based=1 --buffered=0 \
>     --numjobs=72 --iodepth=128 --iodepth_batch_submit=64 \
>     --iodepth_batch_complete=64 --name=/dev/nullb0 --filename=/dev/nullb0

I don't think you're testing what you think you are testing here. Queue
depth of > 9000, you are going to be sleeping basically all of the time.
Hardly a realistic workload, you'll spend a lot of time on that and also
make the internal data structures much slower.

Since I still have the box booted with my patch, here's what I see:

Jobs	Queue depth	IOPS
============================
1	128		3090K
32	4		1313K

and taking a quick peek, we're spending a lot of time trying to merge.
Disabling expensive merges, and I get:

Jobs	Queue depth	IOPS
============================
32	4		1980K

which is more reasonable. I used 32 jobs as I have 32 threads in this
box, and QD=4 to keep the same overall queue depth.

All the contention from the numjobs=32 case is insertion at that point,
in fact that's 50% of the time! Well add a quick hack that makes that a
bit better, see below, it's folded in with the previous one. That brings
it to 2300K, and queue lock contention is 18%, down from 50% before.

As before, don't take this patch as gospel, it's just a proof of concept
that it would indeed be possible to make this work. 2300K isn't 3100K,
but it's not terrible scaling for a) all CPUs in the system hammering on
the device, b) a single queue device, and c) using an IO scheduler.


diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..46814b5ed1c9 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -79,7 +79,30 @@ struct dd_per_prio {
 	struct io_stats_per_prio stats;
 };
 
+#define DD_CPU_BUCKETS		32
+#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
+
+struct dd_bucket_list {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+enum {
+	DD_DISPATCHING	= 0,
+	DD_INSERTING	= 1,
+};
+
 struct deadline_data {
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long run_state;
+
+	atomic_t insert_seq;
+	struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS];
+
 	/*
 	 * run time data
 	 */
@@ -100,9 +123,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +620,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
+	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +640,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(DD_DISPATCHING, &dd->run_state);
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -694,7 +719,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
+	int i, ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -706,6 +731,11 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		INIT_LIST_HEAD(&dd->bucket_lists[i].list);
+		spin_lock_init(&dd->bucket_lists[i].lock);
+	}
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -724,6 +754,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->prio_aging_expire = prio_aging_expire;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
+	atomic_set(&dd->insert_seq, 0);
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
@@ -789,6 +820,22 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 	return ret;
 }
 
+static void dd_dispatch_from_buckets(struct deadline_data *dd,
+				     struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		struct dd_bucket_list *bucket = &dd->bucket_lists[i];
+
+		if (list_empty_careful(&bucket->list))
+			continue;
+		spin_lock(&bucket->lock);
+		list_splice_init(&bucket->list, list);
+		spin_unlock(&bucket->lock);
+	}
+}
+
 /*
  * add rq to rbtree and fifo
  */
@@ -868,8 +915,29 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
+	int seq, new_seq;
 
-	spin_lock(&dd->lock);
+	seq = atomic_inc_return(&dd->insert_seq);
+	if (!spin_trylock(&dd->lock)) {
+		if (!test_bit(DD_INSERTING, &dd->run_state)) {
+			spin_lock(&dd->lock);
+		} else {
+			struct dd_bucket_list *bucket;
+			int cpu = get_cpu();
+
+			bucket = &dd->bucket_lists[cpu & DD_CPU_BUCKETS_MASK];
+			spin_lock(&bucket->lock);
+			list_splice_init(list, &bucket->list);
+			spin_unlock(&bucket->lock);
+			put_cpu();
+			if (test_bit(DD_INSERTING, &dd->run_state))
+				return;
+			spin_lock(&dd->lock);
+		}
+	}
+
+	set_bit(DD_INSERTING, &dd->run_state);
+retry:
 	while (!list_empty(list)) {
 		struct request *rq;
 
@@ -877,7 +945,16 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, flags, &free);
 	}
+
+	new_seq = atomic_read(&dd->insert_seq);
+	if (seq != new_seq) {
+		seq = new_seq;
+		dd_dispatch_from_buckets(dd, list);
+		goto retry;
+	}
+
 	spin_unlock(&dd->lock);
+	clear_bit(DD_INSERTING, &dd->run_state);
 
 	blk_mq_free_requests(&free);
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-01-18 15:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-16 18:20 [LSF/MM/BPF TOPIC] Improving Zoned Storage Support Bart Van Assche
2024-01-16 23:34 ` Damien Le Moal
2024-01-17  1:21   ` Bart Van Assche
2024-01-17 17:36   ` Bart Van Assche
2024-01-17 17:48     ` Jens Axboe
2024-01-17 18:22       ` Bart Van Assche
2024-01-17 18:43         ` Jens Axboe
2024-01-17 20:06           ` Jens Axboe
2024-01-17 20:18             ` Bart Van Assche
2024-01-17 20:20               ` Jens Axboe
2024-01-17 21:02                 ` Jens Axboe
2024-01-17 21:14                   ` Jens Axboe
2024-01-17 21:33                     ` Bart Van Assche
2024-01-17 21:40                       ` Jens Axboe
2024-01-18  0:43                         ` Bart Van Assche
2024-01-18 14:51                           ` Jens Axboe
2024-01-18  0:38           ` Bart Van Assche
2024-01-18  0:42             ` Jens Axboe
2024-01-18  0:54               ` Bart Van Assche
2024-01-18 15:07                 ` Jens Axboe
2024-01-17  8:15 ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox