Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH v4 7/7] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
From: Jens Axboe @ 2023-01-09  3:54 UTC (permalink / raw)
  To: David Howells, Al Viro
  Cc: Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe,
	Christoph Hellwig, Jeff Layton, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk>

On 1/6/23 5:34 PM, David Howells wrote:
> Convert the block layer's bio code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages().  This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the source iterator.
> 
> A field, bi_cleanup_mode, is added to the bio struct that gets set by
> iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
> flags could also be used in future.
> 
> Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> indicate that attached pages are ref'd by default.  Cloning sets it to 0.
> __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> indicates.

What's the motivation for this change? It's growing struct bio, which we
can have a lot of in the system. I read the cover letter too and I can
tell what the change does, but there's no justification really for the
change.

So unless there's a good reason to do this, then that's a NAK in terms
of just the addition to struct bio alone.

-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCHv2 00/12] iov_iter: replace import_single_range with ubuf
From: Jens Axboe @ 2023-01-09  3:51 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, Alexander Viro
  Cc: linux-aio, linux-fsdevel, hch, io-uring, Pavel Begunkov,
	Eric Dumazet, Steven Rostedt, davem, Jakub Kicinski, Paolo Abeni,
	David Howells, Jarkko Sakkinen, Paul Moore, keyrings,
	linux-security-module, linux-trace-kernel, linux-block, netdev,
	Keith Busch
In-Reply-To: <20230105190741.2405013-1-kbusch@meta.com>

On 1/5/23 12:07 PM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> ITER_UBUF is a more efficient representation when using single vector
> buffers, providing small optimizations over ITER_IOVEC. This series
> introduces a helper to set these up, and replaces all applicable users
> of import_single_range with the new helper. And since there are no
> single range users left after this change, the helper is no longer
> needed.
> 
> As noted in v1(*), there are some fundamental differences to how io_uring
> compares to read/write/readv/writev. There are only the two affected
> file_operations, and they already do not work with io_uring due to their
> diverging semantics for vectored vs non-vectored read/write. Therefore,
> this series having io_uring prefer ubuf iov_iter isn't introducing new
> breakage.

Pondering how to stage this, both for later upstream but also for
testing. Would probably make the best sense to stage 1-5 separately,
and then just punt the remaining ones to the appropriate subsystems.
And then 12/12 can go in when they have all been applied.

-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCHv2 02/12] io_uring: switch network send/recv to ITER_UBUF
From: Jens Axboe @ 2023-01-09  3:49 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, Alexander Viro
  Cc: linux-aio, linux-fsdevel, hch, io-uring, Pavel Begunkov,
	Eric Dumazet, Steven Rostedt, davem, Jakub Kicinski, Paolo Abeni,
	David Howells, Jarkko Sakkinen, Paul Moore, keyrings,
	linux-security-module, linux-trace-kernel, linux-block, netdev,
	Keith Busch
In-Reply-To: <20230105190741.2405013-3-kbusch@meta.com>

On 1/5/23 12:07 PM, Keith Busch wrote:
> From: Jens Axboe <axboe@kernel.dk>
> 
> This is more efficient that iter_iov.

Looks like that s/that/than typo ended up in a few spots throughout
this series... Nothing major, but figured I'd bring it up.

-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Jens Axboe @ 2023-01-09  3:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dan Carpenter, Christoph Hellwig, Luis Chamberlain, linux-block,
	linux-kernel
In-Reply-To: <Y7iFwjN+XzWvLv3y@slm.duckdns.org>


On Fri, 06 Jan 2023 10:34:10 -1000, Tejun Heo wrote:
> Dan reports the following smatch detected the following:
> 
>   block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context
> 
> caused by blkcg_schedule_throttle() calling blk_put_queue() in an
> non-sleepable context.
> 
> [...]

Applied, thanks!

[1/1] block: Drop spurious might_sleep() from blk_put_queue()
      commit: 49e4d04f0486117ac57a97890eb1db6d52bf82b3

Best regards,
-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCHv4 0/2] block: don't forget user settings
From: Jens Axboe @ 2023-01-09  3:27 UTC (permalink / raw)
  To: linux-block, Keith Busch
  Cc: hch, martin.petersen, Damien Le Moal, Keith Busch
In-Reply-To: <20230105205146.3610282-1-kbusch@meta.com>


On Thu, 05 Jan 2023 12:51:44 -0800, Keith Busch wrote:
> If the user overrides the max sectors for their device (which is
> currently defaulting to quite a low value for modern hardware), their
> request is lost on a rescan. Save it and fix the weird type issues.
> 
> Changes since v3: Fixed the unsigned long/unsigned int issue raised by
> clang.
> 
> [...]

Applied, thanks!

[1/2] block: make BLK_DEF_MAX_SECTORS unsigned
      commit: c749b9c6de40f0882d9c83c8a3780e631603eb9d
[2/2] block: save user max_sectors limit
      commit: 39001023bf1dfcee0ad1602501fbdea6f0d3d3bb

Best regards,
-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCH v2 08/13] blk-mq: simplify flush check in blk_mq_dispatch_rq_list
From: Kemeng Shi @ 2023-01-09  2:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel,
	john.garry, jack
In-Reply-To: <20230108180611.GG23466@lst.de>


Hi, Christoph, thank you so much for review.
on 1/9/2023 2:06 AM, Christoph Hellwig wrote:
> I think we need to come up with a clear rule on when commit_rqs
> needs to be called, and follow that.  In this case I'd be confused
> if there was any case where we need to call it if list was empty.
> 
After we queue request[s] to one driver queue, we need to notify driver
that there are no more request to the queue or driver will keep waiting
for the last request to be queued and IO hung could happen.
Normaly, we will notify this by setting .last in struct blk_mq_queue_data
along with the normal last request .rq in struct blk_mq_queue_data. The
extra commit is only needed if normal last information in .last is lost.
(See comment in struct blk_mq_ops for commit_rqs).

The lost could occur if error happens for sending last request with .last
set or error happen in middle of list and we even do not send the request
with .last set.

-- 
Best wishes
Kemeng Shi


^ permalink raw reply

* Re: [PATCH -next 3/4] block/rq_qos: use a global mutex to protect rq_qos apis
From: Yu Kuai @ 2023-01-09  1:38 UTC (permalink / raw)
  To: Tejun Heo, Yu Kuai
  Cc: hch, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)
In-Reply-To: <Y7hnH9GT6D469Vuu@slm.duckdns.org>

Hi,

在 2023/01/07 2:23, Tejun Heo 写道:
> Hello,
> 
> On Fri, Jan 06, 2023 at 09:33:26AM +0800, Yu Kuai wrote:
>>> wbt's lazy init is tied to one of the block device sysfs files, right? So,
>>> it *should* already be protected against device removal.
>>
>> That seems not true, I don't think q->sysfs_lock can protect that,
>> consider that queue_wb_lat_store() doesn't check if del_gendisk() is
>> called or not:
>>
>> t1: wbt lazy init		t2: remove device
>> queue_attr_store
>> 				del_gendisk
>> 				blk_unregister_queue
>> 				 mutex_lock(&q->sysfs_lock)
>> 			         ...
>> 				 mutex_unlock(&q->sysfs_lock);
>> 				rq_qos_exit
>>   mutex_lock(&q->sysfs_lock);
>>    queue_wb_lat_store
>>    wbt_init
>>     rq_qos_add
>>   mutex_unlock(&q->sysfs_lock);
> 
> So, it's not sysfs_lock but sysfs file deletion. When a kernfs, which backs
> sysfs, file is removed, it disables future operations and drains all
> inflight ones before returning, so you remove the interface files before
> cleaning up the object that it interacts with, you don't have to worry about
> racing against file operations as none can be in flight at that point.

Ok, thanks for explanation, I'll look into this and try to find out how
this works.

> 
>> I tried to comfirm that by adding following delay:
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 93d9e9c9a6ea..101c33cb0a2b 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -11,6 +11,7 @@
>>   #include <linux/blktrace_api.h>
>>   #include <linux/blk-mq.h>
>>   #include <linux/debugfs.h>
>> +#include <linux/delay.h>
>>
>>   #include "blk.h"
>>   #include "blk-mq.h"
>> @@ -734,6 +735,8 @@ queue_attr_store(struct kobject *kobj, struct attribute
>> *attr,
>>          if (!entry->store)
>>                  return -EIO;
>>
>> +       msleep(10000);
>> +
>>          mutex_lock(&q->sysfs_lock);
>>          res = entry->store(q, page, length);
>>          mutex_unlock(&q->sysfs_lock);
>>
>> And then do the following test:
>>
>> 1) echo 10000 > /sys/block/sdb/queue/wbt_lat_usec &
>> 2) echo 1 > /sys/block/sda/device/delete
>>
>> Then, following bug is triggered:
>>
>> [   51.923642] BUG: unable to handle page fault for address:
>> ffffffffffffffed
>> [   51.924294] #PF: supervisor read access in kernel mode
>> [   51.924773] #PF: error_code(0x0000) - not-present page
>> [   51.925252] PGD 1820b067 P4D 1820b067 PUD 1820d067 PMD 0
>> [   51.925754] Oops: 0000 [#1] PREEMPT SMP
>> [   51.926123] CPU: 1 PID: 539 Comm: bash Tainted: G        W
>> 6.2.0-rc1-next-202212267
>> [   51.927124] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> ?-20190727_073836-b4
>> [   51.928334] RIP: 0010:__rq_qos_issue+0x30/0x60
> 
> This indicates that we aren't getting the destruction order right. It could
> be that there are other reasons why the ordering is like this and we might
> have to synchronize separately.
> 
> Sorry that I've been asking you to go round and round but block device
> add/remove paths have always been really tricky and we wanna avoid adding
> more complications if at all possible. Can you see why the device is being
> destroyed before the queue attr is removed?

Of course, I'll glad to help, I'll let you know if I have any progress.

Thanks,
Kuai


^ permalink raw reply

* Re: [PATCH v2 1/2] blk-iocost: add refcounting for iocg
From: Yu Kuai @ 2023-01-09  1:32 UTC (permalink / raw)
  To: Tejun Heo, Yu Kuai
  Cc: hch, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yukuai (C)
In-Reply-To: <Y7iCId3pnEnLqY8G@slm.duckdns.org>

Hi,

在 2023/01/07 4:18, Tejun Heo 写道:
> On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/01/06 2:32, Tejun Heo 写道:
>>> On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
>>>> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
>>>> fix the problem. refcnting from blkcg_policy_data should be ok, but I
>>>> see that bfq already has the similar refcnting, while other policy
>>>> doesn't require such refcnting.
>>>
>>> Hmm... taking a step back, wouldn't this be solved by moving the first part
>>> of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
>>> right?
>>>
>>
>> Moving first part to pd_offline_fn() has some requirements, like what I
>> did in the other thread:
>>
>> iocg can be activated again after pd_offline_fn(), which is possible
>> because bio can be dispatched when cgroup is removed. I tried to avoid
>> that by:
>>
>> 1) dispatch all throttled bio io ioc_pd_offline()
>> 2) don't throttle bio after ioc_pd_offline()
>>
>> However, you already disagreed with that. 😔
> 
> Okay, I was completely wrong while I was replying to your original patch.
> Should have looked at the code closer, my apologies.
> 
> What I missed is that pd_offline doesn't happen when the cgroup goes
> offline. Please take a look at the following two commits:
> 
>   59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
>   d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it")
> 

These two commits are applied for three years, I don't check the details
yet but they seem can't guarantee that no io will be handled by
rq_qos_throttle() after pd_offline_fn(), because I just reproduced this
in another problem:

f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()")

User thread can issue async io, and io can be throttled by
blk-throttle(not writeback), then user thread can exit and cgroup can be
removed before such io is dispatched to rq_qos_throttle.

> After the above two commits, ->pd_offline_fn() is called only after all
> possible writebacks are complete, so it shouldn't allow mass escapes to
> root. With writebacks out of the picture, it might be that there can be no
> further IOs once ->pd_offline_fn() is called too as there can be no tasks
> left in it and no dirty pages, but best to confirm that.
> 
> So, yeah, the original approach you took should work although I'm not sure
> the patches that you added to make offline blkg to bypass are necessary
> (that also contributed to my assumption that there will be more IOs on those
> blkg's). Have you seen more IOs coming down the pipeline after offline? If
> so, can you dump some backtraces and see where they're coming from?

Currently I'm sure such IOs can come from blk-throttle, and I'm not sure
yet but I also suspect io_uring can do this.

Thanks,
Kuai


^ permalink raw reply

* Re: [PATCHv4 2/2] block: save user max_sectors limit
From: Martin K. Petersen @ 2023-01-09  0:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, axboe, hch, martin.petersen, Damien Le Moal,
	Keith Busch
In-Reply-To: <20230105205146.3610282-3-kbusch@meta.com>


Keith,

> The user can set the max_sectors limit to any valid value via sysfs
> /sys/block/<dev>/queue/max_sectors_kb attribute. If the device limits
> are ever rescanned, though, the limit reverts back to the potentially
> artificially low BLK_DEF_MAX_SECTORS value.
>
> Preserve the user's setting as the max_sectors limit as long as it's
> valid. The user can reset back to defaults by writing 0 to the sysfs
> file.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCHv4 1/2] block: make BLK_DEF_MAX_SECTORS unsigned
From: Martin K. Petersen @ 2023-01-09  0:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, axboe, hch, martin.petersen, Damien Le Moal,
	Keith Busch
In-Reply-To: <20230105205146.3610282-2-kbusch@meta.com>


Keith,

> This is used as an unsigned value, so define it that way to avoid
> having to cast it.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH v2 11/13] blk-mq: remove unncessary from_schedule parameter in blk_mq_plug_issue_direct
From: Christoph Hellwig @ 2023-01-08 18:06 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-12-shikemeng@huaweicloud.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 08/13] blk-mq: simplify flush check in blk_mq_dispatch_rq_list
From: Christoph Hellwig @ 2023-01-08 18:06 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-9-shikemeng@huaweicloud.com>

I think we need to come up with a clear rule on when commit_rqs
needs to be called, and follow that.  In this case I'd be confused
if there was any case where we need to call it if list was empty.

^ permalink raw reply

* Re: [PATCH v2 07/13] blk-mq: remove error count and unncessary flush in blk_mq_try_issue_list_directly
From: Christoph Hellwig @ 2023-01-08 18:03 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-8-shikemeng@huaweicloud.com>

Similar comment as for the previous patch, just complicated by the
excursion to blk_mq_request_bypass_insert.

^ permalink raw reply

* Re: [PATCH v2 06/13] blk-mq: remove unncessary error count and flush in blk_mq_plug_issue_direct
From: Christoph Hellwig @ 2023-01-08 18:02 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-7-shikemeng@huaweicloud.com>

I'm really confused by this.  Why do we need the extra commit
anyway if the errored command has never made it to the device?

^ permalink raw reply

* Re: [PATCH v2 05/13] blk-mq: remove unnecessary list_empty check in blk_mq_try_issue_list_directly
From: Christoph Hellwig @ 2023-01-08 17:56 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-6-shikemeng@huaweicloud.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 03/13] blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait
From: Christoph Hellwig @ 2023-01-08 17:55 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-4-shikemeng@huaweicloud.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 02/13] blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctx
From: Christoph Hellwig @ 2023-01-08 17:55 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-3-shikemeng@huaweicloud.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 01/13] blk-mq: avoid sleep in blk_mq_alloc_request_hctx
From: Christoph Hellwig @ 2023-01-08 17:55 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: axboe, dwagner, hare, ming.lei, linux-block, linux-kernel, hch,
	john.garry, jack
In-Reply-To: <20230104142259.2673013-2-shikemeng@huaweicloud.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Christoph Hellwig @ 2023-01-08 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Dan Carpenter, Ming Lei, Luis Chamberlain,
	linux-block, linux-kernel
In-Reply-To: <Y7iFwjN+XzWvLv3y@slm.duckdns.org>

I walked through everything called from blk_free_queue and can't find
anything that sleeps either, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>

I wonder if we could not also revert
d578c770c85233af592e54537f93f3831bde7e9a as I think that was added
to avoid calling blk_put_queue from atomic context.

^ permalink raw reply

* Re: [PATCHv4 2/2] block: save user max_sectors limit
From: Christoph Hellwig @ 2023-01-08 17:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, axboe, hch, martin.petersen, Damien Le Moal,
	Keith Busch
In-Reply-To: <20230105205146.3610282-3-kbusch@meta.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCHv4 1/2] block: make BLK_DEF_MAX_SECTORS unsigned
From: Christoph Hellwig @ 2023-01-08 17:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, axboe, hch, martin.petersen, Damien Le Moal,
	Keith Busch
In-Reply-To: <20230105205146.3610282-2-kbusch@meta.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCHv2 00/12] iov_iter: replace import_single_range with ubuf
From: Christoph Hellwig @ 2023-01-08 17:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, Alexander Viro, Jens Axboe, linux-aio,
	linux-fsdevel, hch, io-uring, Pavel Begunkov, Eric Dumazet,
	Steven Rostedt, davem, Jakub Kicinski, Paolo Abeni, David Howells,
	Jarkko Sakkinen, Paul Moore, keyrings, linux-security-module,
	linux-trace-kernel, linux-block, netdev, Keith Busch
In-Reply-To: <20230105190741.2405013-1-kbusch@meta.com>

The entire series looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 1/4] blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()
From: Christoph Hellwig @ 2023-01-08 17:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, josef, hch, linux-block, linux-kernel
In-Reply-To: <20230105212432.289569-2-tj@kernel.org>

On Thu, Jan 05, 2023 at 11:24:29AM -1000, Tejun Heo wrote:
> Holding the queue lock now implies RCU read lock, so no need to use
> rcu_read_[un]lock() explicitly. This shouldn't cause any behavior changes.

How so?

> While at it, drop __acquires() annotation on the queue lock too. The
> __acquires() part was already out of sync and it doesn't catch anything that
> lockdep can't.

This makes sparse even more unhappy than it was before.  For now
please keep the annotation.

^ permalink raw reply

* [PATCH v4] blk-throtl: Introduce sync and async queues for blk-throtl
From: Jinke Han @ 2023-01-07 13:07 UTC (permalink / raw)
  To: tj, josef, axboe; +Cc: cgroups, linux-block, linux-kernel, yinxin.x, Jinke Han

From: Jinke Han <hanjinke.666@bytedance.com>

Now we don't distinguish sync write ios from normal buffer write ios
in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait
until write completion soon after it submit. So it's reasonable for sync
io to complete as soon as possible.

In our test, fio writes a 100g file in sequential 4k blocksize in
a container with low bps limit configured (wbps=10M). More than 1200
ios were throttled in blk-throtl queue and the avarage throtle time
of each io is 140s. At the same time, the operation of saving a small
file by vim will be blocked amolst 140s. As a fsync will be send by vim,
the sync ios of fsync will be blocked by a huge amount of buffer write
ios ahead. This is also a priority inversion problem within one cgroup.
In the database scene, things got really bad with blk-throtle enabled
as fsync is called very often.

This patch splits bio queue into sync and async queues for blk-throtl
and gives a huge priority to sync write ios. Sync queue only make sense
for write ios as we treat all read io as sync io. I think it's a nice
respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO
gains the same priority as they are important to fs. This may avoid
some potential priority inversion problems.

With this patch, do the same test above, the duration of the fsync sent
by vim drops to several hundreds of milliseconds.

Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
---

Changes in v2
- Make code more simple.
Changes in v3
- Fix mismatch of waiting bio and the next dispatched bio.
- Rename dispatch_sync_cnt to disp_sync_cnt.
- Add more comments.
Changes in v4
- Improve patch in code style and code comments. 

 block/blk-throttle.c | 137 ++++++++++++++++++++++++++++++++++++++++---
 block/blk-throttle.h |  13 +++-
 2 files changed, 141 insertions(+), 9 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6fb5a2f9e1ee..cb81a957696b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,13 @@
 /* Total max dispatch from all groups in one round */
 #define THROTL_QUANTUM 32
 
+/* For write ios, dispatch 4 sync ios and 1 normal io in one loop */
+#define THROTL_SYNC_FACTOR 4
+
+/* Only make sense for write ios, all read ios are treated as SYNC */
+#define SYNC	0
+#define ASYNC	1
+
 /* Throttling is performed over a slice and after that slice is renewed */
 #define DFL_THROTL_SLICE_HD (HZ / 10)
 #define DFL_THROTL_SLICE_SSD (HZ / 50)
@@ -241,11 +248,30 @@ static inline unsigned int throtl_bio_data_size(struct bio *bio)
 	return bio->bi_iter.bi_size;
 }
 
-static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
+static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg,
+			      bool rw)
 {
 	INIT_LIST_HEAD(&qn->node);
-	bio_list_init(&qn->bios);
+	bio_list_init(&qn->bios[SYNC]);
+	bio_list_init(&qn->bios[ASYNC]);
 	qn->tg = tg;
+	qn->disp_sync_cnt = rw == READ ? UINT_MAX : 0;
+	qn->next_to_disp = NULL;
+}
+
+static inline void throtl_qnode_add_bio_list(struct throtl_qnode *qn,
+					     struct bio *bio)
+{
+	bool rw = bio_data_dir(bio);
+
+	/*
+	 * All read bios and those write bios with REQ_SYNC, REQ_META or
+	 * REQ_PRIO flags set are treated as SYNC io.
+	 */
+	if (rw == READ || bio->bi_opf & (REQ_SYNC | REQ_META | REQ_PRIO))
+		bio_list_add(&qn->bios[SYNC], bio);
+	else
+		bio_list_add(&qn->bios[ASYNC], bio);
 }
 
 /**
@@ -261,13 +287,77 @@ static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
 static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn,
 				 struct list_head *queued)
 {
-	bio_list_add(&qn->bios, bio);
+	throtl_qnode_add_bio_list(qn, bio);
 	if (list_empty(&qn->node)) {
 		list_add_tail(&qn->node, queued);
 		blkg_get(tg_to_blkg(qn->tg));
 	}
 }
 
+/**
+ * throtl_qnode_bio_list_pop: pop a bio from sync/async queue
+ * @qn: the qnode to pop a bio from
+ *
+ * For writes, target SYNC or ASYNC queue based on disp_sync_cnt. If empty,
+ * try the other queue.
+ */
+static struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn)
+{
+	struct bio *bio;
+	int from = qn->disp_sync_cnt == THROTL_SYNC_FACTOR ? ASYNC : SYNC;
+
+	bio = bio_list_pop(&qn->bios[from]);
+	if (!bio) {
+		from = 1 - from;
+		bio = bio_list_pop(&qn->bios[from]);
+	}
+
+	if (qn->disp_sync_cnt < THROTL_SYNC_FACTOR && from == SYNC)
+		qn->disp_sync_cnt++;
+	else
+		qn->disp_sync_cnt = 0;
+
+	return bio;
+}
+
+/**
+ * throtl_qnode_bio_peek - peek a bio from a qn
+ * @qn: the qnode to peek from
+ *
+ * For read, always peek bio from the SYNC queue.
+ *
+ * For write, we always peek bio from next_to_disp. If it's NULL, a bio
+ * will be popped from SYNC or ASYNC queue to fill it. The next_to_disp
+ * is used to make sure that the peeked bio and the next popped bio are
+ * always the same even in case that the spinlock of queue was released
+ * and re-holded.
+ *
+ * Without the next_to_disp, consider the following situation:
+ *
+ * Assumed that there were only bios queued in ASYNC queue and the SYNC
+ * queue was empty. The ASYNC bio was selected to dispatch and the
+ * disp_sync_cnt was set to 0 after each dispatching. If a ASYNC bio
+ * can't be dispatched because of overlimit in current slice, the process
+ * of dispatch should give up and the spin lock of the request queue
+ * may be released. A new SYNC bio may be queued in the SYNC queue then.
+ * When it's time to dispatch this tg, the SYNC bio was selected and pop
+ * to dispatch as the disp_sync_cnt is 0 and the SYNC queue is no-empty.
+ * If the dispatched bio is smaller than the waiting bio, the bandwidth
+ * may be hard to satisfied as the slice may be trimed after each dispatch.
+ */
+static inline struct bio *throtl_qnode_bio_peek(struct throtl_qnode *qn)
+{
+	/* qn for read ios */
+	if (qn->disp_sync_cnt == UINT_MAX)
+		return bio_list_peek(&qn->bios[SYNC]);
+
+	/* qn for write ios */
+	if (!qn->next_to_disp)
+		qn->next_to_disp  = throtl_qnode_bio_list_pop(qn);
+
+	return qn->next_to_disp;
+}
+
 /**
  * throtl_peek_queued - peek the first bio on a qnode list
  * @queued: the qnode list to peek
@@ -281,11 +371,42 @@ static struct bio *throtl_peek_queued(struct list_head *queued)
 		return NULL;
 
 	qn = list_first_entry(queued, struct throtl_qnode, node);
-	bio = bio_list_peek(&qn->bios);
+	bio = throtl_qnode_bio_peek(qn);
 	WARN_ON_ONCE(!bio);
 	return bio;
 }
 
+/**
+ * throtl_qnode_bio_pop: pop a bio from a qnode
+ * @qn: the qnode to pop a bio from
+ */
+static inline struct bio *throtl_qnode_bio_pop(struct throtl_qnode *qn)
+{
+	struct bio *bio;
+
+	/* qn for read ios */
+	if (qn->disp_sync_cnt == UINT_MAX)
+		return bio_list_pop(&qn->bios[SYNC]);
+
+	/* qn for write ios */
+	if (qn->next_to_disp) {
+		bio = qn->next_to_disp;
+		qn->next_to_disp = NULL;
+		return bio;
+	}
+
+	return throtl_qnode_bio_list_pop(qn);
+}
+
+static inline bool throtl_qnode_empty(struct throtl_qnode *qn)
+{
+	if (!qn->next_to_disp &&
+	    bio_list_empty(&qn->bios[SYNC]) &&
+			   bio_list_empty(&qn->bios[ASYNC]))
+		return true;
+	return false;
+}
+
 /**
  * throtl_pop_queued - pop the first bio form a qnode list
  * @queued: the qnode list to pop a bio from
@@ -310,10 +431,10 @@ static struct bio *throtl_pop_queued(struct list_head *queued,
 		return NULL;
 
 	qn = list_first_entry(queued, struct throtl_qnode, node);
-	bio = bio_list_pop(&qn->bios);
+	bio = throtl_qnode_bio_pop(qn);
 	WARN_ON_ONCE(!bio);
 
-	if (bio_list_empty(&qn->bios)) {
+	if (throtl_qnode_empty(qn)) {
 		list_del_init(&qn->node);
 		if (tg_to_put)
 			*tg_to_put = qn->tg;
@@ -355,8 +476,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp,
 	throtl_service_queue_init(&tg->service_queue);
 
 	for (rw = READ; rw <= WRITE; rw++) {
-		throtl_qnode_init(&tg->qnode_on_self[rw], tg);
-		throtl_qnode_init(&tg->qnode_on_parent[rw], tg);
+		throtl_qnode_init(&tg->qnode_on_self[rw], tg, rw);
+		throtl_qnode_init(&tg->qnode_on_parent[rw], tg, rw);
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index ef4b7a4de987..bd4fff65b6d9 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -28,8 +28,19 @@
  */
 struct throtl_qnode {
 	struct list_head	node;		/* service_queue->queued[] */
-	struct bio_list		bios;		/* queued bios */
+	struct bio_list		bios[2];	/* queued bios */
 	struct throtl_grp	*tg;		/* tg this qnode belongs to */
+
+	struct bio		*next_to_disp;	/* pinned for next to dispatch */
+	/*
+	 * 1) for write throtl_qnode:
+	 * [0, THROTL_SYNC_FACTOR-1]: dispatch sync io
+	 * [THROTL_SYNC_FACTOR]: dispatch async io
+	 *
+	 * 2) for read throtl_qnode:
+	 * UINT_MAX
+	 */
+	unsigned int disp_sync_cnt;         /* sync io dispatch counter */
 };
 
 struct throtl_service_queue {
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Dan Carpenter @ 2023-01-07  8:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Jens Axboe, Christoph Hellwig, linux-block,
	linux-kernel
In-Reply-To: <Y7iIWA6h88cYjhcO@bombadil.infradead.org>

On Fri, Jan 06, 2023 at 12:45:12PM -0800, Luis Chamberlain wrote:
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reported-by: Dan Carpenter <error27@gmail.com>
> > Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+
> 
> *tons* has changed since e8c7d14ac6c3 and so the bots might think that
> *if* this patch is applied upstream it is justified for older kernels
> and I don't think that's yet been verified and doubt it.

If you enable the correct debug option then the might_sleep() causes a
stack trace.  Eventually syzbot will find it.

I would backport it.

regards,
dan carpenter

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox