Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] io_uring: fix stuck problem caused by potential memory alloc failure
From: JackieLiu @ 2019-07-10 15:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block
In-Reply-To: <25691b9f-8c0f-0d19-d1db-4c4ce6dfb5a9@kernel.dk>

Thank you for pointing out. I haven't had time to pay attention to the new 
features of sqe links. When I was debugging another stuck problem, I found 
that there was no free the reference count, so I sent the patch about this.

Anyway, if necessary, I will resend the patch again, with your rewritten
comments.
 

> 在 2019年7月10日，21:57，Jens Axboe <axboe@kernel.dk> 写道：
> 
> On 7/10/19 2:35 AM, Jackie Liu wrote:
>> when io_req_defer alloc memory failed, it will be return -EAGAIN.
>> But io_submit_sqe cannot be returned immediately because the reference
>> count for req is still hold. we should be clean it.
> 
> Looks like this actually got fixed in my for-5.3/io_uring branch which
> I haven't pushed out yet. Once that's in, I'd suggest we send yours to
> stable separately. Probably change the wording to:
> 
> When io_req_defer alloc memory fails, it will be -EAGAIN. But
> io_submit_sqe cannot return immediately because the reference count for
> req is still held. Ensure that we free it.
> 
> -- 
> Jens Axboe
> 
> 




^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Danil Kipnis @ 2019-07-10 14:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche, jgg, dledford, Roman Pen, gregkh
In-Reply-To: <20190709110036.GQ7034@mtr-leonro.mtl.com>

Hi Leon,

thanks for the feedback!

On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.

I have only theories of why ibnbd is showing better numbers than nvmeof:
1. The way we utilize the MQ framework in IBNBD. We promise to have
queue_depth (say 512) requests on each of the num_cpus hardware queues
of each device, but in fact we have only queue_depth for the whole
"session" toward a given server. The moment we have queue_depth
inflights we need stop the queue (on a device on a cpu) we get more
requests on. We need to start them again after some requests are
completed. We maintain per cpu lists of stopped HW queues, a bitmap
showing which lists are not empty, etc. to wake them up in a
round-robin fashion to avoid starvation of any devices.
2. We only do rdma writes with imm. A server reserves queue_depth of
max_io_size buffers for a given client. The client manages those
himself. Client uses imm field to tell to the server which buffer has
been written (and where) and server uses the imm field to send back
errno. If our max_io_size is 64K and queue_depth 512 and client only
issues 4K IOs all the time, then 60*512K memory is wasted. On the
other hand we do no buffer allocation/registration in io path on
server side. Server sends rdma addresses and keys to those
preregistered buffers on connection establishment and
deallocates/unregisters them when a session is closed. That's for
writes. For reads, client registers user buffers (after fr) and sends
the addresses and keys to the server (with an rdma write with imm).
Server rdma writes into those buffers. Client does the
unregistering/invalidation and completes the request.

> Can you please share any numbers with us?
Apart from github
(https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3)
the performance results for v5.2-rc3 on two different systems can be
accessed under dcd.ionos.com/ibnbd-performance-report. The page allows
to filter out test scenarios interesting for comparison.

>
> Thanks

^ permalink raw reply

* Re: [PATCH] blk-throttle: fix zero wait time for iops throttled group
From: Jens Axboe @ 2019-07-10 14:25 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-block, linux-kernel; +Cc: Liu Bo, stable
In-Reply-To: <f4be51ff-e426-cc44-db94-3c26e2cfbca9@yandex-team.ru>

On 7/10/19 8:24 AM, Konstantin Khlebnikov wrote:
> On 10.07.2019 17:00, Jens Axboe wrote:
>> On 7/8/19 9:29 AM, Konstantin Khlebnikov wrote:
>>> After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops
>>> limit is enforced") wait time could be zero even if group is throttled and
>>> cannot issue requests right now. As a result throtl_select_dispatch() turns
>>> into busy-loop under irq-safe queue spinlock.
>>>
>>> Fix is simple: always round up target time to the next throttle slice.
>>
>> Applied, thanks. In the future, please break lines at 72 chars in
>> commit messages, I fixed it up.
>>
> 
> Ok, but Documentation/process/submitting-patches.rst and
> scripts/checkpatch.pl recommends 75 chars per line.

Huh, oh well. Not a big deal for me, line breaking is easily automated.


-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] blk-throttle: fix zero wait time for iops throttled group
From: Konstantin Khlebnikov @ 2019-07-10 14:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-kernel; +Cc: Liu Bo, stable
In-Reply-To: <30caacb5-4d45-016b-a97d-db8b37010218@kernel.dk>

On 10.07.2019 17:00, Jens Axboe wrote:
> On 7/8/19 9:29 AM, Konstantin Khlebnikov wrote:
>> After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops
>> limit is enforced") wait time could be zero even if group is throttled and
>> cannot issue requests right now. As a result throtl_select_dispatch() turns
>> into busy-loop under irq-safe queue spinlock.
>>
>> Fix is simple: always round up target time to the next throttle slice.
> 
> Applied, thanks. In the future, please break lines at 72 chars in
> commit messages, I fixed it up.
> 

Ok, but Documentation/process/submitting-patches.rst and
scripts/checkpatch.pl recommends 75 chars per line.

^ permalink raw reply

* Re: [PATCH v3 5/5] nvme: add support weighted round robin queue
From: Weiping Zhang @ 2019-07-10 14:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Minwoo Im, keith.busch, sagi, linux-block, linux-nvme
In-Reply-To: <CAA70yB5uve6x-t56u7RcM8=JNSXh644uErC5z4m5h2C1rkSuvA@mail.gmail.com>

Weiping Zhang <zwp10758@gmail.com> 于2019年6月28日周五 下午11:57写道：
>
> Christoph Hellwig <hch@lst.de> 于2019年6月27日周四 下午7:06写道：
> >
> > On Thu, Jun 27, 2019 at 07:37:19PM +0900, Minwoo Im wrote:
> > > Hi, Maintainers
> > >
> > > Would you guys please give some thoughts about this patch?  I like this
> > > feature WRR addition to the driver so I really want to hear something
> > > from you guys.
> >
> > We are at the end of the merge window with tons of things to sort out.
> > A giant feature series with a lot of impact is not at the top of the
> > priority list right now.
>
> Hi Christoph,
>
> There are some feedback in V3, I really want to get some more feedback from you
> and other people, at that time I post V4.
>
> So please give some comments for V3 at your convenience after this merge window.
>
Hi Christoph,

Ping

^ permalink raw reply

* Re: [PATCH] blk-throttle: fix zero wait time for iops throttled group
From: Jens Axboe @ 2019-07-10 14:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-block, linux-kernel; +Cc: Liu Bo, stable
In-Reply-To: <156259979778.2486.6296077059654653057.stgit@buzz>

On 7/8/19 9:29 AM, Konstantin Khlebnikov wrote:
> After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops
> limit is enforced") wait time could be zero even if group is throttled and
> cannot issue requests right now. As a result throtl_select_dispatch() turns
> into busy-loop under irq-safe queue spinlock.
> 
> Fix is simple: always round up target time to the next throttle slice.

Applied, thanks. In the future, please break lines at 72 chars in
commit messages, I fixed it up.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Jens Axboe @ 2019-07-10 13:59 UTC (permalink / raw)
  To: Damien Le Moal, Ming Lei
  Cc: linux-block@vger.kernel.org, Christoph Hellwig, Matias Bjorling
In-Reply-To: <BYAPR04MB5816B43AEAACBCBCEE27588AE7F00@BYAPR04MB5816.namprd04.prod.outlook.com>

On 7/9/19 8:59 PM, Damien Le Moal wrote:
> On 7/10/19 11:55 AM, Jens Axboe wrote:
>> On 7/9/19 8:29 AM, Ming Lei wrote:
>>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>>> Simultaneously writing to a sequential zone of a zoned block device
>>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>>> ensure that writes happen sequentially. However, even for a well
>>>> behaved user correctly implementing such synchronization, BIO plugging
>>>> may interfere and result in BIOs from the different contextx to be
>>>> reordered if plugging is done outside of the mutual exclusion section,
>>>> e.g. the plug was started by a function higher in the call chain than
>>>> the function issuing BIOs.
>>>>
>>>>         Context A                           Context B
>>>>
>>>>      | blk_start_plug()
>>>>      | ...
>>>>      | seq_write_zone()
>>>>        | mutex_lock(zone)
>>>>        | submit_bio(bio-0)
>>>>        | submit_bio(bio-1)
>>>>        | mutex_unlock(zone)
>>>>        | return
>>>>      | ------------------------------> | seq_write_zone()
>>>>     				       | mutex_lock(zone)
>>>> 				       | submit_bio(bio-2)
>>>> 				       | mutex_unlock(zone)
>>>>      | <------------------------------ |
>>>>      | blk_finish_plug()
>>>>
>>>> In the above example, despite the mutex synchronization resulting in the
>>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>>
>>> I am wondering how you guarantee that context B is always run after
>>> context A.
>>>
>>>>
>>>> To fix this problem, introduce the internal helper function
>>>> blk_mq_plug() to access the current context plug, return the current
>>>> plug only if the target device is not a zoned block device or if the
>>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>>> return NULL, resulting is all writes to zoned block device to never be
>>>> plugged.
>>>
>>> Another candidate approach is to run the following code before
>>> releasing 'zone' lock:
>>>
>>> 	if (current->plug)
>>> 		blk_finish_plug(context->plug)
>>>
>>> Then we can fix zone specific issue in zone code only, and avoid generic
>>> blk-core change for zone issue.
>>
>> I prefer that to the existing solution as well.
> 
> My apologies, you lost me: do you mean that you prefer Ming's suggestion
> and force FS or dm users to manually unplug in the case of zoned block
> devices ?

I take that back, I thought we could do it manually in the zoned code
while dealing with the locking, but I don't think that is feasible.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH V2] block: Fix potential overflow in blk_report_zones()
From: Jens Axboe @ 2019-07-10 13:58 UTC (permalink / raw)
  To: Damien Le Moal, linux-block; +Cc: Christoph Hellwig, Matias Bjorling
In-Reply-To: <20190710045310.12397-1-damien.lemoal@wdc.com>

On 7/9/19 10:53 PM, Damien Le Moal wrote:
> For large values of the number of zones reported and/or large zone
> sizes, the sector increment calculated with
> 
> blk_queue_zone_sectors(q) * n
> 
> in blk_report_zones() loop can overflow the unsigned int type used for
> the calculation as both "n" and blk_queue_zone_sectors() value are
> unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
> overflow happens with 8192 or more zones reported.
> 
> Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
> this problem and avoids overflow problem for all other callers of this
> helper too. The same change is also applied to the bdev_zone_sectors()
> helper.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] io_uring: fix stuck problem caused by potential memory alloc failure
From: Jens Axboe @ 2019-07-10 13:57 UTC (permalink / raw)
  To: Jackie Liu; +Cc: linux-block
In-Reply-To: <20190710083558.5940-1-liuyun01@kylinos.cn>

On 7/10/19 2:35 AM, Jackie Liu wrote:
> when io_req_defer alloc memory failed, it will be return -EAGAIN.
> But io_submit_sqe cannot be returned immediately because the reference
> count for req is still hold. we should be clean it.

Looks like this actually got fixed in my for-5.3/io_uring branch which
I haven't pushed out yet. Once that's in, I'd suggest we send yours to
stable separately. Probably change the wording to:

When io_req_defer alloc memory fails, it will be -EAGAIN. But
io_submit_sqe cannot return immediately because the reference count for
req is still held. Ensure that we free it.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jason Gunthorpe @ 2019-07-10 13:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh
In-Reply-To: <a8f2f1d2-b5d9-92fc-40c8-090af0487723@grimberg.me>

On Tue, Jul 09, 2019 at 12:45:57PM -0700, Sagi Grimberg wrote:

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?

RDMA WRITE only is generally a bit faster, and if you use a buffer
pool in a smart way it is possible to get very good data packing. With
SEND the number of recvq entries dictates how big the rx buffer can
be, or you waste even more memory by using partial send buffers..

A scheme like this seems like a high performance idea, but on the
other side, I have no idea how you could possibly manage invalidations
efficiently with a shared RX buffer pool...

The RXer has to push out an invalidation for the shared buffer pool
MR, but we don't have protocols for partial MR invalidation.

Which is back to my earlier thought that the main reason this perfoms
better is because it doesn't have synchronous MR invalidation.

Maybe this is fine, but it needs to be made very clear that it uses
this insecure operating model to get higher performance..

Jason

^ permalink raw reply

* Re: [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism
From: Jens Axboe @ 2019-07-10 13:53 UTC (permalink / raw)
  To: Tejun Heo, axboe@kernel.dk
  Cc: jack@suse.cz, josef@toxicpanda.com, Chris Mason, dsterba@suse.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-btrfs@vger.kernel.org, Kernel Team
In-Reply-To: <20190627203952.386785-1-tj@kernel.org>

On 6/27/19 2:39 PM, Tejun Heo wrote:
> Hello,
> 
> This patchset contains only the block part of the following
> 
>    [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> 
> with the following changes
> 
>   * wbc_account_io() is renamed to wbc_account_cgroup_owner() and
>     wbc->no_account_io to wbc->no_cgroup_owner for clarity.
> 
> When writeback is executed asynchronously (e.g. for compression), bios
> are bounced to and issued by worker pool shared by all cgroups.  This
> leads to significant priority inversions when cgroup IO control is in
> use - IOs for a low priority cgroup can tie down the workers forcing
> higher priority IOs to wait behind them.
> 
> This patchset adds an bio punt mechanism to blkcg and updates btrfs to
> issue async IOs through it.  A bio tagged with REQ_CGROUP_PUNT flag is
> bounced to the asynchronous issue context of the associated blkcg on
> bio_submit().  As the bios are issued from per-blkcg work items,
> there's no concern for priority inversions and it doesn't require
> invasive changes to the filesystems.  The mechanism should be
> generally useful for IO control support across different filesystems.

Applied for 5.3, thanks Tejun.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
From: Jan Kara @ 2019-07-10 10:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team
In-Reply-To: <20190627203952.386785-3-tj@kernel.org>

On Thu 27-06-19 13:39:49, Tejun Heo wrote:
> wbc_account_io() does a very specific job - try to see which cgroup is
> actually dirtying an inode and transfer its ownership to the majority
> dirtier if needed.  The name is too generic and confusing.  Let's
> rename it to something more specific.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  Documentation/admin-guide/cgroup-v2.rst | 2 +-
>  fs/btrfs/extent_io.c                    | 4 ++--
>  fs/buffer.c                             | 2 +-
>  fs/ext4/page-io.c                       | 2 +-
>  fs/f2fs/data.c                          | 4 ++--
>  fs/fs-writeback.c                       | 8 ++++----
>  fs/mpage.c                              | 2 +-
>  include/linux/writeback.h               | 8 ++++----
>  8 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index cf88c1f98270..356a7a3dcb2f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2108,7 +2108,7 @@ following two functions.
>  	a queue (device) has been associated with the bio and
>  	before submission.
>  
> -  wbc_account_io(@wbc, @page, @bytes)
> +  wbc_account_cgroup_owner(@wbc, @page, @bytes)
>  	Should be called for each data segment being written out.
>  	While this function doesn't care exactly when it's called
>  	during the writeback session, it's the easiest and most
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index db337e53aab3..5106008f5e28 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2911,7 +2911,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
>  			bio = NULL;
>  		} else {
>  			if (wbc)
> -				wbc_account_io(wbc, page, page_size);
> +				wbc_account_cgroup_owner(wbc, page, page_size);
>  			return 0;
>  		}
>  	}
> @@ -2924,7 +2924,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
>  	bio->bi_opf = opf;
>  	if (wbc) {
>  		wbc_init_bio(wbc, bio);
> -		wbc_account_io(wbc, page, page_size);
> +		wbc_account_cgroup_owner(wbc, page, page_size);
>  	}
>  
>  	*bio_ret = bio;
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e450c55f6434..40547bbbea94 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3093,7 +3093,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
>  
>  	if (wbc) {
>  		wbc_init_bio(wbc, bio);
> -		wbc_account_io(wbc, bh->b_page, bh->b_size);
> +		wbc_account_cgroup_owner(wbc, bh->b_page, bh->b_size);
>  	}
>  
>  	submit_bio(bio);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 4690618a92e9..56e287f5ee50 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -404,7 +404,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
>  	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
> -	wbc_account_io(io->io_wbc, page, bh->b_size);
> +	wbc_account_cgroup_owner(io->io_wbc, page, bh->b_size);
>  	io->io_next_block++;
>  	return 0;
>  }
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index eda4181d2092..e1cab1717ac7 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -470,7 +470,7 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
>  	}
>  
>  	if (fio->io_wbc && !is_read_io(fio->op))
> -		wbc_account_io(fio->io_wbc, page, PAGE_SIZE);
> +		wbc_account_cgroup_owner(fio->io_wbc, page, PAGE_SIZE);
>  
>  	bio_set_op_attrs(bio, fio->op, fio->op_flags);
>  
> @@ -537,7 +537,7 @@ void f2fs_submit_page_write(struct f2fs_io_info *fio)
>  	}
>  
>  	if (fio->io_wbc)
> -		wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE);
> +		wbc_account_cgroup_owner(fio->io_wbc, bio_page, PAGE_SIZE);
>  
>  	io->last_block_in_bio = fio->new_blkaddr;
>  	f2fs_trace_ios(fio, 0);
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a8a40bc26c2f..0aef79e934bb 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -706,7 +706,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
>  EXPORT_SYMBOL_GPL(wbc_detach_inode);
>  
>  /**
> - * wbc_account_io - account IO issued during writeback
> + * wbc_account_cgroup_owner - account writeback to update inode cgroup ownership
>   * @wbc: writeback_control of the writeback in progress
>   * @page: page being written out
>   * @bytes: number of bytes being written out
> @@ -715,8 +715,8 @@ EXPORT_SYMBOL_GPL(wbc_detach_inode);
>   * controlled by @wbc.  Keep the book for foreign inode detection.  See
>   * wbc_detach_inode().
>   */
> -void wbc_account_io(struct writeback_control *wbc, struct page *page,
> -		    size_t bytes)
> +void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> +			      size_t bytes)
>  {
>  	struct cgroup_subsys_state *css;
>  	int id;
> @@ -753,7 +753,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
>  	else
>  		wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
>  }
> -EXPORT_SYMBOL_GPL(wbc_account_io);
> +EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
>  
>  /**
>   * inode_congested - test whether an inode is congested
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 436a85260394..a63620cdb73a 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -647,7 +647,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  	 * the confused fail path above (OOM) will be very confused when
>  	 * it finds all bh marked clean (i.e. it will not write anything)
>  	 */
> -	wbc_account_io(wbc, page, PAGE_SIZE);
> +	wbc_account_cgroup_owner(wbc, page, PAGE_SIZE);
>  	length = first_unmapped << blkbits;
>  	if (bio_add_page(bio, page, length, 0) < length) {
>  		bio = mpage_bio_submit(REQ_OP_WRITE, op_flags, bio);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 738a0c24874f..dda5cf228172 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -188,8 +188,8 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>  				 struct inode *inode)
>  	__releases(&inode->i_lock);
>  void wbc_detach_inode(struct writeback_control *wbc);
> -void wbc_account_io(struct writeback_control *wbc, struct page *page,
> -		    size_t bytes);
> +void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> +			      size_t bytes);
>  void cgroup_writeback_umount(void);
>  
>  /**
> @@ -291,8 +291,8 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
>  {
>  }
>  
> -static inline void wbc_account_io(struct writeback_control *wbc,
> -				  struct page *page, size_t bytes)
> +static inline void wbc_account_cgroup_owner(struct writeback_control *wbc,
> +					    struct page *page, size_t bytes)
>  {
>  }
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] blk-throttle: fix zero wait time for iops throttled group
From: Konstantin Khlebnikov @ 2019-07-10 10:42 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-kernel; +Cc: Liu Bo, Stable, cgroups
In-Reply-To: <156259979778.2486.6296077059654653057.stgit@buzz>

On 08.07.2019 18:29, Konstantin Khlebnikov wrote:
> After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops
> limit is enforced") wait time could be zero even if group is throttled and
> cannot issue requests right now. As a result throtl_select_dispatch() turns
> into busy-loop under irq-safe queue spinlock.

To be clear: this almost instantly kills entire machine - other cpus stuck at sending ipi.

> 
> Fix is simple: always round up target time to the next throttle slice.
> 
> Fixes: 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced")
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: stable@vger.kernel.org # v4.19+
> ---
>   block/blk-throttle.c |    9 +++------
>   1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 9ea7c0ecad10..8ab6c8153223 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -881,13 +881,10 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
>   	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
>   	u64 tmp;
>   
> -	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> -
> -	/* Slice has just started. Consider one slice interval */
> -	if (!jiffy_elapsed)
> -		jiffy_elapsed_rnd = tg->td->throtl_slice;
> +	jiffy_elapsed = jiffies - tg->slice_start[rw];
>   
> -	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
> +	/* Round up to the next throttle slice, wait time must be nonzero */
> +	jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice);
>   
>   	/*
>   	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
> 

^ permalink raw reply

* [PATCH] io_uring: fix stuck problem caused by potential memory alloc failure
From: Jackie Liu @ 2019-07-10  8:35 UTC (permalink / raw)
  To: axboe; +Cc: Jackie Liu, linux-block

when io_req_defer alloc memory failed, it will be return -EAGAIN.
But io_submit_sqe cannot be returned immediately because the reference
count for req is still hold. we should be clean it.

Fixes: de0617e46717 ("io_uring: add support for marking commands as draining")
Cc: linux-block@vger.kernel.org
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
---
 fs/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4ef62a45045d..1c388533cdc8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1843,8 +1843,8 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	ret = io_req_defer(ctx, req, s->sqe);
 	if (ret) {
 		if (ret == -EIOCBQUEUED)
-			ret = 0;
-		return ret;
+			return 0;
+		goto out;
 	}
 
 	ret = __io_submit_sqe(ctx, req, s, true);
-- 
2.22.0




^ permalink raw reply related

* [PATCH V2] block: Fix potential overflow in blk_report_zones()
From: Damien Le Moal @ 2019-07-10  4:53 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Christoph Hellwig, Matias Bjorling

For large values of the number of zones reported and/or large zone
sizes, the sector increment calculated with

blk_queue_zone_sectors(q) * n

in blk_report_zones() loop can overflow the unsigned int type used for
the calculation as both "n" and blk_queue_zone_sectors() value are
unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
overflow happens with 8192 or more zones reported.

Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
this problem and avoids overflow problem for all other callers of this
helper too. The same change is also applied to the bdev_zone_sectors()
helper.

Fixes: e76239a3748c ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-zoned.c      | 2 +-
 include/linux/blkdev.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 79ad269b545d..6c503824ba3f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -73,7 +73,7 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
 static inline unsigned int __blkdev_nr_zones(struct request_queue *q,
 					     sector_t nr_sectors)
 {
-	unsigned long zone_sectors = blk_queue_zone_sectors(q);
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
 
 	return (nr_sectors + zone_sectors - 1) >> ilog2(zone_sectors);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5ace0bb77213..760668ed555a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,7 +686,7 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
 	}
 }
 
-static inline unsigned int blk_queue_zone_sectors(struct request_queue *q)
+static inline sector_t blk_queue_zone_sectors(struct request_queue *q)
 {
 	return blk_queue_is_zoned(q) ? q->limits.chunk_sectors : 0;
 }
@@ -1434,7 +1434,7 @@ static inline bool bdev_is_zoned(struct block_device *bdev)
 	return false;
 }
 
-static inline unsigned int bdev_zone_sectors(struct block_device *bdev)
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
 
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Damien Le Moal @ 2019-07-10  4:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block@vger.kernel.org, Jens Axboe, Christoph Hellwig,
	Matias Bjorling
In-Reply-To: <20190710031003.GB2621@ming.t460p>

On 7/10/19 12:10 PM, Ming Lei wrote:
> On Tue, Jul 09, 2019 at 02:47:12PM +0000, Damien Le Moal wrote:
>> Hi Ming,
>>
>> On 2019/07/09 23:29, Ming Lei wrote:
>>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>>> Simultaneously writing to a sequential zone of a zoned block device
>>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>>> ensure that writes happen sequentially. However, even for a well
>>>> behaved user correctly implementing such synchronization, BIO plugging
>>>> may interfere and result in BIOs from the different contextx to be
>>>> reordered if plugging is done outside of the mutual exclusion section,
>>>> e.g. the plug was started by a function higher in the call chain than
>>>> the function issuing BIOs.
>>>>
>>>>       Context A                           Context B
>>>>
>>>>    | blk_start_plug()
>>>>    | ...
>>>>    | seq_write_zone()
>>>>      | mutex_lock(zone)
>>>>      | submit_bio(bio-0)
>>>>      | submit_bio(bio-1)
>>>>      | mutex_unlock(zone)
>>>>      | return
>>>>    | ------------------------------> | seq_write_zone()
>>>>   				       | mutex_lock(zone)
>>>> 				       | submit_bio(bio-2)
>>>> 				       | mutex_unlock(zone)
>>>>    | <------------------------------ |
>>>>    | blk_finish_plug()
>>>>
>>>> In the above example, despite the mutex synchronization resulting in the
>>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>>
>>> I am wondering how you guarantee that context B is always run after
>>> context A.
>>
>> My example was a little too oversimplified. Think of a file system allocating
>> blocks sequentially and issuing page I/Os for the allocated blocks sequentially.
>> The typical sequence is:
>>
>> mutex_lock(zone)
>> alloc_block_extent(zone)
>> for all blocks in the extent
>> 	submit_bio()
>> mutex_unlock(zone)
>>
>> This way, it does not matter which context gets the lock first, all write BIOs
>> for the zone remain sequential. The problem with plugs as explained above is
> 
> But wrt. the example in the commit log, it does matter which context gets the lock
> first, and it implies that context A has to run seq_write_zone() first,
> because you mentioned bio-2 has to be issued after bio-0 and bio-1.
> 
> If there is 3rd context which is holding the lock, then either context A or
> context B can win in getting the lock first. So looks the zone lock itself
> isn't enough for maintaining the IO order. But that may not be related
> with this patch.

For a raw block device driver, the zone lock is enough to maintain
sequential write sequence. This is not visible in my example, because it
is too simplistic. My apologies for the confusion.

The reason is that the target sector of any zone write BIO must always
be set to the end sector of the last issued write BIO for the zone. A
more detailed and correct typical sequence for writing to a zone for a
raw block device driver (e.g. a dm target) is:

seq_write_zone() {

	mutex_lock(zone)

	/* bio-0 */
	bio = bio_alloc()
	bio->bi_iter.bi_sector = zone->wp
	zone->wp += bio_sectors(bio)
	submit_bio(bio)

	/* bio-1 */
	bio = bio_alloc()
	bio->bi_iter.bi_sector = zone->wp
	zone->wp += bio_sectors(bio)
	submit_bio(bio)

	...

	mutex_unlock(zone)
}

Doing so, multiple contexts serialized with the zone mutex can keep
writing sequentially, no matter the number of BIOs they issue and no
matter the order in which they grab the zone lock. Note that here, the
zone write pointer is a "soft" write pointer, not the actual device
managed write pointer, because this latter WP is updated only on
completion of the write commands, so visible to the host only on
completion of the write BIOs. The "soft" write pointer is thus always
equal to or in advance of the device hard WP. The soft WP must be
re-synced to the hard WP in case of failed writes.

For a file system, the zone hard WP is used as a starting point for
block allocation. BIO issuing can then simply use the allocated extent
sector directly instead of the zone soft write pointer. The block
allocation code will manage the zone soft WP and do the resync with the
device hard WP in case of write error.

> Also seems there is issue with REQ_NOWAIT for zone support, for example,
> context A may see out-of-request and return earlier, however context B
> may get request and move on.

Yes, but context B will move on from the last successfully written
sector so sequential writes can still go on. It is the responsibility of
the user code to deal with failed writes and how to recover from them.

If REQ_NOWAIT is used for a BIO and causes submit_bio() to fail (
BLK_QC_T_NONE returned) in one context, that context may retry until it
succeeds and increment the soft WP or bail out without incrementing the
zone soft WP. In both cases, other contexts may simply resume trying to
write from the still valid soft WP. Any number of methods exist for
dealing with this. All the responsibility of the user (fs or dm) because
sequential write issuing must in the first place be guaranteed by the
users. In this regard, the generic block layer is fine.

>> that if the plug start/finish is not within the zone lock, reordering can happen
>> for the 2 sequences of BIOs issued by the 2 contexts.
>>
>> We hit this problem with btrfs writepages (page writeback) where plugging is
>> done before the above sequence execution, in the caller function of the page
>> writeback processing, resulting in unaligned write errors.
>>
>>>> To fix this problem, introduce the internal helper function
>>>> blk_mq_plug() to access the current context plug, return the current
>>>> plug only if the target device is not a zoned block device or if the
>>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>>> return NULL, resulting is all writes to zoned block device to never be
>>>> plugged.
>>>
>>> Another candidate approach is to run the following code before
>>> releasing 'zone' lock:
>>>
>>> 	if (current->plug)
>>> 		blk_finish_plug(context->plug)
>>>
>>> Then we can fix zone specific issue in zone code only, and avoid generic
>>> blk-core change for zone issue.
>>
>> Yes indeed, that would work too. But this patch is precisely to avoid having to
>> add such code and simplify implementing support for zoned block device in
>> existing code. Furthermore, plugging for writes to sequential zones has no real
>> value because mq-deadline will dispatch at most one write per zone. So writes
>> for a single zone tend to accumulate in the scheduler queue, and that creates
>> plenty of opportunities for merging small sequential writes (e.g. file system
>> page BIOs).
>>
>> If you think this patch is really not appropriate, we can still address the
>> problem case by case in the support we add for zoned devices. But again, a
>> generic solution makes things simpler I think.
> 
> OK, then I am fine with this simple generic approach.

Thanks.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* [PATCH 3/3] block: set ioprio for write-zeroes, discard etc
From: Chaitanya Kulkarni @ 2019-07-10  4:02 UTC (permalink / raw)
  To: linux-block; +Cc: Chaitanya Kulkarni, axboe, hch
In-Reply-To: <20190710040224.12342-1-chaitanya.kulkarni@wdc.com>

Set the current process's iopriority for discard, write-zeroes and
write-same operations.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 block/blk-lib.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..efc9a1bf5262 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -7,6 +7,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/scatterlist.h>
+#include <linux/ioprio.h>
 
 #include "blk.h"
 
@@ -64,6 +65,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio_set_op_attrs(bio, op, 0);
+		bio_set_prio(bio, get_current_ioprio());
 
 		bio->bi_iter.bi_size = req_sects << 9;
 		sector += req_sects;
@@ -162,6 +164,7 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		bio->bi_io_vec->bv_offset = 0;
 		bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);
 		bio_set_op_attrs(bio, REQ_OP_WRITE_SAME, 0);
+		bio_set_prio(bio, get_current_ioprio());
 
 		if (nr_sects > max_write_same_sectors) {
 			bio->bi_iter.bi_size = max_write_same_sectors << 9;
@@ -234,6 +237,8 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio->bi_opf = REQ_OP_WRITE_ZEROES;
+		bio_set_prio(bio, get_current_ioprio());
+
 		if (flags & BLKDEV_ZERO_NOUNMAP)
 			bio->bi_opf |= REQ_NOUNMAP;
 
@@ -286,6 +291,7 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+		bio_set_prio(bio, get_current_ioprio());
 
 		while (nr_sects != 0) {
 			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
-- 
2.17.0


^ permalink raw reply related

* [PATCH 2/3] block: set ioprio for flush bio
From: Chaitanya Kulkarni @ 2019-07-10  4:02 UTC (permalink / raw)
  To: linux-block; +Cc: Chaitanya Kulkarni, axboe, hch
In-Reply-To: <20190710040224.12342-1-chaitanya.kulkarni@wdc.com>

Set the current process's iopriority for flush bio.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 block/blk-flush.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index aedd9320e605..21013fcdf42c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -69,6 +69,7 @@
 #include <linux/blkdev.h>
 #include <linux/gfp.h>
 #include <linux/blk-mq.h>
+#include <linux/ioprio.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -445,6 +446,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 	bio = bio_alloc(gfp_mask, 0);
 	bio_set_dev(bio, bdev);
 	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+	bio_set_prio(bio, get_current_ioprio());
 
 	ret = submit_bio_wait(bio);
 
-- 
2.17.0


^ permalink raw reply related

* [PATCH 1/3] block: set ioprio for zone-reset bio
From: Chaitanya Kulkarni @ 2019-07-10  4:02 UTC (permalink / raw)
  To: linux-block; +Cc: Chaitanya Kulkarni, axboe, hch
In-Reply-To: <20190710040224.12342-1-chaitanya.kulkarni@wdc.com>

Set the current process's iopriority to the bio for REQ_OP_ZONE_RESET.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 block/blk-zoned.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ae7e91bd0618..0bd7981e9535 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -249,6 +249,7 @@ int blkdev_reset_zones(struct block_device *bdev,
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio_set_op_attrs(bio, REQ_OP_ZONE_RESET, 0);
+		bio_set_prio(bio, get_current_ioprio());
 
 		sector += zone_sectors;
 
-- 
2.17.0


^ permalink raw reply related

* [PATCH 0/3] block: set ioprio value
From: Chaitanya Kulkarni @ 2019-07-10  4:02 UTC (permalink / raw)
  To: linux-block; +Cc: Chaitanya Kulkarni, axboe, hch

Hi,

While experimenting with the ionice, I came across the scenario where
I cannot see the bio/request priority field being set when using
blkdiscard for REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES and other operations
like REQ_OP_WRITE_SAME, REQ_OP_ZONE_RESET, and flush.

This is a small patch-series which sets the ioprio value at various
places for the zone-reset, flush, write-zeroes and discard operations.
This patch series uses get_current_ioprio() so that bio associated
with the respective operation can inherit the value from current
process.

In order to test this, I've modified the null_blk and enabled the 
write_zeroes through module param. Following are the results.

Without these patches:-  

# modprobe null_blk gb=5 nr_devices=1 write_zeroes=1
# for prio in `seq 0 3`; do ionice -c ${prio} blkdiscard -z -o 0 -l 4096 /dev/nullb0; done
# dmesg  -c 
[  402.958458] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 0
[  402.966024] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 0
[  402.973960] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 0
[  402.981373] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 0
# 

With these patches:-

# modprobe null_blk gb=5 nr_devices=1 write_zeroes=1
# for prio in `seq 0 3`; do ionice -c ${prio} blkdiscard -z -o 0 -l 4096 /dev/nullb0; done 
# dmesg  -c 
[ 1426.091772] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 0
[ 1426.100177] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 1
[ 1426.108035] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 2
[ 1426.115768] null_handle_cmd REQ_OP_WRITE_ZEROES priority class 3
# 

With the block trace extension support is being worked on [1] which has
the ability to report the I/O priority, now we can track the correct
priority values with this patch-series.

Regards,
Chaitanya

[1] https://www.spinics.net/lists/linux-btrace/msg00880.html

Chaitanya Kulkarni (3):
  block: set ioprio for zone-reset bio
  block: set ioprio for flush bio
  block: set ioprio for write-zeroes, discard etc

 block/blk-flush.c | 2 ++
 block/blk-lib.c   | 6 ++++++
 block/blk-zoned.c | 1 +
 3 files changed, 9 insertions(+)

-- 
2.17.0

^ permalink raw reply

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Ming Lei @ 2019-07-10  3:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block@vger.kernel.org, Jens Axboe, Christoph Hellwig,
	Matias Bjorling
In-Reply-To: <BYAPR04MB58162A2087D53F27BAF8176BE7F10@BYAPR04MB5816.namprd04.prod.outlook.com>

On Tue, Jul 09, 2019 at 02:47:12PM +0000, Damien Le Moal wrote:
> Hi Ming,
> 
> On 2019/07/09 23:29, Ming Lei wrote:
> > On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
> >> Simultaneously writing to a sequential zone of a zoned block device
> >> from multiple contexts requires mutual exclusion for BIO issuing to
> >> ensure that writes happen sequentially. However, even for a well
> >> behaved user correctly implementing such synchronization, BIO plugging
> >> may interfere and result in BIOs from the different contextx to be
> >> reordered if plugging is done outside of the mutual exclusion section,
> >> e.g. the plug was started by a function higher in the call chain than
> >> the function issuing BIOs.
> >>
> >>       Context A                           Context B
> >>
> >>    | blk_start_plug()
> >>    | ...
> >>    | seq_write_zone()
> >>      | mutex_lock(zone)
> >>      | submit_bio(bio-0)
> >>      | submit_bio(bio-1)
> >>      | mutex_unlock(zone)
> >>      | return
> >>    | ------------------------------> | seq_write_zone()
> >>   				       | mutex_lock(zone)
> >> 				       | submit_bio(bio-2)
> >> 				       | mutex_unlock(zone)
> >>    | <------------------------------ |
> >>    | blk_finish_plug()
> >>
> >> In the above example, despite the mutex synchronization resulting in the
> >> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
> >> issued after BIO 2 when the plug is released with blk_finish_plug().
> > 
> > I am wondering how you guarantee that context B is always run after
> > context A.
> 
> My example was a little too oversimplified. Think of a file system allocating
> blocks sequentially and issuing page I/Os for the allocated blocks sequentially.
> The typical sequence is:
> 
> mutex_lock(zone)
> alloc_block_extent(zone)
> for all blocks in the extent
> 	submit_bio()
> mutex_unlock(zone)
> 
> This way, it does not matter which context gets the lock first, all write BIOs
> for the zone remain sequential. The problem with plugs as explained above is

But wrt. the example in the commit log, it does matter which context gets the lock
first, and it implies that context A has to run seq_write_zone() first,
because you mentioned bio-2 has to be issued after bio-0 and bio-1.

If there is 3rd context which is holding the lock, then either context A or
context B can win in getting the lock first. So looks the zone lock itself
isn't enough for maintaining the IO order. But that may not be related
with this patch.

Also seems there is issue with REQ_NOWAIT for zone support, for example,
context A may see out-of-request and return earlier, however context B
may get request and move on.

> that if the plug start/finish is not within the zone lock, reordering can happen
> for the 2 sequences of BIOs issued by the 2 contexts.
> 
> We hit this problem with btrfs writepages (page writeback) where plugging is
> done before the above sequence execution, in the caller function of the page
> writeback processing, resulting in unaligned write errors.
> 
> >> To fix this problem, introduce the internal helper function
> >> blk_mq_plug() to access the current context plug, return the current
> >> plug only if the target device is not a zoned block device or if the
> >> BIO to be plugged not a write operation. Otherwise, ignore the plug and
> >> return NULL, resulting is all writes to zoned block device to never be
> >> plugged.
> > 
> > Another candidate approach is to run the following code before
> > releasing 'zone' lock:
> > 
> > 	if (current->plug)
> > 		blk_finish_plug(context->plug)
> > 
> > Then we can fix zone specific issue in zone code only, and avoid generic
> > blk-core change for zone issue.
> 
> Yes indeed, that would work too. But this patch is precisely to avoid having to
> add such code and simplify implementing support for zoned block device in
> existing code. Furthermore, plugging for writes to sequential zones has no real
> value because mq-deadline will dispatch at most one write per zone. So writes
> for a single zone tend to accumulate in the scheduler queue, and that creates
> plenty of opportunities for merging small sequential writes (e.g. file system
> page BIOs).
> 
> If you think this patch is really not appropriate, we can still address the
> problem case by case in the support we add for zoned devices. But again, a
> generic solution makes things simpler I think.

OK, then I am fine with this simple generic approach.

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Damien Le Moal @ 2019-07-10  2:59 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block@vger.kernel.org, Christoph Hellwig, Matias Bjorling
In-Reply-To: <341defc6-128e-3b18-9468-951d311e0993@kernel.dk>

On 7/10/19 11:55 AM, Jens Axboe wrote:
> On 7/9/19 8:29 AM, Ming Lei wrote:
>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>> Simultaneously writing to a sequential zone of a zoned block device
>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>> ensure that writes happen sequentially. However, even for a well
>>> behaved user correctly implementing such synchronization, BIO plugging
>>> may interfere and result in BIOs from the different contextx to be
>>> reordered if plugging is done outside of the mutual exclusion section,
>>> e.g. the plug was started by a function higher in the call chain than
>>> the function issuing BIOs.
>>>
>>>        Context A                           Context B
>>>
>>>     | blk_start_plug()
>>>     | ...
>>>     | seq_write_zone()
>>>       | mutex_lock(zone)
>>>       | submit_bio(bio-0)
>>>       | submit_bio(bio-1)
>>>       | mutex_unlock(zone)
>>>       | return
>>>     | ------------------------------> | seq_write_zone()
>>>    				       | mutex_lock(zone)
>>> 				       | submit_bio(bio-2)
>>> 				       | mutex_unlock(zone)
>>>     | <------------------------------ |
>>>     | blk_finish_plug()
>>>
>>> In the above example, despite the mutex synchronization resulting in the
>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>
>> I am wondering how you guarantee that context B is always run after
>> context A.
>>
>>>
>>> To fix this problem, introduce the internal helper function
>>> blk_mq_plug() to access the current context plug, return the current
>>> plug only if the target device is not a zoned block device or if the
>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>> return NULL, resulting is all writes to zoned block device to never be
>>> plugged.
>>
>> Another candidate approach is to run the following code before
>> releasing 'zone' lock:
>>
>> 	if (current->plug)
>> 		blk_finish_plug(context->plug)
>>
>> Then we can fix zone specific issue in zone code only, and avoid generic
>> blk-core change for zone issue.
> 
> I prefer that to the existing solution as well.

My apologies, you lost me: do you mean that you prefer Ming's suggestion
and force FS or dm users to manually unplug in the case of zoned block
devices ?

Thanks.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Jens Axboe @ 2019-07-10  2:55 UTC (permalink / raw)
  To: Ming Lei, Damien Le Moal; +Cc: linux-block, Christoph Hellwig, Matias Bjorling
In-Reply-To: <20190709142915.GA30082@ming.t460p>

On 7/9/19 8:29 AM, Ming Lei wrote:
> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>> Simultaneously writing to a sequential zone of a zoned block device
>> from multiple contexts requires mutual exclusion for BIO issuing to
>> ensure that writes happen sequentially. However, even for a well
>> behaved user correctly implementing such synchronization, BIO plugging
>> may interfere and result in BIOs from the different contextx to be
>> reordered if plugging is done outside of the mutual exclusion section,
>> e.g. the plug was started by a function higher in the call chain than
>> the function issuing BIOs.
>>
>>        Context A                           Context B
>>
>>     | blk_start_plug()
>>     | ...
>>     | seq_write_zone()
>>       | mutex_lock(zone)
>>       | submit_bio(bio-0)
>>       | submit_bio(bio-1)
>>       | mutex_unlock(zone)
>>       | return
>>     | ------------------------------> | seq_write_zone()
>>    				       | mutex_lock(zone)
>> 				       | submit_bio(bio-2)
>> 				       | mutex_unlock(zone)
>>     | <------------------------------ |
>>     | blk_finish_plug()
>>
>> In the above example, despite the mutex synchronization resulting in the
>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>> issued after BIO 2 when the plug is released with blk_finish_plug().
> 
> I am wondering how you guarantee that context B is always run after
> context A.
> 
>>
>> To fix this problem, introduce the internal helper function
>> blk_mq_plug() to access the current context plug, return the current
>> plug only if the target device is not a zoned block device or if the
>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>> return NULL, resulting is all writes to zoned block device to never be
>> plugged.
> 
> Another candidate approach is to run the following code before
> releasing 'zone' lock:
> 
> 	if (current->plug)
> 		blk_finish_plug(context->plug)
> 
> Then we can fix zone specific issue in zone code only, and avoid generic
> blk-core change for zone issue.

I prefer that to the existing solution as well.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] blk-cgroup: turn on psi memstall stuff
From: Jens Axboe @ 2019-07-10  2:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-block, linux-kernel, kernel-team, Josef Bacik
In-Reply-To: <20190709214129.GK657710@devbig004.ftw2.facebook.com>

On 7/9/19 3:41 PM, Tejun Heo wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> With the psi stuff in place we can use the memstall flag to indicate
> pressure that happens from throttling.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] block: Disable write plugging for zoned block devices
From: Damien Le Moal @ 2019-07-10  2:36 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org, Jens Axboe
  Cc: Christoph Hellwig, Matias Bjorling
In-Reply-To: <29cfbb53-36c9-1e03-183d-572223eb01f3@acm.org>

On 7/10/19 11:32 AM, Bart Van Assche wrote:
> On 7/9/19 2:02 AM, Damien Le Moal wrote:
>> +static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
>> +					   struct bio *bio)
>> +{
>> +	struct blk_plug *plug = current->plug;
>> +
>> +	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
>> +		return plug;
>> +
>> +	/* Zoned block device write case: do not plug the BIO */
>> +	return NULL;
>> +}
> 
> Can the 'plug' variable be left out from this function and can 'return 
> plug' be changed into 'return current->plug'? Anyway:

Sure, that would be cleaner. Will Send a V2.

> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Thanks. Can I add this to the V2 or would you prefer to see the revised
patch first ?

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox