public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Damien Le Moal <dlemoal@kernel.org>,
	linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	linux-scsi@vger.kernel.org,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v2 07/28] block: Introduce zone write plugging
Date: Wed, 27 Mar 2024 08:18:17 +0100	[thread overview]
Message-ID: <1e6b4eef-dee8-49cc-97e6-a798d3fdb1fb@suse.de> (raw)
In-Reply-To: <20240325044452.3125418-8-dlemoal@kernel.org>

On 3/25/24 05:44, Damien Le Moal wrote:
> Zone write plugging implements a per-zone "plug" for write operations
> to control the submission and execution order of write operations to
> sequential write required zones of a zoned block device. Per-zone
> plugging guarantees that at any time there is at most only one write
> request per zone being executed. This mechanism is intended to replace
> zone write locking which implements a similar per-zone write throttling
> at the scheduler level, but is implemented only by mq-deadline.
> 
> Unlike zone write locking which operates on requests, zone write
> plugging operates on BIOs. A zone write plug is simply a BIO list that
> is atomically manipulated using a spinlock and a kblockd submission
> work. A write BIO to a zone is "plugged" to delay its execution if a
> write BIO for the same zone was already issued, that is, if a write
> request for the same zone is being executed. The next plugged BIO is
> unplugged and issued once the write request completes.
> 
> This mechanism allows to:
>   - Untangle zone write ordering from block IO schedulers. This allows
>     removing the restriction on using mq-deadline for writing to zoned
>     block devices. Any block IO scheduler, including "none" can be used.
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     are not preventing other BIOs from executing (reads or writes to
>     other zones). Depending on the workload, this can significantly
>     improve the device use (higher queue depth operation) and
>     performance.
>   - Both blk-mq (request based) zoned devices and BIO-based zoned devices
>     (e.g.  device mapper) can use zone write plugging. It is mandatory
>     for the former but optional for the latter. BIO-based drivers can
>     use zone write plugging to implement write ordering guarantees, or
>     the drivers can implement their own if needed.
>   - The code is less invasive in the block layer and is mostly limited to
>     blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
>     bio.c.
> 
> Zone write plugging is implemented using struct blk_zone_wplug. This
> structure includes a spinlock, a BIO list and a work structure to
> handle the submission of plugged BIOs. Zone write plugs structures are
> managed using a per-disk hash table.
> 
> Plugging of zone write BIOs is done using the function
> blk_zone_write_plug_bio() which returns false if a BIO execution does
> not need to be delayed and true otherwise. This function is called
> from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
> spanning multiple zones which would cause mishandling of zone write
> plugs. This ichange enables by default zone write plugging for any mq
> request-based block device. BIO-based device drivers can also use zone
> write plugging by expliclty calling blk_zone_write_plug_bio() in their
> ->submit_bio method. For such devices, the driver must ensure that a
> BIO passed to blk_zone_write_plug_bio() is already split and not
> straddling zone boundaries.
> 
> Only write and write zeroes BIOs are plugged. Zone write plugging does
> not introduce any significant overhead for other operations. A BIO that
> is being handled through zone write plugging is flagged using the new
> BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
> this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
> The completion of BIOs and requests flagged trigger respectively calls
> to the functions blk_zone_write_plug_bio_endio() and
> blk_zone_write_plug_complete_request(). The latter function is used to
> trigger submission of the next plugged BIO using the zone plug work.
> blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) is being executed for any zone. The
> handling of zone write plugs using a per-zone plug spinlock maximizes
> parallelism and device usage by allowing multiple zones to be writen
> simultaneously without lock contention.
> 
> Zone write plugging ignores flush BIOs without data. Hovever, any flush
> BIO that has data is always plugged so that the write part of the flush
> sequence is serialized with other regular writes.
> 
> Given that any BIO handled through zone write plugging will be the only
> BIO in flight for the target zone when it is executed, the unplugging
> and submission of a BIO will have no chance of successfully merging with
> plugged requests or requests in the scheduler. To overcome this
> potential performance degradation, blk_mq_submit_bio() calls the
> function blk_zone_write_plug_attempt_merge() to try to merge other
> plugged BIOs with the one just unplugged and submitted. Successful
> merging is signaled using blk_zone_write_plug_bio_merged(), called from
> bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
> of segments of plugged BIOs to attempt merging, the number of segments
> of a plugged BIO is saved using the new struct bio field
> __bi_nr_segments. To avoid growing the size of struct bio, this field is
> added as a union with the bio_cookie field. This is safe to do as
> polling is always disabled for plugged BIOs.
> 
> When BIOs are plugged in a zone write plug, the device request queue
> usage counter is always incremented. This reference is kept and reused
> for blk-mq devices when the plugged BIO is unplugged and submitted
> again using submit_bio_noacct_nocheck(). For this case, the unplugged
> BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
> blk_mq_submit_bio() proceeds directly to allocating a new request for
> the BIO, re-using the usage reference count taken when the BIO was
> plugged. This extra reference count is dropped in
> blk_zone_write_plug_attempt_merge() for any plugged BIO that is
> successfully merged. Given that BIO-based devices will not take this
> path, the extra reference is dropped after a plugged BIO is unplugged
> and submitted.
> 
> Zone write plugs are dynamically allocated and managed using a hash
> table (an array of struct hlist_head) with RCU protection.
> A zone write plug is allocated when a write BIO is received for the
> zone and not freed until the zone is fully written, reset or finished.
> To detect when a zone write plug can be freed, the write state of each
> zone is tracked using a write pointer offset which corresponds to the
> offset of a zone write pointer relative to the zone start. Write
> operations always increment this write pointer offset. Zone reset
> operations set it to 0 and zone finish operations set it to the zone
> size.
> 
> If a write error happens, the wp_offset value of a zone write plug may
> become incorrect and out of sync with the device managed write pointer.
> This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
> The function blk_zone_wplug_handle_error() is called from the new disk
> zone write plug work when this flag is set. This function executes a
> report zone to update the zone write pointer offset to the current
> value as indicated by the device. The disk zone write plug work is
> scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
> with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
> write. Once scheduled, the disk zone write plugs work keeps running
> until all zone errors are handled.
> 
> To match the new data structures used for zoned disks, the function
> disk_free_zone_bitmaps() is renamed to the more generic
> disk_free_zone_resources(). The function disk_init_zone_resources() is
> also introduced to initialize zone write plugs resources when a gendisk
> is allocated.
> 
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/bio.c               |    7 +
>   block/blk-core.c          |    2 +
>   block/blk-merge.c         |   11 +
>   block/blk-mq.c            |   38 +-
>   block/blk-zoned.c         | 1034 ++++++++++++++++++++++++++++++++++++-
>   block/blk.h               |   40 +-
>   block/genhd.c             |    3 +-
>   include/linux/blk-mq.h    |    2 +
>   include/linux/blk_types.h |    8 +-
>   include/linux/blkdev.h    |   11 +
>   10 files changed, 1144 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


  parent reply	other threads:[~2024-03-27  7:18 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25  4:44 [PATCH v2 00/28] Zone write plugging Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 01/28] block: Restore sector of flush requests Damien Le Moal
2024-03-25 19:30   ` Bart Van Assche
2024-03-26  6:05   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 02/28] block: Remove req_bio_endio() Damien Le Moal
2024-03-25 19:39   ` Bart Van Assche
2024-03-26  1:54     ` Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 03/28] block: Introduce blk_zone_update_request_bio() Damien Le Moal
2024-03-25 19:52   ` Bart Van Assche
2024-03-25 23:23     ` Damien Le Moal
2024-03-26  6:37       ` Christoph Hellwig
2024-03-26  7:47         ` Damien Le Moal
2024-03-27  7:01   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 04/28] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
2024-03-25 19:55   ` Bart Van Assche
2024-03-26  6:39   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 05/28] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
2024-03-25 20:00   ` Bart Van Assche
2024-03-26  6:39   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 06/28] block: Remember zone capacity when revalidating zones Damien Le Moal
2024-03-25 21:53   ` Bart Van Assche
2024-03-25 23:20     ` Damien Le Moal
2024-03-26  6:40   ` Christoph Hellwig
2024-03-27  7:05   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 07/28] block: Introduce zone write plugging Damien Le Moal
2024-03-25 21:53   ` Bart Van Assche
2024-03-26  3:12     ` Damien Le Moal
2024-03-26  6:51       ` Christoph Hellwig
2024-03-26 17:23       ` Bart Van Assche
2024-03-27  7:18   ` Hannes Reinecke [this message]
2024-03-25  4:44 ` [PATCH v2 08/28] block: Use a mempool to allocate zone write plugs Damien Le Moal
2024-03-27  7:19   ` Hannes Reinecke
2024-03-27  7:22     ` Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 09/28] block: Fake max open zones limit when there is no limit Damien Le Moal
2024-03-26  6:57   ` Christoph Hellwig
2024-03-27  7:21   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 10/28] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 11/28] block: Implement zone append emulation Damien Le Moal
2024-03-27  7:28   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 12/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
2024-03-26  7:08   ` Christoph Hellwig
2024-03-26  8:12     ` Damien Le Moal
2024-03-27  7:29   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 13/28] dm: Use the block layer zone append emulation Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 14/28] scsi: sd: " Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 15/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 16/28] null_blk: " Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 17/28] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
2024-03-27  7:31   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 18/28] null_blk: Introduce fua attribute Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 19/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
2024-03-26  6:45   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 20/28] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
2024-03-26  6:45   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 21/28] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
2024-03-26  6:45   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 22/28] block: mq-deadline: Remove support for zone write locking Damien Le Moal
2024-03-25 22:13   ` Bart Van Assche
2024-03-25  4:44 ` [PATCH v2 23/28] block: Remove elevator required features Damien Le Moal
2024-03-26  6:45   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 24/28] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
2024-03-26  6:46   ` Christoph Hellwig
2024-03-25  4:44 ` [PATCH v2 25/28] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
2024-03-25 22:20   ` Bart Van Assche
2024-03-25 23:17     ` Damien Le Moal
2024-03-25  4:44 ` [PATCH v2 26/28] block: Remove zone write locking Damien Le Moal
2024-03-25 22:27   ` Bart Van Assche
2024-03-27  7:32   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
2024-03-25 22:29   ` Bart Van Assche
2024-03-27  7:33   ` Hannes Reinecke
2024-03-25  4:44 ` [PATCH v2 28/28] block: Do not special-case plugging of zone write operations Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e6b4eef-dee8-49cc-97e6-a798d3fdb1fb@suse.de \
    --to=hare@suse.de \
    --cc=axboe@kernel.dk \
    --cc=dlemoal@kernel.org \
    --cc=dm-devel@lists.linux.dev \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox