From: Hannes Reinecke <hare@suse.de>
To: Damien Le Moal <dlemoal@kernel.org>,
linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>
Subject: Re: [PATCH v3 07/14] block: Do not remove zone write plugs still in use
Date: Thu, 2 May 2024 08:09:36 +0200 [thread overview]
Message-ID: <751efc92-2081-41c0-8007-c73f51fcf87e@suse.de> (raw)
In-Reply-To: <20240501110907.96950-8-dlemoal@kernel.org>
On 5/1/24 13:09, Damien Le Moal wrote:
> Large write BIOs that span a zone boundary are split in
> blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
> write plugging. Such split BIO will be chained with one fragment
> targeting one zone and the remainder of the BIO targeting the next
> zone. The two BIOs can be executed in parallel, without a predetermine
> order relative to eachother and their completion may be reversed: the
each other
> remainder first completing and the first fragment then completing. In
> such case, bio_endio() will not immediately execute
> blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
> split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
> the parent BIO will be executed only once the first fragment completes.
>
> In the case of a device with small zones and very large BIOs, uch
such
> completion pattern can lead to disk_should_remove_zone_wplug() to return
> true for the zone of the parent BIO when the parent BIO request
> completes and blk_zone_write_plug_complete_request() is executed. This
> triggers the removal of the zone write plug from the hash table using
> disk_remove_zone_wplug(). With the zone write plug of the parent BIO
> missing, the call to disk_get_zone_wplug() in
> blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.
>
> This patterns can be recreated fairly easily using a scsi_debug device
> with small zone and btrfs. E.g.
>
> modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
> zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
> mkfs.btrfs -f -O zoned /dev/sda
> mount -t btrfs /dev/sda /mnt
> fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
> --bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
> --runtime=10
> umount /dev/sda
>
> Will result in the warning:
>
> [ 29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
> ...
> [ 29.058682] Call Trace:
> [ 29.059095] <TASK>
> [ 29.059473] ? __warn+0x80/0x120
> [ 29.059983] ? blk_zone_write_plug_bio_endio+0xee/0x1e0
> [ 29.060728] ? report_bug+0x160/0x190
> [ 29.061283] ? handle_bug+0x36/0x70
> [ 29.061830] ? exc_invalid_op+0x17/0x60
> [ 29.062399] ? asm_exc_invalid_op+0x1a/0x20
> [ 29.063025] ? blk_zone_write_plug_bio_endio+0xee/0x1e0
> [ 29.063760] bio_endio+0xb7/0x150
> [ 29.064280] btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
> [ 29.065049] blk_update_request+0x17c/0x500
> [ 29.065666] scsi_end_request+0x27/0x1a0 [scsi_mod]
> [ 29.066356] scsi_io_completion+0x5b/0x690 [scsi_mod]
> [ 29.067077] blk_complete_reqs+0x3a/0x50
> [ 29.067692] __do_softirq+0xcf/0x2b3
> [ 29.068248] ? sort_range+0x20/0x20
> [ 29.068791] run_ksoftirqd+0x1c/0x30
> [ 29.069339] smpboot_thread_fn+0xcc/0x1b0
> [ 29.069936] kthread+0xcf/0x100
> [ 29.070438] ? kthread_complete_and_exit+0x20/0x20
> [ 29.071314] ret_from_fork+0x31/0x50
> [ 29.071873] ? kthread_complete_and_exit+0x20/0x20
> [ 29.072563] ret_from_fork_asm+0x11/0x20
> [ 29.073146] </TASK>
>
> either when fio executes or when unmount is executed.
>
> Fix this by modifying disk_should_remove_zone_wplug() to check that the
> reference count to a zone write plug is not larger than 2, that is, that
> the only references left on the zone are the caller held reference
> (blk_zone_write_plug_complete_request()) and the initial extra reference
> for the zone write plug taken when it was initialized (and that is
> dropped when the zone write plug is removed from the hash table).
>
> To be consistent with this change, make sure to drop the request or BIO
> held reference to the zone write plug before calling
> disk_zone_wplug_unplug_bio(). All references are also dropped using
> disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
> write plug is freed if it needs to be.
>
> Comments are also improved to clarify zone write plugs reference
> handling.
>
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 39 +++++++++++++++++++++++++++++++--------
> 1 file changed, 31 insertions(+), 8 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2024-05-02 6:09 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit Damien Le Moal
2024-05-02 5:52 ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb() Damien Le Moal
2024-05-02 5:53 ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state Damien Le Moal
2024-05-02 6:01 ` Hannes Reinecke
2024-05-02 9:37 ` Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission Damien Le Moal
2024-05-02 6:04 ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 06/14] block: Unhash a zone write plug only if needed Damien Le Moal
2024-05-02 6:05 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
2024-05-02 5:38 ` Christoph Hellwig
2024-05-02 6:09 ` Hannes Reinecke [this message]
2024-05-01 11:09 ` [PATCH v3 08/14] block: Fix flush request sector restore Damien Le Moal
2024-05-02 6:10 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones Damien Le Moal
2024-05-02 6:11 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged() Damien Le Moal
2024-05-02 6:13 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 11/14] block: Improve zone write request completion handling Damien Le Moal
2024-05-02 6:14 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio() Damien Le Moal
2024-05-02 6:15 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 13/14] block: Simplify zone write plug BIO abort Damien Le Moal
2024-05-02 6:15 ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb() Damien Le Moal
2024-05-02 6:17 ` Hannes Reinecke
2024-05-01 14:08 ` [PATCH v3 00/14] Zone write plugging fixes and cleanup Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=751efc92-2081-41c0-8007-c73f51fcf87e@suse.de \
--to=hare@suse.de \
--cc=axboe@kernel.dk \
--cc=dlemoal@kernel.org \
--cc=dm-devel@lists.linux.dev \
--cc=linux-block@vger.kernel.org \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox