From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: qemu-devel@nongnu.org, Fam Zheng <fam@euphon.net>,
hare@suse.de, Hanna Reitz <hreitz@redhat.com>,
dmitry.fomichev@wdc.com, Julia Suvorova <jusual@redhat.com>,
Stefano Garzarella <sgarzare@redhat.com>,
damien.lemoal@opensource.wdc.com,
Aarushi Mehta <mehta.aaru20@gmail.com>,
Kevin Wolf <kwolf@redhat.com>,
qemu-block@nongnu.org
Subject: Re: [PATCH v8 1/4] file-posix: add tracking of the zone write pointers
Date: Wed, 5 Apr 2023 16:26:41 -0400 [thread overview]
Message-ID: <20230405202641.GD676473@fedora> (raw)
In-Reply-To: <20230404153239.32234-2-faithilikerun@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 6749 bytes --]
On Tue, Apr 04, 2023 at 11:32:36PM +0800, Sam Li wrote:
> Since Linux doesn't have a user API to issue zone append operations to
> zoned devices from user space, the file-posix driver is modified to add
> zone append emulation using regular writes. To do this, the file-posix
> driver tracks the wp location of all zones of the device. It uses an
> array of uint64_t. The most significant bit of each wp location indicates
> if the zone type is conventional zones.
>
> The zones wp can be changed due to the following operations issued:
> - zone reset: change the wp to the start offset of that zone
> - zone finish: change to the end location of that zone
> - write to a zone
> - zone append
>
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> ---
> block/file-posix.c | 168 ++++++++++++++++++++++++++++++-
> include/block/block-common.h | 14 +++
> include/block/block_int-common.h | 5 +
> 3 files changed, 184 insertions(+), 3 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 65efe5147e..bc58f7193b 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1324,6 +1324,88 @@ static int hdev_get_max_segments(int fd, struct stat *st)
> #endif
> }
>
> +#if defined(CONFIG_BLKZONED)
> +/*
> + * If the reset_all flag is true, then the wps of zone whose state is
> + * not readonly or offline should be all reset to the start sector.
> + * Else, take the real wp of the device.
> + */
> +static int get_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
> + unsigned int nrz, bool reset_all)
> +{
> + struct blk_zone *blkz;
> + size_t rep_size;
> + uint64_t sector = offset >> BDRV_SECTOR_BITS;
> + int ret, n = 0, i = 0;
> + rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> + g_autofree struct blk_zone_report *rep = NULL;
> +
> + rep = g_malloc(rep_size);
> + blkz = (struct blk_zone *)(rep + 1);
> + while (n < nrz) {
> + memset(rep, 0, rep_size);
> + rep->sector = sector;
> + rep->nr_zones = nrz - n;
> +
> + do {
> + ret = ioctl(fd, BLKREPORTZONE, rep);
> + } while (ret != 0 && errno == EINTR);
> + if (ret != 0) {
> + error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> + fd, offset, errno);
> + return -errno;
> + }
> +
> + if (!rep->nr_zones) {
> + break;
> + }
> +
> + for (i = 0; i < rep->nr_zones; i++, n++) {
> + /*
> + * The wp tracking cares only about sequential writes required and
> + * sequential write preferred zones so that the wp can advance to
> + * the right location.
> + * Use the most significant bit of the wp location to indicate the
> + * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
> + */
> + if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> + wps->wp[i] &= 1ULL << 63;
Should this be |= instead of &=? I think the intention is to set the
bit.
> + } else {
> + switch(blkz[i].cond) {
> + case BLK_ZONE_COND_FULL:
> + case BLK_ZONE_COND_READONLY:
> + /* Zone not writable */
> + wps->wp[i] = (blkz[i].start + blkz[i].len) << BDRV_SECTOR_BITS;
wps->wp[i] looks wrong in two cases:
1. After the first iteration of the while (n < nrz) loop.
2. When offset > 0.
I think there should be a j variable that tracks the index into wp[]. It
should be initialized outside the while loop based on offset and
incremented inside the for loop.
> + break;
> + case BLK_ZONE_COND_OFFLINE:
> + /* Zone not writable nor readable */
> + wps->wp[i] = (blkz[i].start) << BDRV_SECTOR_BITS;
> + break;
> + default:
> + if (reset_all) {
> + wps->wp[i] = blkz[i].start << BDRV_SECTOR_BITS;
> + } else {
> + wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
> + }
> + break;
> + }
> + }
> + }
> + sector = blkz[i - 1].start + blkz[i - 1].len;
> + }
> +
> + return 0;
> +}
> +
> +static void update_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
> + unsigned int nrz)
> +{
> + if (get_zones_wp(fd, wps, offset, nrz, 0) < 0) {
> + error_report("update zone wp failed");
> + }
> +}
> +#endif
> +
> static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> {
> BDRVRawState *s = bs->opaque;
> @@ -1413,6 +1495,23 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> if (ret >= 0) {
> bs->bl.max_active_zones = ret;
> }
> +
> + ret = get_sysfs_long_val(&st, "physical_block_size");
> + if (ret >= 0) {
> + bs->bl.write_granularity = ret;
> + }
> +
> + /* The refresh_limits() function can be called multiple times. */
> + bs->wps = NULL;
This needs to be g_free(bs->wps) to avoid leaking the odl bs->wps
memory.
(There are more complex solutions that reuse bs->wps when nr_zones has
not gotten larger, but freeing and allocating a new one is the simple
solution for now. This code isn't performance-critical.)
> + bs->wps = g_malloc(sizeof(BlockZoneWps) +
> + sizeof(int64_t) * bs->bl.nr_zones);
> + ret = get_zones_wp(s->fd, bs->wps, 0, bs->bl.nr_zones, 0);
> + if (ret < 0) {
> + error_setg_errno(errp, -ret, "report wps failed");
> + bs->wps = NULL;
> + return;
> + }
> + qemu_co_mutex_init(&bs->wps->colock);
> return;
> }
> out:
> @@ -2338,9 +2437,15 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
> {
> BDRVRawState *s = bs->opaque;
> RawPosixAIOData acb;
> + int ret;
>
> if (fd_open(bs) < 0)
> return -EIO;
> +#if defined(CONFIG_BLKZONED)
> + if (type & QEMU_AIO_WRITE && bs->wps) {
> + QEMU_LOCK_GUARD(&bs->wps->colock);
> + }
QEMU_LOCK_GUARD() has lexical scope so its lifetime ends at the end of
the {} block. Therefore it doesn't lock for the remainder of the
function.
This is a case where it's necessary to use qemu_co_mutex_lock() directly
and remember to qemu_co_mutex_unlock() at the exit point of this
function.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2023-04-05 20:27 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-04 15:32 [PATCH v8 0/4] Add zone append write for zoned device Sam Li
2023-04-04 15:32 ` [PATCH v8 1/4] file-posix: add tracking of the zone write pointers Sam Li
2023-04-05 20:26 ` Stefan Hajnoczi [this message]
2023-04-04 15:32 ` [PATCH v8 2/4] block: introduce zone append write for zoned devices Sam Li
2023-04-05 20:32 ` Stefan Hajnoczi
2023-04-04 15:32 ` [PATCH v8 3/4] qemu-iotests: test zone append operation Sam Li
2023-04-04 15:32 ` [PATCH v8 4/4] block: add some trace events for zone append Sam Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230405202641.GD676473@fedora \
--to=stefanha@redhat.com \
--cc=damien.lemoal@opensource.wdc.com \
--cc=dmitry.fomichev@wdc.com \
--cc=faithilikerun@gmail.com \
--cc=fam@euphon.net \
--cc=hare@suse.de \
--cc=hreitz@redhat.com \
--cc=jusual@redhat.com \
--cc=kwolf@redhat.com \
--cc=mehta.aaru20@gmail.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sgarzare@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.