From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: qemu-devel@nongnu.org, dmitry.fomichev@wdc.com,
Aarushi Mehta <mehta.aaru20@gmail.com>,
Stefano Garzarella <sgarzare@redhat.com>,
Julia Suvorova <jusual@redhat.com>,
damien.lemoal@opensource.wdc.com, Kevin Wolf <kwolf@redhat.com>,
qemu-block@nongnu.org, Fam Zheng <fam@euphon.net>,
hare@suse.de, Hanna Reitz <hreitz@redhat.com>
Subject: Re: [PATCH v9 1/4] file-posix: add tracking of the zone write pointers
Date: Mon, 10 Apr 2023 09:04:44 -0400 [thread overview]
Message-ID: <20230410130444.GB888305@fedora> (raw)
In-Reply-To: <20230407081657.17947-2-faithilikerun@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 12941 bytes --]
On Fri, Apr 07, 2023 at 04:16:54PM +0800, Sam Li wrote:
> Since Linux doesn't have a user API to issue zone append operations to
> zoned devices from user space, the file-posix driver is modified to add
> zone append emulation using regular writes. To do this, the file-posix
> driver tracks the wp location of all zones of the device. It uses an
> array of uint64_t. The most significant bit of each wp location indicates
> if the zone type is conventional zones.
>
> The zones wp can be changed due to the following operations issued:
> - zone reset: change the wp to the start offset of that zone
> - zone finish: change to the end location of that zone
> - write to a zone
> - zone append
>
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> ---
> block/file-posix.c | 173 ++++++++++++++++++++++++++++++-
> include/block/block-common.h | 14 +++
> include/block/block_int-common.h | 5 +
> 3 files changed, 189 insertions(+), 3 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 65efe5147e..e7957f5559 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1324,6 +1324,90 @@ static int hdev_get_max_segments(int fd, struct stat *st)
> #endif
> }
>
> +#if defined(CONFIG_BLKZONED)
> +/*
> + * If the reset_all flag is true, then the wps of zone whose state is
> + * not readonly or offline should be all reset to the start sector.
> + * Else, take the real wp of the device.
> + */
> +static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
> + unsigned int nrz, bool reset_all)
> +{
> + struct blk_zone *blkz;
> + size_t rep_size;
> + uint64_t sector = offset >> BDRV_SECTOR_BITS;
> + BlockZoneWps *wps = bs->wps;
> + int j = offset / bs->bl.zone_size;
> + int ret, n = 0, i = 0;
I would feel more comfortable if i, j, and n were unsigned int like nrz.
That way we don't need to worry about negative array indices when int
wraps to INT_MIN.
In practice we'll probably hit scalability problems before nrz becomes
greater than INT_MAX. Also, such devices probably don't exist. A 5 TB
drive with 256 MB zones only has 20,480 zones.
So for now I think you can keep the code the way it is.
> + rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> + g_autofree struct blk_zone_report *rep = NULL;
> +
> + rep = g_malloc(rep_size);
> + blkz = (struct blk_zone *)(rep + 1);
> + while (n < nrz) {
> + memset(rep, 0, rep_size);
> + rep->sector = sector;
> + rep->nr_zones = nrz - n;
> +
> + do {
> + ret = ioctl(fd, BLKREPORTZONE, rep);
> + } while (ret != 0 && errno == EINTR);
> + if (ret != 0) {
> + error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> + fd, offset, errno);
> + return -errno;
> + }
> +
> + if (!rep->nr_zones) {
> + break;
> + }
> +
> + for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
> + /*
> + * The wp tracking cares only about sequential writes required and
> + * sequential write preferred zones so that the wp can advance to
> + * the right location.
> + * Use the most significant bit of the wp location to indicate the
> + * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
> + */
> + if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> + wps->wp[j] |= 1ULL << 63;
> + } else {
> + switch(blkz[i].cond) {
> + case BLK_ZONE_COND_FULL:
> + case BLK_ZONE_COND_READONLY:
> + /* Zone not writable */
> + wps->wp[j] = (blkz[i].start + blkz[i].len) << BDRV_SECTOR_BITS;
> + break;
> + case BLK_ZONE_COND_OFFLINE:
> + /* Zone not writable nor readable */
> + wps->wp[j] = (blkz[i].start) << BDRV_SECTOR_BITS;
> + break;
> + default:
> + if (reset_all) {
> + wps->wp[j] = blkz[i].start << BDRV_SECTOR_BITS;
> + } else {
> + wps->wp[j] = blkz[i].wp << BDRV_SECTOR_BITS;
> + }
> + break;
> + }
> + }
> + }
> + sector = blkz[i - 1].start + blkz[i - 1].len;
> + }
> +
> + return 0;
> +}
> +
> +static void update_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
> + unsigned int nrz)
> +{
> + if (get_zones_wp(bs, fd, offset, nrz, 0) < 0) {
> + error_report("update zone wp failed");
> + }
> +}
> +#endif
> +
> static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> {
> BDRVRawState *s = bs->opaque;
> @@ -1413,6 +1497,23 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> if (ret >= 0) {
> bs->bl.max_active_zones = ret;
> }
> +
> + ret = get_sysfs_long_val(&st, "physical_block_size");
> + if (ret >= 0) {
> + bs->bl.write_granularity = ret;
> + }
> +
> + /* The refresh_limits() function can be called multiple times. */
> + g_free(bs->wps);
> + bs->wps = g_malloc(sizeof(BlockZoneWps) +
> + sizeof(int64_t) * bs->bl.nr_zones);
> + ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0);
> + if (ret < 0) {
> + error_setg_errno(errp, -ret, "report wps failed");
> + bs->wps = NULL;
> + return;
> + }
> + qemu_co_mutex_init(&bs->wps->colock);
I suggest moving qemu_co_mutex_init() to raw_open_common() in the future
to eliminate the assumption that raw_refresh_limits() is called before
other functions that use colock. But there is no need to resend the
patch series.
> return;
> }
> out:
> @@ -2338,9 +2439,15 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
> {
> BDRVRawState *s = bs->opaque;
> RawPosixAIOData acb;
> + int ret;
>
> if (fd_open(bs) < 0)
> return -EIO;
> +#if defined(CONFIG_BLKZONED)
> + if (type & QEMU_AIO_WRITE && bs->wps) {
> + qemu_co_mutex_lock(&bs->wps->colock);
> + }
> +#endif
>
> /*
> * When using O_DIRECT, the request must be aligned to be able to use
> @@ -2354,14 +2461,16 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
> } else if (s->use_linux_io_uring) {
> LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> assert(qiov->size == bytes);
> - return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
> + ret = luring_co_submit(bs, aio, s->fd, offset, qiov, type);
> + goto out;
> #endif
> #ifdef CONFIG_LINUX_AIO
> } else if (s->use_linux_aio) {
> LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
> assert(qiov->size == bytes);
> - return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
> + ret = laio_co_submit(bs, aio, s->fd, offset, qiov, type,
> s->aio_max_batch);
> + goto out;
> #endif
> }
>
> @@ -2378,7 +2487,32 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
> };
>
> assert(qiov->size == bytes);
> - return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
> + ret = raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
> +
> +out:
> +#if defined(CONFIG_BLKZONED)
> + BlockZoneWps *wps = bs->wps;
> + if (ret == 0) {
> + if (type & QEMU_AIO_WRITE && wps && bs->bl.zone_size) {
> + uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
> + if (!BDRV_ZT_IS_CONV(*wp)) {
> + /* Advance the wp if needed */
> + if (offset + bytes > *wp) {
> + *wp = offset + bytes;
> + }
> + }
> + }
> + } else {
> + if (type & QEMU_AIO_WRITE) {
> + update_zones_wp(bs, s->fd, 0, 1);
> + }
> + }
> +
> + if (type & QEMU_AIO_WRITE && wps) {
> + qemu_co_mutex_unlock(&wps->colock);
> + }
> +#endif
> + return ret;
> }
>
> static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
> @@ -2486,6 +2620,9 @@ static void raw_close(BlockDriverState *bs)
> BDRVRawState *s = bs->opaque;
>
> if (s->fd >= 0) {
> +#if defined(CONFIG_BLKZONED)
> + g_free(bs->wps);
> +#endif
> qemu_close(s->fd);
> s->fd = -1;
> }
> @@ -3283,6 +3420,7 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> const char *op_name;
> unsigned long zo;
> int ret;
> + BlockZoneWps *wps = bs->wps;
> int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
>
> zone_size = bs->bl.zone_size;
> @@ -3300,6 +3438,15 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> return -EINVAL;
> }
>
> + QEMU_LOCK_GUARD(&wps->colock);
> + uint32_t i = offset / bs->bl.zone_size;
> + uint32_t nrz = len / bs->bl.zone_size;
> + uint64_t *wp = &wps->wp[i];
> + if (BDRV_ZT_IS_CONV(*wp) && len != capacity) {
> + error_report("zone mgmt operations are not allowed for conventional zones");
> + return -EIO;
> + }
> +
> switch (op) {
> case BLK_ZO_OPEN:
> op_name = "BLKOPENZONE";
> @@ -3337,8 +3484,28 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> len >> BDRV_SECTOR_BITS);
> ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> if (ret != 0) {
> + update_zones_wp(bs, s->fd, offset, i);
> ret = -errno;
> error_report("ioctl %s failed %d", op_name, ret);
> + return ret;
> + }
> +
> + if (zo == BLKRESETZONE && len == capacity) {
> + ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 1);
> + if (ret < 0) {
> + error_report("reporting single wp failed");
> + return ret;
> + }
> + } else if (zo == BLKRESETZONE) {
> + for (int j = 0; j < nrz; ++j) {
> + wp[j] = offset + j * zone_size;
> + }
> + } else if (zo == BLKFINISHZONE) {
> + for (int j = 0; j < nrz; ++j) {
> + /* The zoned device allows the last zone smaller that the
> + * zone size. */
> + wp[j] = MIN(offset + (j + 1) * zone_size, offset + len);
> + }
> }
>
> return ret;
> diff --git a/include/block/block-common.h b/include/block/block-common.h
> index 1576fcf2ed..93196229ac 100644
> --- a/include/block/block-common.h
> +++ b/include/block/block-common.h
> @@ -118,6 +118,14 @@ typedef struct BlockZoneDescriptor {
> BlockZoneState state;
> } BlockZoneDescriptor;
>
> +/*
> + * Track write pointers of a zone in bytes.
> + */
> +typedef struct BlockZoneWps {
> + CoMutex colock;
> + uint64_t wp[];
> +} BlockZoneWps;
> +
> typedef struct BlockDriverInfo {
> /* in bytes, 0 if irrelevant */
> int cluster_size;
> @@ -240,6 +248,12 @@ typedef enum {
> #define BDRV_SECTOR_BITS 9
> #define BDRV_SECTOR_SIZE (1ULL << BDRV_SECTOR_BITS)
>
> +/*
> + * Get the first most significant bit of wp. If it is zero, then
> + * the zone type is SWR.
> + */
> +#define BDRV_ZT_IS_CONV(wp) (wp & (1ULL << 63))
> +
> #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
> INT_MAX >> BDRV_SECTOR_BITS)
> #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 1bd2aef4d5..b34a7f175d 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -884,6 +884,8 @@ typedef struct BlockLimits {
>
> /* maximum number of active zones */
> int64_t max_active_zones;
> +
> + int64_t write_granularity;
> } BlockLimits;
>
> typedef struct BdrvOpBlocker BdrvOpBlocker;
> @@ -1245,6 +1247,9 @@ struct BlockDriverState {
> CoMutex bsc_modify_lock;
> /* Always non-NULL, but must only be dereferenced under an RCU read guard */
> BdrvBlockStatusCache *block_status_cache;
> +
> + /* array of write pointers' location of each zone in the zoned device. */
> + BlockZoneWps *wps;
> };
>
> struct BlockBackendRootState {
> --
> 2.39.2
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2023-04-10 13:05 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-07 8:16 [PATCH v9 0/4] Add zone append write for zoned device Sam Li
2023-04-07 8:16 ` [PATCH v9 1/4] file-posix: add tracking of the zone write pointers Sam Li
2023-04-10 13:04 ` Stefan Hajnoczi [this message]
2023-04-10 13:42 ` Sam Li
2023-04-07 8:16 ` [PATCH v9 2/4] block: introduce zone append write for zoned devices Sam Li
2023-04-07 8:16 ` [PATCH v9 3/4] qemu-iotests: test zone append operation Sam Li
2023-04-07 8:16 ` [PATCH v9 4/4] block: add some trace events for zone append Sam Li
2023-04-10 13:10 ` [PATCH v9 0/4] Add zone append write for zoned device Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230410130444.GB888305@fedora \
--to=stefanha@redhat.com \
--cc=damien.lemoal@opensource.wdc.com \
--cc=dmitry.fomichev@wdc.com \
--cc=faithilikerun@gmail.com \
--cc=fam@euphon.net \
--cc=hare@suse.de \
--cc=hreitz@redhat.com \
--cc=jusual@redhat.com \
--cc=kwolf@redhat.com \
--cc=mehta.aaru20@gmail.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sgarzare@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.