Re: [PATCH] block: introduce zone append write for zoned devices

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: qemu-devel <qemu-devel@nongnu.org>, Kevin Wolf <kwolf@redhat.com>,
	Dmitry Fomichev <dmitry.fomichev@wdc.com>,
	Fam Zheng <fam@euphon.net>, Hannes Reinecke <hare@suse.de>,
	Hanna Reitz <hreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	qemu block <qemu-block@nongnu.org>
Subject: Re: [PATCH] block: introduce zone append write for zoned devices
Date: Sun, 11 Sep 2022 17:36:45 +0900	[thread overview]
Message-ID: <a0d5fe49-8290-72ca-3999-fd8701f94d15@opensource.wdc.com> (raw)
In-Reply-To: <CAAAx-8KUHoW4Apr1u9urRyVnKmp1Qo2kMKWP2+iRui4jtYj9LA@mail.gmail.com>

On 2022/09/11 17:00, Sam Li wrote:
[...]
>>> @@ -1604,6 +1629,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
>>>                           (const char *)buf + offset,
>>>                           aiocb->aio_nbytes - offset,
>>>                           aiocb->aio_offset + offset);
>>> +        } else if (aiocb->aio_type == QEMU_AIO_ZONE_APPEND) {
>>> +            uint64_t wp = aiocb->aio_offset;
>>
>> This variable is not necessary.
>>
>>> +            len = pwrite(aiocb->aio_fildes,
>>> +                         (const char *)buf + offset,
>>> +                         aiocb->aio_nbytes - offset,
>>> +                         wp + offset);
>>>          } else {
>>>              len = pread(aiocb->aio_fildes,
>>>                          buf + offset,
>>> @@ -1638,7 +1669,6 @@ static int handle_aiocb_rw(void *opaque)
>>>      RawPosixAIOData *aiocb = opaque;
>>>      ssize_t nbytes;
>>>      char *buf;
>>> -
>>
>> whiteline change.
>>
>>>      if (!(aiocb->aio_type & QEMU_AIO_MISALIGNED)) {
>>>          /*
>>>           * If there is just a single buffer, and it is properly aligned
>>> @@ -1692,7 +1722,7 @@ static int handle_aiocb_rw(void *opaque)
>>>      }
>>>
>>>      nbytes = handle_aiocb_rw_linear(aiocb, buf);
>>> -    if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
>>> +    if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
>>>          char *p = buf;
>>>          size_t count = aiocb->aio_nbytes, copy;
>>>          int i;
>>> @@ -1713,6 +1743,25 @@ static int handle_aiocb_rw(void *opaque)
>>>
>>>  out:
>>>      if (nbytes == aiocb->aio_nbytes) {
>>> +#if defined(CONFIG_BLKZONED)
>>> +        if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
>>> +            BlockZoneDescriptor *zone = aiocb->io.zone;
>>> +            int64_t nr_sectors = aiocb->aio_nbytes / 512;
>>> +            if (zone) {
>>> +                qemu_mutex_init(&zone->lock);
>>> +                if (zone->type == BLK_ZT_SWR) {
>>> +                    qemu_mutex_lock(&zone->lock);
>>> +                    zone->wp += nr_sectors;
>>> +                    qemu_mutex_unlock(&zone->lock);
>>> +                }
>>> +                qemu_mutex_destroy(&zone->lock);
>>
>> This is weird. you init the mutex, lock/unlock it and destroy it. So it has
>> absolutely no meaning at all.
> 
> I was thinking that init every lock for all the zones or init the lock
> for the zone that needed it. The confusion I have here is the cost of
> initializing/destroying the lock.

A mutex needs to be initialized before it is used and should not be
re-initialized, ever, until it is not needed anymore. That is, in this case,
since the mutex protects a zone wp tracking entry, it should be initialized when
the array of zone wp is allocated & initialized with a zone report, and the
mutex destroyed when that same array is freed.

The cost of initializing & destroying a mutex is low. And since that is not done
in the hot IO path, you do not need to worry about it.

[...]
>>> +static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
>>> +                                           int64_t *offset,
>>> +                                           QEMUIOVector *qiov,
>>> +                                           BdrvRequestFlags flags) {
>>> +#if defined(CONFIG_BLKZONED)
>>> +    BDRVRawState *s = bs->opaque;
>>> +    int64_t zone_sector = bs->bl.zone_sectors;
>>> +    int64_t zone_sector_mask = zone_sector - 1;
>>> +    int64_t iov_len = 0;
>>> +    int64_t len = 0;
>>> +    RawPosixAIOData acb;
>>> +
>>> +    if (*offset & zone_sector_mask) {
>>> +        error_report("offset %" PRId64 " is not aligned to zone size "
>>> +                     "%" PRId64 "", *offset, zone_sector);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    int64_t lbsz = bs->bl.logical_block_size;> +    int64_t lbsz_mask = lbsz - 1;
>>> +    for (int i = 0; i < qiov->niov; i++) {
>>> +       iov_len = qiov->iov[i].iov_len;
>>> +       if (iov_len & lbsz_mask) {
>>> +           error_report("len of IOVector[%d] %" PRId64 " is not aligned to block "
>>> +                        "size %" PRId64 "", i, iov_len, lbsz);
>>> +           return -EINVAL;
>>> +       }
>>
>> This alignment check should be against the device write granularity, not the
>> logical block size. The write granularity will always be equal to the device
>> physical block size, which may or may not be equal to the device logical block
>> size. E.g. a 512e SMR disk has a 512B logical block size but a 4096B physical
>> block size. And the ZBC & ZAC specifications mandate that all write be aligned
>> to the physical block size.
> 
> I see. I'll change it to physical block size.

I would use a filed called "write_granularity" since the virtio specs will
introduce that anyway. This zone granularity is going to be indeed equal to the
physical block size of the host device for now.

[...]
>>>      /* removable device specific */
>>>      bool (*bdrv_is_inserted)(BlockDriverState *bs);
>>> @@ -854,6 +857,12 @@ typedef struct BlockLimits {
>>>
>>>      /* maximum number of active zones */
>>>      int64_t max_active_zones;
>>> +
>>> +    /* array of zones in the zoned block device. Only tracks write pointer's
>>> +     * location of each zone as a helper for zone_append API */
>>> +    BlockZoneDescriptor *zones;
>>
>> This is a lot of memory for only tracking the wp... Why not reduce this to an
>> array of int64 values, for the wp only ? As you may need the zone type too
>> (conventional vs sequential), you can use the most significant bit (a zone wp
>> value will never use all 64 bits !).
>>
>> Or device another zone structure with zone type, zone wp and mutex only, so
>> smaller than the regular zone report structure.
> 
> I was just trying to reuse do_zone_report. It is better to only track
> the wp only. Then a new struct and smaller zone_report should be
> introduced for it.

Yes, this will use less memory, which is always good.

-- 
Damien Le Moal
Western Digital Research

next prev parent reply	other threads:[~2022-09-11  8:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-10  6:38 [PATCH] block: introduce zone append write for zoned devices Sam Li
2022-09-11  6:06 ` Damien Le Moal
2022-09-11  6:54   ` Damien Le Moal
2022-09-11  8:00   ` Sam Li
2022-09-11  8:08     ` Sam Li
2022-09-11  8:36     ` Damien Le Moal [this message]
2022-09-11  8:39       ` Sam Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a0d5fe49-8290-72ca-3999-fd8701f94d15@opensource.wdc.com \
    --to=damien.lemoal@opensource.wdc.com \
    --cc=dmitry.fomichev@wdc.com \
    --cc=faithilikerun@gmail.com \
    --cc=fam@euphon.net \
    --cc=hare@suse.de \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).