[PATCH v2] block/file-posix: optimize append write

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] block/file-posix: optimize append write
@ 2024-10-04 10:41 Sam Li
  2024-10-18 14:37 ` Kevin Wolf
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Li @ 2024-10-04 10:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, stefanha, Hanna Reitz, dmitry.fomichev, qemu-block,
	dlemoal, Sam Li

When the file-posix driver emulates append write, it holds the lock
whenever accessing wp, which limits the IO queue depth to one.

The write IO flow can be optimized to allow concurrent writes. The lock
is held in two cases:
1. Assumed that the write IO succeeds, update the wp before issuing the
write.
2. If the write IO fails, report that zone and use the reported value
as the current wp.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
---
 block/file-posix.c | 57 ++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 90fa54352c..a65a23cb2c 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2482,18 +2482,46 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
     BDRVRawState *s = bs->opaque;
     RawPosixAIOData acb;
     int ret;
-    uint64_t offset = *offset_ptr;
+    uint64_t end_offset, end_zone, offset = *offset_ptr;
+    uint64_t *wp;
 
     if (fd_open(bs) < 0)
         return -EIO;
 #if defined(CONFIG_BLKZONED)
     if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
         bs->bl.zoned != BLK_Z_NONE) {
-        qemu_co_mutex_lock(&bs->wps->colock);
-        if (type & QEMU_AIO_ZONE_APPEND) {
-            int index = offset / bs->bl.zone_size;
-            offset = bs->wps->wp[index];
+        BlockZoneWps *wps = bs->wps;
+        int index = offset / bs->bl.zone_size;
+
+        qemu_co_mutex_lock(&wps->colock);
+        wp = &wps->wp[index];
+        if (!BDRV_ZT_IS_CONV(*wp)) {
+            if (type & QEMU_AIO_WRITE && offset != *wp) {
+                error_report("write offset 0x%" PRIx64 " is not equal to the wp"
+                             " of zone[%d] 0x%" PRIx64 "", offset, index, *wp);
+                qemu_co_mutex_unlock(&wps->colock);
+                return -EINVAL;
+            }
+
+            if (type & QEMU_AIO_ZONE_APPEND) {
+                offset = *wp;
+                *offset_ptr = offset;
+            }
+
+            end_offset = offset + bytes;
+            end_zone = (index + 1) * bs->bl.zone_size;
+            if (end_offset > end_zone) {
+                error_report("write exceeds zone boundary with end_offset "
+                             "%" PRIu64 ", end_zone %" PRIu64 "",
+                             end_offset, end_zone);
+                qemu_co_mutex_unlock(&wps->colock);
+                return -EINVAL;
+            }
+
+            /* Advance the wp */
+            *wp = end_offset;
         }
+        qemu_co_mutex_unlock(&bs->wps->colock);
     }
 #endif
 
@@ -2540,28 +2568,19 @@ out:
 #if defined(CONFIG_BLKZONED)
     if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
         bs->bl.zoned != BLK_Z_NONE) {
-        BlockZoneWps *wps = bs->wps;
         if (ret == 0) {
-            uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
-            if (!BDRV_ZT_IS_CONV(*wp)) {
-                if (type & QEMU_AIO_ZONE_APPEND) {
-                    *offset_ptr = *wp;
-                    trace_zbd_zone_append_complete(bs, *offset_ptr
-                        >> BDRV_SECTOR_BITS);
-                }
-                /* Advance the wp if needed */
-                if (offset + bytes > *wp) {
-                    *wp = offset + bytes;
-                }
+            if (type & QEMU_AIO_ZONE_APPEND) {
+                trace_zbd_zone_append_complete(bs, *offset_ptr
+                    >> BDRV_SECTOR_BITS);
             }
         } else {
+            qemu_co_mutex_lock(&bs->wps->colock);
             /*
              * write and append write are not allowed to cross zone boundaries
              */
             update_zones_wp(bs, s->fd, offset, 1);
+            qemu_co_mutex_unlock(&bs->wps->colock);
         }
-
-        qemu_co_mutex_unlock(&wps->colock);
     }
 #endif
     return ret;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-04 10:41 [PATCH v2] block/file-posix: optimize append write Sam Li
@ 2024-10-18 14:37 ` Kevin Wolf
  2024-10-20  1:03   ` Damien Le Moal
  2024-10-21 13:21   ` Sam Li
  0 siblings, 2 replies; 9+ messages in thread
From: Kevin Wolf @ 2024-10-18 14:37 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev, qemu-block,
	dlemoal

Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
> When the file-posix driver emulates append write, it holds the lock
> whenever accessing wp, which limits the IO queue depth to one.
> 
> The write IO flow can be optimized to allow concurrent writes. The lock
> is held in two cases:
> 1. Assumed that the write IO succeeds, update the wp before issuing the
> write.
> 2. If the write IO fails, report that zone and use the reported value
> as the current wp.

What happens with the concurrent writes that started later and may not
have completed yet? Can we really just reset to the reported value
before all other requests have completed, too?

> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> ---
>  block/file-posix.c | 57 ++++++++++++++++++++++++++++++----------------
>  1 file changed, 38 insertions(+), 19 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 90fa54352c..a65a23cb2c 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2482,18 +2482,46 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
>      BDRVRawState *s = bs->opaque;
>      RawPosixAIOData acb;
>      int ret;
> -    uint64_t offset = *offset_ptr;
> +    uint64_t end_offset, end_zone, offset = *offset_ptr;
> +    uint64_t *wp;

Without CONFIG_BLKZONED, these are unused variables and break the build.

They are only used in the first CONFIG_BLKZONED block, so you could just
declare them locally there.

>  
>      if (fd_open(bs) < 0)
>          return -EIO;
>  #if defined(CONFIG_BLKZONED)
>      if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
>          bs->bl.zoned != BLK_Z_NONE) {
> -        qemu_co_mutex_lock(&bs->wps->colock);
> -        if (type & QEMU_AIO_ZONE_APPEND) {
> -            int index = offset / bs->bl.zone_size;
> -            offset = bs->wps->wp[index];
> +        BlockZoneWps *wps = bs->wps;
> +        int index = offset / bs->bl.zone_size;
> +
> +        qemu_co_mutex_lock(&wps->colock);

This is preexisting, but what is the reason for using coroutine locks
here? There doesn't seem to be any code running under the lock that can
yield, so a normal mutex might be more efficient.

Hm... Looking a bit closer, get_zones_wp() could probably be a
coroutine_fn and call hdev_co_ioctl() instead of calling ioctl()
directly in order to avoid blocking.

> +        wp = &wps->wp[index];

Also preexisting, but who guarantees that index is within the bounds? A
stacked block driver may try to grow the file size.

> +        if (!BDRV_ZT_IS_CONV(*wp)) {
> +            if (type & QEMU_AIO_WRITE && offset != *wp) {
> +                error_report("write offset 0x%" PRIx64 " is not equal to the wp"
> +                             " of zone[%d] 0x%" PRIx64 "", offset, index, *wp);

We can't error_report() in an I/O path that can be triggered by the
guest, it could result in unbounded log file growth.

> +                qemu_co_mutex_unlock(&wps->colock);
> +                return -EINVAL;
> +            }
> +
> +            if (type & QEMU_AIO_ZONE_APPEND) {
> +                offset = *wp;
> +                *offset_ptr = offset;
> +            }
> +
> +            end_offset = offset + bytes;
> +            end_zone = (index + 1) * bs->bl.zone_size;
> +            if (end_offset > end_zone) {
> +                error_report("write exceeds zone boundary with end_offset "
> +                             "%" PRIu64 ", end_zone %" PRIu64 "",
> +                             end_offset, end_zone);

Same error_report() problem.

> +                qemu_co_mutex_unlock(&wps->colock);
> +                return -EINVAL;
> +            }
> +
> +            /* Advance the wp */
> +            *wp = end_offset;
>          }
> +        qemu_co_mutex_unlock(&bs->wps->colock);
>      }
>  #endif
>  
> @@ -2540,28 +2568,19 @@ out:
>  #if defined(CONFIG_BLKZONED)
>      if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
>          bs->bl.zoned != BLK_Z_NONE) {
> -        BlockZoneWps *wps = bs->wps;
>          if (ret == 0) {
> -            uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
> -            if (!BDRV_ZT_IS_CONV(*wp)) {
> -                if (type & QEMU_AIO_ZONE_APPEND) {
> -                    *offset_ptr = *wp;
> -                    trace_zbd_zone_append_complete(bs, *offset_ptr
> -                        >> BDRV_SECTOR_BITS);
> -                }
> -                /* Advance the wp if needed */
> -                if (offset + bytes > *wp) {
> -                    *wp = offset + bytes;
> -                }
> +            if (type & QEMU_AIO_ZONE_APPEND) {
> +                trace_zbd_zone_append_complete(bs, *offset_ptr
> +                    >> BDRV_SECTOR_BITS);
>              }
>          } else {
> +            qemu_co_mutex_lock(&bs->wps->colock);
>              /*
>               * write and append write are not allowed to cross zone boundaries
>               */
>              update_zones_wp(bs, s->fd, offset, 1);
> +            qemu_co_mutex_unlock(&bs->wps->colock);
>          }
> -
> -        qemu_co_mutex_unlock(&wps->colock);
>      }
>  #endif
>      return ret;

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-18 14:37 ` Kevin Wolf
@ 2024-10-20  1:03   ` Damien Le Moal
  2024-10-21 11:08     ` Kevin Wolf
  2024-10-21 13:21   ` Sam Li
  1 sibling, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2024-10-20  1:03 UTC (permalink / raw)
  To: Kevin Wolf, Sam Li
  Cc: qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev, qemu-block

On 10/18/24 23:37, Kevin Wolf wrote:
> Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
>> When the file-posix driver emulates append write, it holds the lock
>> whenever accessing wp, which limits the IO queue depth to one.
>>
>> The write IO flow can be optimized to allow concurrent writes. The lock
>> is held in two cases:
>> 1. Assumed that the write IO succeeds, update the wp before issuing the
>> write.
>> 2. If the write IO fails, report that zone and use the reported value
>> as the current wp.
> 
> What happens with the concurrent writes that started later and may not
> have completed yet? Can we really just reset to the reported value
> before all other requests have completed, too?

Yes, because if one write fails, we know that the following writes will fail too
as they will not be aligned to the write pointer. These subsequent failed writes
will again trigger the report zones and update, but that is fine. All of them
have failed and the report will give the same wp again.

This is a typical pattern with zoned block device: if one write fails in a zone,
the user has to expect failures for all other writes issued to the same zone, do
a report zone to get the wp and restart writing from there.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-20  1:03   ` Damien Le Moal
@ 2024-10-21 11:08     ` Kevin Wolf
  2024-10-21 12:32       ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2024-10-21 11:08 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sam Li, qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev,
	qemu-block

Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben:
> On 10/18/24 23:37, Kevin Wolf wrote:
> > Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
> >> When the file-posix driver emulates append write, it holds the lock
> >> whenever accessing wp, which limits the IO queue depth to one.
> >>
> >> The write IO flow can be optimized to allow concurrent writes. The lock
> >> is held in two cases:
> >> 1. Assumed that the write IO succeeds, update the wp before issuing the
> >> write.
> >> 2. If the write IO fails, report that zone and use the reported value
> >> as the current wp.
> > 
> > What happens with the concurrent writes that started later and may not
> > have completed yet? Can we really just reset to the reported value
> > before all other requests have completed, too?
> 
> Yes, because if one write fails, we know that the following writes
> will fail too as they will not be aligned to the write pointer. These
> subsequent failed writes will again trigger the report zones and
> update, but that is fine. All of them have failed and the report will
> give the same wp again.
> 
> This is a typical pattern with zoned block device: if one write fails
> in a zone, the user has to expect failures for all other writes issued
> to the same zone, do a report zone to get the wp and restart writing
> from there.

Ok, that makes sense. Can we be sure that requests are handled in the
order they were submitted, though? That is, if the failed request is
resubmitted, could the already pending next one still succeed if it's
overtaken by the resubmitted request? Not sure if this would even cause
a probem, but is it a case we have to consider?

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-21 11:08     ` Kevin Wolf
@ 2024-10-21 12:32       ` Damien Le Moal
  2024-10-21 18:13         ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2024-10-21 12:32 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Sam Li, qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev,
	qemu-block

On 10/21/24 20:08, Kevin Wolf wrote:
> Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben:
>> On 10/18/24 23:37, Kevin Wolf wrote:
>>> Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
>>>> When the file-posix driver emulates append write, it holds the lock
>>>> whenever accessing wp, which limits the IO queue depth to one.
>>>>
>>>> The write IO flow can be optimized to allow concurrent writes. The lock
>>>> is held in two cases:
>>>> 1. Assumed that the write IO succeeds, update the wp before issuing the
>>>> write.
>>>> 2. If the write IO fails, report that zone and use the reported value
>>>> as the current wp.
>>>
>>> What happens with the concurrent writes that started later and may not
>>> have completed yet? Can we really just reset to the reported value
>>> before all other requests have completed, too?
>>
>> Yes, because if one write fails, we know that the following writes
>> will fail too as they will not be aligned to the write pointer. These
>> subsequent failed writes will again trigger the report zones and
>> update, but that is fine. All of them have failed and the report will
>> give the same wp again.
>>
>> This is a typical pattern with zoned block device: if one write fails
>> in a zone, the user has to expect failures for all other writes issued
>> to the same zone, do a report zone to get the wp and restart writing
>> from there.
> 
> Ok, that makes sense. Can we be sure that requests are handled in the
> order they were submitted, though? That is, if the failed request is
> resubmitted, could the already pending next one still succeed if it's
> overtaken by the resubmitted request? Not sure if this would even cause
> a probem, but is it a case we have to consider?

A zoned device will always handle writes in the order they were submitted (per
zone) and that is true for emulated devices as well as real ones. The
completions may not be seen in order though, but that is fine.
So what you are saying above is not a problem. The resubmitted failed write will
go after the ones already submitted (and about to be failed) and may succeed if
it is aligned to the wp, or fail. Whichever will happen only after all the
already submitted writes have failed.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-18 14:37 ` Kevin Wolf
  2024-10-20  1:03   ` Damien Le Moal
@ 2024-10-21 13:21   ` Sam Li
  2024-10-21 22:11     ` Kevin Wolf
  1 sibling, 1 reply; 9+ messages in thread
From: Sam Li @ 2024-10-21 13:21 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev, qemu-block,
	dlemoal

Kevin Wolf <kwolf@redhat.com> 于2024年10月18日周五 16:37写道：
>
> Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
> > When the file-posix driver emulates append write, it holds the lock
> > whenever accessing wp, which limits the IO queue depth to one.
> >
> > The write IO flow can be optimized to allow concurrent writes. The lock
> > is held in two cases:
> > 1. Assumed that the write IO succeeds, update the wp before issuing the
> > write.
> > 2. If the write IO fails, report that zone and use the reported value
> > as the current wp.
>
> What happens with the concurrent writes that started later and may not
> have completed yet? Can we really just reset to the reported value
> before all other requests have completed, too?
>
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > ---
> >  block/file-posix.c | 57 ++++++++++++++++++++++++++++++----------------
> >  1 file changed, 38 insertions(+), 19 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 90fa54352c..a65a23cb2c 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -2482,18 +2482,46 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
> >      BDRVRawState *s = bs->opaque;
> >      RawPosixAIOData acb;
> >      int ret;
> > -    uint64_t offset = *offset_ptr;
> > +    uint64_t end_offset, end_zone, offset = *offset_ptr;
> > +    uint64_t *wp;
>
> Without CONFIG_BLKZONED, these are unused variables and break the build.
>
> They are only used in the first CONFIG_BLKZONED block, so you could just
> declare them locally there.

Thanks! Will do.

>
> >
> >      if (fd_open(bs) < 0)
> >          return -EIO;
> >  #if defined(CONFIG_BLKZONED)
> >      if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
> >          bs->bl.zoned != BLK_Z_NONE) {
> > -        qemu_co_mutex_lock(&bs->wps->colock);
> > -        if (type & QEMU_AIO_ZONE_APPEND) {
> > -            int index = offset / bs->bl.zone_size;
> > -            offset = bs->wps->wp[index];
> > +        BlockZoneWps *wps = bs->wps;
> > +        int index = offset / bs->bl.zone_size;
> > +
> > +        qemu_co_mutex_lock(&wps->colock);
>
> This is preexisting, but what is the reason for using coroutine locks
> here? There doesn't seem to be any code running under the lock that can
> yield, so a normal mutex might be more efficient.

Using coroutine locks is to avoid blocking in coroutines. QemuMutex
blocks the thread when the lock is held instead of yielding.

>
> Hm... Looking a bit closer, get_zones_wp() could probably be a
> coroutine_fn and call hdev_co_ioctl() instead of calling ioctl()
> directly in order to avoid blocking.
>
> > +        wp = &wps->wp[index];
>
> Also preexisting, but who guarantees that index is within the bounds? A
> stacked block driver may try to grow the file size.

Right. It needs to be checked if it's over nr_zones.

>
> > +        if (!BDRV_ZT_IS_CONV(*wp)) {
> > +            if (type & QEMU_AIO_WRITE && offset != *wp) {
> > +                error_report("write offset 0x%" PRIx64 " is not equal to the wp"
> > +                             " of zone[%d] 0x%" PRIx64 "", offset, index, *wp);
>
> We can't error_report() in an I/O path that can be triggered by the
> guest, it could result in unbounded log file growth.

Those error messages show in the err path and are good for debugging
zoned device emulation.

I was wondering if there is a better approach to print errors? Use
error_report_once() to reduce the log?


Sam

>
> > +                qemu_co_mutex_unlock(&wps->colock);
> > +                return -EINVAL;
> > +            }
> > +
> > +            if (type & QEMU_AIO_ZONE_APPEND) {
> > +                offset = *wp;
> > +                *offset_ptr = offset;
> > +            }
> > +
> > +            end_offset = offset + bytes;
> > +            end_zone = (index + 1) * bs->bl.zone_size;
> > +            if (end_offset > end_zone) {
> > +                error_report("write exceeds zone boundary with end_offset "
> > +                             "%" PRIu64 ", end_zone %" PRIu64 "",
> > +                             end_offset, end_zone);
>
> Same error_report() problem.
>
> > +                qemu_co_mutex_unlock(&wps->colock);
> > +                return -EINVAL;
> > +            }
> > +
> > +            /* Advance the wp */
> > +            *wp = end_offset;
> >          }
> > +        qemu_co_mutex_unlock(&bs->wps->colock);
> >      }
> >  #endif
> >
> > @@ -2540,28 +2568,19 @@ out:
> >  #if defined(CONFIG_BLKZONED)
> >      if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
> >          bs->bl.zoned != BLK_Z_NONE) {
> > -        BlockZoneWps *wps = bs->wps;
> >          if (ret == 0) {
> > -            uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
> > -            if (!BDRV_ZT_IS_CONV(*wp)) {
> > -                if (type & QEMU_AIO_ZONE_APPEND) {
> > -                    *offset_ptr = *wp;
> > -                    trace_zbd_zone_append_complete(bs, *offset_ptr
> > -                        >> BDRV_SECTOR_BITS);
> > -                }
> > -                /* Advance the wp if needed */
> > -                if (offset + bytes > *wp) {
> > -                    *wp = offset + bytes;
> > -                }
> > +            if (type & QEMU_AIO_ZONE_APPEND) {
> > +                trace_zbd_zone_append_complete(bs, *offset_ptr
> > +                    >> BDRV_SECTOR_BITS);
> >              }
> >          } else {
> > +            qemu_co_mutex_lock(&bs->wps->colock);
> >              /*
> >               * write and append write are not allowed to cross zone boundaries
> >               */
> >              update_zones_wp(bs, s->fd, offset, 1);
> > +            qemu_co_mutex_unlock(&bs->wps->colock);
> >          }
> > -
> > -        qemu_co_mutex_unlock(&wps->colock);
> >      }
> >  #endif
> >      return ret;
>
> Kevin
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-21 12:32       ` Damien Le Moal
@ 2024-10-21 18:13         ` Stefan Hajnoczi
  2024-10-22  1:56           ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2024-10-21 18:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Kevin Wolf, Sam Li, qemu-devel, Hanna Reitz, dmitry.fomichev,
	qemu-block

[-- Attachment #1: Type: text/plain, Size: 2305 bytes --]

On Mon, Oct 21, 2024 at 09:32:50PM +0900, Damien Le Moal wrote:
> On 10/21/24 20:08, Kevin Wolf wrote:
> > Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben:
> >> On 10/18/24 23:37, Kevin Wolf wrote:
> >>> Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
> >>>> When the file-posix driver emulates append write, it holds the lock
> >>>> whenever accessing wp, which limits the IO queue depth to one.
> >>>>
> >>>> The write IO flow can be optimized to allow concurrent writes. The lock
> >>>> is held in two cases:
> >>>> 1. Assumed that the write IO succeeds, update the wp before issuing the
> >>>> write.
> >>>> 2. If the write IO fails, report that zone and use the reported value
> >>>> as the current wp.
> >>>
> >>> What happens with the concurrent writes that started later and may not
> >>> have completed yet? Can we really just reset to the reported value
> >>> before all other requests have completed, too?
> >>
> >> Yes, because if one write fails, we know that the following writes
> >> will fail too as they will not be aligned to the write pointer. These
> >> subsequent failed writes will again trigger the report zones and
> >> update, but that is fine. All of them have failed and the report will
> >> give the same wp again.
> >>
> >> This is a typical pattern with zoned block device: if one write fails
> >> in a zone, the user has to expect failures for all other writes issued
> >> to the same zone, do a report zone to get the wp and restart writing
> >> from there.
> > 
> > Ok, that makes sense. Can we be sure that requests are handled in the
> > order they were submitted, though? That is, if the failed request is
> > resubmitted, could the already pending next one still succeed if it's
> > overtaken by the resubmitted request? Not sure if this would even cause
> > a probem, but is it a case we have to consider?
> 
> A zoned device will always handle writes in the order they were submitted (per
> zone) and that is true for emulated devices as well as real ones.

Is there serialization code in the kernel so that zoned devices behind
multi-path keep requests ordered?

Normally I don't assume any ordering between concurrent requests to a
block device, so I'm surprised that it's safe to submit multiple writes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-21 13:21   ` Sam Li
@ 2024-10-21 22:11     ` Kevin Wolf
  0 siblings, 0 replies; 9+ messages in thread
From: Kevin Wolf @ 2024-10-21 22:11 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, stefanha, Hanna Reitz, dmitry.fomichev, qemu-block,
	dlemoal

Am 21.10.2024 um 15:21 hat Sam Li geschrieben:
> Kevin Wolf <kwolf@redhat.com> 于2024年10月18日周五 16:37写道：
> >
> > Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
> > > When the file-posix driver emulates append write, it holds the lock
> > > whenever accessing wp, which limits the IO queue depth to one.
> > >
> > > The write IO flow can be optimized to allow concurrent writes. The lock
> > > is held in two cases:
> > > 1. Assumed that the write IO succeeds, update the wp before issuing the
> > > write.
> > > 2. If the write IO fails, report that zone and use the reported value
> > > as the current wp.
> >
> > What happens with the concurrent writes that started later and may not
> > have completed yet? Can we really just reset to the reported value
> > before all other requests have completed, too?
> >
> > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > ---
> > >  block/file-posix.c | 57 ++++++++++++++++++++++++++++++----------------
> > >  1 file changed, 38 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > index 90fa54352c..a65a23cb2c 100644
> > > --- a/block/file-posix.c
> > > +++ b/block/file-posix.c
> > > @@ -2482,18 +2482,46 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
> > >      BDRVRawState *s = bs->opaque;
> > >      RawPosixAIOData acb;
> > >      int ret;
> > > -    uint64_t offset = *offset_ptr;
> > > +    uint64_t end_offset, end_zone, offset = *offset_ptr;
> > > +    uint64_t *wp;
> >
> > Without CONFIG_BLKZONED, these are unused variables and break the build.
> >
> > They are only used in the first CONFIG_BLKZONED block, so you could just
> > declare them locally there.
> 
> Thanks! Will do.
> 
> >
> > >
> > >      if (fd_open(bs) < 0)
> > >          return -EIO;
> > >  #if defined(CONFIG_BLKZONED)
> > >      if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
> > >          bs->bl.zoned != BLK_Z_NONE) {
> > > -        qemu_co_mutex_lock(&bs->wps->colock);
> > > -        if (type & QEMU_AIO_ZONE_APPEND) {
> > > -            int index = offset / bs->bl.zone_size;
> > > -            offset = bs->wps->wp[index];
> > > +        BlockZoneWps *wps = bs->wps;
> > > +        int index = offset / bs->bl.zone_size;
> > > +
> > > +        qemu_co_mutex_lock(&wps->colock);
> >
> > This is preexisting, but what is the reason for using coroutine locks
> > here? There doesn't seem to be any code running under the lock that can
> > yield, so a normal mutex might be more efficient.
> 
> Using coroutine locks is to avoid blocking in coroutines. QemuMutex
> blocks the thread when the lock is held instead of yielding.

Right, but usually you have to wait only for a very short time until the
mutex is released again and CoMutexes are more expensive then.

You absolutely do need to use CoMutex when the coroutine can yield in
the critical section, but if it can't, the CoMutex is often worse.

Though as I said here...

> > Hm... Looking a bit closer, get_zones_wp() could probably be a
> > coroutine_fn and call hdev_co_ioctl() instead of calling ioctl()
> > directly in order to avoid blocking.

...we should probably use a coroutine version of ioctl() instead of
the blocking one, and then you do need the CoMutex.

> > > +        wp = &wps->wp[index];
> >
> > Also preexisting, but who guarantees that index is within the bounds? A
> > stacked block driver may try to grow the file size.
> 
> Right. It needs to be checked if it's over nr_zones.

Can you send a separate fix for this, please? (Can be both in one
patch series, though.)

> >
> > > +        if (!BDRV_ZT_IS_CONV(*wp)) {
> > > +            if (type & QEMU_AIO_WRITE && offset != *wp) {
> > > +                error_report("write offset 0x%" PRIx64 " is not equal to the wp"
> > > +                             " of zone[%d] 0x%" PRIx64 "", offset, index, *wp);
> >
> > We can't error_report() in an I/O path that can be triggered by the
> > guest, it could result in unbounded log file growth.
> 
> Those error messages show in the err path and are good for debugging
> zoned device emulation.
> 
> I was wondering if there is a better approach to print errors? Use
> error_report_once() to reduce the log?

If it's for debugging, I think trace points are best.

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] block/file-posix: optimize append write
  2024-10-21 18:13         ` Stefan Hajnoczi
@ 2024-10-22  1:56           ` Damien Le Moal
  0 siblings, 0 replies; 9+ messages in thread
From: Damien Le Moal @ 2024-10-22  1:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Sam Li, qemu-devel, Hanna Reitz, dmitry.fomichev,
	qemu-block

On 10/22/24 03:13, Stefan Hajnoczi wrote:
> On Mon, Oct 21, 2024 at 09:32:50PM +0900, Damien Le Moal wrote:
>> On 10/21/24 20:08, Kevin Wolf wrote:
>>> Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben:
>>>> On 10/18/24 23:37, Kevin Wolf wrote:
>>>>> Am 04.10.2024 um 12:41 hat Sam Li geschrieben:
>>>>>> When the file-posix driver emulates append write, it holds the lock
>>>>>> whenever accessing wp, which limits the IO queue depth to one.
>>>>>>
>>>>>> The write IO flow can be optimized to allow concurrent writes. The lock
>>>>>> is held in two cases:
>>>>>> 1. Assumed that the write IO succeeds, update the wp before issuing the
>>>>>> write.
>>>>>> 2. If the write IO fails, report that zone and use the reported value
>>>>>> as the current wp.
>>>>>
>>>>> What happens with the concurrent writes that started later and may not
>>>>> have completed yet? Can we really just reset to the reported value
>>>>> before all other requests have completed, too?
>>>>
>>>> Yes, because if one write fails, we know that the following writes
>>>> will fail too as they will not be aligned to the write pointer. These
>>>> subsequent failed writes will again trigger the report zones and
>>>> update, but that is fine. All of them have failed and the report will
>>>> give the same wp again.
>>>>
>>>> This is a typical pattern with zoned block device: if one write fails
>>>> in a zone, the user has to expect failures for all other writes issued
>>>> to the same zone, do a report zone to get the wp and restart writing
>>>> from there.
>>>
>>> Ok, that makes sense. Can we be sure that requests are handled in the
>>> order they were submitted, though? That is, if the failed request is
>>> resubmitted, could the already pending next one still succeed if it's
>>> overtaken by the resubmitted request? Not sure if this would even cause
>>> a probem, but is it a case we have to consider?
>>
>> A zoned device will always handle writes in the order they were submitted (per
>> zone) and that is true for emulated devices as well as real ones.
> 
> Is there serialization code in the kernel so that zoned devices behind
> multi-path keep requests ordered?

Yes: the kernel only issues at most one write per zone at any time, to preserve
ordering. So there should be no issues at all.

> Normally I don't assume any ordering between concurrent requests to a
> block device, so I'm surprised that it's safe to submit multiple writes.

Correct, the normal case does not provide any guarantees. But writes to zoned
block devices are a special case. More on this here:

https://zonedstorage.io/docs/linux/sched


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-10-22  1:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04 10:41 [PATCH v2] block/file-posix: optimize append write Sam Li
2024-10-18 14:37 ` Kevin Wolf
2024-10-20  1:03   ` Damien Le Moal
2024-10-21 11:08     ` Kevin Wolf
2024-10-21 12:32       ` Damien Le Moal
2024-10-21 18:13         ` Stefan Hajnoczi
2024-10-22  1:56           ` Damien Le Moal
2024-10-21 13:21   ` Sam Li
2024-10-21 22:11     ` Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).