* [PATCH v2 0/2] block: discard alignment fixes @ 2025-04-10 18:41 Stefan Hajnoczi 2025-04-10 18:41 ` [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices Stefan Hajnoczi 2025-04-10 18:41 ` [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL Stefan Hajnoczi 0 siblings, 2 replies; 9+ messages in thread From: Stefan Hajnoczi @ 2025-04-10 18:41 UTC (permalink / raw) To: qemu-devel Cc: Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi, Fam Zheng, qemu-block v2: - Fix inverted logic in alignment check [Qing Wang] Two discard alignment issues were identified in https://issues.redhat.com/browse/RHEL-86032: 1. pdiscard_alignment is not populated for host_device in file-posix.c. 2. Misaligned head/tail discard requests are not skipped when file-posix.c returns -EINVAL. This causes an undesired pause when guests are configured with werror=stop. Stefan Hajnoczi (2): file-posix: probe discard alignment on Linux block devices block/io: skip head/tail requests on EINVAL block/file-posix.c | 56 +++++++++++++++++++++++++++++++++++++++++++++- block/io.c | 6 ++++- 2 files changed, 60 insertions(+), 2 deletions(-) -- 2.49.0 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices 2025-04-10 18:41 [PATCH v2 0/2] block: discard alignment fixes Stefan Hajnoczi @ 2025-04-10 18:41 ` Stefan Hajnoczi 2025-04-11 8:15 ` Hanna Czenczek 2025-04-10 18:41 ` [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL Stefan Hajnoczi 1 sibling, 1 reply; 9+ messages in thread From: Stefan Hajnoczi @ 2025-04-10 18:41 UTC (permalink / raw) To: qemu-devel Cc: Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi, Fam Zheng, qemu-block Populate the pdiscard_alignment block limit so the block layer is able align discard requests correctly. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- block/file-posix.c | 56 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-) diff --git a/block/file-posix.c b/block/file-posix.c index 56d1972d15..2a1e1f48c0 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1276,10 +1276,10 @@ static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) } #endif /* defined(CONFIG_BLKZONED) */ +#ifdef CONFIG_LINUX /* * Get a sysfs attribute value as a long integer. */ -#ifdef CONFIG_LINUX static long get_sysfs_long_val(struct stat *st, const char *attribute) { g_autofree char *str = NULL; @@ -1299,6 +1299,30 @@ static long get_sysfs_long_val(struct stat *st, const char *attribute) } return ret; } + +/* + * Get a sysfs attribute value as a uint32_t. + */ +static int get_sysfs_u32_val(struct stat *st, const char *attribute, + uint32_t *u32) +{ + g_autofree char *str = NULL; + const char *end; + unsigned int val; + int ret; + + ret = get_sysfs_str_val(st, attribute, &str); + if (ret < 0) { + return ret; + } + + /* The file is ended with '\n', pass 'end' to accept that. */ + ret = qemu_strtoui(str, &end, 10, &val); + if (ret == 0 && end && *end == '\0') { + *u32 = val; + } + return ret; +} #endif static int hdev_get_max_segments(int fd, struct stat *st) @@ -1318,6 +1342,23 @@ static int hdev_get_max_segments(int fd, struct stat *st) #endif } +/* + * Fills in *dalign with the discard alignment and returns 0 on success, + * -errno otherwise. + */ +static int hdev_get_pdiscard_alignment(struct stat *st, uint32_t *dalign) +{ +#ifdef CONFIG_LINUX + /* + * Note that Linux "discard_granularity" is QEMU "discard_alignment". Linux + * "discard_alignment" is something else. + */ + return get_sysfs_u32_val(st, "discard_granularity", dalign); +#else + return -ENOTSUP; +#endif +} + #if defined(CONFIG_BLKZONED) /* * If the reset_all flag is true, then the wps of zone whose state is @@ -1527,6 +1568,19 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) } } + if (S_ISBLK(st.st_mode)) { + uint32_t dalign = 0; + int ret; + + ret = hdev_get_pdiscard_alignment(&st, &dalign); + if (ret == 0) { + /* Must be a multiple of request_alignment */ + assert(dalign % bs->bl.request_alignment == 0); + + bs->bl.pdiscard_alignment = dalign; + } + } + raw_refresh_zoned_limits(bs, &st, errp); } -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices 2025-04-10 18:41 ` [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices Stefan Hajnoczi @ 2025-04-11 8:15 ` Hanna Czenczek 2025-04-14 15:34 ` Stefan Hajnoczi 0 siblings, 1 reply; 9+ messages in thread From: Hanna Czenczek @ 2025-04-11 8:15 UTC (permalink / raw) To: Stefan Hajnoczi, qemu-devel; +Cc: Kevin Wolf, Fam Zheng, qemu-block On 10.04.25 20:41, Stefan Hajnoczi wrote: > Populate the pdiscard_alignment block limit so the block layer is able > align discard requests correctly. > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > block/file-posix.c | 56 +++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 55 insertions(+), 1 deletion(-) Ah, I didn’t know sysfs is actually fair game. Should we not also get the maximum discard length then, too? > diff --git a/block/file-posix.c b/block/file-posix.c > index 56d1972d15..2a1e1f48c0 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1276,10 +1276,10 @@ static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) > } > #endif /* defined(CONFIG_BLKZONED) */ > > +#ifdef CONFIG_LINUX > /* > * Get a sysfs attribute value as a long integer. > */ > -#ifdef CONFIG_LINUX > static long get_sysfs_long_val(struct stat *st, const char *attribute) > { > g_autofree char *str = NULL; > @@ -1299,6 +1299,30 @@ static long get_sysfs_long_val(struct stat *st, const char *attribute) > } > return ret; > } > + > +/* > + * Get a sysfs attribute value as a uint32_t. > + */ > +static int get_sysfs_u32_val(struct stat *st, const char *attribute, > + uint32_t *u32) > +{ > + g_autofree char *str = NULL; > + const char *end; > + unsigned int val; > + int ret; > + > + ret = get_sysfs_str_val(st, attribute, &str); > + if (ret < 0) { > + return ret; > + } > + > + /* The file is ended with '\n', pass 'end' to accept that. */ > + ret = qemu_strtoui(str, &end, 10, &val); > + if (ret == 0 && end && *end == '\0') { > + *u32 = val; > + } > + return ret; > +} > #endif > > static int hdev_get_max_segments(int fd, struct stat *st) > @@ -1318,6 +1342,23 @@ static int hdev_get_max_segments(int fd, struct stat *st) > #endif > } > > +/* > + * Fills in *dalign with the discard alignment and returns 0 on success, > + * -errno otherwise. > + */ > +static int hdev_get_pdiscard_alignment(struct stat *st, uint32_t *dalign) > +{ > +#ifdef CONFIG_LINUX > + /* > + * Note that Linux "discard_granularity" is QEMU "discard_alignment". Linux > + * "discard_alignment" is something else. > + */ > + return get_sysfs_u32_val(st, "discard_granularity", dalign); > +#else > + return -ENOTSUP; > +#endif > +} > + > #if defined(CONFIG_BLKZONED) > /* > * If the reset_all flag is true, then the wps of zone whose state is > @@ -1527,6 +1568,19 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > } > } > > + if (S_ISBLK(st.st_mode)) { > + uint32_t dalign = 0; > + int ret; > + > + ret = hdev_get_pdiscard_alignment(&st, &dalign); > + if (ret == 0) { > + /* Must be a multiple of request_alignment */ > + assert(dalign % bs->bl.request_alignment == 0); Is it fair to crash qemu if the kernel reports a value that is not a multiple of request_alignment? Wouldn’t it make more sense to take the maximum, and if that still isn’t a multiple, return an error here? Hanna > + > + bs->bl.pdiscard_alignment = dalign; > + } > + } > + > raw_refresh_zoned_limits(bs, &st, errp); > } > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices 2025-04-11 8:15 ` Hanna Czenczek @ 2025-04-14 15:34 ` Stefan Hajnoczi 2025-04-16 10:43 ` Kevin Wolf 0 siblings, 1 reply; 9+ messages in thread From: Stefan Hajnoczi @ 2025-04-14 15:34 UTC (permalink / raw) To: Hanna Czenczek; +Cc: qemu-devel, Kevin Wolf, Fam Zheng, qemu-block [-- Attachment #1: Type: text/plain, Size: 4400 bytes --] On Fri, Apr 11, 2025 at 10:15:13AM +0200, Hanna Czenczek wrote: > On 10.04.25 20:41, Stefan Hajnoczi wrote: > > Populate the pdiscard_alignment block limit so the block layer is able > > align discard requests correctly. > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > block/file-posix.c | 56 +++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 55 insertions(+), 1 deletion(-) > > Ah, I didn’t know sysfs is actually fair game. Should we not also get the > maximum discard length then, too? The maximum discard length behaves differently: the Linux block layer splits requests according to the maximum discard length. If the guest submits a discard request that is too large for the host, the host block layer will split it and the request succeeds. That is why I didn't make any changes to the maximum discard length in this series. > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 56d1972d15..2a1e1f48c0 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -1276,10 +1276,10 @@ static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) > > } > > #endif /* defined(CONFIG_BLKZONED) */ > > +#ifdef CONFIG_LINUX > > /* > > * Get a sysfs attribute value as a long integer. > > */ > > -#ifdef CONFIG_LINUX > > static long get_sysfs_long_val(struct stat *st, const char *attribute) > > { > > g_autofree char *str = NULL; > > @@ -1299,6 +1299,30 @@ static long get_sysfs_long_val(struct stat *st, const char *attribute) > > } > > return ret; > > } > > + > > +/* > > + * Get a sysfs attribute value as a uint32_t. > > + */ > > +static int get_sysfs_u32_val(struct stat *st, const char *attribute, > > + uint32_t *u32) > > +{ > > + g_autofree char *str = NULL; > > + const char *end; > > + unsigned int val; > > + int ret; > > + > > + ret = get_sysfs_str_val(st, attribute, &str); > > + if (ret < 0) { > > + return ret; > > + } > > + > > + /* The file is ended with '\n', pass 'end' to accept that. */ > > + ret = qemu_strtoui(str, &end, 10, &val); > > + if (ret == 0 && end && *end == '\0') { > > + *u32 = val; > > + } > > + return ret; > > +} > > #endif > > static int hdev_get_max_segments(int fd, struct stat *st) > > @@ -1318,6 +1342,23 @@ static int hdev_get_max_segments(int fd, struct stat *st) > > #endif > > } > > +/* > > + * Fills in *dalign with the discard alignment and returns 0 on success, > > + * -errno otherwise. > > + */ > > +static int hdev_get_pdiscard_alignment(struct stat *st, uint32_t *dalign) > > +{ > > +#ifdef CONFIG_LINUX > > + /* > > + * Note that Linux "discard_granularity" is QEMU "discard_alignment". Linux > > + * "discard_alignment" is something else. > > + */ > > + return get_sysfs_u32_val(st, "discard_granularity", dalign); > > +#else > > + return -ENOTSUP; > > +#endif > > +} > > + > > #if defined(CONFIG_BLKZONED) > > /* > > * If the reset_all flag is true, then the wps of zone whose state is > > @@ -1527,6 +1568,19 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > > } > > } > > + if (S_ISBLK(st.st_mode)) { > > + uint32_t dalign = 0; > > + int ret; > > + > > + ret = hdev_get_pdiscard_alignment(&st, &dalign); > > + if (ret == 0) { > > + /* Must be a multiple of request_alignment */ > > + assert(dalign % bs->bl.request_alignment == 0); > > Is it fair to crash qemu if the kernel reports a value that is not a > multiple of request_alignment? Wouldn’t it make more sense to take the > maximum, and if that still isn’t a multiple, return an error here? I'll replace the assertion with an error. The Linux block layer sysfs documentation says: [RO] Devices that support discard functionality may internally allocate space using units that are bigger than the logical block size. I don't expect dalign to be smaller than request_alignment, but it doesn't hurt the check if request_alignment would work. > > Hanna > > > + > > + bs->bl.pdiscard_alignment = dalign; > > + } > > + } > > + > > raw_refresh_zoned_limits(bs, &st, errp); > > } > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices 2025-04-14 15:34 ` Stefan Hajnoczi @ 2025-04-16 10:43 ` Kevin Wolf 0 siblings, 0 replies; 9+ messages in thread From: Kevin Wolf @ 2025-04-16 10:43 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Hanna Czenczek, qemu-devel, Fam Zheng, qemu-block [-- Attachment #1: Type: text/plain, Size: 1148 bytes --] Am 14.04.2025 um 17:34 hat Stefan Hajnoczi geschrieben: > On Fri, Apr 11, 2025 at 10:15:13AM +0200, Hanna Czenczek wrote: > > On 10.04.25 20:41, Stefan Hajnoczi wrote: > > > Populate the pdiscard_alignment block limit so the block layer is able > > > align discard requests correctly. > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > block/file-posix.c | 56 +++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 55 insertions(+), 1 deletion(-) > > > > Ah, I didn’t know sysfs is actually fair game. Should we not also get the > > maximum discard length then, too? > > The maximum discard length behaves differently: the Linux block layer > splits requests according to the maximum discard length. If the guest > submits a discard request that is too large for the host, the host block > layer will split it and the request succeeds. That is why I didn't make > any changes to the maximum discard length in this series. Do we need to do something with it for SCSI passthrough? Similar to how we expose bs->bl.max_hw_transfer/iov in the block limits VPD page? Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL 2025-04-10 18:41 [PATCH v2 0/2] block: discard alignment fixes Stefan Hajnoczi 2025-04-10 18:41 ` [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices Stefan Hajnoczi @ 2025-04-10 18:41 ` Stefan Hajnoczi 2025-04-11 8:18 ` Hanna Czenczek 1 sibling, 1 reply; 9+ messages in thread From: Stefan Hajnoczi @ 2025-04-10 18:41 UTC (permalink / raw) To: qemu-devel Cc: Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi, Fam Zheng, qemu-block When guests send misaligned discard requests, the block layer breaks them up into a misaligned head, an aligned main body, and a misaligned tail. The file-posix block driver on Linux returns -EINVAL on misaligned discard requests. This causes bdrv_co_pdiscard() to fail and guests configured with werror=stop will pause. Add a special case for misaligned head/tail requests. Simply continue when EINVAL is encountered so that the aligned main body of the request can be completed and the guest is not paused. This is the best we can do when guest discard limits do not match the host discard limits. Fixes: https://issues.redhat.com/browse/RHEL-86032 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- block/io.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/block/io.c b/block/io.c index 1ba8d1aeea..a0d0b31a3e 100644 --- a/block/io.c +++ b/block/io.c @@ -3180,7 +3180,11 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, } } if (ret && ret != -ENOTSUP) { - goto out; + if (ret == -EINVAL && (offset % align != 0 || num % align != 0)) { + /* Silently skip rejected unaligned head/tail requests */ + } else { + goto out; /* bail out */ + } } offset += num; -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL 2025-04-10 18:41 ` [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL Stefan Hajnoczi @ 2025-04-11 8:18 ` Hanna Czenczek 2025-04-11 17:28 ` Eric Blake 0 siblings, 1 reply; 9+ messages in thread From: Hanna Czenczek @ 2025-04-11 8:18 UTC (permalink / raw) To: Stefan Hajnoczi, qemu-devel; +Cc: Kevin Wolf, Fam Zheng, qemu-block On 10.04.25 20:41, Stefan Hajnoczi wrote: > When guests send misaligned discard requests, the block layer breaks > them up into a misaligned head, an aligned main body, and a misaligned > tail. > > The file-posix block driver on Linux returns -EINVAL on misaligned > discard requests. This causes bdrv_co_pdiscard() to fail and guests > configured with werror=stop will pause. > > Add a special case for misaligned head/tail requests. Simply continue > when EINVAL is encountered so that the aligned main body of the request > can be completed and the guest is not paused. This is the best we can do > when guest discard limits do not match the host discard limits. > > Fixes: https://issues.redhat.com/browse/RHEL-86032 > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > block/io.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/block/io.c b/block/io.c > index 1ba8d1aeea..a0d0b31a3e 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -3180,7 +3180,11 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, > } > } > if (ret && ret != -ENOTSUP) { > - goto out; > + if (ret == -EINVAL && (offset % align != 0 || num % align != 0)) { Could use `(offset | num) % align != 0`, but either way: Reviewed-by: Hanna Czenczek <hreitz@redhat.com> > + /* Silently skip rejected unaligned head/tail requests */ > + } else { > + goto out; /* bail out */ > + } > } > > offset += num; ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL 2025-04-11 8:18 ` Hanna Czenczek @ 2025-04-11 17:28 ` Eric Blake 2025-04-14 13:39 ` Hanna Czenczek 0 siblings, 1 reply; 9+ messages in thread From: Eric Blake @ 2025-04-11 17:28 UTC (permalink / raw) To: Hanna Czenczek Cc: Stefan Hajnoczi, qemu-devel, Kevin Wolf, Fam Zheng, qemu-block On Fri, Apr 11, 2025 at 10:18:55AM +0200, Hanna Czenczek wrote: > > if (ret && ret != -ENOTSUP) { > > - goto out; > > + if (ret == -EINVAL && (offset % align != 0 || num % align != 0)) { > > Could use `(offset | num) % align != 0`, but either way: Use of | and & to perform alignment checks only works if align is guaranteed to be a power of 2. But isn't there (odd) hardware out there with something like a 15M alignment, at which point you HAVE to do separate checks with % because bitwise ops no longer work? -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL 2025-04-11 17:28 ` Eric Blake @ 2025-04-14 13:39 ` Hanna Czenczek 0 siblings, 0 replies; 9+ messages in thread From: Hanna Czenczek @ 2025-04-14 13:39 UTC (permalink / raw) To: Eric Blake; +Cc: Stefan Hajnoczi, qemu-devel, Kevin Wolf, Fam Zheng, qemu-block [-- Attachment #1: Type: text/plain, Size: 617 bytes --] On 11.04.25 19:28, Eric Blake wrote: > On Fri, Apr 11, 2025 at 10:18:55AM +0200, Hanna Czenczek wrote: >>> if (ret && ret != -ENOTSUP) { >>> - goto out; >>> + if (ret == -EINVAL && (offset % align != 0 || num % align != 0)) { >> Could use `(offset | num) % align != 0`, but either way: > Use of | and & to perform alignment checks only works if align is > guaranteed to be a power of 2. But isn't there (odd) hardware out > there with something like a 15M alignment, at which point you HAVE to > do separate checks with % because bitwise ops no longer work? Ah, true, thanks! Hanna [-- Attachment #2: Type: text/html, Size: 1305 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-16 10:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-10 18:41 [PATCH v2 0/2] block: discard alignment fixes Stefan Hajnoczi 2025-04-10 18:41 ` [PATCH v2 1/2] file-posix: probe discard alignment on Linux block devices Stefan Hajnoczi 2025-04-11 8:15 ` Hanna Czenczek 2025-04-14 15:34 ` Stefan Hajnoczi 2025-04-16 10:43 ` Kevin Wolf 2025-04-10 18:41 ` [PATCH v2 2/2] block/io: skip head/tail requests on EINVAL Stefan Hajnoczi 2025-04-11 8:18 ` Hanna Czenczek 2025-04-11 17:28 ` Eric Blake 2025-04-14 13:39 ` Hanna Czenczek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).