Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/5] block: allow making a block device unfreezable
From: Johannes Thumshirn @ 2026-06-18 12:47 UTC (permalink / raw)
  To: Christian Brauner, Chris Mason, Jens Axboe, David Sterba,
	Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-1-b3567c7f994b@kernel.org>

Looks good to me,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>




^ permalink raw reply

* Re: [PATCH v2 2/5] block: split bdev_yield_claim() out of bdev_fput()
From: Johannes Thumshirn @ 2026-06-18 12:40 UTC (permalink / raw)
  To: Christian Brauner, Chris Mason, Jens Axboe, David Sterba,
	Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-2-b3567c7f994b@kernel.org>

Looks good to me,

Reviewd-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply

* Re: [PATCH v2 3/7] rust: doctest: add LocalModule fallback for #[vtable] ThisModule
From: Andreas Hindborg @ 2026-06-18 12:13 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, Alvin Sun
In-Reply-To: <20260521-fix-fops-owner-v2-3-fd99079c5a04@linux.dev>

Alvin Sun <alvin.sun@linux.dev> writes:

> Add a `LocalModule` struct with a null-pointer `ModuleMetadata` impl
> in the doctest harness, so that `crate::LocalModule` (auto-inserted
> by `#[vtable]`) resolves correctly when there is no `module!` macro.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>

Does this need to be ordered before the vtable auto insert in the patch series?

Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [PATCH v2 7/7] block: rnull: use `LocalModule` for `THIS_MODULE`
From: Andreas Hindborg @ 2026-06-18 12:17 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, Alvin Sun
In-Reply-To: <20260521-fix-fops-owner-v2-7-fd99079c5a04@linux.dev>

Alvin Sun <alvin.sun@linux.dev> writes:

> Replace the `THIS_MODULE` import with `LocalModule` from the crate,
> consistent with the move of `THIS_MODULE` into the `ModuleMetadata`
> trait.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

You need to squash this with the previous patch.


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH v2 2/7] rust: macros: auto-insert ThisModule in #[vtable]
From: Andreas Hindborg @ 2026-06-18 12:11 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, Alvin Sun
In-Reply-To: <20260521-fix-fops-owner-v2-2-fd99079c5a04@linux.dev>

Alvin Sun <alvin.sun@linux.dev> writes:

> Auto-add `type ThisModule: ::kernel::ModuleMetadata;` as a required
> associated type on the trait side if not already defined, and
> auto-insert `type ThisModule = crate::LocalModule;` on the impl side
> if not explicitly provided, eliminating the need to manually declare
> and implement `ThisModule` in every vtable trait and impl.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH v2 1/7] rust: module: add `THIS_MODULE` const to `ModuleMetadata` trait
From: Andreas Hindborg @ 2026-06-18 12:04 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, Alvin Sun
In-Reply-To: <20260521-fix-fops-owner-v2-1-fd99079c5a04@linux.dev>

"Alvin Sun" <alvin.sun@linux.dev> writes:

> Add a `THIS_MODULE` const to the `ModuleMetadata` trait so that
> modules can provide their `ThisModule` pointer usable in const
> contexts such as static file_operations.
>
> Move the `THIS_MODULE` static from the `module!` macro into the
> `ModuleMetadata` impl, and update `__init` to use
> `LocalModule::THIS_MODULE` instead.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
> ---
>  rust/kernel/lib.rs    |  3 +++
>  rust/macros/module.rs | 34 +++++++++++++++++-----------------
>  2 files changed, 20 insertions(+), 17 deletions(-)
>
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index b72b2fbe046d6..f0cf0705d9697 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -184,6 +184,9 @@ fn init(module: &'static ThisModule) -> impl pin_init::PinInit<Self, error::Erro
>  pub trait ModuleMetadata {
>      /// The name of the module as specified in the `module!` macro.
>      const NAME: &'static crate::str::CStr;
> +
> +    /// The module's `THIS_MODULE` pointer.
> +    const THIS_MODULE: ThisModule;
>  }
>
>  /// Equivalent to `THIS_MODULE` in the C API.
> diff --git a/rust/macros/module.rs b/rust/macros/module.rs
> index 06c18e2075083..b6d7b3299fbf9 100644
> --- a/rust/macros/module.rs
> +++ b/rust/macros/module.rs
> @@ -497,28 +497,28 @@ pub(crate) fn module(info: ModuleInfo) -> Result<TokenStream> {
>          /// Used by the printing macros, e.g. [`info!`].
>          const __LOG_PREFIX: &[u8] = #name_cstr.to_bytes_with_nul();
>
> -        // SAFETY: `__this_module` is constructed by the kernel at load time and will not be
> -        // freed until the module is unloaded.
> -        #[cfg(MODULE)]
> -        static THIS_MODULE: ::kernel::ThisModule = unsafe {
> -            extern "C" {
> -                static __this_module: ::kernel::types::Opaque<::kernel::bindings::module>;
> -            };
> -
> -            ::kernel::ThisModule::from_ptr(__this_module.get())
> -        };
> -
> -        #[cfg(not(MODULE))]
> -        static THIS_MODULE: ::kernel::ThisModule = unsafe {
> -            ::kernel::ThisModule::from_ptr(::core::ptr::null_mut())
> -        };
> -
>          /// The `LocalModule` type is the type of the module created by `module!`,
>          /// `module_pci_driver!`, `module_platform_driver!`, etc.
>          type LocalModule = #type_;
>
>          impl ::kernel::ModuleMetadata for #type_ {
>              const NAME: &'static ::kernel::str::CStr = #name_cstr;
> +
> +            #[cfg(MODULE)]
> +            const THIS_MODULE: ::kernel::ThisModule = {
> +                extern "C" {
> +                    static __this_module: ::kernel::types::Opaque<::kernel::bindings::module>;
> +                }
> +
> +                // SAFETY: `__this_module` is constructed by the kernel at load time
> +                // and lives until the module is unloaded.
> +                unsafe { ::kernel::ThisModule::from_ptr(__this_module.get()) }
> +            };
> +
> +            #[cfg(not(MODULE))]
> +            const THIS_MODULE: ::kernel::ThisModule = unsafe {
> +                ::kernel::ThisModule::from_ptr(::core::ptr::null_mut())
> +            };
>          }
>
>          // Double nested modules, since then nobody can access the public items inside.
> @@ -616,7 +616,7 @@ pub extern "C" fn #ident_exit() {
>                  /// This function must only be called once.
>                  unsafe fn __init() -> ::kernel::ffi::c_int {
>                      let initer = <super::super::LocalModule as ::kernel::InPlaceModule>::init(
> -                        &super::super::THIS_MODULE
> +                        &<super::super::LocalModule as ::kernel::ModuleMetadata>::THIS_MODULE

Is it possible we could make this more ergonomic? Perhaps by adding a
helper:

  fn this_module<M: ::kernel::ModuleMetadata>() -> &'static ::kernel::ThisModule {
      &M::THIS_MODULE
  }

Then the invocation is a little better:

  let initer = <super::super::LocalModule as ::kernel::InPlaceModule>::init(
      this_module::<super::super::LocalModule>()
  );


Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [PATCH v3 6/7] rust: block: rnull: use vertical import style
From: Andreas Hindborg @ 2026-06-18 10:41 UTC (permalink / raw)
  To: Alvin Sun, Arnd Bergmann, Greg Kroah-Hartman, Miguel Ojeda,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Jens Axboe,
	Brendan Higgins, David Gow, Rae Moar
  Cc: rust-for-linux, linux-block, linux-kselftest, kunit-dev,
	Alvin Sun
In-Reply-To: <20260521-miscdev-use-format-v3-6-56240ca70d0c@linux.dev>

"Alvin Sun" <alvin.sun@linux.dev> writes:

> Convert `use` imports to vertical layout for better readability and
> maintainability.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>


Acked-by: Andreas Hindborg <a.hindborg@kernel.org>


Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [PATCH v2 4/5] rust: block: mq: use vertical import style
From: Andreas Hindborg @ 2026-06-18 10:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alvin Sun, Arnd Bergmann, Greg Kroah-Hartman, Miguel Ojeda,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, rust-for-linux,
	linux-block, Alvin Sun
In-Reply-To: <20260520-miscdev-use-format-v2-4-64dc48fc1345@linux.dev>

"Alvin Sun" <alvin.sun@linux.dev> writes:

> Convert `use` imports to vertical layout for better readability and
> maintainability.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>


Acked-by: Andreas Hindborg <a.hindborg@kernel.org>

Cc: Jens Axboe <axboe@kernel.dk>

Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH v2 5/5] rust: block: mq: remove redundant imports and format
From: Andreas Hindborg @ 2026-06-18 10:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alvin Sun, Arnd Bergmann, Greg Kroah-Hartman, Miguel Ojeda,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, rust-for-linux,
	linux-block, Alvin Sun
In-Reply-To: <20260520-miscdev-use-format-v2-5-64dc48fc1345@linux.dev>

"Alvin Sun" <alvin.sun@linux.dev> writes:

> Drop `Result`, `Pin`, `pin_data`, `pinned_drop`, `PinInit`, and
> `try_pin_init` imports already provided by `kernel::prelude`.
>
> Simplify `error` imports and flatten parameters formatting.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>

@Jens can you pick 4/5 and 5/5?


Best regards,
Andreas Hindborg


^ permalink raw reply

* Re: [PATCH 1/1] block: validate user space vectors during extraction
From: Christoph Hellwig @ 2026-06-18 10:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch, stable
In-Reply-To: <20260617233235.1016063-2-kbusch@meta.com>

On Wed, Jun 17, 2026 at 04:32:35PM -0700, Keith Busch wrote:
> @@ -1242,7 +1242,7 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
>   * is returned only if 0 pages could be pinned.
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
> -			   unsigned len_align_mask)
> +			   unsigned len_align_mask, unsigned vec_align_mask)

vec_align_mask needs to be documented in the kernel doc.  And I find
the vec_align_mask name a bit confusing.  This is all about the physical
address (really the dma address, but the page aligned offset map 1:1),
so maybe phys_align_mask or dma_align_mask might be better names?

Also wouldn't it be more natural to pass the start alignment requirement
before the length alignment paramter?

> @@ -1251,6 +1251,11 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
>  
>  	if (iov_iter_is_bvec(iter)) {
>  		bio_iov_bvec_set(bio, iter);
> +
> +		if (mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
> +							vec_align_mask)
> +			return -EINVAL;

Can you add a comment here?  Especially as the bvec iter doesn't actually
require all individual bvecs to be aligned and I'm not entirely sure this
handles all case - writing down the rules might help a bit with that.

>  		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec,
>  				BIO_MAX_SIZE - bio->bi_iter.bi_size,
> -				&bio->bi_vcnt, bio->bi_max_vecs, flags);
> +				&bio->bi_vcnt, bio->bi_max_vecs,
> +				vec_align_mask, flags);
>  		if (ret <= 0) {
> +			if (ret == -EINVAL) {
> +				bio_release_pages(bio, false);
> +				bio_clear_flag(bio, BIO_PAGE_PINNED);
> +				bio->bi_iter.bi_size = 0;
> +				bio->bi_vcnt = 0;
> +				return ret;
> +			}

Do we need all this cleanups beyoned the bio_release_pages()?  Most
callers just free the bio, so should not care about it, and the error
handling in __blkdev_direct_IO that calls bio_endio looks buggy for
other reasons..

> + * @align_mask:	reject with -EINVAL if the source address or length is not
> + *		aligned to this mask

Maybe use the same paramater name as on the bio side here?

And not for this patch, but this makes me wonder if we should handle the
len alignment in iov_iter_extract_bvecs as well, as that should simplify
it quite a bit.

^ permalink raw reply

* Re: [PATCH 1/1] block: validate user space vectors during extraction
From: kernel test robot @ 2026-06-18 10:22 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel
  Cc: llvm, oe-kbuild-all, dm-devel, hch, axboe, brauner, djwong, viro,
	Keith Busch, stable
In-Reply-To: <20260617233235.1016063-2-kbusch@meta.com>

Hi Keith,

kernel test robot noticed the following build warnings:

[auto build test WARNING on axboe/for-next]
[also build test WARNING on brauner-vfs/vfs.all akpm-mm/mm-nonmm-unstable linus/master v7.1 next-20260616]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/block-validate-user-space-vectors-during-extraction/20260618-073522
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link:    https://lore.kernel.org/r/20260617233235.1016063-2-kbusch%40meta.com
patch subject: [PATCH 1/1] block: validate user space vectors during extraction
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20260618/202606181254.ohF2ZO9K-lkp@intel.com/config)
compiler: clang version 22.1.8 (https://github.com/llvm/llvm-project ca7933e47d3a3451d81e72ac174dcb5aa28b59d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260618/202606181254.ohF2ZO9K-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606181254.ohF2ZO9K-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: block/bio.c:1245 function parameter 'vec_align_mask' not described in 'bio_iov_iter_get_pages'
>> Warning: block/bio.c:1245 function parameter 'vec_align_mask' not described in 'bio_iov_iter_get_pages'

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-18  8:04 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Ming Lei, linux-block, Jens Axboe
In-Reply-To: <2371227f-43ef-4a0d-ad8f-da23eea43357@linux.ibm.com>

On Jun 17, 2026 / 16:38, Nilay Shroff wrote:
[...]
> Given the above, I'm fine with the earlier approach of upgrading update_nr_hwq_lock from
> a reader lock to a writer lock in elv_iosched_store(). That directly serializes concurrent
> scheduler updates and avoids the race on q->elevator without introducing additional lock
> ordering concerns.

Thanks for the comment. I will prepare the "writer lock in elv_iosched_store()"
approach as v2 patch.

^ permalink raw reply

* Re: [PATCH] virtio-blk: use little-endian types for the zoned fields
From: Stefano Garzarella @ 2026-06-18  7:41 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Michael S . Tsirkin, Jason Wang, Stefan Hajnoczi, Dmitry Fomichev,
	Damien Le Moal, Jens Axboe, Paolo Bonzini, virtualization,
	linux-block, linux-kernel
In-Reply-To: <20260617151727.4071754-1-michael.bommarito@gmail.com>

On Wed, Jun 17, 2026 at 11:17:27AM -0400, Michael Bommarito wrote:
>The zoned block-device fields in the virtio-blk header are typed
>__virtio{32,64}, so their endianness follows VIRTIO_F_VERSION_1. The
>zoned feature is only defined for VIRTIO 1.x devices, and the virtio
>specification defines all of its fields as little-endian. Commit
>b16a1756c716 ("virtio_blk: mark all zone fields LE") tagged them
>__le* for exactly this reason, but commit f1ba4e674feb ("virtio-blk:
>fix to match virtio spec") re-applied the reviewed version of the
>original zoned series -- which predated b16a1756 -- and silently
>restored the __virtio* typing together with the matching
>virtio*_to_cpu() / virtio_cread() accessors in the driver.
>
>Restore the little-endian typing for the zoned configuration-space
>characteristics, the zone descriptor, the zone report header and the
>ZONE_APPEND in-header sector, and read them with le*_to_cpu() and
>virtio_cread_le() to match.
>
>There is no functional change on any spec-compliant device: zoned
>requires VIRTIO_F_VERSION_1, and for a VERSION_1 device
>virtio*_to_cpu() is identical to le*_to_cpu(). The change makes the
>uapi types describe the actual wire format and removes a latent
>endianness mismatch for a (non-conformant) legacy device on a
>big-endian guest.

Not for this patch, but at this point should we do the same also for the 
fields gated by the following features that IIUC are all added in 1.*:
- VIRTIO_BLK_F_MQ
- VIRTIO_BLK_F_DISCARD
- VIRTIO_BLK_F_WRITE_ZEROES
- VIRTIO_BLK_F_SECURE_ERASE

>
>Fixes: f1ba4e674feb ("virtio-blk: fix to match virtio spec")
>Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>Assisted-by: Claude:claude-opus-4-8
>Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
>---
>Testing:
> - Builds with no new warnings; sparse endian-clean (C=2,
>   __CHECK_ENDIAN__, CONFIG_BLK_DEV_ZONED=y) both before and after.
> - Booted under QEMU with a host-managed zoned device exposed through
>   virtio-blk. Zone revalidation, blkzone report and a sequential
>   write / write-pointer check return correct values; blktests zbd
>   device tests 001-006 (sysfs+ioctl, report zone, reset, write split,
>   write ordering, revalidate) pass, with results identical before and
>   after this change -- expected, since on a VIRTIO_F_VERSION_1 device
>   virtio*_to_cpu() == le*_to_cpu().
>
> drivers/block/virtio_blk.c      | 38 +++++++++++++++------------------
> include/uapi/linux/virtio_blk.h | 18 ++++++++--------
> 2 files changed, 26 insertions(+), 30 deletions(-)
>
>diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>index b1c9a27fe00f3..5532cfbde7bfe 100644
>--- a/drivers/block/virtio_blk.c
>+++ b/drivers/block/virtio_blk.c
>@@ -99,7 +99,7 @@ struct virtblk_req {
> 		 * be the last byte.
> 		 */
> 		struct {
>-			__virtio64 sector;
>+			__le64 sector;
> 			u8 status;
> 		} zone_append;
> 	} in_hdr;
>@@ -335,14 +335,12 @@ static inline void virtblk_request_done(struct request *req)
> {
> 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
> 	blk_status_t status = virtblk_result(virtblk_vbr_status(vbr));
>-	struct virtio_blk *vblk = req->mq_hctx->queue->queuedata;
>
> 	virtblk_unmap_data(req, vbr);
> 	virtblk_cleanup_cmd(req);
>
> 	if (req_op(req) == REQ_OP_ZONE_APPEND)
>-		req->__sector = virtio64_to_cpu(vblk->vdev,
>-						vbr->in_hdr.zone_append.sector);
>+		req->__sector = le64_to_cpu(vbr->in_hdr.zone_append.sector);
>
> 	blk_mq_end_request(req, status);
> }
>@@ -589,13 +587,13 @@ static int virtblk_parse_zone(struct virtio_blk *vblk,
> {
> 	struct blk_zone zone = { };
>
>-	zone.start = virtio64_to_cpu(vblk->vdev, entry->z_start);
>+	zone.start = le64_to_cpu(entry->z_start);
> 	if (zone.start + vblk->zone_sectors <= get_capacity(vblk->disk))
> 		zone.len = vblk->zone_sectors;
> 	else
> 		zone.len = get_capacity(vblk->disk) - zone.start;
>-	zone.capacity = virtio64_to_cpu(vblk->vdev, entry->z_cap);
>-	zone.wp = virtio64_to_cpu(vblk->vdev, entry->z_wp);
>+	zone.capacity = le64_to_cpu(entry->z_cap);
>+	zone.wp = le64_to_cpu(entry->z_wp);
>
> 	switch (entry->z_type) {
> 	case VIRTIO_BLK_ZT_SWR:
>@@ -687,8 +685,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
> 		if (ret)
> 			goto fail_report;
>
>-		nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
>-			   nr_zones);
>+		nz = min_t(u64, le64_to_cpu(report->nr_zones), nr_zones);
> 		if (!nz)
> 			break;
>
>@@ -698,8 +695,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
> 			if (ret)
> 				goto fail_report;
>
>-			sector = virtio64_to_cpu(vblk->vdev,
>-						 report->zones[i].z_start) +
>+			sector = le64_to_cpu(report->zones[i].z_start) +
> 				 vblk->zone_sectors;
> 			zone_idx++;
> 		}
>@@ -725,18 +721,18 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
>
> 	lim->features |= BLK_FEAT_ZONED;
>
>-	virtio_cread(vdev, struct virtio_blk_config,
>-		     zoned.max_open_zones, &v);
>+	virtio_cread_le(vdev, struct virtio_blk_config,
>+			zoned.max_open_zones, &v);
> 	lim->max_open_zones = v;
> 	dev_dbg(&vdev->dev, "max open zones = %u\n", v);
>
>-	virtio_cread(vdev, struct virtio_blk_config,
>-		     zoned.max_active_zones, &v);
>+	virtio_cread_le(vdev, struct virtio_blk_config,
>+			zoned.max_active_zones, &v);
> 	lim->max_active_zones = v;
> 	dev_dbg(&vdev->dev, "max active zones = %u\n", v);
>
>-	virtio_cread(vdev, struct virtio_blk_config,
>-		     zoned.write_granularity, &wg);
>+	virtio_cread_le(vdev, struct virtio_blk_config,
>+			zoned.write_granularity, &wg);
> 	if (!wg) {
> 		dev_warn(&vdev->dev, "zero write granularity reported\n");
> 		return -ENODEV;
>@@ -750,8 +746,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
> 	 * virtio ZBD specification doesn't require zones to be a power of
> 	 * two sectors in size, but the code in this driver expects that.
> 	 */
>-	virtio_cread(vdev, struct virtio_blk_config, zoned.zone_sectors,
>-		     &vblk->zone_sectors);
>+	virtio_cread_le(vdev, struct virtio_blk_config, zoned.zone_sectors,
>+			&vblk->zone_sectors);
> 	if (vblk->zone_sectors == 0 || !is_power_of_2(vblk->zone_sectors)) {
> 		dev_err(&vdev->dev,
> 			"zoned device with non power of two zone size %u\n",
>@@ -767,8 +763,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
> 		lim->max_hw_discard_sectors = 0;
> 	}
>
>-	virtio_cread(vdev, struct virtio_blk_config,
>-		     zoned.max_append_sectors, &v);
>+	virtio_cread_le(vdev, struct virtio_blk_config,
>+			zoned.max_append_sectors, &v);
> 	if (!v) {
> 		dev_warn(&vdev->dev, "zero max_append_sectors reported\n");
> 		return -ENODEV;
>diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
>index 3744e4da1b2a7..5af2a0300bb9d 100644
>--- a/include/uapi/linux/virtio_blk.h
>+++ b/include/uapi/linux/virtio_blk.h
>@@ -140,11 +140,11 @@ struct virtio_blk_config {
>

To avoid making this mistake again, how about adding a note here to 
clarify that all the fields listed below are defined only for VIRTIO 1.x 
devices and are therefore always little-endian?

Anyway, the patch LGTM:

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>


> 	/* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
> 	struct virtio_blk_zoned_characteristics {
>-		__virtio32 zone_sectors;
>-		__virtio32 max_open_zones;
>-		__virtio32 max_active_zones;
>-		__virtio32 max_append_sectors;
>-		__virtio32 write_granularity;
>+		__le32 zone_sectors;
>+		__le32 max_open_zones;
>+		__le32 max_active_zones;
>+		__le32 max_append_sectors;
>+		__le32 write_granularity;
> 		__u8 model;
> 		__u8 unused2[3];
> 	} zoned;
>@@ -241,11 +241,11 @@ struct virtio_blk_outhdr {
>  */
> struct virtio_blk_zone_descriptor {
> 	/* Zone capacity */
>-	__virtio64 z_cap;
>+	__le64 z_cap;
> 	/* The starting sector of the zone */
>-	__virtio64 z_start;
>+	__le64 z_start;
> 	/* Zone write pointer position in sectors */
>-	__virtio64 z_wp;
>+	__le64 z_wp;
> 	/* Zone type */
> 	__u8 z_type;
> 	/* Zone state */
>@@ -254,7 +254,7 @@ struct virtio_blk_zone_descriptor {
> };
>
> struct virtio_blk_zone_report {
>-	__virtio64 nr_zones;
>+	__le64 nr_zones;
> 	__u8 reserved[56];
> 	struct virtio_blk_zone_descriptor zones[];
> };
>-- 
>2.53.0
>


^ permalink raw reply

* Re: [PATCH blktests] ublk: mark all tests as QUICK
From: Shin'ichiro Kawasaki @ 2026-06-18  6:20 UTC (permalink / raw)
  To: Sebastian Chlad; +Cc: linux-block, Sebastian Chlad
In-Reply-To: <20260615094144.13060-1-sebastian.chlad@suse.com>

On Jun 15, 2026 / 11:41, Sebastian Chlad wrote:
> These tests are quick to run so mark them accordingly to ensure
> they are included in quick runs.

Thanks, I applied it.

^ permalink raw reply

* [PATCH 1/1] block: validate user space vectors during extraction
From: Keith Busch @ 2026-06-17 23:32 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch, stable
In-Reply-To: <20260617233235.1016063-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

The blk-mq based drivers have every incoming bio validated by an
unconditional __bio_split_to_limits() call, which rejects any segment
that does not meet the queue's dma_alignment with BLK_STS_INVAL, so they
only see viable requests. A bio-based driver, though, receives a bio
whose memory alignment has not been checked.

Misalignment is possible for vectors supplied from user space direct-io.
When a stacking driver forwards a misaligned bio to a member device,
that member may reject it with BLK_STS_INVAL if the lower level attempts
to split the bio to the queue limits. The stacker tends to mishandle the
error: dm-raid1 may degrade an otherwise healthy array.

Alternatively, some lower level bio based block drivers never attempt to
split their bio and assume the one received is viable. If it's
unaligned, block devices like brd and pmem may corrupt their data as
they have a strong dependency on sector size aligned bvecs.

Validate the source against the device's dma_alignment where the bio is
built from the iov_iter, rejecting misaligned I/O with -EINVAL before it
is submitted. This is done opportunistically in a path that already pins
the pages, so no additional io vector walking is needed.

The required alignment is supplied by the callers as vec_align_mask
(bdev_dma_alignment()); passthrough and the bounce path pass 0 as they
have no such requirement. If a vector is misaligned while building the
bio, any pages already pinned into that bio are released before
returning.

Cc: stable@vger.kernel.org
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c          | 19 ++++++++++++++++---
 block/blk-map.c      |  2 +-
 block/fops.c         |  3 ++-
 fs/iomap/direct-io.c |  3 ++-
 include/linux/bio.h  |  2 +-
 include/linux/uio.h  |  3 ++-
 lib/iov_iter.c       |  9 ++++++++-
 7 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a9672..1bd7da889e069 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1242,7 +1242,7 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
  * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-			   unsigned len_align_mask)
+			   unsigned len_align_mask, unsigned vec_align_mask)
 {
 	iov_iter_extraction_t flags = 0;
 
@@ -1251,6 +1251,11 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
+
+		if (mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
+							vec_align_mask)
+			return -EINVAL;
+
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
 	}
@@ -1265,8 +1270,16 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec,
 				BIO_MAX_SIZE - bio->bi_iter.bi_size,
-				&bio->bi_vcnt, bio->bi_max_vecs, flags);
+				&bio->bi_vcnt, bio->bi_max_vecs,
+				vec_align_mask, flags);
 		if (ret <= 0) {
+			if (ret == -EINVAL) {
+				bio_release_pages(bio, false);
+				bio_clear_flag(bio, BIO_PAGE_PINNED);
+				bio->bi_iter.bi_size = 0;
+				bio->bi_vcnt = 0;
+				return ret;
+			}
 			if (!bio->bi_vcnt)
 				return ret;
 			break;
@@ -1377,7 +1390,7 @@ static int bio_iov_iter_bounce_read(struct bio *bio, struct iov_iter *iter,
 		ssize_t ret;
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec + 1, len,
-				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0);
+				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0, 0);
 		if (ret <= 0) {
 			if (!bio->bi_vcnt) {
 				folio_put(folio);
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97e..c9535efe1a913 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -274,7 +274,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	 * No alignment requirements on our part to support arbitrary
 	 * passthrough commands.
 	 */
-	ret = bio_iov_iter_get_pages(bio, iter, 0);
+	ret = bio_iov_iter_get_pages(bio, iter, 0, 0);
 	if (ret)
 		goto out_put;
 	ret = blk_rq_append_bio(rq, bio);
diff --git a/block/fops.c b/block/fops.c
index 15783a6180dec..928ba9be170cd 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -47,7 +47,8 @@ static inline int blkdev_iov_iter_get_pages(struct bio *bio,
 		struct iov_iter *iter, struct block_device *bdev)
 {
 	return bio_iov_iter_get_pages(bio, iter,
-			bdev_logical_block_size(bdev) - 1);
+			bdev_logical_block_size(bdev) - 1,
+			bdev_dma_alignment(bdev));
 }
 
 #define DIO_INLINE_BIO_VECS 4
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b485e3b191daf..645a4e9cd25f9 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -358,7 +358,8 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 				iomap_max_bio_size(&iter->iomap), alignment);
 	else
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter,
-					     alignment - 1);
+					     alignment - 1,
+					     bdev_dma_alignment(bio->bi_bdev));
 	if (unlikely(ret))
 		goto out_put_bio;
 	ret = bio->bi_iter.bi_size;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8f33f717b14f5..13be7edb524fc 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -477,7 +477,7 @@ int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-		unsigned len_align_mask);
+		unsigned len_align_mask, unsigned vec_align_mask);
 
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e32..be8b2625b376a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -391,7 +391,8 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
 			       size_t *offset0);
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags);
+		unsigned short max_vecs, unsigned align_mask,
+		iov_iter_extraction_t extraction_flags);
 
 /**
  * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 273919b161617..ccd5b49f6b78d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1886,6 +1886,8 @@ static unsigned int get_contig_folio_len(struct page **pages,
  * @max_size:	maximum size to extract from @iter
  * @nr_vecs:	number of vectors in @bv (on in and output)
  * @max_vecs:	maximum vectors in @bv, including those filled before calling
+ * @align_mask:	reject with -EINVAL if the source address or length is not
+ *		aligned to this mask
  * @extraction_flags: flags to qualify request
  *
  * Like iov_iter_extract_pages(), but returns physically contiguous ranges
@@ -1897,14 +1899,19 @@ static unsigned int get_contig_folio_len(struct page **pages,
  */
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags)
+		unsigned short max_vecs, unsigned align_mask,
+		iov_iter_extraction_t extraction_flags)
 {
+	unsigned long start = (unsigned long)iter_iov_addr(iter);
 	unsigned short entries_left = max_vecs - *nr_vecs;
 	unsigned short nr_pages, i = 0;
 	size_t left, offset, len;
 	struct page **pages;
 	ssize_t size;
 
+	if ((start | iter_iov_len(iter)) & align_mask)
+		return -EINVAL;
+
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
 	 * possible so that we can start filling biovecs from the beginning
-- 
2.52.0


^ permalink raw reply related

* [PATCH 0/1] direct-io: validate user space vectors during extraction
From: Keith Busch @ 2026-06-17 23:32 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This addresses the misaligned direct-io problem behind various threads:

 https://lore.kernel.org/linux-xfs/20260610145218.141369-1-cem@kernel.org/
 https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
 https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/
 https://lore.kernel.org/linux-block/20260616154009.2123183-1-kbusch@meta.com/

The various tested fixes are correct as far as they go, but they treat the
symptom: they only matter because an invalid bio reaches those drivers in the
first place.

The reason it reaches them is an assumption I made when I removed
direct-io alignment checks in 5ff3f74e145a ("block: simplify direct io
validity check") and 7eac331869575 ("iomap: simplify direct io validity
check"): every bio is eventually split to the device limits, and the
upper layers cope with resulting errors once the bio has formed. Both
were optimistic assumptions. Drivers with their own ->submit_bio may
never pass through blk_mq_submit_bio()'s split, so the check never runs
for them, and as numerous threads showed, the consumers don't uniformly
handle this condition.

This patch stops the invalid bio at the source instead. It validates the
buffer's alignment against the alignment limits when the bio is built
from the iov_iter. The check is folded into the bvec extraction that
already walks the vectors, so it adds only a comparison on a path that
is pinning direct-io pages anyway. Misalignment is now uniformly
rejected with EINVAL before submission for every direct-io submission
path.

With this in place, the dm side changes under discussion are no longer
required to fix the bugs: the affected targets simply never see the
invalid bio. The tested patches remain reasonable as defense-in-depth if
desired, but they are not strictly necessary after this.

Keith Busch (1):
  block: validate user space vectors during extraction

 block/bio.c          | 19 ++++++++++++++++---
 block/blk-map.c      |  2 +-
 block/fops.c         |  3 ++-
 fs/iomap/direct-io.c |  3 ++-
 include/linux/bio.h  |  2 +-
 include/linux/uio.h  |  3 ++-
 lib/iov_iter.c       |  9 ++++++++-
 7 files changed, 32 insertions(+), 9 deletions(-)

-- 
2.52.0

^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Vincent Mailhol @ 2026-06-17 20:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Davidlohr Bueso, Alexander Viro, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260617-irritation-rollen-wirst-7d636cbfec92@brauner>

On 17/06/2026 at 14:41, Christian Brauner wrote:
> On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:
>> DPS [1] defines GPT partition type UUIDs for OS partitions and
>> attributes that control whether such partitions should be
>> automatically discovered. The specification states that:
>>
>>   The OS can discover and mount the necessary file systems with a
>>   non-existent or incomplete /etc/fstab file and without the root=
>>   kernel command line option.
>>
>> DPS is already implemented in systemd-gpt-auto-generator [2], which,
>> when embedded in an initrd, indeed allows automatic detection of the
>> root filesystem through its partition type UUID.
>>
>> This series adds this discovery feature directly into the kernel so
>> that people who are not using systemd or not using an initrd can still
>> benefit from it. The implementation follows the same model as
>> systemd-gpt-auto-generator:
> 
> I happen to co-maintain the DPS. It is userspace policy and complex
> userspace policy at that and does not belong into the kernel.
> 
> This also implements a really tiny portion of the spec. It deals with a
> lot more complex concepts such as automatic partitioning during
> installation, verity, LUKS, containers. This is really not intended for
> the kernel at all. I mean, it's great that this spec is being used but I
> do not want this in the kernel just for the sake of auto-discovery.

The implementation of a tiny portion is voluntary. If I can draw a
parallel, it would be the same as saying that the root= cmdline option
is a tiny portion of what an fstab can do.

Yes it does not manage the LUKS, containers and so on, the same way it
is not possible to directly boot those things directly from the kernel.

So, I don't think this conflicts with the actual userland
implementations, the same way you can add root= to your command line and
still have an initrd next to it.

I did not intend to write this as a replacement but just as a complement
to fill the gap of kernel with no initrd.

> The DPS is completely generic and can be implemented by tooling other
> than systemd (util-linux implements it and so does refind iirc). I think
> not wanting to use or build alternative userspace tooling for this is a
> really weak argument for pushing this into the kernel.

Well, I might explain to you where I come from. Time to time, I mess up
my configuration. When this issue is in a userland config file (e.g. bad
fstab), the recovery is always easy.

But when I mess up the bootloader firmware configuration (e.g. grub,
u-boot, edk2), the fix is always painful. I have to fight with a shell
with which I am not familiar with to figure out what the correct
configuration is.

And an initrd would help but:

 - it is still one more file to look for pass as a parameter
 - on some machine I do not have one anyway

I think it would have been very neet to have a method to boot a kernel
with zero config (understand here: no cmdline, no initrd) and I find out
that DPS could achieve that if just a tiny part of it were implemented
in the kernel.

For example, in edk2, I would be able to just browse the disk from the
"Boot from file" menu and select a kernel. Currently it panics because
no configuration is attached. With DPS, we could have it boot linux from
that menu. All in a graphical interface, with just up/down arrows and
one enter keypress.

And this is my motivation. This non LUKS root read-only part of the DPS
is the only piece which makes sense for me in the kernel. Not that I
don't *want* to implement it in userland, but just that it doesn't
achieve what would be helpful to me (and I guess others).

I thought I wouldn't be the only one in the world to see value in that
this is why I posted it.

Yours sincerely,
Vincent Mailhol

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-17 16:59 UTC (permalink / raw)
  To: Keith Busch, regressions
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajLRTkSZJ0WCYNk4@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Wed, Jun 17, 2026 at 04:44:35PM +0000, Dr. David Alan Gilbert wrote:
> > (It's a bit scary you're having to go around quite
> > a few places and make similar fixes; I assume there
> > are others that do similar things).
> 
> Yes, I understand that. I'm looking into a common way to validate this.
> The md raid doesn't have this problem because they always call
> bio_split_to_limits() first, but that's not an optimal thing to do for
> dm raid in the normal read/write path, so perhaps a common checker needs
> to happen generically in the block layer. Yeah, I know I removed the
> previous higher level validation ... I'll try find something less costly
> than what we had before.

OK, thanks again
(and to Thomas for gluing my query to those other two which got this
moving!)

Dave.
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-17 16:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajLO80kZmFnyR2mH@gallifrey>

On Wed, Jun 17, 2026 at 04:44:35PM +0000, Dr. David Alan Gilbert wrote:
> (It's a bit scary you're having to go around quite
> a few places and make similar fixes; I assume there
> are others that do similar things).

Yes, I understand that. I'm looking into a common way to validate this.
The md raid doesn't have this problem because they always call
bio_split_to_limits() first, but that's not an optimal thing to do for
dm raid in the normal read/write path, so perhaps a common checker needs
to happen generically in the block layer. Yeah, I know I removed the
previous higher level validation ... I'll try find something less costly
than what we had before.

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-17 16:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajLJiOwim7qBp_0P@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Wed, Jun 17, 2026 at 03:33:55PM +0000, Dr. David Alan Gilbert wrote:
> > * Keith Busch (kbusch@kernel.org) wrote:
> > > On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> > > > root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > > 
> > > So this is a subtle difference from your original report which ran
> > > lvcreate a little differently:
> > > 
> > >   # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > > 
> > > This patch series address problems with the original report with the
> > > "--type mirror" parameter, which uses dm-raid1.c instead of md/raid1.c.
> > 
> > Ah OK.
> > (I think I think I did say that somewhere, hmm ajFK5NXkxd6jU5zu@gallifrey ? )
> 
> I see. This will fix that setup:

And it does;
dg@dalek:~$ ./dbf
pread of 4096 said: -1 (Invalid argument)
dg@dalek:~$ ./dbf-write 
pwrite of 4096 said: -1 (Invalid argument)
dg@dalek:~$ ./dbf-joint 
pread of 4096 said: -1 (Invalid argument)
pwrite of 4096 said: -1 (Invalid argument)

and the log is clean.

Tested-by: Dr. David Alan Gilbert <linux@treblig.org>

(It's a bit scary you're having to go around quite
a few places and make similar fixes; I assume there
are others that do similar things).

Thanks again,

Dave

> 
> ---
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 5b9368bd9e700..17a5f0d98aacc 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -322,7 +322,9 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  {
>  	struct bio *bio = r1_bio->master_bio;
>  
> -	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> +	if (test_bit(R1BIO_Invalid, &r1_bio->state))
> +		bio->bi_status = BLK_STS_INVAL;
> +	else if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>  		bio->bi_status = BLK_STS_IOERR;
>  
>  	bio_endio(bio);
> @@ -403,6 +405,8 @@ static void raid1_end_read_request(struct bio *bio)
>  		;
>  	} else if (!raid1_should_handle_error(bio)) {
>  		uptodate = 1;
> +		if (bio->bi_status == BLK_STS_INVAL)
> +			set_bit(R1BIO_Invalid, &r1_bio->state);
>  	} else {
>  		/* If all other devices have failed, we want to return
>  		 * the error upwards rather than fail the last device.
> @@ -519,6 +523,14 @@ static void raid1_end_write_request(struct bio *bio)
>  		 */
>  		r1_bio->bios[mirror] = NULL;
>  		to_put = bio;
> +		/*
> +		 * An invalid I/O (e.g. a misaligned bio rejected by the lower
> +		 * device) was ignored above rather than faulting the device.
> +		 * It is not a successful write, though, so report the error to
> +		 * the caller instead of completing the master bio as uptodate.
> +		 */
> +		if (bio->bi_status == BLK_STS_INVAL)
> +			set_bit(R1BIO_Invalid, &r1_bio->state);
>  		/*
>  		 * Do not set R1BIO_Uptodate if the current device is
>  		 * rebuilding or Faulty. This is because we cannot use
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index c98d43a7ae993..21e837db5b25e 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -184,6 +184,12 @@ enum r1bio_state {
>  	R1BIO_MadeGood,
>  	R1BIO_WriteError,
>  	R1BIO_FailFast,
> +/* An invalid I/O (e.g. a bio rejected by the lower device because it does
> + * not meet that device's dma_alignment) is not a device failure.  Report
> + * the error to the caller without faulting the device or retrying, and do
> + * not complete a write as if it had succeeded.
> + */
> +	R1BIO_Invalid,
>  };
>  
>  static inline int sector_to_idx(sector_t sector)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index cee5a253a281d..3cee9612be26d 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -323,7 +323,9 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
>  	struct r10conf *conf = r10_bio->mddev->private;
>  
>  	if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) {
> -		if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> +		if (test_bit(R10BIO_Invalid, &r10_bio->state))
> +			bio->bi_status = BLK_STS_INVAL;
> +		else if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>  			bio->bi_status = BLK_STS_IOERR;
>  		bio_endio(bio);
>  	}
> @@ -403,6 +405,8 @@ static void raid10_end_read_request(struct bio *bio)
>  		set_bit(R10BIO_Uptodate, &r10_bio->state);
>  	} else if (!raid1_should_handle_error(bio)) {
>  		uptodate = 1;
> +		if (bio->bi_status == BLK_STS_INVAL)
> +			set_bit(R10BIO_Invalid, &r10_bio->state);
>  	} else {
>  		/* If all other devices that store this block have
>  		 * failed, we want to return the error upwards rather
> @@ -523,6 +527,8 @@ static void raid10_end_write_request(struct bio *bio)
>  		 * before rdev->recovery_offset, but for simplicity we don't
>  		 * check this here.
>  		 */
> +		if (bio->bi_status == BLK_STS_INVAL)
> +			set_bit(R10BIO_Invalid, &r10_bio->state);
>  		if (test_bit(In_sync, &rdev->flags) &&
>  		    !test_bit(Faulty, &rdev->flags))
>  			set_bit(R10BIO_Uptodate, &r10_bio->state);
> diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
> index ec79d87fb92f6..a1adad3acafe1 100644
> --- a/drivers/md/raid10.h
> +++ b/drivers/md/raid10.h
> @@ -175,5 +175,11 @@ enum r10bio_state {
>  /* failfast devices did receive failfast requests. */
>  	R10BIO_FailFast,
>  	R10BIO_Discard,
> +/* An invalid I/O (e.g. a bio rejected by the lower device because it does not
> + * meet that device's queue_limits) is not a device failure. Report the error
> + * to the caller without faulting the device or retrying, and do not complete a
> + * write as if it had succeeded.
> + */
> +	R10BIO_Invalid,
>  };
>  #endif
> --
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-17 16:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajK-Y7uoK0QT5PVl@gallifrey>

On Wed, Jun 17, 2026 at 03:33:55PM +0000, Dr. David Alan Gilbert wrote:
> * Keith Busch (kbusch@kernel.org) wrote:
> > On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> > > root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > 
> > So this is a subtle difference from your original report which ran
> > lvcreate a little differently:
> > 
> >   # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > 
> > This patch series address problems with the original report with the
> > "--type mirror" parameter, which uses dm-raid1.c instead of md/raid1.c.
> 
> Ah OK.
> (I think I think I did say that somewhere, hmm ajFK5NXkxd6jU5zu@gallifrey ? )

I see. This will fix that setup:

---
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5b9368bd9e700..17a5f0d98aacc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -322,7 +322,9 @@ static void call_bio_endio(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
 
-	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
+	if (test_bit(R1BIO_Invalid, &r1_bio->state))
+		bio->bi_status = BLK_STS_INVAL;
+	else if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_status = BLK_STS_IOERR;
 
 	bio_endio(bio);
@@ -403,6 +405,8 @@ static void raid1_end_read_request(struct bio *bio)
 		;
 	} else if (!raid1_should_handle_error(bio)) {
 		uptodate = 1;
+		if (bio->bi_status == BLK_STS_INVAL)
+			set_bit(R1BIO_Invalid, &r1_bio->state);
 	} else {
 		/* If all other devices have failed, we want to return
 		 * the error upwards rather than fail the last device.
@@ -519,6 +523,14 @@ static void raid1_end_write_request(struct bio *bio)
 		 */
 		r1_bio->bios[mirror] = NULL;
 		to_put = bio;
+		/*
+		 * An invalid I/O (e.g. a misaligned bio rejected by the lower
+		 * device) was ignored above rather than faulting the device.
+		 * It is not a successful write, though, so report the error to
+		 * the caller instead of completing the master bio as uptodate.
+		 */
+		if (bio->bi_status == BLK_STS_INVAL)
+			set_bit(R1BIO_Invalid, &r1_bio->state);
 		/*
 		 * Do not set R1BIO_Uptodate if the current device is
 		 * rebuilding or Faulty. This is because we cannot use
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c98d43a7ae993..21e837db5b25e 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -184,6 +184,12 @@ enum r1bio_state {
 	R1BIO_MadeGood,
 	R1BIO_WriteError,
 	R1BIO_FailFast,
+/* An invalid I/O (e.g. a bio rejected by the lower device because it does
+ * not meet that device's dma_alignment) is not a device failure.  Report
+ * the error to the caller without faulting the device or retrying, and do
+ * not complete a write as if it had succeeded.
+ */
+	R1BIO_Invalid,
 };
 
 static inline int sector_to_idx(sector_t sector)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index cee5a253a281d..3cee9612be26d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -323,7 +323,9 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
 	struct r10conf *conf = r10_bio->mddev->private;
 
 	if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) {
-		if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+		if (test_bit(R10BIO_Invalid, &r10_bio->state))
+			bio->bi_status = BLK_STS_INVAL;
+		else if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
 			bio->bi_status = BLK_STS_IOERR;
 		bio_endio(bio);
 	}
@@ -403,6 +405,8 @@ static void raid10_end_read_request(struct bio *bio)
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
 	} else if (!raid1_should_handle_error(bio)) {
 		uptodate = 1;
+		if (bio->bi_status == BLK_STS_INVAL)
+			set_bit(R10BIO_Invalid, &r10_bio->state);
 	} else {
 		/* If all other devices that store this block have
 		 * failed, we want to return the error upwards rather
@@ -523,6 +527,8 @@ static void raid10_end_write_request(struct bio *bio)
 		 * before rdev->recovery_offset, but for simplicity we don't
 		 * check this here.
 		 */
+		if (bio->bi_status == BLK_STS_INVAL)
+			set_bit(R10BIO_Invalid, &r10_bio->state);
 		if (test_bit(In_sync, &rdev->flags) &&
 		    !test_bit(Faulty, &rdev->flags))
 			set_bit(R10BIO_Uptodate, &r10_bio->state);
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index ec79d87fb92f6..a1adad3acafe1 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -175,5 +175,11 @@ enum r10bio_state {
 /* failfast devices did receive failfast requests. */
 	R10BIO_FailFast,
 	R10BIO_Discard,
+/* An invalid I/O (e.g. a bio rejected by the lower device because it does not
+ * meet that device's queue_limits) is not a device failure. Report the error
+ * to the caller without faulting the device or retrying, and do not complete a
+ * write as if it had succeeded.
+ */
+	R10BIO_Invalid,
 };
 #endif
--

^ permalink raw reply related

* [PATCH v3] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta @ 2026-06-17 15:50 UTC (permalink / raw)
  To: axboe, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <CGME20260617155734epcas5p128e015bf4a1653f386c1ad75b32b86bc@epcas5p1.samsung.com>

blk_hctx_poll() can busy-poll until a completion is found or
need_resched() becomes true. On preemptible kernels, the scheduler can
set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
return before the loop condition re-evaluates it. After the context
switch, the flag is cleared, so the poller can continue spinning instead
of returning to its caller.

This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
which holds the rcu_read_lock() while calling bio_poll(). If another
poller on the same polled queue drains the available completions, this
poller may repeatedly find no completions and remain inside the RCU
read-side critical section long enough to trigger RCU stall reports:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
task:fio state:R  running task     stack:0     pid:3961
Call Trace:
<TASK>
? nvme_poll+0x36/0xa0 [nvme]
? blk_hctx_poll+0x39/0x90
? blk_mq_poll+0x30/0x60
? bio_poll+0x87/0x170
? iocb_bio_iopoll+0x32/0x50
? io_uring_classic_poll+0x25/0x50
? io_do_iopoll+0x216/0x420
? __do_sys_io_uring_enter+0x2c7/0x7c0

Reproducible with:

fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
--numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
--registerfiles=1 --group_reporting --thread

Record the starting jiffy and exit the loop once jiffies has advanced.
This bounds each blk_hctx_poll() invocation while also covering the
case where the reschedule flag was cleared by the context switch
before the loop condition could observe it.

Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
Reviewed-by: Fengnan Chang <changfengnan@bytedance.com>
Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..e5850dc6c5d9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			 struct io_comp_batch *iob, unsigned int flags)
 {
 	int ret;
+	unsigned long timeout = jiffies + 2;

 	do {
 		ret = q->mq_ops->poll(hctx, iob);
@@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
 		cpu_relax();
-	} while (!need_resched());
+	} while (!need_resched() && time_before(jiffies, timeout));

 	return 0;
 }
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj gupta @ 2026-06-17 15:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Anuj Gupta, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, linux-block, Alok Rathore
In-Reply-To: <d71f62d7-f1f1-4169-bc05-1b354b94ef78@kernel.dk>

> I wonder if it'd be better to set this to jiffies + 2, just to avoid
> skipping after a single loop if jiffies changes right after this.
> Probably not a big deal, on average it should be fine. But also should
> not really matter if this is potentially spinning uselessly 10ms or 20ms
> at most, if HZ == 100. Similarly, this is also a misuse/misconfiguration
> if you end up having multiple pollers on the same queue. Yes it'll work,
> but it's a terrible idea for obvious reasons. Hence the patch is mostly
> about ensuring that bad case isn't TOO terrible. But you'd really want
> to sort out the app/config side of things in any case.

Thanks Jens, will use jiffies + 2 in v3.

On the config side: we first hit this with just 2 jobs on one poll queue
on the dmabuf-rw path[1], not upstream yet. The 32-job io-uring read
repro was only to confirm it's generic, not dmabuf-specific. Agreed the
multiple-pollers-per-queue setup is not a good idea in the first place.
[1] https://lore.kernel.org/linux-block/cover.1777475843.git.asml.silence@gmail.com/

>
> --
> Jens Axboe

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-17 15:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajK4YacvP7eQ-S2o@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> > root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> 
> So this is a subtle difference from your original report which ran
> lvcreate a little differently:
> 
>   # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> 
> This patch series address problems with the original report with the
> "--type mirror" parameter, which uses dm-raid1.c instead of md/raid1.c.

Ah OK.
(I think I think I did say that somewhere, hmm ajFK5NXkxd6jU5zu@gallifrey ? )

> Knowing that detail makes this a trivial matter to fix now, so I'll send
> a separate patch for that. But this series should be good to go for the
> original issue on the legacy dm mirror.

Great!  Thanks again,

Dave

-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* [PATCH] virtio-blk: use little-endian types for the zoned fields
From: Michael Bommarito @ 2026-06-17 15:17 UTC (permalink / raw)
  To: Michael S . Tsirkin, Jason Wang
  Cc: Stefan Hajnoczi, Stefano Garzarella, Dmitry Fomichev,
	Damien Le Moal, Jens Axboe, Paolo Bonzini, virtualization,
	linux-block, linux-kernel

The zoned block-device fields in the virtio-blk header are typed
__virtio{32,64}, so their endianness follows VIRTIO_F_VERSION_1. The
zoned feature is only defined for VIRTIO 1.x devices, and the virtio
specification defines all of its fields as little-endian. Commit
b16a1756c716 ("virtio_blk: mark all zone fields LE") tagged them
__le* for exactly this reason, but commit f1ba4e674feb ("virtio-blk:
fix to match virtio spec") re-applied the reviewed version of the
original zoned series -- which predated b16a1756 -- and silently
restored the __virtio* typing together with the matching
virtio*_to_cpu() / virtio_cread() accessors in the driver.

Restore the little-endian typing for the zoned configuration-space
characteristics, the zone descriptor, the zone report header and the
ZONE_APPEND in-header sector, and read them with le*_to_cpu() and
virtio_cread_le() to match.

There is no functional change on any spec-compliant device: zoned
requires VIRTIO_F_VERSION_1, and for a VERSION_1 device
virtio*_to_cpu() is identical to le*_to_cpu(). The change makes the
uapi types describe the actual wire format and removes a latent
endianness mismatch for a (non-conformant) legacy device on a
big-endian guest.

Fixes: f1ba4e674feb ("virtio-blk: fix to match virtio spec")
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Testing:
 - Builds with no new warnings; sparse endian-clean (C=2,
   __CHECK_ENDIAN__, CONFIG_BLK_DEV_ZONED=y) both before and after.
 - Booted under QEMU with a host-managed zoned device exposed through
   virtio-blk. Zone revalidation, blkzone report and a sequential
   write / write-pointer check return correct values; blktests zbd
   device tests 001-006 (sysfs+ioctl, report zone, reset, write split,
   write ordering, revalidate) pass, with results identical before and
   after this change -- expected, since on a VIRTIO_F_VERSION_1 device
   virtio*_to_cpu() == le*_to_cpu().

 drivers/block/virtio_blk.c      | 38 +++++++++++++++------------------
 include/uapi/linux/virtio_blk.h | 18 ++++++++--------
 2 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27fe00f3..5532cfbde7bfe 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -99,7 +99,7 @@ struct virtblk_req {
 		 * be the last byte.
 		 */
 		struct {
-			__virtio64 sector;
+			__le64 sector;
 			u8 status;
 		} zone_append;
 	} in_hdr;
@@ -335,14 +335,12 @@ static inline void virtblk_request_done(struct request *req)
 {
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	blk_status_t status = virtblk_result(virtblk_vbr_status(vbr));
-	struct virtio_blk *vblk = req->mq_hctx->queue->queuedata;
 
 	virtblk_unmap_data(req, vbr);
 	virtblk_cleanup_cmd(req);
 
 	if (req_op(req) == REQ_OP_ZONE_APPEND)
-		req->__sector = virtio64_to_cpu(vblk->vdev,
-						vbr->in_hdr.zone_append.sector);
+		req->__sector = le64_to_cpu(vbr->in_hdr.zone_append.sector);
 
 	blk_mq_end_request(req, status);
 }
@@ -589,13 +587,13 @@ static int virtblk_parse_zone(struct virtio_blk *vblk,
 {
 	struct blk_zone zone = { };
 
-	zone.start = virtio64_to_cpu(vblk->vdev, entry->z_start);
+	zone.start = le64_to_cpu(entry->z_start);
 	if (zone.start + vblk->zone_sectors <= get_capacity(vblk->disk))
 		zone.len = vblk->zone_sectors;
 	else
 		zone.len = get_capacity(vblk->disk) - zone.start;
-	zone.capacity = virtio64_to_cpu(vblk->vdev, entry->z_cap);
-	zone.wp = virtio64_to_cpu(vblk->vdev, entry->z_wp);
+	zone.capacity = le64_to_cpu(entry->z_cap);
+	zone.wp = le64_to_cpu(entry->z_wp);
 
 	switch (entry->z_type) {
 	case VIRTIO_BLK_ZT_SWR:
@@ -687,8 +685,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
 		if (ret)
 			goto fail_report;
 
-		nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
-			   nr_zones);
+		nz = min_t(u64, le64_to_cpu(report->nr_zones), nr_zones);
 		if (!nz)
 			break;
 
@@ -698,8 +695,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
 			if (ret)
 				goto fail_report;
 
-			sector = virtio64_to_cpu(vblk->vdev,
-						 report->zones[i].z_start) +
+			sector = le64_to_cpu(report->zones[i].z_start) +
 				 vblk->zone_sectors;
 			zone_idx++;
 		}
@@ -725,18 +721,18 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 
 	lim->features |= BLK_FEAT_ZONED;
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_open_zones, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_open_zones, &v);
 	lim->max_open_zones = v;
 	dev_dbg(&vdev->dev, "max open zones = %u\n", v);
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_active_zones, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_active_zones, &v);
 	lim->max_active_zones = v;
 	dev_dbg(&vdev->dev, "max active zones = %u\n", v);
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.write_granularity, &wg);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.write_granularity, &wg);
 	if (!wg) {
 		dev_warn(&vdev->dev, "zero write granularity reported\n");
 		return -ENODEV;
@@ -750,8 +746,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 	 * virtio ZBD specification doesn't require zones to be a power of
 	 * two sectors in size, but the code in this driver expects that.
 	 */
-	virtio_cread(vdev, struct virtio_blk_config, zoned.zone_sectors,
-		     &vblk->zone_sectors);
+	virtio_cread_le(vdev, struct virtio_blk_config, zoned.zone_sectors,
+			&vblk->zone_sectors);
 	if (vblk->zone_sectors == 0 || !is_power_of_2(vblk->zone_sectors)) {
 		dev_err(&vdev->dev,
 			"zoned device with non power of two zone size %u\n",
@@ -767,8 +763,8 @@ static int virtblk_read_zoned_limits(struct virtio_blk *vblk,
 		lim->max_hw_discard_sectors = 0;
 	}
 
-	virtio_cread(vdev, struct virtio_blk_config,
-		     zoned.max_append_sectors, &v);
+	virtio_cread_le(vdev, struct virtio_blk_config,
+			zoned.max_append_sectors, &v);
 	if (!v) {
 		dev_warn(&vdev->dev, "zero max_append_sectors reported\n");
 		return -ENODEV;
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 3744e4da1b2a7..5af2a0300bb9d 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -140,11 +140,11 @@ struct virtio_blk_config {
 
 	/* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
 	struct virtio_blk_zoned_characteristics {
-		__virtio32 zone_sectors;
-		__virtio32 max_open_zones;
-		__virtio32 max_active_zones;
-		__virtio32 max_append_sectors;
-		__virtio32 write_granularity;
+		__le32 zone_sectors;
+		__le32 max_open_zones;
+		__le32 max_active_zones;
+		__le32 max_append_sectors;
+		__le32 write_granularity;
 		__u8 model;
 		__u8 unused2[3];
 	} zoned;
@@ -241,11 +241,11 @@ struct virtio_blk_outhdr {
  */
 struct virtio_blk_zone_descriptor {
 	/* Zone capacity */
-	__virtio64 z_cap;
+	__le64 z_cap;
 	/* The starting sector of the zone */
-	__virtio64 z_start;
+	__le64 z_start;
 	/* Zone write pointer position in sectors */
-	__virtio64 z_wp;
+	__le64 z_wp;
 	/* Zone type */
 	__u8 z_type;
 	/* Zone state */
@@ -254,7 +254,7 @@ struct virtio_blk_zone_descriptor {
 };
 
 struct virtio_blk_zone_report {
-	__virtio64 nr_zones;
+	__le64 nr_zones;
 	__u8 reserved[56];
 	struct virtio_blk_zone_descriptor zones[];
 };
-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox