Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Ming Lei @ 2026-06-12 11:06 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe, Nilay Shroff
In-Reply-To: <aivMxPCd305WbBsk@shinmob>

On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
> On Jun 11, 2026 / 06:22, Ming Lei wrote:
> > Hi Shin'ichiro,
> 
> Hi Ming, thanks for the comments.
> 
> > 
> > On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > > I observed that the blktests test case block/005 hangs on a specific
> > > server hardware using a specific HDD as a block device. During the test
> > > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > > dependent.
> > > 
> > > From the kernel message, I noticed that udev-worker wrote to the
> > > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > > The test case block/005 also wrote to the same sysfs attribute, which
> > 
> > sysfs write is supposed to be serialized...
> 
> I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
> I found elevator_change() call is guarded with the rw_semaphore
> "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
> lock. This does not serialize the sysfs writes.

Please see kernfs_fop_write_iter(), in which mutex is held before calling
->write().

> 
> I tried the patch below to replace the reader lock with the writer lock. With
> a quick trial, it looks working. The kernel message is no longer observed and
> the new test case does not cause hangs. I will do further testing to confirm
> that this change does not trigger other new lockdep WARNs. Assuming it does not
> have such side effects, I hope this fix approach is acceptable. It doesn't add
> the new lock, so I think it's the better.
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 3bcd37c2aa34..b03185a217ff 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
>  	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
>  	 */
> -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
> +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
>  		ret = -EBUSY;
>  		goto out;
>  	}
> @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	} else {
>  		ret = -ENOENT;
>  	}
> -	up_read(&set->update_nr_hwq_lock);
> +	up_write(&set->update_nr_hwq_lock);
>  
>  out:
>  	if (ctx.type)
> 
> [...]
> 
> > blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> > fix could be check & avoid the null-ptr-deref.
> 
> Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
> free is also observed [3]. Then I'm guessing adding null checks may not be
> enough.
> 
> > Adding new lock should be the last straw usually, especially this one is
> > depended by queue freeze.
> 
> Got it, thanks.
> 
> 
> [3] KASAN slab-use-after-free

Then you need to figure out the exact slab type and check if the pointer is cleared
during free.

Anyway, there is guard already, not see reason to add new lock for covering
it.


Thanks,
Ming

^ permalink raw reply

* Re: [LSF/MM/BPF RFC PATCH 00/13]
From: Haris Iqbal @ 2026-06-12 10:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: linux-block, linux-rdma, linux-kernel, axboe, bvanassche, hch,
	jgg, jinpu.wang
In-Reply-To: <20260611115902.GO327369@unreal>

On Thu, Jun 11, 2026 at 1:59 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, May 27, 2026 at 02:44:08PM +0200, Haris Iqbal wrote:
> > On Tue, May 12, 2026 at 12:34 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Tue, May 05, 2026 at 09:46:12AM +0200, Md Haris Iqbal wrote:
> > > > Following a conversation with Bart yesterday, I am sending the RMR+BRMR
> > > > code through patch for easier review.
> > > >
> > > > The patches apply over the for-next branch of the block tree over commit
> > > > 07dfa981ca3
> > > >
> > > > For context,
> > > > RMR (Reliable Multicast over RTRS) is a kernel module that provides
> > > > active-active block-level replication over RDMA. It guarantees delivery
> > > > of IO to a group of storage nodes and handles resynchronization of data
> > > > directly between storage nodes without involving the compute client.
> > > >
> > > > BRMR (Block device over RMR) sits on top of RMR and exposes a standard
> > > > Linux block device (/dev/brmrX) backed by an RMR pool. Together, RMR and
> > > > BRMR provide a single-hop replication and resynchronization solution for
> > > > RDMA-connected storage clusters.
> > > >
> > > > My session is on Wednesday, at 12 in the storage room (Istanbul).
> > >
> > > To summarize the discussion:
> > >
> > > 1. Move as much logic as possible into the block layer; RDMA should serve
> > >    strictly as a transport.
> > > 2. Identify another in‑kernel user of this functionality, and add support for
> > >    it if required. At least accommodate potential users elsewhere in the
> > >    kernel.
> >
> > Thanks for the summary Leon.
> >
> > The main logic which handles multicast/replication legs, missed I/O
> > tracking, re-synchronization, etc are the core parts of RMR.
> > If we move those to a separate module, there won't be much left in
> > RMR. RMR already uses RTRS from the RDMA subsystem as transport.
> >
> > Having said that, I am not against moving RMR out of the RDMA layer.
> > It can serve as a reliable replication service/library for any other
> > user in the kernel to use.
> > Which subsystem (block or something else) would be a better fit then,
> > can be discussed.
> >
> > PS: Would this be a good candidate for a session/discussion in the upcoming LPC?
>
> Probably yes.
>
> Thanks

Thanks Leon. I'll submit the abstract through the portal.
Do you think the topic is better suited towards Refereed track or Kernel Summit?

>
> >
> > >
> > > Thanks

^ permalink raw reply

* Re: [PATCH v4 6/8] Bluetooth: hci_sync: Add NVMEM-backed BD address retrieval
From: Loic Poulain @ 2026-06-12 10:00 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <sy2ofvdbcxspxtmfdavjvdz7oes5ieuep4znf4ayknmuwhrlgk@7lp3bkegaeif>

On Fri, Jun 12, 2026 at 11:11 AM Dmitry Baryshkov
<dmitry.baryshkov@oss.qualcomm.com> wrote:
>
> On Tue, Jun 09, 2026 at 09:52:31AM +0200, Loic Poulain wrote:
> > Some devices store the Bluetooth BD address in non-volatile
> > memory, which can be accessed through the NVMEM framework.
> > Similar to Ethernet or WiFi MAC addresses, add support for
> > reading the BD address from a 'local-bd-address' NVMEM cell.
> >
> > As with the device-tree provided BD address, add a quirk to
> > indicate whether a device or platform should attempt to read
> > the address from NVMEM when no valid in-chip address is present.
> > Also add a quirk to indicate if the address is stored in
> > big-endian byte order.
> >
> > Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  include/net/bluetooth/hci.h | 18 ++++++++++++++++++
> >  net/bluetooth/hci_sync.c    | 39 ++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 56 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
> > index 572b1c620c5d653a1fe10b26c1b0ba33e8f4968f..7686466d1109253b0d75edeb5f6a99fb98ce4cc6 100644
> > --- a/include/net/bluetooth/hci.h
> > +++ b/include/net/bluetooth/hci.h
> > @@ -164,6 +164,24 @@ enum {
> >        */
> >       HCI_QUIRK_BDADDR_PROPERTY_BROKEN,
> >
> > +     /* When this quirk is set, the public Bluetooth address
> > +      * initially reported by HCI Read BD Address command
> > +      * is considered invalid. The public BD Address can be
> > +      * retrieved via a 'local-bd-address' NVMEM cell.
>
> Why do we need a quirk here? Can't we always assume that if there is an
> NVMEM cell, it contains a correct address, even if HCI command returned
> a seemingly-sensible one?

The pattern follows HCI_QUIRK_USE_BDADDR_PROPERTY, the quirk indicates
that the address returned by the HCI Read BD Address command is
invalid and should be overridden using a fwnode property. Without this
quirk, even a valid fwnode-provided address is ignored. So here this
is primarily done to align with that established behavior, although
whether that design choice is ideal is a good question.

This also raises the question of why an explicit HCI_QUIRK_USE_* flag
is required to allow reading from NVMEM when the controller-provided
address is known to be invalid, rather than attempting to use any
available backend (fwnode-prop or NVMEM). but this remains
consistent with the behavior established by the fwnode-based quirk.

So, I think these aspects could be revisited in a Bluetooth follow-up
series if there is interest in reworking the overall addr fallback
design.

Regards,
Loic



>
> > +      *
> > +      * This quirk can be set before hci_register_dev is called or
> > +      * during the hdev->setup vendor callback.
> > +      */
> > +     HCI_QUIRK_USE_BDADDR_NVMEM,
> > +
> > +     /* When this quirk is set, the Bluetooth Device Address provided by
> > +      * the 'local-bd-address' NVMEM is stored in big-endian order.
> > +      *
> > +      * This quirk can be set before hci_register_dev is called or
> > +      * during the hdev->setup vendor callback.
> > +      */
> > +     HCI_QUIRK_BDADDR_NVMEM_BE,
>
> Also, is this necessary? Are the devices which store the address in the
> wrong format in the NVMEM?
>
> > +
> >       /* When this quirk is set, the duplicate filtering during
> >        * scanning is based on Bluetooth devices addresses. To allow
> >        * RSSI based updates, restart scanning if needed.
>
> --
> With best wishes
> Dmitry

^ permalink raw reply

* Re: [PATCH v2] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-12 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, shakeel.butt, hannes, riel,
	kernel-team, stable
In-Reply-To: <20260612094520.GA42921@noisy.programming.kicks-ass.net>



On 12/06/2026 10:45, Peter Zijlstra wrote:
> On Fri, Jun 12, 2026 at 02:40:42AM -0700, Usama Arif wrote:
> 
>> +static __always_inline void blk_plug_invalidate_ts(void)
>>  {
>> +	if (unlikely(current->flags & PF_BLOCK_TS)) {
>> +		struct blk_plug *plug = current->plug;
>>  
>> +		if (plug)
>> +			plug->cur_ktime = 0;
>> +		current->flags &= ~PF_BLOCK_TS;
>> +	}
>>  }
> 
> If you can guarantee PF_BLOCK_TS is only ever set when current->plug,
> this can be reduced further.

Thanks for the reviews!

The invariant holds at set time (the only set in blk_time_get_ns() is
gated by if (!plug)) and through the only legitimate plug clear in
blk_finish_plug() (which goes through __blk_flush_plug() that clears
PF_BLOCK_TS first).

However, copy_process() sets p->plug = NULL for the child but doesn't
strip PF_BLOCK_TS from the inherited flags.

I think the if(plug) is a good defensive check, but can also do the below
if you prefer?

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1c1fd31ce187..c285a4d9837d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1219,10 +1219,7 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
 static __always_inline void blk_plug_invalidate_ts(void)
 {
        if (unlikely(current->flags & PF_BLOCK_TS)) {
-               struct blk_plug *plug = current->plug;
-
-               if (plug)
-                       plug->cur_ktime = 0;
+               current->plug->cur_ktime = 0;
                current->flags &= ~PF_BLOCK_TS;
        }
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 892a95214c54..9a062149e0d8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2167,7 +2167,8 @@ __latent_entropy struct task_struct *copy_process(
                goto bad_fork_cleanup_count;
 
        delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
-       p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE | PF_NO_SETAFFINITY);
+       p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE | PF_NO_SETAFFINITY |
+                     PF_BLOCK_TS);
        p->flags |= PF_FORKNOEXEC;
        INIT_LIST_HEAD(&p->children);
        INIT_LIST_HEAD(&p->sibling);

^ permalink raw reply related

* [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

If a direct IO requires bounced pages for stable buffer, it will always
allocate memory, and both bio_iov_iter_bounce_write() and
bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
can sleep and break NOWAIT requirement.

So we need to reject such NOWAIT and BOUNCE direct IO in
iomap_dio_bio_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/iomap/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcd..d1601122f0b5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -412,6 +412,10 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	unsigned int alignment;
 	ssize_t ret = 0;
 
+	/* Bounced direct IO will need to allocate memory, breaking NOWAIT flag. */
+	if (unlikely(iter->flags & IOMAP_NOWAIT && dio->flags & IOMAP_DIO_BOUNCE))
+		return -EAGAIN;
+
 	/*
 	 * File systems that write out of place and always allocate new blocks
 	 * need each bio to be block aligned as that's the unit of allocation.
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 2/4] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.

However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.

Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index b33ff69bb722..01bb76d9717c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1335,7 +1335,10 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (iter->nofault)
+			copied = copy_folio_from_iter_atomic(folio, 0, this_len, iter);
+		else
+			copied = copy_folio_from_iter(folio, 0, this_len, iter);
 		if (copied < this_len) {
 			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d114846b38 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c5b1b7 ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are still
required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 53 ++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e566a60b0ce5..bbf94056a874 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -818,13 +818,36 @@ static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
 }
 
+static bool need_stable_write(struct btrfs_inode *inode)
+{
+	const u64 data_profile = btrfs_data_alloc_profile(inode->root->fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	/* Data checksum requires stable buffer. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		return true;
+	/*
+	 * Any profile with mirror/parity will require stable buffer.
+	 * Otherwise the mirror may differ from each other.
+	 *
+	 * Thus only SINGLE and RAID0 doesn't require stable buffer.
+	 */
+	if (data_profile != 0 && data_profile != BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+	return false;
+}
+
 static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 					 size_t done_before)
 {
 	struct btrfs_dio_data data = { 0 };
+	unsigned int dio_flags = IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED;
+
+	if (need_stable_write(BTRFS_I(file_inode(iocb->ki_filp))))
+		dio_flags |= IOMAP_DIO_BOUNCE;
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
+			      dio_flags, &data, done_before);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -853,8 +876,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
-	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
-				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -868,16 +889,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;
 
-	/*
-	 * If our data profile has duplication (either extra mirrors or RAID56),
-	 * we can not trust the direct IO buffer, the content may change during
-	 * writeback and cause different contents written to different mirrors.
-	 *
-	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
-	 */
-	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
-		goto buffered;
-
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -918,22 +929,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
-	/*
-	 * We can't control the folios being passed in, applications can write
-	 * to them while a direct IO write is in progress.  This means the
-	 * content might change after we calculated the data checksum.
-	 * Therefore we can end up storing a checksum that doesn't match the
-	 * persisted data.
-	 *
-	 * To be extra safe and avoid false data checksum mismatch, if the
-	 * inode requires data checksum, just fallback to buffered IO.
-	 * For buffered IO we have full control of page cache and can ensure
-	 * no one is modifying the content during writeback.
-	 */
-	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
-		goto buffered;
-	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781253428.git.wqu@suse.com>

For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().

This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.

And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.

Revert the iov_iter after a short copy.

One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).

Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..b33ff69bb722 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1321,6 +1321,7 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,

 	do {
 		size_t this_len = min(total_len, SZ_1M);
+		size_t copied;
 		struct folio *folio;

 		if (this_len > minsize * 2)
@@ -1334,12 +1335,12 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);

-		if (copy_from_iter(folio_address(folio), this_len, iter) !=
-				this_len) {
+		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (copied < this_len) {
+			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
 			return -EFAULT;
 		}
-
 		total_len -= this_len;
 	} while (total_len && bio->bi_vcnt < bio->bi_max_vecs);

-- 
2.54.0

^ permalink raw reply related

* [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

[CHANGELOG]
v3:
- Fix a bug in error handling of bio_iov_iter_bounce_write()
  Which can lead to generic/708 failure on btrfs.

- Respect nofault flag in bio_iov_iter_bounce_write()
  To avoid btrfs specific deadlocks.

- Reject NOWAIT and BOUNCE direct IOs
  Since BOUNCE always allocate pages using GFP_KERNEL, which can sleep
  and break NOWAIT requirement, has to reject such combination.

v2:
- Rework the comment in btrfs_dio_write()

Commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the
inode requires checksum") solved the csum mismatch caused by unstable
direct IO buffers, it has a pretty hefty performance penalty.

Meanwhile upstream iomap has introduce IOMAP_DIO_BOUNCE flag to get
stable buffers meanwhile without falling back to buffered IOs.

Using that flag btrfs can reach 95% of the original zero-copy direct IO
performance, almost 2x the current buffered fallback performance.

However during my tests, there are several bugs related to iomap that
can lead to direct IO test case failures:

- generic/708
  Results garbage in the end of the writes, is a bug in the error
  handling of a short copy.

  Fixed in the first patch.

- Deadlock if using the page cache as direct IO buffer
  This is because bio_iov_iter_bounce_write() doesn't respect
  iov_iter::nofault flag.

  Fixed in the second patch.

- Possible NOWAIT and BOUNCE conflicts
  BOUNCE flag for both reads and writes will allocate new folios using
  GFP_KERNEL, which can sleep and break NOWAIT requirement.

  Reject such combination in iomap_dio_bio_iter() directly in the 3rd
  patch.

And the final one will enable btrfs to use IOMAP_DIO_BOUNCE flag, so
that even with data checksum we do not need to fallback to buffered IO
and reclaim most of the dropped direct IO performance.

Qu Wenruo (4):
  block: revert the iov_iter after a short copy in
    bio_iov_iter_bounce_write()
  block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  iomap: reject NOWAIT and BOUNCE direct IOs
  btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered
    IO

 block/bio.c          | 10 ++++++---
 fs/btrfs/direct-io.c | 53 ++++++++++++++++++++------------------------
 fs/iomap/direct-io.c |  4 ++++
 3 files changed, 35 insertions(+), 32 deletions(-)

-- 
2.54.0

^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-12  9:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Nilay Shroff
In-Reply-To: <aiqaXfTqCLMu2DwF@fedora>

On Jun 11, 2026 / 06:22, Ming Lei wrote:
> Hi Shin'ichiro,

Hi Ming, thanks for the comments.

> 
> On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > I observed that the blktests test case block/005 hangs on a specific
> > server hardware using a specific HDD as a block device. During the test
> > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > dependent.
> > 
> > From the kernel message, I noticed that udev-worker wrote to the
> > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > The test case block/005 also wrote to the same sysfs attribute, which
> 
> sysfs write is supposed to be serialized...

I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
I found elevator_change() call is guarded with the rw_semaphore
"set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
lock. This does not serialize the sysfs writes.

I tried the patch below to replace the reader lock with the writer lock. With
a quick trial, it looks working. The kernel message is no longer observed and
the new test case does not cause hangs. I will do further testing to confirm
that this change does not trigger other new lockdep WARNs. Assuming it does not
have such side effects, I hope this fix approach is acceptable. It doesn't add
the new lock, so I think it's the better.

diff --git a/block/elevator.c b/block/elevator.c
index 3bcd37c2aa34..b03185a217ff 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
 	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
 	 */
-	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
+	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	} else {
 		ret = -ENOENT;
 	}
-	up_read(&set->update_nr_hwq_lock);
+	up_write(&set->update_nr_hwq_lock);
 
 out:
 	if (ctx.type)

[...]

> blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> fix could be check & avoid the null-ptr-deref.

Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
free is also observed [3]. Then I'm guessing adding null checks may not be
enough.

> Adding new lock should be the last straw usually, especially this one is
> depended by queue freeze.

Got it, thanks.


[3] KASAN slab-use-after-free

[  802.836569][ T3919] run blktests block/005 at 2026-05-11 10:42:39
[  804.256901][ T3866] debugfs: 'sched' already exists in 'sdd'
[  804.874743][ T3919] debugfs: 'sched' already exists in 'sdd'
[  804.882124][ T3919] ==================================================================
[  804.882154][ T3866] debugfs: 'sched' already exists in 'sdd'
[  804.890039][ T3919] BUG: KASAN: slab-use-after-free in elevator_change_done+0x304/0x610
[  804.890053][ T3919] Write of size 8 at addr ffff8881273e08e0 by task check/3919
[  804.890061][ T3919]
[  804.890069][ T3919] CPU: 4 UID: 0 PID: 3919 Comm: check Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[  804.890080][ T3919] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[  804.890086][ T3919] Call Trace:
[  804.890092][ T3919]  <TASK>
[  804.890098][ T3919]  dump_stack_lvl+0x6e/0xa0
[  804.890118][ T3919]  print_address_description.constprop.0+0x70/0x300
[  804.890135][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890145][ T3919]  print_report+0xfc/0x1ff
[  804.890154][ T3919]  ? __virt_addr_valid+0x1d1/0x3f0
[  804.890163][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890168][ T3919]  kasan_report+0xf6/0x1c0
[  804.890176][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890185][ T3919]  kasan_check_range+0x125/0x200
[  804.890192][ T3919]  elevator_change_done+0x304/0x610
[  804.890198][ T3919]  ? sysfs_file_ops+0x70/0x140
[  804.890206][ T3919]  ? __pfx_elevator_change_done+0x10/0x10
[  804.890213][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890220][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890225][ T3919]  elevator_change+0x283/0x4f0
[  804.890233][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890239][ T3919]  elv_iosched_store+0x30c/0x3a0
[  804.890246][ T3919]  ? __pfx_elv_iosched_store+0x10/0x10
[  804.890255][ T3919]  ? lock_acquire.part.0+0xb8/0x230                                                                                                                                             10:42 [84/1747]
[  804.890262][ T3919]  ? kernfs_fop_write_iter+0x25b/0x5e0
[  804.890268][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890274][ T3919]  ? lock_acquire+0x126/0x140
[  804.890281][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890286][ T3919]  queue_attr_store+0x23f/0x360
[  804.890295][ T3919]  ? __pfx_queue_attr_store+0x10/0x10
[  804.890300][ T3919]  ? __lock_acquire+0x55d/0xbd0
[  804.890308][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890314][ T3919]  ? sysfs_file_kobj+0x1d/0x1b0
[  804.890319][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890326][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890334][ T3919]  ? lock_release.part.0+0x1c/0x50
[  804.890340][ T3919]  ? sysfs_file_kobj+0xb9/0x1b0
[  804.890345][ T3919]  ? sysfs_kf_write+0x65/0x170
[  804.890352][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890357][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  804.890363][ T3919]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  804.890368][ T3919]  vfs_write+0x524/0x1010
[  804.890378][ T3919]  ? __pfx_vfs_write+0x10/0x10
[  804.890393][ T3919]  ksys_write+0xff/0x200
[  804.890401][ T3919]  ? __pfx_ksys_write+0x10/0x10
[  804.890408][ T3919]  ? __pfx_pte_val+0x10/0x10
[  804.890414][ T3919]  ? folio_xchg_last_cpupid+0xc6/0x130
[  804.890421][ T3919]  do_syscall_64+0xf4/0x1550
[  804.890429][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890437][ T3919]  ? lock_release.part.0+0x1c/0x50
[  804.890444][ T3919]  ? rcu_read_unlock+0x1c/0x60
[  804.890449][ T3919]  ? wp_page_reuse+0x160/0x1e0
[  804.890455][ T3919]  ? do_wp_page+0x5db/0x10a0
[  804.890465][ T3919]  ? handle_pte_fault+0x54e/0x760
[  804.890472][ T3919]  ? __pfx_handle_pte_fault+0x10/0x10
[  804.890479][ T3919]  ? __pfx_pmd_val+0x10/0x10
[  804.890485][ T3919]  ? __handle_mm_fault+0xa02/0xef0
[  804.890493][ T3919]  ? __lock_acquire+0x55d/0xbd0
[  804.890499][ T3919]  ? __pfx_css_rstat_updated+0x10/0x10
[  804.890509][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890515][ T3919]  ? count_memcg_events_mm.constprop.0+0x22/0x130
[  804.890522][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890528][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890536][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890542][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890550][ T3919]  ? do_user_addr_fault+0x811/0xed0
[  804.890559][ T3919]  ? do_syscall_64+0x34/0x1550
[  804.890564][ T3919]  ? lockdep_hardirqs_on_prepare.part.0+0x9b/0x140
[  804.890570][ T3919]  ? do_syscall_64+0x34/0x1550
[  804.890575][ T3919]  ? trace_hardirqs_on+0x19/0x1a0
[  804.890584][ T3919]  ? do_syscall_64+0xab/0x1550
[  804.890590][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  804.890596][ T3919] RIP: 0033:0x7ff08cbe3bbe
[  804.890603][ T3919] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f
3 0f 1e fa
[  804.890609][ T3919] RSP: 002b:00007ffc95718820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  804.890616][ T3919] RAX: ffffffffffffffda RBX: 00007ff08cd5f5c0 RCX: 00007ff08cbe3bbe
[  804.890621][ T3919] RDX: 0000000000000006 RSI: 0000563340f2c390 RDI: 0000000000000001
[  804.890624][ T3919] RBP: 00007ffc95718830 R08: 0000000000000000 R09: 0000000000000000
[  804.890627][ T3919] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
[  804.890630][ T3919] R13: 0000000000000006 R14: 0000563340f2c390 R15: 0000563340f96890
[  804.890641][ T3919]  </TASK>
[  804.890643][ T3919]
[  805.368835][ T3919] Allocated by task 3919:
[  805.373543][ T3919]  kasan_save_stack+0x30/0x50
[  805.378559][ T3919]  kasan_save_track+0x14/0x30
[  805.383559][ T3919]  __kasan_kmalloc+0x9a/0xb0
[  805.388465][ T3919]  elevator_alloc+0xc5/0x2b0
[  805.393366][ T3919]  blk_mq_init_sched+0xa6/0x5e0
[  805.398554][ T3919]  elevator_switch+0x18e/0x680
[  805.403702][ T3919]  elevator_change+0x2d8/0x4f0
[  805.408802][ T3919]  elv_iosched_store+0x30c/0x3a0
[  805.414116][ T3919]  queue_attr_store+0x23f/0x360
[  805.419289][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  805.424938][ T3919]  vfs_write+0x524/0x1010
[  805.429600][ T3919]  ksys_write+0xff/0x200
[  805.434159][ T3919]  do_syscall_64+0xf4/0x1550
[  805.439064][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  805.445273][ T3919]
[  805.447927][ T3919] Freed by task 3866:
[  805.452231][ T3919]  kasan_save_stack+0x30/0x50
[  805.457287][ T3919]  kasan_save_track+0x14/0x30
[  805.462282][ T3919]  kasan_save_free_info+0x3b/0x70
[  805.467645][ T3919]  __kasan_slab_free+0x6b/0x90
[  805.472736][ T3919]  kfree+0x21c/0x620
[  805.476953][ T3919]  kobject_cleanup+0x105/0x3a0
[  805.482039][ T3919]  elevator_change_done+0x196/0x610
[  805.487633][ T3919]  elevator_change+0x283/0x4f0
[  805.492730][ T3919]  elv_iosched_store+0x30c/0x3a0
[  805.497989][ T3919]  queue_attr_store+0x23f/0x360
[  805.503144][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  805.508747][ T3919]  vfs_write+0x524/0x1010
[  805.513381][ T3919]  ksys_write+0xff/0x200
[  805.517944][ T3919]  do_syscall_64+0xf4/0x1550
[  805.522862][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  805.529118][ T3919]
[  805.531858][ T3919] The buggy address belongs to the object at ffff8881273e0800
[  805.531858][ T3919]  which belongs to the cache kmalloc-rnd-13-1k of size 1024
[  805.547392][ T3919] The buggy address is located 224 bytes inside of
[  805.547392][ T3919]  freed 1024-byte region [ffff8881273e0800, ffff8881273e0c00)
[  805.562078][ T3919]
[  805.564734][ T3919] The buggy address belongs to the physical page:
[  805.571446][ T3919] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1273e0
[  805.580609][ T3919] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  805.589411][ T3919] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  805.597524][ T3919] page_type: f5(slab)
[  805.601916][ T3919] raw: 0017ffffc0000040 ffff88810005c640 dead000000000100 dead000000000122
[  805.610881][ T3919] raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
[  805.619808][ T3919] head: 0017ffffc0000040 ffff88810005c640 dead000000000100 dead000000000122
[  805.628815][ T3919] head: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
[  805.637838][ T3919] head: 0017ffffc0000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
[  805.646901][ T3919] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[  805.655983][ T3919] page dumped because: kasan: bad access detected
[  805.662913][ T3919]
[  805.665657][ T3919] Memory state around the buggy address:
[  805.671717][ T3919]  ffff8881273e0780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  805.680194][ T3919]  ffff8881273e0800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.688697][ T3919] >ffff8881273e0880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.697130][ T3919]                                                        ^
[  805.704717][ T3919]  ffff8881273e0900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.713179][ T3919]  ffff8881273e0980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.721720][ T3919] ==================================================================
[  805.730526][ T3919] Disabling lock debugging due to kernel taint
...

^ permalink raw reply related

* Re: [PATCH v2] block: invalidate cached plug timestamp after task switch
From: Peter Zijlstra @ 2026-06-12  9:45 UTC (permalink / raw)
  To: Usama Arif
  Cc: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, shakeel.butt, hannes, riel,
	kernel-team, stable
In-Reply-To: <20260612094042.3350401-1-usama.arif@linux.dev>

On Fri, Jun 12, 2026 at 02:40:42AM -0700, Usama Arif wrote:

> +static __always_inline void blk_plug_invalidate_ts(void)
>  {
> +	if (unlikely(current->flags & PF_BLOCK_TS)) {
> +		struct blk_plug *plug = current->plug;
>  
> +		if (plug)
> +			plug->cur_ktime = 0;
> +		current->flags &= ~PF_BLOCK_TS;
> +	}
>  }

If you can guarantee PF_BLOCK_TS is only ever set when current->plug,
this can be reduced further.

^ permalink raw reply

* [PATCH v2] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-12  9:40 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
v1 -> v2: https://lore.kernel.org/all/20260611231428.345098-1-usama.arif@linux.dev/
- Make the function just blk_plug_invalidate_ts(), move the check for
PF_BLOCK_TS flag into blk_plug_invalidate_ts and make it __always_inline
(Peter Zijlstra).
---
 include/linux/blkdev.h | 17 ++++++++---------
 kernel/sched/core.c    | 12 ++++++++----
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 57e84d59a642..1c1fd31ce187 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1216,16 +1216,15 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
 		__blk_flush_plug(plug, async);
 }

-/*
- * tsk == current here
- */
-static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+static __always_inline void blk_plug_invalidate_ts(void)
 {
-	struct blk_plug *plug = tsk->plug;
+	if (unlikely(current->flags & PF_BLOCK_TS)) {
+		struct blk_plug *plug = current->plug;

-	if (plug)
-		plug->cur_ktime = 0;
-	current->flags &= ~PF_BLOCK_TS;
+		if (plug)
+			plug->cur_ktime = 0;
+		current->flags &= ~PF_BLOCK_TS;
+	}
 }

 int blkdev_issue_flush(struct block_device *bdev);
@@ -1251,7 +1250,7 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
 {
 }

-static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+static inline void blk_plug_invalidate_ts(void)
 {
 }

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..e97e98c33be5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5368,6 +5368,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	kmap_local_sched_in();

+	/*
+	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
+	 * invalidate it.
+	 */
+	blk_plug_invalidate_ts();
+
 	fire_sched_in_preempt_notifiers(current);
 	/*
 	 * When switching through a kernel thread, the loop in
@@ -7290,12 +7296,10 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
-		if (tsk->flags & PF_BLOCK_TS)
-			blk_plug_invalidate_ts(tsk);
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
-		else if (tsk->flags & PF_IO_WORKER)
+		else
 			io_wq_worker_running(tsk);
 	}
 }
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH] rnbd-clt: Use common error handling code in rnbd_get_iu()
From: Haris Iqbal @ 2026-06-12  9:39 UTC (permalink / raw)
  To: Markus Elfring; +Cc: linux-block, Jack Wang, Jens Axboe, LKML, kernel-janitors
In-Reply-To: <c9f86f0b-331d-4cb1-b8a2-00bc1e857ec7@web.de>

On Wed, Jun 10, 2026 at 9:03 PM Markus Elfring <Markus.Elfring@web.de> wrote:
>
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Wed, 10 Jun 2026 20:58:47 +0200
>
> Use an additional label so that a bit of exception handling can be better
> reused at the end of an if branch.
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
>  drivers/block/rnbd/rnbd-clt.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
> index 4d6725a0035e..d8e3f145ee2f 100644
> --- a/drivers/block/rnbd/rnbd-clt.c
> +++ b/drivers/block/rnbd/rnbd-clt.c
> @@ -329,10 +329,8 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
>                 return NULL;
>
>         permit = rnbd_get_permit(sess, con_type, wait);
> -       if (!permit) {
> -               kfree(iu);
> -               return NULL;
> -       }
> +       if (!permit)
> +               goto free_iu;
>
>         iu->permit = permit;
>         /*
> @@ -349,6 +347,7 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
>
>         if (sg_alloc_table(&iu->sgt, 1, GFP_KERNEL)) {
>                 rnbd_put_permit(sess, permit);
> +free_iu:

Thanks for the patch.
It does what it mentioned in the commit description, but maybe we do
not need to do this?

If there was a kfree before the last "return iu;", it would have made
more sense. But jumping to the middle of a conditional block to reuse
the free and return seems forced.

>                 kfree(iu);
>                 return NULL;
>         }
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v4 7/8] Bluetooth: qca: Set NVMEM BD address quirks when address is invalid
From: Dmitry Baryshkov @ 2026-06-12  9:12 UTC (permalink / raw)
  To: Loic Poulain
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-7-45712e6b22c6@oss.qualcomm.com>

On Tue, Jun 09, 2026 at 09:52:32AM +0200, Loic Poulain wrote:
> When the controller BD address is invalid (zero or default),
> set the NVMEM quirks to allow retrieving the address from a
> 'local-bd-address' NVMEM cell. The BD address is often stored
> alongside the WiFi MAC address in big-endian format, so also
> set the big-endian quirk.

Okay, this answers my question to the previous patch. We need to support
BE addresses.

> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---
>  drivers/bluetooth/btqca.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH v4 6/8] Bluetooth: hci_sync: Add NVMEM-backed BD address retrieval
From: Dmitry Baryshkov @ 2026-06-12  9:11 UTC (permalink / raw)
  To: Loic Poulain
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-6-45712e6b22c6@oss.qualcomm.com>

On Tue, Jun 09, 2026 at 09:52:31AM +0200, Loic Poulain wrote:
> Some devices store the Bluetooth BD address in non-volatile
> memory, which can be accessed through the NVMEM framework.
> Similar to Ethernet or WiFi MAC addresses, add support for
> reading the BD address from a 'local-bd-address' NVMEM cell.
> 
> As with the device-tree provided BD address, add a quirk to
> indicate whether a device or platform should attempt to read
> the address from NVMEM when no valid in-chip address is present.
> Also add a quirk to indicate if the address is stored in
> big-endian byte order.
> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---
>  include/net/bluetooth/hci.h | 18 ++++++++++++++++++
>  net/bluetooth/hci_sync.c    | 39 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
> index 572b1c620c5d653a1fe10b26c1b0ba33e8f4968f..7686466d1109253b0d75edeb5f6a99fb98ce4cc6 100644
> --- a/include/net/bluetooth/hci.h
> +++ b/include/net/bluetooth/hci.h
> @@ -164,6 +164,24 @@ enum {
>  	 */
>  	HCI_QUIRK_BDADDR_PROPERTY_BROKEN,
>  
> +	/* When this quirk is set, the public Bluetooth address
> +	 * initially reported by HCI Read BD Address command
> +	 * is considered invalid. The public BD Address can be
> +	 * retrieved via a 'local-bd-address' NVMEM cell.

Why do we need a quirk here? Can't we always assume that if there is an
NVMEM cell, it contains a correct address, even if HCI command returned
a seemingly-sensible one?

> +	 *
> +	 * This quirk can be set before hci_register_dev is called or
> +	 * during the hdev->setup vendor callback.
> +	 */
> +	HCI_QUIRK_USE_BDADDR_NVMEM,
> +
> +	/* When this quirk is set, the Bluetooth Device Address provided by
> +	 * the 'local-bd-address' NVMEM is stored in big-endian order.
> +	 *
> +	 * This quirk can be set before hci_register_dev is called or
> +	 * during the hdev->setup vendor callback.
> +	 */
> +	HCI_QUIRK_BDADDR_NVMEM_BE,

Also, is this necessary? Are the devices which store the address in the
wrong format in the NVMEM?

> +
>  	/* When this quirk is set, the duplicate filtering during
>  	 * scanning is based on Bluetooth devices addresses. To allow
>  	 * RSSI based updates, restart scanning if needed.

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  8:27 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <aivABK8nwNR6Z__A@infradead.org>



在 2026/6/12 17:45, Christoph Hellwig 写道:
> On Fri, Jun 12, 2026 at 02:54:22PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2026/6/12 11:11, Qu Wenruo 写道:
>>> Hi,
>>>
>>> Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm
>>> experiencing weird data corruption.
>>>
>>> During test case generic/708, I'm reliably hitting garbage pages at the
>>> last 64KiB, the garbage even contains an ELF header.
>>
>> Added ftrace shows that, since btrfs has to disable page fault to avoid
>> certain deadlock, bio_iov_iter_bounced() failed with a short copy.
>>
>> Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() only
>> copied 64K then failed due to the disabled page fault.
>>
>> I don't think it's a coincident that the short 64K exactly matches where the
>> garbage is (the last 64K).
>>
>> I guess it's in the error path we didn't properly revert the iov iter?
> 
> Looks like it.  The better option would probably be not to give up on
> a short copy, and just reduce the bio size to fit the short copy
> even if that wastes a little memory.

And that will only work if IOMAP_PARTIAL is specified.

For now I'll just fix the short copy path, and continue to make btrfs 
work with IOMAP_DIO_BOUNCE first (already testing and finished one round).

The partial bio return solution will require some way to pass the dio 
flag, or a refactor to move all the error handling to the only caller.
Anyway it will be a dedicated series for the change, meanwhile the 
simple revert fix will be sent out soon along with the btrfs enablement.

Thanks,
Qu

^ permalink raw reply

* Re: [PATCH v4 2/8] dt-bindings: net: wireless: qcom,ath10k: Document NVMEM cells
From: Krzysztof Kozlowski @ 2026-06-12  8:26 UTC (permalink / raw)
  To: Loic Poulain, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Bjorn Andersson, Konrad Dybcio, Jens Axboe,
	Johannes Berg, Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Bartosz Golaszewski
In-Reply-To: <20260609-block-as-nvmem-v4-2-45712e6b22c6@oss.qualcomm.com>

On 09/06/2026 09:52, Loic Poulain wrote:
> Document the NVMEM cells supported by the ath10k driver, the
> mac-address, pre-calibration data, and calibration data.
> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---


Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>

Best regards,
Krzysztof

^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Christoph Hellwig @ 2026-06-12  8:15 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <f1dbdcd6-4e6e-4304-8fa5-59c2c60252ad@suse.com>

On Fri, Jun 12, 2026 at 02:54:22PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2026/6/12 11:11, Qu Wenruo 写道:
> > Hi,
> > 
> > Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm
> > experiencing weird data corruption.
> > 
> > During test case generic/708, I'm reliably hitting garbage pages at the
> > last 64KiB, the garbage even contains an ELF header.
> 
> Added ftrace shows that, since btrfs has to disable page fault to avoid
> certain deadlock, bio_iov_iter_bounced() failed with a short copy.
> 
> Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() only
> copied 64K then failed due to the disabled page fault.
> 
> I don't think it's a coincident that the short 64K exactly matches where the
> garbage is (the last 64K).
> 
> I guess it's in the error path we didn't properly revert the iov iter?

Looks like it.  The better option would probably be not to give up on
a short copy, and just reduce the bio size to fit the short copy
even if that wastes a little memory.


^ permalink raw reply

* Re: [PATCH v4 2/8] dt-bindings: net: wireless: qcom,ath10k: Document NVMEM cells
From: Loic Poulain @ 2026-06-12  7:57 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <20260610-funny-paper-warthog-25fa0a@quoll>

On Wed, Jun 10, 2026 at 9:16 AM Krzysztof Kozlowski <krzk@kernel.org> wrote:
>
> On Tue, Jun 09, 2026 at 09:52:27AM +0200, Loic Poulain wrote:
> > Document the NVMEM cells supported by the ath10k driver, the
> > mac-address, pre-calibration data, and calibration data.
> >
> > Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  .../devicetree/bindings/net/wireless/qcom,ath10k.yaml    | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > index c21d66c7cd558ab792524be9afec8b79272d1c87..7391df5e7071e626af4c64b9919d48c41ac09f1e 100644
> > --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml
> > @@ -92,6 +92,22 @@ properties:
> >
> >    ieee80211-freq-limit: true
> >
> > +  nvmem-cells:
> > +    minItems: 1
> > +    maxItems: 3
> > +    description: |
>
> If there is going to be resend:
> Do not need '|' unless you need to preserve formatting.

Sure, thanks.

>
> > +      References to nvmem cells for MAC address and/or calibration data.
> > +      Supported cell names are mac-address, calibration, and pre-calibration.
> > +
> > +  nvmem-cell-names:
> > +    minItems: 1
> > +    maxItems: 3
> > +    items:
> > +      enum:
> > +        - mac-address
> > +        - calibration
> > +        - pre-calibration
>
> This means you expect random order with variable number of items. Is
> that intentional? If yes, please provide short explanation in the commit
> msg.

Yes we may or may have any of those cells. Will document.

Thanks,
Loic

^ permalink raw reply

* Re: [PATCH] block: invalidate cached plug timestamp after task switch
From: Peter Zijlstra @ 2026-06-12  7:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, rostedt,
	vincent.guittot, vschneid, shakeel.butt, hannes, riel,
	kernel-team, stable
In-Reply-To: <20260611231428.345098-1-usama.arif@linux.dev>

On Thu, Jun 11, 2026 at 04:14:28PM -0700, Usama Arif wrote:
> blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
> and marks the task with PF_BLOCK_TS. That cache is only valid while the
> task keeps running; if the task is switched out, wall-clock time
> advances and the cached value must not be reused when the task runs again.
> 
> The existing invalidation covers explicit plug flushes through
> __blk_flush_plug(), and the schedule() / rtmutex paths through
> sched_update_worker(). It does not cover in-kernel preemption paths such
> as preempt_schedule(), preempt_schedule_notrace(), and
> preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
> return without calling sched_update_worker().
> 
> As a result, a task preempted while holding a plug with PF_BLOCK_TS set
> can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
> then consumes that stale timestamp through ioc_now(), producing stale vnow
> values for throttle decisions, and through ioc_rqos_done(), inflating
> on-queue time and feeding false missed-QoS samples into vrate
> adjustment.
> 
> Move the schedule-side invalidation to finish_task_switch(), which runs
> for the scheduled-in task after every actual context switch regardless
> of which schedule entry point was used. Keep __blk_flush_plug() as the
> explicit flush/finish-plug invalidation path, and remove only the
> PF_BLOCK_TS handling from sched_update_worker().
> 
> Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
> Cc: stable@vger.kernel.org
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  kernel/sched/core.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8b791e9e9f67..bf024ca115ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5368,6 +5368,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	 */
>  	kmap_local_sched_in();
>  
> +	/*
> +	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
> +	 * invalidate it.
> +	 */
> +	if (unlikely(current->flags & PF_BLOCK_TS))
> +		blk_plug_invalidate_ts(current);

Can you make that just blk_plug_invalidate_ts() and move the branch into
the function itself, which is already inline anyway (but perhaps upgrade
it to __always_inline).

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-12  5:28 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Keith Busch, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <airX6BmMQ14Rvjcb@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > It's entirely possible a device supports byte aligned addresses. The
> > > block layer just doesn't let a driver report that. So either it really
> > > was successful because you found a bug that skips the alignment checks,
> > > or your device silently corrupted your payload.
> 
> I tried this on different hardware, I find it hard to say all those
> devices were corrupting the payload.

I think in the other thread we agreed that we are currently missing
the alignment check for fast-path bios not hitting the splitting code,
so maybe that is something you see.  Additionally we're missing the
checks for purely bio based drivers not calling the splitting helper
at all, but I don't think that applies here.

> > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > though, in not taking the optimization when it was possible. So here's
> > > an alternative suggestion that should get things working as expected:
> > 
> > The fix below looks like it is addressing a real bug.  I'm not sure if
> > Carlos is hitting it, but we were missing the alignment checks for
> > single-bvec fast path bios so far indeed.
> 
> You left context out so I'm assuming by the fix you meant Keith's patch.

Yes.


^ permalink raw reply

* Re: Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  5:24 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List
In-Reply-To: <b866b2e6-292f-4c5e-ae5a-d77e983cfefd@suse.com>



在 2026/6/12 11:11, Qu Wenruo 写道:
> Hi,
> 
> Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm 
> experiencing weird data corruption.
> 
> During test case generic/708, I'm reliably hitting garbage pages at the 
> last 64KiB, the garbage even contains an ELF header.

Added ftrace shows that, since btrfs has to disable page fault to avoid 
certain deadlock, bio_iov_iter_bounced() failed with a short copy.

Initially bio_iov_iter_bounced() got a 1MiB page, but copy_iter_from() 
only copied 64K then failed due to the disabled page fault.

I don't think it's a coincident that the short 64K exactly matches where 
the garbage is (the last 64K).

I guess it's in the error path we didn't properly revert the iov iter?

> 
> In that test case, we mmap a 2MiB sized buffer from another file, and 
> use that 2MiB mmapped memory as buffer for direct IO, write into a 
> different file.
> 
> The source file has dirty page cache for that 2MiB range, and no 
> writeback happened during that direct IO write.
> 
> So it means as long as we fault in all the pages of that 2MiB buffer, we 
> should be able to copy them into the newly allocated folio, and submit a 
> bio using the bounced pages.
> 
> But the last 64KiB is reliably corrupted with some ELF header.
> 
> I'm wondering where the corruption is from, especially it seems btrfs 
> has very little to do, except calling fault_in_iov_readable() to fault 
> in all the pages.
> 
> Thanks,
> Qu


^ permalink raw reply

* Re: [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Fengnan @ 2026-06-12  1:53 UTC (permalink / raw)
  To: Anuj Gupta, axboe, hch, kbusch, lidiangang, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Alok Rathore
In-Reply-To: <20260611162719.910837-1-anuj20.g@samsung.com>

在 2026/6/12 00:27, Anuj Gupta 写道:
> blk_hctx_poll() can busy-poll until a completion is found or
> need_resched() becomes true. On preemptible kernels, the scheduler can
> set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
> return before the loop condition re-evaluates it. After the context
> switch, the flag is cleared, so the poller can continue spinning instead
> of returning to its caller.
>
> This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
> which holds the rcu_read_lock() while calling bio_poll(). If another
> poller on the same polled queue drains the available completions, this
> poller may repeatedly find no completions and remain inside the RCU
> read-side critical section long enough to trigger RCU stall reports:
>
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
> rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
> task:fio state:R  running task     stack:0     pid:3961
> Call Trace:
> <TASK>
> ? nvme_poll+0x36/0xa0 [nvme]
> ? blk_hctx_poll+0x39/0x90
> ? blk_mq_poll+0x30/0x60
> ? bio_poll+0x87/0x170
> ? iocb_bio_iopoll+0x32/0x50
> ? io_uring_classic_poll+0x25/0x50
> ? io_do_iopoll+0x216/0x420
> ? __do_sys_io_uring_enter+0x2c7/0x7c0
>
> Reproducible with:
>
> fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
> --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
> --registerfiles=1 --group_reporting --thread
>
> Record the starting jiffy and exit the loop once jiffies has advanced.
> This bounds each blk_hctx_poll() invocation while also covering the
> case where the reschedule flag was cleared by the context switch
> before the loop condition could observe it.
>
> Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
> ---
>   block/blk-mq.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..d85fa4a51e79 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>   			 struct io_comp_batch *iob, unsigned int flags)
>   {
>   	int ret;
> +	unsigned long start = jiffies;
how about this :

unsigned long timeout = jiffies + 1;
...
} while (!need_resched() && time_before(jiffies, timeout));

>   
>   	do {
>   		ret = q->mq_ops->poll(hctx, iob);
> @@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>   		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>   			break;
>   		cpu_relax();
> -	} while (!need_resched());
> +	} while (!need_resched() && time_before_eq(jiffies, start));
>   
>   	return 0;
>   }

^ permalink raw reply

* Direct IO page bouncing got some garbage?
From: Qu Wenruo @ 2026-06-12  1:41 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, Linux Memory Management List

Hi,

Recently I'm trying to make btrfs utilize IOMAP_DIO_BOUNCE, however I'm 
experiencing weird data corruption.

During test case generic/708, I'm reliably hitting garbage pages at the 
last 64KiB, the garbage even contains an ELF header.

In that test case, we mmap a 2MiB sized buffer from another file, and 
use that 2MiB mmapped memory as buffer for direct IO, write into a 
different file.

The source file has dirty page cache for that 2MiB range, and no 
writeback happened during that direct IO write.

So it means as long as we fault in all the pages of that 2MiB buffer, we 
should be able to copy them into the newly allocated folio, and submit a 
bio using the bounced pages.

But the last 64KiB is reliably corrupted with some ELF header.

I'm wondering where the corruption is from, especially it seems btrfs 
has very little to do, except calling fault_in_iov_readable() to fault 
in all the pages.

Thanks,
Qu

^ permalink raw reply

* [PATCH] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-11 23:14 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 kernel/sched/core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..bf024ca115ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5368,6 +5368,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	kmap_local_sched_in();

+	/*
+	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
+	 * invalidate it.
+	 */
+	if (unlikely(current->flags & PF_BLOCK_TS))
+		blk_plug_invalidate_ts(current);
+
 	fire_sched_in_preempt_notifiers(current);
 	/*
 	 * When switching through a kernel thread, the loop in
@@ -7290,12 +7297,10 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
-		if (tsk->flags & PF_BLOCK_TS)
-			blk_plug_invalidate_ts(tsk);
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
-		else if (tsk->flags & PF_IO_WORKER)
+		else
 			io_wq_worker_running(tsk);
 	}
 }
-- 
2.53.0-Meta

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox