All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: brauner@kernel.org, miklos@szeredi.hu, djwong@kernel.org,
	hch@infradead.org, hsiangkao@linux.alibaba.com,
	linux-block@vger.kernel.org, gfs2@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, kernel-team@meta.com,
	linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
Date: Thu, 23 Oct 2025 15:34:22 -0400	[thread overview]
Message-ID: <aPqDPjnIaR3EF5Lt@bfoster> (raw)
In-Reply-To: <20250926002609.1302233-8-joannelkoong@gmail.com>

On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> Instead of incrementing read_bytes_pending for every folio range read in
> (which requires acquiring the spinlock to do so), set read_bytes_pending
> to the folio size when the first range is asynchronously read in, keep
> track of how many bytes total are asynchronously read in, and adjust
> read_bytes_pending accordingly after issuing requests to read in all the
> necessary ranges.
> 
> iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> value for pending bytes necessarily indicates the folio is in the bio.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Hi Joanne,

I was throwing some extra testing at the vfs-6.19.iomap branch since the
little merge conflict thing with iomap_iter_advance(). I end up hitting
what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
generic/051. It reproduces fairly reliably within a few iterations or so
and seems to always stall during a read for a dedupe operation:

task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 filemap_read_folio+0x85/0xd0
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 do_read_cache_folio+0x7c/0x1b0
 vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
 __generic_remap_file_range_prep+0x276/0x2a0
 generic_remap_file_range_prep+0x10/0x20
 xfs_reflink_remap_prep+0x22c/0x300 [xfs]
 xfs_file_remap_range+0x84/0x360 [xfs]
 vfs_dedupe_file_range_one+0x1b2/0x1d0
 ? remap_verify_area+0x46/0x140
 vfs_dedupe_file_range+0x162/0x220
 do_vfs_ioctl+0x4d1/0x940
 __x64_sys_ioctl+0x75/0xe0
 do_syscall_64+0x84/0x800
 ? do_syscall_64+0xbb/0x800
 ? avc_has_perm_noaudit+0x6b/0xf0
 ? _copy_to_user+0x31/0x40
 ? cp_new_stat+0x130/0x170
 ? __do_sys_newfstat+0x44/0x70
 ? do_syscall_64+0xbb/0x800
 ? do_syscall_64+0xbb/0x800
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe6bbd9a14d
RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
 </TASK>

It wasn't immediately clear to me what the issue was so I bisected and
it landed on this patch. It kind of looks like we're failing to unlock a
folio at some point and then tripping over it later..? I can kill the
fsstress process but then the umount ultimately gets stuck tossing
pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
I'll poke at it some more but I figure you might be able to make sense
of this faster than I can.

Brian

[1] umount stack trace: 

task:umount          state:D stack:0     pid:12216 tgid:12216 ppid:2514   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 truncate_inode_pages_range+0x42a/0x4d0
 xfs_fs_evict_inode+0x1f/0x30 [xfs]
 evict+0x112/0x290
 evict_inodes+0x209/0x230
 generic_shutdown_super+0x42/0x100
 kill_block_super+0x1a/0x40
 xfs_kill_sb+0x12/0x20 [xfs]
 deactivate_locked_super+0x33/0xb0
 cleanup_mnt+0xba/0x150
 task_work_run+0x5c/0x90
 exit_to_user_mode_loop+0x12f/0x170
 do_syscall_64+0x1af/0x800
 ? vfs_statx+0x80/0x160
 ? do_statx+0x62/0xa0
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? count_memcg_events+0xdd/0x1b0
 ? handle_mm_fault+0x220/0x340
 ? do_user_addr_fault+0x2c3/0x7f0
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdd641ed5ab
RSP: 002b:00007ffd671182e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000559b3e2056b0 RCX: 00007fdd641ed5ab
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000559b3e205ac0
RBP: 00007ffd671183c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000246 R12: 0000559b3e2057b8
R13: 0000000000000000 R14: 0000559b3e205ac0 R15: 0000000000000000
 </TASK>

>  fs/iomap/buffered-io.c | 87 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 66 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 09e65771a947..4e6258fdb915 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -362,7 +362,6 @@ static void iomap_read_end_io(struct bio *bio)
>  
>  struct iomap_read_folio_ctx {
>  	struct folio		*cur_folio;
> -	bool			cur_folio_in_bio;
>  	void			*read_ctx;
>  	struct readahead_control *rac;
>  };
> @@ -380,19 +379,11 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  {
>  	struct folio *folio = ctx->cur_folio;
>  	const struct iomap *iomap = &iter->iomap;
> -	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
>  	loff_t length = iomap_length(iter);
>  	sector_t sector;
>  	struct bio *bio = ctx->read_ctx;
>  
> -	ctx->cur_folio_in_bio = true;
> -	if (ifs) {
> -		spin_lock_irq(&ifs->state_lock);
> -		ifs->read_bytes_pending += plen;
> -		spin_unlock_irq(&ifs->state_lock);
> -	}
> -
>  	sector = iomap_sector(iomap, pos);
>  	if (!bio || bio_end_sector(bio) != sector ||
>  	    !bio_add_folio(bio, folio, plen, poff)) {
> @@ -422,8 +413,57 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  	}
>  }
>  
> +static void iomap_read_init(struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +
> +	if (ifs) {
> +		size_t len = folio_size(folio);
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending += len;
> +		spin_unlock_irq(&ifs->state_lock);
> +	}
> +}
> +
> +static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> +{
> +	struct iomap_folio_state *ifs;
> +
> +	/*
> +	 * If there are no bytes pending, this means we are responsible for
> +	 * unlocking the folio here, since no IO helper has taken ownership of
> +	 * it.
> +	 */
> +	if (!bytes_pending) {
> +		folio_unlock(folio);
> +		return;
> +	}
> +
> +	ifs = folio->private;
> +	if (ifs) {
> +		bool end_read, uptodate;
> +		size_t bytes_accounted = folio_size(folio) - bytes_pending;
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending -= bytes_accounted;
> +		/*
> +		 * If !ifs->read_bytes_pending, this means all pending reads
> +		 * by the IO helper have already completed, which means we need
> +		 * to end the folio read here. If ifs->read_bytes_pending != 0,
> +		 * the IO helper will end the folio read.
> +		 */
> +		end_read = !ifs->read_bytes_pending;
> +		if (end_read)
> +			uptodate = ifs_is_fully_uptodate(folio, ifs);
> +		spin_unlock_irq(&ifs->state_lock);
> +		if (end_read)
> +			folio_end_read(folio, uptodate);
> +	}
> +}
> +
>  static int iomap_read_folio_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *bytes_pending)
>  {
>  	const struct iomap *iomap = &iter->iomap;
>  	loff_t pos = iter->pos;
> @@ -460,6 +500,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
>  			folio_zero_range(folio, poff, plen);
>  			iomap_set_range_uptodate(folio, poff, plen);
>  		} else {
> +			if (!*bytes_pending)
> +				iomap_read_init(folio);
> +			*bytes_pending += plen;
>  			iomap_bio_read_folio_range(iter, ctx, pos, plen);
>  		}
>  
> @@ -482,17 +525,18 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.cur_folio	= folio,
>  	};
> +	size_t bytes_pending = 0;
>  	int ret;
>  
>  	trace_iomap_readpage(iter.inode, 1);
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.status = iomap_read_folio_iter(&iter, &ctx);
> +		iter.status = iomap_read_folio_iter(&iter, &ctx,
> +				&bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (!ctx.cur_folio_in_bio)
> -		folio_unlock(folio);
> +	iomap_read_end(folio, bytes_pending);
>  
>  	/*
>  	 * Just like mpage_readahead and block_read_full_folio, we always
> @@ -504,24 +548,23 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  EXPORT_SYMBOL_GPL(iomap_read_folio);
>  
>  static int iomap_readahead_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_pending)
>  {
>  	int ret;
>  
>  	while (iomap_length(iter)) {
>  		if (ctx->cur_folio &&
>  		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
> -			if (!ctx->cur_folio_in_bio)
> -				folio_unlock(ctx->cur_folio);
> +			iomap_read_end(ctx->cur_folio, *cur_bytes_pending);
>  			ctx->cur_folio = NULL;
>  		}
>  		if (!ctx->cur_folio) {
>  			ctx->cur_folio = readahead_folio(ctx->rac);
>  			if (WARN_ON_ONCE(!ctx->cur_folio))
>  				return -EINVAL;
> -			ctx->cur_folio_in_bio = false;
> +			*cur_bytes_pending = 0;
>  		}
> -		ret = iomap_read_folio_iter(iter, ctx);
> +		ret = iomap_read_folio_iter(iter, ctx, cur_bytes_pending);
>  		if (ret)
>  			return ret;
>  	}
> @@ -554,16 +597,18 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.rac	= rac,
>  	};
> +	size_t cur_bytes_pending;
>  
>  	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
>  
>  	while (iomap_iter(&iter, ops) > 0)
> -		iter.status = iomap_readahead_iter(&iter, &ctx);
> +		iter.status = iomap_readahead_iter(&iter, &ctx,
> +					&cur_bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
> -		folio_unlock(ctx.cur_folio);
> +	if (ctx.cur_folio)
> +		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
>  }
>  EXPORT_SYMBOL_GPL(iomap_readahead);
>  
> -- 
> 2.47.3
> 
> 


  reply	other threads:[~2025-10-23 19:30 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
2025-10-23 19:34   ` Brian Foster [this message]
2025-10-24  0:01     ` Joanne Koong
2025-10-24 16:25       ` Joanne Koong
2025-10-24 17:14         ` Brian Foster
2025-10-24 19:48           ` Joanne Koong
2025-10-24 21:55             ` Joanne Koong
2025-10-27 12:16               ` Brian Foster
2025-10-24 17:21         ` Matthew Wilcox
2025-10-24 19:22           ` Joanne Koong
2025-10-24 20:59             ` Matthew Wilcox
2025-10-24 21:37               ` Darrick J. Wong
2025-10-24 21:58               ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
2025-12-23 22:30   ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
2025-12-23 22:30     ` [RFC PATCH 1/1] " Sasha Levin
2025-12-24  1:12       ` Joanne Koong
2025-12-24  1:31         ` Sasha Levin
2026-02-07  7:16           ` Wei Gao
2026-02-09 19:08             ` Joanne Koong
2026-02-10  0:12               ` Wei Gao
2026-02-10  0:20                 ` Joanne Koong
2026-02-10  0:40                   ` Wei Gao
2026-02-10 22:18                     ` Joanne Koong
2026-02-11  0:00                       ` Sasha Levin
2026-02-11  3:11                       ` Matthew Wilcox
2026-02-11 19:33                         ` Joanne Koong
2026-02-11 21:03                           ` Matthew Wilcox
2026-02-11 23:13                             ` Joanne Koong
2026-02-12 19:31                               ` Matthew Wilcox
2026-02-13  0:53                                 ` Joanne Koong
2025-12-24  2:10         ` Matthew Wilcox
2025-12-24 15:43           ` Sasha Levin
2025-12-24 17:27             ` Matthew Wilcox
2025-12-24 21:21               ` Sasha Levin
2025-12-30  0:58                 ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPqDPjnIaR3EF5Lt@bfoster \
    --to=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=gfs2@lists.linux.dev \
    --cc=hch@infradead.org \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=joannelkoong@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.