* [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
@ 2025-12-09 21:04 Dominique Martinet via B4 Relay
2025-12-10 4:21 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Dominique Martinet via B4 Relay @ 2025-12-09 21:04 UTC (permalink / raw)
To: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck
Cc: v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel,
Chris Arges, Dominique Martinet
From: Dominique Martinet <asmadeus@codewreck.org>
When doing a loop mount of a filesystem over 9p, read requests can come
from unexpected places and blow up as reported by Chris Arges with this
reproducer:
```
dd if=/dev/zero of=./xfs.img bs=1M count=300
yes | mkfs.xfs -b size=8192 ./xfs.img
rm -rf ./mount && mkdir -p ./mount
mount -o loop ./xfs.img ./mount
```
The problem is that iov_iter_get_pages_alloc2() apparently cannot be
called on folios (as illustrated by the backtrace below), so limit what
iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
Full backtrace:
```
[ 31.276957][ T255] loop0: detected capacity change from 0 to 614400
[ 31.286377][ T255] XFS (loop0): EXPERIMENTAL large block size feature enabled. Use at your own risk!
[ 31.286624][ T255] XFS (loop0): Mounting V5 Filesystem fa3c2d3c-b936-4ee3-a5a8-e80ba36298cc
[ 31.395721][ T62] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102600
[ 31.395833][ T62] head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 31.395915][ T62] flags: 0x2ffff800000040(head|node=0|zone=2|lastcpupid=0x1ffff)
[ 31.395976][ T62] page_type: f8(unknown)
[ 31.396004][ T62] raw: 002ffff800000040 0000000000000000 dead000000000122 0000000000000000
[ 31.396092][ T62] raw: 0000000000000000 0000000000000000 00000000f8000000 0000000000000000
[ 31.396174][ T62] head: 002ffff800000040 0000000000000000 dead000000000122 0000000000000000
[ 31.396251][ T62] head: 0000000000000000 0000000000000000 00000000f8000000 0000000000000000
[ 31.396339][ T62] head: 002ffff800000009 ffffea0004098001 00000000ffffffff 00000000ffffffff
[ 31.396425][ T62] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000200
[ 31.396523][ T62] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
[ 31.396641][ T62] ------------[ cut here ]------------
[ 31.396689][ T62] kernel BUG at include/linux/mm.h:1386!
[ 31.396748][ T62] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 31.396820][ T62] CPU: 4 UID: 0 PID: 62 Comm: kworker/u32:1 Not tainted 6.18.0-rc7-cloudflare-2025.11.11-21-gab0ed6ff #1 PREEMPT(voluntary)
[ 31.396947][ T62] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 2025.02-8 05/13/2025
[ 31.397031][ T62] Workqueue: loop0 loop_rootcg_workfn
[ 31.397084][ T62] RIP: 0010:__iov_iter_get_pages_alloc+0x7b6/0x920
[ 31.397152][ T62] Code: 08 4c 89 5d 10 44 88 55 20 e9 0d fb ff ff 0f 0b 4d 85 ed 0f 85 fc fb ff ff e9 38 fd ff ff 48 c7 c6 20 88 6d 83 e8 fa 2f b7 ff <0f> 0b 31 f6 b9 c0 0c 00 00 ba 01 00 00 00 4c 89 0c 24 48 8d 3c dd
[ 31.397310][ T62] RSP: 0018:ffffc90000257908 EFLAGS: 00010246
[ 31.397365][ T62] RAX: 000000000000005c RBX: 0000000000000020 RCX: 0000000000000003
[ 31.397424][ T62] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff83f38508
[ 31.397498][ T62] RBP: ffff888101af90f8 R08: 0000000000000000 R09: ffffc900002577a0
[ 31.397571][ T62] R10: ffffffff83f084c8 R11: 0000000000000003 R12: 0000000000020000
[ 31.397654][ T62] R13: ffffc90000257a70 R14: ffffc90000257a68 R15: ffffea0004098000
[ 31.397727][ T62] FS: 0000000000000000(0000) GS:ffff8882b3266000(0000) knlGS:0000000000000000
[ 31.397819][ T62] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 31.397890][ T62] CR2: 00007f846eb985a0 CR3: 0000000004620003 CR4: 0000000000772ef0
[ 31.397964][ T62] PKRU: 55555554
[ 31.398005][ T62] Call Trace:
[ 31.398045][ T62] <TASK>
[ 31.398075][ T62] ? kvm_sched_clock_read+0x11/0x20
[ 31.398131][ T62] ? sched_clock+0x10/0x30
[ 31.398179][ T62] ? sched_clock_cpu+0xf/0x1d0
[ 31.398234][ T62] iov_iter_get_pages_alloc2+0x20/0x50
[ 31.398277][ T62] p9_get_mapped_pages.part.0.constprop.0+0x6f/0x280 [9pnet_virtio]
[ 31.398354][ T62] ? p9pdu_vwritef+0xe0/0x6e0 [9pnet]
[ 31.398413][ T62] ? pdu_write+0x2d/0x40 [9pnet]
[ 31.398464][ T62] p9_virtio_zc_request+0x92/0x69a [9pnet_virtio]
[ 31.398530][ T62] ? p9pdu_vwritef+0xe0/0x6e0 [9pnet]
[ 31.398582][ T62] ? p9pdu_finalize+0x32/0x90 [9pnet]
[ 31.398620][ T62] ? p9_client_prepare_req+0xbe/0x150 [9pnet]
[ 31.398693][ T62] p9_client_zc_rpc.constprop.0+0xf4/0x2f0 [9pnet]
[ 31.398768][ T62] ? p9_client_xattrwalk+0x148/0x1d0 [9pnet]
[ 31.398840][ T62] p9_client_write+0x16a/0x240 [9pnet]
[ 31.398887][ T62] ? __kmalloc_cache_noprof+0x2f3/0x5a0
[ 31.398939][ T62] v9fs_issue_write+0x3a/0x80 [9p]
[ 31.399002][ T62] netfs_advance_write+0xd3/0x2b0 [netfs]
[ 31.399069][ T62] netfs_unbuffered_write+0x66/0xb0 [netfs]
[ 31.399131][ T62] netfs_unbuffered_write_iter_locked+0x1cd/0x220 [netfs]
[ 31.399202][ T62] netfs_unbuffered_write_iter+0x100/0x1d0 [netfs]
[ 31.399265][ T62] lo_rw_aio.isra.0+0x2e7/0x330
[ 31.399321][ T62] loop_process_work+0x86/0x420
[ 31.399380][ T62] process_one_work+0x192/0x350
[ 31.399434][ T62] worker_thread+0x2d3/0x400
[ 31.399493][ T62] ? __pfx_worker_thread+0x10/0x10
[ 31.399559][ T62] kthread+0xfc/0x240
[ 31.399605][ T62] ? __pfx_kthread+0x10/0x10
[ 31.399660][ T62] ? _raw_spin_unlock+0xe/0x30
[ 31.399711][ T62] ? finish_task_switch.isra.0+0x8d/0x280
[ 31.399764][ T62] ? __pfx_kthread+0x10/0x10
[ 31.399820][ T62] ? __pfx_kthread+0x10/0x10
[ 31.399878][ T62] ret_from_fork+0x113/0x130
[ 31.399931][ T62] ? __pfx_kthread+0x10/0x10
[ 31.399992][ T62] ret_from_fork_asm+0x1a/0x30
[ 31.400050][ T62] </TASK>
[ 31.400088][ T62] Modules linked in: kvm_intel kvm irqbypass aesni_intel rapl i2c_piix4 i2c_smbus tiny_power_button button configfs virtio_mmio virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio_console 9pnet_virtio virtiofs virtio virtio_ring fuse 9p 9pnet netfs
[ 31.400365][ T62] ---[ end trace 0000000000000000 ]---
[ 31.405087][ T62] RIP: 0010:__iov_iter_get_pages_alloc+0x7b6/0x920
[ 31.405166][ T62] Code: 08 4c 89 5d 10 44 88 55 20 e9 0d fb ff ff 0f 0b 4d 85 ed 0f 85 fc fb ff ff e9 38 fd ff ff 48 c7 c6 20 88 6d 83 e8 fa 2f b7 ff <0f> 0b 31 f6 b9 c0 0c 00 00 ba 01 00 00 00 4c 89 0c 24 48 8d 3c dd
[ 31.405281][ T62] RSP: 0018:ffffc90000257908 EFLAGS: 00010246
[ 31.405328][ T62] RAX: 000000000000005c RBX: 0000000000000020 RCX: 0000000000000003
[ 31.405383][ T62] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff83f38508
[ 31.405456][ T62] RBP: ffff888101af90f8 R08: 0000000000000000 R09: ffffc900002577a0
[ 31.405516][ T62] R10: ffffffff83f084c8 R11: 0000000000000003 R12: 0000000000020000
[ 31.405593][ T62] R13: ffffc90000257a70 R14: ffffc90000257a68 R15: ffffea0004098000
[ 31.405665][ T62] FS: 0000000000000000(0000) GS:ffff8882b3266000(0000) knlGS:0000000000000000
[ 31.405730][ T62] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 31.405774][ T62] CR2: 00007f846eb985a0 CR3: 0000000004620004 CR4: 0000000000772ef0
[ 31.405837][ T62] PKRU: 55555554
[ 31.434509][ C4] ------------[ cut here ]------------
```
Reported-by: Chris Arges <carges@cloudflare.com>
Closes: https://lkml.kernel.org/r/aSR-C4ahmNRoUV58@861G6M3
Tested-by: Chris Arges <carges@cloudflare.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Suggested-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---
This is the patch Chris tested in the linked thread as is
I'll admit I still don't really understand how this even works: the else
branch of the trans_virtio patch assumes it can use data->kvec->iov_base
so it shouldn't work with folio-backed iov?!
(Also: why does iov_iter_get_pages_alloc2() explicitly implement support
for folio if it's just to blow up later)
.. but if it worked for Chris I guess that's good enough for now?
I'm still surprised I can't reproduce this, I'll try to play with the
backing 9p mount options and check these IOs are done properly before
sending to Linus, but any feedback is welcome until then
---
net/9p/trans_virtio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 10c2dd48643818907f4370243eb971fceba4d40b..f7ee1f864b03a59568510eb0dd3496bd05b3b8d6 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
if (!iov_iter_count(data))
return 0;
- if (!iov_iter_is_kvec(data)) {
+ if (user_backed_iter(data)) {
int n;
/*
* We allow only p9_max_pages pinned. We wait for the
---
base-commit: 3e281113f871d7f9c69ca55a4d806a72180b7e8a
change-id: 20251210-virtio_trans_iter-5973892db2e3
Best regards,
--
Dominique Martinet <asmadeus@codewreck.org>
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay
@ 2025-12-10 4:21 ` Matthew Wilcox
2025-12-10 6:04 ` Christoph Hellwig
2025-12-17 13:41 ` Christian Schoenebeck
2 siblings, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2025-12-10 4:21 UTC (permalink / raw)
To: asmadeus
Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
v9fs, linux-kernel, David Howells, linux-fsdevel, Chris Arges
On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote:
> The problem is that iov_iter_get_pages_alloc2() apparently cannot be
> called on folios (as illustrated by the backtrace below), so limit what
> iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
>
> Full backtrace:
> ```
> [ 31.395721][ T62] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102600
> [ 31.395833][ T62] head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [ 31.395915][ T62] flags: 0x2ffff800000040(head|node=0|zone=2|lastcpupid=0x1ffff)
> [ 31.395976][ T62] page_type: f8(unknown)
This _isn't_ a folio. It's a kmalloc allocation. It's a very large
kmalloc allocation (something between 1024 * 1024 + 1 and 2048 * 1024
bytes inclusive). You can _only_ use iov_iter_get_pages_alloc2() with
folios (or other struct pages that have a refcount, but I'm not sure how
many of those there really are; we're removing refcounts from various
types of pages)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay
2025-12-10 4:21 ` Matthew Wilcox
@ 2025-12-10 6:04 ` Christoph Hellwig
2025-12-10 7:38 ` asmadeus
2025-12-10 13:33 ` Christian Schoenebeck
2025-12-17 13:41 ` Christian Schoenebeck
2 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-12-10 6:04 UTC (permalink / raw)
To: asmadeus
Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel,
Chris Arges
On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote:
> From: Dominique Martinet <asmadeus@codewreck.org>
>
> When doing a loop mount of a filesystem over 9p, read requests can come
> from unexpected places and blow up as reported by Chris Arges with this
> reproducer:
> ```
> dd if=/dev/zero of=./xfs.img bs=1M count=300
> yes | mkfs.xfs -b size=8192 ./xfs.img
> rm -rf ./mount && mkdir -p ./mount
> mount -o loop ./xfs.img ./mount
We should really wire this up to xfstests so that all file systems
see the pattern of kmalloc allocations passed into the block layer
and then on to the direct I/O code.
> The problem is that iov_iter_get_pages_alloc2() apparently cannot be
> called on folios (as illustrated by the backtrace below), so limit what
> iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
As willy pointed out this is a kmalloc.
And 9p (just like NFS) really needs to switch away from
iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not
just this perfectly fine but also fixes various other issues.
Note that the networking code still wants special treatment for kmalloc
pages, so you might have more work there.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-10 6:04 ` Christoph Hellwig
@ 2025-12-10 7:38 ` asmadeus
2025-12-10 8:32 ` Christoph Hellwig
2025-12-10 13:33 ` Christian Schoenebeck
1 sibling, 1 reply; 24+ messages in thread
From: asmadeus @ 2025-12-10 7:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel,
Chris Arges
Christoph Hellwig wrote on Tue, Dec 09, 2025 at 10:04:30PM -0800:
> On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote:
> > From: Dominique Martinet <asmadeus@codewreck.org>
> >
> > When doing a loop mount of a filesystem over 9p, read requests can come
> > from unexpected places and blow up as reported by Chris Arges with this
> > reproducer:
> > ```
> > dd if=/dev/zero of=./xfs.img bs=1M count=300
> > yes | mkfs.xfs -b size=8192 ./xfs.img
> > rm -rf ./mount && mkdir -p ./mount
> > mount -o loop ./xfs.img ./mount
>
> We should really wire this up to xfstests so that all file systems
> see the pattern of kmalloc allocations passed into the block layer
> and then on to the direct I/O code.
Note this doesn't seem to reproduce on my test VM so I'm not sure what
kind of precondition there is to going through this code...
> > The problem is that iov_iter_get_pages_alloc2() apparently cannot be
> > called on folios (as illustrated by the backtrace below), so limit what
> > iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
>
> As willy pointed out this is a kmalloc.
Ok I got confused because of the VM_BUG_ON_FOLIO(), but looking back
it's in a folio_get() called directly from __iov_iter_get_pages_alloc()
so that was likely a bvec...
My points of "but there's a case for it in __iov_iter_get_pages_alloc()"
and "we have no idea what to do" still stand though, but you answered
that below:
> And 9p (just like NFS) really needs to switch away from
> iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not
> just this perfectly fine but also fixes various other issues.
Ok, so we can remove the special branch for kvec and just extract pages
with this.
I understand it pins user spaces pages, so there's no risk of it moving
under us during the IO, and there's nothing else we need to do about it?
Looking at the implementation for iov_iter_extract_bvec_pages() it looks
like it might not process all the way to the end, so we need to loop on
calling iov_iter_extract_pages()? (I see networking code looping on
"while (iter->count > 0)")
I'll send a v2 with that when I can
While I have your attention, there's some work to move away from large
(>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not
be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message
buffers on supported transports") that basically only did that for
trans_fd), there's no iov_iter involved so it's off topic but how would
one get around "extracting pages" out of that?
> Note that the networking code still wants special treatment for kmalloc
> pages, so you might have more work there.
I *think* we're fine on this end, as it's just passing the buffers into
a sg list for virtio, as long as things don't move under the caller I
assume they don't care...
Thanks,
--
Dominique Martinet | Asmadeus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-10 7:38 ` asmadeus
@ 2025-12-10 8:32 ` Christoph Hellwig
2025-12-13 13:28 ` asmadeus
0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2025-12-10 8:32 UTC (permalink / raw)
To: asmadeus
Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov,
Christian Schoenebeck, v9fs, linux-kernel, David Howells,
Matthew Wilcox, linux-fsdevel, Chris Arges
On Wed, Dec 10, 2025 at 04:38:02PM +0900, asmadeus@codewreck.org wrote:
> Christoph Hellwig wrote on Tue, Dec 09, 2025 at 10:04:30PM -0800:
> > On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote:
> > > From: Dominique Martinet <asmadeus@codewreck.org>
> > >
> > > When doing a loop mount of a filesystem over 9p, read requests can come
> > > from unexpected places and blow up as reported by Chris Arges with this
> > > reproducer:
> > > ```
> > > dd if=/dev/zero of=./xfs.img bs=1M count=300
> > > yes | mkfs.xfs -b size=8192 ./xfs.img
> > > rm -rf ./mount && mkdir -p ./mount
> > > mount -o loop ./xfs.img ./mount
> >
> > We should really wire this up to xfstests so that all file systems
> > see the pattern of kmalloc allocations passed into the block layer
> > and then on to the direct I/O code.
>
> Note this doesn't seem to reproduce on my test VM so I'm not sure what
> kind of precondition there is to going through this code...
In general the best way to get XFS issue kmalloc I/O is to use a
file system sector size smaller than the page size, і.e. something
like:
mkfs.xfs -s 512 -b 4096 -f
The above would do that job when run on a large page size system
like typical arm64 configs.
> > And 9p (just like NFS) really needs to switch away from
> > iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not
> > just this perfectly fine but also fixes various other issues.
>
> Ok, so we can remove the special branch for kvec and just extract pages
> with this.
Yes.
> I understand it pins user spaces pages, so there's no risk of it moving
> under us during the IO, and there's nothing else we need to do about it?
Yes, unlike iov_iter_get_pages_alloc2 which gets a reference and doesn't
pin.
> Looking at the implementation for iov_iter_extract_bvec_pages() it looks
> like it might not process all the way to the end, so we need to loop on
> calling iov_iter_extract_pages()? (I see networking code looping on
> "while (iter->count > 0)")
Yes.
> I'll send a v2 with that when I can
You looked into this already, but in case you haven't seen it yet,
don't forget to call unpin_user_folio on the completion side as well.
> While I have your attention, there's some work to move away from large
> (>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not
> be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message
> buffers on supported transports") that basically only did that for
> trans_fd), there's no iov_iter involved so it's off topic but how would
> one get around "extracting pages" out of that?
vmalloc and I/O is in general problematic, because you need special
calls to flush the cached for VIVT caches:
flush_kernel_vmap_range / invalidate_kernel_vmap_range.
If you want to stuff that into an iov_iter, the only sane way is to
call vmalloc_to_page and build a bvec iter from that at the moment.
You also need to do the flush_kernel_vmap_range /
invalidate_kernel_vmap_range in the caller for it.
>
> > Note that the networking code still wants special treatment for kmalloc
> > pages, so you might have more work there.
>
> I *think* we're fine on this end, as it's just passing the buffers into
> a sg list for virtio, as long as things don't move under the caller I
> assume they don't care...
Ok, if your backend is virtio that's fine. If it's actual 9p over the
network you might still into issues, though.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-10 6:04 ` Christoph Hellwig
2025-12-10 7:38 ` asmadeus
@ 2025-12-10 13:33 ` Christian Schoenebeck
1 sibling, 0 replies; 24+ messages in thread
From: Christian Schoenebeck @ 2025-12-10 13:33 UTC (permalink / raw)
To: asmadeus, Christoph Hellwig
Cc: Eric Van Hensbergen, Latchesar Ionkov, v9fs, linux-kernel,
David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges
On Wednesday, 10 December 2025 07:04:30 CET Christoph Hellwig wrote:
> On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay
> wrote:
[...]
> > The problem is that iov_iter_get_pages_alloc2() apparently cannot be
> > called on folios (as illustrated by the backtrace below), so limit what
> > iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
>
> As willy pointed out this is a kmalloc.
>
> And 9p (just like NFS) really needs to switch away from
> iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not
> just this perfectly fine but also fixes various other issues.
>
> Note that the networking code still wants special treatment for kmalloc
> pages, so you might have more work there.
But couldn't this patch be used as a preliminary solution for this issue
before switching to iov_iter_extract_pages(), as the latter does not look like
a trivial change?
Maybe I'm still missing something important here, not sure.
/Christian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-10 8:32 ` Christoph Hellwig
@ 2025-12-13 13:28 ` asmadeus
2025-12-15 5:55 ` Christoph Hellwig
2025-12-19 11:26 ` David Howells
0 siblings, 2 replies; 24+ messages in thread
From: asmadeus @ 2025-12-13 13:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel,
Chris Arges
Christoph Hellwig wrote on Wed, Dec 10, 2025 at 12:32:41AM -0800:
> > Looking at the implementation for iov_iter_extract_bvec_pages() it looks
> > like it might not process all the way to the end, so we need to loop on
> > calling iov_iter_extract_pages()? (I see networking code looping on
> > "while (iter->count > 0)")
>
> Yes.
Ok, I don't understand why the current code locks everything down and
wants to use a single scatterlist shared for the whole channel (and
capped to 128 pages?), it should only need to lock around the
virtqueue_add_sg() call, I'll need to play with that some more.
Looking at other virtio drivers I could probably use a sg_table and
have extract_iter_to_sg() do all the work for us...
> > I'll send a v2 with that when I can
>
> You looked into this already, but in case you haven't seen it yet,
> don't forget to call unpin_user_folio on the completion side as well.
Thanks for bringing this up -- I'm not sure I understand how to decide
on what to cleanup with... Depending on the type of pages it's not
necessarily unpin_user_folio() is it? Or will it just ignore anything
else?
Ah, the caller must check iov_iter_extract_will_pin(iter), then it looks
like I'm free to call unpin_user_page() (or unpin_folio() which is the
same without pinned check) -- unpin_user_folio() would allow unpining
multiple pages within a folio, but iov_iter_extract_pages() doesn't give
us folios that could be used like that)
FWIW, the comment at the top of extract_iter_to_sg() says:
> The iov_iter_extract_mode() function should be used to query how cleanup
but I couldn't find any such function (even back when this comment was
added to netfs code in 2023...), the two copies of this comment probably
could use updating... David?
> > While I have your attention, there's some work to move away from large
> > (>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not
> > be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message
> > buffers on supported transports") that basically only did that for
> > trans_fd), there's no iov_iter involved so it's off topic but how would
> > one get around "extracting pages" out of that?
>
> vmalloc and I/O is in general problematic, because you need special
> calls to flush the cached for VIVT caches:
> flush_kernel_vmap_range / invalidate_kernel_vmap_range.
>
> If you want to stuff that into an iov_iter, the only sane way is to
> call vmalloc_to_page and build a bvec iter from that at the moment.
> You also need to do the flush_kernel_vmap_range /
> invalidate_kernel_vmap_range in the caller for it.
Thanks, this is on the virtio/network side so I don't need an iov_iter,
but I agree virtio will not like vmap addresses as well -- I'd want to
resolve and pin the backing pages and pass that to a scatterlist like
we're doing here... Looking at extract_kvec_to_sg() it's just calling
vmalloc_to_page() on the the kvec address if it's a vmalloc address, so
no pin, so I guess pin might not be needed?
I was under the impression that vmalloc meant the kernel could shuffle
the physical data under us when we're not scheduled, but I'm really not
familiar enough with these memory management details; let's focus on the
bug at hands first...
I'll (eventually) poke back at this after we're all happy with the
zero-copy side.
> > I *think* we're fine on this end, as it's just passing the buffers into
> > a sg list for virtio, as long as things don't move under the caller I
> > assume they don't care...
>
> Ok, if your backend is virtio that's fine. If it's actual 9p over the
> network you might still into issues, though.
TCP and other transports don't do zero-copy straight into the iov_iter,
so virtio is our only problem right now -- thanksfully! :)
Thanks,
--
Dominique Martinet | Asmadeus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-13 13:28 ` asmadeus
@ 2025-12-15 5:55 ` Christoph Hellwig
2025-12-15 7:34 ` Dominique Martinet
2025-12-19 12:00 ` David Howells
2025-12-19 11:26 ` David Howells
1 sibling, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-12-15 5:55 UTC (permalink / raw)
To: asmadeus
Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov,
Christian Schoenebeck, v9fs, linux-kernel, David Howells,
Matthew Wilcox, linux-fsdevel, Chris Arges
[Dave: netfs questions below, please read]
On Sat, Dec 13, 2025 at 10:28:50PM +0900, asmadeus@codewreck.org wrote:
> Christoph Hellwig wrote on Wed, Dec 10, 2025 at 12:32:41AM -0800:
> > > Looking at the implementation for iov_iter_extract_bvec_pages() it looks
> > > like it might not process all the way to the end, so we need to loop on
> > > calling iov_iter_extract_pages()? (I see networking code looping on
> > > "while (iter->count > 0)")
> >
> > Yes.
>
> Ok, I don't understand why the current code locks everything down and
> wants to use a single scatterlist shared for the whole channel (and
> capped to 128 pages?), it should only need to lock around the
> virtqueue_add_sg() call, I'll need to play with that some more.
What do you mean with "lock down"?
> Looking at other virtio drivers I could probably use a sg_table and
> have extract_iter_to_sg() do all the work for us...
Looking at the code I'm actually really confused. Both because I
actually though we were talking about the 9fs direct I/O code, but
that has actually been removed / converted to netfs a long time ago.
But even more so what the net/9p code is actually doing.. How do
we even end up with user addresses here at all?
Let me try to understand things:
- p9_virtio_zc_request is the only instances of the p9_trans_module
zc_request operation.
- zc_request only gets called by p9_client_zc_rpc
- p9_client_zc_rpc gets called by p9_client_read_once, p9_client_write,
p9_client_write_subreq and p9_client_readdir
Let's go through these:
- p9_client_write_subreq is entirely unused
- p9_client_readdir builds a local iov_iter_kvec
- p9_client_read_once is only called by p9_client_read, and really
should be marked static.
- p9_client_read is called by v9fs_issue_read on a netfs iov_iter
and by v9fs_dir_readdir and v9fs_fid_xattr_get on a local kvec iter
- p9_client_write is called with a iov_iter_kvec from
v9fs_fid_xattr_set, and with a netfs-issued iov_iter by
v9fs_issue_write
So right now except for netfs everything is on a kvec. Dave, what
kind of iov_iter does netfs send down to the file system? I had
a bit of a hard time reading through it, but I'd expect that any
page pinning would be done in netfs and not below it? Why are we
using iov_iters here and not something like a bio_vec? What is
the fs / transport supported to do with these iters?
Ignoring the rest of the mail for now, because I suspect the outcome
of the above might make it irrelevant, but I'll come back to it if
needed.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-15 5:55 ` Christoph Hellwig
@ 2025-12-15 7:34 ` Dominique Martinet
2025-12-15 11:16 ` Christian Schoenebeck
` (2 more replies)
2025-12-19 12:00 ` David Howells
1 sibling, 3 replies; 24+ messages in thread
From: Dominique Martinet @ 2025-12-15 7:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel,
Chris Arges
Thanks for having a look
Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800:
> > Ok, I don't understand why the current code locks everything down and
> > wants to use a single scatterlist shared for the whole channel (and
> > capped to 128 pages?), it should only need to lock around the
> > virtqueue_add_sg() call, I'll need to play with that some more.
>
> What do you mean with "lock down"?
Just the odd (to me) use of the chan->lock around basically all of
p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty
sure this was just the author trying to avoid an allocation by recycling
the chan->sg array around though, so ignore this.
> > Looking at other virtio drivers I could probably use a sg_table and
> > have extract_iter_to_sg() do all the work for us...
>
> Looking at the code I'm actually really confused. Both because I
> actually though we were talking about the 9fs direct I/O code, but
> that has actually been removed / converted to netfs a long time ago.
>
> But even more so what the net/9p code is actually doing.. How do
> we even end up with user addresses here at all?
FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ --
O_DIRECT writes are seen as BVEC so I guess it's not as direct as I
expected them to be -- that code could very well be leftovers from
the switch to iov_iter back in 2015...
(I'm actually not sure why Christian suggested checking for is_iovec()
in https://lkml.kernel.org/r/2245723.irdbgypaU6@weasel -- then I
generalized it to user_backed_iter() and it just worked because checking
for that moved out bvec and folioq from iov_iter_get_pages_alloc2()
to... something that obviously should not work in my opinion but
apparently was enough to not trigger this particular BUG.)
> Let me try to understand things:
>
> - p9_virtio_zc_request is the only instances of the p9_trans_module
> zc_request operation.
> - zc_request only gets called by p9_client_zc_rpc
> - p9_client_zc_rpc gets called by p9_client_read_once, p9_client_write,
> p9_client_write_subreq and p9_client_readdir
>
> Let's go through these:
>
> - p9_client_write_subreq is entirely unused
Let's remove that.. I'll send a patch later.
> - p9_client_readdir builds a local iov_iter_kvec
> - p9_client_read_once is only called by p9_client_read, and really
> should be marked static.
agreed, will cleanup too.
> - p9_client_read is called by v9fs_issue_read on a netfs iov_iter
> and by v9fs_dir_readdir and v9fs_fid_xattr_get on a local kvec iter
> - p9_client_write is called with a iov_iter_kvec from
> v9fs_fid_xattr_set, and with a netfs-issued iov_iter by
> v9fs_issue_write
>
> So right now except for netfs everything is on a kvec. Dave, what
> kind of iov_iter does netfs send down to the file system? I had
> a bit of a hard time reading through it, but I'd expect that any
> page pinning would be done in netfs and not below it? Why are we
> using iov_iters here and not something like a bio_vec? What is
> the fs / transport supported to do with these iters?
>
> Ignoring the rest of the mail for now, because I suspect the outcome
> of the above might make it irrelevant, but I'll come back to it if
> needed.
(waiting for David's answer here, but as far as I see the contract
between the transport and the vfs is that the transport should handle
whatever it's being fed, so it doesn't really matter if it's a bio_vec
or an iov_iter -- ultimately virtio or whatever backend that wants to
handle zc likely won't handle bio_vec any better so it'll need
converting anyway)
Thanks,
--
Dominique Martinet | Asmadeus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-15 7:34 ` Dominique Martinet
@ 2025-12-15 11:16 ` Christian Schoenebeck
2025-12-15 14:37 ` Christoph Hellwig
2025-12-19 12:03 ` David Howells
2 siblings, 0 replies; 24+ messages in thread
From: Christian Schoenebeck @ 2025-12-15 11:16 UTC (permalink / raw)
To: Christoph Hellwig, Dominique Martinet
Cc: Eric Van Hensbergen, Latchesar Ionkov, v9fs, linux-kernel,
David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges
On Monday, 15 December 2025 08:34:12 CET Dominique Martinet wrote:
> Thanks for having a look
>
> Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800:
> > > Ok, I don't understand why the current code locks everything down and
> > > wants to use a single scatterlist shared for the whole channel (and
> > > capped to 128 pages?), it should only need to lock around the
> > > virtqueue_add_sg() call, I'll need to play with that some more.
> >
> > What do you mean with "lock down"?
>
> Just the odd (to me) use of the chan->lock around basically all of
> p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty
> sure this was just the author trying to avoid an allocation by recycling
> the chan->sg array around though, so ignore this.
The lock protects the channel wide, shared scatterlist while the scatterlist
is filled from the linear buffers by pack_sg_list(). Then virtqueue_add_sgs()
pushes scatterlist's segments as virtio descriptors into the virtio FIFOs.
From this point it safe to unlock as the scatterlist is no longer needed.
And yes, the assumption probably was that handling the scatterlist as a
temporary was more expensive due to allocation.
> > > Looking at other virtio drivers I could probably use a sg_table and
> > > have extract_iter_to_sg() do all the work for us...
> >
> > Looking at the code I'm actually really confused. Both because I
> > actually though we were talking about the 9fs direct I/O code, but
> > that has actually been removed / converted to netfs a long time ago.
> >
> > But even more so what the net/9p code is actually doing.. How do
> > we even end up with user addresses here at all?
>
> FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ --
> O_DIRECT writes are seen as BVEC so I guess it's not as direct as I
> expected them to be -- that code could very well be leftovers from
> the switch to iov_iter back in 2015...
>
> (I'm actually not sure why Christian suggested checking for is_iovec()
> in https://lkml.kernel.org/r/2245723.irdbgypaU6@weasel -- then I
> generalized it to user_backed_iter() and it just worked because checking
> for that moved out bvec and folioq from iov_iter_get_pages_alloc2()
> to... something that obviously should not work in my opinion but
> apparently was enough to not trigger this particular BUG.)
Sorry, I should have explained why I suggested that change: My understanding
of the p9_get_mapped_pages() code was that the original author intended to go
down the iov_iter_get_pages_alloc2() path only for user space memory
exclusively and that he assumed that his preceding !iov_iter_is_kvec(data)
check would guard this appropriately. But apparently there are several other
iterator types that are kernel memory as well. So I thought switching the
check to is_iovec() would be better as these are user space memory, however I
missed that there is also ITER_UBUF with user space memory, which you realized
fortunately by suggesting user_backed_iter() check instead.
/Christian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-15 7:34 ` Dominique Martinet
2025-12-15 11:16 ` Christian Schoenebeck
@ 2025-12-15 14:37 ` Christoph Hellwig
2025-12-19 12:03 ` David Howells
2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-12-15 14:37 UTC (permalink / raw)
To: Dominique Martinet
Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov,
Christian Schoenebeck, v9fs, linux-kernel, David Howells,
Matthew Wilcox, linux-fsdevel, Chris Arges
On Mon, Dec 15, 2025 at 04:34:12PM +0900, Dominique Martinet wrote:
> Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800:
> > > Ok, I don't understand why the current code locks everything down and
> > > wants to use a single scatterlist shared for the whole channel (and
> > > capped to 128 pages?), it should only need to lock around the
> > > virtqueue_add_sg() call, I'll need to play with that some more.
> >
> > What do you mean with "lock down"?
>
> Just the odd (to me) use of the chan->lock around basically all of
> p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty
> sure this was just the author trying to avoid an allocation by recycling
> the chan->sg array around though, so ignore this.
Oh, ok. This seems unrelated to the handling of the iov_iters and
I'm sorry that I don't really know anything about that part.
>
> > > Looking at other virtio drivers I could probably use a sg_table and
> > > have extract_iter_to_sg() do all the work for us...
> >
> > Looking at the code I'm actually really confused. Both because I
> > actually though we were talking about the 9fs direct I/O code, but
> > that has actually been removed / converted to netfs a long time ago.
> >
> > But even more so what the net/9p code is actually doing.. How do
> > we even end up with user addresses here at all?
>
> FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ --
> O_DIRECT writes are seen as BVEC so I guess it's not as direct as I
> expected them to be -- that code could very well be leftovers from
> the switch to iov_iter back in 2015...
Oh right, I think this from Dave's netfs_extract_user_iter.
> (waiting for David's answer here, but as far as I see the contract
> between the transport and the vfs is that the transport should handle
> whatever it's being fed, so it doesn't really matter if it's a bio_vec
> or an iov_iter -- ultimately virtio or whatever backend that wants to
> handle zc likely won't handle bio_vec any better so it'll need
> converting anyway)
Yeah. Looking at what the code does with the pages, I think all this
should go away in favor of using extract_iter_to_sg and build the
scatterlists directly from the iters, without an extra page indirection.
(and of course one day virtio should migrate away from scatterlists,
but that's for another time).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay
2025-12-10 4:21 ` Matthew Wilcox
2025-12-10 6:04 ` Christoph Hellwig
@ 2025-12-17 13:41 ` Christian Schoenebeck
2025-12-17 21:49 ` asmadeus
2025-12-19 12:06 ` [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec David Howells
2 siblings, 2 replies; 24+ messages in thread
From: Christian Schoenebeck @ 2025-12-17 13:41 UTC (permalink / raw)
To: Eric Van Hensbergen, Latchesar Ionkov, asmadeus, Chris Arges,
Dominique Martinet
Cc: v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel
On Tuesday, 9 December 2025 22:04:23 CET Dominique Martinet via B4 Relay
wrote:
> From: Dominique Martinet <asmadeus@codewreck.org>
>
> When doing a loop mount of a filesystem over 9p, read requests can come
> from unexpected places and blow up as reported by Chris Arges with this
> reproducer:
> ```
> dd if=/dev/zero of=./xfs.img bs=1M count=300
> yes | mkfs.xfs -b size=8192 ./xfs.img
> rm -rf ./mount && mkdir -p ./mount
> mount -o loop ./xfs.img ./mount
> ```
>
> The problem is that iov_iter_get_pages_alloc2() apparently cannot be
> called on folios (as illustrated by the backtrace below), so limit what
> iov we can pin from !iov_iter_is_kvec() to user_backed_iter()
[...]
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index
> 10c2dd48643818907f4370243eb971fceba4d40b..f7ee1f864b03a59568510eb0dd3496bd0
> 5b3b8d6 100644 --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
> if (!iov_iter_count(data))
> return 0;
>
> - if (!iov_iter_is_kvec(data)) {
> + if (user_backed_iter(data)) {
> int n;
> /*
> * We allow only p9_max_pages pinned. We wait for the
>
> ---
> base-commit: 3e281113f871d7f9c69ca55a4d806a72180b7e8a
> change-id: 20251210-virtio_trans_iter-5973892db2e3
Something's seriously messed up with 9p cache right now. With today's git
master I do get data corruption in any 9p cache mode, including cache=mmap,
only cache=none behaves clean.
With this patch applied though it gets even worse as I can't even boot due to
immediate 9p data corruption. Could be coincidence, as I get corruption
without this patch in any cache mode as well, but at least I have to try much
harder to trigger it.
Currently busy with other stuff, so could be a while before being able to
identify the cause.
/Christian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-17 13:41 ` Christian Schoenebeck
@ 2025-12-17 21:49 ` asmadeus
2025-12-18 14:21 ` asmadeus
2025-12-19 12:06 ` [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec David Howells
1 sibling, 1 reply; 24+ messages in thread
From: asmadeus @ 2025-12-17 21:49 UTC (permalink / raw)
To: Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel
Christian Schoenebeck wrote on Wed, Dec 17, 2025 at 02:41:31PM +0100:
> Something's seriously messed up with 9p cache right now. With today's git
> master I do get data corruption in any 9p cache mode, including cache=mmap,
> only cache=none behaves clean.
Ugh...
> With this patch applied though it gets even worse as I can't even boot due to
> immediate 9p data corruption. Could be coincidence, as I get corruption
> without this patch in any cache mode as well, but at least I have to try much
> harder to trigger it.
As said before don't bother with this patch, it's definitely wrong --
we can't just use data->kvec->iov_base for bvec or folioq iters, so
there *will* be corruptions with this patch.
I'll try to see if I can produce anything wrong with master...
--
Dominique
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-17 21:49 ` asmadeus
@ 2025-12-18 14:21 ` asmadeus
2025-12-18 15:14 ` Christian Schoenebeck
0 siblings, 1 reply; 24+ messages in thread
From: asmadeus @ 2025-12-18 14:21 UTC (permalink / raw)
To: Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel
asmadeus@codewreck.org wrote on Thu, Dec 18, 2025 at 06:49:04AM +0900:
> Christian Schoenebeck wrote on Wed, Dec 17, 2025 at 02:41:31PM +0100:
>> Something's seriously messed up with 9p cache right now. With today's git
>> master I do get data corruption in any 9p cache mode, including cache=mmap,
>> only cache=none behaves clean.
>
> Ugh...
I've updated from v6.18-rc2 + 9p pull requests to master and I can't
reproduce any obvious corruption, booting an alpine rootfs over 9p and
building a kernel inside (tried cache=loose and mmap)
Won't be the first time I can't reproduce, but what kind of workload are
you testing?
Anything that might help me try to reproduce (like VM cpu count/memory)
will be appreciated, corruptions are Bad...
Thanks,
--
Dominique
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-18 14:21 ` asmadeus
@ 2025-12-18 15:14 ` Christian Schoenebeck
2025-12-18 23:14 ` asmadeus
0 siblings, 1 reply; 24+ messages in thread
From: Christian Schoenebeck @ 2025-12-18 15:14 UTC (permalink / raw)
To: asmadeus
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel
On Thursday, 18 December 2025 15:21:43 CET asmadeus@codewreck.org wrote:
> asmadeus@codewreck.org wrote on Thu, Dec 18, 2025 at 06:49:04AM +0900:
> > Christian Schoenebeck wrote on Wed, Dec 17, 2025 at 02:41:31PM +0100:
> >> Something's seriously messed up with 9p cache right now. With today's git
> >> master I do get data corruption in any 9p cache mode, including
> >> cache=mmap,
> >> only cache=none behaves clean.
> >
> > Ugh...
>
> I've updated from v6.18-rc2 + 9p pull requests to master and I can't
> reproduce any obvious corruption, booting an alpine rootfs over 9p and
> building a kernel inside (tried cache=loose and mmap)
>
> Won't be the first time I can't reproduce, but what kind of workload are
> you testing?
> Anything that might help me try to reproduce (like VM cpu count/memory)
> will be appreciated, corruptions are Bad...
Debian Trixie guest running as 9p rootfs in QEMU, 4 cores, 16 GB.
Compiling a bunch of projects with GCC works fine without errors, but with
clang it's very simple for me to reproduce. E.g. just a very short C++ file
that pulls in some system headers:
#include <utility>
#include <sys/cdefs.h>
#include <limits>
Then running 3 times: clang++ -c foo.cpp -std=c++17
The first 2 clang runs succeed, the 3rd clang run then always blows up for
anything else than cache=none, various spurious clang errors on those system
headers like
error: source file is not valid UTF-8
...
warning: null character ignored [-Wnull-character]
...
error: expected unqualified-id
and finally clang crashes.
/Christian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-18 15:14 ` Christian Schoenebeck
@ 2025-12-18 23:14 ` asmadeus
2025-12-19 4:09 ` 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec) Dominique Martinet
0 siblings, 1 reply; 24+ messages in thread
From: asmadeus @ 2025-12-18 23:14 UTC (permalink / raw)
To: Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel
Christian Schoenebeck wrote on Thu, Dec 18, 2025 at 04:14:45PM +0100:
> > Won't be the first time I can't reproduce, but what kind of workload are
> > you testing?
> > Anything that might help me try to reproduce (like VM cpu count/memory)
> > will be appreciated, corruptions are Bad...
>
> Debian Trixie guest running as 9p rootfs in QEMU, 4 cores, 16 GB.
>
> Compiling a bunch of projects with GCC works fine without errors, but with
> clang it's very simple for me to reproduce. E.g. just a very short C++ file
> that pulls in some system headers:
>
> #include <utility>
> #include <sys/cdefs.h>
> #include <limits>
>
> Then running 3 times: clang++ -c foo.cpp -std=c++17
>
> The first 2 clang runs succeed, the 3rd clang run then always blows up for
> anything else than cache=none, various spurious clang errors on those system
> headers like
Thanks, I can't reproduce with this example, but building linux with
`make LLVM=1` does blow up on debian... even with cache=none actually?
I couldn't reproduce running the same rootfs directory in a container so
I don't think I corrupted my image, it appears to be reading junk? short
reads perhaps?...
(Interestingly, it doesn't seem to blow up on an alpine rootfs, I wonder
what's different...)
I'm now getting late for work but at least there's something I can
reproduce, I'll have a closer look ASAP, thank you.
--
Dominique Martinet | Asmadeus
^ permalink raw reply [flat|nested] 24+ messages in thread
* 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec)
2025-12-18 23:14 ` asmadeus
@ 2025-12-19 4:09 ` Dominique Martinet
2025-12-19 11:53 ` Dominique Martinet
2025-12-19 14:01 ` David Howells
0 siblings, 2 replies; 24+ messages in thread
From: Dominique Martinet @ 2025-12-19 4:09 UTC (permalink / raw)
To: David Howells, Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, Matthew Wilcox, linux-fsdevel
David, would be great if you can find a minute to look at the netfs
trace below.
asmadeus@codewreck.org wrote on Fri, Dec 19, 2025 at 08:14:58AM +0900:
> Christian Schoenebeck wrote on Thu, Dec 18, 2025 at 04:14:45PM +0100:
> > > Won't be the first time I can't reproduce, but what kind of workload are
> > > you testing?
> > > Anything that might help me try to reproduce (like VM cpu count/memory)
> > > will be appreciated, corruptions are Bad...
> >
> > Debian Trixie guest running as 9p rootfs in QEMU, 4 cores, 16 GB.
> >
> > Compiling a bunch of projects with GCC works fine without errors, but with
> > clang it's very simple for me to reproduce. E.g. just a very short C++ file
> > that pulls in some system headers:
> >
> > #include <utility>
> > #include <sys/cdefs.h>
> > #include <limits>
> >
> > Then running 3 times: clang++ -c foo.cpp -std=c++17
> >
> > The first 2 clang runs succeed, the 3rd clang run then always blows up for
> > anything else than cache=none, various spurious clang errors on those system
> > headers like
>
> Thanks, I can't reproduce with this example, but building linux with
> `make LLVM=1` does blow up on debian... even with cache=none actually?
Sorry that must have been because I still had cache=loose on the system
mount, even if the linux repo was mounted with cache=none; I agree this
doesn't happen with none (which agrees with the following, because mmap
wouldn't be allowed in the first place...)
> I couldn't reproduce running the same rootfs directory in a container so
> I don't think I corrupted my image, it appears to be reading junk? short
> reads perhaps?...
> (Interestingly, it doesn't seem to blow up on an alpine rootfs, I wonder
> what's different...)
>
> I'm now getting late for work but at least there's something I can
> reproduce, I'll have a closer look ASAP, thank you.
Curiosity won over work.. So I straced a failing clang command, and it
seems to always fail on a file that's been mmap'd:
120129 openat(AT_FDCWD, "./include/linux/vmstat.h", O_RDONLY|O_CLOEXEC) = 3
120129 readlink("/proc/self/fd/3", "/mnt/linux/include/linux/vmstat.h", 4096) = 33
120129 fstat(3, {st_mode=S_IFREG|0644, st_size=16755, ...}) = 0
120129 mmap(NULL, 16755, PROT_READ, MAP_PRIVATE|MAP_NORESERVE, 3, 0) = 0x7f697182e000
120129 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
120129 close(3) = 0
120129 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
120129 openat(AT_FDCWD, "./include/linux/vm_event_item.h", O_RDONLY|O_CLOEXEC) = 3
120129 readlink("/proc/self/fd/3", "/mnt/linux/include/linux/vm_event_item.h", 4096) = 40
120129 fstat(3, {st_mode=S_IFREG|0644, st_size=4526, ...}) = 0
120129 pread64(3, "/* SP...", 4526, 0) = 4526
120129 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
120129 close(3) = 0
120129 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
120129 brk(0x55665820c000) = 0x55665820c000
120129 brk(0x55665822d000) = 0x55665822d000
120129 brk(0x55665824e000) = 0x55665824e000
120129 write(2, "In file included from ", 22) = 22
./include/linux/vmstat.h:616:1: error: unknown type name 'ons'
(fun fact: the error line seems to always be file's total number of
lines + 1 line.. but I assume that's just luck at this point)
Unfortunately that's as far as I got anything clear...
- I tried a couple of variants around doing mmap() close() and accessing
data but couldn't create a simpler reproducer
- I took a dump of dmesg (with debug=65535) and tracepoints (netfs, 9p),
and it looks like the mmaped file IO is mostly correct? -- a TREAD is
issued with the correct size, I'm seeing the data is collected... and..
what is that ZERO SUBMT with the same size? Could it be related?
David, could you please have a look?
This was a different occurence from the strace above, that time was the
0x5fb2 read (24498 bytes) -- I think I got everything related to it here:
----
clang-318 [002] ..... 3031.183402: netfs_rreq_ref: R=00001b55 NEW r=2
clang-318 [002] ..... 3031.183402: netfs_read: R=00001b55 READAHEAD c=00000000 ni=157f3 s=0 l=6000 sz=5f
b2
clang-318 [002] ..... 3031.183403: netfs_folioq: R=00001b55 fq=1edf roll-init
clang-318 [002] ..... 3031.183403: netfs_rreq_ref: R=00001b55 GET SUBREQ r=3
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00000-00000 read
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00001-00001 read
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00002-00002 read
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00003-00003 read
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00004-00004 read
clang-318 [002] ..... 3031.183404: netfs_folio: i=157f3 ix=00005-00005 read
clang-318 [002] ..... 3031.183408: 9p_protocol_dump: clnt 18446612686390831168 P9_TREAD(tag = 0)
17 00 00 00 74 00 00 14 00 00 00 00 00 00 00 00 00 00 00 b2 5f 00 00 00 00 00 00 00 a4 81 00 00
clang-318 [002] ..... 3031.183408: 9p_client_req: client 18446612686390831168 request P9_TREAD tag 0
clang-318 [002] ..... 3031.183445: 9p_protocol_dump: clnt 18446612686390831168 P9_RREAD(tag = 0)
bd 5f 00 00 75 00 00 b2 5f 00 00 ff 3f 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 a4 81 00 00
clang-318 [002] ..... 3031.183450: 9p_client_res: client 18446612686390831168 response P9_TREAD tag 0 err 0
clang-318 [002] ..... 3031.183452: netfs_sreq: R=00001b55[1] DOWN TERM f=192 s=0 5fb2/5fb2 s=5 e=0
clang-318 [002] .N... 3031.183454: netfs_sreq_ref: R=00001b55[1] PUT TERM r=1
kworker/u16:2-233 [002] ..... 3031.183455: netfs_rreq_ref: R=00001b55 SEE WORK r=3
kworker/u16:2-233 [002] ..... 3031.183455: netfs_rreq: R=00001b55 RA COLLECT f=101
kworker/u16:2-233 [002] ..... 3031.183455: netfs_collect: R=00001b55 s=0-6000
kworker/u16:2-233 [002] ..... 3031.183455: netfs_collect_sreq: R=00001b55[0:01] s=0 t=5fb2/5fb2
kworker/u16:2-233 [002] ..... 3031.183455: netfs_collect_folio: R=00001b55 ix=00000 r=0-1000 t=0/5fb2
kworker/u16:2-233 [002] ..... 3031.183456: netfs_folio: i=157f3 ix=00000-00000 read-done
kworker/u16:2-233 [002] ..... 3031.183456: netfs_folio: i=157f3 ix=00000-00000 read-unlock
kworker/u16:2-233 [002] ..... 3031.183456: netfs_collect_folio: R=00001b55 ix=00001 r=1000-2000 t=1000/5fb2
kworker/u16:2-233 [002] ..... 3031.183456: netfs_folio: i=157f3 ix=00001-00001 read-done
kworker/u16:2-233 [002] ..... 3031.183457: netfs_folio: i=157f3 ix=00001-00001 read-unlock
kworker/u16:2-233 [002] ..... 3031.183457: netfs_collect_folio: R=00001b55 ix=00002 r=2000-3000 t=2000/5fb2
kworker/u16:2-233 [002] ..... 3031.183457: netfs_folio: i=157f3 ix=00002-00002 read-done
kworker/u16:2-233 [002] ..... 3031.183457: netfs_folio: i=157f3 ix=00002-00002 read-unlock
kworker/u16:2-233 [002] ..... 3031.183457: netfs_collect_folio: R=00001b55 ix=00003 r=3000-4000 t=3000/5fb2
kworker/u16:2-233 [002] ..... 3031.183458: netfs_folio: i=157f3 ix=00003-00003 read-done
kworker/u16:2-233 [002] ..... 3031.183458: netfs_folio: i=157f3 ix=00003-00003 read-unlock
kworker/u16:2-233 [002] ..... 3031.183458: netfs_collect_folio: R=00001b55 ix=00004 r=4000-5000 t=4000/5fb2
kworker/u16:2-233 [002] ..... 3031.183458: netfs_folio: i=157f3 ix=00004-00004 read-done
kworker/u16:2-233 [002] ..... 3031.183458: netfs_folio: i=157f3 ix=00004-00004 read-unlock
kworker/u16:2-233 [002] ..... 3031.183459: netfs_collect_folio: R=00001b55 ix=00005 r=5000-5fb2 t=5000/5fb2
kworker/u16:2-233 [002] ..... 3031.183459: netfs_folio: i=157f3 ix=00005-00005 read-done
kworker/u16:2-233 [002] ..... 3031.183459: netfs_folio: i=157f3 ix=00005-00005 read-unlock
kworker/u16:2-233 [002] ..... 3031.183459: netfs_sreq: R=00001b55[1] DOWN CONSM f=092 s=0 5fb2/5fb2 s=5 e=0
kworker/u16:2-233 [002] ..... 3031.183460: netfs_sreq_ref: R=00001b55[1] PUT DONE r=0
kworker/u16:2-233 [002] ..... 3031.183460: netfs_sreq: R=00001b55[1] DOWN FREE f=092 s=0 5fb2/5fb2 s=5 e=0
kworker/u16:2-233 [002] ..... 3031.183460: netfs_rreq_ref: R=00001b55 PUT SUBREQ r=2
kworker/u16:2-233 [002] ..... 3031.183460: netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
kworker/u16:2-233 [002] ..... 3031.183460: netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=c
kworker/u16:2-233 [002] ..... 3031.183460: netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
kworker/u16:2-233 [002] ..... 3031.183460: netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=8
kworker/u16:2-233 [002] ..... 3031.183460: netfs_rreq_ref: R=00001b55 SEE WORK CP r=2
clang-318 [002] ..... 3031.183462: netfs_rreq_ref: R=00001b55 GET SUBREQ r=3
clang-318 [002] ..... 3031.183462: netfs_sreq: R=00001b55[2] ZERO SUBMT f=000 s=5fb2 0/4e s=0 e=0
clang-318 [002] ..... 3031.183462: netfs_sreq: R=00001b55[2] ZERO TERM f=102 s=5fb2 4e/4e s=5 e=0
clang-318 [002] ..... 3031.183463: netfs_sreq_ref: R=00001b55[2] PUT TERM r=1
clang-318 [002] ..... 3031.183463: netfs_rreq_ref: R=00001b55 PUT RETURN r=2
kworker/u16:2-233 [002] ..... 3031.183476: netfs_rreq_ref: R=00001b55 SEE WORK r=2
kworker/u16:2-233 [002] ..... 3031.183476: netfs_rreq: R=00001b55 RA COLLECT f=103
kworker/u16:2-233 [002] ..... 3031.183476: netfs_collect: R=00001b55 s=0-6000
kworker/u16:2-233 [002] ..... 3031.183476: netfs_collect_sreq: R=00001b55[0:02] s=5fb2 t=4e/4e
kworker/u16:2-233 [002] ..... 3031.183476: netfs_sreq: R=00001b55[2] ZERO CONSM f=002 s=5fb2 4e/4e s=5 e=0
kworker/u16:2-233 [002] ..... 3031.183477: netfs_sreq_ref: R=00001b55[2] PUT DONE r=0
kworker/u16:2-233 [002] ..... 3031.183477: netfs_sreq: R=00001b55[2] ZERO FREE f=002 s=5fb2 4e/4e s=5 e=0
kworker/u16:2-233 [002] ..... 3031.183478: netfs_rreq_ref: R=00001b55 PUT SUBREQ r=1
kworker/u16:2-233 [002] ..... 3031.183478: netfs_collect_stream: R=00001b55[0:] cto=6000 frn=ffffffff
kworker/u16:2-233 [002] ..... 3031.183478: netfs_collect_state: R=00001b55 col=6000 cln=6000 n=c
kworker/u16:2-233 [002] ..... 3031.183478: netfs_collect_stream: R=00001b55[0:] cto=6000 frn=ffffffff
kworker/u16:2-233 [002] ..... 3031.183479: netfs_collect_state: R=00001b55 col=6000 cln=6000 n=8
kworker/u16:2-233 [002] ..... 3031.183479: netfs_rreq: R=00001b55 RA COMPLET f=103
kworker/u16:2-233 [002] ..... 3031.183479: netfs_rreq: R=00001b55 RA WAKE-IP f=102
kworker/u16:2-233 [002] ..... 3031.183479: netfs_rreq: R=00001b55 RA DONE f=102
kworker/u16:2-233 [002] ..... 3031.183479: netfs_rreq_ref: R=00001b55 PUT WORK IP r=0
kworker/u16:2-233 [002] ..... 3031.183482: netfs_rreq: R=00001b55 RA FREE f=102
kworker/u16:2-233 [002] ..... 3031.183482: 9p_fid_ref: put fid 20, refcount 2
kworker/u16:2-233 [002] ..... 3031.183482: netfs_folioq: R=00001b55 fq=1edf clear
----
If you know how I could also track the page fault from clang I'll be
happy to try to grab that too
Thanks,
--
Dominique
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-13 13:28 ` asmadeus
2025-12-15 5:55 ` Christoph Hellwig
@ 2025-12-19 11:26 ` David Howells
1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2025-12-19 11:26 UTC (permalink / raw)
To: asmadeus
Cc: dhowells, Christoph Hellwig, Eric Van Hensbergen,
Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel,
Matthew Wilcox, linux-fsdevel, Chris Arges
asmadeus@codewreck.org wrote:
> FWIW, the comment at the top of extract_iter_to_sg() says:
> > The iov_iter_extract_mode() function should be used to query how cleanup
> but I couldn't find any such function (even back when this comment was
> added to netfs code in 2023...), the two copies of this comment probably
> could use updating... David?
Ah, the function got renamed to iov_iter_extract_will_pin() instead. I need
to fix the comment.
David
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec)
2025-12-19 4:09 ` 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec) Dominique Martinet
@ 2025-12-19 11:53 ` Dominique Martinet
2025-12-19 13:46 ` Dominique Martinet
2025-12-19 14:01 ` David Howells
1 sibling, 1 reply; 24+ messages in thread
From: Dominique Martinet @ 2025-12-19 11:53 UTC (permalink / raw)
To: David Howells, Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, Matthew Wilcox, linux-fsdevel
Dominique Martinet wrote on Fri, Dec 19, 2025 at 01:09:26PM +0900:
> - I took a dump of dmesg (with debug=65535) and tracepoints (netfs, 9p),
> and it looks like the mmaped file IO is mostly correct? -- a TREAD is
> issued with the correct size, I'm seeing the data is collected... and..
> what is that ZERO SUBMT with the same size? Could it be related?
> David, could you please have a look?
So answering myself on that ZERO submit now I've looked up the
tracepoint definition, s=%x is the start offset and the following item
is the length so [0-5fb2[ was "downloaded from the server" and
[5fb2-6000[ was "filled with zero", and there's nothing wrong here...
Both are triggered straight from the page fault so that answers my last
question as well:
clang 691 [000] 18146.476058: netfs:netfs_sreq: R=0003306d[1] DOWN TERM f=192 s=0 5fb2/5fb2 s=5 e=0
ffffffff81601197 __traceiter_netfs_sreq+0x37 ([kernel.kallsyms])
ffffffff8160625a netfs_read_subreq_terminated+0x10a ([kernel.kallsyms])
ffffffff817fcbc6 v9fs_issue_read+0x86 ([kernel.kallsyms])
ffffffff815fd86b netfs_read_to_pagecache+0x18b ([kernel.kallsyms])
ffffffff815fdc62 netfs_readahead+0x152 ([kernel.kallsyms])
ffffffff814b9c2a read_pages+0x4a ([kernel.kallsyms])
ffffffff814b9f20 page_cache_ra_unbounded+0x190 ([kernel.kallsyms])
ffffffff814ba677 page_cache_ra_order+0x387 ([kernel.kallsyms])
ffffffff814ae2a9 filemap_fault+0x5c9 ([kernel.kallsyms])
ffffffff814ee518 __do_fault+0x38 ([kernel.kallsyms])
ffffffff814f83a5 __handle_mm_fault+0xbb5 ([kernel.kallsyms])
ffffffff814f9509 handle_mm_fault+0x99 ([kernel.kallsyms])
ffffffff812a6e2f do_user_addr_fault+0x1ef ([kernel.kallsyms])
ffffffff81ceb097 exc_page_fault+0x67 ([kernel.kallsyms])
ffffffff81000bb7 asm_exc_page_fault+0x27 ([kernel.kallsyms])
7f776ed86c23 clang::SrcMgr::ContentCache::getInvalidBOM(llvm::StringRef)+0x13 (/usr/lib/x86_64-linux-gnu/li
Which leaves me absolutely clueless at where to look next, I assume the
data is populated cleanly but by the time later pages are read by clang
the content changed or something like that?
I'll try harder to create a synthetic/more deterministic reproducer for
now...
Any ideas welcome...!
--
Dominique Martinet | Asmadeus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-15 5:55 ` Christoph Hellwig
2025-12-15 7:34 ` Dominique Martinet
@ 2025-12-19 12:00 ` David Howells
1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2025-12-19 12:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dhowells, asmadeus, Eric Van Hensbergen, Latchesar Ionkov,
Christian Schoenebeck, v9fs, linux-kernel, Matthew Wilcox,
linux-fsdevel, Chris Arges
Christoph Hellwig <hch@infradead.org> wrote:
> So right now except for netfs everything is on a kvec. Dave, what
> kind of iov_iter does netfs send down to the file system?
It depends. For buffered I/O it's a ITER_FOLIOQ, as you might expect for both
reading and writing.
For direct (and unbuffered) I/O, it's more complicated.
For direct writes, if the source iter is user-backed, it'll be extracted to an
ITER_BVEC by netfs_extract_user_iter(). This calls iov_iter_extract_pages()
to do the pinning - and netfs will release the pins later. However, the
network layer, if MSG_SPLICE_PAGES is set will attempt to take refs on those
pages.
For direct writes, if the source iter is ITER_BVEC/KVEC/FOLIOQ (also XARRAY,
but I'm trying to get rid of that), netfs passes a slice of the source iter
down to the filesystem. Netfs does not take refs on it as there's no
guarantee that the memory it points to has pages with refcounts.
Netfs will at some point soon hopefully acquire the ability to do bounce
buffering, but it's not there yet.
Direct reads work the same as direct writes.
> I had a bit of a hard time reading through it, but I'd expect that any page
> pinning would be done in netfs and not below it?
Page pinning is done by netfs_extract_user_iter() calling
iov_iter_extract_pages() - but only for user-backed iterators. The network
layer needs a way to be told how to handle these correctly (which it doesn't
currently).
kernel-backed pages may not be pinned or ref'd. They can be buffered instead,
but pinning and ref-taking is not permitted for, say, kmalloc'd buffers.
Instead, we need to use a callback from the network layer to indicate
completion - and the network layer needs to change to not ref the pages.
(Note "pinning" != "ref-taking" thanks to GUP terminology)
> Why are we using iov_iters here and not something like a bio_vec?
Because your idea of vmalloc'ing a big bio_vec[] and copying it repeatedly in
order to stick bits on the ends is not good. I have a relatively simple
solution I'm working on - and mostly have working - but the act of allocating
and transcribing into a bio_vec[] incurs a noticeable performance penalty:-/.
This will hopefully allow me to phase out ITER_FOLIOQ, but I will still need a
folio list inside netfs, firstly because we may have to split a folio across
multiple RPC ops of different sizes to multiple devices in parallel and
secondly because as Willy's plans unfold, folio structs will no longer be
colocated with page structs and page structs will be dynamically allocated and
looked up in some sort of tree instead of a flat array - which means going
from physaddr in bio_vec[] to struct folio will suck when it comes to cleaning
up the page flags after I/O.
> What is the fs / transport supported to do with these iters?
Pass them to sendmsg() or recvmsg(). That's what they currently do.
David
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-15 7:34 ` Dominique Martinet
2025-12-15 11:16 ` Christian Schoenebeck
2025-12-15 14:37 ` Christoph Hellwig
@ 2025-12-19 12:03 ` David Howells
2 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2025-12-19 12:03 UTC (permalink / raw)
To: Dominique Martinet
Cc: dhowells, Christoph Hellwig, Eric Van Hensbergen,
Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel,
Matthew Wilcox, linux-fsdevel, Chris Arges
Dominique Martinet <asmadeus@codewreck.org> wrote:
> FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ --
> O_DIRECT writes are seen as BVEC so I guess it's not as direct as I
> expected them to be -- that code could very well be leftovers from
> the switch to iov_iter back in 2015...
ITER_FOLIOQ should only be seen from buffered I/O. For unbuffered/direct I/O,
user-backed writes are extracted to ITER_BVEC; kernel-backed writes are
sliced, but the original iter type comes down. For the moment.
David
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec
2025-12-17 13:41 ` Christian Schoenebeck
2025-12-17 21:49 ` asmadeus
@ 2025-12-19 12:06 ` David Howells
1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2025-12-19 12:06 UTC (permalink / raw)
To: Christian Schoenebeck
Cc: dhowells, Eric Van Hensbergen, Latchesar Ionkov, asmadeus,
Chris Arges, v9fs, linux-kernel, Matthew Wilcox, linux-fsdevel
Christian Schoenebeck <linux_oss@crudebyte.com> wrote:
> > - if (!iov_iter_is_kvec(data)) {
> > + if (user_backed_iter(data)) {
That looks wrong. If the I/O is coming through netfslib, you should never see
ITER_UBUF or ITER_IOVEC in the filesystem.
David
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec)
2025-12-19 11:53 ` Dominique Martinet
@ 2025-12-19 13:46 ` Dominique Martinet
0 siblings, 0 replies; 24+ messages in thread
From: Dominique Martinet @ 2025-12-19 13:46 UTC (permalink / raw)
To: David Howells, Christian Schoenebeck
Cc: Eric Van Hensbergen, Latchesar Ionkov, Chris Arges, v9fs,
linux-kernel, Matthew Wilcox, linux-fsdevel
David pointing this out on IRC (thanks!)
> you may have actually caught the bug in your trace
> kworker/u16:2-233 [002] ..... 3031.183459: netfs_folio: i=157f3 ix=00005-00005 read-unlock
> ...
> clang-318 [002] ..... 3031.183462: netfs_sreq: R=00001b55[2] ZERO SUBMT f=000
> s=5fb2 0/4e s=0 e=0
> we shouldn't have unlocked page 5 yet
> we still have to clear the tail of the page
And indeed, if I update the read_collect code to not unlock as soon as
we've read i_size but wait for the tail to be cleared as follow, I can't
reproduce anymore:
---------
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index a95e7aadafd0..7a0ffa675fb1 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -137,7 +137,7 @@ static void netfs_read_unlock_folios(struct netfs_io_request *rreq,
rreq->front_folio_order = order;
fsize = PAGE_SIZE << order;
fpos = folio_pos(folio);
- fend = umin(fpos + fsize, rreq->i_size);
+ fend = fpos + fsize;
trace_netfs_collect_folio(rreq, folio, fend, collected_to);
---------
It's too late for me to think of the implications and consider this
diff's correctness, but with that patch the kernel is happily building
with LLVM=1, so I'm fairly confident this particular bug goes away...
(Now that umin() dates back v6.14-rc1 (commit e2d46f2ec332 ("netfs:
Change the read result collector to only use one work item")), I guess
I'll want to check if it's been broken this long next, or if it's a side
effect of another change... Tomorrow...)
I'm off to bed, feel free to send a patch with the above if you think
it's correct
--
Dominique Martinet | Asmadeus
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec)
2025-12-19 4:09 ` 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec) Dominique Martinet
2025-12-19 11:53 ` Dominique Martinet
@ 2025-12-19 14:01 ` David Howells
1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2025-12-19 14:01 UTC (permalink / raw)
To: Dominique Martinet
Cc: dhowells, Christian Schoenebeck, Eric Van Hensbergen,
Latchesar Ionkov, Chris Arges, v9fs, linux-kernel, Matthew Wilcox,
linux-fsdevel
Dominique Martinet <asmadeus@codewreck.org> wrote:
> netfs_collect_folio: R=00001b55 ix=00003 r=3000-4000 t=3000/5fb2
> netfs_folio: i=157f3 ix=00003-00003 read-done
> netfs_folio: i=157f3 ix=00003-00003 read-unlock
> netfs_collect_folio: R=00001b55 ix=00004 r=4000-5000 t=4000/5fb2
> netfs_folio: i=157f3 ix=00004-00004 read-done
> netfs_folio: i=157f3 ix=00004-00004 read-unlock
> netfs_collect_folio: R=00001b55 ix=00005 r=5000-5fb2 t=5000/5fb2
> netfs_folio: i=157f3 ix=00005-00005 read-done
> netfs_folio: i=157f3 ix=00005-00005 read-unlock
> ...
> netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
> netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=c
> netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
> netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=8
> ...
> netfs_sreq: R=00001b55[2] ZERO SUBMT f=000 s=5fb2 0/4e s=0 e=0
> netfs_sreq: R=00001b55[2] ZERO TERM f=102 s=5fb2 4e/4e s=5 e=0
This would seem to show a problem, if not the problem.
We unlocked page ix=00005 before doing the ZERO subreq that clears the page
tail. That shouldn't have happened since the collection point hasn't reached
the end of the folio yet.
David
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-12-19 14:02 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay
2025-12-10 4:21 ` Matthew Wilcox
2025-12-10 6:04 ` Christoph Hellwig
2025-12-10 7:38 ` asmadeus
2025-12-10 8:32 ` Christoph Hellwig
2025-12-13 13:28 ` asmadeus
2025-12-15 5:55 ` Christoph Hellwig
2025-12-15 7:34 ` Dominique Martinet
2025-12-15 11:16 ` Christian Schoenebeck
2025-12-15 14:37 ` Christoph Hellwig
2025-12-19 12:03 ` David Howells
2025-12-19 12:00 ` David Howells
2025-12-19 11:26 ` David Howells
2025-12-10 13:33 ` Christian Schoenebeck
2025-12-17 13:41 ` Christian Schoenebeck
2025-12-17 21:49 ` asmadeus
2025-12-18 14:21 ` asmadeus
2025-12-18 15:14 ` Christian Schoenebeck
2025-12-18 23:14 ` asmadeus
2025-12-19 4:09 ` 9p read corruption of mmaped content (Was: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec) Dominique Martinet
2025-12-19 11:53 ` Dominique Martinet
2025-12-19 13:46 ` Dominique Martinet
2025-12-19 14:01 ` David Howells
2025-12-19 12:06 ` [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).