* [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec @ 2025-12-09 21:04 Dominique Martinet via B4 Relay 2025-12-10 4:21 ` Matthew Wilcox 2025-12-10 6:04 ` Christoph Hellwig 0 siblings, 2 replies; 11+ messages in thread From: Dominique Martinet via B4 Relay @ 2025-12-09 21:04 UTC (permalink / raw) To: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck Cc: v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges, Dominique Martinet From: Dominique Martinet <asmadeus@codewreck.org> When doing a loop mount of a filesystem over 9p, read requests can come from unexpected places and blow up as reported by Chris Arges with this reproducer: ``` dd if=/dev/zero of=./xfs.img bs=1M count=300 yes | mkfs.xfs -b size=8192 ./xfs.img rm -rf ./mount && mkdir -p ./mount mount -o loop ./xfs.img ./mount ``` The problem is that iov_iter_get_pages_alloc2() apparently cannot be called on folios (as illustrated by the backtrace below), so limit what iov we can pin from !iov_iter_is_kvec() to user_backed_iter() Full backtrace: ``` [ 31.276957][ T255] loop0: detected capacity change from 0 to 614400 [ 31.286377][ T255] XFS (loop0): EXPERIMENTAL large block size feature enabled. Use at your own risk! [ 31.286624][ T255] XFS (loop0): Mounting V5 Filesystem fa3c2d3c-b936-4ee3-a5a8-e80ba36298cc [ 31.395721][ T62] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102600 [ 31.395833][ T62] head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 31.395915][ T62] flags: 0x2ffff800000040(head|node=0|zone=2|lastcpupid=0x1ffff) [ 31.395976][ T62] page_type: f8(unknown) [ 31.396004][ T62] raw: 002ffff800000040 0000000000000000 dead000000000122 0000000000000000 [ 31.396092][ T62] raw: 0000000000000000 0000000000000000 00000000f8000000 0000000000000000 [ 31.396174][ T62] head: 002ffff800000040 0000000000000000 dead000000000122 0000000000000000 [ 31.396251][ T62] head: 0000000000000000 0000000000000000 00000000f8000000 0000000000000000 [ 31.396339][ T62] head: 002ffff800000009 ffffea0004098001 00000000ffffffff 00000000ffffffff [ 31.396425][ T62] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000200 [ 31.396523][ T62] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u)) [ 31.396641][ T62] ------------[ cut here ]------------ [ 31.396689][ T62] kernel BUG at include/linux/mm.h:1386! [ 31.396748][ T62] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 31.396820][ T62] CPU: 4 UID: 0 PID: 62 Comm: kworker/u32:1 Not tainted 6.18.0-rc7-cloudflare-2025.11.11-21-gab0ed6ff #1 PREEMPT(voluntary) [ 31.396947][ T62] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 2025.02-8 05/13/2025 [ 31.397031][ T62] Workqueue: loop0 loop_rootcg_workfn [ 31.397084][ T62] RIP: 0010:__iov_iter_get_pages_alloc+0x7b6/0x920 [ 31.397152][ T62] Code: 08 4c 89 5d 10 44 88 55 20 e9 0d fb ff ff 0f 0b 4d 85 ed 0f 85 fc fb ff ff e9 38 fd ff ff 48 c7 c6 20 88 6d 83 e8 fa 2f b7 ff <0f> 0b 31 f6 b9 c0 0c 00 00 ba 01 00 00 00 4c 89 0c 24 48 8d 3c dd [ 31.397310][ T62] RSP: 0018:ffffc90000257908 EFLAGS: 00010246 [ 31.397365][ T62] RAX: 000000000000005c RBX: 0000000000000020 RCX: 0000000000000003 [ 31.397424][ T62] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff83f38508 [ 31.397498][ T62] RBP: ffff888101af90f8 R08: 0000000000000000 R09: ffffc900002577a0 [ 31.397571][ T62] R10: ffffffff83f084c8 R11: 0000000000000003 R12: 0000000000020000 [ 31.397654][ T62] R13: ffffc90000257a70 R14: ffffc90000257a68 R15: ffffea0004098000 [ 31.397727][ T62] FS: 0000000000000000(0000) GS:ffff8882b3266000(0000) knlGS:0000000000000000 [ 31.397819][ T62] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 31.397890][ T62] CR2: 00007f846eb985a0 CR3: 0000000004620003 CR4: 0000000000772ef0 [ 31.397964][ T62] PKRU: 55555554 [ 31.398005][ T62] Call Trace: [ 31.398045][ T62] <TASK> [ 31.398075][ T62] ? kvm_sched_clock_read+0x11/0x20 [ 31.398131][ T62] ? sched_clock+0x10/0x30 [ 31.398179][ T62] ? sched_clock_cpu+0xf/0x1d0 [ 31.398234][ T62] iov_iter_get_pages_alloc2+0x20/0x50 [ 31.398277][ T62] p9_get_mapped_pages.part.0.constprop.0+0x6f/0x280 [9pnet_virtio] [ 31.398354][ T62] ? p9pdu_vwritef+0xe0/0x6e0 [9pnet] [ 31.398413][ T62] ? pdu_write+0x2d/0x40 [9pnet] [ 31.398464][ T62] p9_virtio_zc_request+0x92/0x69a [9pnet_virtio] [ 31.398530][ T62] ? p9pdu_vwritef+0xe0/0x6e0 [9pnet] [ 31.398582][ T62] ? p9pdu_finalize+0x32/0x90 [9pnet] [ 31.398620][ T62] ? p9_client_prepare_req+0xbe/0x150 [9pnet] [ 31.398693][ T62] p9_client_zc_rpc.constprop.0+0xf4/0x2f0 [9pnet] [ 31.398768][ T62] ? p9_client_xattrwalk+0x148/0x1d0 [9pnet] [ 31.398840][ T62] p9_client_write+0x16a/0x240 [9pnet] [ 31.398887][ T62] ? __kmalloc_cache_noprof+0x2f3/0x5a0 [ 31.398939][ T62] v9fs_issue_write+0x3a/0x80 [9p] [ 31.399002][ T62] netfs_advance_write+0xd3/0x2b0 [netfs] [ 31.399069][ T62] netfs_unbuffered_write+0x66/0xb0 [netfs] [ 31.399131][ T62] netfs_unbuffered_write_iter_locked+0x1cd/0x220 [netfs] [ 31.399202][ T62] netfs_unbuffered_write_iter+0x100/0x1d0 [netfs] [ 31.399265][ T62] lo_rw_aio.isra.0+0x2e7/0x330 [ 31.399321][ T62] loop_process_work+0x86/0x420 [ 31.399380][ T62] process_one_work+0x192/0x350 [ 31.399434][ T62] worker_thread+0x2d3/0x400 [ 31.399493][ T62] ? __pfx_worker_thread+0x10/0x10 [ 31.399559][ T62] kthread+0xfc/0x240 [ 31.399605][ T62] ? __pfx_kthread+0x10/0x10 [ 31.399660][ T62] ? _raw_spin_unlock+0xe/0x30 [ 31.399711][ T62] ? finish_task_switch.isra.0+0x8d/0x280 [ 31.399764][ T62] ? __pfx_kthread+0x10/0x10 [ 31.399820][ T62] ? __pfx_kthread+0x10/0x10 [ 31.399878][ T62] ret_from_fork+0x113/0x130 [ 31.399931][ T62] ? __pfx_kthread+0x10/0x10 [ 31.399992][ T62] ret_from_fork_asm+0x1a/0x30 [ 31.400050][ T62] </TASK> [ 31.400088][ T62] Modules linked in: kvm_intel kvm irqbypass aesni_intel rapl i2c_piix4 i2c_smbus tiny_power_button button configfs virtio_mmio virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio_console 9pnet_virtio virtiofs virtio virtio_ring fuse 9p 9pnet netfs [ 31.400365][ T62] ---[ end trace 0000000000000000 ]--- [ 31.405087][ T62] RIP: 0010:__iov_iter_get_pages_alloc+0x7b6/0x920 [ 31.405166][ T62] Code: 08 4c 89 5d 10 44 88 55 20 e9 0d fb ff ff 0f 0b 4d 85 ed 0f 85 fc fb ff ff e9 38 fd ff ff 48 c7 c6 20 88 6d 83 e8 fa 2f b7 ff <0f> 0b 31 f6 b9 c0 0c 00 00 ba 01 00 00 00 4c 89 0c 24 48 8d 3c dd [ 31.405281][ T62] RSP: 0018:ffffc90000257908 EFLAGS: 00010246 [ 31.405328][ T62] RAX: 000000000000005c RBX: 0000000000000020 RCX: 0000000000000003 [ 31.405383][ T62] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff83f38508 [ 31.405456][ T62] RBP: ffff888101af90f8 R08: 0000000000000000 R09: ffffc900002577a0 [ 31.405516][ T62] R10: ffffffff83f084c8 R11: 0000000000000003 R12: 0000000000020000 [ 31.405593][ T62] R13: ffffc90000257a70 R14: ffffc90000257a68 R15: ffffea0004098000 [ 31.405665][ T62] FS: 0000000000000000(0000) GS:ffff8882b3266000(0000) knlGS:0000000000000000 [ 31.405730][ T62] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 31.405774][ T62] CR2: 00007f846eb985a0 CR3: 0000000004620004 CR4: 0000000000772ef0 [ 31.405837][ T62] PKRU: 55555554 [ 31.434509][ C4] ------------[ cut here ]------------ ``` Reported-by: Chris Arges <carges@cloudflare.com> Closes: https://lkml.kernel.org/r/aSR-C4ahmNRoUV58@861G6M3 Tested-by: Chris Arges <carges@cloudflare.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Suggested-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- This is the patch Chris tested in the linked thread as is I'll admit I still don't really understand how this even works: the else branch of the trans_virtio patch assumes it can use data->kvec->iov_base so it shouldn't work with folio-backed iov?! (Also: why does iov_iter_get_pages_alloc2() explicitly implement support for folio if it's just to blow up later) .. but if it worked for Chris I guess that's good enough for now? I'm still surprised I can't reproduce this, I'll try to play with the backing 9p mount options and check these IOs are done properly before sending to Linus, but any feedback is welcome until then --- net/9p/trans_virtio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c index 10c2dd48643818907f4370243eb971fceba4d40b..f7ee1f864b03a59568510eb0dd3496bd05b3b8d6 100644 --- a/net/9p/trans_virtio.c +++ b/net/9p/trans_virtio.c @@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan, if (!iov_iter_count(data)) return 0; - if (!iov_iter_is_kvec(data)) { + if (user_backed_iter(data)) { int n; /* * We allow only p9_max_pages pinned. We wait for the --- base-commit: 3e281113f871d7f9c69ca55a4d806a72180b7e8a change-id: 20251210-virtio_trans_iter-5973892db2e3 Best regards, -- Dominique Martinet <asmadeus@codewreck.org> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay @ 2025-12-10 4:21 ` Matthew Wilcox 2025-12-10 6:04 ` Christoph Hellwig 1 sibling, 0 replies; 11+ messages in thread From: Matthew Wilcox @ 2025-12-10 4:21 UTC (permalink / raw) To: asmadeus Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, linux-fsdevel, Chris Arges On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote: > The problem is that iov_iter_get_pages_alloc2() apparently cannot be > called on folios (as illustrated by the backtrace below), so limit what > iov we can pin from !iov_iter_is_kvec() to user_backed_iter() > > Full backtrace: > ``` > [ 31.395721][ T62] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102600 > [ 31.395833][ T62] head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > [ 31.395915][ T62] flags: 0x2ffff800000040(head|node=0|zone=2|lastcpupid=0x1ffff) > [ 31.395976][ T62] page_type: f8(unknown) This _isn't_ a folio. It's a kmalloc allocation. It's a very large kmalloc allocation (something between 1024 * 1024 + 1 and 2048 * 1024 bytes inclusive). You can _only_ use iov_iter_get_pages_alloc2() with folios (or other struct pages that have a refcount, but I'm not sure how many of those there really are; we're removing refcounts from various types of pages) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay 2025-12-10 4:21 ` Matthew Wilcox @ 2025-12-10 6:04 ` Christoph Hellwig 2025-12-10 7:38 ` asmadeus 2025-12-10 13:33 ` Christian Schoenebeck 1 sibling, 2 replies; 11+ messages in thread From: Christoph Hellwig @ 2025-12-10 6:04 UTC (permalink / raw) To: asmadeus Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote: > From: Dominique Martinet <asmadeus@codewreck.org> > > When doing a loop mount of a filesystem over 9p, read requests can come > from unexpected places and blow up as reported by Chris Arges with this > reproducer: > ``` > dd if=/dev/zero of=./xfs.img bs=1M count=300 > yes | mkfs.xfs -b size=8192 ./xfs.img > rm -rf ./mount && mkdir -p ./mount > mount -o loop ./xfs.img ./mount We should really wire this up to xfstests so that all file systems see the pattern of kmalloc allocations passed into the block layer and then on to the direct I/O code. > The problem is that iov_iter_get_pages_alloc2() apparently cannot be > called on folios (as illustrated by the backtrace below), so limit what > iov we can pin from !iov_iter_is_kvec() to user_backed_iter() As willy pointed out this is a kmalloc. And 9p (just like NFS) really needs to switch away from iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not just this perfectly fine but also fixes various other issues. Note that the networking code still wants special treatment for kmalloc pages, so you might have more work there. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-10 6:04 ` Christoph Hellwig @ 2025-12-10 7:38 ` asmadeus 2025-12-10 8:32 ` Christoph Hellwig 2025-12-10 13:33 ` Christian Schoenebeck 1 sibling, 1 reply; 11+ messages in thread From: asmadeus @ 2025-12-10 7:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges Christoph Hellwig wrote on Tue, Dec 09, 2025 at 10:04:30PM -0800: > On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote: > > From: Dominique Martinet <asmadeus@codewreck.org> > > > > When doing a loop mount of a filesystem over 9p, read requests can come > > from unexpected places and blow up as reported by Chris Arges with this > > reproducer: > > ``` > > dd if=/dev/zero of=./xfs.img bs=1M count=300 > > yes | mkfs.xfs -b size=8192 ./xfs.img > > rm -rf ./mount && mkdir -p ./mount > > mount -o loop ./xfs.img ./mount > > We should really wire this up to xfstests so that all file systems > see the pattern of kmalloc allocations passed into the block layer > and then on to the direct I/O code. Note this doesn't seem to reproduce on my test VM so I'm not sure what kind of precondition there is to going through this code... > > The problem is that iov_iter_get_pages_alloc2() apparently cannot be > > called on folios (as illustrated by the backtrace below), so limit what > > iov we can pin from !iov_iter_is_kvec() to user_backed_iter() > > As willy pointed out this is a kmalloc. Ok I got confused because of the VM_BUG_ON_FOLIO(), but looking back it's in a folio_get() called directly from __iov_iter_get_pages_alloc() so that was likely a bvec... My points of "but there's a case for it in __iov_iter_get_pages_alloc()" and "we have no idea what to do" still stand though, but you answered that below: > And 9p (just like NFS) really needs to switch away from > iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not > just this perfectly fine but also fixes various other issues. Ok, so we can remove the special branch for kvec and just extract pages with this. I understand it pins user spaces pages, so there's no risk of it moving under us during the IO, and there's nothing else we need to do about it? Looking at the implementation for iov_iter_extract_bvec_pages() it looks like it might not process all the way to the end, so we need to loop on calling iov_iter_extract_pages()? (I see networking code looping on "while (iter->count > 0)") I'll send a v2 with that when I can While I have your attention, there's some work to move away from large (>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message buffers on supported transports") that basically only did that for trans_fd), there's no iov_iter involved so it's off topic but how would one get around "extracting pages" out of that? > Note that the networking code still wants special treatment for kmalloc > pages, so you might have more work there. I *think* we're fine on this end, as it's just passing the buffers into a sg list for virtio, as long as things don't move under the caller I assume they don't care... Thanks, -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-10 7:38 ` asmadeus @ 2025-12-10 8:32 ` Christoph Hellwig 2025-12-13 13:28 ` asmadeus 0 siblings, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2025-12-10 8:32 UTC (permalink / raw) To: asmadeus Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges On Wed, Dec 10, 2025 at 04:38:02PM +0900, asmadeus@codewreck.org wrote: > Christoph Hellwig wrote on Tue, Dec 09, 2025 at 10:04:30PM -0800: > > On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay wrote: > > > From: Dominique Martinet <asmadeus@codewreck.org> > > > > > > When doing a loop mount of a filesystem over 9p, read requests can come > > > from unexpected places and blow up as reported by Chris Arges with this > > > reproducer: > > > ``` > > > dd if=/dev/zero of=./xfs.img bs=1M count=300 > > > yes | mkfs.xfs -b size=8192 ./xfs.img > > > rm -rf ./mount && mkdir -p ./mount > > > mount -o loop ./xfs.img ./mount > > > > We should really wire this up to xfstests so that all file systems > > see the pattern of kmalloc allocations passed into the block layer > > and then on to the direct I/O code. > > Note this doesn't seem to reproduce on my test VM so I'm not sure what > kind of precondition there is to going through this code... In general the best way to get XFS issue kmalloc I/O is to use a file system sector size smaller than the page size, і.e. something like: mkfs.xfs -s 512 -b 4096 -f The above would do that job when run on a large page size system like typical arm64 configs. > > And 9p (just like NFS) really needs to switch away from > > iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not > > just this perfectly fine but also fixes various other issues. > > Ok, so we can remove the special branch for kvec and just extract pages > with this. Yes. > I understand it pins user spaces pages, so there's no risk of it moving > under us during the IO, and there's nothing else we need to do about it? Yes, unlike iov_iter_get_pages_alloc2 which gets a reference and doesn't pin. > Looking at the implementation for iov_iter_extract_bvec_pages() it looks > like it might not process all the way to the end, so we need to loop on > calling iov_iter_extract_pages()? (I see networking code looping on > "while (iter->count > 0)") Yes. > I'll send a v2 with that when I can You looked into this already, but in case you haven't seen it yet, don't forget to call unpin_user_folio on the completion side as well. > While I have your attention, there's some work to move away from large > (>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not > be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message > buffers on supported transports") that basically only did that for > trans_fd), there's no iov_iter involved so it's off topic but how would > one get around "extracting pages" out of that? vmalloc and I/O is in general problematic, because you need special calls to flush the cached for VIVT caches: flush_kernel_vmap_range / invalidate_kernel_vmap_range. If you want to stuff that into an iov_iter, the only sane way is to call vmalloc_to_page and build a bvec iter from that at the moment. You also need to do the flush_kernel_vmap_range / invalidate_kernel_vmap_range in the caller for it. > > > Note that the networking code still wants special treatment for kmalloc > > pages, so you might have more work there. > > I *think* we're fine on this end, as it's just passing the buffers into > a sg list for virtio, as long as things don't move under the caller I > assume they don't care... Ok, if your backend is virtio that's fine. If it's actual 9p over the network you might still into issues, though. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-10 8:32 ` Christoph Hellwig @ 2025-12-13 13:28 ` asmadeus 2025-12-15 5:55 ` Christoph Hellwig 0 siblings, 1 reply; 11+ messages in thread From: asmadeus @ 2025-12-13 13:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges Christoph Hellwig wrote on Wed, Dec 10, 2025 at 12:32:41AM -0800: > > Looking at the implementation for iov_iter_extract_bvec_pages() it looks > > like it might not process all the way to the end, so we need to loop on > > calling iov_iter_extract_pages()? (I see networking code looping on > > "while (iter->count > 0)") > > Yes. Ok, I don't understand why the current code locks everything down and wants to use a single scatterlist shared for the whole channel (and capped to 128 pages?), it should only need to lock around the virtqueue_add_sg() call, I'll need to play with that some more. Looking at other virtio drivers I could probably use a sg_table and have extract_iter_to_sg() do all the work for us... > > I'll send a v2 with that when I can > > You looked into this already, but in case you haven't seen it yet, > don't forget to call unpin_user_folio on the completion side as well. Thanks for bringing this up -- I'm not sure I understand how to decide on what to cleanup with... Depending on the type of pages it's not necessarily unpin_user_folio() is it? Or will it just ignore anything else? Ah, the caller must check iov_iter_extract_will_pin(iter), then it looks like I'm free to call unpin_user_page() (or unpin_folio() which is the same without pinned check) -- unpin_user_folio() would allow unpining multiple pages within a folio, but iov_iter_extract_pages() doesn't give us folios that could be used like that) FWIW, the comment at the top of extract_iter_to_sg() says: > The iov_iter_extract_mode() function should be used to query how cleanup but I couldn't find any such function (even back when this comment was added to netfs code in 2023...), the two copies of this comment probably could use updating... David? > > While I have your attention, there's some work to move away from large > > (>1MB) kmalloc() in the non-zerocopy case into kvmalloc() that might not > > be contiguous (see commit e21d451a82f3 ("9p: Use kvmalloc for message > > buffers on supported transports") that basically only did that for > > trans_fd), there's no iov_iter involved so it's off topic but how would > > one get around "extracting pages" out of that? > > vmalloc and I/O is in general problematic, because you need special > calls to flush the cached for VIVT caches: > flush_kernel_vmap_range / invalidate_kernel_vmap_range. > > If you want to stuff that into an iov_iter, the only sane way is to > call vmalloc_to_page and build a bvec iter from that at the moment. > You also need to do the flush_kernel_vmap_range / > invalidate_kernel_vmap_range in the caller for it. Thanks, this is on the virtio/network side so I don't need an iov_iter, but I agree virtio will not like vmap addresses as well -- I'd want to resolve and pin the backing pages and pass that to a scatterlist like we're doing here... Looking at extract_kvec_to_sg() it's just calling vmalloc_to_page() on the the kvec address if it's a vmalloc address, so no pin, so I guess pin might not be needed? I was under the impression that vmalloc meant the kernel could shuffle the physical data under us when we're not scheduled, but I'm really not familiar enough with these memory management details; let's focus on the bug at hands first... I'll (eventually) poke back at this after we're all happy with the zero-copy side. > > I *think* we're fine on this end, as it's just passing the buffers into > > a sg list for virtio, as long as things don't move under the caller I > > assume they don't care... > > Ok, if your backend is virtio that's fine. If it's actual 9p over the > network you might still into issues, though. TCP and other transports don't do zero-copy straight into the iov_iter, so virtio is our only problem right now -- thanksfully! :) Thanks, -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-13 13:28 ` asmadeus @ 2025-12-15 5:55 ` Christoph Hellwig 2025-12-15 7:34 ` Dominique Martinet 0 siblings, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2025-12-15 5:55 UTC (permalink / raw) To: asmadeus Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges [Dave: netfs questions below, please read] On Sat, Dec 13, 2025 at 10:28:50PM +0900, asmadeus@codewreck.org wrote: > Christoph Hellwig wrote on Wed, Dec 10, 2025 at 12:32:41AM -0800: > > > Looking at the implementation for iov_iter_extract_bvec_pages() it looks > > > like it might not process all the way to the end, so we need to loop on > > > calling iov_iter_extract_pages()? (I see networking code looping on > > > "while (iter->count > 0)") > > > > Yes. > > Ok, I don't understand why the current code locks everything down and > wants to use a single scatterlist shared for the whole channel (and > capped to 128 pages?), it should only need to lock around the > virtqueue_add_sg() call, I'll need to play with that some more. What do you mean with "lock down"? > Looking at other virtio drivers I could probably use a sg_table and > have extract_iter_to_sg() do all the work for us... Looking at the code I'm actually really confused. Both because I actually though we were talking about the 9fs direct I/O code, but that has actually been removed / converted to netfs a long time ago. But even more so what the net/9p code is actually doing.. How do we even end up with user addresses here at all? Let me try to understand things: - p9_virtio_zc_request is the only instances of the p9_trans_module zc_request operation. - zc_request only gets called by p9_client_zc_rpc - p9_client_zc_rpc gets called by p9_client_read_once, p9_client_write, p9_client_write_subreq and p9_client_readdir Let's go through these: - p9_client_write_subreq is entirely unused - p9_client_readdir builds a local iov_iter_kvec - p9_client_read_once is only called by p9_client_read, and really should be marked static. - p9_client_read is called by v9fs_issue_read on a netfs iov_iter and by v9fs_dir_readdir and v9fs_fid_xattr_get on a local kvec iter - p9_client_write is called with a iov_iter_kvec from v9fs_fid_xattr_set, and with a netfs-issued iov_iter by v9fs_issue_write So right now except for netfs everything is on a kvec. Dave, what kind of iov_iter does netfs send down to the file system? I had a bit of a hard time reading through it, but I'd expect that any page pinning would be done in netfs and not below it? Why are we using iov_iters here and not something like a bio_vec? What is the fs / transport supported to do with these iters? Ignoring the rest of the mail for now, because I suspect the outcome of the above might make it irrelevant, but I'll come back to it if needed. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-15 5:55 ` Christoph Hellwig @ 2025-12-15 7:34 ` Dominique Martinet 2025-12-15 11:16 ` Christian Schoenebeck 2025-12-15 14:37 ` Christoph Hellwig 0 siblings, 2 replies; 11+ messages in thread From: Dominique Martinet @ 2025-12-15 7:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges Thanks for having a look Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800: > > Ok, I don't understand why the current code locks everything down and > > wants to use a single scatterlist shared for the whole channel (and > > capped to 128 pages?), it should only need to lock around the > > virtqueue_add_sg() call, I'll need to play with that some more. > > What do you mean with "lock down"? Just the odd (to me) use of the chan->lock around basically all of p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty sure this was just the author trying to avoid an allocation by recycling the chan->sg array around though, so ignore this. > > Looking at other virtio drivers I could probably use a sg_table and > > have extract_iter_to_sg() do all the work for us... > > Looking at the code I'm actually really confused. Both because I > actually though we were talking about the 9fs direct I/O code, but > that has actually been removed / converted to netfs a long time ago. > > But even more so what the net/9p code is actually doing.. How do > we even end up with user addresses here at all? FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ -- O_DIRECT writes are seen as BVEC so I guess it's not as direct as I expected them to be -- that code could very well be leftovers from the switch to iov_iter back in 2015... (I'm actually not sure why Christian suggested checking for is_iovec() in https://lkml.kernel.org/r/2245723.irdbgypaU6@weasel -- then I generalized it to user_backed_iter() and it just worked because checking for that moved out bvec and folioq from iov_iter_get_pages_alloc2() to... something that obviously should not work in my opinion but apparently was enough to not trigger this particular BUG.) > Let me try to understand things: > > - p9_virtio_zc_request is the only instances of the p9_trans_module > zc_request operation. > - zc_request only gets called by p9_client_zc_rpc > - p9_client_zc_rpc gets called by p9_client_read_once, p9_client_write, > p9_client_write_subreq and p9_client_readdir > > Let's go through these: > > - p9_client_write_subreq is entirely unused Let's remove that.. I'll send a patch later. > - p9_client_readdir builds a local iov_iter_kvec > - p9_client_read_once is only called by p9_client_read, and really > should be marked static. agreed, will cleanup too. > - p9_client_read is called by v9fs_issue_read on a netfs iov_iter > and by v9fs_dir_readdir and v9fs_fid_xattr_get on a local kvec iter > - p9_client_write is called with a iov_iter_kvec from > v9fs_fid_xattr_set, and with a netfs-issued iov_iter by > v9fs_issue_write > > So right now except for netfs everything is on a kvec. Dave, what > kind of iov_iter does netfs send down to the file system? I had > a bit of a hard time reading through it, but I'd expect that any > page pinning would be done in netfs and not below it? Why are we > using iov_iters here and not something like a bio_vec? What is > the fs / transport supported to do with these iters? > > Ignoring the rest of the mail for now, because I suspect the outcome > of the above might make it irrelevant, but I'll come back to it if > needed. (waiting for David's answer here, but as far as I see the contract between the transport and the vfs is that the transport should handle whatever it's being fed, so it doesn't really matter if it's a bio_vec or an iov_iter -- ultimately virtio or whatever backend that wants to handle zc likely won't handle bio_vec any better so it'll need converting anyway) Thanks, -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-15 7:34 ` Dominique Martinet @ 2025-12-15 11:16 ` Christian Schoenebeck 2025-12-15 14:37 ` Christoph Hellwig 1 sibling, 0 replies; 11+ messages in thread From: Christian Schoenebeck @ 2025-12-15 11:16 UTC (permalink / raw) To: Christoph Hellwig, Dominique Martinet Cc: Eric Van Hensbergen, Latchesar Ionkov, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges On Monday, 15 December 2025 08:34:12 CET Dominique Martinet wrote: > Thanks for having a look > > Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800: > > > Ok, I don't understand why the current code locks everything down and > > > wants to use a single scatterlist shared for the whole channel (and > > > capped to 128 pages?), it should only need to lock around the > > > virtqueue_add_sg() call, I'll need to play with that some more. > > > > What do you mean with "lock down"? > > Just the odd (to me) use of the chan->lock around basically all of > p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty > sure this was just the author trying to avoid an allocation by recycling > the chan->sg array around though, so ignore this. The lock protects the channel wide, shared scatterlist while the scatterlist is filled from the linear buffers by pack_sg_list(). Then virtqueue_add_sgs() pushes scatterlist's segments as virtio descriptors into the virtio FIFOs. From this point it safe to unlock as the scatterlist is no longer needed. And yes, the assumption probably was that handling the scatterlist as a temporary was more expensive due to allocation. > > > Looking at other virtio drivers I could probably use a sg_table and > > > have extract_iter_to_sg() do all the work for us... > > > > Looking at the code I'm actually really confused. Both because I > > actually though we were talking about the 9fs direct I/O code, but > > that has actually been removed / converted to netfs a long time ago. > > > > But even more so what the net/9p code is actually doing.. How do > > we even end up with user addresses here at all? > > FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ -- > O_DIRECT writes are seen as BVEC so I guess it's not as direct as I > expected them to be -- that code could very well be leftovers from > the switch to iov_iter back in 2015... > > (I'm actually not sure why Christian suggested checking for is_iovec() > in https://lkml.kernel.org/r/2245723.irdbgypaU6@weasel -- then I > generalized it to user_backed_iter() and it just worked because checking > for that moved out bvec and folioq from iov_iter_get_pages_alloc2() > to... something that obviously should not work in my opinion but > apparently was enough to not trigger this particular BUG.) Sorry, I should have explained why I suggested that change: My understanding of the p9_get_mapped_pages() code was that the original author intended to go down the iov_iter_get_pages_alloc2() path only for user space memory exclusively and that he assumed that his preceding !iov_iter_is_kvec(data) check would guard this appropriately. But apparently there are several other iterator types that are kernel memory as well. So I thought switching the check to is_iovec() would be better as these are user space memory, however I missed that there is also ITER_UBUF with user space memory, which you realized fortunately by suggesting user_backed_iter() check instead. /Christian ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-15 7:34 ` Dominique Martinet 2025-12-15 11:16 ` Christian Schoenebeck @ 2025-12-15 14:37 ` Christoph Hellwig 1 sibling, 0 replies; 11+ messages in thread From: Christoph Hellwig @ 2025-12-15 14:37 UTC (permalink / raw) To: Dominique Martinet Cc: Christoph Hellwig, Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges On Mon, Dec 15, 2025 at 04:34:12PM +0900, Dominique Martinet wrote: > Christoph Hellwig wrote on Sun, Dec 14, 2025 at 09:55:12PM -0800: > > > Ok, I don't understand why the current code locks everything down and > > > wants to use a single scatterlist shared for the whole channel (and > > > capped to 128 pages?), it should only need to lock around the > > > virtqueue_add_sg() call, I'll need to play with that some more. > > > > What do you mean with "lock down"? > > Just the odd (to me) use of the chan->lock around basically all of > p9_virtio_request() and most of p9_virtio_zc_request() -- I'm not pretty > sure this was just the author trying to avoid an allocation by recycling > the chan->sg array around though, so ignore this. Oh, ok. This seems unrelated to the handling of the iov_iters and I'm sorry that I don't really know anything about that part. > > > > Looking at other virtio drivers I could probably use a sg_table and > > > have extract_iter_to_sg() do all the work for us... > > > > Looking at the code I'm actually really confused. Both because I > > actually though we were talking about the 9fs direct I/O code, but > > that has actually been removed / converted to netfs a long time ago. > > > > But even more so what the net/9p code is actually doing.. How do > > we even end up with user addresses here at all? > > FWIW I tried logging and saw ITER_BVEC, ITER_KVEC and ITER_FOLIOQ -- > O_DIRECT writes are seen as BVEC so I guess it's not as direct as I > expected them to be -- that code could very well be leftovers from > the switch to iov_iter back in 2015... Oh right, I think this from Dave's netfs_extract_user_iter. > (waiting for David's answer here, but as far as I see the contract > between the transport and the vfs is that the transport should handle > whatever it's being fed, so it doesn't really matter if it's a bio_vec > or an iov_iter -- ultimately virtio or whatever backend that wants to > handle zc likely won't handle bio_vec any better so it'll need > converting anyway) Yeah. Looking at what the code does with the pages, I think all this should go away in favor of using extract_iter_to_sg and build the scatterlists directly from the iters, without an extra page indirection. (and of course one day virtio should migrate away from scatterlists, but that's for another time). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec 2025-12-10 6:04 ` Christoph Hellwig 2025-12-10 7:38 ` asmadeus @ 2025-12-10 13:33 ` Christian Schoenebeck 1 sibling, 0 replies; 11+ messages in thread From: Christian Schoenebeck @ 2025-12-10 13:33 UTC (permalink / raw) To: asmadeus, Christoph Hellwig Cc: Eric Van Hensbergen, Latchesar Ionkov, v9fs, linux-kernel, David Howells, Matthew Wilcox, linux-fsdevel, Chris Arges On Wednesday, 10 December 2025 07:04:30 CET Christoph Hellwig wrote: > On Wed, Dec 10, 2025 at 06:04:23AM +0900, Dominique Martinet via B4 Relay > wrote: [...] > > The problem is that iov_iter_get_pages_alloc2() apparently cannot be > > called on folios (as illustrated by the backtrace below), so limit what > > iov we can pin from !iov_iter_is_kvec() to user_backed_iter() > > As willy pointed out this is a kmalloc. > > And 9p (just like NFS) really needs to switch away from > iov_iter_get_pages_alloc2 to iov_iter_extract_pages, which handles not > just this perfectly fine but also fixes various other issues. > > Note that the networking code still wants special treatment for kmalloc > pages, so you might have more work there. But couldn't this patch be used as a preliminary solution for this issue before switching to iov_iter_extract_pages(), as the latter does not look like a trivial change? Maybe I'm still missing something important here, not sure. /Christian ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-12-15 14:37 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-09 21:04 [PATCH] 9p/virtio: restrict page pinning to user_backed_iter() iovec Dominique Martinet via B4 Relay 2025-12-10 4:21 ` Matthew Wilcox 2025-12-10 6:04 ` Christoph Hellwig 2025-12-10 7:38 ` asmadeus 2025-12-10 8:32 ` Christoph Hellwig 2025-12-13 13:28 ` asmadeus 2025-12-15 5:55 ` Christoph Hellwig 2025-12-15 7:34 ` Dominique Martinet 2025-12-15 11:16 ` Christian Schoenebeck 2025-12-15 14:37 ` Christoph Hellwig 2025-12-10 13:33 ` Christian Schoenebeck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).