linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1
@ 2025-10-19  9:29 Yongcheng Yang
  2025-10-19 15:18 ` Trond Myklebust
  0 siblings, 1 reply; 18+ messages in thread
From: Yongcheng Yang @ 2025-10-19  9:29 UTC (permalink / raw)
  To: linux-nfs

Hi All,

There is a new nfs slab-use-after-free issue since 6.18.0-rc1.
It appears to be reliably reproducible on my side when running xfstests
generic/323 over NFSv4.2 in *debug* kernel mode:

[18265.311177] ==================================================================
[18265.315831] BUG: KASAN: slab-use-after-free in nfs_local_call_read+0x590/0x7f0 [nfs]
[18265.320135] Read of size 2 at addr ffff8881090556a2 by task kworker/u9:0/667366

[18265.325454] CPU: 0 UID: 0 PID: 667366 Comm: kworker/u9:0 Not tainted 6.18.0-rc1 #1 PREEMPT(full) 
[18265.325461] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-2.el9 11/17/2024
[18265.325465] Workqueue: nfslocaliod nfs_local_call_read [nfs]
[18265.325611] Call Trace:
[18265.325615]  <TASK>
[18265.325619]  dump_stack_lvl+0x77/0xa0
[18265.325629]  print_report+0x171/0x820
[18265.325637]  ? __virt_addr_valid+0x151/0x3a0
[18265.325644]  ? __virt_addr_valid+0x300/0x3a0
[18265.325650]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
[18265.325770]  kasan_report+0x167/0x1a0
[18265.325777]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
[18265.325900]  nfs_local_call_read+0x590/0x7f0 [nfs]
[18265.326027]  ? process_scheduled_works+0x7d3/0x11d0
[18265.326034]  process_scheduled_works+0x857/0x11d0
[18265.326050]  worker_thread+0x897/0xd00
[18265.326065]  kthread+0x51b/0x650
[18265.326071]  ? __pfx_worker_thread+0x10/0x10
[18265.326076]  ? __pfx_kthread+0x10/0x10
[18265.326082]  ret_from_fork+0x249/0x480
[18265.326087]  ? __pfx_kthread+0x10/0x10
[18265.326092]  ret_from_fork_asm+0x1a/0x30
[18265.326104]  </TASK>

[18265.378345] Allocated by task 681242:
[18265.380068]  kasan_save_track+0x3e/0x80
[18265.381838]  __kasan_kmalloc+0x93/0xb0
[18265.383587]  __kmalloc_cache_noprof+0x3eb/0x6e0
[18265.385532]  nfs_local_doio+0x1cb/0xeb0 [nfs]
[18265.387630]  nfs_initiate_pgio+0x284/0x400 [nfs]
[18265.389815]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
[18265.391998]  nfs_pageio_complete+0x278/0x750 [nfs]
[18265.394146]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
[18265.396386]  vfs_read+0x5d0/0x770
[18265.398043]  __x64_sys_pread64+0xed/0x160
[18265.399837]  do_syscall_64+0xad/0x7d0
[18265.401561]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[18265.404687] Freed by task 596986:
[18265.406245]  kasan_save_track+0x3e/0x80
[18265.408013]  __kasan_save_free_info+0x46/0x50
[18265.409884]  __kasan_slab_free+0x58/0x80
[18265.411613]  kfree+0x1c1/0x620
[18265.413075]  nfs_local_read_aio_complete_work+0x86/0x100 [nfs]
[18265.415486]  process_scheduled_works+0x857/0x11d0
[18265.417437]  worker_thread+0x897/0xd00
[18265.419177]  kthread+0x51b/0x650
[18265.420689]  ret_from_fork+0x249/0x480
[18265.422331]  ret_from_fork_asm+0x1a/0x30

[18265.424989] Last potentially related work creation:
[18265.426949]  kasan_save_stack+0x3e/0x60
[18265.428639]  kasan_record_aux_stack+0xbd/0xd0
[18265.430423]  insert_work+0x2d/0x230
[18265.431968]  __queue_work+0x8ec/0xb50
[18265.433555]  queue_work_on+0xaf/0xe0
[18265.435126]  iomap_dio_bio_end_io+0xb5/0x160
[18265.436902]  blk_update_request+0x3d1/0x1000
[18265.438699]  blk_mq_end_request+0x3c/0x70
[18265.440379]  virtblk_done+0x148/0x250
[18265.441973]  vring_interrupt+0x159/0x300
[18265.443642]  __handle_irq_event_percpu+0x1c3/0x700
[18265.445556]  handle_irq_event+0x8b/0x1c0
[18265.447219]  handle_edge_irq+0x1b5/0x760
[18265.448881]  __common_interrupt+0xba/0x140
[18265.450588]  common_interrupt+0x45/0xa0
[18265.452258]  asm_common_interrupt+0x26/0x40

[18265.454941] Second to last potentially related work creation:
[18265.457141]  kasan_save_stack+0x3e/0x60
[18265.458790]  kasan_record_aux_stack+0xbd/0xd0
[18265.460597]  insert_work+0x2d/0x230
[18265.462129]  __queue_work+0x8ec/0xb50
[18265.463725]  queue_work_on+0xaf/0xe0
[18265.465289]  nfs_local_doio+0xa75/0xeb0 [nfs]
[18265.467220]  nfs_initiate_pgio+0x284/0x400 [nfs]
[18265.469226]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
[18265.471310]  nfs_pageio_complete+0x278/0x750 [nfs]
[18265.473363]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
[18265.475432]  vfs_read+0x5d0/0x770
[18265.476941]  __x64_sys_pread64+0xed/0x160
[18265.478648]  do_syscall_64+0xad/0x7d0
[18265.480240]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[18265.483211] The buggy address belongs to the object at ffff888109055600
                which belongs to the cache kmalloc-rnd-14-512 of size 512
[18265.488048] The buggy address is located 162 bytes inside of
                freed 512-byte region [ffff888109055600, ffff888109055800)

[18265.493827] The buggy address belongs to the physical page:
[18265.496033] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888109050e00 pfn:0x109050
[18265.499353] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[18265.502198] flags: 0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
[18265.505105] page_type: f5(slab)
[18265.506675] raw: 0017ffffc0000240 ffff88810006d540 ffffea0004151c10 ffff88810006e088
[18265.509537] raw: ffff888109050e00 000000000015000f 00000000f5000000 0000000000000000
[18265.512418] head: 0017ffffc0000240 ffff88810006d540 ffffea0004151c10 ffff88810006e088
[18265.515326] head: ffff888109050e00 000000000015000f 00000000f5000000 0000000000000000
[18265.518244] head: 0017ffffc0000003 ffffea0004241401 00000000ffffffff 00000000ffffffff
[18265.521168] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[18265.524113] page dumped because: kasan: bad access detected

[18265.527455] Memory state around the buggy address:
[18265.529505]  ffff888109055580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[18265.532292]  ffff888109055600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18265.535149] >ffff888109055680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18265.537930]                                ^
[18265.539899]  ffff888109055700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18265.542713]  ffff888109055780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18265.545507] ==================================================================
[18265.554665] Disabling lock debugging due to kernel taint


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1
  2025-10-19  9:29 [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1 Yongcheng Yang
@ 2025-10-19 15:18 ` Trond Myklebust
  2025-10-19 16:26   ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Trond Myklebust @ 2025-10-19 15:18 UTC (permalink / raw)
  To: Yongcheng Yang, linux-nfs, Mike Snitzer

On Sun, 2025-10-19 at 17:29 +0800, Yongcheng Yang wrote:
> Hi All,
> 
> There is a new nfs slab-use-after-free issue since 6.18.0-rc1.
> It appears to be reliably reproducible on my side when running
> xfstests
> generic/323 over NFSv4.2 in *debug* kernel mode:

Thanks for the report! I think I see the problem.

Mike,

When you iterate over the iocb in nfs_local_call_read(), you're calling
nfs_local_pgio_done(), nfs_local_read_done() and
nfs_local_pgio_release() multiple times.

 * You're calling nfs_local_read_aio_complete() and
   nfs_local_read_aio_complete_work() once for each and every
   asynchronous call.
 * You're calling nfs_local_pgio_done() for each synchronous call.
 * In addition, if there is a synchronous call at the very end of the
   iteration, so that status != -EIOCBQUEUED, then you're also calling
   nfs_local_read_done() one extra time, and then calling
   nfs_local_pgio_release().

The same thing appears to be happening in nfs_local_call_write().

> 
> [18265.311177]
> ==================================================================
> [18265.315831] BUG: KASAN: slab-use-after-free in
> nfs_local_call_read+0x590/0x7f0 [nfs]
> [18265.320135] Read of size 2 at addr ffff8881090556a2 by task
> kworker/u9:0/667366
> 
> [18265.325454] CPU: 0 UID: 0 PID: 667366 Comm: kworker/u9:0 Not
> tainted 6.18.0-rc1 #1 PREEMPT(full) 
> [18265.325461] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-
> 2.el9 11/17/2024
> [18265.325465] Workqueue: nfslocaliod nfs_local_call_read [nfs]
> [18265.325611] Call Trace:
> [18265.325615]  <TASK>
> [18265.325619]  dump_stack_lvl+0x77/0xa0
> [18265.325629]  print_report+0x171/0x820
> [18265.325637]  ? __virt_addr_valid+0x151/0x3a0
> [18265.325644]  ? __virt_addr_valid+0x300/0x3a0
> [18265.325650]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
> [18265.325770]  kasan_report+0x167/0x1a0
> [18265.325777]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
> [18265.325900]  nfs_local_call_read+0x590/0x7f0 [nfs]
> [18265.326027]  ? process_scheduled_works+0x7d3/0x11d0
> [18265.326034]  process_scheduled_works+0x857/0x11d0
> [18265.326050]  worker_thread+0x897/0xd00
> [18265.326065]  kthread+0x51b/0x650
> [18265.326071]  ? __pfx_worker_thread+0x10/0x10
> [18265.326076]  ? __pfx_kthread+0x10/0x10
> [18265.326082]  ret_from_fork+0x249/0x480
> [18265.326087]  ? __pfx_kthread+0x10/0x10
> [18265.326092]  ret_from_fork_asm+0x1a/0x30
> [18265.326104]  </TASK>
> 
> [18265.378345] Allocated by task 681242:
> [18265.380068]  kasan_save_track+0x3e/0x80
> [18265.381838]  __kasan_kmalloc+0x93/0xb0
> [18265.383587]  __kmalloc_cache_noprof+0x3eb/0x6e0
> [18265.385532]  nfs_local_doio+0x1cb/0xeb0 [nfs]
> [18265.387630]  nfs_initiate_pgio+0x284/0x400 [nfs]
> [18265.389815]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
> [18265.391998]  nfs_pageio_complete+0x278/0x750 [nfs]
> [18265.394146]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
> [18265.396386]  vfs_read+0x5d0/0x770
> [18265.398043]  __x64_sys_pread64+0xed/0x160
> [18265.399837]  do_syscall_64+0xad/0x7d0
> [18265.401561]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> [18265.404687] Freed by task 596986:
> [18265.406245]  kasan_save_track+0x3e/0x80
> [18265.408013]  __kasan_save_free_info+0x46/0x50
> [18265.409884]  __kasan_slab_free+0x58/0x80
> [18265.411613]  kfree+0x1c1/0x620
> [18265.413075]  nfs_local_read_aio_complete_work+0x86/0x100 [nfs]
> [18265.415486]  process_scheduled_works+0x857/0x11d0
> [18265.417437]  worker_thread+0x897/0xd00
> [18265.419177]  kthread+0x51b/0x650
> [18265.420689]  ret_from_fork+0x249/0x480
> [18265.422331]  ret_from_fork_asm+0x1a/0x30
> 
> [18265.424989] Last potentially related work creation:
> [18265.426949]  kasan_save_stack+0x3e/0x60
> [18265.428639]  kasan_record_aux_stack+0xbd/0xd0
> [18265.430423]  insert_work+0x2d/0x230
> [18265.431968]  __queue_work+0x8ec/0xb50
> [18265.433555]  queue_work_on+0xaf/0xe0
> [18265.435126]  iomap_dio_bio_end_io+0xb5/0x160
> [18265.436902]  blk_update_request+0x3d1/0x1000
> [18265.438699]  blk_mq_end_request+0x3c/0x70
> [18265.440379]  virtblk_done+0x148/0x250
> [18265.441973]  vring_interrupt+0x159/0x300
> [18265.443642]  __handle_irq_event_percpu+0x1c3/0x700
> [18265.445556]  handle_irq_event+0x8b/0x1c0
> [18265.447219]  handle_edge_irq+0x1b5/0x760
> [18265.448881]  __common_interrupt+0xba/0x140
> [18265.450588]  common_interrupt+0x45/0xa0
> [18265.452258]  asm_common_interrupt+0x26/0x40
> 
> [18265.454941] Second to last potentially related work creation:
> [18265.457141]  kasan_save_stack+0x3e/0x60
> [18265.458790]  kasan_record_aux_stack+0xbd/0xd0
> [18265.460597]  insert_work+0x2d/0x230
> [18265.462129]  __queue_work+0x8ec/0xb50
> [18265.463725]  queue_work_on+0xaf/0xe0
> [18265.465289]  nfs_local_doio+0xa75/0xeb0 [nfs]
> [18265.467220]  nfs_initiate_pgio+0x284/0x400 [nfs]
> [18265.469226]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
> [18265.471310]  nfs_pageio_complete+0x278/0x750 [nfs]
> [18265.473363]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
> [18265.475432]  vfs_read+0x5d0/0x770
> [18265.476941]  __x64_sys_pread64+0xed/0x160
> [18265.478648]  do_syscall_64+0xad/0x7d0
> [18265.480240]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> [18265.483211] The buggy address belongs to the object at
> ffff888109055600
>                 which belongs to the cache kmalloc-rnd-14-512 of size
> 512
> [18265.488048] The buggy address is located 162 bytes inside of
>                 freed 512-byte region [ffff888109055600,
> ffff888109055800)
> 
> [18265.493827] The buggy address belongs to the physical page:
> [18265.496033] page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0xffff888109050e00 pfn:0x109050
> [18265.499353] head: order:3 mapcount:0 entire_mapcount:0
> nr_pages_mapped:0 pincount:0
> [18265.502198] flags:
> 0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
> [18265.505105] page_type: f5(slab)
> [18265.506675] raw: 0017ffffc0000240 ffff88810006d540
> ffffea0004151c10 ffff88810006e088
> [18265.509537] raw: ffff888109050e00 000000000015000f
> 00000000f5000000 0000000000000000
> [18265.512418] head: 0017ffffc0000240 ffff88810006d540
> ffffea0004151c10 ffff88810006e088
> [18265.515326] head: ffff888109050e00 000000000015000f
> 00000000f5000000 0000000000000000
> [18265.518244] head: 0017ffffc0000003 ffffea0004241401
> 00000000ffffffff 00000000ffffffff
> [18265.521168] head: ffffffffffffffff 0000000000000000
> 00000000ffffffff 0000000000000008
> [18265.524113] page dumped because: kasan: bad access detected
> 
> [18265.527455] Memory state around the buggy address:
> [18265.529505]  ffff888109055580: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [18265.532292]  ffff888109055600: fa fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [18265.535149] >ffff888109055680: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [18265.537930]                                ^
> [18265.539899]  ffff888109055700: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [18265.542713]  ffff888109055780: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [18265.545507]
> ==================================================================
> [18265.554665] Disabling lock debugging due to kernel taint
> 

-- 
Trond Myklebust Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1
  2025-10-19 15:18 ` Trond Myklebust
@ 2025-10-19 16:26   ` Mike Snitzer
  2025-10-20 18:24     ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-19 16:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Yongcheng Yang, linux-nfs

On Sun, Oct 19, 2025 at 11:18:57AM -0400, Trond Myklebust wrote:
> On Sun, 2025-10-19 at 17:29 +0800, Yongcheng Yang wrote:
> > Hi All,
> > 
> > There is a new nfs slab-use-after-free issue since 6.18.0-rc1.
> > It appears to be reliably reproducible on my side when running
> > xfstests
> > generic/323 over NFSv4.2 in *debug* kernel mode:
> 
> Thanks for the report! I think I see the problem.
> 
> Mike,
> 
> When you iterate over the iocb in nfs_local_call_read(), you're calling
> nfs_local_pgio_done(), nfs_local_read_done() and
> nfs_local_pgio_release() multiple times.

I purposely made nfs_local_pgio_done() safe to call multiple times.

And nfs_local_{read,write}_done() and nfs_local_pgio_release()
_should_ only be called once.

>  * You're calling nfs_local_read_aio_complete() and
>    nfs_local_read_aio_complete_work() once for each and every
>    asynchronous call.

There is only the possibility of a single async call for the single
aligned DIO.

For any given pgio entering LOCALIO, it may be split into 3 pieces:
The misaligned head and tail are first handled sync and only then the
aligned middle async (or possibly sync if underlying device imposes
sync, e.g. ramdisk).

>  * You're calling nfs_local_pgio_done() for each synchronous call.

Yes, which is safe.  It just updates status, deals with partial
completion.

>  * In addition, if there is a synchronous call at the very end of the
>    iteration, so that status != -EIOCBQUEUED, then you're also calling
>    nfs_local_read_done() one extra time, and then calling
>    nfs_local_pgio_release().

It isn't in addition, its only for the last piece of IO (be it sync or
async).

> The same thing appears to be happening in nfs_local_call_write().

I fully acknolwdge this isn't an easy audit.  And there could be
something wrong.  But I'm not seeing it.  Obviously this BUG report
puts onus on me to figure it out...

BUT, I have used this code extensively on non-debug and had no issues.
Is it at all possible KASAN is triggering a false-positive!?

Mike

> > 
> > [18265.311177]
> > ==================================================================
> > [18265.315831] BUG: KASAN: slab-use-after-free in
> > nfs_local_call_read+0x590/0x7f0 [nfs]
> > [18265.320135] Read of size 2 at addr ffff8881090556a2 by task
> > kworker/u9:0/667366
> > 
> > [18265.325454] CPU: 0 UID: 0 PID: 667366 Comm: kworker/u9:0 Not
> > tainted 6.18.0-rc1 #1 PREEMPT(full) 
> > [18265.325461] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-
> > 2.el9 11/17/2024
> > [18265.325465] Workqueue: nfslocaliod nfs_local_call_read [nfs]
> > [18265.325611] Call Trace:
> > [18265.325615]  <TASK>
> > [18265.325619]  dump_stack_lvl+0x77/0xa0
> > [18265.325629]  print_report+0x171/0x820
> > [18265.325637]  ? __virt_addr_valid+0x151/0x3a0
> > [18265.325644]  ? __virt_addr_valid+0x300/0x3a0
> > [18265.325650]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
> > [18265.325770]  kasan_report+0x167/0x1a0
> > [18265.325777]  ? nfs_local_call_read+0x590/0x7f0 [nfs]
> > [18265.325900]  nfs_local_call_read+0x590/0x7f0 [nfs]
> > [18265.326027]  ? process_scheduled_works+0x7d3/0x11d0
> > [18265.326034]  process_scheduled_works+0x857/0x11d0
> > [18265.326050]  worker_thread+0x897/0xd00
> > [18265.326065]  kthread+0x51b/0x650
> > [18265.326071]  ? __pfx_worker_thread+0x10/0x10
> > [18265.326076]  ? __pfx_kthread+0x10/0x10
> > [18265.326082]  ret_from_fork+0x249/0x480
> > [18265.326087]  ? __pfx_kthread+0x10/0x10
> > [18265.326092]  ret_from_fork_asm+0x1a/0x30
> > [18265.326104]  </TASK>
> > 
> > [18265.378345] Allocated by task 681242:
> > [18265.380068]  kasan_save_track+0x3e/0x80
> > [18265.381838]  __kasan_kmalloc+0x93/0xb0
> > [18265.383587]  __kmalloc_cache_noprof+0x3eb/0x6e0
> > [18265.385532]  nfs_local_doio+0x1cb/0xeb0 [nfs]
> > [18265.387630]  nfs_initiate_pgio+0x284/0x400 [nfs]
> > [18265.389815]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
> > [18265.391998]  nfs_pageio_complete+0x278/0x750 [nfs]
> > [18265.394146]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
> > [18265.396386]  vfs_read+0x5d0/0x770
> > [18265.398043]  __x64_sys_pread64+0xed/0x160
> > [18265.399837]  do_syscall_64+0xad/0x7d0
> > [18265.401561]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > [18265.404687] Freed by task 596986:
> > [18265.406245]  kasan_save_track+0x3e/0x80
> > [18265.408013]  __kasan_save_free_info+0x46/0x50
> > [18265.409884]  __kasan_slab_free+0x58/0x80
> > [18265.411613]  kfree+0x1c1/0x620
> > [18265.413075]  nfs_local_read_aio_complete_work+0x86/0x100 [nfs]
> > [18265.415486]  process_scheduled_works+0x857/0x11d0
> > [18265.417437]  worker_thread+0x897/0xd00
> > [18265.419177]  kthread+0x51b/0x650
> > [18265.420689]  ret_from_fork+0x249/0x480
> > [18265.422331]  ret_from_fork_asm+0x1a/0x30
> > 
> > [18265.424989] Last potentially related work creation:
> > [18265.426949]  kasan_save_stack+0x3e/0x60
> > [18265.428639]  kasan_record_aux_stack+0xbd/0xd0
> > [18265.430423]  insert_work+0x2d/0x230
> > [18265.431968]  __queue_work+0x8ec/0xb50
> > [18265.433555]  queue_work_on+0xaf/0xe0
> > [18265.435126]  iomap_dio_bio_end_io+0xb5/0x160
> > [18265.436902]  blk_update_request+0x3d1/0x1000
> > [18265.438699]  blk_mq_end_request+0x3c/0x70
> > [18265.440379]  virtblk_done+0x148/0x250
> > [18265.441973]  vring_interrupt+0x159/0x300
> > [18265.443642]  __handle_irq_event_percpu+0x1c3/0x700
> > [18265.445556]  handle_irq_event+0x8b/0x1c0
> > [18265.447219]  handle_edge_irq+0x1b5/0x760
> > [18265.448881]  __common_interrupt+0xba/0x140
> > [18265.450588]  common_interrupt+0x45/0xa0
> > [18265.452258]  asm_common_interrupt+0x26/0x40
> > 
> > [18265.454941] Second to last potentially related work creation:
> > [18265.457141]  kasan_save_stack+0x3e/0x60
> > [18265.458790]  kasan_record_aux_stack+0xbd/0xd0
> > [18265.460597]  insert_work+0x2d/0x230
> > [18265.462129]  __queue_work+0x8ec/0xb50
> > [18265.463725]  queue_work_on+0xaf/0xe0
> > [18265.465289]  nfs_local_doio+0xa75/0xeb0 [nfs]
> > [18265.467220]  nfs_initiate_pgio+0x284/0x400 [nfs]
> > [18265.469226]  nfs_generic_pg_pgios+0x6e2/0x810 [nfs]
> > [18265.471310]  nfs_pageio_complete+0x278/0x750 [nfs]
> > [18265.473363]  nfs_file_direct_read+0x78c/0x9e0 [nfs]
> > [18265.475432]  vfs_read+0x5d0/0x770
> > [18265.476941]  __x64_sys_pread64+0xed/0x160
> > [18265.478648]  do_syscall_64+0xad/0x7d0
> > [18265.480240]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > [18265.483211] The buggy address belongs to the object at
> > ffff888109055600
> >                 which belongs to the cache kmalloc-rnd-14-512 of size
> > 512
> > [18265.488048] The buggy address is located 162 bytes inside of
> >                 freed 512-byte region [ffff888109055600,
> > ffff888109055800)
> > 
> > [18265.493827] The buggy address belongs to the physical page:
> > [18265.496033] page: refcount:0 mapcount:0 mapping:0000000000000000
> > index:0xffff888109050e00 pfn:0x109050
> > [18265.499353] head: order:3 mapcount:0 entire_mapcount:0
> > nr_pages_mapped:0 pincount:0
> > [18265.502198] flags:
> > 0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
> > [18265.505105] page_type: f5(slab)
> > [18265.506675] raw: 0017ffffc0000240 ffff88810006d540
> > ffffea0004151c10 ffff88810006e088
> > [18265.509537] raw: ffff888109050e00 000000000015000f
> > 00000000f5000000 0000000000000000
> > [18265.512418] head: 0017ffffc0000240 ffff88810006d540
> > ffffea0004151c10 ffff88810006e088
> > [18265.515326] head: ffff888109050e00 000000000015000f
> > 00000000f5000000 0000000000000000
> > [18265.518244] head: 0017ffffc0000003 ffffea0004241401
> > 00000000ffffffff 00000000ffffffff
> > [18265.521168] head: ffffffffffffffff 0000000000000000
> > 00000000ffffffff 0000000000000008
> > [18265.524113] page dumped because: kasan: bad access detected
> > 
> > [18265.527455] Memory state around the buggy address:
> > [18265.529505]  ffff888109055580: fc fc fc fc fc fc fc fc fc fc fc fc
> > fc fc fc fc
> > [18265.532292]  ffff888109055600: fa fb fb fb fb fb fb fb fb fb fb fb
> > fb fb fb fb
> > [18265.535149] >ffff888109055680: fb fb fb fb fb fb fb fb fb fb fb fb
> > fb fb fb fb
> > [18265.537930]                                ^
> > [18265.539899]  ffff888109055700: fb fb fb fb fb fb fb fb fb fb fb fb
> > fb fb fb fb
> > [18265.542713]  ffff888109055780: fb fb fb fb fb fb fb fb fb fb fb fb
> > fb fb fb fb
> > [18265.545507]
> > ==================================================================
> > [18265.554665] Disabling lock debugging due to kernel taint
> > 
> 
> -- 
> Trond Myklebust Linux NFS client maintainer, Hammerspace
> trondmy@kernel.org, trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1
  2025-10-19 16:26   ` Mike Snitzer
@ 2025-10-20 18:24     ` Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 0/3] nfs/localio: fixes for recent misaligned DIO changes Mike Snitzer
                         ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-10-20 18:24 UTC (permalink / raw)
  To: Yongcheng Yang; +Cc: Trond Myklebust, linux-nfs

On Sun, Oct 19, 2025 at 12:26:25PM -0400, Mike Snitzer wrote:
> On Sun, Oct 19, 2025 at 11:18:57AM -0400, Trond Myklebust wrote:
> > On Sun, 2025-10-19 at 17:29 +0800, Yongcheng Yang wrote:
> > > Hi All,
> > > 
> > > There is a new nfs slab-use-after-free issue since 6.18.0-rc1.
> > > It appears to be reliably reproducible on my side when running
> > > xfstests
> > > generic/323 over NFSv4.2 in *debug* kernel mode:
> > 
> > Thanks for the report! I think I see the problem.
> > 
> > Mike,
> > 
> > When you iterate over the iocb in nfs_local_call_read(), you're calling
> > nfs_local_pgio_done(), nfs_local_read_done() and
> > nfs_local_pgio_release() multiple times.
> 
> I purposely made nfs_local_pgio_done() safe to call multiple times.
> 
> And nfs_local_{read,write}_done() and nfs_local_pgio_release()
> _should_ only be called once.
> 
> >  * You're calling nfs_local_read_aio_complete() and
> >    nfs_local_read_aio_complete_work() once for each and every
> >    asynchronous call.
> 
> There is only the possibility of a single async call for the single
> aligned DIO.
> 
> For any given pgio entering LOCALIO, it may be split into 3 pieces:
> The misaligned head and tail are first handled sync and only then the
> aligned middle async (or possibly sync if underlying device imposes
> sync, e.g. ramdisk).
> 
> >  * You're calling nfs_local_pgio_done() for each synchronous call.
> 
> Yes, which is safe.  It just updates status, deals with partial
> completion.
> 
> >  * In addition, if there is a synchronous call at the very end of the
> >    iteration, so that status != -EIOCBQUEUED, then you're also calling
> >    nfs_local_read_done() one extra time, and then calling
> >    nfs_local_pgio_release().
> 
> It isn't in addition, its only for the last piece of IO (be it sync or
> async).
> 
> > The same thing appears to be happening in nfs_local_call_write().
> 
> I fully acknolwdge this isn't an easy audit.  And there could be
> something wrong.  But I'm not seeing it.  Obviously this BUG report
> puts onus on me to figure it out...
> 
> BUT, I have used this code extensively on non-debug and had no issues.
> Is it at all possible KASAN is triggering a false-positive!?

I haven't been able to reproduce this (NFS LOCALIO and KASAN is
enabled):

[root@snitzer xfstests-dev]# cat local.config
export TEST_DIR="/mnt/share1"
export TEST_DEV="10.200.111.104:/share1"
export SCRATCH_MNT="/mnt/scratch"
export SCRATCH_DEV="10.200.111.104:/"
export TEST_FS_MOUNT_OPTS="-overs=4.2,sec=sys,acl,nconnect=5"

[root@snitzer xfstests-dev]# ./check -nfs generic/323
FSTYP         -- nfs
PLATFORM      -- Linux/x86_64 snitzer 6.12.53.1.hs.snitm+ #75 SMP PREEMPT_DYNAMIC Fri Oct 17 03:55:21 UTC 2025
MKFS_OPTIONS  -- 10.200.111.104:/
MOUNT_OPTIONS -- 10.200.111.104:/ /mnt/scratch

generic/323        121s
Ran: generic/323
Passed all 1 tests

My kernel is 6.12-stable based, but includes all NFS and NFSD changes
through 6.18-rc1 (and also most of chuck's nfsd-testing), see:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.53/main

Please provide your .config (off-list is fine!) and I'll see if I'm
somehow missing something.

(I suppose it could be that by test system is too slow...)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 0/3] nfs/localio: fixes for recent misaligned DIO changes
  2025-10-20 18:24     ` Mike Snitzer
@ 2025-10-27 13:08       ` Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support Mike Snitzer
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 13:08 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

Hi,

These changes are needed to fix v6.18-rc1 commit c817248fc831
("nfs/localio: add proper O_DIRECT support for READ and WRITE").

This patchset fixes the KASAN use-after-free bug that was reported here:
https://lore.kernel.org/linux-nfs/aPSvi5Yr2lGOh5Jh@dell-per750-06-vm-07.rhts.eng.pek2.redhat.com/

I still contend there wasn't an actual problem but these changes bring
more control to the misaligned DIO IO completion by marshalling it
through the use of a refcount.

With the additional minimal memory barriers associated with atomic_t
methods the KASAN splat no longer occurs.

Also, removed "dead code" to handle ENOTBLK that shouldn't ever be
seen by NFS, and backfill misaligned DIO short read support.

Thanks,
Mike

Mike Snitzer (3):
  nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
  nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  nfs/localio: backfill missing partial read support for misaligned DIO

 fs/nfs/localio.c | 149 +++++++++++++++++++++++++++++------------------
 1 file changed, 91 insertions(+), 58 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
  2025-10-20 18:24     ` Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 0/3] nfs/localio: fixes for recent misaligned DIO changes Mike Snitzer
@ 2025-10-27 13:08       ` Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header Mike Snitzer
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 13:08 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

Each filesystem is meant to fallback to retrying DIO in terms buffered
IO when it might encounter -ENOTBLK when issuing DIO (which can happen
if the VFS cannot invalidate the page cache).

So NFS doesn't need special handling for -ENOTBLK.

Also, explicitly initialize a couple DIO related iocb members rather
than simply rely on data structure zeroing.

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index b575f0e6c7c8..7c97055bddb1 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -315,6 +315,7 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 
 	iocb->hdr = hdr;
 	iocb->kiocb.ki_flags &= ~IOCB_APPEND;
+	iocb->kiocb.ki_complete = NULL;
 	iocb->aio_complete_work = NULL;
 
 	iocb->end_iter_index = -1;
@@ -484,6 +485,7 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	/* Use buffered IO */
 	iocb->offset[0] = hdr->args.offset;
 	iov_iter_bvec(&iocb->iters[0], rw, iocb->bvec, v, len);
+	iocb->iter_is_dio_aligned[0] = false;
 	iocb->n_iters = 1;
 }
 
@@ -803,7 +805,7 @@ static void nfs_local_call_write(struct work_struct *work)
 			iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
 			iocb->aio_complete_work = nfs_local_write_aio_complete_work;
 		}
-retry:
+
 		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
@@ -823,15 +825,6 @@ static void nfs_local_call_write(struct work_struct *work)
 					nfs_local_pgio_done(iocb->hdr, status);
 					break;
 				}
-			} else if (unlikely(status == -ENOTBLK &&
-					    (iocb->kiocb.ki_flags & IOCB_DIRECT))) {
-				/* VFS will return -ENOTBLK if DIO WRITE fails to
-				 * invalidate the page cache. Retry using buffered IO.
-				 */
-				iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
-				iocb->kiocb.ki_complete = NULL;
-				iocb->aio_complete_work = NULL;
-				goto retry;
 			}
 			nfs_local_pgio_done(iocb->hdr, status);
 			if (iocb->hdr->task.tk_status)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  2025-10-20 18:24     ` Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 0/3] nfs/localio: fixes for recent misaligned DIO changes Mike Snitzer
  2025-10-27 13:08       ` [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support Mike Snitzer
@ 2025-10-27 13:08       ` Mike Snitzer
  2025-10-27 13:19         ` Christoph Hellwig
  2025-10-27 13:08       ` [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO Mike Snitzer
  2025-10-27 17:52       ` [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion Mike Snitzer
  4 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 13:08 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

Improve completion handling of as many as 3 IOs associated with each
misaligned DIO by using a atomic_t to track completion of each IO.

Update nfs_local_pgio_done() to use precise atomic_t accounting for
remaining iov_iter (up to 3) associated with each iocb, so that each
NFS LOCALIO pgio header is only released after all IOs have completed.
But also allow early return if/when a short read or write occurs.

Fixes reported BUG: KASAN: slab-use-after-free in nfs_local_call_read:
https://lore.kernel.org/linux-nfs/aPSvi5Yr2lGOh5Jh@dell-per750-06-vm-07.rhts.eng.pek2.redhat.com/

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 114 ++++++++++++++++++++++++++++-------------------
 1 file changed, 69 insertions(+), 45 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 7c97055bddb1..a5f1eeeef30e 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -42,7 +42,7 @@ struct nfs_local_kiocb {
 	/* Begin mostly DIO-specific members */
 	size_t                  end_len;
 	short int		end_iter_index;
-	short int		n_iters;
+	atomic_t		n_iters;
 	bool			iter_is_dio_aligned[NFSLOCAL_MAX_IOS];
 	loff_t                  offset[NFSLOCAL_MAX_IOS] ____cacheline_aligned;
 	struct iov_iter		iters[NFSLOCAL_MAX_IOS];
@@ -407,6 +407,7 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 		iters[n_iters].count = local_dio->start_len;
 		iocb->offset[n_iters] = iocb->hdr->args.offset;
 		iocb->iter_is_dio_aligned[n_iters] = false;
+		atomic_inc(&iocb->n_iters);
 		++n_iters;
 	}
 
@@ -425,6 +426,7 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 		/* Save index and length of end */
 		iocb->end_iter_index = n_iters;
 		iocb->end_len = local_dio->end_len;
+		atomic_inc(&iocb->n_iters);
 		++n_iters;
 	}
 
@@ -448,7 +450,6 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 	}
 	++n_iters;
 
-	iocb->n_iters = n_iters;
 	return n_iters;
 }
 
@@ -474,6 +475,12 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	}
 	len = hdr->args.count - total;
 
+	/*
+	 * For each iocb, iocb->n_iter is always at least 1 and we always
+	 * end io after first nfs_local_pgio_done call unless misaligned DIO.
+	 */
+	atomic_set(&iocb->n_iters, 1);
+
 	if (test_bit(NFS_IOHDR_ODIRECT, &hdr->flags)) {
 		struct nfs_local_dio local_dio;
 
@@ -486,7 +493,6 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	iocb->offset[0] = hdr->args.offset;
 	iov_iter_bvec(&iocb->iters[0], rw, iocb->bvec, v, len);
 	iocb->iter_is_dio_aligned[0] = false;
-	iocb->n_iters = 1;
 }
 
 static void
@@ -506,9 +512,11 @@ nfs_local_pgio_init(struct nfs_pgio_header *hdr,
 		hdr->task.tk_start = ktime_get();
 }
 
-static void
-nfs_local_pgio_done(struct nfs_pgio_header *hdr, long status)
+static bool
+nfs_local_pgio_done(struct nfs_local_kiocb *iocb, long status, bool force)
 {
+	struct nfs_pgio_header *hdr = iocb->hdr;
+
 	/* Must handle partial completions */
 	if (status >= 0) {
 		hdr->res.count += status;
@@ -519,6 +527,12 @@ nfs_local_pgio_done(struct nfs_pgio_header *hdr, long status)
 		hdr->res.op_status = nfs_localio_errno_to_nfs4_stat(status);
 		hdr->task.tk_status = status;
 	}
+
+	if (force)
+		return true;
+
+	BUG_ON(atomic_read(&iocb->n_iters) <= 0);
+	return atomic_dec_and_test(&iocb->n_iters);
 }
 
 static void
@@ -549,11 +563,11 @@ static inline void nfs_local_pgio_aio_complete(struct nfs_local_kiocb *iocb)
 	queue_work(nfsiod_workqueue, &iocb->work);
 }
 
-static void
-nfs_local_read_done(struct nfs_local_kiocb *iocb, long status)
+static void nfs_local_read_done(struct nfs_local_kiocb *iocb)
 {
 	struct nfs_pgio_header *hdr = iocb->hdr;
 	struct file *filp = iocb->kiocb.ki_filp;
+	long status = hdr->task.tk_status;
 
 	if ((iocb->kiocb.ki_flags & IOCB_DIRECT) && status == -EINVAL) {
 		/* Underlying FS will return -EINVAL if misaligned DIO is attempted. */
@@ -574,12 +588,18 @@ nfs_local_read_done(struct nfs_local_kiocb *iocb, long status)
 			status > 0 ? status : 0, hdr->res.eof);
 }
 
+static inline void nfs_local_read_iocb_done(struct nfs_local_kiocb *iocb)
+{
+	nfs_local_read_done(iocb);
+	nfs_local_pgio_release(iocb);
+}
+
 static void nfs_local_read_aio_complete_work(struct work_struct *work)
 {
 	struct nfs_local_kiocb *iocb =
 		container_of(work, struct nfs_local_kiocb, work);
 
-	nfs_local_pgio_release(iocb);
+	nfs_local_read_iocb_done(iocb);
 }
 
 static void nfs_local_read_aio_complete(struct kiocb *kiocb, long ret)
@@ -587,8 +607,10 @@ static void nfs_local_read_aio_complete(struct kiocb *kiocb, long ret)
 	struct nfs_local_kiocb *iocb =
 		container_of(kiocb, struct nfs_local_kiocb, kiocb);
 
-	nfs_local_pgio_done(iocb->hdr, ret);
-	nfs_local_read_done(iocb, ret);
+	/* AIO completion of DIO read should always be last to complete */
+	if (unlikely(!nfs_local_pgio_done(iocb, ret, false)))
+		return;
+
 	nfs_local_pgio_aio_complete(iocb); /* Calls nfs_local_read_aio_complete_work */
 }
 
@@ -599,10 +621,13 @@ static void nfs_local_call_read(struct work_struct *work)
 	struct file *filp = iocb->kiocb.ki_filp;
 	const struct cred *save_cred;
 	ssize_t status;
+	int n_iters;
 
 	save_cred = override_creds(filp->f_cred);
 
-	for (int i = 0; i < iocb->n_iters ; i++) {
+	n_iters = atomic_read(&iocb->n_iters);
+	for (int i = 0; i < n_iters ; i++) {
+		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
 			iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
@@ -612,18 +637,14 @@ static void nfs_local_call_read(struct work_struct *work)
 		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			nfs_local_pgio_done(iocb->hdr, status);
-			if (iocb->hdr->task.tk_status)
+			if (nfs_local_pgio_done(iocb, status, false)) {
+				nfs_local_read_iocb_done(iocb);
 				break;
+			}
 		}
 	}
 
 	revert_creds(save_cred);
-
-	if (status != -EIOCBQUEUED) {
-		nfs_local_read_done(iocb, status);
-		nfs_local_pgio_release(iocb);
-	}
 }
 
 static int
@@ -738,11 +759,10 @@ static void nfs_local_vfs_getattr(struct nfs_local_kiocb *iocb)
 	fattr->du.nfs3.used = stat.blocks << 9;
 }
 
-static void
-nfs_local_write_done(struct nfs_local_kiocb *iocb, long status)
+static void nfs_local_write_done(struct nfs_local_kiocb *iocb)
 {
 	struct nfs_pgio_header *hdr = iocb->hdr;
-	struct inode *inode = hdr->inode;
+	long status = hdr->task.tk_status;
 
 	dprintk("%s: wrote %ld bytes.\n", __func__, status > 0 ? status : 0);
 
@@ -761,28 +781,36 @@ nfs_local_write_done(struct nfs_local_kiocb *iocb, long status)
 		nfs_set_pgio_error(hdr, -ENOSPC, hdr->args.offset);
 		status = -ENOSPC;
 		/* record -ENOSPC in terms of nfs_local_pgio_done */
-		nfs_local_pgio_done(hdr, status);
+		(void) nfs_local_pgio_done(iocb, status, true);
 	}
 	if (hdr->task.tk_status < 0)
-		nfs_reset_boot_verifier(inode);
+		nfs_reset_boot_verifier(hdr->inode);
 }
 
-static void nfs_local_write_aio_complete_work(struct work_struct *work)
+static inline void nfs_local_write_iocb_done(struct nfs_local_kiocb *iocb)
 {
-	struct nfs_local_kiocb *iocb =
-		container_of(work, struct nfs_local_kiocb, work);
-
+	nfs_local_write_done(iocb);
 	nfs_local_vfs_getattr(iocb);
 	nfs_local_pgio_release(iocb);
 }
 
+static void nfs_local_write_aio_complete_work(struct work_struct *work)
+{
+	struct nfs_local_kiocb *iocb =
+		container_of(work, struct nfs_local_kiocb, work);
+
+	nfs_local_write_iocb_done(iocb);
+}
+
 static void nfs_local_write_aio_complete(struct kiocb *kiocb, long ret)
 {
 	struct nfs_local_kiocb *iocb =
 		container_of(kiocb, struct nfs_local_kiocb, kiocb);
 
-	nfs_local_pgio_done(iocb->hdr, ret);
-	nfs_local_write_done(iocb, ret);
+	/* AIO completion of DIO write should always be last to complete */
+	if (unlikely(!nfs_local_pgio_done(iocb, ret, false)))
+		return;
+
 	nfs_local_pgio_aio_complete(iocb); /* Calls nfs_local_write_aio_complete_work */
 }
 
@@ -793,13 +821,17 @@ static void nfs_local_call_write(struct work_struct *work)
 	struct file *filp = iocb->kiocb.ki_filp;
 	unsigned long old_flags = current->flags;
 	const struct cred *save_cred;
+	bool force_done = false;
 	ssize_t status;
+	int n_iters;
 
 	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
 	save_cred = override_creds(filp->f_cred);
 
 	file_start_write(filp);
-	for (int i = 0; i < iocb->n_iters ; i++) {
+	n_iters = atomic_read(&iocb->n_iters);
+	for (int i = 0; i < n_iters ; i++) {
+		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
 			iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
@@ -812,35 +844,27 @@ static void nfs_local_call_write(struct work_struct *work)
 			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
 				/* partial write */
 				if (i == iocb->end_iter_index) {
-					/* Must not account partial end, otherwise, due
-					 * to end being issued before middle: the partial
+					/* Must not account DIO partial end, otherwise (due
+					 * to end being issued before middle): the partial
 					 * write accounting in nfs_local_write_done()
 					 * would incorrectly advance hdr->args.offset
 					 */
 					status = 0;
 				} else {
-					/* Partial write at start or buffered middle,
-					 * exit early.
-					 */
-					nfs_local_pgio_done(iocb->hdr, status);
-					break;
+					/* Partial write at start or middle, force done */
+					force_done = true;
 				}
 			}
-			nfs_local_pgio_done(iocb->hdr, status);
-			if (iocb->hdr->task.tk_status)
+			if (nfs_local_pgio_done(iocb, status, force_done)) {
+				nfs_local_write_iocb_done(iocb);
 				break;
+			}
 		}
 	}
 	file_end_write(filp);
 
 	revert_creds(save_cred);
 	current->flags = old_flags;
-
-	if (status != -EIOCBQUEUED) {
-		nfs_local_write_done(iocb, status);
-		nfs_local_vfs_getattr(iocb);
-		nfs_local_pgio_release(iocb);
-	}
 }
 
 static int
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
  2025-10-20 18:24     ` Mike Snitzer
                         ` (2 preceding siblings ...)
  2025-10-27 13:08       ` [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header Mike Snitzer
@ 2025-10-27 13:08       ` Mike Snitzer
  2025-10-27 17:52       ` [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion Mike Snitzer
  4 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 13:08 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

Misaligned DIO read can be split into 3 IOs, must handle potential for
short read from each component IO (follows same pattern used for
handling partial writes, except upper layer read code handles advancing
offset before retry).

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index a5f1eeeef30e..35e332627168 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -414,7 +414,7 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 	/* Setup misaligned end?
 	 * If so, the end is purposely setup to be issued using buffered IO
 	 * before the middle (which will use DIO, if DIO-aligned, with AIO).
-	 * This creates problems if/when the end results in a partial write.
+	 * This creates problems if/when the end results in short read or write.
 	 * So must save index and length of end to handle this corner case.
 	 */
 	if (local_dio->end_len) {
@@ -580,8 +580,9 @@ static void nfs_local_read_done(struct nfs_local_kiocb *iocb)
 	 */
 	hdr->res.replen = 0;
 
-	if (hdr->res.count != hdr->args.count ||
-	    hdr->args.offset + hdr->res.count >= i_size_read(file_inode(filp)))
+	/* nfs_readpage_result() handles short read */
+
+	if (hdr->args.offset + hdr->res.count >= i_size_read(file_inode(filp)))
 		hdr->res.eof = true;
 
 	dprintk("%s: read %ld bytes eof %d.\n", __func__,
@@ -620,6 +621,7 @@ static void nfs_local_call_read(struct work_struct *work)
 		container_of(work, struct nfs_local_kiocb, work);
 	struct file *filp = iocb->kiocb.ki_filp;
 	const struct cred *save_cred;
+	bool force_done = false;
 	ssize_t status;
 	int n_iters;
 
@@ -637,7 +639,21 @@ static void nfs_local_call_read(struct work_struct *work)
 		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (nfs_local_pgio_done(iocb, status, false)) {
+			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
+				/* partial read */
+				if (i == iocb->end_iter_index) {
+					/* Must not account DIO partial end, otherwise (due
+					 * to end being issued before middle): the partial
+					 * read accounting in nfs_local_read_done()
+					 * would incorrectly advance hdr->args.offset
+					 */
+					status = 0;
+				} else {
+					/* Partial read at start or middle, force done */
+					force_done = true;
+				}
+			}
+			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_read_iocb_done(iocb);
 				break;
 			}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  2025-10-27 13:08       ` [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header Mike Snitzer
@ 2025-10-27 13:19         ` Christoph Hellwig
  2025-10-27 13:55           ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:19 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Anna Schumaker, Trond Myklebust, linux-nfs

On Mon, Oct 27, 2025 at 09:08:32AM -0400, Mike Snitzer wrote:
> Improve completion handling of as many as 3 IOs associated with each
> misaligned DIO by using a atomic_t to track completion of each IO.
> 
> Update nfs_local_pgio_done() to use precise atomic_t accounting for
> remaining iov_iter (up to 3) associated with each iocb, so that each
> NFS LOCALIO pgio header is only released after all IOs have completed.
> But also allow early return if/when a short read or write occurs.

Maybe just split the pgio instead?  That's what a lot of the pnfs code
does.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  2025-10-27 13:19         ` Christoph Hellwig
@ 2025-10-27 13:55           ` Mike Snitzer
  2025-10-27 14:45             ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 13:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anna Schumaker, Trond Myklebust, linux-nfs

On Mon, Oct 27, 2025 at 06:19:16AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 09:08:32AM -0400, Mike Snitzer wrote:
> > Improve completion handling of as many as 3 IOs associated with each
> > misaligned DIO by using a atomic_t to track completion of each IO.
> > 
> > Update nfs_local_pgio_done() to use precise atomic_t accounting for
> > remaining iov_iter (up to 3) associated with each iocb, so that each
> > NFS LOCALIO pgio header is only released after all IOs have completed.
> > But also allow early return if/when a short read or write occurs.
> 
> Maybe just split the pgio instead?  That's what a lot of the pnfs code
> does. 

I already tried that, in terms of frontend fs/nfs/direct.c and then
supporting fs/nfs/pagelist.c changes; ended up being pretty nasty (and
overdone because in general the NFS client doesn't need to do this
extra work if its not using LOCALIO).

We only need this misaligned DIO splitting for LOCALIO's benefit
because in general the NFS client is perfectly happy handling
misaligned DIO (and sending it out over the wire).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  2025-10-27 13:55           ` Mike Snitzer
@ 2025-10-27 14:45             ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-10-27 14:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Anna Schumaker, Trond Myklebust, linux-nfs

On Mon, Oct 27, 2025 at 09:55:14AM -0400, Mike Snitzer wrote:
> I already tried that, in terms of frontend fs/nfs/direct.c and then
> supporting fs/nfs/pagelist.c changes; ended up being pretty nasty (and
> overdone because in general the NFS client doesn't need to do this
> extra work if its not using LOCALIO).
> 
> We only need this misaligned DIO splitting for LOCALIO's benefit
> because in general the NFS client is perfectly happy handling
> misaligned DIO (and sending it out over the wire).

Ok.  Maybe stick that into a comment?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
  2025-10-20 18:24     ` Mike Snitzer
                         ` (3 preceding siblings ...)
  2025-10-27 13:08       ` [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO Mike Snitzer
@ 2025-10-27 17:52       ` Mike Snitzer
  2025-10-29 23:19         ` [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order Mike Snitzer
  4 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-27 17:52 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

LOCALIO's misaligned DIO WRITE support requires synchronous IO for any
misaligned head and/or tail that are issued using buffered IO.  In
addition, it is important that the O_DIRECT middle be on stable
storage upon its completion via AIO.

Otherwise, a misaligned DIO WRITE could mix buffered IO for the
head/tail and direct IO for the DIO-aligned middle -- which could lead
to problems associated with deferred writes to stable storage (such as
out of order partial completions causing incorrect advancement of the
file's offset, etc).

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 35e332627168..fdbbf5a3617b 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -485,8 +485,12 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 		struct nfs_local_dio local_dio;
 
 		if (nfs_is_local_dio_possible(iocb, rw, len, &local_dio) &&
-		    nfs_local_iters_setup_dio(iocb, rw, v, len, &local_dio) != 0)
+		    nfs_local_iters_setup_dio(iocb, rw, v, len, &local_dio) != 0) {
+			/* Ensure DIO WRITE's IO on stable storage upon completion */
+			if (rw == ITER_SOURCE)
+				iocb->kiocb.ki_flags |= IOCB_DSYNC|IOCB_SYNC;
 			return; /* is DIO-aligned */
+		}
 	}
 
 	/* Use buffered IO */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
  2025-10-27 17:52       ` [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion Mike Snitzer
@ 2025-10-29 23:19         ` Mike Snitzer
  2025-10-31  1:50           ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-29 23:19 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/

On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > middle (via AIO completion of that aligned middle). So out of order
> > relative to file offset.
>
> That's in general a really bad idea.  It will obviously work, but
> both on SSDs and out of place write file systems it is a sure way
> to increase your garbage collection overhead a lot down the line.

Fix this by never issuing misaligned DIO out-of-order. This fix means
the DIO-aligned segment will only use AIO completion if there is no
misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
will be issued without AIO completion to ensure file offset increases
properly for all partial READ or WRITE situations.

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 83 +++++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 54 deletions(-)

Anna, apologies for stringing fixes together like this; and that this
same commit c817248fc831 has so many follow-on Fixes is not lost on
me.  But the full series of commit c817248fc831 fixes is composed of:

[v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
[v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
[v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
[v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
[v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order

NOTE: PATCH 4/3's use of IOCBD_DSYNC|IOCB_SYNC _is_ conservative, but I
will audit and adjust this further (informed by NFSD Direct's ongoing
evolution for handling this same situaiton) for the v6.19 merge window.

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index ca9df8d09c2d..018fa332aae4 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -40,7 +40,6 @@ struct nfs_local_kiocb {
 	void (*aio_complete_work)(struct work_struct *);
 	struct nfsd_file	*localio;
 	/* Begin mostly DIO-specific members */
-	size_t                  end_len;
 	short int		end_iter_index;
 	atomic_t		n_iters;
 	bool			iter_is_dio_aligned[NFSLOCAL_MAX_IOS];
@@ -411,27 +410,8 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 		++n_iters;
 	}
 
-	/* Setup misaligned end?
-	 * If so, the end is purposely setup to be issued using buffered IO
-	 * before the middle (which will use DIO, if DIO-aligned, with AIO).
-	 * This creates problems if/when the end results in short read or write.
-	 * So must save index and length of end to handle this corner case.
-	 */
-	if (local_dio->end_len) {
-		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-		iocb->offset[n_iters] = local_dio->end_offset;
-		iov_iter_advance(&iters[n_iters],
-			local_dio->start_len + local_dio->middle_len);
-		iocb->iter_is_dio_aligned[n_iters] = false;
-		/* Save index and length of end */
-		iocb->end_iter_index = n_iters;
-		iocb->end_len = local_dio->end_len;
-		atomic_inc(&iocb->n_iters);
-		++n_iters;
-	}
-
-	/* Setup DIO-aligned middle to be issued last, to allow for
-	 * DIO with AIO completion (see nfs_local_call_{read,write}).
+	/* Setup DIO-aligned middle, if there is no misaligned end (below)
+	 * then AIO completion is used, see nfs_local_call_{read,write}
 	 */
 	iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
 	if (local_dio->start_len)
@@ -448,8 +428,21 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 			iocb->hdr->args.offset, len, local_dio);
 		return 0; /* no DIO-aligned IO possible */
 	}
+	iocb->end_iter_index = n_iters;
 	++n_iters;
 
+	/* Setup misaligned end? */
+	if (local_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
+		iocb->offset[n_iters] = local_dio->end_offset;
+		iov_iter_advance(&iters[n_iters],
+			local_dio->start_len + local_dio->middle_len);
+		iocb->iter_is_dio_aligned[n_iters] = false;
+		atomic_inc(&iocb->n_iters);
+		iocb->end_iter_index = n_iters;
+		++n_iters;
+	}
+
 	return n_iters;
 }
 
@@ -636,27 +629,18 @@ static void nfs_local_call_read(struct work_struct *work)
 		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
-			iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
+				iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+			}
 		}
 
 		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial read */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * read accounting in nfs_local_read_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial read at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial read */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_read_iocb_done(iocb);
 				break;
@@ -854,27 +838,18 @@ static void nfs_local_call_write(struct work_struct *work)
 		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
-			iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
+				iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+			}
 		}
 
 		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial write */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * write accounting in nfs_local_write_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial write at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial write */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_write_iocb_done(iocb);
 				break;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
  2025-10-29 23:19         ` [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order Mike Snitzer
@ 2025-10-31  1:50           ` Mike Snitzer
  2025-10-31 13:33             ` Anna Schumaker
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-10-31  1:50 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

On Wed, Oct 29, 2025 at 07:19:30PM -0400, Mike Snitzer wrote:
> From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
> 
> On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > > middle (via AIO completion of that aligned middle). So out of order
> > > relative to file offset.
> >
> > That's in general a really bad idea.  It will obviously work, but
> > both on SSDs and out of place write file systems it is a sure way
> > to increase your garbage collection overhead a lot down the line.
> 
> Fix this by never issuing misaligned DIO out-of-order. This fix means
> the DIO-aligned segment will only use AIO completion if there is no
> misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
> will be issued without AIO completion to ensure file offset increases
> properly for all partial READ or WRITE situations.
> 
> Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfs/localio.c | 83 +++++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 54 deletions(-)
> 
> Anna, apologies for stringing fixes together like this; and that this
> same commit c817248fc831 has so many follow-on Fixes is not lost on
> me.  But the full series of commit c817248fc831 fixes is composed of:
> 
> [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
> [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
> [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
> [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
> [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
> 
> NOTE: PATCH 4/3's use of IOCBD_DSYNC|IOCB_SYNC _is_ conservative, but I
> will audit and adjust this further (informed by NFSD Direct's ongoing
> evolution for handling this same situaiton) for the v6.19 merge window.

Hi Anna,

Please don't pick up this PATCH 5/3, further testing shows there is
something wrong with it.  I'll circle back once I fix it.  But this
5/3 patch doesn't impact the other 4.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
  2025-10-31  1:50           ` Mike Snitzer
@ 2025-10-31 13:33             ` Anna Schumaker
  2025-11-04 18:02               ` [v6.18-rcX PATCH v2] " Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Anna Schumaker @ 2025-10-31 13:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Trond Myklebust, linux-nfs

Hi Mike,

On 10/30/25 9:50 PM, Mike Snitzer wrote:
> On Wed, Oct 29, 2025 at 07:19:30PM -0400, Mike Snitzer wrote:
>> From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
>>
>> On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
>>> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
>>>> LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
>>>> middle (via AIO completion of that aligned middle). So out of order
>>>> relative to file offset.
>>>
>>> That's in general a really bad idea.  It will obviously work, but
>>> both on SSDs and out of place write file systems it is a sure way
>>> to increase your garbage collection overhead a lot down the line.
>>
>> Fix this by never issuing misaligned DIO out-of-order. This fix means
>> the DIO-aligned segment will only use AIO completion if there is no
>> misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
>> will be issued without AIO completion to ensure file offset increases
>> properly for all partial READ or WRITE situations.
>>
>> Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> ---
>>  fs/nfs/localio.c | 83 +++++++++++++++++-------------------------------
>>  1 file changed, 29 insertions(+), 54 deletions(-)
>>
>> Anna, apologies for stringing fixes together like this; and that this
>> same commit c817248fc831 has so many follow-on Fixes is not lost on
>> me.  But the full series of commit c817248fc831 fixes is composed of:
>>
>> [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
>> [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
>> [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
>> [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
>> [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
>>
>> NOTE: PATCH 4/3's use of IOCBD_DSYNC|IOCB_SYNC _is_ conservative, but I
>> will audit and adjust this further (informed by NFSD Direct's ongoing
>> evolution for handling this same situaiton) for the v6.19 merge window.
> 
> Hi Anna,
> 
> Please don't pick up this PATCH 5/3, further testing shows there is
> something wrong with it.  I'll circle back once I fix it.  But this
> 5/3 patch doesn't impact the other 4.

Thanks for the update! I've already looked at the first 4 patches, but
hadn't had a chance too look at 5/3 yet. I'll skip it for now until I
hear otherwise from you!

Anna

> 
> Thanks,
> Mike
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH v2] nfs/localio: do not issue misaligned DIO out-of-order
  2025-10-31 13:33             ` Anna Schumaker
@ 2025-11-04 18:02               ` Mike Snitzer
  2025-11-06  2:50                 ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-11-04 18:02 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

[Hi Anna, here is a fixed v2 of patch 5/3]

On Fri, Oct 31, 2025 at 09:33:40AM -0400, Anna Schumaker wrote:
> Hi Mike,
> 
> On 10/30/25 9:50 PM, Mike Snitzer wrote:
> > On Wed, Oct 29, 2025 at 07:19:30PM -0400, Mike Snitzer wrote:
> >> From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
> >>
> >> On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> >>> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> >>>> LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> >>>> middle (via AIO completion of that aligned middle). So out of order
> >>>> relative to file offset.
> >>>
> >>> That's in general a really bad idea.  It will obviously work, but
> >>> both on SSDs and out of place write file systems it is a sure way
> >>> to increase your garbage collection overhead a lot down the line.
> >>
> >> Fix this by never issuing misaligned DIO out-of-order. This fix means
> >> the DIO-aligned segment will only use AIO completion if there is no
> >> misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
> >> will be issued without AIO completion to ensure file offset increases
> >> properly for all partial READ or WRITE situations.
> >>
> >> Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
> >> Reported-by: Christoph Hellwig <hch@lst.de>
> >> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> >> ---
> >>  fs/nfs/localio.c | 83 +++++++++++++++++-------------------------------
> >>  1 file changed, 29 insertions(+), 54 deletions(-)
> >>
> >> Anna, apologies for stringing fixes together like this; and that this
> >> same commit c817248fc831 has so many follow-on Fixes is not lost on
> >> me.  But the full series of commit c817248fc831 fixes is composed of:
> >>
> >> [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
> >> [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
> >> [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
> >> [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
> >> [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
> >>
> >> NOTE: PATCH 4/3's use of IOCBD_DSYNC|IOCB_SYNC _is_ conservative, but I
> >> will audit and adjust this further (informed by NFSD Direct's ongoing
> >> evolution for handling this same situaiton) for the v6.19 merge window.
> > 
> > Hi Anna,
> > 
> > Please don't pick up this PATCH 5/3, further testing shows there is
> > something wrong with it.  I'll circle back once I fix it.  But this
> > 5/3 patch doesn't impact the other 4.
> 
> Thanks for the update! I've already looked at the first 4 patches, but
> hadn't had a chance too look at 5/3 yet. I'll skip it for now until I
> hear otherwise from you!

From: Mike Snitzer <snitzer@kernel.org>
Date: Wed, 29 Oct 2025 17:41:02 -0400
Subject: [v6.18-rcX PATCH v2] nfs/localio: do not issue misaligned DIO out-of-order

From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/

On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > middle (via AIO completion of that aligned middle). So out of order
> > relative to file offset.
>
> That's in general a really bad idea.  It will obviously work, but
> both on SSDs and out of place write file systems it is a sure way
> to increase your garbage collection overhead a lot down the line.

Fix this by never issuing misaligned DIO out of order. This fix means
the DIO-aligned middle will only use AIO completion if there is no
misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
will be issued without AIO completion to ensure file offset increases
properly for all partial READ or WRITE situations.

Factoring out nfs_local_iter_setup() helps standardize repetitive
nfs_local_iters_setup_dio() code and is inspired by cleanup work that
Chuck Lever did on the NFSD Direct code.

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 125 +++++++++++++++++++----------------------------
 1 file changed, 51 insertions(+), 74 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 24b0c7d62458..985242780abb 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -44,8 +44,7 @@ struct nfs_local_kiocb {
 	short int		end_iter_index;
 	atomic_t		n_iters;
 	bool			iter_is_dio_aligned[NFSLOCAL_MAX_IOS];
-	loff_t                  offset[NFSLOCAL_MAX_IOS] ____cacheline_aligned;
-	struct iov_iter		iters[NFSLOCAL_MAX_IOS];
+	struct iov_iter		iters[NFSLOCAL_MAX_IOS] ____cacheline_aligned;
 	/* End mostly DIO-specific members */
 };
 
@@ -314,6 +313,7 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 	init_sync_kiocb(&iocb->kiocb, file);
 
 	iocb->hdr = hdr;
+	iocb->kiocb.ki_pos = hdr->args.offset;
 	iocb->kiocb.ki_flags &= ~IOCB_APPEND;
 	iocb->kiocb.ki_complete = NULL;
 	iocb->aio_complete_work = NULL;
@@ -387,13 +387,24 @@ static bool nfs_iov_iter_aligned_bvec(const struct iov_iter *i,
 	return true;
 }
 
+static void
+nfs_local_iter_setup(struct iov_iter *iter, int rw, struct bio_vec *bvec,
+		     unsigned int nvecs, unsigned long total,
+		     size_t start, size_t len)
+{
+	iov_iter_bvec(iter, rw, bvec, nvecs, total);
+	if (start)
+		iov_iter_advance(iter, start);
+	iov_iter_truncate(iter, len);
+}
+
 /*
  * Setup as many as 3 iov_iter based on extents described by @local_dio.
  * Returns the number of iov_iter that were setup.
  */
 static int
 nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
-			  unsigned int nvecs, size_t len,
+			  unsigned int nvecs, unsigned long total,
 			  struct nfs_local_dio *local_dio)
 {
 	int n_iters = 0;
@@ -401,41 +412,18 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 
 	/* Setup misaligned start? */
 	if (local_dio->start_len) {
-		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-		iters[n_iters].count = local_dio->start_len;
-		iocb->offset[n_iters] = iocb->hdr->args.offset;
-		iocb->iter_is_dio_aligned[n_iters] = false;
+		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
+				     nvecs, total, 0, local_dio->start_len);
 		atomic_inc(&iocb->n_iters);
 		++n_iters;
 	}
 
-	/* Setup misaligned end?
-	 * If so, the end is purposely setup to be issued using buffered IO
-	 * before the middle (which will use DIO, if DIO-aligned, with AIO).
-	 * This creates problems if/when the end results in short read or write.
-	 * So must save index and length of end to handle this corner case.
+	/*
+	 * Setup DIO-aligned middle, if there is no misaligned end (below)
+	 * then AIO completion is used, see nfs_local_call_{read,write}
 	 */
-	if (local_dio->end_len) {
-		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-		iocb->offset[n_iters] = local_dio->end_offset;
-		iov_iter_advance(&iters[n_iters],
-			local_dio->start_len + local_dio->middle_len);
-		iocb->iter_is_dio_aligned[n_iters] = false;
-		/* Save index and length of end */
-		iocb->end_iter_index = n_iters;
-		iocb->end_len = local_dio->end_len;
-		atomic_inc(&iocb->n_iters);
-		++n_iters;
-	}
-
-	/* Setup DIO-aligned middle to be issued last, to allow for
-	 * DIO with AIO completion (see nfs_local_call_{read,write}).
-	 */
-	iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-	if (local_dio->start_len)
-		iov_iter_advance(&iters[n_iters], local_dio->start_len);
-	iters[n_iters].count -= local_dio->end_len;
-	iocb->offset[n_iters] = local_dio->middle_offset;
+	nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec, nvecs,
+			     total, local_dio->start_len, local_dio->middle_len);
 
 	iocb->iter_is_dio_aligned[n_iters] =
 		nfs_iov_iter_aligned_bvec(&iters[n_iters],
@@ -443,11 +431,22 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 
 	if (unlikely(!iocb->iter_is_dio_aligned[n_iters])) {
 		trace_nfs_local_dio_misaligned(iocb->hdr->inode,
-			iocb->hdr->args.offset, len, local_dio);
+			local_dio->start_len, local_dio->middle_len, local_dio);
 		return 0; /* no DIO-aligned IO possible */
 	}
+	iocb->end_iter_index = n_iters;
 	++n_iters;
 
+	/* Setup misaligned end? */
+	if (local_dio->end_len) {
+		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
+				     nvecs, total, local_dio->start_len +
+				     local_dio->middle_len, local_dio->end_len);
+		atomic_inc(&iocb->n_iters);
+		iocb->end_iter_index = n_iters;
+		++n_iters;
+	}
+
 	return n_iters;
 }
 
@@ -492,9 +491,7 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	}
 
 	/* Use buffered IO */
-	iocb->offset[0] = hdr->args.offset;
 	iov_iter_bvec(&iocb->iters[0], rw, iocb->bvec, v, len);
-	iocb->iter_is_dio_aligned[0] = false;
 }
 
 static void
@@ -631,30 +628,20 @@ static void nfs_local_call_read(struct work_struct *work)
 
 	n_iters = atomic_read(&iocb->n_iters);
 	for (int i = 0; i < n_iters ; i++) {
-		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
-			iocb->aio_complete_work = nfs_local_read_aio_complete_work;
-		}
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
+				iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+			}
+		} else
+			iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
 
-		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial read */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * read accounting in nfs_local_read_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial read at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial read */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_read_iocb_done(iocb);
 				break;
@@ -849,30 +836,20 @@ static void nfs_local_call_write(struct work_struct *work)
 	file_start_write(filp);
 	n_iters = atomic_read(&iocb->n_iters);
 	for (int i = 0; i < n_iters ; i++) {
-		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
-			iocb->aio_complete_work = nfs_local_write_aio_complete_work;
-		}
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
+				iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+			}
+		} else
+			iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
 
-		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial write */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * write accounting in nfs_local_write_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial write at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial write */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_write_iocb_done(iocb);
 				break;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [v6.18-rcX PATCH v2] nfs/localio: do not issue misaligned DIO out-of-order
  2025-11-04 18:02               ` [v6.18-rcX PATCH v2] " Mike Snitzer
@ 2025-11-06  2:50                 ` Mike Snitzer
  2025-11-06  3:03                   ` [v6.18-rcX PATCH v3 5/3] " Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2025-11-06  2:50 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

On Tue, Nov 04, 2025 at 01:02:42PM -0500, Mike Snitzer wrote:
> [Hi Anna, here is a fixed v2 of patch 5/3]
> 
> On Fri, Oct 31, 2025 at 09:33:40AM -0400, Anna Schumaker wrote:
> > Hi Mike,
> > 
> > On 10/30/25 9:50 PM, Mike Snitzer wrote:
> > > On Wed, Oct 29, 2025 at 07:19:30PM -0400, Mike Snitzer wrote:
> > >> From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
> > >>
> > >> On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> > >>> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > >>>> LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > >>>> middle (via AIO completion of that aligned middle). So out of order
> > >>>> relative to file offset.
> > >>>
> > >>> That's in general a really bad idea.  It will obviously work, but
> > >>> both on SSDs and out of place write file systems it is a sure way
> > >>> to increase your garbage collection overhead a lot down the line.
> > >>
> > >> Fix this by never issuing misaligned DIO out-of-order. This fix means
> > >> the DIO-aligned segment will only use AIO completion if there is no
> > >> misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
> > >> will be issued without AIO completion to ensure file offset increases
> > >> properly for all partial READ or WRITE situations.
> > >>
> > >> Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
> > >> Reported-by: Christoph Hellwig <hch@lst.de>
> > >> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > >> ---
> > >>  fs/nfs/localio.c | 83 +++++++++++++++++-------------------------------
> > >>  1 file changed, 29 insertions(+), 54 deletions(-)
> > >>
> > >> Anna, apologies for stringing fixes together like this; and that this
> > >> same commit c817248fc831 has so many follow-on Fixes is not lost on
> > >> me.  But the full series of commit c817248fc831 fixes is composed of:
> > >>
> > >> [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
> > >> [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
> > >> [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO
> > >> [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
> > >> [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order
> > >>
> > >> NOTE: PATCH 4/3's use of IOCBD_DSYNC|IOCB_SYNC _is_ conservative, but I
> > >> will audit and adjust this further (informed by NFSD Direct's ongoing
> > >> evolution for handling this same situaiton) for the v6.19 merge window.
> > > 
> > > Hi Anna,
> > > 
> > > Please don't pick up this PATCH 5/3, further testing shows there is
> > > something wrong with it.  I'll circle back once I fix it.  But this
> > > 5/3 patch doesn't impact the other 4.
> > 
> > Thanks for the update! I've already looked at the first 4 patches, but
> > hadn't had a chance too look at 5/3 yet. I'll skip it for now until I
> > hear otherwise from you!
> 
> From: Mike Snitzer <snitzer@kernel.org>
> Date: Wed, 29 Oct 2025 17:41:02 -0400
> Subject: [v6.18-rcX PATCH v2] nfs/localio: do not issue misaligned DIO out-of-order
> 
> From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
> 
> On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > > middle (via AIO completion of that aligned middle). So out of order
> > > relative to file offset.
> >
> > That's in general a really bad idea.  It will obviously work, but
> > both on SSDs and out of place write file systems it is a sure way
> > to increase your garbage collection overhead a lot down the line.
> 
> Fix this by never issuing misaligned DIO out of order. This fix means
> the DIO-aligned middle will only use AIO completion if there is no
> misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
> will be issued without AIO completion to ensure file offset increases
> properly for all partial READ or WRITE situations.
> 
> Factoring out nfs_local_iter_setup() helps standardize repetitive
> nfs_local_iters_setup_dio() code and is inspired by cleanup work that
> Chuck Lever did on the NFSD Direct code.
> 
> Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfs/localio.c | 125 +++++++++++++++++++----------------------------
>  1 file changed, 51 insertions(+), 74 deletions(-)
> 


Hi Anna,

I found that this v2 of patch 5/3 had a bug when falling back from DIO
to buffered due to misalignment.  Here is the incremental fix (I'll
also reply with v3 of 5/3 with this fix folded in):

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 985242780abb..5aa903b2b836 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -414,7 +414,6 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 	if (local_dio->start_len) {
 		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
 				     nvecs, total, 0, local_dio->start_len);
-		atomic_inc(&iocb->n_iters);
 		++n_iters;
 	}
 
@@ -442,11 +441,11 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
 				     nvecs, total, local_dio->start_len +
 				     local_dio->middle_len, local_dio->end_len);
-		atomic_inc(&iocb->n_iters);
 		iocb->end_iter_index = n_iters;
 		++n_iters;
 	}
 
+	atomic_set(&iocb->n_iters, n_iters);
 	return n_iters;
 }
 
@@ -473,7 +472,7 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	len = hdr->args.count - total;
 
 	/*
-	 * For each iocb, iocb->n_iter is always at least 1 and we always
+	 * For each iocb, iocb->n_iters is always at least 1 and we always
 	 * end io after first nfs_local_pgio_done call unless misaligned DIO.
 	 */
 	atomic_set(&iocb->n_iters, 1);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v6.18-rcX PATCH v3 5/3] nfs/localio: do not issue misaligned DIO out-of-order
  2025-11-06  2:50                 ` Mike Snitzer
@ 2025-11-06  3:03                   ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-11-06  3:03 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, linux-nfs

From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/

On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > middle (via AIO completion of that aligned middle). So out of order
> > relative to file offset.
>
> That's in general a really bad idea.  It will obviously work, but
> both on SSDs and out of place write file systems it is a sure way
> to increase your garbage collection overhead a lot down the line.

Fix this by never issuing misaligned DIO out of order. This fix means
the DIO-aligned middle will only use AIO completion if there is no
misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
will be issued without AIO completion to ensure file offset increases
properly for all partial READ or WRITE situations.

Factoring out nfs_local_iter_setup() helps standardize repetitive
nfs_local_iters_setup_dio() code and is inspired by cleanup work that
Chuck Lever did on the NFSD Direct code.

Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/localio.c | 128 +++++++++++++++++++----------------------------
 1 file changed, 52 insertions(+), 76 deletions(-)

v3: fix accounting bug where iocb->n_iters is too high if DIO has a
    misaligned head _and_ the middle also isn't DIO aligned so we
    fallback to buffered. Also fix a typo in a related comment.
    (and again, sorry for all the churn while arriving at this fix)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 24b0c7d62458..5aa903b2b836 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -44,8 +44,7 @@ struct nfs_local_kiocb {
 	short int		end_iter_index;
 	atomic_t		n_iters;
 	bool			iter_is_dio_aligned[NFSLOCAL_MAX_IOS];
-	loff_t                  offset[NFSLOCAL_MAX_IOS] ____cacheline_aligned;
-	struct iov_iter		iters[NFSLOCAL_MAX_IOS];
+	struct iov_iter		iters[NFSLOCAL_MAX_IOS] ____cacheline_aligned;
 	/* End mostly DIO-specific members */
 };
 
@@ -314,6 +313,7 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 	init_sync_kiocb(&iocb->kiocb, file);
 
 	iocb->hdr = hdr;
+	iocb->kiocb.ki_pos = hdr->args.offset;
 	iocb->kiocb.ki_flags &= ~IOCB_APPEND;
 	iocb->kiocb.ki_complete = NULL;
 	iocb->aio_complete_work = NULL;
@@ -387,13 +387,24 @@ static bool nfs_iov_iter_aligned_bvec(const struct iov_iter *i,
 	return true;
 }
 
+static void
+nfs_local_iter_setup(struct iov_iter *iter, int rw, struct bio_vec *bvec,
+		     unsigned int nvecs, unsigned long total,
+		     size_t start, size_t len)
+{
+	iov_iter_bvec(iter, rw, bvec, nvecs, total);
+	if (start)
+		iov_iter_advance(iter, start);
+	iov_iter_truncate(iter, len);
+}
+
 /*
  * Setup as many as 3 iov_iter based on extents described by @local_dio.
  * Returns the number of iov_iter that were setup.
  */
 static int
 nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
-			  unsigned int nvecs, size_t len,
+			  unsigned int nvecs, unsigned long total,
 			  struct nfs_local_dio *local_dio)
 {
 	int n_iters = 0;
@@ -401,41 +412,17 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 
 	/* Setup misaligned start? */
 	if (local_dio->start_len) {
-		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-		iters[n_iters].count = local_dio->start_len;
-		iocb->offset[n_iters] = iocb->hdr->args.offset;
-		iocb->iter_is_dio_aligned[n_iters] = false;
-		atomic_inc(&iocb->n_iters);
+		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
+				     nvecs, total, 0, local_dio->start_len);
 		++n_iters;
 	}
 
-	/* Setup misaligned end?
-	 * If so, the end is purposely setup to be issued using buffered IO
-	 * before the middle (which will use DIO, if DIO-aligned, with AIO).
-	 * This creates problems if/when the end results in short read or write.
-	 * So must save index and length of end to handle this corner case.
+	/*
+	 * Setup DIO-aligned middle, if there is no misaligned end (below)
+	 * then AIO completion is used, see nfs_local_call_{read,write}
 	 */
-	if (local_dio->end_len) {
-		iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-		iocb->offset[n_iters] = local_dio->end_offset;
-		iov_iter_advance(&iters[n_iters],
-			local_dio->start_len + local_dio->middle_len);
-		iocb->iter_is_dio_aligned[n_iters] = false;
-		/* Save index and length of end */
-		iocb->end_iter_index = n_iters;
-		iocb->end_len = local_dio->end_len;
-		atomic_inc(&iocb->n_iters);
-		++n_iters;
-	}
-
-	/* Setup DIO-aligned middle to be issued last, to allow for
-	 * DIO with AIO completion (see nfs_local_call_{read,write}).
-	 */
-	iov_iter_bvec(&iters[n_iters], rw, iocb->bvec, nvecs, len);
-	if (local_dio->start_len)
-		iov_iter_advance(&iters[n_iters], local_dio->start_len);
-	iters[n_iters].count -= local_dio->end_len;
-	iocb->offset[n_iters] = local_dio->middle_offset;
+	nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec, nvecs,
+			     total, local_dio->start_len, local_dio->middle_len);
 
 	iocb->iter_is_dio_aligned[n_iters] =
 		nfs_iov_iter_aligned_bvec(&iters[n_iters],
@@ -443,11 +430,22 @@ nfs_local_iters_setup_dio(struct nfs_local_kiocb *iocb, int rw,
 
 	if (unlikely(!iocb->iter_is_dio_aligned[n_iters])) {
 		trace_nfs_local_dio_misaligned(iocb->hdr->inode,
-			iocb->hdr->args.offset, len, local_dio);
+			local_dio->start_len, local_dio->middle_len, local_dio);
 		return 0; /* no DIO-aligned IO possible */
 	}
+	iocb->end_iter_index = n_iters;
 	++n_iters;
 
+	/* Setup misaligned end? */
+	if (local_dio->end_len) {
+		nfs_local_iter_setup(&iters[n_iters], rw, iocb->bvec,
+				     nvecs, total, local_dio->start_len +
+				     local_dio->middle_len, local_dio->end_len);
+		iocb->end_iter_index = n_iters;
+		++n_iters;
+	}
+
+	atomic_set(&iocb->n_iters, n_iters);
 	return n_iters;
 }
 
@@ -474,7 +472,7 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	len = hdr->args.count - total;
 
 	/*
-	 * For each iocb, iocb->n_iter is always at least 1 and we always
+	 * For each iocb, iocb->n_iters is always at least 1 and we always
 	 * end io after first nfs_local_pgio_done call unless misaligned DIO.
 	 */
 	atomic_set(&iocb->n_iters, 1);
@@ -492,9 +490,7 @@ nfs_local_iters_init(struct nfs_local_kiocb *iocb, int rw)
 	}
 
 	/* Use buffered IO */
-	iocb->offset[0] = hdr->args.offset;
 	iov_iter_bvec(&iocb->iters[0], rw, iocb->bvec, v, len);
-	iocb->iter_is_dio_aligned[0] = false;
 }
 
 static void
@@ -631,30 +627,20 @@ static void nfs_local_call_read(struct work_struct *work)
 
 	n_iters = atomic_read(&iocb->n_iters);
 	for (int i = 0; i < n_iters ; i++) {
-		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
-			iocb->aio_complete_work = nfs_local_read_aio_complete_work;
-		}
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
+				iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+			}
+		} else
+			iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
 
-		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial read */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * read accounting in nfs_local_read_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial read at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial read */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_read_iocb_done(iocb);
 				break;
@@ -849,30 +835,20 @@ static void nfs_local_call_write(struct work_struct *work)
 	file_start_write(filp);
 	n_iters = atomic_read(&iocb->n_iters);
 	for (int i = 0; i < n_iters ; i++) {
-		/* DIO-aligned middle is always issued last with AIO completion */
 		if (iocb->iter_is_dio_aligned[i]) {
 			iocb->kiocb.ki_flags |= IOCB_DIRECT;
-			iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
-			iocb->aio_complete_work = nfs_local_write_aio_complete_work;
-		}
+			/* Only use AIO completion if DIO-aligned segment is last */
+			if (i == iocb->end_iter_index) {
+				iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
+				iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+			}
+		} else
+			iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
 
-		iocb->kiocb.ki_pos = iocb->offset[i];
 		status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iters[i]);
 		if (status != -EIOCBQUEUED) {
-			if (unlikely(status >= 0 && status < iocb->iters[i].count)) {
-				/* partial write */
-				if (i == iocb->end_iter_index) {
-					/* Must not account DIO partial end, otherwise (due
-					 * to end being issued before middle): the partial
-					 * write accounting in nfs_local_write_done()
-					 * would incorrectly advance hdr->args.offset
-					 */
-					status = 0;
-				} else {
-					/* Partial write at start or middle, force done */
-					force_done = true;
-				}
-			}
+			if (unlikely(status >= 0 && status < iocb->iters[i].count))
+				force_done = true; /* Partial write */
 			if (nfs_local_pgio_done(iocb, status, force_done)) {
 				nfs_local_write_iocb_done(iocb);
 				break;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-11-06  3:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-19  9:29 [Bug report] xfstests generic/323 over NFS hit BUG: KASAN: slab-use-after-free in nfs_local_call_read on 6.18.0-rc1 Yongcheng Yang
2025-10-19 15:18 ` Trond Myklebust
2025-10-19 16:26   ` Mike Snitzer
2025-10-20 18:24     ` Mike Snitzer
2025-10-27 13:08       ` [v6.18-rcX PATCH 0/3] nfs/localio: fixes for recent misaligned DIO changes Mike Snitzer
2025-10-27 13:08       ` [v6.18-rcX PATCH 1/3] nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support Mike Snitzer
2025-10-27 13:08       ` [v6.18-rcX PATCH 2/3] nfs/localio: add refcounting for each iocb IO associated with NFS pgio header Mike Snitzer
2025-10-27 13:19         ` Christoph Hellwig
2025-10-27 13:55           ` Mike Snitzer
2025-10-27 14:45             ` Christoph Hellwig
2025-10-27 13:08       ` [v6.18-rcX PATCH 3/3] nfs/localio: backfill missing partial read support for misaligned DIO Mike Snitzer
2025-10-27 17:52       ` [v6.18-rcX PATCH 4/3] nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion Mike Snitzer
2025-10-29 23:19         ` [v6.18-rcX PATCH 5/3] nfs/localio: do not issue misaligned DIO out-of-order Mike Snitzer
2025-10-31  1:50           ` Mike Snitzer
2025-10-31 13:33             ` Anna Schumaker
2025-11-04 18:02               ` [v6.18-rcX PATCH v2] " Mike Snitzer
2025-11-06  2:50                 ` Mike Snitzer
2025-11-06  3:03                   ` [v6.18-rcX PATCH v3 5/3] " Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).