Linux filesystem development
 help / color / mirror / Atom feed
* Re: [PATCH] hfsplus: validate thread record before delete key rebuild
From: Kyle Zeng @ 2026-06-27  0:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: linux-fsdevel, Yangtao Li, John Paul Adrian Glaubitz,
	outbounddisclosures
In-Reply-To: <a6220ce5bf7f546c77123934974a0f94f19a6711.camel@dubeyko.com>

Hi Viacheslav,

I hope you had a great trip.
It was difficult for us to test the patches without knowing the exact
number of xfstests failures. So I wonder whether you could help test
the two HFS patches we shared previously.

Thanks,
Kyle

On Mon, Jun 15, 2026 at 11:24 PM Viacheslav Dubeyko <slava@dubeyko.com> wrote:
>
> On Fri, 2026-06-12 at 16:08 -0700, Kyle Zeng wrote:
> > Hi Viacheslav,
> >
> > I prepared a xfstests environment and tried to test the patches.
> > But unfortunately, due to my lab environment constraints, I could not
> > fully finish the tests: some tests errored out even without the
> > patches due to environment constraints.
> > I think you will still be the best person to perform the test. I'll
> > ping you around end of June.
> >
>
> Currently, HFS+ has around 15 - 17 xfstests failures. If you can see
> around the same number of issues without your patches, then you are
> good if your patches will not increase the number of failed xfestests.
> Anyway, I'll be back in two weeks and I will be able to test your
> patches in more reliable environment. Sorry, I have pretty no time and
> capabilities to run xfstests during my trip. And not very much time for
> the review too. :)
>
> Thanks,
> Slava.
>
> >
> > On Thu, Jun 11, 2026 at 9:11 PM Viacheslav Dubeyko
> > <slava@dubeyko.com> wrote:
> > >
> > > On Thu, 2026-06-11 at 14:26 -0700, Kyle Zeng wrote:
> > > > hfsplus_delete_cat() is called with str == NULL when the last
> > > > open
> > > > reference to an unlinked HFS+ hardlink backing inode is closed.
> > > > In
> > > > that
> > > > case, the function finds the catalog thread by CNID and rebuilds
> > > > the
> > > > catalog key from thread.nodeName.
> > > >
> > > > That reconstruction path reads thread.nodeName.length directly
> > > > from
> > > > the
> > > > catalog B-tree into fd.search_key and then copies length * 2
> > > > bytes
> > > > into
> > > > fd.search_key->cat.name.unicode. It does not first check that the
> > > > found
> > > > record is a thread record, that the record size matches the
> > > > thread
> > > > name
> > > > length, or that the name length fits in the fixed HFS+ catalog
> > > > key
> > > > buffer.
> > > >
> > > > A corrupted image can therefore provide an oversized thread name
> > > > length
> > > > and make hfs_bnode_read() write past the catalog search-key
> > > > allocation.
> > > >
> > > > Read the CNID record through hfsplus_brec_read_cat(), reject non-
> > > > thread
> > > > records, and explicitly bound nodeName.length before building the
> > > > delete
> > > > key from the validated thread name.
> > > >
> > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > > Assisted-by: Codex:gpt-5.5
> > > > Signed-off-by: Kyle Zeng <kylebot@openai.com>
> > > > ---
> > > >  fs/hfsplus/catalog.c | 32 ++++++++++++++++++++------------
> > > >  1 file changed, 20 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/fs/hfsplus/catalog.c b/fs/hfsplus/catalog.c
> > > > index 3f0cc1b1bb58..000000000000 100644
> > > > --- a/fs/hfsplus/catalog.c
> > > > +++ b/fs/hfsplus/catalog.c
> > > > @@ -351,23 +351,31 @@ int hfsplus_delete_cat(u32 cnid, struct
> > > > inode
> > > > *dir, const struct qstr *str)
> > > >               goto out;
> > > >
> > > >       if (!str) {
> > > > -             int len;
> > > > +             hfsplus_cat_entry entry = {0};
> > > > +             u16 thread_type;
> > > >
> > > >               hfsplus_cat_build_key_with_cnid(sb, fd.search_key,
> > > > cnid);
> > > > -             err = hfs_brec_find(&fd, hfs_find_rec_by_key);
> > > > +             err = hfsplus_brec_read_cat(&fd, &entry);
> > > >               if (err)
> > > >                       goto out;
> > > >
> > > > -             off = fd.entryoffset +
> > > > -                     offsetof(struct hfsplus_cat_thread,
> > > > nodeName);
> > > > -             fd.search_key->cat.parent = cpu_to_be32(dir-
> > > > >i_ino);
> > > > -             hfs_bnode_read(fd.bnode,
> > > > -                     &fd.search_key->cat.name.length, off, 2);
> > > > -             len = be16_to_cpu(fd.search_key->cat.name.length) *
> > > > 2;
> > > > -             hfs_bnode_read(fd.bnode,
> > > > -                     &fd.search_key->cat.name.unicode,
> > > > -                     off + 2, len);
> > > > -             fd.search_key->key_len = cpu_to_be16(6 + len);
> > > > +             thread_type = be16_to_cpu(entry.type);
> > > > +             if (thread_type != HFSPLUS_FOLDER_THREAD &&
> > > > +                 thread_type != HFSPLUS_FILE_THREAD) {
> > > > +                     pr_err("found bad thread record in
> > > > catalog\n");
> > > > +                     err = -EIO;
> > > > +                     goto out;
> > > > +             }
> > > > +
> > > > +             if (be16_to_cpu(entry.thread.nodeName.length) >
> > > > +                 HFSPLUS_MAX_STRLEN) {
> > > > +                     pr_err("catalog name length corrupted\n");
> > > > +                     err = -EIO;
> > > > +                     goto out;
> > > > +             }
> > > > +
> > > > +             hfsplus_cat_build_key_uni(fd.search_key, dir-
> > > > >i_ino,
> > > > +                                       &entry.thread.nodeName);
> > > >       } else {
> > > >               err = hfsplus_cat_build_key(sb, fd.search_key, dir-
> > > > > i_ino, str);
> > > >               if (unlikely(err))
> > >
> > > Have you run xfstests for this patch? Sorry, I am in personal trip
> > > right now and I cannot test your patch until the end of June.
> > > Please,
> > > ping me around that time.
> > >
> > > Thanks,
> > > Slava.

^ permalink raw reply

* [GIT PULL] fscrypt fixes for 7.2
From: Eric Biggers @ 2026-06-27  0:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fscrypt, linux-fsdevel, linux-kernel, Theodore Ts'o,
	Jaegeuk Kim, Mohammed EL Kadiri, Luis Henriques,
	syzbot+f55b043dacf43776b50c

The following changes since commit 1dc18801be29bc54709aa355b8acd80e183b03cd:

  Merge tag 'i2c-7.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux (2026-06-22 09:30:31 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/fs/fscrypt/linux.git tags/fscrypt-for-linus

for you to fetch changes up to 696c030e1e3438955aba443b308ee8b6faa3983e:

  fscrypt: Replace mk_users keyring with simple list (2026-06-22 12:12:11 -0700)

----------------------------------------------------------------

- Fix a bug where in a specific edge case, file contents en/decryption
  could be done with the wrong data unit size.

- Fix the data structure used for keeping track of users that have added
  an fscrypt key to be a simple list instead of a 'struct key' keyring.

  This fixes issues such as a lockdep report found by syzbot and
  possible unintended interactions with the keyctl() system calls.

----------------------------------------------------------------
Eric Biggers (2):
      fscrypt: Fix key setup in edge case with multiple data unit sizes
      fscrypt: Replace mk_users keyring with simple list

 fs/crypto/fscrypt_private.h |  84 ++++++++++------
 fs/crypto/inline_crypt.c    |   8 +-
 fs/crypto/keyring.c         | 239 +++++++++++++++++++-------------------------
 fs/crypto/keysetup.c        | 118 ++++++++++++++--------
 4 files changed, 233 insertions(+), 216 deletions(-)

^ permalink raw reply

* Re: kernel BUG at hfs_write_inode [verbose debug info unavailable]
From: Matthew Wilcox @ 2026-06-26 21:44 UTC (permalink / raw)
  To: sanan.hasanou
  Cc: slava, glaubitz, frank.li, linux-fsdevel, linux-kernel, syzkaller,
	contact
In-Reply-To: <6a3eeb80.c68533e6.3320fc.f244@mx.google.com>

On Fri, Jun 26, 2026 at 02:13:36PM -0700, sanan.hasanou@gmail.com wrote:
> Good day, dear maintainers,
> 
> We found a bug using a modified version of syzkaller.

Do not so this.  Get your changes upstream into syzkaller and let syzbot
do the rest.

^ permalink raw reply

* KASAN: slab-use-after-free Read in fserror_worker
From: sanan.hasanou @ 2026-06-26 21:29 UTC (permalink / raw)
  To: viro, brauner, jack, linux-fsdevel, linux-kernel; +Cc: syzkaller, contact

Good day, dear maintainers,

We found a bug using a modified version of syzkaller.

Kernel Branch: 7.0-rc1
Kernel Config: <https://drive.google.com/open?id=1dd0qteaHHIsE3puUVyWFoRLtF4bP2IOy>
Unfortunately, we don't have any reproducer for this bug yet.
Thank you!

Best regards,
Sanan Hasanov

==================================================================
BUG: KASAN: slab-use-after-free in inode_state_read_once include/linux/fs.h:884 [inline]
BUG: KASAN: slab-use-after-free in iput+0x34c/0xc60 fs/inode.c:1986
Read of size 4 at addr ffff888066ffafb8 by task kworker/0:1/448004

CPU: 0 UID: 0 PID: 448004 Comm: kworker/0:1 Tainted: G             L      7.0.0-rc1 #1 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: events fserror_worker
Call Trace:
 <TASK>
 __dump_stack+0x21/0x30 lib/dump_stack.c:94
 dump_stack_lvl+0xee/0x150 lib/dump_stack.c:120
 print_address_description+0x51/0x1e0 mm/kasan/report.c:378
 print_report+0x67/0x80 mm/kasan/report.c:482
 kasan_report+0x135/0x170 mm/kasan/report.c:595
 __asan_report_load4_noabort+0x18/0x20 mm/kasan/report_generic.c:380
 inode_state_read_once include/linux/fs.h:884 [inline]
 iput+0x34c/0xc60 fs/inode.c:1986
 fserror_worker+0x215/0x310 fs/fserror.c:69
 process_one_work kernel/workqueue.c:3275 [inline]
 process_scheduled_works+0xa30/0x13d0 kernel/workqueue.c:3358
 worker_thread+0xacb/0x1060 kernel/workqueue.c:3439
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x5e4/0xb90 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 475317:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x70 mm/kasan/common.c:78
 kasan_save_alloc_info+0x40/0x50 mm/kasan/generic.c:570
 unpoison_slab_object mm/kasan/common.c:340 [inline]
 __kasan_slab_alloc+0x73/0x80 mm/kasan/common.c:366
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4501 [inline]
 slab_alloc_node mm/slub.c:4830 [inline]
 kmem_cache_alloc_lru_noprof+0x2bc/0x4a0 mm/slub.c:4849
 xfs_inode_alloc+0xf8/0x7b0 fs/xfs/xfs_icache.c:97
 xfs_iget_cache_miss fs/xfs/xfs_icache.c:635 [inline]
 xfs_iget+0x635/0x2330 fs/xfs/xfs_icache.c:799
 xfs_lookup+0x2fb/0x4f0 fs/xfs/xfs_inode.c:553
 xfs_vn_lookup+0x11d/0x1e0 fs/xfs/xfs_iops.c:327
 __lookup_slow+0x28f/0x3c0 fs/namei.c:1916
 lookup_slow+0x5c/0x80 fs/namei.c:1933
 walk_component fs/namei.c:2279 [inline]
 lookup_last fs/namei.c:2780 [inline]
 path_lookupat+0x403/0x8f0 fs/namei.c:2804
 filename_lookup+0x217/0x570 fs/namei.c:2833
 filename_listxattr fs/xattr.c:945 [inline]
 path_listxattrat+0x117/0x3a0 fs/xattr.c:975
 __do_sys_listxattr fs/xattr.c:988 [inline]
 __se_sys_listxattr fs/xattr.c:985 [inline]
 __x64_sys_listxattr+0x8b/0xa0 fs/xattr.c:985
 x64_sys_call+0x1899/0x2900 arch/x86/include/generated/asm/syscalls_64.h:195
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x13f/0x860 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Freed by task 15:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x70 mm/kasan/common.c:78
 kasan_save_free_info+0x4a/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x63/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2687 [inline]
 slab_free mm/slub.c:6124 [inline]
 kmem_cache_free+0x20c/0x5a0 mm/slub.c:6254
 xfs_inode_free_callback+0x1ad/0x1e0 fs/xfs/xfs_icache.c:165
 rcu_do_batch+0x541/0xc90 kernel/rcu/tree.c:2617
 rcu_core+0x455/0x870 kernel/rcu/tree.c:2869
 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2886
 handle_softirqs+0x229/0x750 kernel/softirq.c:622
 run_ksoftirqd+0x3f/0x70 kernel/softirq.c:1063
 smpboot_thread_fn+0x611/0xbe0 kernel/smpboot.c:160
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x5e4/0xb90 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:245

Last potentially related work creation:
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
 kasan_record_aux_stack+0xc1/0xd0 mm/kasan/generic.c:556
 __call_rcu_common kernel/rcu/tree.c:3131 [inline]
 call_rcu+0xec/0x7d0 kernel/rcu/tree.c:3251
 __xfs_inode_free fs/xfs/xfs_icache.c:177 [inline]
 xfs_inode_free+0x1c5/0x240 fs/xfs/xfs_icache.c:197
 xfs_iget_cache_miss fs/xfs/xfs_icache.c:740 [inline]
 xfs_iget+0x6b6/0x2330 fs/xfs/xfs_icache.c:799
 xfs_lookup+0x2fb/0x4f0 fs/xfs/xfs_inode.c:553
 xfs_vn_lookup+0x11d/0x1e0 fs/xfs/xfs_iops.c:327
 __lookup_slow+0x28f/0x3c0 fs/namei.c:1916
 lookup_slow+0x5c/0x80 fs/namei.c:1933
 walk_component fs/namei.c:2279 [inline]
 lookup_last fs/namei.c:2780 [inline]
 path_lookupat+0x403/0x8f0 fs/namei.c:2804
 filename_lookup+0x217/0x570 fs/namei.c:2833
 filename_listxattr fs/xattr.c:945 [inline]
 path_listxattrat+0x117/0x3a0 fs/xattr.c:975
 __do_sys_listxattr fs/xattr.c:988 [inline]
 __se_sys_listxattr fs/xattr.c:985 [inline]
 __x64_sys_listxattr+0x8b/0xa0 fs/xattr.c:985
 x64_sys_call+0x1899/0x2900 arch/x86/include/generated/asm/syscalls_64.h:195
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x13f/0x860 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

The buggy address belongs to the object at ffff888066ffad00
 which belongs to the cache xfs_inode of size 1776
The buggy address is located 696 bytes inside of
 freed 1776-byte region [ffff888066ffad00, ffff888066ffb3f0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888066ffe180 pfn:0x66ff8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888066ff8719
flags: 0x2000000000000240(workingset|head|zone=1)
page_type: f5(slab)
raw: 2000000000000240 ffff888019fd2640 ffff888019fd1ac8 ffffea0000adde10
raw: ffff888066ffe180 0000078000110009 00000000f5000000 ffff888066ff8719
head: 2000000000000240 ffff888019fd2640 ffff888019fd1ac8 ffffea0000adde10
head: ffff888066ffe180 0000078000110009 00000000f5000000 ffff888066ff8719
head: 2000000000000003 ffffea00019bfe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Reclaimable, gfp_mask 0xd20d0(__GFP_RECLAIMABLE|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 40578, tgid 40575 (syz.3.7187), ts 249736208593, free_ts 245599182745
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x25f/0x490 mm/page_alloc.c:1889
 prep_new_page mm/page_alloc.c:1897 [inline]
 get_page_from_freelist+0x2da9/0x2ed0 mm/page_alloc.c:3962
 __alloc_frozen_pages_noprof+0x17c/0x340 mm/page_alloc.c:5250
 alloc_slab_page+0x62/0x130 mm/slub.c:-1
 allocate_slab+0x7a/0x530 mm/slub.c:3444
 new_slab mm/slub.c:3502 [inline]
 refill_objects+0x4bf/0x640 mm/slub.c:7134
 refill_sheaf+0x32/0x50 mm/slub.c:2804
 alloc_full_sheaf mm/slub.c:2825 [inline]
 __pcs_replace_empty_main+0x335/0x580 mm/slub.c:4588
 alloc_from_pcs mm/slub.c:4681 [inline]
 slab_alloc_node mm/slub.c:4815 [inline]
 kmem_cache_alloc_lru_noprof+0x41c/0x4a0 mm/slub.c:4849
 xfs_inode_alloc+0xf8/0x7b0 fs/xfs/xfs_icache.c:97
 xfs_iget_cache_miss fs/xfs/xfs_icache.c:635 [inline]
 xfs_iget+0x635/0x2330 fs/xfs/xfs_icache.c:799
 xfs_mountfs+0xf84/0x2050 fs/xfs/xfs_mount.c:1072
 xfs_fs_fill_super+0x1225/0x16a0 fs/xfs/xfs_super.c:1938
 get_tree_bdev_flags+0x407/0x4d0 fs/super.c:1694
 get_tree_bdev+0x28/0x30 fs/super.c:1717
 xfs_fs_get_tree+0x25/0x30 fs/xfs/xfs_super.c:1985
page last free pid 5023 tgid 5023 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1433 [inline]
 __free_frozen_pages+0xb63/0x1040 mm/page_alloc.c:2978
 free_frozen_pages+0x14/0x20 mm/page_alloc.c:3016
 __free_slab+0x1a2/0x290 mm/slub.c:3518
 free_slab+0xdd/0x100 mm/slub.c:3552
 discard_slab+0x28/0x30 mm/slub.c:3558
 __slab_free+0x2a8/0x2b0 mm/slub.c:5532
 ___cache_free+0x72/0x80 mm/slub.c:6199
 qlink_free mm/kasan/quarantine.c:163 [inline]
 qlist_free_all+0xa3/0x110 mm/kasan/quarantine.c:179
 kasan_quarantine_reduce+0x13f/0x150 mm/kasan/quarantine.c:286
 __kasan_slab_alloc+0x28/0x80 mm/kasan/common.c:350
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4501 [inline]
 slab_alloc_node mm/slub.c:4830 [inline]
 __do_kmalloc_node mm/slub.c:5218 [inline]
 __kmalloc_noprof+0x329/0x610 mm/slub.c:5231
 kmalloc_noprof include/linux/slab.h:966 [inline]
 tomoyo_realpath_from_path+0x172/0x710 security/tomoyo/realpath.c:251
 tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
 tomoyo_path_perm+0x208/0x460 security/tomoyo/file.c:827
 tomoyo_inode_getattr+0x25/0x30 security/tomoyo/tomoyo.c:123
 security_inode_getattr+0x1eb/0x3d0 security/security.c:1869
 vfs_getattr fs/stat.c:259 [inline]
 vfs_fstat fs/stat.c:281 [inline]
 __do_sys_newfstat fs/stat.c:551 [inline]
 __se_sys_newfstat+0xe9/0x3e0 fs/stat.c:546

Memory state around the buggy address:
 ffff888066ffae80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888066ffaf00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888066ffaf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                        ^
 ffff888066ffb000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888066ffb080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>

^ permalink raw reply

* kernel BUG at hfs_write_inode [verbose debug info unavailable]
From: sanan.hasanou @ 2026-06-26 21:13 UTC (permalink / raw)
  To: slava, glaubitz, frank.li, linux-fsdevel, linux-kernel; +Cc: syzkaller, contact

Good day, dear maintainers,

We found a bug using a modified version of syzkaller.

Kernel Branch: 7.0-rc1
Kernel Config: <https://drive.google.com/open?id=173DLEAEPKPhhR1TcqofdnkLpdoK7PMFl>
Reproducer: <https://drive.google.com/open?id=1CqxzPCkagwu-C1x-19rSi9hAPdorJFHY>
Thank you!

Best regards,
Sanan Hasanov

------------[ cut here ]------------
Kernel BUG at hfs_write_inode+0x8b1/0x8c0 [verbose]
Oops: invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 UID: 0 PID: 52148 Comm: kworker/u8:18 Tainted: G             L      7.0.0-rc1 #1 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: writeback wb_workfn (flush-7:7)
RIP: 0010:hfs_write_inode+0x8b1/0x8c0 fs/hfs/inode.c:474
Code: ff e9 c5 fd ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c 61 fe ff ff 4c 89 f7 e8 4a 02 85 ff e9 54 fe ff ff e8 d0 d2 18 ff 90 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 0f 1f 44
RSP: 0018:ffffc900107e72c0 EFLAGS: 00010293
RAX: ffffffff82a97f00 RBX: ffff88805491a520 RCX: ffff88801ccc2700
RDX: 0000000000000000 RSI: ffffffff8eb6dc00 RDI: 0000000000000000
RBP: ffffc900107e7450 R08: ffff88801ccc2700 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 1ffff920020fce5c
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88805491a4e0
FS:  0000000000000000(0000) GS:ffff8880d98df000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f80878c4e78 CR3: 00000000476ce000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 write_inode fs/fs-writeback.c:1581 [inline]
 __writeback_single_inode+0x56f/0x870 fs/fs-writeback.c:1812
 writeback_sb_inodes+0x73b/0x1110 fs/fs-writeback.c:2040
 wb_writeback+0x3fa/0x690 fs/fs-writeback.c:2226
 wb_do_writeback fs/fs-writeback.c:2373 [inline]
 wb_workfn+0x3db/0xef0 fs/fs-writeback.c:2413
 process_one_work kernel/workqueue.c:3275 [inline]
 process_scheduled_works+0x811/0xf10 kernel/workqueue.c:3358
 worker_thread+0x9c1/0xeb0 kernel/workqueue.c:3439
 kthread+0x3c1/0x4d0 kernel/kthread.c:467
 ret_from_fork+0x608/0xc40 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:245
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:hfs_write_inode+0x8b1/0x8c0 fs/hfs/inode.c:474
Code: ff e9 c5 fd ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c 61 fe ff ff 4c 89 f7 e8 4a 02 85 ff e9 54 fe ff ff e8 d0 d2 18 ff 90 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 0f 1f 44
RSP: 0018:ffffc900107e72c0 EFLAGS: 00010293
RAX: ffffffff82a97f00 RBX: ffff88805491a520 RCX: ffff88801ccc2700
RDX: 0000000000000000 RSI: ffffffff8eb6dc00 RDI: 0000000000000000
RBP: ffffc900107e7450 R08: ffff88801ccc2700 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 1ffff920020fce5c
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88805491a4e0
FS:  0000000000000000(0000) GS:ffff8880d99df000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdb2c2dff0 CR3: 000000003181d000 CR4: 00000000000006f0

<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>

Oops: invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 UID: 0 PID: 52148 Comm: kworker/u8:18 Tainted: G             L      7.0.0-rc1 #1 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: writeback wb_workfn (flush-7:7)
RIP: 0010:hfs_write_inode+0x8b1/0x8c0
Code: ff e9 c5 fd ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c 61 fe ff ff 4c 89 f7 e8 4a 02 85 ff e9 54 fe ff ff e8 d0 d2 18 ff 90 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 0f 1f 44
RSP: 0018:ffffc900107e72c0 EFLAGS: 00010293
RAX: ffffffff82a97f00 RBX: ffff88805491a520 RCX: ffff88801ccc2700
RDX: 0000000000000000 RSI: ffffffff8eb6dc00 RDI: 0000000000000000
RBP: ffffc900107e7450 R08: ffff88801ccc2700 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 1ffff920020fce5c
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88805491a4e0
FS:  0000000000000000(0000) GS:ffff8880d98df000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f80878c4e78 CR3: 00000000476ce000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __writeback_single_inode+0x56f/0x870
 writeback_sb_inodes+0x73b/0x1110
 wb_writeback+0x3fa/0x690
 wb_workfn+0x3db/0xef0
 process_scheduled_works+0x811/0xf10
 worker_thread+0x9c1/0xeb0
 kthread+0x3c1/0x4d0
 ret_from_fork+0x608/0xc40
 ret_from_fork_asm+0x11/0x20
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:hfs_write_inode+0x8b1/0x8c0
Code: ff e9 c5 fd ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c 61 fe ff ff 4c 89 f7 e8 4a 02 85 ff e9 54 fe ff ff e8 d0 d2 18 ff 90 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 0f 1f 44
RSP: 0018:ffffc900107e72c0 EFLAGS: 00010293
RAX: ffffffff82a97f00 RBX: ffff88805491a520 RCX: ffff88801ccc2700
RDX: 0000000000000000 RSI: ffffffff8eb6dc00 RDI: 0000000000000000
RBP: ffffc900107e7450 R08: ffff88801ccc2700 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 1ffff920020fce5c
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88805491a4e0
FS:  0000000000000000(0000) GS:ffff8880d99df000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdb2c2dff0 CR3: 000000003181d000 CR4: 00000000000006f0

<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>

^ permalink raw reply

* Re: [PATCH] fscrypt: Replace mk_users keyring with simple list
From: Eric Biggers @ 2026-06-26 20:29 UTC (permalink / raw)
  To: Luis Henriques
  Cc: linux-fscrypt, Theodore Ts'o, Jaegeuk Kim, Jarkko Sakkinen,
	linux-fsdevel, keyrings, linux-kernel,
	syzbot+f55b043dacf43776b50c, Mohammed EL Kadiri, stable
In-Reply-To: <20260626190232.GA1719948@google.com>

On Fri, Jun 26, 2026 at 07:02:32PM +0000, Eric Biggers wrote:
> On Fri, Jun 26, 2026 at 09:16:35AM +0100, Luis Henriques wrote:
> > Hi Eric!
> > 
> > On Thu, Jun 18 2026, Eric Biggers wrote:
> > 
> > > Change mk_users (the set of user claims to an fscrypt master key) from a
> > > 'struct key' keyring to a simple linked list.
> > >
> > > It's still a collection of 'struct key' for quota tracking.  It was
> > > originally thought to be natural that a collection of 'struct key'
> > > should be held in a 'struct key' keyring.  In reality, it's just been
> > > causing problems, similar to how using 'struct key' for the filesystem
> > > keyring caused problems and was removed in commit d7e7b9af104c
> > > ("fscrypt: stop using keyrings subsystem for fscrypt_master_key").
> > >
> > > Commit d3a7bd420076 ("fscrypt: clear keyring before calling key_put()")
> > > fixed mk_users cleanup to be synchronous.  But that apparently wasn't
> > > enough: the keyring subsystem's redundant locking is still generating
> > > lockdep false positives due to the interaction with filesystem reclaim.
> > >
> > > With the simple list, the redundant locking and lockdep issue goes away.
> > >
> > > Of course, searching a linked list is linear-time whereas the
> > > 'struct key' keyring used a fancy constant-time associative array.  But
> > > that's fine here, since in practice there's just one entry in the list.
> > > In fact the new code is much faster in practice, since it's much smaller
> > > and doesn't have to convert the kuid_t into a string to search for it.
> > >
> > > Reported-by: syzbot+f55b043dacf43776b50c@syzkaller.appspotmail.com
> > > Closes: https://syzkaller.appspot.com/bug?extid=f55b043dacf43776b50c
> > > Reported-by: Mohammed EL Kadiri <med08elkadiri@gmail.com>
> > > Closes: https://lore.kernel.org/keyrings/20260614150041.21172-1-med08elkadiri@gmail.com/
> > > Fixes: 23c688b54016 ("fscrypt: allow unprivileged users to add/remove keys for v2 policies")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> > > ---
> > >
> > > I'm planning to take this via the fscrypt tree for 7.2
> > 
> > I was hoping to have some more time to test this patch, but I don't think
> > that will happen any time soon.
> > 
> > I've done a review of the patch and couldn't find any obvious problem.
> > Though, again, a more in-depth review would require more time as it has
> > been a while since I took a look into this code.
> > 
> > I just wonder if this is really stable material.  It's a bit intrusive
> > (doesn't even apply cleanly in mainline), but above all it's fixing a
> > lockdep false positive.  Other than syzbot, has this been seen in the
> > wild?
> 
> Thanks!
> 
> It applies on top of
> "[PATCH] fscrypt: Fix key setup in edge case with multiple data unit sizes"
> (https://lore.kernel.org/linux-fscrypt/20260618180652.52742-1-ebiggers@kernel.org/).
> This time I tried just relying on the prerequisite-patch-id footer (as
> generated by 'git format-patch') to express the dependency.  But
> evidently that still doesn't work: for one, 'b4 am' just ignores it.
> 
> Since this patch has "Reported-by: syzbot" as well as "Fixes", the
> stable maintainers will apply it anyway.  If I actually wanted to stop
> that, I'd have to actively oppose the backport, likely multiple times
> indefinitely since people will continue to try to backport it.  And
> syzkaller would continue to get the lockdep warning on stable kernels.
> 
> So I'd rather just get it out the way and backport it the same time as
> "fscrypt: Fix key setup in edge case with multiple data unit sizes",
> which similarly tweaks some data structures in struct fscrypt_master_key
> and needs to be backported too.  "fscrypt: stop using keyrings subsystem
> for fscrypt_master_key" several years ago was backported too.

FWIW, I would also not be surprised if the old code would also fail
fuzzing in other ways, like keyctl() being used to directly manipulate
the keyrings from underneath what fs/crypto/ assumes.  I remember at
least considering that scenario when adding this code years ago, but I
think the reasoning was quite subtle and I may have missed something.

The 'struct key' keyrings just have a lot of obscure sharp corners.
Whereas simple lists, hash tables, etc. are much easier to evaluate.

- Eric

^ permalink raw reply

* [syzbot ci] Re: minix: convert to iomap and add direct I/O
From: Jeremy Bingham @ 2026-06-26 20:21 UTC (permalink / raw)
  To: syzbot+ci97bc680341b3b928
  Cc: linux-fsdevel, linux-kernel, brauner, jkoolstra, syzkaller-bugs,
	Jeremy Bingham
In-Reply-To: <6a3ed243.656f0a6b.201ab1.0000.GAE@google.com>

Apparently I did this wrong the first time. I misunderstood and sent one
patch covering all the changes differing from master, rather than just
patching the changes to fix the errors syzbot found.

#syz test

---
 fs/minix/file.c         |  4 ++--
 fs/minix/inode.c        | 11 ++++++-----
 fs/minix/itree_common.c | 11 ++++++++++-
 fs/minix/minix.h        |  3 +++
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/minix/file.c b/fs/minix/file.c
index 1f4217115401..b07c853fa43a 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -175,8 +175,8 @@ const struct file_operations minix_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-static int minix_setattr(struct mnt_idmap *idmap,
-			 struct dentry *dentry, struct iattr *attr)
+int minix_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+	struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
 	int error;
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index cd12e59ce9b9..8a79ff82a656 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -444,10 +444,10 @@ static ssize_t minix_writeback_range(struct iomap_writepage_ctx *wpc,
 	if (pos < wpc->iomap.offset ||
 			pos >= wpc->iomap.offset + wpc->iomap.length) {
 		if (INODE_VERSION(wpc->inode) == MINIX_V1)
-			error = V1_minix_iomap_begin(wpc->inode, pos, len, 0,
+			error = V1_minix_iomap_begin(wpc->inode, pos, len, IOMAP_WRITE,
 				&wpc->iomap, NULL);
 		else
-			error = V2_minix_iomap_begin(wpc->inode, pos, len, 0,
+			error = V2_minix_iomap_begin(wpc->inode, pos, len, IOMAP_WRITE,
 				&wpc->iomap, NULL);
 		if (error)
 			return error;
@@ -490,7 +490,7 @@ static int minix_writepages(struct address_space *mapping,
 
 static int minix_read_folio(struct file *file, struct folio *folio)
 {
-	const struct iomap_ops *ops = minix_iomap_ops_ver(file->f_inode);
+	const struct iomap_ops *ops = minix_iomap_ops_ver(folio->mapping->host);
 
 	iomap_bio_read_folio(folio, ops);
 	return 0;
@@ -504,7 +504,7 @@ static int minix_block_read_folio(struct file *file, struct folio *folio)
 
 static void minix_readahead(struct readahead_control *rac)
 {
-	const struct iomap_ops *ops = minix_iomap_ops_ver(rac->file->f_inode);
+	const struct iomap_ops *ops = minix_iomap_ops_ver(rac->mapping->host);
 
 	iomap_bio_readahead(rac, ops);
 }
@@ -545,7 +545,7 @@ static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 	return iomap_bmap(mapping, block, ops);
 }
 
-static const struct address_space_operations minix_aops = {
+const struct address_space_operations minix_aops = {
 	.dirty_folio	= iomap_dirty_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.read_folio = minix_read_folio,
@@ -575,6 +575,7 @@ static const struct address_space_operations minix_dir_aops = {
 static const struct inode_operations minix_symlink_inode_operations = {
 	.get_link	= page_get_link,
 	.getattr	= minix_getattr,
+	.setattr	= minix_setattr,
 };
 
 void minix_set_inode(struct inode *inode, dev_t rdev)
diff --git a/fs/minix/itree_common.c b/fs/minix/itree_common.c
index c3cd2c75af9c..5a8b73a7beda 100644
--- a/fs/minix/itree_common.c
+++ b/fs/minix/itree_common.c
@@ -311,7 +311,16 @@ static inline void truncate (struct inode * inode)
 	long iblock;
 
 	iblock = (inode->i_size + sb->s_blocksize -1) >> sb->s_blocksize_bits;
-	block_truncate_page(inode->i_mapping, inode->i_size, get_block);
+
+	/* Depending on what address space operations are being used by the
+	 * inode being truncated, we need to either call iomap_truncate_page or
+	 * block_truncate_page.
+	 */
+	if (inode->i_mapping->a_ops == &minix_aops)
+		iomap_truncate_page(inode, inode->i_size, NULL,
+			minix_iomap_ops_ver(inode), NULL, NULL);
+	else
+		block_truncate_page(inode->i_mapping, inode->i_size, get_block);
 
 	n = block_to_path(inode, iblock, offsets);
 	if (!n)
diff --git a/fs/minix/minix.h b/fs/minix/minix.h
index face74100346..270e4e0620a1 100644
--- a/fs/minix/minix.h
+++ b/fs/minix/minix.h
@@ -58,6 +58,8 @@ void minix_free_block(struct inode *inode, unsigned long block);
 unsigned long minix_count_free_blocks(struct super_block *sb);
 int minix_getattr(struct mnt_idmap *, const struct path *,
 		struct kstat *, u32, unsigned);
+int minix_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+	struct iattr *attr);
 int minix_prepare_chunk(struct folio *folio, loff_t pos, unsigned len);
 struct mapping_metadata_bhs *minix_get_metadata_bhs(struct inode *inode);
 int minix_fsync(struct file *file, loff_t start, loff_t end, int datasync);
@@ -88,6 +90,7 @@ extern int V2_minix_iomap_begin(struct inode *inode, loff_t offset,
 	loff_t length, unsigned int flags, struct iomap *iomap,
 	struct iomap *srcmap);
 
+extern const struct address_space_operations minix_aops;
 extern const struct inode_operations minix_file_inode_operations;
 extern const struct inode_operations minix_dir_inode_operations;
 extern const struct file_operations minix_file_operations;
-- 
2.47.3


^ permalink raw reply related

* Re: [syzbot ci] Re: minix: convert to iomap and add direct I/O
From: Jeremy Bingham @ 2026-06-26 19:25 UTC (permalink / raw)
  To: syzbot+ci6eb2640f41075c71
  Cc: linux-fsdevel, linux-kernel, brauner, jkoolstra, syzkaller-bugs,
	Jeremy Bingham
In-Reply-To: <6a3e251e.b42ede87.2ae58d.0001.GAE@google.com>

#syz test

---
 fs/minix/file.c         | 157 ++++++++++++++++++++++++++++++++++++++--
 fs/minix/inode.c        |  86 ++++++++++++++++++++--
 fs/minix/iomap.c        | 114 +++++++++++++++++++++++++++++
 fs/minix/itree_common.c |  11 ++-
 fs/minix/itree_v1.c     |  25 ++++++-
 fs/minix/itree_v2.c     |  17 ++++-
 fs/minix/minix.h        |  25 ++++++-
 7 files changed, 415 insertions(+), 20 deletions(-)
 create mode 100644 fs/minix/iomap.c

diff --git a/fs/minix/file.c b/fs/minix/file.c
index 86e5943cd2ff..b07c853fa43a 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -17,21 +17,166 @@ int minix_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 			start, end, datasync);
 }
 
+static ssize_t minix_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+
+	inode_lock_shared(inode);
+
+	const struct iomap_ops *ops = minix_iomap_ops_ver(inode);
+
+	ret = iomap_dio_rw(iocb, to, ops, NULL, 0, NULL, 0);
+	inode_unlock_shared(inode);
+	return ret;
+}
+
+static int minix_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+		unsigned int flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	loff_t pos = iocb->ki_pos;
+
+	if (error)
+		return error;
+
+	pos += size;
+	if (size && pos > i_size_read(inode)) {
+		i_size_write(inode, pos);
+		mark_inode_dirty(inode);
+	}
+	return 0;
+}
+
+static const struct iomap_dio_ops minix_dio_write_ops = {
+	.end_io = minix_dio_write_end_io,
+};
+
+static ssize_t minix_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+	unsigned int flags = 0;
+	unsigned long blocksize = inode->i_sb->s_blocksize;
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = kiocb_modified(iocb);
+	if (ret)
+		goto out_unlock;
+
+	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
+		!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))
+		flags |= IOMAP_DIO_FORCE_WAIT;
+
+	const struct iomap_ops *ops = minix_iomap_ops_ver(inode);
+
+	ret = iomap_dio_rw(iocb, from, ops,
+		&minix_dio_write_ops, flags, NULL, 0);
+	if (ret == -ENOTBLK)
+		ret = 0; /* fallback to buffered */
+
+	if (ret >= 0 && iov_iter_count(from)) {
+		loff_t pos;
+		loff_t endbyte;
+		ssize_t status;
+
+		iocb->ki_flags &= ~IOCB_DIRECT;
+		pos = iocb->ki_pos;
+		status = iomap_file_buffered_write(iocb, from, ops,
+			NULL, NULL);
+		if (unlikely(status < 0)) {
+			ret = status;
+			goto out_unlock;
+		}
+
+		ret += status;
+		endbyte = pos + status - 1;
+		status = filemap_write_and_wait_range(inode->i_mapping, pos, endbyte);
+		if (!status) {
+			invalidate_mapping_pages(inode->i_mapping,
+				pos >> PAGE_SHIFT,
+				endbyte >> PAGE_SHIFT);
+			if (ret > 0)
+				ret = generic_write_sync(iocb, ret);
+		} else {
+			ret = status;
+		}
+	}
+
+out_unlock:
+	inode_unlock(inode);
+	return ret;
+}
+
+static ssize_t minix_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return minix_dio_read_iter(iocb, to);
+
+	return generic_file_read_iter(iocb, to);
+}
+
+static ssize_t minix_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+
+	/* minix_dio_write_iter also locks the inode and appears to do the same
+	 * general sorts of checks as this, so just return directly from there.
+	 */
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return minix_dio_write_iter(iocb, from);
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto unlock;
+
+	ret = file_modified(iocb->ki_filp);
+	if (ret)
+		goto unlock;
+
+	const struct iomap_ops *ops = minix_iomap_ops_ver(inode);
+
+	ret = iomap_file_buffered_write(iocb, from, ops,
+			NULL, NULL);
+
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+
+unlock:
+	inode_unlock(inode);
+	return ret;
+}
+
+static int minix_file_open(struct inode *inode, struct file *filp)
+{
+	filp->f_mode |= FMODE_CAN_ODIRECT;
+	return generic_file_open(inode, filp);
+}
+
 /*
- * We have mostly NULLs here: the current defaults are OK for
- * the minix filesystem.
+ * We still have some NULLs here, but not as many of the current defaults are
+ * still OK for the minix filesystem.
  */
+
 const struct file_operations minix_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= generic_file_write_iter,
+	.read_iter	= minix_file_read_iter,
+	.write_iter	= minix_file_write_iter,
 	.mmap_prepare	= generic_file_mmap_prepare,
+	.open		= minix_file_open,
 	.fsync		= minix_fsync,
 	.splice_read	= filemap_splice_read,
+	.splice_write	= iter_file_splice_write,
 };
 
-static int minix_setattr(struct mnt_idmap *idmap,
-			 struct dentry *dentry, struct iattr *attr)
+int minix_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+	struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
 	int error;
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index c30cc590698d..8a79ff82a656 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -436,6 +436,31 @@ static int minix_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
+static ssize_t minix_writeback_range(struct iomap_writepage_ctx *wpc,
+	struct folio *folio, u64 pos, unsigned int len, u64 end_pos)
+{
+	int error;
+
+	if (pos < wpc->iomap.offset ||
+			pos >= wpc->iomap.offset + wpc->iomap.length) {
+		if (INODE_VERSION(wpc->inode) == MINIX_V1)
+			error = V1_minix_iomap_begin(wpc->inode, pos, len, IOMAP_WRITE,
+				&wpc->iomap, NULL);
+		else
+			error = V2_minix_iomap_begin(wpc->inode, pos, len, IOMAP_WRITE,
+				&wpc->iomap, NULL);
+		if (error)
+			return error;
+	}
+
+	return iomap_add_to_ioend(wpc, folio, pos, end_pos, len);
+}
+
+static const struct iomap_writeback_ops minix_writeback_ops = {
+	.writeback_range = minix_writeback_range,
+	.writeback_submit = iomap_ioend_writeback_submit,
+};
+
 static int minix_get_block(struct inode *inode, sector_t block,
 		    struct buffer_head *bh_result, int create)
 {
@@ -445,17 +470,45 @@ static int minix_get_block(struct inode *inode, sector_t block,
 		return V2_minix_get_block(inode, block, bh_result, create);
 }
 
-static int minix_writepages(struct address_space *mapping,
+/* The old minix_writepages, preserved for directory operations. */
+static int minix_block_writepages(struct address_space *mapping,
 		struct writeback_control *wbc)
 {
 	return mpage_writepages(mapping, wbc, minix_get_block);
 }
 
+static int minix_writepages(struct address_space *mapping,
+		struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = {
+		.inode = mapping->host,
+		.wbc = wbc,
+		.ops = &minix_writeback_ops,
+	};
+	return iomap_writepages(&wpc);
+}
+
 static int minix_read_folio(struct file *file, struct folio *folio)
+{
+	const struct iomap_ops *ops = minix_iomap_ops_ver(folio->mapping->host);
+
+	iomap_bio_read_folio(folio, ops);
+	return 0;
+}
+
+/* The old minix_read_folio, preserved for directory operations. */
+static int minix_block_read_folio(struct file *file, struct folio *folio)
 {
 	return block_read_full_folio(folio, minix_get_block);
 }
 
+static void minix_readahead(struct readahead_control *rac)
+{
+	const struct iomap_ops *ops = minix_iomap_ops_ver(rac->mapping->host);
+
+	iomap_bio_readahead(rac, ops);
+}
+
 int minix_prepare_chunk(struct folio *folio, loff_t pos, unsigned len)
 {
 	return __block_write_begin(folio, pos, len, minix_get_block);
@@ -487,24 +540,42 @@ static int minix_write_begin(const struct kiocb *iocb,
 
 static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 {
-	return generic_block_bmap(mapping,block,minix_get_block);
+	const struct iomap_ops *ops = minix_iomap_ops_ver(mapping->host);
+
+	return iomap_bmap(mapping, block, ops);
 }
 
-static const struct address_space_operations minix_aops = {
-	.dirty_folio	= block_dirty_folio,
-	.invalidate_folio = block_invalidate_folio,
+const struct address_space_operations minix_aops = {
+	.dirty_folio	= iomap_dirty_folio,
+	.invalidate_folio = iomap_invalidate_folio,
 	.read_folio = minix_read_folio,
+	.readahead = minix_readahead,
 	.writepages = minix_writepages,
+	.migrate_folio = filemap_migrate_folio,
+	.bmap = minix_bmap,
+	.is_partially_uptodate = iomap_is_partially_uptodate,
+	.release_folio = iomap_release_folio,
+	.error_remove_folio = generic_error_remove_folio,
+};
+
+/* A special aops for directories that keeps using the buffer head chunks, at
+ * least for the time being.
+ */
+static const struct address_space_operations minix_dir_aops = {
+	.dirty_folio = block_dirty_folio,
+	.invalidate_folio = block_invalidate_folio,
+	.read_folio = minix_block_read_folio,
 	.write_begin = minix_write_begin,
 	.write_end = generic_write_end,
 	.migrate_folio = buffer_migrate_folio,
 	.bmap = minix_bmap,
-	.direct_IO = noop_direct_IO
+	.writepages = minix_block_writepages,
 };
 
 static const struct inode_operations minix_symlink_inode_operations = {
 	.get_link	= page_get_link,
 	.getattr	= minix_getattr,
+	.setattr	= minix_setattr,
 };
 
 void minix_set_inode(struct inode *inode, dev_t rdev)
@@ -516,7 +587,7 @@ void minix_set_inode(struct inode *inode, dev_t rdev)
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &minix_dir_inode_operations;
 		inode->i_fop = &minix_dir_operations;
-		inode->i_mapping->a_ops = &minix_aops;
+		inode->i_mapping->a_ops = &minix_dir_aops;
 	} else if (S_ISLNK(inode->i_mode)) {
 		inode->i_op = &minix_symlink_inode_operations;
 		inode_nohighmem(inode);
@@ -768,4 +839,3 @@ module_init(init_minix_fs)
 module_exit(exit_minix_fs)
 MODULE_DESCRIPTION("Minix file system");
 MODULE_LICENSE("GPL");
-
diff --git a/fs/minix/iomap.c b/fs/minix/iomap.c
new file mode 100644
index 000000000000..7bb0439e3669
--- /dev/null
+++ b/fs/minix/iomap.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * iomap functions for minix. At least the first pass of this file was taken
+ * from the xiafs iomap.c, which is fitting since the xiafs module in turn
+ * borrowed heavily from the modernized minix fs kernel module.
+ */
+
+/*
+ * minix_iomap_begin - map a file range to disk blocks. It acts as a replacement
+ * for get_block in itree_common.c, at least in the important ways, and is
+ * adapted from it, but it uses iomap instead of buffer_head. This is taken
+ * directly from the out-of-tree xiafs iomap changes, and the exfat iomap
+ * changes were an inspiration for that.
+ */
+static int minix_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned int blkbits = sb->s_blocksize_bits;
+	sector_t iblock = offset >> blkbits;
+	int create = flags & IOMAP_WRITE;
+
+	/* Mostly taken from modern-xiafs itree.c get_block with elements from
+	 * similar exfat operations.
+	 */
+	int offsets[DEPTH];
+	Indirect chain[DEPTH];
+	Indirect *partial;
+	int depth = block_to_path(inode, iblock, offsets);
+	int left;
+	int err = -EIO;
+
+	sector_t phys;
+
+	/* block is beyond max file size */
+	if (depth == 0)
+		goto out;
+
+	iomap->bdev = inode->i_sb->s_bdev;
+
+reread:
+	partial = get_branch(inode, depth, offsets, chain, &err);
+
+	/* Simplest case - block found, no allocation needed */
+	if (!partial) {
+		/* Bit of a weird order, but it'll make sense when you get to
+		 * the bottom.
+		 */
+		iomap->flags = IOMAP_F_MERGED;
+got_it:
+		phys = block_to_cpu(chain[depth - 1].key);
+		partial = chain+depth-1;
+		/* Set up the iomap struct before cleaning up */
+		iomap->type = IOMAP_MAPPED;
+		iomap->addr = (u64)phys << blkbits;
+		iomap->length = 1 << blkbits;
+		iomap->offset = (u64)iblock << blkbits;
+		goto cleanup;
+	}
+
+	/* Next simple case - plain lookup or failed read of indirect block */
+	if (!create || err == -EIO) {
+		iomap->type = IOMAP_HOLE;
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->length = 1 << blkbits;
+		iomap->offset = (u64)iblock << blkbits;
+		iomap->flags = 0;
+cleanup:
+		while (partial > chain) {
+			brelse(partial->bh);
+			partial--;
+		}
+out:
+		return err;
+	}
+
+	/*
+	 * Indirect block might be removed by truncate while we were
+	 * reading it. Handling of that case (forget what we've got and
+	 * reread) is taken out of the main path.
+	 */
+	if (err == -EAGAIN)
+		goto changed;
+
+	left = (chain + depth) - partial;
+	err = alloc_branch(inode, left, offsets + (partial - chain), partial);
+	if (err)
+		goto cleanup;
+
+	if (splice_branch(inode, chain, partial, left) < 0)
+		goto changed;
+
+	/* Successful allocation, mapping it. */
+	iomap->flags = IOMAP_F_NEW;
+	goto got_it;
+
+changed:
+	while (partial > chain) {
+		brelse(partial->bh);
+		partial--;
+	}
+	goto reread;
+}
+
+/*
+ * minix_iomap_end ends up being a nop; since minix doesn't have any extents or
+ * transactions to worry about, there isn't anything to update here. The on-disk
+ * indirect blocks get dirtied in minix_iomap_begin.
+ */
+static int minix_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap)
+{
+	return 0;
+}
diff --git a/fs/minix/itree_common.c b/fs/minix/itree_common.c
index c3cd2c75af9c..5a8b73a7beda 100644
--- a/fs/minix/itree_common.c
+++ b/fs/minix/itree_common.c
@@ -311,7 +311,16 @@ static inline void truncate (struct inode * inode)
 	long iblock;
 
 	iblock = (inode->i_size + sb->s_blocksize -1) >> sb->s_blocksize_bits;
-	block_truncate_page(inode->i_mapping, inode->i_size, get_block);
+
+	/* Depending on what address space operations are being used by the
+	 * inode being truncated, we need to either call iomap_truncate_page or
+	 * block_truncate_page.
+	 */
+	if (inode->i_mapping->a_ops == &minix_aops)
+		iomap_truncate_page(inode, inode->i_size, NULL,
+			minix_iomap_ops_ver(inode), NULL, NULL);
+	else
+		block_truncate_page(inode->i_mapping, inode->i_size, get_block);
 
 	n = block_to_path(inode, iblock, offsets);
 	if (!n)
diff --git a/fs/minix/itree_v1.c b/fs/minix/itree_v1.c
index 1fed906042aa..58c29f4443d3 100644
--- a/fs/minix/itree_v1.c
+++ b/fs/minix/itree_v1.c
@@ -49,6 +49,18 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH])
 }
 
 #include "itree_common.c"
+/* NOTA BENE:
+ *
+ * This is icky to me, but at the same time having it be a standalone C file
+ * that's compiled to object form and linked separately like it is in xiafs is
+ * much nastier in minix because of the different versions of the minix fs that
+ * have some very, very different aspects, like the size of block_t. I don't
+ * like it, but since minix already has this pattern where a common itree file
+ * is included in the itree_v1 and itree_v2(and v3) files, I'm including iomap.c
+ * in these files as well. It does at least avoid exporting some currently
+ * static functions that aren't needed anywhere but itree_common.c and iomap.c.
+ */
+#include "iomap.c"
 
 int V1_minix_get_block(struct inode * inode, long block,
 			struct buffer_head *bh_result, int create)
@@ -61,7 +73,18 @@ void V1_minix_truncate(struct inode * inode)
 	truncate(inode);
 }
 
-unsigned V1_minix_blocks(loff_t size, struct super_block *sb)
+unsigned int V1_minix_blocks(loff_t size, struct super_block *sb)
 {
 	return nblocks(size, sb);
 }
+
+int V1_minix_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	return minix_iomap_begin(inode, offset, length, flags, iomap, srcmap);
+}
+
+const struct iomap_ops V1_minix_iomap_ops = {
+	.iomap_begin = V1_minix_iomap_begin,
+	.iomap_end   = minix_iomap_end,
+};
diff --git a/fs/minix/itree_v2.c b/fs/minix/itree_v2.c
index 9d00f31a2d9d..fc7a5ae8fa1c 100644
--- a/fs/minix/itree_v2.c
+++ b/fs/minix/itree_v2.c
@@ -57,6 +57,10 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH])
 }
 
 #include "itree_common.c"
+/* See the note in itree_v1 in a comment that starts "NOTA BENE" for an
+ * explanation for why iomap.c is included here.
+ */
+#include "iomap.c"
 
 int V2_minix_get_block(struct inode * inode, long block,
 			struct buffer_head *bh_result, int create)
@@ -69,7 +73,18 @@ void V2_minix_truncate(struct inode * inode)
 	truncate(inode);
 }
 
-unsigned V2_minix_blocks(loff_t size, struct super_block *sb)
+unsigned int V2_minix_blocks(loff_t size, struct super_block *sb)
 {
 	return nblocks(size, sb);
 }
+
+int V2_minix_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	return minix_iomap_begin(inode, offset, length, flags, iomap, srcmap);
+}
+
+const struct iomap_ops V2_minix_iomap_ops = {
+	.iomap_begin = V2_minix_iomap_begin,
+	.iomap_end   = minix_iomap_end,
+};
diff --git a/fs/minix/minix.h b/fs/minix/minix.h
index f2025c9b5825..270e4e0620a1 100644
--- a/fs/minix/minix.h
+++ b/fs/minix/minix.h
@@ -5,6 +5,7 @@
 #include <linux/fs.h>
 #include <linux/pagemap.h>
 #include <linux/minix_fs.h>
+#include <linux/iomap.h>
 
 #define INODE_VERSION(inode)	minix_sb(inode->i_sb)->s_version
 #define MINIX_V1		0x0001		/* original minix fs */
@@ -56,7 +57,9 @@ int minix_new_block(struct inode *inode);
 void minix_free_block(struct inode *inode, unsigned long block);
 unsigned long minix_count_free_blocks(struct super_block *sb);
 int minix_getattr(struct mnt_idmap *, const struct path *,
-		struct kstat *, u32, unsigned int);
+		struct kstat *, u32, unsigned);
+int minix_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+	struct iattr *attr);
 int minix_prepare_chunk(struct folio *folio, loff_t pos, unsigned len);
 struct mapping_metadata_bhs *minix_get_metadata_bhs(struct inode *inode);
 int minix_fsync(struct file *file, loff_t start, loff_t end, int datasync);
@@ -80,10 +83,20 @@ int minix_set_link(struct minix_dir_entry *de, struct folio *folio,
 struct minix_dir_entry *minix_dotdot(struct inode*, struct folio **);
 ino_t minix_inode_by_name(struct dentry*);
 
+extern int V1_minix_iomap_begin(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap,
+	struct iomap *srcmap);
+extern int V2_minix_iomap_begin(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap,
+	struct iomap *srcmap);
+
+extern const struct address_space_operations minix_aops;
 extern const struct inode_operations minix_file_inode_operations;
 extern const struct inode_operations minix_dir_inode_operations;
 extern const struct file_operations minix_file_operations;
 extern const struct file_operations minix_dir_operations;
+extern const struct iomap_ops V1_minix_iomap_ops;
+extern const struct iomap_ops V2_minix_iomap_ops;
 
 static inline struct minix_sb_info *minix_sb(struct super_block *sb)
 {
@@ -95,11 +108,17 @@ static inline struct minix_inode_info *minix_i(struct inode *inode)
 	return container_of(inode, struct minix_inode_info, vfs_inode);
 }
 
-static inline unsigned minix_blocks_needed(unsigned bits, unsigned blocksize)
+static inline unsigned int minix_blocks_needed(unsigned int bits, unsigned int blocksize)
 {
 	return DIV_ROUND_UP(bits, blocksize * 8);
 }
 
+static inline const struct iomap_ops *minix_iomap_ops_ver(struct inode *inode)
+{
+	return (INODE_VERSION(inode) == MINIX_V1) ?
+		&V1_minix_iomap_ops : &V2_minix_iomap_ops;
+}
+
 #if defined(CONFIG_MINIX_FS_NATIVE_ENDIAN) && \
 	defined(CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED)
 
@@ -129,7 +148,7 @@ static inline unsigned minix_blocks_needed(unsigned bits, unsigned blocksize)
  * big-endian 16bit indexed bitmaps
  */
 
-static inline int minix_find_first_zero_bit(const void *vaddr, unsigned size)
+static inline int minix_find_first_zero_bit(const void *vaddr, unsigned int size)
 {
 	const unsigned short *p = vaddr, *addr = vaddr;
 	unsigned short num;
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] fscrypt: Replace mk_users keyring with simple list
From: Eric Biggers @ 2026-06-26 19:02 UTC (permalink / raw)
  To: Luis Henriques
  Cc: linux-fscrypt, Theodore Ts'o, Jaegeuk Kim, Jarkko Sakkinen,
	linux-fsdevel, keyrings, linux-kernel,
	syzbot+f55b043dacf43776b50c, Mohammed EL Kadiri, stable
In-Reply-To: <87tsqpd8d8.fsf@wotan.olymp>

On Fri, Jun 26, 2026 at 09:16:35AM +0100, Luis Henriques wrote:
> Hi Eric!
> 
> On Thu, Jun 18 2026, Eric Biggers wrote:
> 
> > Change mk_users (the set of user claims to an fscrypt master key) from a
> > 'struct key' keyring to a simple linked list.
> >
> > It's still a collection of 'struct key' for quota tracking.  It was
> > originally thought to be natural that a collection of 'struct key'
> > should be held in a 'struct key' keyring.  In reality, it's just been
> > causing problems, similar to how using 'struct key' for the filesystem
> > keyring caused problems and was removed in commit d7e7b9af104c
> > ("fscrypt: stop using keyrings subsystem for fscrypt_master_key").
> >
> > Commit d3a7bd420076 ("fscrypt: clear keyring before calling key_put()")
> > fixed mk_users cleanup to be synchronous.  But that apparently wasn't
> > enough: the keyring subsystem's redundant locking is still generating
> > lockdep false positives due to the interaction with filesystem reclaim.
> >
> > With the simple list, the redundant locking and lockdep issue goes away.
> >
> > Of course, searching a linked list is linear-time whereas the
> > 'struct key' keyring used a fancy constant-time associative array.  But
> > that's fine here, since in practice there's just one entry in the list.
> > In fact the new code is much faster in practice, since it's much smaller
> > and doesn't have to convert the kuid_t into a string to search for it.
> >
> > Reported-by: syzbot+f55b043dacf43776b50c@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=f55b043dacf43776b50c
> > Reported-by: Mohammed EL Kadiri <med08elkadiri@gmail.com>
> > Closes: https://lore.kernel.org/keyrings/20260614150041.21172-1-med08elkadiri@gmail.com/
> > Fixes: 23c688b54016 ("fscrypt: allow unprivileged users to add/remove keys for v2 policies")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> > ---
> >
> > I'm planning to take this via the fscrypt tree for 7.2
> 
> I was hoping to have some more time to test this patch, but I don't think
> that will happen any time soon.
> 
> I've done a review of the patch and couldn't find any obvious problem.
> Though, again, a more in-depth review would require more time as it has
> been a while since I took a look into this code.
> 
> I just wonder if this is really stable material.  It's a bit intrusive
> (doesn't even apply cleanly in mainline), but above all it's fixing a
> lockdep false positive.  Other than syzbot, has this been seen in the
> wild?

Thanks!

It applies on top of
"[PATCH] fscrypt: Fix key setup in edge case with multiple data unit sizes"
(https://lore.kernel.org/linux-fscrypt/20260618180652.52742-1-ebiggers@kernel.org/).
This time I tried just relying on the prerequisite-patch-id footer (as
generated by 'git format-patch') to express the dependency.  But
evidently that still doesn't work: for one, 'b4 am' just ignores it.

Since this patch has "Reported-by: syzbot" as well as "Fixes", the
stable maintainers will apply it anyway.  If I actually wanted to stop
that, I'd have to actively oppose the backport, likely multiple times
indefinitely since people will continue to try to backport it.  And
syzkaller would continue to get the lockdep warning on stable kernels.

So I'd rather just get it out the way and backport it the same time as
"fscrypt: Fix key setup in edge case with multiple data unit sizes",
which similarly tweaks some data structures in struct fscrypt_master_key
and needs to be backported too.  "fscrypt: stop using keyrings subsystem
for fscrypt_master_key" several years ago was backported too.

- Eric

^ permalink raw reply

* Re: [PATCH v2 stable/linux-6.18.y 0/2] Backport Fix incorrect overlayfs mmap() and mprotect() LSM access controls
From: Sasha Levin @ 2026-06-26 17:54 UTC (permalink / raw)
  To: viro, brauner, jack, miklos, amir73il, paul, jmorris, serge,
	stephen.smalley.work, omosnace, gregkh, bboscaccy, caixinchen1
  Cc: Sasha Levin, linux-fsdevel, linux-kernel, linux-unionfs,
	linux-security-module, selinux, bpf, stable, lujialin4
In-Reply-To: <20260626075035.143419-1-caixinchen1@huawei.com>

> [PATCH v2 stable/linux-6.18.y 0/2] Backport Fix incorrect overlayfs
> mmap() and mprotect() LSM access controls

Both patches queued for 6.18, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [RFC PATCH 1/4] capabily: Add new capable_noaudit
From: Serge E. Hallyn @ 2026-06-26 17:46 UTC (permalink / raw)
  To: Paul Moore
  Cc: cem, linux-fsdevel, jack, djwong, hch, serge,
	linux-security-module, linux-kernel, linux-xfs
In-Reply-To: <CAHC9VhQNURc=d4AOVDF-z29fjLasCiLf120Y-N3txEBccpkfSA@mail.gmail.com>

On Fri, Jun 26, 2026 at 11:31:06AM -0400, Paul Moore wrote:
> On Fri, Jun 26, 2026 at 7:49 AM <cem@kernel.org> wrote:
> >
> > From: Carlos Maiolino <cem@kernel.org>
> >
> > In some situations (quota enforcement bypass in this case) we'd like to
> > check for a specific capability without triggering spurious audit
> > messages from security modules like selinux.
> >
> > Add a new helper so we don't need to use ns_capable_noaudit() directly.
> >
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> >  include/linux/capability.h |  5 +++++
> >  kernel/capability.c        | 17 +++++++++++++++++
> >  2 files changed, 22 insertions(+)
> 
> This is Serge's call, not mine, but FWIW, I somewhat prefer to see
> code use the ns_capable_XXX() variants directly as I like to think it
> means some thought went into ensuring the capability check is being
> done in the right namespace.  Yes, we all know that capable() just
> uses the init namespace, but I like to think that having to type that
> out in the parameter list might be a good double check ;)

Hm, yeah, on he one hand it seems like a nice shortcut, but I still
see people confusing what 'capable' really does, so standardizing on
ns_capable_noaudit(&init_user_ns, x) might be worthwhile.

(and then patch 3 can go)

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-26 17:23 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: H. Peter Anvin, Al Viro, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <20260626092750.58a8de9c@pumpkin>

I am replying to both Andy and David in a single email --- hope that is
not confusing.

On Thu, Jun 25, 2026, at 7:09 PM, Andy Lutomirski wrote:
> On Thu, Jun 25, 2026 at 2:53 PM John Ericson <mail@johnericson.me> wrote:
> >
> > The argument against just having an empty, immutable root directory and
> > calling it a day is the tie-in with a new process-spawning API discussed
> > near the bottom of my original email. I want to have nice secure
> > defaults, rather than forcing the programmer to remember to unshare, but
> > I also don't want to degrade performance by speculatively creating new
> > empty mount namespaces that might just be thrown away. Null fields alone
> > get us both --- security and good performance.
>
> This seems like a false dichotomy.  There's such thing as a singleton.
>
> In fact, we have this spiffy nullfs_fs_get_tree.  It seems relatively
> straightforward to have an API to get an fd to the singleton nullfs,
> and the default for a newly spawned process could even be to have cwd
> pointing at nullfs.

Ah! This is the first I am learning about the new nullfs. OK yes I agree
this gives us both properties, since it is truly immutably empty.

I still have a slight preference for something that also makes
statting/opening/etc. of `/` itself fail, but this is otherwise good ---
there's no denying it.

> root is still harder, because of the shadowing issue.  I think I
> proposed, ages ago, relaxing the chroot rules so that, at least under
> certain circumstances (e.g. the task is not already chrooted) an
> unprivileged task could chroot.  chrooting to nullfs seems like a
> somewhat useful operation.
>
> I can imagine more complex schemes to allow even a chrooted process to
> safely start acting as though their root is nullfs, but that would be
> potentially fairly nasty.  *Maybe* everything would work if there was
> a root-for-dotdot and a separate root-for-absolute-paths, and
> nameidata->root could point to the former, but I'm certainly not
> willing to say that I think this would work with any confidence at
> all.

I really like these ideas!

- Splitting the two uses of root sounds great. Even more generally (at
  least as a thought experiment, I don't like the O(n) performance), one
  can imagine a set of paths one must not `cd ..` past. Conceptually, I
  feel optimistic that inserting another boundary path into the set on
  every `chroot` makes it safe.

- In the original "real root", the "root for .." field could be null,
  since no `..` check is actually needed. Then, if we only want to have
  a single "root for .." (to avoid the O(n)), only the initial
  assignment of it from null to non-null would be unprivileged --- this
  would implement your "task is not already chrooted" idea. Subsequent
  assignment would still be privileged since we are replacing, not
  extending our "set". (The nullable single path means we have 0 or 1
  paths in our set.)

----

On Fri, Jun 26, 2026, at 4:27 AM, David Laight wrote:
>
> You'd also need to sort out the 'pwd' mess.
> The kernel inode always has its real parent, inside a chroot the scan stops
> when the inode is the same as that of the base of the chroot.
> But faf about with namespaces (IIRC I was doing an unshare to get out of
> a network namespace) and that comparison can fail (if the chroot base isn't
> a mount point) - so "../.." can go all the way back to the real root rather
> than stopping at the base of the chroot (as you would expect).
>
> David

I did get the impression that the `..` check is...rather fragile. I am
also thinking that a global setting like `openat2`'s `RESOLVE_BENEATH`
to make `..` never work would be useful; then all manner of chrooting is
trivially safe, because you cannot go up regardless!

----

Given the state of the discussion, I'll go submit my null cwd and root
patch momentarily. The nullfs alternative is quite compelling; to the
extent that I do prefer making the root operations fail as I said above,
I think my best shot is demonstrating that this patch is so small and
lightweight that this slight benefit is paid for by the simplicity of
the implementation.

John

^ permalink raw reply

* Re: [PATCH v10 5/5] ext4: prevent deadlock from duplicate EA inode references on corrupted fs
From: Jan Kara @ 2026-06-26 17:23 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner, linux-ext4, linux-kernel, linux-fsdevel
In-Reply-To: <20260625152941.24788-6-yun.zhou@windriver.com>

On Thu 25-06-26 23:29:41, Yun Zhou wrote:
> On a corrupted filesystem, multiple xattr entries may reference the same
> EA inode.  When ext4_xattr_inode_dec_ref_all() processes such entries, it
> can dec_ref the EA inode (setting nlink=0) and queue it for deferred iput.
> If the deferred worker runs before the loop processes the duplicate entry,
> the second iget() may block on I_FREEING while the worker's eviction waits
> for the caller's transaction to commit -- classic ABBA deadlock.

Hum, this looks possible but it isn't a new thing this patch set
introduces. Even before if you had corrupted filesystem,
ext4_xattr_inode_array_free() from ext4_evict_inode() could deadlock in a
similar way against say ext4_xattr_inode_dec_ref_all() (but practially
anything calling ext4_xattr_inode_iget() while holding a transaction
handle). So please leave this alone for now. We can look into that once
other EA inode settle.

								Honza

> 
> Fix by tracking successfully processed EA inodes on a per-call llist
> (reusing i_ea_iput_node) and skipping any ea_ino already in the list.
> This covers both intra-block duplicates and cross ibody/block duplicates
> in ext4_xattr_delete_inode().
> 
> The actual ext4_put_ea_inode() is deferred until after the processing
> loop completes (ext4_put_ea_inode_llist), ensuring no EA inode is queued
> for eviction while the loop is still iterating.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> ---
>  fs/ext4/xattr.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 7f334349bd4f..5c929043e44a 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1152,11 +1152,41 @@ static int ext4_xattr_restart_fn(handle_t *handle, struct inode *inode,
>  	return 0;
>  }
>  
> +/* Check if an EA inode number is already in the processed llist. */
> +static bool ext4_ea_ino_in_llist(unsigned int ea_ino,
> +				 struct llist_head *processed)
> +{
> +	struct ext4_inode_info *ei;
> +
> +	llist_for_each_entry(ei, processed->first, i_ea_iput_node) {
> +		if (ei->vfs_inode.i_ino == ea_ino)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Put all EA inodes on a processed llist via ext4_put_ea_inode. */
> +static void ext4_put_ea_inode_llist(struct super_block *sb,
> +				    struct llist_head *processed)
> +{
> +	struct llist_node *node = llist_del_all(processed);
> +	struct llist_node *next;
> +
> +	while (node) {
> +		struct ext4_inode_info *ei = container_of(node,
> +				struct ext4_inode_info, i_ea_iput_node);
> +		next = node->next;
> +		ext4_put_ea_inode(sb, &ei->vfs_inode);
> +		node = next;
> +	}
> +}
> +
>  static void
>  ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
>  			     struct buffer_head *bh,
>  			     struct ext4_xattr_entry *first, bool block_csum,
> -			     int extra_credits, bool skip_quota)
> +			     int extra_credits, bool skip_quota,
> +			     struct llist_head *processed)
>  {
>  	struct inode *ea_inode;
>  	struct ext4_xattr_entry *entry;
> @@ -1186,6 +1216,11 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
>  		if (!entry->e_value_inum)
>  			continue;
>  		ea_ino = le32_to_cpu(entry->e_value_inum);
> +
> +		/* Skip if already processed (duplicate on corrupted fs) */
> +		if (ext4_ea_ino_in_llist(ea_ino, processed))
> +			continue;
> +
>  		err = ext4_xattr_inode_iget(parent, ea_ino,
>  					    le32_to_cpu(entry->e_hash),
>  					    &ea_inode);
> @@ -1235,7 +1270,12 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
>  		entry->e_value_inum = 0;
>  		entry->e_value_size = 0;
>  
> -		ext4_put_ea_inode(parent->i_sb, ea_inode);
> +		/*
> +		 * Collect processed EA inodes for dedup and deferred iput.
> +		 * ext4_put_ea_inode_llist() handles the actual release
> +		 * after the loop, preventing iget deadlocks on duplicates.
> +		 */
> +		llist_add(&EXT4_I(ea_inode)->i_ea_iput_node, processed);
>  		dirty = true;
>  	}
>  
> @@ -1262,7 +1302,8 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
>  static void
>  ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>  			 struct buffer_head *bh,
> -			 int extra_credits)
> +			 int extra_credits,
> +			 struct llist_head *processed)
>  {
>  	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
>  	u32 hash, ref;
> @@ -1304,7 +1345,8 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>  						     BFIRST(bh),
>  						     true /* block_csum */,
>  						     extra_credits,
> -						     true /* skip_quota */);
> +						     true /* skip_quota */,
> +						     processed);
>  		ext4_free_blocks(handle, inode, bh, 0, 1,
>  				 EXT4_FREE_BLOCKS_METADATA |
>  				 EXT4_FREE_BLOCKS_FORGET);
> @@ -2171,8 +2213,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  
>  	/* Drop the previous xattr block. */
>  	if (bs->bh && bs->bh != new_bh) {
> +		LLIST_HEAD(processed);
> +
>  		ext4_xattr_release_block(handle, inode, bs->bh,
> -					 0 /* extra_credits */);
> +					 0 /* extra_credits */,
> +					 &processed);
> +		ext4_put_ea_inode_llist(inode->i_sb, &processed);
>  	}
>  	error = 0;
>  
> @@ -2866,6 +2912,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  	struct ext4_xattr_entry *entry;
>  	struct inode *ea_inode;
>  	int error;
> +	LLIST_HEAD(processed);
>  
>  	error = ext4_journal_ensure_credits(handle, extra_credits,
>  			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
> @@ -2897,7 +2944,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  						     IFIRST(header),
>  						     false /* block_csum */,
>  						     extra_credits,
> -						     false /* skip_quota */);
> +						     false /* skip_quota */,
> +						     &processed);
>  	}
>  
>  	if (EXT4_I(inode)->i_file_acl) {
> @@ -2921,6 +2969,11 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  			     entry = EXT4_XATTR_NEXT(entry)) {
>  				if (!entry->e_value_inum)
>  					continue;
> +				/* Skip EA inodes already dec_ref'd from ibody */
> +				if (ext4_ea_ino_in_llist(
> +					    le32_to_cpu(entry->e_value_inum),
> +					    &processed))
> +					continue;
>  				error = ext4_xattr_inode_iget(inode,
>  					      le32_to_cpu(entry->e_value_inum),
>  					      le32_to_cpu(entry->e_hash),
> @@ -2935,7 +2988,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  		}
>  
>  		ext4_xattr_release_block(handle, inode, bh,
> -					 extra_credits);
> +					 extra_credits, &processed);
>  		/*
>  		 * Update i_file_acl value in the same transaction that releases
>  		 * block.
> @@ -2951,6 +3004,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  	}
>  	error = 0;
>  cleanup:
> +	ext4_put_ea_inode_llist(inode->i_sb, &processed);
>  	brelse(iloc.bh);
>  	brelse(bh);
>  	return error;
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v10 2/5] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Jan Kara @ 2026-06-26 16:53 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner, linux-ext4, linux-kernel, linux-fsdevel
In-Reply-To: <20260625152941.24788-3-yun.zhou@windriver.com>

On Thu 25-06-26 23:29:38, Yun Zhou wrote:
> Calling iput() on EA inodes while holding xattr_sem or a jbd2 handle
> can trigger write_inode_now() -> ext4_writepages() -> s_writepages_rwsem,
> creating a lock ordering issue during mount (!SB_ACTIVE).
> 
> Add ext4_put_ea_inode() which uses iput_if_not_last() as a fast path.
> If this is not the last reference, it is dropped immediately.  If this
> is the last reference, the inode is linked onto a per-sb lock-free llist
> via i_ea_iput_node (embedded in ext4_inode_info, sharing space with the
> unused xattr_sem of EA inodes via a union) and a delayed worker
> (1 jiffie) performs the final iput() in a clean context.  This avoids
> per-iput memory allocation.
> 
> Convert the first call site: ext4_xattr_block_set()'s "Drop the
> previous xattr block" path, which previously called
> ext4_xattr_inode_array_free() under xattr_sem + jbd2 handle.
> 
> The worker is drained in ext4_put_super() before quota shutdown using
> a loop to handle re-arming (evicting an EA inode may queue further EA
> inodes).  Initialization is placed before journal loading since fast
> commit replay may trigger evictions that call ext4_put_ea_inode().
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Thanks for the patch. It looks mostly good. Some comments below.

> @@ -5497,6 +5505,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  			  ext4_has_feature_orphan_present(sb) ||
>  			  ext4_has_feature_journal_needs_recovery(sb));
>  
> +	ext4_init_ea_inode_work(sbi);
> +
>  	if (ext4_has_feature_mmp(sb) && !sb_rdonly(sb)) {
>  		err = ext4_multi_mount_protect(sb, le64_to_cpu(es->s_mmp_block));
>  		if (err)
> @@ -5508,6 +5518,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	 * The first inode we look at is the journal inode.  Don't try
>  	 * root first: it may be modified in the journal!
>  	 */
> +

Stray addition of empty line.

>  	if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
>  		err = ext4_load_and_init_journal(sb, es, ctx);
>  		if (err)


> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..ecdad5920b14 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -117,6 +117,8 @@ const struct xattr_handler * const ext4_xattr_handlers[] = {
>  static int
>  ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
>  			struct inode *inode);
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> +				struct ext4_xattr_inode_array *array);
>  
>  #ifdef CONFIG_LOCKDEP
>  void ext4_xattr_inode_set_class(struct inode *ea_inode)
> @@ -2187,7 +2189,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  		ext4_xattr_release_block(handle, inode, bs->bh,
>  					 &ea_inode_array,
>  					 0 /* extra_credits */);
> -		ext4_xattr_inode_array_free(ea_inode_array);
> +		ext4_xattr_inode_array_free_deferred(inode->i_sb,
> +						     ea_inode_array);
>  	}
>  	error = 0;
>  
> @@ -3025,6 +3028,74 @@ void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
>  	kfree(ea_inode_array);
>  }
>  
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> +				struct ext4_xattr_inode_array *array)
> +{
> +	int idx;
> +
> +	if (array == NULL)
> +		return;
> +
> +	for (idx = 0; idx < array->count; ++idx)
> +		ext4_put_ea_inode(sb, array->inodes[idx]);
> +	kfree(array);
> +}

It's strange to introduce this only to delete it two patches later. I'd
just introduce the mechanism in patch 1. Convert callsites in patch 2,
replace ext4_xattr_inode_array_free() mechanism in patch 3.

> +
> +/*
> + * Worker function for deferred EA inode iput.  Processes all inodes queued
> + * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
> + */
> +static void ext4_ea_inode_work(struct work_struct *work)
> +{
> +	struct ext4_sb_info *sbi = container_of(to_delayed_work(work),
> +						struct ext4_sb_info,
> +						s_ea_inode_work);
> +	struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
> +	struct llist_node *next;
> +
> +	while (node) {
> +		struct ext4_inode_info *ei = container_of(node,
> +					struct ext4_inode_info, i_ea_iput_node);
> +		next = node->next;
> +		iput(&ei->vfs_inode);
> +		node = next;

'next' is actually not needed in this function. You can directly do:
  node = node->next;

> +/*
> + * Release a VFS reference on an EA inode.  Must be used instead of iput()
> + * in any context where xattr_sem or a jbd2 handle is held.
> + *
> + * If this is not the last reference, drops it immediately via
> + * iput_if_not_last() with no further action needed.
> + *
> + * If this is the last reference, the inode is linked onto a per-sb
> + * llist via i_ea_iput_node (embedded in ext4_inode_info, sharing space
> + * with the unused xattr_sem) and a delayed worker performs the final
> + * iput() in a clean context.

I'd add here: Note that if an inode is in s_ea_inode_to_free list, the
inode reference implicitely associated with that prevents
any future iput_if_not_last() from failing and so nobody will try to add
the inode to s_ea_inode_to_free for the second time until iput() in
ext4_ea_inode_work drops that reference.

> + */
> +void ext4_put_ea_inode(struct super_block *sb, struct inode *inode)
> +{
> +	if (!inode)
> +		return;
> +	WARN_ON_ONCE(!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL));
> +	if (iput_if_not_last(inode))
> +		return;
> +	llist_add(&EXT4_I(inode)->i_ea_iput_node,
> +		  &EXT4_SB(sb)->s_ea_inode_to_free);
> +	/*
> +	 * Use a short delay to allow multiple EA inodes to accumulate,
> +	 * reducing workqueue wakeups when several are released together.
> +	 */
> +	schedule_delayed_work(&EXT4_SB(sb)->s_ea_inode_work, 1);
> +}
> +
> +void ext4_init_ea_inode_work(struct ext4_sb_info *sbi)
> +{
> +	init_llist_head(&sbi->s_ea_inode_to_free);
> +	INIT_DELAYED_WORK(&sbi->s_ea_inode_work, ext4_ea_inode_work);
> +}
> +
>  /*
>   * ext4_xattr_block_cache_insert()
>   *
> diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
> index 1fedf44d4fb6..9883ba5569a1 100644
> --- a/fs/ext4/xattr.h
> +++ b/fs/ext4/xattr.h
> @@ -190,6 +190,20 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  				   struct ext4_xattr_inode_array **array,
>  				   int extra_credits);
>  extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
> +extern void ext4_init_ea_inode_work(struct ext4_sb_info *sbi);
> +extern void ext4_put_ea_inode(struct super_block *sb, struct inode *inode);
> +
> +/*
> + * Drain all pending deferred EA inode iputs.  Must be called before
> + * freeing resources that eviction depends on (quota, block allocator).
> + * Loops because worker iput may trigger eviction that re-queues.
> + */

Can you please explain how iput of EA inode can trigger iput of another EA
inode? So far I don't think that's possible but perhaps I'm missing
something.

> +static inline void ext4_drain_ea_inode_work(struct ext4_sb_info *sbi)
> +{
> +	while (flush_delayed_work(&sbi->s_ea_inode_work) ||
> +	       !llist_empty(&sbi->s_ea_inode_to_free))
> +		;
> +}
>  
>  extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>  			    struct ext4_inode *raw_inode, handle_t *handle);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v10 1/5] fs: add iput_if_not_last() helper
From: Jan Kara @ 2026-06-26 16:30 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner, linux-ext4, linux-kernel, linux-fsdevel
In-Reply-To: <20260625152941.24788-2-yun.zhou@windriver.com>

On Thu 25-06-26 23:29:37, Yun Zhou wrote:
> Add a helper that drops an inode reference only if the caller does not
> hold the last one.  Returns true if the reference was dropped, false
> otherwise.
> 
> This is useful for filesystems that need to release inode references
> in contexts where triggering final iput (and thus eviction) would be
> unsafe due to lock ordering constraints.  The caller can check the
> return value and defer the final iput to a safe context.
> 
> Unlike iput_not_last() which BUG_ON's if called with the last ref,
> this variant is designed to be called speculatively.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Yes, I think this is a sensible addition to the inode refcounting API which
will allow filesystems to save the hassle of iput offloading in the common
case. Just one nit below, otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6da44573ce45..4916a9d54347 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2418,6 +2418,19 @@ static inline void super_set_sysfs_name_generic(struct super_block *sb, const ch
>  extern void ihold(struct inode * inode);
>  extern void iput(struct inode *);
>  void iput_not_last(struct inode *);
> +
> +/**
> + * iput_if_not_last - drop an inode reference only if it is not the last one
> + * @inode: inode to put
> + *
> + * Returns true if the reference was dropped, false if this was the last
> + * reference and the caller must arrange for final iput() in a safe context.
> + */
> +static inline bool iput_if_not_last(struct inode *inode)

I'd add __must_check to the declaration.

								Honza

> +{
> +	return atomic_add_unless(&inode->i_count, -1, 1);
> +}
> +
>  int inode_update_time(struct inode *inode, enum fs_update_time type,
>  		unsigned int flags);
>  int generic_update_time(struct inode *inode, enum fs_update_time type,
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 0/2] fs: support $ORIGIN in ELF interpreter paths
From: David Laight @ 2026-06-26 16:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, John Ericson, Farid Zakaria, Jan Kara,
	Kees Cook, Al Viro, shuah, linux-fsdevel, linux-mm,
	linux-kselftest, LKML
In-Reply-To: <CAG48ez3o1Mt59dWdyknh_SEaoi-cHv6pdmix+aOdBy3C0-LkYQ@mail.gmail.com>

On Fri, 26 Jun 2026 15:34:12 +0200
Jann Horn <jannh@google.com> wrote:

> On Fri, Jun 26, 2026 at 3:26 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> > On Fri, 26 Jun 2026 14:39:22 +0200
> > Jann Horn <jannh@google.com> wrote:
> >  
> > > On Thu, Jun 25, 2026 at 10:50 AM Christian Brauner <brauner@kernel.org> wrote:  
> > > > The arguments I have heard from various people so far are:
> > > >
> > > > (1) Userspace would be able to clone a random chroot to /woot and run a
> > > >     binary from it without having to set up a complicated sandbox
> > > >     effectively making dynamically linked binaries more like static
> > > >     binaries in a sense.
> > > >
> > > > (2) Quote:
> > > >     "If you debootstrap/dnf a chroot to some location in your
> > > >     home dir and try to run a binary from it, that it tries to load the
> > > >     libraries from your /usr is a pretty unintuitive and not at all
> > > >     useful behavior."
> > > >
> > > > (3) Quote:
> > > >     "[Various remote execution things run in locked down containers that
> > > >     disable userns, which makes the sandbox impossible and hence our
> > > >     builds wouldn't work there."  
> > >
> > > FWIW I think someone also mentioned to me that it would make things
> > > easier for them if they could build a piece of software in one
> > > environment and then bundle it up with all required libraries and such
> > > and run it in a very different environment, without
> > > container/sandboxing stuff and without static linking. But I guess
> > > that's kinda niche.  
> >
> > The problem with 'ship the shared libraries with the application' is
> > that you get all the problems of static linking.
> > If there is a bug in the library code you can't fix it without getting the
> > 3rd party to rebuild their application package.  
> 
> Yes, it's appropriate for weird use cases like "I want to run this
> historical version of the software and its dependencies", it's not
> necessarily a good idea for normal application use.

That's what LD_LIBRARY_PATH is for ...

And if you want to use a different elf interpreter just run it and pass the
program name and arguments to it. eg:
	/lib64/ld-linux-x64-64.so.2 /bin/echo fubar
Last time I did that I was trying to run non-linux ppc elf program.
I got part way there, but needed to build a lot more of libc.

	David

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-26 16:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <20260626001538.GO2636677@ZenIV>

On Thu, Jun 25, 2026, at 8:15 PM, Al Viro wrote:
> On Wed, Jun 24, 2026 at 11:41:07PM -0400, John Ericson wrote:
>
> > But I don't want that global state.
>
> Don't use it, then... out of curiosity, does that extend to stdout et.al.?

Good question; it turns out I like the standard streams much better!

First of all, the standard streams are just an idiom --- there is
nothing actually special about file descriptors 0, 1, and 2. That's a
clean design --- the kernel doesn't need to know about userspace idioms.

Second of all, if you don't want any of those, you can just close 'em!
You can't do that with the cwd, however. It's stuck open.

Ideally `*at` would have been with us from the beginning, and, say, file
descriptor 3 would have been the "current working directory" merely by
convention.

> Would that kind of thing added kernel-side assist the development of such
> library?  Maybe, but I wouldn't bet too much on that - if you start from
> scratch, you can trivially verify that you don't even attempt given
> set of syscalls and if you use libc as a starting point, you get to
> debug all the failure exits you've added...

First of all, I am trying to change what processes are allowed to do,
and this includes programs I did not write. A libc-based solution is the
program cooperating with its own sandboxing; this is not a solution for
running arbitrary programs which may not be trusted in a restricted
manner.

Second of all, this would be very laborious in practice, because we're
talking not about what syscalls the program uses, but about what data is
passed in those syscalls. Any program that consumes arbitrary user input
(like shell utilities) might receive an absolute or relative path, and
so it would have to manually check for that, lest the user input "trick"
the program into using the root dir and cwd it is trying to ignore.

Making a tiny few edits in the kernel path resolution logic to allow for
these null fields is much more practical than defending a much broader
perimeter in userspace.

> > The programmer (or coding agent) is
> > encouraged to do everything with file descriptors rather than path
> > concatenations etc., because they need to use `*at` anyways, and then
> > voilà, without browbeating anyone in security seminars or code review, a
> > bunch of TOCTOU issues disappear simply because doing the right thing is
> > now the path of least resistance.
>
> I'm sorry, but the path of least resistance is picking a snippet from google
> that will implement open(), etc., on top of your setup and using it.
> _Especially_ if coding agents are going to be involved, precisely because
> they'll do a convincing simulation of human duhveloper's behaviour, i.e.
> "cut'n'paste it from the net".

We agree! But this is precisely why it is important to make these things
fail. Mindless Stack Overflow cut'n'pasters (human or agent) still run
their program to make sure it works. Making the thing you don't want
them to do *actually fail* creates sufficiently strong and incremental
feedback that they will end up doing the right thing.

> > The current working directory, roughly, is *just* some global state
> > holding a directory file descriptor.
>
> So's the descriptor table; what's the difference?

Now that I've responded to everything else, I can answer this in
summary:

- File descriptors can be closed; cwd and root cannot be.

- File descriptors need to be explicitly used in syscalls. The cwd and
  root are implicitly used (in too many different syscalls to make
  syscall-level auditing practical) based on the sort of path string
  argument to the syscall, without the program's explicit consent.

John

^ permalink raw reply

* Re: [PATCH] mm: do file ownership checks with the proper mount idmap
From: David Hildenbrand (Arm) @ 2026-06-26 16:02 UTC (permalink / raw)
  To: Pedro Falcato, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle), Andrew Morton, Liam R. Howlett
  Cc: Jan Kara, Vlastimil Babka, Jann Horn, linux-fsdevel, linux-mm,
	linux-kernel, stable
In-Reply-To: <20260625153853.913949-1-pfalcato@suse.de>

On 6/25/26 17:38, Pedro Falcato wrote:
> Ever since idmapped mounts were introduced, inode ownership checks
> (for side-channel protection) in mincore() and madvise(MADV_PAGEOUT) were
> done against the nop_mnt_idmap, which completely ignores the file's mount's
> idmap. This results in odd edgecases like:
> 
> 1) mount/bind-mount with an idmap userA:userB:1
> 2) userB runs an owner_or_capable() check on file that is owned by userA
> on-disk/in-memory, but owned by userB after idmap translation
> 3) owner_or_capable() mysteriously fails as the correct idmap wasn't supplied
> 
> In the case of mincore/madvise MADV_PAGEOUT, this is usually benign, because
> file_permission(file, MAY_WRITE) will probably succeed, as it uses the proper
> idmap internally, but it does not need to be the case on e.g a 0444 file
> where even the owner itself doesn't have permissions to write to it.
> 
> Since this is clearly not trivial to get right, introduce a
> file_owner_or_capable() that can carry the correct semantics, and switch
> the various users in mm to it.
> 
> The issue was found by manual code inspection & an off-list discussion with
> Jan Kara.
> 
> Fixes: 9caccd41541a ("fs: introduce MOUNT_ATTR_IDMAP")
> Cc: stable@vger.kernel.org
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>

MM side LGTM

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [RFC] fanotify for flock release
From: Jeff Layton @ 2026-06-26 15:51 UTC (permalink / raw)
  To: Amir Goldstein, Jori Koolstra
  Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel@vger.kernel.org,
	Christian Brauner
In-Reply-To: <CAOQ4uxioUmS5O7fo9f0pZHTc_drzJCVstRipqSwGYKMBy3DUxA@mail.gmail.com>

On Fri, 2026-06-26 at 17:32 +0200, Amir Goldstein wrote:
> On Thu, Jun 25, 2026 at 3:02 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> > 
> > Hi Jan and Amir, (cc Christian)
> cc Jeff Layton
> 
> > 
> > There has been a wish from systemd to be able to be notified on flock(2) releases.[1]
> > I've been looking at the locks.c code (I really wish requests were decoupled
> > from locks... :) ) and the fanotify code, and this seems to be a rather
> > straightforward expansion of existing fanotify functionality. Before sending a
> > patch up, are there any objections to this? If we implement this should we
> > also do POSIX locks notifications? And what about lock taking?
> 
> An event on funlock is very specific.
> Maybe if you can come up with some more generic semantics.
> 
> TBH it feels like functionality that could be added to flock() -
> waiting for resources to be available without actually taking the lock.
> Sounds like something that would be easy to implement, but arguing
> over UAPI is never ending fun...
> 

I've no objection to Jori's idea in principle, but I'd also ask:

Why release and not also lock acquisition? Also, why only flock() and
not POSIX locks? Would leases or layouts also be interesting?

I don't really understand the use-case though, so I'm not sure what to
suggest as far as a uAPI or backend implementation.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [RFC] fanotify for flock release
From: Amir Goldstein @ 2026-06-26 15:32 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel@vger.kernel.org,
	Christian Brauner, Jeff Layton
In-Reply-To: <716044027.2969861.1782392539951@kpc.webmail.kpnmail.nl>

On Thu, Jun 25, 2026 at 3:02 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
> Hi Jan and Amir, (cc Christian)
cc Jeff Layton

>
> There has been a wish from systemd to be able to be notified on flock(2) releases.[1]
> I've been looking at the locks.c code (I really wish requests were decoupled
> from locks... :) ) and the fanotify code, and this seems to be a rather
> straightforward expansion of existing fanotify functionality. Before sending a
> patch up, are there any objections to this? If we implement this should we
> also do POSIX locks notifications? And what about lock taking?

An event on funlock is very specific.
Maybe if you can come up with some more generic semantics.

TBH it feels like functionality that could be added to flock() -
waiting for resources to be available without actually taking the lock.
Sounds like something that would be easy to implement, but arguing
over UAPI is never ending fun...

Thanks,
Amir.

^ permalink raw reply

* Re: [RFC PATCH 1/4] capabily: Add new capable_noaudit
From: Paul Moore @ 2026-06-26 15:31 UTC (permalink / raw)
  To: cem
  Cc: linux-fsdevel, jack, djwong, hch, serge, linux-security-module,
	linux-kernel, linux-xfs
In-Reply-To: <20260626114533.102138-2-cem@kernel.org>

On Fri, Jun 26, 2026 at 7:49 AM <cem@kernel.org> wrote:
>
> From: Carlos Maiolino <cem@kernel.org>
>
> In some situations (quota enforcement bypass in this case) we'd like to
> check for a specific capability without triggering spurious audit
> messages from security modules like selinux.
>
> Add a new helper so we don't need to use ns_capable_noaudit() directly.
>
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  include/linux/capability.h |  5 +++++
>  kernel/capability.c        | 17 +++++++++++++++++
>  2 files changed, 22 insertions(+)

This is Serge's call, not mine, but FWIW, I somewhat prefer to see
code use the ns_capable_XXX() variants directly as I like to think it
means some thought went into ensuring the capability check is being
done in the right namespace.  Yes, we all know that capable() just
uses the init namespace, but I like to think that having to type that
out in the parameter list might be a good double check ;)

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH v12 00/16] Direct Map Removal Support for guest_memfd
From: Brendan Jackman @ 2026-06-26 15:28 UTC (permalink / raw)
  To: Takahiro Itazuri, seanjc, ljs
  Cc: Liam.Howlett, ackerleytng, agordeev, ajones, akpm, alex, andrii,
	aou, ast, baolu.lu, catalin.marinas, chenhuacai, corbet, coxu,
	daniel, dave.hansen, david, dev.jain, itazur, jackmanb, jannh,
	jhubbard, jmattson, joey.gouly, john.fastabend, jolsa, jthoughton,
	kas, kernel, kpsingh, kvm, kvmarm, lenb, linux-arm-kernel,
	linux-doc, linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	linux-pm, linux-riscv, linux-s390, loongarch, lorenzo.stoakes,
	luto, maobibo, martin.lau, maz, mhocko, mingo, mlevitsk,
	nikita.kalyazin, oupton, palmer, patrick.roy, pavel, pbonzini,
	peterx, peterz, pfalcato, pjw, prsampat, rafael, riel, rppt,
	ryan.roberts, sdf, shijie, skhan, song, surenb, suzuki.poulose,
	svens, tabba, tglx, thuth, urezki, vannapurve, vbabka, will,
	willy, wu.fei9, x86, yang, yangyicong, yonghong.song, yosry,
	yu-cheng.yu, yuzenghui, zhengqi.arch
In-Reply-To: <20260506080753.14517-1-itazur@amazon.com>

On Wed May 6, 2026 at 8:07 AM UTC, Takahiro Itazuri wrote:
> Hi Lorenzo and Sean,
>
> Apologies for the delayed reply — Nikita is leaving Amazon, and I'm
> taking over this series going forward. Thanks for your patience.
>
> On Tue, Apr 21, 2026 at 01:40:00PM +0000, Lorenzo Stoakes wrote:
>> Hm, given this touches a fair bit of mm, I wonder if we shouldn't try to do this
>> through the mm tree?
>
> On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
>> Yeah, when the time comes, the mm pieces definitely need to go through the mm
>> tree.  Ideally, I think this would be merged in two separate parts, with all mm
>> changes going through the mm tree, and then the KVM changes through the KVM tree
>> using a stable topic branch/tag from Andrew.
>
> Thanks for the guidance. The split makes sense to me; I'm planning to
> follow this approach with patches 1-6 (mm) going through the mm tree
> and patches 7-16 (KVM) through the KVM tree on top of a stable
> branch/tag from mm. I'll confirm the exact boundary and coordination
> details as I prepare the repost.
>
> On Tue, Apr 21, 2026 at 01:40:00PM +0000, Lorenzo Stoakes wrote:
>> In any case, we definitely need a rebase on something not-next :) if not mm then
>> Linus's tree at least maybe?
>>
>> I'm seeing a lot of conflicts against mm-unstable, it can't b4 shazam even patch
>> 1 and in Linus's tree it's failing at an mm patch (mm: introduce
>> AS_NO_DIRECT_MAP).
>

Just as an FYI, I am gonna look at trying to move this forward a bit
while Takahiro ramps up on taking over (I spoke to him about this off
list).

My ulterior motive is that this would give me an excuse to add
ALLOC_UNMAPPED (formerly __GFP_UNMAPPED) and the mermap [0] which
unblocks various security nonsense including ASI [1]. (But also, this
feature is a good idea).

[0] https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/
[1] https://linuxasi.dev

^ permalink raw reply

* Re: [RFC PATCH 3/4] xfs: replace ns_capable_noaudit()
From: Darrick J. Wong @ 2026-06-26 15:19 UTC (permalink / raw)
  To: cem
  Cc: linux-fsdevel, jack, hch, serge, linux-security-module,
	linux-kernel, linux-xfs
In-Reply-To: <20260626114533.102138-4-cem@kernel.org>

On Fri, Jun 26, 2026 at 01:45:22PM +0200, cem@kernel.org wrote:
> From: Carlos Maiolino <cem@kernel.org>
> 
> We don't need to use ns_capable_noaudit() as all we care is the initial
> user namespace, use capable_noaudit() instead.

Might as well do the one in xfs_fsmap.c too, since it was originally a
capable() call.

--D

> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/xfs/xfs_trans_dquot.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index 50e5b323f7f1..30c2f6ec0aac 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -835,7 +835,7 @@ xfs_trans_dqresv(
>  	if ((flags & XFS_QMOPT_FORCE_RES) == 0 &&
>  	    dqp->q_id &&
>  	    xfs_dquot_is_enforced(dqp) &&
> -	    !ns_capable_noaudit(&init_user_ns, CAP_SYS_RESOURCE)) {
> +	    !capable_noaudit(CAP_SYS_RESOURCE)) {
>  		int		quota_nl;
>  		bool		fatal;
>  
> -- 
> 2.54.0
> 
> 

^ permalink raw reply

* Re: [PATCH 1/3] fs, mm: add ->cachestat() file operation
From: Amir Goldstein @ 2026-06-26 15:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pavel Tikhomirov, Johannes Weiner, Miklos Szeredi, Alexander Viro,
	Jan Kara, Matthew Wilcox (Oracle), Andrew Morton, Nhat Pham,
	Shuah Khan, linux-unionfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest
In-Reply-To: <20260625-lehrbuch-gehalt-wichen-1b2f3533efbe@brauner>

On Thu, Jun 25, 2026 at 12:36 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On 2026-06-23 17:34:47+02:00, Amir Goldstein wrote:
> > On Tue, Jun 23, 2026 at 4:55 PM Pavel Tikhomirov
> > <ptikhomirov@virtuozzo.com> wrote:
> >
> > > On 6/23/26 15:48, Johannes Weiner wrote:
> > >
> > > Yes, AFAIU in overlay when we use realfile we should always use it
> > > with_ovl_creds(), even though I don't think there is anything cred related
> > > in filemap_cachestat(), I still think we should follow the common pattern
> > > other overlay helpers use (similar to ovl_fadvise() and ovl_flush()).
> > >
> > > note: Actually some places get ovl_real_file() and use it without
> > > with_ovl_creds(), e.g.: ovl_read_iter, ovl_write_iter, ovl_splice_read,
> > > ovl_splice_write. But those look more of an exception than the general
> > > rule. All other instances use with_ovl_creds().
> >
> > Use with_ovl_creds() is a good practice to keep the mental security model,
> > but it is useless if the security check (can_do_cachestat) is not in the
> > vfs helper (vfs_cachestat), so please move it there.
> >
> > Also it kind of makes more sense to check (flags != 0) in sys_cachestats
> > before checking permissions.
> >
> > > Also there are simingly no other file_operations which return "realfile"
> > > for further processing, mostly the operation from fsops simply replaces
> > > general operation with its own logic completely.
> > >
> > > Thanks for your review!
> > >
> > > ps: Hope overlay maintainers will correctly if I'm getting this wrong.
> >
> > I don't think this is wrong per-se, except for can_do_cachestat().
> >
> > Just be aware that the real file could change from one cachestat
> > call to the next (i.e. due to copy up).
>
> I'm really grump about adding a new file operation just for a
> special-sauce system call which is under a CONFIG_* option even. We're
> not going to set the precedent of piling on custom file operations for a
> single filesystem everytime someone adds a new system call unless
> absolutely necessary.

I had a similar reaction.

> This looks like it could just use a new helper in
> fs/backing_file.c that the cachestat thing can call to use the correct
> file.
>

I was considering suggesting this as well.
Having f_real_file() complement

but technically, can_do_cachestat() should be checked against
the overlayfs file/inode AND also against the real file/inode with
ovl_creds.

I'd love to be able to provide a backing_file "template" for
operations, but I don't have a good idea how to do that.
Do you?

Thanks,
Amir.

^ permalink raw reply

* Re: [RFC PATCH 2/4] quota: Don't issue audit messages on quota enforcing
From: Darrick J. Wong @ 2026-06-26 15:18 UTC (permalink / raw)
  To: cem
  Cc: linux-fsdevel, jack, hch, serge, linux-security-module,
	linux-kernel, linux-xfs
In-Reply-To: <20260626114533.102138-3-cem@kernel.org>

On Fri, Jun 26, 2026 at 01:45:21PM +0200, cem@kernel.org wrote:
> From: Carlos Maiolino <cem@kernel.org>
> 
> Calling capable() to determine if we can bypass quota enforcement or not
> can trigger spurious audit messages. We don't really require it here so
> just use the capable_noaudit() version.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/quota/dquot.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 64cf42721496..1122a29215f7 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1308,7 +1308,7 @@ static int ignore_hardlimit(struct dquot *dquot)
>  {
>  	struct mem_dqinfo *info = &sb_dqopt(dquot->dq_sb)->info[dquot->dq_id.type];
>  
> -	return capable(CAP_SYS_RESOURCE) &&
> +	return capable_noaudit(CAP_SYS_RESOURCE) &&

Yeah, we're just checking if we're going to enforce hardlimits, not
actually denying something based on lack of capability.  For all we know
the user is well under their disk quota limit.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

>  	       (info->dqi_format->qf_fmt_id != QFMT_VFS_OLD ||
>  		!(info->dqi_flags & DQF_ROOT_SQUASH));
>  }
> -- 
> 2.54.0
> 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox