public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
       [not found] ` <9035b82cff08a3801cef3d06bbf2778b2e5a4dba.1731684329.git.josef@toxicpanda.com>
@ 2025-01-31 19:17   ` Alex Williamson
  2025-01-31 19:59     ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-01-31 19:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4, Peter Xu,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org


20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")

This breaks huge_fault support for PFNMAPs that was recently added in
v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
order mappings.  Thanks,

Alex


On Fri, 15 Nov 2024 10:30:28 -0500
Josef Bacik <josef@toxicpanda.com> wrote:

> There's nothing stopping us from supporting this, we could simply pass
> the order into the helper and emit the proper length.  However currently
> there's no tests to validate this works properly, so disable it until
> there's a desire to support this along with the appropriate tests.
> 
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mm/memory.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bdf77a3ec47b..843ad75a4148 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -78,6 +78,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/fsnotify.h>
>  
>  #include <trace/events/kmem.h>
>  
> @@ -5637,8 +5638,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct file *file = vma->vm_file;
>  	if (vma_is_anonymous(vma))
>  		return do_huge_pmd_anonymous_page(vmf);
> +	/*
> +	 * Currently we just emit PAGE_SIZE for our fault events, so don't allow
> +	 * a huge fault if we have a pre content watch on this file.  This would
> +	 * be trivial to support, but there would need to be tests to ensure
> +	 * this works properly and those don't exist currently.
> +	 */
> +	if (fsnotify_file_has_pre_content_watches(file))
> +		return VM_FAULT_FALLBACK;
>  	if (vma->vm_ops->huge_fault)
>  		return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
>  	return VM_FAULT_FALLBACK;
> @@ -5648,6 +5658,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct file *file = vma->vm_file;
>  	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
>  	vm_fault_t ret;
>  
> @@ -5662,6 +5673,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>  	}
>  
>  	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> +		/* See comment in create_huge_pmd. */
> +		if (fsnotify_file_has_pre_content_watches(file))
> +			goto split;
>  		if (vma->vm_ops->huge_fault) {
>  			ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
>  			if (!(ret & VM_FAULT_FALLBACK))
> @@ -5681,9 +5695,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
>  	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct file *file = vma->vm_file;
>  	/* No support for anonymous transparent PUD pages yet */
>  	if (vma_is_anonymous(vma))
>  		return VM_FAULT_FALLBACK;
> +	/* See comment in create_huge_pmd. */
> +	if (fsnotify_file_has_pre_content_watches(file))
> +		return VM_FAULT_FALLBACK;
>  	if (vma->vm_ops->huge_fault)
>  		return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -5695,12 +5713,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
>  	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct file *file = vma->vm_file;
>  	vm_fault_t ret;
>  
>  	/* No support for anonymous transparent PUD pages yet */
>  	if (vma_is_anonymous(vma))
>  		goto split;
>  	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> +		/* See comment in create_huge_pmd. */
> +		if (fsnotify_file_has_pre_content_watches(file))
> +			goto split;
>  		if (vma->vm_ops->huge_fault) {
>  			ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
>  			if (!(ret & VM_FAULT_FALLBACK))


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-01-31 19:17   ` [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Alex Williamson
@ 2025-01-31 19:59     ` Linus Torvalds
  2025-02-01  1:19       ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2025-01-31 19:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, amir73il, brauner,
	viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4, Peter Xu,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org

On Fri, 31 Jan 2025 at 11:17, Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
>
> This breaks huge_fault support for PFNMAPs that was recently added in
> v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> order mappings.

Surely only for content watches?

Which shouldn't be a valid situation *anyway*.

IOW, there must be some unrelated bug somewhere: either somebody is
allowed to set a pre-content match on a special device.

That should be disabled by the whole

        /*
         * If there are permission event watchers but no pre-content event
         * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
         */

thing in file_set_fsnotify_mode() which only allows regular files and
directories to be notified on.

Or, alternatively, that check for huge-fault disabling is just
checking the wrong bits.

Or - quite possibly - I am missing something obvious?

             Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-01-31 19:59     ` Linus Torvalds
@ 2025-02-01  1:19       ` Peter Xu
  2025-02-01 14:38         ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2025-02-01  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Williamson, Josef Bacik, kernel-team, linux-fsdevel, jack,
	amir73il, brauner, viro, linux-xfs, linux-btrfs, linux-mm,
	linux-ext4, linux-kernel@vger.kernel.org, kvm@vger.kernel.org

On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote:
> On Fri, 31 Jan 2025 at 11:17, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >
> > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
> >
> > This breaks huge_fault support for PFNMAPs that was recently added in
> > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> > order mappings.
> 
> Surely only for content watches?
> 
> Which shouldn't be a valid situation *anyway*.
> 
> IOW, there must be some unrelated bug somewhere: either somebody is
> allowed to set a pre-content match on a special device.
> 
> That should be disabled by the whole
> 
>         /*
>          * If there are permission event watchers but no pre-content event
>          * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
>          */
> 
> thing in file_set_fsnotify_mode() which only allows regular files and
> directories to be notified on.
> 
> Or, alternatively, that check for huge-fault disabling is just
> checking the wrong bits.
> 
> Or - quite possibly - I am missing something obvious?

Is it possible that we have some paths got overlooked in setting up the
fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on
those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on
those if overlooked..

One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are
not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU
groups rather than the device fd itself.

The app normally needs to first open the IOMMU group fd under /dev/vfio/*,
then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which
will be the mmap() target, instead of the ones under /dev.

I checked, those device fds were allocated from vfio_device_open_file()
within the ioctl, which internally uses anon_inode_getfile().  I don't see
anywhere in that path that will set the fanotify bits..

Further, I'm not sure whether some callers of alloc_file() can also suffer
from similar issue, because at least memfd_create() syscall also uses the
API, which (hopefully?) would used to allow THPs for shmem backed memfds on
aligned mmap()s, but not sure whether it'll also wrongly trigger the
FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs.  I
didn't verify it though, nor did I yet check more users.

So I wonder whether we should setup the fanotify bits in at least
alloc_file() too (to FMODE_NONOTIFY?).

I'm totally not familiar with fanotify, and it's a bit late to try verify
anything (I cannot quickly find my previous huge pfnmap setup, so setup
those will also take time..). but maybe above can provide some clues for
others..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-01  1:19       ` Peter Xu
@ 2025-02-01 14:38         ` Christian Brauner
  2025-02-02  0:58           ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-02-01 14:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linus Torvalds, Alex Williamson, Josef Bacik, kernel-team,
	linux-fsdevel, jack, amir73il, viro, linux-xfs, linux-btrfs,
	linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Fri, Jan 31, 2025 at 08:19:22PM -0500, Peter Xu wrote:
> On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote:
> > On Fri, 31 Jan 2025 at 11:17, Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > >
> > > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
> > >
> > > This breaks huge_fault support for PFNMAPs that was recently added in
> > > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> > > order mappings.
> > 
> > Surely only for content watches?
> > 
> > Which shouldn't be a valid situation *anyway*.
> > 
> > IOW, there must be some unrelated bug somewhere: either somebody is
> > allowed to set a pre-content match on a special device.
> > 
> > That should be disabled by the whole
> > 
> >         /*
> >          * If there are permission event watchers but no pre-content event
> >          * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
> >          */
> > 
> > thing in file_set_fsnotify_mode() which only allows regular files and
> > directories to be notified on.
> > 
> > Or, alternatively, that check for huge-fault disabling is just
> > checking the wrong bits.
> > 
> > Or - quite possibly - I am missing something obvious?
> 
> Is it possible that we have some paths got overlooked in setting up the
> fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on
> those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on
> those if overlooked..
> 
> One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are
> not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU
> groups rather than the device fd itself.
> 
> The app normally needs to first open the IOMMU group fd under /dev/vfio/*,
> then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which
> will be the mmap() target, instead of the ones under /dev.

Ok, but those "device fds" aren't really device fds in the sense that
they are character fds. They are regular files afaict from:

vfio_device_open_file(struct vfio_device *device)

(Well, it's actually worse as anon_inode_getfile() files don't have any
mode at all but that's beside the point.)?

In any case, I think you're right that such files would (accidently?)
qualify for content watches afaict. So at least that should probably get
FMODE_NONOTIFY.

> 
> I checked, those device fds were allocated from vfio_device_open_file()
> within the ioctl, which internally uses anon_inode_getfile().  I don't see
> anywhere in that path that will set the fanotify bits..
> 
> Further, I'm not sure whether some callers of alloc_file() can also suffer

Sidenote, mm/memfd.c should pretty please rename alloc_file() to
memfd_alloc_file() or something. That would be great because
alloc_file() is a local fs/file_table.c helper and grepping for it is
confusing as I first thought someone made alloc_file() available outside
of fs/file_table.c

> from similar issue, because at least memfd_create() syscall also uses the
> API, which (hopefully?) would used to allow THPs for shmem backed memfds on
> aligned mmap()s, but not sure whether it'll also wrongly trigger the
> FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs.  I
> didn't verify it though, nor did I yet check more users.
> 
> So I wonder whether we should setup the fanotify bits in at least
> alloc_file() too (to FMODE_NONOTIFY?).
> 
> I'm totally not familiar with fanotify, and it's a bit late to try verify
> anything (I cannot quickly find my previous huge pfnmap setup, so setup
> those will also take time..). but maybe above can provide some clues for
> others..
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-01 14:38         ` Christian Brauner
@ 2025-02-02  0:58           ` Linus Torvalds
  2025-02-02  7:46             ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2025-02-02  0:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Peter Xu, Alex Williamson, Josef Bacik, kernel-team,
	linux-fsdevel, jack, amir73il, viro, linux-xfs, linux-btrfs,
	linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
>
> Ok, but those "device fds" aren't really device fds in the sense that
> they are character fds. They are regular files afaict from:
>
> vfio_device_open_file(struct vfio_device *device)
>
> (Well, it's actually worse as anon_inode_getfile() files don't have any
> mode at all but that's beside the point.)?
>
> In any case, I think you're right that such files would (accidently?)
> qualify for content watches afaict. So at least that should probably get
> FMODE_NONOTIFY.

Hmm. Can we just make all anon_inodes do that? I don't think you can
sanely have pre-content watches on anon-inodes, since you can't really
have access to them to _set_ the content watch from outside anyway..

In fact, maybe do it in alloc_file_pseudo()?

Amir / Josef?

              Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-02  0:58           ` Linus Torvalds
@ 2025-02-02  7:46             ` Amir Goldstein
  2025-02-02 10:04               ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2025-02-02  7:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Peter Xu, Alex Williamson, Josef Bacik,
	kernel-team, linux-fsdevel, jack, viro, linux-xfs, linux-btrfs,
	linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ok, but those "device fds" aren't really device fds in the sense that
> > they are character fds. They are regular files afaict from:
> >
> > vfio_device_open_file(struct vfio_device *device)
> >
> > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > mode at all but that's beside the point.)?
> >
> > In any case, I think you're right that such files would (accidently?)
> > qualify for content watches afaict. So at least that should probably get
> > FMODE_NONOTIFY.
>
> Hmm. Can we just make all anon_inodes do that? I don't think you can
> sanely have pre-content watches on anon-inodes, since you can't really
> have access to them to _set_ the content watch from outside anyway..
>
> In fact, maybe do it in alloc_file_pseudo()?
>

The problem is that we cannot set FMODE_NONOTIFY -
we tried that once but it regressed some workloads watching
write on pipe fd or something.

and the no-pre-content is a flag combination (to save FMODE_ flags)
which makes things a bit messy.

We could try to initialize f_mode to FMODE_NONOTIFY_PERM
for anon_inode, which opts out of both permission and pre-content
events and leaves the legacy inotify workloads unaffected.

But, then code like this will not do the right thing:

        /* We refuse fsnotify events on ptmx, since it's a shared resource */
        filp->f_mode |= FMODE_NONOTIFY;

We will need to convert all those to use a helper.
I am traveling today so will be able to look closer tomorrow.

Jan,

What do you think?

Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-02  7:46             ` Amir Goldstein
@ 2025-02-02 10:04               ` Christian Brauner
  2025-02-03 12:41                 ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-02-02 10:04 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, Peter Xu, Alex Williamson, Josef Bacik,
	kernel-team, linux-fsdevel, jack, viro, linux-xfs, linux-btrfs,
	linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ok, but those "device fds" aren't really device fds in the sense that
> > > they are character fds. They are regular files afaict from:
> > >
> > > vfio_device_open_file(struct vfio_device *device)
> > >
> > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > mode at all but that's beside the point.)?
> > >
> > > In any case, I think you're right that such files would (accidently?)
> > > qualify for content watches afaict. So at least that should probably get
> > > FMODE_NONOTIFY.
> >
> > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > sanely have pre-content watches on anon-inodes, since you can't really
> > have access to them to _set_ the content watch from outside anyway..
> >
> > In fact, maybe do it in alloc_file_pseudo()?
> >
> 
> The problem is that we cannot set FMODE_NONOTIFY -
> we tried that once but it regressed some workloads watching
> write on pipe fd or something.

Ok, that might be true. But I would assume that most users of
alloc_file_pseudo() or the anonymous inode infrastructure will not care
about fanotify events. I would not go for a separate helper. It'd be
nice to keep the number of file allocation functions low.

I'd rather have the subsystems that want it explicitly opt-in to
fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
broken fanotify support for e.g., nsfs already. So make the subsystems
think about whether they actually want to support it.

I would disqualify all anonymous inodes and see what actually does
break. I naively suspect that almost no one uses anonymous inodes +
fanotify. I'd be very surprised.

I'm currently traveling (see you later btw) but from a very cursory
reading I would naively suspect the following:

// Suspects for FMODE_NONOTIFY
drivers/dma-buf/dma-buf.c:      file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
drivers/scsi/cxlflash/ocxl_hw.c:        file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
fs/anon_inodes.c:       file = alloc_file_pseudo(inode, anon_inode_mnt, name,
fs/hugetlbfs/inode.c:           file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
kernel/bpf/token.c:     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
block/bdev.c:   bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)

// Suspects for ~FMODE_NONOTIFY
fs/aio.c:       file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
fs/pipe.c:      f = alloc_file_pseudo(inode, pipe_mnt, "",
mm/shmem.c:             res = alloc_file_pseudo(inode, mnt, name, O_RDWR,

// Unsure:
fs/nfs/nfs4file.c:      filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
net/socket.c:   file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,

> 
> and the no-pre-content is a flag combination (to save FMODE_ flags)
> which makes things a bit messy.
> 
> We could try to initialize f_mode to FMODE_NONOTIFY_PERM
> for anon_inode, which opts out of both permission and pre-content
> events and leaves the legacy inotify workloads unaffected.
> 
> But, then code like this will not do the right thing:
> 
>         /* We refuse fsnotify events on ptmx, since it's a shared resource */
>         filp->f_mode |= FMODE_NONOTIFY;
> 
> We will need to convert all those to use a helper.
> I am traveling today so will be able to look closer tomorrow.
> 
> Jan,
> 
> What do you think?
> 
> Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-02 10:04               ` Christian Brauner
@ 2025-02-03 12:41                 ` Jan Kara
  2025-02-03 20:39                   ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-02-03 12:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Linus Torvalds, Peter Xu, Alex Williamson,
	Josef Bacik, kernel-team, linux-fsdevel, jack, viro, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > they are character fds. They are regular files afaict from:
> > > >
> > > > vfio_device_open_file(struct vfio_device *device)
> > > >
> > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > mode at all but that's beside the point.)?
> > > >
> > > > In any case, I think you're right that such files would (accidently?)
> > > > qualify for content watches afaict. So at least that should probably get
> > > > FMODE_NONOTIFY.
> > >
> > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > sanely have pre-content watches on anon-inodes, since you can't really
> > > have access to them to _set_ the content watch from outside anyway..
> > >
> > > In fact, maybe do it in alloc_file_pseudo()?
> > >
> > 
> > The problem is that we cannot set FMODE_NONOTIFY -
> > we tried that once but it regressed some workloads watching
> > write on pipe fd or something.
> 
> Ok, that might be true. But I would assume that most users of
> alloc_file_pseudo() or the anonymous inode infrastructure will not care
> about fanotify events. I would not go for a separate helper. It'd be
> nice to keep the number of file allocation functions low.
> 
> I'd rather have the subsystems that want it explicitly opt-in to
> fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> broken fanotify support for e.g., nsfs already. So make the subsystems
> think about whether they actually want to support it.

Agreed, that would be a saner default.

> I would disqualify all anonymous inodes and see what actually does
> break. I naively suspect that almost no one uses anonymous inodes +
> fanotify. I'd be very surprised.
> 
> I'm currently traveling (see you later btw) but from a very cursory
> reading I would naively suspect the following:
> 
> // Suspects for FMODE_NONOTIFY
> drivers/dma-buf/dma-buf.c:      file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> drivers/scsi/cxlflash/ocxl_hw.c:        file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> fs/anon_inodes.c:       file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> fs/hugetlbfs/inode.c:           file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> kernel/bpf/token.c:     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> block/bdev.c:   bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> 
> // Suspects for ~FMODE_NONOTIFY
> fs/aio.c:       file = alloc_file_pseudo(inode, aio_mnt, "[aio]",

This is just a helper file for managing aio context so I don't think any
notification makes sense there (events are not well defined). So I'd say
FMODE_NONOTIFY here as well.

> fs/pipe.c:      f = alloc_file_pseudo(inode, pipe_mnt, "",
> mm/shmem.c:             res = alloc_file_pseudo(inode, mnt, name, O_RDWR,

This is actually used for stuff like IPC SEM where notification doesn't
make sense. It's also used when mmapping /dev/zero but that struct file
isn't easily accessible to userspace so overall I'd say this should be
FMODE_NONOTIFY as well.

> // Unsure:
> fs/nfs/nfs4file.c:      filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,

AFAICS this struct file is for copy offload and doesn't leave the kernel.
Hence FMODE_NONOTIFY should be fine.

> net/socket.c:   file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,

In this case I think we need to be careful. It's a similar case as pipes so
probably we should use ~FMODE_NONOTIFY here from pure caution.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-03 12:41                 ` Jan Kara
@ 2025-02-03 20:39                   ` Amir Goldstein
  2025-02-03 21:41                     ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2025-02-03 20:39 UTC (permalink / raw)
  To: Jan Kara, Alex Williamson
  Cc: Christian Brauner, Linus Torvalds, Peter Xu, Josef Bacik,
	kernel-team, linux-fsdevel, viro, linux-xfs, linux-btrfs,
	linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > they are character fds. They are regular files afaict from:
> > > > >
> > > > > vfio_device_open_file(struct vfio_device *device)
> > > > >
> > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > mode at all but that's beside the point.)?
> > > > >
> > > > > In any case, I think you're right that such files would (accidently?)
> > > > > qualify for content watches afaict. So at least that should probably get
> > > > > FMODE_NONOTIFY.
> > > >
> > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > have access to them to _set_ the content watch from outside anyway..
> > > >
> > > > In fact, maybe do it in alloc_file_pseudo()?
> > > >
> > >
> > > The problem is that we cannot set FMODE_NONOTIFY -
> > > we tried that once but it regressed some workloads watching
> > > write on pipe fd or something.
> >
> > Ok, that might be true. But I would assume that most users of
> > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > about fanotify events. I would not go for a separate helper. It'd be
> > nice to keep the number of file allocation functions low.
> >
> > I'd rather have the subsystems that want it explicitly opt-in to
> > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > broken fanotify support for e.g., nsfs already. So make the subsystems
> > think about whether they actually want to support it.
>
> Agreed, that would be a saner default.
>
> > I would disqualify all anonymous inodes and see what actually does
> > break. I naively suspect that almost no one uses anonymous inodes +
> > fanotify. I'd be very surprised.
> >
> > I'm currently traveling (see you later btw) but from a very cursory
> > reading I would naively suspect the following:
> >
> > // Suspects for FMODE_NONOTIFY
> > drivers/dma-buf/dma-buf.c:      file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > drivers/scsi/cxlflash/ocxl_hw.c:        file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > fs/anon_inodes.c:       file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > fs/hugetlbfs/inode.c:           file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > kernel/bpf/token.c:     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > block/bdev.c:   bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> >
> > // Suspects for ~FMODE_NONOTIFY
> > fs/aio.c:       file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
>
> This is just a helper file for managing aio context so I don't think any
> notification makes sense there (events are not well defined). So I'd say
> FMODE_NONOTIFY here as well.
>
> > fs/pipe.c:      f = alloc_file_pseudo(inode, pipe_mnt, "",
> > mm/shmem.c:             res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
>
> This is actually used for stuff like IPC SEM where notification doesn't
> make sense. It's also used when mmapping /dev/zero but that struct file
> isn't easily accessible to userspace so overall I'd say this should be
> FMODE_NONOTIFY as well.

I think there is another code path that the audit missed for getting these
pseudo files not via alloc_file_pseudo():
ipc/shm.c:      file = alloc_file_clone(base, f_flags,

which does not copy f_mode as far as I can tell.

>
> > // Unsure:
> > fs/nfs/nfs4file.c:      filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
>
> AFAICS this struct file is for copy offload and doesn't leave the kernel.
> Hence FMODE_NONOTIFY should be fine.
>
> > net/socket.c:   file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
>
> In this case I think we need to be careful. It's a similar case as pipes so
> probably we should use ~FMODE_NONOTIFY here from pure caution.
>

I tried this approach with patch:
"fsnotify: disable notification by default for all pseudo files"

But I also added another patch:
"fsnotify: disable pre-content and permission events by default"

So that code paths that we missed such as alloc_file_clone()
will not have pre-content events enabled.

Alex,

Can you please try this branch:

https://github.com/amir73il/linux/commits/fsnotify-fixes/

and verify that it fixes your issue.

The branch contains one prep patch:
"fsnotify: use accessor to set FMODE_NONOTIFY_*"
and two independent Fixes patches.

Assuming that it fixes your issue, can you please test each of the
Fixes patches individually, because every one of them should be fixing
the issue independently and every one of them could break something,
so we may end up reverting it later on.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-03 20:39                   ` Amir Goldstein
@ 2025-02-03 21:41                     ` Alex Williamson
  2025-02-03 22:04                       ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-02-03 21:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, Linus Torvalds, Peter Xu,
	Josef Bacik, kernel-team, linux-fsdevel, viro, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Mon, 3 Feb 2025 21:39:27 +0100
Amir Goldstein <amir73il@gmail.com> wrote:

> On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sun 02-02-25 11:04:02, Christian Brauner wrote:  
> > > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:  
> > > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:  
> > > > >
> > > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:  
> > > > > >
> > > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > > they are character fds. They are regular files afaict from:
> > > > > >
> > > > > > vfio_device_open_file(struct vfio_device *device)
> > > > > >
> > > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > > mode at all but that's beside the point.)?
> > > > > >
> > > > > > In any case, I think you're right that such files would (accidently?)
> > > > > > qualify for content watches afaict. So at least that should probably get
> > > > > > FMODE_NONOTIFY.  
> > > > >
> > > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > > have access to them to _set_ the content watch from outside anyway..
> > > > >
> > > > > In fact, maybe do it in alloc_file_pseudo()?
> > > > >  
> > > >
> > > > The problem is that we cannot set FMODE_NONOTIFY -
> > > > we tried that once but it regressed some workloads watching
> > > > write on pipe fd or something.  
> > >
> > > Ok, that might be true. But I would assume that most users of
> > > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > > about fanotify events. I would not go for a separate helper. It'd be
> > > nice to keep the number of file allocation functions low.
> > >
> > > I'd rather have the subsystems that want it explicitly opt-in to
> > > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > > broken fanotify support for e.g., nsfs already. So make the subsystems
> > > think about whether they actually want to support it.  
> >
> > Agreed, that would be a saner default.
> >  
> > > I would disqualify all anonymous inodes and see what actually does
> > > break. I naively suspect that almost no one uses anonymous inodes +
> > > fanotify. I'd be very surprised.
> > >
> > > I'm currently traveling (see you later btw) but from a very cursory
> > > reading I would naively suspect the following:
> > >
> > > // Suspects for FMODE_NONOTIFY
> > > drivers/dma-buf/dma-buf.c:      file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > > drivers/scsi/cxlflash/ocxl_hw.c:        file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > > fs/anon_inodes.c:       file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > > fs/hugetlbfs/inode.c:           file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > > kernel/bpf/token.c:     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > > block/bdev.c:   bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> > >
> > > // Suspects for ~FMODE_NONOTIFY
> > > fs/aio.c:       file = alloc_file_pseudo(inode, aio_mnt, "[aio]",  
> >
> > This is just a helper file for managing aio context so I don't think any
> > notification makes sense there (events are not well defined). So I'd say
> > FMODE_NONOTIFY here as well.
> >  
> > > fs/pipe.c:      f = alloc_file_pseudo(inode, pipe_mnt, "",
> > > mm/shmem.c:             res = alloc_file_pseudo(inode, mnt, name, O_RDWR,  
> >
> > This is actually used for stuff like IPC SEM where notification doesn't
> > make sense. It's also used when mmapping /dev/zero but that struct file
> > isn't easily accessible to userspace so overall I'd say this should be
> > FMODE_NONOTIFY as well.  
> 
> I think there is another code path that the audit missed for getting these
> pseudo files not via alloc_file_pseudo():
> ipc/shm.c:      file = alloc_file_clone(base, f_flags,
> 
> which does not copy f_mode as far as I can tell.
> 
> >  
> > > // Unsure:
> > > fs/nfs/nfs4file.c:      filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,  
> >
> > AFAICS this struct file is for copy offload and doesn't leave the kernel.
> > Hence FMODE_NONOTIFY should be fine.
> >  
> > > net/socket.c:   file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,  
> >
> > In this case I think we need to be careful. It's a similar case as pipes so
> > probably we should use ~FMODE_NONOTIFY here from pure caution.
> >  
> 
> I tried this approach with patch:
> "fsnotify: disable notification by default for all pseudo files"
> 
> But I also added another patch:
> "fsnotify: disable pre-content and permission events by default"
> 
> So that code paths that we missed such as alloc_file_clone()
> will not have pre-content events enabled.
> 
> Alex,
> 
> Can you please try this branch:
> 
> https://github.com/amir73il/linux/commits/fsnotify-fixes/
> 
> and verify that it fixes your issue.
> 
> The branch contains one prep patch:
> "fsnotify: use accessor to set FMODE_NONOTIFY_*"
> and two independent Fixes patches.
> 
> Assuming that it fixes your issue, can you please test each of the
> Fixes patches individually, because every one of them should be fixing
> the issue independently and every one of them could break something,
> so we may end up reverting it later on.

Test #1:

fsnotify: disable pre-content and permission events by default
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*

Result: Pass, vfio-pci huge_fault observed

Test #2:

fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*

Result: Pass, vfio-pci huge_fault observed

Test #3:

fsnotify: disable pre-content and permission events by default
fsnotify: use accessor to set FMODE_NONOTIFY_*

Result: Pass, vfio-pci huge_fault observed

Test #4 (control):

fsnotify: use accessor to set FMODE_NONOTIFY_*

Result: Fail, no vfio-pci huge_fault observed

For any combination of the Fixes patches:

Tested-by: Alex Williamson <alex.williamson@redhat.com>

Thanks!
Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
  2025-02-03 21:41                     ` Alex Williamson
@ 2025-02-03 22:04                       ` Amir Goldstein
  0 siblings, 0 replies; 11+ messages in thread
From: Amir Goldstein @ 2025-02-03 22:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jan Kara, Christian Brauner, Linus Torvalds, Peter Xu,
	Josef Bacik, kernel-team, linux-fsdevel, viro, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org

On Mon, Feb 3, 2025 at 10:41 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Mon, 3 Feb 2025 21:39:27 +0100
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> > > > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > > > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > > > <torvalds@linux-foundation.org> wrote:
> > > > > >
> > > > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > > > > >
> > > > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > > > they are character fds. They are regular files afaict from:
> > > > > > >
> > > > > > > vfio_device_open_file(struct vfio_device *device)
> > > > > > >
> > > > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > > > mode at all but that's beside the point.)?
> > > > > > >
> > > > > > > In any case, I think you're right that such files would (accidently?)
> > > > > > > qualify for content watches afaict. So at least that should probably get
> > > > > > > FMODE_NONOTIFY.
> > > > > >
> > > > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > > > have access to them to _set_ the content watch from outside anyway..
> > > > > >
> > > > > > In fact, maybe do it in alloc_file_pseudo()?
> > > > > >
> > > > >
> > > > > The problem is that we cannot set FMODE_NONOTIFY -
> > > > > we tried that once but it regressed some workloads watching
> > > > > write on pipe fd or something.
> > > >
> > > > Ok, that might be true. But I would assume that most users of
> > > > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > > > about fanotify events. I would not go for a separate helper. It'd be
> > > > nice to keep the number of file allocation functions low.
> > > >
> > > > I'd rather have the subsystems that want it explicitly opt-in to
> > > > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > > > broken fanotify support for e.g., nsfs already. So make the subsystems
> > > > think about whether they actually want to support it.
> > >
> > > Agreed, that would be a saner default.
> > >
> > > > I would disqualify all anonymous inodes and see what actually does
> > > > break. I naively suspect that almost no one uses anonymous inodes +
> > > > fanotify. I'd be very surprised.
> > > >
> > > > I'm currently traveling (see you later btw) but from a very cursory
> > > > reading I would naively suspect the following:
> > > >
> > > > // Suspects for FMODE_NONOTIFY
> > > > drivers/dma-buf/dma-buf.c:      file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > > > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > > > drivers/scsi/cxlflash/ocxl_hw.c:        file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > > > fs/anon_inodes.c:       file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > > > fs/hugetlbfs/inode.c:           file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > > > kernel/bpf/token.c:     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > > > block/bdev.c:   bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > > > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> > > >
> > > > // Suspects for ~FMODE_NONOTIFY
> > > > fs/aio.c:       file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
> > >
> > > This is just a helper file for managing aio context so I don't think any
> > > notification makes sense there (events are not well defined). So I'd say
> > > FMODE_NONOTIFY here as well.
> > >
> > > > fs/pipe.c:      f = alloc_file_pseudo(inode, pipe_mnt, "",
> > > > mm/shmem.c:             res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > >
> > > This is actually used for stuff like IPC SEM where notification doesn't
> > > make sense. It's also used when mmapping /dev/zero but that struct file
> > > isn't easily accessible to userspace so overall I'd say this should be
> > > FMODE_NONOTIFY as well.
> >
> > I think there is another code path that the audit missed for getting these
> > pseudo files not via alloc_file_pseudo():
> > ipc/shm.c:      file = alloc_file_clone(base, f_flags,
> >
> > which does not copy f_mode as far as I can tell.
> >
> > >
> > > > // Unsure:
> > > > fs/nfs/nfs4file.c:      filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
> > >
> > > AFAICS this struct file is for copy offload and doesn't leave the kernel.
> > > Hence FMODE_NONOTIFY should be fine.
> > >
> > > > net/socket.c:   file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
> > >
> > > In this case I think we need to be careful. It's a similar case as pipes so
> > > probably we should use ~FMODE_NONOTIFY here from pure caution.
> > >
> >
> > I tried this approach with patch:
> > "fsnotify: disable notification by default for all pseudo files"
> >
> > But I also added another patch:
> > "fsnotify: disable pre-content and permission events by default"
> >
> > So that code paths that we missed such as alloc_file_clone()
> > will not have pre-content events enabled.
> >
> > Alex,
> >
> > Can you please try this branch:
> >
> > https://github.com/amir73il/linux/commits/fsnotify-fixes/
> >
> > and verify that it fixes your issue.
> >
> > The branch contains one prep patch:
> > "fsnotify: use accessor to set FMODE_NONOTIFY_*"
> > and two independent Fixes patches.
> >
> > Assuming that it fixes your issue, can you please test each of the
> > Fixes patches individually, because every one of them should be fixing
> > the issue independently and every one of them could break something,
> > so we may end up reverting it later on.
>
> Test #1:
>
> fsnotify: disable pre-content and permission events by default
> fsnotify: disable notification by default for all pseudo files
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #2:
>
> fsnotify: disable notification by default for all pseudo files
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #3:
>
> fsnotify: disable pre-content and permission events by default
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #4 (control):
>
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Fail, no vfio-pci huge_fault observed
>
> For any combination of the Fixes patches:
>
> Tested-by: Alex Williamson <alex.williamson@redhat.com>
>

That was fast.
I will post the patches.

Thanks!
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-02-03 22:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1731684329.git.josef@toxicpanda.com>
     [not found] ` <9035b82cff08a3801cef3d06bbf2778b2e5a4dba.1731684329.git.josef@toxicpanda.com>
2025-01-31 19:17   ` [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Alex Williamson
2025-01-31 19:59     ` Linus Torvalds
2025-02-01  1:19       ` Peter Xu
2025-02-01 14:38         ` Christian Brauner
2025-02-02  0:58           ` Linus Torvalds
2025-02-02  7:46             ` Amir Goldstein
2025-02-02 10:04               ` Christian Brauner
2025-02-03 12:41                 ` Jan Kara
2025-02-03 20:39                   ` Amir Goldstein
2025-02-03 21:41                     ` Alex Williamson
2025-02-03 22:04                       ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox