linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Jan Kara <jack@suse.cz>
Cc: kernel-team@fb.com, linux-fsdevel@vger.kernel.org,
	amir73il@gmail.com, brauner@kernel.org
Subject: Re: [PATCH 10/10] fsnotify: generate pre-content permission event on page fault
Date: Wed, 7 Aug 2024 15:04:31 -0400	[thread overview]
Message-ID: <20240807190431.GA274047@perftesting> (raw)
In-Reply-To: <20240805121349.i4esnngbuckbpdea@quack3>

On Mon, Aug 05, 2024 at 02:13:49PM +0200, Jan Kara wrote:
> On Fri 02-08-24 12:03:57, Josef Bacik wrote:
> > On Thu, Aug 01, 2024 at 11:40:25PM +0200, Jan Kara wrote:
> > > On Thu 25-07-24 14:19:47, Josef Bacik wrote:
> > > > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> > > > on the faulting method.
> > > > 
> > > > This pre-content event is meant to be used by hierarchical storage
> > > > managers that want to fill in the file content on first read access.
> > > > 
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ...
> > > > @@ -3287,6 +3288,35 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > > >  	if (unlikely(index >= max_idx))
> > > >  		return VM_FAULT_SIGBUS;
> > > >  
> > > > +	/*
> > > > +	 * If we have pre-content watchers then we need to generate events on
> > > > +	 * page fault so that we can populate any data before the fault.
> > > > +	 *
> > > > +	 * We only do this on the first pass through, otherwise the populating
> > > > +	 * application could potentially deadlock on the mmap lock if it tries
> > > > +	 * to populate it with mmap.
> > > > +	 */
> > > > +	if (fault_flag_allow_retry_first(vmf->flags) &&
> > > > +	    fsnotify_file_has_content_watches(file)) {
> > > 
> > > I'm somewhat nervous that if ALLOW_RETRY isn't set, we'd silently jump into
> > > readpage code without ever sending pre-content event and thus we'd possibly
> > > expose invalid content to userspace? I think we should fail the fault if
> > > fsnotify_file_has_content_watches(file) && !(vmf->flags &
> > > FAULT_FLAG_ALLOW_RETRY).
> > 
> > I was worried about this too but it seems to not be the case that we'll not ever
> > have ALLOW_RETRY.  That being said I'm fine turning this into a sigbus.
> 
> Do you mean that with your workloads we always have ALLOW_RETRY set? As I
> wrote, currently you'd have to try really hard to hit such paths but they
> are there - for example if you place uprobe on an address in a VMA that is
> not present, the page fault is going to happen without ALLOW_RETRY set.

From what I can tell we almost always have FOLL_UNLOCKABLE set, which is what
translates into ALLOW_RETRY.  There's definitely some paths that can get there,
but as far as what happens in a normal environment we're going to almost always
have ALLOW_RETRY set.

This does leave a hole in some corner cases.  I'm content to say "don't do that"
if you want to use these hooks.

Optionally we could add a FAN_PRE_MMAP hook in vm_mmap() for the range that is
being mmap'ed to make sure we never miss any events, and then applications can
decide if they want to risk it with the pagefault hooks or enable the mmap hooks
for absolute certainty.

> 
> > > > +		int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_READ;
> > > > +		loff_t pos = vmf->pgoff << PAGE_SHIFT;
> > > > +
> > > > +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > > > +
> > > > +		/*
> > > > +		 * We can only emit the event if we did actually release the
> > > > +		 * mmap lock.
> > > > +		 */
> > > > +		if (fpin) {
> > > > +			error = fsnotify_file_area_perm(fpin, mask, &pos,
> > > > +							PAGE_SIZE);
> > > > +			if (error) {
> > > > +				fput(fpin);
> > > > +				return VM_FAULT_ERROR;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 * Do we have something in the page cache already?
> > > >  	 */
> > > ...
> > > > @@ -3612,6 +3643,13 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> > > >  	unsigned long rss = 0;
> > > >  	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> > > >  
> > > > +	/*
> > > > +	 * We are under RCU, we can't emit events here, we need to force a
> > > > +	 * normal fault to make sure the events get sent.
> > > > +	 */
> > > > +	if (fsnotify_file_has_content_watches(file))
> > > > +		return ret;
> > > > +
> > > 
> > > I don't think we need to do anything for filemap_map_pages(). The call just
> > > inserts page cache content into page tables and whatever is in the page
> > > cache and has folio_uptodate() set should be already valid file content,
> > > shouldn't it?
> > 
> > I'll make this comment more clear.  filemap_fault() will start readahead,
> > but we'll only emit the event for the page size that we're faulting.  I
> > had looked at putting this at the readahead place and figuring out the
> > readahead size, but literally anything could trigger readahead so it's
> > better to just not allow filemap_map_pages() to happen, otherwise we'll
> > end up with empty pages (if the content hasn't been populated yet) and
> > never emit an event for those ranges.
> 
> This seems like an interesting problem. Even ordinary read(2) will trigger
> readahead and as you say, we would be instantiating folios with wrong
> content (zeros) due to that. It seems as a fragile design to keep such
> folios in the page cache and place checks in all the places that could
> possibly make their content visible to the user. I'd rather make sure that
> if we pull folios into page cache (and set folio_uptodate() bit), their
> content is indeed valid.

The hook exists before we go looking in pagecache, so we're fine with read(2),
the only problem is mmap (AFAICT, I am not very smart after all).

> 
> What we could do is to turn off readahead on the inode if
> fsnotify_file_has_content_watches() is true. Essentially the handler of the
> precontent event can do a much better job of prefilling the page cache with
> whatever content is needed in a range that makes sense. And then we could
> leave filemap_map_pages() intact. What do you think guys?

I had considered this but decided against it because it seemed like a big
hammer, but if you're cool with it then so am I.  Thanks,

Josef

      reply	other threads:[~2024-08-07 19:04 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25 18:19 [PATCH 00/10] fanotify: add pre-content hooks Josef Bacik
2024-07-25 18:19 ` [PATCH 01/10] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
2024-08-01 17:59   ` Jan Kara
2024-07-25 18:19 ` [PATCH 02/10] fsnotify: introduce pre-content permission event Josef Bacik
2024-08-01 16:31   ` Jan Kara
2024-08-03 16:52     ` Amir Goldstein
2024-10-25  7:55       ` Amir Goldstein
2024-10-25 13:09         ` Jan Kara
2024-10-25 13:39           ` Amir Goldstein
2024-10-26  6:58             ` Amir Goldstein
2024-10-31 12:47               ` Jan Kara
2024-07-25 18:19 ` [PATCH 03/10] fsnotify: generate pre-content permission event on open Josef Bacik
2024-08-01 17:01   ` Jan Kara
2024-08-03 16:53     ` Amir Goldstein
2024-07-25 18:19 ` [PATCH 04/10] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
2024-08-01 17:04   ` Jan Kara
2024-07-25 18:19 ` [PATCH 05/10] fanotify: introduce FAN_PRE_MODIFY " Josef Bacik
2024-08-01 17:09   ` Jan Kara
2024-08-03 16:55     ` Amir Goldstein
2024-08-05 11:18       ` Jan Kara
2024-07-25 18:19 ` [PATCH 06/10] fanotify: pass optional file access range in pre-content event Josef Bacik
2024-08-01 17:16   ` Jan Kara
2024-08-03 17:00     ` Amir Goldstein
2024-08-05 11:20       ` Jan Kara
2024-07-25 18:19 ` [PATCH 07/10] fanotify: rename a misnamed constant Josef Bacik
2024-08-01 17:19   ` Jan Kara
2024-08-01 17:23     ` Jan Kara
2024-07-25 18:19 ` [PATCH 08/10] fanotify: report file range info with pre-content events Josef Bacik
2024-08-01 17:38   ` Jan Kara
2024-10-24 10:06     ` Amir Goldstein
2024-10-24 16:35       ` Jan Kara
2024-10-24 16:49         ` Amir Goldstein
2024-10-24 16:56           ` Jan Kara
2024-07-25 18:19 ` [PATCH 09/10] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
2024-08-01 21:14   ` Jan Kara
2024-08-03 17:06     ` Amir Goldstein
2024-07-25 18:19 ` [PATCH 10/10] fsnotify: generate pre-content permission event on page fault Josef Bacik
2024-07-25 20:19   ` Amir Goldstein
2024-07-29 17:11     ` Josef Bacik
2024-07-29 18:57       ` Amir Goldstein
2024-07-30 12:18         ` Jan Kara
2024-07-30 16:51           ` Josef Bacik
2024-08-01 21:34             ` Jan Kara
2024-08-01 21:40   ` Jan Kara
2024-08-02 16:03     ` Josef Bacik
2024-08-05 12:13       ` Jan Kara
2024-08-07 19:04         ` Josef Bacik [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240807190431.GA274047@perftesting \
    --to=josef@toxicpanda.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).