* [GIT PULL] Fsnotify changes for 6.12-rc1
@ 2024-09-23 11:03 Jan Kara
2024-09-23 18:35 ` Linus Torvalds
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2024-09-23 11:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel
Hello Linus,
could you please pull from
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify_for_v6.12-rc1
to get:
* The implementation of the pre-content fanotify events. These events are
sent before read / write / page fault and the execution is paused until
event listener replies similarly to current fanotify permission events.
Facebook and some other vendors already started to use these events for
implementing hierarchical storage management in their infrastructure.
* Fix of a data race noticed by syzbot
There are some conflicts of this pull with other work in linux-next, mostly
with multigrain timestamp series which I think Christian decided to
postpone in the end so this may be a non-issue in the end. The conflicts
were easy to resolve - two commits both adding a feature flag to .fs_flags
in struct filesystem_type. Anyway currently, the merge with your tree as
of today has only one trivial conflict in mm/filemap.c due to added
includes.
Top of the tree is 6165bfb427a6. The full shortlog is:
Amir Goldstein (8):
fsnotify: introduce pre-content permission event
fsnotify: generate pre-content permission event on open
fanotify: introduce FAN_PRE_ACCESS permission event
fanotify: introduce FAN_PRE_MODIFY permission event
fanotify: pass optional file access range in pre-content event
fanotify: rename a misnamed constant
fanotify: report file range info with pre-content events
fanotify: allow to set errno in FAN_DENY permission response
Jan Kara (1):
fsnotify: Avoid data race between fsnotify_recalc_mask() and fsnotify_object_watched()
Josef Bacik (10):
fanotify: don't skip extra event info if no info_mode is set
fs: add a flag to indicate the fs supports pre-content events
fanotify: add a helper to check for pre content events
fanotify: disable readahead if we have pre-content watches
mm: don't allow huge faults for files with pre content watches
fsnotify: generate pre-content permission event on page fault
bcachefs: add pre-content fsnotify hook to fault
xfs: add pre-content fsnotify hook for write faults
btrfs: disable defrag on pre-content watched files
ext4: enable pre-content events
The diffstat is
fs/bcachefs/fs-io-pagecache.c | 4 ++
fs/bcachefs/fs.c | 2 +-
fs/btrfs/ioctl.c | 9 +++
fs/btrfs/super.c | 3 +-
fs/ext4/super.c | 6 +-
fs/namei.c | 9 +++
fs/notify/fanotify/fanotify.c | 32 ++++++++--
fs/notify/fanotify/fanotify.h | 15 +++++
fs/notify/fanotify/fanotify_user.c | 111 ++++++++++++++++++++++++++------
fs/notify/fsnotify.c | 39 ++++++++---
fs/notify/inotify/inotify_user.c | 2 +-
fs/notify/mark.c | 8 ++-
fs/xfs/xfs_file.c | 4 ++
fs/xfs/xfs_super.c | 2 +-
include/linux/fanotify.h | 20 ++++--
include/linux/fs.h | 1 +
include/linux/fsnotify.h | 58 +++++++++++++++--
include/linux/fsnotify_backend.h | 59 ++++++++++++++++-
include/linux/mm.h | 1 +
include/uapi/linux/fanotify.h | 18 ++++++
mm/filemap.c | 128 +++++++++++++++++++++++++++++++++++--
mm/memory.c | 22 +++++++
mm/nommu.c | 7 ++
mm/readahead.c | 13 ++++
security/selinux/hooks.c | 3 +-
25 files changed, 511 insertions(+), 65 deletions(-)
Thanks
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-23 11:03 [GIT PULL] Fsnotify changes for 6.12-rc1 Jan Kara
@ 2024-09-23 18:35 ` Linus Torvalds
2024-09-23 19:13 ` Jan Kara
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-09-23 18:35 UTC (permalink / raw)
To: Jan Kara, Amir Goldstein; +Cc: linux-fsdevel
On Mon, 23 Sept 2024 at 04:03, Jan Kara <jack@suse.cz> wrote:
>
> * The implementation of the pre-content fanotify events. T
I pulled this, and then I decided to unpull.
I don't see what the permissions for this thing are, and without
explanations for why this isn't a huge security issue, I'm not pulling
it.
Maybe those explanations exist elsewhere, but they sure aren't in the
pull request.
IOW, I want to know where the code is that says "you can't block root
processes doing accesses to your files" etc. Or things like "oh, the
kernel took a page fault while holding some lock, what protects this
from being misused"?
And if that code doesn't exist, there's no way in hell we're pulling this. Ever.
IOW, where is the "we don't allow unprivileged groups to do this" code?
Because:
> These events are
> sent before read / write / page fault and the execution is paused until
> event listener replies similarly to current fanotify permission events.
Permission events aren't allowed for unprivileged users. I want to
make sure people have thought about this, and I need to actually see
this talked about in the pull request.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-23 18:35 ` Linus Torvalds
@ 2024-09-23 19:13 ` Jan Kara
2024-09-23 19:36 ` Linus Torvalds
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2024-09-23 19:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jan Kara, Amir Goldstein, linux-fsdevel
On Mon 23-09-24 11:35:00, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 04:03, Jan Kara <jack@suse.cz> wrote:
> >
> > * The implementation of the pre-content fanotify events. T
>
> I pulled this, and then I decided to unpull.
>
> I don't see what the permissions for this thing are, and without
> explanations for why this isn't a huge security issue, I'm not pulling
> it.
>
> Maybe those explanations exist elsewhere, but they sure aren't in the
> pull request.
Sure, the details are in some of the commit messages but you're right I
should have summarized them in the pull request as well:
Pre-content events are restricted to global CAP_SYS_ADMIN. This is achieved
by pre-content events being restricted to FAN_CLASSS_PRE_CONTENT
notification groups which are restricted to CAP_SYS_ADMIN in
fanotify_init() by this check:
if (!capable(CAP_SYS_ADMIN)) {
/*
* An unprivileged user can setup an fanotify group with
* limited functionality - an unprivileged group is limited to
* notification events with file handles and it cannot use
* unlimited queue/marks.
*/
if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
return -EPERM;
...
}
> IOW, I want to know where the code is that says "you can't block root
> processes doing accesses to your files" etc. Or things like "oh, the
> kernel took a page fault while holding some lock, what protects this
> from being misused"?
>
> And if that code doesn't exist, there's no way in hell we're pulling
> this. Ever.
Sure, I understand that. That would have been a huge security hole.
> IOW, where is the "we don't allow unprivileged groups to do this" code?
>
> Because:
>
> > These events are
> > sent before read / write / page fault and the execution is paused until
> > event listener replies similarly to current fanotify permission events.
>
> Permission events aren't allowed for unprivileged users. I want to
> make sure people have thought about this, and I need to actually see
> this talked about in the pull request.
Should I update the pull request and resend or will you update it with
paragraph above?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-23 19:13 ` Jan Kara
@ 2024-09-23 19:36 ` Linus Torvalds
2024-09-24 9:27 ` Jan Kara
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-09-23 19:36 UTC (permalink / raw)
To: Jan Kara; +Cc: Amir Goldstein, linux-fsdevel
On Mon, 23 Sept 2024 at 12:13, Jan Kara <jack@suse.cz> wrote:
>
> Sure, the details are in some of the commit messages but you're right I
> should have summarized them in the pull request as well:
So I really only looked at the parts I know - the VM side, and
honestly, I threw up in my mouth a bit there.
Do we really want to call that horrific fsnotify path for the case
where we already have the page cached? This is a fairly critical
fastpath, and not giving out page cache pages means that now you are
literally violating mmap coherency.
If the aim is to fill in caches on first access, then if we already
have a page cache page, it's by definition not first access any more!
So I *really* don't like the VM side of this. That's what then made me
go look for permission code, and since I don't know the fsnotify code
very well, I didn't find it.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-23 19:36 ` Linus Torvalds
@ 2024-09-24 9:27 ` Jan Kara
2024-09-24 16:33 ` Linus Torvalds
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2024-09-24 9:27 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jan Kara, Amir Goldstein, linux-fsdevel
On Mon 23-09-24 12:36:14, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 12:13, Jan Kara <jack@suse.cz> wrote:
> >
> > Sure, the details are in some of the commit messages but you're right I
> > should have summarized them in the pull request as well:
>
> So I really only looked at the parts I know - the VM side, and
> honestly, I threw up in my mouth a bit there.
>
> Do we really want to call that horrific fsnotify path for the case
> where we already have the page cached? This is a fairly critical
> fastpath, and not giving out page cache pages means that now you are
> literally violating mmap coherency.
>
> If the aim is to fill in caches on first access, then if we already
> have a page cache page, it's by definition not first access any more!
Well, that's what actually should be happening. do_read_fault() will do
should_fault_around(vmf) -> yes -> do_fault_around() and
filemap_map_pages() will insert all pages in the page cache into the page
table page before we even get to filemap_fault() calling our fsnotify
hooks. Note that filemap_map_pages() returns VM_FAULT_NOPAGE if it has
mapped the page for the faulting address which makes the page fault code
bail out even before ->fault handler is even called.
That being said now that I'm rereading this code again, write faults will
always end up in the fault handler so we'll generate PRE_WRITE events for
them on each write fault (if someone is watching for it). Not sure if write
faults matter that much and I don't see easy way around that as that's the
promise of PRE_WRITE event...
Do the above explanations make the VM changes acceptable for you? I agree
it isn't exactly beautiful but it should work.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-24 9:27 ` Jan Kara
@ 2024-09-24 16:33 ` Linus Torvalds
2024-09-25 0:16 ` Gao Xiang
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-09-24 16:33 UTC (permalink / raw)
To: Jan Kara; +Cc: Amir Goldstein, linux-fsdevel
On Tue, 24 Sept 2024 at 02:28, Jan Kara <jack@suse.cz> wrote:
>
> On Mon 23-09-24 12:36:14, Linus Torvalds wrote:
> >
> > Do we really want to call that horrific fsnotify path for the case
> > where we already have the page cached? This is a fairly critical
> > fastpath, and not giving out page cache pages means that now you are
> > literally violating mmap coherency.
> >
> > If the aim is to fill in caches on first access, then if we already
> > have a page cache page, it's by definition not first access any more!
>
> Well, that's what actually should be happening. do_read_fault() will do
> should_fault_around(vmf) -> yes -> do_fault_around() and
> filemap_map_pages() will insert all pages in the page cache into the page
> table page before we even get to filemap_fault() calling our fsnotify
> hooks.
That's the fault-around code, yes, and it will populate most pages on
many filesystems, but it's still optional.
Not all filesystems have a 'map_pages' function at all (from a quick
grep at least ceph, erofs, ext2, ocfs2 - although I didn't actually
validate that my quick grep was right).
Look here - the part I disliked the most was literally commit
4f0ec01f45cd ("fsnotify: generate pre-content permission event on page
fault") which at the very top of 'filemap_fault()' adds
/*
* If we have pre-content watchers then we need to generate events on
* page fault so that we can populate any data before the fault.
*/
ret = __filemap_fsnotify_fault(vmf, &fpin);
...
right *above* the code that then does
/*
* Do we have something in the page cache already?
*/
folio = filemap_get_folio(mapping, index);
and this is all still in the fast-path for any filesystem that doesn't
do map_pages.
Do we care deeply about such filesystems? Perhaps not - but this all
still smells to me.
When the code that is supposed to ignore caches exists right *above*
the code that has a big comment about "is it already in the page
cache", it makes me unhappy.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-24 16:33 ` Linus Torvalds
@ 2024-09-25 0:16 ` Gao Xiang
2024-09-25 1:15 ` Linus Torvalds
0 siblings, 1 reply; 10+ messages in thread
From: Gao Xiang @ 2024-09-25 0:16 UTC (permalink / raw)
To: Linus Torvalds, Jan Kara; +Cc: Amir Goldstein, linux-fsdevel
Hi Linus,
On 2024/9/25 00:33, Linus Torvalds wrote:
> On Tue, 24 Sept 2024 at 02:28, Jan Kara <jack@suse.cz> wrote:
>>
>> On Mon 23-09-24 12:36:14, Linus Torvalds wrote:
>>>
>>> Do we really want to call that horrific fsnotify path for the case
>>> where we already have the page cached? This is a fairly critical
>>> fastpath, and not giving out page cache pages means that now you are
>>> literally violating mmap coherency.
>>>
>>> If the aim is to fill in caches on first access, then if we already
>>> have a page cache page, it's by definition not first access any more!
>>
>> Well, that's what actually should be happening. do_read_fault() will do
>> should_fault_around(vmf) -> yes -> do_fault_around() and
>> filemap_map_pages() will insert all pages in the page cache into the page
>> table page before we even get to filemap_fault() calling our fsnotify
>> hooks.
>
> That's the fault-around code, yes, and it will populate most pages on
> many filesystems, but it's still optional.
>
> Not all filesystems have a 'map_pages' function at all (from a quick
> grep at least ceph, erofs, ext2, ocfs2 - although I didn't actually
> validate that my quick grep was right).
Just side note: I think `generic_file_vm_ops` already prepares this
feature, so generic_file_mmap users also have fault around behaviors.
Anyway..
Thanks,
Gao Xiang
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-25 0:16 ` Gao Xiang
@ 2024-09-25 1:15 ` Linus Torvalds
2024-09-25 11:04 ` Jan Kara
2024-10-16 19:28 ` Josef Bacik
0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-09-25 1:15 UTC (permalink / raw)
To: Gao Xiang; +Cc: Jan Kara, Amir Goldstein, linux-fsdevel
On Tue, 24 Sept 2024 at 17:17, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Just side note: I think `generic_file_vm_ops` already prepares this
> feature, so generic_file_mmap users also have fault around behaviors.
Hmm. Maybe. But it doesn't really change the fundamental issue - the
code in question seems to be just *random*.
And I mean that in a very real and very immediate sense: the
fault-around code and filemap_map_pages() only maps in pages that are
uptodate, so it literally DEPENDS ON TIMING whether some previous IO
has completed or not, and thus on whether the page fault is handled by
the fault-around in filemap_map_pages() or by the filemap_fault()
code.
In other words - I think this is all completely broken.
Put another way: explain to me why random IO timing details should
matter for the whether we do __filemap_fsnotify_fault() on a page
fault or not?
So no. I'm not taking this pull request. It makes absolutely zero
sense to me, and I don't think it has sane semantics. The argument
that it is already used by people is not an argument.
The new fsnotify hooks need to make SENSE - not be in random locations
that give some kind of random data.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-25 1:15 ` Linus Torvalds
@ 2024-09-25 11:04 ` Jan Kara
2024-10-16 19:28 ` Josef Bacik
1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2024-09-25 11:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Gao Xiang, Jan Kara, Amir Goldstein, linux-fsdevel
On Tue 24-09-24 18:15:51, Linus Torvalds wrote:
> On Tue, 24 Sept 2024 at 17:17, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >
> > Just side note: I think `generic_file_vm_ops` already prepares this
> > feature, so generic_file_mmap users also have fault around behaviors.
>
> Hmm. Maybe. But it doesn't really change the fundamental issue - the
> code in question seems to be just *random*.
>
> And I mean that in a very real and very immediate sense: the
> fault-around code and filemap_map_pages() only maps in pages that are
> uptodate, so it literally DEPENDS ON TIMING whether some previous IO
> has completed or not, and thus on whether the page fault is handled by
> the fault-around in filemap_map_pages() or by the filemap_fault()
> code.
>
> In other words - I think this is all completely broken.
>
> Put another way: explain to me why random IO timing details should
> matter for the whether we do __filemap_fsnotify_fault() on a page
> fault or not?
I agree that with the fault-around code, there's dependency on IO
completion time. So we can still be generating PRE_ACCESS events while the
data is being loaded from the disk. This was a compromise we ended up with
after quite some discussions about possible solutions. Generally the
options I see for page faults are:
1) Generate PRE_ACCESS event whenever a page is being mapped into page
tables (in fact Josef originally had this in his patch). This would
provide the determinism at the cost of performance (events generated even
for cached pages). If this is OK with you, we can do that but from what I
gather from your previous emails you don't really like this either.
2) We could generate event in filemap_fault() only if we don't find
uptodate folio there. I'm happy to do this if it looks better to you
although I don't think there's a practical difference from the current
state (the same raciness with IO completion applies, just the window is
smaller).
3) We could generate event in filemap_fault() only if we didn't find a
folio in the page cache or we found it, locked it, but it still was not
uptodate. This will make sure we generate the event only if we really need
to pull in the data into the page cache. Doable. The case when we found &
locked a page that's not uptodate in the end will be a bit ugly but not too
bad and this case should be rare. I actually like this option the most I
guess.
> So no. I'm not taking this pull request. It makes absolutely zero
> sense to me, and I don't think it has sane semantics. The argument
> that it is already used by people is not an argument.
I mentioned that to show that there's practical interest in this kind of
functionality. I think people involved understand things are pretty much in
flux until they are merged upstream.
> The new fsnotify hooks need to make SENSE - not be in random locations
> that give some kind of random data.
Thanks for feedback. So 1) and 3) from the above options make sense to me
(in the sense that it is relatively easy to explain when events are
generated). I'd prefer 3) for performance reasons so can we settle on that?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] Fsnotify changes for 6.12-rc1
2024-09-25 1:15 ` Linus Torvalds
2024-09-25 11:04 ` Jan Kara
@ 2024-10-16 19:28 ` Josef Bacik
1 sibling, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2024-10-16 19:28 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Gao Xiang, Jan Kara, Amir Goldstein, linux-fsdevel
On Tue, Sep 24, 2024 at 06:15:51PM -0700, Linus Torvalds wrote:
> On Tue, 24 Sept 2024 at 17:17, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >
> > Just side note: I think `generic_file_vm_ops` already prepares this
> > feature, so generic_file_mmap users also have fault around behaviors.
>
> Hmm. Maybe. But it doesn't really change the fundamental issue - the
> code in question seems to be just *random*.
>
> And I mean that in a very real and very immediate sense: the
> fault-around code and filemap_map_pages() only maps in pages that are
> uptodate, so it literally DEPENDS ON TIMING whether some previous IO
> has completed or not, and thus on whether the page fault is handled by
> the fault-around in filemap_map_pages() or by the filemap_fault()
> code.
>
> In other words - I think this is all completely broken.
>
> Put another way: explain to me why random IO timing details should
> matter for the whether we do __filemap_fsnotify_fault() on a page
> fault or not?
They don't, or at least I don't understand what you're getting at here.
I have a file with these precontent watches on, no IO is going to occur on these
pages without calling the fanotify hook first. We have readahead disabled in
this case for this very reason, we need to make sure that every single page
access is caught by fanotify before we allow the IO to proceed, otherwise we do
get random garbage (well really we just get empty pages because these files are
just truncated to the final size).
Now there is the case where userspace could have just done a normal pwrite() to
fill in the file, so any subsequent faults would find uptodate pages and thus
we'd not get an event. My original code disabled this, specifically because I
wanted to have the consistency of "any access == event". But we decided in one
of our many discussions around this code that it would be a bit too heavy
handed, and if there were uptodate pages it would be because the HSM application
filled them in.
As Jan pointed out, I'm fine going back to that model, as it does have more
consistent behavior. But I don't want to go rework this code for a 6th time
just to have you reject it and it take me a couple of weeks to notice that it
didn't go in.
I realize to you that "we're already using this" doesn't matter, but it matters
to me. As a rule we don't like carrying patches that aren't upstream, the only
reason we do it is for these cases where we need to see how it works in
production to see if it's actually worth chasing and implementing. In this case
the answer overwhelmingly is yes, this feature is extremely useful and pretty
greatly improves the launch times of our containers and our applications. So
I'd really like some concrete guidance on how exactly you want this to look
before I go back to rework it again. Thanks,
Josef
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-10-16 19:28 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-23 11:03 [GIT PULL] Fsnotify changes for 6.12-rc1 Jan Kara
2024-09-23 18:35 ` Linus Torvalds
2024-09-23 19:13 ` Jan Kara
2024-09-23 19:36 ` Linus Torvalds
2024-09-24 9:27 ` Jan Kara
2024-09-24 16:33 ` Linus Torvalds
2024-09-25 0:16 ` Gao Xiang
2024-09-25 1:15 ` Linus Torvalds
2024-09-25 11:04 ` Jan Kara
2024-10-16 19:28 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).