Fanotify: concurrent work and handling files being executed

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fanotify: concurrent work and handling files being executed
@ 2024-02-05 23:23 Sargun Dhillon
  2024-02-06  7:44 ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Sargun Dhillon @ 2024-02-05 23:23 UTC (permalink / raw)
  To: Linux FS-devel Mailing List, Sweet Tea Dorminy, Jan Kara,
	Amir Goldstein

One of the issues we've hit recently while using fanotify in an HSM is
racing with files that are opened for execution.

There is a race that can result in ETXTBUSY.
Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
Pid 2: execve(file_by_path)
Pid 1: gets notification, with file.fd
Pid 2: blocked, waiting for notification to resolve
Pid 1: Does work with FD (populates the file)
Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
Pid 2: continues, and falls through to deny_write_access (and fails)
Pid 1: closes fd

Pid 1 can close the FD before responding, but this can result in a
race if fanotify is being handled in a multi-threaded
manner.

I.e. if there are two threads operating on the same fanotify group,
and an event's FD has been closed, that can be reused
by another event. This is largely not a problem because the
outstanding events are added in a FIFO manner to the outstanding
event list, and as long as the earlier event is closed and responded
to without interruption, it should be okay, but it's difficult
to guarantee that this happens, unless event responses are serialized
in some fashion, with strict ordering between
responses.

There are a couple of ways I see around this:
1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
where fanotify_write closes the fd when
it processes the response.
2. Make the response identifier separate from the FD. This can either
be an IDR / xarray, or a 64-bit always
incrementing number. The benefit of using an xarray is that responses
can than be handled in O(1) speed
whereas the current implementation is O(n) in relationship to the
number of outstanding events.

This can be implemented by adding an additional piece of response
metadata, and then that becomes the
key vs. fd on response.
---

An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
to populate file content lazily,
and I realize there are many steps between removing the write lock,
and being able to do this, but given
that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
feels somewhat arbitrary to continue
to have this in place for executable files only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-05 23:23 Fanotify: concurrent work and handling files being executed Sargun Dhillon
@ 2024-02-06  7:44 ` Amir Goldstein
  2024-02-06 13:50   ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2024-02-06  7:44 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: Linux FS-devel Mailing List, Sweet Tea Dorminy, Jan Kara

On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> One of the issues we've hit recently while using fanotify in an HSM is
> racing with files that are opened for execution.
>
> There is a race that can result in ETXTBUSY.
> Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> Pid 2: execve(file_by_path)
> Pid 1: gets notification, with file.fd
> Pid 2: blocked, waiting for notification to resolve
> Pid 1: Does work with FD (populates the file)
> Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> Pid 2: continues, and falls through to deny_write_access (and fails)
> Pid 1: closes fd
>
> Pid 1 can close the FD before responding, but this can result in a
> race if fanotify is being handled in a multi-threaded
> manner.
>
> I.e. if there are two threads operating on the same fanotify group,
> and an event's FD has been closed, that can be reused
> by another event. This is largely not a problem because the
> outstanding events are added in a FIFO manner to the outstanding
> event list, and as long as the earlier event is closed and responded
> to without interruption, it should be okay, but it's difficult
> to guarantee that this happens, unless event responses are serialized
> in some fashion, with strict ordering between
> responses.
>
> There are a couple of ways I see around this:
> 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> where fanotify_write closes the fd when
> it processes the response.

That seems doable and useful.

> 2. Make the response identifier separate from the FD. This can either
> be an IDR / xarray, or a 64-bit always
> incrementing number. The benefit of using an xarray is that responses
> can than be handled in O(1) speed
> whereas the current implementation is O(n) in relationship to the
> number of outstanding events.

The number of outstanding permission events is usually limited
by the number of events that are handled concurrently.

I think that a 64-bit event id is a worthy addition to a well designed
notifications API. I am pretty sure that both Windows and MacOS
filesystem notification events have a 64-bit event id.

If we ever implement filesystem support for persistent change
notification journals (as in Windows), those ids would be essential.

>
> This can be implemented by adding an additional piece of response
> metadata, and then that becomes the
> key vs. fd on response.
> ---
>
> An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
> to populate file content lazily,
> and I realize there are many steps between removing the write lock,
> and being able to do this, but given
> that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
> feels somewhat arbitrary to continue
> to have this in place for executable files only.

I think there are way too many security models built around this
behavior to be able to change it for the common case.

However, I will note that technically, the filesystem does not
need to require a file open for write to fill its content as long as the
file content is written directly to disk and as long as page cache
is not populated in that region and invalidate_lock is held.

Requiring FMODE_WRITE for btrfs_ioctl_encoded_write()
may make sense, but it is not a must.
A more fine grained page cache protection approach is also
an option, a bit like (or exactly like) exported pNFS layouts.

IOW, you could potentially implement lazy filling of executable
content with specific file systems that support an out of band
file filling API, but then we would also need notifications for
page faults, so let's leave that for the far future.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-06  7:44 ` Amir Goldstein
@ 2024-02-06 13:50   ` Jan Kara
  2024-02-06 16:29     ` Sargun Dhillon
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2024-02-06 13:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, Linux FS-devel Mailing List, Sweet Tea Dorminy,
	Jan Kara

On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > One of the issues we've hit recently while using fanotify in an HSM is
> > racing with files that are opened for execution.
> >
> > There is a race that can result in ETXTBUSY.
> > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > Pid 2: execve(file_by_path)
> > Pid 1: gets notification, with file.fd
> > Pid 2: blocked, waiting for notification to resolve
> > Pid 1: Does work with FD (populates the file)
> > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > Pid 2: continues, and falls through to deny_write_access (and fails)
> > Pid 1: closes fd

Right, this is kind of nasty.

> > Pid 1 can close the FD before responding, but this can result in a
> > race if fanotify is being handled in a multi-threaded
> > manner.

Yep.

> > I.e. if there are two threads operating on the same fanotify group,
> > and an event's FD has been closed, that can be reused
> > by another event. This is largely not a problem because the
> > outstanding events are added in a FIFO manner to the outstanding
> > event list, and as long as the earlier event is closed and responded
> > to without interruption, it should be okay, but it's difficult
> > to guarantee that this happens, unless event responses are serialized
> > in some fashion, with strict ordering between
> > responses.

Yes, essentially you must make sure you will not read any new events from
the notification queue between fd close & writing of the response. Frankly,
I find this as quite ugly and asking for trouble (subtle bugs down the
line).

> > There are a couple of ways I see around this:
> > 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> > where fanotify_write closes the fd when
> > it processes the response.
> 
> That seems doable and useful.
> 
> > 2. Make the response identifier separate from the FD. This can either
> > be an IDR / xarray, or a 64-bit always
> > incrementing number. The benefit of using an xarray is that responses
> > can than be handled in O(1) speed
> > whereas the current implementation is O(n) in relationship to the
> > number of outstanding events.
> 
> The number of outstanding permission events is usually limited
> by the number of events that are handled concurrently.
> 
> I think that a 64-bit event id is a worthy addition to a well designed
> notifications API. I am pretty sure that both Windows and MacOS
> filesystem notification events have a 64-bit event id.

I agree that having 64-bit id in an event and just increment it with each
event would be simple and fine way to identify events. This could then be
used to match responses to events when we would be reporting permission
events for FID groups as I outlined in my email yesterday [1].

> If we ever implement filesystem support for persistent change
> notification journals (as in Windows), those ids would be essential.

How do you want to use it for persistent change journal? I'm mostly asking
because currently I expect the number to start from 0 for each fsnotify
group created but if you'd like to persist somewhere the event id, this
would not be ideal?

> > This can be implemented by adding an additional piece of response
> > metadata, and then that becomes the
> > key vs. fd on response.
> > ---
> >
> > An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
> > to populate file content lazily,
> > and I realize there are many steps between removing the write lock,
> > and being able to do this, but given
> > that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
> > feels somewhat arbitrary to continue
> > to have this in place for executable files only.
> 
> I think there are way too many security models built around this
> behavior to be able to change it for the common case.
> 
> However, I will note that technically, the filesystem does not
> need to require a file open for write to fill its content as long as the
> file content is written directly to disk and as long as page cache
> is not populated in that region and invalidate_lock is held.

I was thinking about this as well and as you say, technically there is no
problem with this - just bypass writer the checks when opening the file -
but I think it just gives userspace too much rope to hang itself on. So if
we are going to support something like this it will require very careful
API design and wider consensus among VFS people this is something we are
willing to support.

> Requiring FMODE_WRITE for btrfs_ioctl_encoded_write()
> may make sense, but it is not a must.
> A more fine grained page cache protection approach is also
> an option, a bit like (or exactly like) exported pNFS layouts.
> 
> IOW, you could potentially implement lazy filling of executable
> content with specific file systems that support an out of band
> file filling API, but then we would also need notifications for
> page faults, so let's leave that for the far future.

That's another good point that currently we don't generate fsnotify events
on page faults. For lazy filling of executables this is basically must-have
functionality and I'd say that even for lazy filling of other files,
handling notifications on page faults will come up sooner rather than
later. So this is probably the thing we'll have to handle first.

								Honza

[1] https://lore.kernel.org/all/20240205182718.lvtgfsxcd6htbqyy@quack3
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-06 13:50   ` Jan Kara
@ 2024-02-06 16:29     ` Sargun Dhillon
  2024-02-06 16:44       ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Sargun Dhillon @ 2024-02-06 16:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, Linux FS-devel Mailing List, Sweet Tea Dorminy

On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > >
> > > One of the issues we've hit recently while using fanotify in an HSM is
> > > racing with files that are opened for execution.
> > >
> > > There is a race that can result in ETXTBUSY.
> > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > Pid 2: execve(file_by_path)
> > > Pid 1: gets notification, with file.fd
> > > Pid 2: blocked, waiting for notification to resolve
> > > Pid 1: Does work with FD (populates the file)
> > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > Pid 1: closes fd
>
> Right, this is kind of nasty.
>
> > > Pid 1 can close the FD before responding, but this can result in a
> > > race if fanotify is being handled in a multi-threaded
> > > manner.
>
> Yep.
>
> > > I.e. if there are two threads operating on the same fanotify group,
> > > and an event's FD has been closed, that can be reused
> > > by another event. This is largely not a problem because the
> > > outstanding events are added in a FIFO manner to the outstanding
> > > event list, and as long as the earlier event is closed and responded
> > > to without interruption, it should be okay, but it's difficult
> > > to guarantee that this happens, unless event responses are serialized
> > > in some fashion, with strict ordering between
> > > responses.
>
> Yes, essentially you must make sure you will not read any new events from
> the notification queue between fd close & writing of the response. Frankly,
> I find this as quite ugly and asking for trouble (subtle bugs down the
> line).
>
Is there a preference for either refactoring fanotify_event_metadata, or
adding this new ID type as a piece of metadata?

I almost feel like the FD should move to being metadata, and we should
use ID in place of fd in fanotify_event_metadata. If we use an xarray,
it should be reasonable to use a 32-bit identifier, so we don't need
to modify the fanotify_event_metadata structure at all.

> > > There are a couple of ways I see around this:
> > > 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> > > where fanotify_write closes the fd when
> > > it processes the response.
> >
> > That seems doable and useful.
> >
I can work on this. We already have FAN_AUDIT and FAN_INFO as optional
bits that can be set in the response. I know there's another thread talking
about using the high-bits for error codes, and I wouldn't want to pollute
the higher bits too much if we go down that route.

I'm also not sure how best to implement capability probing (other than you
get an EINVAL in userspace and retry without that bit being set).

> > > 2. Make the response identifier separate from the FD. This can either
> > > be an IDR / xarray, or a 64-bit always
> > > incrementing number. The benefit of using an xarray is that responses
> > > can than be handled in O(1) speed
> > > whereas the current implementation is O(n) in relationship to the
> > > number of outstanding events.
> >
> > The number of outstanding permission events is usually limited
> > by the number of events that are handled concurrently.
> >
> > I think that a 64-bit event id is a worthy addition to a well designed
> > notifications API. I am pretty sure that both Windows and MacOS
> > filesystem notification events have a 64-bit event id.
>
> I agree that having 64-bit id in an event and just increment it with each
> event would be simple and fine way to identify events. This could then be
> used to match responses to events when we would be reporting permission
> events for FID groups as I outlined in my email yesterday [1].
>
> > If we ever implement filesystem support for persistent change
> > notification journals (as in Windows), those ids would be essential.
>
> How do you want to use it for persistent change journal? I'm mostly asking
> because currently I expect the number to start from 0 for each fsnotify
> group created but if you'd like to persist somewhere the event id, this
> would not be ideal?
>
> > > This can be implemented by adding an additional piece of response
> > > metadata, and then that becomes the
> > > key vs. fd on response.
> > > ---
> > >
> > > An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
> > > to populate file content lazily,
> > > and I realize there are many steps between removing the write lock,
> > > and being able to do this, but given
> > > that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
> > > feels somewhat arbitrary to continue
> > > to have this in place for executable files only.
> >
> > I think there are way too many security models built around this
> > behavior to be able to change it for the common case.
> >
> > However, I will note that technically, the filesystem does not
> > need to require a file open for write to fill its content as long as the
> > file content is written directly to disk and as long as page cache
> > is not populated in that region and invalidate_lock is held.
>
> I was thinking about this as well and as you say, technically there is no
> problem with this - just bypass writer the checks when opening the file -
> but I think it just gives userspace too much rope to hang itself on. So if
> we are going to support something like this it will require very careful
> API design and wider consensus among VFS people this is something we are
> willing to support.
>
> > Requiring FMODE_WRITE for btrfs_ioctl_encoded_write()
> > may make sense, but it is not a must.
> > A more fine grained page cache protection approach is also
> > an option, a bit like (or exactly like) exported pNFS layouts.
> >
> > IOW, you could potentially implement lazy filling of executable
> > content with specific file systems that support an out of band
> > file filling API, but then we would also need notifications for
> > page faults, so let's leave that for the far future.
>
> That's another good point that currently we don't generate fsnotify events
> on page faults. For lazy filling of executables this is basically must-have
> functionality and I'd say that even for lazy filling of other files,
> handling notifications on page faults will come up sooner rather than
> later. So this is probably the thing we'll have to handle first.
>
>                                                                 Honza
>
> [1] https://lore.kernel.org/all/20240205182718.lvtgfsxcd6htbqyy@quack3
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-06 16:29     ` Sargun Dhillon
@ 2024-02-06 16:44       ` Amir Goldstein
  2024-02-07 10:44         ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2024-02-06 16:44 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: Jan Kara, Linux FS-devel Mailing List, Sweet Tea Dorminy

On Tue, Feb 6, 2024 at 6:30 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > >
> > > > One of the issues we've hit recently while using fanotify in an HSM is
> > > > racing with files that are opened for execution.
> > > >
> > > > There is a race that can result in ETXTBUSY.
> > > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > > Pid 2: execve(file_by_path)
> > > > Pid 1: gets notification, with file.fd
> > > > Pid 2: blocked, waiting for notification to resolve
> > > > Pid 1: Does work with FD (populates the file)
> > > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > > Pid 1: closes fd
> >
> > Right, this is kind of nasty.
> >
> > > > Pid 1 can close the FD before responding, but this can result in a
> > > > race if fanotify is being handled in a multi-threaded
> > > > manner.
> >
> > Yep.
> >
> > > > I.e. if there are two threads operating on the same fanotify group,
> > > > and an event's FD has been closed, that can be reused
> > > > by another event. This is largely not a problem because the
> > > > outstanding events are added in a FIFO manner to the outstanding
> > > > event list, and as long as the earlier event is closed and responded
> > > > to without interruption, it should be okay, but it's difficult
> > > > to guarantee that this happens, unless event responses are serialized
> > > > in some fashion, with strict ordering between
> > > > responses.
> >
> > Yes, essentially you must make sure you will not read any new events from
> > the notification queue between fd close & writing of the response. Frankly,
> > I find this as quite ugly and asking for trouble (subtle bugs down the
> > line).
> >
> Is there a preference for either refactoring fanotify_event_metadata, or
> adding this new ID type as a piece of metadata?
>
> I almost feel like the FD should move to being metadata, and we should
> use ID in place of fd in fanotify_event_metadata. If we use an xarray,
> it should be reasonable to use a 32-bit identifier, so we don't need
> to modify the fanotify_event_metadata structure at all.
>

I have a strong preference for FANOTIFY_METADATA_VERSION 4
because I really would like event->key to be 64bit and in the header,
but I have a feeling that Jan may have a different opinion..

> > > > There are a couple of ways I see around this:
> > > > 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> > > > where fanotify_write closes the fd when
> > > > it processes the response.
> > >
> > > That seems doable and useful.
> > >
> I can work on this. We already have FAN_AUDIT and FAN_INFO as optional
> bits that can be set in the response. I know there's another thread talking
> about using the high-bits for error codes, and I wouldn't want to pollute
> the higher bits too much if we go down that route.
>

There is plenty of bits space still.
I could implement FAN_CLOSE_FD if we get to a conclusion that it
will help. it should be quite trivial so better not do conflicting work
in parallel. But I think that Jan's idea of not opening event->fd at all
may be better.

> I'm also not sure how best to implement capability probing (other than you
> get an EINVAL in userspace and retry without that bit being set).
>

Yes, easy.
You can event write a response with FAN_NOFD for fd after
fanotify_init() to check capability.
If capability is supported you will get -ENOENT otherwise -INVAL.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-06 16:44       ` Amir Goldstein
@ 2024-02-07 10:44         ` Jan Kara
  2024-02-07 11:15           ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2024-02-07 10:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, Jan Kara, Linux FS-devel Mailing List,
	Sweet Tea Dorminy

On Tue 06-02-24 18:44:47, Amir Goldstein wrote:
> On Tue, Feb 6, 2024 at 6:30 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > > > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > >
> > > > > One of the issues we've hit recently while using fanotify in an HSM is
> > > > > racing with files that are opened for execution.
> > > > >
> > > > > There is a race that can result in ETXTBUSY.
> > > > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > > > Pid 2: execve(file_by_path)
> > > > > Pid 1: gets notification, with file.fd
> > > > > Pid 2: blocked, waiting for notification to resolve
> > > > > Pid 1: Does work with FD (populates the file)
> > > > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > > > Pid 1: closes fd
> > >
> > > Right, this is kind of nasty.
> > >
> > > > > Pid 1 can close the FD before responding, but this can result in a
> > > > > race if fanotify is being handled in a multi-threaded
> > > > > manner.
> > >
> > > Yep.
> > >
> > > > > I.e. if there are two threads operating on the same fanotify group,
> > > > > and an event's FD has been closed, that can be reused
> > > > > by another event. This is largely not a problem because the
> > > > > outstanding events are added in a FIFO manner to the outstanding
> > > > > event list, and as long as the earlier event is closed and responded
> > > > > to without interruption, it should be okay, but it's difficult
> > > > > to guarantee that this happens, unless event responses are serialized
> > > > > in some fashion, with strict ordering between
> > > > > responses.
> > >
> > > Yes, essentially you must make sure you will not read any new events from
> > > the notification queue between fd close & writing of the response. Frankly,
> > > I find this as quite ugly and asking for trouble (subtle bugs down the
> > > line).
> > >
> > Is there a preference for either refactoring fanotify_event_metadata, or
> > adding this new ID type as a piece of metadata?
> >
> > I almost feel like the FD should move to being metadata, and we should
> > use ID in place of fd in fanotify_event_metadata. If we use an xarray,
> > it should be reasonable to use a 32-bit identifier, so we don't need
> > to modify the fanotify_event_metadata structure at all.
> 
> I have a strong preference for FANOTIFY_METADATA_VERSION 4
> because I really would like event->key to be 64bit and in the header,
> but I have a feeling that Jan may have a different opinion..

I also think 64-bit ID would be potentially more useful for userspace
(mostly because of guaranteed uniqueness). I'm just still not yet sure how
do you plan to use it for persistent events because there are several
options how to generate the ID. I'd hate to add yet-another-id in the near
future.

Regarding FANOTIFY_METADATA_VERSION 4: What are your reasons to want the ID
in the header? I think we'd need explicit init flag to enable event ID
reporting anyway? But admittedly I can see some appeal of having ID in the
header if we are going to use the ID for matching responses to permission
events.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fanotify: concurrent work and handling files being executed
  2024-02-07 10:44         ` Jan Kara
@ 2024-02-07 11:15           ` Amir Goldstein
  0 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2024-02-07 11:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Sargun Dhillon, Linux FS-devel Mailing List, Sweet Tea Dorminy

On Wed, Feb 7, 2024 at 12:44 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 06-02-24 18:44:47, Amir Goldstein wrote:
> > On Tue, Feb 6, 2024 at 6:30 PM Sargun Dhillon <sargun@sargun.me> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > > > > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > >
> > > > > > One of the issues we've hit recently while using fanotify in an HSM is
> > > > > > racing with files that are opened for execution.
> > > > > >
> > > > > > There is a race that can result in ETXTBUSY.
> > > > > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > > > > Pid 2: execve(file_by_path)
> > > > > > Pid 1: gets notification, with file.fd
> > > > > > Pid 2: blocked, waiting for notification to resolve
> > > > > > Pid 1: Does work with FD (populates the file)
> > > > > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > > > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > > > > Pid 1: closes fd
> > > >
> > > > Right, this is kind of nasty.
> > > >
> > > > > > Pid 1 can close the FD before responding, but this can result in a
> > > > > > race if fanotify is being handled in a multi-threaded
> > > > > > manner.
> > > >
> > > > Yep.
> > > >
> > > > > > I.e. if there are two threads operating on the same fanotify group,
> > > > > > and an event's FD has been closed, that can be reused
> > > > > > by another event. This is largely not a problem because the
> > > > > > outstanding events are added in a FIFO manner to the outstanding
> > > > > > event list, and as long as the earlier event is closed and responded
> > > > > > to without interruption, it should be okay, but it's difficult
> > > > > > to guarantee that this happens, unless event responses are serialized
> > > > > > in some fashion, with strict ordering between
> > > > > > responses.
> > > >
> > > > Yes, essentially you must make sure you will not read any new events from
> > > > the notification queue between fd close & writing of the response. Frankly,
> > > > I find this as quite ugly and asking for trouble (subtle bugs down the
> > > > line).
> > > >
> > > Is there a preference for either refactoring fanotify_event_metadata, or
> > > adding this new ID type as a piece of metadata?
> > >
> > > I almost feel like the FD should move to being metadata, and we should
> > > use ID in place of fd in fanotify_event_metadata. If we use an xarray,
> > > it should be reasonable to use a 32-bit identifier, so we don't need
> > > to modify the fanotify_event_metadata structure at all.
> >
> > I have a strong preference for FANOTIFY_METADATA_VERSION 4
> > because I really would like event->key to be 64bit and in the header,
> > but I have a feeling that Jan may have a different opinion..
>
> I also think 64-bit ID would be potentially more useful for userspace
> (mostly because of guaranteed uniqueness). I'm just still not yet sure how
> do you plan to use it for persistent events because there are several
> options how to generate the ID. I'd hate to add yet-another-id in the near
> future.

The use case of permission events and persistent async events do not
overlap and for the future, they should not be mixed in the same group
at all. mixing permission events and async events in the same group
was probably a mistake and we should not repeat it.

The event->id namespace is per group.
permission events are always realtime events so they do not persist
and event->id can be a runtime id used for responding.
A future group for persistent async events would be something like
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journal-records
and then the 64-bit event->id would have a different meaning.
It means "ACK that events up ID are consumed" or "query the events since ID".

>
> Regarding FANOTIFY_METADATA_VERSION 4: What are your reasons to want the ID
> in the header?

Just a matter of personal taste - I see event->id as being fundamental
information
and not "extra" information.

Looking forward at persistent events, it would be easier to iterate and skip
up to ID, if event->id is in the header.

> I think we'd need explicit init flag to enable event ID
> reporting anyway?But admittedly I can see some appeal of having ID in the
> header if we are going to use the ID for matching responses to permission
> events.

Yes, as you wrote,
permission events with FAN_REPORT_FID are a clean start.
I don't mind if this setup requires an explicit FAN_REPORT_EVENT_ID flag.

Also, regarding your suggestion to report FID instead of event->fd,
I think we can do it like that:
1. With FAN_REPORT_FID in permission events, event->fd should
    NOT be used to access the file
2. Instead, it should be used as a mount_fd arg to open_by_handle_at()
    along with the reported fid
3. In that case, event->fd may be an O_PATH fd (we need a
    patch to allow O_PATH fd as mount_fd)
4. An fd that is open with open_by_handle_at(event->fd, ...
    will have the FMODE_NONOTIFY flag, so it is safe to access the file

This model also solves my problem with rename(), because a single
mount_fd could be used to open both old and new parent path fds.

Does that sound like a plan?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-02-07 11:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-05 23:23 Fanotify: concurrent work and handling files being executed Sargun Dhillon
2024-02-06  7:44 ` Amir Goldstein
2024-02-06 13:50   ` Jan Kara
2024-02-06 16:29     ` Sargun Dhillon
2024-02-06 16:44       ` Amir Goldstein
2024-02-07 10:44         ` Jan Kara
2024-02-07 11:15           ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).