Re: fanotify HSM open issues

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: fanotify HSM open issues
       [not found]                   ` <20230817182220.vzzklvr7ejqlfnju@quack3>
@ 2023-08-18  7:01                     ` Amir Goldstein
  2023-08-23 14:37                       ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-08-18  7:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, Jens Axboe, linux-fsdevel

[adding fsdevel]

On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > [CC Christian and Jens for the NOWAIT semantics]
> >
> > Jan,
> >
> > I was going to post start-write-safe patches [1], but now that this
> > design issue has emerged, with your permission, I would like to
> > take this discussion to fsdevel, so please reply to the list.
> >
> > For those who just joined, the context is fanotify HSM API [2]
> > proposal and avoiding the fanotify deadlocks I described in my
> > talk on LSFMM [3].
>
> OK, sure. I'm resending the reply which I sent only to you here.
>
> > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@suse.cz> wrote:
> > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > underlying xfs file.
> > > >
> > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > is actually the underlying file that is really used for reading. Hum,
> > > > nasty.
> > > >
> > > > > This is a violation of start-write-safe permission hooks and the
> > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > violation.
> > > > >
> > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > >
> > > > I don't see how that would be possible. If you have a loop image file on
> > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > Overlayfs case is special here because it doesn't really work with
> > > > filesystems but rather directory subtrees and that causes the
> > > > complications.
> > > >
> > >
> > > I was referring to sendfile() from xfs2 to xfs1.
> > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > in xfs1. No?
>
> Yes, that seems possible and it would indeed trigger rw_verify_area() in
> do_iter_read() on xfs1 while freeze protection for xfs1 is held.
>

Recap for new people joining this thread.

The following deadlock is possible in upstream kernel
if fanotify permission event handler tries to make
modifications to the filesystem it is watching in the context
of FAN_ACCESS_PERM handling in some cases:

P1                             P2                      P3
-----------                    ------------            ------------
do_sendfile(fs1.out_fd, fs1.in_fd)
-> sb_start_write(fs1.sb)
  -> do_splice_direct()                         freeze_super(fs1.sb)
    -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
      -> security_file_permission()
        -> fsnotify_perm() --> FAN_ACCESS_PERM
                                 -> do_unlinkat(fs1.dfd, ...)
                                   -> sb_start_write(fs1.sb) ......

start-write-safe patches [1] (not posted) are trying to solve this
deadlock and prepare the ground for a new set of permission events
with cleaner/safer semantics.

The cases described above of sendfile from a file in loop mounted
image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
despite the start-write-safe patches [1].

> > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > userspace event listener was notified via flag whether modifying
> > > > > the content of the file was allowed or not.
> > > > >
> > > > > I do not care so much about HSM being able to fill content of files
> > > > > from a nested context like this, but we do need some way for
> > > > > userspace to at least deny this access to a file with no content.
> > > > >
> > > > > Another possibility I thought of is to change file_start_write()
> > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > This should make it safe to fill file content when event is generated
> > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > Right?
> > > >
> > > > OK, so you mean that the HSM managing application will get an fd with
> > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > contents and the kernel functions grabbing freeze protection will detect
> > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > attractive solution - the HSM managing app could even reply with error like
> > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > the whole problem transparently for the application generating the event.
> > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > syscall.
> > > >
> > >
> > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > Same as a temporary error from a network fs or FUSE - should be
> > > tolerable when the endpoint is not connected.
> > > One of my patches allows HSM returning an error that is not EPERM as
> > > response - this can be useful in such situations.
>
> OK.
>
> > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > application will have to be careful to only access the managed filesystem
> > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > frozen filesystem.
> > >
> > > Isn't that already the case to some extent?
> > > It is not wise for permission event handlers to perform operations
> > > on fd without  FMODE_NONOTIFY.
>
> Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> boiling down to this kind of self-deadlock just shows that fanotify users
> get this wrong all the time.
>
> > > > That may seem obvious but practice shows that with
> > > > complex software stacks with many dependencies, this is far from trivial.
> > >
> > > It will be especially important when we have permission events
> > > on directory operations that need to perform operations on O_PATH
> > > dirfd with FMODE_NONOTIFY.
> > >
> > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > somewhat arbitary unless you understand our implementation issues and
> > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > >
> > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > as a means to an end by fanotify - so it would not be much different if
> > > the new events would provide an fd with FMODE_NONOTIFY |
> > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > is not allowed with the event->fd provided by fanotify.
> > >
> >
> > This part needs clarifying.
> > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> >
> > Yes, it would be a change of behavior, but I think it would be a good change,
> > because current event->fd from FAN_ACCESS_PERM events is really not
> > write-safe (could deadlock with freezing fs).
>
> As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> this fd. It says nothing about freeze handling or so. Furthermore as you
> observe FMODE_NONOTIFY cannot be set by userspace but practically all
> current fanotify users need to also do IO on other files in order to handle
> fanotify event. So ideally we'd have a way to do IO to other files in a
> manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> to only trylock freeze protection - that actually makes a lot of sense to
> me. The question is whether this is enough or not.
>

Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
to a file is not the only thing that HSM needs to do.
Eventually, event handler for lookup permission events should be
able to also create files without blocking on vfs level freeze protection.

In theory, I am not saying we should do it, but as a thought experiment:
if the requirement from permission event handler is that is must use a
O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
any filesystem modifications, then instead of aiming for NOWAIT
semantics using sb_start_write_trylock(), we could use a freeze level
SB_FREEZE_FSNOTIFY between
SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.

As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
make it clear how userspace should avoid nesting "VFS faults" there is
a model that can solve the deadlock correctly.

> > Then we have two options:
> > 1. Generate "write-safe" FAN_PRE_ACCESS events only for fs that set
> >     FMODE_NOWAIT.
> >     Other fs will still generate the legacy FAN_ACCESS_PERM events
> >     which will be documented as write-unsafe
> > 2. Use a new internal flag (e.g. FMODE_NOSBWAIT) for the stronger
> >     NOWAIT semantics that fanotify will always set on event->fd for the
> >     new write-safe FAN_PRE_ACCESS events
> >
> > TBH, the backing fs for HSM [2] is anyway supposed to be a "normal"
> > local fs and I'd be more comfortable with fs opting in to support fanotify
> > HSM events, so option #1 doesn't seem like a terrible idea??
>
> Yes, I don't think 1) would be really be a limitation that would matter too
> much in practice.
>
> > > Currently, the documentation is missing, because there are operations
> > > that are not really safe in the permission event context, but there is no
> > > documentation about that.
> > >
> > > > Thirdly, unless we
> > > > propagate to the HSM app the information whether the freeze protection is
> > > > held in the kernel or not, it doesn't know whether it should just wait for
> > > > the filesystem to unfreeze or whether it should rather fail the request to
> > > > avoid the deadlock. Hrm...
> > >
> > > informing HSM if freeze protection is held by this thread may be a little
> > > challenging, but it is easy for me to annotate possible risky contexts
> > > like the hooks inside splice read.
> > > I am just not sure that waiting in HSM context is that important and
> > > if it is not better to always fail in the frozen fs case.
>
> Always failing in frozen fs case is certainly possible but that will make
> fs freezing a bit non-transparent - the application may treat such failures
> as fatal errors and abort. So it's ok for the first POC but eventually we
> should have a plan how we could make fs freezing transparent for the
> applications even for HSM managed filesystems.
>

OK. ATM, the only solution I can think of that is both maintainable
and lets HSM live in complete harmony with fsfreeze is adding the
extra SB_FREEZE_FSNOTIFY level.

I am not sure how big of an overhead that would be?
I imagine that sb_writers is large enough as it is w.r.t fitting into
cache lines?
I don't think that it adds much complexity or maintenance burden
to vfs?? I'm really not sure.

> > > I wonder if we go down this path, if we need any of the start-write-safe
> > > patches at all? maybe only some of them to avoid duplicate hooks?
>
> Yes, avoiding duplicate hooks would be nice in any case.

OK. I already posted some patches from the series to vfs [4] and ovl [5].

The rest of the series can be justified also for avoiding duplicate
permission hook and also to greatly reduce the risk of the aforementioned
deadlock, despite the remaining loop/ovl corner cases.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/start-write-safe
[2] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API
[3] https://youtu.be/z3A7mzfceKM
[4] https://lore.kernel.org/linux-fsdevel/20230817141337.1025891-1-amir73il@gmail.com/
[5] https://lore.kernel.org/linux-unionfs/20230816152334.924960-1-amir73il@gmail.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-08-18  7:01                     ` fanotify HSM open issues Amir Goldstein
@ 2023-08-23 14:37                       ` Jan Kara
  2023-08-23 16:31                         ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2023-08-23 14:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Fri 18-08-23 10:01:40, Amir Goldstein wrote:
> [adding fsdevel]
> 
> On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > > [CC Christian and Jens for the NOWAIT semantics]
> > >
> > > Jan,
> > >
> > > I was going to post start-write-safe patches [1], but now that this
> > > design issue has emerged, with your permission, I would like to
> > > take this discussion to fsdevel, so please reply to the list.
> > >
> > > For those who just joined, the context is fanotify HSM API [2]
> > > proposal and avoiding the fanotify deadlocks I described in my
> > > talk on LSFMM [3].
> >
> > OK, sure. I'm resending the reply which I sent only to you here.
> >
> > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > > underlying xfs file.
> > > > >
> > > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > > is actually the underlying file that is really used for reading. Hum,
> > > > > nasty.
> > > > >
> > > > > > This is a violation of start-write-safe permission hooks and the
> > > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > > violation.
> > > > > >
> > > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > > >
> > > > > I don't see how that would be possible. If you have a loop image file on
> > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > > Overlayfs case is special here because it doesn't really work with
> > > > > filesystems but rather directory subtrees and that causes the
> > > > > complications.
> > > > >
> > > >
> > > > I was referring to sendfile() from xfs2 to xfs1.
> > > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > > in xfs1. No?
> >
> > Yes, that seems possible and it would indeed trigger rw_verify_area() in
> > do_iter_read() on xfs1 while freeze protection for xfs1 is held.
> >
> 
> Recap for new people joining this thread.
> 
> The following deadlock is possible in upstream kernel
> if fanotify permission event handler tries to make
> modifications to the filesystem it is watching in the context
> of FAN_ACCESS_PERM handling in some cases:
> 
> P1                             P2                      P3
> -----------                    ------------            ------------
> do_sendfile(fs1.out_fd, fs1.in_fd)
> -> sb_start_write(fs1.sb)
>   -> do_splice_direct()                         freeze_super(fs1.sb)
>     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
>       -> security_file_permission()
>         -> fsnotify_perm() --> FAN_ACCESS_PERM
>                                  -> do_unlinkat(fs1.dfd, ...)
>                                    -> sb_start_write(fs1.sb) ......
> 
> start-write-safe patches [1] (not posted) are trying to solve this
> deadlock and prepare the ground for a new set of permission events
> with cleaner/safer semantics.
> 
> The cases described above of sendfile from a file in loop mounted
> image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
> despite the start-write-safe patches [1].

Yep, nice summary.

> > > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > > userspace event listener was notified via flag whether modifying
> > > > > > the content of the file was allowed or not.
> > > > > >
> > > > > > I do not care so much about HSM being able to fill content of files
> > > > > > from a nested context like this, but we do need some way for
> > > > > > userspace to at least deny this access to a file with no content.
> > > > > >
> > > > > > Another possibility I thought of is to change file_start_write()
> > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > > This should make it safe to fill file content when event is generated
> > > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > > Right?
> > > > >
> > > > > OK, so you mean that the HSM managing application will get an fd with
> > > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > > contents and the kernel functions grabbing freeze protection will detect
> > > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > > attractive solution - the HSM managing app could even reply with error like
> > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > > the whole problem transparently for the application generating the event.
> > > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > > syscall.
> > > > >
> > > >
> > > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > > Same as a temporary error from a network fs or FUSE - should be
> > > > tolerable when the endpoint is not connected.
> > > > One of my patches allows HSM returning an error that is not EPERM as
> > > > response - this can be useful in such situations.
> >
> > OK.
> >
> > > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > > application will have to be careful to only access the managed filesystem
> > > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > > frozen filesystem.
> > > >
> > > > Isn't that already the case to some extent?
> > > > It is not wise for permission event handlers to perform operations
> > > > on fd without  FMODE_NONOTIFY.
> >
> > Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> > boiling down to this kind of self-deadlock just shows that fanotify users
> > get this wrong all the time.
> >
> > > > > That may seem obvious but practice shows that with
> > > > > complex software stacks with many dependencies, this is far from trivial.
> > > >
> > > > It will be especially important when we have permission events
> > > > on directory operations that need to perform operations on O_PATH
> > > > dirfd with FMODE_NONOTIFY.
> > > >
> > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > > somewhat arbitary unless you understand our implementation issues and
> > > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > > >
> > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > > as a means to an end by fanotify - so it would not be much different if
> > > > the new events would provide an fd with FMODE_NONOTIFY |
> > > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > > is not allowed with the event->fd provided by fanotify.
> > > >
> > >
> > > This part needs clarifying.
> > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> > >
> > > Yes, it would be a change of behavior, but I think it would be a good change,
> > > because current event->fd from FAN_ACCESS_PERM events is really not
> > > write-safe (could deadlock with freezing fs).
> >
> > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > this fd. It says nothing about freeze handling or so. Furthermore as you
> > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > current fanotify users need to also do IO on other files in order to handle
> > fanotify event. So ideally we'd have a way to do IO to other files in a
> > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > to only trylock freeze protection - that actually makes a lot of sense to
> > me. The question is whether this is enough or not.
> >
> 
> Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> to a file is not the only thing that HSM needs to do.
> Eventually, event handler for lookup permission events should be
> able to also create files without blocking on vfs level freeze protection.

So this is what I wanted to clarify. The lookup permission event never gets
called under a freeze protection so the deadlock doesn't exist there. In
principle the problem exists only for access and modify events where we'd
be filling in file data and thus RWF_NOWAIT could be enough. That being
said I understand this may be assuming too much about the implementations
of HSM daemons and as you write, we might want to provide a way to do IO
not blocking on freeze protection from any hook. But I wanted to point this
out explicitely so that it's a conscious decision.

> In theory, I am not saying we should do it, but as a thought experiment:
> if the requirement from permission event handler is that is must use a
> O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> any filesystem modifications, then instead of aiming for NOWAIT
> semantics using sb_start_write_trylock(), we could use a freeze level
> SB_FREEZE_FSNOTIFY between
> SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> 
> As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> make it clear how userspace should avoid nesting "VFS faults" there is
> a model that can solve the deadlock correctly.

OK, yes, in principle another freeze level which could be used by handlers
of fanotify permission events would solve the deadlock as well. Just you
seem to like to tie this functionality to the particular fd returned from
fanotify and I'm not convinced that is a good idea. What if the application
needs to do write to some other location besides the one fd it got passed
from fanotify event? E.g. imagine it wants to fetch a whole subtree on
first access to any file in a subtree. Or maybe it wants to write to some
DB file containing current state or something like that.

One solution I can imagine is to create an open flag that can be specified
on open which would result in the special behavior wrt fs freezing. If the
special behavior would be just trylocking the freeze protection then it
would be really easy. If the behaviour would be another freeze protection
level, then we'd need to make sure we don't generate another fanotify
permission event with such fd - autorejecting any such access is an obvious
solution but I'm not sure if practical for applications.

> > > Then we have two options:
> > > 1. Generate "write-safe" FAN_PRE_ACCESS events only for fs that set
> > >     FMODE_NOWAIT.
> > >     Other fs will still generate the legacy FAN_ACCESS_PERM events
> > >     which will be documented as write-unsafe
> > > 2. Use a new internal flag (e.g. FMODE_NOSBWAIT) for the stronger
> > >     NOWAIT semantics that fanotify will always set on event->fd for the
> > >     new write-safe FAN_PRE_ACCESS events
> > >
> > > TBH, the backing fs for HSM [2] is anyway supposed to be a "normal"
> > > local fs and I'd be more comfortable with fs opting in to support fanotify
> > > HSM events, so option #1 doesn't seem like a terrible idea??
> >
> > Yes, I don't think 1) would be really be a limitation that would matter too
> > much in practice.
> >
> > > > Currently, the documentation is missing, because there are operations
> > > > that are not really safe in the permission event context, but there is no
> > > > documentation about that.
> > > >
> > > > > Thirdly, unless we
> > > > > propagate to the HSM app the information whether the freeze protection is
> > > > > held in the kernel or not, it doesn't know whether it should just wait for
> > > > > the filesystem to unfreeze or whether it should rather fail the request to
> > > > > avoid the deadlock. Hrm...
> > > >
> > > > informing HSM if freeze protection is held by this thread may be a little
> > > > challenging, but it is easy for me to annotate possible risky contexts
> > > > like the hooks inside splice read.
> > > > I am just not sure that waiting in HSM context is that important and
> > > > if it is not better to always fail in the frozen fs case.
> >
> > Always failing in frozen fs case is certainly possible but that will make
> > fs freezing a bit non-transparent - the application may treat such failures
> > as fatal errors and abort. So it's ok for the first POC but eventually we
> > should have a plan how we could make fs freezing transparent for the
> > applications even for HSM managed filesystems.
> >
> 
> OK. ATM, the only solution I can think of that is both maintainable
> and lets HSM live in complete harmony with fsfreeze is adding the
> extra SB_FREEZE_FSNOTIFY level.

To make things clear: if the only problems would be with those sendfile(2)
rare corner-cases, then I guess we can live with that and implement retry
in the kernel if userspace ever complains about unexpected short copy or
EAGAIN...  The problem I see is that if we advise that all IO from the
fanotify event handler should happen in the freeze-safe manner, then with
the non-blocking solution all HSM IO suddently starts failing as soon as
the filesystem is frozen. And that is IMHO not nice.

> I am not sure how big of an overhead that would be?
> I imagine that sb_writers is large enough as it is w.r.t fitting into
> cache lines?
> I don't think that it adds much complexity or maintenance burden
> to vfs?? I'm really not sure.
 
Well, the overhead is effectively one percpu counter per superblock.
Negligible in terms of CPU time, somewhat annoying in terms of memory but
bearable. So this may be a way forward.

> > > > I wonder if we go down this path, if we need any of the start-write-safe
> > > > patches at all? maybe only some of them to avoid duplicate hooks?
> >
> > Yes, avoiding duplicate hooks would be nice in any case.
> 
> OK. I already posted some patches from the series to vfs [4] and ovl [5].
> 
> The rest of the series can be justified also for avoiding duplicate
> permission hook and also to greatly reduce the risk of the aforementioned
> deadlock, despite the remaining loop/ovl corner cases.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/start-write-safe
> [2] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API
> [3] https://youtu.be/z3A7mzfceKM
> [4] https://lore.kernel.org/linux-fsdevel/20230817141337.1025891-1-amir73il@gmail.com/
> [5] https://lore.kernel.org/linux-unionfs/20230816152334.924960-1-amir73il@gmail.com/

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-08-23 14:37                       ` Jan Kara
@ 2023-08-23 16:31                         ` Amir Goldstein
  2023-11-13 11:50                           ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-08-23 16:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, Jens Axboe, linux-fsdevel

On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 18-08-23 10:01:40, Amir Goldstein wrote:
> > [adding fsdevel]
> >
> > On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > > > [CC Christian and Jens for the NOWAIT semantics]
> > > >
> > > > Jan,
> > > >
> > > > I was going to post start-write-safe patches [1], but now that this
> > > > design issue has emerged, with your permission, I would like to
> > > > take this discussion to fsdevel, so please reply to the list.
> > > >
> > > > For those who just joined, the context is fanotify HSM API [2]
> > > > proposal and avoiding the fanotify deadlocks I described in my
> > > > talk on LSFMM [3].
> > >
> > > OK, sure. I'm resending the reply which I sent only to you here.
> > >
> > > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > > > underlying xfs file.
> > > > > >
> > > > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > > > is actually the underlying file that is really used for reading. Hum,
> > > > > > nasty.
> > > > > >
> > > > > > > This is a violation of start-write-safe permission hooks and the
> > > > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > > > violation.
> > > > > > >
> > > > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > > > >
> > > > > > I don't see how that would be possible. If you have a loop image file on
> > > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > > > Overlayfs case is special here because it doesn't really work with
> > > > > > filesystems but rather directory subtrees and that causes the
> > > > > > complications.
> > > > > >
> > > > >
> > > > > I was referring to sendfile() from xfs2 to xfs1.
> > > > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > > > in xfs1. No?
> > >
> > > Yes, that seems possible and it would indeed trigger rw_verify_area() in
> > > do_iter_read() on xfs1 while freeze protection for xfs1 is held.
> > >
> >
> > Recap for new people joining this thread.
> >
> > The following deadlock is possible in upstream kernel
> > if fanotify permission event handler tries to make
> > modifications to the filesystem it is watching in the context
> > of FAN_ACCESS_PERM handling in some cases:
> >
> > P1                             P2                      P3
> > -----------                    ------------            ------------
> > do_sendfile(fs1.out_fd, fs1.in_fd)
> > -> sb_start_write(fs1.sb)
> >   -> do_splice_direct()                         freeze_super(fs1.sb)
> >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> >       -> security_file_permission()
> >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> >                                  -> do_unlinkat(fs1.dfd, ...)
> >                                    -> sb_start_write(fs1.sb) ......
> >
> > start-write-safe patches [1] (not posted) are trying to solve this
> > deadlock and prepare the ground for a new set of permission events
> > with cleaner/safer semantics.
> >
> > The cases described above of sendfile from a file in loop mounted
> > image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
> > despite the start-write-safe patches [1].
>
> Yep, nice summary.
>
> > > > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > > > userspace event listener was notified via flag whether modifying
> > > > > > > the content of the file was allowed or not.
> > > > > > >
> > > > > > > I do not care so much about HSM being able to fill content of files
> > > > > > > from a nested context like this, but we do need some way for
> > > > > > > userspace to at least deny this access to a file with no content.
> > > > > > >
> > > > > > > Another possibility I thought of is to change file_start_write()
> > > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > > > This should make it safe to fill file content when event is generated
> > > > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > > > Right?
> > > > > >
> > > > > > OK, so you mean that the HSM managing application will get an fd with
> > > > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > > > contents and the kernel functions grabbing freeze protection will detect
> > > > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > > > attractive solution - the HSM managing app could even reply with error like
> > > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > > > the whole problem transparently for the application generating the event.
> > > > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > > > syscall.
> > > > > >
> > > > >
> > > > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > > > Same as a temporary error from a network fs or FUSE - should be
> > > > > tolerable when the endpoint is not connected.
> > > > > One of my patches allows HSM returning an error that is not EPERM as
> > > > > response - this can be useful in such situations.
> > >
> > > OK.
> > >
> > > > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > > > application will have to be careful to only access the managed filesystem
> > > > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > > > frozen filesystem.
> > > > >
> > > > > Isn't that already the case to some extent?
> > > > > It is not wise for permission event handlers to perform operations
> > > > > on fd without  FMODE_NONOTIFY.
> > >
> > > Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> > > boiling down to this kind of self-deadlock just shows that fanotify users
> > > get this wrong all the time.
> > >
> > > > > > That may seem obvious but practice shows that with
> > > > > > complex software stacks with many dependencies, this is far from trivial.
> > > > >
> > > > > It will be especially important when we have permission events
> > > > > on directory operations that need to perform operations on O_PATH
> > > > > dirfd with FMODE_NONOTIFY.
> > > > >
> > > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > > > somewhat arbitary unless you understand our implementation issues and
> > > > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > > > >
> > > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > > > as a means to an end by fanotify - so it would not be much different if
> > > > > the new events would provide an fd with FMODE_NONOTIFY |
> > > > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > > > is not allowed with the event->fd provided by fanotify.
> > > > >
> > > >
> > > > This part needs clarifying.
> > > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > > > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> > > >
> > > > Yes, it would be a change of behavior, but I think it would be a good change,
> > > > because current event->fd from FAN_ACCESS_PERM events is really not
> > > > write-safe (could deadlock with freezing fs).
> > >
> > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > current fanotify users need to also do IO on other files in order to handle
> > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > to only trylock freeze protection - that actually makes a lot of sense to
> > > me. The question is whether this is enough or not.
> > >
> >
> > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > to a file is not the only thing that HSM needs to do.
> > Eventually, event handler for lookup permission events should be
> > able to also create files without blocking on vfs level freeze protection.
>
> So this is what I wanted to clarify. The lookup permission event never gets
> called under a freeze protection so the deadlock doesn't exist there. In
> principle the problem exists only for access and modify events where we'd
> be filling in file data and thus RWF_NOWAIT could be enough.

Yes, you are right.
It is possible that RWF_NOWAIT could be enough.

But the discovery of the loop/ovl corner cases has shaken my
confidence is the ability to guarantee that freeze protection is not
held somehow indirectly.

If I am not mistaken, FAN_OPEN_PERM suffers from the exact
same ovl corner case, because with splice from ovl1 to fs1,
fs1 freeze protection is held and:
  ovl_splice_read(ovl1.file)
    ovl_real_fdget()
      ovl_open_realfile(fs1.file)
         ... security_file_open(fs1.file)

> That being
> said I understand this may be assuming too much about the implementations
> of HSM daemons and as you write, we might want to provide a way to do IO
> not blocking on freeze protection from any hook. But I wanted to point this
> out explicitely so that it's a conscious decision.
>
> > In theory, I am not saying we should do it, but as a thought experiment:
> > if the requirement from permission event handler is that is must use a
> > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > any filesystem modifications, then instead of aiming for NOWAIT
> > semantics using sb_start_write_trylock(), we could use a freeze level
> > SB_FREEZE_FSNOTIFY between
> > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> >
> > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > make it clear how userspace should avoid nesting "VFS faults" there is
> > a model that can solve the deadlock correctly.
>
> OK, yes, in principle another freeze level which could be used by handlers
> of fanotify permission events would solve the deadlock as well. Just you
> seem to like to tie this functionality to the particular fd returned from
> fanotify and I'm not convinced that is a good idea. What if the application
> needs to do write to some other location besides the one fd it got passed
> from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> first access to any file in a subtree. Or maybe it wants to write to some
> DB file containing current state or something like that.
>
> One solution I can imagine is to create an open flag that can be specified
> on open which would result in the special behavior wrt fs freezing. If the
> special behavior would be just trylocking the freeze protection then it
> would be really easy. If the behaviour would be another freeze protection
> level, then we'd need to make sure we don't generate another fanotify
> permission event with such fd - autorejecting any such access is an obvious
> solution but I'm not sure if practical for applications.
>

I had also considered marking the listener process with the FSNOTIFY
context and enforcing this context on fanotify_read().
In a way, this is similar to the NOIO and NOFS process context.
It could be used to both act as a stronger form of FMODE_NONOTIFY
and to activate the desired freeze protection behavior
(whether trylock or SB_FREEZE_FSNOTIFY level).

> > > > Then we have two options:
> > > > 1. Generate "write-safe" FAN_PRE_ACCESS events only for fs that set
> > > >     FMODE_NOWAIT.
> > > >     Other fs will still generate the legacy FAN_ACCESS_PERM events
> > > >     which will be documented as write-unsafe
> > > > 2. Use a new internal flag (e.g. FMODE_NOSBWAIT) for the stronger
> > > >     NOWAIT semantics that fanotify will always set on event->fd for the
> > > >     new write-safe FAN_PRE_ACCESS events
> > > >
> > > > TBH, the backing fs for HSM [2] is anyway supposed to be a "normal"
> > > > local fs and I'd be more comfortable with fs opting in to support fanotify
> > > > HSM events, so option #1 doesn't seem like a terrible idea??
> > >
> > > Yes, I don't think 1) would be really be a limitation that would matter too
> > > much in practice.
> > >
> > > > > Currently, the documentation is missing, because there are operations
> > > > > that are not really safe in the permission event context, but there is no
> > > > > documentation about that.
> > > > >
> > > > > > Thirdly, unless we
> > > > > > propagate to the HSM app the information whether the freeze protection is
> > > > > > held in the kernel or not, it doesn't know whether it should just wait for
> > > > > > the filesystem to unfreeze or whether it should rather fail the request to
> > > > > > avoid the deadlock. Hrm...
> > > > >
> > > > > informing HSM if freeze protection is held by this thread may be a little
> > > > > challenging, but it is easy for me to annotate possible risky contexts
> > > > > like the hooks inside splice read.
> > > > > I am just not sure that waiting in HSM context is that important and
> > > > > if it is not better to always fail in the frozen fs case.
> > >
> > > Always failing in frozen fs case is certainly possible but that will make
> > > fs freezing a bit non-transparent - the application may treat such failures
> > > as fatal errors and abort. So it's ok for the first POC but eventually we
> > > should have a plan how we could make fs freezing transparent for the
> > > applications even for HSM managed filesystems.
> > >
> >
> > OK. ATM, the only solution I can think of that is both maintainable
> > and lets HSM live in complete harmony with fsfreeze is adding the
> > extra SB_FREEZE_FSNOTIFY level.
>
> To make things clear: if the only problems would be with those sendfile(2)
> rare corner-cases, then I guess we can live with that and implement retry
> in the kernel if userspace ever complains about unexpected short copy or
> EAGAIN...  The problem I see is that if we advise that all IO from the
> fanotify event handler should happen in the freeze-safe manner, then with
> the non-blocking solution all HSM IO suddently starts failing as soon as
> the filesystem is frozen. And that is IMHO not nice.

I see what you mean. The SB_FREEZE_FSNOTIFY design is much more
clear in that respect.

> > I am not sure how big of an overhead that would be?
> > I imagine that sb_writers is large enough as it is w.r.t fitting into
> > cache lines?
> > I don't think that it adds much complexity or maintenance burden
> > to vfs?? I'm really not sure.
>
> Well, the overhead is effectively one percpu counter per superblock.
> Negligible in terms of CPU time, somewhat annoying in terms of memory but
> bearable. So this may be a way forward.
>

Considering that I had already spend some time optimizing out
the memory and performance overhead of s_write_srcu:
https://github.com/amir73il/linux/commit/655606ca8cffba7636959547dc094651cef56f4d
I guess I may be able to also optimize out the SB_FREEZE_FSNOTIFY
percpu counter under the same object that is allocated lazily for
sb on the first mark with the new write-safe fanotify events.

Ok. I have something to try..

Thanks!
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-08-23 16:31                         ` Amir Goldstein
@ 2023-11-13 11:50                           ` Amir Goldstein
  2023-11-20 14:06                             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-13 11:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, Jens Axboe, linux-fsdevel

On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 18-08-23 10:01:40, Amir Goldstein wrote:
> > > [adding fsdevel]
> > >
> > > On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > > > > [CC Christian and Jens for the NOWAIT semantics]
> > > > >
> > > > > Jan,
> > > > >
> > > > > I was going to post start-write-safe patches [1], but now that this
> > > > > design issue has emerged, with your permission, I would like to
> > > > > take this discussion to fsdevel, so please reply to the list.
> > > > >
> > > > > For those who just joined, the context is fanotify HSM API [2]
> > > > > proposal and avoiding the fanotify deadlocks I described in my
> > > > > talk on LSFMM [3].
> > > >
> > > > OK, sure. I'm resending the reply which I sent only to you here.
> > > >
> > > > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > > > > underlying xfs file.
> > > > > > >
> > > > > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > > > > is actually the underlying file that is really used for reading. Hum,
> > > > > > > nasty.
> > > > > > >
> > > > > > > > This is a violation of start-write-safe permission hooks and the
> > > > > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > > > > violation.
> > > > > > > >
> > > > > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > > > > >
> > > > > > > I don't see how that would be possible. If you have a loop image file on
> > > > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > > > > Overlayfs case is special here because it doesn't really work with
> > > > > > > filesystems but rather directory subtrees and that causes the
> > > > > > > complications.
> > > > > > >
> > > > > >
> > > > > > I was referring to sendfile() from xfs2 to xfs1.
> > > > > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > > > > in xfs1. No?
> > > >
> > > > Yes, that seems possible and it would indeed trigger rw_verify_area() in
> > > > do_iter_read() on xfs1 while freeze protection for xfs1 is held.
> > > >
> > >
> > > Recap for new people joining this thread.
> > >
> > > The following deadlock is possible in upstream kernel
> > > if fanotify permission event handler tries to make
> > > modifications to the filesystem it is watching in the context
> > > of FAN_ACCESS_PERM handling in some cases:
> > >
> > > P1                             P2                      P3
> > > -----------                    ------------            ------------
> > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > -> sb_start_write(fs1.sb)
> > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > >       -> security_file_permission()
> > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > >                                  -> do_unlinkat(fs1.dfd, ...)
> > >                                    -> sb_start_write(fs1.sb) ......
> > >
> > > start-write-safe patches [1] (not posted) are trying to solve this
> > > deadlock and prepare the ground for a new set of permission events
> > > with cleaner/safer semantics.
> > >
> > > The cases described above of sendfile from a file in loop mounted
> > > image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
> > > despite the start-write-safe patches [1].
> >
> > Yep, nice summary.
> >
> > > > > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > > > > userspace event listener was notified via flag whether modifying
> > > > > > > > the content of the file was allowed or not.
> > > > > > > >
> > > > > > > > I do not care so much about HSM being able to fill content of files
> > > > > > > > from a nested context like this, but we do need some way for
> > > > > > > > userspace to at least deny this access to a file with no content.
> > > > > > > >
> > > > > > > > Another possibility I thought of is to change file_start_write()
> > > > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > > > > This should make it safe to fill file content when event is generated
> > > > > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > > > > Right?
> > > > > > >
> > > > > > > OK, so you mean that the HSM managing application will get an fd with
> > > > > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > > > > contents and the kernel functions grabbing freeze protection will detect
> > > > > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > > > > attractive solution - the HSM managing app could even reply with error like
> > > > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > > > > the whole problem transparently for the application generating the event.
> > > > > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > > > > syscall.
> > > > > > >
> > > > > >
> > > > > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > > > > Same as a temporary error from a network fs or FUSE - should be
> > > > > > tolerable when the endpoint is not connected.
> > > > > > One of my patches allows HSM returning an error that is not EPERM as
> > > > > > response - this can be useful in such situations.
> > > >
> > > > OK.
> > > >
> > > > > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > > > > application will have to be careful to only access the managed filesystem
> > > > > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > > > > frozen filesystem.
> > > > > >
> > > > > > Isn't that already the case to some extent?
> > > > > > It is not wise for permission event handlers to perform operations
> > > > > > on fd without  FMODE_NONOTIFY.
> > > >
> > > > Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> > > > boiling down to this kind of self-deadlock just shows that fanotify users
> > > > get this wrong all the time.
> > > >
> > > > > > > That may seem obvious but practice shows that with
> > > > > > > complex software stacks with many dependencies, this is far from trivial.
> > > > > >
> > > > > > It will be especially important when we have permission events
> > > > > > on directory operations that need to perform operations on O_PATH
> > > > > > dirfd with FMODE_NONOTIFY.
> > > > > >
> > > > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > > > > somewhat arbitary unless you understand our implementation issues and
> > > > > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > > > > >
> > > > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > > > > as a means to an end by fanotify - so it would not be much different if
> > > > > > the new events would provide an fd with FMODE_NONOTIFY |
> > > > > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > > > > is not allowed with the event->fd provided by fanotify.
> > > > > >
> > > > >
> > > > > This part needs clarifying.
> > > > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > > > > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> > > > >
> > > > > Yes, it would be a change of behavior, but I think it would be a good change,
> > > > > because current event->fd from FAN_ACCESS_PERM events is really not
> > > > > write-safe (could deadlock with freezing fs).
> > > >
> > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > current fanotify users need to also do IO on other files in order to handle
> > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > me. The question is whether this is enough or not.
> > > >
> > >
> > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > to a file is not the only thing that HSM needs to do.
> > > Eventually, event handler for lookup permission events should be
> > > able to also create files without blocking on vfs level freeze protection.
> >
> > So this is what I wanted to clarify. The lookup permission event never gets
> > called under a freeze protection so the deadlock doesn't exist there. In
> > principle the problem exists only for access and modify events where we'd
> > be filling in file data and thus RWF_NOWAIT could be enough.
>
> Yes, you are right.
> It is possible that RWF_NOWAIT could be enough.
>
> But the discovery of the loop/ovl corner cases has shaken my
> confidence is the ability to guarantee that freeze protection is not
> held somehow indirectly.
>
> If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> same ovl corner case, because with splice from ovl1 to fs1,
> fs1 freeze protection is held and:
>   ovl_splice_read(ovl1.file)
>     ovl_real_fdget()
>       ovl_open_realfile(fs1.file)
>          ... security_file_open(fs1.file)
>
> > That being
> > said I understand this may be assuming too much about the implementations
> > of HSM daemons and as you write, we might want to provide a way to do IO
> > not blocking on freeze protection from any hook. But I wanted to point this
> > out explicitly so that it's a conscious decision.
> >

I agree and I'd like to explain using an example, why RWF_NOWAIT is
not enough for HSM needs.

The reason is that often, when HSM needs to handle filling content
in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
HSM needs to be able to avoid blocking on freeze protection
for any operations on the filesystem, not just pwrite().

For example, the POC HSM code [1], stores the DATA_DIR_fd
from the lookup event and uses it in the handling of access events to
update the metadata files that store which parts of the file were already
filled (relying of fiemap is not always a valid option).

That is the reason that in the POC patches [2], FMODE_NONOTIFY
is propagated from dirfd to an fd opened with openat(dirfd, ...), so
HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.

Another use case is that HSM may want to download content to a
temp file on the same filesystem, verify the downloaded content and
then clone the data into the accessed file range.

I think that a PF_ flag (see below) would work best for all those cases.

> > > In theory, I am not saying we should do it, but as a thought experiment:
> > > if the requirement from permission event handler is that is must use a
> > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > any filesystem modifications, then instead of aiming for NOWAIT
> > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > SB_FREEZE_FSNOTIFY between
> > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > >
> > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > a model that can solve the deadlock correctly.
> >
> > OK, yes, in principle another freeze level which could be used by handlers
> > of fanotify permission events would solve the deadlock as well. Just you
> > seem to like to tie this functionality to the particular fd returned from
> > fanotify and I'm not convinced that is a good idea. What if the application
> > needs to do write to some other location besides the one fd it got passed
> > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > first access to any file in a subtree. Or maybe it wants to write to some
> > DB file containing current state or something like that.
> >
> > One solution I can imagine is to create an open flag that can be specified
> > on open which would result in the special behavior wrt fs freezing. If the
> > special behavior would be just trylocking the freeze protection then it
> > would be really easy. If the behaviour would be another freeze protection
> > level, then we'd need to make sure we don't generate another fanotify
> > permission event with such fd - autorejecting any such access is an obvious
> > solution but I'm not sure if practical for applications.
> >
>
> I had also considered marking the listener process with the FSNOTIFY
> context and enforcing this context on fanotify_read().
> In a way, this is similar to the NOIO and NOFS process context.
> It could be used to both act as a stronger form of FMODE_NONOTIFY
> and to activate the desired freeze protection behavior
> (whether trylock or SB_FREEZE_FSNOTIFY level).
>

My feeling is that the best approach would be a PF_NOWAIT task flag:

- PF_NOWAIT will prevent blocking on freeze protection
- PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
- PF_NOWAIT could be auto-set on the reader of a permission event
- PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
- We could add user API to set this personality explicitly to any task
- PF_NOWAIT without FMODE_NONOTIFY denies permission events

Please let me know if you agree with this design and if so,
which of the methods to set PF_NOWAIT are a must for the first version
in your opinion?

Do you think we should use this method to fix the existing deadlocks
with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?

[...]

> > > OK. ATM, the only solution I can think of that is both maintainable
> > > and lets HSM live in complete harmony with fsfreeze is adding the
> > > extra SB_FREEZE_FSNOTIFY level.
> >
> > To make things clear: if the only problems would be with those sendfile(2)
> > rare corner-cases, then I guess we can live with that and implement retry
> > in the kernel if userspace ever complains about unexpected short copy or
> > EAGAIN...  The problem I see is that if we advise that all IO from the
> > fanotify event handler should happen in the freeze-safe manner, then with
> > the non-blocking solution all HSM IO suddently starts failing as soon as
> > the filesystem is frozen. And that is IMHO not nice.
>
> I see what you mean. The SB_FREEZE_FSNOTIFY design is much more
> clear in that respect.
>
> > > I am not sure how big of an overhead that would be?
> > > I imagine that sb_writers is large enough as it is w.r.t fitting into
> > > cache lines?
> > > I don't think that it adds much complexity or maintenance burden
> > > to vfs?? I'm really not sure.
> >
> > Well, the overhead is effectively one percpu counter per superblock.
> > Negligible in terms of CPU time, somewhat annoying in terms of memory but
> > bearable. So this may be a way forward.
> >
>

My feeling is that because we only need this to handle very obscure
corner cases, that adding an extra freeze level is an overkill that
cannot be justified, even if the actual impact on cpu and memory are
rather low.

The HSM API documentation will clearly state that EAGAIN may be
expected when writing to the filesystem.

IMO, for all practical matters, it is perfectly fine if HSM just denies
access in those corner cases, but even a simple solution of triggering
async download of file's content and returning a temporary to user
is a decent solution for the rare corner cases.

FYI, I've already gotten requests from people in the community that
are waiting for this feature and are testing the POC patches,
so my plan is to send out the permission hooks cleanup patches [3]
soon and try to get the first part of the HSM API [4]
(FAN_PRE_ACCESS and FAN_PRE_MODIFY permission events)
ready for the next cycle.

In any case, permission hooks cleanup patches are independent
of the solution we will choose for the corner cases that they do
not handle.

Thanks,
Amir.

[2] https://github.com/amir73il/httpdirfs/commits/fan_lookup_perm
[2] https://github.com/amir73il/linux/commits/fan_lookup_perm
[3] https://github.com/amir73il/linux/commits/start-write-safe
[4] https://github.com/amir73il/linux/commits/fan_pre_content

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-13 11:50                           ` Amir Goldstein
@ 2023-11-20 14:06                             ` Jan Kara
  2023-11-20 16:59                               ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2023-11-20 14:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

Hi Amir,

sorry for a bit delayed reply, I did not get to "swapping in" HSM
discussion during the Plumbers conference :)

On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > Recap for new people joining this thread.
> > > >
> > > > The following deadlock is possible in upstream kernel
> > > > if fanotify permission event handler tries to make
> > > > modifications to the filesystem it is watching in the context
> > > > of FAN_ACCESS_PERM handling in some cases:
> > > >
> > > > P1                             P2                      P3
> > > > -----------                    ------------            ------------
> > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > -> sb_start_write(fs1.sb)
> > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > >       -> security_file_permission()
> > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > >                                    -> sb_start_write(fs1.sb) ......
> > > >
> > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > deadlock and prepare the ground for a new set of permission events
> > > > with cleaner/safer semantics.
> > > >
> > > > The cases described above of sendfile from a file in loop mounted
> > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > deadlock despite the start-write-safe patches [1].
> > >
> > > Yep, nice summary.
...
> > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > me. The question is whether this is enough or not.
> > > > >
> > > >
> > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > to a file is not the only thing that HSM needs to do.
> > > > Eventually, event handler for lookup permission events should be
> > > > able to also create files without blocking on vfs level freeze protection.
> > >
> > > So this is what I wanted to clarify. The lookup permission event never gets
> > > called under a freeze protection so the deadlock doesn't exist there. In
> > > principle the problem exists only for access and modify events where we'd
> > > be filling in file data and thus RWF_NOWAIT could be enough.
> >
> > Yes, you are right.
> > It is possible that RWF_NOWAIT could be enough.
> >
> > But the discovery of the loop/ovl corner cases has shaken my
> > confidence is the ability to guarantee that freeze protection is not
> > held somehow indirectly.
> >
> > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > same ovl corner case, because with splice from ovl1 to fs1,
> > fs1 freeze protection is held and:
> >   ovl_splice_read(ovl1.file)
> >     ovl_real_fdget()
> >       ovl_open_realfile(fs1.file)
> >          ... security_file_open(fs1.file)
> >
> > > That being
> > > said I understand this may be assuming too much about the implementations
> > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > not blocking on freeze protection from any hook. But I wanted to point this
> > > out explicitly so that it's a conscious decision.
> > >
> 
> I agree and I'd like to explain using an example, why RWF_NOWAIT is
> not enough for HSM needs.
> 
> The reason is that often, when HSM needs to handle filling content
> in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> HSM needs to be able to avoid blocking on freeze protection
> for any operations on the filesystem, not just pwrite().
> 
> For example, the POC HSM code [1], stores the DATA_DIR_fd
> from the lookup event and uses it in the handling of access events to
> update the metadata files that store which parts of the file were already
> filled (relying of fiemap is not always a valid option).
> 
> That is the reason that in the POC patches [2], FMODE_NONOTIFY
> is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> 
> Another use case is that HSM may want to download content to a
> temp file on the same filesystem, verify the downloaded content and
> then clone the data into the accessed file range.
> 
> I think that a PF_ flag (see below) would work best for all those cases.

Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
enough for all sensible usecases to avoid deadlocks with freezing. However
note that if we want to really properly handle all possible operations, we
need to start handling error from all sb_start_write() and
file_start_write() calls and there are quite a few of those.

> > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > if the requirement from permission event handler is that is must use a
> > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > SB_FREEZE_FSNOTIFY between
> > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > >
> > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > a model that can solve the deadlock correctly.
> > >
> > > OK, yes, in principle another freeze level which could be used by handlers
> > > of fanotify permission events would solve the deadlock as well. Just you
> > > seem to like to tie this functionality to the particular fd returned from
> > > fanotify and I'm not convinced that is a good idea. What if the application
> > > needs to do write to some other location besides the one fd it got passed
> > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > first access to any file in a subtree. Or maybe it wants to write to some
> > > DB file containing current state or something like that.
> > >
> > > One solution I can imagine is to create an open flag that can be specified
> > > on open which would result in the special behavior wrt fs freezing. If the
> > > special behavior would be just trylocking the freeze protection then it
> > > would be really easy. If the behaviour would be another freeze protection
> > > level, then we'd need to make sure we don't generate another fanotify
> > > permission event with such fd - autorejecting any such access is an obvious
> > > solution but I'm not sure if practical for applications.
> > >
> >
> > I had also considered marking the listener process with the FSNOTIFY
> > context and enforcing this context on fanotify_read().
> > In a way, this is similar to the NOIO and NOFS process context.
> > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > and to activate the desired freeze protection behavior
> > (whether trylock or SB_FREEZE_FSNOTIFY level).
> >
> 
> My feeling is that the best approach would be a PF_NOWAIT task flag:
> 
> - PF_NOWAIT will prevent blocking on freeze protection
> - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> - PF_NOWAIT could be auto-set on the reader of a permission event
> - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> - We could add user API to set this personality explicitly to any task
> - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> 
> Please let me know if you agree with this design and if so,
> which of the methods to set PF_NOWAIT are a must for the first version
> in your opinion?

Yeah, the PF flag could work. It can be set for the process(es) responsible
for processing the fanotify events and filling in filesystem contents. I
don't think automatic setting of this flag is desirable though as it has
quite wide impact and some of the consequences could be surprising.  I
rather think it should be a conscious decision when setting up the process
processing the events. So I think API to explicitly set / clear the flag
would be the best. Also I think it would be better to capture in the name
that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
something like that?

Also we were thinking about having an open(2) flag for this (instead of PF
flag) in the past. That would allow finer granularity control of the
behavior but I guess you are worried that it would not cover all the needed
operations?

> Do you think we should use this method to fix the existing deadlocks
> with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?

No, I think if someone cares about these, they should explicitly set the
PF flag in their task processing the events.

> > > > OK. ATM, the only solution I can think of that is both maintainable
> > > > and lets HSM live in complete harmony with fsfreeze is adding the
> > > > extra SB_FREEZE_FSNOTIFY level.
> > >
> > > To make things clear: if the only problems would be with those sendfile(2)
> > > rare corner-cases, then I guess we can live with that and implement retry
> > > in the kernel if userspace ever complains about unexpected short copy or
> > > EAGAIN...  The problem I see is that if we advise that all IO from the
> > > fanotify event handler should happen in the freeze-safe manner, then with
> > > the non-blocking solution all HSM IO suddently starts failing as soon as
> > > the filesystem is frozen. And that is IMHO not nice.
> >
> > I see what you mean. The SB_FREEZE_FSNOTIFY design is much more
> > clear in that respect.
> >
> > > > I am not sure how big of an overhead that would be?
> > > > I imagine that sb_writers is large enough as it is w.r.t fitting into
> > > > cache lines?
> > > > I don't think that it adds much complexity or maintenance burden
> > > > to vfs?? I'm really not sure.
> > >
> > > Well, the overhead is effectively one percpu counter per superblock.
> > > Negligible in terms of CPU time, somewhat annoying in terms of memory but
> > > bearable. So this may be a way forward.
> 
> My feeling is that because we only need this to handle very obscure
> corner cases, that adding an extra freeze level is an overkill that
> cannot be justified, even if the actual impact on cpu and memory are
> rather low.
> 
> The HSM API documentation will clearly state that EAGAIN may be
> expected when writing to the filesystem.
> 
> IMO, for all practical matters, it is perfectly fine if HSM just denies
> access in those corner cases, but even a simple solution of triggering
> async download of file's content and returning a temporary to user
> is a decent solution for the rare corner cases.

Yeah, I guess returning EAGAIN to userspace in these corner cases might be
acceptable. It won't be 100% compatible with current filesystem behavior in
case the fs is frozen but close enough.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-20 14:06                             ` Jan Kara
@ 2023-11-20 16:59                               ` Amir Goldstein
  2023-11-27 13:56                                 ` Christian Brauner
  2023-11-27 19:11                                 ` Josef Bacik
  0 siblings, 2 replies; 19+ messages in thread
From: Amir Goldstein @ 2023-11-20 16:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Christian Brauner, Jens Axboe, linux-fsdevel

On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir,
>
> sorry for a bit delayed reply, I did not get to "swapping in" HSM
> discussion during the Plumbers conference :)
>
> On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > Recap for new people joining this thread.
> > > > >
> > > > > The following deadlock is possible in upstream kernel
> > > > > if fanotify permission event handler tries to make
> > > > > modifications to the filesystem it is watching in the context
> > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > >
> > > > > P1                             P2                      P3
> > > > > -----------                    ------------            ------------
> > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > -> sb_start_write(fs1.sb)
> > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > >       -> security_file_permission()
> > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > >
> > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > deadlock and prepare the ground for a new set of permission events
> > > > > with cleaner/safer semantics.
> > > > >
> > > > > The cases described above of sendfile from a file in loop mounted
> > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > deadlock despite the start-write-safe patches [1].
> > > >
> > > > Yep, nice summary.
> ...
> > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > me. The question is whether this is enough or not.
> > > > > >
> > > > >
> > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > to a file is not the only thing that HSM needs to do.
> > > > > Eventually, event handler for lookup permission events should be
> > > > > able to also create files without blocking on vfs level freeze protection.
> > > >
> > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > principle the problem exists only for access and modify events where we'd
> > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > >
> > > Yes, you are right.
> > > It is possible that RWF_NOWAIT could be enough.
> > >
> > > But the discovery of the loop/ovl corner cases has shaken my
> > > confidence is the ability to guarantee that freeze protection is not
> > > held somehow indirectly.
> > >
> > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > same ovl corner case, because with splice from ovl1 to fs1,
> > > fs1 freeze protection is held and:
> > >   ovl_splice_read(ovl1.file)
> > >     ovl_real_fdget()
> > >       ovl_open_realfile(fs1.file)
> > >          ... security_file_open(fs1.file)
> > >
> > > > That being
> > > > said I understand this may be assuming too much about the implementations
> > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > out explicitly so that it's a conscious decision.
> > > >
> >
> > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > not enough for HSM needs.
> >
> > The reason is that often, when HSM needs to handle filling content
> > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > HSM needs to be able to avoid blocking on freeze protection
> > for any operations on the filesystem, not just pwrite().
> >
> > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > from the lookup event and uses it in the handling of access events to
> > update the metadata files that store which parts of the file were already
> > filled (relying of fiemap is not always a valid option).
> >
> > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> >
> > Another use case is that HSM may want to download content to a
> > temp file on the same filesystem, verify the downloaded content and
> > then clone the data into the accessed file range.
> >
> > I think that a PF_ flag (see below) would work best for all those cases.
>
> Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> enough for all sensible usecases to avoid deadlocks with freezing. However
> note that if we want to really properly handle all possible operations, we
> need to start handling error from all sb_start_write() and
> file_start_write() calls and there are quite a few of those.
>

Darn, forgot about those.
I am starting to reconsider adding a freeze level.
I cannot shake the feeling that there is a simpler solution that escapes us...
Maybe fs anti-freeze (see blow).

> > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > if the requirement from permission event handler is that is must use a
> > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > SB_FREEZE_FSNOTIFY between
> > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > >
> > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > a model that can solve the deadlock correctly.
> > > >
> > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > seem to like to tie this functionality to the particular fd returned from
> > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > needs to do write to some other location besides the one fd it got passed
> > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > DB file containing current state or something like that.
> > > >
> > > > One solution I can imagine is to create an open flag that can be specified
> > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > special behavior would be just trylocking the freeze protection then it
> > > > would be really easy. If the behaviour would be another freeze protection
> > > > level, then we'd need to make sure we don't generate another fanotify
> > > > permission event with such fd - autorejecting any such access is an obvious
> > > > solution but I'm not sure if practical for applications.
> > > >
> > >
> > > I had also considered marking the listener process with the FSNOTIFY
> > > context and enforcing this context on fanotify_read().
> > > In a way, this is similar to the NOIO and NOFS process context.
> > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > and to activate the desired freeze protection behavior
> > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > >
> >
> > My feeling is that the best approach would be a PF_NOWAIT task flag:
> >
> > - PF_NOWAIT will prevent blocking on freeze protection
> > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > - PF_NOWAIT could be auto-set on the reader of a permission event
> > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > - We could add user API to set this personality explicitly to any task
> > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> >
> > Please let me know if you agree with this design and if so,
> > which of the methods to set PF_NOWAIT are a must for the first version
> > in your opinion?
>
> Yeah, the PF flag could work. It can be set for the process(es) responsible
> for processing the fanotify events and filling in filesystem contents. I
> don't think automatic setting of this flag is desirable though as it has
> quite wide impact and some of the consequences could be surprising.  I
> rather think it should be a conscious decision when setting up the process
> processing the events. So I think API to explicitly set / clear the flag
> would be the best. Also I think it would be better to capture in the name
> that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> something like that?
>

Sure.

> Also we were thinking about having an open(2) flag for this (instead of PF
> flag) in the past. That would allow finer granularity control of the
> behavior but I guess you are worried that it would not cover all the needed
> operations?
>

Yeh, it seems like an API that is going to be harder to write safe HSM
programs with.

> > Do you think we should use this method to fix the existing deadlocks
> > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
>
> No, I think if someone cares about these, they should explicitly set the
> PF flag in their task processing the events.
>

OK.

I see an exit hatch in this statement -
If we are going leave the responsibility to avoid deadlock in corner
cases completely in the hands of the application, then I do not feel
morally obligated to create the PF_NOWAIT_FREEZE API *before*
providing the first HSM API.

If the HSM application is running in a controlled system, on a filesystem
where fsfreeze is not expected or not needed, then a fully functional and
safe HSM does not require PF_NOWAIT_FREEZE API.

Perhaps an API to make an fs unfreezable is just as practical and a much
easier option for the first version of HSM API?

Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
ioctl. Then no other task can freeze the fs, for as long as the fd is open
apart from the HSM itself using this fd.

HSM itself can avoid deadlocks if it collaborates the fs freezes with
making fs modifications from within HSM events.

Do you think that may be an acceptable way out or the corner?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-20 16:59                               ` Amir Goldstein
@ 2023-11-27 13:56                                 ` Christian Brauner
  2023-11-27 14:48                                   ` Amir Goldstein
  2023-11-27 19:11                                 ` Josef Bacik
  1 sibling, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2023-11-27 13:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Miklos Szeredi, Jens Axboe, linux-fsdevel

On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir,
> >
> > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > discussion during the Plumbers conference :)
> >
> > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Recap for new people joining this thread.
> > > > > >
> > > > > > The following deadlock is possible in upstream kernel
> > > > > > if fanotify permission event handler tries to make
> > > > > > modifications to the filesystem it is watching in the context
> > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > >
> > > > > > P1                             P2                      P3
> > > > > > -----------                    ------------            ------------
> > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > -> sb_start_write(fs1.sb)
> > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > >       -> security_file_permission()
> > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > >
> > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > with cleaner/safer semantics.
> > > > > >
> > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > deadlock despite the start-write-safe patches [1].
> > > > >
> > > > > Yep, nice summary.
> > ...
> > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > me. The question is whether this is enough or not.
> > > > > > >
> > > > > >
> > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > Eventually, event handler for lookup permission events should be
> > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > >
> > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > principle the problem exists only for access and modify events where we'd
> > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > >
> > > > Yes, you are right.
> > > > It is possible that RWF_NOWAIT could be enough.
> > > >
> > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > confidence is the ability to guarantee that freeze protection is not
> > > > held somehow indirectly.
> > > >
> > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > fs1 freeze protection is held and:
> > > >   ovl_splice_read(ovl1.file)
> > > >     ovl_real_fdget()
> > > >       ovl_open_realfile(fs1.file)
> > > >          ... security_file_open(fs1.file)
> > > >
> > > > > That being
> > > > > said I understand this may be assuming too much about the implementations
> > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > out explicitly so that it's a conscious decision.
> > > > >
> > >
> > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > not enough for HSM needs.
> > >
> > > The reason is that often, when HSM needs to handle filling content
> > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > HSM needs to be able to avoid blocking on freeze protection
> > > for any operations on the filesystem, not just pwrite().
> > >
> > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > from the lookup event and uses it in the handling of access events to
> > > update the metadata files that store which parts of the file were already
> > > filled (relying of fiemap is not always a valid option).
> > >
> > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > >
> > > Another use case is that HSM may want to download content to a
> > > temp file on the same filesystem, verify the downloaded content and
> > > then clone the data into the accessed file range.
> > >
> > > I think that a PF_ flag (see below) would work best for all those cases.
> >
> > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > enough for all sensible usecases to avoid deadlocks with freezing. However
> > note that if we want to really properly handle all possible operations, we
> > need to start handling error from all sb_start_write() and
> > file_start_write() calls and there are quite a few of those.
> >
> 
> Darn, forgot about those.
> I am starting to reconsider adding a freeze level.
> I cannot shake the feeling that there is a simpler solution that escapes us...
> Maybe fs anti-freeze (see blow).
> 
> > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > if the requirement from permission event handler is that is must use a
> > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > >
> > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > a model that can solve the deadlock correctly.
> > > > >
> > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > needs to do write to some other location besides the one fd it got passed
> > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > DB file containing current state or something like that.
> > > > >
> > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > special behavior would be just trylocking the freeze protection then it
> > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > solution but I'm not sure if practical for applications.
> > > > >
> > > >
> > > > I had also considered marking the listener process with the FSNOTIFY
> > > > context and enforcing this context on fanotify_read().
> > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > and to activate the desired freeze protection behavior
> > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > >
> > >
> > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > >
> > > - PF_NOWAIT will prevent blocking on freeze protection
> > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > - We could add user API to set this personality explicitly to any task
> > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > >
> > > Please let me know if you agree with this design and if so,
> > > which of the methods to set PF_NOWAIT are a must for the first version
> > > in your opinion?
> >
> > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > for processing the fanotify events and filling in filesystem contents. I
> > don't think automatic setting of this flag is desirable though as it has
> > quite wide impact and some of the consequences could be surprising.  I
> > rather think it should be a conscious decision when setting up the process
> > processing the events. So I think API to explicitly set / clear the flag
> > would be the best. Also I think it would be better to capture in the name
> > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > something like that?
> >
> 
> Sure.
> 
> > Also we were thinking about having an open(2) flag for this (instead of PF
> > flag) in the past. That would allow finer granularity control of the
> > behavior but I guess you are worried that it would not cover all the needed
> > operations?
> >
> 
> Yeh, it seems like an API that is going to be harder to write safe HSM
> programs with.
> 
> > > Do you think we should use this method to fix the existing deadlocks
> > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> >
> > No, I think if someone cares about these, they should explicitly set the
> > PF flag in their task processing the events.
> >
> 
> OK.
> 
> I see an exit hatch in this statement -
> If we are going leave the responsibility to avoid deadlock in corner
> cases completely in the hands of the application, then I do not feel
> morally obligated to create the PF_NOWAIT_FREEZE API *before*
> providing the first HSM API.
> 
> If the HSM application is running in a controlled system, on a filesystem
> where fsfreeze is not expected or not needed, then a fully functional and
> safe HSM does not require PF_NOWAIT_FREEZE API.
> 
> Perhaps an API to make an fs unfreezable is just as practical and a much
> easier option for the first version of HSM API?
> 
> Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> ioctl. Then no other task can freeze the fs, for as long as the fd is open
> apart from the HSM itself using this fd.

This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs
uses for filesystem scrubbing iirc. I would reckon that you also run
into problems with device mapper workloads where freeze/thaw requests
from the block layer and into the filesystem layer are quite common.

Have you given any thought to the idea - similar to a FUSE daemon - that
you could register with a given filesystem as an HSM? Maybe integration
like this is really undesirable for some reason but that may be an
alternative.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-27 13:56                                 ` Christian Brauner
@ 2023-11-27 14:48                                   ` Amir Goldstein
  2023-11-27 14:57                                     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-27 14:48 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Miklos Szeredi, Jens Axboe, linux-fsdevel

On Mon, Nov 27, 2023 at 3:56 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hi Amir,
> > >
> > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > discussion during the Plumbers conference :)
> > >
> > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > Recap for new people joining this thread.
> > > > > > >
> > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > if fanotify permission event handler tries to make
> > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > >
> > > > > > > P1                             P2                      P3
> > > > > > > -----------                    ------------            ------------
> > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > -> sb_start_write(fs1.sb)
> > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > >       -> security_file_permission()
> > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > >
> > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > with cleaner/safer semantics.
> > > > > > >
> > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > >
> > > > > > Yep, nice summary.
> > > ...
> > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > me. The question is whether this is enough or not.
> > > > > > > >
> > > > > > >
> > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > >
> > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > >
> > > > > Yes, you are right.
> > > > > It is possible that RWF_NOWAIT could be enough.
> > > > >
> > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > held somehow indirectly.
> > > > >
> > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > fs1 freeze protection is held and:
> > > > >   ovl_splice_read(ovl1.file)
> > > > >     ovl_real_fdget()
> > > > >       ovl_open_realfile(fs1.file)
> > > > >          ... security_file_open(fs1.file)
> > > > >
> > > > > > That being
> > > > > > said I understand this may be assuming too much about the implementations
> > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > out explicitly so that it's a conscious decision.
> > > > > >
> > > >
> > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > not enough for HSM needs.
> > > >
> > > > The reason is that often, when HSM needs to handle filling content
> > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > HSM needs to be able to avoid blocking on freeze protection
> > > > for any operations on the filesystem, not just pwrite().
> > > >
> > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > from the lookup event and uses it in the handling of access events to
> > > > update the metadata files that store which parts of the file were already
> > > > filled (relying of fiemap is not always a valid option).
> > > >
> > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > >
> > > > Another use case is that HSM may want to download content to a
> > > > temp file on the same filesystem, verify the downloaded content and
> > > > then clone the data into the accessed file range.
> > > >
> > > > I think that a PF_ flag (see below) would work best for all those cases.
> > >
> > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > note that if we want to really properly handle all possible operations, we
> > > need to start handling error from all sb_start_write() and
> > > file_start_write() calls and there are quite a few of those.
> > >
> >
> > Darn, forgot about those.
> > I am starting to reconsider adding a freeze level.
> > I cannot shake the feeling that there is a simpler solution that escapes us...
> > Maybe fs anti-freeze (see blow).
> >
> > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > >
> > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > a model that can solve the deadlock correctly.
> > > > > >
> > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > DB file containing current state or something like that.
> > > > > >
> > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > solution but I'm not sure if practical for applications.
> > > > > >
> > > > >
> > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > context and enforcing this context on fanotify_read().
> > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > and to activate the desired freeze protection behavior
> > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > >
> > > >
> > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > >
> > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > - We could add user API to set this personality explicitly to any task
> > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > >
> > > > Please let me know if you agree with this design and if so,
> > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > in your opinion?
> > >
> > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > for processing the fanotify events and filling in filesystem contents. I
> > > don't think automatic setting of this flag is desirable though as it has
> > > quite wide impact and some of the consequences could be surprising.  I
> > > rather think it should be a conscious decision when setting up the process
> > > processing the events. So I think API to explicitly set / clear the flag
> > > would be the best. Also I think it would be better to capture in the name
> > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > something like that?
> > >
> >
> > Sure.
> >
> > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > flag) in the past. That would allow finer granularity control of the
> > > behavior but I guess you are worried that it would not cover all the needed
> > > operations?
> > >
> >
> > Yeh, it seems like an API that is going to be harder to write safe HSM
> > programs with.
> >
> > > > Do you think we should use this method to fix the existing deadlocks
> > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > >
> > > No, I think if someone cares about these, they should explicitly set the
> > > PF flag in their task processing the events.
> > >
> >
> > OK.
> >
> > I see an exit hatch in this statement -
> > If we are going leave the responsibility to avoid deadlock in corner
> > cases completely in the hands of the application, then I do not feel
> > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > providing the first HSM API.
> >
> > If the HSM application is running in a controlled system, on a filesystem
> > where fsfreeze is not expected or not needed, then a fully functional and
> > safe HSM does not require PF_NOWAIT_FREEZE API.
> >
> > Perhaps an API to make an fs unfreezable is just as practical and a much
> > easier option for the first version of HSM API?
> >
> > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > apart from the HSM itself using this fd.
>
> This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs
> uses for filesystem scrubbing iirc. I would reckon that you also run
> into problems with device mapper workloads where freeze/thaw requests
> from the block layer and into the filesystem layer are quite common.

I agree. These cases will not play nicely with EXCLUSIVE_FSFREEZER.
The only case where the EXCLUSIVE_FSFREEZER API makes sense
is when the admin does not expect to meet any fsfreeze on the target fs and
wants to enforce that.

>
> Have you given any thought to the idea - similar to a FUSE daemon - that
> you could register with a given filesystem as an HSM? Maybe integration
> like this is really undesirable for some reason but that may be an
> alternative.

I am not sure what you mean by "register with a given filesystem"?
The comparison to FUSE daemon buffels me.  The main point with fanotify
HSM was for the user to be able to work natively on the target filesystem
without any "passthrough".

FUSE passthrough is a valid way to implement HSM.
Many HSM already use FUSE and many HSM will continue to use FUSE.
Improving FUSE passthough performance (e.g. FUSE BPF) is another
way to improve HSM.

Compared to fanotify HSM, FUSE passthrough is more versalite, but it
is also more resource expensive and some native fs features (e.g. ioctls)
will never work properly with FUSE passthrough.

Not sure if that answers your question?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-27 14:48                                   ` Amir Goldstein
@ 2023-11-27 14:57                                     ` Christian Brauner
  2023-11-28  9:46                                       ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2023-11-27 14:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Miklos Szeredi, Jens Axboe, linux-fsdevel

On Mon, Nov 27, 2023 at 04:48:23PM +0200, Amir Goldstein wrote:
> On Mon, Nov 27, 2023 at 3:56 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hi Amir,
> > > >
> > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > discussion during the Plumbers conference :)
> > > >
> > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > Recap for new people joining this thread.
> > > > > > > >
> > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > >
> > > > > > > > P1                             P2                      P3
> > > > > > > > -----------                    ------------            ------------
> > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > >       -> security_file_permission()
> > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > >
> > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > with cleaner/safer semantics.
> > > > > > > >
> > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > >
> > > > > > > Yep, nice summary.
> > > > ...
> > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > >
> > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > >
> > > > > > Yes, you are right.
> > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > >
> > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > held somehow indirectly.
> > > > > >
> > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > fs1 freeze protection is held and:
> > > > > >   ovl_splice_read(ovl1.file)
> > > > > >     ovl_real_fdget()
> > > > > >       ovl_open_realfile(fs1.file)
> > > > > >          ... security_file_open(fs1.file)
> > > > > >
> > > > > > > That being
> > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > out explicitly so that it's a conscious decision.
> > > > > > >
> > > > >
> > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > not enough for HSM needs.
> > > > >
> > > > > The reason is that often, when HSM needs to handle filling content
> > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > for any operations on the filesystem, not just pwrite().
> > > > >
> > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > from the lookup event and uses it in the handling of access events to
> > > > > update the metadata files that store which parts of the file were already
> > > > > filled (relying of fiemap is not always a valid option).
> > > > >
> > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > >
> > > > > Another use case is that HSM may want to download content to a
> > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > then clone the data into the accessed file range.
> > > > >
> > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > >
> > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > note that if we want to really properly handle all possible operations, we
> > > > need to start handling error from all sb_start_write() and
> > > > file_start_write() calls and there are quite a few of those.
> > > >
> > >
> > > Darn, forgot about those.
> > > I am starting to reconsider adding a freeze level.
> > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > Maybe fs anti-freeze (see blow).
> > >
> > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > >
> > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > a model that can solve the deadlock correctly.
> > > > > > >
> > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > DB file containing current state or something like that.
> > > > > > >
> > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > solution but I'm not sure if practical for applications.
> > > > > > >
> > > > > >
> > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > context and enforcing this context on fanotify_read().
> > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > and to activate the desired freeze protection behavior
> > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > >
> > > > >
> > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > >
> > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > - We could add user API to set this personality explicitly to any task
> > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > >
> > > > > Please let me know if you agree with this design and if so,
> > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > in your opinion?
> > > >
> > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > for processing the fanotify events and filling in filesystem contents. I
> > > > don't think automatic setting of this flag is desirable though as it has
> > > > quite wide impact and some of the consequences could be surprising.  I
> > > > rather think it should be a conscious decision when setting up the process
> > > > processing the events. So I think API to explicitly set / clear the flag
> > > > would be the best. Also I think it would be better to capture in the name
> > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > something like that?
> > > >
> > >
> > > Sure.
> > >
> > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > flag) in the past. That would allow finer granularity control of the
> > > > behavior but I guess you are worried that it would not cover all the needed
> > > > operations?
> > > >
> > >
> > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > programs with.
> > >
> > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > >
> > > > No, I think if someone cares about these, they should explicitly set the
> > > > PF flag in their task processing the events.
> > > >
> > >
> > > OK.
> > >
> > > I see an exit hatch in this statement -
> > > If we are going leave the responsibility to avoid deadlock in corner
> > > cases completely in the hands of the application, then I do not feel
> > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > providing the first HSM API.
> > >
> > > If the HSM application is running in a controlled system, on a filesystem
> > > where fsfreeze is not expected or not needed, then a fully functional and
> > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > >
> > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > easier option for the first version of HSM API?
> > >
> > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > apart from the HSM itself using this fd.
> >
> > This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs
> > uses for filesystem scrubbing iirc. I would reckon that you also run
> > into problems with device mapper workloads where freeze/thaw requests
> > from the block layer and into the filesystem layer are quite common.
> 
> I agree. These cases will not play nicely with EXCLUSIVE_FSFREEZER.
> The only case where the EXCLUSIVE_FSFREEZER API makes sense
> is when the admin does not expect to meet any fsfreeze on the target fs and
> wants to enforce that.
> 
> >
> > Have you given any thought to the idea - similar to a FUSE daemon - that
> > you could register with a given filesystem as an HSM? Maybe integration
> > like this is really undesirable for some reason but that may be an
> > alternative.
> 
> I am not sure what you mean by "register with a given filesystem"?
> The comparison to FUSE daemon buffels me.  The main point with fanotify
> HSM was for the user to be able to work natively on the target filesystem
> without any "passthrough".
> 
> FUSE passthrough is a valid way to implement HSM.
> Many HSM already use FUSE and many HSM will continue to use FUSE.
> Improving FUSE passthough performance (e.g. FUSE BPF) is another
> way to improve HSM.
> 
> Compared to fanotify HSM, FUSE passthrough is more versalite, but it
> is also more resource expensive and some native fs features (e.g. ioctls)
> will never work properly with FUSE passthrough.
> 
> Not sure if that answers your question?

This isn't about FUSE passthrough. Maybe the analogy doesn't work.

What I just meant is similar to how fanotify registers itself as
watching an inode or a mount or superblock one could have a new HSM
watch type that lets the fs detect that it is watched by an HSM and then
refuse to be frozen or other special behavior you might need. I don't
know much about HSMs so I might just be talking nonsense.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-20 16:59                               ` Amir Goldstein
  2023-11-27 13:56                                 ` Christian Brauner
@ 2023-11-27 19:11                                 ` Josef Bacik
  2023-11-28 11:05                                   ` Amir Goldstein
  1 sibling, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2023-11-27 19:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir,
> >
> > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > discussion during the Plumbers conference :)
> >
> > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Recap for new people joining this thread.
> > > > > >
> > > > > > The following deadlock is possible in upstream kernel
> > > > > > if fanotify permission event handler tries to make
> > > > > > modifications to the filesystem it is watching in the context
> > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > >
> > > > > > P1                             P2                      P3
> > > > > > -----------                    ------------            ------------
> > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > -> sb_start_write(fs1.sb)
> > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > >       -> security_file_permission()
> > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > >
> > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > with cleaner/safer semantics.
> > > > > >
> > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > deadlock despite the start-write-safe patches [1].
> > > > >
> > > > > Yep, nice summary.
> > ...
> > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > me. The question is whether this is enough or not.
> > > > > > >
> > > > > >
> > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > Eventually, event handler for lookup permission events should be
> > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > >
> > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > principle the problem exists only for access and modify events where we'd
> > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > >
> > > > Yes, you are right.
> > > > It is possible that RWF_NOWAIT could be enough.
> > > >
> > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > confidence is the ability to guarantee that freeze protection is not
> > > > held somehow indirectly.
> > > >
> > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > fs1 freeze protection is held and:
> > > >   ovl_splice_read(ovl1.file)
> > > >     ovl_real_fdget()
> > > >       ovl_open_realfile(fs1.file)
> > > >          ... security_file_open(fs1.file)
> > > >
> > > > > That being
> > > > > said I understand this may be assuming too much about the implementations
> > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > out explicitly so that it's a conscious decision.
> > > > >
> > >
> > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > not enough for HSM needs.
> > >
> > > The reason is that often, when HSM needs to handle filling content
> > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > HSM needs to be able to avoid blocking on freeze protection
> > > for any operations on the filesystem, not just pwrite().
> > >
> > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > from the lookup event and uses it in the handling of access events to
> > > update the metadata files that store which parts of the file were already
> > > filled (relying of fiemap is not always a valid option).
> > >
> > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > >
> > > Another use case is that HSM may want to download content to a
> > > temp file on the same filesystem, verify the downloaded content and
> > > then clone the data into the accessed file range.
> > >
> > > I think that a PF_ flag (see below) would work best for all those cases.
> >
> > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > enough for all sensible usecases to avoid deadlocks with freezing. However
> > note that if we want to really properly handle all possible operations, we
> > need to start handling error from all sb_start_write() and
> > file_start_write() calls and there are quite a few of those.
> >
> 
> Darn, forgot about those.
> I am starting to reconsider adding a freeze level.
> I cannot shake the feeling that there is a simpler solution that escapes us...
> Maybe fs anti-freeze (see blow).
> 
> > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > if the requirement from permission event handler is that is must use a
> > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > >
> > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > a model that can solve the deadlock correctly.
> > > > >
> > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > needs to do write to some other location besides the one fd it got passed
> > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > DB file containing current state or something like that.
> > > > >
> > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > special behavior would be just trylocking the freeze protection then it
> > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > solution but I'm not sure if practical for applications.
> > > > >
> > > >
> > > > I had also considered marking the listener process with the FSNOTIFY
> > > > context and enforcing this context on fanotify_read().
> > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > and to activate the desired freeze protection behavior
> > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > >
> > >
> > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > >
> > > - PF_NOWAIT will prevent blocking on freeze protection
> > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > - We could add user API to set this personality explicitly to any task
> > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > >
> > > Please let me know if you agree with this design and if so,
> > > which of the methods to set PF_NOWAIT are a must for the first version
> > > in your opinion?
> >
> > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > for processing the fanotify events and filling in filesystem contents. I
> > don't think automatic setting of this flag is desirable though as it has
> > quite wide impact and some of the consequences could be surprising.  I
> > rather think it should be a conscious decision when setting up the process
> > processing the events. So I think API to explicitly set / clear the flag
> > would be the best. Also I think it would be better to capture in the name
> > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > something like that?
> >
> 
> Sure.
> 
> > Also we were thinking about having an open(2) flag for this (instead of PF
> > flag) in the past. That would allow finer granularity control of the
> > behavior but I guess you are worried that it would not cover all the needed
> > operations?
> >
> 
> Yeh, it seems like an API that is going to be harder to write safe HSM
> programs with.
> 
> > > Do you think we should use this method to fix the existing deadlocks
> > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> >
> > No, I think if someone cares about these, they should explicitly set the
> > PF flag in their task processing the events.
> >
> 
> OK.
> 
> I see an exit hatch in this statement -
> If we are going leave the responsibility to avoid deadlock in corner
> cases completely in the hands of the application, then I do not feel
> morally obligated to create the PF_NOWAIT_FREEZE API *before*
> providing the first HSM API.
> 
> If the HSM application is running in a controlled system, on a filesystem
> where fsfreeze is not expected or not needed, then a fully functional and
> safe HSM does not require PF_NOWAIT_FREEZE API.
> 
> Perhaps an API to make an fs unfreezable is just as practical and a much
> easier option for the first version of HSM API?
> 
> Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> ioctl. Then no other task can freeze the fs, for as long as the fd is open
> apart from the HSM itself using this fd.
> 
> HSM itself can avoid deadlocks if it collaborates the fs freezes with
> making fs modifications from within HSM events.
> 
> Do you think that may be an acceptable way out or the corner?

This is kind of a corner case that I think is acceptable to just leave up to
application developers.  Speaking as a potential consumer of this work we don't
use fsfreeze so aren't concerned wit this in practice, and arguably if you're
using this interface you know what you're doing.  As long as the sharp edge is
well documented I think that's fine for v1.

Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
about the xfs scrubbing use case.  We all know that "freeze this file system" is
an operation that is going to take X amount of time, so as long as we provide
the application a way to block fsfreeze to avoid the deadlock then I think
that's a reasonable solution.  Additionally it would allow us an avenue to
gracefully handle errors.  If we race and see that the fs is already frozen well
then we can go back to the HSM with an error saying he's out of luck, and he can
return -EAGAIN or something through fanotify to unwind and try again later.

But this is a pretty narrow corner case, you've done the due diligence to avoid
the other deadlocks, I don't feel that coming up with a solution to this is a
necessary pre-requisite to the actual feature.  Documenting it clearly is the
only thing I would ask.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-27 14:57                                     ` Christian Brauner
@ 2023-11-28  9:46                                       ` Amir Goldstein
  0 siblings, 0 replies; 19+ messages in thread
From: Amir Goldstein @ 2023-11-28  9:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Miklos Szeredi, Jens Axboe, linux-fsdevel, Josef Bacik

On Mon, Nov 27, 2023 at 4:57 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Nov 27, 2023 at 04:48:23PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 27, 2023 at 3:56 PM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Hi Amir,
> > > > >
> > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > discussion during the Plumbers conference :)
> > > > >
> > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > Recap for new people joining this thread.
> > > > > > > > >
> > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > >
> > > > > > > > > P1                             P2                      P3
> > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > >       -> security_file_permission()
> > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > >
> > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > with cleaner/safer semantics.
> > > > > > > > >
> > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > >
> > > > > > > > Yep, nice summary.
> > > > > ...
> > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > >
> > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > Yes, you are right.
> > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > held somehow indirectly.
> > > > > > >
> > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > fs1 freeze protection is held and:
> > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > >     ovl_real_fdget()
> > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > >          ... security_file_open(fs1.file)
> > > > > > >
> > > > > > > > That being
> > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > >
> > > > > >
> > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > not enough for HSM needs.
> > > > > >
> > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > for any operations on the filesystem, not just pwrite().
> > > > > >
> > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > update the metadata files that store which parts of the file were already
> > > > > > filled (relying of fiemap is not always a valid option).
> > > > > >
> > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > >
> > > > > > Another use case is that HSM may want to download content to a
> > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > then clone the data into the accessed file range.
> > > > > >
> > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > >
> > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > note that if we want to really properly handle all possible operations, we
> > > > > need to start handling error from all sb_start_write() and
> > > > > file_start_write() calls and there are quite a few of those.
> > > > >
> > > >
> > > > Darn, forgot about those.
> > > > I am starting to reconsider adding a freeze level.
> > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > Maybe fs anti-freeze (see blow).
> > > >
> > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > >
> > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > >
> > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > DB file containing current state or something like that.
> > > > > > > >
> > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > >
> > > > > > >
> > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > and to activate the desired freeze protection behavior
> > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > >
> > > > > >
> > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > >
> > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > >
> > > > > > Please let me know if you agree with this design and if so,
> > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > in your opinion?
> > > > >
> > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > rather think it should be a conscious decision when setting up the process
> > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > would be the best. Also I think it would be better to capture in the name
> > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > something like that?
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > flag) in the past. That would allow finer granularity control of the
> > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > operations?
> > > > >
> > > >
> > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > programs with.
> > > >
> > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > >
> > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > PF flag in their task processing the events.
> > > > >
> > > >
> > > > OK.
> > > >
> > > > I see an exit hatch in this statement -
> > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > cases completely in the hands of the application, then I do not feel
> > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > providing the first HSM API.
> > > >
> > > > If the HSM application is running in a controlled system, on a filesystem
> > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > >
> > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > easier option for the first version of HSM API?
> > > >
> > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > apart from the HSM itself using this fd.
> > >
> > > This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs
> > > uses for filesystem scrubbing iirc. I would reckon that you also run
> > > into problems with device mapper workloads where freeze/thaw requests
> > > from the block layer and into the filesystem layer are quite common.
> >
> > I agree. These cases will not play nicely with EXCLUSIVE_FSFREEZER.
> > The only case where the EXCLUSIVE_FSFREEZER API makes sense
> > is when the admin does not expect to meet any fsfreeze on the target fs and
> > wants to enforce that.
> >
> > >
> > > Have you given any thought to the idea - similar to a FUSE daemon - that
> > > you could register with a given filesystem as an HSM? Maybe integration
> > > like this is really undesirable for some reason but that may be an
> > > alternative.
> >
> > I am not sure what you mean by "register with a given filesystem"?
> > The comparison to FUSE daemon buffels me.  The main point with fanotify
> > HSM was for the user to be able to work natively on the target filesystem
> > without any "passthrough".
> >
> > FUSE passthrough is a valid way to implement HSM.
> > Many HSM already use FUSE and many HSM will continue to use FUSE.
> > Improving FUSE passthough performance (e.g. FUSE BPF) is another
> > way to improve HSM.
> >
> > Compared to fanotify HSM, FUSE passthrough is more versalite, but it
> > is also more resource expensive and some native fs features (e.g. ioctls)
> > will never work properly with FUSE passthrough.
> >
> > Not sure if that answers your question?
>
> This isn't about FUSE passthrough. Maybe the analogy doesn't work.
>
> What I just meant is similar to how fanotify registers itself as
> watching an inode or a mount or superblock one could have a new HSM
> watch type that lets the fs detect that it is watched by an HSM and then
> refuse to be frozen or other special behavior you might need. I don't
> know much about HSMs so I might just be talking nonsense.

Implementing mandatory anti-fsfreeze for any fs watches by HSM events
would be trivial and does not require specific filesystem integration.

I've already written a vfs API to advertise that filesystems are watches by
pre-modify HSM events in a later part of the series:
https://github.com/amir73il/linux/commit/88db3054b6bfa5ef4240175fa9efd6b3a871818c

However, if we do choose the solution of anti-fsfreeze,
I much prefer to leave it in the hands of userspace via
EXCLUSIVE_FSFREEZER API over mandatory anti-fsfreeze.

The main reason is that unlike what may be inferred from this thread,
HSM + fsfreeze CAN live quite well together, including HSM + xfs scrub,
HSM + LVM.

After the patches that are now in the vfs.rw branch, it takes much more
than just HSM + fsfreeze to cause a deadlock.

It requires HSM + fsfreeze + splice from a file on:
(
  - a nested overlayfs, whose lower^2 fs is on the "host" fs
  OR
  - a loop mounted filesystem, whose image file is on the "host" fs
)
AND
- the splice is to a file on the "host" fs

These two scenarios are not possible in a container, for example,
when the "host" fs is not exposed for write, directly or indirectly,
to the container.

And of course for many systems, those scenarios do not exist at all,
so there is no need for any anti-fsfreeze, not mandatory, nor user
controlled.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-27 19:11                                 ` Josef Bacik
@ 2023-11-28 11:05                                   ` Amir Goldstein
  2023-11-28 14:55                                     ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-28 11:05 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hi Amir,
> > >
> > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > discussion during the Plumbers conference :)
> > >
> > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > Recap for new people joining this thread.
> > > > > > >
> > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > if fanotify permission event handler tries to make
> > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > >
> > > > > > > P1                             P2                      P3
> > > > > > > -----------                    ------------            ------------
> > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > -> sb_start_write(fs1.sb)
> > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > >       -> security_file_permission()
> > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > >
> > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > with cleaner/safer semantics.
> > > > > > >
> > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > >
> > > > > > Yep, nice summary.
> > > ...
> > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > me. The question is whether this is enough or not.
> > > > > > > >
> > > > > > >
> > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > >
> > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > >
> > > > > Yes, you are right.
> > > > > It is possible that RWF_NOWAIT could be enough.
> > > > >
> > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > held somehow indirectly.
> > > > >
> > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > fs1 freeze protection is held and:
> > > > >   ovl_splice_read(ovl1.file)
> > > > >     ovl_real_fdget()
> > > > >       ovl_open_realfile(fs1.file)
> > > > >          ... security_file_open(fs1.file)
> > > > >
> > > > > > That being
> > > > > > said I understand this may be assuming too much about the implementations
> > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > out explicitly so that it's a conscious decision.
> > > > > >
> > > >
> > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > not enough for HSM needs.
> > > >
> > > > The reason is that often, when HSM needs to handle filling content
> > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > HSM needs to be able to avoid blocking on freeze protection
> > > > for any operations on the filesystem, not just pwrite().
> > > >
> > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > from the lookup event and uses it in the handling of access events to
> > > > update the metadata files that store which parts of the file were already
> > > > filled (relying of fiemap is not always a valid option).
> > > >
> > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > >
> > > > Another use case is that HSM may want to download content to a
> > > > temp file on the same filesystem, verify the downloaded content and
> > > > then clone the data into the accessed file range.
> > > >
> > > > I think that a PF_ flag (see below) would work best for all those cases.
> > >
> > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > note that if we want to really properly handle all possible operations, we
> > > need to start handling error from all sb_start_write() and
> > > file_start_write() calls and there are quite a few of those.
> > >
> >
> > Darn, forgot about those.
> > I am starting to reconsider adding a freeze level.
> > I cannot shake the feeling that there is a simpler solution that escapes us...
> > Maybe fs anti-freeze (see blow).
> >
> > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > >
> > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > a model that can solve the deadlock correctly.
> > > > > >
> > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > DB file containing current state or something like that.
> > > > > >
> > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > solution but I'm not sure if practical for applications.
> > > > > >
> > > > >
> > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > context and enforcing this context on fanotify_read().
> > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > and to activate the desired freeze protection behavior
> > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > >
> > > >
> > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > >
> > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > - We could add user API to set this personality explicitly to any task
> > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > >
> > > > Please let me know if you agree with this design and if so,
> > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > in your opinion?
> > >
> > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > for processing the fanotify events and filling in filesystem contents. I
> > > don't think automatic setting of this flag is desirable though as it has
> > > quite wide impact and some of the consequences could be surprising.  I
> > > rather think it should be a conscious decision when setting up the process
> > > processing the events. So I think API to explicitly set / clear the flag
> > > would be the best. Also I think it would be better to capture in the name
> > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > something like that?
> > >
> >
> > Sure.
> >
> > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > flag) in the past. That would allow finer granularity control of the
> > > behavior but I guess you are worried that it would not cover all the needed
> > > operations?
> > >
> >
> > Yeh, it seems like an API that is going to be harder to write safe HSM
> > programs with.
> >
> > > > Do you think we should use this method to fix the existing deadlocks
> > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > >
> > > No, I think if someone cares about these, they should explicitly set the
> > > PF flag in their task processing the events.
> > >
> >
> > OK.
> >
> > I see an exit hatch in this statement -
> > If we are going leave the responsibility to avoid deadlock in corner
> > cases completely in the hands of the application, then I do not feel
> > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > providing the first HSM API.
> >
> > If the HSM application is running in a controlled system, on a filesystem
> > where fsfreeze is not expected or not needed, then a fully functional and
> > safe HSM does not require PF_NOWAIT_FREEZE API.
> >
> > Perhaps an API to make an fs unfreezable is just as practical and a much
> > easier option for the first version of HSM API?
> >
> > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > apart from the HSM itself using this fd.
> >
> > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > making fs modifications from within HSM events.
> >
> > Do you think that may be an acceptable way out or the corner?
>
> This is kind of a corner case that I think is acceptable to just leave up to
> application developers.  Speaking as a potential consumer of this work we don't
> use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> using this interface you know what you're doing.  As long as the sharp edge is
> well documented I think that's fine for v1.
>

I agree that this is good enough for v1.
The only question is can we (and should we) do better than good enough for v1.

> Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> about the xfs scrubbing use case.  We all know that "freeze this file system" is
> an operation that is going to take X amount of time, so as long as we provide
> the application a way to block fsfreeze to avoid the deadlock then I think
> that's a reasonable solution.  Additionally it would allow us an avenue to
> gracefully handle errors.  If we race and see that the fs is already frozen well
> then we can go back to the HSM with an error saying he's out of luck, and he can
> return -EAGAIN or something through fanotify to unwind and try again later.
>

Actually, "fs is already frozen" is not a deadlock case.
If "fs is already frozen" then fsfreeze was successful and HSM should just
wait in line like everyone else until fs is unfrozen.

The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
in state SB_FREEZE_WRITE), which cannot make progress because
an existing holder of sb write is blocked on an HSM event, which in turn
is trying to start a new sb write.

So far, we have discussed proposals to:
- put sb in state where sb_wait_write(sb, SB_FREEZE_WRITE)
  will not be called (anti-fsfreeze)
- put HSM process (or HSM fd) in a state, where sb_start_write_trylock()
  is called instead of sb_start_write() from within event handler context
- put HSM process (or HSM fd) in a state, where SB_FREEZE_FSNOTIFY
  level is taken instead of sb_start_write() from within event handler context

There may be another option:
- on read of HSM event, fanotify calls sb_start_write_trylock() and
  put HSM fd in a "nested sb write" state, where sb_start_write() does not
  take the SB_FREEZE_WRITE freeze level lock at all if
  sb->s_writers.frozen is in SB_FREEZE_WRITE state

But that sounds a bit subtle, specifically, we will need to make sure
that a "nested" sb_start_write() scope must be completed before an
HSM event is completed/canceled. I will not even try to go down this
road unless Jan gives me his blessing and a roadmap...

> But this is a pretty narrow corner case, you've done the due diligence to avoid
> the other deadlocks, I don't feel that coming up with a solution to this is a
> necessary pre-requisite to the actual feature.  Documenting it clearly is the
> only thing I would ask.

TBH, I am not at ease with "only documenting" for v1.
Since your use case has no fsfreeze, I would be much more comfortable
with "mandatory anti-freeze" plus documentation for v1,
because it is easy to relax "mandatory anti-freeze" later with finer grained
anti-deadlock mechanisms and for use cases with no fsfreeze at all,
"mandatory anti-freeze" doesn't hurt.

Anyway, I am going to wait for Jan's decision on the minimum requirement for v1.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-28 11:05                                   ` Amir Goldstein
@ 2023-11-28 14:55                                     ` Josef Bacik
  2023-11-28 15:13                                       ` Christian Brauner
  2023-11-28 16:52                                       ` Amir Goldstein
  0 siblings, 2 replies; 19+ messages in thread
From: Josef Bacik @ 2023-11-28 14:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hi Amir,
> > > >
> > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > discussion during the Plumbers conference :)
> > > >
> > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > Recap for new people joining this thread.
> > > > > > > >
> > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > >
> > > > > > > > P1                             P2                      P3
> > > > > > > > -----------                    ------------            ------------
> > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > >       -> security_file_permission()
> > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > >
> > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > with cleaner/safer semantics.
> > > > > > > >
> > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > >
> > > > > > > Yep, nice summary.
> > > > ...
> > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > >
> > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > >
> > > > > > Yes, you are right.
> > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > >
> > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > held somehow indirectly.
> > > > > >
> > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > fs1 freeze protection is held and:
> > > > > >   ovl_splice_read(ovl1.file)
> > > > > >     ovl_real_fdget()
> > > > > >       ovl_open_realfile(fs1.file)
> > > > > >          ... security_file_open(fs1.file)
> > > > > >
> > > > > > > That being
> > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > out explicitly so that it's a conscious decision.
> > > > > > >
> > > > >
> > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > not enough for HSM needs.
> > > > >
> > > > > The reason is that often, when HSM needs to handle filling content
> > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > for any operations on the filesystem, not just pwrite().
> > > > >
> > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > from the lookup event and uses it in the handling of access events to
> > > > > update the metadata files that store which parts of the file were already
> > > > > filled (relying of fiemap is not always a valid option).
> > > > >
> > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > >
> > > > > Another use case is that HSM may want to download content to a
> > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > then clone the data into the accessed file range.
> > > > >
> > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > >
> > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > note that if we want to really properly handle all possible operations, we
> > > > need to start handling error from all sb_start_write() and
> > > > file_start_write() calls and there are quite a few of those.
> > > >
> > >
> > > Darn, forgot about those.
> > > I am starting to reconsider adding a freeze level.
> > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > Maybe fs anti-freeze (see blow).
> > >
> > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > >
> > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > a model that can solve the deadlock correctly.
> > > > > > >
> > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > DB file containing current state or something like that.
> > > > > > >
> > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > solution but I'm not sure if practical for applications.
> > > > > > >
> > > > > >
> > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > context and enforcing this context on fanotify_read().
> > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > and to activate the desired freeze protection behavior
> > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > >
> > > > >
> > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > >
> > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > - We could add user API to set this personality explicitly to any task
> > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > >
> > > > > Please let me know if you agree with this design and if so,
> > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > in your opinion?
> > > >
> > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > for processing the fanotify events and filling in filesystem contents. I
> > > > don't think automatic setting of this flag is desirable though as it has
> > > > quite wide impact and some of the consequences could be surprising.  I
> > > > rather think it should be a conscious decision when setting up the process
> > > > processing the events. So I think API to explicitly set / clear the flag
> > > > would be the best. Also I think it would be better to capture in the name
> > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > something like that?
> > > >
> > >
> > > Sure.
> > >
> > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > flag) in the past. That would allow finer granularity control of the
> > > > behavior but I guess you are worried that it would not cover all the needed
> > > > operations?
> > > >
> > >
> > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > programs with.
> > >
> > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > >
> > > > No, I think if someone cares about these, they should explicitly set the
> > > > PF flag in their task processing the events.
> > > >
> > >
> > > OK.
> > >
> > > I see an exit hatch in this statement -
> > > If we are going leave the responsibility to avoid deadlock in corner
> > > cases completely in the hands of the application, then I do not feel
> > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > providing the first HSM API.
> > >
> > > If the HSM application is running in a controlled system, on a filesystem
> > > where fsfreeze is not expected or not needed, then a fully functional and
> > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > >
> > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > easier option for the first version of HSM API?
> > >
> > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > apart from the HSM itself using this fd.
> > >
> > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > making fs modifications from within HSM events.
> > >
> > > Do you think that may be an acceptable way out or the corner?
> >
> > This is kind of a corner case that I think is acceptable to just leave up to
> > application developers.  Speaking as a potential consumer of this work we don't
> > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > using this interface you know what you're doing.  As long as the sharp edge is
> > well documented I think that's fine for v1.
> >
> 
> I agree that this is good enough for v1.
> The only question is can we (and should we) do better than good enough for v1.
> 
> > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > an operation that is going to take X amount of time, so as long as we provide
> > the application a way to block fsfreeze to avoid the deadlock then I think
> > that's a reasonable solution.  Additionally it would allow us an avenue to
> > gracefully handle errors.  If we race and see that the fs is already frozen well
> > then we can go back to the HSM with an error saying he's out of luck, and he can
> > return -EAGAIN or something through fanotify to unwind and try again later.
> >
> 
> Actually, "fs is already frozen" is not a deadlock case.
> If "fs is already frozen" then fsfreeze was successful and HSM should just
> wait in line like everyone else until fs is unfrozen.
> 
> The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> in state SB_FREEZE_WRITE), which cannot make progress because
> an existing holder of sb write is blocked on an HSM event, which in turn
> is trying to start a new sb write.

Right, and now I'm confused.  You have your patchset to re-order the permission
checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
longer holding the sb write lock and thus can't deadlock, correct?

The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
happen inside of an sb_start_write(), correct?

So where is the deadlock you're trying to fix?  The one you describe in this
thread is what the patchset I reviewed last week was fixing, so in my eyes it
looks like we're good?  It seems you're worried about the HSM app getting stuck
on an fsfreeze when it's trying to populate the content, but that's not actually
deadlocked, it just has to wait for the fs to be unfrozen, the fsfreeze
operation will be able to complete and then thaw will be able to happen because
there's no nested sb_write with the new flags, and with your patchset there's no
sb_write with FAN_OPEN_PERM.

Sorry I hate it when people come in the middle of a conversation and I have to
re-explain myself, so feel free to ignore me.  But I've read the whole thread a
few times and I can't quite figure out what this new deadlock is you're worried
about.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-28 14:55                                     ` Josef Bacik
@ 2023-11-28 15:13                                       ` Christian Brauner
  2023-11-28 16:52                                       ` Amir Goldstein
  1 sibling, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2023-11-28 15:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Amir Goldstein, Jan Kara, Miklos Szeredi, Jens Axboe,
	linux-fsdevel

On Tue, Nov 28, 2023 at 09:55:47AM -0500, Josef Bacik wrote:
> On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Hi Amir,
> > > > >
> > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > discussion during the Plumbers conference :)
> > > > >
> > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > Recap for new people joining this thread.
> > > > > > > > >
> > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > >
> > > > > > > > > P1                             P2                      P3
> > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > >       -> security_file_permission()
> > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > >
> > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > with cleaner/safer semantics.
> > > > > > > > >
> > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > >
> > > > > > > > Yep, nice summary.
> > > > > ...
> > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > >
> > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > Yes, you are right.
> > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > held somehow indirectly.
> > > > > > >
> > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > fs1 freeze protection is held and:
> > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > >     ovl_real_fdget()
> > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > >          ... security_file_open(fs1.file)
> > > > > > >
> > > > > > > > That being
> > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > >
> > > > > >
> > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > not enough for HSM needs.
> > > > > >
> > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > for any operations on the filesystem, not just pwrite().
> > > > > >
> > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > update the metadata files that store which parts of the file were already
> > > > > > filled (relying of fiemap is not always a valid option).
> > > > > >
> > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > >
> > > > > > Another use case is that HSM may want to download content to a
> > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > then clone the data into the accessed file range.
> > > > > >
> > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > >
> > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > note that if we want to really properly handle all possible operations, we
> > > > > need to start handling error from all sb_start_write() and
> > > > > file_start_write() calls and there are quite a few of those.
> > > > >
> > > >
> > > > Darn, forgot about those.
> > > > I am starting to reconsider adding a freeze level.
> > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > Maybe fs anti-freeze (see blow).
> > > >
> > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > >
> > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > >
> > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > DB file containing current state or something like that.
> > > > > > > >
> > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > >
> > > > > > >
> > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > and to activate the desired freeze protection behavior
> > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > >
> > > > > >
> > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > >
> > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > >
> > > > > > Please let me know if you agree with this design and if so,
> > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > in your opinion?
> > > > >
> > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > rather think it should be a conscious decision when setting up the process
> > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > would be the best. Also I think it would be better to capture in the name
> > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > something like that?
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > flag) in the past. That would allow finer granularity control of the
> > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > operations?
> > > > >
> > > >
> > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > programs with.
> > > >
> > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > >
> > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > PF flag in their task processing the events.
> > > > >
> > > >
> > > > OK.
> > > >
> > > > I see an exit hatch in this statement -
> > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > cases completely in the hands of the application, then I do not feel
> > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > providing the first HSM API.
> > > >
> > > > If the HSM application is running in a controlled system, on a filesystem
> > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > >
> > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > easier option for the first version of HSM API?
> > > >
> > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > apart from the HSM itself using this fd.
> > > >
> > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > making fs modifications from within HSM events.
> > > >
> > > > Do you think that may be an acceptable way out or the corner?
> > >
> > > This is kind of a corner case that I think is acceptable to just leave up to
> > > application developers.  Speaking as a potential consumer of this work we don't
> > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > using this interface you know what you're doing.  As long as the sharp edge is
> > > well documented I think that's fine for v1.
> > >
> > 
> > I agree that this is good enough for v1.
> > The only question is can we (and should we) do better than good enough for v1.
> > 
> > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > an operation that is going to take X amount of time, so as long as we provide
> > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > return -EAGAIN or something through fanotify to unwind and try again later.
> > >
> > 
> > Actually, "fs is already frozen" is not a deadlock case.
> > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > wait in line like everyone else until fs is unfrozen.
> > 
> > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > in state SB_FREEZE_WRITE), which cannot make progress because
> > an existing holder of sb write is blocked on an HSM event, which in turn
> > is trying to start a new sb write.
> 
> Right, and now I'm confused.  You have your patchset to re-order the permission
> checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> longer holding the sb write lock and thus can't deadlock, correct?
> 
> The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> happen inside of an sb_start_write(), correct?
> 
> So where is the deadlock you're trying to fix?  The one you describe in this
> thread is what the patchset I reviewed last week was fixing, so in my eyes it
> looks like we're good?  It seems you're worried about the HSM app getting stuck
> on an fsfreeze when it's trying to populate the content, but that's not actually
> deadlocked, it just has to wait for the fs to be unfrozen, the fsfreeze
> operation will be able to complete and then thaw will be able to happen because
> there's no nested sb_write with the new flags, and with your patchset there's no
> sb_write with FAN_OPEN_PERM.
> 
> Sorry I hate it when people come in the middle of a conversation and I have to
> re-explain myself, so feel free to ignore me.  But I've read the whole thread a
> few times and I can't quite figure out what this new deadlock is you're worried
> about.  Thanks,

Actually, I'd appreciate that context as well as I've been looking at
this from the angle of avoiding a deadlock on fsfreeze as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-28 14:55                                     ` Josef Bacik
  2023-11-28 15:13                                       ` Christian Brauner
@ 2023-11-28 16:52                                       ` Amir Goldstein
  2023-11-28 21:42                                         ` Josef Bacik
  1 sibling, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-28 16:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Tue, Nov 28, 2023 at 4:55 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Hi Amir,
> > > > >
> > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > discussion during the Plumbers conference :)
> > > > >
> > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > Recap for new people joining this thread.
> > > > > > > > >
> > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > >
> > > > > > > > > P1                             P2                      P3
> > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > >       -> security_file_permission()
> > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > >
> > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > with cleaner/safer semantics.
> > > > > > > > >
> > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > >
> > > > > > > > Yep, nice summary.
> > > > > ...
> > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > >
> > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > Yes, you are right.
> > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > held somehow indirectly.
> > > > > > >
> > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > fs1 freeze protection is held and:
> > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > >     ovl_real_fdget()
> > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > >          ... security_file_open(fs1.file)
> > > > > > >
> > > > > > > > That being
> > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > >
> > > > > >
> > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > not enough for HSM needs.
> > > > > >
> > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > for any operations on the filesystem, not just pwrite().
> > > > > >
> > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > update the metadata files that store which parts of the file were already
> > > > > > filled (relying of fiemap is not always a valid option).
> > > > > >
> > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > >
> > > > > > Another use case is that HSM may want to download content to a
> > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > then clone the data into the accessed file range.
> > > > > >
> > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > >
> > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > note that if we want to really properly handle all possible operations, we
> > > > > need to start handling error from all sb_start_write() and
> > > > > file_start_write() calls and there are quite a few of those.
> > > > >
> > > >
> > > > Darn, forgot about those.
> > > > I am starting to reconsider adding a freeze level.
> > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > Maybe fs anti-freeze (see blow).
> > > >
> > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > >
> > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > >
> > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > DB file containing current state or something like that.
> > > > > > > >
> > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > >
> > > > > > >
> > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > and to activate the desired freeze protection behavior
> > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > >
> > > > > >
> > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > >
> > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > >
> > > > > > Please let me know if you agree with this design and if so,
> > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > in your opinion?
> > > > >
> > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > rather think it should be a conscious decision when setting up the process
> > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > would be the best. Also I think it would be better to capture in the name
> > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > something like that?
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > flag) in the past. That would allow finer granularity control of the
> > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > operations?
> > > > >
> > > >
> > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > programs with.
> > > >
> > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > >
> > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > PF flag in their task processing the events.
> > > > >
> > > >
> > > > OK.
> > > >
> > > > I see an exit hatch in this statement -
> > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > cases completely in the hands of the application, then I do not feel
> > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > providing the first HSM API.
> > > >
> > > > If the HSM application is running in a controlled system, on a filesystem
> > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > >
> > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > easier option for the first version of HSM API?
> > > >
> > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > apart from the HSM itself using this fd.
> > > >
> > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > making fs modifications from within HSM events.
> > > >
> > > > Do you think that may be an acceptable way out or the corner?
> > >
> > > This is kind of a corner case that I think is acceptable to just leave up to
> > > application developers.  Speaking as a potential consumer of this work we don't
> > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > using this interface you know what you're doing.  As long as the sharp edge is
> > > well documented I think that's fine for v1.
> > >
> >
> > I agree that this is good enough for v1.
> > The only question is can we (and should we) do better than good enough for v1.
> >
> > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > an operation that is going to take X amount of time, so as long as we provide
> > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > return -EAGAIN or something through fanotify to unwind and try again later.
> > >
> >
> > Actually, "fs is already frozen" is not a deadlock case.
> > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > wait in line like everyone else until fs is unfrozen.
> >
> > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > in state SB_FREEZE_WRITE), which cannot make progress because
> > an existing holder of sb write is blocked on an HSM event, which in turn
> > is trying to start a new sb write.
>
> Right, and now I'm confused.  You have your patchset to re-order the permission
> checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> longer holding the sb write lock and thus can't deadlock, correct?

Correct.

>
> The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> happen inside of an sb_start_write(), correct?
>

Almost correct.

The callers of the security_file_permission() hook do not hold sb_start_write()
*directly*, but it can be held *indirectly* in splice(file_in_fs1, file_in_fs2).
That is the corner case I was trying to explain.

When fs1 (splice source fs) is a loop mounted fs and the loop image file
is on fs2 (a.k.a the "host" fs), which also happens to be to splice dest fs,
splice grabs sb_start_write() on fs2.

After the patches in vfs.rw, splice() no longer calls security_file_permission()
directly on the file in the loop mounted fs1, but the reads from loopdev
translate to reads on the image file, which can call security_file_permission()
on the loop image file on the "host" fs (fs2), while sb_start_write() is held.

IOW, if HSM needs to fill the content on the loop image file and fsfreeze on
the "host" fs that is the destination of splice, gets in the middle, there is
a chance for a deadlock, because freeze will never make progress and
HSM filling of the loop image file is blocked.

Yes, it is a corner case, but it exists and a similar one exists with a splice
from an overlayfs file into a file on a "host" fs, which also happens to be the
lower layer of overlayfs (I have a test case that triggered this).

> So where is the deadlock you're trying to fix?  The one you describe in this
> thread is what the patchset I reviewed last week was fixing, so in my eyes it
> looks like we're good?  It seems you're worried about the HSM app getting stuck
> on an fsfreeze when it's trying to populate the content, but that's not actually
> deadlocked, it just has to wait for the fs to be unfrozen, the fsfreeze
> operation will be able to complete and then thaw will be able to happen because
> there's no nested sb_write with the new flags, and with your patchset there's no
> sb_write with FAN_OPEN_PERM.
>
> Sorry I hate it when people come in the middle of a conversation and I have to
> re-explain myself, so feel free to ignore me.  But I've read the whole thread a
> few times and I can't quite figure out what this new deadlock is you're worried
> about

Quite the contrary, I never managed to explain the remaining deadlock
to the wider crowd (except for Jan), so I am glad that you and Christian
are taking an interest.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-28 16:52                                       ` Amir Goldstein
@ 2023-11-28 21:42                                         ` Josef Bacik
  2023-11-29  5:22                                           ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2023-11-28 21:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Tue, Nov 28, 2023 at 06:52:00PM +0200, Amir Goldstein wrote:
> On Tue, Nov 28, 2023 at 4:55 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Hi Amir,
> > > > > >
> > > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > > discussion during the Plumbers conference :)
> > > > > >
> > > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > Recap for new people joining this thread.
> > > > > > > > > >
> > > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > > >
> > > > > > > > > > P1                             P2                      P3
> > > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > > >       -> security_file_permission()
> > > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > > >
> > > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > > with cleaner/safer semantics.
> > > > > > > > > >
> > > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > > >
> > > > > > > > > Yep, nice summary.
> > > > > > ...
> > > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > > >
> > > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > > >
> > > > > > > > Yes, you are right.
> > > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > > >
> > > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > > held somehow indirectly.
> > > > > > > >
> > > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > > fs1 freeze protection is held and:
> > > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > > >     ovl_real_fdget()
> > > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > > >          ... security_file_open(fs1.file)
> > > > > > > >
> > > > > > > > > That being
> > > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > > >
> > > > > > >
> > > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > > not enough for HSM needs.
> > > > > > >
> > > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > > for any operations on the filesystem, not just pwrite().
> > > > > > >
> > > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > > update the metadata files that store which parts of the file were already
> > > > > > > filled (relying of fiemap is not always a valid option).
> > > > > > >
> > > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > > >
> > > > > > > Another use case is that HSM may want to download content to a
> > > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > > then clone the data into the accessed file range.
> > > > > > >
> > > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > > >
> > > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > > note that if we want to really properly handle all possible operations, we
> > > > > > need to start handling error from all sb_start_write() and
> > > > > > file_start_write() calls and there are quite a few of those.
> > > > > >
> > > > >
> > > > > Darn, forgot about those.
> > > > > I am starting to reconsider adding a freeze level.
> > > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > > Maybe fs anti-freeze (see blow).
> > > > >
> > > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > > >
> > > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > > >
> > > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > > DB file containing current state or something like that.
> > > > > > > > >
> > > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > > and to activate the desired freeze protection behavior
> > > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > > >
> > > > > > >
> > > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > > >
> > > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > > >
> > > > > > > Please let me know if you agree with this design and if so,
> > > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > > in your opinion?
> > > > > >
> > > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > > rather think it should be a conscious decision when setting up the process
> > > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > > would be the best. Also I think it would be better to capture in the name
> > > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > > something like that?
> > > > > >
> > > > >
> > > > > Sure.
> > > > >
> > > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > > flag) in the past. That would allow finer granularity control of the
> > > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > > operations?
> > > > > >
> > > > >
> > > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > > programs with.
> > > > >
> > > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > > >
> > > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > > PF flag in their task processing the events.
> > > > > >
> > > > >
> > > > > OK.
> > > > >
> > > > > I see an exit hatch in this statement -
> > > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > > cases completely in the hands of the application, then I do not feel
> > > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > > providing the first HSM API.
> > > > >
> > > > > If the HSM application is running in a controlled system, on a filesystem
> > > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > > >
> > > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > > easier option for the first version of HSM API?
> > > > >
> > > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > > apart from the HSM itself using this fd.
> > > > >
> > > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > > making fs modifications from within HSM events.
> > > > >
> > > > > Do you think that may be an acceptable way out or the corner?
> > > >
> > > > This is kind of a corner case that I think is acceptable to just leave up to
> > > > application developers.  Speaking as a potential consumer of this work we don't
> > > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > > using this interface you know what you're doing.  As long as the sharp edge is
> > > > well documented I think that's fine for v1.
> > > >
> > >
> > > I agree that this is good enough for v1.
> > > The only question is can we (and should we) do better than good enough for v1.
> > >
> > > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > > an operation that is going to take X amount of time, so as long as we provide
> > > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > > return -EAGAIN or something through fanotify to unwind and try again later.
> > > >
> > >
> > > Actually, "fs is already frozen" is not a deadlock case.
> > > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > > wait in line like everyone else until fs is unfrozen.
> > >
> > > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > > in state SB_FREEZE_WRITE), which cannot make progress because
> > > an existing holder of sb write is blocked on an HSM event, which in turn
> > > is trying to start a new sb write.
> >
> > Right, and now I'm confused.  You have your patchset to re-order the permission
> > checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> > longer holding the sb write lock and thus can't deadlock, correct?
> 
> Correct.
> 
> >
> > The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> > happen inside of an sb_start_write(), correct?
> >
> 
> Almost correct.
> 
> The callers of the security_file_permission() hook do not hold sb_start_write()
> *directly*, but it can be held *indirectly* in splice(file_in_fs1, file_in_fs2).
> That is the corner case I was trying to explain.
> 
> When fs1 (splice source fs) is a loop mounted fs and the loop image file
> is on fs2 (a.k.a the "host" fs), which also happens to be to splice dest fs,
> splice grabs sb_start_write() on fs2.
> 
> After the patches in vfs.rw, splice() no longer calls security_file_permission()
> directly on the file in the loop mounted fs1, but the reads from loopdev
> translate to reads on the image file, which can call security_file_permission()
> on the loop image file on the "host" fs (fs2), while sb_start_write() is held.
> 
> IOW, if HSM needs to fill the content on the loop image file and fsfreeze on
> the "host" fs that is the destination of splice, gets in the middle, there is
> a chance for a deadlock, because freeze will never make progress and
> HSM filling of the loop image file is blocked.
> 
> Yes, it is a corner case, but it exists and a similar one exists with a splice
> from an overlayfs file into a file on a "host" fs, which also happens to be the
> lower layer of overlayfs (I have a test case that triggered this).
> 

I had to still draw this on my whiteboard to make sure I understood it properly,
so I'm going to draw it here to make sure I did actually understand it, because
it is indeed quite complex if I'm understanding you correctly.

We have the following

File A on FS 1 which is a loopback device backed by File B on FS 2
File B on FS 2 which is a normal file

We have an HSM watching FS1 to populate files.

sendfile(A, B);

This does

file_start_write(FS2);

Then we start to read from A to populate the page, this triggers the HSM, which
then wants to write to FS1.

At this point some other process calls fsfreeze(FS2), and now we're deadlocked,
because the HSM is stuck at sb_start_write(FS2) trying to write to the FS1 which
is backed by FS2, but we're already holding file_start_write(FS2) because of
splice.

Is this correct?

If it is, I think the best thing to do is actually push the file_start_write()
deeper into the splice work.  Do something like the patch I've applied below,
which is wildly untested and uncompiled.  However I think this closes this
deadlock in a nice clean way, because we're reading and then writing, and we
don't have to worry about any shenanigans under the read path because we only
hold the sb_write_start() when we do the actual write part.  Does that make
sense?  Thanks,

Josef

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 4382881b0709..f37bb41551fe 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -230,6 +230,19 @@ static int ovl_copy_fileattr(struct inode *inode, const struct path *old,
 	return ovl_real_fileattr_set(new, &newfa);
 }
 
+static int ovl_splice_actor(struct pipe_inode_info *pipe,
+			    struct splice_desc *sd)
+{
+	struct file *file = sd->u.file;
+	long ret;
+
+	ovl_start_write(file_dentry(file));
+	ret = vfs_do_splice_from(pipe, file, sd->opos, sd->total_len,
+				 sd->flags);
+	ovl_end_write(file_dentry(file));
+	return ret;
+}
+
 static int ovl_copy_up_file(struct ovl_fs *ofs, struct dentry *dentry,
 			    struct file *new_file, loff_t len)
 {
@@ -309,6 +322,8 @@ static int ovl_copy_up_file(struct ovl_fs *ofs, struct dentry *dentry,
 			}
 		}
 
+		do_splice_direct(old_file, &old_pos, new_file, &new_pos,
+				 this_len, SPLICE_F_MOVE, ovl_splice_actor);
 		ovl_start_write(dentry);
 		bytes = do_splice_direct(old_file, &old_pos,
 					 new_file, &new_pos,
diff --git a/fs/read_write.c b/fs/read_write.c
index 4771701c896b..797ef9e2ecf5 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1250,10 +1250,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		retval = rw_verify_area(WRITE, out.file, &out_pos, count);
 		if (retval < 0)
 			goto fput_out;
-		file_start_write(out.file);
 		retval = do_splice_direct(in.file, &pos, out.file, &out_pos,
 					  count, fl);
-		file_end_write(out.file);
 	} else {
 		if (out.file->f_flags & O_NONBLOCK)
 			fl |= SPLICE_F_NONBLOCK;
diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..85a4ed0ad06c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -925,13 +925,14 @@ static int warn_unsupported(struct file *file, const char *op)
 /*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long vfs_do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			loff_t *ppos, size_t len, unsigned int flags)
 {
 	if (unlikely(!out->f_op->splice_write))
 		return warn_unsupported(out, "write");
 	return out->f_op->splice_write(pipe, out, ppos, len, flags);
 }
+EXPORT_SYMBOL(vfs_do_splice_from);
 
 /*
  * Indicate to the caller that there was a premature EOF when reading from the
@@ -1138,9 +1139,13 @@ static int direct_splice_actor(struct pipe_inode_info *pipe,
 			       struct splice_desc *sd)
 {
 	struct file *file = sd->u.file;
+	long ret;
 
-	return do_splice_from(pipe, file, sd->opos, sd->total_len,
-			      sd->flags);
+	file_start_write(file);
+	ret = vfs_do_splice_from(pipe, file, sd->opos, sd->total_len,
+				 sd->flags);
+	file_end_write(file);
+	return ret;
 }
 
 static void direct_file_splice_eof(struct splice_desc *sd)
@@ -1152,13 +1157,14 @@ static void direct_file_splice_eof(struct splice_desc *sd)
 }
 
 /**
- * do_splice_direct - splices data directly between two files
+ * do_splice_direct_actor - splices data directly between two files
  * @in:		file to splice from
  * @ppos:	input file offset
  * @out:	file to splice to
  * @opos:	output file offset
  * @len:	number of bytes to splice
  * @flags:	splice modifier flags
+ * @actor:	the actor to use for the splice
  *
  * Description:
  *    For use by do_sendfile(). splice can easily emulate sendfile, but
@@ -1167,8 +1173,9 @@ static void direct_file_splice_eof(struct splice_desc *sd)
  *    can splice directly through a process-private pipe.
  *
  */
-long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
-		      loff_t *opos, size_t len, unsigned int flags)
+long do_splice_direct_actor(struct file *in, loff_t *ppos, struct file *out,
+			    loff_t *opos, size_t len, unsigned int flags,
+			    splice_direct_actor *actor)
 {
 	struct splice_desc sd = {
 		.len		= len,
@@ -1191,12 +1198,36 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 	if (unlikely(ret < 0))
 		return ret;
 
-	ret = splice_direct_to_actor(in, &sd, direct_splice_actor);
+	ret = splice_direct_to_actor(in, &sd, actor);
 	if (ret > 0)
 		*ppos = sd.pos;
 
 	return ret;
 }
+EXPORT_SYMBOL(do_splice_direct_actor);
+
+/**
+ * do_splice_direct - splices data directly between two files
+ * @in:		file to splice from
+ * @ppos:	input file offset
+ * @out:	file to splice to
+ * @opos:	output file offset
+ * @len:	number of bytes to splice
+ * @flags:	splice modifier flags
+ *
+ * Description:
+ *    For use by do_sendfile(). splice can easily emulate sendfile, but
+ *    doing it in the application would incur an extra system call
+ *    (splice in + splice out, as compared to just sendfile()). So this helper
+ *    can splice directly through a process-private pipe.
+ *
+ */
+long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
+		      loff_t *opos, size_t len, unsigned int flags)
+{
+	return do_splice_direct_actor(in, ppos, out, opos, len, flags,
+				      direct_splice_actor);
+}
 EXPORT_SYMBOL(do_splice_direct);
 
 static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
@@ -1289,7 +1320,7 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 			flags |= SPLICE_F_NONBLOCK;
 
 		file_start_write(out);
-		ret = do_splice_from(ipipe, out, &offset, len, flags);
+		ret = vfs_do_splice_from(ipipe, out, &offset, len, flags);
 		file_end_write(out);
 
 		if (!off_out)
@@ -1323,7 +1354,7 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 	if (ret > 0) {
 		/*
 		 * Generate modify out before access in:
-		 * do_splice_from() may've already sent modify out,
+		 * vfs_do_splice_from() may've already sent modify out,
 		 * and this ensures the events get merged.
 		 */
 		fsnotify_modify(out);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 6c461573434d..8583b31135fa 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -85,11 +85,16 @@ extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 extern long do_splice(struct file *in, loff_t *off_in,
 		      struct file *out, loff_t *off_out,
 		      size_t len, unsigned int flags);
+long do_splice_direct_actor(struct file *in, loff_t *ppos, struct file *out,
+			    loff_t *opos, size_t len, unsigned int flags,
+			    splice_direct_actor *actor);
 
 extern long do_tee(struct file *in, struct file *out, size_t len,
 		   unsigned int flags);
 extern ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
 				loff_t *ppos, size_t len, unsigned int flags);
+extern long vfs_do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			       loff_t *ppos, size_t len, unsigned int flags);
 
 /*
  * for dynamic pipe sizing

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-28 21:42                                         ` Josef Bacik
@ 2023-11-29  5:22                                           ` Amir Goldstein
  2023-11-29 14:44                                             ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-29  5:22 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Tue, Nov 28, 2023 at 11:43 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Nov 28, 2023 at 06:52:00PM +0200, Amir Goldstein wrote:
> > On Tue, Nov 28, 2023 at 4:55 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > > >
> > > > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > >
> > > > > > > Hi Amir,
> > > > > > >
> > > > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > > > discussion during the Plumbers conference :)
> > > > > > >
> > > > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > > Recap for new people joining this thread.
> > > > > > > > > > >
> > > > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > > > >
> > > > > > > > > > > P1                             P2                      P3
> > > > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > > > >       -> security_file_permission()
> > > > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > > > >
> > > > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > > > with cleaner/safer semantics.
> > > > > > > > > > >
> > > > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > > > >
> > > > > > > > > > Yep, nice summary.
> > > > > > > ...
> > > > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > > > >
> > > > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > > > >
> > > > > > > > > Yes, you are right.
> > > > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > > > >
> > > > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > > > held somehow indirectly.
> > > > > > > > >
> > > > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > > > fs1 freeze protection is held and:
> > > > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > > > >     ovl_real_fdget()
> > > > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > > > >          ... security_file_open(fs1.file)
> > > > > > > > >
> > > > > > > > > > That being
> > > > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > > > >
> > > > > > > >
> > > > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > > > not enough for HSM needs.
> > > > > > > >
> > > > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > > > for any operations on the filesystem, not just pwrite().
> > > > > > > >
> > > > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > > > update the metadata files that store which parts of the file were already
> > > > > > > > filled (relying of fiemap is not always a valid option).
> > > > > > > >
> > > > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > > > >
> > > > > > > > Another use case is that HSM may want to download content to a
> > > > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > > > then clone the data into the accessed file range.
> > > > > > > >
> > > > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > > > >
> > > > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > > > note that if we want to really properly handle all possible operations, we
> > > > > > > need to start handling error from all sb_start_write() and
> > > > > > > file_start_write() calls and there are quite a few of those.
> > > > > > >
> > > > > >
> > > > > > Darn, forgot about those.
> > > > > > I am starting to reconsider adding a freeze level.
> > > > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > > > Maybe fs anti-freeze (see blow).
> > > > > >
> > > > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > > > >
> > > > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > > > >
> > > > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > > > DB file containing current state or something like that.
> > > > > > > > > >
> > > > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > > > and to activate the desired freeze protection behavior
> > > > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > > > >
> > > > > > > >
> > > > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > > > >
> > > > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > > > >
> > > > > > > > Please let me know if you agree with this design and if so,
> > > > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > > > in your opinion?
> > > > > > >
> > > > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > > > rather think it should be a conscious decision when setting up the process
> > > > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > > > would be the best. Also I think it would be better to capture in the name
> > > > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > > > something like that?
> > > > > > >
> > > > > >
> > > > > > Sure.
> > > > > >
> > > > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > > > flag) in the past. That would allow finer granularity control of the
> > > > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > > > operations?
> > > > > > >
> > > > > >
> > > > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > > > programs with.
> > > > > >
> > > > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > > > >
> > > > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > > > PF flag in their task processing the events.
> > > > > > >
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > I see an exit hatch in this statement -
> > > > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > > > cases completely in the hands of the application, then I do not feel
> > > > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > > > providing the first HSM API.
> > > > > >
> > > > > > If the HSM application is running in a controlled system, on a filesystem
> > > > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > > > >
> > > > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > > > easier option for the first version of HSM API?
> > > > > >
> > > > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > > > apart from the HSM itself using this fd.
> > > > > >
> > > > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > > > making fs modifications from within HSM events.
> > > > > >
> > > > > > Do you think that may be an acceptable way out or the corner?
> > > > >
> > > > > This is kind of a corner case that I think is acceptable to just leave up to
> > > > > application developers.  Speaking as a potential consumer of this work we don't
> > > > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > > > using this interface you know what you're doing.  As long as the sharp edge is
> > > > > well documented I think that's fine for v1.
> > > > >
> > > >
> > > > I agree that this is good enough for v1.
> > > > The only question is can we (and should we) do better than good enough for v1.
> > > >
> > > > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > > > an operation that is going to take X amount of time, so as long as we provide
> > > > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > > > return -EAGAIN or something through fanotify to unwind and try again later.
> > > > >
> > > >
> > > > Actually, "fs is already frozen" is not a deadlock case.
> > > > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > > > wait in line like everyone else until fs is unfrozen.
> > > >
> > > > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > > > in state SB_FREEZE_WRITE), which cannot make progress because
> > > > an existing holder of sb write is blocked on an HSM event, which in turn
> > > > is trying to start a new sb write.
> > >
> > > Right, and now I'm confused.  You have your patchset to re-order the permission
> > > checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> > > longer holding the sb write lock and thus can't deadlock, correct?
> >
> > Correct.
> >
> > >
> > > The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> > > happen inside of an sb_start_write(), correct?
> > >
> >
> > Almost correct.
> >
> > The callers of the security_file_permission() hook do not hold sb_start_write()
> > *directly*, but it can be held *indirectly* in splice(file_in_fs1, file_in_fs2).
> > That is the corner case I was trying to explain.
> >
> > When fs1 (splice source fs) is a loop mounted fs and the loop image file
> > is on fs2 (a.k.a the "host" fs), which also happens to be to splice dest fs,
> > splice grabs sb_start_write() on fs2.
> >
> > After the patches in vfs.rw, splice() no longer calls security_file_permission()
> > directly on the file in the loop mounted fs1, but the reads from loopdev
> > translate to reads on the image file, which can call security_file_permission()
> > on the loop image file on the "host" fs (fs2), while sb_start_write() is held.
> >
> > IOW, if HSM needs to fill the content on the loop image file and fsfreeze on
> > the "host" fs that is the destination of splice, gets in the middle, there is
> > a chance for a deadlock, because freeze will never make progress and
> > HSM filling of the loop image file is blocked.
> >
> > Yes, it is a corner case, but it exists and a similar one exists with a splice
> > from an overlayfs file into a file on a "host" fs, which also happens to be the
> > lower layer of overlayfs (I have a test case that triggered this).
> >
>
> I had to still draw this on my whiteboard to make sure I understood it properly,
> so I'm going to draw it here to make sure I did actually understand it, because
> it is indeed quite complex if I'm understanding you correctly.
>
> We have the following
>
> File A on FS 1 which is a loopback device backed by File B on FS 2

B is the normal file on FS2, so I guess you meant to say backed by file C

> File B on FS 2 which is a normal file
>
> We have an HSM watching FS1 to populate files.
>
> sendfile(A, B);
>
> This does
>
> file_start_write(FS2);
>
> Then we start to read from A to populate the page, this triggers the HSM, which
> then wants to write to FS1.
>
> At this point some other process calls fsfreeze(FS2), and now we're deadlocked,
> because the HSM is stuck at sb_start_write(FS2) trying to write to the FS1 which
> is backed by FS2, but we're already holding file_start_write(FS2) because of
> splice.
>
> Is this correct?

Yes, this is correct.
I was describing a different variant of deadlock when FS2 is watched by HSM
and HSM wants to write to the image file C upon reading from file A.

There are many variants of this, but the root cause is operating of file A
while holding sb_start_write() on file B on another fs.

>
> If it is, I think the best thing to do is actually push the file_start_write()
> deeper into the splice work.  Do something like the patch I've applied below,
> which is wildly untested and uncompiled.  However I think this closes this
> deadlock in a nice clean way, because we're reading and then writing, and we
> don't have to worry about any shenanigans under the read path because we only
> hold the sb_write_start() when we do the actual write part.  Does that make
> sense?

That makes a lot of sense!

I think this is the correct way out of the deadlock corner case.
I will amend the patch and test it.

Thanks for getting me out of tunnel vision ;)

Some comments for myself below...

>
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 4382881b0709..f37bb41551fe 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -230,6 +230,19 @@ static int ovl_copy_fileattr(struct inode *inode, const struct path *old,
>         return ovl_real_fileattr_set(new, &newfa);
>  }
>
> +static int ovl_splice_actor(struct pipe_inode_info *pipe,
> +                           struct splice_desc *sd)
> +{
> +       struct file *file = sd->u.file;
> +       long ret;
> +
> +       ovl_start_write(file_dentry(file));
> +       ret = vfs_do_splice_from(pipe, file, sd->opos, sd->total_len,
> +                                sd->flags);
> +       ovl_end_write(file_dentry(file));
> +       return ret;
> +}
> +
>  static int ovl_copy_up_file(struct ovl_fs *ofs, struct dentry *dentry,
>                             struct file *new_file, loff_t len)
>  {
> @@ -309,6 +322,8 @@ static int ovl_copy_up_file(struct ovl_fs *ofs, struct dentry *dentry,
>                         }
>                 }
>
> +               do_splice_direct(old_file, &old_pos, new_file, &new_pos,
> +                                this_len, SPLICE_F_MOVE, ovl_splice_actor);
>                 ovl_start_write(dentry);
>                 bytes = do_splice_direct(old_file, &old_pos,
>                                          new_file, &new_pos,

Remove this..

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 4771701c896b..797ef9e2ecf5 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1250,10 +1250,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>                 retval = rw_verify_area(WRITE, out.file, &out_pos, count);
>                 if (retval < 0)
>                         goto fput_out;
> -               file_start_write(out.file);
>                 retval = do_splice_direct(in.file, &pos, out.file, &out_pos,
>                                           count, fl);
> -               file_end_write(out.file);
>         } else {
>                 if (out.file->f_flags & O_NONBLOCK)
>                         fl |= SPLICE_F_NONBLOCK;
> diff --git a/fs/splice.c b/fs/splice.c
> index d983d375ff11..85a4ed0ad06c 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -925,13 +925,14 @@ static int warn_unsupported(struct file *file, const char *op)
>  /*
>   * Attempt to initiate a splice from pipe to file.
>   */
> -static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
> -                          loff_t *ppos, size_t len, unsigned int flags)
> +long vfs_do_splice_from(struct pipe_inode_info *pipe, struct file *out,
> +                       loff_t *ppos, size_t len, unsigned int flags)
>  {
>         if (unlikely(!out->f_op->splice_write))
>                 return warn_unsupported(out, "write");
>         return out->f_op->splice_write(pipe, out, ppos, len, flags);
>  }
> +EXPORT_SYMBOL(vfs_do_splice_from);

My cleanup was trying to distinguish between vfs_XXX helpers
that call permission hooks and take sb_write and do_XXX helpers
that do the rest.

It's true that exporting do_XXX helpers is not nice, but for me,
vfs_do_XXX is too much to endure ;)

If it were up to me, I would either export do_splice_from()
or open code it in overlayfs.

It might be worth making this an inline helper in fs.h
along with warn_unsupported().
I would suggest call_splice_write().
I know how people feel about call_{read,write}_iter(), but perhaps
together with warn_unsupported(), an inline helper is justified.

Anyway, unless there is consensus about call_splice_write(),
I am going to unify the two variants of warn_unsupported(), move
it to fs.h, and open code do_splice_from() in ovl_splice_actor().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-29  5:22                                           ` Amir Goldstein
@ 2023-11-29 14:44                                             ` Amir Goldstein
  2023-11-29 18:42                                               ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2023-11-29 14:44 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Wed, Nov 29, 2023 at 7:22 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Nov 28, 2023 at 11:43 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Nov 28, 2023 at 06:52:00PM +0200, Amir Goldstein wrote:
> > > On Tue, Nov 28, 2023 at 4:55 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > > > > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > > > >
> > > > > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > Hi Amir,
> > > > > > > >
> > > > > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > > > > discussion during the Plumbers conference :)
> > > > > > > >
> > > > > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > > > Recap for new people joining this thread.
> > > > > > > > > > > >
> > > > > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > > > > >
> > > > > > > > > > > > P1                             P2                      P3
> > > > > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > > > > >       -> security_file_permission()
> > > > > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > > > > >
> > > > > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > > > > with cleaner/safer semantics.
> > > > > > > > > > > >
> > > > > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > > > > >
> > > > > > > > > > > Yep, nice summary.
> > > > > > > > ...
> > > > > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > > > > >
> > > > > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > > > > >
> > > > > > > > > > Yes, you are right.
> > > > > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > > > > >
> > > > > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > > > > held somehow indirectly.
> > > > > > > > > >
> > > > > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > > > > fs1 freeze protection is held and:
> > > > > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > > > > >     ovl_real_fdget()
> > > > > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > > > > >          ... security_file_open(fs1.file)
> > > > > > > > > >
> > > > > > > > > > > That being
> > > > > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > > > > not enough for HSM needs.
> > > > > > > > >
> > > > > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > > > > for any operations on the filesystem, not just pwrite().
> > > > > > > > >
> > > > > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > > > > update the metadata files that store which parts of the file were already
> > > > > > > > > filled (relying of fiemap is not always a valid option).
> > > > > > > > >
> > > > > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > > > > >
> > > > > > > > > Another use case is that HSM may want to download content to a
> > > > > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > > > > then clone the data into the accessed file range.
> > > > > > > > >
> > > > > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > > > > >
> > > > > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > > > > note that if we want to really properly handle all possible operations, we
> > > > > > > > need to start handling error from all sb_start_write() and
> > > > > > > > file_start_write() calls and there are quite a few of those.
> > > > > > > >
> > > > > > >
> > > > > > > Darn, forgot about those.
> > > > > > > I am starting to reconsider adding a freeze level.
> > > > > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > > > > Maybe fs anti-freeze (see blow).
> > > > > > >
> > > > > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > > > > >
> > > > > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > > > > >
> > > > > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > > > > DB file containing current state or something like that.
> > > > > > > > > > >
> > > > > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > > > > and to activate the desired freeze protection behavior
> > > > > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > > > > >
> > > > > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > > > > >
> > > > > > > > > Please let me know if you agree with this design and if so,
> > > > > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > > > > in your opinion?
> > > > > > > >
> > > > > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > > > > rather think it should be a conscious decision when setting up the process
> > > > > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > > > > would be the best. Also I think it would be better to capture in the name
> > > > > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > > > > something like that?
> > > > > > > >
> > > > > > >
> > > > > > > Sure.
> > > > > > >
> > > > > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > > > > flag) in the past. That would allow finer granularity control of the
> > > > > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > > > > operations?
> > > > > > > >
> > > > > > >
> > > > > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > > > > programs with.
> > > > > > >
> > > > > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > > > > >
> > > > > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > > > > PF flag in their task processing the events.
> > > > > > > >
> > > > > > >
> > > > > > > OK.
> > > > > > >
> > > > > > > I see an exit hatch in this statement -
> > > > > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > > > > cases completely in the hands of the application, then I do not feel
> > > > > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > > > > providing the first HSM API.
> > > > > > >
> > > > > > > If the HSM application is running in a controlled system, on a filesystem
> > > > > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > > > > >
> > > > > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > > > > easier option for the first version of HSM API?
> > > > > > >
> > > > > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > > > > apart from the HSM itself using this fd.
> > > > > > >
> > > > > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > > > > making fs modifications from within HSM events.
> > > > > > >
> > > > > > > Do you think that may be an acceptable way out or the corner?
> > > > > >
> > > > > > This is kind of a corner case that I think is acceptable to just leave up to
> > > > > > application developers.  Speaking as a potential consumer of this work we don't
> > > > > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > > > > using this interface you know what you're doing.  As long as the sharp edge is
> > > > > > well documented I think that's fine for v1.
> > > > > >
> > > > >
> > > > > I agree that this is good enough for v1.
> > > > > The only question is can we (and should we) do better than good enough for v1.
> > > > >
> > > > > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > > > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > > > > an operation that is going to take X amount of time, so as long as we provide
> > > > > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > > > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > > > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > > > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > > > > return -EAGAIN or something through fanotify to unwind and try again later.
> > > > > >
> > > > >
> > > > > Actually, "fs is already frozen" is not a deadlock case.
> > > > > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > > > > wait in line like everyone else until fs is unfrozen.
> > > > >
> > > > > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > > > > in state SB_FREEZE_WRITE), which cannot make progress because
> > > > > an existing holder of sb write is blocked on an HSM event, which in turn
> > > > > is trying to start a new sb write.
> > > >
> > > > Right, and now I'm confused.  You have your patchset to re-order the permission
> > > > checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> > > > longer holding the sb write lock and thus can't deadlock, correct?
> > >
> > > Correct.
> > >
> > > >
> > > > The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> > > > happen inside of an sb_start_write(), correct?
> > > >
> > >
> > > Almost correct.
> > >
> > > The callers of the security_file_permission() hook do not hold sb_start_write()
> > > *directly*, but it can be held *indirectly* in splice(file_in_fs1, file_in_fs2).
> > > That is the corner case I was trying to explain.
> > >
> > > When fs1 (splice source fs) is a loop mounted fs and the loop image file
> > > is on fs2 (a.k.a the "host" fs), which also happens to be to splice dest fs,
> > > splice grabs sb_start_write() on fs2.
> > >
> > > After the patches in vfs.rw, splice() no longer calls security_file_permission()
> > > directly on the file in the loop mounted fs1, but the reads from loopdev
> > > translate to reads on the image file, which can call security_file_permission()
> > > on the loop image file on the "host" fs (fs2), while sb_start_write() is held.
> > >
> > > IOW, if HSM needs to fill the content on the loop image file and fsfreeze on
> > > the "host" fs that is the destination of splice, gets in the middle, there is
> > > a chance for a deadlock, because freeze will never make progress and
> > > HSM filling of the loop image file is blocked.
> > >
> > > Yes, it is a corner case, but it exists and a similar one exists with a splice
> > > from an overlayfs file into a file on a "host" fs, which also happens to be the
> > > lower layer of overlayfs (I have a test case that triggered this).
> > >
> >
> > I had to still draw this on my whiteboard to make sure I understood it properly,
> > so I'm going to draw it here to make sure I did actually understand it, because
> > it is indeed quite complex if I'm understanding you correctly.
> >
> > We have the following
> >
> > File A on FS 1 which is a loopback device backed by File B on FS 2
>
> B is the normal file on FS2, so I guess you meant to say backed by file C
>
> > File B on FS 2 which is a normal file
> >
> > We have an HSM watching FS1 to populate files.
> >
> > sendfile(A, B);
> >
> > This does
> >
> > file_start_write(FS2);
> >
> > Then we start to read from A to populate the page, this triggers the HSM, which
> > then wants to write to FS1.
> >
> > At this point some other process calls fsfreeze(FS2), and now we're deadlocked,
> > because the HSM is stuck at sb_start_write(FS2) trying to write to the FS1 which
> > is backed by FS2, but we're already holding file_start_write(FS2) because of
> > splice.
> >
> > Is this correct?
>
> Yes, this is correct.
> I was describing a different variant of deadlock when FS2 is watched by HSM
> and HSM wants to write to the image file C upon reading from file A.
>
> There are many variants of this, but the root cause is operating of file A
> while holding sb_start_write() on file B on another fs.
>
> >
> > If it is, I think the best thing to do is actually push the file_start_write()
> > deeper into the splice work.  Do something like the patch I've applied below,
> > which is wildly untested and uncompiled.  However I think this closes this
> > deadlock in a nice clean way, because we're reading and then writing, and we
> > don't have to worry about any shenanigans under the read path because we only
> > hold the sb_write_start() when we do the actual write part.  Does that make
> > sense?
>
> That makes a lot of sense!
>
> I think this is the correct way out of the deadlock corner case.
> I will amend the patch and test it.
>
> Thanks for getting me out of tunnel vision ;)
>
> Some comments for myself below...
>
> >
> > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > index 4382881b0709..f37bb41551fe 100644
> > --- a/fs/overlayfs/copy_up.c
> > +++ b/fs/overlayfs/copy_up.c
> > @@ -230,6 +230,19 @@ static int ovl_copy_fileattr(struct inode *inode, const struct path *old,
> >         return ovl_real_fileattr_set(new, &newfa);
> >  }
> >
> > +static int ovl_splice_actor(struct pipe_inode_info *pipe,
> > +                           struct splice_desc *sd)
> > +{
> > +       struct file *file = sd->u.file;
> > +       long ret;
> > +
> > +       ovl_start_write(file_dentry(file));
> > +       ret = vfs_do_splice_from(pipe, file, sd->opos, sd->total_len,
> > +                                sd->flags);
> > +       ovl_end_write(file_dentry(file));
> > +       return ret;
> > +}
> > +

On second look, this custom ovl actor is not needed at all.
ovl_start_write(file_dentry(file)) is completely equivalent to
file_start_write(file) in this context, so no need to export any actor.

OTOH, generic_copy_file_range() and ceph (from ->copy_file_range())
call do_splice_direct() with file_start_write() held and this is a bit harder
to untangle.

The easy solution is to export do_splice_copy_file_range(), which is
a variant of do_splice_direct() with an actor that does not take
file_start_write().

The good thing about copy_file_range() is that it is only allowed across
sb for filesystems with ->copy_file_range(), so if we ban HSM events
on those filesystems, the freeze deadlock is averted.

I don't think we need to support HSM events on fuse/ceph/cifs/nfs/ovl
anyway, even if some of them do not allow cross sb copy.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fanotify HSM open issues
  2023-11-29 14:44                                             ` Amir Goldstein
@ 2023-11-29 18:42                                               ` Josef Bacik
  0 siblings, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2023-11-29 18:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Christian Brauner, Jens Axboe,
	linux-fsdevel

On Wed, Nov 29, 2023 at 04:44:28PM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 7:22 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Nov 28, 2023 at 11:43 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Tue, Nov 28, 2023 at 06:52:00PM +0200, Amir Goldstein wrote:
> > > > On Tue, Nov 28, 2023 at 4:55 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > > >
> > > > > On Tue, Nov 28, 2023 at 01:05:50PM +0200, Amir Goldstein wrote:
> > > > > > On Mon, Nov 27, 2023 at 9:11 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > > > > >
> > > > > > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > > > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi Amir,
> > > > > > > > >
> > > > > > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > > > > > discussion during the Plumbers conference :)
> > > > > > > > >
> > > > > > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > > > > Recap for new people joining this thread.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > > > > > >
> > > > > > > > > > > > > P1                             P2                      P3
> > > > > > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > > > > > >       -> security_file_permission()
> > > > > > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > > > > > >
> > > > > > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > > > > > with cleaner/safer semantics.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > > > > > >
> > > > > > > > > > > > Yep, nice summary.
> > > > > > > > > ...
> > > > > > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > > > > > >
> > > > > > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > > > > > >
> > > > > > > > > > > Yes, you are right.
> > > > > > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > > > > > >
> > > > > > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > > > > > held somehow indirectly.
> > > > > > > > > > >
> > > > > > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > > > > > fs1 freeze protection is held and:
> > > > > > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > > > > > >     ovl_real_fdget()
> > > > > > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > > > > > >          ... security_file_open(fs1.file)
> > > > > > > > > > >
> > > > > > > > > > > > That being
> > > > > > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > > > > > not enough for HSM needs.
> > > > > > > > > >
> > > > > > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > > > > > for any operations on the filesystem, not just pwrite().
> > > > > > > > > >
> > > > > > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > > > > > update the metadata files that store which parts of the file were already
> > > > > > > > > > filled (relying of fiemap is not always a valid option).
> > > > > > > > > >
> > > > > > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > > > > > >
> > > > > > > > > > Another use case is that HSM may want to download content to a
> > > > > > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > > > > > then clone the data into the accessed file range.
> > > > > > > > > >
> > > > > > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > > > > > >
> > > > > > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > > > > > note that if we want to really properly handle all possible operations, we
> > > > > > > > > need to start handling error from all sb_start_write() and
> > > > > > > > > file_start_write() calls and there are quite a few of those.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Darn, forgot about those.
> > > > > > > > I am starting to reconsider adding a freeze level.
> > > > > > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > > > > > Maybe fs anti-freeze (see blow).
> > > > > > > >
> > > > > > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > > > > > >
> > > > > > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > > > > > DB file containing current state or something like that.
> > > > > > > > > > > >
> > > > > > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > > > > > and to activate the desired freeze protection behavior
> > > > > > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > > > > > >
> > > > > > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > > > > > >
> > > > > > > > > > Please let me know if you agree with this design and if so,
> > > > > > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > > > > > in your opinion?
> > > > > > > > >
> > > > > > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > > > > > rather think it should be a conscious decision when setting up the process
> > > > > > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > > > > > would be the best. Also I think it would be better to capture in the name
> > > > > > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > > > > > something like that?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sure.
> > > > > > > >
> > > > > > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > > > > > flag) in the past. That would allow finer granularity control of the
> > > > > > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > > > > > operations?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > > > > > programs with.
> > > > > > > >
> > > > > > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > > > > > >
> > > > > > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > > > > > PF flag in their task processing the events.
> > > > > > > > >
> > > > > > > >
> > > > > > > > OK.
> > > > > > > >
> > > > > > > > I see an exit hatch in this statement -
> > > > > > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > > > > > cases completely in the hands of the application, then I do not feel
> > > > > > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > > > > > providing the first HSM API.
> > > > > > > >
> > > > > > > > If the HSM application is running in a controlled system, on a filesystem
> > > > > > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > > > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > > > > > >
> > > > > > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > > > > > easier option for the first version of HSM API?
> > > > > > > >
> > > > > > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > > > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > > > > > apart from the HSM itself using this fd.
> > > > > > > >
> > > > > > > > HSM itself can avoid deadlocks if it collaborates the fs freezes with
> > > > > > > > making fs modifications from within HSM events.
> > > > > > > >
> > > > > > > > Do you think that may be an acceptable way out or the corner?
> > > > > > >
> > > > > > > This is kind of a corner case that I think is acceptable to just leave up to
> > > > > > > application developers.  Speaking as a potential consumer of this work we don't
> > > > > > > use fsfreeze so aren't concerned wit this in practice, and arguably if you're
> > > > > > > using this interface you know what you're doing.  As long as the sharp edge is
> > > > > > > well documented I think that's fine for v1.
> > > > > > >
> > > > > >
> > > > > > I agree that this is good enough for v1.
> > > > > > The only question is can we (and should we) do better than good enough for v1.
> > > > > >
> > > > > > > Long term I like the EXCLUSIVE_FSFREEZER option, noting Christian's comment
> > > > > > > about the xfs scrubbing use case.  We all know that "freeze this file system" is
> > > > > > > an operation that is going to take X amount of time, so as long as we provide
> > > > > > > the application a way to block fsfreeze to avoid the deadlock then I think
> > > > > > > that's a reasonable solution.  Additionally it would allow us an avenue to
> > > > > > > gracefully handle errors.  If we race and see that the fs is already frozen well
> > > > > > > then we can go back to the HSM with an error saying he's out of luck, and he can
> > > > > > > return -EAGAIN or something through fanotify to unwind and try again later.
> > > > > > >
> > > > > >
> > > > > > Actually, "fs is already frozen" is not a deadlock case.
> > > > > > If "fs is already frozen" then fsfreeze was successful and HSM should just
> > > > > > wait in line like everyone else until fs is unfrozen.
> > > > > >
> > > > > > The deadlock case is "fs is being frozen" (i.e. sb->s_writers.frozen is
> > > > > > in state SB_FREEZE_WRITE), which cannot make progress because
> > > > > > an existing holder of sb write is blocked on an HSM event, which in turn
> > > > > > is trying to start a new sb write.
> > > > >
> > > > > Right, and now I'm confused.  You have your patchset to re-order the permission
> > > > > checks to before the sb_start_write(), so an HSM watching FAN_OPEN_PERM is no
> > > > > longer holding the sb write lock and thus can't deadlock, correct?
> > > >
> > > > Correct.
> > > >
> > > > >
> > > > > The new things you are proposing (FAN_PRE_ACESS and FAN_PRE_MODIFY) also do not
> > > > > happen inside of an sb_start_write(), correct?
> > > > >
> > > >
> > > > Almost correct.
> > > >
> > > > The callers of the security_file_permission() hook do not hold sb_start_write()
> > > > *directly*, but it can be held *indirectly* in splice(file_in_fs1, file_in_fs2).
> > > > That is the corner case I was trying to explain.
> > > >
> > > > When fs1 (splice source fs) is a loop mounted fs and the loop image file
> > > > is on fs2 (a.k.a the "host" fs), which also happens to be to splice dest fs,
> > > > splice grabs sb_start_write() on fs2.
> > > >
> > > > After the patches in vfs.rw, splice() no longer calls security_file_permission()
> > > > directly on the file in the loop mounted fs1, but the reads from loopdev
> > > > translate to reads on the image file, which can call security_file_permission()
> > > > on the loop image file on the "host" fs (fs2), while sb_start_write() is held.
> > > >
> > > > IOW, if HSM needs to fill the content on the loop image file and fsfreeze on
> > > > the "host" fs that is the destination of splice, gets in the middle, there is
> > > > a chance for a deadlock, because freeze will never make progress and
> > > > HSM filling of the loop image file is blocked.
> > > >
> > > > Yes, it is a corner case, but it exists and a similar one exists with a splice
> > > > from an overlayfs file into a file on a "host" fs, which also happens to be the
> > > > lower layer of overlayfs (I have a test case that triggered this).
> > > >
> > >
> > > I had to still draw this on my whiteboard to make sure I understood it properly,
> > > so I'm going to draw it here to make sure I did actually understand it, because
> > > it is indeed quite complex if I'm understanding you correctly.
> > >
> > > We have the following
> > >
> > > File A on FS 1 which is a loopback device backed by File B on FS 2
> >
> > B is the normal file on FS2, so I guess you meant to say backed by file C
> >
> > > File B on FS 2 which is a normal file
> > >
> > > We have an HSM watching FS1 to populate files.
> > >
> > > sendfile(A, B);
> > >
> > > This does
> > >
> > > file_start_write(FS2);
> > >
> > > Then we start to read from A to populate the page, this triggers the HSM, which
> > > then wants to write to FS1.
> > >
> > > At this point some other process calls fsfreeze(FS2), and now we're deadlocked,
> > > because the HSM is stuck at sb_start_write(FS2) trying to write to the FS1 which
> > > is backed by FS2, but we're already holding file_start_write(FS2) because of
> > > splice.
> > >
> > > Is this correct?
> >
> > Yes, this is correct.
> > I was describing a different variant of deadlock when FS2 is watched by HSM
> > and HSM wants to write to the image file C upon reading from file A.
> >
> > There are many variants of this, but the root cause is operating of file A
> > while holding sb_start_write() on file B on another fs.
> >
> > >
> > > If it is, I think the best thing to do is actually push the file_start_write()
> > > deeper into the splice work.  Do something like the patch I've applied below,
> > > which is wildly untested and uncompiled.  However I think this closes this
> > > deadlock in a nice clean way, because we're reading and then writing, and we
> > > don't have to worry about any shenanigans under the read path because we only
> > > hold the sb_write_start() when we do the actual write part.  Does that make
> > > sense?
> >
> > That makes a lot of sense!
> >
> > I think this is the correct way out of the deadlock corner case.
> > I will amend the patch and test it.
> >
> > Thanks for getting me out of tunnel vision ;)
> >
> > Some comments for myself below...
> >
> > >
> > > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > > index 4382881b0709..f37bb41551fe 100644
> > > --- a/fs/overlayfs/copy_up.c
> > > +++ b/fs/overlayfs/copy_up.c
> > > @@ -230,6 +230,19 @@ static int ovl_copy_fileattr(struct inode *inode, const struct path *old,
> > >         return ovl_real_fileattr_set(new, &newfa);
> > >  }
> > >
> > > +static int ovl_splice_actor(struct pipe_inode_info *pipe,
> > > +                           struct splice_desc *sd)
> > > +{
> > > +       struct file *file = sd->u.file;
> > > +       long ret;
> > > +
> > > +       ovl_start_write(file_dentry(file));
> > > +       ret = vfs_do_splice_from(pipe, file, sd->opos, sd->total_len,
> > > +                                sd->flags);
> > > +       ovl_end_write(file_dentry(file));
> > > +       return ret;
> > > +}
> > > +
> 
> On second look, this custom ovl actor is not needed at all.
> ovl_start_write(file_dentry(file)) is completely equivalent to
> file_start_write(file) in this context, so no need to export any actor.
> 

Perfect, I originally started with that but then I couldn't work out if the
upper layer would end up being a different SB than the one that the file was
attached to, so I erred on the side of making an overlayfs specific solution for
that.

> OTOH, generic_copy_file_range() and ceph (from ->copy_file_range())
> call do_splice_direct() with file_start_write() held and this is a bit harder
> to untangle.
> 
> The easy solution is to export do_splice_copy_file_range(), which is
> a variant of do_splice_direct() with an actor that does not take
> file_start_write().
> 
> The good thing about copy_file_range() is that it is only allowed across
> sb for filesystems with ->copy_file_range(), so if we ban HSM events
> on those filesystems, the freeze deadlock is averted.
> 
> I don't think we need to support HSM events on fuse/ceph/cifs/nfs/ovl
> anyway, even if some of them do not allow cross sb copy.

That sounds reasonable, and then just add a big comment describing why we're
disabling it for those file systems in case in the future somebody wants to go
move the file_start_write() around.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-11-29 18:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20230629133940.5w255qerlgqeqd7s@quack3>
     [not found] ` <CAOQ4uxgBMUjNvd3ZPJ1ZCzvhohB6yWe4E52XqEdfLQPHEHw-hA@mail.gmail.com>
     [not found]   ` <20230629171157.54l44agwejgnquw3@quack3>
     [not found]     ` <CAOQ4uxgxFtBZy4V8vccV2F7Lbg_9=OFNhgdgCP6Hu=o7gjcsVQ@mail.gmail.com>
     [not found]       ` <20230703183029.nn5adeyphijv5wl6@quack3>
     [not found]         ` <CAOQ4uxiS6R9hGFmputP6uRHGKywaCca0Ug53ihGcrgxkvMHomg@mail.gmail.com>
     [not found]           ` <CAOQ4uxhk_rydFejNqsmn4AydZfuknp=vPunNODNcZ_8qW-AykQ@mail.gmail.com>
     [not found]             ` <20230816094702.zztx3dctxvnfeh6o@quack3>
     [not found]               ` <CAOQ4uxhp6o40gZKnyAcjB2vkmNF0WOD9V9p2i+eHXXjSf=YFtQ@mail.gmail.com>
     [not found]                 ` <CAOQ4uxixuw9d1TGNpzc7cSPyzRN6spu48Y+4QPqFBsvOYS89kQ@mail.gmail.com>
     [not found]                   ` <20230817182220.vzzklvr7ejqlfnju@quack3>
2023-08-18  7:01                     ` fanotify HSM open issues Amir Goldstein
2023-08-23 14:37                       ` Jan Kara
2023-08-23 16:31                         ` Amir Goldstein
2023-11-13 11:50                           ` Amir Goldstein
2023-11-20 14:06                             ` Jan Kara
2023-11-20 16:59                               ` Amir Goldstein
2023-11-27 13:56                                 ` Christian Brauner
2023-11-27 14:48                                   ` Amir Goldstein
2023-11-27 14:57                                     ` Christian Brauner
2023-11-28  9:46                                       ` Amir Goldstein
2023-11-27 19:11                                 ` Josef Bacik
2023-11-28 11:05                                   ` Amir Goldstein
2023-11-28 14:55                                     ` Josef Bacik
2023-11-28 15:13                                       ` Christian Brauner
2023-11-28 16:52                                       ` Amir Goldstein
2023-11-28 21:42                                         ` Josef Bacik
2023-11-29  5:22                                           ` Amir Goldstein
2023-11-29 14:44                                             ` Amir Goldstein
2023-11-29 18:42                                               ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).