From: Tycho Kirchner <tychokirchner@mail.de>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC] Volatile fanotify marks
Date: Fri, 6 May 2022 11:59:50 +0200 [thread overview]
Message-ID: <99bb075f-60e3-9480-f253-45515da80348@mail.de> (raw)
In-Reply-To: <CAOQ4uxj4YOg2JP6XSzYtn2-eta2SsVcTgjHfnc=raD8S7xgrkQ@mail.gmail.com>
>>> On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>>>
>>>> All right, I thought a bit more about that and returned to your
>>>> original BPF idea you mentioned on 2020-08-28:
>>>>
>>>>> I was thinking that we could add a BPF hook to fanotify_handle_event()
>>>>> (similar to what's happening in packet filtering code) and you could attach
>>>>> BPF programs to this hook to do filtering of events. That way we don't have
>>>>> to introduce new group flags for various filtering options. The question is
>>>>> whether eBPF is strong enough so that filters useful for fanotify users
>>>>> could be implemented with it but this particular check seems implementable.
>>>>>
>>>>> Honza
>>>>
>>>> Instead of changing fanotify's filesystem notification functionality,
>>>> I suggest to rather **add a tracing mode (fantrace)**.
>>>>
>>>> The synchronous handling of syscalls via ptrace is of course required
>>>> for debugging purposes, however that introduces a major slowdown (even
>>>> with seccomp-bpf filters). There are a number of cases, including
>>>> [1-3], where async processing of file events of specific tasks would be
>>>> fine but is not readily available in Linux. Fanotify already ships
>>>> important infrastructure in this regard: it provides very fast
>>>> event-buffering and, by using file descriptors instead of resolved
>>>> paths, a clean and race-free API to process the events later. However,
>>>> as already stated, fanotify does not provide a clean way, to monitor
>>>> only a subset of tasks. Therefore please consider the following
>>>> proposed architecture of fantrace:
>>>>
>>>> Each taks gets its own struct fsnotify_group. Within
>>>> fsnotify.c:fsnotify() it is checked if the given task has a
>>>> fsnotify_group attached where events of interest are buffered as usual.
>>>> Note that this is an additional hook - sysadmins being subscribed to
>>>> filesystem events rather than task-filesystem-events are notified as
>>>> usual - in that case two hooks possibly run. The fsnotify_group is
>>>> extended by a field optionally pointing to a BPF program which allows
>>>> for custom filters to be run.
>>>>
>>>> Some implementation details:
>>>> - To let the tracee return quickly, run BPF filter program within tracer
>>>> context during read(fan_fd) but before events are copied to userspace
>>>> - only one fantracer per task, which overrides existing ones if any
>>>> - task->fsnotify_group refcount increment on fork, decrement on exit (run
>>>> after exit_files(tsk) to not miss final close events). When last task
>>>> exited, send EOF to listener.
>>>> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
>>>> - lazy check when event occurs, if listener is still alive (refcount > 1)
>>>> - for the beginning, to keep things simple and to "solve" the cleanup of
>>>> filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
>>>> (only allow FAN_MARK_FILESYSTEM), as that functionality can be
>>>> implemented within the user-provided BPF-program.
>>>>
>>>
>>> Maybe I am slow, but I did not understand the need for this task fsnotify_group.
>>>
>>> What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
>>> that hook is supposed to filter by pid so why all this extra complexity?
>>>
>>> We may consider the option to have another BFP hook when reading
>>> events if there is
>>> good justification, but subtree filters will have to be in handle_event().
>>>
>>> Thanks,
>>> Amir.
>>
>> To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip)
>> file-events from all paths have to be reported, which is difficult
>> using i_fsnotify_marks, because
>> - marking whole mountpoints requires privileges
>> - marking the whole filesystem using directory marks is unfeasible
>>
>> However, we need a quick way to find out, if a file event is of
>> interest (find its beloning fsnotify_group). For the purpose of tracing
>> it appears reasonable to consider all file events of a traced task as
>> "interesting" in the first place. So, in this way, we allow a user to
>> trace file events of his own tasks without slowing down other,
>> non-traced tasks.
>>
>> After all, it's all about the order of running filters - first inode,
>> then pid or reverse. With my proposed architecture for the purpose of
>> tracing I would hand the inode-filter to the user in form of an
>> optional BPF hook. Performance-wise that's also the "fair" solution.
>> Let's assume we allow marking the whole filesystem (via mountpoints).
>> Now, the BPF-pid-filter code has to run for every single file event (of
>> all users!), if multiple users trace the filesystem even multiple hooks
>> have to run, slowing down the whole system.
>>
>
> I understand now what you were trying to do, but still not convinced
> that the added complexity to the kernel is worth it, because you may be
> able to achieve the same with userspace LD_PRELOAD hooking.
In fact, I use that approach in shournal's fanotify backend, as the
shell process cannot join the mount namespace of the currently executing
command (without re-exec). See here, for example:
https://github.com/tycho-kirchner/shournal/blob/master/src/shell-integration-fanotify/libshournal-shellwatch.cpp
However, while this works for the supported shells bash and zsh,
the LD_PRELOAD approach in general is problematic:
- Static compiled executables cannot be traced (rare, ptrace with
seccomp-bpf filters might be ok for those)
- killed tasks may still have fd's open, so signals also have to be hooked.
SIGKILL or crashing tasks (e.g. segfault) cannot be handled at all (though
this would be ok)
- close- or other library-calls are potentially inlined. Some programs may
even use syscalls directly (probably a rare issue, but I'm not sure)
- quite some hooks required: close, fclose (glibc does not use the close wrapper),
fcloseall, dup2, dup3, exit, _exit and possibly others
Actually, this sounds more of a complexity. I guess I could make a
proof-of-principle implementation of my proposed fantrace in a few days,
but of course this makes only sense if there is a slight chance this
could be accepted.
>
> BTW, the fanotify_userns patch set [1] was an attempt to allow unprivileged
> user watch over subtree with low performance impact on the rest of the
> file system.
> This is not exactly what you need, but maybe it could be made
> into what you need.
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
I'll take at look, but while I'm fine in compiling a custom kernel
users of my tool are not. Further, using user namespaces in the
outermost shell layer (where my tool runs) is problematic for
administrators or users of setuid binaries.
Thanks,
Tycho
next prev parent reply other threads:[~2022-05-06 9:59 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-23 18:42 [RFC] Volatile fanotify marks Amir Goldstein
2022-02-28 14:05 ` Jan Kara
2022-02-28 17:40 ` Amir Goldstein
2022-03-01 11:07 ` Jan Kara
2022-03-01 11:27 ` Amir Goldstein
2022-03-01 12:26 ` Tycho Kirchner
2022-03-01 16:58 ` Amir Goldstein
2022-03-02 10:04 ` Tycho Kirchner
2022-03-02 18:14 ` Amir Goldstein
2022-03-03 9:24 ` Jan Kara
2022-05-02 9:13 ` Tycho Kirchner
2022-05-04 6:13 ` Amir Goldstein
2022-05-04 10:01 ` Tycho Kirchner
2022-05-04 14:37 ` Amir Goldstein
2022-05-06 9:59 ` Tycho Kirchner [this message]
2022-05-05 10:42 ` Jan Kara
2022-05-13 15:30 ` Matthew Bobrowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=99bb075f-60e3-9480-f253-45515da80348@mail.de \
--to=tychokirchner@mail.de \
--cc=amir73il@gmail.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).