public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, Miklos Szeredi <miklos@szeredi.hu>,
	Eric Paris <eparis@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: fsnotify_mark_srcu wtf?
Date: Thu, 10 Nov 2016 20:46:25 +0100	[thread overview]
Message-ID: <20161110194625.GG31098@quack2.suse.cz> (raw)
In-Reply-To: <CAOQ4uxhBNHjOWFzcdzSAV3Bp=xrHe-TS_hkCpEUdivue6qDjnw@mail.gmail.com>

On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
> >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
> >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
> >> >> We've got a report where a fanotify daemon that implements permission checks
> >> >> screws up and doesn't send a reply.  This then causes widespread hangs due to
> >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> >> >> fsnotify_mark_destroy_list() to block.
> >> >
> >> > Yes. But if a program implementing permission checks does not reply, your
> >> > system is likely hosed anyway. We can only try to somewhat limit the
> >> > damage...
> >> >
> >>
> >> That was my initial thought as well, but at least with the sample code
> >> Miklos sent
> >> the only thing that gets hosed is the one process watching that one file.
> >> You could think of a use case of fanotify being used to watch over files
> >> in a specific user directory, where the damage on the entire system
> >> should/could be limited. No?
> >
> > Yes, the damage could at least theoretically be limited only to those files
> > / dirs watched by the buggy process.
> >
> >> >> Below program demonstrates the issue.  It should output a single line:
> >> >>
> >> >> close(inotify_fd): success
> >> >>
> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> >> >> the waiting permission event.
> >> >>
> >> >> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
> >> >
> >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
> >> > not sure srcu would scale to that. Furthermore the SRCU protects the list
> >> > of groups that need to get notification so it would not even be easily
> >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
> >> > to his patch. I'll try to find something to improve the situation but so
> >> > far I have no good idea...
> >> >
> >>
> >> Yes, very much buggy indeed :/
> >> Anyway, the reason I drafted it quickly was to highlight the fact that the
> >> marks only need to live to the point of decision whether or not the event
> >> should be sent to the group and afterwards, its sufficient to grab the
> >> group reference, without having impact on the entire system.
> >
> > Yes, fanotify code as such does not need the marks anymore. But the core
> > fsnotify code does...
> >
> >> Yet another possible ugly (but less buggy) solution would be
> >> to iterate all marks under SRCU read protection.
> >> If any group is about to block (either by suggested return value
> >> EAGAIN or another
> >> by using a new op should_handle_event_deferred), defer event handling to post
> >> marks iteration, by keeping a few group references on stack.
> >
> > And this does not work as well... Fanotify must notify groups by their
> > priority so you cannot arbitrarily reorder ordering in which groups get
> > notified. I'm currently pondering on using mark refcount to pin it when
> > processing permission event but there are still some details to check.
> >
> 
> All right, mark refcount sound like the proper solution.

Except it doesn't quite work. We can pin the current marks by a refcount
but they can still be removed from the list so after we regain srcu lock,
we are not sure their ->next pointers still point to still allocated marks
:-| Sadly I realized this only after implementing all this.

> In case this approach doesn't work out for some reason, you may want
> to consider fsnotify_mark_srcu (and destroy_list) per priority.
> Or just 2 srcu, one for for priority 0 and one for other.
> Because priority > 0 may block and priority 0 may not block.
> 
> The priority set on an inode/vfsmount can be easily encoded into the
> i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
> #define FS_GROUP_PRIO_1          0x00040000      /* fanotify content
> based access control */
> #define FS_GROUP_PRIO_2          0x00080000      /* fanotify
> pre-content access */
> 
> in fsnotify(), only need to take the read side srcu of relevant priority groups,
> but need to take extra care to set the priority bit in the inode/mnt
> mask *before*
> adding the mark to the list, with a relevant memory barrier before checking
> the priority bits in fsnotify().

Well but how would you like to protect the mark list hanging off the inode
/ mountpoint with two SRCUs? You'd need two lists hanging off the inode &
mountpoint (for different priorities) and that's too big cost to pay to
accomodate broken userspace...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2016-11-10 19:46 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-02 22:09 fsnotify_mark_srcu wtf? Miklos Szeredi
2016-11-03 10:25 ` Amir Goldstein
2016-11-05 21:43   ` Jan Kara
2016-11-05 21:34 ` Jan Kara
2016-11-06  6:45   ` Amir Goldstein
2016-11-09 11:10     ` Jan Kara
2016-11-09 18:26       ` Amir Goldstein
2016-11-10 19:46         ` Jan Kara [this message]
2016-11-10 20:02           ` Amir Goldstein
2016-11-10 20:44           ` Miklos Szeredi
2016-11-10 22:41             ` Amir Goldstein
2016-11-13 18:43           ` Amir Goldstein
2016-11-14 11:59             ` Amir Goldstein
2016-12-02  8:26           ` Miklos Szeredi
2016-12-02 10:48             ` Jan Kara
2016-12-02 11:02               ` Miklos Szeredi
2016-12-06 17:07                 ` Jan Kara
2016-12-02 11:41               ` Amir Goldstein
2016-12-02 11:57                 ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161110194625.GG31098@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=amir73il@gmail.com \
    --cc=eparis@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox