linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Ian Kent <raven@themaw.net>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	 Mark Brown <broonie@kernel.org>,
	Eric Chanudet <echanude@redhat.com>,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.cz>, Clark Williams <clrkwllms@kernel.org>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Ian Kent <ikent@redhat.com>,
	linux-fsdevel@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-rt-devel@lists.linux.dev,
	 Alexander Larsson <alexl@redhat.com>,
	Lucas Karpinski <lkarpins@redhat.com>,
	Aishwarya.TCV@arm.com
Subject: Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount
Date: Sat, 19 Apr 2025 12:44:18 +0200	[thread overview]
Message-ID: <20250419-auftrag-knipsen-6e56b0ccd267@brauner> (raw)
In-Reply-To: <834853f4-10ca-4585-84b2-425c4e9f7d2b@themaw.net>

On Sat, Apr 19, 2025 at 09:24:31AM +0800, Ian Kent wrote:
> 
> On 18/4/25 16:47, Christian Brauner wrote:
> > On Fri, Apr 18, 2025 at 09:20:52AM +0800, Ian Kent wrote:
> > > On 18/4/25 09:13, Ian Kent wrote:
> > > > On 18/4/25 00:28, Christian Brauner wrote:
> > > > > On Thu, Apr 17, 2025 at 05:31:26PM +0200, Sebastian Andrzej Siewior
> > > > > wrote:
> > > > > > On 2025-04-17 17:28:20 [+0200], Christian Brauner wrote:
> > > > > > > >       So if there's some userspace process with a broken
> > > > > > > > NFS server and it
> > > > > > > >       does umount(MNT_DETACH) it will end up hanging every other
> > > > > > > >       umount(MNT_DETACH) on the system because the dealyed_mntput_work
> > > > > > > >       workqueue (to my understanding) cannot make progress.
> > > > > > > Ok, "to my understanding" has been updated after going back
> > > > > > > and reading
> > > > > > > the delayed work code. Luckily it's not as bad as I thought it is
> > > > > > > because it's queued on system_wq which is multi-threaded so it's at
> > > > > > > least not causing everyone with MNT_DETACH to get stuck. I'm still
> > > > > > > skeptical how safe this all is.
> > > > > > I would (again) throw system_unbound_wq into the game because
> > > > > > the former
> > > > > > will remain on the CPU on which has been enqueued (if speaking about
> > > > > > multi threading).
> > > > > Yes, good point.
> > > > > 
> > > > > However, what about using polled grace periods?
> > > > > 
> > > > > A first simple-minded thing to do would be to record the grace period
> > > > > after umount_tree() has finished and the check it in namespace_unlock():
> > > > > 
> > > > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > > > index d9ca80dcc544..1e7ebcdd1ebc 100644
> > > > > --- a/fs/namespace.c
> > > > > +++ b/fs/namespace.c
> > > > > @@ -77,6 +77,7 @@ static struct hlist_head *mount_hashtable
> > > > > __ro_after_init;
> > > > >    static struct hlist_head *mountpoint_hashtable __ro_after_init;
> > > > >    static struct kmem_cache *mnt_cache __ro_after_init;
> > > > >    static DECLARE_RWSEM(namespace_sem);
> > > > > +static unsigned long rcu_unmount_seq; /* protected by namespace_sem */
> > > > >    static HLIST_HEAD(unmounted);  /* protected by namespace_sem */
> > > > >    static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
> > > > >    static DEFINE_SEQLOCK(mnt_ns_tree_lock);
> > > > > @@ -1794,6 +1795,7 @@ static void namespace_unlock(void)
> > > > >           struct hlist_head head;
> > > > >           struct hlist_node *p;
> > > > >           struct mount *m;
> > > > > +       unsigned long unmount_seq = rcu_unmount_seq;
> > > > >           LIST_HEAD(list);
> > > > > 
> > > > >           hlist_move_list(&unmounted, &head);
> > > > > @@ -1817,7 +1819,7 @@ static void namespace_unlock(void)
> > > > >           if (likely(hlist_empty(&head)))
> > > > >                   return;
> > > > > 
> > > > > -       synchronize_rcu_expedited();
> > > > > +       cond_synchronize_rcu_expedited(unmount_seq);
> > > > > 
> > > > >           hlist_for_each_entry_safe(m, p, &head, mnt_umount) {
> > > > >                   hlist_del(&m->mnt_umount);
> > > > > @@ -1939,6 +1941,8 @@ static void umount_tree(struct mount *mnt,
> > > > > enum umount_tree_flags how)
> > > > >                    */
> > > > >                   mnt_notify_add(p);
> > > > >           }
> > > > > +
> > > > > +       rcu_unmount_seq = get_state_synchronize_rcu();
> > > > >    }
> > > > > 
> > > > >    static void shrink_submounts(struct mount *mnt);
> > > > > 
> > > > > 
> > > > > I'm not sure how much that would buy us. If it doesn't then it should be
> > > > > possible to play with the following possibly strange idea:
> > > > > 
> > > > > diff --git a/fs/mount.h b/fs/mount.h
> > > > > index 7aecf2a60472..51b86300dc50 100644
> > > > > --- a/fs/mount.h
> > > > > +++ b/fs/mount.h
> > > > > @@ -61,6 +61,7 @@ struct mount {
> > > > >                   struct rb_node mnt_node; /* node in the ns->mounts
> > > > > rbtree */
> > > > >                   struct rcu_head mnt_rcu;
> > > > >                   struct llist_node mnt_llist;
> > > > > +               unsigned long mnt_rcu_unmount_seq;
> > > > >           };
> > > > >    #ifdef CONFIG_SMP
> > > > >           struct mnt_pcp __percpu *mnt_pcp;
> > > > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > > > index d9ca80dcc544..aae9df75beed 100644
> > > > > --- a/fs/namespace.c
> > > > > +++ b/fs/namespace.c
> > > > > @@ -1794,6 +1794,7 @@ static void namespace_unlock(void)
> > > > >           struct hlist_head head;
> > > > >           struct hlist_node *p;
> > > > >           struct mount *m;
> > > > > +       bool needs_synchronize_rcu = false;
> > > > >           LIST_HEAD(list);
> > > > > 
> > > > >           hlist_move_list(&unmounted, &head);
> > > > > @@ -1817,7 +1818,16 @@ static void namespace_unlock(void)
> > > > >           if (likely(hlist_empty(&head)))
> > > > >                   return;
> > > > > 
> > > > > -       synchronize_rcu_expedited();
> > > > > +       hlist_for_each_entry_safe(m, p, &head, mnt_umount) {
> > > > > +               if (!poll_state_synchronize_rcu(m->mnt_rcu_unmount_seq))
> > > > > +                       continue;
> > This has a bug. This needs to be:
> > 
> > 	/* A grace period has already elapsed. */
> > 	if (poll_state_synchronize_rcu(m->mnt_rcu_unmount_seq))
> > 		continue;
> > 
> > 	/* Oh oh, we have to pay up. */
> > 	needs_synchronize_rcu = true;
> > 	break;
> > 
> > which I'm pretty sure will eradicate most of the performance gain you've
> > seen because fundamentally the two version shouldn't be different (Note,
> > I drafted this while on my way out the door. r.
> > 
> > I would test the following version where we pay the cost of the
> > smb_mb() from poll_state_synchronize_rcu() exactly one time:
> > 
> > diff --git a/fs/mount.h b/fs/mount.h
> > index 7aecf2a60472..51b86300dc50 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -61,6 +61,7 @@ struct mount {
> >                  struct rb_node mnt_node; /* node in the ns->mounts rbtree */
> >                  struct rcu_head mnt_rcu;
> >                  struct llist_node mnt_llist;
> > +               unsigned long mnt_rcu_unmount_seq;
> >          };
> >   #ifdef CONFIG_SMP
> >          struct mnt_pcp __percpu *mnt_pcp;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index d9ca80dcc544..dd367c54bc29 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -1794,6 +1794,7 @@ static void namespace_unlock(void)
> >          struct hlist_head head;
> >          struct hlist_node *p;
> >          struct mount *m;
> > +       unsigned long mnt_rcu_unmount_seq = 0;
> >          LIST_HEAD(list);
> > 
> >          hlist_move_list(&unmounted, &head);
> > @@ -1817,7 +1818,10 @@ static void namespace_unlock(void)
> >          if (likely(hlist_empty(&head)))
> >                  return;
> > 
> > -       synchronize_rcu_expedited();
> > +       hlist_for_each_entry_safe(m, p, &head, mnt_umount)
> > +               mnt_rcu_unmount_seq = max(m->mnt_rcu_unmount_seq, mnt_rcu_unmount_seq);
> > +
> > +       cond_synchronize_rcu_expedited(mnt_rcu_unmount_seq);
> > 
> >          hlist_for_each_entry_safe(m, p, &head, mnt_umount) {
> >                  hlist_del(&m->mnt_umount);
> > @@ -1923,8 +1927,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
> >                          }
> >                  }
> >                  change_mnt_propagation(p, MS_PRIVATE);
> > -               if (disconnect)
> > +               if (disconnect) {
> > +                       p->mnt_rcu_unmount_seq = get_state_synchronize_rcu();
> >                          hlist_add_head(&p->mnt_umount, &unmounted);
> > +               }
> > 
> >                  /*
> >                   * At this point p->mnt_ns is NULL, notification will be queued
> > 
> > If this doesn't help I had considered recording the rcu sequence number
> > during __legitimize_mnt() in the mounts. But we likely can't do that
> > because get_state_synchronize_rcu() is expensive because it inserts a
> > smb_mb() and that would likely be noticable during path lookup. This
> > would also hinge on the notion that the last store of the rcu sequence
> > number is guaranteed to be seen when we check them in namespace_unlock().
> > 
> > Another possibly insane idea (haven't fully thought it out but throwing
> > it out there to test): allocate a percpu counter for each mount and
> > increment it each time we enter __legitimize_mnt() and decrement it when
> > we leave __legitimize_mnt(). During umount_tree() check the percpu sum
> > for each mount after it's been added to the @unmounted list.
> 
> I had been thinking that a completion in the mount with a counter (say
> 
> walker_cnt) could be used. Because the mounts are unhashed there won't
> 
> be new walks so if/once the count is 0 the walker could call complete()
> 
> and wait_for_completion() replaces the rcu sync completely. The catch is
> 
> managing walker_cnt correctly could be racy or expensive.
> 
> 
> I thought this would not be received to well dew to the additional fields

Path walking absolutely has to be as fast as possible, unmounting
doesn't. Anything that writes to a shared field from e.g.,
__legitimize_mnt() will cause cacheline pingpong and will very likely be
noticable. And people care about even slight decreases in performances
there. That's why we have the percpu counter and why I didn't even
consider something like the completion stuff.

  reply	other threads:[~2025-04-19 10:44 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-08 20:58 [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount Eric Chanudet
2025-04-09 10:37 ` Christian Brauner
2025-04-09 13:14   ` Sebastian Andrzej Siewior
2025-04-09 14:02     ` Mateusz Guzik
2025-04-09 14:25       ` Sebastian Andrzej Siewior
2025-04-09 16:04         ` Christian Brauner
2025-04-10  3:04           ` Ian Kent
2025-04-10  8:28           ` Sebastian Andrzej Siewior
2025-04-10 10:48             ` Christian Brauner
2025-04-10 13:58           ` Ian Kent
2025-04-11  2:36             ` Ian Kent
2025-04-09 16:08         ` Eric Chanudet
2025-04-11 15:17           ` Christian Brauner
2025-04-11 18:30             ` Eric Chanudet
2025-04-09 16:09     ` Christian Brauner
2025-04-10  1:17   ` Ian Kent
2025-04-09 13:04 ` Mateusz Guzik
2025-04-09 16:41   ` Eric Chanudet
2025-04-16 22:11 ` Mark Brown
2025-04-17  9:01   ` Christian Brauner
2025-04-17 10:17     ` Ian Kent
2025-04-17 11:31       ` Christian Brauner
2025-04-17 11:49         ` Mark Brown
2025-04-17 15:12         ` Christian Brauner
2025-04-17 15:28           ` Christian Brauner
2025-04-17 15:31             ` Sebastian Andrzej Siewior
2025-04-17 16:28               ` Christian Brauner
2025-04-17 22:33                 ` Eric Chanudet
2025-04-18  1:13                 ` Ian Kent
2025-04-18  1:20                   ` Ian Kent
2025-04-18  8:47                     ` Christian Brauner
2025-04-18 12:55                       ` Christian Brauner
2025-04-18 19:59                       ` Christian Brauner
2025-04-18 21:20                         ` Eric Chanudet
2025-04-19  1:24                       ` Ian Kent
2025-04-19 10:44                         ` Christian Brauner [this message]
2025-04-19 13:26                           ` Christian Brauner
2025-04-21  0:12                             ` Ian Kent
2025-04-21  0:44                               ` Al Viro
2025-04-18  0:31           ` Ian Kent
2025-04-18  8:59             ` Christian Brauner
2025-04-19  1:14               ` Ian Kent
2025-04-20  4:24           ` Al Viro
2025-04-20  5:54 ` Al Viro
2025-04-22 19:53   ` Eric Chanudet
2025-04-23  2:15     ` Al Viro
2025-04-23 15:04       ` Eric Chanudet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250419-auftrag-knipsen-6e56b0ccd267@brauner \
    --to=brauner@kernel.org \
    --cc=Aishwarya.TCV@arm.com \
    --cc=alexl@redhat.com \
    --cc=bigeasy@linutronix.de \
    --cc=broonie@kernel.org \
    --cc=clrkwllms@kernel.org \
    --cc=echanude@redhat.com \
    --cc=ikent@redhat.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=lkarpins@redhat.com \
    --cc=raven@themaw.net \
    --cc=rostedt@goodmis.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).