public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	 linux-fsdevel@vger.kernel.org,
	Christian Brauner <brauner@kernel.org>, Jan Kara	 <jack@suse.cz>,
	Nikolay Borisov <nik.borisov@suse.com>,
	Max Kellermann	 <max.kellermann@ionos.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Paulo Alcantara	 <pc@manguebit.org>
Subject: Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
Date: Fri, 10 Apr 2026 07:56:12 -0400	[thread overview]
Message-ID: <ecd898202614340dced57f96a22542d4d6dbbf72.camel@kernel.org> (raw)
In-Reply-To: <ff5771a38b1977dc419c92cd466df03581e9bd69.camel@kernel.org>

On Fri, 2026-04-10 at 07:18 -0400, Jeff Layton wrote:
> On Fri, 2026-04-10 at 09:48 +0100, Al Viro wrote:
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > 
> > lock_for_kill() losing a race to eviction by another thread will end up
> > returning false (correctly - the caller should *not* evict dentry in
> > that case), with ->d_lock held and rcu read-critical area still unbroken,
> > so the caller can safely drop the locks, provided that it's done in the
> > right order - ->d_lock should be dropped before rcu_read_unlock().
> > 
> > Unfortunately, 3 of 4 callers did it the other way round and for two of them
> > (finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
> > The third (shrink_dentry_list()) had been safe for other reasons (dentry being
> > on the shrink list => the other thread wouldn't even schedule its freeing
> > until observing dentry off the shrink list while holding ->d_lock), but it's
> > simpler to make it a general rule - in case of lock_for_kill() failure
> > ->d_lock must be dropped before rcu_read_unlock().
> > 
> > UAF scenario was basically this:
> > 	dput() drops the last reference to foo/bar and evicts it.
> > 	The reference it held to foo happens to be the last one.
> > 	When trylock on ->i_lock of parent's inode fails, we
> > 	(under rcu_read_lock()) drop ->d_lock and take the locks in
> > 	the right order; while we'd been spinning on ->i_lock somebody
> > 	else comes and evicts the parent.
> > 	Noticing non-zero (negative) refcount we decide there's nothing
> > 	left to do.  And had we only dropped parent's ->d_lock before
> > 	rcu_read_unlock(), everything would be fine.  Unfortunately,
> > 	doing that the other way round allows rcu-scheduled freeing of
> > 	parent to proceed before we drop its ->d_lock.
> > The same applies if instead of final dput() of foo/bar it gets evicted
> > by shrink_dcache_parent() or memory pressure, again taking out the last
> > reference to that used to pin its parent.
> > 
> > Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 0c8faeee02e2..b1c163f20db6 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
> >   *
> >   * Return false if dentry is busy.  Otherwise, return true and have
> >   * that dentry's inode locked.
> > + *
> > + * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
> >   */
> >  
> >  static bool lock_for_kill(struct dentry *dentry)
> > @@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
> >  		}
> >  		rcu_read_lock();
> >  	}
> > -	rcu_read_unlock();
> >  	spin_unlock(&dentry->d_lock);
> > +	rcu_read_unlock();
> >  }
> >  
> >  /* 
> > @@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
> >  	do {
> >  		rcu_read_unlock();
> >  		victim = __dentry_kill(victim);
> > +		if (!victim)
> > +			return;
> >  		rcu_read_lock();
> > -	} while (victim && lock_for_kill(victim));
> > +	} while (lock_for_kill(victim));
> > +	spin_unlock(&victim->d_lock);
> >  	rcu_read_unlock();
> > -	if (victim)
> > -		spin_unlock(&victim->d_lock);
> >  }
> >  
> >  void shrink_dentry_list(struct list_head *list)
> > @@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
> >  		rcu_read_lock();
> >  		if (!lock_for_kill(dentry)) {
> >  			bool can_free;
> > -			rcu_read_unlock();
> >  			d_shrink_del(dentry);
> >  			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
> >  			spin_unlock(&dentry->d_lock);
> > +			rcu_read_unlock();
> >  			if (can_free)
> >  				dentry_free(dentry);
> >  			continue;
> 
> I'm a little skeptical that this explains the UAF we've seen. It looks
> like this would mostly affect the parent dentry rather than the
> children on the shrink list, but parents can be children too, so we
> can't rule it out.
> 
> Either way, releasing the rcu_read_lock() protecting a dentry you're
> holding a lock on seems like a bad idea.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

FWIW, here's Claude's review:

  Summary                                                                                           
  -------                                                                                           
                                                                                                    
  The patch fixes a use-after-free caused by incorrect lock ordering                                
  when lock_for_kill() returns false (race lost to another thread's                                 
  eviction).  Three of four callers were dropping rcu_read_unlock()                                 
  before spin_unlock(&dentry->d_lock), which allowed RCU-scheduled                                  
  freeing of the dentry to proceed while d_lock was still held on                                   
  the freed object.                                                                                 
                                                                                                    
  Changes                                                                                           
  -------                                                                                           
                                                             
  finish_dput() (line 935-938):                   
                         
    Swaps the unlock order so d_lock is dropped before rcu_read_unlock().                           
    This is reachable from dput() and d_make_discardable().  The UAF                                
    scenario described in the commit message applies directly here:                                 
    dput() evicts foo/bar, the parent's refcount drops to zero,                                     
    lock_for_kill() fails on the parent (trylock on i_lock failed and                               
    during the retry another thread evicted the parent), then                                       
    rcu_read_unlock() lets the parent be freed while d_lock is still                                
    held on it.                                   

  shrink_kill() (lines 1193-1203):                                                                  
                                                            
    Restructured to pull the !victim early return out of the while                                  
    condition.  When __dentry_kill() returns NULL (parent's refcount
    didn't drop to zero), the function now returns immediately without                              
    touching rcu at all -- correct, since __dentry_kill() entered                                   
    with rcu_read_unlock() already called.  When lock_for_kill() fails,                             
    the loop exits and d_lock is dropped before rcu_read_unlock().                                  
    This is the path most likely responsible for the production crashes                             
    in T262643929: shrink_dentry_list() calls shrink_kill(), which                                  
    walks up to parent dentries via __dentry_kill().                                                
                                                                                                    
  shrink_dentry_list() (lines 1213-1218):                                                           
                                                                                                    
    Swaps the unlock order in the lock_for_kill() failure path.  As                                 
    the commit message notes, this was already safe for other reasons:
    the dentry is on a shrink list (DCACHE_SHRINK_LIST set), so the                                 
    other thread's __dentry_kill() would see that flag and set                                      
    can_free = false, preventing dentry_free() from being called.                                   
    Changed for consistency with the general rule.                                                  
                                                                                                    
  lock_for_kill() comment:                                                                          
                                                                                                    
    Documents the ordering requirement: on failure, callers must drop                               
    d_lock before rcu_read_unlock().                        
                                                                                                    
  Production crash relevance                                                                        
  --------------------------
                                                                                                    
  The shrink_kill() fix directly addresses the UAF pattern observed                                 
  in three production vmcores (od0103.dkl2, od5175.prn6,
  twshared90354.17.frc2).  In all three, the crashed thread was in                                  
  shrink_dentry_list() -> spin_lock() on a dentry whose slab page had                               
  been freed and reused by kmalloc-64 or kmalloc-96.                                                
                                                                                                    
  The scenario: shrink_dentry_list() kills a dentry via shrink_kill(),                              
  __dentry_kill() returns the parent with d_lock held and refcount                                  
  decremented to zero, lock_for_kill() fails on the parent because                                  
  another thread raced in and evicted it.  With the old code, the                                   
  rcu_read_unlock() in shrink_kill() allowed the parent's                                           
  call_rcu-scheduled free to complete, and the subsequent                                           
  spin_unlock(&victim->d_lock) touched freed memory.                                                
                                                            
  Whether this is the full explanation for the list corruption                                      
  observed in the vmcores (LRU dentries spliced onto the dispose
  list) remains an open question.  The lock ordering fix prevents
  the immediate UAF on the parent dentry, but the mechanism by                                      
  which entire LRU chains ended up on the dispose list may involve                                  
  additional factors.     

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2026-04-10 11:56 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22 20:20 [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree() Al Viro
2026-01-23  0:19 ` Linus Torvalds
2026-01-23  0:36   ` Al Viro
2026-01-24  4:36     ` Al Viro
2026-01-24  4:46       ` Linus Torvalds
2026-01-24  5:36         ` Al Viro
2026-01-24 17:45           ` Linus Torvalds
2026-01-24 18:43             ` Al Viro
2026-01-24 19:32               ` Linus Torvalds
2026-01-24 20:28                 ` Al Viro
2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 2/4] struct dentry: make ->d_u anonymous Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
2026-04-02 19:52     ` Linus Torvalds
2026-04-02 22:44       ` Al Viro
2026-04-02 22:49         ` Linus Torvalds
2026-04-02 23:16           ` Al Viro
2026-04-03  0:29             ` Linus Torvalds
2026-04-03  2:15               ` Al Viro
2026-04-04  0:02                 ` Al Viro
2026-04-04  0:04                   ` Linus Torvalds
2026-04-04 18:54                     ` Al Viro
2026-04-04 19:04                       ` Linus Torvalds
2026-04-05  0:04                         ` Al Viro
2026-04-02 20:28   ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Paulo Alcantara
2026-04-03  4:46     ` Al Viro
2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
2026-04-09 16:51   ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
2026-04-09 19:02     ` Al Viro
2026-04-09 20:10       ` Jeff Layton
2026-04-09 21:57         ` Al Viro
2026-04-09 22:38           ` Jeff Layton
2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
2026-04-10 11:18             ` Jeff Layton
2026-04-10 11:56               ` Jeff Layton [this message]
2026-04-10 15:25             ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ecd898202614340dced57f96a22542d4d6dbbf72.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=max.kellermann@ionos.com \
    --cc=nik.borisov@suse.com \
    --cc=pc@manguebit.org \
    --cc=sandeen@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox