public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Thomas Moestl <moestl@ibr.cs.tu-bs.de>
Cc: linux-kernel@vger.kernel.org,
	viro@parcelfarce.linux.theplanet.co.uk,
	Trond Myklebust <trond.myklebust@fys.uio.no>
Subject: Re: umount() and NFS races in 2.4.26
Date: Fri, 9 Jul 2004 11:32:42 -0300	[thread overview]
Message-ID: <20040709143242.GB11168@logos.cnet> (raw)
In-Reply-To: <20040708180709.GA7704@timesink.dyndns.org>

On Thu, Jul 08, 2004 at 08:07:09PM +0200, Thomas Moestl wrote:
> Hi,

Hi Thomas,

> after deploying an SMP machine at work, we started to experience Oopses
> in file-system related code relatively frequently. Investigation
> revealed that they were caused by references using junk pointers from
> freed super blocks via dangling inodes from unmounted file systems;
> Oopses would always be preceded by the warning
>   VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
> on an unmount (unmount activity is high on this machine due to heavy use
> of the automounter). The predecessor to this machine, a UP system with
> otherwise almost identical configuration, did never encounter such
> problems, so I went looking for possible SMP races.
> 
> I believe that I have found two problems:
> 
> - The NFS async unlink code (fs/nfs/unlink.c) does keep a dentry for
>   later asynchronous processing, but the mount point is unbusied via
>   path_release() once sys_unlink() returns (fs/namei.c). While it
>   does a dget() on the dentry, this is insufficient to prevent an
>   umount(); when one would happen in the right time window, it seems
>   that it would initially get past the busy check:
> 	if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
>   (fs/namespace.c, do_umount()), but invalidate_inodes() in kill_super()
>   (fs/super.c) would then fail because of the inode referenced from
>   the dentry needed for the async unlink (which cannot be removed
>   by shrink_dcache_parent() because the NFS code did dget() it).
> 
>   Please note that this problem is only conjectured, as it turned out
>   to not be our culprit. It looks not completely trivial to fix to me,
>   I believe it might require some changes that would affect other FS
>   implementations. It is not a true SMP race, if it exists it would
>   affect UP systems as well.

Trond?

> - There is a SMP race between the shrink_dcache_parent() (fs/dcache.c)
>   called from kill_super() and prune_dache() called via
>   shrink_dache_memory() (called by kswapd), as follows:
>   shrink_dache_parent() calls select_parent() to both prepare the LRU
>   list for purge_cache() and return the number of unused dcache
>   entries that can likely be removed in the next prune_dache() run.
>   If select_parent() returns 0, shrink_dcache_parent() assumes that
>   its work is done and returns. Now, assume for simplicity that there
>   are only two remaining dcache entries: one for "foo/bar" and for
>   the directory "foo/", which is referenced by the "foo/bar" entry.
>   Furthermore, prune_dcache() is currently running, called from kswapd,
>   and has decided to remove the "foo/bar" entry. To that end, it
>   calls prune_one_dentry(), which dentry_iput()s then "foo/bar" dentry.
>   This causes the dache_lock() to be dropped. Just now select_parent()
>   comes along, and can obtain the dcache_lock(). It now looks for unused
>   dentries; but there is only the "foo/" one left, which has a non-zero
>   d_count because "foo/bar" referenced it as parent, and
>   prune_one_dentry() did not yet have a chance to dput() it because it
>   has to wait for the dcache_lock(). Thus, select_parent() finds no
>   unused dentries, assumes that all is fine and does not purge any
>   more; the "foo/" entry can remain in the cache for much longer
>   because it may have DCACHE_REFERENCED set, so that the kswapd
>   purge_dcache() will leave it alone. The inode corresponding to the
>   dangling dcache entry will still be referenced, and end up dangling,
>   too. kill_super() will print the warning mentioned above.
>   When dentry or inode are touched again later (for example in another
>   purge_dcache() later on) we can end up accessing the super block
>   freed during the unmount, leading to an Oops.
>   This was partially verified by inspecting the dcache state via
>   /dev/kmem after the busy inodes warning had occured (the directory
>   dentry was not busy any more, but still remained in the unused list).
> 
>   In the attached patch, I have used a semaphore to serialize purging
>   accesses to the dentry_unused list. With a kernel so patched, the
>   problem seems to have disappeared. The patch is just a quick hack,
>   the semantics of the semaphore is not really well-defined; but maybe
>   it can serve as a starting point.

This is a fastpath, adding a semaphore here is not a Good Thing from the
performance POV.

Maybe shrink_dcache_parent() when called from kill_super() could be more
picky and loop again when a used dentry is found? I feel that
something along these line could make the problem go away without the
need for a slow sleep-lock.

Thanks for the detailed description of the problem.

> I have not checked whether any of these issues pertain to the 2.6 series
> as well.
> 
> 	- Thomas
> 
> P.S: please CC me in replies, as I am not subscribed to this list.
> 

> --- dcache.c.orig	Wed Jun 16 00:22:03 2004
> +++ dcache.c	Wed Jun 16 01:00:47 2004
> @@ -51,6 +51,7 @@
>  static unsigned int d_hash_shift;
>  static struct list_head *dentry_hashtable;
>  static LIST_HEAD(dentry_unused);
> +struct semaphore dcache_lru_sem;
>  
>  /* Statistics gathering. */
>  struct dentry_stat_t dentry_stat = {0, 0, 45, 0,};
> @@ -381,6 +382,7 @@
>  	struct list_head *tmp, *next;
>  	struct dentry *dentry;
>  
> +	down(&dcache_lru_sem);
>  	/*
>  	 * Pass one ... move the dentries for the specified
>  	 * superblock to the most recent end of the unused list.
> @@ -416,6 +418,7 @@
>  		goto repeat;
>  	}
>  	spin_unlock(&dcache_lock);
> +	up(&dcache_lru_sem);
>  }
>  
>  /*
> @@ -548,8 +551,10 @@
>  {
>  	int found;
>  
> +	down(&dcache_lru_sem);
>  	while ((found = select_parent(parent)) != 0)
>  		prune_dcache(found);
> +	up(&dcache_lru_sem);
>  }
>  
>  /*
> @@ -581,9 +586,11 @@
>  	if (!(gfp_mask & __GFP_FS))
>  		return 0;
>  
> +	down(&dcache_lru_sem);
>  	count = dentry_stat.nr_unused / priority;
>  
>  	prune_dcache(count);
> +	up(&dcache_lru_sem);
>  	return kmem_cache_shrink(dentry_cache);
>  }
>  
> @@ -1247,6 +1254,7 @@
>  		d++;
>  		i--;
>  	} while (i);
> +	sema_init(&dcache_lru_sem, 1);
>  }
>  
>  static void init_buffer_head(void * foo, kmem_cache_t * cachep, unsigned long flags)


  reply	other threads:[~2004-07-09 15:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-07-08 18:07 umount() and NFS races in 2.4.26 Thomas Moestl
2004-07-09 14:32 ` Marcelo Tosatti [this message]
2004-07-10 21:38   ` Trond Myklebust
2004-07-11  0:32     ` viro
2004-07-10  6:57 ` raven
2004-07-10 15:25   ` [autofs] " Greg Banks
2004-07-10 18:25     ` Thomas Moestl
2004-07-10 18:19   ` Thomas Moestl
2004-07-10 19:25     ` raven
2004-07-10 19:50       ` Thomas Moestl
2004-07-11 10:16     ` raven
  -- strict thread matches above, loose matches on Subject: below --
2004-07-09 15:00 James Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040709143242.GB11168@logos.cnet \
    --to=marcelo.tosatti@cyclades.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=moestl@ibr.cs.tu-bs.de \
    --cc=trond.myklebust@fys.uio.no \
    --cc=viro@parcelfarce.linux.theplanet.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox