umount() and NFS races in 2.4.26

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Moestl <moestl@ibr.cs.tu-bs.de>
To: linux-kernel@vger.kernel.org
Subject: umount() and NFS races in 2.4.26
Date: Thu, 8 Jul 2004 20:07:09 +0200	[thread overview]
Message-ID: <20040708180709.GA7704@timesink.dyndns.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 4298 bytes --]

Hi,

after deploying an SMP machine at work, we started to experience Oopses
in file-system related code relatively frequently. Investigation
revealed that they were caused by references using junk pointers from
freed super blocks via dangling inodes from unmounted file systems;
Oopses would always be preceded by the warning
  VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
on an unmount (unmount activity is high on this machine due to heavy use
of the automounter). The predecessor to this machine, a UP system with
otherwise almost identical configuration, did never encounter such
problems, so I went looking for possible SMP races.

I believe that I have found two problems:

- The NFS async unlink code (fs/nfs/unlink.c) does keep a dentry for
  later asynchronous processing, but the mount point is unbusied via
  path_release() once sys_unlink() returns (fs/namei.c). While it
  does a dget() on the dentry, this is insufficient to prevent an
  umount(); when one would happen in the right time window, it seems
  that it would initially get past the busy check:
	if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
  (fs/namespace.c, do_umount()), but invalidate_inodes() in kill_super()
  (fs/super.c) would then fail because of the inode referenced from
  the dentry needed for the async unlink (which cannot be removed
  by shrink_dcache_parent() because the NFS code did dget() it).

  Please note that this problem is only conjectured, as it turned out
  to not be our culprit. It looks not completely trivial to fix to me,
  I believe it might require some changes that would affect other FS
  implementations. It is not a true SMP race, if it exists it would
  affect UP systems as well.

- There is a SMP race between the shrink_dcache_parent() (fs/dcache.c)
  called from kill_super() and prune_dache() called via
  shrink_dache_memory() (called by kswapd), as follows:
  shrink_dache_parent() calls select_parent() to both prepare the LRU
  list for purge_cache() and return the number of unused dcache
  entries that can likely be removed in the next prune_dache() run.
  If select_parent() returns 0, shrink_dcache_parent() assumes that
  its work is done and returns. Now, assume for simplicity that there
  are only two remaining dcache entries: one for "foo/bar" and for
  the directory "foo/", which is referenced by the "foo/bar" entry.
  Furthermore, prune_dcache() is currently running, called from kswapd,
  and has decided to remove the "foo/bar" entry. To that end, it
  calls prune_one_dentry(), which dentry_iput()s then "foo/bar" dentry.
  This causes the dache_lock() to be dropped. Just now select_parent()
  comes along, and can obtain the dcache_lock(). It now looks for unused
  dentries; but there is only the "foo/" one left, which has a non-zero
  d_count because "foo/bar" referenced it as parent, and
  prune_one_dentry() did not yet have a chance to dput() it because it
  has to wait for the dcache_lock(). Thus, select_parent() finds no
  unused dentries, assumes that all is fine and does not purge any
  more; the "foo/" entry can remain in the cache for much longer
  because it may have DCACHE_REFERENCED set, so that the kswapd
  purge_dcache() will leave it alone. The inode corresponding to the
  dangling dcache entry will still be referenced, and end up dangling,
  too. kill_super() will print the warning mentioned above.
  When dentry or inode are touched again later (for example in another
  purge_dcache() later on) we can end up accessing the super block
  freed during the unmount, leading to an Oops.
  This was partially verified by inspecting the dcache state via
  /dev/kmem after the busy inodes warning had occured (the directory
  dentry was not busy any more, but still remained in the unused list).

  In the attached patch, I have used a semaphore to serialize purging
  accesses to the dentry_unused list. With a kernel so patched, the
  problem seems to have disappeared. The patch is just a quick hack,
  the semantics of the semaphore is not really well-defined; but maybe
  it can serve as a starting point.
  
I have not checked whether any of these issues pertain to the 2.6 series
as well.

	- Thomas

P.S: please CC me in replies, as I am not subscribed to this list.


[-- Attachment #2: dcachesem.diff --]
[-- Type: text/plain, Size: 1226 bytes --]

--- dcache.c.orig	Wed Jun 16 00:22:03 2004
+++ dcache.c	Wed Jun 16 01:00:47 2004
@@ -51,6 +51,7 @@
 static unsigned int d_hash_shift;
 static struct list_head *dentry_hashtable;
 static LIST_HEAD(dentry_unused);
+struct semaphore dcache_lru_sem;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {0, 0, 45, 0,};
@@ -381,6 +382,7 @@
 	struct list_head *tmp, *next;
 	struct dentry *dentry;
 
+	down(&dcache_lru_sem);
 	/*
 	 * Pass one ... move the dentries for the specified
 	 * superblock to the most recent end of the unused list.
@@ -416,6 +418,7 @@
 		goto repeat;
 	}
 	spin_unlock(&dcache_lock);
+	up(&dcache_lru_sem);
 }
 
 /*
@@ -548,8 +551,10 @@
 {
 	int found;
 
+	down(&dcache_lru_sem);
 	while ((found = select_parent(parent)) != 0)
 		prune_dcache(found);
+	up(&dcache_lru_sem);
 }
 
 /*
@@ -581,9 +586,11 @@
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
+	down(&dcache_lru_sem);
 	count = dentry_stat.nr_unused / priority;
 
 	prune_dcache(count);
+	up(&dcache_lru_sem);
 	return kmem_cache_shrink(dentry_cache);
 }
 
@@ -1247,6 +1254,7 @@
 		d++;
 		i--;
 	} while (i);
+	sema_init(&dcache_lru_sem, 1);
 }
 
 static void init_buffer_head(void * foo, kmem_cache_t * cachep, unsigned long flags)

next             reply	other threads:[~2004-07-08 18:07 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-07-08 18:07 Thomas Moestl [this message]
2004-07-09 14:32 ` umount() and NFS races in 2.4.26 Marcelo Tosatti
2004-07-10 21:38   ` Trond Myklebust
2004-07-11  0:32     ` viro
2004-07-10  6:57 ` raven
2004-07-10 15:25   ` Greg Banks
2004-07-10 15:25     ` [autofs] " Greg Banks
2004-07-10 18:25     ` Thomas Moestl
2004-07-10 18:19   ` Thomas Moestl
2004-07-10 19:25     ` raven
2004-07-10 19:25       ` raven
2004-07-10 19:50       ` Thomas Moestl
2004-07-11 10:16     ` raven
2004-07-11 10:16       ` raven
  -- strict thread matches above, loose matches on Subject: below --
2004-07-09 15:00 James Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040708180709.GA7704@timesink.dyndns.org \
    --to=moestl@ibr.cs.tu-bs.de \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.