From: Thomas Moestl <moestl@ibr.cs.tu-bs.de>
To: linux-kernel@vger.kernel.org
Subject: umount() and NFS races in 2.4.26
Date: Thu, 8 Jul 2004 20:07:09 +0200 [thread overview]
Message-ID: <20040708180709.GA7704@timesink.dyndns.org> (raw)
[-- Attachment #1: Type: text/plain, Size: 4298 bytes --]
Hi,
after deploying an SMP machine at work, we started to experience Oopses
in file-system related code relatively frequently. Investigation
revealed that they were caused by references using junk pointers from
freed super blocks via dangling inodes from unmounted file systems;
Oopses would always be preceded by the warning
VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day...
on an unmount (unmount activity is high on this machine due to heavy use
of the automounter). The predecessor to this machine, a UP system with
otherwise almost identical configuration, did never encounter such
problems, so I went looking for possible SMP races.
I believe that I have found two problems:
- The NFS async unlink code (fs/nfs/unlink.c) does keep a dentry for
later asynchronous processing, but the mount point is unbusied via
path_release() once sys_unlink() returns (fs/namei.c). While it
does a dget() on the dentry, this is insufficient to prevent an
umount(); when one would happen in the right time window, it seems
that it would initially get past the busy check:
if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
(fs/namespace.c, do_umount()), but invalidate_inodes() in kill_super()
(fs/super.c) would then fail because of the inode referenced from
the dentry needed for the async unlink (which cannot be removed
by shrink_dcache_parent() because the NFS code did dget() it).
Please note that this problem is only conjectured, as it turned out
to not be our culprit. It looks not completely trivial to fix to me,
I believe it might require some changes that would affect other FS
implementations. It is not a true SMP race, if it exists it would
affect UP systems as well.
- There is a SMP race between the shrink_dcache_parent() (fs/dcache.c)
called from kill_super() and prune_dache() called via
shrink_dache_memory() (called by kswapd), as follows:
shrink_dache_parent() calls select_parent() to both prepare the LRU
list for purge_cache() and return the number of unused dcache
entries that can likely be removed in the next prune_dache() run.
If select_parent() returns 0, shrink_dcache_parent() assumes that
its work is done and returns. Now, assume for simplicity that there
are only two remaining dcache entries: one for "foo/bar" and for
the directory "foo/", which is referenced by the "foo/bar" entry.
Furthermore, prune_dcache() is currently running, called from kswapd,
and has decided to remove the "foo/bar" entry. To that end, it
calls prune_one_dentry(), which dentry_iput()s then "foo/bar" dentry.
This causes the dache_lock() to be dropped. Just now select_parent()
comes along, and can obtain the dcache_lock(). It now looks for unused
dentries; but there is only the "foo/" one left, which has a non-zero
d_count because "foo/bar" referenced it as parent, and
prune_one_dentry() did not yet have a chance to dput() it because it
has to wait for the dcache_lock(). Thus, select_parent() finds no
unused dentries, assumes that all is fine and does not purge any
more; the "foo/" entry can remain in the cache for much longer
because it may have DCACHE_REFERENCED set, so that the kswapd
purge_dcache() will leave it alone. The inode corresponding to the
dangling dcache entry will still be referenced, and end up dangling,
too. kill_super() will print the warning mentioned above.
When dentry or inode are touched again later (for example in another
purge_dcache() later on) we can end up accessing the super block
freed during the unmount, leading to an Oops.
This was partially verified by inspecting the dcache state via
/dev/kmem after the busy inodes warning had occured (the directory
dentry was not busy any more, but still remained in the unused list).
In the attached patch, I have used a semaphore to serialize purging
accesses to the dentry_unused list. With a kernel so patched, the
problem seems to have disappeared. The patch is just a quick hack,
the semantics of the semaphore is not really well-defined; but maybe
it can serve as a starting point.
I have not checked whether any of these issues pertain to the 2.6 series
as well.
- Thomas
P.S: please CC me in replies, as I am not subscribed to this list.
[-- Attachment #2: dcachesem.diff --]
[-- Type: text/plain, Size: 1226 bytes --]
--- dcache.c.orig Wed Jun 16 00:22:03 2004
+++ dcache.c Wed Jun 16 01:00:47 2004
@@ -51,6 +51,7 @@
static unsigned int d_hash_shift;
static struct list_head *dentry_hashtable;
static LIST_HEAD(dentry_unused);
+struct semaphore dcache_lru_sem;
/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {0, 0, 45, 0,};
@@ -381,6 +382,7 @@
struct list_head *tmp, *next;
struct dentry *dentry;
+ down(&dcache_lru_sem);
/*
* Pass one ... move the dentries for the specified
* superblock to the most recent end of the unused list.
@@ -416,6 +418,7 @@
goto repeat;
}
spin_unlock(&dcache_lock);
+ up(&dcache_lru_sem);
}
/*
@@ -548,8 +551,10 @@
{
int found;
+ down(&dcache_lru_sem);
while ((found = select_parent(parent)) != 0)
prune_dcache(found);
+ up(&dcache_lru_sem);
}
/*
@@ -581,9 +586,11 @@
if (!(gfp_mask & __GFP_FS))
return 0;
+ down(&dcache_lru_sem);
count = dentry_stat.nr_unused / priority;
prune_dcache(count);
+ up(&dcache_lru_sem);
return kmem_cache_shrink(dentry_cache);
}
@@ -1247,6 +1254,7 @@
d++;
i--;
} while (i);
+ sema_init(&dcache_lru_sem, 1);
}
static void init_buffer_head(void * foo, kmem_cache_t * cachep, unsigned long flags)
next reply other threads:[~2004-07-08 18:07 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-07-08 18:07 Thomas Moestl [this message]
2004-07-09 14:32 ` umount() and NFS races in 2.4.26 Marcelo Tosatti
2004-07-10 21:38 ` Trond Myklebust
2004-07-11 0:32 ` viro
2004-07-10 6:57 ` raven
2004-07-10 15:25 ` Greg Banks
2004-07-10 15:25 ` [autofs] " Greg Banks
2004-07-10 18:25 ` Thomas Moestl
2004-07-10 18:19 ` Thomas Moestl
2004-07-10 19:25 ` raven
2004-07-10 19:25 ` raven
2004-07-10 19:50 ` Thomas Moestl
2004-07-11 10:16 ` raven
2004-07-11 10:16 ` raven
-- strict thread matches above, loose matches on Subject: below --
2004-07-09 15:00 James Pearson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040708180709.GA7704@timesink.dyndns.org \
--to=moestl@ibr.cs.tu-bs.de \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.