public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
@ 2001-01-16 19:04 tytso
  2001-01-16 19:33 ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: tytso @ 2001-01-16 19:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: alan, aviro


HJ Lu recently pointed me at a potential locking problem
try_to_free_inodes(), and when I started proding at it, I found what
appears to be another set of SMP locking issues in the dcache code.
(But if that were the case, why aren't we seeing huge numbers of
complaints?  So I'm wondering if I'm missing something.)

Anyway, the first problem which HJ pointed out is in
try_to_free_inodes() which attempts to implement a mutual exclusion with
respect to itself as follows....

	if (block)
	{
		struct wait_queue __wait;

		__wait.task = current;
		add_wait_queue(&wait_inode_freeing, &__wait);
		for (;;)
		{
			/* NOTE: we rely only on the inode_lock to be sure
			   to not miss the unblock event */
			current->state = TASK_UNINTERRUPTIBLE;
			spin_unlock(&inode_lock);
			schedule();
			spin_lock(&inode_lock);
			if (!block)
				break;
		}
		remove_wait_queue(&wait_inode_freeing, &__wait);
		current->state = TASK_RUNNING;
	}
	block = 1;

Of course, this is utterly unsafe on an SMP machines, since access to
the "block" variable isn't protected at all.  So the first question is
why did whoever implemented this do it in this horribly complicated way
above, instead of something simple, say like this:

	static struct semaphore block = MUTEX;
	if (down_trylock(&block)) {
		spin_unlock(&inode_lock);
		down(&block);
		spin_lock(&inode_lock);
	}

(with the appropriate unlocking code at the end of the function).


Next question.... why was this there in the first place?  After all,
most of the time try_to_free_inodes() is called with the inode_lock
spinlock held, which would act as a automatic mutual exclusion anyway.
The only time this doesn't happen is when we call prune_dcache(), where
inode_lock is temporarily dropped.

So I took a look at prune_dcache(), and discovered that (a) it's called
from multiple places, and (b) it and shrink_dcache_sb() both iterate over
dentry_unused and among other things, tried to free dcache structures
without any kind of locking to prevent two kernel threads to
from mucking with the contents of dentry_unused at the same time, and
possibly having prune_one_dentry() being called by two processes on the
same dentry structure.  This should almost certainly cause problems.

So the following patch I think is definitely necessary, assuming that
the "block" mutual exclusion in try_to_free_inodes() needed to be there
in the first place.  I'm not so sure about needing a patch to protect
access to dentry_unused in fs/dentry.c, but unless there are some
other unstated locking hierarchy rules about when it's safe to call
prune_dcache() and shrink_dcache_sb(), I suspect we need to add a lock
to protect dentry_unused as well.  Comments?

						- Ted

--- fs/inode.c.old	Fri Sep 29 17:37:01 2000
+++ fs/inode.c	Tue Jan 16 09:29:06 2001
@@ -380,36 +380,19 @@
  */
 static void try_to_free_inodes(int goal)
 {
-	static int block;
-	static struct wait_queue * wait_inode_freeing;
+	static struct semaphore block = MUTEX;
 	LIST_HEAD(throw_away);
 	
 	/* We must make sure to not eat the inodes
 	   while the blocker task sleeps otherwise
 	   the blocker task may find none inode
 	   available. */
-	if (block)
-	{
-		struct wait_queue __wait;
-
-		__wait.task = current;
-		add_wait_queue(&wait_inode_freeing, &__wait);
-		for (;;)
-		{
-			/* NOTE: we rely only on the inode_lock to be sure
-			   to not miss the unblock event */
-			current->state = TASK_UNINTERRUPTIBLE;
-			spin_unlock(&inode_lock);
-			schedule();
-			spin_lock(&inode_lock);
-			if (!block)
-				break;
-		}
-		remove_wait_queue(&wait_inode_freeing, &__wait);
-		current->state = TASK_RUNNING;
+	if (down_trylock(&block)) {
+		spin_unlock(&inode_lock);
+		down(&block);
+		spin_lock(&inode_lock);
 	}
 
-	block = 1;
 	/*
 	 * First stry to just get rid of unused inodes.
 	 *
@@ -427,8 +410,7 @@
 	}
 	if (!list_empty(&throw_away))
 		dispose_list(&throw_away);
-	block = 0;
-	wake_up(&wait_inode_freeing);
+	up(&block);
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
  2001-01-16 19:04 Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c) tytso
@ 2001-01-16 19:33 ` Andrea Arcangeli
  2001-01-16 20:10   ` tytso
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2001-01-16 19:33 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, alan, aviro

On Tue, Jan 16, 2001 at 11:04:45AM -0800, Theodore Y. Ts'o wrote:
> 
> HJ Lu recently pointed me at a potential locking problem
> try_to_free_inodes(), and when I started proding at it, I found what
> appears to be another set of SMP locking issues in the dcache code.
> (But if that were the case, why aren't we seeing huge numbers of
> complaints?  So I'm wondering if I'm missing something.)

Because the code is correct ;). It is infact a fix and it took some time to fix
such bug in mid 2.2.x.

> 
> Anyway, the first problem which HJ pointed out is in
> try_to_free_inodes() which attempts to implement a mutual exclusion with
> respect to itself as follows....
> 
> 	if (block)
> 	{
> 		struct wait_queue __wait;
> 
> 		__wait.task = current;
> 		add_wait_queue(&wait_inode_freeing, &__wait);
> 		for (;;)
> 		{
> 			/* NOTE: we rely only on the inode_lock to be sure
> 			   to not miss the unblock event */
> 			current->state = TASK_UNINTERRUPTIBLE;
> 			spin_unlock(&inode_lock);
			^^^^^^^^^^^^^^^^^^^^^^^^
> 			schedule();
> 			spin_lock(&inode_lock);
			^^^^^^^^^^^^^^^^^^^^^^^^
> 			if (!block)
> 				break;
> 		}
> 		remove_wait_queue(&wait_inode_freeing, &__wait);
> 		current->state = TASK_RUNNING;
> 	}
> 	block = 1;
> 
> Of course, this is utterly unsafe on an SMP machines, since access to
> the "block" variable isn't protected at all.  So the first question is

Wrong, it's obviously protected by the inode_lock. And even if it wasn't
protected by the inode_lock in 2.2.x inode.c and dcache.c runs globally
serialized by the BKL (but it is obviously protected regardless of the BKL).

> why did whoever implemented this do it in this horribly complicated way
> above, instead of something simple, say like this:
> 
> 	static struct semaphore block = MUTEX;
> 	if (down_trylock(&block)) {
> 		spin_unlock(&inode_lock);
> 		down(&block);
> 		spin_lock(&inode_lock);
> 	}

The above is overkill (there's no need to use further atomic API, when we can
rely on the inode_lock for the locking. It's overcomplex and slower.

> (with the appropriate unlocking code at the end of the function).
> 
> 
> Next question.... why was this there in the first place?  After all,

To fix the "inode-max limit reached" faliures that you could reproduce on
earlier 2.2.x. (the freed inodes was re-used before the task that freed them
had a chance to allocate them for itself)

> most of the time try_to_free_inodes() is called with the inode_lock
> spinlock held, which would act as a automatic mutual exclusion anyway.

> The only time this doesn't happen is when we call prune_dcache(), where
> inode_lock is temporarily dropped.

Wrong:

		spin_unlock(&inode_lock);
		prune_dcache(0, goal);
		spin_lock(&inode_lock);
		sync_all_inodes();
		__free_inodes(&throw_away);

the above code obviously drops the spinlock for doing things like flushing the
dirty inodes to the buffer cache that can block in balance_dirty() etc...
(and that's the real problem because it sleeps so also the BKL gets released
while prune_dcache in practice could not race because of the BKL)

> So I took a look at prune_dcache(), and discovered that (a) it's called
> from multiple places, and (b) it and shrink_dcache_sb() both iterate over
> dentry_unused and among other things, tried to free dcache structures
> without any kind of locking to prevent two kernel threads to
> from mucking with the contents of dentry_unused at the same time, and
> possibly having prune_one_dentry() being called by two processes on the
> same dentry structure.  This should almost certainly cause problems.

we're running under the BKL all over the place in 2.2.x so they can't race.

> So the following patch I think is definitely necessary, assuming that

The patch is definitely not necessary.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
  2001-01-16 19:33 ` Andrea Arcangeli
@ 2001-01-16 20:10   ` tytso
  2001-01-16 23:05     ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: tytso @ 2001-01-16 20:10 UTC (permalink / raw)
  To: andrea; +Cc: linux-kernel, alan, aviro

   Date: Tue, 16 Jan 2001 20:33:34 +0100
   From: Andrea Arcangeli <andrea@suse.de>

   > Of course, this is utterly unsafe on an SMP machines, since access to
   > the "block" variable isn't protected at all.  So the first question is

   Wrong, it's obviously protected by the inode_lock. And even if it wasn't
   protected by the inode_lock in 2.2.x inode.c and dcache.c runs globally
   serialized by the BKL (but it is obviously protected regardless of the BKL).

Yes, you're quite right.  The fact that you have to have inode_lock
before you call try_to_free inodes would would protect the "block"
variable.

   > 	static struct semaphore block = MUTEX;
   > 	if (down_trylock(&block)) {
   > 		spin_unlock(&inode_lock);
   > 		down(&block);
   > 		spin_lock(&inode_lock);
   > 	}

   The above is overkill (there's no need to use further atomic API,
   when we can rely on the inode_lock for the locking. It's overcomplex
   and slower. 

Actually, looking at the fast path of down_trylock compared to huge mess
of code that's currently there, I actually suspect that using
down_trylock() would actually be faster, since in the fast path case
there would only two assembly language instructions, whereas the code
that's currently there is (a) much more complicated, and thus harder to
understand, and (b) is many more instructions to execute.  

Sometimes the simplest approach is the best.....

   > (with the appropriate unlocking code at the end of the function).
   > 
   > Next question.... why was this there in the first place?  After all,

   To fix the "inode-max limit reached" faliures that you could reproduce on
   earlier 2.2.x. (the freed inodes was re-used before the task that freed them
   had a chance to allocate them for itself)

Ah, OK.  Well, we're currently tracking down a slow inode leak which is
only happening on SMP machines, especially our mailhubs.  It's gradual,
but if you don't reboot the machine before you run out of inodes, it
will print the "inode-max limit reach" message, and then shortly after
that lock up the entire machine locks up until someone can come in and
hit the Big Red Button.  Monitoring the machine before it locks up, we
noted that the number of inodes in use was continually climbing until
the machine died.  (Yeah, we could put a reboot command into crontab,
but you should only need to do hacks like that on Windows NT machines.
:-)

We're running a reasonably recent 2.2 kernel on our machines, and the
problem is only showing up on SMP machines, so there's *some* kind of
SMP race hiding in the 2.2 inode code....

						- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
  2001-01-16 20:10   ` tytso
@ 2001-01-16 23:05     ` Andrea Arcangeli
  0 siblings, 0 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2001-01-16 23:05 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, alan, aviro

On Tue, Jan 16, 2001 at 12:10:31PM -0800, Theodore Y. Ts'o wrote:
> Actually, looking at the fast path of down_trylock compared to huge mess
> of code that's currently there, I actually suspect that using
> down_trylock() would actually be faster, since in the fast path case
> there would only two assembly language instructions, whereas the code

The fast path of the current code never hits the "huge mess" (aka wait_event
done with a seralizing spinlock). The main difference is that down() is using
an atomic logic, while `int block' runs out of order and it's serialized only
by the spinlock. I personally prefer current version because serializing
instructions are usually more expensive and I don't consider "huge mess" the
wait event interface as it makes perfect sense there IMHO.

> Ah, OK.  Well, we're currently tracking down a slow inode leak which is
> only happening on SMP machines, especially our mailhubs.  It's gradual,

How long does it takes to reproduce? BTW, I assume you can reproduce also with
vanilla 2.2.x latest kernels.

I suspect that the problem is the caller that is missing an iput or dput. (it
maybe also an userspace application, just check that nothing gets fixed by
SYSRQ+I before using the Big Red Button)

Note also that killing all apps won't decrease of 1 the number of inodes
allocated.  The inodes allocated will _never_ shrink (that's why we need the
hard limit inode-max).  Deleting all in-use inodes is the only way to have them
to showup in the freelist (and they still won't be freed but at least you'll
see that they're not leaked somewhere ;). So the debugging isn't very friendly
unless you play with the sources (2.4.x is much better).

> but if you don't reboot the machine before you run out of inodes, it
> will print the "inode-max limit reach" message, and then shortly after
> that lock up the entire machine locks up until someone can come in and
> hit the Big Red Button.  Monitoring the machine before it locks up, we

If SYSRQ+I doesn't help, can you try to reproduce with vanilla 2.2.19pre7aa1
(that also sets the inode-max and the inode-hash to a rasonable value for big
boxes, and it fixes the inode-hash function to avoid huge collisions)?

Recent 2.2.xaa are running on the same heavily loaded SMP boxes that was able
to reproduce the common code inode leak we had in mid 2.2.x.  They doesn't show
problems anymore since the last fix (that is the one we discussed in this
thread). So I tend to believe this could be a bug in some non-common or
unofficial piece of code or in userspace. But hey, this is just a guess.

> the machine died.  (Yeah, we could put a reboot command into crontab,
> but you should only need to do hacks like that on Windows NT machines.
> :-)

;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2001-01-16 23:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-16 19:04 Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c) tytso
2001-01-16 19:33 ` Andrea Arcangeli
2001-01-16 20:10   ` tytso
2001-01-16 23:05     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox