shrink_dcache_sb scalability problem.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* shrink_dcache_sb scalability problem.
@ 2006-04-13  8:22 David Chinner
  2006-04-13  8:52 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: David Chinner @ 2006-04-13  8:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel

Folks,

After recently upgrading a build machine to 2.6.16, we started
seeing 10-50s pauses where the machine would appear to hang.
Profiles showed that we were spending a substantial amount of time
in shrink_dcache_sb, and several CPUs were spinning on the
dcache_lock.

This is happening quite frequently - we recorded a 10 minute period
where there were 13 incidents where a touch/rm of a single file was
taking longer than 10s. The machine was close to unusable when this
happened.

At the time of the problem the machine had several million unused
cached dentries in memory (often > 10million), and the builds use
chroot environments with internally mounted filesystems like /proc
and /sys.

The problem is that whenever we mount /proc, /sys, /dev/pts, etc, we
call shrink_dcache_sb() which does multiple traversals across the
unused dentry list with the dcache_lock held.

It is trivial to reduce this to one traversal for the case of a new
mount. However, that doesn't solve the issue that we are walking a
linked list of many million entries with a global lock held and
holding out everyone else.

We're open to any suggestions on how to go about fixing this problem
as it's not obvious what the correct way to approach this problem
is. Any advice, patches, etc is more than welcome.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-13  8:22 shrink_dcache_sb scalability problem David Chinner
@ 2006-04-13  8:52 ` Andrew Morton
  2006-04-14  3:43   ` David Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-04-13  8:52 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel, linux-fsdevel

David Chinner <dgc@sgi.com> wrote:
>
> After recently upgrading a build machine to 2.6.16, we started
>  seeing 10-50s pauses where the machine would appear to hang.

This sounds like the recent thread "Avoid excessive time spend on
concurrent slab shrinking" over on linux-mm.  Have you read through that?

http://marc.theaimsgroup.com/?l=linux-mm&r=1&b=200603&w=2
http://marc.theaimsgroup.com/?l=linux-mm&r=3&b=200604&w=2

It ended up somewaht inconclusive, but it looks like we do have a bit of a
problem, but it got exacerbated by an XFS slowness.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-13  8:52 ` Andrew Morton
@ 2006-04-14  3:43   ` David Chinner
  2006-04-14  5:23     ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: David Chinner @ 2006-04-14  3:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Chinner, linux-kernel, linux-fsdevel

On Thu, Apr 13, 2006 at 01:52:57AM -0700, Andrew Morton wrote:
> David Chinner <dgc@sgi.com> wrote:
> >
> > After recently upgrading a build machine to 2.6.16, we started seeing
> > 10-50s pauses where the machine would appear to hang.
> 
> This sounds like the recent thread "Avoid excessive time spend on concurrent
> slab shrinking" over on linux-mm.  Have you read through that?
> 
> http://marc.theaimsgroup.com/?l=linux-mm&r=1&b=200603&w=2
> http://marc.theaimsgroup.com/?l=linux-mm&r=3&b=200604&w=2

Yes, I even made comments directly in the thread and it really
wasn't a problem with the slab shrinking infrastructure. It
was (obvious to us XFS folk) just another XFS inode caching
scalability problem that this machine has uncovered over
the past few years.

> It ended up somewaht inconclusive, but it looks like we do have a bit of a
> problem, but it got exacerbated by an XFS slowness.

I've already fixed that problem with:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1fc5d959d88a5f77aa7e4435f6c9d0e2d2236704

and the machine showing the shrink_dcache_sb() problems is
already running that fix. That problem masked the shrink_dcache_sb
one - who notices a 10s hang when the machine has been really,
really slow for 20 minutes?

So this is not (directly) related to reclaim of inodes or dentries.
It can be seen during reclaim of dentries if someone is mounting or
unmounting a filesystem at the same time, but fundamentally it's
a result of a large number of cached dentries on a single list
protected by a single lock and having to walk that list atomically.

Given the complexity of the dcache, I really don't know enough
about it or have the time available to invest in learning all I
need to know about it to solve the problem. That's why I'm
asking for help from the experts....

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-14  3:43   ` David Chinner
@ 2006-04-14  5:23     ` Andrew Morton
  2006-04-15  5:25       ` Bharata B Rao
  2006-04-18  0:29       ` David Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Andrew Morton @ 2006-04-14  5:23 UTC (permalink / raw)
  To: David Chinner; +Cc: dgc, linux-kernel, linux-fsdevel, Dipankar Sarma

David Chinner <dgc@sgi.com> wrote:
>
> On Thu, Apr 13, 2006 at 01:52:57AM -0700, Andrew Morton wrote:
>  > David Chinner <dgc@sgi.com> wrote:
>  > >
>  > > After recently upgrading a build machine to 2.6.16, we started seeing
>  > > 10-50s pauses where the machine would appear to hang.
>  > 
>  > This sounds like the recent thread "Avoid excessive time spend on concurrent
>  > slab shrinking" over on linux-mm.  Have you read through that?
>  > 
>  > http://marc.theaimsgroup.com/?l=linux-mm&r=1&b=200603&w=2
>  > http://marc.theaimsgroup.com/?l=linux-mm&r=3&b=200604&w=2
> 
>  Yes, I even made comments directly in the thread and it really
>  wasn't a problem with the slab shrinking infrastructure. It
>  was (obvious to us XFS folk) just another XFS inode caching
>  scalability problem that this machine has uncovered over
>  the past few years.

OK.

>  > It ended up somewaht inconclusive, but it looks like we do have a bit of a
>  > problem, but it got exacerbated by an XFS slowness.
> 
>  I've already fixed that problem with:
> 
>  http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1fc5d959d88a5f77aa7e4435f6c9d0e2d2236704
> 
>  and the machine showing the shrink_dcache_sb() problems is
>  already running that fix. That problem masked the shrink_dcache_sb
>  one - who notices a 10s hang when the machine has been really,
>  really slow for 20 minutes?

So the problem is shrink_dcache_sb() and not regular memory reclaim?

What is happening on that machine to be triggering shrink_dcache_sb()? 
automounts?

We fixed a similar problem in the inode cache a year or so back by creating
per-superblock inode lists, so that
search-a-global-list-for-objects-belonging-to-this-superblock thing went
away.  Presumably we could fix this in the same manner.  But adding two
more pointers to struct dentry would hurt.

An alternative might be to remove the global LRU altogether, make it
per-superblock.  That would reduce the quality of the LRUing, but that
probably wouldn't hurt a lot.

Another idea would be to take shrinker_sem for writing when running
shrink_dcache_sb() - that would prevent tasks from coming in and getting
stuck on dcache_lock, but there are plentry of other places which want
dcache_lock.

I don't immediately see any simple tweaks which would allow us to avoid that
long lock hold time.  Perhaps the scanning in shrink_dcache_sb() could use
just rcu_read_lock()...

OT, I'm a bit curious about this:

		list_del_init(tmp);
		spin_lock(&dentry->d_lock);
		if (atomic_read(&dentry->d_count)) {
			spin_unlock(&dentry->d_lock);
			continue;
		}

So we rip the dentry off dcache_unused and just leave it floating about? 
Dipankar, do you remember why that change was made, and why it's not a bug?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-14  5:23     ` Andrew Morton
@ 2006-04-15  5:25       ` Bharata B Rao
  2006-04-15  5:53         ` Andrew Morton
  2006-04-18  0:29       ` David Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Bharata B Rao @ 2006-04-15  5:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Chinner, linux-kernel, linux-fsdevel, Dipankar Sarma

On 4/14/06, Andrew Morton <akpm@osdl.org> wrote:
> David Chinner <dgc@sgi.com> wrote:
> >
> > On Thu, Apr 13, 2006 at 01:52:57AM -0700, Andrew Morton wrote:
> >  > David Chinner <dgc@sgi.com> wrote:
> >  > >
> >  > > After recently upgrading a build machine to 2.6.16, we started seeing
> >  > > 10-50s pauses where the machine would appear to hang.
> >  >
> >  > This sounds like the recent thread "Avoid excessive time spend on concurrent
> >  > slab shrinking" over on linux-mm.  Have you read through that?
> >  >
> >  > http://marc.theaimsgroup.com/?l=linux-mm&r=1&b=200603&w=2
> >  > http://marc.theaimsgroup.com/?l=linux-mm&r=3&b=200604&w=2
> >
> >  Yes, I even made comments directly in the thread and it really
> >  wasn't a problem with the slab shrinking infrastructure. It
> >  was (obvious to us XFS folk) just another XFS inode caching
> >  scalability problem that this machine has uncovered over
> >  the past few years.
>
> OK.
>
> >  > It ended up somewaht inconclusive, but it looks like we do have a bit of a
> >  > problem, but it got exacerbated by an XFS slowness.
> >
> >  I've already fixed that problem with:
> >
> >  http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1fc5d959d88a5f77aa7e4435f6c9d0e2d2236704
> >
> >  and the machine showing the shrink_dcache_sb() problems is
> >  already running that fix. That problem masked the shrink_dcache_sb
> >  one - who notices a 10s hang when the machine has been really,
> >  really slow for 20 minutes?
>
> So the problem is shrink_dcache_sb() and not regular memory reclaim?
>
> What is happening on that machine to be triggering shrink_dcache_sb()?
> automounts?
>
> We fixed a similar problem in the inode cache a year or so back by creating
> per-superblock inode lists, so that
> search-a-global-list-for-objects-belonging-to-this-superblock thing went
> away.  Presumably we could fix this in the same manner.  But adding two
> more pointers to struct dentry would hurt.
>
> An alternative might be to remove the global LRU altogether, make it
> per-superblock.  That would reduce the quality of the LRUing, but that
> probably wouldn't hurt a lot.
>
> Another idea would be to take shrinker_sem for writing when running
> shrink_dcache_sb() - that would prevent tasks from coming in and getting
> stuck on dcache_lock, but there are plentry of other places which want
> dcache_lock.
>
> I don't immediately see any simple tweaks which would allow us to avoid that
> long lock hold time.  Perhaps the scanning in shrink_dcache_sb() could use
> just rcu_read_lock()...
>
> OT, I'm a bit curious about this:
>
>                 list_del_init(tmp);
>                 spin_lock(&dentry->d_lock);
>                 if (atomic_read(&dentry->d_count)) {
>                         spin_unlock(&dentry->d_lock);
>                         continue;
>                 }
>
> So we rip the dentry off dcache_unused and just leave it floating about?
> Dipankar, do you remember why that change was made, and why it's not a bug?

Due to lazy updating of the LRU list, there can be some dentries with non-zero
ref counts on LRU list. This is one of the places where such dentries are
removed from the LRU list. (Basically such dentries will be both on
hash list and LRU
list and here they get removed from the LRU list)

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-15  5:25       ` Bharata B Rao
@ 2006-04-15  5:53         ` Andrew Morton
  2006-04-18  5:57           ` Bharata B Rao
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-04-15  5:53 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: dgc, linux-kernel, linux-fsdevel, dipankar

"Bharata B Rao" <bharata.rao@gmail.com> wrote:
>
> > OT, I'm a bit curious about this:
>  >
>  >                 list_del_init(tmp);
>  >                 spin_lock(&dentry->d_lock);
>  >                 if (atomic_read(&dentry->d_count)) {
>  >                         spin_unlock(&dentry->d_lock);
>  >                         continue;
>  >                 }
>  >
>  > So we rip the dentry off dcache_unused and just leave it floating about?
>  > Dipankar, do you remember why that change was made, and why it's not a bug?
> 
>  Due to lazy updating of the LRU list, there can be some dentries with non-zero
>  ref counts on LRU list. This is one of the places where such dentries are
>  removed from the LRU list. (Basically such dentries will be both on
>  hash list and LRU
>  list and here they get removed from the LRU list)

OK.  But what guarantees that these live-but-detached dentries are
appropriately destroyed before the unmount completes?

Or...  if these dentries will be freed by RCU callbacks potentially after
the unmount, are we sure that they will always be in a state which will
permit them to be freed?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-15  5:53         ` Andrew Morton
@ 2006-04-18  5:57           ` Bharata B Rao
  0 siblings, 0 replies; 9+ messages in thread
From: Bharata B Rao @ 2006-04-18  5:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dgc, linux-kernel, linux-fsdevel, dipankar

On 4/15/06, Andrew Morton <akpm@osdl.org> wrote:
> "Bharata B Rao" <bharata.rao@gmail.com> wrote:
> >
> > > OT, I'm a bit curious about this:
> >  >
> >  >                 list_del_init(tmp);
> >  >                 spin_lock(&dentry->d_lock);
> >  >                 if (atomic_read(&dentry->d_count)) {
> >  >                         spin_unlock(&dentry->d_lock);
> >  >                         continue;
> >  >                 }
> >  >
> >  > So we rip the dentry off dcache_unused and just leave it floating about?
> >  > Dipankar, do you remember why that change was made, and why it's not a bug?
> >
> >  Due to lazy updating of the LRU list, there can be some dentries with non-zero
> >  ref counts on LRU list. This is one of the places where such dentries are
> >  removed from the LRU list. (Basically such dentries will be both on
> >  hash list and LRU
> >  list and here they get removed from the LRU list)
>
> OK.  But what guarantees that these live-but-detached dentries are
> appropriately destroyed before the unmount completes?

These are live dentries but not really detached, they are still attached to
the hash list. And yes I don't see shrink_dcache_sb holding the umount
because of these dentries. I assume dput of such dentries will happen
from different paths. But I am not sure if we could even end up in this
situation where we have landed up in shrink_dcache_sb from unmount path
and there are still some inuse dentries present. Need some clarification here.
>
> Or...  if these dentries will be freed by RCU callbacks potentially after
> the unmount, are we sure that they will always be in a state which will
> permit them to be freed?

When a dentry is queued for RCU freeing, there will be no references to
it from anywhere. It wouldn't be on hash list or on lru list. So I would think
only those dentries which are really freeable are queued for RCU freeing.

I see that shrink_dcache_sb is being called from the remount path
(do_remount_sb).
I couldn't understand why we do this. AFAICS we anyway don't modify the
mountpoint or the mount root during remount. Woudn't it be advantageous to
leave those dentries on LRU ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: shrink_dcache_sb scalability problem.
  2006-04-14  5:23     ` Andrew Morton
  2006-04-15  5:25       ` Bharata B Rao
@ 2006-04-18  0:29       ` David Chinner
  2006-04-18 14:37         ` [RFC][PATCH] " David Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: David Chinner @ 2006-04-18  0:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Chinner, linux-kernel, linux-fsdevel, Dipankar Sarma

On Thu, Apr 13, 2006 at 10:23:25PM -0700, Andrew Morton wrote:
> David Chinner <dgc@sgi.com> wrote:
> >  I've already fixed that problem with:
> > 
> >  http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1fc5d959d88a5f77aa7e4435f6c9d0e2d2236704
> > 
> >  and the machine showing the shrink_dcache_sb() problems is
> >  already running that fix. That problem masked the shrink_dcache_sb
> >  one - who notices a 10s hang when the machine has been really,
> >  really slow for 20 minutes?
> 
> So the problem is shrink_dcache_sb() and not regular memory reclaim?

*nod*

> What is happening on that machine to be triggering shrink_dcache_sb()? 
> automounts?

That was what i first suspected as themachine has lots of automounted
directories. however, breaking into kdb when the problem occurred
showed mounts of sysfs and /dev/pts.

>From what I've been told, the chroot package build environment being
used (not something we control) mounts /proc, /sys, /dev/pts, etc
when the chroot is entered, and unmounts them after the build is
complete. hence for every package build an engineer kicks off, we
get multiple mounts and unmounts occurring.

> We fixed a similar problem in the inode cache a year or so back by creating
> per-superblock inode lists, so that
> search-a-global-list-for-objects-belonging-to-this-superblock thing went
> away.  Presumably we could fix this in the same manner.  But adding two
> more pointers to struct dentry would hurt.

*nod*

I can't see æny fields we'd be able to overload, either.

> An alternative might be to remove the global LRU altogether, make it
> per-superblock.  That would reduce the quality of the LRUing, but that
> probably wouldn't hurt a lot.

That makes reclaim an interesting problem. You'd have to walk the
superblock list to get to each lru, and then how would you ensure
that you fairly reclaim inodes from each filesystem?

> Another idea would be to take shrinker_sem for writing when running
> shrink_dcache_sb() - that would prevent tasks from coming in and getting
> stuck on dcache_lock, but there are plentry of other places which want
> dcache_lock.

Yeah, the contention we are seeing is from all those other places, so I
can't see how taking the shrinker semaphore reall helps us here.

> I don't immediately see any simple tweaks which would allow us to avoid that
> long lock hold time.  Perhaps the scanning in shrink_dcache_sb() could use
> just rcu_read_lock()...

We're modifying the list as we scan it, so I can't see how we can do
this without an exclusive lock.

The other thing that I thought of over the weekend is per-node LRU
lists and a lock per node This will reduce the length of the lists,
allow some parallelism even while we scan and purge each list
using the existing algorithm, and not completely destroy the LRU-ness
of the dcache.

It would also allow parallelising prune_dcache() and allow the
shrinker to prune the local node first (i.e. where we really need
the memory).

FWIW, this is a showstopper for us. The only thing that is allowing
us to keep running a recent kernel on this machine is the fact that
someone is running `echo 3 > /proc/sys/vm/drop_caches` as soon
as the slowdown manifests to blow away the dentry cache....

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH] Re: shrink_dcache_sb scalability problem.
  2006-04-18  0:29       ` David Chinner
@ 2006-04-18 14:37         ` David Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: David Chinner @ 2006-04-18 14:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, Dipankar Sarma

On Tue, Apr 18, 2006 at 10:29:28AM +1000, David Chinner wrote:
> 
> The other thing that I thought of over the weekend is per-node LRU
> lists and a lock per node This will reduce the length of the lists,
> allow some parallelism even while we scan and purge each list
> using the existing algorithm, and not completely destroy the LRU-ness
> of the dcache.

Here's a patch that does that. It's rough, but it boots and i've run
some basic tests against it.  It doesn't survive dbench with 200
processes, though, as it crashes in prune_one_dentry() with a
corrupted d_child.

The logic in prune_dcache() is pretty grotesque atm, and I
doubt it's correct. Comments on how to fix it are welcome ;)

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

Make the dentry unused lists per-node.

Signed-off-by: Dave Chinner <dgc@sgi.com>

Index: 2.6.x-xfs-new/fs/dcache.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/dcache.c	2006-04-18 10:35:10.000000000 +1000
+++ 2.6.x-xfs-new/fs/dcache.c	2006-04-18 22:23:04.603552947 +1000
@@ -62,7 +62,6 @@ static kmem_cache_t *dentry_cache; 
 static unsigned int d_hash_mask;
 static unsigned int d_hash_shift;
 static struct hlist_head *dentry_hashtable;
-static LIST_HEAD(dentry_unused);
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
@@ -114,6 +113,72 @@ static void dentry_iput(struct dentry * 
 	}
 }
 
+static void
+dentry_unused_add(struct dentry *dentry)
+{
+	pg_data_t *pgdat = page_zone(virt_to_page(dentry))->zone_pgdat;
+
+	spin_lock(&pgdat->dentry_unused_lock);
+	list_add(&dentry->d_lru, &pgdat->dentry_unused);
+	spin_unlock(&pgdat->dentry_unused_lock);
+	dentry_stat.nr_unused++;
+}
+
+static void
+dentry_unused_add_tail(struct dentry *dentry)
+{
+	pg_data_t *pgdat = page_zone(virt_to_page(dentry))->zone_pgdat;
+
+	spin_lock(&pgdat->dentry_unused_lock);
+	list_add_tail(&dentry->d_lru, &pgdat->dentry_unused);
+	spin_unlock(&pgdat->dentry_unused_lock);
+	dentry_stat.nr_unused++;
+}
+
+/*
+ * Assumes external locks are already held
+ */
+static void
+dentry_unused_move(struct dentry *dentry, struct list_head *head)
+{
+	list_del(&dentry->d_lru);
+	list_add(&dentry->d_lru, head);
+}
+
+static void
+dentry_unused_del(struct dentry *dentry)
+{
+	if (!list_empty(&dentry->d_lru)) {
+		pg_data_t *pgdat = page_zone(virt_to_page(dentry))->zone_pgdat;
+
+		spin_lock(&pgdat->dentry_unused_lock);
+		list_del(&dentry->d_lru);
+		spin_unlock(&pgdat->dentry_unused_lock);
+		dentry_stat.nr_unused--;
+	}
+}
+
+static inline void
+__dentry_unused_del_init(struct dentry *dentry)
+{
+	if (likely(!list_empty(&dentry->d_lru)))
+		list_del_init(&dentry->d_lru);
+
+}
+
+static void
+dentry_unused_del_init(struct dentry *dentry)
+{
+	if (!list_empty(&dentry->d_lru)) {
+		pg_data_t *pgdat = page_zone(virt_to_page(dentry))->zone_pgdat;
+
+		spin_lock(&pgdat->dentry_unused_lock);
+		__dentry_unused_del_init(dentry);
+		spin_unlock(&pgdat->dentry_unused_lock);
+		dentry_stat.nr_unused--;
+	}
+}
+
 /* 
  * This is dput
  *
@@ -173,8 +238,7 @@ repeat:
 		goto kill_it;
   	if (list_empty(&dentry->d_lru)) {
   		dentry->d_flags |= DCACHE_REFERENCED;
-  		list_add(&dentry->d_lru, &dentry_unused);
-  		dentry_stat.nr_unused++;
+		dentry_unused_add(dentry);
   	}
  	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
@@ -186,13 +250,8 @@ unhash_it:
 kill_it: {
 		struct dentry *parent;
 
-		/* If dentry was on d_lru list
-		 * delete it from there
-		 */
-  		if (!list_empty(&dentry->d_lru)) {
-  			list_del(&dentry->d_lru);
-  			dentry_stat.nr_unused--;
-  		}
+		/* If dentry was on d_lru list delete it from there */
+		dentry_unused_del(dentry);
   		list_del(&dentry->d_u.d_child);
 		dentry_stat.nr_dentry--;	/* For d_free, below */
 		/*drops the locks, at that point nobody can reach this dentry */
@@ -268,10 +327,7 @@ int d_invalidate(struct dentry * dentry)
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
 	atomic_inc(&dentry->d_count);
-	if (!list_empty(&dentry->d_lru)) {
-		dentry_stat.nr_unused--;
-		list_del_init(&dentry->d_lru);
-	}
+	dentry_unused_del_init(dentry);
 	return dentry;
 }
 
@@ -392,42 +448,73 @@ static inline void prune_one_dentry(stru
  
 static void prune_dcache(int count)
 {
-	spin_lock(&dcache_lock);
-	for (; count ; count--) {
-		struct dentry *dentry;
-		struct list_head *tmp;
-
-		cond_resched_lock(&dcache_lock);
-
-		tmp = dentry_unused.prev;
-		if (tmp == &dentry_unused)
-			break;
-		list_del_init(tmp);
-		prefetch(dentry_unused.prev);
- 		dentry_stat.nr_unused--;
-		dentry = list_entry(tmp, struct dentry, d_lru);
+	int node_id = numa_node_id();
+	int	scan_low = 0;
+	int	c = count;
+	pg_data_t *pgdat;
+
+rescan:
+	for_each_pgdat(pgdat) {
+		if (!scan_low) {
+			if (pgdat->node_id < node_id)
+				continue;
+		} else {
+			if (pgdat->node_id >= node_id)
+				break;
+		}
+		for (c = count; c ; c--) {
+			struct dentry *dentry;
+			struct list_head *tmp;
+			spin_lock(&pgdat->dentry_unused_lock);
+
+			tmp = pgdat->dentry_unused.prev;
+			if (tmp == &pgdat->dentry_unused) {
+				spin_unlock(&pgdat->dentry_unused_lock);
+				break;
+			}
+			dentry = list_entry(tmp, struct dentry, d_lru);
+			__dentry_unused_del_init(dentry);
+			prefetch(&pgdat->dentry_unused.prev);
+			spin_unlock(&pgdat->dentry_unused_lock);
 
- 		spin_lock(&dentry->d_lock);
+			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
+			/*
+			 * We found an inuse dentry which was not removed from
+			 * dentry_unused because of laziness during lookup or
+			 * a dentry that has just been put back on the unused
+			 * list.  Do not free it - just leave it where it is.
+			 */
+			if (atomic_read(&dentry->d_count) ||
+			    !list_empty(&dentry->d_lru)) {
+				spin_unlock(&dentry->d_lock);
+				spin_unlock(&dcache_lock);
+				continue;
+			}
+			/* If the dentry was recently referenced, don't free it. */
+			if (dentry->d_flags & DCACHE_REFERENCED) {
+				dentry->d_flags &= ~DCACHE_REFERENCED;
+				dentry_unused_add(dentry);
+				spin_unlock(&dentry->d_lock);
+				spin_unlock(&dcache_lock);
+				continue;
+			}
+			prune_one_dentry(dentry);
+			spin_unlock(&dcache_lock);
+		}
 		/*
-		 * We found an inuse dentry which was not removed from
-		 * dentry_unused because of laziness during lookup.  Do not free
-		 * it - just keep it off the dentry_unused list.
+		 * shrink_parent needs to scan each list, and if it only
+		 * calls in with one count then we may never find it. So
+		 * if count ==1, scan each list once.
 		 */
- 		if (atomic_read(&dentry->d_count)) {
- 			spin_unlock(&dentry->d_lock);
-			continue;
-		}
-		/* If the dentry was recently referenced, don't free it. */
-		if (dentry->d_flags & DCACHE_REFERENCED) {
-			dentry->d_flags &= ~DCACHE_REFERENCED;
- 			list_add(&dentry->d_lru, &dentry_unused);
- 			dentry_stat.nr_unused++;
- 			spin_unlock(&dentry->d_lock);
-			continue;
-		}
-		prune_one_dentry(dentry);
+		if (count == 1)
+			c = 1;
+		if (!c)
+			break;
 	}
-	spin_unlock(&dcache_lock);
+	if (c && scan_low++ == 0)
+		goto rescan;
+
 }
 
 /*
@@ -456,39 +543,66 @@ void shrink_dcache_sb(struct super_block
 {
 	struct list_head *tmp, *next;
 	struct dentry *dentry;
+	pg_data_t *pgdat;
+	int found;
 
-	/*
-	 * Pass one ... move the dentries for the specified
-	 * superblock to the most recent end of the unused list.
-	 */
-	spin_lock(&dcache_lock);
-	list_for_each_safe(tmp, next, &dentry_unused) {
-		dentry = list_entry(tmp, struct dentry, d_lru);
-		if (dentry->d_sb != sb)
-			continue;
-		list_del(tmp);
-		list_add(tmp, &dentry_unused);
-	}
+	for_each_pgdat(pgdat) {
+		found = 0;
+		spin_lock(&pgdat->dentry_unused_lock);
+		/*
+		 * Pass one ... move the dentries for the specified
+		 * superblock to the most recent end of the unused list.
+		 */
+		list_for_each_safe(tmp, next, &pgdat->dentry_unused) {
+			dentry = list_entry(tmp, struct dentry, d_lru);
+			if (dentry->d_sb != sb)
+				continue;
+			dentry_unused_move(dentry, &pgdat->dentry_unused);
+			found++;
+		}
+		spin_unlock(&pgdat->dentry_unused_lock);
 
-	/*
-	 * Pass two ... free the dentries for this superblock.
-	 */
-repeat:
-	list_for_each_safe(tmp, next, &dentry_unused) {
-		dentry = list_entry(tmp, struct dentry, d_lru);
-		if (dentry->d_sb != sb)
-			continue;
-		dentry_stat.nr_unused--;
-		list_del_init(tmp);
-		spin_lock(&dentry->d_lock);
-		if (atomic_read(&dentry->d_count)) {
-			spin_unlock(&dentry->d_lock);
+		/*
+		 * Pass two ... free the dentries for this superblock.
+		 * use the output of the first pass to determine if we need
+		 * to run this pass.
+		 */
+		if (!found)
 			continue;
+repeat:
+		spin_lock(&pgdat->dentry_unused_lock);
+		list_for_each_safe(tmp, next, &pgdat->dentry_unused) {
+			dentry = list_entry(tmp, struct dentry, d_lru);
+			if (dentry->d_sb != sb)
+				continue;
+			__dentry_unused_del_init(dentry);
+
+			/*
+			 * We snoop on the d_count here so we can skip
+			 * dentries we can obviously not free right now
+			 * without dropping the list lock. This prevents us
+			 * from getting stuck on an in-use dentry on the unused
+			 * list.
+			 */
+			if (atomic_read(&dentry->d_count))
+				continue;
+
+			spin_unlock(&pgdat->dentry_unused_lock);
+			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
+			if (atomic_read(&dentry->d_count) ||
+			    (dentry->d_sb != sb) ||
+			    !list_empty(&dentry->d_lru)) {
+				spin_unlock(&dentry->d_lock);
+				spin_unlock(&dcache_lock);
+				goto repeat;
+			}
+			prune_one_dentry(dentry);
+			spin_unlock(&dcache_lock);
+			goto repeat;
 		}
-		prune_one_dentry(dentry);
-		goto repeat;
+		spin_unlock(&pgdat->dentry_unused_lock);
 	}
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -572,17 +686,13 @@ resume:
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 
-		if (!list_empty(&dentry->d_lru)) {
-			dentry_stat.nr_unused--;
-			list_del_init(&dentry->d_lru);
-		}
+		dentry_unused_del_init(dentry);
 		/* 
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache
 		 */
 		if (!atomic_read(&dentry->d_count)) {
-			list_add(&dentry->d_lru, dentry_unused.prev);
-			dentry_stat.nr_unused++;
+			dentry_unused_add_tail(dentry);
 			found++;
 		}
 
@@ -657,18 +767,14 @@ void shrink_dcache_anon(struct hlist_hea
 		spin_lock(&dcache_lock);
 		hlist_for_each(lp, head) {
 			struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
-			if (!list_empty(&this->d_lru)) {
-				dentry_stat.nr_unused--;
-				list_del_init(&this->d_lru);
-			}
 
+			dentry_unused_del_init(this);
 			/* 
 			 * move only zero ref count dentries to the end 
 			 * of the unused list for prune_dcache
 			 */
 			if (!atomic_read(&this->d_count)) {
-				list_add_tail(&this->d_lru, &dentry_unused);
-				dentry_stat.nr_unused++;
+				dentry_unused_add_tail(this);
 				found++;
 			}
 		}
@@ -1673,6 +1779,12 @@ static void __init dcache_init_early(voi
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
+	pg_data_t *pgdat;
+
+	for_each_pgdat(pgdat) {
+		spin_lock_init(&pgdat->dentry_unused_lock);
+		INIT_LIST_HEAD(&pgdat->dentry_unused);
+	}
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
Index: 2.6.x-xfs-new/include/linux/mmzone.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/mmzone.h	2006-02-06 11:57:55.000000000 +1100
+++ 2.6.x-xfs-new/include/linux/mmzone.h	2006-04-18 18:04:22.378952121 +1000
@@ -311,6 +311,8 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+	spinlock_t dentry_unused_lock;
+	struct list_head dentry_unused;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-04-18 14:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-13  8:22 shrink_dcache_sb scalability problem David Chinner
2006-04-13  8:52 ` Andrew Morton
2006-04-14  3:43   ` David Chinner
2006-04-14  5:23     ` Andrew Morton
2006-04-15  5:25       ` Bharata B Rao
2006-04-15  5:53         ` Andrew Morton
2006-04-18  5:57           ` Bharata B Rao
2006-04-18  0:29       ` David Chinner
2006-04-18 14:37         ` [RFC][PATCH] " David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox