Results of my VFS scaling evaluation.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Results of my VFS scaling evaluation.
@ 2010-10-08 23:32 Frank Mayhar
  2010-10-09  0:33 ` Frank Mayhar
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Frank Mayhar @ 2010-10-08 23:32 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: mrubin

Nick Piggin has been doing work on lock contention in VFS, in particular
to remove the dcache and inode locks, and we are very interested in this
work.  He has entirely eliminated two of the most contended locks,
replacing them with a combination of more granular locking, seqlocks,
RCU lists and other mechanisms that reduce locking and contention in
general. He has published this work at

git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git

As we have run into problems with lock contention, Google is very
interested in these improvements.

I’ve built two kernels, one an unmodified 2.6.35 based on the “master”
branch of Nick’s tree (which is identical to the main Linux tree as of
that time) and one with Nick’s changes based on the “vfs-scale” branch
of his tree.

For each of the kernels I ran a “socket test” and a “storage test,” each
of which I ran on systems with a moderate number of cores and memory
(unfortunately I can’t say more about the hardware).  I gathered test
results and kernel profiling data for each.

The "Socket Test" does a lot of socket operations; it fields lots of
connections, receiving and transmitting small amounts of data over each.
The application it emulates has run into bottlenecks on the dcache_lock
and the inode_lock several times in the past, which is why I chose it as
a target.

The "Storage Test" does more storage operations, doing some network
transfers but spending most of its time reading and writing the disk.  I
chose it for the obvious reason that any VFS changes will probably
directly affect its results.

Both of these tests are multithreaded with at least one thread per core
and are designed to put as much load on the application being tested as
possible.  They are in fact designed specifically to find performance
regressions (albeit at a higher level than the kernel), which makes them
very suitable for this testing.

Before going into details of the test results, however, I must say that
the most striking thing about Nick's work how stable it is.  In all of
the work I've been doing, all the kernels I've built and run and all the
tests I've run, I've run into no hangs and only one crash, that in an
area that we happen to stress very heavily, for which I posted a patch,
available at
 http://www.kerneltrap.org/mailarchive/linux-fsdevel/2010/9/27/6886943
The crash involved the fact that we use cgroups very heavily, and there
was an oversight in the new d_set_d_op() routine that failed to clear
flags before it set them.

The Socket Test

This test has a target rate which I'll refer to as 100%.  Internal
Google kernels (with modifications to specific code paths) allow the
test to generally achieve that rate, albeit not without substantial
effort.  Against the base 2.6.35 kernel I saw a rate of around 65.28%.
Against the modified kernel the rate was around 65.2%.  The difference,
while significant, is small and entirely expected given that the running
environment was not one that could take advantage of the improved
scaling.

More interesting was the kernel profile (which I generated with the new
perf_events framework).  This revealed a distinct improvement in locking
performance.  While both kernels spent a substantial amount of time in
locking, the modified kernel spent significantly less time there.

Both kernels spent the most time in lock_release (oddly enough; other
kernels I've seen tend to spend more time acquiring locks than releasing
them), however the base kernel spent 7.02% of its time there versus
2.47% for the modified kernel.  Further, while the unmodified kernel
spent more than a quarter (26.15%) of its time in that routine actually
in spin_unlock called from the dcache code (d_alloc, __d_lookup, et al),
the modified kernel spent only 8.56% of its time in the equivalent
calls.

Other lock calls showed similar improvements across the board.  I've
enclosed a snippet of the relevant measurements (as reported by "perf
report" in its call-graph mode) for each kernel.

While the overall performance drop is a little disappointing it's not
at all unexpected, as the environment was definitely not the one that
would be helped by the scaling improvements and there is a small but
nonzero cost to those improvements.  Fortunately, the cost seems small
enough that with some work it may be possible to effectively eliminate
it.

The Storage Test

This test doesn't have any single result; rather it has a number of
measurements of such things as sequential and random reads and writes
as well as a standard set of reads and writes recorded from an
application.

As one might expect, this test did fairly well; overall things seemed to
improve very slightly, by on the order of around one percent.  (There
was one outlier, a nearly 20 percent regression, but while it should be
eventually tracked down I don't think it's significant for the purposes
of this evaluation.)  My vague suspicion, though, is that the margin of
error (which I didn't compute) nearly eclipses that slight improvement.
Since the scaling improvements aren't expected to improve performance in
this kind of environment, this is actually still a win.

The locking-related profile graph for this test is _much_ more complex
for the Storage Test than for the Socket Test.  While it appears that
the dcache locking calls have been pushed down a bit in the profile it's
a bit hard to tell because other calls appear to dominate.  In the end,
it looks like there's very little difference made by the scaling
patches.

Conclusion.

In general Nick's work does appear to make things better for locking.
It virtually eliminates contention on two very important locks that we
have seen as bottlenecks, pushing locking from the root far enough down
into the leaves of the data structures that they are no longer of
significant concern as far as scaling to larger numbers of cores.  I
suspect that with some further work, the performance cost of the
improvements, already fairly small, can be essentially eliminated, at
least for the common cases.

In the long run this will be a net win.  Systems with large numbers of
cores are coming and these changes address the scalability challengs of
the Linux kernel to those systems.  There is still some work to be done,
however; in addition to the above issues, Nick has expressed concern
that incremental adoption of his changes will mean performance
regressions early on, since earlier changes lay the groundwork for later
improvements but in the meantime add overhead.  Those early regressions
will be compensated for in the long term by the later improvements but
may be problematic in the short term.

Finally, I have kernel profiles for all of the above tests, all of which
are excessively huge, too huge to even look at in their entirety.  To
glean the above numbers I used "perf report" in its call-graph mode,
focusing on locking primitives and percentages above around 0.5%.  I
kept a copy of the profiles I looked at and they are available upon
request (just ask).  I will also post them publicly as soon as I have a
place to put them.
-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
@ 2010-10-09  0:33 ` Frank Mayhar
  2010-10-09  0:38 ` Valerie Aurora
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Frank Mayhar @ 2010-10-09  0:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, mrubin

On Fri, 2010-10-08 at 16:32 -0700, Frank Mayhar wrote:
> Finally, I have kernel profiles for all of the above tests, all of which
> are excessively huge, too huge to even look at in their entirety.  To
> glean the above numbers I used "perf report" in its call-graph mode,
> focusing on locking primitives and percentages above around 0.5%.  I
> kept a copy of the profiles I looked at and they are available upon
> request (just ask).  I will also post them publicly as soon as I have a
> place to put them.

While there will be a more official place eventually, for the moment the
profiles can be found here:
    http://code.google.com/p/vfs-scaling-eval/downloads/list
-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
  2010-10-09  0:33 ` Frank Mayhar
@ 2010-10-09  0:38 ` Valerie Aurora
  2010-10-11 18:47   ` Frank Mayhar
  2011-01-13 11:13   ` Nick Piggin
  2010-10-09  3:16 ` Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Valerie Aurora @ 2010-10-09  0:38 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-fsdevel, linux-mm, mrubin

On Fri, Oct 08, 2010 at 04:32:19PM -0700, Frank Mayhar wrote:
> 
> Before going into details of the test results, however, I must say that
> the most striking thing about Nick's work how stable it is.  In all of

:D

> the work I've been doing, all the kernels I've built and run and all the
> tests I've run, I've run into no hangs and only one crash, that in an
> area that we happen to stress very heavily, for which I posted a patch,
> available at
>  http://www.kerneltrap.org/mailarchive/linux-fsdevel/2010/9/27/6886943
> The crash involved the fact that we use cgroups very heavily, and there
> was an oversight in the new d_set_d_op() routine that failed to clear
> flags before it set them.

I honestly can't stand the d_set_d_op() patch (testing flags instead
of d_op->op) because it obfuscates the code in such a way that leads
directly to this kind of bug.  I don't suppose you could test the
performance effect of that specific patch and see how big of a
difference it makes?

-VAL

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-09  0:38 ` Valerie Aurora
@ 2010-10-11 18:47   ` Frank Mayhar
  2011-01-13 11:13   ` Nick Piggin
  1 sibling, 0 replies; 18+ messages in thread
From: Frank Mayhar @ 2010-10-11 18:47 UTC (permalink / raw)
  To: Valerie Aurora; +Cc: linux-fsdevel, linux-mm, mrubin

On Fri, 2010-10-08 at 20:38 -0400, Valerie Aurora wrote:
> On Fri, Oct 08, 2010 at 04:32:19PM -0700, Frank Mayhar wrote:
> > 
> > Before going into details of the test results, however, I must say that
> > the most striking thing about Nick's work how stable it is.  In all of
> 
> :D
> 
> > the work I've been doing, all the kernels I've built and run and all the
> > tests I've run, I've run into no hangs and only one crash, that in an
> > area that we happen to stress very heavily, for which I posted a patch,
> > available at
> >  http://www.kerneltrap.org/mailarchive/linux-fsdevel/2010/9/27/6886943
> > The crash involved the fact that we use cgroups very heavily, and there
> > was an oversight in the new d_set_d_op() routine that failed to clear
> > flags before it set them.
> 
> I honestly can't stand the d_set_d_op() patch (testing flags instead
> of d_op->op) because it obfuscates the code in such a way that leads
> directly to this kind of bug.  I don't suppose you could test the
> performance effect of that specific patch and see how big of a
> difference it makes?

I do kind of understand why he did it but you're right that it makes
things a bit error-prone.  Unfortunately I'm not in a position at the
moment to do a lot more testing and analysis.  I'll try to find some
spare time in which to do some more testing of both this and Dave
Chinner's tree, but no promises.
-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-09  0:38 ` Valerie Aurora
  2010-10-11 18:47   ` Frank Mayhar
@ 2011-01-13 11:13   ` Nick Piggin
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2011-01-13 11:13 UTC (permalink / raw)
  To: Valerie Aurora; +Cc: Frank Mayhar, linux-fsdevel, linux-mm, mrubin

On Sat, Oct 9, 2010 at 11:38 AM, Valerie Aurora <vaurora@redhat.com> wrote:
> On Fri, Oct 08, 2010 at 04:32:19PM -0700, Frank Mayhar wrote:
>>
>> Before going into details of the test results, however, I must say that
>> the most striking thing about Nick's work how stable it is.  In all of
>
> :D
>
>> the work I've been doing, all the kernels I've built and run and all the
>> tests I've run, I've run into no hangs and only one crash, that in an
>> area that we happen to stress very heavily, for which I posted a patch,
>> available at
>>  http://www.kerneltrap.org/mailarchive/linux-fsdevel/2010/9/27/6886943
>> The crash involved the fact that we use cgroups very heavily, and there
>> was an oversight in the new d_set_d_op() routine that failed to clear
>> flags before it set them.
>
> I honestly can't stand the d_set_d_op() patch (testing flags instead
> of d_op->op) because it obfuscates the code in such a way that leads
> directly to this kind of bug.  I don't suppose you could test the
> performance effect of that specific patch and see how big of a
> difference it makes?

I'm coming across this message a bit late (due to searching mailing
list for d_set_d_op problems), and I'm sorry I don't think I ever read
it, so I didn't reply.

There are a couple of problems I guess. One is having flags and
ops go out of sync by changing d_op around. I think this one is not
something we want to allow in filesystems and can easily be racy.
The d_set_d_op patch exposed quite a lot of these, and I wish I'd
read this earlier because we've got several of these bugs upstream
now (well arguably they are existing bugs, but anyway they are
crashing testers boxes).

The other potential nasty of this patch is filesystems assigning
d_op directly. This will be exposed pretty quickly because nothing
will work.

As for efficiency -- I am sorry for not including results in the patch.
Now we avoid a load and a couple of branches and a little bit of
icache, which is always nice.

But the biggest motivation for the patch was to fit path walking
dcache footprint in the dentry to a single cache line in the common
case, rather than 2. I think it's worthwhile and there is even a bit
more work to do on dentry shuffling and shrinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
  2010-10-09  0:33 ` Frank Mayhar
  2010-10-09  0:38 ` Valerie Aurora
@ 2010-10-09  3:16 ` Dave Chinner
  2010-10-10  6:54   ` Andi Kleen
  2010-10-10  6:50 ` Andi Kleen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2010-10-09  3:16 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-fsdevel, linux-mm, mrubin

On Fri, Oct 08, 2010 at 04:32:19PM -0700, Frank Mayhar wrote:
> Nick Piggin has been doing work on lock contention in VFS, in particular
> to remove the dcache and inode locks, and we are very interested in this
> work.  He has entirely eliminated two of the most contended locks,
> replacing them with a combination of more granular locking, seqlocks,
> RCU lists and other mechanisms that reduce locking and contention in
> general. He has published this work at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git

While the code in that tree might be stable, it's not really in any
shape acceptible for mainline inclusion.

I've been reworking the inode_lock breakup code from this patch set,
and there is significant change in the locking order and structure
compared to the above tree to avoid the unmaintainable mess of
trylock operations that Nick's patchset ended up with.

Also, breaking the series down into smaller bunches also shows that
certain optimisations made later in the series (e.g. making
writeback lists per CPU, breaking up inode LRUs, etc) do not deal
with the primary causes of observable contention (e.g. unbound
writeback parallelism in balance_dirty_pages), so the parts of the
original patch set might not even end up in mainline for some time...

FWIW, it would be good if this sort of testing could be run on the tree
under review here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

This is what I'm trying to get reviewed in time for a .37 merge.  If
that gets in .37, then I'll probably follow the same process for the
dcache_lock in .38, and after that we can then consider all the RCU
changes for both the inode and dentry operations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-09  3:16 ` Dave Chinner
@ 2010-10-10  6:54   ` Andi Kleen
  2010-10-10  7:37     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2010-10-10  6:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Frank Mayhar, linux-fsdevel, linux-mm, mrubin, torvalds, viro

Dave Chinner <david@fromorbit.com> writes:

> On Fri, Oct 08, 2010 at 04:32:19PM -0700, Frank Mayhar wrote:
>> Nick Piggin has been doing work on lock contention in VFS, in particular
>> to remove the dcache and inode locks, and we are very interested in this
>> work.  He has entirely eliminated two of the most contended locks,
>> replacing them with a combination of more granular locking, seqlocks,
>> RCU lists and other mechanisms that reduce locking and contention in
>> general. He has published this work at
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git

...

>
> While the code in that tree might be stable, it's not really in any
> shape acceptible for mainline inclusion.
>
> I've been reworking the inode_lock breakup code from this patch set,
> and there is significant change in the locking order and structure
> compared to the above tree to avoid the unmaintainable mess of
> trylock operations that Nick's patchset ended up with.

...
>
> FWIW, it would be good if this sort of testing could be run on the tree
> under review here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale
>
> This is what I'm trying to get reviewed in time for a .37 merge.  If
> that gets in .37, then I'll probably follow the same process for the
> dcache_lock in .38, and after that we can then consider all the RCU
> changes for both the inode and dentry operations.

That would be over 6 months just to make even a little progress.
 
Sorry, I am not convinced yet that any progress in this area has to be
that glacial. Linus indicated last time he wanted to move faster on the
VFS improvements. And the locking as it stands today is certainly a
major problem.

Maybe it's possible to come up with a way to integrate this faster?


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10  6:54   ` Andi Kleen
@ 2010-10-10  7:37     ` Christoph Hellwig
  2010-10-10  8:20       ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2010-10-10  7:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Chinner, Frank Mayhar, linux-fsdevel, linux-mm, mrubin,
	torvalds, viro

On Sun, Oct 10, 2010 at 08:54:51AM +0200, Andi Kleen wrote:
> That would be over 6 months just to make even a little progress.

I think that's unfair.  There's been absolutely no work from Nick to
get things mergeable since 2.6.35-rc days where we gave him that
feedback.  We now have had Dave pick it up and sort out various issues
with the third or so of the patchset he needed most to sort the lock
contention problems in the workloads he saw, and we'll get large
improvements for those for .37.  The dcache_lock splitup alone is
another massive task that needs a lot more work, too.  I've started
reviewing it and already fixed tons issues in in and the surrounding
code.  

> Sorry, I am not convinced yet that any progress in this area has to be
> that glacial. Linus indicated last time he wanted to move faster on the
> VFS improvements. And the locking as it stands today is certainly a
> major problem.
> 
> Maybe it's possible to come up with a way to integrate this faster?

Certainly not for .37, where even the inode_lock splitup is pretty damn
later.  Nick disappearing for a few weeks and others having to pick up
the work to sort it out certainly doesn't help.  And the dcache_lock
splitup is a much larget task than that anyway.  Getting that into .38
is the enabler for doing more fancy things.  And as Dave mentioned at
least in the writeback area it's much better to sort out the algorithmic
problems now than to blindly split some locks up more.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10  7:37     ` Christoph Hellwig
@ 2010-10-10  8:20       ` Andi Kleen
  2010-10-10  8:37         ` Christoph Hellwig
  2010-10-10 23:31         ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2010-10-10  8:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Dave Chinner, Frank Mayhar, linux-fsdevel, linux-mm,
	mrubin, torvalds, viro

> Certainly not for .37, where even the inode_lock splitup is pretty damn
> later.  Nick disappearing for a few weeks and others having to pick up
> the work to sort it out certainly doesn't help.  And the dcache_lock
> splitup is a much larget task than that anyway.  Getting that into .38
> is the enabler for doing more fancy things.  And as Dave mentioned at
> least in the writeback area it's much better to sort out the algorithmic
> problems now than to blindly split some locks up more.

I don't see why the algorithmic work can't be done in parallel 
to the lock split up?

Just the lock split up on its own gives us large gains here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10  8:20       ` Andi Kleen
@ 2010-10-10  8:37         ` Christoph Hellwig
  2010-10-10 12:03           ` Andi Kleen
  2010-10-10 23:31         ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2010-10-10  8:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Dave Chinner, Frank Mayhar, linux-fsdevel,
	linux-mm, mrubin, torvalds, viro

On Sun, Oct 10, 2010 at 10:20:39AM +0200, Andi Kleen wrote:
> > Certainly not for .37, where even the inode_lock splitup is pretty damn
> > later.  Nick disappearing for a few weeks and others having to pick up
> > the work to sort it out certainly doesn't help.  And the dcache_lock
> > splitup is a much larget task than that anyway.  Getting that into .38
> > is the enabler for doing more fancy things.  And as Dave mentioned at
> > least in the writeback area it's much better to sort out the algorithmic
> > problems now than to blindly split some locks up more.
> 
> I don't see why the algorithmic work can't be done in parallel 
> to the lock split up?
> 
> Just the lock split up on its own gives us large gains here.

What about actually starting to test the stuff headed towards Al's tree
to verify your assumptions?  It's nice to have a lot of people talking,
but actually helping with review and testing would be more useful.

Yes, lots of things could be done in parallel, but it needs people to
actually work on it.  And right now that's mostly Dave for the real
work, with me trying to prepare a proper dcache series for .38, and Al
doing some review.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10  8:37         ` Christoph Hellwig
@ 2010-10-10 12:03           ` Andi Kleen
  2010-10-10 23:50             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2010-10-10 12:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Dave Chinner, Frank Mayhar, linux-fsdevel, linux-mm,
	mrubin, torvalds, viro

On Sun, Oct 10, 2010 at 04:37:49AM -0400, Christoph Hellwig wrote:
> On Sun, Oct 10, 2010 at 10:20:39AM +0200, Andi Kleen wrote:
> > > Certainly not for .37, where even the inode_lock splitup is pretty damn
> > > later.  Nick disappearing for a few weeks and others having to pick up
> > > the work to sort it out certainly doesn't help.  And the dcache_lock
> > > splitup is a much larget task than that anyway.  Getting that into .38
> > > is the enabler for doing more fancy things.  And as Dave mentioned at
> > > least in the writeback area it's much better to sort out the algorithmic
> > > problems now than to blindly split some locks up more.
> > 
> > I don't see why the algorithmic work can't be done in parallel 
> > to the lock split up?
> > 
> > Just the lock split up on its own gives us large gains here.
> 
> What about actually starting to test the stuff headed towards Al's tree
> to verify your assumptions?  It's nice to have a lot of people talking,

That's in the works. Previously all testing work was done
on Nick's patch series.

> but actually helping with review and testing would be more useful.

Well the constant refactoring is certainly not helping with testing.

Also what typically happens is that if we don't fix all the serious
VFS locking issues (like Nick's patch kit) we just move from one bottle 
neck to another.

> Yes, lots of things could be done in parallel, but it needs people to
> actually work on it.  And right now that's mostly Dave for the real
> work, with me trying to prepare a proper dcache series for .38, and Al
> doing some review.

It was not clear to me what was so horrible with Nick's original
patchkit?  Sure there were a few rough edges, but does it really
need to be fully redone?

It certainly held up great to lots of testing, both at our side
and apparently Google's too.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10 12:03           ` Andi Kleen
@ 2010-10-10 23:50             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-10-10 23:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Frank Mayhar, linux-fsdevel, linux-mm, mrubin,
	torvalds, viro

On Sun, Oct 10, 2010 at 02:03:09PM +0200, Andi Kleen wrote:
> On Sun, Oct 10, 2010 at 04:37:49AM -0400, Christoph Hellwig wrote:
> > but actually helping with review and testing would be more useful.
> 
> Well the constant refactoring is certainly not helping with testing.

That is the way of review cycles. The need for significant
refactoring and reworking shows how much work the VFS maintainers
consider still needs to be done on the patch set.

> Also what typically happens is that if we don't fix all the serious
> VFS locking issues (like Nick's patch kit) we just move from one bottle 
> neck to another.

Sure, but at least there is a plan for dealing with them all and,
most importantly, people committed to pushing it forward.

Fundamentally, we need to understand the source of the lock
contention problems before trying to fix them. Nick just hit them
repeatedly with a big hammer until they went away....

> > Yes, lots of things could be done in parallel, but it needs people to
> > actually work on it.  And right now that's mostly Dave for the real
> > work, with me trying to prepare a proper dcache series for .38, and Al
> > doing some review.
> 
> It was not clear to me what was so horrible with Nick's original
> patchkit?  Sure there were a few rough edges, but does it really
> need to be fully redone?

I think the trylock mess is pretty much universally disliked by
anyone who looks at the VFS and writeback code on a daily basis. And
IMO the level of nested trylock looping is generally indicative of
getting the lock ordering strategy wrong in the first place.

Not to mention that as soon as I tried to re-order cleanups to the
front of the queue, it was pretty clear that it was going to be
unmaintainable, too.

> It certainly held up great to lots of testing, both at our side
> and apparently Google's too.

Not the least bit relevant, IMO, when the code ends up unmaintanable
in the long term.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-10  8:20       ` Andi Kleen
  2010-10-10  8:37         ` Christoph Hellwig
@ 2010-10-10 23:31         ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-10-10 23:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Frank Mayhar, linux-fsdevel, linux-mm, mrubin,
	torvalds, viro

On Sun, Oct 10, 2010 at 10:20:39AM +0200, Andi Kleen wrote:
> > Certainly not for .37, where even the inode_lock splitup is pretty damn
> > later.  Nick disappearing for a few weeks and others having to pick up
> > the work to sort it out certainly doesn't help.  And the dcache_lock
> > splitup is a much larget task than that anyway.  Getting that into .38
> > is the enabler for doing more fancy things.  And as Dave mentioned at
> > least in the writeback area it's much better to sort out the algorithmic
> > problems now than to blindly split some locks up more.
> 
> I don't see why the algorithmic work can't be done in parallel 
> to the lock split up?

It is - see Fengguang Wu's 17 patch series RFC for removing
writeback from balance_dirty_pages(). That change is complex enough
that few people can understand it well enough to review it, and even
fewer have the hardware and time available to test it thoroughly.

That patch series is *exactly* what we need to test for fixing the
writeback lock contention, but I cannot do that while I'm still
trying to get the current series sorted out. It's next on my list
because it's now the biggest problem I'm seeing on small file
intensive workloads on XFS.

> Just the lock split up on its own gives us large gains here.

The writeback lock split up is an algorithmic change in itself, one
which no-one has yet analysed for undesirable behaviour. At minimum
it changes the writeback IO patterns because of the different list
traversal ordering, and that is not something that should go into
mainline without close scrutiny.

Indeed, I showed that Nick's patch series actually significantly
increased the amount of IO during certain workloads. There was
plenty of handwaving about possible causes, but it was never
analysed or explained. The only way to determine the cause is to go
step by step and work out which algorithmic change caused that - it
might be the RCU changes, the zone LRU reclaim, the writeback
locking, or it might be something else. This series has not shown
such a regression, so I've aleast ruled out the lock breakup as the
cause.

IMO, pushing Nick's changes into mainline without answering such
questions is the _worst_ thing we can do. Writeback has been a mess
for a long time and so shovelling a truck-load of badly understood,
unmaintainable crap into the writeback path to "fix lock contention"
is not going to improve the situation at all. It is premature
optimisation at it's finest.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
                   ` (2 preceding siblings ...)
  2010-10-09  3:16 ` Dave Chinner
@ 2010-10-10  6:50 ` Andi Kleen
  2010-10-19 21:59 ` Results of my VFS scaling evaluation, redux Frank Mayhar
  2010-10-22 19:03 ` VFS scaling evaluation results, redux Frank Mayhar
  5 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2010-10-10  6:50 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-fsdevel, linux-mm, mrubin

Frank Mayhar <fmayhar@google.com> writes:

> Nick Piggin has been doing work on lock contention in VFS, in particular
> to remove the dcache and inode locks, and we are very interested in this
> work.  He has entirely eliminated two of the most contended locks,
> replacing them with a combination of more granular locking, seqlocks,
> RCU lists and other mechanisms that reduce locking and contention in
> general. He has published this work at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
>
> As we have run into problems with lock contention, Google is very
> interested in these improvements.

Thanks Frank for the data. Yes publication of any profiles would
be interesting. We're also seeing major issues with dcache and
inode locks in local testing.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Results of my VFS scaling evaluation, redux.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
                   ` (3 preceding siblings ...)
  2010-10-10  6:50 ` Andi Kleen
@ 2010-10-19 21:59 ` Frank Mayhar
  2010-10-19 22:03   ` Frank Mayhar
  2010-10-22 19:03 ` VFS scaling evaluation results, redux Frank Mayhar
  5 siblings, 1 reply; 18+ messages in thread
From: Frank Mayhar @ 2010-10-19 21:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, mrubin

After seeing the reaction to my original post of this work, I decided to
rerun the tests against Dave Chinner's tree just to see how things fare
with his changes.  This time I only ran the "socket test" due to time
constraints and since the "storage test" didn't produce anything
particularly interesting last time.

Further, I was unable to use the hardware I used previously so I reran
the test against both the 2.6.35 base and 2.6.35 plus Nick's changes in
addition to the 2.6.36 base and Dave's version of same.  The changed
hardware changed the absolute test results substantially.  Comparisons
between the runs remain valid, however.

>From my previous post (slightly rewritten):

> For each of the kernels I ran a “socket test” on systems with a
> moderate number of cores and memory (unfortunately I can’t say more
> about the hardware).  I gathered test results and kernel profiling
> data for each.
> 
> The "Socket Test" does a lot of socket operations; it fields lots of
> connections, receiving and transmitting small amounts of data over each.
> The application it emulates has run into bottlenecks on the dcache_lock
> and the inode_lock several times in the past, which is why I chose it as
> a target.
> 
> The test is multithreaded with at least one thread per core and is
> designed to put as much load on the application being tested as
> possible.  It is in fact designed specifically to find performance
> regressions (albeit at a higher level than the kernel), which makes it
> very suitable for this testing.

The kernels were very stable; I saw no crashes or hangs during my
testing.

The "Socket Test" has a target rate which I'll refer to as 100%.
Internal Google kernels (with modifications to specific code paths)
allow the test to generally achieve that rate, albeit not without
substantial effort.  Against the base 2.6.35 kernel I saw a rate of
around 13.9%; the modified 2.6.35 kernel had a rate of around 8.38%.
The base 2.6.36 kernel was effectively unchanged relative to the 2.6.35
kernel with a rate of 14.12% and, likewise, the modified 2.6.36 kernel
had a rate of around 9.1%.  In each case the difference is small and
expected given the environment.


> More interesting was the kernel profile (which I generated with the new
> perf_events framework).  This revealed a distinct improvement in locking
> performance.  While both kernels spent a substantial amount of time in
> locking, the modified kernel spent significantly less time there.
> 
> Both kernels spent the most time in lock_release (oddly enough; other
> kernels I've seen tend to spend more time acquiring locks than releasing
> them), however the base kernel spent 7.02% of its time there versus
> 2.47% for the modified kernel.  Further, while the unmodified kernel
> spent more than a quarter (26.15%) of its time in that routine actually
> in spin_unlock called from the dcache code (d_alloc, __d_lookup, et al),
> the modified kernel spent only 8.56% of its time in the equivalent
> calls.
> 
> Other lock calls showed similar improvements across the board.  I've
> enclosed a snippet of the relevant measurements (as reported by "perf
> report" in its call-graph mode) for each kernel.
> 
> While the overall performance drop is a little disappointing it's not
> at all unexpected, as the environment was definitely not the one that
> would be helped by the scaling improvements and there is a small but
> nonzero cost to those improvements.  Fortunately, the cost seems small
> enough that with some work it may be possible to effectively eliminate
> it.
> 
> 
> The Storage Test
> 
> This test doesn't have any single result; rather it has a number of
> measurements of such things as sequential and random reads and writes
> as well as a standard set of reads and writes recorded from an
> application.
> 
> As one might expect, this test did fairly well; overall things seemed to
> improve very slightly, by on the order of around one percent.  (There
> was one outlier, a nearly 20 percent regression, but while it should be
> eventually tracked down I don't think it's significant for the purposes
> of this evaluation.)  My vague suspicion, though, is that the margin of
> error (which I didn't compute) nearly eclipses that slight improvement.
> Since the scaling improvements aren't expected to improve performance in
> this kind of environment, this is actually still a win.
> 
> The locking-related profile graph for this test is _much_ more complex
> for the Storage Test than for the Socket Test.  While it appears that
> the dcache locking calls have been pushed down a bit in the profile it's
> a bit hard to tell because other calls appear to dominate.  In the end,
> it looks like there's very little difference made by the scaling
> patches.
> 
> 
> Conclusion.
> 
> In general Nick's work does appear to make things better for locking.
> It virtually eliminates contention on two very important locks that we
> have seen as bottlenecks, pushing locking from the root far enough down
> into the leaves of the data structures that they are no longer of
> significant concern as far as scaling to larger numbers of cores.  I
> suspect that with some further work, the performance cost of the
> improvements, already fairly small, can be essentially eliminated, at
> least for the common cases.
> 
> In the long run this will be a net win.  Systems with large numbers of
> cores are coming and these changes address the scalability challengs of
> the Linux kernel to those systems.  There is still some work to be done,
> however; in addition to the above issues, Nick has expressed concern
> that incremental adoption of his changes will mean performance
> regressions early on, since earlier changes lay the groundwork for later
> improvements but in the meantime add overhead.  Those early regressions
> will be compensated for in the long term by the later improvements but
> may be problematic in the short term.
> 
> Finally, I have kernel profiles for all of the above tests, all of which
> are excessively huge, too huge to even look at in their entirety.  To
> glean the above numbers I used "perf report" in its call-graph mode,
> focusing on locking primitives and percentages above around 0.5%.  I
> kept a copy of the profiles I looked at and they are available upon
> request (just ask).  I will also post them publicly as soon as I have a
> place to put them.


-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Results of my VFS scaling evaluation, redux.
  2010-10-19 21:59 ` Results of my VFS scaling evaluation, redux Frank Mayhar
@ 2010-10-19 22:03   ` Frank Mayhar
  0 siblings, 0 replies; 18+ messages in thread
From: Frank Mayhar @ 2010-10-19 22:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, mrubin

On Tue, 2010-10-19 at 14:59 -0700, Frank Mayhar wrote:
> After seeing the reaction to my original post of this work,

Feh, sorry about this, that was not finished and went out by accident.
I'll finish it up and send it shortly.
-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* VFS scaling evaluation results, redux.
  2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
                   ` (4 preceding siblings ...)
  2010-10-19 21:59 ` Results of my VFS scaling evaluation, redux Frank Mayhar
@ 2010-10-22 19:03 ` Frank Mayhar
  2010-10-23  0:00   ` Lin Ming
  5 siblings, 1 reply; 18+ messages in thread
From: Frank Mayhar @ 2010-10-22 19:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: mrubin, ext4-team

After seeing the newer work a couple of weeks ago, I decided to rerun
the tests against Dave Chinner's tree just to see how things fare with
his changes.  This time I only ran the "socket test" due to time
constraints and since the "storage test" didn't produce anything
particularly interesting last time.

Unfortunately I was unable to use the hardware I used previously so I
reran the test against both the 2.6.35 base and 2.6.35 plus Nick's
changes in addition to the 2.6.36 base and Dave's version of same.  The
changed hardware changed the absolute test results substantially, so I'm
making no comparisons with the previous runs.

Once more I ran a “socket test” on systems with a moderate number of
cores and memory (unfortunately I can’t say more about the hardware).  I
gathered test results and kernel profiling data for each. 

The "Socket Test" does a lot of socket operations; it fields lots of
connections, receiving and transmitting small amounts of data over each.
The application it emulates has run into bottlenecks on the dcache_lock
and the inode_lock several times in the past, which is why I chose it as
a target.

The test is multithreaded with at least one thread per core and is
designed to put as much load on the application being tested as
possible.  It is in fact designed specifically to find performance
regressions (albeit at a higher level than the kernel), which makes it
very suitable for this testing.

The kernels were very stable; I saw no crashes or hangs during my
testing.

The "Socket Test" has a target rate which I'll refer to as 100%.
Internal Google kernels (with modifications to specific code paths)
allow the test to generally achieve that rate, albeit not without
substantial effort.  Against the base 2.6.35 kernel I saw a rate of
around 13.9%; the modified 2.6.35 kernel had a rate of around 8.38%.
The base 2.6.36 kernel was effectively unchanged relative to the 2.6.35
kernel with a rate of 14.12% and, likewise, the modified 2.6.36 kernel
had a rate of around 9.1%.  In each case the difference is small and
expected given the environment.

The kernel profiles were not nearly as straightforward this time.
Running the test against a base and improved 2.6.35 kernel, there were
some (fairly subtle) improvements with respect to the dcache locking, in
which as before the amount of time there dropped slightly.  This time,
however, both tests spend essentially the same amount of time in locking
primitives.

I compared the 2.6.35 base and 2.6.36 base kernels as well, to see where
improvements might show up without Nick's and Dave's improvements.  I
saw that time spent in locking primitives dropped slightly but
consistently.

Comparing 2.6.36 base and improved kernels, again there seemed to be
some subtle improvements to the dcache locking but otherwise the tests
spent about the same amount of time in the locking primitives.

As before the original profiles are available at
    http://code.google.com/p/vfs-scaling-eval/downloads/list
The newer data is marked as "Socket_test-profile-<kernel
version>-<name>".  Data from the previous evaluation is there as well.
-- 
Frank Mayhar <fmayhar@google.com>
Google Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VFS scaling evaluation results, redux.
  2010-10-22 19:03 ` VFS scaling evaluation results, redux Frank Mayhar
@ 2010-10-23  0:00   ` Lin Ming
  0 siblings, 0 replies; 18+ messages in thread
From: Lin Ming @ 2010-10-23  0:00 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-fsdevel, linux-mm, mrubin, ext4-team

On Sat, Oct 23, 2010 at 3:03 AM, Frank Mayhar <fmayhar@google.com> wrote:
> After seeing the newer work a couple of weeks ago, I decided to rerun
> the tests against Dave Chinner's tree just to see how things fare with
> his changes.  This time I only ran the "socket test" due to time
> constraints and since the "storage test" didn't produce anything
> particularly interesting last time.

Could you share your "socket test" test case?
I'd like to test these vfs scaling patches also.

Thanks,
Lin Ming

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-01-13 11:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-08 23:32 Results of my VFS scaling evaluation Frank Mayhar
2010-10-09  0:33 ` Frank Mayhar
2010-10-09  0:38 ` Valerie Aurora
2010-10-11 18:47   ` Frank Mayhar
2011-01-13 11:13   ` Nick Piggin
2010-10-09  3:16 ` Dave Chinner
2010-10-10  6:54   ` Andi Kleen
2010-10-10  7:37     ` Christoph Hellwig
2010-10-10  8:20       ` Andi Kleen
2010-10-10  8:37         ` Christoph Hellwig
2010-10-10 12:03           ` Andi Kleen
2010-10-10 23:50             ` Dave Chinner
2010-10-10 23:31         ` Dave Chinner
2010-10-10  6:50 ` Andi Kleen
2010-10-19 21:59 ` Results of my VFS scaling evaluation, redux Frank Mayhar
2010-10-19 22:03   ` Frank Mayhar
2010-10-22 19:03 ` VFS scaling evaluation results, redux Frank Mayhar
2010-10-23  0:00   ` Lin Ming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).