public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Another Performance Regression in write() syscall
@ 2009-02-24  6:05 Salman Qazi
  2009-02-24  6:25 ` Dave Hansen
  0 siblings, 1 reply; 8+ messages in thread
From: Salman Qazi @ 2009-02-24  6:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Andi Kleen,
	Dave Hansen, nickpiggin, Linus Torvalds

Analysis of profile data has led us to believe that the commit
3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
regression.  This commit provides for tracking of writers so that read only
bind mounts function correctly.

We can verify this regression by applying the following patch to partially
disable the above-mentioned commit and then running the fstime component
of Unixbench.  The settings used were 256 byte writes with MAX_BLOCK of 2000.

The following numbers were produced (5 samples, each specified in Kb/sec)

    2.6.29-rc6:
    283750, 295200, 294500, 293000, 293300

    2.6.29-rc6 + patch:
    337200, 329000, 331050, 337300, 328450

    2.6.18
    395700, 342000, 399100, 366050, 359850

See w_test() in src/fstime in Unixbench version 4.1.0.  A simplified
version of the test (leaving out the accounting code) is:

        alarm(10);
        while (!sigalarm) {
            for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
                write(f, buf, 256);
            }
            lseek(f, 0L, 0);
        }

The following patch is not intended to be a fix, but a way to demonstrate
the problem.

diff --git a/fs/namespace.c b/fs/namespace.c
index 06f8e63..ec0bfd9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -251,21 +251,10 @@ static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer,
  */
 int mnt_want_write(struct vfsmount *mnt)
 {
-	int ret = 0;
-	struct mnt_writer *cpu_writer;
+	if (__mnt_is_readonly(mnt))
+		return -EROFS;
+	return 0;
 
-	cpu_writer = &get_cpu_var(mnt_writers);
-	spin_lock(&cpu_writer->lock);
-	if (__mnt_is_readonly(mnt)) {
-		ret = -EROFS;
-		goto out;
-	}
-	use_cpu_writer_for_mount(cpu_writer, mnt);
-	cpu_writer->count++;
-out:
-	spin_unlock(&cpu_writer->lock);
-	put_cpu_var(mnt_writers);
-	return ret;
 }
 EXPORT_SYMBOL_GPL(mnt_want_write);
 
@@ -282,45 +271,6 @@ static void lock_mnt_writers(void)
 	}
 }
 
-/*
- * These per-cpu write counts are not guaranteed to have
- * matched increments and decrements on any given cpu.
- * A file open()ed for write on one cpu and close()d on
- * another cpu will imbalance this count.  Make sure it
- * does not get too far out of whack.
- */
-static void handle_write_count_underflow(struct vfsmount *mnt)
-{
-	if (atomic_read(&mnt->__mnt_writers) >=
-	    MNT_WRITER_UNDERFLOW_LIMIT)
-		return;
-	/*
-	 * It isn't necessary to hold all of the locks
-	 * at the same time, but doing it this way makes
-	 * us share a lot more code.
-	 */
-	lock_mnt_writers();
-	/*
-	 * vfsmount_lock is for mnt_flags.
-	 */
-	spin_lock(&vfsmount_lock);
-	/*
-	 * If coalescing the per-cpu writer counts did not
-	 * get us back to a positive writer count, we have
-	 * a bug.
-	 */
-	if ((atomic_read(&mnt->__mnt_writers) < 0) &&
-	    !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) {
-		WARN(1, KERN_DEBUG "leak detected on mount(%p) writers "
-				"count: %d\n",
-			mnt, atomic_read(&mnt->__mnt_writers));
-		/* use the flag to keep the dmesg spam down */
-		mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
-	}
-	spin_unlock(&vfsmount_lock);
-	unlock_mnt_writers();
-}
-
 /**
  * mnt_drop_write - give up write access to a mount
  * @mnt: the mount on which to give up write access
@@ -331,37 +281,6 @@ static void handle_write_count_underflow(struct vfsmount *mnt)
  */
 void mnt_drop_write(struct vfsmount *mnt)
 {
-	int must_check_underflow = 0;
-	struct mnt_writer *cpu_writer;
-
-	cpu_writer = &get_cpu_var(mnt_writers);
-	spin_lock(&cpu_writer->lock);
-
-	use_cpu_writer_for_mount(cpu_writer, mnt);
-	if (cpu_writer->count > 0) {
-		cpu_writer->count--;
-	} else {
-		must_check_underflow = 1;
-		atomic_dec(&mnt->__mnt_writers);
-	}
-
-	spin_unlock(&cpu_writer->lock);
-	/*
-	 * Logically, we could call this each time,
-	 * but the __mnt_writers cacheline tends to
-	 * be cold, and makes this expensive.
-	 */
-	if (must_check_underflow)
-		handle_write_count_underflow(mnt);
-	/*
-	 * This could be done right after the spinlock
-	 * is taken because the spinlock keeps us on
-	 * the cpu, and disables preemption.  However,
-	 * putting it here bounds the amount that
-	 * __mnt_writers can underflow.  Without it,
-	 * we could theoretically wrap __mnt_writers.
-	 */
-	put_cpu_var(mnt_writers);
 }
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24  6:05 Another Performance Regression in write() syscall Salman Qazi
@ 2009-02-24  6:25 ` Dave Hansen
  2009-02-24  8:47   ` Nick Piggin
  2009-02-25  1:05   ` Salman Qazi
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Hansen @ 2009-02-24  6:25 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Andi Kleen, Dave Hansen, nickpiggin, Linus Torvalds

On Mon, 2009-02-23 at 22:05 -0800, Salman Qazi wrote:
> Analysis of profile data has led us to believe that the commit
> 3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
> regression.  This commit provides for tracking of writers so that read only
> bind mounts function correctly.
> 
> We can verify this regression by applying the following patch to partially
> disable the above-mentioned commit and then running the fstime component
> of Unixbench.  The settings used were 256 byte writes with MAX_BLOCK of 2000.

I'm a bit surprised that write() is what is regressing.  Unless I
screwed up, we do all the expensive accounting at open()/close() time.
Is this a test that gets run in parallel on multiple cpus?  

Could you take a look at Nick's patches to speed this stuff up?

	http://thread.gmane.org/gmane.linux.file-systems/28186

We may need to dust those off, although I'm still a bit worried about
the complexities of open-coding all the barriers.

Could we also see some kind of profile?  What kind of machine are you
seeing this on, btw?

-- Dave


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24  6:25 ` Dave Hansen
@ 2009-02-24  8:47   ` Nick Piggin
  2009-02-24 16:58     ` Dave Hansen
  2009-02-25  1:05   ` Salman Qazi
  1 sibling, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2009-02-24  8:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Salman Qazi, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Linus Torvalds

On Tuesday 24 February 2009 17:25:45 Dave Hansen wrote:
> On Mon, 2009-02-23 at 22:05 -0800, Salman Qazi wrote:
> > Analysis of profile data has led us to believe that the commit
> > 3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
> > regression.  This commit provides for tracking of writers so that read
> > only bind mounts function correctly.
> >
> > We can verify this regression by applying the following patch to
> > partially disable the above-mentioned commit and then running the fstime
> > component of Unixbench.  The settings used were 256 byte writes with
> > MAX_BLOCK of 2000.
>
> I'm a bit surprised that write() is what is regressing.  Unless I
> screwed up, we do all the expensive accounting at open()/close() time.
> Is this a test that gets run in parallel on multiple cpus?

Don't forget touch_atime...

Still, open/close isn't unimportant either.


> Could you take a look at Nick's patches to speed this stuff up?
>
> 	http://thread.gmane.org/gmane.linux.file-systems/28186
>
> We may need to dust those off, although I'm still a bit worried about
> the complexities of open-coding all the barriers.

I really need to do something about trying to push them upstream again
actually because we've got them in SLES11 tree.

It would be interesting to know how much the unixbench numbers improve
with the patches.

> Could we also see some kind of profile?  What kind of machine are you
> seeing this on, btw?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24  8:47   ` Nick Piggin
@ 2009-02-24 16:58     ` Dave Hansen
  2009-02-24 17:29       ` Linus Torvalds
  2009-02-25  3:49       ` Nick Piggin
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Hansen @ 2009-02-24 16:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Salman Qazi, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Linus Torvalds

On Tue, 2009-02-24 at 19:47 +1100, Nick Piggin wrote:
> On Tuesday 24 February 2009 17:25:45 Dave Hansen wrote:
> > On Mon, 2009-02-23 at 22:05 -0800, Salman Qazi wrote:
> > > Analysis of profile data has led us to believe that the commit
> > > 3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
> > > regression.  This commit provides for tracking of writers so that read
> > > only bind mounts function correctly.
> > >
> > > We can verify this regression by applying the following patch to
> > > partially disable the above-mentioned commit and then running the fstime
> > > component of Unixbench.  The settings used were 256 byte writes with
> > > MAX_BLOCK of 2000.
> >
> > I'm a bit surprised that write() is what is regressing.  Unless I
> > screwed up, we do all the expensive accounting at open()/close() time.
> > Is this a test that gets run in parallel on multiple cpus?
> 
> Don't forget touch_atime...

Yeah, that's a good point.  Are we sure that's what is happening here,
though?  That's one thing a profile would hopefully help with.

> Still, open/close isn't unimportant either.

Yeah, that's true.  But, what I noticed was that all of the other
open/close activity masked out any overhead from mnt_want/drop_write()
since a big chunk of the overhead was just going and bringing the
vfsmount pieces into the cache.

> > Could you take a look at Nick's patches to speed this stuff up?
> >
> > 	http://thread.gmane.org/gmane.linux.file-systems/28186
> >
> > We may need to dust those off, although I'm still a bit worried about
> > the complexities of open-coding all the barriers.
> 
> I really need to do something about trying to push them upstream again
> actually because we've got them in SLES11 tree.

Were the patches that you integrated any different from the ones you
posted a few months ago?

-- Dave


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24 16:58     ` Dave Hansen
@ 2009-02-24 17:29       ` Linus Torvalds
  2009-02-24 17:50         ` Ingo Molnar
  2009-02-25  3:49       ` Nick Piggin
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2009-02-24 17:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nick Piggin, Salman Qazi, linux-kernel, Ingo Molnar,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen, Dave Hansen



On Tue, 24 Feb 2009, Dave Hansen wrote:
> 
> Yeah, that's a good point.  Are we sure that's what is happening here,
> though?  That's one thing a profile would hopefully help with.

One thing to note is that _if_ it's purely an issue of nontemporal stores 
vs normal stores, then profiling is very likely going to be almost 
entirely useless. You'll get "results", but the results have nothing 
what-so-ever to do with reality or anything interesting.

The nontemporal stores may stand out in the profiles, but the actual 
performance impact will be all about whether totally unrelated code got 
cache misses or not. Quite often those cache misses will also be in user 
mode, and very possibly in other processes.

So profiles can certainly be interesting, but if Salman says that his 
patch makes a difference for his benchmark, then profiling is almost 
certainly not interesting FOR THAT PATCH. It's interesting mainly as a way 
to look at whether there are then also _other_ issues that are worth 
addressing (ie the whole atime thing is in a whole different dimension 
and an independent issue).

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24 17:29       ` Linus Torvalds
@ 2009-02-24 17:50         ` Ingo Molnar
  0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2009-02-24 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Nick Piggin, Salman Qazi, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen, Dave Hansen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 24 Feb 2009, Dave Hansen wrote:
> > 
> > Yeah, that's a good point.  Are we sure that's what is 
> > happening here, though?  That's one thing a profile would 
> > hopefully help with.
> 
> One thing to note is that _if_ it's purely an issue of 
> nontemporal stores vs normal stores, then profiling is very 
> likely going to be almost entirely useless. You'll get 
> "results", but the results have nothing what-so-ever to do 
> with reality or anything interesting.
> 
> The nontemporal stores may stand out in the profiles, but the 
> actual performance impact will be all about whether totally 
> unrelated code got cache misses or not. Quite often those 
> cache misses will also be in user mode, and very possibly in 
> other processes.
> 
> So profiles can certainly be interesting, but if Salman says 
> that his patch makes a difference for his benchmark, then 
> profiling is almost certainly not interesting FOR THAT PATCH. 
> It's interesting mainly as a way to look at whether there are 
> then also _other_ issues that are worth addressing (ie the 
> whole atime thing is in a whole different dimension and an 
> independent issue).

a 'perfstat' run would certainly be interesting (for cases where 
a pure /usr/bin/time run is inconclusive), comparing the 
unpatched and patched kernel.

That way we can see summary counts for the whole workload, like:

 -----------------------------------------------
 | Performance counter stats for './mmap-perf' |
 -----------------------------------------------
 |                |
 |  x86-defconfig |   PARAVIRT=y
 |------------------------------------------------------------------
 |
 |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
 |                |
 |              1 |            1  CPU migrations
 |             91 |           79  context switches
 |          55945 |        55943  pagefaults
 |    ............................................
 |     3781392474 |   3918777174  CPU cycles                  +3.63%
 |     1957153827 |   2161280486  instructions               +10.43%
 |       50234816 |     51303520  cache references            +2.12%
 |        5428258 |      5583728  cache misses                +2.86%
 |
 |      437983499 |    478967061  branches                    +9.36%
 |       32486067 |     32336874  branch-misses               -0.46%
 |                |
 |    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
 |                |
 -----------------------------------

Such a comparison of would certainly be more meaningful for such 
things than a profile.

Salman, if you are interested in doing a perfstat comparison, 
just pick up a tip:master kernel [perfcounters are 
default-enabled in it]:

   http://people.redhat.com/mingo/tip.git/README

and run perfstat on it (as root, to get the kernel-mode counts 
too):

   http://redhat.com/~mingo/perfcounters/perfstat.c

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24  6:25 ` Dave Hansen
  2009-02-24  8:47   ` Nick Piggin
@ 2009-02-25  1:05   ` Salman Qazi
  1 sibling, 0 replies; 8+ messages in thread
From: Salman Qazi @ 2009-02-25  1:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Andi Kleen, Dave Hansen, nickpiggin, Linus Torvalds

On Mon, Feb 23, 2009 at 10:25 PM, Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> On Mon, 2009-02-23 at 22:05 -0800, Salman Qazi wrote:
>> Analysis of profile data has led us to believe that the commit
>> 3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
>> regression.  This commit provides for tracking of writers so that read only
>> bind mounts function correctly.
>>
>> We can verify this regression by applying the following patch to partially
>> disable the above-mentioned commit and then running the fstime component
>> of Unixbench.  The settings used were 256 byte writes with MAX_BLOCK of 2000.
>
> I'm a bit surprised that write() is what is regressing.  Unless I
> screwed up, we do all the expensive accounting at open()/close() time.
> Is this a test that gets run in parallel on multiple cpus?
>
> Could you take a look at Nick's patches to speed this stuff up?
>
>        http://thread.gmane.org/gmane.linux.file-systems/28186
>

The pair of patches seems to fix our problem.  The benchmark results
for 2.6.29-rc6 + the above patches:

308200, 335850, 335900, 335150, 334700

Thanks for your help.

> We may need to dust those off, although I'm still a bit worried about
> the complexities of open-coding all the barriers.
>
> Could we also see some kind of profile?  What kind of machine are you
> seeing this on, btw?

It's an Opteron with with 4 cores.  Unfortunately, I don't have a
profile for the upstream kernel.

>
> -- Dave
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Another Performance Regression in write() syscall
  2009-02-24 16:58     ` Dave Hansen
  2009-02-24 17:29       ` Linus Torvalds
@ 2009-02-25  3:49       ` Nick Piggin
  1 sibling, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2009-02-25  3:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Salman Qazi, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Linus Torvalds

On Wednesday 25 February 2009 03:58:52 Dave Hansen wrote:
> On Tue, 2009-02-24 at 19:47 +1100, Nick Piggin wrote:
> > On Tuesday 24 February 2009 17:25:45 Dave Hansen wrote:
> > > On Mon, 2009-02-23 at 22:05 -0800, Salman Qazi wrote:
> > > > Analysis of profile data has led us to believe that the commit
> > > > 3d733633a633065729c9e4e254b2e5442c00ef7e has caused a performance
> > > > regression.  This commit provides for tracking of writers so that
> > > > read only bind mounts function correctly.
> > > >
> > > > We can verify this regression by applying the following patch to
> > > > partially disable the above-mentioned commit and then running the
> > > > fstime component of Unixbench.  The settings used were 256 byte
> > > > writes with MAX_BLOCK of 2000.
> > >
> > > I'm a bit surprised that write() is what is regressing.  Unless I
> > > screwed up, we do all the expensive accounting at open()/close() time.
> > > Is this a test that gets run in parallel on multiple cpus?
> >
> > Don't forget touch_atime...
>
> Yeah, that's a good point.  Are we sure that's what is happening here,
> though?  That's one thing a profile would hopefully help with.
>
> > Still, open/close isn't unimportant either.
>
> Yeah, that's true.  But, what I noticed was that all of the other
> open/close activity masked out any overhead from mnt_want/drop_write()
> since a big chunk of the overhead was just going and bringing the
> vfsmount pieces into the cache.

That's very true.


> > > Could you take a look at Nick's patches to speed this stuff up?
> > >
> > > 	http://thread.gmane.org/gmane.linux.file-systems/28186
> > >
> > > We may need to dust those off, although I'm still a bit worried about
> > > the complexities of open-coding all the barriers.
> >
> > I really need to do something about trying to push them upstream again
> > actually because we've got them in SLES11 tree.
>
> Were the patches that you integrated any different from the ones you
> posted a few months ago?

Don't think they were significantly changed. I'll take another look
and repost them.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-02-25  3:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-24  6:05 Another Performance Regression in write() syscall Salman Qazi
2009-02-24  6:25 ` Dave Hansen
2009-02-24  8:47   ` Nick Piggin
2009-02-24 16:58     ` Dave Hansen
2009-02-24 17:29       ` Linus Torvalds
2009-02-24 17:50         ` Ingo Molnar
2009-02-25  3:49       ` Nick Piggin
2009-02-25  1:05   ` Salman Qazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox