ext4 dbench performance with CONFIG_PREEMPT_RT

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: john stultz <johnstul@us.ibm.com>
To: linux-ext4@vger.kernel.org
Cc: Mingming Cao <cmm@us.ibm.com>,
	keith maanthey <kmannth@us.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	"Theodore Ts'o" <tytso@mit.edu>, Darren Hart <dvhltc@us.ibm.com>
Subject: ext4 dbench performance with CONFIG_PREEMPT_RT
Date: Wed, 07 Apr 2010 16:21:18 -0700	[thread overview]
Message-ID: <1270682478.3755.58.camel@localhost.localdomain> (raw)

I've recently been working on scalability issues with the PREEMPT_RT
kernel patch, and one area I've hit has been ext4 journal j_state_lock
contention.

With the PREEMPT_RT kernel, most spinlocks become pi-aware mutexes, so
they sleep when the lock cannot be acquired. This means lock contention
has a *much* greater impact on the PREEMPT_RT kernel then mainline. Thus
scalability issues hit the PREEMPT_RT kernel at lower cpu counts then
mainline.

Now, you might not care about PREEMPT_RT, but consider that any lock
contention will get worse in mainline as the cpu count increases. So in
this way, the PREEMPT_RT kernel allows us to see what scalability issues
are on the horizon for mainline kernels running on larger systems.

When running dbench on ext4 with the -rt kernel, I saw a fairly severe
performance drop off (~-60%) going from 4 to 8 clients (on an 8 cpu
machine).

Looking at the perf log, I could see there was quite a bit of lock
contention in the jdb2 start_This_handle/jbd2_journal_stop code paths:

27.39%       dbench  [kernel]                    [k] _raw_spin_lock_irqsave
                |          
                |--90.91%-- rt_spin_lock_slowlock
                |          rt_spin_lock
                |          |          
                |          |--66.92%-- start_this_handle
                |          |          jbd2_journal_start
                |          |          ext4_journal_start_sb
                |          |          |          
... 
                |          |          
                |          |--32.31%-- jbd2_journal_stop
                |          |          __ext4_journal_stop
                |          |          |          
                |          |          |--92.86%-- ext4_da_write_end
                |          |          |          generic_file_buffered_write

Full perf log here:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33/linux-2.6.33-rt11-vfs-ext4-8cpu.log

The perf logs for mainline showed similar contention, but at 8 cpus it
was not severe enough to drastically hurt performance in the same way.

Further using lockstat I was able to isolate it the contention down to
the journal j_state_lock, and then adding some lock owner tracking, I
was able to see that the lock owners were almost always in
start_this_handle, and jbd2_journal_stop when we saw contention (with
the freq breakdown being about 55% in jbd2_journal_stop and 45% in
start_this_handle).

Now, I'm very new to this code, and I don't fully understand the locking
rules. There is some documentation in the journal_s structure about
what's protected by the j_state_lock, but I'm not sure if its 100%
correct or not. 

For the most part, in jbd2_jorunal_stop we're only reading values in the
journal_t struct, so throwing caution to the wind, I simply removed the
locking there (trying to be careful in the one case of a structure write
to use cmpxchg so its sort of an atomic write).

Doing this more then doubled performance (2.5x increase).

Now, I know its terribly terribly wrong, in probably more ways then I
imagine, and I'm in no way suggesting this "locks are for data integrity
nerds, look at this performance!" mentality is valid. But it does show
exactly where contention is hurting us, and maybe what is possibly
achievable if we can rework some of the locking here.

The terrible terrible disk-eating* patch can be found here:
http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-ext4-hack/state_lock_hack.patch

Here's a chart comparing 2.6.33 vs 2.6.33-rt11-vfs  both with and
without the terrible terrible disk-eating* locking hack:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33-ext4-hack/2.6.33-ext4-hack.png

All the perf logs for the 4 cases above:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33-ext4-hack/

As you can see in the charts, the hack doesn't seem to give any benefit
to mainline at only 8 cpus, but looking at the perf logs it does seem to
lower the overhead of the journal code.

The charts also show that start_this_handle still sees some contention
so there is still more gains to be had by reworking the j_state_lock.

So questions from there are:
1) Some of the values protected by the j_state_lock are protected only
for quick modification or reading. Could these be converted to a
atomic_t? If not, why?

2) Alternatively can some of the values protected by the j_state_lock
which seem to be mostly read type values be converted to something like
a seq_lock? Again, I'm looking more for the locking dependency reasons
then just yes or no.

3) Is RCU too crazy for filesystems?

If there's any better documentation on the locking rules here, I'd also
appreciate a pointer.

Any other thoughts or feedback would be greatly appreciated.

thanks
-john

* No disks were actually eaten in the creation of this data. The patch
actually ran fine the whole time, but that said, DON'T TRY IT AT HOME!

next             reply	other threads:[~2010-04-07 23:21 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-07 23:21 john stultz [this message]
2010-04-08  3:46 ` ext4 dbench performance with CONFIG_PREEMPT_RT tytso
2010-04-08 10:18   ` Theodore Tso
2010-04-08 20:41   ` john stultz
2010-04-08 21:10     ` tytso
2010-04-13  3:52       ` john stultz
2010-04-14  3:04       ` john stultz
2010-04-08 22:37   ` Mingming Cao
2010-04-12 19:46   ` Jan Kara
2010-04-13 14:52     ` tytso
2010-04-13 16:25       ` Darren Hart
2010-06-02 22:35       ` j_state_lock patch data (was: Re: ext4 dbench performance with CONFIG_PREEMPT_RT) Eric Whitney
2010-04-09 15:49 ` ext4 dbench performance with CONFIG_PREEMPT_RT Andi Kleen
2010-04-09 23:33   ` tytso
2010-04-09 23:48     ` Chen, Tim C
2010-04-09 23:57       ` john stultz
2010-04-10 11:58       ` tytso
2010-04-12 19:54         ` Chen, Tim C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1270682478.3755.58.camel@localhost.localdomain \
    --to=johnstul@us.ibm.com \
    --cc=cmm@us.ibm.com \
    --cc=dvhltc@us.ibm.com \
    --cc=kmannth@us.ibm.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).