Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@zip.com.au>
To: Andreas Dilger <adilger@clusterfs.com>
Cc: Dave Hansen <haveblue@us.ibm.com>,
	mgross@unix-os.sc.intel.com,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	lse-tech@lists.sourceforge.net, richard.a.griffiths@intel.com
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles  gets large
Date: Wed, 19 Jun 2002 23:53:45 -0700	[thread overview]
Message-ID: <3D117BF9.3657DA1E@zip.com.au> (raw)
In-Reply-To: 20020620060337.GJ22427@clusterfs.com

Andreas Dilger wrote:
> 
> On Jun 19, 2002  21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate.  But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much.  Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3.  The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule().  The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock.  I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
> 
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3.  The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).

lock_super() has been `down()' for a long time.  In 2.4, too.

> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data.  Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.

Well.  First I want to know if block-highmem is in there.  If not,
then yep, we'll spend ages spinning on the BKL.  Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?

> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.

ext3 is about 700x as complex as ext2.  It will need to be done with
some care.
 
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed.  This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed.  The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them.  The inode table locks would be read/write
> locks.

The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention,  per-blockgroup locks
and removal of lock_super from the block allocator.

But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list.  Fixes for both of those are in progress.

ext2 is bog-simple.  It will scale up the wazoo in 2.6.
 
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).

Depends on what the profile say, Andreas.  And I mean profiles - lockmeter
tends to tell you "what", not "why".   Start at the top of the list.  Fix
them by design if possible.  If not, tweak it!


-

next prev parent reply	other threads:[~2002-06-20  6:49 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-06-19 21:29 ext3 performance bottleneck as the number of spindles gets large mgross
2002-06-20  0:54 ` Andrew Morton
2002-06-20  4:09   ` [Lse-tech] " Dave Hansen
2002-06-20  6:03     ` Andreas Dilger
2002-06-20  6:53       ` Andrew Morton [this message]
2002-06-20  9:54   ` Stephen C. Tweedie
2002-06-20  1:55 ` Andrew Morton
2002-06-20  6:05   ` Jens Axboe
     [not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
2002-06-20 16:10 ` [Lse-tech] " Dave Hansen
2002-06-20 20:47   ` John Hawkes
  -- strict thread matches above, loose matches on Subject: below --
2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
2002-06-20 21:11 ` [Lse-tech] Re: ext3 performance bottleneck as the number of spindles " Andrew Morton
2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
2002-06-22  0:19 ` kwijibo
2002-06-22  8:10   ` kwijibo
2002-06-23  4:33 Andreas Dilger
2002-06-23  6:00 ` Christopher E. Brown
2002-06-23  6:35   ` [Lse-tech] " William Lee Irwin III
2002-06-23  7:29     ` Dave Hansen
2002-06-23  7:36       ` William Lee Irwin III
2002-06-23  7:45         ` Dave Hansen
2002-06-23  7:55           ` Christopher E. Brown
2002-06-23  8:11             ` David Lang
2002-06-23  8:31             ` Dave Hansen
2002-06-23 16:21           ` Martin J. Bligh
2002-06-23 17:06     ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3D117BF9.3657DA1E@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=adilger@clusterfs.com \
    --cc=haveblue@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lse-tech@lists.sourceforge.net \
    --cc=mgross@unix-os.sc.intel.com \
    --cc=richard.a.griffiths@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.