From: Andrew Morton <akpm@zip.com.au>
To: Andreas Dilger <adilger@clusterfs.com>
Cc: Dave Hansen <haveblue@us.ibm.com>,
mgross@unix-os.sc.intel.com,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
lse-tech@lists.sourceforge.net, richard.a.griffiths@intel.com
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
Date: Wed, 19 Jun 2002 23:53:45 -0700 [thread overview]
Message-ID: <3D117BF9.3657DA1E@zip.com.au> (raw)
In-Reply-To: 20020620060337.GJ22427@clusterfs.com
Andreas Dilger wrote:
>
> On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate. But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much. Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3. The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule(). The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock. I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
>
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3. The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).
lock_super() has been `down()' for a long time. In 2.4, too.
> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data. Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.
Well. First I want to know if block-highmem is in there. If not,
then yep, we'll spend ages spinning on the BKL. Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?
> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.
ext3 is about 700x as complex as ext2. It will need to be done with
some care.
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed. This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed. The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them. The inode table locks would be read/write
> locks.
The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention, per-blockgroup locks
and removal of lock_super from the block allocator.
But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list. Fixes for both of those are in progress.
ext2 is bog-simple. It will scale up the wazoo in 2.6.
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).
Depends on what the profile say, Andreas. And I mean profiles - lockmeter
tends to tell you "what", not "why". Start at the top of the list. Fix
them by design if possible. If not, tweak it!
-
next prev parent reply other threads:[~2002-06-20 6:49 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-06-19 21:29 ext3 performance bottleneck as the number of spindles gets large mgross
2002-06-20 0:54 ` Andrew Morton
2002-06-20 4:09 ` [Lse-tech] " Dave Hansen
2002-06-20 6:03 ` Andreas Dilger
2002-06-20 6:53 ` Andrew Morton [this message]
2002-06-20 9:54 ` Stephen C. Tweedie
2002-06-20 1:55 ` Andrew Morton
2002-06-20 6:05 ` Jens Axboe
[not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
2002-06-20 16:10 ` [Lse-tech] " Dave Hansen
2002-06-20 20:47 ` John Hawkes
-- strict thread matches above, loose matches on Subject: below --
2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
2002-06-20 21:11 ` [Lse-tech] Re: ext3 performance bottleneck as the number of spindles " Andrew Morton
2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
2002-06-22 0:19 ` kwijibo
2002-06-22 8:10 ` kwijibo
2002-06-23 4:33 Andreas Dilger
2002-06-23 6:00 ` Christopher E. Brown
2002-06-23 6:35 ` [Lse-tech] " William Lee Irwin III
2002-06-23 7:29 ` Dave Hansen
2002-06-23 7:36 ` William Lee Irwin III
2002-06-23 7:45 ` Dave Hansen
2002-06-23 7:55 ` Christopher E. Brown
2002-06-23 8:11 ` David Lang
2002-06-23 8:31 ` Dave Hansen
2002-06-23 16:21 ` Martin J. Bligh
2002-06-23 17:06 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3D117BF9.3657DA1E@zip.com.au \
--to=akpm@zip.com.au \
--cc=adilger@clusterfs.com \
--cc=haveblue@us.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lse-tech@lists.sourceforge.net \
--cc=mgross@unix-os.sc.intel.com \
--cc=richard.a.griffiths@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox