* RE: [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles gets large
@ 2002-06-20 15:24 Griffiths, Richard A
0 siblings, 0 replies; 4+ messages in thread
From: Griffiths, Richard A @ 2002-06-20 15:24 UTC (permalink / raw)
To: 'Andrew Morton', Andreas Dilger
Cc: Dave Hansen, mgross, Linux Kernel Mailing List, lse-tech,
Griffiths, Richard A
The Sparc only had 1GB of memory so for parity we ran the Linux machine
without highmem enabled.
Richard
-----Original Message-----
From: Andrew Morton [mailto:akpm@zip.com.au]
Sent: Wednesday, June 19, 2002 11:54 PM
To: Andreas Dilger
Cc: Dave Hansen; mgross@unix-os.sc.intel.com; Linux Kernel Mailing List;
lse-tech@lists.sourceforge.net; richard.a.griffiths@intel.com
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of
spindles gets large
Andreas Dilger wrote:
>
> On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate. But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much. Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3. The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule(). The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock. I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
>
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3. The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).
lock_super() has been `down()' for a long time. In 2.4, too.
> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data. Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.
Well. First I want to know if block-highmem is in there. If not,
then yep, we'll spend ages spinning on the BKL. Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?
> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.
ext3 is about 700x as complex as ext2. It will need to be done with
some care.
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed. This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed. The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them. The inode table locks would be read/write
> locks.
The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention, per-blockgroup locks
and removal of lock_super from the block allocator.
But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list. Fixes for both of those are in progress.
ext2 is bog-simple. It will scale up the wazoo in 2.6.
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).
Depends on what the profile say, Andreas. And I mean profiles - lockmeter
tends to tell you "what", not "why". Start at the top of the list. Fix
them by design if possible. If not, tweak it!
-
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles gets large
@ 2002-06-20 16:24 Gross, Mark
0 siblings, 0 replies; 4+ messages in thread
From: Gross, Mark @ 2002-06-20 16:24 UTC (permalink / raw)
To: 'Dave Hansen', Gross, Mark
Cc: 'Russell Leighton', Andrew Morton, mgross,
Linux Kernel Mailing List, lse-tech, Griffiths, Richard A
I'm don't have much visibility into this platform's journaling requirements.
I suspect its to enable fast reboot / recovery from some klutz bumping the
power cord or a crash of some sort.
I will raise the issue with the platform folks. However; for now I'm
looking for ways to make it scale competitively WRT adapters and spindles
for writes without changing the file system. If this turns out to be a dead
end then, hopefully, we'll move to a more spindle friendly file system.
The workload is http://www.coker.com.au/bonnie++/ (one of the newer versions
;)
--mgross
(W) 503-712-8218
MS: JF1-05
2111 N.E. 25th Ave.
Hillsboro, OR 97124
> -----Original Message-----
> From: Dave Hansen [mailto:haveblue@us.ibm.com]
> Sent: Thursday, June 20, 2002 9:10 AM
> To: Gross, Mark
> Cc: 'Russell Leighton'; Andrew Morton; mgross@unix-os.sc.intel.com;
> Linux Kernel Mailing List; lse-tech@lists.sourceforge.net; Griffiths,
> Richard A
> Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as
> the number of
> spindles gets large
>
>
> Gross, Mark wrote:
> > We will get around to reformatting our spindles to some
> other FS after
> > we get as much data and analysis out of our current
> configuration as we
> > can get.
> >
> > We'll report out our findings on the lock contention, and
> throughput
> > data for some other FS then. I'd like recommendations on what file
> > systems to try, besides ext2.
>
> Do you really need a journaling FS? If not, I think ext2 is a sure
> bet to be the fastest. If you do need journaling, try
> reiserfs and jfs.
>
> BTW, what kind of workload are you running under?
>
> --
> Dave Hansen
> haveblue@us.ibm.com
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large
@ 2002-06-24 16:33 Gross, Mark
2002-06-24 17:14 ` Eric W. Biederman
0 siblings, 1 reply; 4+ messages in thread
From: Gross, Mark @ 2002-06-24 16:33 UTC (permalink / raw)
To: 'Christopher E. Brown', Griffiths, Richard A
Cc: 'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
We are running the tests with the following mother board.
http://www.intel.com/design/servers/scb2/index.htm?iid=ipp_browse+motherbd_s
cb2&
Its a very nice box with 2 independent 64/66 PCI buses.
Capable of 2x503MB/sec, using your logic ;)
Regardless, the 640MB/s number was computed without considering the PCI bus
limitations, or the dual port nature of the base 160MB/sec nature of the
Adabptec SCSI-39160.
http://www.adaptec.com/worldwide/product/proddetail.html?sess=no&prodkey=ASC
-39160&cat=Products
Realistically, we are looking for a max throughput of about 320MB/sec with 4
adapters with enough drives attached.
--mgross
(W) 503-712-8218
MS: JF1-05
2111 N.E. 25th Ave.
Hillsboro, OR 97124
> -----Original Message-----
> From: Christopher E. Brown [mailto:cbrown@woods.net]
> Sent: Saturday, June 22, 2002 9:03 PM
> To: Griffiths, Richard A
> Cc: 'Andrew Morton'; mgross@unix-os.sc.intel.com; 'Jens Axboe'; Linux
> Kernel Mailing List; lse-tech@lists.sourceforge.net
> Subject: [Lse-tech] RE: ext3 performance bottleneck as the number of
> spindles gets large
>
>
> On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
>
> > I should have mentioned the throughput we saw on 4 adapters
> 6 drives was
> > 126KB/s. The max theoretical bus bandwith is 640MB/s.
>
>
> This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
> 503MB/sec minus PCI overhead...
>
> This of course assumes nothing else is using the PCI bus.
>
>
> 120 something MB/sec sounds a hell of a lot like topping out a 32bit
> 33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI
> 64bit w/ backward compat to 32bit.
>
> You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz
> slot right? Not plugging them into a 32bit/33Mhz slot?
>
>
> 32bit/33Mhz (32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec
> 64bit/33Mhz (64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec
> 64bit/66Mhz (64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec
>
>
> NOTE: PCI transfer rates are often listed as
>
> 32bit/33Mhz, 132 MByte/sec
> 64bit/33Mhz, 264 MByte/sec
> 64bit/66Mhz, 528 MByte/sec
>
> This is somewhat true, but only if we start with Mbit rates as used in
> transmission rates (1,000,000 bits/sec) and work from there, instead
> of 2^20 (1,048,576). I will not argue about PCI 32bit/33Mhz being
> 1056Mbit, if talking about line rate, but when we are talking about
> storage media and transfers to/from as measured by files remember to
> convert.
>
> --
> I route, therefore you are.
>
>
>
>
> -------------------------------------------------------
> Sponsored by:
> ThinkGeek at http://www.ThinkGeek.com/
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large
2002-06-24 16:33 [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large Gross, Mark
@ 2002-06-24 17:14 ` Eric W. Biederman
0 siblings, 0 replies; 4+ messages in thread
From: Eric W. Biederman @ 2002-06-24 17:14 UTC (permalink / raw)
To: Gross, Mark
Cc: 'Christopher E. Brown', Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
"Gross, Mark" <mark.gross@intel.com> writes:
> We are running the tests with the following mother board.
> http://www.intel.com/design/servers/scb2/index.htm?iid=ipp_browse+motherbd_s
> cb2&
>
> Its a very nice box with 2 independent 64/66 PCI buses.
> Capable of 2x503MB/sec, using your logic ;)
>
> Regardless, the 640MB/s number was computed without considering the PCI bus
> limitations, or the dual port nature of the base 160MB/sec nature of the
> Adabptec SCSI-39160.
> http://www.adaptec.com/worldwide/product/proddetail.html?sess=no&prodkey=ASC
> -39160&cat=Products
>
> Realistically, we are looking for a max throughput of about 320MB/sec with 4
> adapters with enough drives attached.
Careful even with at 320MB/sec this requires 50% of your systems theoretical
memory bandwidth in DMA transfers.
Application level benchmarks like streams can only achieve memory copy numbers
on a PIII platform of about 320MB/sec. A highly tuned mmx, or sse memory
copy can do better but it is a challenge.
That close to the hardware limits finding the actual bottleneck can
get very tricky. At the very least I would run the system with
just one processor, and attempt to get the numbers that way. I can
trivially see spinlock hold times staying high simply because
the memory is busy with a DMA transfer, and so cannot be used to
transfer the new contents of the lock.
Eric
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2002-06-24 17:24 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-24 16:33 [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large Gross, Mark
2002-06-24 17:14 ` Eric W. Biederman
-- strict thread matches above, loose matches on Subject: below --
2002-06-20 16:24 [Lse-tech] " Gross, Mark
2002-06-20 15:24 Griffiths, Richard A
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.