RE: [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles gets large

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles  gets large
@ 2002-06-20 15:24 Griffiths, Richard A
  0 siblings, 0 replies; 4+ messages in thread
From: Griffiths, Richard A @ 2002-06-20 15:24 UTC (permalink / raw)
  To: 'Andrew Morton', Andreas Dilger
  Cc: Dave Hansen, mgross, Linux Kernel Mailing List, lse-tech,
	Griffiths, Richard A

The Sparc only had 1GB of memory so for parity we ran the Linux machine
without highmem enabled.

Richard

-----Original Message-----
From: Andrew Morton [mailto:akpm@zip.com.au]
Sent: Wednesday, June 19, 2002 11:54 PM
To: Andreas Dilger
Cc: Dave Hansen; mgross@unix-os.sc.intel.com; Linux Kernel Mailing List;
lse-tech@lists.sourceforge.net; richard.a.griffiths@intel.com
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of
spindles gets large


Andreas Dilger wrote:
> 
> On Jun 19, 2002  21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate.  But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much.  Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3.  The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule().  The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock.  I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
> 
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3.  The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).

lock_super() has been `down()' for a long time.  In 2.4, too.

> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data.  Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.

Well.  First I want to know if block-highmem is in there.  If not,
then yep, we'll spend ages spinning on the BKL.  Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?

> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.

ext3 is about 700x as complex as ext2.  It will need to be done with
some care.
 
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed.  This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed.  The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them.  The inode table locks would be read/write
> locks.

The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention,  per-blockgroup locks
and removal of lock_super from the block allocator.

But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list.  Fixes for both of those are in progress.

ext2 is bog-simple.  It will scale up the wazoo in 2.6.
 
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).

Depends on what the profile say, Andreas.  And I mean profiles - lockmeter
tends to tell you "what", not "why".   Start at the top of the list.  Fix
them by design if possible.  If not, tweak it!


-

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles gets large
@ 2002-06-20 16:24 Gross, Mark
  0 siblings, 0 replies; 4+ messages in thread
From: Gross, Mark @ 2002-06-20 16:24 UTC (permalink / raw)
  To: 'Dave Hansen', Gross, Mark
  Cc: 'Russell Leighton', Andrew Morton, mgross,
	Linux Kernel Mailing List, lse-tech, Griffiths, Richard A

I'm don't have much visibility into this platform's journaling requirements.
I suspect its to enable fast reboot / recovery from some klutz bumping the
power cord or a crash of some sort.

I will raise the issue with the platform folks.  However; for now I'm
looking for ways to make it scale competitively WRT adapters and spindles
for writes without changing the file system.  If this turns out to be a dead
end then, hopefully, we'll move to a more spindle friendly file system.

The workload is http://www.coker.com.au/bonnie++/ (one of the newer versions
;)

--mgross

(W) 503-712-8218
MS: JF1-05
2111 N.E. 25th Ave.
Hillsboro, OR 97124


> -----Original Message-----
> From: Dave Hansen [mailto:haveblue@us.ibm.com]
> Sent: Thursday, June 20, 2002 9:10 AM
> To: Gross, Mark
> Cc: 'Russell Leighton'; Andrew Morton; mgross@unix-os.sc.intel.com;
> Linux Kernel Mailing List; lse-tech@lists.sourceforge.net; Griffiths,
> Richard A
> Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as 
> the number of
> spindles gets large
> 
> 
> Gross, Mark wrote:
> > We will get around to reformatting our spindles to some 
> other FS after 
> > we get as much data and analysis out of our current 
> configuration as we 
> > can get. 
> >  
> > We'll report out our findings on the lock contention, and 
> throughput 
> > data for some other FS then.  I'd like recommendations on what file 
> > systems to try, besides ext2.
> 
> Do you really need a journaling FS?  If not, I think ext2 is a sure 
> bet to be the fastest.  If you do need journaling, try 
> reiserfs and jfs.
> 
> BTW, what kind of workload are you running under?
> 
> -- 
> Dave Hansen
> haveblue@us.ibm.com
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large
@ 2002-06-24 16:33 Gross, Mark
  2002-06-24 17:14 ` Eric W. Biederman
  0 siblings, 1 reply; 4+ messages in thread
From: Gross, Mark @ 2002-06-24 16:33 UTC (permalink / raw)
  To: 'Christopher E. Brown', Griffiths, Richard A
  Cc: 'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

We are running the tests with the following mother board.
http://www.intel.com/design/servers/scb2/index.htm?iid=ipp_browse+motherbd_s
cb2&

Its a very nice box with 2 independent 64/66 PCI buses.  
Capable of 2x503MB/sec, using your logic ;)

Regardless, the 640MB/s number was computed without considering the PCI bus
limitations, or the dual port nature of the base 160MB/sec nature of the
Adabptec SCSI-39160.
http://www.adaptec.com/worldwide/product/proddetail.html?sess=no&prodkey=ASC
-39160&cat=Products

Realistically, we are looking for a max throughput of about 320MB/sec with 4
adapters with enough drives attached.

--mgross

(W) 503-712-8218
MS: JF1-05
2111 N.E. 25th Ave.
Hillsboro, OR 97124


> -----Original Message-----
> From: Christopher E. Brown [mailto:cbrown@woods.net]
> Sent: Saturday, June 22, 2002 9:03 PM
> To: Griffiths, Richard A
> Cc: 'Andrew Morton'; mgross@unix-os.sc.intel.com; 'Jens Axboe'; Linux
> Kernel Mailing List; lse-tech@lists.sourceforge.net
> Subject: [Lse-tech] RE: ext3 performance bottleneck as the number of
> spindles gets large
> 
> 
> On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> 
> > I should have mentioned the throughput we saw on 4 adapters 
> 6 drives was
> > 126KB/s.  The max theoretical bus bandwith is 640MB/s.
> 
> 
> This is *NOT* correct.  Assuming a 64bit 66Mhz PCI bus your MAX is
> 503MB/sec minus PCI overhead...
> 
> This of course assumes nothing else is using the PCI bus.
> 
> 
> 120 something MB/sec sounds a hell of a lot like topping out a 32bit
> 33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI
> 64bit w/ backward compat to 32bit.
> 
> You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz
> slot right?  Not plugging them into a 32bit/33Mhz slot?
> 
> 
> 32bit/33Mhz	(32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec
> 64bit/33Mhz	(64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec
> 64bit/66Mhz	(64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec
> 
> 
> NOTE: PCI transfer rates are often listed as
> 
> 32bit/33Mhz, 132 MByte/sec
> 64bit/33Mhz, 264 MByte/sec
> 64bit/66Mhz, 528 MByte/sec
> 
> This is somewhat true, but only if we start with Mbit rates as used in
> transmission rates (1,000,000 bits/sec) and work from there, instead
> of 2^20 (1,048,576).  I will not argue about PCI 32bit/33Mhz being
> 1056Mbit, if talking about line rate, but when we are talking about
> storage media and transfers to/from as measured by files remember to
> convert.
> 
> -- 
> I route, therefore you are.
> 
> 
> 
> 
> -------------------------------------------------------
> Sponsored by:
> ThinkGeek at http://www.ThinkGeek.com/
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large
  2002-06-24 16:33 [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large Gross, Mark
@ 2002-06-24 17:14 ` Eric W. Biederman
  0 siblings, 0 replies; 4+ messages in thread
From: Eric W. Biederman @ 2002-06-24 17:14 UTC (permalink / raw)
  To: Gross, Mark
  Cc: 'Christopher E. Brown', Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

"Gross, Mark" <mark.gross@intel.com> writes:

> We are running the tests with the following mother board.
> http://www.intel.com/design/servers/scb2/index.htm?iid=ipp_browse+motherbd_s
> cb2&
> 
> Its a very nice box with 2 independent 64/66 PCI buses.  
> Capable of 2x503MB/sec, using your logic ;)
> 
> Regardless, the 640MB/s number was computed without considering the PCI bus
> limitations, or the dual port nature of the base 160MB/sec nature of the
> Adabptec SCSI-39160.
> http://www.adaptec.com/worldwide/product/proddetail.html?sess=no&prodkey=ASC
> -39160&cat=Products
> 
> Realistically, we are looking for a max throughput of about 320MB/sec with 4
> adapters with enough drives attached.

Careful even with at 320MB/sec this requires 50% of your systems theoretical
memory bandwidth in DMA transfers.  

Application level benchmarks like streams can only achieve memory copy numbers
on a PIII platform of about 320MB/sec.  A highly tuned mmx, or sse memory
copy can do better but it is a challenge.

That close to the hardware limits finding the actual bottleneck can
get very tricky.  At the very least I would run the system with
just one processor, and attempt to get the numbers that way.  I can
trivially see spinlock hold times staying high simply because
the memory is busy with a DMA transfer, and so cannot be used to
transfer the new contents of the lock.

Eric

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-06-24 17:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-24 16:33 [Lse-tech] RE: ext3 performance bottleneck as the number of s pindles gets large Gross, Mark
2002-06-24 17:14 ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2002-06-20 16:24 [Lse-tech] " Gross, Mark
2002-06-20 15:24 Griffiths, Richard A

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.