* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 0:54 ` Andrew Morton
@ 2002-06-20 4:09 ` Dave Hansen
2002-06-20 6:03 ` Andreas Dilger
0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2002-06-20 4:09 UTC (permalink / raw)
To: Andrew Morton
Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths
Andrew Morton wrote:
> mgross wrote:
>>Has anyone done any work looking into the I/O scaling of Linux / ext3 per
>>spindle or per adapter? We would like to compare notes.
>
> No. ext3 scalability is very poor, I'm afraid. The fs really wasn't
> up and running until kernel 2.4.5 and we just didn't have time to
> address that issue.
Ick. That takes the prize for the highest BKL contention I've ever
seen, except for some horribly contrived torture tests of mine. I've
had data like this sent to me a few times to analyze and the only
thing I've been able to suggest up to this point is not to use ext3.
>>I've only just started to look at the ext3 code but it seems to me that replacing the
>>BKL with a per - ext3 file system lock could remove some of the contention thats
>>getting measured. What data are the BKL protecting in these ext3 functions? Could a
>>lock per FS approach work?
>
> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate. But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.
I really doubt that dropping in lock_journal will help this case very
much. Every single kernel_flag entry in the lockmeter output where
Util > 0.00% is caused by ext3. The schedule entry is probably caused
by something in ext3 grabbing BKL, getting scheduled out for some
reason, then having it implicitly released in schedule(). The
schedule() contention comes from the reacquire_kernel_lock().
We used to see plenty of ext2 BKL contention, but Al Viro did a good
job fixing that early in 2.5 using a per-inode rwlock. I think that
this is the required level of lock granularity, another global lock
just won't cut it.
http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 4:09 ` [Lse-tech] " Dave Hansen
@ 2002-06-20 6:03 ` Andreas Dilger
2002-06-20 6:53 ` Andrew Morton
0 siblings, 1 reply; 27+ messages in thread
From: Andreas Dilger @ 2002-06-20 6:03 UTC (permalink / raw)
To: Dave Hansen
Cc: Andrew Morton, mgross, Linux Kernel Mailing List, lse-tech,
richard.a.griffiths
On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> Andrew Morton wrote:
> >The vague plan there is to replace lock_kernel with lock_journal
> >where appropriate. But ext3 scalability work of this nature
> >will be targetted at the 2.5 kernel, most probably.
>
> I really doubt that dropping in lock_journal will help this case very
> much. Every single kernel_flag entry in the lockmeter output where
> Util > 0.00% is caused by ext3. The schedule entry is probably caused
> by something in ext3 grabbing BKL, getting scheduled out for some
> reason, then having it implicitly released in schedule(). The
> schedule() contention comes from the reacquire_kernel_lock().
>
> We used to see plenty of ext2 BKL contention, but Al Viro did a good
> job fixing that early in 2.5 using a per-inode rwlock. I think that
> this is the required level of lock granularity, another global lock
> just won't cut it.
> http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
There are a variety of different efforts that could be made towards
removing the BKL from ext2 and ext3. The first, of course, would be
to have a per-filesystem lock instead of taking the BKL (I don't know
if Al has changed lock_super() in 2.5 to be a real semaphore or not).
As Andrew mentioned, there would also need to be be a per-journal lock to
ensure coherency of the journal data. Currently the per-filesystem and
per-journal lock would be equivalent, but when a single journal device
can be shared among multiple filesystems they would be different locks.
I will leave it up to Andrew and Stephen to discuss locking scalability
within the journal layer.
Within the filesystem there can be a large number of increasingly fine
locks added - a superblock-only lock with per-group locks, or even
per-bitmap and per-inode-table(-block) locks if needed. This would
allow multi- threaded inode and block allocations, but a sane lock
ranking strategy would have to be developed. The bitmap locks would
only need to be 2-state locks, because you only look at the bitmaps
when you want to modify them. The inode table locks would be read/write
locks.
If there is a try-writelock mechanism for the individual inode table
blocks you can avoid write lock contention for creations by simply
finding the first un-write-locked block in the target group's inode table
(usually in the hundreds of blocks per group for default parameters).
For inode allocation you don't really care which inode you get, as long
as you get one in the preferred group (even that isn't critical for
directory creation). For inode deletions you will get essentially random
block locking, which is actually improved by the find-first-unlocked
allocation policy (at the expense of dirtying more inode table blocks).
Contention for the superblock lock for updates to the superblock free
block and free inode counts could be mitigated by keeping "per-group
delta buckets" in memory, that are written into the superblock only
once every few seconds or at statfs time instead of needing multiple
locks for each block/inode alloc/free. The groups already keep their
own summary counts for free blocks and inodes. The coherency of these
fields with the superblock on recovery would be handled at journal
recovery time (either in the kernel or e2fsck*). Other than these two
fields there are few write updates to the superblock (on ext3 there
is also the orphan list, modified at truncate and when an open file is
unlinked and when such a file is closed).
I have even been thinking about multi-threaded directory-entry creation
in a single directory. One nice thing about ext2/ext3 directory blocks
is that each one is self-contained and can be modified independently.
For regular ext2/ext3 directories you would only be able to do
multi-threaded deletes by having a lock for each directory block.
For creations you would need to lock the entire directory to ensure
exclusive access for a create, which is the same single-threaded behaviour
for a single directory we have today with the directory i_sem.
However, if you are using the htree indexed directory layout (which you
will be, if you care about scalable filesystem performance) then there
is only a single[**] block into which a given filename can be added, so
you can have per-block locks even for file creation. As the number of
directory entries grows (and hence more directory blocks) the locking
becomes increasingly more fine-grained so you get better scalability
with larger directories, which is what you want.
Cheers, Andreas
[*] If we think that we will go to any kind of per-group locking in the
near future, the support for this could be added into e2fsck and
existing kernels today with read support for a COMPAT flag to
ensure maximal forwards compatibility. On e2fsck runs we already
validate the superblock on each boot, and the group descriptor table
is contiguous with the superblock, so the amount of extra checking
at boot time would be very minimal.
The kernel already has ext[23]_count_free_{blocks,inodes} functions
that just need a bit of tweaking to check only the descriptor
summaries unless mounted with debug and check options, and to update
the superblock counts at mount time if the COMPAT flag is set.
[**] In rare circumstances you may have a large number of hash collisions
for a single hash value which fill more than one block, so an entry
with that hash value could live in more than a single block. This
would need to be handled somehow (e.g. always getting the locks on
all such blocks in order at create time; you only need a single
block lock at delete time).
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 6:03 ` Andreas Dilger
@ 2002-06-20 6:53 ` Andrew Morton
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2002-06-20 6:53 UTC (permalink / raw)
To: Andreas Dilger
Cc: Dave Hansen, mgross, Linux Kernel Mailing List, lse-tech,
richard.a.griffiths
Andreas Dilger wrote:
>
> On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate. But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much. Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3. The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule(). The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock. I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
>
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3. The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).
lock_super() has been `down()' for a long time. In 2.4, too.
> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data. Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.
Well. First I want to know if block-highmem is in there. If not,
then yep, we'll spend ages spinning on the BKL. Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?
> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.
ext3 is about 700x as complex as ext2. It will need to be done with
some care.
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed. This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed. The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them. The inode table locks would be read/write
> locks.
The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention, per-blockgroup locks
and removal of lock_super from the block allocator.
But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list. Fixes for both of those are in progress.
ext2 is bog-simple. It will scale up the wazoo in 2.6.
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).
Depends on what the profile say, Andreas. And I mean profiles - lockmeter
tends to tell you "what", not "why". Start at the top of the list. Fix
them by design if possible. If not, tweak it!
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
[not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
@ 2002-06-20 16:10 ` Dave Hansen
2002-06-20 20:47 ` John Hawkes
0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2002-06-20 16:10 UTC (permalink / raw)
To: Gross, Mark
Cc: 'Russell Leighton', Andrew Morton, mgross,
Linux Kernel Mailing List, lse-tech, Griffiths, Richard A
Gross, Mark wrote:
> We will get around to reformatting our spindles to some other FS after
> we get as much data and analysis out of our current configuration as we
> can get.
>
> We'll report out our findings on the lock contention, and throughput
> data for some other FS then. I'd like recommendations on what file
> systems to try, besides ext2.
Do you really need a journaling FS? If not, I think ext2 is a sure
bet to be the fastest. If you do need journaling, try reiserfs and jfs.
BTW, what kind of workload are you running under?
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 16:10 ` Dave Hansen
@ 2002-06-20 20:47 ` John Hawkes
0 siblings, 0 replies; 27+ messages in thread
From: John Hawkes @ 2002-06-20 20:47 UTC (permalink / raw)
To: Dave Hansen, Gross, Mark
Cc: 'Russell Leighton', Andrew Morton, mgross,
Linux Kernel Mailing List, lse-tech, Griffiths, Richard A
From: "Dave Hansen" <haveblue@us.ibm.com>
> > We'll report out our findings on the lock contention, and throughput
> > data for some other FS then. I'd like recommendations on what file
> > systems to try, besides ext2.
>
> Do you really need a journaling FS? If not, I think ext2 is a sure
> bet to be the fastest. If you do need journaling, try reiserfs and
jfs.
XFS in 2.4.x scales much better on larger CPU counts than do ext3 or
ReiserFS. That's because XFS is a much lighter user of the BKL in 2.4.x
than ext3, ReiserFS, or ext2.
John Hawkes
hawkes@sgi.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
@ 2002-06-20 21:11 ` Andrew Morton
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2002-06-20 21:11 UTC (permalink / raw)
To: Gross, Mark
Cc: 'Dave Hansen', 'Russell Leighton', mgross,
Linux Kernel Mailing List, lse-tech, Griffiths, Richard A
"Gross, Mark" wrote:
>
> ...
> The workload is http://www.coker.com.au/bonnie++/ (one of the newer versions
> ;)
>
Please tell me exactly how you're using it: how many filesystems, how
many controllers, disk topology, physical memory, size of filesystems,
etc. Sufficient for me to be able to reproduce it and find out what
is happening.
Also: what is your best-case aggregate bandwidth? Platter-speed of disks
multiplied by number of disks, please?
Thanks to the BKL you've effectively got 1.3 to 1.5 CPUs, but we should be
able to saturate six or eight disks on a uniprocessor kernel. It's
possible that we're looking at the wrong thing.
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: ext3 performance bottleneck as the number of spindles gets la rge
@ 2002-06-20 21:50 Griffiths, Richard A
2002-06-21 7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
2002-06-23 4:02 ` Christopher E. Brown
0 siblings, 2 replies; 27+ messages in thread
From: Griffiths, Richard A @ 2002-06-20 21:50 UTC (permalink / raw)
To: 'Andrew Morton', mgross
Cc: Griffiths, Richard A, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
I should have mentioned the throughput we saw on 4 adapters 6 drives was
126KB/s. The max theoretical bus bandwith is 640MB/s.
-----Original Message-----
From: Andrew Morton [mailto:akpm@zip.com.au]
Sent: Thursday, June 20, 2002 2:26 PM
To: mgross@unix-os.sc.intel.com
Cc: Griffiths, Richard A; 'Jens Axboe'; Linux Kernel Mailing List;
lse-tech@lists.sourceforge.net
Subject: Re: ext3 performance bottleneck as the number of spindles gets
large
mgross wrote:
>
> On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote:
> > Yup. I take it back - high ext3 lock contention happens on 2.5
> > as well, which has block-highmem. With heavy write traffic onto
> > six disks, two controllers, six filesystems, four CPUs the machine
> > spends about 40% of the time spinning on locks in fs/ext3/inode.c
> > You're un dual CPU, so the contention is less.
> >
> > Not very nice. But given that the longest spin time was some
> > tens of milliseconds, with the average much lower, it shouldn't
> > affect overall I/O throughput.
>
> How could losing 40% of your CPU's to spin locks NOT spank your
throughtput?
The limiting factor is usually disk bandwidth, seek latency, rotational
latency. That's why I want to know your bandwidth.
> Can you copy your lockmeter data from its kernel_flag section? Id like to
> see it.
I don't find lockmeter very useful. Here's oprofile output for 2.5.23:
c013ec08 873 1.07487 rmqueue
c018a8e4 950 1.16968 do_get_write_access
c013b00c 969 1.19307 kmem_cache_alloc_batch
c018165c 1120 1.37899 ext3_writepage
c0193120 1457 1.79392 journal_add_journal_head
c0180e30 1458 1.79515 ext3_prepare_write
c0136948 6546 8.05969 generic_file_write
c01838ac 42608 52.4606 .text.lock.inode
So I lost two CPUs on the BKL in fs/ext3/inode.c. The remaining
two should be enough to saturate all but the most heroic disk
subsystems.
A couple of possibilities come to mind:
1: Processes which should be submitting I/O against disk "A" are
instead spending tons of time asleep in the page allocator waiting
for I/O to complete against disk "B".
2: ext3 is just too slow for the rate of data which you're trying to
push at it. This exhibits as lock contention, but the root cause
is the cost of things like ext3_mark_inode_dirty(). And *that*
is something we can fix - can shave 75% off the cost of that.
Need more data...
> >
> > Possibly something else is happening. Have you tested ext2?
>
> No. We're attempting to see if we can scale to large numbers of spindles
> with EXT3 at the moment. Perhaps we can effect positive changes to ext3
> before giving up on it and moving to another Journaled FS.
Have you tried *any* other fs?
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
@ 2002-06-21 7:58 ` Andrew Morton
2002-06-21 18:46 ` mgross
2002-06-23 4:02 ` Christopher E. Brown
1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2002-06-21 7:58 UTC (permalink / raw)
To: Griffiths, Richard A
Cc: mgross, 'Jens Axboe', Linux Kernel Mailing List, lse-tech
"Griffiths, Richard A" wrote:
>
> I should have mentioned the throughput we saw on 4 adapters 6 drives was
> 126KB/s. The max theoretical bus bandwith is 640MB/s.
I hope that was 128MB/s?
Please try the below patch (againt 2.4.19-pre10). It halves the lock
contention, and it does that by making the fs twice as efficient, so
that's a bonus.
I wouldn't be surprised if it made no difference. I'm not seeing
much difference between ext2 and ext3 here.
If you have time, please test ext2 and/or reiserfs and/or ext3
in writeback mode.
And please tell us some more details regarding the performance bottleneck.
I assume that you mean that the IO rate per disk slows as more
disks are added to an adapter? Or does the total throughput through
the adapter fall as more disks are added?
Thanks.
--- 2.4.19-pre10/fs/ext3/inode.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/fs/ext3/inode.c Fri Jun 21 00:28:59 2002
@@ -1016,21 +1016,20 @@ static int ext3_prepare_write(struct fil
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
handle_t *handle;
- lock_kernel();
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
}
- unlock_kernel();
ret = block_prepare_write(page, from, to, ext3_get_block);
- lock_kernel();
if (ret != 0)
goto prepare_write_failed;
if (ext3_should_journal_data(inode)) {
+ lock_kernel();
ret = walk_page_buffers(handle, page->buffers,
from, to, NULL, do_journal_get_write_access);
+ unlock_kernel();
if (ret) {
/*
* We're going to fail this prepare_write(),
@@ -1043,10 +1042,12 @@ static int ext3_prepare_write(struct fil
}
}
prepare_write_failed:
- if (ret)
+ if (ret) {
+ lock_kernel();
ext3_journal_stop(handle, inode);
+ unlock_kernel();
+ }
out:
- unlock_kernel();
return ret;
}
@@ -1094,7 +1095,6 @@ static int ext3_commit_write(struct file
struct inode *inode = page->mapping->host;
int ret = 0, ret2;
- lock_kernel();
if (ext3_should_journal_data(inode)) {
/*
* Here we duplicate the generic_commit_write() functionality
@@ -1102,22 +1102,43 @@ static int ext3_commit_write(struct file
int partial = 0;
loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ lock_kernel();
ret = walk_page_buffers(handle, page->buffers,
from, to, &partial, commit_write_fn);
+ unlock_kernel();
if (!partial)
SetPageUptodate(page);
kunmap(page);
if (pos > inode->i_size)
inode->i_size = pos;
EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
+ if (inode->i_size > inode->u.ext3_i.i_disksize) {
+ inode->u.ext3_i.i_disksize = inode->i_size;
+ lock_kernel();
+ ret2 = ext3_mark_inode_dirty(handle, inode);
+ unlock_kernel();
+ if (!ret)
+ ret = ret2;
+ }
} else {
if (ext3_should_order_data(inode)) {
+ lock_kernel();
ret = walk_page_buffers(handle, page->buffers,
from, to, NULL, journal_dirty_sync_data);
+ unlock_kernel();
}
/* Be careful here if generic_commit_write becomes a
* required invocation after block_prepare_write. */
if (ret == 0) {
+ /*
+ * generic_commit_write() will run mark_inode_dirty()
+ * if i_size changes. So let's piggyback the
+ * i_disksize mark_inode_dirty into that.
+ */
+ loff_t new_i_size =
+ ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ if (new_i_size > EXT3_I(inode)->i_disksize)
+ EXT3_I(inode)->i_disksize = new_i_size;
ret = generic_commit_write(file, page, from, to);
} else {
/*
@@ -1129,12 +1150,7 @@ static int ext3_commit_write(struct file
kunmap(page);
}
}
- if (inode->i_size > inode->u.ext3_i.i_disksize) {
- inode->u.ext3_i.i_disksize = inode->i_size;
- ret2 = ext3_mark_inode_dirty(handle, inode);
- if (!ret)
- ret = ret2;
- }
+ lock_kernel();
ret2 = ext3_journal_stop(handle, inode);
unlock_kernel();
if (!ret)
@@ -2165,9 +2181,11 @@ bad_inode:
/*
* Post the struct inode info into an on-disk inode location in the
* buffer-cache. This gobbles the caller's reference to the
- * buffer_head in the inode location struct.
+ * buffer_head in the inode location struct.
+ *
+ * On entry, the caller *must* have journal write access to the inode's
+ * backing block, at iloc->bh.
*/
-
static int ext3_do_update_inode(handle_t *handle,
struct inode *inode,
struct ext3_iloc *iloc)
@@ -2176,12 +2194,6 @@ static int ext3_do_update_inode(handle_t
struct buffer_head *bh = iloc->bh;
int err = 0, rc, block;
- if (handle) {
- BUFFER_TRACE(bh, "get_write_access");
- err = ext3_journal_get_write_access(handle, bh);
- if (err)
- goto out_brelse;
- }
raw_inode->i_mode = cpu_to_le16(inode->i_mode);
if(!(test_opt(inode->i_sb, NO_UID32))) {
raw_inode->i_uid_low = cpu_to_le16(low_16_bits(inode->i_uid));
--- 2.4.19-pre10/mm/filemap.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/mm/filemap.c Fri Jun 21 00:28:59 2002
@@ -2924,6 +2924,7 @@ generic_file_write(struct file *file,con
long status = 0;
int err;
unsigned bytes;
+ time_t time_now;
if ((ssize_t) count < 0)
return -EINVAL;
@@ -3026,8 +3027,12 @@ generic_file_write(struct file *file,con
goto out;
remove_suid(inode);
- inode->i_ctime = inode->i_mtime = CURRENT_TIME;
- mark_inode_dirty_sync(inode);
+ time_now = CURRENT_TIME;
+ if (inode->i_ctime != time_now || inode->i_mtime != time_now) {
+ inode->i_ctime = time_now;
+ inode->i_mtime = time_now;
+ mark_inode_dirty_sync(inode);
+ }
if (file->f_flags & O_DIRECT)
goto o_direct;
--- 2.4.19-pre10/fs/jbd/transaction.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/fs/jbd/transaction.c Fri Jun 21 00:28:59 2002
@@ -237,7 +237,9 @@ handle_t *journal_start(journal_t *journ
handle->h_ref = 1;
current->journal_info = handle;
+ lock_kernel();
err = start_this_handle(journal, handle);
+ unlock_kernel();
if (err < 0) {
kfree(handle);
current->journal_info = NULL;
@@ -1388,8 +1390,10 @@ int journal_stop(handle_t *handle)
transaction->t_outstanding_credits -= handle->h_buffer_credits;
transaction->t_updates--;
if (!transaction->t_updates) {
- wake_up(&journal->j_wait_updates);
- if (journal->j_barrier_count)
+ if (waitqueue_active(&journal->j_wait_updates))
+ wake_up(&journal->j_wait_updates);
+ if (journal->j_barrier_count &&
+ waitqueue_active(&journal->j_wait_transaction_locked))
wake_up(&journal->j_wait_transaction_locked);
}
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-21 7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
@ 2002-06-21 18:46 ` mgross
2002-06-21 19:26 ` Chris Mason
2002-06-21 19:56 ` Andrew Morton
0 siblings, 2 replies; 27+ messages in thread
From: mgross @ 2002-06-21 18:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Griffiths, Richard A, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
Andrew Morton wrote:
>"Griffiths, Richard A" wrote:
>
>>I should have mentioned the throughput we saw on 4 adapters 6 drives was
>>126KB/s. The max theoretical bus bandwith is 640MB/s.
>>
>
>I hope that was 128MB/s?
>
Yes that was MB/s, the data was taken in KB a set of 3 zeros where missing.
>
>
>Please try the below patch (againt 2.4.19-pre10). It halves the lock
>contention, and it does that by making the fs twice as efficient, so
>that's a bonus.
>
We'll give it a try. I'm on travel right now so it may be a few days if
Richard doesn't get to before I get back.
>
>
>I wouldn't be surprised if it made no difference. I'm not seeing
>much difference between ext2 and ext3 here.
>
>If you have time, please test ext2 and/or reiserfs and/or ext3
>in writeback mode.
>
Soon after we finish beating the ext3 file system up I'll take a swing
at some other file systems.
>
>And please tell us some more details regarding the performance bottleneck.
>I assume that you mean that the IO rate per disk slows as more
>disks are added to an adapter? Or does the total throughput through
>the adapter fall as more disks are added?
>
No, the IO block write throughput for the system goes down as drives are
added under this work load. We measure the system throughput not the
per drive throughput, but one could infer the per drive throughput by
dividing.
Running bonnie++ on with 300MB files doing 8Kb sequential writes we get
the following system wide throughput as a function of the number of
drives attached and by number of addapters.
One addapter
1 drive per addapter 127,702KB/Sec
2 drives per addapter 93,283 KB/Sec
6 drives per addapter 85,626 KB/Sec
2 addapters
1 drive per addapter 92,095 KB/Sec
2 drives per addapter 110,956 KB/Sec
6 drives per addapter 106,883 KB/Sec
4 addapters
1 drive per addapter 121,125 KB/Sec
2 drives per addapter 117,575 KB/Sec
6 drives per addapter 116,570 KB/Sec
Not too pritty.
--mgross
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-21 18:46 ` mgross
@ 2002-06-21 19:26 ` Chris Mason
2002-06-21 19:56 ` Andrew Morton
1 sibling, 0 replies; 27+ messages in thread
From: Chris Mason @ 2002-06-21 19:26 UTC (permalink / raw)
To: mgross
Cc: Andrew Morton, Griffiths, Richard A, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
On Fri, 2002-06-21 at 14:46, mgross wrote:
> Andrew Morton wrote:
> >
> >Please try the below patch (againt 2.4.19-pre10). It halves the lock
> >contention, and it does that by making the fs twice as efficient, so
> >that's a bonus.
> >
> We'll give it a try. I'm on travel right now so it may be a few days if
> Richard doesn't get to before I get back.
You might want to try this too, Andrew fixed UPDATE_ATIME() to only call
the dirty_inode method once per second, but generic_file_write should do
the same. It reduces BKL contention by reducing calls to ext3 and
reiserfs dirty_inode calls, which are much more expensive than simply
marking the inode dirty.
-chris
--- linux/mm/filemap.c Mon, 28 Jan 2002 09:51:50 -0500
+++ linux/mm/filemap.c Sun, 12 May 2002 16:16:59 -0400
@@ -2826,6 +2826,14 @@
}
}
+static void update_inode_times(struct inode *inode)
+{
+ time_t now = CURRENT_TIME;
+ if (inode->i_ctime != now || inode->i_mtime != now) {
+ inode->i_ctime = inode->i_mtime = now;
+ mark_inode_dirty_sync(inode);
+ }
+}
/*
* Write to a file through the page cache.
*
@@ -2955,8 +2963,7 @@
goto out;
remove_suid(inode);
- inode->i_ctime = inode->i_mtime = CURRENT_TIME;
- mark_inode_dirty_sync(inode);
+ update_inode_times(inode);
if (file->f_flags & O_DIRECT)
goto o_direct;
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-21 18:46 ` mgross
2002-06-21 19:26 ` Chris Mason
@ 2002-06-21 19:56 ` Andrew Morton
1 sibling, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2002-06-21 19:56 UTC (permalink / raw)
To: mgross
Cc: Griffiths, Richard A, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
mgross wrote:
>
> ...
> >And please tell us some more details regarding the performance bottleneck.
> >I assume that you mean that the IO rate per disk slows as more
> >disks are added to an adapter? Or does the total throughput through
> >the adapter fall as more disks are added?
> >
> No, the IO block write throughput for the system goes down as drives are
> added under this work load. We measure the system throughput not the
> per drive throughput, but one could infer the per drive throughput by
> dividing.
>
> Running bonnie++ on with 300MB files doing 8Kb sequential writes we get
> the following system wide throughput as a function of the number of
> drives attached and by number of addapters.
>
> One addapter
> 1 drive per addapter 127,702KB/Sec
> 2 drives per addapter 93,283 KB/Sec
> 6 drives per addapter 85,626 KB/Sec
127 megabytes/sec to a single disk? Either that's a very
fast disk, or you're using very small bytes :)
> 2 addapters
> 1 drive per addapter 92,095 KB/Sec
> 2 drives per addapter 110,956 KB/Sec
> 6 drives per addapter 106,883 KB/Sec
>
> 4 addapters
> 1 drive per addapter 121,125 KB/Sec
> 2 drives per addapter 117,575 KB/Sec
> 6 drives per addapter 116,570 KB/Sec
>
Possibly what is happening here is that a significant amount
of dirty data is being left in memory and is escaping the
measurement period. When you run the test against more disks,
the *total* amount of dirty memory is increased, so the kernel
is forced to perform more writeback within the measurement period.
So with two filesystems, you're actually performing more I/O.
You need to either ensure that all I/O is occurring *within the
measurement interval*, or make the test write so much data (wrt
main memory size) that any leftover unwritten stuff is insignificant.
bonnie++ is too complex for this work. Suggest you use
http://www.zip.com.au/~akpm/linux/write-and-fsync.c
which will just write and fsync a file. Time how long that
takes. Or you could experiment with bonnie++'s fsync option.
My suggestion is to work with this workload:
for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ...
do
write-and-fsync $i/foo 4000 &
done
which will write a 4 gig file to each disk. This will defeat
any caching effects and is just a way simpler workload, which
will allow you to test one thing in isolation.
So anyway. All this possibly explains the "negative scalability"
in the single-adapter case. For four adapters with one disk on
each, 120 megs/sec seems reasonable, assuming the sustained
write bandwidth of a single disk is 30 megs/sec.
For four adapters, six disks on each you should be doing better.
Something does appear to be wrong there.
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
@ 2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
2002-06-22 0:19 ` kwijibo
0 siblings, 2 replies; 27+ messages in thread
From: Duc Vianney @ 2002-06-21 22:03 UTC (permalink / raw)
To: Andrew Morton, mgross, Griffiths, Richard A, Jens Axboe,
Linux Kernel Mailing List, lse-tech
Andrew Morton wrote:
>If you have time, please test ext2 and/or reiserfs and/or ext3
>in writeback mode.
I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
test file size is 128MB, test block size is 4KB, and IO threads varies
from 1 to 6. When comparing with other file system for this test
environment, the results on a 2.5.19 SMP kernel show ext3fs is having
performance problem with Writes and in particularly, with Random Write.
I think the BKL contention patch would help ext3fs, but I need to verify
it first.
The following data are throughput in MB/sec obtained from IOzone
benchmark running on all file systems installed with default options.
Kernels 2519smp4 2519smp4 2519smp4 2519smp4
No of threads=1 ext2-1t jfs-1t ext3-1t reiserfs-1t
Initial write 138010 111023 29808 48170
Rewrite 205736 204538 119543 142765
Read 236500 237235 231860 236959
Re-read 242927 243577 240284 242776
Random read 204292 206010 201664 207219
Random write 180144 180461 1090 121676
No of threads=2 ext2-2t jfs-2t ext3-2t reiserfs-2t
Initial write 196477 143395 62248 55260
Rewrite 261641 261441 126604 205076
Read 292566 292796 313562 291434
Re-read 302239 306423 341416 303424
Random read 296152 295430 316966 288584
Random write 253026 251013 958 203358
No of threads=4 ext2-4t jfs-4t ext3-4t reiserfs-4t
Initial write 79513 172302 42051 48782
Rewrite 256568 269840 124912 231395
Read 290599 303669 327066 283793
Re-read 289578 303644 327362 287531
Random read 354011 353455 353806 351671
Random write 279704 279922 2482 250498
No of threads=6 ext2-6t jfs-6t ext3-6t reiserfs-6t
Initial write 98559 69825 59728 15576
Rewrite 274993 286987 126048 232193
Read 330522 326143 332147 326163
Re-read 339672 328890 333094 326725
Random read 348059 346154 347901 344927
Random write 281613 280213 3659 227579
Cheers,
Duc J Vianney, dvianney@us.ibm.com
home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
project page: http://www-124.ibm.com/developerworks/projects/linuxperf
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-21 22:03 Duc Vianney
@ 2002-06-21 23:11 ` Andrew Morton
2002-06-22 0:19 ` kwijibo
1 sibling, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2002-06-21 23:11 UTC (permalink / raw)
To: Duc Vianney
Cc: mgross, Griffiths, Richard A, Jens Axboe,
Linux Kernel Mailing List, lse-tech
Duc Vianney wrote:
>
> Andrew Morton wrote:
> >If you have time, please test ext2 and/or reiserfs and/or ext3
> >in writeback mode.
> I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
> 500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
> test file size is 128MB, test block size is 4KB, and IO threads varies
> from 1 to 6. When comparing with other file system for this test
> environment, the results on a 2.5.19 SMP kernel show ext3fs is having
> performance problem with Writes and in particularly, with Random Write.
> I think the BKL contention patch would help ext3fs, but I need to verify
> it first.
>
> The following data are throughput in MB/sec obtained from IOzone
> benchmark running on all file systems installed with default options.
>
> Kernels 2519smp4 2519smp4 2519smp4 2519smp4
> No of threads=1 ext2-1t jfs-1t ext3-1t reiserfs-1t
>
> Initial write 138010 111023 29808 48170
> Rewrite 205736 204538 119543 142765
> Read 236500 237235 231860 236959
> Re-read 242927 243577 240284 242776
> Random read 204292 206010 201664 207219
> Random write 180144 180461 1090 121676
ext3 only allows dirty data to remain in memory for five seconds,
whereas the other filesystems allow it for thirty. This is
a reasonable thing to do, but it hurts badly in benchmarks.
If you run a benchmark which takes ext2 ten seconds to
complete, ext2 will do it all in-RAM. But after five
seconds, ext3 will go to disk and the test takes vastly longer.
I suspect that is what is happening here - we're seeing the
difference between disk bandwidth and memory bandwidth.
If you choose a larger file, a shorter file or a longer-running
test then the difference will not be so gross.
You can confirm this by trying a one-gigabyte file instead.
The "Initial write" is fishy. I wonder if the same thing
is happening here - there may have been lots of dirty memory
left in-core (and unaccounted for) after the test completed.
iozone has a `-e' option which causes it to include the fsync()
time in the timing calculations. Using that would give a
better comparison, unless you are specifically trying to test
in-memory performance. And we're not doing that here.
-
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
@ 2002-06-22 0:19 ` kwijibo
2002-06-22 8:10 ` kwijibo
1 sibling, 1 reply; 27+ messages in thread
From: kwijibo @ 2002-06-22 0:19 UTC (permalink / raw)
To: Duc Vianney
Cc: Andrew Morton, mgross, Griffiths, Richard A, Jens Axboe,
Linux Kernel Mailing List, lse-tech
This web site may be of interest for this discussion:
http://labs.zianet.com. I have benchmarks using NFS
with ext3 there. It also compares ext3 with ReiserFS.
The page is not quite complete but it has the
benchmarks up.
Steven
Duc Vianney wrote:
>Andrew Morton wrote:
>
>
>>If you have time, please test ext2 and/or reiserfs and/or ext3
>>in writeback mode.
>>
>>
>I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
>500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
>test file size is 128MB, test block size is 4KB, and IO threads varies
>from 1 to 6. When comparing with other file system for this test
>environment, the results on a 2.5.19 SMP kernel show ext3fs is having
>performance problem with Writes and in particularly, with Random Write.
>I think the BKL contention patch would help ext3fs, but I need to verify
>it first.
>
>The following data are throughput in MB/sec obtained from IOzone
>benchmark running on all file systems installed with default options.
>
>
>Kernels 2519smp4 2519smp4 2519smp4 2519smp4
>No of threads=1 ext2-1t jfs-1t ext3-1t reiserfs-1t
>
>Initial write 138010 111023 29808 48170
>Rewrite 205736 204538 119543 142765
>Read 236500 237235 231860 236959
>Re-read 242927 243577 240284 242776
>Random read 204292 206010 201664 207219
>Random write 180144 180461 1090 121676
>
>No of threads=2 ext2-2t jfs-2t ext3-2t reiserfs-2t
>
>Initial write 196477 143395 62248 55260
>Rewrite 261641 261441 126604 205076
>Read 292566 292796 313562 291434
>Re-read 302239 306423 341416 303424
>Random read 296152 295430 316966 288584
>Random write 253026 251013 958 203358
>
>No of threads=4 ext2-4t jfs-4t ext3-4t reiserfs-4t
>
>Initial write 79513 172302 42051 48782
>Rewrite 256568 269840 124912 231395
>Read 290599 303669 327066 283793
>Re-read 289578 303644 327362 287531
>Random read 354011 353455 353806 351671
>Random write 279704 279922 2482 250498
>
>No of threads=6 ext2-6t jfs-6t ext3-6t reiserfs-6t
>
>Initial write 98559 69825 59728 15576
>Rewrite 274993 286987 126048 232193
>Read 330522 326143 332147 326163
>Re-read 339672 328890 333094 326725
>Random read 348059 346154 347901 344927
>Random write 281613 280213 3659 227579
>
>Cheers,
>Duc J Vianney, dvianney@us.ibm.com
>home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
>project page: http://www-124.ibm.com/developerworks/projects/linuxperf
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-22 0:19 ` kwijibo
@ 2002-06-22 8:10 ` kwijibo
0 siblings, 0 replies; 27+ messages in thread
From: kwijibo @ 2002-06-22 8:10 UTC (permalink / raw)
To: kwijibo
Cc: Duc Vianney, Andrew Morton, mgross, Griffiths, Richard A,
Jens Axboe, Linux Kernel Mailing List, lse-tech
If you tried the link earlier and it didn't work I'm sorry.
Had a mental brain fart with the web server. It should
work now.
Steven
kwijibo@zianet.com wrote:
> This web site may be of interest for this discussion:
> http://labs.zianet.com. I have benchmarks using NFS
> with ext3 there. It also compares ext3 with ReiserFS.
> The page is not quite complete but it has the
> benchmarks up.
>
> Steven
>
> Duc Vianney wrote:
>
>> Andrew Morton wrote:
>>
>>
>>> If you have time, please test ext2 and/or reiserfs and/or ext3
>>> in writeback mode.
>>>
>>
>> I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
>> 500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
>> test file size is 128MB, test block size is 4KB, and IO threads varies
>> from 1 to 6. When comparing with other file system for this test
>> environment, the results on a 2.5.19 SMP kernel show ext3fs is having
>> performance problem with Writes and in particularly, with Random Write.
>> I think the BKL contention patch would help ext3fs, but I need to verify
>> it first.
>>
>> The following data are throughput in MB/sec obtained from IOzone
>> benchmark running on all file systems installed with default options.
>>
>>
>> Kernels 2519smp4 2519smp4 2519smp4 2519smp4
>> No of threads=1 ext2-1t jfs-1t ext3-1t reiserfs-1t
>>
>> Initial write 138010 111023 29808 48170
>> Rewrite 205736 204538 119543 142765
>> Read 236500 237235 231860 236959
>> Re-read 242927 243577 240284 242776
>> Random read 204292 206010 201664 207219
>> Random write 180144 180461 1090 121676
>>
>> No of threads=2 ext2-2t jfs-2t ext3-2t reiserfs-2t
>>
>> Initial write 196477 143395 62248 55260
>> Rewrite 261641 261441 126604 205076
>> Read 292566 292796 313562 291434
>> Re-read 302239 306423 341416 303424
>> Random read 296152 295430 316966 288584
>> Random write 253026 251013 958 203358
>>
>> No of threads=4 ext2-4t jfs-4t ext3-4t reiserfs-4t
>>
>> Initial write 79513 172302 42051 48782
>> Rewrite 256568 269840 124912 231395
>> Read 290599 303669 327066 283793
>> Re-read 289578 303644 327362 287531
>> Random read 354011 353455 353806 351671
>> Random write 279704 279922 2482 250498
>>
>> No of threads=6 ext2-6t jfs-6t ext3-6t reiserfs-6t
>>
>> Initial write 98559 69825 59728 15576
>> Rewrite 274993 286987 126048 232193
>> Read 330522 326143 332147 326163
>> Re-read 339672 328890 333094 326725
>> Random read 348059 346154 347901 344927
>> Random write 281613 280213 3659 227579
>>
>> Cheers,
>> Duc J Vianney, dvianney@us.ibm.com
>> home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
>> project page: http://www-124.ibm.com/developerworks/projects/linuxperf
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>>
>>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: ext3 performance bottleneck as the number of spindles gets large
2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
2002-06-21 7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
@ 2002-06-23 4:02 ` Christopher E. Brown
2002-06-23 4:33 ` Andreas Dilger
1 sibling, 1 reply; 27+ messages in thread
From: Christopher E. Brown @ 2002-06-23 4:02 UTC (permalink / raw)
To: Griffiths, Richard A
Cc: 'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> I should have mentioned the throughput we saw on 4 adapters 6 drives was
> 126KB/s. The max theoretical bus bandwith is 640MB/s.
This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
503MB/sec minus PCI overhead...
This of course assumes nothing else is using the PCI bus.
120 something MB/sec sounds a hell of a lot like topping out a 32bit
33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI
64bit w/ backward compat to 32bit.
You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz
slot right? Not plugging them into a 32bit/33Mhz slot?
32bit/33Mhz (32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec
64bit/33Mhz (64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec
64bit/66Mhz (64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec
NOTE: PCI transfer rates are often listed as
32bit/33Mhz, 132 MByte/sec
64bit/33Mhz, 264 MByte/sec
64bit/66Mhz, 528 MByte/sec
This is somewhat true, but only if we start with Mbit rates as used in
transmission rates (1,000,000 bits/sec) and work from there, instead
of 2^20 (1,048,576). I will not argue about PCI 32bit/33Mhz being
1056Mbit, if talking about line rate, but when we are talking about
storage media and transfers to/from as measured by files remember to
convert.
--
I route, therefore you are.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 4:02 ` Christopher E. Brown
@ 2002-06-23 4:33 ` Andreas Dilger
2002-06-23 6:00 ` Christopher E. Brown
0 siblings, 1 reply; 27+ messages in thread
From: Andreas Dilger @ 2002-06-23 4:33 UTC (permalink / raw)
To: Christopher E. Brown
Cc: Griffiths, Richard A, 'Andrew Morton', mgross,
'Jens Axboe', Linux Kernel Mailing List, lse-tech
On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote:
> On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
>
> > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > 126KB/s. The max theoretical bus bandwith is 640MB/s.
>
> This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
> 503MB/sec minus PCI overhead...
Assuming you only have a single PCI bus...
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 4:33 ` Andreas Dilger
@ 2002-06-23 6:00 ` Christopher E. Brown
2002-06-23 6:35 ` [Lse-tech] " William Lee Irwin III
0 siblings, 1 reply; 27+ messages in thread
From: Christopher E. Brown @ 2002-06-23 6:00 UTC (permalink / raw)
To: Andreas Dilger
Cc: Griffiths, Richard A, 'Andrew Morton', mgross,
'Jens Axboe', Linux Kernel Mailing List, lse-tech
On Sat, 22 Jun 2002, Andreas Dilger wrote:
> On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote:
> > On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> >
> > > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > > 126KB/s. The max theoretical bus bandwith is 640MB/s.
> >
> > This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
> > 503MB/sec minus PCI overhead...
>
> Assuming you only have a single PCI bus...
Yes, we could (for example) assume a DP264 board, it features 2/4/8
way memory interleave, dual 21264 CPUs, and 2 separate PCI 64bit 66Mhz
buses.
However, multiple busses are *rare* on x86. There are alot of chained
busses via PCI to PCI bridge, but few systems with 2 or more PCI
busses of any type with parallel access to the CPU.
--
I route, therefore you are.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 6:00 ` Christopher E. Brown
@ 2002-06-23 6:35 ` William Lee Irwin III
2002-06-23 7:29 ` Dave Hansen
2002-06-23 17:06 ` Eric W. Biederman
0 siblings, 2 replies; 27+ messages in thread
From: William Lee Irwin III @ 2002-06-23 6:35 UTC (permalink / raw)
To: Christopher E. Brown
Cc: Andreas Dilger, Griffiths, Richard A, 'Andrew Morton',
mgross, 'Jens Axboe', Linux Kernel Mailing List, lse-tech
On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> However, multiple busses are *rare* on x86. There are alot of chained
> busses via PCI to PCI bridge, but few systems with 2 or more PCI
> busses of any type with parallel access to the CPU.
NUMA-Q has them.
Cheers,
Bill
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 6:35 ` [Lse-tech] " William Lee Irwin III
@ 2002-06-23 7:29 ` Dave Hansen
2002-06-23 7:36 ` William Lee Irwin III
2002-06-23 17:06 ` Eric W. Biederman
1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2002-06-23 7:29 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
William Lee Irwin III wrote:
> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
>
>>However, multiple busses are *rare* on x86. There are alot of chained
>>busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>busses of any type with parallel access to the CPU.
>
> NUMA-Q has them.
>
Yep, 2 independent busses per quad. That's a _lot_ of busses when you
have an 8 or 16 quad system. (I wonder who has one of those... ;)
Almost all of the server-type boxes that we play with have multiple
PCI busses. Even my old dual-PPro has 2.
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:29 ` Dave Hansen
@ 2002-06-23 7:36 ` William Lee Irwin III
2002-06-23 7:45 ` Dave Hansen
0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2002-06-23 7:36 UTC (permalink / raw)
To: Dave Hansen
Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
>> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
>>> However, multiple busses are *rare* on x86. There are alot of chained
>>> busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>> busses of any type with parallel access to the CPU.
William Lee Irwin III wrote:
>> NUMA-Q has them.
On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> Yep, 2 independent busses per quad. That's a _lot_ of busses when you
> have an 8 or 16 quad system. (I wonder who has one of those... ;)
> Almost all of the server-type boxes that we play with have multiple
> PCI busses. Even my old dual-PPro has 2.
I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
"independent" bit coming into play.
Cheers,
Bill
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:36 ` William Lee Irwin III
@ 2002-06-23 7:45 ` Dave Hansen
2002-06-23 7:55 ` Christopher E. Brown
2002-06-23 16:21 ` Martin J. Bligh
0 siblings, 2 replies; 27+ messages in thread
From: Dave Hansen @ 2002-06-23 7:45 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
William Lee Irwin III wrote:
> On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
>> Yep, 2 independent busses per quad. That's a _lot_ of busses
>> when you have an 8 or 16 quad system. (I wonder who has one of
>> those... ;) Almost all of the server-type boxes that we play with
>> have multiple PCI busses. Even my old dual-PPro has 2.
>
> I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> "independent" bit coming into play.
>
Hmmmm. Maybe there is another one for the onboard devices. I thought
that there were 8 slots and 4 per bus. I could
be wrong. BTW, the ISA slot is EISA and as far as I can tell is only
used for the MDC.
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:45 ` Dave Hansen
@ 2002-06-23 7:55 ` Christopher E. Brown
2002-06-23 8:11 ` David Lang
2002-06-23 8:31 ` Dave Hansen
2002-06-23 16:21 ` Martin J. Bligh
1 sibling, 2 replies; 27+ messages in thread
From: Christopher E. Brown @ 2002-06-23 7:55 UTC (permalink / raw)
To: Dave Hansen
Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
On Sun, 23 Jun 2002, Dave Hansen wrote:
> William Lee Irwin III wrote:
> > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> >> Yep, 2 independent busses per quad. That's a _lot_ of busses
> >> when you have an 8 or 16 quad system. (I wonder who has one of
> >> those... ;) Almost all of the server-type boxes that we play with
> >> have multiple PCI busses. Even my old dual-PPro has 2.
> >
> > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> > "independent" bit coming into play.
> >
> Hmmmm. Maybe there is another one for the onboard devices. I thought
> that there were 8 slots and 4 per bus. I could
> be wrong. BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.
Do you mean independent in that there are 2 sets of 4 slots each
detected as a seperate PCI bus, or independent in that each set of 4
had *direct* access to the cpu side, and *does not* access via a
PCI:PCI bridge?
I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
chianed buses. Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
PCI:PCI bridge.
--
I route, therefore you are.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:55 ` Christopher E. Brown
@ 2002-06-23 8:11 ` David Lang
2002-06-23 8:31 ` Dave Hansen
1 sibling, 0 replies; 27+ messages in thread
From: David Lang @ 2002-06-23 8:11 UTC (permalink / raw)
To: Christopher E. Brown
Cc: Dave Hansen, William Lee Irwin III, Andreas Dilger,
Griffiths, Richard A, 'Andrew Morton', mgross,
'Jens Axboe', Linux Kernel Mailing List, lse-tech
most chipsets only have one PCI bus on them so any others need to be
bridged to that one.
David Lang
On Sun, 23 Jun 2002, Christopher E. Brown wrote:
> Date: Sun, 23 Jun 2002 01:55:28 -0600 (MDT)
> From: Christopher E. Brown <cbrown@woods.net>
> To: Dave Hansen <haveblue@us.ibm.com>
> Cc: William Lee Irwin III <wli@holomorphy.com>,
> Andreas Dilger <adilger@clusterfs.com>,
> "Griffiths, Richard A" <richard.a.griffiths@intel.com>,
> 'Andrew Morton' <akpm@zip.com.au>, mgross@unix-os.sc.intel.com,
> 'Jens Axboe' <axboe@suse.de>,
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
> lse-tech@lists.sourceforge.net
> Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of
> spindles gets large
>
> On Sun, 23 Jun 2002, Dave Hansen wrote:
>
> > William Lee Irwin III wrote:
> > > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> > >> Yep, 2 independent busses per quad. That's a _lot_ of busses
> > >> when you have an 8 or 16 quad system. (I wonder who has one of
> > >> those... ;) Almost all of the server-type boxes that we play with
> > >> have multiple PCI busses. Even my old dual-PPro has 2.
> > >
> > > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> > > "independent" bit coming into play.
> > >
> > Hmmmm. Maybe there is another one for the onboard devices. I thought
> > that there were 8 slots and 4 per bus. I could
> > be wrong. BTW, the ISA slot is EISA and as far as I can tell is only
> > used for the MDC.
>
>
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?
>
>
>
> I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
> chianed buses. Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
> HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
> PCI:PCI bridge.
>
>
> --
> I route, therefore you are.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:55 ` Christopher E. Brown
2002-06-23 8:11 ` David Lang
@ 2002-06-23 8:31 ` Dave Hansen
1 sibling, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2002-06-23 8:31 UTC (permalink / raw)
To: Christopher E. Brown
Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
Christopher E. Brown wrote:
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?
No PCI:PCI bridges, at least for NUMA-Q.
http://telia.dl.sourceforge.net/sourceforge/lse/linux_on_numaq.pdf
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 7:45 ` Dave Hansen
2002-06-23 7:55 ` Christopher E. Brown
@ 2002-06-23 16:21 ` Martin J. Bligh
1 sibling, 0 replies; 27+ messages in thread
From: Martin J. Bligh @ 2002-06-23 16:21 UTC (permalink / raw)
To: Dave Hansen, William Lee Irwin III
Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
> >> Yep, 2 independent busses per quad. That's a _lot_ of busses
> >> when you have an 8 or 16 quad system. (I wonder who has one of
> >> those... ;) Almost all of the server-type boxes that we play with
> >> have multiple PCI busses. Even my old dual-PPro has 2.
> >
> > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> > "independent" bit coming into play.
> >
> Hmmmm. Maybe there is another one for the onboard devices. I thought
> that there were 8 slots and 4 per bus. I could
> be wrong. BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.
NUMA-Q has 2 PCI buses per quad, 3 slots in one, 4 in the other,
plus the EISA slots.
Multiple independant PCI buses are also available on other more
common architecutres, eg Netfinity 8500R, x360, x440, etc.
Anything with the Intel Profusion chipset will have this feature,
the bottleneck becomes the "P6 system bus" backplane they're all
connected to, which has a theoretical limit of 800Mb/s IIRC, though
nobody's been able to get more than 420Mb/s out of it in practice,
as far as I know.
The thing that makes the NUMA-Q a massive IO shovelling engine is
having one of these IO backplanes per quad too ... 16 x 800Mb/s
= 12.8Gb/s ;-)
M.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
2002-06-23 6:35 ` [Lse-tech] " William Lee Irwin III
2002-06-23 7:29 ` Dave Hansen
@ 2002-06-23 17:06 ` Eric W. Biederman
1 sibling, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2002-06-23 17:06 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
'Andrew Morton', mgross, 'Jens Axboe',
Linux Kernel Mailing List, lse-tech
William Lee Irwin III <wli@holomorphy.com> writes:
> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> > However, multiple busses are *rare* on x86. There are alot of chained
> > busses via PCI to PCI bridge, but few systems with 2 or more PCI
> > busses of any type with parallel access to the CPU.
>
> NUMA-Q has them.
As do the latest round of dual P4 Xeon chipsets. The Intel E7500 and
the Serverworks Grand Champion.
So on new systems this is easy to get if you want it.
Eric
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2002-06-23 17:17 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
2002-06-21 7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
2002-06-21 18:46 ` mgross
2002-06-21 19:26 ` Chris Mason
2002-06-21 19:56 ` Andrew Morton
2002-06-23 4:02 ` Christopher E. Brown
2002-06-23 4:33 ` Andreas Dilger
2002-06-23 6:00 ` Christopher E. Brown
2002-06-23 6:35 ` [Lse-tech] " William Lee Irwin III
2002-06-23 7:29 ` Dave Hansen
2002-06-23 7:36 ` William Lee Irwin III
2002-06-23 7:45 ` Dave Hansen
2002-06-23 7:55 ` Christopher E. Brown
2002-06-23 8:11 ` David Lang
2002-06-23 8:31 ` Dave Hansen
2002-06-23 16:21 ` Martin J. Bligh
2002-06-23 17:06 ` Eric W. Biederman
-- strict thread matches above, loose matches on Subject: below --
2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
2002-06-22 0:19 ` kwijibo
2002-06-22 8:10 ` kwijibo
2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
2002-06-20 21:11 ` [Lse-tech] Re: ext3 performance bottleneck as the number of spindles " Andrew Morton
[not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
2002-06-20 16:10 ` Dave Hansen
2002-06-20 20:47 ` John Hawkes
2002-06-19 21:29 mgross
2002-06-20 0:54 ` Andrew Morton
2002-06-20 4:09 ` [Lse-tech] " Dave Hansen
2002-06-20 6:03 ` Andreas Dilger
2002-06-20 6:53 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox