* ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
@ 2004-02-11 19:04 Jon Burgess
2004-02-11 20:28 ` Rik van Riel
2004-02-12 9:56 ` Andrew Morton
0 siblings, 2 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-11 19:04 UTC (permalink / raw)
To: linux kernel
[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]
I wrote a small benchmark tool to simulate the pattern of writes which
occur when slowly streaming files to disk.
This is trying to replicate the filesystem activity when I record
multiple TV and radio channels to disk.
I have attached a copy of the test program. It measures how long it
takes to write a number of files in parallel, writing a small amount of
data to each file at a time. I noticed that results for ext2 on 2.6.2
are much slower than 2.4.22:
Write speed in MB/s using an ext2 filesystem for 1 and 2 streams:
Num streams: 1 2
linux-2.4.22 10.47 6.98
linux-2.6.2 9.71 0.34
"vmstat" agrees with the performance figures. It seems that the pattern
of small interleaved writes to two files really upsets something in the
2.6 code.
During the disk light is on solid and it really slows any other disk
access. It looks like the disk is continuously seeking backwards and
forwards, perhaps re-writing the meta data.
Does this look like a problem, or is the test unrealistic?
Thanks,
Jon
[-- Attachment #2: trial.c --]
[-- Type: text/plain, Size: 1895 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>
#include <string.h>
#include <sys/time.h>
#define BSIZE (4 * 1024)
#define MAX_NAME (256)
char base_name[MAX_NAME];
char *get_name(int n)
{
static char name[MAX_NAME + 5];
sprintf(name, "%s%d", base_name, n);
return name;
}
void display_rate(struct timeval start, struct timeval end, int len)
{
int d_s, d_us;
float sec;
d_s = end.tv_sec - start.tv_sec;
d_us = end.tv_usec - start.tv_usec;
sec = d_s + d_us / 1000000.0;
printf("Transferred %dMb of data in %.2f seconds (%.2fMb/s)\n",
len, sec, len / sec);
fflush(NULL);
}
void create_files(int n, int sz)
{
int out[n], i;
char buf[BSIZE];
int pos;
struct timeval start, end;
printf("Writing %dMb of data to %d files in parallel\n", sz, n);
fflush(NULL);
for (i = 0; i < n; i++) {
out[i] = open(get_name(i), O_WRONLY | O_CREAT | O_TRUNC, 0666);
if (out[i] < 0) {
perror("Creating output file");
exit(1);
}
}
memset(buf, 0, BSIZE);
gettimeofday(&start, NULL);
for (pos = 0; pos < (sz * 1024 * 1024); pos += BSIZE) {
for(i = 0; i < n; i++) {
if (write(out[i], buf, BSIZE) != BSIZE) {
fprintf(stderr, "Problem writing output file\n");
exit(2);
}
}
}
for (i=0; i<n; i++) {
fdatasync(out[i]);
close(out[i]);
}
gettimeofday(&end, NULL);
display_rate(start, end, n * pos / (1024 * 1024));
}
void delete_files(int n)
{
int i;
for (i = 0; i < n; i++) {
unlink(get_name(i));
}
}
void run_test(int n, int s)
{
delete_files(n);
create_files(n, s);
delete_files(n);
printf("\n");
fflush(NULL);
}
int main(int argc, char *argv[])
{
unsigned int s = 16;
strcpy(base_name, "temp_");
run_test(1, s);
run_test(2, s / 2);
return 0;
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 19:04 ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
@ 2004-02-11 20:28 ` Rik van Riel
2004-02-11 21:02 ` Michael Frank
2004-02-12 10:40 ` Jon Burgess
2004-02-12 9:56 ` Andrew Morton
1 sibling, 2 replies; 25+ messages in thread
From: Rik van Riel @ 2004-02-11 20:28 UTC (permalink / raw)
To: Jon Burgess; +Cc: linux kernel
On Wed, 11 Feb 2004, Jon Burgess wrote:
> Write speed in MB/s using an ext2 filesystem for 1 and 2 streams:
> Num streams: 1 2
> linux-2.4.22 10.47 6.98
> linux-2.6.2 9.71 0.34
> During the disk light is on solid and it really slows any other disk
> access. It looks like the disk is continuously seeking backwards and
> forwards, perhaps re-writing the meta data.
Just for fun, could you also try measuring how long it takes
to read back the files in question ?
Both individually and in parallel...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 20:28 ` Rik van Riel
@ 2004-02-11 21:02 ` Michael Frank
2004-02-11 21:18 ` Diego Calleja
2004-02-13 12:15 ` ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
2004-02-12 10:40 ` Jon Burgess
1 sibling, 2 replies; 25+ messages in thread
From: Michael Frank @ 2004-02-11 21:02 UTC (permalink / raw)
To: Rik van Riel, Jon Burgess; +Cc: linux kernel
On Thursday 12 February 2004 04:28, Rik van Riel wrote:
> On Wed, 11 Feb 2004, Jon Burgess wrote:
>
> > Write speed in MB/s using an ext2 filesystem for 1 and 2 streams:
> > Num streams: 1 2
> > linux-2.4.22 10.47 6.98
> > linux-2.6.2 9.71 0.34
>
> > During the disk light is on solid and it really slows any other disk
> > access. It looks like the disk is continuously seeking backwards and
> > forwards, perhaps re-writing the meta data.
>
> Just for fun, could you also try measuring how long it takes
> to read back the files in question ?
>
> Both individually and in parallel...
>
2.4 has a deadline scheduler. 2.6 default is anticipatory.
Could you please boot with scheduler=deadline to compare apples with apples.
Regards
Michael
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 21:02 ` Michael Frank
@ 2004-02-11 21:18 ` Diego Calleja
2004-02-12 2:00 ` Dave Olien
2004-02-13 12:15 ` ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
1 sibling, 1 reply; 25+ messages in thread
From: Diego Calleja @ 2004-02-11 21:18 UTC (permalink / raw)
To: Michael Frank; +Cc: linux-kernel
El Thu, 12 Feb 2004 05:02:39 +0800 Michael Frank <mhf@linuxmail.org> escribió:
> 2.4 has a deadline scheduler. 2.6 default is anticipatory.
I though the 2.4 io scheduler wasn't "deadline" base, I think the first
"deadline" io scheduler was the one merged ~2.5.39
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 21:18 ` Diego Calleja
@ 2004-02-12 2:00 ` Dave Olien
2004-02-12 2:23 ` Andrea Arcangeli
0 siblings, 1 reply; 25+ messages in thread
From: Dave Olien @ 2004-02-12 2:00 UTC (permalink / raw)
To: Diego Calleja; +Cc: Michael Frank, linux-kernel
2.4 does not have deadline scheduler. But the 2.6 deadline scheduler
is more similar to 2.4's scheduler than is the anticipatory scheduler.
Re-try 2.6 with deadline scheduler will remove some of the additional
scheduler policies that are present in the anticipatory scheduler.
On Wed, Feb 11, 2004 at 10:18:06PM +0100, Diego Calleja wrote:
> El Thu, 12 Feb 2004 05:02:39 +0800 Michael Frank <mhf@linuxmail.org> escribió:
>
>
> > 2.4 has a deadline scheduler. 2.6 default is anticipatory.
>
> I though the 2.4 io scheduler wasn't "deadline" base, I think the first
> "deadline" io scheduler was the one merged ~2.5.39
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 2:00 ` Dave Olien
@ 2004-02-12 2:23 ` Andrea Arcangeli
2004-02-12 9:42 ` ext2/3 performance regression in 2.6 vs 2.4 for small interl Giuliano Pochini
0 siblings, 1 reply; 25+ messages in thread
From: Andrea Arcangeli @ 2004-02-12 2:23 UTC (permalink / raw)
To: Dave Olien; +Cc: Diego Calleja, Michael Frank, linux-kernel
On Wed, Feb 11, 2004 at 06:00:19PM -0800, Dave Olien wrote:
>
> 2.4 does not have deadline scheduler. But the 2.6 deadline scheduler
> is more similar to 2.4's scheduler than is the anticipatory scheduler.
the main difference is that 2.4 isn't in function of time, it's in
function of requests, no matter how long it takes to write a request, so
it's potentially optimizing slow devices when you don't care about
latency (deadline can be tuned for each dev via
/sys/block/*/queue/iosched/).
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 2:23 ` Andrea Arcangeli
@ 2004-02-12 9:42 ` Giuliano Pochini
2004-02-12 10:15 ` John Bradford
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Giuliano Pochini @ 2004-02-12 9:42 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: linux-kernel
On 12-Feb-2004 Andrea Arcangeli wrote:
> the main difference is that 2.4 isn't in function of time, it's in
> function of requests, no matter how long it takes to write a request,
> so it's potentially optimizing slow devices when you don't care about
> latency (deadline can be tuned for each dev via
> /sys/block/*/queue/iosched/).
IMHO it's the opposite. Transfer speed * seek time of some
slow devices is lower than fast devices. For example:
Hard disk raw speed= 40MB/s seek time = 8ms
MO/ZIP raw speed= 3MB/s seek time = 25ms
One seek of HD costs about 320KB, while on a slow drive it's
only 75KB. 2.4 has a terrible latency on slow devices, and it
has very small advantage in terms of speed. On CDs and DVDs
the cost of a seek is much higher, but since the data is
usually accessed sequentially you have the high latency
penalty with no appreciable speed gain in this case too.
--
Giuliano.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 19:04 ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
2004-02-11 20:28 ` Rik van Riel
@ 2004-02-12 9:56 ` Andrew Morton
2004-02-12 20:20 ` Jon Burgess
` (2 more replies)
1 sibling, 3 replies; 25+ messages in thread
From: Andrew Morton @ 2004-02-12 9:56 UTC (permalink / raw)
To: Jon Burgess; +Cc: linux-kernel
Jon Burgess <lkml@jburgess.uklinux.net> wrote:
>
> I wrote a small benchmark tool to simulate the pattern of writes which
> occur when slowly streaming files to disk.
> This is trying to replicate the filesystem activity when I record
> multiple TV and radio channels to disk.
>
> I have attached a copy of the test program. It measures how long it
> takes to write a number of files in parallel, writing a small amount of
> data to each file at a time. I noticed that results for ext2 on 2.6.2
> are much slower than 2.4.22:
>
> Write speed in MB/s using an ext2 filesystem for 1 and 2 streams:
> Num streams: 1 2
> linux-2.4.22 10.47 6.98
> linux-2.6.2 9.71 0.34
I don't know why the single-stream case would be slower, but the two-stream
case is probably due to writeback changes interacting with a weakness in
the block allocator. 10 megs/sec is pretty awful either way.
You have two files, each allocating blocks from the same part of the disk.
So the blocks of the two files are intermingled.
The same happens in 2.4, although the effect can be worse in 2.6 if the two
files are in different directories (because 2.6 will still start these file
out in the same blockgroup, usually - 2.4 will spread different directories
around).
Either way, you have intermingled blocks in the files.
In 2.4, we write these blocks out in time-of-dirtying-the-block order, so
these blocks are written out to nice big linear chunks of disk - the block
write order is 1,2,3,4,5,6,7...
However in 2.6, we write the data out on a per-file basis. So we write
file 1 (blocks 1,3,5,7,9,...) and then we write file 2 (blocks
2,4,6,8,10,...). So you'll see that instead of a single full-bandwidth
write, we do two half-bandwidth writes. If it weren't for disk writeback
caching, it would be as much as 4x slower.
Reads will be slower too - you will probably find that reading back a file
which was created at the same time as a second stream is significantly
slower than reading a file which was created all on its own. 2.4 and 2.6
shouldn't behave significantly differently here.
It's an unfortunate interaction. The 2.6 writeback design is better,
really, because it is optimised for well-laid out files - the better your
filesystem is at laying the files out, the faster it all goes. But in this
particular case, the poor layout decisions trip it up.
The ideal fix for this of course is to just fix the dang filesystems to not
do such a silly thing. But nobody got to that in time. Delayed allocation
would fix it too. You can probably address it quite well within the
application itself by buffering up a good amount of data for each write()
call. Maybe a megabyte.
XFS will do well at this.
You might be able to improve things significantly on ext2 by increasing
EXT2_DEFAULT_PREALLOC_BLOCKS by a lot - make it 64 or 128. I don't recall
anyone trying that.
But I must say, a 21x difference is pretty wild. What filesytem was that
with, and how much memory do you have, and what was the bandwidth of each
stream, and how much data is the application passing to write()?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 9:42 ` ext2/3 performance regression in 2.6 vs 2.4 for small interl Giuliano Pochini
@ 2004-02-12 10:15 ` John Bradford
2004-02-12 10:27 ` Nick Piggin
2004-02-12 14:59 ` Andrea Arcangeli
2 siblings, 0 replies; 25+ messages in thread
From: John Bradford @ 2004-02-12 10:15 UTC (permalink / raw)
To: Giuliano Pochini, Andrea Arcangeli; +Cc: linux-kernel
Quote from Giuliano Pochini <pochini@shiny.it>:
>
> On 12-Feb-2004 Andrea Arcangeli wrote:
>
> > the main difference is that 2.4 isn't in function of time, it's in
> > function of requests, no matter how long it takes to write a request,
> > so it's potentially optimizing slow devices when you don't care about
> > latency (deadline can be tuned for each dev via
> > /sys/block/*/queue/iosched/).
>
> IMHO it's the opposite. Transfer speed * seek time of some
> slow devices is lower than fast devices. For example:
>
> Hard disk raw speed= 40MB/s seek time = 8ms
> MO/ZIP raw speed= 3MB/s seek time = 25ms
>
> One seek of HD costs about 320KB, while on a slow drive it's
> only 75KB.
Hmmm, but I would imagine that most hard disks have much larger caches
than are popular on removable cartridge drives...
John.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 9:42 ` ext2/3 performance regression in 2.6 vs 2.4 for small interl Giuliano Pochini
2004-02-12 10:15 ` John Bradford
@ 2004-02-12 10:27 ` Nick Piggin
2004-02-12 17:05 ` Michael Frank
2004-02-12 14:59 ` Andrea Arcangeli
2 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2004-02-12 10:27 UTC (permalink / raw)
To: Giuliano Pochini; +Cc: Andrea Arcangeli, linux-kernel
Giuliano Pochini wrote:
>On 12-Feb-2004 Andrea Arcangeli wrote:
>
>
>>the main difference is that 2.4 isn't in function of time, it's in
>>function of requests, no matter how long it takes to write a request,
>>so it's potentially optimizing slow devices when you don't care about
>>latency (deadline can be tuned for each dev via
>>/sys/block/*/queue/iosched/).
>>
>
>IMHO it's the opposite. Transfer speed * seek time of some
>slow devices is lower than fast devices. For example:
>
>Hard disk raw speed= 40MB/s seek time = 8ms
>MO/ZIP raw speed= 3MB/s seek time = 25ms
>
>
I like accounting by time better because its accurate
and fair for all types of devices, however I admit an
auto tuning feature would be nice.
Say you allow 16 128K requests before seeking:
The HD will run the requests for 50ms then seek (8ms).
So this gives you about 86% efficiency.
On your zip drive it takes 666ms, giving you 96%.
Now with AS, allowing 50ms of requests before a seek
gives you the same for an HD, but only 66% for the MO
drive. A CD-ROM will be much worse.
Auto tuning wouldn't be too hard. Just measure the time
it takes for your seeking requests to complete and you
can use the simple formula to allow users to specify a
efficiency vs latency %age.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 20:28 ` Rik van Riel
2004-02-11 21:02 ` Michael Frank
@ 2004-02-12 10:40 ` Jon Burgess
2004-02-12 20:17 ` Hans Reiser
1 sibling, 1 reply; 25+ messages in thread
From: Jon Burgess @ 2004-02-12 10:40 UTC (permalink / raw)
To: Rik van Riel; +Cc: Jon Burgess, linux kernel
Rik van Riel wrote:
> Just for fun, could you also try measuring how long it takes
> to read back the files in question ?
>
> Both individually and in parallel...
>
The original code did the read back as well, I stripped it out to make
the code smaller to post.
It was the read back performance that I was most interested in. I found
that ext2/3 interleave all the blocks on the disk. With 2 stream the
read performance is 50%, 4 streams give 25% etc.
I have one really bad case where I record a TV stream at 500kByte/s + a
radio one at 25kByte/s. These blocks are interleaved on the disk and the
read performance of the radio stream is reduced by the data ratio, i.e.
1:20, so I get a miserable read performance of ~ 1MB/s.
I found that ext2, ext3 and Reiserfs behave similarly. XFS and JFS
appear to coalesce the data blocks during the write phase and can read
the data back at near maximum performance.
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 9:42 ` ext2/3 performance regression in 2.6 vs 2.4 for small interl Giuliano Pochini
2004-02-12 10:15 ` John Bradford
2004-02-12 10:27 ` Nick Piggin
@ 2004-02-12 14:59 ` Andrea Arcangeli
2 siblings, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2004-02-12 14:59 UTC (permalink / raw)
To: Giuliano Pochini; +Cc: linux-kernel
On Thu, Feb 12, 2004 at 10:42:15AM +0100, Giuliano Pochini wrote:
>
> On 12-Feb-2004 Andrea Arcangeli wrote:
>
> > the main difference is that 2.4 isn't in function of time, it's in
> > function of requests, no matter how long it takes to write a request,
> > so it's potentially optimizing slow devices when you don't care about
> > latency (deadline can be tuned for each dev via
> > /sys/block/*/queue/iosched/).
>
> IMHO it's the opposite. Transfer speed * seek time of some
> slow devices is lower than fast devices. For example:
>
> Hard disk raw speed= 40MB/s seek time = 8ms
> MO/ZIP raw speed= 3MB/s seek time = 25ms
>
> One seek of HD costs about 320KB, while on a slow drive it's
> only 75KB. 2.4 has a terrible latency on slow devices, and it
> has very small advantage in terms of speed. On CDs and DVDs
> the cost of a seek is much higher, but since the data is
> usually accessed sequentially you have the high latency
> penalty with no appreciable speed gain in this case too.
I was thinking at old slow harddisks (5M/sec), and I don't think all
data on cds is always accessed sequentially, you only need two tasks
reading two files.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 10:27 ` Nick Piggin
@ 2004-02-12 17:05 ` Michael Frank
2004-02-12 17:18 ` Valdis.Kletnieks
0 siblings, 1 reply; 25+ messages in thread
From: Michael Frank @ 2004-02-12 17:05 UTC (permalink / raw)
To: Nick Piggin, Giuliano Pochini; +Cc: Andrea Arcangeli, linux-kernel
On Thursday 12 February 2004 18:27, Nick Piggin wrote:
>
> Giuliano Pochini wrote:
>
> >On 12-Feb-2004 Andrea Arcangeli wrote:
> >
> >
> >>the main difference is that 2.4 isn't in function of time, it's in
> >>function of requests, no matter how long it takes to write a request,
> >>so it's potentially optimizing slow devices when you don't care about
> >>latency (deadline can be tuned for each dev via
> >>/sys/block/*/queue/iosched/).
> >>
> >
> >IMHO it's the opposite. Transfer speed * seek time of some
> >slow devices is lower than fast devices. For example:
> >
> >Hard disk raw speed= 40MB/s seek time = 8ms
> >MO/ZIP raw speed= 3MB/s seek time = 25ms
> >
> >
>
> I like accounting by time better because its accurate
> and fair for all types of devices, however I admit an
> auto tuning feature would be nice.
>
> Say you allow 16 128K requests before seeking:
> The HD will run the requests for 50ms then seek (8ms).
> So this gives you about 86% efficiency.
> On your zip drive it takes 666ms, giving you 96%.
>
> Now with AS, allowing 50ms of requests before a seek
> gives you the same for an HD, but only 66% for the MO
> drive. A CD-ROM will be much worse.
>
> Auto tuning wouldn't be too hard. Just measure the time
> it takes for your seeking requests to complete and you
> can use the simple formula to allow users to specify a
> efficiency vs latency %age.
>
This triggers me to ask about "io niceness" which has been on
my mind for some time.
A disk intensive example is updatedb, which since the earlier
days of linux on [34]86s, is usually reniced at 19. At that time a
CPU did 10-50 bogomips and disks transfered 5-20MB at seek times of
10ms or so.
Today, CPU's are 100 times as fast but disks are effectively only
2-5 times as fast.
What I am getting at is being annoyed with updatedb ___saturating___
the the disk so easily as the "ancient" method of renicing does not
consider the fact that the CPU pwrformance has increased 20-50 fold
over disk performace.
Bottom line: what about assigning "io niceness" to processes, which
would also help with actively scheduling io toward processes
needing it.
Michael
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 17:05 ` Michael Frank
@ 2004-02-12 17:18 ` Valdis.Kletnieks
2004-02-12 20:55 ` Helge Hafting
0 siblings, 1 reply; 25+ messages in thread
From: Valdis.Kletnieks @ 2004-02-12 17:18 UTC (permalink / raw)
To: Michael Frank
Cc: Nick Piggin, Giuliano Pochini, Andrea Arcangeli, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]
On Fri, 13 Feb 2004 01:05:20 +0800, Michael Frank said:
> What I am getting at is being annoyed with updatedb ___saturating___
> the the disk so easily as the "ancient" method of renicing does not
> consider the fact that the CPU pwrformance has increased 20-50 fold
> over disk performace.
>
> Bottom line: what about assigning "io niceness" to processes, which
> would also help with actively scheduling io toward processes
> needing it.
The problem is that unlike CPU niceness, where you can literally rip the
CPU away from a hog and give it to some more deserving process, it's not
as easy to rip an active disk I/O away so somebody else can have it.
If the updatedb issues a seek/read combo to the disk, and your process gets
into the I/O queue even 200 nanoseconds later, it *still* has to wait for that
I/O to finish before it can start its seeks and reads.
For an extreme example, consider those IDE interfaces where fixating or
blanking a CD/RW will cause *all* the disks to lock up for the duration.
No matter how high your priority is, you *cant* get that I/O out the door
for the next 60-70 seconds unless you're willing to create a coaster.
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 10:40 ` Jon Burgess
@ 2004-02-12 20:17 ` Hans Reiser
0 siblings, 0 replies; 25+ messages in thread
From: Hans Reiser @ 2004-02-12 20:17 UTC (permalink / raw)
To: Jon Burgess; +Cc: Rik van Riel, linux kernel, reiserfs-dev
Jon Burgess wrote:
> Rik van Riel wrote:
>
>> Just for fun, could you also try measuring how long it takes
>> to read back the files in question ?
>>
>> Both individually and in parallel...
>>
> The original code did the read back as well, I stripped it out to make
> the code smaller to post.
> It was the read back performance that I was most interested in. I
> found that ext2/3 interleave all the blocks on the disk. With 2 stream
> the read performance is 50%, 4 streams give 25% etc.
>
> I have one really bad case where I record a TV stream at 500kByte/s +
> a radio one at 25kByte/s. These blocks are interleaved on the disk and
> the read performance of the radio stream is reduced by the data ratio,
> i.e. 1:20, so I get a miserable read performance of ~ 1MB/s.
>
> I found that ext2, ext3 and Reiserfs behave similarly.
Try Reiser4, it will probably cure it for you nicely. I will be
surprised if it does not, please let me know your results. I would
expect that tweaking the block preallocation code to preallocate more
should cure it for ReiserFS V3 also....
> XFS and JFS appear to coalesce the data blocks during the write phase
> and can read the data back at near maximum performance.
>
> Jon
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 9:56 ` Andrew Morton
@ 2004-02-12 20:20 ` Jon Burgess
2004-02-13 8:28 ` Juan Piernas Canovas
2004-02-16 17:51 ` Alex Zarochentsev
2004-02-13 12:35 ` Jon Burgess
2004-02-14 15:00 ` Jon Burgess
2 siblings, 2 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-12 20:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jon Burgess, linux-kernel
Andrew Morton wrote:
> I don't know why the single-stream case would be slower, but the
> two-stream
>
>case is probably due to writeback changes interacting with a weakness in
>the block allocator. 10 megs/sec is pretty awful either way.
>
>
>
10MB/s is just because I did the test on an old machine, it maxes out at
15MB/s with "hdparm -t".
I didn't want to do it on my main PC because I using it to record a TV
program at the time :-)
>Either way, you have intermingled blocks in the files.
>
>
Yes the blocks are intermingled. Thanks for the explanation of the
2.4/2.6 difference.
>Reads will be slower too - you will probably find that reading back a file
>
>
Yes reads are 50% for 2 streams, 25% for 4 etc. 2.4 and 2.6 perform the
same.
I did a debugfs "stat" and it clearly shows the fragmented file blocks.
>You can probably address it quite well within the
>application itself by buffering up a good amount of data for each write()
>call. Maybe a megabyte.
>
>
Writes in the 256kB - 1MB region do avoid the problem. Unfortunately the
way the application is written it makes this tricky to do. It wants to
write out the data in one frame at a time, typically 10 - 50kB.
>XFS will do well at this.
>
>
Yes, both XFS and JFS perform much better. Here is a summary of some
tests done on 2.6, these were done on a faster machine / disk
combination. This was the original test program which also measured the
read speeds, you can get this from http://www.jburgess.uklinux.net/slow.c
The ext2 result is a bit slow, but ext3 is really bad.
Num streams |1 1 |2 2
Filesystem |Write Read |Write Read
--------------|----------------|--------------
Ext2 |27.7 29.17 | 5.89 14.43
ext3-ordered |25.73 29.21 | 0.48 1.1
Reiserfs |25.31 26.25 | 7.47 13.55
JFS |26.27 26.95 |26.92 28.5
XFS |27.51 26.00 |27.35 27.42
>You might be able to improve things significantly on ext2 by increasing
>EXT2_DEFAULT_PREALLOC_BLOCKS by a lot - make it 64 or 128. I don't recall
>anyone trying that.
>
>
I'll give it a go.
>But I must say, a 21x difference is pretty wild. What filesytem was that
>with, and how much memory do you have, and what was the bandwidth of each
>stream, and how much data is the application passing to write()?
>
>
The results were from running the test program I attached to the
original email. It was writing 4kB at a time on a ext2 filesystem. It
tries to write the data in a tight loop, taking as much bandwidth as it
can get.
In the real application it records MPEG2 DVB streams from TV and radio.
The bandwidths are as follows:
TV ~ 500kByte/s, in 10 - 50kB blocks (one MPEG frame at a time).
Radio ~ 24kByte/s, in blocks of around 2-4kB.
The write performance is not too critical. Even at 5MB/s I would still
be able to record 10 TV channels.
I wrote the benchmark primarily to see why the read performance was so
bad. I noticed that when I started moving the files between disks the
transfer rate would be really erratic. Sometimes 40MB/s, sometimes 5MB/s.
The worst case that I found was when I record a TV and radio stream
simultaneously. The data blocks are recorded in a patterm of 1 block of
radio data followed by 20 blocks of TV. When I read back the radio
stream I get only 1:20th of the disk performance (1 - 2 MB/s).
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 17:18 ` Valdis.Kletnieks
@ 2004-02-12 20:55 ` Helge Hafting
2004-02-13 1:57 ` Jamie Lokier
0 siblings, 1 reply; 25+ messages in thread
From: Helge Hafting @ 2004-02-12 20:55 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Michael Frank, Nick Piggin, Giuliano Pochini, Andrea Arcangeli,
linux-kernel
On Thu, Feb 12, 2004 at 12:18:29PM -0500, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 13 Feb 2004 01:05:20 +0800, Michael Frank said:
>
> > What I am getting at is being annoyed with updatedb ___saturating___
> > the the disk so easily as the "ancient" method of renicing does not
> > consider the fact that the CPU pwrformance has increased 20-50 fold
> > over disk performace.
> >
> > Bottom line: what about assigning "io niceness" to processes, which
> > would also help with actively scheduling io toward processes
> > needing it.
>
> The problem is that unlike CPU niceness, where you can literally rip the
> CPU away from a hog and give it to some more deserving process, it's not
> as easy to rip an active disk I/O away so somebody else can have it.
>
You can't take the disk away, but you can be careful with it.
The anticipatory scheduler already does that - avoids seeking
away from a read for a while just in case the reader will submit
an adjacent read in short order. (Wonder if it ought to read ahead
a little instead of just waiting?)
Something similiar could be done for io niceness. If we run out of
normal priority io, how about not issuing the low priority io
right away. Anticipate there will be more high-priority io
and wait for some idle time before letting low-priority
requests through. And of course some maximum wait to prevent
total starvation.
> If the updatedb issues a seek/read combo to the disk, and your process gets
> into the I/O queue even 200 nanoseconds later, it *still* has to wait for that
> I/O to finish before it can start its seeks and reads.
>
Anticipatory io niceness might solve that.
> For an extreme example, consider those IDE interfaces where fixating or
> blanking a CD/RW will cause *all* the disks to lock up for the duration.
> No matter how high your priority is, you *cant* get that I/O out the door
> for the next 60-70 seconds unless you're willing to create a coaster.
There's no cure for truely stupid hw. Sometimes I forget why I buy
these expensive scsi drives . . .
Helge Hafting
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-12 20:55 ` Helge Hafting
@ 2004-02-13 1:57 ` Jamie Lokier
2004-02-13 2:05 ` Nick Piggin
0 siblings, 1 reply; 25+ messages in thread
From: Jamie Lokier @ 2004-02-13 1:57 UTC (permalink / raw)
To: Helge Hafting
Cc: Valdis.Kletnieks, Michael Frank, Nick Piggin, Giuliano Pochini,
Andrea Arcangeli, linux-kernel
Helge Hafting wrote:
> Something similiar could be done for io niceness. If we run out of
> normal priority io, how about not issuing the low priority io
> right away. Anticipate there will be more high-priority io
> and wait for some idle time before letting low-priority
> requests through. And of course some maximum wait to prevent
> total starvation.
The problem is quite similar to scheduling for quality on a network
device. Once a packet has started going it, usually you cannot abort
the packet for a higher priority one.
I thought there was a CBQ I/O scheduling patch or such to offer some
kind of I/O niceness these days?
-- Jamie
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interl
2004-02-13 1:57 ` Jamie Lokier
@ 2004-02-13 2:05 ` Nick Piggin
0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2004-02-13 2:05 UTC (permalink / raw)
To: Jamie Lokier
Cc: Helge Hafting, Valdis.Kletnieks, Michael Frank, Giuliano Pochini,
Andrea Arcangeli, linux-kernel
Jamie Lokier wrote:
>Helge Hafting wrote:
>
>>Something similiar could be done for io niceness. If we run out of
>>normal priority io, how about not issuing the low priority io
>>right away. Anticipate there will be more high-priority io
>>and wait for some idle time before letting low-priority
>>requests through. And of course some maximum wait to prevent
>>total starvation.
>>
>
>The problem is quite similar to scheduling for quality on a network
>device. Once a packet has started going it, usually you cannot abort
>the packet for a higher priority one.
>
>I thought there was a CBQ I/O scheduling patch or such to offer some
>kind of I/O niceness these days?
>
>
Yeah its Jens' CFQ io scheduler. It is in -mm, and I think it
has adjustable priorities now.
I have plans to do IO priorities in the anticipatory scheduler
(significantly differently to CFQ). One day...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 20:20 ` Jon Burgess
@ 2004-02-13 8:28 ` Juan Piernas Canovas
2004-02-16 17:51 ` Alex Zarochentsev
1 sibling, 0 replies; 25+ messages in thread
From: Juan Piernas Canovas @ 2004-02-13 8:28 UTC (permalink / raw)
To: Jon Burgess; +Cc: linux-kernel
On Thu, 12 Feb 2004, Jon Burgess wrote:
[snip]
> >You might be able to improve things significantly on ext2 by increasing
> >EXT2_DEFAULT_PREALLOC_BLOCKS by a lot - make it 64 or 128. I don't recall
> >anyone trying that.
> >
> >
> I'll give it a go.
I think that a better choise for Ext2 and Ext3 is to put each stream
in a different directory (but that only makes sense if those directories
are in different groups). In that way, file blocks will not be interleaved on
disk, which is the problem. Try this, if you can, and let us know your
results.
Regards,
Juan.
[snip]
>
> Jon
>
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: piernas@ditec.um.es
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-11 21:02 ` Michael Frank
2004-02-11 21:18 ` Diego Calleja
@ 2004-02-13 12:15 ` Jon Burgess
1 sibling, 0 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-13 12:15 UTC (permalink / raw)
To: Michael Frank, linux kernel
Michael Frank wrote:
>Could you please boot with scheduler=deadline to compare apples with apples.
>
>
The schedular did not make a significant difference, both returned
0.34MB/s for the worst case.
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 9:56 ` Andrew Morton
2004-02-12 20:20 ` Jon Burgess
@ 2004-02-13 12:35 ` Jon Burgess
2004-02-14 15:00 ` Jon Burgess
2 siblings, 0 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-13 12:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jon Burgess, linux-kernel
Andrew Morton wrote:
>What filesytem was that with?
>
>
I re-ran the tests again last night and founfd that I had made one
mistake in my description.
The really poor results occured with the *ext3* filesystem, not ext2.
"mount" was telling me that the contents of /etc/fstab which was ext2 -
but the kernel actually had it mounted it as ext3.
I think I might be able to give a little insight to the "0.34MB/s" and
"0.48MB/s" numbers. I think these numbers closely match the theoretical
performance rate when a single 4kB write occurs per disk rotation.
4kB * 5400RPM / 60 seconds = 360 kB/s
4kB * 7200RPM / 60 seconds = 480 kB/s
Perhaps the drives that I am running the test on do not have
write-caching enabled.
By the time the first 4kB write has completed the drive may need to wait
a complete rotation before it can do the next write. I don't think it
quite explains the difference between ext2 and ext3. Any ideas?
Below are the resuls of ext2/ext3 tests on a new Seagate 80Gb SATA, 8MB
Cache, model ST380023AS.
The ext3 results are a lot better, perhaps this drive has write caching
enabled.
Num streams |1 1 |2 2 |4 4
Filesystem |Write Read |Write Read |Write Read
------------------------------|--------------|--------------
Ext2 |40.17 43.07 |10.88 21.49 |10.13 11.41
ext3-journal |16.06 42.24 | 7.56 16.28 | 7.17 11.25
ext3-ordered |37.31 43.12 | 4.64 15.33 | 5.25 11.28
ext3-writeback |37.33 42.93 | 4.00 14.88 | 2.97 11.26
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 9:56 ` Andrew Morton
2004-02-12 20:20 ` Jon Burgess
2004-02-13 12:35 ` Jon Burgess
@ 2004-02-14 15:00 ` Jon Burgess
2 siblings, 0 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-14 15:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jon Burgess, linux-kernel
Andrew Morton wrote:
>So you'll see that instead of a single full-bandwidth
>write, we do two half-bandwidth writes. If it weren't for disk writeback
>caching, it would be as much as 4x slower.
>
Write caching does indeed make a big difference. Here is a test run on a
drive with and without write caching (hdparm -W 0/1). The test was done
on 2.6.2 with ext3 and shows the write speed in MB/s:
Write Cache 1 Stream 2 Streams
Enabled 21.54 3.66
Disabled 18.11 0.46
The two stream case is almost 10x slower without write caching.
I don't think this explains the difference between 2.4 and 2.6 unless
one of them changes the write cache mode of the drive.
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-12 20:20 ` Jon Burgess
2004-02-13 8:28 ` Juan Piernas Canovas
@ 2004-02-16 17:51 ` Alex Zarochentsev
2004-02-16 20:03 ` Jon Burgess
1 sibling, 1 reply; 25+ messages in thread
From: Alex Zarochentsev @ 2004-02-16 17:51 UTC (permalink / raw)
To: Jon Burgess; +Cc: Andrew Morton, linux-kernel
On Thu, Feb 12, 2004 at 08:20:46PM +0000, Jon Burgess wrote:
> Andrew Morton wrote:
>
> >I don't know why the single-stream case would be slower, but the
> >two-stream
> >
> >case is probably due to writeback changes interacting with a weakness in
> >the block allocator. 10 megs/sec is pretty awful either way.
> >
> >
> >
> 10MB/s is just because I did the test on an old machine, it maxes out at
> 15MB/s with "hdparm -t".
> I didn't want to do it on my main PC because I using it to record a TV
> program at the time :-)
>
> >Either way, you have intermingled blocks in the files.
> >
> >
> Yes the blocks are intermingled. Thanks for the explanation of the
> 2.4/2.6 difference.
>
> >Reads will be slower too - you will probably find that reading back a file
> >
> >
> Yes reads are 50% for 2 streams, 25% for 4 etc. 2.4 and 2.6 perform the
> same.
> I did a debugfs "stat" and it clearly shows the fragmented file blocks.
>
> >You can probably address it quite well within the
> >application itself by buffering up a good amount of data for each write()
> >call. Maybe a megabyte.
> >
> >
> Writes in the 256kB - 1MB region do avoid the problem. Unfortunately the
> way the application is written it makes this tricky to do. It wants to
> write out the data in one frame at a time, typically 10 - 50kB.
>
> >XFS will do well at this.
> >
> >
> Yes, both XFS and JFS perform much better. Here is a summary of some
> tests done on 2.6, these were done on a faster machine / disk
> combination. This was the original test program which also measured the
> read speeds, you can get this from http://www.jburgess.uklinux.net/slow.c
>
> The ext2 result is a bit slow, but ext3 is really bad.
>
> Num streams |1 1 |2 2
> Filesystem |Write Read |Write Read
> --------------|----------------|--------------
> Ext2 |27.7 29.17 | 5.89 14.43
> ext3-ordered |25.73 29.21 | 0.48 1.1
> Reiserfs |25.31 26.25 | 7.47 13.55
> JFS |26.27 26.95 |26.92 28.5
> XFS |27.51 26.00 |27.35 27.42
I ran slow.c on Reiser4, it is different hardware, ext2 results are for
comparing:
server config:
2xXeon, hyperthreading, 256 MB RAM
Linux-2.6.2
Test:
/tests/slow foo 1024
it writes 1GB to one or two files.
Results:
---------+------------+-----------+-----------+-----------+
| 1 stream | 2 streams |
---------+------------+-----------+-----------+-----------+
| WRITE | READ | WRITE | READ |
---------+------------+-----------+-----------+-----------+
REISER4 | 33.67 Mb/s | 40.97Mb/s | 30.78Mb/s | 38.37Mb/s |
---------+------------+-----------+-----------+-----------+
EXT2 | 33.32Mb/s | 40.82Mb/s | 9.45Mb/s | 20.39Mb/s |
---------+------------+-----------+-----------+-----------+
The fs with delayed block allocation (Reiser4, XFS, seems JFS too) look much
better.
--
Alex.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes
2004-02-16 17:51 ` Alex Zarochentsev
@ 2004-02-16 20:03 ` Jon Burgess
0 siblings, 0 replies; 25+ messages in thread
From: Jon Burgess @ 2004-02-16 20:03 UTC (permalink / raw)
To: Alex Zarochentsev; +Cc: Jon Burgess, linux-kernel
Alex Zarochentsev wrote:
>The fs with delayed block allocation (Reiser4, XFS, seems JFS too) look much
>better.
>
>
Yes those results are in line with what I found on Reiserfs4 as well. I
also tried incresing the number of streams to see when things start to
break. Reiserfs4 seems to do well here as well. I stopped some tests
early because some filesystems were just too slow.
Streams: 1 1 2 2 4 4 8 8 16
16 32 32
Write Read Write Read Write Read Write Read Write
Read Write Read
----------------------------------------------------------------------------------------
ext2 26.10 29.22 8.27 14.51 6.91 7.31
-------------------------------------
ext3-order 25.45 28.21 4.96 14.29
--------------------------------------------------
JFS 27.76 29.17 26.72 28.93 25.72 28.86 24.76 29.01 22.94
28.49 4.25 6.03
Reiser4 27.08 29.28 27.02 28.69 27.09 28.47 27.26 27.26 27.09
25.52 26.94 22.59
XFS 28.09 29.16 28.15 28.11 27.60 27.19 26.81 26.23 25.68
24.04 22.59 21.45
It would appear that with XFS and Reiser4 I would be able to
simultaneously record >32 MPEG TV channels on to a single disk. I think
that exceeds my TV recording requirements by some considerable margin :-)
Jon
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2004-02-16 20:04 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-11 19:04 ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
2004-02-11 20:28 ` Rik van Riel
2004-02-11 21:02 ` Michael Frank
2004-02-11 21:18 ` Diego Calleja
2004-02-12 2:00 ` Dave Olien
2004-02-12 2:23 ` Andrea Arcangeli
2004-02-12 9:42 ` ext2/3 performance regression in 2.6 vs 2.4 for small interl Giuliano Pochini
2004-02-12 10:15 ` John Bradford
2004-02-12 10:27 ` Nick Piggin
2004-02-12 17:05 ` Michael Frank
2004-02-12 17:18 ` Valdis.Kletnieks
2004-02-12 20:55 ` Helge Hafting
2004-02-13 1:57 ` Jamie Lokier
2004-02-13 2:05 ` Nick Piggin
2004-02-12 14:59 ` Andrea Arcangeli
2004-02-13 12:15 ` ext2/3 performance regression in 2.6 vs 2.4 for small interleaved writes Jon Burgess
2004-02-12 10:40 ` Jon Burgess
2004-02-12 20:17 ` Hans Reiser
2004-02-12 9:56 ` Andrew Morton
2004-02-12 20:20 ` Jon Burgess
2004-02-13 8:28 ` Juan Piernas Canovas
2004-02-16 17:51 ` Alex Zarochentsev
2004-02-16 20:03 ` Jon Burgess
2004-02-13 12:35 ` Jon Burgess
2004-02-14 15:00 ` Jon Burgess
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox