Random file I/O regressions in 2.6

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Random file I/O regressions in 2.6
@ 2004-05-02 19:57 Alexey Kopytov
  2004-05-03 11:14 ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-02 19:57 UTC (permalink / raw)
  To: linux-kernel

Hello!

I tried to compare random file I/O performance in 2.4 and 2.6 kernels and 
found some regressions that I failed to explain. I tested 2.4.25, 2.6.5-bk2 
and 2.6.6-rc3 with my own utility SysBench which was written to generate 
workloads similar to a database under intensive load. 

For 2.6.x kernels anticipatory, deadline, CFQ and noop I/O schedulers were
tested with AS giving the best results for this workload, but it's still about 
1.5 times worse than the results for 2.4.25 kernel.

The SysBench 'fileio' test was configured to generate the following workload:
16 worker threads are created, each running random read/write file requests in
blocks of 16 KB with a read/write ratio of 1.5. All I/O operations are evenly
distributed over 128 files with a total size of 3 GB. Each 100 requests, an
fsync() operations is performed sequentially on each file. The total number of
requests is limited by 10000.

The FS used for the test was ext3 with data=ordered.

Here are the results (values are number of seconds to complete the test):

2.4.25: 77.5377

2.6.5-bk2(noop): 165.3393
2.6.5-bk2(anticipatory): 118.7450
2.6.5-bk2(deadline): 130.3254
2.6.5-bk2(CFQ): 146.4286

2.6.6-rc3(noop): 164.9486
2.6.6-rc3(anticipatory): 125.1776
2.6.6-rc3(deadline): 131.8903
2.6.6-rc3(CFQ): 152.9280

I have published the results as well as the hardware and kernel setups at the
SysBench home page: http://sysbench.sourceforge.net/results/fileio/

Any comments or suggestions would be highly appreciated.

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-02 19:57 Random file I/O regressions in 2.6 Alexey Kopytov
@ 2004-05-03 11:14 ` Nick Piggin
  2004-05-03 18:08   ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2004-05-03 11:14 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: linux-kernel, Jens Axboe, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3200 bytes --]

Alexey Kopytov wrote:
> Hello!
> 
> I tried to compare random file I/O performance in 2.4 and 2.6 kernels and 
> found some regressions that I failed to explain. I tested 2.4.25, 2.6.5-bk2 
> and 2.6.6-rc3 with my own utility SysBench which was written to generate 
> workloads similar to a database under intensive load. 
> 
> For 2.6.x kernels anticipatory, deadline, CFQ and noop I/O schedulers were
> tested with AS giving the best results for this workload, but it's still about 
> 1.5 times worse than the results for 2.4.25 kernel.
> 
> The SysBench 'fileio' test was configured to generate the following workload:
> 16 worker threads are created, each running random read/write file requests in
> blocks of 16 KB with a read/write ratio of 1.5. All I/O operations are evenly
> distributed over 128 files with a total size of 3 GB. Each 100 requests, an
> fsync() operations is performed sequentially on each file. The total number of
> requests is limited by 10000.
> 
> The FS used for the test was ext3 with data=ordered.
> 

I am able to reproduce this here. 2.6 isn't improved by increasing
nr_requests, relaxing IO scheduler deadlines, or turning off readahead.
It looks like 2.6 is submitting a lot of the IO in 4KB sized requests...

Hmm, oh dear. It looks like the readahead logic shat itself and/or
do_generic_mapping_read doesn't know how to handle multipage reads
properly.

What ends up happening is that readahead gets turned off, then the
16K read ends up being done in 4 synchronous 4K chunks. Because they
are synchronous, they have no chance of being merged with one another
either.

I have attached a proof of concept hack... I think what should really
happen is that page_cache_readahead should be taught about the size
of the requested read, and ensures that a decent amount of reading is
done while within the read request window, even if
beyond-request-window-readahead has been previously unsuccessful.

Numbers with an IDE disk, 256MB ram
2.4.24:		 81s
2.6.6-rc3-mm1:  126s
rc3-mm1+patch:   87s

The small remaining regression might be explained by 2.6's smaller
nr_requests, IDE driver, io scheduler tuning, etc.

> Here are the results (values are number of seconds to complete the test):
> 
> 2.4.25: 77.5377
> 
> 2.6.5-bk2(noop): 165.3393
> 2.6.5-bk2(anticipatory): 118.7450
> 2.6.5-bk2(deadline): 130.3254
> 2.6.5-bk2(CFQ): 146.4286
> 
> 2.6.6-rc3(noop): 164.9486
> 2.6.6-rc3(anticipatory): 125.1776
> 2.6.6-rc3(deadline): 131.8903
> 2.6.6-rc3(CFQ): 152.9280
> 
> I have published the results as well as the hardware and kernel setups at the
> SysBench home page: http://sysbench.sourceforge.net/results/fileio/
> 
> Any comments or suggestions would be highly appreciated.
> 

 From your website:
"Another interesting fact is that AS gives the best results for this
workload, though it's believed to give worse results for this kind of
workloads as compared to other I/O schedulers available in 2.6.x
kernels."

The anticipatory scheduler is actually in a fairly good state of tune,
and can often beat deadline even for random read/write/fsync tests. The
infamous database regression problem is when this sort of workload is
combined with TCQ disk drives.

Nick

[-- Attachment #2: read-populate.patch --]
[-- Type: text/x-patch, Size: 1010 bytes --]

 include/linux/mm.h             |    0 
 linux-2.6-npiggin/mm/filemap.c |    5 ++++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff -puN mm/readahead.c~read-populate mm/readahead.c
diff -puN mm/filemap.c~read-populate mm/filemap.c
--- linux-2.6/mm/filemap.c~read-populate	2004-05-03 19:56:00.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-05-03 20:51:37.000000000 +1000
@@ -627,6 +627,9 @@ void do_generic_mapping_read(struct addr
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
+	force_page_cache_readahead(mapping, filp, index,
+			max_sane_readahead(desc->count >> PAGE_CACHE_SHIFT));
+
 	for (;;) {
 		struct page *page;
 		unsigned long end_index, nr, ret;
@@ -644,7 +647,7 @@ void do_generic_mapping_read(struct addr
 		}
 
 		cond_resched();
-		page_cache_readahead(mapping, ra, filp, index);
+		page_cache_readahead(mapping, ra, filp, index + desc->count);
 
 		nr = nr - offset;
 find_page:
diff -puN include/linux/mm.h~read-populate include/linux/mm.h

_

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 11:14 ` Nick Piggin
@ 2004-05-03 18:08   ` Andrew Morton
  2004-05-03 20:22     ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-03 18:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: alexeyk, linux-kernel, axboe

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> What ends up happening is that readahead gets turned off, then the
> 16K read ends up being done in 4 synchronous 4K chunks. Because they
> are synchronous, they have no chance of being merged with one another
> either.

yup.

> I have attached a proof of concept hack... I think what should really
> happen is that page_cache_readahead should be taught about the size
> of the requested read, and ensures that a decent amount of reading is
> done while within the read request window, even if
> beyond-request-window-readahead has been previously unsuccessful.

The "readahead turned itself off" thing is there to avoid doing lots of
pagecache lookups in the very common case where the file is fully cached.

The place which needs attention is handle_ra_miss().  But first I'd like to
reacquaint myself with the intent behind the lazy-readahead patch.  Was
never happy with the complexity and special-cases which that introduced.

>  		cond_resched();
> -		page_cache_readahead(mapping, ra, filp, index);
> +		page_cache_readahead(mapping, ra, filp, index + desc->count);
>  

`index' is a pagecache index and desc->count is a byte counter.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 18:08   ` Andrew Morton
@ 2004-05-03 20:22     ` Ram Pai
  2004-05-03 20:57       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-03 20:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 11:08, Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> > What ends up happening is that readahead gets turned off, then the
> > 16K read ends up being done in 4 synchronous 4K chunks. Because they
> > are synchronous, they have no chance of being merged with one another
> > either.
> 
> yup.
> 
> > I have attached a proof of concept hack... I think what should really
> > happen is that page_cache_readahead should be taught about the size
> > of the requested read, and ensures that a decent amount of reading is
> > done while within the read request window, even if
> > beyond-request-window-readahead has been previously unsuccessful.
> 
> The "readahead turned itself off" thing is there to avoid doing lots of
> pagecache lookups in the very common case where the file is fully cached.
> 
> The place which needs attention is handle_ra_miss().  But first I'd like to
> reacquaint myself with the intent behind the lazy-readahead patch.  Was
> never happy with the complexity and special-cases which that introduced.

lazy-readahead has no role to play here. The readahead window got closed
because the i/o pattern was totally random. My guess is multiple threads
are generating 16k i/o on the same fd. In such a case the i/os can  get
interleaved and the readahead window size goes for a toss(which is
expected  behavior)

Well if this is infact the case: the question is
	1. does the i/o pattern really has some sequentiality to 
		deserve a readahead?
	2. or should we ensure that the interleaved case be somehow
		 handled, by including the size parameter?

I know Nick has implied option (2) but I think from the readahead's
point of view it is (1),
RP



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 20:22     ` Ram Pai
@ 2004-05-03 20:57       ` Andrew Morton
  2004-05-03 21:37         ` Peter Zaitsev
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-03 20:57 UTC (permalink / raw)
  To: Ram Pai; +Cc: nickpiggin, alexeyk, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> > The place which needs attention is handle_ra_miss().  But first I'd like to
> > reacquaint myself with the intent behind the lazy-readahead patch.  Was
> > never happy with the complexity and special-cases which that introduced.
> 
> lazy-readahead has no role to play here.

Sure.  But lazy-readahead is bolted on the side and is generally not to my
liking.  I'd like to find a solution to the sysbench problem which also
solves the thing which lazy-readahead addressed.

> The readahead window got closed
> because the i/o pattern was totally random. My guess is multiple threads
> are generating 16k i/o on the same fd. In such a case the i/os can  get
> interleaved and the readahead window size goes for a toss(which is
> expected  behavior)

I don't think it's that.  The app is doing well-aligned 16k reads and
writes.  If we get enough pagecache hits on the reads, readahead turns
itself off (fair enough) but fails to turn itself on again.

The readahead logic _should_ be able to adapt to the fixed-sized I/Os and
issue correct-sized reads immediately after each seek.  I _think_ this will
fix the problem which lazy-readahead addressed, but as usual we don't have
a rigorous description of that problem :(

> Well if this is infact the case: the question is
> 	1. does the i/o pattern really has some sequentiality to 
> 		deserve a readahead?
> 	2. or should we ensure that the interleaved case be somehow
> 		 handled, by including the size parameter?
> 
> I know Nick has implied option (2) but I think from the readahead's
> point of view it is (1),

Readahead has got too complex and is getting band-aidy.  I'd prefer to tear
it down and rethink things.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 20:57       ` Andrew Morton
@ 2004-05-03 21:37         ` Peter Zaitsev
  2004-05-03 21:50           ` Ram Pai
  2004-05-03 21:59           ` Andrew Morton
  0 siblings, 2 replies; 56+ messages in thread
From: Peter Zaitsev @ 2004-05-03 21:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ram Pai, nickpiggin, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 13:57, Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
> >
> > > The place which needs attention is handle_ra_miss().  But first I'd like to
> > > reacquaint myself with the intent behind the lazy-readahead patch.  Was
> > > never happy with the complexity and special-cases which that introduced.
> > 
> > lazy-readahead has no role to play here.
> 

Andrew,

Could you please clarify how this things become to be dependent on
read-ahead at all.

At my understanding read-ahead it to catch sequential (or other) access
pattern and do some advance reading, so instead of 16K request we do
128K request, or something similar.

But how could read-ahead disabled end up in 16K request converted to
several sequential synchronous 4K requests ? 

It all looks pretty strange.

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 21:37         ` Peter Zaitsev
@ 2004-05-03 21:50           ` Ram Pai
  2004-05-03 22:01             ` Peter Zaitsev
  2004-05-03 21:59           ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-03 21:50 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, nickpiggin, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 14:37, Peter Zaitsev wrote:
> On Mon, 2004-05-03 at 13:57, Andrew Morton wrote:
> > Ram Pai <linuxram@us.ibm.com> wrote:
> > >
> > > > The place which needs attention is handle_ra_miss().  But first I'd like to
> > > > reacquaint myself with the intent behind the lazy-readahead patch.  Was
> > > > never happy with the complexity and special-cases which that introduced.
> > > 
> > > lazy-readahead has no role to play here.
> > 
> 
> Andrew,
> 
> Could you please clarify how this things become to be dependent on
> read-ahead at all.
> 
> At my understanding read-ahead it to catch sequential (or other) access
> pattern and do some advance reading, so instead of 16K request we do
> 128K request, or something similar.
> 
> But how could read-ahead disabled end up in 16K request converted to
> several sequential synchronous 4K requests ? 

When the readahead window gets closed,the code goes into slow-read mode.
In this mode, all requests are broken to page-size. Hence a 16k request
gets broken into 4  4K-requests. This continues to the point where
enough number of sequential i/os are requested(i.e around ra->ra_pages
number of pages), at which point the readahead window gets
re-activated.  

Looking at it the other way, without readahead code, all requests
satisfied through 4k i/os.  Readahead helps in generating larger size
i/os.

RP



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 21:50           ` Ram Pai
@ 2004-05-03 22:01             ` Peter Zaitsev
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zaitsev @ 2004-05-03 22:01 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andrew Morton, nickpiggin, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 14:50, Ram Pai wrote:

> 
> Looking at it the other way, without readahead code, all requests
> satisfied through 4k i/os.  Readahead helps in generating larger size
> i/os.

Huh,  This is kind of really strange.

If you speak about database world, Random IO is quite frequent and
database page sizes are normally larger than OS page size.

Furthermore even if it is split to 4K block sizes, why are they not
submitted in parallel, being merged on lower level.

Anyway we seems to all agree this is not very good behavior and it
should be fixed :)

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 21:37         ` Peter Zaitsev
  2004-05-03 21:50           ` Ram Pai
@ 2004-05-03 21:59           ` Andrew Morton
  2004-05-03 22:07             ` Ram Pai
  2004-05-03 23:58             ` Nick Piggin
  1 sibling, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-03 21:59 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linuxram, nickpiggin, alexeyk, linux-kernel, axboe

Peter Zaitsev <peter@mysql.com> wrote:
>
> On Mon, 2004-05-03 at 13:57, Andrew Morton wrote:
> > Ram Pai <linuxram@us.ibm.com> wrote:
> > >
> > > > The place which needs attention is handle_ra_miss().  But first I'd like to
> > > > reacquaint myself with the intent behind the lazy-readahead patch.  Was
> > > > never happy with the complexity and special-cases which that introduced.
> > > 
> > > lazy-readahead has no role to play here.
> > 
> 
> Andrew,
> 
> Could you please clarify how this things become to be dependent on
> read-ahead at all.

readahead is currently the only means by which we build up nice large
multi-page BIOs.

> At my understanding read-ahead it to catch sequential (or other) access
> pattern and do some advance reading, so instead of 16K request we do
> 128K request, or something similar.

That's one of its usage patterns.  It's also supposed to detect the
fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
supposed to submit correctly-sized multi-page BIOs.  But it's not working
right for this workload.

A naive solution would be to add special-case code which always does the
fixed-size readahead after a seek.  Basically that's

	if (ra->next_size == -1UL)
		force_page_cache_readahead(...)

in filemap.c.  But this means that the kernel does lots of pointless
pagecache lookups when everything is in pagecache.  We should detect this
situation and stop doing readahead completely, until we start getting
pagecache lookup misses again.

> But how could read-ahead disabled end up in 16K request converted to
> several sequential synchronous 4K requests ? 

Readahead got itself turned off because of pagecache hits and didn't turn
itself on again.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 21:59           ` Andrew Morton
@ 2004-05-03 22:07             ` Ram Pai
  2004-05-03 23:58             ` Nick Piggin
  1 sibling, 0 replies; 56+ messages in thread
From: Ram Pai @ 2004-05-03 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, nickpiggin, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 14:59, Andrew Morton wrote:
 
> 
> > But how could read-ahead disabled end up in 16K request converted to
> > several sequential synchronous 4K requests ? 
> 
> Readahead got itself turned off because of pagecache hits and didn't turn
> itself on again.

Andrew,
	In the slow read path, every contiguous access increases 
	ra->size by 1 and  non-contiguous access decreases the
	ra->size by 1. Now in the case of 16k random request,
        we have 1 non-contiguous request and 3 contiguous request.
	As a result the ra->size should have been incremented by
	-1+1+1+1=2 . So at the end of 16 4K request we should have had
        ra->size at 32. At this point onwards the readahead should get
	turned on. Right?

	I strongly feel the readahead got closed because of misses and
	not because of hits. Moreover if we are closing readahead window
	 because of hits, then that implies we have pretty good caching
	 going on.Which implies i/o should rarely hit the disk and hence
	 performance should not degrade.

Agree?
RP
	
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 21:59           ` Andrew Morton
  2004-05-03 22:07             ` Ram Pai
@ 2004-05-03 23:58             ` Nick Piggin
  2004-05-04  0:10               ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2004-05-03 23:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, linuxram, alexeyk, linux-kernel, axboe

Andrew Morton wrote:
> Peter Zaitsev <peter@mysql.com> wrote:
> 
>>On Mon, 2004-05-03 at 13:57, Andrew Morton wrote:
>>
>>>Ram Pai <linuxram@us.ibm.com> wrote:
>>>
>>>>>The place which needs attention is handle_ra_miss().  But first I'd like to
>>>>>reacquaint myself with the intent behind the lazy-readahead patch.  Was
>>>>>never happy with the complexity and special-cases which that introduced.
>>>>
>>>>lazy-readahead has no role to play here.
>>>
>>Andrew,
>>
>>Could you please clarify how this things become to be dependent on
>>read-ahead at all.
> 
> 
> readahead is currently the only means by which we build up nice large
> multi-page BIOs.
> 
> 
>>At my understanding read-ahead it to catch sequential (or other) access
>>pattern and do some advance reading, so instead of 16K request we do
>>128K request, or something similar.
> 
> 
> That's one of its usage patterns.  It's also supposed to detect the
> fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
> supposed to submit correctly-sized multi-page BIOs.  But it's not working
> right for this workload.
> 
> A naive solution would be to add special-case code which always does the
> fixed-size readahead after a seek.  Basically that's
> 
> 	if (ra->next_size == -1UL)
> 		force_page_cache_readahead(...)
> 

I think a better solution to this case would be to ensure the
readahead window is always min(size of read, some large number);

The size of the read is basically a free and accurate "hint" to
the minimum size of the required readahead.

Either that or do a simple "preread" while you're still in the
read request window, and run readahead when that completes.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-03 23:58             ` Nick Piggin
@ 2004-05-04  0:10               ` Andrew Morton
  2004-05-04  0:19                 ` Nick Piggin
  2004-05-04  8:27                 ` Arjan van de Ven
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-04  0:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: peter, linuxram, alexeyk, linux-kernel, axboe

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > That's one of its usage patterns.  It's also supposed to detect the
> > fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
> > supposed to submit correctly-sized multi-page BIOs.  But it's not working
> > right for this workload.
> > 
> > A naive solution would be to add special-case code which always does the
> > fixed-size readahead after a seek.  Basically that's
> > 
> > 	if (ra->next_size == -1UL)
> > 		force_page_cache_readahead(...)
> > 
> 
> I think a better solution to this case would be to ensure the
> readahead window is always min(size of read, some large number);
> 

That would cause the kernel to perform lots of pointless pagecache lookups
when the file is already 100% cached.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  0:10               ` Andrew Morton
@ 2004-05-04  0:19                 ` Nick Piggin
  2004-05-04  0:50                   ` Ram Pai
  2004-05-04  1:15                   ` Andrew Morton
  2004-05-04  8:27                 ` Arjan van de Ven
  1 sibling, 2 replies; 56+ messages in thread
From: Nick Piggin @ 2004-05-04  0:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: peter, linuxram, alexeyk, linux-kernel, axboe

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>>That's one of its usage patterns.  It's also supposed to detect the
>>>fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
>>>supposed to submit correctly-sized multi-page BIOs.  But it's not working
>>>right for this workload.
>>>
>>>A naive solution would be to add special-case code which always does the
>>>fixed-size readahead after a seek.  Basically that's
>>>
>>>	if (ra->next_size == -1UL)
>>>		force_page_cache_readahead(...)
>>>
>>
>>I think a better solution to this case would be to ensure the
>>readahead window is always min(size of read, some large number);
>>
> 
> 
> That would cause the kernel to perform lots of pointless pagecache lookups
> when the file is already 100% cached.
> 


That's pretty sad. You need a "preread" or something which
sends the pages back... or uses the actor itself. readahead
would then have to be reworked to only run off the end of
the read window, but that is what it should be doing anyway.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  0:19                 ` Nick Piggin
@ 2004-05-04  0:50                   ` Ram Pai
  2004-05-04  6:29                     ` Andrew Morton
  2004-05-04  1:15                   ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-04  0:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, peter, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 17:19, Nick Piggin wrote:
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> >>>That's one of its usage patterns.  It's also supposed to detect the
> >>>fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
> >>>supposed to submit correctly-sized multi-page BIOs.  But it's not working
> >>>right for this workload.
> >>>
> >>>A naive solution would be to add special-case code which always does the
> >>>fixed-size readahead after a seek.  Basically that's
> >>>
> >>>	if (ra->next_size == -1UL)
> >>>		force_page_cache_readahead(...)
> >>>
> >>
> >>I think a better solution to this case would be to ensure the
> >>readahead window is always min(size of read, some large number);
> >>
> > 
> > 
> > That would cause the kernel to perform lots of pointless pagecache lookups
> > when the file is already 100% cached.
> > 
> 
> 
> That's pretty sad. You need a "preread" or something which
> sends the pages back... or uses the actor itself. readahead
> would then have to be reworked to only run off the end of
> the read window, but that is what it should be doing anyway.

Sorry, If I am saying this again. I have checked the behaviour of the
readahead code using my user level simulator as well as running some
DSS benchmark and iozone benchmark. It generates a steady stream of
large i/o for large-random-reads and should not exhibit the bad behavior
that we are seeing.  I feel this bad behavior is because of interleaved
access by multiple thread. 

To illustrate with an example:

t1 request reads from page 100 to 104
simultaneously t2 requests reads on the same fd from 200 to 204

So  do_page_cache_readahead() can be called in the following pattern.
100,200,101,201,102,202,103,203,104,204. 
Because of this pattern the readahaed code assumes that the read pattern
is absolutely random and hence closes the readahead window.

I think I should generate a patch to validate this behavior, I will.
How about having some /proc counters that keep track of number of
window-closes because of cache-hits and because of cache-misses?

RP

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  0:50                   ` Ram Pai
@ 2004-05-04  6:29                     ` Andrew Morton
  2004-05-04 15:03                       ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-04  6:29 UTC (permalink / raw)
  To: Ram Pai; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> Sorry, If I am saying this again. I have checked the behaviour of the
>  readahead code using my user level simulator as well as running some
>  DSS benchmark and iozone benchmark. It generates a steady stream of
>  large i/o for large-random-reads and should not exhibit the bad behavior
>  that we are seeing.  I feel this bad behavior is because of interleaved
>  access by multiple thread. 

you're right - the benchmark has multiple threads issuing concurrent
pread()s against the same fd.  For some reason this mucks up the 2.6
readahead state more than 2.4's.

Putting a semaphore around do_generic_file_read() or maintaining the state
as below fixes it up.

I wonder if we should bother fixing this?  I guess as long as the app is
using pread() it is a legitimate thing to be doing, so I guess we should...



--- 25/mm/filemap.c~readahead-seralisation	2004-05-03 23:14:43.399947720 -0700
+++ 25-akpm/mm/filemap.c	2004-05-03 23:14:43.404946960 -0700
@@ -612,7 +612,7 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
  * - note the struct file * is only passed for the use of readpage
  */
 void do_generic_mapping_read(struct address_space *mapping,
-			     struct file_ra_state *ra,
+			     struct file_ra_state *_ra,
 			     struct file * filp,
 			     loff_t *ppos,
 			     read_descriptor_t * desc,
@@ -622,6 +622,7 @@ void do_generic_mapping_read(struct addr
 	unsigned long index, offset;
 	struct page *cached_page;
 	int error;
+	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
@@ -644,13 +645,13 @@ void do_generic_mapping_read(struct addr
 		}
 
 		cond_resched();
-		page_cache_readahead(mapping, ra, filp, index);
+		page_cache_readahead(mapping, &ra, filp, index);
 
 		nr = nr - offset;
 find_page:
 		page = find_get_page(mapping, index);
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, ra, index);
+			handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
 		if (!PageUptodate(page))
@@ -752,6 +753,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+	*_ra = ra;
+
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);

_


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  6:29                     ` Andrew Morton
@ 2004-05-04 15:03                       ` Ram Pai
  2004-05-04 19:39                         ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-04 15:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

On Mon, 2004-05-03 at 23:29, Andrew Morton wrote:
 
> 
> Putting a semaphore around do_generic_file_read() or maintaining the state
> as below fixes it up.
> 
> I wonder if we should bother fixing this?  I guess as long as the app is
> using pread() it is a legitimate thing to be doing, so I guess we should...
> 
> 
> 
Yes this patch makes sense. I have setup sysbench on my lab machine. Let
me see how much improvement the patch provides.


RP
 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 15:03                       ` Ram Pai
@ 2004-05-04 19:39                         ` Ram Pai
  2004-05-04 19:48                           ` Andrew Morton
  2004-05-04 23:01                           ` Alexey Kopytov
  0 siblings, 2 replies; 56+ messages in thread
From: Ram Pai @ 2004-05-04 19:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]

On Tue, 2004-05-04 at 08:03, Ram Pai wrote:
> On Mon, 2004-05-03 at 23:29, Andrew Morton wrote:
>  
> > 
> > Putting a semaphore around do_generic_file_read() or maintaining the state
> > as below fixes it up.
> > 
> > I wonder if we should bother fixing this?  I guess as long as the app is
> > using pread() it is a legitimate thing to be doing, so I guess we should...
> > 
> > 
> > 
> Yes this patch makes sense. I have setup sysbench on my lab machine. Let
> me see how much improvement the patch provides.
I ran the following command:

/root/sysbench-0.2.5/sysbench/sysbench --num-threads=256 --test=fileio
--file-total-size=2800M --file-test-mode=rndrw run 



Without the patch:
------------------

Operations performed:  5959 Read, 4041 Write, 10752 Other = 20752 Total
Read 93Mb  Written 63Mb  Total Transferred 156Mb
   7.549Mb/sec  Transferred
  483.89 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  20.6661s

no of times window reset because of hits: 0
no of times window reset because of misses: 7
no of times window was shrunk because of hits: 6716
no of times the page request was non-contiguous: 5880
no of times the page request was contiguous : 19639


With the patch:
--------------

Operations performed:  5960 Read, 4040 Write, 10880 Other = 20880 Total
Read 93Mb  Written 63Mb  Total Transferred 156Mb
   7.985Mb/sec  Transferred
   511.85 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  19.5370s


no of times window got reset because of hits: 0
no of times window got reset because of misses: 0
no of times window was shrunk because of hits: 5844
no of times the page request was non-contiguous: 5830
no of times the page request was contiguous : 20232


I have enclosed the patch that collects the hit/miss related counts.

        In general I am not seeing any major difference with or without
        andrew's ra-copy patch; except for readahead window getting
        closed because of misses when run without the patch.

Would be nice if Alexey tries the patch on his machine and sees any
major difference.

RP





[-- Attachment #2: ra_instrumentation.patch --]
[-- Type: text/x-patch, Size: 4017 bytes --]

diff -urNp linux-2.6.6-rc3/include/linux/sysctl.h linux-2.6.6-rc3.new/include/linux/sysctl.h
--- linux-2.6.6-rc3/include/linux/sysctl.h	2004-04-27 18:35:49.000000000 -0700
+++ linux-2.6.6-rc3.new/include/linux/sysctl.h	2004-05-04 18:26:37.911973080 -0700
@@ -643,6 +643,11 @@ enum
 	FS_XFS=17,	/* struct: control xfs parameters */
 	FS_AIO_NR=18,	/* current system-wide number of aio requests */
 	FS_AIO_MAX_NR=19,	/* system-wide maximum number of aio requests */
+	FS_READ_MISS_RESET=20,
+	FS_READ_HIT_RESET=21,
+	FS_CONTIGUOUS_CNT=22,
+	FS_NON_CONTIGUOUS_CNT=22,
+	FS_HIT_COUNT=23,
 };
 
 /* /proc/sys/fs/quota/ */
diff -urNp linux-2.6.6-rc3/kernel/sysctl.c linux-2.6.6-rc3.new/kernel/sysctl.c
--- linux-2.6.6-rc3/kernel/sysctl.c	2004-04-27 18:35:08.000000000 -0700
+++ linux-2.6.6-rc3.new/kernel/sysctl.c	2004-05-04 18:38:58.774344880 -0700
@@ -64,6 +64,13 @@ extern int sysctl_lower_zone_protection;
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
+extern int printk_ratelimit_burst;
+extern int printk_ratelimit_burst;
+extern atomic_t hit_reset;
+extern atomic_t miss_reset;
+extern atomic_t hit_count;
+extern atomic_t contiguous_cnt;
+extern atomic_t non_contiguous_cnt;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -897,6 +904,46 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= FS_READ_MISS_RESET,
+		.procname	= "read-miss-reset",
+		.data		= &miss_reset,
+		.maxlen		= sizeof(miss_reset),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= FS_READ_HIT_RESET,
+		.procname	= "read-hit-reset",
+		.data		= &hit_reset,
+		.maxlen		= sizeof(hit_reset),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= FS_CONTIGUOUS_CNT,
+		.procname	= "read-contiguous-cnt",
+		.data		= &contiguous_cnt,
+		.maxlen		= sizeof(contiguous_cnt),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= FS_NON_CONTIGUOUS_CNT,
+		.procname	= "read-non-contiguous-cnt",
+		.data		= &non_contiguous_cnt,
+		.maxlen		= sizeof(non_contiguous_cnt),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= FS_HIT_COUNT,
+		.procname	= "read-hit-count",
+		.data		= &hit_count,
+		.maxlen		= sizeof(hit_count),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -urNp linux-2.6.6-rc3/mm/readahead.c linux-2.6.6-rc3.new/mm/readahead.c
--- linux-2.6.6-rc3/mm/readahead.c	2004-04-27 18:35:06.000000000 -0700
+++ linux-2.6.6-rc3.new/mm/readahead.c	2004-05-04 18:37:20.681257296 -0700
@@ -316,6 +316,12 @@ int do_page_cache_readahead(struct addre
 	return 0;
 }
 
+atomic_t hit_reset= ATOMIC_INIT(0);
+atomic_t miss_reset= ATOMIC_INIT(0);
+atomic_t hit_count= ATOMIC_INIT(0);
+atomic_t contiguous_cnt = ATOMIC_INIT(0);
+atomic_t non_contiguous_cnt= ATOMIC_INIT(0);
+
 /*
  * Check how effective readahead is being.  If the amount of started IO is
  * less than expected then the file is partly or fully in pagecache and
@@ -331,11 +337,13 @@ check_ra_success(struct file_ra_state *r
 	if (actual == 0) {
 		if (orig_next_size > 1) {
 			ra->next_size = orig_next_size - 1;
+			atomic_inc(&hit_count);
 			if (ra->ahead_size)
 				ra->ahead_size = ra->next_size;
 		} else {
 			ra->next_size = -1UL;
 			ra->size = 0;
+			atomic_inc(&hit_reset);
 		}
 	}
 }
@@ -406,17 +414,20 @@ page_cache_readahead(struct address_spac
 		 * page beyond the end.  Expand the next readahead size.
 		 */
 		ra->next_size += 2;
+		atomic_inc(&contiguous_cnt);
 	} else {
 		/*
 		 * A miss - lseek, pagefault, pread, etc.  Shrink the readahead
 		 * window.
 		 */
 		ra->next_size -= 2;
+		atomic_inc(&non_contiguous_cnt);
 	}
 
 	if ((long)ra->next_size > (long)max)
 		ra->next_size = max;
 	if ((long)ra->next_size <= 0L) {
+		atomic_inc(&miss_reset);
 		ra->next_size = -1UL;
 		ra->size = 0;
 		goto out;		/* Readahead is off */

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 19:39                         ` Ram Pai
@ 2004-05-04 19:48                           ` Andrew Morton
  2004-05-04 19:58                             ` Ram Pai
  2004-05-04 23:01                           ` Alexey Kopytov
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-04 19:48 UTC (permalink / raw)
  To: Ram Pai; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> I ran the following command:
> 
>  /root/sysbench-0.2.5/sysbench/sysbench --num-threads=256 --test=fileio
>  --file-total-size=2800M --file-test-mode=rndrw run 
> 

Alexey and I have been using 16 threads.

You don't tell us how much memory your lab machine has.  The above command
only makes sense if it is less than 400 megabytes.  Otherwise many or all
of the reads are satisfied from pagecache.

I've been testing with mem=256M, --file-total-size=2G.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 19:48                           ` Andrew Morton
@ 2004-05-04 19:58                             ` Ram Pai
  2004-05-04 21:51                               ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-04 19:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

On Tue, 2004-05-04 at 12:48, Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
> >
> > I ran the following command:
> > 
> >  /root/sysbench-0.2.5/sysbench/sysbench --num-threads=256 --test=fileio
> >  --file-total-size=2800M --file-test-mode=rndrw run 
> > 
> 
> Alexey and I have been using 16 threads.
> 
> You don't tell us how much memory your lab machine has.  

It has 8GB but only 4Gb is being used. I will try with 256MB and 16
threads.

RP




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 19:58                             ` Ram Pai
@ 2004-05-04 21:51                               ` Ram Pai
  2004-05-04 22:29                                 ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-04 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

On Tue, 2004-05-04 at 12:58, Ram Pai wrote:
> On Tue, 2004-05-04 at 12:48, Andrew Morton wrote:
> > Ram Pai <linuxram@us.ibm.com> wrote:
> > >
> > > I ran the following command:
> > > 
> > >  /root/sysbench-0.2.5/sysbench/sysbench --num-threads=256 --test=fileio
> > >  --file-total-size=2800M --file-test-mode=rndrw run 
> > > 
> > 
> > Alexey and I have been using 16 threads.
> >
 /root/sysbench-0.2.5/sysbench/sysbench --num-threads=16 --test=fileio
--file-total-size=2800M --file-test-mode=rndrw run 

Without the patch:
------------------

Operations performed:  6002 Read, 3998 Write, 12800 Other = 22800 Total
Read 93Mb  Written 62Mb  Total Transferred 156Mb
   1.967Mb/sec  Transferred
  126.11 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  79.2986s

no of times window reset because of hits: 0
no of times window reset because of misses: 119
no of times window was shrunk because of hits: 417
no of times the page request was non-contiguous: 3809
no of times the page request was contiguous : 12745

With the patch:
--------------
Operations performed:  6002 Read, 3999 Write, 12672 Other = 22673 Total
Read 93Mb  Written 62Mb  Total Transferred 156Mb
   2.927Mb/sec  Transferred
  187.65 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  53.2949s

no of times window reset because of hits: 0
no of times window reset because of misses: 0
no of times window was shrunk because of hits: 360
no of times the page request was non-contiguous: 5860
no of times the page request was contiguous : 20378


Impressive results. WOuld be nice to get a confirmation from Alexey.
RP


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 21:51                               ` Ram Pai
@ 2004-05-04 22:29                                 ` Ram Pai
  0 siblings, 0 replies; 56+ messages in thread
From: Ram Pai @ 2004-05-04 22:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, alexeyk, linux-kernel, axboe

On Tue, 2004-05-04 at 14:51, Ram Pai wrote:

memory used is ***256MB***. 

>  /root/sysbench-0.2.5/sysbench/sysbench --num-threads=16 --test=fileio
> --file-total-size=2800M --file-test-mode=rndrw run 
> 

RP


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 19:39                         ` Ram Pai
  2004-05-04 19:48                           ` Andrew Morton
@ 2004-05-04 23:01                           ` Alexey Kopytov
  2004-05-04 23:20                             ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-04 23:01 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andrew Morton, nickpiggin, peter, linux-kernel, axboe

Ram Pai wrote:
>Without the patch:
>------------------
>Time spent for test:  20.6661s
>
>no of times window reset because of hits: 0
>no of times window reset because of misses: 7
>no of times window was shrunk because of hits: 6716
>no of times the page request was non-contiguous: 5880
>no of times the page request was contiguous : 19639
>
>With the patch:
>--------------
>Time spent for test:  19.5370s
>
>no of times window got reset because of hits: 0
>no of times window got reset because of misses: 0
>no of times window was shrunk because of hits: 5844
>no of times the page request was non-contiguous: 5830
>no of times the page request was contiguous : 20232
>
>Would be nice if Alexey tries the patch on his machine and sees any
>major difference.

Here's what I have (same hardware and test setups):

Without the patch (but with Ram's patch applied):
------------------
Time spent for test: 125.4429s

no of times window reset because of hits: 0
no of times window reset because of misses: 127
no of times window was shrunk because of hits: 1153
no of times the page request was non-contiguous: 3968
no of times the page request was contiguous : 10686

With the patch:
---------------
Time spent for test:  86.5459s

no of times window reset because of hits: 0
no of times window reset because of misses: 0
no of times window was shrunk because of hits: 1066
no of times the page request was non-contiguous: 5860
no of times the page request was contiguous : 18099

I wonder if there are some plans to further improve 2.6 behavior on this 
workload to match that of 2.4? Is the remaing regression a result of the 
different readahead handling, or it might be caused by IDE driver or I/O 
scheduler tuning?

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 23:01                           ` Alexey Kopytov
@ 2004-05-04 23:20                             ` Andrew Morton
  2004-05-05 22:04                               ` Alexey Kopytov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-04 23:20 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

Alexey Kopytov <alexeyk@mysql.com> wrote:
>
> Without the patch (but with Ram's patch applied):
> ------------------
> Time spent for test: 125.4429s
> 
> no of times window reset because of hits: 0
> no of times window reset because of misses: 127
> no of times window was shrunk because of hits: 1153
> no of times the page request was non-contiguous: 3968
> no of times the page request was contiguous : 10686
> 
> With the patch:
> ---------------
> Time spent for test:  86.5459s
> 
> no of times window reset because of hits: 0
> no of times window reset because of misses: 0
> no of times window was shrunk because of hits: 1066
> no of times the page request was non-contiguous: 5860
> no of times the page request was contiguous : 18099
> 

The patch brought my test box to the same speed as 2.4.  With the deadline
scheduler it was a bit faster than 2.4.  I didn't do a lot of testing
though.  I was using ext2.  Please try deadline.

> I wonder if there are some plans to further improve 2.6 behavior on this 
> workload to match that of 2.4?

Of course...

Tuning work is being done on the anticipatory scheduler which we hope will
bring it up to deadline throughput for this sort of workload.

> Is the remaing regression a result of the 
> different readahead handling, or it might be caused by IDE driver or I/O 
> scheduler tuning?

Don't know yet.  On 2.6 the test actually does about 5% fewer reads than
under 2.4, so the VM page replacement is working a bit better in this case.
 And 2.6 does about 40% fewer context switches for some reason.  So we
should be a little bit faster - it's a matter of finding where the
additional seeks or idle time are coming from.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04 23:20                             ` Andrew Morton
@ 2004-05-05 22:04                               ` Alexey Kopytov
  2004-05-06  8:43                                 ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-05 22:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

Andrew Morton wrote:
>Alexey Kopytov <alexeyk@mysql.com> wrote:
>> With the patch:
>> ---------------
>> Time spent for test:  86.5459s
>>
>> no of times window reset because of hits: 0
>> no of times window reset because of misses: 0
>> no of times window was shrunk because of hits: 1066
>> no of times the page request was non-contiguous: 5860
>> no of times the page request was contiguous : 18099
>
>The patch brought my test box to the same speed as 2.4.  With the deadline
>scheduler it was a bit faster than 2.4.  I didn't do a lot of testing
>though.  I was using ext2.  Please try deadline.
>

Results with the deadline scheduler on my hardware:

Time spent for test:  92.8340s

no of times window reset because of hits: 0
no of times window reset because of misses: 0
no of times window was shrunk because of hits: 1108
no of times the page request was non-contiguous: 5860
no of times the page request was contiguous : 18091

I have updated the results on the SysBench home page with 2.6.6-rc3 with the 
patch applied.

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-05 22:04                               ` Alexey Kopytov
@ 2004-05-06  8:43                                 ` Andrew Morton
  2004-05-06 18:13                                   ` Peter Zaitsev
  2004-05-10 19:50                                   ` Ram Pai
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-06  8:43 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

Alexey Kopytov <alexeyk@mysql.com> wrote:
>
> Results with the deadline scheduler on my hardware:
> 
>  Time spent for test:  92.8340s

Now we're into unreproducible results, alas.  On a 256MB uniprocessor
machine:

ext3:

sysbench --num-threads=16 --test=fileio --file-total-size=2G --file-test-mode=rndrw run

2.6.6-rc3-mm2, deadline:

	Time spent for test:  66.7536s
	Time spent for test:  67.9000s
		0.04s user 6.41s system 4% cpu 2:14.74 total

2.6.6-rc2-mm2, as:

	Time spent for test:  66.7576s
		0.07s user 6.68s system 5% cpu 2:14.18 total
	Time spent for test:  66.3216s
		0.06s user 6.28s system 4% cpu 2:12.25 total

2.4.27-pre2:

	Time spent for test:  64.9766s
		0.09s user 11.57s system 8% cpu 2:17.43 total
	Time spent for test:  64.2852s
		0.11s user 11.18s system 8% cpu 2:14.63 total

so 2.6 is a shade slower.  2.6 has tons less system CPU time, probably due
to ext3 improvements.

The reason for the difference appears to be the thing which Ram added to
readahead which causes it to usually read one page too many.  With this
exciting patch:

--- 25/mm/readahead.c~a	2004-05-06 01:24:26.230330464 -0700
+++ 25-akpm/mm/readahead.c	2004-05-06 01:24:26.234329856 -0700
@@ -475,7 +475,7 @@ do_io:
 		ra->ahead_start = 0;		/* Invalidate these */
 		ra->ahead_size = 0;
 		actual = do_page_cache_readahead(mapping, filp, offset,
-						 ra->size);
+				ra->size == 5 ? 4 : ra->size);
 		if(!first_access) {
 			/*
 			 * do not adjust the readahead window size the first

_

I get:

	Time spent for test:  63.9435s
		0.07s user 6.69s system 5% cpu 2:11.02 total

which is a good result.

Ram, can you take a look at fixing that up please?  Something clean, not
more hacks ;) I'd also be interested in an explanation of what the extra
page is for.  The little comment in there doesn't really help.

One thing I note about this test is that it generates a huge number of
inode writes.  atime updates from the reads and mtime updates from the
writes.  Suppressing them doesn't actually make a lot of performance
difference, but that is with writeback caching enabled.  I expect that with
a writethrough cache these will really hurt.

The test uses 128 files, which seems excessive.  I assume that four or
eight files is a more likely real-life setup, and in theis case the
atime/mtime update volume will be proportionately less.

Alexey, I do not know why you're seeing such a disparity.  I assume that
IDE DMA is enabled - the difference seems too small for that to be an
explanation, but please check it.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-06  8:43                                 ` Andrew Morton
@ 2004-05-06 18:13                                   ` Peter Zaitsev
  2004-05-06 21:49                                     ` Andrew Morton
  2004-05-10 19:50                                   ` Ram Pai
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zaitsev @ 2004-05-06 18:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexey Kopytov, linuxram, nickpiggin, linux-kernel, axboe

On Thu, 2004-05-06 at 01:43, Andrew Morton wrote:

> 
> One thing I note about this test is that it generates a huge number of
> inode writes.  atime updates from the reads and mtime updates from the
> writes.  Suppressing them doesn't actually make a lot of performance
> difference, but that is with writeback caching enabled.  I expect that with
> a writethrough cache these will really hurt.

Perhaps.  By the way is there a way to disable update time modification
as well ?  It would make quite a good sense for partition used for
Database needs - you do not need last modification time in most cases.

> 
> The test uses 128 files, which seems excessive.  I assume that four or
> eight files is a more likely real-life setup, and in theis case the
> atime/mtime update volume will be proportionately less.

Actually both single (or very few) files and large amount of files are
practical setup. In MySQL 4.1 we have the option to store each Innodb
table in its own file, which will mean scattered random IO to many files
for OLTP workloads. 

You might think 128 actively used tables are still to much, but
practically we see even larger numbers - some customers partition data,
creating huge number of tables with same structure, for example
table-per customer.

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-06 18:13                                   ` Peter Zaitsev
@ 2004-05-06 21:49                                     ` Andrew Morton
  2004-05-06 23:49                                       ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-06 21:49 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: alexeyk, linuxram, nickpiggin, linux-kernel, axboe

Peter Zaitsev <peter@mysql.com> wrote:
>
> On Thu, 2004-05-06 at 01:43, Andrew Morton wrote:
> 
> > 
> > One thing I note about this test is that it generates a huge number of
> > inode writes.  atime updates from the reads and mtime updates from the
> > writes.  Suppressing them doesn't actually make a lot of performance
> > difference, but that is with writeback caching enabled.  I expect that with
> > a writethrough cache these will really hurt.
> 
> Perhaps.  By the way is there a way to disable update time modification
> as well ?

No, there is not.

> It would make quite a good sense for partition used for
> Database needs - you do not need last modification time in most cases.

First up, one needs to remove the inode_update_time() call from
generic_file_aio_write_nolock() and run the tests.  If this (and noatime)
indeed makes a significant difference (probably on writethrough-caching
disks) then yup, we should do something.

`nomtime' would be simple enough.  But another option would be to arrange
for a/m/ctime dirtiness to not cause an inode writeout in fsync(). 
Instead, only sync the a/m/ctime-dirty inodes via sync, umount and pdflush.

That way, the inodes get written every thirty seconds rather than once per
second.

It's probably not standards-compliant, but shoot me.  Who cares if the
mtimes come up 30 seconds out of date after a system crash?

`nomtime' would be simpler and safer to implement, but not as nice.

But we need those numbers first.  I'll take a look.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-06 21:49                                     ` Andrew Morton
@ 2004-05-06 23:49                                       ` Nick Piggin
  2004-05-07  1:29                                         ` Peter Zaitsev
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2004-05-06 23:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, alexeyk, linuxram, linux-kernel, axboe

Andrew Morton wrote:
> Peter Zaitsev <peter@mysql.com> wrote:
> 
>>On Thu, 2004-05-06 at 01:43, Andrew Morton wrote:
>>
>>
>>>One thing I note about this test is that it generates a huge number of
>>>inode writes.  atime updates from the reads and mtime updates from the
>>>writes.  Suppressing them doesn't actually make a lot of performance
>>>difference, but that is with writeback caching enabled.  I expect that with
>>>a writethrough cache these will really hurt.
>>
>>Perhaps.  By the way is there a way to disable update time modification
>>as well ?
> 
> 
> No, there is not.
> 
> 
>>It would make quite a good sense for partition used for
>>Database needs - you do not need last modification time in most cases.
> 
> 
> First up, one needs to remove the inode_update_time() call from
> generic_file_aio_write_nolock() and run the tests.  If this (and noatime)
> indeed makes a significant difference (probably on writethrough-caching
> disks) then yup, we should do something.
> 
> `nomtime' would be simple enough.  But another option would be to arrange
> for a/m/ctime dirtiness to not cause an inode writeout in fsync(). 
> Instead, only sync the a/m/ctime-dirty inodes via sync, umount and pdflush.
> 
> That way, the inodes get written every thirty seconds rather than once per
> second.
> 
> It's probably not standards-compliant, but shoot me.  Who cares if the
> mtimes come up 30 seconds out of date after a system crash?
> 
> `nomtime' would be simpler and safer to implement, but not as nice.
> 
> But we need those numbers first.  I'll take a look.
> 

Can they use fdatasync? Does it do the right thing on Linux?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-06 23:49                                       ` Nick Piggin
@ 2004-05-07  1:29                                         ` Peter Zaitsev
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zaitsev @ 2004-05-07  1:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, alexeyk, linuxram, linux-kernel, axboe

On Thu, 2004-05-06 at 16:49, Nick Piggin wrote:

> > 
> > `nomtime' would be simpler and safer to implement, but not as nice.
> > 
> > But we need those numbers first.  I'll take a look.
> > 
> 
> Can they use fdatasync? Does it do the right thing on Linux?

Nick,

You're right. fdatasync suppose to be solution in this case and actually
test supports this mode as well as MySQL does :)

On other hand if you rather use O_DSYNC it does not seems to work being
mapped to O_SYNC. 

But the thing I'm mostly interested in is O_DIRECT. It seems to be the
best solution for many database needs, especially used together with
asynchronous IO. There is however no matching option which should not
flush MetaData.

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-06  8:43                                 ` Andrew Morton
  2004-05-06 18:13                                   ` Peter Zaitsev
@ 2004-05-10 19:50                                   ` Ram Pai
  2004-05-10 20:21                                     ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-10 19:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexey Kopytov, nickpiggin, peter, linux-kernel, axboe

On Thu, 2004-05-06 at 01:43, Andrew Morton wrote:

sorry, I am out for 10days and hence replies are late.

> The reason for the difference appears to be the thing which Ram added to
> readahead which causes it to usually read one page too many.  With this
> exciting patch:
> 
> --- 25/mm/readahead.c~a	2004-05-06 01:24:26.230330464 -0700
> +++ 25-akpm/mm/readahead.c	2004-05-06 01:24:26.234329856 -0700
> @@ -475,7 +475,7 @@ do_io:
>  		ra->ahead_start = 0;		/* Invalidate these */
>  		ra->ahead_size = 0;
>  		actual = do_page_cache_readahead(mapping, filp, offset,
> -						 ra->size);
> +				ra->size == 5 ? 4 : ra->size);
>  		if(!first_access) {
>  			/*
>  			 * do not adjust the readahead window size the first
> 
> _
> 
> 
> I get:
> 
> 	Time spent for test:  63.9435s
> 		0.07s user 6.69s system 5% cpu 2:11.02 total
> 
> which is a good result.
> 
> Ram, can you take a look at fixing that up please?  Something clean, not
> more hacks ;) I'd also be interested in an explanation of what the extra
> page is for.  The little comment in there doesn't really help.

The reason for the extra page read is as follows:

Consider 16k random reads i/os. Reads are generated 4pages at a time.

the readahead is triggered when the 4th page in the 'current-window' is
touched. However the data which is read-in through the 'readahead
window' gets thrown away because the next 16k read-io will not access
anything read in the readahead window. As a result I put in that
optimization which handles this wasted readahead-pages. 

The idea is, when we miss the current-window, read one more page than
the number of pages accessed in the current-window. 

Here is a example scenario of random 16k i/os and with Andrew's code
		actual = do_page_cache_readahead(mapping, filp, offset,
-						 ra->size);
+				ra->size == 5 ? 4 : ra->size);

Consider that the application access  page {1,2,3,4}  
{100,101,102,103}  {200,201,202,203}

Consider that the current-window holds 4 pages. i.e page 1,2,3,4

when the application asks for {1,2,3,4} we happily satisfy them through
the current-window. However when the application touches page 4, the
lazy-readahead triggers in and brings in pages {5,6,7,8,9,10,11,12} but
now the application wants to access {100,101,102,103}. This waste of
effort is probably bearable as long as we dont commit the same mistake
in the future. When the application tries to access {100,101,102,103}
the code then scraps both the current-window and the readahead-window
and reads in a new current-window of size 4 i.e {100,101,102,103}.
However when
the application touches page 103, the lazy-readahead gets triggered and
brings in 8 more pages {104,105,106,107,108,109,110,111} and as always
all these pages go wasted. This wastage continues for ever.

My Optimization [ I mean hack ;) ] was meant to avoid this bad behavior.
Instead of reading in 'number of pages accessed in the current-window',
I read in 'one more page than the number of pages accessed in the
current-window'. With this optimization the behavior changes to as
follows:

when the application asks for {1,2,3,4} we happily satisfy them through
the current-window. However when the application touches page 4, the
lazy-readahead triggers and brings in pages {5,6,7,8,9,10,11,12} but now
the application wants to access {100,101,102,103}. This bad behavior is
probably ok since the optimization ensures that it does not commit the
same mistake in the future. When the application tries to access
{100,101,102,103} the code then scraps both the current window and the
readahead window and reads in a new current window of size 4+1 i.e
{100,101,102,103,104}. 
However since the application does not touch page 104 and hence
lazy-readahead does not get triggered we do not waste effort
bringing in pages. And this nice behavior continues for ever.

Probably we may see marginal degradation of this optimization with 16k
i/o but the amount of wastage avoided by this optimization (hack) 
is great when random i/o is of larger size. I think it was 4% better
performance on DSS workload with 64k random reads.

Do you still think its a hack?

Also I think  with sysbench workload and Andrew's ra-copy patch, we
might be loosing some benefits of some of the optimization because 
if two threads simulteously work with copies of the same ra structure
and update it, the optimization effect reflected in one of the
ra-structure is lost depending on which ra structure gets copied back
last.

RP

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-10 19:50                                   ` Ram Pai
@ 2004-05-10 20:21                                     ` Andrew Morton
  2004-05-10 22:39                                       ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-10 20:21 UTC (permalink / raw)
  To: Ram Pai; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> > Ram, can you take a look at fixing that up please?  Something clean, not
> > more hacks ;) I'd also be interested in an explanation of what the extra
> > page is for.  The little comment in there doesn't really help.
> 
> 
> The reason for the extra page read is as follows:
> 
> Consider 16k random reads i/os. Reads are generated 4pages at a time.
> 
> the readahead is triggered when the 4th page in the 'current-window' is
> touched.

Right.  We've added two whole unsigned longs to the file_struct to track
the access patterns.  That should be sufficient for us to detect when the
access pattern is random, and to then not perform readahead due to a
current-window miss *at all*.

So that extra page can go away, and:

--- 25/mm/readahead.c~a	Mon May 10 13:16:59 2004
+++ 25-akpm/mm/readahead.c	Mon May 10 13:17:22 2004
@@ -492,21 +492,17 @@ do_io:
 		 */
 		if (ra->ahead_start == 0) {
 			/*
-			 * if the average io-size is less than maximum
+			 * If the average io-size is less than maximum
 			 * readahead size of the file the io pattern is
 			 * sequential. Hence  bring in the readahead window
 			 * immediately.
-			 * Else the i/o pattern is random. Bring
-			 * in the readahead window only if the last page of
-			 * the current window is accessed (lazy readahead).
 			 */
 			unsigned long average = ra->average;
 
 			if (ra->serial_cnt > average)
 				average = (ra->serial_cnt + ra->average) / 2;
 
-			if ((average >= max) || (offset == (ra->start +
-							ra->size - 1))) {
+			if (average >= max) {
 				ra->ahead_start = ra->start + ra->size;
 				ra->ahead_size = ra->next_size;
 				actual = do_page_cache_readahead(mapping, filp,

_


That way, we read the correct amount of data, and we only start I/O when we
know the application is going to actually use the data.

This may cause problems when the application transitions from seeky-access
to linear-access.

Does it sound feasible?

>
> Probably we may see marginal degradation of this optimization with 16k
> i/o but the amount of wastage avoided by this optimization (hack) 
> is great when random i/o is of larger size. I think it was 4% better
> performance on DSS workload with 64k random reads.

64k sounds unusually large.  We need top performance at 8k too.

> Do you still think its a hack?

yup ;)

> Also I think  with sysbench workload and Andrew's ra-copy patch, we
> might be loosing some benefits of some of the optimization because 
> if two threads simulteously work with copies of the same ra structure
> and update it, the optimization effect reflected in one of the
> ra-structure is lost depending on which ra structure gets copied back
> last.

hm, maybe.  That only makes a difference if two threads are accessing the
same fd at the same time, and it was really bad before the patch.  The IO
patterns seemed OK to me with the patch.  Except it's reading one page too
many.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-10 20:21                                     ` Andrew Morton
@ 2004-05-10 22:39                                       ` Ram Pai
  2004-05-10 23:07                                         ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-10 22:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

On Mon, 2004-05-10 at 13:21, Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
> >
> > > Ram, can you take a look at fixing that up please?  Something clean, not
> > > more hacks ;) I'd also be interested in an explanation of what the extra
> > > page is for.  The little comment in there doesn't really help.
> > 
> > 
> > The reason for the extra page read is as follows:
> > 
> > Consider 16k random reads i/os. Reads are generated 4pages at a time.
> > 
> > the readahead is triggered when the 4th page in the 'current-window' is
> > touched.
> 
> Right.  We've added two whole unsigned longs to the file_struct to track
> the access patterns.  That should be sufficient for us to detect when the
> access pattern is random, and to then not perform readahead due to a
> current-window miss *at all*.
> 
> So that extra page can go away, and:
> 
> --- 25/mm/readahead.c~a	Mon May 10 13:16:59 2004
> +++ 25-akpm/mm/readahead.c	Mon May 10 13:17:22 2004
> @@ -492,21 +492,17 @@ do_io:
>  		 */
>  		if (ra->ahead_start == 0) {
>  			/*
> -			 * if the average io-size is less than maximum
> +			 * If the average io-size is less than maximum
>  			 * readahead size of the file the io pattern is
>  			 * sequential. Hence  bring in the readahead window
>  			 * immediately.
> -			 * Else the i/o pattern is random. Bring
> -			 * in the readahead window only if the last page of
> -			 * the current window is accessed (lazy readahead).
>  			 */
>  			unsigned long average = ra->average;
>  
>  			if (ra->serial_cnt > average)
>  				average = (ra->serial_cnt + ra->average) / 2;
>  
> -			if ((average >= max) || (offset == (ra->start +
> -							ra->size - 1))) {
> +			if (average >= max) {
>  				ra->ahead_start = ra->start + ra->size;
>  				ra->ahead_size = ra->next_size;
>  				actual = do_page_cache_readahead(mapping, filp,
> 
> _
> 
> 
> That way, we read the correct amount of data, and we only start I/O when we
> know the application is going to actually use the data.
> 
> This may cause problems when the application transitions from seeky-access
> to linear-access.
> 
> Does it sound feasible?
I am nervous about this change. You are totally getting rid of
lazy-readahead and that was the optimization which gave the best
possible boost in performance. 
Let me see how this patch does with a DSS benchmark.

> 
> >
> > Probably we may see marginal degradation of this optimization with 16k
> > i/o but the amount of wastage avoided by this optimization (hack) 
> > is great when random i/o is of larger size. I think it was 4% better
> > performance on DSS workload with 64k random reads.
> 
> 64k sounds unusually large.  We need top performance at 8k too.
> 
> > Do you still think its a hack?
> 
> yup ;)
> 

:-(


> > Also I think  with sysbench workload and Andrew's ra-copy patch, we
> > might be loosing some benefits of some of the optimization because 
> > if two threads simulteously work with copies of the same ra structure
> > and update it, the optimization effect reflected in one of the
> > ra-structure is lost depending on which ra structure gets copied back
> > last.
> 
> hm, maybe.  That only makes a difference if two threads are accessing the
> same fd at the same time, and it was really bad before the patch.  The IO
> patterns seemed OK to me with the patch.  Except it's reading one page too
> many.

In the normal large random workload this extra page would have
compesated for all the wasted readaheads.  However in the case of
sysbench with Andrew's ra-copy patch the readahead calculation is not
happening quiet right. Is it worth trying to get a marginal gain 
with sysbench at the cost of getting a big hit on DSS benchmarks,
aio-tests,iozone and probably others. Or am I making an unsubstantiated
claim? I will get back with results.


RP


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-10 22:39                                       ` Ram Pai
@ 2004-05-10 23:07                                         ` Andrew Morton
  2004-05-11 20:51                                           ` Ram Pai
  2004-05-11 22:26                                           ` Random file I/O regressions in 2.6 Bill Davidsen
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-10 23:07 UTC (permalink / raw)
  To: Ram Pai; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> I am nervous about this change. You are totally getting rid of
> lazy-readahead and that was the optimization which gave the best
> possible boost in performance. 

Because it disabled the large readahead outside the area which the app is
reading.  But it's still reading too much.

> Let me see how this patch does with a DSS benchmark.

That was not a real patch.  More work is surely needed to get that right.

> In the normal large random workload this extra page would have
> compesated for all the wasted readaheads.

I disagree that 64k is "normal"!

>  However in the case of
> sysbench with Andrew's ra-copy patch the readahead calculation is not
> happening quiet right. Is it worth trying to get a marginal gain 
> with sysbench at the cost of getting a big hit on DSS benchmarks,
> aio-tests,iozone and probably others. Or am I making an unsubstantiated
> claim? I will get back with results.

It shouldn't hurt at all - the app does a seek, we perform the
correctly-sized read.

As I say, my main concern is that we correctly transition from seeky access
to linear access and resume readahead.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-10 23:07                                         ` Andrew Morton
@ 2004-05-11 20:51                                           ` Ram Pai
  2004-05-11 21:17                                             ` Andrew Morton
  2004-05-11 22:26                                           ` Random file I/O regressions in 2.6 Bill Davidsen
  1 sibling, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-11 20:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

[-- Attachment #1: Type: text/plain, Size: 1634 bytes --]

On Mon, 2004-05-10 at 16:07, Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
> >
> > I am nervous about this change. You are totally getting rid of
> > lazy-readahead and that was the optimization which gave the best
> > possible boost in performance. 
> 
> Because it disabled the large readahead outside the area which the app is
> reading.  But it's still reading too much.

> > Let me see how this patch does with a DSS benchmark.
> 
> That was not a real patch.  More work is surely needed to get that right.
> 
> > In the normal large random workload this extra page would have
> > compesated for all the wasted readaheads.
> 
> I disagree that 64k is "normal"!
> 
> >  However in the case of
> > sysbench with Andrew's ra-copy patch the readahead calculation is not
> > happening quiet right. Is it worth trying to get a marginal gain 
> > with sysbench at the cost of getting a big hit on DSS benchmarks,
> > aio-tests,iozone and probably others. Or am I making an unsubstantiated
> > claim? I will get back with results.
> 
> It shouldn't hurt at all - the app does a seek, we perform the
> correctly-sized read.

Looks like you are right on all counts! I did some modifications to your
patch and did a preliminary run with my user-level simulator. With these
changes I am able to get rid of that extra page. Also code looks much
simpler and adapts well to sequential and random patterns.

However I have to run this under some benchmarks and see how it fares.
Its a pre-alpha level patch.

Can you take a quick look at the changes and see if you like it? I am
sure you won't consider these changes a hack ;)

RP

[-- Attachment #2: readahead_trim.patch --]
[-- Type: text/x-patch, Size: 3130 bytes --]

diff -urNp linux-2.6.6/mm/readahead.c linux-2.6.6.new/mm/readahead.c
--- linux-2.6.6/mm/readahead.c	2004-05-09 19:32:00.000000000 -0700
+++ linux-2.6.6.new/mm/readahead.c	2004-05-11 20:26:51.288797696 -0700
@@ -353,7 +353,7 @@ page_cache_readahead(struct address_spac
 	unsigned orig_next_size;
 	unsigned actual;
 	int first_access=0;
-	unsigned long preoffset=0;
+	unsigned long average=0;
 
 	/*
 	 * Here we detect the case where the application is performing
@@ -394,10 +394,17 @@ page_cache_readahead(struct address_spac
 		if (ra->serial_cnt <= (max * 2))
 			ra->serial_cnt++;
 	} else {
-		ra->average = (ra->average + ra->serial_cnt) / 2;
+		/* to avoid rounding errors, ensure that 'average' 
+		 * tends towards the value of ra->serial_cnt.
+		 */
+                if(ra->average > ra->serial_cnt) {
+                        average = ra->average - 1; 
+                } else {
+                        average = ra->average + 1;
+                }
+		ra->average = (average + ra->serial_cnt) / 2;
 		ra->serial_cnt = 1;
 	}
-	preoffset = ra->prev_page;
 	ra->prev_page = offset;
 
 	if (offset >= ra->start && offset <= (ra->start + ra->size)) {
@@ -457,18 +464,14 @@ do_io:
 		 * ahead window and get some I/O underway for the new
 		 * current window.
 		 */
-		if (!first_access && preoffset >= ra->start &&
-				preoffset < (ra->start + ra->size)) {
-			 /* Heuristic:  If 'n' pages were
-			  * accessed in the current window, there
-			  * is a high probability that around 'n' pages
-			  * shall be used in the next current window.
-			  *
-			  * To minimize lazy-readahead triggered
-			  * in the next current window, read in
-			  * an extra page.
+		if (!first_access) {
+			 /* Heuristic: there is a high probability 
+			  * that around  ra->average number of
+			  * pages shall be accessed in the next
+			  * current window.
 			  */
-			ra->next_size = preoffset - ra->start + 2;
+			ra->next_size = (ra->average > max ?  
+				max : ra->average); 
 		}
 		ra->start = offset;
 		ra->size = ra->next_size;
@@ -492,21 +495,19 @@ do_io:
 		 */
 		if (ra->ahead_start == 0) {
 			/*
-			 * if the average io-size is less than maximum
+			 * If the average io-size is more than maximum
 			 * readahead size of the file the io pattern is
 			 * sequential. Hence  bring in the readahead window
-			 * immediately.
-			 * Else the i/o pattern is random. Bring
-			 * in the readahead window only if the last page of
-			 * the current window is accessed (lazy readahead).
+			 * immediately. 
+			 * If the average io-size is less than maximum
+			 * readahead size of the file the io pattern is
+			 * random. Hence don't bother to readahead.
 			 */
-			unsigned long average = ra->average;
-
+			average = ra->average;
 			if (ra->serial_cnt > average)
-				average = (ra->serial_cnt + ra->average) / 2;
+				average = (ra->serial_cnt + ra->average + 1) / 2;
 
-			if ((average >= max) || (offset == (ra->start +
-							ra->size - 1))) {
+			if (average > max) {
 				ra->ahead_start = ra->start + ra->size;
 				ra->ahead_size = ra->next_size;
 				actual = do_page_cache_readahead(mapping, filp,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-11 20:51                                           ` Ram Pai
@ 2004-05-11 21:17                                             ` Andrew Morton
  2004-05-13 20:41                                               ` Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-11 21:17 UTC (permalink / raw)
  To: Ram Pai; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

Ram Pai <linuxram@us.ibm.com> wrote:
>
> Looks like you are right on all counts!

It's a probabilistic thing.

> I did some modifications to your
> patch and did a preliminary run with my user-level simulator. With these
> changes I am able to get rid of that extra page. Also code looks much
> simpler and adapts well to sequential and random patterns.

That is good news.

> However I have to run this under some benchmarks and see how it fares.
> Its a pre-alpha level patch.

It is nicer, thanks.  I'll add it to -mm and hopefully Meredith and co will
include it in regular performance testing.

> Can you take a quick look at the changes and see if you like it? I am
> sure you won't consider these changes a hack ;)

Couple of minor things:

> -	unsigned long preoffset=0;

yay!

> +	unsigned long average=0;

Please add spaces around '='.  But I don't think this needs to be
initialised at all.


>  	/*
>  	 * Here we detect the case where the application is performing
> @@ -394,10 +394,17 @@ page_cache_readahead(struct address_spac
>  		if (ra->serial_cnt <= (max * 2))
>  			ra->serial_cnt++;
>  	} else {
> -		ra->average = (ra->average + ra->serial_cnt) / 2;
> +		/* to avoid rounding errors, ensure that 'average' 
> +		 * tends towards the value of ra->serial_cnt.
> +		 */

multiline comment layout:

		/*
		 * To avoid rounding errors, ensure that 'average' tends
		 * towards the value of ra->serial_cnt.
		 */

(I said "minor").

I can't say that I immediately understand what is the issue here with
rounding errors?


> +                if(ra->average > ra->serial_cnt) {

space between "if" and "("

> +			ra->next_size = (ra->average > max ?  
> +				max : ra->average); 

	min(max, ra->average) ?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-11 21:17                                             ` Andrew Morton
@ 2004-05-13 20:41                                               ` Ram Pai
  2004-05-17 17:30                                                 ` Random file I/O regressions in 2.6 [patch+results] Ram Pai
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-13 20:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

On Tue, 2004-05-11 at 14:17, Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
 
I am yet to get my machine fully set up to run a DSS benchmark. But
thought I will update you on the following comment.
 
> 
> multiline comment layout:
> 
> 		/*
> 		 * To avoid rounding errors, ensure that 'average' tends
> 		 * towards the value of ra->serial_cnt.
> 		 */
> 
> (I said "minor").
> 
> I can't say that I immediately understand what is the issue here with
> rounding errors?

Say the i/o size is 20 pages.

Our algorithm starts by a initial average i/o size of 'ra_pages/2' which
is mostly say 16.

Now every time we take a average, the 'average' progresses as follows
(16+20)/2=18
(18+20)/2=19
(19+20)/2=19
(19+20)/2=19.....
and the rounding error makes it never touch 20

However the code can be further optimized to :

 		/* 
                 * to avoid rounding errors, ensure that 'average' 
                 * tends towards the value of ra->serial_cnt.
                 */
                if (ra->average < ra->serial_cnt) {
                        average = ra->average + 1;
                }

I will send a updated patch with all your comments incorporated as soon
as I see good benchmark numbers.(probably by tomorrow).

RP






> 
> 
> > +                if(ra->average > ra->serial_cnt) {
> 
> space between "if" and "("
> 
> > +			ra->next_size = (ra->average > max ?  
> > +				max : ra->average); 
> 
> 	min(max, ra->average) ?
> 
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-13 20:41                                               ` Ram Pai
@ 2004-05-17 17:30                                                 ` Ram Pai
  2004-05-20  1:06                                                   ` Alexey Kopytov
  0 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-17 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alexeyk, nickpiggin, peter, linux-kernel, axboe

[-- Attachment #1: Type: text/plain, Size: 651 bytes --]

On Thu, 2004-05-13 at 13:41, Ram Pai wrote:
> On Tue, 2004-05-11 at 14:17, Andrew Morton wrote:
> > Ram Pai <linuxram@us.ibm.com> wrote:
>  
> I am yet to get my machine fully set up to run a DSS benchmark. But
> thought I will update you on the following comment.

Attached the cleaned up patch and the performance results of the patch.

Overall Observation:
        1.Small improvement with iozone with the patch, and overall
                        much better performance than 2.4
        2.Small/neglegible improvement with DSS workload.
        3.Negligible impact with sysbench, but results worser than
                        2.4 kernels

RP


[-- Attachment #2: seeky-readahead-speedups.patch --]
[-- Type: text/plain, Size: 7487 bytes --]


	Results of iozone,sysbench and DSS workload with the 
		seeky-readahead-speedups.patch
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Overall Observation: 
	1.Small improvement with iozone with the patch, and overall
			much better performance than 2.4
	2.Small/neglegible improvement with DSS workload.
	3.Negligible impact with sysbench, but results worser than
			2.4 kernels

	The cleaned-up patch is included towards the end of this report.

Details:

**********************************************************************
			IOZONE 

	run on a nfs mounted filesystem:
	client machine 2proc, 733MHz, 2GB memory
	server machine 8proc, 700Mhz, 8GB memory

./iozone -c -t1 -s 4096m -r 128k


---------------------------------------------------------
|		| throughput |	throughput | throughput |
|		| KB/sec     |	KB/sec     | KB/sec     |
|		| 266	     |	266+patch  | 2.4.20     |
---------------------------------------------------------
|sequential read| 11697.55   |	11700.98   | 10846.87   |
| 		|	     |             |            |
|re-read	| 11698.39   |	11691.84   | 10865.39   |
|		|	     |             |            |
|reverse read	| 20002.71   |	20099.86   | 10340.34   |
|               |            |             |            |
|stride read	| 13813.01   |	13850.28   | 10193.87   |
|		|	     |             |            |
|random read	| 19705.06   |	19978.00   | 10839.57   |
|               |            |             |            |
|random mix	| 28465.68   |	29964.38   | 10779.17   |
|		|	     |             |            |
|pread		| 11692.95   |	11697.29   | 10863.56   |
---------------------------------------------------------


**************************************************************

			SYSBENCH

	run on machine 2proc, 733MHz, 256MB memory


---------------------------------------------------------
|		| 266	     |	266+patch  | 2.4.21     |
---------------------------------------------------------
|time spent     | 79.6253    |	79.8176    | 73.2605sec |
| 		|	     |             |            |
|Mb/sec		| 1.959Mb.sec|	1.954Mb/sec| 2.129Mb/sec|
|		|	     |             |            |
|requests/sec 	| 125.59     |	125.29     | 136.54	|
|               |            |             |            |
|no of Reads 	| 6001       |	6001	   | 6008	|
|		|	     |             |            |
|no of Writes 	| 3999	     |	3999       | 3995	|
|               |            |             |            |
---------------------------------------------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
266 sysbench output:

Operations performed:  6001 Read, 3999 Write, 12800 Other = 22800 Total
Read 93Mb  Written 62Mb  Total Transferred 156Mb
   1.959Mb/sec  Transferred
  125.59 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  79.6253s

Per Request statistics:
Min:   0.0000s  Avg:   0.0467s  Max:   0.9802s    Events tracked: 10000
Total time taken by event execution: 467.1493s
Threads fairness: 87.41/94.20  distribution,  88.68/94.45 execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
266+patch sysbench output:

Operations performed:  6001 Read, 3999 Write, 12800 Other = 22800 Total
Read 93Mb  Written 62Mb  Total Transferred 156Mb
   1.954Mb/sec  Transferred
  125.29 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  79.8176s

Per Request statistics:
Min:   0.0000s  Avg:   0.0482s  Max:   0.8481s    Events tracked: 10000
Total time taken by event execution: 481.7572s
Threads fairness: 85.27/93.25  distribution,  85.15/94.91 execution

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2.4.21 sysbench output:

Operations performed:  6008 Read, 3995 Write, 12800 Other = 22803 Total
Read 93Mb  Written 62Mb  Total Transferred 156Mb
   2.129Mb/sec  Transferred
  136.54 Requests/sec executed

Test execution Statistics summary:
Time spent for test:  73.2605s

Per Request statistics:
Min:   0.0000s  Avg:   0.0380s  Max:   0.3712s    Events tracked: 10003
Total time taken by event execution: 380.4081s
Threads fairness: 79.04/91.95  distribution,  82.52/92.44 execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




**************************************************************

DSS WORKLOAD

	Got 1% improvement with the patch

**************************************************************




diff -urNp linux-2.6.6/mm/readahead.c linux-2.6.6.new/mm/readahead.c
--- linux-2.6.6/mm/readahead.c	2004-05-11 20:41:28.000000000 -0700
+++ linux-2.6.6.new/mm/readahead.c	2004-05-17 17:33:51.145040472 -0700
@@ -353,7 +353,7 @@ page_cache_readahead(struct address_spac
 	unsigned orig_next_size;
 	unsigned actual;
 	int first_access=0;
-	unsigned long preoffset=0;
+	unsigned long average;
 
 	/*
 	 * Here we detect the case where the application is performing
@@ -394,10 +394,17 @@ page_cache_readahead(struct address_spac
 		if (ra->serial_cnt <= (max * 2))
 			ra->serial_cnt++;
 	} else {
-		ra->average = (ra->average + ra->serial_cnt) / 2;
+		/* 
+		 * to avoid rounding errors, ensure that 'average' 
+		 * tends towards the value of ra->serial_cnt.
+		 */
+		average = ra->average;
+		if (average < ra->serial_cnt) {
+			average++;
+		}
+		ra->average = (average + ra->serial_cnt) / 2;
 		ra->serial_cnt = 1;
 	}
-	preoffset = ra->prev_page;
 	ra->prev_page = offset;
 
 	if (offset >= ra->start && offset <= (ra->start + ra->size)) {
@@ -457,18 +464,13 @@ do_io:
 		 * ahead window and get some I/O underway for the new
 		 * current window.
 		 */
-		if (!first_access && preoffset >= ra->start &&
-				preoffset < (ra->start + ra->size)) {
-			 /* Heuristic:  If 'n' pages were
-			  * accessed in the current window, there
-			  * is a high probability that around 'n' pages
-			  * shall be used in the next current window.
-			  *
-			  * To minimize lazy-readahead triggered
-			  * in the next current window, read in
-			  * an extra page.
+		if (!first_access) {
+			 /* Heuristic: there is a high probability 
+			  * that around  ra->average number of
+			  * pages shall be accessed in the next
+			  * current window.
 			  */
-			ra->next_size = preoffset - ra->start + 2;
+			ra->next_size = min(ra->average , (unsigned long)max);
 		}
 		ra->start = offset;
 		ra->size = ra->next_size;
@@ -492,21 +494,19 @@ do_io:
 		 */
 		if (ra->ahead_start == 0) {
 			/*
-			 * if the average io-size is less than maximum
+			 * If the average io-size is more than maximum
 			 * readahead size of the file the io pattern is
 			 * sequential. Hence  bring in the readahead window
-			 * immediately.
-			 * Else the i/o pattern is random. Bring
-			 * in the readahead window only if the last page of
-			 * the current window is accessed (lazy readahead).
+			 * immediately. 
+			 * If the average io-size is less than maximum
+			 * readahead size of the file the io pattern is
+			 * random. Hence don't bother to readahead.
 			 */
-			unsigned long average = ra->average;
-
+			average = ra->average;
 			if (ra->serial_cnt > average)
-				average = (ra->serial_cnt + ra->average) / 2;
+				average = (ra->serial_cnt + ra->average + 1) / 2;
 
-			if ((average >= max) || (offset == (ra->start +
-							ra->size - 1))) {
+			if (average > max) {
 				ra->ahead_start = ra->start + ra->size;
 				ra->ahead_size = ra->next_size;
 				actual = do_page_cache_readahead(mapping, filp,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-17 17:30                                                 ` Random file I/O regressions in 2.6 [patch+results] Ram Pai
@ 2004-05-20  1:06                                                   ` Alexey Kopytov
  2004-05-20  1:31                                                     ` Ram Pai
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-20  1:06 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andrew Morton, nickpiggin, peter, linux-kernel, axboe

Ram Pai wrote:

>Attached the cleaned up patch and the performance results of the patch.
>
>Overall Observation:
>        1.Small improvement with iozone with the patch, and overall
>                        much better performance than 2.4
>        2.Small/neglegible improvement with DSS workload.
>        3.Negligible impact with sysbench, but results worser than
>                        2.4 kernels

Ram, can you clarify the status of this patch please?

I ran the same sysbench test on my hardware with patched 2.6.6 and got 
122.2348s execution time, i.e. almost the same results as in the original 
tests. Is this patch an intermediate step to improve the sysbench workload on 
2.6, or it just addresses another problem?

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20  1:06                                                   ` Alexey Kopytov
@ 2004-05-20  1:31                                                     ` Ram Pai
  2004-05-21 19:32                                                       ` Alexey Kopytov
  2004-05-20  5:49                                                     ` Andrew Morton
  2004-05-20 21:59                                                     ` Andrew Morton
  2 siblings, 1 reply; 56+ messages in thread
From: Ram Pai @ 2004-05-20  1:31 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: Andrew Morton, nickpiggin, peter, linux-kernel, axboe

On Wed, 2004-05-19 at 18:06, Alexey Kopytov wrote:
> Ram Pai wrote:
> 
> >Attached the cleaned up patch and the performance results of the patch.
> >
> >Overall Observation:
> >        1.Small improvement with iozone with the patch, and overall
> >                        much better performance than 2.4
> >        2.Small/neglegible improvement with DSS workload.
> >        3.Negligible impact with sysbench, but results worser than
> >                        2.4 kernels
> 
> Ram, can you clarify the status of this patch please?
> 
> I ran the same sysbench test on my hardware with patched 2.6.6 and got 
> 122.2348s execution time, i.e. almost the same results as in the original 
> tests. Is this patch an intermediate step to improve the sysbench workload on 
> 2.6, or it just addresses another problem?

this patch by itself does not address your problem. Your problem is
better addressed by Andrew's 'readahead-private' patch.

However; this patch applied on top of Andrew's 'readahead-private' patch
may get you some extra performance.

Can you confirm this please?
RP




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20  1:31                                                     ` Ram Pai
@ 2004-05-21 19:32                                                       ` Alexey Kopytov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-21 19:32 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andrew Morton, nickpiggin, peter, linux-kernel, axboe

On Thursday 20 May 2004 05:31, Ram Pai wrote:
>On Wed, 2004-05-19 18:06, Alexey Kopytov wrote:
>> Ram, can you clarify the status of this patch please?
>>
>> I ran the same sysbench test on my hardware with patched 2.6.6 and got
>> 122.2348s execution time, i.e. almost the same results as in the original
>> tests. Is this patch an intermediate step to improve the sysbench workload
>> on 2.6, or it just addresses another problem?
>
>this patch by itself does not address your problem. Your problem is
>better addressed by Andrew's 'readahead-private' patch.
>
>However; this patch applied on top of Andrew's 'readahead-private' patch
>may get you some extra performance.
>
>Can you confirm this please?

Yes.

2.6.6-rc3 + Andrew's patch:

Time spent for test:  86.5459s

2.6.6-bk:

Time spent for test:  83.1929s

Thanks for clarifying!

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20  1:06                                                   ` Alexey Kopytov
  2004-05-20  1:31                                                     ` Ram Pai
@ 2004-05-20  5:49                                                     ` Andrew Morton
  2004-05-20 21:59                                                     ` Andrew Morton
  2 siblings, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-20  5:49 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

Alexey Kopytov <alexeyk@mysql.com> wrote:
>
> Ram Pai wrote:
> 
> >Attached the cleaned up patch and the performance results of the patch.
> >
> >Overall Observation:
> >        1.Small improvement with iozone with the patch, and overall
> >                        much better performance than 2.4
> >        2.Small/neglegible improvement with DSS workload.
> >        3.Negligible impact with sysbench, but results worser than
> >                        2.4 kernels
> 
> Ram, can you clarify the status of this patch please?

Everything we have is now in Linus's tree.  And in 2.6.6-mm4.

> I ran the same sysbench test on my hardware with patched 2.6.6 and got 
> 122.2348s execution time, i.e. almost the same results as in the original 
> tests. Is this patch an intermediate step to improve the sysbench workload on 
> 2.6, or it just addresses another problem?

The patches in Linus's tree improve sysbench significantly here.  It's a
256MB 2-way with IDE disks, writeback caching enabled:


sysbench --num-threads=16 --test=fileio --file-total-size=2G --file-test-mode=rndrw run

2.4.27-pre2, ext2:

	Time spent for test:  61.0240s
		0.06s user 6.03s system 4% cpu 2:05.95 total
	Time spent for test:  60.8456s
		0.11s user 5.49s system 4% cpu 2:04.94 total

2.6.6, CFQ, ext2:

	Time spent for test:  85.6614s
		0.05s user 5.66s system 3% cpu 2:26.75 total
	Time spent for test:  85.2090s
		0.06s user 5.32s system 3% cpu 2:24.75 total

2.6.6-bk, CFQ, ext2:

	Time spent for test:  66.7717s
		0.04s user 5.54s system 4% cpu 2:06.19 total
	Time spent for test:  67.5666s
		0.04s user 5.10s system 4% cpu 2:06.72 total


2.6.6, as, ext2:

	Time spent for test:  83.8358s
		0.07s user 5.89s system 4% cpu 2:22.92 total
	Time spent for test:  83.8068s
		0.06s user 5.34s system 3% cpu 2:21.33 total

2.6.6-bk, AS, ext2:

	Time spent for test:  62.5316s
		0.05s user 5.27s system 4% cpu 2:01.28 total
	Time spent for test:  62.7401s
		0.04s user 5.17s system 4% cpu 2:00.50 total


2.6.6, deadline, ext2:

	Time spent for test: 103.0084s
		0.06s user 5.76s system 3% cpu 2:40.74 total
	Time spent for test: 101.9648s
		0.07s user 5.35s system 3% cpu 2:38.83 total

2.6.6-bk, deadline, ext2:

	Time spent for test:  63.3405s
		0.03s user 5.49s system 4% cpu 2:01.05 total
	Time spent for test:  63.5288s
		0.03s user 5.05s system 4% cpu 2:00.78 total


There's still something wrong here.  2.6.6-bk+deadline is pretty equivalent
to 2.4 from an IO scheduler point of view in this test.  Yet it's a couple
of percent slower.

I don't know why you're still seeing significant discrepancies.

What sort of disk+controller system are you using?  If scsi, what is the
tag queue depth set to?  Is writeback caching enabled on the disk?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20  1:06                                                   ` Alexey Kopytov
  2004-05-20  1:31                                                     ` Ram Pai
  2004-05-20  5:49                                                     ` Andrew Morton
@ 2004-05-20 21:59                                                     ` Andrew Morton
  2004-05-20 22:23                                                       ` Andrew Morton
  2004-05-21 21:13                                                       ` Alexey Kopytov
  2 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2004-05-20 21:59 UTC (permalink / raw)
  To: Alexey Kopytov; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe


(Resend due to osdl<->vger smtp bunfight)

Alexey Kopytov <alexeyk@mysql.com> wrote:
>
> Ram Pai wrote:
> 
> >Attached the cleaned up patch and the performance results of the patch.
> >
> >Overall Observation:
> >        1.Small improvement with iozone with the patch, and overall
> >                        much better performance than 2.4
> >        2.Small/neglegible improvement with DSS workload.
> >        3.Negligible impact with sysbench, but results worser than
> >                        2.4 kernels
> 
> Ram, can you clarify the status of this patch please?

Everything we have is now in Linus's tree.  And in 2.6.6-mm4.

> I ran the same sysbench test on my hardware with patched 2.6.6 and got 
> 122.2348s execution time, i.e. almost the same results as in the original 
> tests. Is this patch an intermediate step to improve the sysbench workload on 
> 2.6, or it just addresses another problem?

The patches in Linus's tree improve sysbench significantly here.  It's a
256MB 2-way with IDE disks, writeback caching enabled:


sysbench --num-threads=16 --test=fileio --file-total-size=2G --file-test-mode=rndrw run

2.4.27-pre2, ext2:

	Time spent for test:  61.0240s
		0.06s user 6.03s system 4% cpu 2:05.95 total
	Time spent for test:  60.8456s
		0.11s user 5.49s system 4% cpu 2:04.94 total

2.6.6, CFQ, ext2:

	Time spent for test:  85.6614s
		0.05s user 5.66s system 3% cpu 2:26.75 total
	Time spent for test:  85.2090s
		0.06s user 5.32s system 3% cpu 2:24.75 total

2.6.6-bk, CFQ, ext2:

	Time spent for test:  66.7717s
		0.04s user 5.54s system 4% cpu 2:06.19 total
	Time spent for test:  67.5666s
		0.04s user 5.10s system 4% cpu 2:06.72 total


2.6.6, as, ext2:

	Time spent for test:  83.8358s
		0.07s user 5.89s system 4% cpu 2:22.92 total
	Time spent for test:  83.8068s
		0.06s user 5.34s system 3% cpu 2:21.33 total

2.6.6-bk, AS, ext2:

	Time spent for test:  62.5316s
		0.05s user 5.27s system 4% cpu 2:01.28 total
	Time spent for test:  62.7401s
		0.04s user 5.17s system 4% cpu 2:00.50 total


2.6.6, deadline, ext2:

	Time spent for test: 103.0084s
		0.06s user 5.76s system 3% cpu 2:40.74 total
	Time spent for test: 101.9648s
		0.07s user 5.35s system 3% cpu 2:38.83 total

2.6.6-bk, deadline, ext2:

	Time spent for test:  63.3405s
		0.03s user 5.49s system 4% cpu 2:01.05 total
	Time spent for test:  63.5288s
		0.03s user 5.05s system 4% cpu 2:00.78 total


There's still something wrong here.  2.6.6-bk+deadline is pretty equivalent
to 2.4 from an IO scheduler point of view in this test.  Yet it's a couple
of percent slower.

I don't know why you're still seeing significant discrepancies.

What sort of disk+controller system are you using?  If scsi, what is the
tag queue depth set to?  Is writeback caching enabled on the disk?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20 21:59                                                     ` Andrew Morton
@ 2004-05-20 22:23                                                       ` Andrew Morton
  2004-05-21  7:31                                                         ` Nick Piggin
  2004-05-21 21:13                                                       ` Alexey Kopytov
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-20 22:23 UTC (permalink / raw)
  To: alexeyk, linuxram, nickpiggin, peter, linux-kernel, axboe

Andrew Morton <akpm@osdl.org> wrote:
>
> There's still something wrong here.  2.6.6-bk+deadline is pretty equivalent
> to 2.4 from an IO scheduler point of view in this test.  Yet it's a couple
> of percent slower.
> 
> I don't know why you're still seeing significant discrepancies.
> 
> What sort of disk+controller system are you using?  If scsi, what is the
> tag queue depth set to?  Is writeback caching enabled on the disk?

If the 2.4 and 2.6 disk accounting statitics are to be believed, they show
something interesting.

Workload is one run of

sysbench --num-threads=16 --test=fileio --file-total-size=2G --file-test-mode=rndrw run

on ext2.


2.4.27-pre2:

rio:				5549		(Read requests issued)
rblk:				259680		(Total sectors read)
wio:				42398		(Write requests issued)
wblk:				4368056		(Total sectors written)

2.6.6-bk, as:

reads:				5983
readsectors:			201192
writes:				22548
writesectors:			4343184


- Note that 2.6 read 20% less data from the disk.  We observed this
  before.  It appears that 2.6 page replacements decisions are working
  better for this workload.

- Despite that, 2.6 issued *more* read requests.  So it is submitting
  more, and smaller I/O's

- Both kernels wrote basically the same amount of data.  2.6 a little
  less, perhaps because of fsync() optimisations.

- But 2.6 issued far fewer write requests.  Half as many as 2.4 - a huge
  difference.  There are a number of reasons why this could happen but
  frankly, I don't have a clue what's going on in there.


Given that 2.6 is issuing less IO requests it should be performing faster
than 2.4.  The reason that the two kernels are achieving about the same
throughput despite this is that the disk is performing writeback caching
and is absorbing 2.4's smaller write requests.



I set the IDE disk to do writethrough (hdparm -W0):

2.6.6-bk, as:

	Time spent for test:  89.9427s
		0.04s user 5.24s system 1% cpu 4:51.62 total

2.4.27-pre2:

	Time spent for test: 107.8293s
		0.04s user 6.00s system 1% cpu 7:26.47 total


as expected.

Open questions are:

a) Why is 2.6 write coalescing so superior to 2.4?

b) Why is 2.6 issuing more read requests, for less data?

c) Why is Alexey seeing dissimilar results?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20 22:23                                                       ` Andrew Morton
@ 2004-05-21  7:31                                                         ` Nick Piggin
  2004-05-21  7:50                                                           ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2004-05-21  7:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: alexeyk, linuxram, peter, linux-kernel, axboe

Andrew Morton wrote:

> Open questions are:
> 
> a) Why is 2.6 write coalescing so superior to 2.4?
> 
> b) Why is 2.6 issuing more read requests, for less data?
> 
> c) Why is Alexey seeing dissimilar results?
> 

Interesting. I am not too familiar with 2.4's IO scheduler,
but 2.6's have pretty comprehensive merging systems. Could
that be helping, Jens? Or is 2.4 pretty equivalent?

What about things like maximum request size for 2.4 vs 2.6
for example? This is another thing that can have an impact,
especially for writes.

I'll take a guess at b, and say it could be as-iosched.c.
Another thing might be that 2.6 has smaller nr_requests than
2.4, although you are unlikely to hid the read side limit
with only 16 threads if they are doing sync IO.

As for question c, has Alexey confirmed that it is indeed
2.6-bk which has problems?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-21  7:31                                                         ` Nick Piggin
@ 2004-05-21  7:50                                                           ` Jens Axboe
  2004-05-21  8:40                                                             ` Nick Piggin
  2004-05-21  8:56                                                             ` Spam: " Andrew Morton
  0 siblings, 2 replies; 56+ messages in thread
From: Jens Axboe @ 2004-05-21  7:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, alexeyk, linuxram, peter, linux-kernel

On Fri, May 21 2004, Nick Piggin wrote:
> Andrew Morton wrote:
> 
> >Open questions are:
> >
> >a) Why is 2.6 write coalescing so superior to 2.4?
> >
> >b) Why is 2.6 issuing more read requests, for less data?
> >
> >c) Why is Alexey seeing dissimilar results?
> >
> 
> 
> Interesting. I am not too familiar with 2.4's IO scheduler,
> but 2.6's have pretty comprehensive merging systems. Could
> that be helping, Jens? Or is 2.4 pretty equivalent?

2.4 will give up merging faster than 2.6, elevator_linus will stop
looking for a merge point if the sequence drops to zero. 2.6 will always
merge. So that could explain the fewer writes.

> What about things like maximum request size for 2.4 vs 2.6
> for example? This is another thing that can have an impact,
> especially for writes.

I think that's pretty similar. Andrew didn't say what device he was
testing on, but 2.4 ide defaults to max 64k where 2.6 defaults to 128k.

> I'll take a guess at b, and say it could be as-iosched.c.
> Another thing might be that 2.6 has smaller nr_requests than
> 2.4, although you are unlikely to hid the read side limit
> with only 16 threads if they are doing sync IO.

Andrew, you did numbers for deadline previously as well, but no rq
statistics there? As for nr_requests that's true, would be worth a shot
to bump available requests in 2.6.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-21  7:50                                                           ` Jens Axboe
@ 2004-05-21  8:40                                                             ` Nick Piggin
  2004-05-21  8:56                                                             ` Spam: " Andrew Morton
  1 sibling, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2004-05-21  8:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, alexeyk, linuxram, peter, linux-kernel

Jens Axboe wrote:
> On Fri, May 21 2004, Nick Piggin wrote:
> 
>>Andrew Morton wrote:
>>
>>
>>>Open questions are:
>>>
>>>a) Why is 2.6 write coalescing so superior to 2.4?
>>>
>>>b) Why is 2.6 issuing more read requests, for less data?
>>>
>>>c) Why is Alexey seeing dissimilar results?
>>>
>>
>>
>>Interesting. I am not too familiar with 2.4's IO scheduler,
>>but 2.6's have pretty comprehensive merging systems. Could
>>that be helping, Jens? Or is 2.4 pretty equivalent?
> 
> 
> 2.4 will give up merging faster than 2.6, elevator_linus will stop
> looking for a merge point if the sequence drops to zero. 2.6 will always
> merge. So that could explain the fewer writes.
> 

Yep OK, that could be one thing.

> 
>>What about things like maximum request size for 2.4 vs 2.6
>>for example? This is another thing that can have an impact,
>>especially for writes.
> 
> 
> I think that's pretty similar. Andrew didn't say what device he was
> testing on, but 2.4 ide defaults to max 64k where 2.6 defaults to 128k.
> 

This could be another. If Andrew's using IDE, this alone could
make up the entire difference *if* writes are nicely sequential.
I guess they probably aren't, but it could still help.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Spam: Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-21  7:50                                                           ` Jens Axboe
  2004-05-21  8:40                                                             ` Nick Piggin
@ 2004-05-21  8:56                                                             ` Andrew Morton
  2004-05-21 22:24                                                               ` Alexey Kopytov
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-21  8:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: nickpiggin, alexeyk, linuxram, peter, linux-kernel

Jens Axboe <axboe@suse.de> wrote:
>
> I think that's pretty similar. Andrew didn't say what device he was
>  testing on, but 2.4 ide defaults to max 64k where 2.6 defaults to 128k.

IDE.

I was being silly, sorry.  Those I/O stats include the (huge linear)
initial write of the "database" files, so the larger IDE request size will
be dominating.

What I need is a way of getting sysbench to create and remove the database
files in separate invokations, but the syntax for that is defeating me at
present.

>  > I'll take a guess at b, and say it could be as-iosched.c.
>  > Another thing might be that 2.6 has smaller nr_requests than
>  > 2.4, although you are unlikely to hid the read side limit
>  > with only 16 threads if they are doing sync IO.
> 
>  Andrew, you did numbers for deadline previously as well, but no rq
>  statistics there? As for nr_requests that's true, would be worth a shot
>  to bump available requests in 2.6.

Doubling the request queue size makes no difference.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Spam: Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-21  8:56                                                             ` Spam: " Andrew Morton
@ 2004-05-21 22:24                                                               ` Alexey Kopytov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-21 22:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, nickpiggin, linuxram, peter, linux-kernel

On Friday 21 May 2004 12:56, Andrew Morton wrote:
>
>What I need is a way of getting sysbench to create and remove the database
>files in separate invokations, but the syntax for that is defeating me at
>present.
>

I have changed the syntax to allow creating/removing test files and test 
running in separate stages:

sysbench --test=fileio --file-total-size=3G prepare

sysbench --num-threads=16 --test=fileio --file-total-size=3G  
--file-test-mode=rndrw run

sysbench --test=fileio cleanup

The updated version is available from the SysBench page at 
http://sourceforge.net/projects/sysbench/

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-20 21:59                                                     ` Andrew Morton
  2004-05-20 22:23                                                       ` Andrew Morton
@ 2004-05-21 21:13                                                       ` Alexey Kopytov
  2004-05-26  4:43                                                         ` Alexey Kopytov
  1 sibling, 1 reply; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-21 21:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

On Friday 21 May 2004 01:59, Andrew Morton wrote:
>The patches in Linus's tree improve sysbench significantly here.  It's a
>256MB 2-way with IDE disks, writeback caching enabled:
>
>sysbench --num-threads=16 --test=fileio --file-total-size=2G
> --file-test-mode=rndrw run
>
>2.4.27-pre2, ext2:
>
>	Time spent for test:  61.0240s
>		0.06s user 6.03s system 4% cpu 2:05.95 total
>	Time spent for test:  60.8456s
>		0.11s user 5.49s system 4% cpu 2:04.94 total
>
>2.6.6-bk, AS, ext2:
>
>	Time spent for test:  62.5316s
>		0.05s user 5.27s system 4% cpu 2:01.28 total
>	Time spent for test:  62.7401s
>		0.04s user 5.17s system 4% cpu 2:00.50 total

I ran the tests with a configuration as close to yours as possible. Here are 
the results for mem=256M, 2G total file size (ext3):

2.4.25:
       Time spent for test:  79.4146s
                0.20user 16.08system 3:20.29elapsed 8%CPU
       Time spent for test:  78.9797s
                0.11user 15.84system 3:19.76elapsed 7%CPU 

2.6.6-bk, AS:
       Time spent for test:  81.2208s
                0.13user 17.97system 3:13.30elapsed 9%CPU
       Time spent for test:  82.5538s
                0.14user 18.00system 3:14.88elapsed 9%CPU

This correlates very well your results. But when I returned back to my 
original configuration (mem=640M, 3G total file size), I got the following:

2.4.25:
       Time spent for test:  77.5377s

2.6.6-bk, AS:
       Time spent for test:  83.1929s

It seems like the smaller file size just hides the regression, but I have to 
run some more tests to ensure this.
  

>I don't know why you're still seeing significant discrepancies.
>
>What sort of disk+controller system are you using?  If scsi, what is the
>tag queue depth set to?  Is writeback caching enabled on the disk?

It's IDE disk without TCQ support with writeback caching enabled.

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6 [patch+results]
  2004-05-21 21:13                                                       ` Alexey Kopytov
@ 2004-05-26  4:43                                                         ` Alexey Kopytov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kopytov @ 2004-05-26  4:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxram, nickpiggin, peter, linux-kernel, axboe

On Saturday 22 May 2004 01:13, Alexey Kopytov wrote:
>I ran the tests with a configuration as close to yours as possible. Here are
>the results for mem=256M, 2G total file size (ext3):
>
>2.4.25:
>       Time spent for test:  79.4146s
>                0.20user 16.08system 3:20.29elapsed 8%CPU
>       Time spent for test:  78.9797s
>                0.11user 15.84system 3:19.76elapsed 7%CPU
>
>2.6.6-bk, AS:
>       Time spent for test:  81.2208s
>                0.13user 17.97system 3:13.30elapsed 9%CPU
>       Time spent for test:  82.5538s
>                0.14user 18.00system 3:14.88elapsed 9%CPU
>
>This correlates very well your results. But when I returned back to my
>original configuration (mem=640M, 3G total file size), I got the following:
>
>2.4.25:
>       Time spent for test:  77.5377s
>
>2.6.6-bk, AS:
>       Time spent for test:  83.1929s
>
>It seems like the smaller file size just hides the regression, but I have to
>run some more tests to ensure this.
>

The assumption appears to be true. I tried to vary the total file size and got 
the following results (tests were done on another IDE disk):

2.4.27-pre3:
    2 GB:  58.2707s
    4 GB:  72.3313s
    8 GB:  83.082s

2.6.7-rc1, AS:
    2 GB:  60.6792s  
    4 GB:  82.8023s
    8 GB:  99.4398s

Varying the number of files while keeping the total file size constant also 
gives some interesting results:

2.4.27-pre3, 4 GB total file size:
    1 file:  71.7288s
    128 files: 72.3313s
    256 files: 73.9268

2.6.7-rc1, AS, 4 GB total file size:
    1 file: 76.443
    128 files: 82.8023
    256 files: 81.9618

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-10 23:07                                         ` Andrew Morton
  2004-05-11 20:51                                           ` Ram Pai
@ 2004-05-11 22:26                                           ` Bill Davidsen
  1 sibling, 0 replies; 56+ messages in thread
From: Bill Davidsen @ 2004-05-11 22:26 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton wrote:
> Ram Pai <linuxram@us.ibm.com> wrote:
> 
>>I am nervous about this change. You are totally getting rid of
>>lazy-readahead and that was the optimization which gave the best
>>possible boost in performance. 
> 
> 
> Because it disabled the large readahead outside the area which the app is
> reading.  But it's still reading too much.
> 
> 
>>Let me see how this patch does with a DSS benchmark.
> 
> 
> That was not a real patch.  More work is surely needed to get that right.
> 
> 
>>In the normal large random workload this extra page would have
>>compesated for all the wasted readaheads.
> 
> 
> I disagree that 64k is "normal"!
> 
> 
>> However in the case of
>>sysbench with Andrew's ra-copy patch the readahead calculation is not
>>happening quiet right. Is it worth trying to get a marginal gain 
>>with sysbench at the cost of getting a big hit on DSS benchmarks,
>>aio-tests,iozone and probably others. Or am I making an unsubstantiated
>>claim? I will get back with results.
> 
> 
> It shouldn't hurt at all - the app does a seek, we perform the
> correctly-sized read.
> 
> As I say, my main concern is that we correctly transition from seeky access
> to linear access and resume readahead.

One real problem is that you are trying to do in the kernel what would 
be best done in the application and better done in glibc... Because the 
benefit of readahead varies based on fd rather than device. Consider a 
program reading data from a file and putting it in a database. The 
benefit of readahead for the sequential access data file is higher than 
seek-read combinations. The library could do readahead based on the 
bytes read since the last seek on a by-file basis, something the kernel 
can't.

This is not to say the kernel work hasn't been a benefit, but note that 
with all the patches 2.4 still seems to outperform 2.6. And that's a 
problem since other parts of 2.6 scale so well. I do see that 2.4 seems 
to outperform 2.6 for usenet news, where you have small reads against a 
modest database, a few TB or so, and 400-2000 processes doing random 
reads against the data. Settings and schedulers seem to have only modest 
effect there.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  0:19                 ` Nick Piggin
  2004-05-04  0:50                   ` Ram Pai
@ 2004-05-04  1:15                   ` Andrew Morton
  2004-05-04 11:39                     ` Nick Piggin
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-04  1:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: peter, linuxram, alexeyk, linux-kernel, axboe

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> >>>That's one of its usage patterns.  It's also supposed to detect the
> >>>fixed-sized-reads-seeking-all-over-the-place situation.  In which case it's
> >>>supposed to submit correctly-sized multi-page BIOs.  But it's not working
> >>>right for this workload.
> >>>
> >>>A naive solution would be to add special-case code which always does the
> >>>fixed-size readahead after a seek.  Basically that's
> >>>
> >>>	if (ra->next_size == -1UL)
> >>>		force_page_cache_readahead(...)
> >>>
> >>
> >>I think a better solution to this case would be to ensure the
> >>readahead window is always min(size of read, some large number);
> >>
> > 
> > 
> > That would cause the kernel to perform lots of pointless pagecache lookups
> > when the file is already 100% cached.
> > 
> 
> 
> That's pretty sad. You need a "preread" or something which
> sends the pages back... or uses the actor itself. readahead
> would then have to be reworked to only run off the end of
> the read window, but that is what it should be doing anyway.

Sorry, I do not understand that paragraph at all.

All forms or pagecache population need to examine the pagecache to find out
if the page is already there.  This involves pagecache lookups.  We want
the read code to "learn" that the requested pages are all coming from cache
and to stop doing those lookups altogether.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  1:15                   ` Andrew Morton
@ 2004-05-04 11:39                     ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2004-05-04 11:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: peter, linuxram, alexeyk, linux-kernel, axboe

Andrew Morton wrote:

> 
> 
> Sorry, I do not understand that paragraph at all.
> 
> All forms or pagecache population need to examine the pagecache to find out
> if the page is already there.  This involves pagecache lookups.  We want
> the read code to "learn" that the requested pages are all coming from cache
> and to stop doing those lookups altogether.

Yeah I think I have an idea of what the basic problems are,
but I'd have to understand things better before I know if I
am really on the right track.

My idea would probably also involve redoing some of the code
code too, so at the moment I don't think I have time. If your
simple fix works though, then that sounds good.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  0:10               ` Andrew Morton
  2004-05-04  0:19                 ` Nick Piggin
@ 2004-05-04  8:27                 ` Arjan van de Ven
  2004-05-04  8:47                   ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2004-05-04  8:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, peter, linuxram, alexeyk, linux-kernel, axboe

[-- Attachment #1: Type: text/plain, Size: 305 bytes --]


> 
> That would cause the kernel to perform lots of pointless pagecache lookups
> when the file is already 100% cached.

well surely the read itself will do those AGAIN anyway, so in the fully
cached case this is just warming up the cpu cache ;)  (and thus really
cheap as nett cost I suspect)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  8:27                 ` Arjan van de Ven
@ 2004-05-04  8:47                   ` Andrew Morton
  2004-05-04  8:50                     ` Arjan van de Ven
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2004-05-04  8:47 UTC (permalink / raw)
  To: arjanv; +Cc: nickpiggin, peter, linuxram, alexeyk, linux-kernel, axboe

Arjan van de Ven <arjanv@redhat.com> wrote:
>
> 
> > 
> > That would cause the kernel to perform lots of pointless pagecache lookups
> > when the file is already 100% cached.
> 
> well surely the read itself will do those AGAIN anyway, so in the fully
> cached case this is just warming up the cpu cache ;)  (and thus really
> cheap as nett cost I suspect)

Probably true for x86, but the cost is noticeable on ppc64, for example. 
Anton fixed some things in there shortly after it went in, but it's still
apparent on profiles.

We could perhaps speed things up a little bit by using gang lookup in both
__do_page_cache_readahead() and in do_generic_file_read().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Random file I/O regressions in 2.6
  2004-05-04  8:47                   ` Andrew Morton
@ 2004-05-04  8:50                     ` Arjan van de Ven
  0 siblings, 0 replies; 56+ messages in thread
From: Arjan van de Ven @ 2004-05-04  8:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, peter, linuxram, alexeyk, linux-kernel, axboe

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

On Tue, May 04, 2004 at 01:47:29AM -0700, Andrew Morton wrote:
> Arjan van de Ven <arjanv@redhat.com> wrote:
> >
> > 
> > > 
> > > That would cause the kernel to perform lots of pointless pagecache lookups
> > > when the file is already 100% cached.
> > 
> > well surely the read itself will do those AGAIN anyway, so in the fully
> > cached case this is just warming up the cpu cache ;)  (and thus really
> > cheap as nett cost I suspect)
> 
> Probably true for x86, but the cost is noticeable on ppc64, for example. 
> Anton fixed some things in there shortly after it went in, but it's still
> apparent on profiles.

well do the profiles also show that the actual later lookup becomes near
free due to a warm cpu cache?

> 
> We could perhaps speed things up a little bit by using gang lookup in both
> __do_page_cache_readahead() and in do_generic_file_read().

or go into the readahead path only when the first miss occurs; for the fully
cached case you can then avoid the cost while when you're doing IO, well, a
few premature cache misses...


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2004-05-26  4:43 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-02 19:57 Random file I/O regressions in 2.6 Alexey Kopytov
2004-05-03 11:14 ` Nick Piggin
2004-05-03 18:08   ` Andrew Morton
2004-05-03 20:22     ` Ram Pai
2004-05-03 20:57       ` Andrew Morton
2004-05-03 21:37         ` Peter Zaitsev
2004-05-03 21:50           ` Ram Pai
2004-05-03 22:01             ` Peter Zaitsev
2004-05-03 21:59           ` Andrew Morton
2004-05-03 22:07             ` Ram Pai
2004-05-03 23:58             ` Nick Piggin
2004-05-04  0:10               ` Andrew Morton
2004-05-04  0:19                 ` Nick Piggin
2004-05-04  0:50                   ` Ram Pai
2004-05-04  6:29                     ` Andrew Morton
2004-05-04 15:03                       ` Ram Pai
2004-05-04 19:39                         ` Ram Pai
2004-05-04 19:48                           ` Andrew Morton
2004-05-04 19:58                             ` Ram Pai
2004-05-04 21:51                               ` Ram Pai
2004-05-04 22:29                                 ` Ram Pai
2004-05-04 23:01                           ` Alexey Kopytov
2004-05-04 23:20                             ` Andrew Morton
2004-05-05 22:04                               ` Alexey Kopytov
2004-05-06  8:43                                 ` Andrew Morton
2004-05-06 18:13                                   ` Peter Zaitsev
2004-05-06 21:49                                     ` Andrew Morton
2004-05-06 23:49                                       ` Nick Piggin
2004-05-07  1:29                                         ` Peter Zaitsev
2004-05-10 19:50                                   ` Ram Pai
2004-05-10 20:21                                     ` Andrew Morton
2004-05-10 22:39                                       ` Ram Pai
2004-05-10 23:07                                         ` Andrew Morton
2004-05-11 20:51                                           ` Ram Pai
2004-05-11 21:17                                             ` Andrew Morton
2004-05-13 20:41                                               ` Ram Pai
2004-05-17 17:30                                                 ` Random file I/O regressions in 2.6 [patch+results] Ram Pai
2004-05-20  1:06                                                   ` Alexey Kopytov
2004-05-20  1:31                                                     ` Ram Pai
2004-05-21 19:32                                                       ` Alexey Kopytov
2004-05-20  5:49                                                     ` Andrew Morton
2004-05-20 21:59                                                     ` Andrew Morton
2004-05-20 22:23                                                       ` Andrew Morton
2004-05-21  7:31                                                         ` Nick Piggin
2004-05-21  7:50                                                           ` Jens Axboe
2004-05-21  8:40                                                             ` Nick Piggin
2004-05-21  8:56                                                             ` Spam: " Andrew Morton
2004-05-21 22:24                                                               ` Alexey Kopytov
2004-05-21 21:13                                                       ` Alexey Kopytov
2004-05-26  4:43                                                         ` Alexey Kopytov
2004-05-11 22:26                                           ` Random file I/O regressions in 2.6 Bill Davidsen
2004-05-04  1:15                   ` Andrew Morton
2004-05-04 11:39                     ` Nick Piggin
2004-05-04  8:27                 ` Arjan van de Ven
2004-05-04  8:47                   ` Andrew Morton
2004-05-04  8:50                     ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox