From: Steven Pratt <slpratt@austin.ibm.com>
To: linuxram@us.ibm.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH/RFC] Simplified Readahead
Date: Mon, 27 Sep 2004 10:30:45 -0500 [thread overview]
Message-ID: <41583225.4040901@austin.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.44.0409242123110.14902-100000@dyn319181.beaverton.ibm.com>
Ram Pai wrote:
>On Fri, 24 Sep 2004, Ram Pai wrote:
>
>
>>On Thu, 23 Sep 2004, Steven Pratt wrote:
>>
>>
>>>The readahead code has undergone many changes in the 2.6 kernel and the
>>>current implementation is in my opinion obtuse and hard to maintain. We
>>>would like to offer up an alternative simplified design which will not
>>>only make the code easier to maintain, but as performance tests have
>>>shown, results in better performance in many cases.
>>>
>>>We are very interested in having others review and try out the code and
>>>run whatever performance tests they see fit.
>>>
>>>Quick overview of the new design:
>>>
>>>The key design point of the new design is to make the readahead code
>>>aware of the size of the I/O request. This change eliminates the need
>>>for treating large random I/O as sequential and all of the averaging
>>>code that exists just to support this. In addition to this change, the
>>>new design ramps up quicker, and shuts off faster. This, combined with
>>>the request size awareness eliminates the so called "slow read path"
>>>that we try so hard to avoid in the current code. For complete details
>>>on the design of the new readahead logic, please refer to
>>>http://www-124.ibm.com/developerworks/opensource/linuxperf/readahead/read-ahead-design.pdf
>>>
>>>There are a few exception cases which still concern me.
>>>
>>>1. pages already in cache
>>>2. I/O queue congestion.
>>>3. page stealing
>>>
>>>The first of these is a file already residing in page cache. If we do
>>>not code for this case we will end up doing multiple page lookups for
>>>each page. The current code tries to handle this using the
>>>check_ra_success function, but this code does not work.
>>>check_ra_success will subtract 1 page each time an I/O is completely
>>>contained in page cache, however on the main path we will increment the
>>>window size by 2 for each page in the request (up to max_readahead) thus
>>>negating the reduction. My question is what is the right behavior.
>>>
>>>
>>Not exactly true. If you look at the the parameters of check_ra_success()
>>it takes orig_next_size as its third parameter. And orig_next_size is
>>initialized to next_size right at the beginning of the function.
>>
>>So if the pages are already in the page cache , the next_size is decremented
>>effectively by 1.
>>
>>
>
>Ah..I misread what you said. Right the first time the next_size is decremented
>but since those pages will reside in the current window, next time onwards
>since the pages are already in the page cache the next_size keeps incrementing.
>Hence check_ra_success() effectively becomes useless.
>
>Looked at your code to see how you handled page cache hits, and if my reading is
>correct you don't handle it at all?
>
>
>
Correct, in the first version I sent I did not handle this. In the
second version I do.
>>>Reducing the size of the ahead window doesn't help. You must turn off
>>>readahead to have any effect. Once you believe all pages to be in page
>>>cache we should just immediately turn off readahead. What is this
>>>trigger point? 4 I/Os in a row? 400? My concern is that the savings
>>>for skipping the double lookup appears to be on the order of .5% CPU in
>>>my tests, but the penalty for small I/O in sequential read case can be
>>>substantial. Currently the new code does not handle this case, but it
>>>could be enhanced to do so.
>>>
>>>
>
>ok. so you don't handle the case too.
>
Right, the code currently ignores queue congestion, but based on Andrew
and Nick's comments I think we will change that slightly.
>>Currently the code does handle this automatically. If the size of the
>>next_size is 'n' then if the next 'n' window reads are already
>>in the page cache the readahead turns off. The problem with shutting
>>down readahead the first time when all the pages are in the cache, is that
>>there you will end up in the slow read path and it becomes miserable.
>>Because then onwards only one page gets read at a time even though
>>the read request is for larger number of pages. This behavior needs to
>>be fixed.
>>
>>
>>
>>
>>>The second case is on queue congestion. Current code does not submit
>>>the I/O if the queue is congested. This will result in each page being
>>>read serially on the cache miss path. Does submitting 32 4k I/Os
>>>instead of 1 128k I/O help queue congestion? Here is one place where
>>>the current cod gets real confusing. We reduce the window by 1 page for
>>>queue congestion(treated the same as page cache hit), but we leave the
>>>information about the current and ahead windows alone even though we did
>>>not issue the I/O to populate it and so we will never issue the I/O from
>>>the readahead code. Eventually we will start taking cache misses since
>>>no one read the pages. That code decrements the window size by 3, but
>>>as in the cache hit case since we are still reading sequentially we keep
>>>incrementing by 2 for each page; net effect -1 for each page not in
>>>cache. Again, the new code ignores the congestion case and still
>>>tries to do readahead, thus minimizing/optimizing the I/O requests sent
>>>to the device. Is this right? If not what should we do?
>>>
>>>
>>Yes. The code is currently not differentiating
>>queue congestion against 'pages already in the page cache'.
>>
>>I think if the queue is congested and if we are trying to populate
>>the current window, we better wait by calling schedule(),
>>and if we are trying to populate readahead window, we can return
>>immediately resetting the ahead_size and ahead_start?
>>
>>
>>
>>
>>
>>>The third exception case is page stealing where the page into which
>>>readahead was done is reclaimed before the data is copied to user
>>>space. This would seem to be a somewhat rare case only happening under
>>>sever memory pressure, but my tests have shown that it can occur quite
>>>frequently with as little as 4 threads doing 1M readahead or 16 threads
>>>doing 128k readahead on a machine with 1GB memory of which 950MB is page
>>>cache. Here it would seem the right thing to do is shrink the window
>>>size and reduce the chance for page stealing, this however kill I/O
>>>performance if done to aggressively. Again the current code may not
>>>
>>>
>>I have seen pages not up2date on most of my stress tests. But have not observed
>>page frames being stolen before the pages are accessed.
>>
>>
>>
>>
>>>perform as expected. As in the 2 previous cases, the -3 is offset by a
>>>+2 and so unless > 2/3 of pages in a given window are stolen the net
>>>effect is to ignore the page stealing. New code will slowly shrink the
>>>window as long as stealing occurs, but will quickly regrow once it stops.
>>>
>>>
>
>both excessive readahead-thrashing and page-cache-hit imply readahead needs to
>be temporarily halted. How about decrementing a counter for every page that is
>trashed or is hit in the page cache and incrementing the counter for every
>page-misses or no-page-trash. If the counter reaches 0 stop readahead. If the
>counter reaches max-readahead resume readahead. something along these lines...
>
>
>
What I do now for page cache hits is count how many pages in a row are
found in page cache when trying to do readahead. If any page is missing
I reset the count to 0 and start over since we want to avoid
fragmented/small I/Os. The limit on how many pages in a row disable
readahead is arbitrarily set at 2560 (10M). I don't know if this is a
good number or not.
>To summarize you noticed 3 problems:
>
>1. page cache hits not handled properly.
>2. readahead thrashing not accounted.
>3. read congestion not accounted.
>
>
Yes.
>Currently both the patches do not handle all the above cases.
>
>
No, thrashing was handled in the first patch, and both thrashing and
page cache hits are handled in the second. Also, it seems to be the
consensus that on normal I/O ignoring queue congestion is the right
behavior.
>So if your patch performs much better than the current one, than
>it is the winner anyway. But past experience has shown that some
>benchmark gets a hit for any small change. This happens to be tooo big
>a change.
>
I agree, we need more people to test this.
>
>The code looks familiar ;)
>RP
>
>
Steve
next prev parent reply other threads:[~2004-09-27 15:28 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-09-23 16:06 [PATCH/RFC] Simplified Readahead Steven Pratt
2004-09-23 22:14 ` Joel Schopp
2004-09-24 0:21 ` Nick Piggin
2004-09-24 2:42 ` Andrew Morton
2004-09-24 15:40 ` Steven Pratt
2004-09-24 16:16 ` Nick Piggin
2004-09-24 16:48 ` Steven Pratt
2004-09-24 22:05 ` Andrew Morton
2004-09-24 22:43 ` Steven Pratt
2004-09-24 23:01 ` Andrew Morton
2004-09-27 15:39 ` Steven Pratt
2004-09-27 19:26 ` Andrew Morton
2004-09-28 10:13 ` Jens Axboe
2004-09-24 22:55 ` Steven Pratt
2004-09-27 20:29 ` Ray Bryant
2004-09-27 21:04 ` Steven Pratt
2004-09-25 0:45 ` Nick Piggin
2004-09-25 1:01 ` Ram Pai
2004-09-25 6:07 ` Ram Pai
2004-09-27 15:30 ` Steven Pratt [this message]
2004-09-27 18:42 ` Ram Pai
2004-09-27 20:07 ` Steven Pratt
2004-09-29 18:46 ` Ram Pai
2004-09-29 22:33 ` Steven Pratt
2004-09-29 23:13 ` Andreas Dilger
2004-09-30 2:26 ` Ram Pai
2004-09-30 5:29 ` Andrew Morton
2004-09-30 20:20 ` Stephen C. Tweedie
2004-09-30 1:12 ` Ram Pai
2004-10-01 21:02 ` Steven Pratt
2004-10-05 17:52 ` Ram Pai
[not found] <372479081@toto.iv>
2004-09-24 5:00 ` Peter Chubb
2004-09-24 22:57 ` Steven Pratt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41583225.4040901@austin.ibm.com \
--to=slpratt@austin.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxram@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox