From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Tue, 28 May 2002 05:25:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Tue, 28 May 2002 05:25:28 -0400 Received: from ns.virtualhost.dk ([195.184.98.160]:31941 "EHLO virtualhost.dk") by vger.kernel.org with ESMTP id ; Tue, 28 May 2002 05:25:28 -0400 Date: Tue, 28 May 2002 11:25:03 +0200 From: Jens Axboe To: Andrew Morton Cc: William Lee Irwin III , Giuliano Pochini , linux-kernel@vger.kernel.org, "chen, xiangping" Subject: Re: Poor read performance when sequential write presents Message-ID: <20020528092503.GJ17674@suse.de> In-Reply-To: <3CED4843.2783B568@zip.com.au> <3CEE0758.27110CAD@zip.com.au> <20020524094606.GH14918@holomorphy.com> <3CEE1035.1E67E1B8@zip.com.au> <20020527080632.GC17674@suse.de> <3CF1ECD1.A1BB2CF1@zip.com.au> <20020527085414.GD17674@suse.de> <3CF1FDF8.B775DF44@zip.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 27 2002, Andrew Morton wrote: > > On Mon, May 27 2002, Andrew Morton wrote: > > > Jens Axboe wrote: > > > > > > > > ... > > > > > But in 2.5, head-activeness went away and as far as I know, IDE and SCSI are > > > > > treated the same. Odd. > > > > > > > > It didn't really go away, it just gets handled automatically now. > > > > elv_next_request() marks the request as started, in which case the i/o > > > > scheduler won't consider it for merging etc. SCSI removes the request > > > > directly after it has been marked started, while IDE leaves it on the > > > > queue until it completes. For IDE TCQ, the behaviour is the same as with > > > > SCSI. > > > > > > It won't consider the active request at the head of the queue for > > > merging (making the request larger). But it _could_ consider the > > > request when making decisions about insertion (adding a new request > > > at the head of the queue because it's close-on-disk to the active > > > one). Does it do that? > > > > Only when the front request isn't active is it safe to consider > > insertion in front of it. 2.5 does that exactly because it knows if the > > request has been started, while 2.4 has to guess by looking at the > > head-active flag and the plug status. > > > > If the request is started, we will only consider placing in front of the > > 2nd request not after the 1st. We could consider in between 1st and 2nd, > > that should be safe. In fact that should be perfectly safe, just move > > the barrier and started test down after the insert test. *req is the > > insert-after point. > > Makes sense. I suspect it may even worsen the problem I observed > with the mpage code. Set the readahead to 256k with `blockdev --setra 512' > and then run tiobench. The read latencies are massive - one thread > gets hold of the disk head and hogs it for 30-60 seconds. > > The readahead code has a sort of double-window design. The idea is that > if the disk does 50 megs/sec and your application processes data at > 49 megs/sec, the application will never block on I/O. At 256k readahead, > the readahead code will be laying out four BIOs at a time. It's probable > that the application is actually submitting BIOs for a new readahead > window before all of the BIOs for the old one are complete. So it's > performing merging against its own reads. > > Given all this, what I would expect to see is for thread "A" to capture > the disk head for some period of time, until eventually one of thread "B"'s > requests expires its latency. Then thread "B" gets to hog the disk head. > That's reasonable behaviour, but the latencies are *enormous*. Almost > like the latency stuff isn't working. But it sure looks OK. I'm still waiting for some time to implement some nicer i/o scheduling algorithms, I'd be sad to see elevator_linus be the default for 2.6. For now it's just receiving the odd fixes here and there which do make small improvements. > Not super-high priority at this time. I'll play with it some more. > (Some userspace tunables for the elevator would be nice. Hint. ;)) Agreed :-) > hmm. Actually the code looks a bit odd: > > if (elv_linus_sequence(__rq)-- <= 0) > break; > if (!(__rq->flags & REQ_CMD)) > continue; > if (elv_linus_sequence(__rq) < bio_sectors(bio)) > break; > > The first decrement is saying that elv_linus_sequence is in units of > requests, but the comparison (and the later `-= bio_sectors()') seems > to be saying it's in units of sectors. Well, it really is in units of sectors in 2.5, the first decrement is a scan aging measure. > I think calculating the latency in terms of requests makes more sense - just > ignore the actual size of those requests (or weight it down in some manner). > But I don't immediately see what the above code is up to? That might make more sense, but again it's not likely to make elevator_linus too tolerable anyways. You can easily changes the read/write initial sequences to be >> 2 what they are now, and just account seeks. The end result would be very similar, though :-) -- Jens Axboe