public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@zip.com.au>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Rik van Riel <riel@conectiva.com.br>,
	Marcelo Tosatti <marcelo@conectiva.com.br>,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: 2.4.16 & OOM killer screw up (fwd)
Date: Wed, 12 Dec 2001 01:59:38 -0800	[thread overview]
Message-ID: <3C172A8A.3760C553@zip.com.au> (raw)
In-Reply-To: <3C15B0B3.1399043B@zip.com.au> <Pine.LNX.4.33L.0112111130110.4079-100000@imladris.surriel.com>, <Pine.LNX.4.33L.0112111130110.4079-100000@imladris.surriel.com>; <20011211144634.F4801@athlon.random> <3C1718E1.C22141B3@zip.com.au>, <3C1718E1.C22141B3@zip.com.au>; from akpm@zip.com.au on Wed, Dec 12, 2001 at 12:44:17AM -0800 <20011212102141.Q4801@athlon.random>

Andrea Arcangeli wrote:
> 
> ...
> > The swapstorm I agree is uninteresting.  The slowdown with a heavy write
> > load impacts a very common usage, and I've told you how to mostly fix
> > it.  You need to back out the change to bdflush.
> 
> I guess i should drop the run_task_queue(&tq_disk) instead of replacing
> it back with a wait_for_some_buffers().

hum.  Nope, it definitely wants the wait_for_locked_buffers() in there.
36 seconds versus 25.  (21 on stock kernel)

My theory is that balance_dirty() is directing heaps of wakeups
to bdflush, so bdflush just keeps on running.  I'll take a look
tomorrow.

(If we're sending that many wakeups, we should do a waitqueue_active
test in wakeup_bdflush...)

> ...
>
> Note that the first elevator (not elevator_linus) could handle this
> case, however it was too complicated and I'm been told it was hurting
> too much the performance of things like dbench etc.. But it was allowing
> you to take a few seconds for your test number 2 for example. Quite
> frankly all my benchmark were latency oriented, but I couldn't notice
> an huge drop of performance, but OTOH at that time my test box had a
> 10mbyte/sec HD, and I know for experience that on such HD numbers tends
> to be very different than on fast SCSI and my current test hd IDE
> 33mbyte/sec so I think they were right.

OK, well I think I'll make it so the feature defaults to "off" - no
change in behaviour.  People need to run `elvtune -b non-zero-value'
to turn it on.

So what is then needed is testing to determine the latency-versus-throughput
tradeoff.  Andries takes manpage patches :)

> ...
> > - Your attempt to address read latencies didn't work out, and should
> >   be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))
> 
> It should not be dropped. And it's not an hack, I only enabled the code
> that was basically disabled due the huge numbers. It will work as 2.2.20.

Sorry, I was referring to the elevator-bypass patch.  Jens called
it a hack ;)

> Now what you want to add is an hack to move the read at the top of the
> request_queue and if you go back to 2.3.5x you'll see I was doing this,
> that's the first thing I did while playing with the elevator. And
> latency-wise it was working great. I'm sure somebody remebers the kind
> of latency you could get with such an elevator.
> 
> Then I got flames from Linus and Ingo claiming that I screwedup the
> elevator and that I was the source of the 2.3.x bad I/O performance and
> so they required to nearly rewrite the elevator in a way that was
> obvious that couldn't hurt the benchmarks and so Jens dropped part of my
> latency-capable elevator and he did the elevator_linus that of course
> cannot hurt performance of benchmarks, but that has the usual problem
> you need to wait 1 minute for xterm to be stared under a write flood.
> 
> However my object was to avoid nearly infinite starvation and the
> elevator_linus avoids it (you can start the xterm it in 1 minute,
> previously in early 2.3 and 2.2 you'd need to wait for the disk to be
> full, and that could take some day with some terabyte of data). So I was
> pretty much fine with elevator_linus too but we very well known reads
> would be starved again significantly (even if not indefinitely).
> 

OK, thanks.

As long as the elevator-bypass tunable gives a good range of
latency-versus-throughput tuning then I'll be happy.  It's a
bit sad that in even the best case, reads are penalised by a
factor of ten when there are writes happening.

But fixing that would require major readahead surgery, and perhaps
implementation of anticipatory scheduling, as described in
http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf
which is out of scope.

-

  parent reply	other threads:[~2001-12-12 10:01 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-12-10 19:08 2.4.16 & OOM killer screw up (fwd) Marcelo Tosatti
2001-12-10 20:47 ` Andrew Morton
2001-12-10 19:42   ` Marcelo Tosatti
2001-12-11  0:11   ` Andrea Arcangeli
2001-12-11  7:07     ` Andrew Morton
2001-12-11 13:32       ` Rik van Riel
2001-12-11 13:46         ` Andrea Arcangeli
2001-12-12  8:44           ` Andrew Morton
2001-12-12  9:21             ` Andrea Arcangeli
2001-12-12  9:45               ` Rik van Riel
2001-12-12 10:09                 ` Andrea Arcangeli
2001-12-12  9:59               ` Andrew Morton [this message]
2001-12-12 10:15                 ` Andrea Arcangeli
2001-12-11 13:42       ` Andrea Arcangeli
2001-12-11 13:59         ` Rik van Riel
2001-12-11 14:23           ` Andrea Arcangeli
2001-12-11 15:27             ` Daniel Phillips
2001-12-12 11:16               ` Andrea Arcangeli
2001-12-12 20:03                 ` Daniel Phillips
2001-12-12 21:25                   ` Andrea Arcangeli
2001-12-11 13:59         ` Abraham vd Merwe
2001-12-11 14:01           ` Andrea Arcangeli
2001-12-11 17:30             ` Leigh Orf
2001-12-11 15:47         ` Henning P. Schmiedehausen
2001-12-11 16:01           ` Alan Cox
2001-12-11 16:37           ` Hubert Mantel
2001-12-11 17:09           ` Rik van Riel
2001-12-11 17:28             ` Alan Cox
2001-12-11 17:22               ` Rik van Riel
2001-12-11 17:23               ` Christoph Hellwig
2001-12-12 22:20                 ` Rob Landley
2001-12-13  8:48                   ` Alan Cox
2001-12-13  8:47                     ` David S. Miller
2001-12-13 18:41                       ` Matthias Andree
2001-12-13 10:22                     ` [OT] " Rob Landley
2001-12-12  8:39         ` Andrew Morton
2001-12-11  0:43 ` Andrea Arcangeli
2001-12-11 15:46   ` Luigi Genoni
2001-12-12 22:05   ` Ken Brownfield
2001-12-12 22:30     ` Andrea Arcangeli
2001-12-12 23:23     ` Rik van Riel
     [not found] <Pine.LNX.4.33L.0112102004490.1352-100000@duckman.distro.conectiva>
2001-12-11 16:45 ` Marcelo Tosatti
2001-12-11 18:51   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3C172A8A.3760C553@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=andrea@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcelo@conectiva.com.br \
    --cc=riel@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox