public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jakob Oestergaard <jakob@unthought.net>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Arjan van de Ven <arjan@infradead.org>,
	"Phetteplace, Thad (GE Healthcare,
	consultant)"  <Thad.Phetteplace@ge.com>,
	linux-kernel@vger.kernel.org
Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler
Date: Wed, 18 Oct 2006 15:35:30 +0200	[thread overview]
Message-ID: <20061018133530.GY23492@unthought.net> (raw)
In-Reply-To: <20061018124253.GH24452@kernel.dk>

On Wed, Oct 18, 2006 at 02:42:53PM +0200, Jens Axboe wrote:
...
> > impossible.
> 
> But you can say you want to give the db 90% of the disk bandwidth, and
> at least 50%. The iops/sec metric doesn't help you.

I think we're misunderstanding each other...

I am trying to say, that me being able to specify "90% of the disk
bandwidth" does not help me.

Because the DB would probably be happy with just 1% of the 100MiB/sec
theoretical bandwidth I could get from sequentially reading the disk -
but if it needs to do, say, 160 seeks per second to get those 1% of
100MiB/sec, then that is still more than 96% of the disk time available
with a 6ms seek time.

So, I believe we need something that takes into account the general
performance of the disk - not just the single-user-sequential-read/write
bandwidth.  And, as I shall soon argue, this is where I do think the
iops/sec metric does help - I probably just explained it very poorly to
begin with.

> > 
> > Would you want to limit bandwidth on a per-file or per-process basis?
> > You're talking files, above, I was thinking about processes (consumers
> > if you like) the whole time.
> 
> You need to define your workload for the kernel to know what to do. So
> for the bandwidth case, you need to tell the kernel against what file
> you want to allocate that bandwidth. If you go the percentage route, you
> don't need that. The percentage route doesn't care about sequential or
> random io, it just gets you foo % of the disk time. If the slice given
> is large enough, with 10% of the disk time you may have 90% of the total
> bandwidth if the remaining 90% of the time is spent doing random io. But
> you still have 10% of the time allocated.

I like the time allocation for several reasons:
1) It's presumably simple to implement
2) It will suit both your mp3 player and my database reasonably well
3) It's intuitive to the user - you can understand wall-clock time a lot
   easier than all the little things than influence whether or not you
   get a number of bytes written in a number of places on the disk in
   more or less than the time you had available...

I think "reasonably well" is good enough for a kernel that isn't
hard-real-time anyway  :)


...
[snip - good arguments, response will follow]
...

> > > with a magic iops/sec metric that is both
> > > hard to understand and impossible to quantify.
> > 
> > iops/sec is what you get from your disks. In real world scenarios. It's
> > no more magic than the real world, and no harder to understand than real
> > world disks. Although I admit real-world disks can be a bitch at times ;)
> 
> Again, iops/sec doesn't make sense unless you say how big the iops is

1 OSIOP (oestergaard standard input/output operation) is hereby defined
to be:
  1 optional seek
plus
  1 (read or write) of no more than 256 KiB  (*)

(*): The size limit should be adjusted every 10 years as disk technology
     evolves.

There you have it  :)

So, a single 1MiB read on a disk is 4 OSIOPs, for example.

> and what your stream of iops look like. That's why I say it's a
> benchmark metric.

I state that the total OSIOPs/second you can get out of a given disk
will not change by much, no matter which disk operations you perform and
how you mix them.

That was the whole point of using OSIOPs/sec rather than bandwidth to
begin with.

I know I did not properly define the iop to begin with - my bad, sorry.

> 
> > My argument is that it is simpler to understand than bandwidth.
> 
> And mine is that that is nonsense :-)

Still?  :)

I hope the above clears up some of the misunderstandings.

...
...
> > The total iops/sec "available" from a given disk will not vary a lot,
> > compared to how the total bandwidth available from a given disk will
> > vary.
> 
> That's only true if you scale your iops. And how are you going to give
> that number? You need to define what an iop is for it to be meaningfull.

Done :)

A basic OSIOP is useful for the application, because it maps very
closely to the read/write/seek API that applications are built over.
Thus, the application will know very well how many OSIOPs it needs in
order to complete a given job.

The total number of OSIOPs/sec available in the system, however, will
vary depending on the characteristics of the disk subsystem.  Just like
available cycles/sec vary with the speed of your processor.

You are correct in that the total number of OSIOPs/sec will not be
strictly constant over time - it will depend *somewhat* on the nature of
the operations performed. But it will not change completely - or at
least this is what I claim  :)

...
> > With more than 1 client, you get seeks, and then bandwidth is no longer
> > a sensible measure.
> 
> And neither is iops/sec.

We agree that neither is "correct".

I still claim that one is "not strictly correct but probably close
enough to be useful".

> But things don't deteriorate that quickly, if
> you can tolerate higher latency, it's quite possible to have most of the
> potential bandwidth available for > 1 client workloads.

True.

I do wonder, though, how often that would be practically useful. Seek
times are *huge* (milliseconds) compared to almost anything else we work
with.


-- 

 / jakob


  reply	other threads:[~2006-10-18 13:35 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-16 20:46 Bandwidth Allocations under CFQ I/O Scheduler Phetteplace, Thad (GE Healthcare, consultant)
2006-10-17  1:24 ` Arjan van de Ven
2006-10-17 13:23   ` Jens Axboe
2006-10-17 14:37     ` Ric Wheeler
2006-10-17 14:47       ` Jens Axboe
2006-10-17 14:46     ` Phetteplace, Thad (GE Healthcare, consultant)
2006-10-18  8:00     ` Jakob Oestergaard
2006-10-18  9:40       ` Arjan van de Ven
2006-10-18 11:30         ` Jakob Oestergaard
2006-10-18 11:49           ` Jens Axboe
2006-10-18 12:23             ` Jakob Oestergaard
2006-10-18 12:42               ` Alan Cox
2006-10-18 12:44                 ` Jens Axboe
2006-10-18 12:55                   ` Nick Piggin
2006-10-18 13:04                     ` Jens Axboe
2006-10-18 13:39                       ` Jakob Oestergaard
2006-10-18 13:51                       ` Paulo Marques
2006-10-19 12:22                         ` Jens Axboe
2006-10-18 13:37                     ` Jakob Oestergaard
2006-10-18 12:44                 ` Jakob Oestergaard
2006-10-18 12:42               ` Jens Axboe
2006-10-18 13:35                 ` Jakob Oestergaard [this message]
2006-10-18  9:51       ` Jens Axboe
2006-10-18 11:00         ` Helge Hafting
2006-10-18 11:14           ` Jens Axboe
2006-10-18 11:23           ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061018133530.GY23492@unthought.net \
    --to=jakob@unthought.net \
    --cc=Thad.Phetteplace@ge.com \
    --cc=arjan@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox