All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakob Oestergaard <jakob@unthought.net>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Arjan van de Ven <arjan@infradead.org>,
	"Phetteplace, Thad (GE Healthcare,
	consultant)"  <Thad.Phetteplace@ge.com>,
	linux-kernel@vger.kernel.org
Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler
Date: Wed, 18 Oct 2006 15:35:30 +0200	[thread overview]
Message-ID: <20061018133530.GY23492@unthought.net> (raw)
In-Reply-To: <20061018124253.GH24452@kernel.dk>

On Wed, Oct 18, 2006 at 02:42:53PM +0200, Jens Axboe wrote:
...
> > impossible.
> 
> But you can say you want to give the db 90% of the disk bandwidth, and
> at least 50%. The iops/sec metric doesn't help you.

I think we're misunderstanding each other...

I am trying to say, that me being able to specify "90% of the disk
bandwidth" does not help me.

Because the DB would probably be happy with just 1% of the 100MiB/sec
theoretical bandwidth I could get from sequentially reading the disk -
but if it needs to do, say, 160 seeks per second to get those 1% of
100MiB/sec, then that is still more than 96% of the disk time available
with a 6ms seek time.

So, I believe we need something that takes into account the general
performance of the disk - not just the single-user-sequential-read/write
bandwidth.  And, as I shall soon argue, this is where I do think the
iops/sec metric does help - I probably just explained it very poorly to
begin with.

> > 
> > Would you want to limit bandwidth on a per-file or per-process basis?
> > You're talking files, above, I was thinking about processes (consumers
> > if you like) the whole time.
> 
> You need to define your workload for the kernel to know what to do. So
> for the bandwidth case, you need to tell the kernel against what file
> you want to allocate that bandwidth. If you go the percentage route, you
> don't need that. The percentage route doesn't care about sequential or
> random io, it just gets you foo % of the disk time. If the slice given
> is large enough, with 10% of the disk time you may have 90% of the total
> bandwidth if the remaining 90% of the time is spent doing random io. But
> you still have 10% of the time allocated.

I like the time allocation for several reasons:
1) It's presumably simple to implement
2) It will suit both your mp3 player and my database reasonably well
3) It's intuitive to the user - you can understand wall-clock time a lot
   easier than all the little things than influence whether or not you
   get a number of bytes written in a number of places on the disk in
   more or less than the time you had available...

I think "reasonably well" is good enough for a kernel that isn't
hard-real-time anyway  :)


...
[snip - good arguments, response will follow]
...

> > > with a magic iops/sec metric that is both
> > > hard to understand and impossible to quantify.
> > 
> > iops/sec is what you get from your disks. In real world scenarios. It's
> > no more magic than the real world, and no harder to understand than real
> > world disks. Although I admit real-world disks can be a bitch at times ;)
> 
> Again, iops/sec doesn't make sense unless you say how big the iops is

1 OSIOP (oestergaard standard input/output operation) is hereby defined
to be:
  1 optional seek
plus
  1 (read or write) of no more than 256 KiB  (*)

(*): The size limit should be adjusted every 10 years as disk technology
     evolves.

There you have it  :)

So, a single 1MiB read on a disk is 4 OSIOPs, for example.

> and what your stream of iops look like. That's why I say it's a
> benchmark metric.

I state that the total OSIOPs/second you can get out of a given disk
will not change by much, no matter which disk operations you perform and
how you mix them.

That was the whole point of using OSIOPs/sec rather than bandwidth to
begin with.

I know I did not properly define the iop to begin with - my bad, sorry.

> 
> > My argument is that it is simpler to understand than bandwidth.
> 
> And mine is that that is nonsense :-)

Still?  :)

I hope the above clears up some of the misunderstandings.

...
...
> > The total iops/sec "available" from a given disk will not vary a lot,
> > compared to how the total bandwidth available from a given disk will
> > vary.
> 
> That's only true if you scale your iops. And how are you going to give
> that number? You need to define what an iop is for it to be meaningfull.

Done :)

A basic OSIOP is useful for the application, because it maps very
closely to the read/write/seek API that applications are built over.
Thus, the application will know very well how many OSIOPs it needs in
order to complete a given job.

The total number of OSIOPs/sec available in the system, however, will
vary depending on the characteristics of the disk subsystem.  Just like
available cycles/sec vary with the speed of your processor.

You are correct in that the total number of OSIOPs/sec will not be
strictly constant over time - it will depend *somewhat* on the nature of
the operations performed. But it will not change completely - or at
least this is what I claim  :)

...
> > With more than 1 client, you get seeks, and then bandwidth is no longer
> > a sensible measure.
> 
> And neither is iops/sec.

We agree that neither is "correct".

I still claim that one is "not strictly correct but probably close
enough to be useful".

> But things don't deteriorate that quickly, if
> you can tolerate higher latency, it's quite possible to have most of the
> potential bandwidth available for > 1 client workloads.

True.

I do wonder, though, how often that would be practically useful. Seek
times are *huge* (milliseconds) compared to almost anything else we work
with.


-- 

 / jakob


  reply	other threads:[~2006-10-18 13:35 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-16 20:46 Bandwidth Allocations under CFQ I/O Scheduler Phetteplace, Thad (GE Healthcare, consultant)
2006-10-17  1:24 ` Arjan van de Ven
2006-10-17 13:23   ` Jens Axboe
2006-10-17 14:37     ` Ric Wheeler
2006-10-17 14:47       ` Jens Axboe
2006-10-17 14:46     ` Phetteplace, Thad (GE Healthcare, consultant)
2006-10-18  8:00     ` Jakob Oestergaard
2006-10-18  9:40       ` Arjan van de Ven
2006-10-18 11:30         ` Jakob Oestergaard
2006-10-18 11:49           ` Jens Axboe
2006-10-18 12:23             ` Jakob Oestergaard
2006-10-18 12:42               ` Alan Cox
2006-10-18 12:44                 ` Jens Axboe
2006-10-18 12:55                   ` Nick Piggin
2006-10-18 13:04                     ` Jens Axboe
2006-10-18 13:39                       ` Jakob Oestergaard
2006-10-18 13:51                       ` Paulo Marques
2006-10-19 12:22                         ` Jens Axboe
2006-10-18 13:37                     ` Jakob Oestergaard
2006-10-18 12:44                 ` Jakob Oestergaard
2006-10-18 12:42               ` Jens Axboe
2006-10-18 13:35                 ` Jakob Oestergaard [this message]
2006-10-18  9:51       ` Jens Axboe
2006-10-18 11:00         ` Helge Hafting
2006-10-18 11:14           ` Jens Axboe
2006-10-18 11:23           ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061018133530.GY23492@unthought.net \
    --to=jakob@unthought.net \
    --cc=Thad.Phetteplace@ge.com \
    --cc=arjan@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.