public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	"lsf@lists.linux-foundation.org" <lsf@lists.linux-foundation.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
Date: Fri, 22 Apr 2011 18:28:29 +0200	[thread overview]
Message-ID: <20110422162829.GX5611@random.random> (raw)
In-Reply-To: <20110422152531.GA8255@redhat.com>

On Fri, Apr 22, 2011 at 11:25:31AM -0400, Vivek Goyal wrote:
> It is and we have modified CFQ a lot to tackle that but still... 
> 
> Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root
> disk and then try to launch firefox and browse few websites and see if
> you are happy with the response of the firefox. It took me more than
> a minute to launch firefox and be able to input and load first website.
> 
> But I agree that READ latencies in presence of WRITES can be a problem
> independent of IO controller.

Reading this I've some dejavu, this is literally a decade old problem,
so old that when I first worked on it the elevator had no notion of
latency and it would potentially infinitely starve any I/O (regardless
of read/write) at the end of the disk if any I/O before the end would
keep coming in ;).

We're orders of magnitude better these days, but one thing I didn't
see mentioned is that according to memories, a lot of it had to do
with the way the DMA command size can grow to the maximum allowed by
the sg table for writes, but reads (especially metadata and small
files where readahead is less effective) won't grow to the maximum or
even if it grows to the maximum the readahead may not be useful
(userland will seek again not reading into the readahead) and even if
synchronous metadata reads aren't involved it'll submit another
physical readahead after having satisfied only a little userland read.

So even if you have a totally unfair io scheduler that places the next
read request always at the top of the queue (ignoring any fairness
requirement), you're still going to have the synchronous small read
dma waiting at the top of the queue for the large dma write to
complete.

The time I got the dd if=/dev/zero working best is when I broke the
throughput by massively reducing the dma size (by error or intentional
frankly I don't remember). SATA requires ~64k large dma to run at peak
speed, and I expect if you reduce it to 4k it'll behave a lot better
than current 256k. Some very old scsi device I had performed best at
512k dma (much faster than 64k). The max sector size is still 512k
today, probably 256k (or only 128k) for SATA but likely above 64k (as
it saves CPU even if throughput can be maxed out at ~64k dma as far as
the platter is concerned).

> Also it is only CFQ which provides READS so much preferrence over WRITES.
> deadline and noop do not which we typically use on faster storage. There
> we might take a bigger hit on READ latencies depending on what storage
> is and how effected it is with a burst of WRITES.
> 
> I guess it boils down to better system control and better predictability.

I tend to think to get even better read latency and predictability,
the IO scheduler could dynamically and temporarily reduce the max
sector size of the write dma (and also ensure any read readahead is
also reduced to the dynamic reduced sector size or it'd be detrimental
on the number of read DMA issued for each userland read).

Maybe with tagged queuing things are better and the dma size doesn't
make a difference anymore, I don't know. Surely Jens knows this best
and can tell me if I'm wrong.

Anyway it should be real easy to test, just a two liner reducing the
max sector size to scsi_lib and the max readahead, should allow you to
see how fast firefox starts with cfq when dd if=/dev/zero is running
and if there's any difference at all.

I've seen huge work on cfq but still the max merging remains at top
and it doesn't decrease dynamically and I doubt you can get
real unnoticeable writeback to reads, without such a chance, no matter
how the IO scheduler is otherwise implemented.

I'm unsure if this will ever be really viable in single user
environment (often absolute throughput is more important and that is
clearly higher - at least for the writeback - by keeping the max
sector fixed to the max), but if cgroup wants to make a dd
if=/dev/zero of=zero bs=10M oflag=direct from one group unnoticeable
to the other cgroups that are reading, it's worth researching if this
still an actual issue with todays hardware. I guess SSD won't change
it much, as it's a DMA duration issue, not seeks, in fact it may be
way more noticeable on SSD as seeks will be less costly leaving the
duration effect more visible.

> So throttling is happening at two layers. One throttling is in
> balance_dirty_pages() which is actually not dependent on user inputted
> parameters. It is more dependent on what's the page cache share of 
> this cgroup and what's the effecitve IO rate this cgroup is getting.
> The real IO throttling is happning at device level which is dependent
> on parameters inputted by user and which in-turn indirectly should decide
> how tasks are throttled in balance_dirty_pages().

This sounds a fine design to me.

Thanks,
Andrea

  reply	other threads:[~2011-04-22 16:28 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli [this message]
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110422162829.GX5611@random.random \
    --to=aarcange@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox