Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Vivek Goyal <vgoyal@redhat.com>
To: Nauman Rafique <nauman@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	dpshah@google.com, lizf@cn.fujitsu.com, mikew@google.com,
	fchecconi@gmail.com, paolo.valente@unimore.it,
	jens.axboe@oracle.com, ryov@valinux.co.jp,
	fernando@intellilink.co.jp, s-uchida@ap.jp.nec.com,
	taka@valinux.co.jp, guijianfeng@cn.fujitsu.com,
	arozansk@redhat.com, jmoyer@redhat.com, oz-kernel@redhat.com,
	dhaval@linux.vnet.ibm.com, balbir@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org,
	containers@lists.linux-foundation.org, menage@google.com,
	peterz@infradead.org
Subject: Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
Date: Sun, 19 Apr 2009 09:08:49 -0400	[thread overview]
Message-ID: <20090419130849.GC8493@redhat.com> (raw)
In-Reply-To: <e98e18940904171109r17ccb054kb7879f8d02ac26b5@mail.gmail.com>

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.
> 
> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution.

Currently, if pdflush can't keep up and processes are dirtying the
page cache at higher rate than we will cross vm_dirty_ratio and
process will be made to write back some of dirty pages. That should
make sure that processes will be automatically throttled at IO scheduler
(Assuming process tries to write its own pages and does not pick randomly
some other processes's pages). Currently I think a processes can pick
any inode for writeback and not necessarily the inode the process is
dirtying.

Thanks
Vivek

next prev parent reply	other threads:[~2009-04-19 13:11 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-12  1:56 [RFC] IO Controller Vivek Goyal
2009-03-12  1:56 ` [PATCH 01/10] Documentation Vivek Goyal
2009-03-12  7:11   ` Andrew Morton
2009-03-12 10:07     ` Ryo Tsuruta
2009-03-12 18:01     ` Vivek Goyal
2009-03-16  8:40       ` Ryo Tsuruta
2009-03-16 13:39         ` Vivek Goyal
2009-04-05 15:15       ` Andrea Righi
2009-04-06  6:50         ` Nauman Rafique
2009-04-07  6:40         ` Vivek Goyal
2009-04-08 20:37           ` Andrea Righi
2009-04-16 18:37             ` Vivek Goyal
2009-04-17  5:35               ` Dhaval Giani
2009-04-17 13:49                 ` IO Controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17  9:37               ` [PATCH 01/10] Documentation Andrea Righi
2009-04-17 14:13                 ` IO controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17 18:09                   ` Nauman Rafique
2009-04-18  8:13                     ` Andrea Righi
2009-04-19 12:59                     ` Vivek Goyal
2009-04-19 13:08                     ` Vivek Goyal [this message]
2009-04-17 22:38                   ` Andrea Righi
2009-04-19 13:21                     ` Vivek Goyal
2009-04-18 13:19                   ` Balbir Singh
2009-04-19 13:45                     ` Vivek Goyal
2009-04-19 15:53                       ` Andrea Righi
2009-04-21  1:16                         ` KAMEZAWA Hiroyuki
2009-04-19  4:35                   ` Nauman Rafique
2009-03-12  7:45   ` [PATCH 01/10] Documentation Yang Hongyang
2009-03-12 13:51     ` Vivek Goyal
2009-03-12 10:00   ` Dhaval Giani
2009-03-12 14:04     ` Vivek Goyal
2009-03-12 14:48       ` Fabio Checconi
2009-03-12 15:03         ` Vivek Goyal
2009-03-18  7:23       ` Gui Jianfeng
2009-03-18 21:55         ` Vivek Goyal
2009-03-19  3:38           ` Gui Jianfeng
2009-03-24  5:32           ` Nauman Rafique
2009-03-24 12:58             ` Vivek Goyal
2009-03-24 18:14               ` Nauman Rafique
2009-03-24 18:29                 ` Vivek Goyal
2009-03-24 18:41                   ` Fabio Checconi
2009-03-24 18:35                     ` Vivek Goyal
2009-03-24 18:49                       ` Nauman Rafique
2009-03-24 19:04                       ` Fabio Checconi
2009-03-12 10:24   ` Peter Zijlstra
2009-03-12 14:09     ` Vivek Goyal
2009-04-06 14:35   ` Balbir Singh
2009-04-06 22:00     ` Nauman Rafique
2009-04-07  5:59     ` Gui Jianfeng
2009-04-13 13:40     ` Vivek Goyal
2009-05-01 22:04       ` IKEDA, Munehiro
2009-05-01 22:45         ` IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-05-01 23:39           ` Nauman Rafique
2009-05-04 17:18             ` IKEDA, Munehiro
2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-03-19  6:27   ` Gui Jianfeng
2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
2009-03-27 13:52     ` Vivek Goyal
2009-04-02  4:06   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
2009-04-02 13:52     ` Vivek Goyal
2009-03-12  1:56 ` [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-03-12  1:56 ` [PATCH 04/10] Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-03-12  1:56 ` [PATCH 05/10] cfq changes to use " Vivek Goyal
2009-04-16  5:25   ` [PATCH] IO-Controller: Fix kernel panic after moving a task Gui Jianfeng
2009-04-16 19:15     ` Vivek Goyal
2009-03-12  1:56 ` [PATCH 06/10] Separate out queue and data Vivek Goyal
2009-03-12  1:56 ` [PATCH 07/10] Prepare elevator layer for single queue schedulers Vivek Goyal
2009-03-12  1:56 ` [PATCH 08/10] noop changes for hierarchical fair queuing Vivek Goyal
2009-03-12  1:56 ` [PATCH 09/10] deadline " Vivek Goyal
2009-03-12  1:56 ` [PATCH 10/10] anticipatory " Vivek Goyal
2009-03-27  6:58   ` [PATCH] IO Controller: No need to stop idling in as Gui Jianfeng
2009-03-27 14:05     ` Vivek Goyal
2009-03-30  1:09       ` Gui Jianfeng
2009-03-12  3:27 ` [RFC] IO Controller Takuya Yoshikawa
2009-03-12  6:40   ` anqin
2009-03-12  6:55     ` Li Zefan
2009-03-12  7:11       ` anqin
2009-03-12 14:57         ` Vivek Goyal
2009-03-12 13:46     ` Vivek Goyal
2009-03-12 13:43   ` Vivek Goyal
2009-04-02  6:39 ` Gui Jianfeng
2009-04-02 14:00   ` Vivek Goyal
2009-04-07  1:40     ` Gui Jianfeng
2009-04-07  6:40       ` Gui Jianfeng
2009-04-10  9:33 ` Gui Jianfeng
2009-04-10 17:49   ` Nauman Rafique
2009-04-13 13:09   ` Vivek Goyal
2009-04-22  3:04     ` Gui Jianfeng
2009-04-22  3:10       ` Nauman Rafique
2009-04-22 13:23       ` Vivek Goyal
2009-04-30 19:38         ` Nauman Rafique
2009-05-05  3:18           ` Gui Jianfeng
2009-05-01  1:25 ` Divyesh Shah
2009-05-01  2:45   ` Vivek Goyal
2009-05-01  3:00     ` Divyesh Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090419130849.GC8493@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arozansk@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=dpshah@google.com \
    --cc=fchecconi@gmail.com \
    --cc=fernando@intellilink.co.jp \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=jens.axboe@oracle.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=menage@google.com \
    --cc=mikew@google.com \
    --cc=nauman@google.com \
    --cc=oz-kernel@redhat.com \
    --cc=paolo.valente@unimore.it \
    --cc=peterz@infradead.org \
    --cc=righi.andrea@gmail.com \
    --cc=ryov@valinux.co.jp \
    --cc=s-uchida@ap.jp.nec.com \
    --cc=taka@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).