From: Andrea Righi <righi.andrea@gmail.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Tejun Heo <tj@kernel.org>, Li Zefan <lizefan@huawei.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Jens Axboe <axboe@kernel.dk>, Vivek Goyal <vgoyal@redhat.com>,
Dennis Zhou <dennis@kernel.org>,
cgroups@vger.kernel.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
Date: Sat, 19 Jan 2019 11:08:27 +0100 [thread overview]
Message-ID: <20190119100827.GA1630@xps-13> (raw)
In-Reply-To: <20190118194652.gg5j2yz3h2llecpj@macbook-pro-91.dhcp.thefacebook.com>
On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > >
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > >
> > > > = Problem =
> > > >
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > >
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > >
> > > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > > properly. So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled. Then you are throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> >
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
>
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups. But even if you are, fsync() is synchronous, we should be getting the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.
OK, fair point. We shouldn't be sharing files between cgroups.
I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.
>
> >
> > >
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and such. If
> > > you are running into this case that may be something worth using. Course it
> > > only works for io.latency now but there's no reason you can't add support to it
> > > for io.max or whatever.
> >
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
>
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.
I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?
>
> >
> > >
> > > >
> > > > = Proposed solution =
> > > >
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > >
> > >
> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > > looking into that route and simply changing the existing io controller you are
> > > using to take advantage of that so it will actually throttle things. Then just
> > > sprinkle it around the areas where we indirectly generate IO. Thanks,
> >
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> > determine when a cgroup should be throttled (instead of doing my own
> > I/O measuring), but to prevent the "slow writeback slowing down other
> > cgroups" issue I still need to apply throttling when pages are dirtied
> > in page cache.
>
> Again this is just a fuckup from a configuration stand point. The argument
> could be made that sync() is probably broken here, but I think the right
> solution here is to just pass the cgroup context along with the writeback
> information and use that if it's set instead. Thanks,
Alright, let's skip the root cgroup for now. I think the point here is
if we want to provide sync() isolation among cgroups or not.
According to the manpage:
sync() causes all pending modifications to filesystem metadata and cached file data to be
written to the underlying filesystems.
And:
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
may return before the actual writing is done. However Linux waits for I/O completions, and
thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‚Äê
tem or filesystem respectively.
Excluding the root cgroup, do you think a sync() issued inside a
specific cgroup should wait for I/O completions only for the writes that
have been generated by that cgroup?
Thanks,
-Andrea
WARNING: multiple messages have this Message-ID (diff)
From: Andrea Righi <righi.andrea@gmail.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Tejun Heo <tj@kernel.org>, Li Zefan <lizefan@huawei.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Jens Axboe <axboe@kernel.dk>, Vivek Goyal <vgoyal@redhat.com>,
Dennis Zhou <dennis@kernel.org>,
cgroups@vger.kernel.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
Date: Sat, 19 Jan 2019 11:08:27 +0100 [thread overview]
Message-ID: <20190119100827.GA1630@xps-13> (raw)
In-Reply-To: <20190118194652.gg5j2yz3h2llecpj@macbook-pro-91.dhcp.thefacebook.com>
On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > >
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > >
> > > > = Problem =
> > > >
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > >
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > >
> > > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > > properly. So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled. Then you are throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> >
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
>
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups. But even if you are, fsync() is synchronous, we should be getting the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.
OK, fair point. We shouldn't be sharing files between cgroups.
I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.
>
> >
> > >
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and such. If
> > > you are running into this case that may be something worth using. Course it
> > > only works for io.latency now but there's no reason you can't add support to it
> > > for io.max or whatever.
> >
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
>
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.
I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?
>
> >
> > >
> > > >
> > > > = Proposed solution =
> > > >
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > >
> > >
> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > > looking into that route and simply changing the existing io controller you are
> > > using to take advantage of that so it will actually throttle things. Then just
> > > sprinkle it around the areas where we indirectly generate IO. Thanks,
> >
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> > determine when a cgroup should be throttled (instead of doing my own
> > I/O measuring), but to prevent the "slow writeback slowing down other
> > cgroups" issue I still need to apply throttling when pages are dirtied
> > in page cache.
>
> Again this is just a fuckup from a configuration stand point. The argument
> could be made that sync() is probably broken here, but I think the right
> solution here is to just pass the cgroup context along with the writeback
> information and use that if it's set instead. Thanks,
Alright, let's skip the root cgroup for now. I think the point here is
if we want to provide sync() isolation among cgroups or not.
According to the manpage:
sync() causes all pending modifications to filesystem metadata and cached file data to be
written to the underlying filesystems.
And:
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
may return before the actual writing is done. However Linux waits for I/O completions, and
thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
tem or filesystem respectively.
Excluding the root cgroup, do you think a sync() issued inside a
specific cgroup should wait for I/O completions only for the writes that
have been generated by that cgroup?
Thanks,
-Andrea
next prev parent reply other threads:[~2019-01-19 10:08 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-18 10:31 [RFC PATCH 0/3] cgroup: fsio throttle controller Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 1/3] fsio-throttle: documentation Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 2/3] fsio-throttle: controller infrastructure Andrea Righi
2019-01-18 10:31 ` [RFC PATCH 3/3] fsio-throttle: instrumentation Andrea Righi
2019-01-18 11:04 ` [RFC PATCH 0/3] cgroup: fsio throttle controller Paolo Valente
2019-01-18 11:10 ` Andrea Righi
2019-01-18 11:11 ` Paolo Valente
2019-01-18 16:35 ` Josef Bacik
2019-01-18 17:07 ` Paolo Valente
2019-01-18 17:12 ` Josef Bacik
2019-01-18 19:02 ` Andrea Righi
2019-01-18 18:44 ` Andrea Righi
2019-01-18 19:46 ` Josef Bacik
2019-01-19 10:08 ` Andrea Righi [this message]
2019-01-19 10:08 ` Andrea Righi
2019-01-21 21:47 ` Vivek Goyal
2019-01-28 17:41 ` Andrea Righi
2019-01-28 19:26 ` Vivek Goyal
2019-01-29 18:39 ` Andrea Righi
2019-01-29 18:50 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190119100827.GA1630@xps-13 \
--to=righi.andrea@gmail.com \
--cc=axboe@kernel.dk \
--cc=cgroups@vger.kernel.org \
--cc=dennis@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=josef@toxicpanda.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lizefan@huawei.com \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.