From: Vivek Goyal <vgoyal@redhat.com>
To: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: xen-devel@lists.xensource.com,
containers@lists.linux-foundation.org, jens.axboe@oracle.com,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, dm-devel@redhat.com,
righi.andrea@gmail.com, agk@sourceware.org, xemul@openvz.org,
fernando@oss.ntt.co.jp, balbir@linux.vnet.ibm.com
Subject: Re: dm-ioband + bio-cgroup benchmarks
Date: Wed, 24 Sep 2008 10:52:02 -0400 [thread overview]
Message-ID: <20080924145202.GC547@redhat.com> (raw)
In-Reply-To: <20080924.191803.100102323.taka@valinux.co.jp>
On Wed, Sep 24, 2008 at 07:18:03PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > > > > device we want to subject to rules, I was thinking of maintaining an
> > > > > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > > > > __make_request() and then will filter down to elevator associated with the
> > > > > > queue (if there is one). This will provide us the control of releasing
> > > > > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > > > > etc) and no need of stacking additional block device.
> > > > >
> > > > > I think it's a bit late to control I/O requests there, since process
> > > > > may be blocked in get_request_wait when the I/O load is high.
> > > > > Please imagine the situation that cgroups with low bandwidths are
> > > > > consuming most of "struct request"s while another cgroup with a high
> > > > > bandwidth is blocked and can't get enough "struct request"s.
> > > > >
> > > > > It means cgroups that issues lot of I/O request can win the game.
> > > > >
> > > >
> > > > Ok, this is a good point. Because number of struct requests are limited
> > > > and they seem to be allocated on first come first serve basis, so if a
> > > > cgroup is generating lot of IO, then it might win.
> > > >
> > > > But dm-ioband will face the same issue.
> > >
> > > Nope. Dm-ioband doesn't have this issue since it works before allocating
> > > the descriptors. Only I/O requests dm-ioband has passed can allocate its
> > > descriptor.
> > >
> >
> > Ok. Got it. dm-ioband does not block on allocation of request descriptors.
> > It does seem to be blocking in prevent_burst_bios() but that would be
> > per group so it should be fine.
>
> Yes. There is also another little mechanism that prevent_burst_bios()
> tries not to block kernel threads if possible.
>
> > That means for lower layers, one shall have to do request descritor
> > allocation as per the cgroup weight to make sure a cgroup with lower
> > weight does not get higher % of disk because it is generating more
> > requests.
>
> Yes. But when cgroups with higher weight aren't issueing a lot of I/Os,
> even a cgroup with lower weight can allocate a lot of request descriptors.
>
ok. Now with the new thought, I am completely deprecating the idea of
queuing the request descriptors. Now I am thinking of capturing the bios
and buffering these into the rb-tree as soon as these enter the request
queue using associated request function. All the request descriptor
allocation will come later when bios are actually release to elevator from
the rb-tree. That way we should be able to get rid of this issue.
> > One additional issue with my scheme I just noticed is that I am putting
> > bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
> > same cgroup can be at multiple levels of processing at same time. That
> > would mean that a single cgroup needs to be in multiple rb-trees at the
> > same time in various layers. So I might have to create a temporary object
> > which can associate with cgroup and get rid of that object once I don't
> > have the requests any more...
>
> You mean each layer should have its rb-tree? Is it per device?
> One lvm logical volume may probably consist from several physical
> volumes, which will be shared with other logical volumes.
> And some layers may split one bio into several bios.
> I hardly can imagine how these structures will be.
>
Yes, one rb-tree per device, be it physical device or logical device
(because there is one request queue associated per physical/logical block
device).
I was thinking of getting hold/hijack the bios as soon as they are
submitted to the device using associated request function. So if there
is a logical device built on top of two physical device, the associated
bio copy or other logic should not even see the bio the moment it is
submitted to the deivce. It will see the bio only when it is released
from associated rb-tree to them. Do you think this will not work? To me
this is what dm-ioband is doing logically. The only difference is that it
does this with the help of a separate request queue.
Thanks
Vivek
WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@redhat.com>
To: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: ryov@valinux.co.jp, linux-kernel@vger.kernel.org,
dm-devel@redhat.com, containers@lists.linux-foundation.org,
virtualization@lists.linux-foundation.org,
xen-devel@lists.xensource.com, fernando@oss.ntt.co.jp,
balbir@linux.vnet.ibm.com, xemul@openvz.org, agk@sourceware.org,
righi.andrea@gmail.com, jens.axboe@oracle.com
Subject: Re: dm-ioband + bio-cgroup benchmarks
Date: Wed, 24 Sep 2008 10:52:02 -0400 [thread overview]
Message-ID: <20080924145202.GC547@redhat.com> (raw)
In-Reply-To: <20080924.191803.100102323.taka@valinux.co.jp>
On Wed, Sep 24, 2008 at 07:18:03PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > > > > device we want to subject to rules, I was thinking of maintaining an
> > > > > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > > > > __make_request() and then will filter down to elevator associated with the
> > > > > > queue (if there is one). This will provide us the control of releasing
> > > > > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > > > > etc) and no need of stacking additional block device.
> > > > >
> > > > > I think it's a bit late to control I/O requests there, since process
> > > > > may be blocked in get_request_wait when the I/O load is high.
> > > > > Please imagine the situation that cgroups with low bandwidths are
> > > > > consuming most of "struct request"s while another cgroup with a high
> > > > > bandwidth is blocked and can't get enough "struct request"s.
> > > > >
> > > > > It means cgroups that issues lot of I/O request can win the game.
> > > > >
> > > >
> > > > Ok, this is a good point. Because number of struct requests are limited
> > > > and they seem to be allocated on first come first serve basis, so if a
> > > > cgroup is generating lot of IO, then it might win.
> > > >
> > > > But dm-ioband will face the same issue.
> > >
> > > Nope. Dm-ioband doesn't have this issue since it works before allocating
> > > the descriptors. Only I/O requests dm-ioband has passed can allocate its
> > > descriptor.
> > >
> >
> > Ok. Got it. dm-ioband does not block on allocation of request descriptors.
> > It does seem to be blocking in prevent_burst_bios() but that would be
> > per group so it should be fine.
>
> Yes. There is also another little mechanism that prevent_burst_bios()
> tries not to block kernel threads if possible.
>
> > That means for lower layers, one shall have to do request descritor
> > allocation as per the cgroup weight to make sure a cgroup with lower
> > weight does not get higher % of disk because it is generating more
> > requests.
>
> Yes. But when cgroups with higher weight aren't issueing a lot of I/Os,
> even a cgroup with lower weight can allocate a lot of request descriptors.
>
ok. Now with the new thought, I am completely deprecating the idea of
queuing the request descriptors. Now I am thinking of capturing the bios
and buffering these into the rb-tree as soon as these enter the request
queue using associated request function. All the request descriptor
allocation will come later when bios are actually release to elevator from
the rb-tree. That way we should be able to get rid of this issue.
> > One additional issue with my scheme I just noticed is that I am putting
> > bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
> > same cgroup can be at multiple levels of processing at same time. That
> > would mean that a single cgroup needs to be in multiple rb-trees at the
> > same time in various layers. So I might have to create a temporary object
> > which can associate with cgroup and get rid of that object once I don't
> > have the requests any more...
>
> You mean each layer should have its rb-tree? Is it per device?
> One lvm logical volume may probably consist from several physical
> volumes, which will be shared with other logical volumes.
> And some layers may split one bio into several bios.
> I hardly can imagine how these structures will be.
>
Yes, one rb-tree per device, be it physical device or logical device
(because there is one request queue associated per physical/logical block
device).
I was thinking of getting hold/hijack the bios as soon as they are
submitted to the device using associated request function. So if there
is a logical device built on top of two physical device, the associated
bio copy or other logic should not even see the bio the moment it is
submitted to the deivce. It will see the bio only when it is released
from associated rb-tree to them. Do you think this will not work? To me
this is what dm-ioband is doing logically. The only difference is that it
does this with the help of a separate request queue.
Thanks
Vivek
next prev parent reply other threads:[~2008-09-24 14:52 UTC|newest]
Thread overview: 140+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-18 12:04 dm-ioband + bio-cgroup benchmarks Ryo Tsuruta
2008-09-18 13:15 ` Vivek Goyal
2008-09-18 14:37 ` Andrea Righi
2008-09-18 14:37 ` Andrea Righi
2008-09-18 15:06 ` Vivek Goyal
2008-09-18 15:06 ` Vivek Goyal
2008-09-18 15:18 ` Andrea Righi
2008-09-18 15:18 ` Andrea Righi
2008-09-18 16:20 ` Vivek Goyal
2008-09-18 16:20 ` Vivek Goyal
2008-09-18 19:54 ` Andrea Righi
2008-09-18 19:54 ` Andrea Righi
[not found] ` <20080918162010.GJ20640-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-18 19:54 ` Andrea Righi
[not found] ` <48D2715A.6060002-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-18 16:20 ` Vivek Goyal
2008-09-19 3:34 ` [dm-devel] " Hirokazu Takahashi
2008-09-19 3:34 ` Hirokazu Takahashi
2008-09-19 3:34 ` Hirokazu Takahashi
2008-09-20 4:27 ` KAMEZAWA Hiroyuki
[not found] ` <20080919.123405.91829935.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-20 4:27 ` KAMEZAWA Hiroyuki
2008-09-24 11:04 ` [Xen-devel] " Balbir Singh
2008-09-20 4:27 ` KAMEZAWA Hiroyuki
2008-09-20 5:18 ` Balbir Singh
[not found] ` <48D48789.8000606-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2008-09-20 9:25 ` KAMEZAWA Hiroyuki
2008-09-20 9:25 ` KAMEZAWA Hiroyuki
2008-09-20 9:25 ` KAMEZAWA Hiroyuki
2008-09-20 5:18 ` Balbir Singh
[not found] ` <20080920132703.e74c8f89.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-09-20 5:18 ` Balbir Singh
2008-09-24 11:04 ` [Xen-devel] " Balbir Singh
2008-09-24 11:04 ` [Xen-devel] " Balbir Singh
2008-09-24 11:07 ` Re: [dm-devel] " Balbir Singh
2008-09-24 11:07 ` [Xen-devel] " Balbir Singh
[not found] ` <661de9470809240407m7f50b6dav897fef3b37295bb2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-09-26 10:54 ` Hirokazu Takahashi
2008-09-26 10:54 ` Hirokazu Takahashi
2008-09-26 10:54 ` Hirokazu Takahashi
2008-09-26 10:54 ` [Xen-devel] " Hirokazu Takahashi
[not found] ` <661de9470809240404i62300942o15337ecec335fe22-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-09-24 11:07 ` Balbir Singh
2008-09-24 11:07 ` Balbir Singh
2008-09-19 3:34 ` Hirokazu Takahashi
[not found] ` <20080918150634.GH20640-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-18 15:18 ` Andrea Righi
2008-09-18 15:18 ` Andrea Righi
[not found] ` <48D267B5.20402-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-18 15:06 ` Vivek Goyal
[not found] ` <20080918131554.GB20640-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-18 14:37 ` Andrea Righi
2008-09-19 6:12 ` Hirokazu Takahashi
2008-09-19 11:20 ` Hirokazu Takahashi
2008-09-19 6:12 ` Hirokazu Takahashi
2008-09-19 6:12 ` Hirokazu Takahashi
2008-09-19 6:12 ` Hirokazu Takahashi
[not found] ` <20080919.151221.49666828.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-19 13:12 ` Vivek Goyal
2008-09-19 13:12 ` Vivek Goyal
2008-09-19 13:12 ` Vivek Goyal
2008-09-19 11:20 ` Hirokazu Takahashi
2008-09-19 11:20 ` Hirokazu Takahashi
2008-09-19 11:20 ` Hirokazu Takahashi
2008-09-19 13:10 ` Vivek Goyal
2008-09-19 20:28 ` Andrea Righi
2008-09-22 9:36 ` Hirokazu Takahashi
2008-09-22 9:36 ` Hirokazu Takahashi
2008-09-22 9:36 ` Hirokazu Takahashi
2008-09-22 14:30 ` Vivek Goyal
2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 8:29 ` Hirokazu Takahashi
[not found] ` <20080924.172937.72827863.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-24 14:03 ` Vivek Goyal
2008-09-24 14:03 ` Vivek Goyal
2008-09-24 14:03 ` Vivek Goyal
2008-09-24 14:03 ` Vivek Goyal
2008-09-26 16:11 ` Andrea Righi
2008-09-26 16:11 ` Andrea Righi
2008-09-26 17:11 ` Andrea Righi
2008-09-26 17:11 ` Andrea Righi
2008-09-26 17:30 ` Andrea Righi
2008-09-26 17:30 ` Andrea Righi
2008-09-29 12:07 ` Hirokazu Takahashi
[not found] ` <48DD17A9.9080607-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-26 17:30 ` Andrea Righi
2008-09-29 12:07 ` Hirokazu Takahashi
2008-09-29 12:07 ` Hirokazu Takahashi
2008-09-29 12:07 ` Hirokazu Takahashi
2008-09-29 12:13 ` Pavel Emelyanov
[not found] ` <20080929.210729.117112710.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-29 12:13 ` Pavel Emelyanov
2008-09-29 12:13 ` Pavel Emelyanov
[not found] ` <48DD09AD.2010200-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-26 17:11 ` Andrea Righi
[not found] ` <20080924140355.GB547-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-26 16:11 ` Andrea Righi
[not found] ` <20080922143042.GA19222-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
[not found] ` <20080924.191803.100102323.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-24 14:52 ` Vivek Goyal
2008-09-24 14:52 ` Vivek Goyal
2008-09-24 14:52 ` Vivek Goyal [this message]
2008-09-24 14:52 ` Vivek Goyal
2008-09-26 12:42 ` Hirokazu Takahashi
[not found] ` <20080924145202.GC547-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-26 12:42 ` Hirokazu Takahashi
2008-09-26 12:42 ` Hirokazu Takahashi
2008-09-26 12:42 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
2008-09-24 12:38 ` Balbir Singh
2008-09-24 12:38 ` Balbir Singh
[not found] ` <20080924.193414.22923673.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-24 12:38 ` Balbir Singh
2008-09-24 14:53 ` Vivek Goyal
2008-09-24 14:53 ` Vivek Goyal
2008-09-24 14:53 ` Vivek Goyal
2008-09-24 14:53 ` Vivek Goyal
2008-09-26 13:04 ` Hirokazu Takahashi
2008-09-26 13:04 ` Hirokazu Takahashi
2008-09-26 13:04 ` Hirokazu Takahashi
2008-09-26 15:56 ` Andrea Righi
2008-09-26 15:56 ` Andrea Righi
2008-09-29 10:40 ` Hirokazu Takahashi
2008-09-29 10:40 ` Hirokazu Takahashi
2008-09-29 10:40 ` Hirokazu Takahashi
[not found] ` <48DD0617.3050403-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-29 10:40 ` Hirokazu Takahashi
2008-09-26 15:56 ` Andrea Righi
[not found] ` <20080926.220418.83079316.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-26 15:56 ` Andrea Righi
[not found] ` <20080924145331.GD547-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-26 13:04 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
2008-09-22 14:30 ` Vivek Goyal
[not found] ` <20080922.183651.62951479.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-22 14:30 ` Vivek Goyal
[not found] ` <20080919131019.GA3606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-09-19 20:28 ` Andrea Righi
2008-09-19 20:28 ` Andrea Righi
2008-09-22 9:45 ` Hirokazu Takahashi
2008-09-22 9:45 ` Hirokazu Takahashi
[not found] ` <48D40B78.6060709-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-09-22 9:45 ` Hirokazu Takahashi
2008-09-22 9:45 ` Hirokazu Takahashi
2008-09-22 9:36 ` Hirokazu Takahashi
[not found] ` <20080919.202031.86647893.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-19 13:10 ` Vivek Goyal
2008-09-19 13:10 ` Vivek Goyal
2008-09-18 13:15 ` Vivek Goyal
2008-09-19 8:49 ` Takuya Yoshikawa
[not found] ` <48D36794.6010002-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
2008-09-19 11:31 ` Ryo Tsuruta
2008-09-19 11:31 ` Ryo Tsuruta
2008-09-19 11:31 ` Ryo Tsuruta
2008-09-19 11:31 ` Ryo Tsuruta
2008-09-19 8:49 ` Takuya Yoshikawa
[not found] ` <20080918.210418.226794540.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-09-18 13:15 ` Vivek Goyal
2008-09-19 8:49 ` Takuya Yoshikawa
-- strict thread matches above, loose matches on Subject: below --
2008-09-18 12:04 Ryo Tsuruta
2008-09-18 12:04 Ryo Tsuruta
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080924145202.GC547@redhat.com \
--to=vgoyal@redhat.com \
--cc=agk@sourceware.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=containers@lists.linux-foundation.org \
--cc=dm-devel@redhat.com \
--cc=fernando@oss.ntt.co.jp \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=righi.andrea@gmail.com \
--cc=taka@valinux.co.jp \
--cc=virtualization@lists.linux-foundation.org \
--cc=xemul@openvz.org \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.