* dm-ioband + bio-cgroup benchmarks
@ 2008-09-18 12:04 Ryo Tsuruta
0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2008-09-18 12:04 UTC (permalink / raw)
To: linux-kernel, dm-devel, containers, virtualization, xen-devel
Cc: fernando, balbir, xemul, kamezawa.hiroyu, agk
Hi All,
I have got excellent results of dm-ioband, that controls the disk I/O
bandwidth even when it accepts delayed write requests.
In this time, I ran some benchmarks with a high-end storage. The
reason was to avoid a performance bottleneck due to mechanical factors
such as seek time.
You can see the details of the benchmarks at:
http://people.valinux.co.jp/~ryov/dm-ioband/hps/
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] <20080918.210418.226794540.ryov@valinux.co.jp>
@ 2008-09-18 13:15 ` Vivek Goyal
2008-09-19 8:49 ` Takuya Yoshikawa
` (2 subsequent siblings)
3 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-18 13:15 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, Andrea Righi, agk, xemul, fernando, balbir
On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
> Hi All,
>
> I have got excellent results of dm-ioband, that controls the disk I/O
> bandwidth even when it accepts delayed write requests.
>
> In this time, I ran some benchmarks with a high-end storage. The
> reason was to avoid a performance bottleneck due to mechanical factors
> such as seek time.
>
> You can see the details of the benchmarks at:
> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>
Hi Ryo,
I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
the notion of process priority in CFQ because now dm-ioband device will
hold the bio and issue these to lower layers later based on which bio's
become ready. Hence actual bio submitting context might be different and
because cfq derives the io_context from current task, it will be broken.
To mitigate that problem, we probably need to implement Fernando's
suggestion of putting io_context pointer in bio.
Have you already done something to solve this issue?
Secondly, why do we have to create an additional dm-ioband device for
every device we want to control using rules. This looks little odd
atleast to me. Can't we keep it in line with rest of the controllers
where task grouping takes place using cgroup and rules are specified in
cgroup itself (The way Andrea Righi does for io-throttling patches)?
To avoid creation of stacking another device (dm-ioband) on top of every
device we want to subject to rules, I was thinking of maintaining an
rb-tree per request queue. Requests will first go into this rb-tree upon
__make_request() and then will filter down to elevator associated with the
queue (if there is one). This will provide us the control of releasing
bio's to elevaor based on policies (proportional weight, max bandwidth
etc) and no need of stacking additional block device.
I am working on some experimental proof of concept patches. It will take
some time though.
I was thinking of following.
- Adopt the Andrea Righi's style of specifying rules for devices and
group the tasks using cgroups.
- To begin with, adopt dm-ioband's approach of proportional bandwidth
controller. It makes sense to me limit the bandwidth usage only in
case of contention. If there is really a need to limit max bandwidth,
then probably we can do something to implement additional rules or
implement some policy switcher where user can decide what kind of
policies need to be implemented.
- Get rid of dm-ioband and instead buffer requests on an rb-tree on every
request queue which is controlled by some kind of cgroup rules.
It would be good to discuss above approach now whether it makes sense or
not. I think it is kind of fusion of io-throttling and dm-ioband patches
with additional idea of doing io-control just above elevator on the request
queue using an rb-tree.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080918131554.GB20640@redhat.com>
@ 2008-09-18 14:37 ` Andrea Righi
[not found] ` <48D267B5.20402@gmail.com>
` (4 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-18 14:37 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, balbir
Vivek Goyal wrote:
> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
>> Hi All,
>>
>> I have got excellent results of dm-ioband, that controls the disk I/O
>> bandwidth even when it accepts delayed write requests.
>>
>> In this time, I ran some benchmarks with a high-end storage. The
>> reason was to avoid a performance bottleneck due to mechanical factors
>> such as seek time.
>>
>> You can see the details of the benchmarks at:
>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>>
>
> Hi Ryo,
>
> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> the notion of process priority in CFQ because now dm-ioband device will
> hold the bio and issue these to lower layers later based on which bio's
> become ready. Hence actual bio submitting context might be different and
> because cfq derives the io_context from current task, it will be broken.
>
> To mitigate that problem, we probably need to implement Fernando's
> suggestion of putting io_context pointer in bio.
>
> Have you already done something to solve this issue?
>
> Secondly, why do we have to create an additional dm-ioband device for
> every device we want to control using rules. This looks little odd
> atleast to me. Can't we keep it in line with rest of the controllers
> where task grouping takes place using cgroup and rules are specified in
> cgroup itself (The way Andrea Righi does for io-throttling patches)?
>
> To avoid creation of stacking another device (dm-ioband) on top of every
> device we want to subject to rules, I was thinking of maintaining an
> rb-tree per request queue. Requests will first go into this rb-tree upon
> __make_request() and then will filter down to elevator associated with the
> queue (if there is one). This will provide us the control of releasing
> bio's to elevaor based on policies (proportional weight, max bandwidth
> etc) and no need of stacking additional block device.
>
> I am working on some experimental proof of concept patches. It will take
> some time though.
>
> I was thinking of following.
>
> - Adopt the Andrea Righi's style of specifying rules for devices and
> group the tasks using cgroups.
>
> - To begin with, adopt dm-ioband's approach of proportional bandwidth
> controller. It makes sense to me limit the bandwidth usage only in
> case of contention. If there is really a need to limit max bandwidth,
> then probably we can do something to implement additional rules or
> implement some policy switcher where user can decide what kind of
> policies need to be implemented.
>
> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
> request queue which is controlled by some kind of cgroup rules.
>
> It would be good to discuss above approach now whether it makes sense or
> not. I think it is kind of fusion of io-throttling and dm-ioband patches
> with additional idea of doing io-control just above elevator on the request
> queue using an rb-tree.
Thanks Vivek. All sounds reasonable to me and I think this is be the right way
to proceed.
I'll try to design and implement your rb-tree per request-queue idea into my
io-throttle controller, maybe we can reuse it also for a more generic solution.
Feel free to send me your experimental proof of concept if you want, even if
it's not yet complete, I can review it, test and contribute.
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D267B5.20402@gmail.com>
@ 2008-09-18 15:06 ` Vivek Goyal
2008-09-18 15:18 ` Andrea Righi
[not found] ` <48D2715A.6060002@gmail.com>
0 siblings, 2 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-18 15:06 UTC (permalink / raw)
To: Andrea Righi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, balbir
On Thu, Sep 18, 2008 at 04:37:41PM +0200, Andrea Righi wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
> >> Hi All,
> >>
> >> I have got excellent results of dm-ioband, that controls the disk I/O
> >> bandwidth even when it accepts delayed write requests.
> >>
> >> In this time, I ran some benchmarks with a high-end storage. The
> >> reason was to avoid a performance bottleneck due to mechanical factors
> >> such as seek time.
> >>
> >> You can see the details of the benchmarks at:
> >> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> >>
> >
> > Hi Ryo,
> >
> > I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> > the notion of process priority in CFQ because now dm-ioband device will
> > hold the bio and issue these to lower layers later based on which bio's
> > become ready. Hence actual bio submitting context might be different and
> > because cfq derives the io_context from current task, it will be broken.
> >
> > To mitigate that problem, we probably need to implement Fernando's
> > suggestion of putting io_context pointer in bio.
> >
> > Have you already done something to solve this issue?
> >
> > Secondly, why do we have to create an additional dm-ioband device for
> > every device we want to control using rules. This looks little odd
> > atleast to me. Can't we keep it in line with rest of the controllers
> > where task grouping takes place using cgroup and rules are specified in
> > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> >
> > To avoid creation of stacking another device (dm-ioband) on top of every
> > device we want to subject to rules, I was thinking of maintaining an
> > rb-tree per request queue. Requests will first go into this rb-tree upon
> > __make_request() and then will filter down to elevator associated with the
> > queue (if there is one). This will provide us the control of releasing
> > bio's to elevaor based on policies (proportional weight, max bandwidth
> > etc) and no need of stacking additional block device.
> >
> > I am working on some experimental proof of concept patches. It will take
> > some time though.
> >
> > I was thinking of following.
> >
> > - Adopt the Andrea Righi's style of specifying rules for devices and
> > group the tasks using cgroups.
> >
> > - To begin with, adopt dm-ioband's approach of proportional bandwidth
> > controller. It makes sense to me limit the bandwidth usage only in
> > case of contention. If there is really a need to limit max bandwidth,
> > then probably we can do something to implement additional rules or
> > implement some policy switcher where user can decide what kind of
> > policies need to be implemented.
> >
> > - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
> > request queue which is controlled by some kind of cgroup rules.
> >
> > It would be good to discuss above approach now whether it makes sense or
> > not. I think it is kind of fusion of io-throttling and dm-ioband patches
> > with additional idea of doing io-control just above elevator on the request
> > queue using an rb-tree.
>
> Thanks Vivek. All sounds reasonable to me and I think this is be the right way
> to proceed.
>
> I'll try to design and implement your rb-tree per request-queue idea into my
> io-throttle controller, maybe we can reuse it also for a more generic solution.
> Feel free to send me your experimental proof of concept if you want, even if
> it's not yet complete, I can review it, test and contribute.
Currently I have taken code from bio-cgroup to implement cgroups and to
provide functionality to associate a bio to a cgroup. I need this to be
able to queue the bio's at right node in the rb-tree and then also to be
able to take a decision when is the right time to release few requests.
Right now in crude implementation, I am working on making system boot.
Once patches are at least in little bit working shape, I will send it to you
to have a look.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
2008-09-18 15:06 ` Vivek Goyal
@ 2008-09-18 15:18 ` Andrea Righi
[not found] ` <48D2715A.6060002@gmail.com>
1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-18 15:18 UTC (permalink / raw)
To: Vivek Goyal, Ryo Tsuruta
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, balbir
Vivek Goyal wrote:
> On Thu, Sep 18, 2008 at 04:37:41PM +0200, Andrea Righi wrote:
>> Vivek Goyal wrote:
>>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
>>>> Hi All,
>>>>
>>>> I have got excellent results of dm-ioband, that controls the disk I/O
>>>> bandwidth even when it accepts delayed write requests.
>>>>
>>>> In this time, I ran some benchmarks with a high-end storage. The
>>>> reason was to avoid a performance bottleneck due to mechanical factors
>>>> such as seek time.
>>>>
>>>> You can see the details of the benchmarks at:
>>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>>>>
>>> Hi Ryo,
>>>
>>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
>>> the notion of process priority in CFQ because now dm-ioband device will
>>> hold the bio and issue these to lower layers later based on which bio's
>>> become ready. Hence actual bio submitting context might be different and
>>> because cfq derives the io_context from current task, it will be broken.
>>>
>>> To mitigate that problem, we probably need to implement Fernando's
>>> suggestion of putting io_context pointer in bio.
>>>
>>> Have you already done something to solve this issue?
>>>
>>> Secondly, why do we have to create an additional dm-ioband device for
>>> every device we want to control using rules. This looks little odd
>>> atleast to me. Can't we keep it in line with rest of the controllers
>>> where task grouping takes place using cgroup and rules are specified in
>>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
>>>
>>> To avoid creation of stacking another device (dm-ioband) on top of every
>>> device we want to subject to rules, I was thinking of maintaining an
>>> rb-tree per request queue. Requests will first go into this rb-tree upon
>>> __make_request() and then will filter down to elevator associated with the
>>> queue (if there is one). This will provide us the control of releasing
>>> bio's to elevaor based on policies (proportional weight, max bandwidth
>>> etc) and no need of stacking additional block device.
>>>
>>> I am working on some experimental proof of concept patches. It will take
>>> some time though.
>>>
>>> I was thinking of following.
>>>
>>> - Adopt the Andrea Righi's style of specifying rules for devices and
>>> group the tasks using cgroups.
>>>
>>> - To begin with, adopt dm-ioband's approach of proportional bandwidth
>>> controller. It makes sense to me limit the bandwidth usage only in
>>> case of contention. If there is really a need to limit max bandwidth,
>>> then probably we can do something to implement additional rules or
>>> implement some policy switcher where user can decide what kind of
>>> policies need to be implemented.
>>>
>>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
>>> request queue which is controlled by some kind of cgroup rules.
>>>
>>> It would be good to discuss above approach now whether it makes sense or
>>> not. I think it is kind of fusion of io-throttling and dm-ioband patches
>>> with additional idea of doing io-control just above elevator on the request
>>> queue using an rb-tree.
>> Thanks Vivek. All sounds reasonable to me and I think this is be the right way
>> to proceed.
>>
>> I'll try to design and implement your rb-tree per request-queue idea into my
>> io-throttle controller, maybe we can reuse it also for a more generic solution.
>> Feel free to send me your experimental proof of concept if you want, even if
>> it's not yet complete, I can review it, test and contribute.
>
> Currently I have taken code from bio-cgroup to implement cgroups and to
> provide functionality to associate a bio to a cgroup. I need this to be
> able to queue the bio's at right node in the rb-tree and then also to be
> able to take a decision when is the right time to release few requests.
>
> Right now in crude implementation, I am working on making system boot.
> Once patches are at least in little bit working shape, I will send it to you
> to have a look.
>
> Thanks
> Vivek
I wonder... wouldn't be simpler to just use the memory controller
to retrieve this information starting from struct page?
I mean, following this path (in short, obviously using the appropriate
interfaces for locking and referencing the different objects):
cgrp = page->page_cgroup->mem_cgroup->css.cgroup
Once you get the cgrp it's very easy to use the corresponding controller
structure.
Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
to a cgroup. What other functionalities/advantages bio-cgroup provide in
addition to that?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D2715A.6060002@gmail.com>
@ 2008-09-18 16:20 ` Vivek Goyal
2008-09-18 19:54 ` Andrea Righi
2008-09-19 3:34 ` [dm-devel] " Hirokazu Takahashi
[not found] ` <20080919.123405.91829935.taka@valinux.co.jp>
2 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2008-09-18 16:20 UTC (permalink / raw)
To: Andrea Righi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, balbir
On Thu, Sep 18, 2008 at 05:18:50PM +0200, Andrea Righi wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 18, 2008 at 04:37:41PM +0200, Andrea Righi wrote:
> >> Vivek Goyal wrote:
> >>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
> >>>> Hi All,
> >>>>
> >>>> I have got excellent results of dm-ioband, that controls the disk I/O
> >>>> bandwidth even when it accepts delayed write requests.
> >>>>
> >>>> In this time, I ran some benchmarks with a high-end storage. The
> >>>> reason was to avoid a performance bottleneck due to mechanical factors
> >>>> such as seek time.
> >>>>
> >>>> You can see the details of the benchmarks at:
> >>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> >>>>
> >>> Hi Ryo,
> >>>
> >>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> >>> the notion of process priority in CFQ because now dm-ioband device will
> >>> hold the bio and issue these to lower layers later based on which bio's
> >>> become ready. Hence actual bio submitting context might be different and
> >>> because cfq derives the io_context from current task, it will be broken.
> >>>
> >>> To mitigate that problem, we probably need to implement Fernando's
> >>> suggestion of putting io_context pointer in bio.
> >>>
> >>> Have you already done something to solve this issue?
> >>>
> >>> Secondly, why do we have to create an additional dm-ioband device for
> >>> every device we want to control using rules. This looks little odd
> >>> atleast to me. Can't we keep it in line with rest of the controllers
> >>> where task grouping takes place using cgroup and rules are specified in
> >>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
> >>>
> >>> To avoid creation of stacking another device (dm-ioband) on top of every
> >>> device we want to subject to rules, I was thinking of maintaining an
> >>> rb-tree per request queue. Requests will first go into this rb-tree upon
> >>> __make_request() and then will filter down to elevator associated with the
> >>> queue (if there is one). This will provide us the control of releasing
> >>> bio's to elevaor based on policies (proportional weight, max bandwidth
> >>> etc) and no need of stacking additional block device.
> >>>
> >>> I am working on some experimental proof of concept patches. It will take
> >>> some time though.
> >>>
> >>> I was thinking of following.
> >>>
> >>> - Adopt the Andrea Righi's style of specifying rules for devices and
> >>> group the tasks using cgroups.
> >>>
> >>> - To begin with, adopt dm-ioband's approach of proportional bandwidth
> >>> controller. It makes sense to me limit the bandwidth usage only in
> >>> case of contention. If there is really a need to limit max bandwidth,
> >>> then probably we can do something to implement additional rules or
> >>> implement some policy switcher where user can decide what kind of
> >>> policies need to be implemented.
> >>>
> >>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
> >>> request queue which is controlled by some kind of cgroup rules.
> >>>
> >>> It would be good to discuss above approach now whether it makes sense or
> >>> not. I think it is kind of fusion of io-throttling and dm-ioband patches
> >>> with additional idea of doing io-control just above elevator on the request
> >>> queue using an rb-tree.
> >> Thanks Vivek. All sounds reasonable to me and I think this is be the right way
> >> to proceed.
> >>
> >> I'll try to design and implement your rb-tree per request-queue idea into my
> >> io-throttle controller, maybe we can reuse it also for a more generic solution.
> >> Feel free to send me your experimental proof of concept if you want, even if
> >> it's not yet complete, I can review it, test and contribute.
> >
> > Currently I have taken code from bio-cgroup to implement cgroups and to
> > provide functionality to associate a bio to a cgroup. I need this to be
> > able to queue the bio's at right node in the rb-tree and then also to be
> > able to take a decision when is the right time to release few requests.
> >
> > Right now in crude implementation, I am working on making system boot.
> > Once patches are at least in little bit working shape, I will send it to you
> > to have a look.
> >
> > Thanks
> > Vivek
>
> I wonder... wouldn't be simpler to just use the memory controller
> to retrieve this information starting from struct page?
>
> I mean, following this path (in short, obviously using the appropriate
> interfaces for locking and referencing the different objects):
>
> cgrp = page->page_cgroup->mem_cgroup->css.cgroup
>
Andrea,
Ok, you are first retrieving cgroup associated page owner and then
retrieving repsective iothrottle state using that
cgroup, (cgroup_to_iothrottle). I have yet to dive deeper into cgroup
data structures but does it work if iothrottle and memory controller
are mounted on separate hierarchies?
bio-cgroup guys are also doing similar thing in the sense retrieving
relevant pointer through page and page_cgroup and use that to reach
bio_cgroup strucutre. The difference is that they don't retrieve first
css object of mem_cgroup instead they directly store the pointer of
bio_cgroup in page_cgroup (When page is being charged in memory controller).
While page is being charged, determine the bio_cgroup, associated with
the task and store this info in page->page_cgroup->bio_cgroup.
static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct
*p)
{
return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
struct bio_cgroup, css);
}
At any later point, one can look at bio and reach respective bio_cgroup
by.
bio->page->page_cgroup->bio_cgroup.
Looks like now we are getting rid of page_cgroup pointer in "struct page"
and we shall have to change the implementation accordingly.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
2008-09-18 16:20 ` Vivek Goyal
@ 2008-09-18 19:54 ` Andrea Righi
0 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-18 19:54 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, balbir
Vivek Goyal wrote:
> On Thu, Sep 18, 2008 at 05:18:50PM +0200, Andrea Righi wrote:
>> Vivek Goyal wrote:
>>> On Thu, Sep 18, 2008 at 04:37:41PM +0200, Andrea Righi wrote:
>>>> Vivek Goyal wrote:
>>>>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> I have got excellent results of dm-ioband, that controls the disk I/O
>>>>>> bandwidth even when it accepts delayed write requests.
>>>>>>
>>>>>> In this time, I ran some benchmarks with a high-end storage. The
>>>>>> reason was to avoid a performance bottleneck due to mechanical factors
>>>>>> such as seek time.
>>>>>>
>>>>>> You can see the details of the benchmarks at:
>>>>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>>>>>>
>>>>> Hi Ryo,
>>>>>
>>>>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
>>>>> the notion of process priority in CFQ because now dm-ioband device will
>>>>> hold the bio and issue these to lower layers later based on which bio's
>>>>> become ready. Hence actual bio submitting context might be different and
>>>>> because cfq derives the io_context from current task, it will be broken.
>>>>>
>>>>> To mitigate that problem, we probably need to implement Fernando's
>>>>> suggestion of putting io_context pointer in bio.
>>>>>
>>>>> Have you already done something to solve this issue?
>>>>>
>>>>> Secondly, why do we have to create an additional dm-ioband device for
>>>>> every device we want to control using rules. This looks little odd
>>>>> atleast to me. Can't we keep it in line with rest of the controllers
>>>>> where task grouping takes place using cgroup and rules are specified in
>>>>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
>>>>>
>>>>> To avoid creation of stacking another device (dm-ioband) on top of every
>>>>> device we want to subject to rules, I was thinking of maintaining an
>>>>> rb-tree per request queue. Requests will first go into this rb-tree upon
>>>>> __make_request() and then will filter down to elevator associated with the
>>>>> queue (if there is one). This will provide us the control of releasing
>>>>> bio's to elevaor based on policies (proportional weight, max bandwidth
>>>>> etc) and no need of stacking additional block device.
>>>>>
>>>>> I am working on some experimental proof of concept patches. It will take
>>>>> some time though.
>>>>>
>>>>> I was thinking of following.
>>>>>
>>>>> - Adopt the Andrea Righi's style of specifying rules for devices and
>>>>> group the tasks using cgroups.
>>>>>
>>>>> - To begin with, adopt dm-ioband's approach of proportional bandwidth
>>>>> controller. It makes sense to me limit the bandwidth usage only in
>>>>> case of contention. If there is really a need to limit max bandwidth,
>>>>> then probably we can do something to implement additional rules or
>>>>> implement some policy switcher where user can decide what kind of
>>>>> policies need to be implemented.
>>>>>
>>>>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
>>>>> request queue which is controlled by some kind of cgroup rules.
>>>>>
>>>>> It would be good to discuss above approach now whether it makes sense or
>>>>> not. I think it is kind of fusion of io-throttling and dm-ioband patches
>>>>> with additional idea of doing io-control just above elevator on the request
>>>>> queue using an rb-tree.
>>>> Thanks Vivek. All sounds reasonable to me and I think this is be the right way
>>>> to proceed.
>>>>
>>>> I'll try to design and implement your rb-tree per request-queue idea into my
>>>> io-throttle controller, maybe we can reuse it also for a more generic solution.
>>>> Feel free to send me your experimental proof of concept if you want, even if
>>>> it's not yet complete, I can review it, test and contribute.
>>> Currently I have taken code from bio-cgroup to implement cgroups and to
>>> provide functionality to associate a bio to a cgroup. I need this to be
>>> able to queue the bio's at right node in the rb-tree and then also to be
>>> able to take a decision when is the right time to release few requests.
>>>
>>> Right now in crude implementation, I am working on making system boot.
>>> Once patches are at least in little bit working shape, I will send it to you
>>> to have a look.
>>>
>>> Thanks
>>> Vivek
>> I wonder... wouldn't be simpler to just use the memory controller
>> to retrieve this information starting from struct page?
>>
>> I mean, following this path (in short, obviously using the appropriate
>> interfaces for locking and referencing the different objects):
>>
>> cgrp = page->page_cgroup->mem_cgroup->css.cgroup
>>
>
> Andrea,
>
> Ok, you are first retrieving cgroup associated page owner and then
> retrieving repsective iothrottle state using that
> cgroup, (cgroup_to_iothrottle). I have yet to dive deeper into cgroup
Correct.
> data structures but does it work if iothrottle and memory controller
> are mounted on separate hierarchies?
ehm... I've to check. I usually mount all the controllers into the same
hierarchy. :P
> bio-cgroup guys are also doing similar thing in the sense retrieving
> relevant pointer through page and page_cgroup and use that to reach
> bio_cgroup strucutre. The difference is that they don't retrieve first
> css object of mem_cgroup instead they directly store the pointer of
> bio_cgroup in page_cgroup (When page is being charged in memory controller).
>
> While page is being charged, determine the bio_cgroup, associated with
> the task and store this info in page->page_cgroup->bio_cgroup.
>
> static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct
> *p)
> {
> return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
> struct bio_cgroup, css);
> }
>
> At any later point, one can look at bio and reach respective bio_cgroup
> by.
>
> bio->page->page_cgroup->bio_cgroup.
>
> Looks like now we are getting rid of page_cgroup pointer in "struct page"
> and we shall have to change the implementation accordingly.
Actually, only page_get_page_cgroup() implementation would change. And
we don't have to worry about the particular implementation (hash,
radix_tree, whatever..), in any case bio-cgroup has to simply use the
opportune interface: page_get_page_cgroup(struct *page).
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D2715A.6060002@gmail.com>
2008-09-18 16:20 ` Vivek Goyal
@ 2008-09-19 3:34 ` Hirokazu Takahashi
[not found] ` <20080919.123405.91829935.taka@valinux.co.jp>
2 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-19 3:34 UTC (permalink / raw)
To: righi.andrea, dm-devel
Cc: xen-devel, containers, agk, linux-kernel, virtualization,
jens.axboe, balbir, fernando, vgoyal, xemul
Hi,
> >> Vivek Goyal wrote:
> >>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
> >>>> Hi All,
> >>>>
> >>>> I have got excellent results of dm-ioband, that controls the disk I/O
> >>>> bandwidth even when it accepts delayed write requests.
> >>>>
> >>>> In this time, I ran some benchmarks with a high-end storage. The
> >>>> reason was to avoid a performance bottleneck due to mechanical factors
> >>>> such as seek time.
> >>>>
> >>>> You can see the details of the benchmarks at:
> >>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> >>>>
> >>> Hi Ryo,
> >>>
> >>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> >>> the notion of process priority in CFQ because now dm-ioband device will
> >>> hold the bio and issue these to lower layers later based on which bio's
> >>> become ready. Hence actual bio submitting context might be different and
> >>> because cfq derives the io_context from current task, it will be broken.
> >>>
> >>> To mitigate that problem, we probably need to implement Fernando's
> >>> suggestion of putting io_context pointer in bio.
> >>>
> >>> Have you already done something to solve this issue?
> >>>
> >>> Secondly, why do we have to create an additional dm-ioband device for
> >>> every device we want to control using rules. This looks little odd
> >>> atleast to me. Can't we keep it in line with rest of the controllers
> >>> where task grouping takes place using cgroup and rules are specified in
> >>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
> >>>
> >>> To avoid creation of stacking another device (dm-ioband) on top of every
> >>> device we want to subject to rules, I was thinking of maintaining an
> >>> rb-tree per request queue. Requests will first go into this rb-tree upon
> >>> __make_request() and then will filter down to elevator associated with the
> >>> queue (if there is one). This will provide us the control of releasing
> >>> bio's to elevaor based on policies (proportional weight, max bandwidth
> >>> etc) and no need of stacking additional block device.
> >>>
> >>> I am working on some experimental proof of concept patches. It will take
> >>> some time though.
> >>>
> >>> I was thinking of following.
> >>>
> >>> - Adopt the Andrea Righi's style of specifying rules for devices and
> >>> group the tasks using cgroups.
> >>>
> >>> - To begin with, adopt dm-ioband's approach of proportional bandwidth
> >>> controller. It makes sense to me limit the bandwidth usage only in
> >>> case of contention. If there is really a need to limit max bandwidth,
> >>> then probably we can do something to implement additional rules or
> >>> implement some policy switcher where user can decide what kind of
> >>> policies need to be implemented.
> >>>
> >>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
> >>> request queue which is controlled by some kind of cgroup rules.
> >>>
> >>> It would be good to discuss above approach now whether it makes sense or
> >>> not. I think it is kind of fusion of io-throttling and dm-ioband patches
> >>> with additional idea of doing io-control just above elevator on the request
> >>> queue using an rb-tree.
> >> Thanks Vivek. All sounds reasonable to me and I think this is be the right way
> >> to proceed.
> >>
> >> I'll try to design and implement your rb-tree per request-queue idea into my
> >> io-throttle controller, maybe we can reuse it also for a more generic solution.
> >> Feel free to send me your experimental proof of concept if you want, even if
> >> it's not yet complete, I can review it, test and contribute.
> >
> > Currently I have taken code from bio-cgroup to implement cgroups and to
> > provide functionality to associate a bio to a cgroup. I need this to be
> > able to queue the bio's at right node in the rb-tree and then also to be
> > able to take a decision when is the right time to release few requests.
> >
> > Right now in crude implementation, I am working on making system boot.
> > Once patches are at least in little bit working shape, I will send it to you
> > to have a look.
> >
> > Thanks
> > Vivek
>
> I wonder... wouldn't be simpler to just use the memory controller
> to retrieve this information starting from struct page?
>
> I mean, following this path (in short, obviously using the appropriate
> interfaces for locking and referencing the different objects):
>
> cgrp = page->page_cgroup->mem_cgroup->css.cgroup
>
> Once you get the cgrp it's very easy to use the corresponding controller
> structure.
>
> Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
> to a cgroup. What other functionalities/advantages bio-cgroup provide in
> addition to that?
I've decided to get Ryo to post the accurate dirty-page tracking patch
for bio-cgroup, which isn't perfect yet though. The memory controller
never wants to support this tracking because migrating a page between
memory cgroups is really heavy.
I also thought enhancing the memory controller would be good enough,
but a lot of people said they wanted to control memory resource and
block I/O resource separately.
So you can create several bio-cgroup in one memory-cgroup,
or you can use bio-cgroup without memory-cgroup.
I also have a plan to implement more acurate tracking mechanism
on bio-cgroup after the memory cgroup team re-implement the infrastructure,
which won't be supported by memory-cgroup.
When a process are moved into another memory cgroup,
the pages belonging to the process don't move to the new cgroup
because migrating pages is so heavy. It's hard to find the pages
from the process and migrating pages may cause some memory pressure.
I'll implement this feature only on bio-cgroup with minimum overhead
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080918131554.GB20640@redhat.com>
2008-09-18 14:37 ` Andrea Righi
[not found] ` <48D267B5.20402@gmail.com>
@ 2008-09-19 6:12 ` Hirokazu Takahashi
2008-09-19 11:20 ` Hirokazu Takahashi
` (2 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-19 6:12 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > Hi All,
> >
> > I have got excellent results of dm-ioband, that controls the disk I/O
> > bandwidth even when it accepts delayed write requests.
> >
> > In this time, I ran some benchmarks with a high-end storage. The
> > reason was to avoid a performance bottleneck due to mechanical factors
> > such as seek time.
> >
> > You can see the details of the benchmarks at:
> > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> >
>
> Hi Ryo,
>
> I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> the notion of process priority in CFQ because now dm-ioband device will
> hold the bio and issue these to lower layers later based on which bio's
> become ready. Hence actual bio submitting context might be different and
> because cfq derives the io_context from current task, it will be broken.
This is completely another problem we have to solve.
The CFQ scheduler has really bad assumption that the current process
must be the owner. This problem occurs when you use some of device
mapper devices or use linux aio.
> To mitigate that problem, we probably need to implement Fernando's
> suggestion of putting io_context pointer in bio.
>
> Have you already done something to solve this issue?
Actually, I already have a patch to solve this problem, which make
each bio have a pointer to the io_context of the owner process.
Would you take a look at the thread whose subject is "I/O context
inheritance" in:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0804.2/index.html#2850
Fernando also knows this.
Thank you,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] <20080918.210418.226794540.ryov@valinux.co.jp>
2008-09-18 13:15 ` dm-ioband + bio-cgroup benchmarks Vivek Goyal
@ 2008-09-19 8:49 ` Takuya Yoshikawa
[not found] ` <20080918131554.GB20640@redhat.com>
[not found] ` <48D36794.6010002@oss.ntt.co.jp>
3 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2008-09-19 8:49 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: xen-devel, containers, linux-kernel, virtualization, dm-devel,
agk, xemul, fernando, kamezawa.hiroyu, balbir
Hi Tsuruta-san,
Ryo Tsuruta wrote:
> Hi All,
>
> I have got excellent results of dm-ioband, that controls the disk I/O
> bandwidth even when it accepts delayed write requests.
>
> In this time, I ran some benchmarks with a high-end storage. The
> reason was to avoid a performance bottleneck due to mechanical factors
> such as seek time.
>
> You can see the details of the benchmarks at:
> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>
I took a look at your beautiful results!
When you have time, would you explain me how you succeeded to check the
time, bandwidth, especially when you did write() tests? Actually, I tried
similar tests and failed to check the bandwidth correctly. Did you insert
something in the kernel source?
Thanks,
Takuya Yoshikawa
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080918131554.GB20640@redhat.com>
` (2 preceding siblings ...)
2008-09-19 6:12 ` Hirokazu Takahashi
@ 2008-09-19 11:20 ` Hirokazu Takahashi
[not found] ` <20080919.202031.86647893.taka@valinux.co.jp>
[not found] ` <20080919.151221.49666828.taka@valinux.co.jp>
5 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-19 11:20 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > Hi All,
> >
> > I have got excellent results of dm-ioband, that controls the disk I/O
> > bandwidth even when it accepts delayed write requests.
> >
> > In this time, I ran some benchmarks with a high-end storage. The
> > reason was to avoid a performance bottleneck due to mechanical factors
> > such as seek time.
> >
> > You can see the details of the benchmarks at:
> > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
(snip)
> Secondly, why do we have to create an additional dm-ioband device for
> every device we want to control using rules. This looks little odd
> atleast to me. Can't we keep it in line with rest of the controllers
> where task grouping takes place using cgroup and rules are specified in
> cgroup itself (The way Andrea Righi does for io-throttling patches)?
It isn't essential dm-band is implemented as one of the device-mappers.
I've been also considering that this algorithm itself can be implemented
in the block layer directly.
Although, the current implementation has merits. It is flexible.
- Dm-ioband can be place anywhere you like, which may be right before
the I/O schedulers or may be placed on top of LVM devices.
- It supports partition based bandwidth control which can work without
cgroups, which is quite easy to use of.
- It is independent to any I/O schedulers including ones which will
be introduced in the future.
I also understand it's will be hard to set up without some tools
such as lvm commands.
> To avoid creation of stacking another device (dm-ioband) on top of every
> device we want to subject to rules, I was thinking of maintaining an
> rb-tree per request queue. Requests will first go into this rb-tree upon
> __make_request() and then will filter down to elevator associated with the
> queue (if there is one). This will provide us the control of releasing
> bio's to elevaor based on policies (proportional weight, max bandwidth
> etc) and no need of stacking additional block device.
I think it's a bit late to control I/O requests there, since process
may be blocked in get_request_wait when the I/O load is high.
Please imagine the situation that cgroups with low bandwidths are
consuming most of "struct request"s while another cgroup with a high
bandwidth is blocked and can't get enough "struct request"s.
It means cgroups that issues lot of I/O request can win the game.
> I am working on some experimental proof of concept patches. It will take
> some time though.
>
> I was thinking of following.
>
> - Adopt the Andrea Righi's style of specifying rules for devices and
> group the tasks using cgroups.
>
> - To begin with, adopt dm-ioband's approach of proportional bandwidth
> controller. It makes sense to me limit the bandwidth usage only in
> case of contention. If there is really a need to limit max bandwidth,
> then probably we can do something to implement additional rules or
> implement some policy switcher where user can decide what kind of
> policies need to be implemented.
>
> - Get rid of dm-ioband and instead buffer requests on an rb-tree on every
> request queue which is controlled by some kind of cgroup rules.
>
> It would be good to discuss above approach now whether it makes sense or
> not. I think it is kind of fusion of io-throttling and dm-ioband patches
> with additional idea of doing io-control just above elevator on the request
> queue using an rb-tree.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D36794.6010002@oss.ntt.co.jp>
@ 2008-09-19 11:31 ` Ryo Tsuruta
0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2008-09-19 11:31 UTC (permalink / raw)
To: yoshikawa.takuya
Cc: xen-devel, containers, linux-kernel, virtualization, dm-devel,
agk, xemul, fernando, kamezawa.hiroyu, balbir
Hi Yoshikawa-san,
> When you have time, would you explain me how you succeeded to check the
> time, bandwidth, especially when you did write() tests? Actually, I tried
> similar tests and failed to check the bandwidth correctly. Did you insert
> something in the kernel source?
I'm using our own tool, which issues I/Os in prallel in a specified
period and counts up how many I/Os are issued and how many bytes are
transferred in the period.
I'm also using our own tool for measurement of throughput variation to
see the internal data of dm-ioband. This tool is implemented as a
kernel module.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919.202031.86647893.taka@valinux.co.jp>
@ 2008-09-19 13:10 ` Vivek Goyal
[not found] ` <20080919131019.GA3606@redhat.com>
1 sibling, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-19 13:10 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Fri, Sep 19, 2008 at 08:20:31PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > Hi All,
> > >
> > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > bandwidth even when it accepts delayed write requests.
> > >
> > > In this time, I ran some benchmarks with a high-end storage. The
> > > reason was to avoid a performance bottleneck due to mechanical factors
> > > such as seek time.
> > >
> > > You can see the details of the benchmarks at:
> > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>
> (snip)
>
> > Secondly, why do we have to create an additional dm-ioband device for
> > every device we want to control using rules. This looks little odd
> > atleast to me. Can't we keep it in line with rest of the controllers
> > where task grouping takes place using cgroup and rules are specified in
> > cgroup itself (The way Andrea Righi does for io-throttling patches)?
>
> It isn't essential dm-band is implemented as one of the device-mappers.
> I've been also considering that this algorithm itself can be implemented
> in the block layer directly.
>
> Although, the current implementation has merits. It is flexible.
> - Dm-ioband can be place anywhere you like, which may be right before
> the I/O schedulers or may be placed on top of LVM devices.
Hi,
An rb-tree per request queue also should be able to give us this
flexibility. Because logic is implemented per request queue, rules can be
placed at any layer. Either at bottom most layer where requests are
passed to elevator or at higher layer where requests will be passed to
lower level block devices in the stack. Just that we shall have to do
modifications to some of the higher level dm/md drivers to make use of
queuing cgroup requests and releasing cgroup requests to lower layers.
> - It supports partition based bandwidth control which can work without
> cgroups, which is quite easy to use of.
> - It is independent to any I/O schedulers including ones which will
> be introduced in the future.
This scheme should also be independent of any of the IO schedulers. We
might have to do small changes in IO-schedulers to decouple the things
from __make_request() a bit to insert rb-tree in between __make_request()
and IO-scheduler. Otherwise fundamentally, this approach should not
require any major modifications to IO-schedulers.
>
> I also understand it's will be hard to set up without some tools
> such as lvm commands.
>
That's something I wish to avoid. If we can keep it simple by doing
grouping using cgroup and allow one line rules in cgroup it would be nice.
> > To avoid creation of stacking another device (dm-ioband) on top of every
> > device we want to subject to rules, I was thinking of maintaining an
> > rb-tree per request queue. Requests will first go into this rb-tree upon
> > __make_request() and then will filter down to elevator associated with the
> > queue (if there is one). This will provide us the control of releasing
> > bio's to elevaor based on policies (proportional weight, max bandwidth
> > etc) and no need of stacking additional block device.
>
> I think it's a bit late to control I/O requests there, since process
> may be blocked in get_request_wait when the I/O load is high.
> Please imagine the situation that cgroups with low bandwidths are
> consuming most of "struct request"s while another cgroup with a high
> bandwidth is blocked and can't get enough "struct request"s.
>
> It means cgroups that issues lot of I/O request can win the game.
>
Ok, this is a good point. Because number of struct requests are limited
and they seem to be allocated on first come first serve basis, so if a
cgroup is generating lot of IO, then it might win.
But dm-ioband will face the same issue. Essentially it is also a request
queue and it will have limited number of request descriptors. Have you
modified the logic somewhere for allocation of request descriptors to the
waiting processes based on their weights? If yes, the logic probably can
be implemented here too.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919.151221.49666828.taka@valinux.co.jp>
@ 2008-09-19 13:12 ` Vivek Goyal
0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-19 13:12 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Fri, Sep 19, 2008 at 03:12:21PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > Hi All,
> > >
> > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > bandwidth even when it accepts delayed write requests.
> > >
> > > In this time, I ran some benchmarks with a high-end storage. The
> > > reason was to avoid a performance bottleneck due to mechanical factors
> > > such as seek time.
> > >
> > > You can see the details of the benchmarks at:
> > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> > >
> >
> > Hi Ryo,
> >
> > I had a query about dm-ioband patches. IIUC, dm-ioband patches will break
> > the notion of process priority in CFQ because now dm-ioband device will
> > hold the bio and issue these to lower layers later based on which bio's
> > become ready. Hence actual bio submitting context might be different and
> > because cfq derives the io_context from current task, it will be broken.
>
> This is completely another problem we have to solve.
> The CFQ scheduler has really bad assumption that the current process
> must be the owner. This problem occurs when you use some of device
> mapper devices or use linux aio.
>
> > To mitigate that problem, we probably need to implement Fernando's
> > suggestion of putting io_context pointer in bio.
> >
> > Have you already done something to solve this issue?
>
> Actually, I already have a patch to solve this problem, which make
> each bio have a pointer to the io_context of the owner process.
> Would you take a look at the thread whose subject is "I/O context
> inheritance" in:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0804.2/index.html#2850
>
> Fernando also knows this.
Great. Sure I will have a look at this thread. This is something we shall
have to implement, irrespective of the fact whether we go for dm-ioband
approach or an rb-tree per request queue approach.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919131019.GA3606@redhat.com>
@ 2008-09-19 20:28 ` Andrea Righi
2008-09-22 9:36 ` Hirokazu Takahashi
` (2 subsequent siblings)
3 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-19 20:28 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, balbir
Vivek Goyal wrote:
> On Fri, Sep 19, 2008 at 08:20:31PM +0900, Hirokazu Takahashi wrote:
>>> To avoid creation of stacking another device (dm-ioband) on top of every
>>> device we want to subject to rules, I was thinking of maintaining an
>>> rb-tree per request queue. Requests will first go into this rb-tree upon
>>> __make_request() and then will filter down to elevator associated with the
>>> queue (if there is one). This will provide us the control of releasing
>>> bio's to elevaor based on policies (proportional weight, max bandwidth
>>> etc) and no need of stacking additional block device.
>> I think it's a bit late to control I/O requests there, since process
>> may be blocked in get_request_wait when the I/O load is high.
>> Please imagine the situation that cgroups with low bandwidths are
>> consuming most of "struct request"s while another cgroup with a high
>> bandwidth is blocked and can't get enough "struct request"s.
>>
>> It means cgroups that issues lot of I/O request can win the game.
>>
>
> Ok, this is a good point. Because number of struct requests are limited
> and they seem to be allocated on first come first serve basis, so if a
> cgroup is generating lot of IO, then it might win.
>
> But dm-ioband will face the same issue. Essentially it is also a request
> queue and it will have limited number of request descriptors. Have you
> modified the logic somewhere for allocation of request descriptors to the
> waiting processes based on their weights? If yes, the logic probably can
> be implemented here too.
Maybe throttling dirty page ratio in memory could help to avoid this problem.
I mean, if a cgroup is exceeding the i/o limits do ehm... something.. also at
the balance_dirty_pages() level.
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919.123405.91829935.taka@valinux.co.jp>
@ 2008-09-20 4:27 ` KAMEZAWA Hiroyuki
[not found] ` <20080920132703.e74c8f89.kamezawa.hiroyu@jp.fujitsu.com>
` (2 subsequent siblings)
3 siblings, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-20 4:27 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, xemul, containers, jens.axboe, linux-kernel,
virtualization, dm-devel, agk, balbir, fernando, righi.andrea
On Fri, 19 Sep 2008 12:34:05 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> I've decided to get Ryo to post the accurate dirty-page tracking patch
> for bio-cgroup, which isn't perfect yet though. The memory controller
> never wants to support this tracking because migrating a page between
> memory cgroups is really heavy.
>
> I also thought enhancing the memory controller would be good enough,
> but a lot of people said they wanted to control memory resource and
> block I/O resource separately.
> So you can create several bio-cgroup in one memory-cgroup,
> or you can use bio-cgroup without memory-cgroup.
>
> I also have a plan to implement more acurate tracking mechanism
> on bio-cgroup after the memory cgroup team re-implement the infrastructure,
> which won't be supported by memory-cgroup.
> When a process are moved into another memory cgroup,
> the pages belonging to the process don't move to the new cgroup
> because migrating pages is so heavy. It's hard to find the pages
> from the process and migrating pages may cause some memory pressure.
> I'll implement this feature only on bio-cgroup with minimum overhead
>
I really would like to move page_cgroup to new cgroup when the process moves...
But it's just in my plan and I'm not sure I can do it or not.
Anyway what's next for me is
1. fix current discussion to remove page->page_cgroup pointer.
2. reduce locks.
3. support swap and swap-cache.
I think algorithm for (1), (2) is now getting smart.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080920132703.e74c8f89.kamezawa.hiroyu@jp.fujitsu.com>
@ 2008-09-20 5:18 ` Balbir Singh
[not found] ` <48D48789.8000606@linux.vnet.ibm.com>
1 sibling, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2008-09-20 5:18 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, righi.andrea
KAMEZAWA Hiroyuki wrote:
> On Fri, 19 Sep 2008 12:34:05 +0900 (JST)
> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
>
>> I've decided to get Ryo to post the accurate dirty-page tracking patch
>> for bio-cgroup, which isn't perfect yet though. The memory controller
>> never wants to support this tracking because migrating a page between
>> memory cgroups is really heavy.
>>
>> I also thought enhancing the memory controller would be good enough,
>> but a lot of people said they wanted to control memory resource and
>> block I/O resource separately.
>> So you can create several bio-cgroup in one memory-cgroup,
>> or you can use bio-cgroup without memory-cgroup.
>>
>> I also have a plan to implement more acurate tracking mechanism
>> on bio-cgroup after the memory cgroup team re-implement the infrastructure,
>> which won't be supported by memory-cgroup.
>> When a process are moved into another memory cgroup,
>> the pages belonging to the process don't move to the new cgroup
>> because migrating pages is so heavy. It's hard to find the pages
>> from the process and migrating pages may cause some memory pressure.
>> I'll implement this feature only on bio-cgroup with minimum overhead
>>
> I really would like to move page_cgroup to new cgroup when the process moves...
> But it's just in my plan and I'm not sure I can do it or not.
>
Kamezawa-San, I am not dead against it, but I would provide a knob/control point
for system administrator to decide if movement is important for applications,
then let them do so (like force_empty).
> Anyway what's next for me is
> 1. fix current discussion to remove page->page_cgroup pointer.
> 2. reduce locks.
Are you planning on reposting these. I've been trying other approaches at my end
1. Use radix tree per-node per-zone
2. Use radix trees only for 32 bit systems
3. Depend on CONFIG_HAVE_MEMORY_PRESENT and build a sparse data structure and
use pre-allocation
I've posted (1) and I'll take a look at your patches as well
> 3. support swap and swap-cache.
>
> I think algorithm for (1), (2) is now getting smart.
>
Yes, it is getting better
> Thanks,
> -Kame
>
--
Balbir
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D48789.8000606@linux.vnet.ibm.com>
@ 2008-09-20 9:25 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-20 9:25 UTC (permalink / raw)
To: balbir
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, righi.andrea
On Fri, 19 Sep 2008 22:18:01 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Kamezawa-San, I am not dead against it, but I would provide a knob/control point
> for system administrator to decide if movement is important for applications,
> then let them do so (like force_empty).
>
make sense.
> > Anyway what's next for me is
> > 1. fix current discussion to remove page->page_cgroup pointer.
> > 2. reduce locks.
>
> Are you planning on reposting these. I've been trying other approaches at my end
>
I'll post in next Monday. It's obvious that I should do more tests/fixes...
About performance, I'll give it up at some reasonable point.
> 1. Use radix tree per-node per-zone
> 2. Use radix trees only for 32 bit systems
> 3. Depend on CONFIG_HAVE_MEMORY_PRESENT and build a sparse data structure and
> use pre-allocation
>
> I've posted (1) and I'll take a look at your patches as well
>
My patch has (many) bugs. Severals are fixed but there will be still ;)
SwapCache beats me again because it easily reuse uncharged pages...
BTW why do you like radix-tree ? It's not very good for our purpose.
FLATMEM support for small system will be easy work.
> > 3. support swap and swap-cache.
> >
> > I think algorithm for (1), (2) is now getting smart.
> >
>
> Yes, it is getting better
>
Thanks,
-Kame
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919131019.GA3606@redhat.com>
2008-09-19 20:28 ` Andrea Righi
@ 2008-09-22 9:36 ` Hirokazu Takahashi
[not found] ` <48D40B78.6060709@gmail.com>
[not found] ` <20080922.183651.62951479.taka@valinux.co.jp>
3 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-22 9:36 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > > bandwidth even when it accepts delayed write requests.
> > > >
> > > > In this time, I ran some benchmarks with a high-end storage. The
> > > > reason was to avoid a performance bottleneck due to mechanical factors
> > > > such as seek time.
> > > >
> > > > You can see the details of the benchmarks at:
> > > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> >
> > (snip)
> >
> > > Secondly, why do we have to create an additional dm-ioband device for
> > > every device we want to control using rules. This looks little odd
> > > atleast to me. Can't we keep it in line with rest of the controllers
> > > where task grouping takes place using cgroup and rules are specified in
> > > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> >
> > It isn't essential dm-band is implemented as one of the device-mappers.
> > I've been also considering that this algorithm itself can be implemented
> > in the block layer directly.
> >
> > Although, the current implementation has merits. It is flexible.
> > - Dm-ioband can be place anywhere you like, which may be right before
> > the I/O schedulers or may be placed on top of LVM devices.
>
> Hi,
>
> An rb-tree per request queue also should be able to give us this
> flexibility. Because logic is implemented per request queue, rules can be
> placed at any layer. Either at bottom most layer where requests are
> passed to elevator or at higher layer where requests will be passed to
> lower level block devices in the stack. Just that we shall have to do
> modifications to some of the higher level dm/md drivers to make use of
> queuing cgroup requests and releasing cgroup requests to lower layers.
Request descriptors are allocated just right before passing I/O requests
to the elevators. Even if you move the descriptor allocation point
before calling the dm/md drivers, the drivers can't make use of them.
When one of the dm drivers accepts a I/O request, the request
won't have either a real device number or a real sector number.
The request will be re-mapped to another sector of another device
in every dm drivers. The request may even be replicated there.
So it is really hard to find the right request queue to put
the request into and sort them on the queue.
> > - It supports partition based bandwidth control which can work without
> > cgroups, which is quite easy to use of.
>
> > - It is independent to any I/O schedulers including ones which will
> > be introduced in the future.
>
> This scheme should also be independent of any of the IO schedulers. We
> might have to do small changes in IO-schedulers to decouple the things
> from __make_request() a bit to insert rb-tree in between __make_request()
> and IO-scheduler. Otherwise fundamentally, this approach should not
> require any major modifications to IO-schedulers.
>
> >
> > I also understand it's will be hard to set up without some tools
> > such as lvm commands.
> >
>
> That's something I wish to avoid. If we can keep it simple by doing
> grouping using cgroup and allow one line rules in cgroup it would be nice.
It's possible the algorithm of dm-ioband can be placed in the block layer
if it is really a big problem.
But I doubt it can control every control block I/O as we wish since
the interface the cgroup supports is quite poor.
> > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > device we want to subject to rules, I was thinking of maintaining an
> > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > __make_request() and then will filter down to elevator associated with the
> > > queue (if there is one). This will provide us the control of releasing
> > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > etc) and no need of stacking additional block device.
> >
> > I think it's a bit late to control I/O requests there, since process
> > may be blocked in get_request_wait when the I/O load is high.
> > Please imagine the situation that cgroups with low bandwidths are
> > consuming most of "struct request"s while another cgroup with a high
> > bandwidth is blocked and can't get enough "struct request"s.
> >
> > It means cgroups that issues lot of I/O request can win the game.
> >
>
> Ok, this is a good point. Because number of struct requests are limited
> and they seem to be allocated on first come first serve basis, so if a
> cgroup is generating lot of IO, then it might win.
>
> But dm-ioband will face the same issue.
Nope. Dm-ioband doesn't have this issue since it works before allocating
the descriptors. Only I/O requests dm-ioband has passed can allocate its
descriptor.
> Essentially it is also a request
> queue and it will have limited number of request descriptors. Have you
> modified the logic somewhere for allocation of request descriptors to the
> waiting processes based on their weights? If yes, the logic probably can
> be implemented here too.
I feel this is almost what dm-ioband is doing.
> Thanks
> Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48D40B78.6060709@gmail.com>
@ 2008-09-22 9:45 ` Hirokazu Takahashi
0 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-22 9:45 UTC (permalink / raw)
To: righi.andrea
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, balbir
Hi,
> >>> To avoid creation of stacking another device (dm-ioband) on top of every
> >>> device we want to subject to rules, I was thinking of maintaining an
> >>> rb-tree per request queue. Requests will first go into this rb-tree upon
> >>> __make_request() and then will filter down to elevator associated with the
> >>> queue (if there is one). This will provide us the control of releasing
> >>> bio's to elevaor based on policies (proportional weight, max bandwidth
> >>> etc) and no need of stacking additional block device.
> >> I think it's a bit late to control I/O requests there, since process
> >> may be blocked in get_request_wait when the I/O load is high.
> >> Please imagine the situation that cgroups with low bandwidths are
> >> consuming most of "struct request"s while another cgroup with a high
> >> bandwidth is blocked and can't get enough "struct request"s.
> >>
> >> It means cgroups that issues lot of I/O request can win the game.
> >>
> >
> > Ok, this is a good point. Because number of struct requests are limited
> > and they seem to be allocated on first come first serve basis, so if a
> > cgroup is generating lot of IO, then it might win.
> >
> > But dm-ioband will face the same issue. Essentially it is also a request
> > queue and it will have limited number of request descriptors. Have you
> > modified the logic somewhere for allocation of request descriptors to the
> > waiting processes based on their weights? If yes, the logic probably can
> > be implemented here too.
>
> Maybe throttling dirty page ratio in memory could help to avoid this problem.
> I mean, if a cgroup is exceeding the i/o limits do ehm... something.. also at
> the balance_dirty_pages() level.
That is one of the important features to be implemented for controlling I/O.
The dirty page ratio controlling can help to avoid this issue but it
isn't guaranteed. So, both of them should be implemented.
What would you think happens in cases that some cgroups may have tons of
threads which issue a lot of direct I/Os, or others may have huge memory?
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080922.183651.62951479.taka@valinux.co.jp>
@ 2008-09-22 14:30 ` Vivek Goyal
[not found] ` <20080922143042.GA19222@redhat.com>
1 sibling, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-22 14:30 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Mon, Sep 22, 2008 at 06:36:51PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > > > bandwidth even when it accepts delayed write requests.
> > > > >
> > > > > In this time, I ran some benchmarks with a high-end storage. The
> > > > > reason was to avoid a performance bottleneck due to mechanical factors
> > > > > such as seek time.
> > > > >
> > > > > You can see the details of the benchmarks at:
> > > > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> > >
> > > (snip)
> > >
> > > > Secondly, why do we have to create an additional dm-ioband device for
> > > > every device we want to control using rules. This looks little odd
> > > > atleast to me. Can't we keep it in line with rest of the controllers
> > > > where task grouping takes place using cgroup and rules are specified in
> > > > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> > >
> > > It isn't essential dm-band is implemented as one of the device-mappers.
> > > I've been also considering that this algorithm itself can be implemented
> > > in the block layer directly.
> > >
> > > Although, the current implementation has merits. It is flexible.
> > > - Dm-ioband can be place anywhere you like, which may be right before
> > > the I/O schedulers or may be placed on top of LVM devices.
> >
> > Hi,
> >
> > An rb-tree per request queue also should be able to give us this
> > flexibility. Because logic is implemented per request queue, rules can be
> > placed at any layer. Either at bottom most layer where requests are
> > passed to elevator or at higher layer where requests will be passed to
> > lower level block devices in the stack. Just that we shall have to do
> > modifications to some of the higher level dm/md drivers to make use of
> > queuing cgroup requests and releasing cgroup requests to lower layers.
>
> Request descriptors are allocated just right before passing I/O requests
> to the elevators. Even if you move the descriptor allocation point
> before calling the dm/md drivers, the drivers can't make use of them.
>
You are right. request descriptors are currently allocated at bottom
most layer. Anyway, in the rb-tree, we put bio cgroups as logical elements
and every bio cgroup then contains the list of either bios or requeust
descriptors. So what kind of list bio-cgroup maintains can depend on
whether it is a higher layer driver (will maintain bios) or a lower layer
driver (will maintain list of request descriptors per bio-cgroup).
So basically mechanism of maintaining an rb-tree can be completely
ignorant of the fact whether a driver is keeping track of bios or keeping
track of requests per cgroup.
> When one of the dm drivers accepts a I/O request, the request
> won't have either a real device number or a real sector number.
> The request will be re-mapped to another sector of another device
> in every dm drivers. The request may even be replicated there.
> So it is really hard to find the right request queue to put
> the request into and sort them on the queue.
Hmm.., I thought that all the incoming requests to dm/md driver will
remain in a single queue maintained by that drvier (irrespective of the
fact in which request queue these requests go in lower layers after
replication or other operation). I am not very familiar with dm/md
implementation. I will read more about it....
>
> > > - It supports partition based bandwidth control which can work without
> > > cgroups, which is quite easy to use of.
> >
> > > - It is independent to any I/O schedulers including ones which will
> > > be introduced in the future.
> >
> > This scheme should also be independent of any of the IO schedulers. We
> > might have to do small changes in IO-schedulers to decouple the things
> > from __make_request() a bit to insert rb-tree in between __make_request()
> > and IO-scheduler. Otherwise fundamentally, this approach should not
> > require any major modifications to IO-schedulers.
> >
> > >
> > > I also understand it's will be hard to set up without some tools
> > > such as lvm commands.
> > >
> >
> > That's something I wish to avoid. If we can keep it simple by doing
> > grouping using cgroup and allow one line rules in cgroup it would be nice.
>
> It's possible the algorithm of dm-ioband can be placed in the block layer
> if it is really a big problem.
> But I doubt it can control every control block I/O as we wish since
> the interface the cgroup supports is quite poor.
Had a question regarding cgroup interface. I am assuming that in a system,
one will be using other controllers as well apart from IO-controller.
Other controllers will be using cgroup as a grouping mechanism.
Now coming up with additional grouping mechanism for only io-controller seems
little odd to me. It will make the job of higher level management software
harder.
Looking at the dm-ioband grouping examples given in patches, I think cases
of grouping based in pid, pgrp, uid and kvm can be handled by creating right
cgroup and making sure applications are launched/moved into right cgroup by
user space tools.
I think keeping grouping mechanism in line with rest of the controllers
should help because a uniform grouping mechanism should make life simpler.
I am not very sure about moving dm-ioband algorithm in block layer. Looks
like it will make life simpler at least in terms of configuration.
>
> > > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > > device we want to subject to rules, I was thinking of maintaining an
> > > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > > __make_request() and then will filter down to elevator associated with the
> > > > queue (if there is one). This will provide us the control of releasing
> > > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > > etc) and no need of stacking additional block device.
> > >
> > > I think it's a bit late to control I/O requests there, since process
> > > may be blocked in get_request_wait when the I/O load is high.
> > > Please imagine the situation that cgroups with low bandwidths are
> > > consuming most of "struct request"s while another cgroup with a high
> > > bandwidth is blocked and can't get enough "struct request"s.
> > >
> > > It means cgroups that issues lot of I/O request can win the game.
> > >
> >
> > Ok, this is a good point. Because number of struct requests are limited
> > and they seem to be allocated on first come first serve basis, so if a
> > cgroup is generating lot of IO, then it might win.
> >
> > But dm-ioband will face the same issue.
>
> Nope. Dm-ioband doesn't have this issue since it works before allocating
> the descriptors. Only I/O requests dm-ioband has passed can allocate its
> descriptor.
>
Ok. Got it. dm-ioband does not block on allocation of request descriptors.
It does seem to be blocking in prevent_burst_bios() but that would be
per group so it should be fine.
That means for lower layers, one shall have to do request descritor
allocation as per the cgroup weight to make sure a cgroup with lower
weight does not get higher % of disk because it is generating more
requests.
One additional issue with my scheme I just noticed is that I am putting
bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
same cgroup can be at multiple levels of processing at same time. That
would mean that a single cgroup needs to be in multiple rb-trees at the
same time in various layers. So I might have to create a temporary object
which can associate with cgroup and get rid of that object once I don't
have the requests any more...
Well, implementing rb-tree per request queue seems to be harder than I
had thought. Especially taking care of decoupling the elevator and reqeust
descriptor logic at lower layers. Long way to go..
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080922143042.GA19222@redhat.com>
@ 2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
` (4 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-24 8:29 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > > > > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > > > > bandwidth even when it accepts delayed write requests.
> > > > > >
> > > > > > In this time, I ran some benchmarks with a high-end storage. The
> > > > > > reason was to avoid a performance bottleneck due to mechanical factors
> > > > > > such as seek time.
> > > > > >
> > > > > > You can see the details of the benchmarks at:
> > > > > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> > > >
> > > > (snip)
> > > >
> > > > > Secondly, why do we have to create an additional dm-ioband device for
> > > > > every device we want to control using rules. This looks little odd
> > > > > atleast to me. Can't we keep it in line with rest of the controllers
> > > > > where task grouping takes place using cgroup and rules are specified in
> > > > > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> > > >
> > > > It isn't essential dm-band is implemented as one of the device-mappers.
> > > > I've been also considering that this algorithm itself can be implemented
> > > > in the block layer directly.
> > > >
> > > > Although, the current implementation has merits. It is flexible.
> > > > - Dm-ioband can be place anywhere you like, which may be right before
> > > > the I/O schedulers or may be placed on top of LVM devices.
> > >
> > > Hi,
> > >
> > > An rb-tree per request queue also should be able to give us this
> > > flexibility. Because logic is implemented per request queue, rules can be
> > > placed at any layer. Either at bottom most layer where requests are
> > > passed to elevator or at higher layer where requests will be passed to
> > > lower level block devices in the stack. Just that we shall have to do
> > > modifications to some of the higher level dm/md drivers to make use of
> > > queuing cgroup requests and releasing cgroup requests to lower layers.
> >
> > Request descriptors are allocated just right before passing I/O requests
> > to the elevators. Even if you move the descriptor allocation point
> > before calling the dm/md drivers, the drivers can't make use of them.
> >
>
> You are right. request descriptors are currently allocated at bottom
> most layer. Anyway, in the rb-tree, we put bio cgroups as logical elements
> and every bio cgroup then contains the list of either bios or requeust
> descriptors. So what kind of list bio-cgroup maintains can depend on
> whether it is a higher layer driver (will maintain bios) or a lower layer
> driver (will maintain list of request descriptors per bio-cgroup).
I'm getting confused about your idea.
I thought you wanted to make each cgroup have its own rb-tree,
and wanted to make all the layers share the same rb-tree.
If so, are you going to put different things into the same tree?
Do you even want all the I/O schedlers use the same tree?
Are you going to block request descriptors in the tree?
From the view point of performance, all the request descriptors
should be passed to the I/O schedulers, since the maximum number
of request descriptors is limited.
And I still don't understand if you want to make your rb-tree
work efficiently, you need to put a lot of bios or request descriptors
into the tree. Is that what you are going to do?
On the other hand, dm-ioband tries to minimize to have bios blocked.
And I have a plan on reducing the maximum number that can be
blocked there.
Sorry to bother you that I just don't understand the concept clearly.
> So basically mechanism of maintaining an rb-tree can be completely
> ignorant of the fact whether a driver is keeping track of bios or keeping
> track of requests per cgroup.
I don't care whether the queue is implemented as a rb-tee or some
kind of list because they are logically the same thing.
> > When one of the dm drivers accepts a I/O request, the request
> > won't have either a real device number or a real sector number.
> > The request will be re-mapped to another sector of another device
> > in every dm drivers. The request may even be replicated there.
> > So it is really hard to find the right request queue to put
> > the request into and sort them on the queue.
>
> Hmm.., I thought that all the incoming requests to dm/md driver will
> remain in a single queue maintained by that drvier (irrespective of the
> fact in which request queue these requests go in lower layers after
> replication or other operation). I am not very familiar with dm/md
> implementation. I will read more about it....
They never look into the queues maintained in drivers.
Some of them have its own little queue and others don't.
Some may just modify the sector numbers of I/O requests or may
create a new I/O request themselves. Others such as md-raid5
have their own queues to control I/Os, where A write request may
cause several read requests and have to wait for their completions
before the actual write starts.
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080922143042.GA19222@redhat.com>
2008-09-24 8:29 ` Hirokazu Takahashi
@ 2008-09-24 10:18 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
` (3 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-24 10:18 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > > > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > > > device we want to subject to rules, I was thinking of maintaining an
> > > > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > > > __make_request() and then will filter down to elevator associated with the
> > > > > queue (if there is one). This will provide us the control of releasing
> > > > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > > > etc) and no need of stacking additional block device.
> > > >
> > > > I think it's a bit late to control I/O requests there, since process
> > > > may be blocked in get_request_wait when the I/O load is high.
> > > > Please imagine the situation that cgroups with low bandwidths are
> > > > consuming most of "struct request"s while another cgroup with a high
> > > > bandwidth is blocked and can't get enough "struct request"s.
> > > >
> > > > It means cgroups that issues lot of I/O request can win the game.
> > > >
> > >
> > > Ok, this is a good point. Because number of struct requests are limited
> > > and they seem to be allocated on first come first serve basis, so if a
> > > cgroup is generating lot of IO, then it might win.
> > >
> > > But dm-ioband will face the same issue.
> >
> > Nope. Dm-ioband doesn't have this issue since it works before allocating
> > the descriptors. Only I/O requests dm-ioband has passed can allocate its
> > descriptor.
> >
>
> Ok. Got it. dm-ioband does not block on allocation of request descriptors.
> It does seem to be blocking in prevent_burst_bios() but that would be
> per group so it should be fine.
Yes. There is also another little mechanism that prevent_burst_bios()
tries not to block kernel threads if possible.
> That means for lower layers, one shall have to do request descritor
> allocation as per the cgroup weight to make sure a cgroup with lower
> weight does not get higher % of disk because it is generating more
> requests.
Yes. But when cgroups with higher weight aren't issueing a lot of I/Os,
even a cgroup with lower weight can allocate a lot of request descriptors.
> One additional issue with my scheme I just noticed is that I am putting
> bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
> same cgroup can be at multiple levels of processing at same time. That
> would mean that a single cgroup needs to be in multiple rb-trees at the
> same time in various layers. So I might have to create a temporary object
> which can associate with cgroup and get rid of that object once I don't
> have the requests any more...
You mean each layer should have its rb-tree? Is it per device?
One lvm logical volume may probably consist from several physical
volumes, which will be shared with other logical volumes.
And some layers may split one bio into several bios.
I hardly can imagine how these structures will be.
But I guess it is a good thing that we are going to support
a general infrastructure for I/O requests.
> Well, implementing rb-tree per request queue seems to be harder than I
> had thought. Especially taking care of decoupling the elevator and reqeust
> descriptor logic at lower layers. Long way to go..
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080922143042.GA19222@redhat.com>
2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
@ 2008-09-24 10:34 ` Hirokazu Takahashi
[not found] ` <20080924.193414.22923673.taka@valinux.co.jp>
` (2 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-24 10:34 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > It's possible the algorithm of dm-ioband can be placed in the block layer
> > if it is really a big problem.
> > But I doubt it can control every control block I/O as we wish since
> > the interface the cgroup supports is quite poor.
>
> Had a question regarding cgroup interface. I am assuming that in a system,
> one will be using other controllers as well apart from IO-controller.
> Other controllers will be using cgroup as a grouping mechanism.
> Now coming up with additional grouping mechanism for only io-controller seems
> little odd to me. It will make the job of higher level management software
> harder.
>
> Looking at the dm-ioband grouping examples given in patches, I think cases
> of grouping based in pid, pgrp, uid and kvm can be handled by creating right
> cgroup and making sure applications are launched/moved into right cgroup by
> user space tools.
Grouping in pid, pgrp and uid is not the point, which I've been thinking
can be replaced with cgroup once the implementation of bio-cgroup is done.
I think problems of cgroup are that they can't support lots of storages
and hotplug devices, it just handle them as if they were just one resource.
I don't insist the interface of dm-ioband is the best. I just hope the
cgroup infrastructure support this kind of resources.
> I think keeping grouping mechanism in line with rest of the controllers
> should help because a uniform grouping mechanism should make life simpler.
>
> I am not very sure about moving dm-ioband algorithm in block layer. Looks
> like it will make life simpler at least in terms of configuration.
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080919.123405.91829935.taka@valinux.co.jp>
2008-09-20 4:27 ` KAMEZAWA Hiroyuki
[not found] ` <20080920132703.e74c8f89.kamezawa.hiroyu@jp.fujitsu.com>
@ 2008-09-24 11:04 ` Balbir Singh
[not found] ` <661de9470809240404i62300942o15337ecec335fe22@mail.gmail.com>
3 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2008-09-24 11:04 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, righi.andrea
[-- Attachment #1.1: Type: text/plain, Size: 6827 bytes --]
On Fri, Sep 19, 2008 at 9:04 AM, Hirokazu Takahashi <taka@valinux.co.jp>wrote:
> Hi,
>
> > >> Vivek Goyal wrote:
> > >>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
> > >>>> Hi All,
> > >>>>
> > >>>> I have got excellent results of dm-ioband, that controls the disk
> I/O
> > >>>> bandwidth even when it accepts delayed write requests.
> > >>>>
> > >>>> In this time, I ran some benchmarks with a high-end storage. The
> > >>>> reason was to avoid a performance bottleneck due to mechanical
> factors
> > >>>> such as seek time.
> > >>>>
> > >>>> You can see the details of the benchmarks at:
> > >>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/<http://people.valinux.co.jp/%7Eryov/dm-ioband/hps/>
> > >>>>
> > >>> Hi Ryo,
> > >>>
> > >>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will
> break
> > >>> the notion of process priority in CFQ because now dm-ioband device
> will
> > >>> hold the bio and issue these to lower layers later based on which
> bio's
> > >>> become ready. Hence actual bio submitting context might be different
> and
> > >>> because cfq derives the io_context from current task, it will be
> broken.
> > >>>
> > >>> To mitigate that problem, we probably need to implement Fernando's
> > >>> suggestion of putting io_context pointer in bio.
> > >>>
> > >>> Have you already done something to solve this issue?
> > >>>
> > >>> Secondly, why do we have to create an additional dm-ioband device for
> > >>> every device we want to control using rules. This looks little odd
> > >>> atleast to me. Can't we keep it in line with rest of the controllers
> > >>> where task grouping takes place using cgroup and rules are specified
> in
> > >>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
> > >>>
> > >>> To avoid creation of stacking another device (dm-ioband) on top of
> every
> > >>> device we want to subject to rules, I was thinking of maintaining an
> > >>> rb-tree per request queue. Requests will first go into this rb-tree
> upon
> > >>> __make_request() and then will filter down to elevator associated
> with the
> > >>> queue (if there is one). This will provide us the control of
> releasing
> > >>> bio's to elevaor based on policies (proportional weight, max
> bandwidth
> > >>> etc) and no need of stacking additional block device.
> > >>>
> > >>> I am working on some experimental proof of concept patches. It will
> take
> > >>> some time though.
> > >>>
> > >>> I was thinking of following.
> > >>>
> > >>> - Adopt the Andrea Righi's style of specifying rules for devices and
> > >>> group the tasks using cgroups.
> > >>>
> > >>> - To begin with, adopt dm-ioband's approach of proportional bandwidth
> > >>> controller. It makes sense to me limit the bandwidth usage only in
> > >>> case of contention. If there is really a need to limit max
> bandwidth,
> > >>> then probably we can do something to implement additional rules or
> > >>> implement some policy switcher where user can decide what kind of
> > >>> policies need to be implemented.
> > >>>
> > >>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on
> every
> > >>> request queue which is controlled by some kind of cgroup rules.
> > >>>
> > >>> It would be good to discuss above approach now whether it makes sense
> or
> > >>> not. I think it is kind of fusion of io-throttling and dm-ioband
> patches
> > >>> with additional idea of doing io-control just above elevator on the
> request
> > >>> queue using an rb-tree.
> > >> Thanks Vivek. All sounds reasonable to me and I think this is be the
> right way
> > >> to proceed.
> > >>
> > >> I'll try to design and implement your rb-tree per request-queue idea
> into my
> > >> io-throttle controller, maybe we can reuse it also for a more generic
> solution.
> > >> Feel free to send me your experimental proof of concept if you want,
> even if
> > >> it's not yet complete, I can review it, test and contribute.
> > >
> > > Currently I have taken code from bio-cgroup to implement cgroups and to
> > > provide functionality to associate a bio to a cgroup. I need this to be
> > > able to queue the bio's at right node in the rb-tree and then also to
> be
> > > able to take a decision when is the right time to release few requests.
> > >
> > > Right now in crude implementation, I am working on making system boot.
> > > Once patches are at least in little bit working shape, I will send it
> to you
> > > to have a look.
> > >
> > > Thanks
> > > Vivek
> >
> > I wonder... wouldn't be simpler to just use the memory controller
> > to retrieve this information starting from struct page?
> >
> > I mean, following this path (in short, obviously using the appropriate
> > interfaces for locking and referencing the different objects):
> >
> > cgrp = page->page_cgroup->mem_cgroup->css.cgroup
> >
> > Once you get the cgrp it's very easy to use the corresponding controller
> > structure.
> >
> > Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
> > to a cgroup. What other functionalities/advantages bio-cgroup provide in
> > addition to that?
>
> I've decided to get Ryo to post the accurate dirty-page tracking patch
> for bio-cgroup, which isn't perfect yet though. The memory controller
> never wants to support this tracking because migrating a page between
> memory cgroups is really heavy.
>
It depends on the migration. The cost is proportional to the number of pages
moved. The cost can be brought down (I do have a design on paper -- from
long long ago), where moving mm's will reduce the cost of migration, but it
adds an additional dereference in the common path.
>
> I also thought enhancing the memory controller would be good enough,
> but a lot of people said they wanted to control memory resource and
> block I/O resource separately.
Yes, ideally we do want that.
>
> So you can create several bio-cgroup in one memory-cgroup,
> or you can use bio-cgroup without memory-cgroup.
>
> I also have a plan to implement more acurate tracking mechanism
> on bio-cgroup after the memory cgroup team re-implement the infrastructure,
> which won't be supported by memory-cgroup.
> When a process are moved into another memory cgroup,
> the pages belonging to the process don't move to the new cgroup
> because migrating pages is so heavy. It's hard to find the pages
> from the process and migrating pages may cause some memory pressure.
> I'll implement this feature only on bio-cgroup with minimum overhead
>
Kamezawa has also wanted the page migration feature and we've agreed to
provide a per-cgroup flag to decide to turn migration on/off. I would not
mind refactoring memcontrol.c if that can help the IO controller and if you
want migration, force the migration flag to on and warn the user if they try
to turn it off.
Balbir
[-- Attachment #1.2: Type: text/html, Size: 8820 bytes --]
[-- Attachment #2: Type: text/plain, Size: 184 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <661de9470809240404i62300942o15337ecec335fe22@mail.gmail.com>
@ 2008-09-24 11:07 ` Balbir Singh
[not found] ` <661de9470809240407m7f50b6dav897fef3b37295bb2@mail.gmail.com>
1 sibling, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2008-09-24 11:07 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, righi.andrea
On Wed, Sep 24, 2008 at 4:34 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>
> On Fri, Sep 19, 2008 at 9:04 AM, Hirokazu Takahashi <taka@valinux.co.jp>
> wrote:
>>
>> Hi,
>>
>> > >> Vivek Goyal wrote:
>> > >>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
>> > >>>> Hi All,
>> > >>>>
>> > >>>> I have got excellent results of dm-ioband, that controls the disk
>> > >>>> I/O
>> > >>>> bandwidth even when it accepts delayed write requests.
>> > >>>>
>> > >>>> In this time, I ran some benchmarks with a high-end storage. The
>> > >>>> reason was to avoid a performance bottleneck due to mechanical
>> > >>>> factors
>> > >>>> such as seek time.
>> > >>>>
>> > >>>> You can see the details of the benchmarks at:
>> > >>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>> > >>>>
>> > >>> Hi Ryo,
>> > >>>
>> > >>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will
>> > >>> break
>> > >>> the notion of process priority in CFQ because now dm-ioband device
>> > >>> will
>> > >>> hold the bio and issue these to lower layers later based on which
>> > >>> bio's
>> > >>> become ready. Hence actual bio submitting context might be different
>> > >>> and
>> > >>> because cfq derives the io_context from current task, it will be
>> > >>> broken.
>> > >>>
>> > >>> To mitigate that problem, we probably need to implement Fernando's
>> > >>> suggestion of putting io_context pointer in bio.
>> > >>>
>> > >>> Have you already done something to solve this issue?
>> > >>>
>> > >>> Secondly, why do we have to create an additional dm-ioband device
>> > >>> for
>> > >>> every device we want to control using rules. This looks little odd
>> > >>> atleast to me. Can't we keep it in line with rest of the controllers
>> > >>> where task grouping takes place using cgroup and rules are specified
>> > >>> in
>> > >>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
>> > >>>
>> > >>> To avoid creation of stacking another device (dm-ioband) on top of
>> > >>> every
>> > >>> device we want to subject to rules, I was thinking of maintaining an
>> > >>> rb-tree per request queue. Requests will first go into this rb-tree
>> > >>> upon
>> > >>> __make_request() and then will filter down to elevator associated
>> > >>> with the
>> > >>> queue (if there is one). This will provide us the control of
>> > >>> releasing
>> > >>> bio's to elevaor based on policies (proportional weight, max
>> > >>> bandwidth
>> > >>> etc) and no need of stacking additional block device.
>> > >>>
>> > >>> I am working on some experimental proof of concept patches. It will
>> > >>> take
>> > >>> some time though.
>> > >>>
>> > >>> I was thinking of following.
>> > >>>
>> > >>> - Adopt the Andrea Righi's style of specifying rules for devices and
>> > >>> group the tasks using cgroups.
>> > >>>
>> > >>> - To begin with, adopt dm-ioband's approach of proportional
>> > >>> bandwidth
>> > >>> controller. It makes sense to me limit the bandwidth usage only in
>> > >>> case of contention. If there is really a need to limit max
>> > >>> bandwidth,
>> > >>> then probably we can do something to implement additional rules or
>> > >>> implement some policy switcher where user can decide what kind of
>> > >>> policies need to be implemented.
>> > >>>
>> > >>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on
>> > >>> every
>> > >>> request queue which is controlled by some kind of cgroup rules.
>> > >>>
>> > >>> It would be good to discuss above approach now whether it makes
>> > >>> sense or
>> > >>> not. I think it is kind of fusion of io-throttling and dm-ioband
>> > >>> patches
>> > >>> with additional idea of doing io-control just above elevator on the
>> > >>> request
>> > >>> queue using an rb-tree.
>> > >> Thanks Vivek. All sounds reasonable to me and I think this is be the
>> > >> right way
>> > >> to proceed.
>> > >>
>> > >> I'll try to design and implement your rb-tree per request-queue idea
>> > >> into my
>> > >> io-throttle controller, maybe we can reuse it also for a more generic
>> > >> solution.
>> > >> Feel free to send me your experimental proof of concept if you want,
>> > >> even if
>> > >> it's not yet complete, I can review it, test and contribute.
>> > >
>> > > Currently I have taken code from bio-cgroup to implement cgroups and
>> > > to
>> > > provide functionality to associate a bio to a cgroup. I need this to
>> > > be
>> > > able to queue the bio's at right node in the rb-tree and then also to
>> > > be
>> > > able to take a decision when is the right time to release few
>> > > requests.
>> > >
>> > > Right now in crude implementation, I am working on making system boot.
>> > > Once patches are at least in little bit working shape, I will send it
>> > > to you
>> > > to have a look.
>> > >
>> > > Thanks
>> > > Vivek
>> >
>> > I wonder... wouldn't be simpler to just use the memory controller
>> > to retrieve this information starting from struct page?
>> >
>> > I mean, following this path (in short, obviously using the appropriate
>> > interfaces for locking and referencing the different objects):
>> >
>> > cgrp = page->page_cgroup->mem_cgroup->css.cgroup
>> >
>> > Once you get the cgrp it's very easy to use the corresponding controller
>> > structure.
>> >
>> > Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
>> > to a cgroup. What other functionalities/advantages bio-cgroup provide in
>> > addition to that?
>>
>> I've decided to get Ryo to post the accurate dirty-page tracking patch
>> for bio-cgroup, which isn't perfect yet though. The memory controller
>> never wants to support this tracking because migrating a page between
>> memory cgroups is really heavy.
It depends on the migration. The cost is proportional to the number of
pages moved. The cost can be brought down (I do have a design on
paper -- from long long ago), where moving mm's will reduce the cost
of migration, but it adds an additional dereference in the common
path.
>
>>
>> I also thought enhancing the memory controller would be good enough,
>> but a lot of people said they wanted to control memory resource and
>> block I/O resource separately.
>
> Yes, ideally we do want that.
>
>>
>> So you can create several bio-cgroup in one memory-cgroup,
>> or you can use bio-cgroup without memory-cgroup.
>>
>> I also have a plan to implement more acurate tracking mechanism
>> on bio-cgroup after the memory cgroup team re-implement the
>> infrastructure,
>> which won't be supported by memory-cgroup.
>> When a process are moved into another memory cgroup,
>> the pages belonging to the process don't move to the new cgroup
>> because migrating pages is so heavy. It's hard to find the pages
>> from the process and migrating pages may cause some memory pressure.
>> I'll implement this feature only on bio-cgroup with minimum overhead
>
Kamezawa has also wanted the page migration feature and we've agreed
to provide a per-cgroup flag to decide to turn migration on/off. I
would not mind refactoring memcontrol.c if that can help the IO
controller and if you want migration, force the migration flag to on
and warn the user if they try to turn it off.
Balbir
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924.193414.22923673.taka@valinux.co.jp>
@ 2008-09-24 12:38 ` Balbir Singh
2008-09-24 14:53 ` Vivek Goyal
[not found] ` <20080924145331.GD547@redhat.com>
2 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2008-09-24 12:38 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, righi.andrea, fernando, vgoyal, xemul
Hirokazu Takahashi wrote:
> Hi,
>
>>> It's possible the algorithm of dm-ioband can be placed in the block layer
>>> if it is really a big problem.
>>> But I doubt it can control every control block I/O as we wish since
>>> the interface the cgroup supports is quite poor.
>> Had a question regarding cgroup interface. I am assuming that in a system,
>> one will be using other controllers as well apart from IO-controller.
>> Other controllers will be using cgroup as a grouping mechanism.
>> Now coming up with additional grouping mechanism for only io-controller seems
>> little odd to me. It will make the job of higher level management software
>> harder.
>>
>> Looking at the dm-ioband grouping examples given in patches, I think cases
>> of grouping based in pid, pgrp, uid and kvm can be handled by creating right
>> cgroup and making sure applications are launched/moved into right cgroup by
>> user space tools.
>
> Grouping in pid, pgrp and uid is not the point, which I've been thinking
> can be replaced with cgroup once the implementation of bio-cgroup is done.
>
> I think problems of cgroup are that they can't support lots of storages
> and hotplug devices, it just handle them as if they were just one resource.
Could you elaborate on this please?
> I don't insist the interface of dm-ioband is the best. I just hope the
> cgroup infrastructure support this kind of resources.
>
What sort of support will help you?
--
Balbir
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924.172937.72827863.taka@valinux.co.jp>
@ 2008-09-24 14:03 ` Vivek Goyal
[not found] ` <20080924140355.GB547@redhat.com>
1 sibling, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-24 14:03 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Wed, Sep 24, 2008 at 05:29:37PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > > > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > > > > > bandwidth even when it accepts delayed write requests.
> > > > > > >
> > > > > > > In this time, I ran some benchmarks with a high-end storage. The
> > > > > > > reason was to avoid a performance bottleneck due to mechanical factors
> > > > > > > such as seek time.
> > > > > > >
> > > > > > > You can see the details of the benchmarks at:
> > > > > > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> > > > >
> > > > > (snip)
> > > > >
> > > > > > Secondly, why do we have to create an additional dm-ioband device for
> > > > > > every device we want to control using rules. This looks little odd
> > > > > > atleast to me. Can't we keep it in line with rest of the controllers
> > > > > > where task grouping takes place using cgroup and rules are specified in
> > > > > > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> > > > >
> > > > > It isn't essential dm-band is implemented as one of the device-mappers.
> > > > > I've been also considering that this algorithm itself can be implemented
> > > > > in the block layer directly.
> > > > >
> > > > > Although, the current implementation has merits. It is flexible.
> > > > > - Dm-ioband can be place anywhere you like, which may be right before
> > > > > the I/O schedulers or may be placed on top of LVM devices.
> > > >
> > > > Hi,
> > > >
> > > > An rb-tree per request queue also should be able to give us this
> > > > flexibility. Because logic is implemented per request queue, rules can be
> > > > placed at any layer. Either at bottom most layer where requests are
> > > > passed to elevator or at higher layer where requests will be passed to
> > > > lower level block devices in the stack. Just that we shall have to do
> > > > modifications to some of the higher level dm/md drivers to make use of
> > > > queuing cgroup requests and releasing cgroup requests to lower layers.
> > >
> > > Request descriptors are allocated just right before passing I/O requests
> > > to the elevators. Even if you move the descriptor allocation point
> > > before calling the dm/md drivers, the drivers can't make use of them.
> > >
> >
> > You are right. request descriptors are currently allocated at bottom
> > most layer. Anyway, in the rb-tree, we put bio cgroups as logical elements
> > and every bio cgroup then contains the list of either bios or requeust
> > descriptors. So what kind of list bio-cgroup maintains can depend on
> > whether it is a higher layer driver (will maintain bios) or a lower layer
> > driver (will maintain list of request descriptors per bio-cgroup).
>
> I'm getting confused about your idea.
>
> I thought you wanted to make each cgroup have its own rb-tree,
> and wanted to make all the layers share the same rb-tree.
> If so, are you going to put different things into the same tree?
> Do you even want all the I/O schedlers use the same tree?
>
Ok, I will give more details of the thought process.
I was thinking of maintaing an rb-tree per request queue and not an
rb-tree per cgroup. This tree can contain all the bios submitted to that
request queue through __make_request(). Every node in the tree will represent
one cgroup and will contain a list of bios issued from the tasks from that
cgroup.
Every bio entering the request queue through __make_request() function
first will be queued in one of the nodes in this rb-tree, depending on which
cgroup that bio belongs to.
Once the bios are buffered in rb-tree, we release these to underlying
elevator depending on the proportionate weight of the nodes/cgroups.
Some more details which I was trying to implement yesterday.
There will be one bio_cgroup object per cgroup. This object will contain
many bio_group objects. Each bio_group object will be created for each
request queue where a bio from bio_cgroup is queued. Essentially the idea
is that bios belonging to a cgroup can be on various request queues in the
system. So a single object can not serve the purpose as it can not be on
many rb-trees at the same time. Hence create one sub object which will keep
track of bios belonging to one cgroup on a particular request queue.
Each bio_group will contain a list of bios and this bio_group object will
be a node in the rb-tree of request queue. For example. Lets say there are
two request queues in the system q1 and q2 (lets say they belong to /dev/sda
and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
for /dev/sda and /dev/sdb.
bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
bios issued by task t1 for /dev/sdb. I thought the same can be extended
for stacked devices also.
I am still trying to implementing it and hopefully this is doable idea.
I think at the end of the day it will be something very close to dm-ioband
algorithm just that there will be no lvm driver and no notion of separate
dm-ioband device.
> Are you going to block request descriptors in the tree?
> >From the view point of performance, all the request descriptors
> should be passed to the I/O schedulers, since the maximum number
> of request descriptors is limited.
>
In my initial implementation I was queuing the request descriptors. Then
you mentioned that it is not a good idea because potentially a cgroup
issuing more requests might win the race.
Yesterday night I thought, then why not start queuing the bios as they
are submitted to the request_queue, using __make_request() and then
release these to underlying elevator or underlying request queue (in case
of stacked device). This will remove few issues.
- All the layers can uniformly queue bios and no intermixing of queuing
bios and request descriptors.
- Will get rid of issue of one cgroup winning the race because of limited
number of request descriptors.
> And I still don't understand if you want to make your rb-tree
> work efficiently, you need to put a lot of bios or request descriptors
> into the tree. Is that what you are going to do?
> On the other hand, dm-ioband tries to minimize to have bios blocked.
> And I have a plan on reducing the maximum number that can be
> blocked there.
>
Now I am planning to queue bios and probably there is no need to queue
request descriptors. I think that's what dm-ioband is doing. Queueing
bios for cgroups per io-band device.
Thinking more about it, In dm-ioband case, you seem to be buffering bios
from various cgroups on a separate request queue belonging to dm-ioband
device. I was thinking of moving all that buffering logic to existing
request queues instead of creating another request queue on top of request
queue I want to control (dm-ioband device).
> Sorry to bother you that I just don't understand the concept clearly.
>
> > So basically mechanism of maintaining an rb-tree can be completely
> > ignorant of the fact whether a driver is keeping track of bios or keeping
> > track of requests per cgroup.
>
> I don't care whether the queue is implemented as a rb-tee or some
> kind of list because they are logically the same thing.
That's true. rb-tree or list is just data structure detail. It is not
important. The core thing I am trying to achive is that is there a way that
I can get rid of notion of creating a separate dm-ioband device for every
device I want to control.
Is it just me who finds creation of dm-ioband devices odd and difficult to
manage or there are other people who think that it would be nice if we can get
rid of it?
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924.191803.100102323.taka@valinux.co.jp>
@ 2008-09-24 14:52 ` Vivek Goyal
[not found] ` <20080924145202.GC547@redhat.com>
1 sibling, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-24 14:52 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Wed, Sep 24, 2008 at 07:18:03PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > > > To avoid creation of stacking another device (dm-ioband) on top of every
> > > > > > device we want to subject to rules, I was thinking of maintaining an
> > > > > > rb-tree per request queue. Requests will first go into this rb-tree upon
> > > > > > __make_request() and then will filter down to elevator associated with the
> > > > > > queue (if there is one). This will provide us the control of releasing
> > > > > > bio's to elevaor based on policies (proportional weight, max bandwidth
> > > > > > etc) and no need of stacking additional block device.
> > > > >
> > > > > I think it's a bit late to control I/O requests there, since process
> > > > > may be blocked in get_request_wait when the I/O load is high.
> > > > > Please imagine the situation that cgroups with low bandwidths are
> > > > > consuming most of "struct request"s while another cgroup with a high
> > > > > bandwidth is blocked and can't get enough "struct request"s.
> > > > >
> > > > > It means cgroups that issues lot of I/O request can win the game.
> > > > >
> > > >
> > > > Ok, this is a good point. Because number of struct requests are limited
> > > > and they seem to be allocated on first come first serve basis, so if a
> > > > cgroup is generating lot of IO, then it might win.
> > > >
> > > > But dm-ioband will face the same issue.
> > >
> > > Nope. Dm-ioband doesn't have this issue since it works before allocating
> > > the descriptors. Only I/O requests dm-ioband has passed can allocate its
> > > descriptor.
> > >
> >
> > Ok. Got it. dm-ioband does not block on allocation of request descriptors.
> > It does seem to be blocking in prevent_burst_bios() but that would be
> > per group so it should be fine.
>
> Yes. There is also another little mechanism that prevent_burst_bios()
> tries not to block kernel threads if possible.
>
> > That means for lower layers, one shall have to do request descritor
> > allocation as per the cgroup weight to make sure a cgroup with lower
> > weight does not get higher % of disk because it is generating more
> > requests.
>
> Yes. But when cgroups with higher weight aren't issueing a lot of I/Os,
> even a cgroup with lower weight can allocate a lot of request descriptors.
>
ok. Now with the new thought, I am completely deprecating the idea of
queuing the request descriptors. Now I am thinking of capturing the bios
and buffering these into the rb-tree as soon as these enter the request
queue using associated request function. All the request descriptor
allocation will come later when bios are actually release to elevator from
the rb-tree. That way we should be able to get rid of this issue.
> > One additional issue with my scheme I just noticed is that I am putting
> > bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
> > same cgroup can be at multiple levels of processing at same time. That
> > would mean that a single cgroup needs to be in multiple rb-trees at the
> > same time in various layers. So I might have to create a temporary object
> > which can associate with cgroup and get rid of that object once I don't
> > have the requests any more...
>
> You mean each layer should have its rb-tree? Is it per device?
> One lvm logical volume may probably consist from several physical
> volumes, which will be shared with other logical volumes.
> And some layers may split one bio into several bios.
> I hardly can imagine how these structures will be.
>
Yes, one rb-tree per device, be it physical device or logical device
(because there is one request queue associated per physical/logical block
device).
I was thinking of getting hold/hijack the bios as soon as they are
submitted to the device using associated request function. So if there
is a logical device built on top of two physical device, the associated
bio copy or other logic should not even see the bio the moment it is
submitted to the deivce. It will see the bio only when it is released
from associated rb-tree to them. Do you think this will not work? To me
this is what dm-ioband is doing logically. The only difference is that it
does this with the help of a separate request queue.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924.193414.22923673.taka@valinux.co.jp>
2008-09-24 12:38 ` Balbir Singh
@ 2008-09-24 14:53 ` Vivek Goyal
[not found] ` <20080924145331.GD547@redhat.com>
2 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2008-09-24 14:53 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
On Wed, Sep 24, 2008 at 07:34:14PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > It's possible the algorithm of dm-ioband can be placed in the block layer
> > > if it is really a big problem.
> > > But I doubt it can control every control block I/O as we wish since
> > > the interface the cgroup supports is quite poor.
> >
> > Had a question regarding cgroup interface. I am assuming that in a system,
> > one will be using other controllers as well apart from IO-controller.
> > Other controllers will be using cgroup as a grouping mechanism.
> > Now coming up with additional grouping mechanism for only io-controller seems
> > little odd to me. It will make the job of higher level management software
> > harder.
> >
> > Looking at the dm-ioband grouping examples given in patches, I think cases
> > of grouping based in pid, pgrp, uid and kvm can be handled by creating right
> > cgroup and making sure applications are launched/moved into right cgroup by
> > user space tools.
>
> Grouping in pid, pgrp and uid is not the point, which I've been thinking
> can be replaced with cgroup once the implementation of bio-cgroup is done.
>
> I think problems of cgroup are that they can't support lots of storages
> and hotplug devices, it just handle them as if they were just one resource.
> I don't insist the interface of dm-ioband is the best. I just hope the
> cgroup infrastructure support this kind of resources.
>
Sorry, I did not understand fully. Can you please explain in detail what
kind of situation will not be covered by cgroup interface.
Thanks
Vivek
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <661de9470809240407m7f50b6dav897fef3b37295bb2@mail.gmail.com>
@ 2008-09-26 10:54 ` Hirokazu Takahashi
0 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-26 10:54 UTC (permalink / raw)
To: balbir
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, righi.andrea
Hi,
> >> > > Currently I have taken code from bio-cgroup to implement cgroups and
> >> > > to
> >> > > provide functionality to associate a bio to a cgroup. I need this to
> >> > > be
> >> > > able to queue the bio's at right node in the rb-tree and then also to
> >> > > be
> >> > > able to take a decision when is the right time to release few
> >> > > requests.
> >> > >
> >> > > Right now in crude implementation, I am working on making system boot.
> >> > > Once patches are at least in little bit working shape, I will send it
> >> > > to you
> >> > > to have a look.
> >> > >
> >> > > Thanks
> >> > > Vivek
> >> >
> >> > I wonder... wouldn't be simpler to just use the memory controller
> >> > to retrieve this information starting from struct page?
> >> >
> >> > I mean, following this path (in short, obviously using the appropriate
> >> > interfaces for locking and referencing the different objects):
> >> >
> >> > cgrp = page->page_cgroup->mem_cgroup->css.cgroup
> >> >
> >> > Once you get the cgrp it's very easy to use the corresponding controller
> >> > structure.
> >> >
> >> > Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
> >> > to a cgroup. What other functionalities/advantages bio-cgroup provide in
> >> > addition to that?
> >>
> >> I've decided to get Ryo to post the accurate dirty-page tracking patch
> >> for bio-cgroup, which isn't perfect yet though. The memory controller
> >> never wants to support this tracking because migrating a page between
> >> memory cgroups is really heavy.
>
> It depends on the migration. The cost is proportional to the number of
> pages moved. The cost can be brought down (I do have a design on
> paper -- from long long ago), where moving mm's will reduce the cost
> of migration, but it adds an additional dereference in the common
> path.
Okay, this will help to track anonymous pages even after processes are
migrated between memory-cgroups.
The rest of my concern is pages in the pagecache, which might be
potentially dirtied by processes in other cgroups. I think bio-cgroups
should also care this case.
> >> I also thought enhancing the memory controller would be good enough,
> >> but a lot of people said they wanted to control memory resource and
> >> block I/O resource separately.
> >
> > Yes, ideally we do want that.
> >
> >>
> >> So you can create several bio-cgroup in one memory-cgroup,
> >> or you can use bio-cgroup without memory-cgroup.
> >>
> >> I also have a plan to implement more acurate tracking mechanism
> >> on bio-cgroup after the memory cgroup team re-implement the
> >> infrastructure,
> >> which won't be supported by memory-cgroup.
> >> When a process are moved into another memory cgroup,
> >> the pages belonging to the process don't move to the new cgroup
> >> because migrating pages is so heavy. It's hard to find the pages
> >> from the process and migrating pages may cause some memory pressure.
> >> I'll implement this feature only on bio-cgroup with minimum overhead
> >
>
> Kamezawa has also wanted the page migration feature and we've agreed
> to provide a per-cgroup flag to decide to turn migration on/off. I
> would not mind refactoring memcontrol.c if that can help the IO
> controller and if you want migration, force the migration flag to on
> and warn the user if they try to turn it off.
Good news! But I've been wondering whether the IO controller should
have the same feature.
Once Kamezawa-san finished to implement the new page_cgroup
infrastructure which pre-allocates all the memory it needs,
I think I can minimize the cost migrating pages between bio-cgroup
since this migration won't cause any page reclaim unlike that of
memory-cgroup.
In this case I might design it only moves pages between bio-cgroups
while it won't move them between memory-cgroups.
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924145202.GC547@redhat.com>
@ 2008-09-26 12:42 ` Hirokazu Takahashi
0 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-26 12:42 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > > One additional issue with my scheme I just noticed is that I am putting
> > > bio-cgroup in rb-tree. If there are stacked devices then bio/requests from
> > > same cgroup can be at multiple levels of processing at same time. That
> > > would mean that a single cgroup needs to be in multiple rb-trees at the
> > > same time in various layers. So I might have to create a temporary object
> > > which can associate with cgroup and get rid of that object once I don't
> > > have the requests any more...
> >
> > You mean each layer should have its rb-tree? Is it per device?
> > One lvm logical volume may probably consist from several physical
> > volumes, which will be shared with other logical volumes.
> > And some layers may split one bio into several bios.
> > I hardly can imagine how these structures will be.
> >
>
> Yes, one rb-tree per device, be it physical device or logical device
> (because there is one request queue associated per physical/logical block
> device).
No, logical block devices doesn't have any request queues and
they essentially won't block any bios unless it is impossible to
handle them at the moment. Device-mappers never touch any request queues.
> I was thinking of getting hold/hijack the bios as soon as they are
> submitted to the device using associated request function. So if there
> is a logical device built on top of two physical device, the associated
> bio copy or other logic should not even see the bio the moment it is
> submitted to the deivce. It will see the bio only when it is released
> from associated rb-tree to them. Do you think this will not work? To me
> this is what dm-ioband is doing logically. The only difference is that it
> does this with the help of a separate request queue.
I think it's easy to just make all logical device --- device mapper
device --- and all physical device have their own bandwidth control
mechanism.
But I'm not clear how your algorithm works to control the bandwidth.
At which level are you going to guarantee the bandwidth, at the logical
volumes layer such as lvm or at the physical device layer?
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924145331.GD547@redhat.com>
@ 2008-09-26 13:04 ` Hirokazu Takahashi
[not found] ` <20080926.220418.83079316.taka@valinux.co.jp>
1 sibling, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-26 13:04 UTC (permalink / raw)
To: vgoyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, righi.andrea, agk, xemul, fernando, balbir
Hi,
> > > > It's possible the algorithm of dm-ioband can be placed in the block layer
> > > > if it is really a big problem.
> > > > But I doubt it can control every control block I/O as we wish since
> > > > the interface the cgroup supports is quite poor.
> > >
> > > Had a question regarding cgroup interface. I am assuming that in a system,
> > > one will be using other controllers as well apart from IO-controller.
> > > Other controllers will be using cgroup as a grouping mechanism.
> > > Now coming up with additional grouping mechanism for only io-controller seems
> > > little odd to me. It will make the job of higher level management software
> > > harder.
> > >
> > > Looking at the dm-ioband grouping examples given in patches, I think cases
> > > of grouping based in pid, pgrp, uid and kvm can be handled by creating right
> > > cgroup and making sure applications are launched/moved into right cgroup by
> > > user space tools.
> >
> > Grouping in pid, pgrp and uid is not the point, which I've been thinking
> > can be replaced with cgroup once the implementation of bio-cgroup is done.
> >
> > I think problems of cgroup are that they can't support lots of storages
> > and hotplug devices, it just handle them as if they were just one resource.
> > I don't insist the interface of dm-ioband is the best. I just hope the
> > cgroup infrastructure support this kind of resources.
> >
>
> Sorry, I did not understand fully. Can you please explain in detail what
> kind of situation will not be covered by cgroup interface.
From the concept of the cgroup, if you want control several disks
independently, you should make each disk have its own cgroup subsystem,
which only can be defined when compiling the kernel. This is impossible
because every linux box has various number of disks.
So you think it may be possible to make each cgroup have lots of control
files for each device as a workaround. But it isn't allowed to add/remove
control files when some devices are hot-added or hot-removed.
Thanks,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080926.220418.83079316.taka@valinux.co.jp>
@ 2008-09-26 15:56 ` Andrea Righi
[not found] ` <48DD0617.3050403@gmail.com>
1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-26 15:56 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, balbir
Hirokazu Takahashi wrote:
> Hi,
>
>>>>> It's possible the algorithm of dm-ioband can be placed in the block layer
>>>>> if it is really a big problem.
>>>>> But I doubt it can control every control block I/O as we wish since
>>>>> the interface the cgroup supports is quite poor.
>>>> Had a question regarding cgroup interface. I am assuming that in a system,
>>>> one will be using other controllers as well apart from IO-controller.
>>>> Other controllers will be using cgroup as a grouping mechanism.
>>>> Now coming up with additional grouping mechanism for only io-controller seems
>>>> little odd to me. It will make the job of higher level management software
>>>> harder.
>>>>
>>>> Looking at the dm-ioband grouping examples given in patches, I think cases
>>>> of grouping based in pid, pgrp, uid and kvm can be handled by creating right
>>>> cgroup and making sure applications are launched/moved into right cgroup by
>>>> user space tools.
>>> Grouping in pid, pgrp and uid is not the point, which I've been thinking
>>> can be replaced with cgroup once the implementation of bio-cgroup is done.
>>>
>>> I think problems of cgroup are that they can't support lots of storages
>>> and hotplug devices, it just handle them as if they were just one resource.
>>> I don't insist the interface of dm-ioband is the best. I just hope the
>>> cgroup infrastructure support this kind of resources.
>>>
>> Sorry, I did not understand fully. Can you please explain in detail what
>> kind of situation will not be covered by cgroup interface.
>
> From the concept of the cgroup, if you want control several disks
> independently, you should make each disk have its own cgroup subsystem,
> which only can be defined when compiling the kernel. This is impossible
> because every linux box has various number of disks.
mmh? not true. You can define a single cgroup subsystem that implements
the opportune interfaces to apply your type of control, and use many
structures allocated dynamically for each controlled object (one for
each block device, disk, partition, ... or using any kind of
grouping/splitting policy). Actually, this is how cgroup-io-throttle, as
well as any other cgroup subsystem, is implemented.
> So you think it may be possible to make each cgroup have lots of control
> files for each device as a workaround. But it isn't allowed to add/remove
> control files when some devices are hot-added or hot-removed.
Why not a single control file for all the devices?
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080924140355.GB547@redhat.com>
@ 2008-09-26 16:11 ` Andrea Righi
[not found] ` <48DD09AD.2010200@gmail.com>
1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-26 16:11 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, balbir
Vivek Goyal wrote:
[snip]
> Ok, I will give more details of the thought process.
>
> I was thinking of maintaing an rb-tree per request queue and not an
> rb-tree per cgroup. This tree can contain all the bios submitted to that
> request queue through __make_request(). Every node in the tree will represent
> one cgroup and will contain a list of bios issued from the tasks from that
> cgroup.
>
> Every bio entering the request queue through __make_request() function
> first will be queued in one of the nodes in this rb-tree, depending on which
> cgroup that bio belongs to.
>
> Once the bios are buffered in rb-tree, we release these to underlying
> elevator depending on the proportionate weight of the nodes/cgroups.
>
> Some more details which I was trying to implement yesterday.
>
> There will be one bio_cgroup object per cgroup. This object will contain
> many bio_group objects. Each bio_group object will be created for each
> request queue where a bio from bio_cgroup is queued. Essentially the idea
> is that bios belonging to a cgroup can be on various request queues in the
> system. So a single object can not serve the purpose as it can not be on
> many rb-trees at the same time. Hence create one sub object which will keep
> track of bios belonging to one cgroup on a particular request queue.
>
> Each bio_group will contain a list of bios and this bio_group object will
> be a node in the rb-tree of request queue. For example. Lets say there are
> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
> for /dev/sda and /dev/sdb.
>
> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
> bios issued by task t1 for /dev/sdb. I thought the same can be extended
> for stacked devices also.
>
> I am still trying to implementing it and hopefully this is doable idea.
> I think at the end of the day it will be something very close to dm-ioband
> algorithm just that there will be no lvm driver and no notion of separate
> dm-ioband device.
Vivek, thanks for the detailed explanation. Only a comment. I guess, if
we don't change also the per-process optimizations/improvements made by
some IO scheduler, I think we can have undesirable behaviours.
For example: CFQ uses the per-process iocontext to improve fairness
between *all* the processes in a system. But it doesn't have the concept
that there's a cgroup context on-top-of the processes.
So, some optimizations made to guarantee fairness among processes could
conflict with algorithms implemented at the cgroup layer. And
potentially lead to undesirable behaviours.
For example an issue I'm experiencing with my cgroup-io-throttle
patchset is that a cgroup can consistently increase the IO rate (always
respecting the max limits), simply increasing the number of IO worker
tasks respect to another cgroup with a lower number of IO workers. This
is probably due to the fact the CFQ tries to give the same amount of
"IO time" to all the tasks, without considering that they're organized
in cgroup.
I don't see this behaviour with noop or deadline, because they don't
have the concept of iocontext.
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48DD09AD.2010200@gmail.com>
@ 2008-09-26 17:11 ` Andrea Righi
[not found] ` <48DD17A9.9080607@gmail.com>
1 sibling, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-26 17:11 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, balbir
Andrea Righi wrote:
> Vivek Goyal wrote:
> [snip]
>> Ok, I will give more details of the thought process.
>>
>> I was thinking of maintaing an rb-tree per request queue and not an
>> rb-tree per cgroup. This tree can contain all the bios submitted to that
>> request queue through __make_request(). Every node in the tree will represent
>> one cgroup and will contain a list of bios issued from the tasks from that
>> cgroup.
>>
>> Every bio entering the request queue through __make_request() function
>> first will be queued in one of the nodes in this rb-tree, depending on which
>> cgroup that bio belongs to.
>>
>> Once the bios are buffered in rb-tree, we release these to underlying
>> elevator depending on the proportionate weight of the nodes/cgroups.
>>
>> Some more details which I was trying to implement yesterday.
>>
>> There will be one bio_cgroup object per cgroup. This object will contain
>> many bio_group objects. Each bio_group object will be created for each
>> request queue where a bio from bio_cgroup is queued. Essentially the idea
>> is that bios belonging to a cgroup can be on various request queues in the
>> system. So a single object can not serve the purpose as it can not be on
>> many rb-trees at the same time. Hence create one sub object which will keep
>> track of bios belonging to one cgroup on a particular request queue.
>>
>> Each bio_group will contain a list of bios and this bio_group object will
>> be a node in the rb-tree of request queue. For example. Lets say there are
>> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
>> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
>> for /dev/sda and /dev/sdb.
>>
>> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
>> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
>> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
>> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
>> bios issued by task t1 for /dev/sdb. I thought the same can be extended
>> for stacked devices also.
>>
>> I am still trying to implementing it and hopefully this is doable idea.
>> I think at the end of the day it will be something very close to dm-ioband
>> algorithm just that there will be no lvm driver and no notion of separate
>> dm-ioband device.
>
> Vivek, thanks for the detailed explanation. Only a comment. I guess, if
> we don't change also the per-process optimizations/improvements made by
> some IO scheduler, I think we can have undesirable behaviours.
>
> For example: CFQ uses the per-process iocontext to improve fairness
> between *all* the processes in a system. But it doesn't have the concept
> that there's a cgroup context on-top-of the processes.
>
> So, some optimizations made to guarantee fairness among processes could
> conflict with algorithms implemented at the cgroup layer. And
> potentially lead to undesirable behaviours.
>
> For example an issue I'm experiencing with my cgroup-io-throttle
> patchset is that a cgroup can consistently increase the IO rate (always
> respecting the max limits), simply increasing the number of IO worker
> tasks respect to another cgroup with a lower number of IO workers. This
> is probably due to the fact the CFQ tries to give the same amount of
> "IO time" to all the tasks, without considering that they're organized
> in cgroup.
BTW this is why I proposed to use a single shared iocontext for all the
processes running in the same cgroup. Anyway, this is not the best
solution, because in this way all the IO requests coming from a cgroup
will be queued to the same cfq queue. If I'm not wrong in this way we
would implement noop (FIFO) between tasks belonging to the same cgroup
and CFQ between cgroups. But, at least for this particular case, we
would be able to provide fairness among cgroups.
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48DD17A9.9080607@gmail.com>
@ 2008-09-26 17:30 ` Andrea Righi
2008-09-29 12:07 ` Hirokazu Takahashi
[not found] ` <20080929.210729.117112710.taka@valinux.co.jp>
2 siblings, 0 replies; 40+ messages in thread
From: Andrea Righi @ 2008-09-26 17:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
Hirokazu Takahashi, dm-devel, agk, xemul, fernando, balbir
Andrea Righi wrote:
> Andrea Righi wrote:
>> Vivek Goyal wrote:
>> [snip]
>>> Ok, I will give more details of the thought process.
>>>
>>> I was thinking of maintaing an rb-tree per request queue and not an
>>> rb-tree per cgroup. This tree can contain all the bios submitted to that
>>> request queue through __make_request(). Every node in the tree will represent
>>> one cgroup and will contain a list of bios issued from the tasks from that
>>> cgroup.
>>>
>>> Every bio entering the request queue through __make_request() function
>>> first will be queued in one of the nodes in this rb-tree, depending on which
>>> cgroup that bio belongs to.
>>>
>>> Once the bios are buffered in rb-tree, we release these to underlying
>>> elevator depending on the proportionate weight of the nodes/cgroups.
>>>
>>> Some more details which I was trying to implement yesterday.
>>>
>>> There will be one bio_cgroup object per cgroup. This object will contain
>>> many bio_group objects. Each bio_group object will be created for each
>>> request queue where a bio from bio_cgroup is queued. Essentially the idea
>>> is that bios belonging to a cgroup can be on various request queues in the
>>> system. So a single object can not serve the purpose as it can not be on
>>> many rb-trees at the same time. Hence create one sub object which will keep
>>> track of bios belonging to one cgroup on a particular request queue.
>>>
>>> Each bio_group will contain a list of bios and this bio_group object will
>>> be a node in the rb-tree of request queue. For example. Lets say there are
>>> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
>>> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
>>> for /dev/sda and /dev/sdb.
>>>
>>> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
>>> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
>>> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
>>> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
>>> bios issued by task t1 for /dev/sdb. I thought the same can be extended
>>> for stacked devices also.
>>>
>>> I am still trying to implementing it and hopefully this is doable idea.
>>> I think at the end of the day it will be something very close to dm-ioband
>>> algorithm just that there will be no lvm driver and no notion of separate
>>> dm-ioband device.
>> Vivek, thanks for the detailed explanation. Only a comment. I guess, if
>> we don't change also the per-process optimizations/improvements made by
>> some IO scheduler, I think we can have undesirable behaviours.
>>
>> For example: CFQ uses the per-process iocontext to improve fairness
>> between *all* the processes in a system. But it doesn't have the concept
>> that there's a cgroup context on-top-of the processes.
>>
>> So, some optimizations made to guarantee fairness among processes could
>> conflict with algorithms implemented at the cgroup layer. And
>> potentially lead to undesirable behaviours.
>>
>> For example an issue I'm experiencing with my cgroup-io-throttle
>> patchset is that a cgroup can consistently increase the IO rate (always
>> respecting the max limits), simply increasing the number of IO worker
>> tasks respect to another cgroup with a lower number of IO workers. This
>> is probably due to the fact the CFQ tries to give the same amount of
>> "IO time" to all the tasks, without considering that they're organized
>> in cgroup.
>
> BTW this is why I proposed to use a single shared iocontext for all the
> processes running in the same cgroup. Anyway, this is not the best
> solution, because in this way all the IO requests coming from a cgroup
> will be queued to the same cfq queue. If I'm not wrong in this way we
> would implement noop (FIFO) between tasks belonging to the same cgroup
> and CFQ between cgroups. But, at least for this particular case, we
> would be able to provide fairness among cgroups.
Ah! also have a look at this:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/benchmark/graph/effect-of-per-process-cfq-fairness-on-the-cgroup-context.png
The graph highlights the dependency between the IO rate and the number
of tasks running in a cgroup. For this testcase I've used 2 cgroups:
- cgroup A, with a single task doing IO (large O_DIRECT read stream)
- cgroup B, with a variable number of tasks ranging from 1 to 16 doing
IO in parallel
If we want to be "fair" the gap of IO performance between the cgroups
should be close to 0.
Using "plain" cfq (red line) the gap of performance increases
incrementing the number of tasks in a cgroup.
Using cgroup-io-throttle on top of cfq (green line) the gap of
performance is lower (the asymptotic curve is due to the bandwidth
capping provided by cgroup-io-throttle).
Using cgroup-io-throttle and a single shared iocontext for each cgroup
(blue line) the gap of performance is really close to 0.
Anyway, I repeat, I don't think this is a wonderful solution, it is just
to highlights this issue and share with you the results of some tests I
did.
-Andrea
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48DD0617.3050403@gmail.com>
@ 2008-09-29 10:40 ` Hirokazu Takahashi
0 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-29 10:40 UTC (permalink / raw)
To: righi.andrea
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, balbir
Hi,
> >>>>> It's possible the algorithm of dm-ioband can be placed in the block layer
> >>>>> if it is really a big problem.
> >>>>> But I doubt it can control every control block I/O as we wish since
> >>>>> the interface the cgroup supports is quite poor.
> >>>> Had a question regarding cgroup interface. I am assuming that in a system,
> >>>> one will be using other controllers as well apart from IO-controller.
> >>>> Other controllers will be using cgroup as a grouping mechanism.
> >>>> Now coming up with additional grouping mechanism for only io-controller seems
> >>>> little odd to me. It will make the job of higher level management software
> >>>> harder.
> >>>>
> >>>> Looking at the dm-ioband grouping examples given in patches, I think cases
> >>>> of grouping based in pid, pgrp, uid and kvm can be handled by creating right
> >>>> cgroup and making sure applications are launched/moved into right cgroup by
> >>>> user space tools.
> >>> Grouping in pid, pgrp and uid is not the point, which I've been thinking
> >>> can be replaced with cgroup once the implementation of bio-cgroup is done.
> >>>
> >>> I think problems of cgroup are that they can't support lots of storages
> >>> and hotplug devices, it just handle them as if they were just one resource.
> >>> I don't insist the interface of dm-ioband is the best. I just hope the
> >>> cgroup infrastructure support this kind of resources.
> >>>
> >> Sorry, I did not understand fully. Can you please explain in detail what
> >> kind of situation will not be covered by cgroup interface.
> >
> > From the concept of the cgroup, if you want control several disks
> > independently, you should make each disk have its own cgroup subsystem,
> > which only can be defined when compiling the kernel. This is impossible
> > because every linux box has various number of disks.
>
> mmh? not true. You can define a single cgroup subsystem that implements
> the opportune interfaces to apply your type of control, and use many
> structures allocated dynamically for each controlled object (one for
> each block device, disk, partition, ... or using any kind of
> grouping/splitting policy). Actually, this is how cgroup-io-throttle, as
> well as any other cgroup subsystem, is implemented.
>
> > So you think it may be possible to make each cgroup have lots of control
> > files for each device as a workaround. But it isn't allowed to add/remove
> > control files when some devices are hot-added or hot-removed.
>
> Why not a single control file for all the devices?
This is possible but I wonder if this is really the way we should go.
It looks like you tried implementing another ioctl-like interface
on the cgroup control file interface. You can do anything you want
with this interface though.
I guess there should be at least some rules to implement this kind of
ioctl-like interface if they don't want to enhance the cgroup interface,
Thank you,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <48DD17A9.9080607@gmail.com>
2008-09-26 17:30 ` Andrea Righi
@ 2008-09-29 12:07 ` Hirokazu Takahashi
[not found] ` <20080929.210729.117112710.taka@valinux.co.jp>
2 siblings, 0 replies; 40+ messages in thread
From: Hirokazu Takahashi @ 2008-09-29 12:07 UTC (permalink / raw)
To: righi.andrea
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, xemul, fernando, vgoyal, balbir
Hi, Andrea,
> >> Ok, I will give more details of the thought process.
> >>
> >> I was thinking of maintaing an rb-tree per request queue and not an
> >> rb-tree per cgroup. This tree can contain all the bios submitted to that
> >> request queue through __make_request(). Every node in the tree will represent
> >> one cgroup and will contain a list of bios issued from the tasks from that
> >> cgroup.
> >>
> >> Every bio entering the request queue through __make_request() function
> >> first will be queued in one of the nodes in this rb-tree, depending on which
> >> cgroup that bio belongs to.
> >>
> >> Once the bios are buffered in rb-tree, we release these to underlying
> >> elevator depending on the proportionate weight of the nodes/cgroups.
> >>
> >> Some more details which I was trying to implement yesterday.
> >>
> >> There will be one bio_cgroup object per cgroup. This object will contain
> >> many bio_group objects. Each bio_group object will be created for each
> >> request queue where a bio from bio_cgroup is queued. Essentially the idea
> >> is that bios belonging to a cgroup can be on various request queues in the
> >> system. So a single object can not serve the purpose as it can not be on
> >> many rb-trees at the same time. Hence create one sub object which will keep
> >> track of bios belonging to one cgroup on a particular request queue.
> >>
> >> Each bio_group will contain a list of bios and this bio_group object will
> >> be a node in the rb-tree of request queue. For example. Lets say there are
> >> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
> >> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
> >> for /dev/sda and /dev/sdb.
> >>
> >> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
> >> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
> >> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
> >> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
> >> bios issued by task t1 for /dev/sdb. I thought the same can be extended
> >> for stacked devices also.
> >>
> >> I am still trying to implementing it and hopefully this is doable idea.
> >> I think at the end of the day it will be something very close to dm-ioband
> >> algorithm just that there will be no lvm driver and no notion of separate
> >> dm-ioband device.
> >
> > Vivek, thanks for the detailed explanation. Only a comment. I guess, if
> > we don't change also the per-process optimizations/improvements made by
> > some IO scheduler, I think we can have undesirable behaviours.
> >
> > For example: CFQ uses the per-process iocontext to improve fairness
> > between *all* the processes in a system. But it doesn't have the concept
> > that there's a cgroup context on-top-of the processes.
> >
> > So, some optimizations made to guarantee fairness among processes could
> > conflict with algorithms implemented at the cgroup layer. And
> > potentially lead to undesirable behaviours.
> >
> > For example an issue I'm experiencing with my cgroup-io-throttle
> > patchset is that a cgroup can consistently increase the IO rate (always
> > respecting the max limits), simply increasing the number of IO worker
> > tasks respect to another cgroup with a lower number of IO workers. This
> > is probably due to the fact the CFQ tries to give the same amount of
> > "IO time" to all the tasks, without considering that they're organized
> > in cgroup.
>
> BTW this is why I proposed to use a single shared iocontext for all the
> processes running in the same cgroup. Anyway, this is not the best
> solution, because in this way all the IO requests coming from a cgroup
> will be queued to the same cfq queue. If I'm not wrong in this way we
> would implement noop (FIFO) between tasks belonging to the same cgroup
> and CFQ between cgroups. But, at least for this particular case, we
> would be able to provide fairness among cgroups.
>
> -Andrea
I ever thought the same thing but this approach breaks the compatibility.
I think we should make ionice only effective for the processes in the
same cgroup.
A system gives some amount of bandwidths to its cgroups, and
the processes in one of the cgroups fairly share the given bandwidth.
I think this is the straight approach. What do you think?
I think all the CFQ-cgroup the NEC guys are working, OpenVZ team's CFQ
scheduler and dm-ioband with bio-cgroup work like this.
Thank you,
Hirokazu Takahashi.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: dm-ioband + bio-cgroup benchmarks
[not found] ` <20080929.210729.117112710.taka@valinux.co.jp>
@ 2008-09-29 12:13 ` Pavel Emelyanov
0 siblings, 0 replies; 40+ messages in thread
From: Pavel Emelyanov @ 2008-09-29 12:13 UTC (permalink / raw)
To: Hirokazu Takahashi
Cc: xen-devel, containers, jens.axboe, linux-kernel, virtualization,
dm-devel, agk, balbir, fernando, vgoyal, righi.andrea
Hirokazu Takahashi wrote:
> Hi, Andrea,
>
>>>> Ok, I will give more details of the thought process.
>>>>
>>>> I was thinking of maintaing an rb-tree per request queue and not an
>>>> rb-tree per cgroup. This tree can contain all the bios submitted to that
>>>> request queue through __make_request(). Every node in the tree will represent
>>>> one cgroup and will contain a list of bios issued from the tasks from that
>>>> cgroup.
>>>>
>>>> Every bio entering the request queue through __make_request() function
>>>> first will be queued in one of the nodes in this rb-tree, depending on which
>>>> cgroup that bio belongs to.
>>>>
>>>> Once the bios are buffered in rb-tree, we release these to underlying
>>>> elevator depending on the proportionate weight of the nodes/cgroups.
>>>>
>>>> Some more details which I was trying to implement yesterday.
>>>>
>>>> There will be one bio_cgroup object per cgroup. This object will contain
>>>> many bio_group objects. Each bio_group object will be created for each
>>>> request queue where a bio from bio_cgroup is queued. Essentially the idea
>>>> is that bios belonging to a cgroup can be on various request queues in the
>>>> system. So a single object can not serve the purpose as it can not be on
>>>> many rb-trees at the same time. Hence create one sub object which will keep
>>>> track of bios belonging to one cgroup on a particular request queue.
>>>>
>>>> Each bio_group will contain a list of bios and this bio_group object will
>>>> be a node in the rb-tree of request queue. For example. Lets say there are
>>>> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
>>>> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
>>>> for /dev/sda and /dev/sdb.
>>>>
>>>> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
>>>> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
>>>> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
>>>> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
>>>> bios issued by task t1 for /dev/sdb. I thought the same can be extended
>>>> for stacked devices also.
>>>>
>>>> I am still trying to implementing it and hopefully this is doable idea.
>>>> I think at the end of the day it will be something very close to dm-ioband
>>>> algorithm just that there will be no lvm driver and no notion of separate
>>>> dm-ioband device.
>>> Vivek, thanks for the detailed explanation. Only a comment. I guess, if
>>> we don't change also the per-process optimizations/improvements made by
>>> some IO scheduler, I think we can have undesirable behaviours.
>>>
>>> For example: CFQ uses the per-process iocontext to improve fairness
>>> between *all* the processes in a system. But it doesn't have the concept
>>> that there's a cgroup context on-top-of the processes.
>>>
>>> So, some optimizations made to guarantee fairness among processes could
>>> conflict with algorithms implemented at the cgroup layer. And
>>> potentially lead to undesirable behaviours.
>>>
>>> For example an issue I'm experiencing with my cgroup-io-throttle
>>> patchset is that a cgroup can consistently increase the IO rate (always
>>> respecting the max limits), simply increasing the number of IO worker
>>> tasks respect to another cgroup with a lower number of IO workers. This
>>> is probably due to the fact the CFQ tries to give the same amount of
>>> "IO time" to all the tasks, without considering that they're organized
>>> in cgroup.
>> BTW this is why I proposed to use a single shared iocontext for all the
>> processes running in the same cgroup. Anyway, this is not the best
>> solution, because in this way all the IO requests coming from a cgroup
>> will be queued to the same cfq queue. If I'm not wrong in this way we
>> would implement noop (FIFO) between tasks belonging to the same cgroup
>> and CFQ between cgroups. But, at least for this particular case, we
>> would be able to provide fairness among cgroups.
>>
>> -Andrea
>
> I ever thought the same thing but this approach breaks the compatibility.
> I think we should make ionice only effective for the processes in the
> same cgroup.
>
> A system gives some amount of bandwidths to its cgroups, and
> the processes in one of the cgroups fairly share the given bandwidth.
> I think this is the straight approach. What do you think?
>
> I think all the CFQ-cgroup the NEC guys are working, OpenVZ team's CFQ
> scheduler and dm-ioband with bio-cgroup work like this.
If by "fairly share the given bandwidth" you mean "share according to their
IO-nice values" then you're right on this, Hirokazu. We always use a two-level
schedulers and would like to see the same behavior in anything that will be
the IO-bandwidth-controller in the mainline :)
> Thank you,
> Hirokazu Takahashi.
>
>
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2008-09-29 12:13 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080918.210418.226794540.ryov@valinux.co.jp>
2008-09-18 13:15 ` dm-ioband + bio-cgroup benchmarks Vivek Goyal
2008-09-19 8:49 ` Takuya Yoshikawa
[not found] ` <20080918131554.GB20640@redhat.com>
2008-09-18 14:37 ` Andrea Righi
[not found] ` <48D267B5.20402@gmail.com>
2008-09-18 15:06 ` Vivek Goyal
2008-09-18 15:18 ` Andrea Righi
[not found] ` <48D2715A.6060002@gmail.com>
2008-09-18 16:20 ` Vivek Goyal
2008-09-18 19:54 ` Andrea Righi
2008-09-19 3:34 ` [dm-devel] " Hirokazu Takahashi
[not found] ` <20080919.123405.91829935.taka@valinux.co.jp>
2008-09-20 4:27 ` KAMEZAWA Hiroyuki
[not found] ` <20080920132703.e74c8f89.kamezawa.hiroyu@jp.fujitsu.com>
2008-09-20 5:18 ` Balbir Singh
[not found] ` <48D48789.8000606@linux.vnet.ibm.com>
2008-09-20 9:25 ` KAMEZAWA Hiroyuki
2008-09-24 11:04 ` [Xen-devel] " Balbir Singh
[not found] ` <661de9470809240404i62300942o15337ecec335fe22@mail.gmail.com>
2008-09-24 11:07 ` Balbir Singh
[not found] ` <661de9470809240407m7f50b6dav897fef3b37295bb2@mail.gmail.com>
2008-09-26 10:54 ` Hirokazu Takahashi
2008-09-19 6:12 ` Hirokazu Takahashi
2008-09-19 11:20 ` Hirokazu Takahashi
[not found] ` <20080919.202031.86647893.taka@valinux.co.jp>
2008-09-19 13:10 ` Vivek Goyal
[not found] ` <20080919131019.GA3606@redhat.com>
2008-09-19 20:28 ` Andrea Righi
2008-09-22 9:36 ` Hirokazu Takahashi
[not found] ` <48D40B78.6060709@gmail.com>
2008-09-22 9:45 ` Hirokazu Takahashi
[not found] ` <20080922.183651.62951479.taka@valinux.co.jp>
2008-09-22 14:30 ` Vivek Goyal
[not found] ` <20080922143042.GA19222@redhat.com>
2008-09-24 8:29 ` Hirokazu Takahashi
2008-09-24 10:18 ` Hirokazu Takahashi
2008-09-24 10:34 ` Hirokazu Takahashi
[not found] ` <20080924.193414.22923673.taka@valinux.co.jp>
2008-09-24 12:38 ` Balbir Singh
2008-09-24 14:53 ` Vivek Goyal
[not found] ` <20080924145331.GD547@redhat.com>
2008-09-26 13:04 ` Hirokazu Takahashi
[not found] ` <20080926.220418.83079316.taka@valinux.co.jp>
2008-09-26 15:56 ` Andrea Righi
[not found] ` <48DD0617.3050403@gmail.com>
2008-09-29 10:40 ` Hirokazu Takahashi
[not found] ` <20080924.172937.72827863.taka@valinux.co.jp>
2008-09-24 14:03 ` Vivek Goyal
[not found] ` <20080924140355.GB547@redhat.com>
2008-09-26 16:11 ` Andrea Righi
[not found] ` <48DD09AD.2010200@gmail.com>
2008-09-26 17:11 ` Andrea Righi
[not found] ` <48DD17A9.9080607@gmail.com>
2008-09-26 17:30 ` Andrea Righi
2008-09-29 12:07 ` Hirokazu Takahashi
[not found] ` <20080929.210729.117112710.taka@valinux.co.jp>
2008-09-29 12:13 ` Pavel Emelyanov
[not found] ` <20080924.191803.100102323.taka@valinux.co.jp>
2008-09-24 14:52 ` Vivek Goyal
[not found] ` <20080924145202.GC547@redhat.com>
2008-09-26 12:42 ` Hirokazu Takahashi
[not found] ` <20080919.151221.49666828.taka@valinux.co.jp>
2008-09-19 13:12 ` Vivek Goyal
[not found] ` <48D36794.6010002@oss.ntt.co.jp>
2008-09-19 11:31 ` Ryo Tsuruta
2008-09-18 12:04 Ryo Tsuruta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).