Updated Elvis Upstreaming Roadmap

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Updated Elvis Upstreaming Roadmap
@ 2013-12-17 10:04 Razya Ladelsky
  2013-12-18 10:43 ` Michael S. Tsirkin
  2013-12-24 16:21 ` Gleb Natapov
  0 siblings, 2 replies; 13+ messages in thread
From: Razya Ladelsky @ 2013-12-17 10:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Anthony Liguori, asias, digitaleric, Eran Raichstein,
	gleb, jasowang, Joel Nider, kvm, kvm-owner, pbonzini,
	Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

Hi,

Thank you all for your comments.
I'm sorry for taking this long to reply, I was away on vacation..

It was a good, long discussion, many issues were raised, which we'd like 
to address with the following proposed roadmap for Elvis patches.
In general, we believe it would be best to start with patches that are 
as simple as possible, providing the basic Elvis functionality, 
and attend to the more complicated issues in subsequent patches.

Here's the road map for Elvis patches: 

1. Shared vhost thread for multiple devices.

The way to go here, we believe, is to start with a patch having a shared 
vhost thread for multiple devices of the SAME vm.
The next step/patch may be handling vms belonging to the same cgroup.

Finally, we need to extend the functionality so that the shared vhost 
thread 
serves multiple vms (not necessarily belonging to the same cgroup).

There was a lot of discussion about the way to address the enforcement 
of cgroup policies, and we will consider the various solutions with a 
future
patch.

2. Creation of vhost threads

We suggested two ways of controlling the creation and removal of vhost
threads: 
- statically determining the maximum number of virtio devices per worker 
via a kernel module parameter 
- dynamically: Sysfs mechanism to add and remove vhost threads 

It seems that it would be simplest to take the static approach as
a first stage. At a second stage (next patch), we'll advance to 
dynamically 
changing the number of vhost threads, using the static module parameter 
only as a default value. 

Regarding cwmq, it is an interesting mechanism, which we need to explore 
further.
At the moment we prefer not to change the vhost model to use cwmq, as some 
of the issues that were discussed, such as cgroups, are not supported by 
cwmq, and this is adding more complexity.
However, we'll look further into it, and consider it at a later stage.

3. Adding polling mode to vhost 

It is a good idea making polling adaptive based on various factors such as 
the I/O rate, the guest kick overhead(which is the tradeoff of polling), 
or the amount of wasted cycles (cycles we kept polling but no new work was 
added).
However, as a beginning polling patch, we would prefer having a naive 
polling approach, which could be tuned with later patches.

4. vhost statistics 

The issue that was raised for the vhost statistics was using ftrace 
instead of the debugfs mechanism.
However, looking further into the kvm stat mechanism, we learned that 
ftrace didn't replace the plain debugfs mechanism, but was used in 
addition to it.

We propose to continue using debugfs for statistics, in a manner similar 
to kvm,
and at some point in the future ftrace can be added to vhost as well.

Does this plan look o.k.?
If there are no further comments, I'll start preparing the patches 
according to what we've agreed on thus far.
Thank you,
Razya

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-17 10:04 Updated Elvis Upstreaming Roadmap Razya Ladelsky
@ 2013-12-18 10:43 ` Michael S. Tsirkin
  2013-12-19  6:40   ` Abel Gordon
  2013-12-24 16:21 ` Gleb Natapov
  1 sibling, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2013-12-18 10:43 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Anthony Liguori, asias, digitaleric, Eran Raichstein,
	gleb, jasowang, Joel Nider, kvm, kvm-owner, pbonzini,
	Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> Hi,
> 
> Thank you all for your comments.
> I'm sorry for taking this long to reply, I was away on vacation..
> 
> It was a good, long discussion, many issues were raised, which we'd like 
> to address with the following proposed roadmap for Elvis patches.
> In general, we believe it would be best to start with patches that are 
> as simple as possible, providing the basic Elvis functionality, 
> and attend to the more complicated issues in subsequent patches.
> 
> Here's the road map for Elvis patches: 

Thanks for the follow up. Some suggestions below.
Please note they suggestions below merely represent
thoughts on merging upstream.
If as the first step you are content with keeping this
work as out of tree patches, in order to have
the freedom to experiment with interfaces and
performance, please feel free to ignore them.

> 1. Shared vhost thread for multiple devices.
> 
> The way to go here, we believe, is to start with a patch having a shared 
> vhost thread for multiple devices of the SAME vm.
> The next step/patch may be handling vms belonging to the same cgroup.
> 
> Finally, we need to extend the functionality so that the shared vhost 
> thread 
> serves multiple vms (not necessarily belonging to the same cgroup).
> 
> There was a lot of discussion about the way to address the enforcement 
> of cgroup policies, and we will consider the various solutions with a 
> future
> patch.

With respect to the upstream kernel,
I'm not sure a bunch of changes just for the sake of guests with
multiple virtual NIC cards makes sense.
And I wonder how this step, in isolation, will affect e.g.
multiqueue workloads.
But I guess if the numbers are convincing, this can be mergeable.

> 
> 2. Creation of vhost threads
> 
> We suggested two ways of controlling the creation and removal of vhost
> threads: 
> - statically determining the maximum number of virtio devices per worker 
> via a kernel module parameter 
> - dynamically: Sysfs mechanism to add and remove vhost threads 
> 
> It seems that it would be simplest to take the static approach as
> a first stage. At a second stage (next patch), we'll advance to 
> dynamically 
> changing the number of vhost threads, using the static module parameter 
> only as a default value. 

I'm not sure how independent this is from 1.
With respect to the upstream kernel,
Introducing interfaces (which we'll have to maintain
forever) just for the sake of guests with
multiple virtual NIC cards does not look like a good tradeoff.

So I'm unlikely to merge this upstream without making it useful cross-VM,
and yes this means isolation and accounting with cgroups need to
work properly.

> Regarding cwmq, it is an interesting mechanism, which we need to explore 
> further.
> At the moment we prefer not to change the vhost model to use cwmq, as some 
> of the issues that were discussed, such as cgroups, are not supported by 
> cwmq, and this is adding more complexity.
> However, we'll look further into it, and consider it at a later stage.

Hmm that's still assuming some smart management tool configuring
this correctly.  Can't this be determined automatically depending
on the workload?
This is what the cwmq suggestion was really about: detect
that we need more threads and spawn them.
It's less about sharing the implementation with workqueues -
would be very nice but not a must.

> 3. Adding polling mode to vhost 
> 
> It is a good idea making polling adaptive based on various factors such as 
> the I/O rate, the guest kick overhead(which is the tradeoff of polling), 
> or the amount of wasted cycles (cycles we kept polling but no new work was 
> added).
> However, as a beginning polling patch, we would prefer having a naive 
> polling approach, which could be tuned with later patches.
> 

While any polling approach would still need a lot of testing to prove we
don't for example steal CPU from guest which could be doing other useful
work, given that an exit is at least 1.5K cycles at least in theory it
seems like something that can improve performance.  I'm not sure how
naive we can be without introducing regressions  for some workloads.
For example, if we are on the same host CPU, there's no
chance busy waiting will help us make progress.
How about detecting that the VCPU thread that kicked us
is currently running on another CPU, and only polling in
this case?

> 4. vhost statistics 
> 
> The issue that was raised for the vhost statistics was using ftrace 
> instead of the debugfs mechanism.
> However, looking further into the kvm stat mechanism, we learned that 
> ftrace didn't replace the plain debugfs mechanism, but was used in 
> addition to it.
>  
> We propose to continue using debugfs for statistics, in a manner similar 
> to kvm,
> and at some point in the future ftrace can be added to vhost as well.

IMHO which kvm stat is a useful script, the best tool
for perf stats is still perf. So I would try to integrate with that.
How it works internally is IMHO less important.

> Does this plan look o.k.?
> If there are no further comments, I'll start preparing the patches 
> according to what we've agreed on thus far.
> Thank you,
> Razya

I think a good place to try to start merging upstream would be 3 and 4.
So if you want to make it easier to merge things upstream, try to keep 3
and 4 independent from 1 and 2.

Thanks again,

-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-18 10:43 ` Michael S. Tsirkin
@ 2013-12-19  6:40   ` Abel Gordon
  2013-12-19 10:13     ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Abel Gordon @ 2013-12-19  6:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Razya Ladelsky, Anthony Liguori, asias, digitaleric,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
>> Hi,
>>
>> Thank you all for your comments.
>> I'm sorry for taking this long to reply, I was away on vacation..
>>
>> It was a good, long discussion, many issues were raised, which we'd like
>> to address with the following proposed roadmap for Elvis patches.
>> In general, we believe it would be best to start with patches that are
>> as simple as possible, providing the basic Elvis functionality,
>> and attend to the more complicated issues in subsequent patches.
>>
>> Here's the road map for Elvis patches:
>
> Thanks for the follow up. Some suggestions below.
> Please note they suggestions below merely represent
> thoughts on merging upstream.
> If as the first step you are content with keeping this
> work as out of tree patches, in order to have
> the freedom to experiment with interfaces and
> performance, please feel free to ignore them.
>
>> 1. Shared vhost thread for multiple devices.
>>
>> The way to go here, we believe, is to start with a patch having a shared
>> vhost thread for multiple devices of the SAME vm.
>> The next step/patch may be handling vms belonging to the same cgroup.
>>
>> Finally, we need to extend the functionality so that the shared vhost
>> thread
>> serves multiple vms (not necessarily belonging to the same cgroup).
>>
>> There was a lot of discussion about the way to address the enforcement
>> of cgroup policies, and we will consider the various solutions with a
>> future
>> patch.
>
> With respect to the upstream kernel,
> I'm not sure a bunch of changes just for the sake of guests with
> multiple virtual NIC cards makes sense.
> And I wonder how this step, in isolation, will affect e.g.
> multiqueue workloads.
> But I guess if the numbers are convincing, this can be mergeable.

Even if you have a single multiqueue device this change allows
to create one vhost thread for all the queues, one vhost thread per
queue or any other combination. I guess that depending on the workload
and depending on the system utilization (free cycles/cores, density)
you would prefer
to use one or more vhost threads.

>
>>
>> 2. Creation of vhost threads
>>
>> We suggested two ways of controlling the creation and removal of vhost
>> threads:
>> - statically determining the maximum number of virtio devices per worker
>> via a kernel module parameter
>> - dynamically: Sysfs mechanism to add and remove vhost threads
>>
>> It seems that it would be simplest to take the static approach as
>> a first stage. At a second stage (next patch), we'll advance to
>> dynamically
>> changing the number of vhost threads, using the static module parameter
>> only as a default value.
>
> I'm not sure how independent this is from 1.
> With respect to the upstream kernel,
> Introducing interfaces (which we'll have to maintain
> forever) just for the sake of guests with
> multiple virtual NIC cards does not look like a good tradeoff.
>
> So I'm unlikely to merge this upstream without making it useful cross-VM,
> and yes this means isolation and accounting with cgroups need to
> work properly.

Agree, but even if you use a single multiqueue device having the
ability to use 1 thread to serve all the queues or multiple threads to
serve all the queues looks like a useful feature.

>
>> Regarding cwmq, it is an interesting mechanism, which we need to explore
>> further.
>> At the moment we prefer not to change the vhost model to use cwmq, as some
>> of the issues that were discussed, such as cgroups, are not supported by
>> cwmq, and this is adding more complexity.
>> However, we'll look further into it, and consider it at a later stage.
>
> Hmm that's still assuming some smart management tool configuring
> this correctly.  Can't this be determined automatically depending
> on the workload?
> This is what the cwmq suggestion was really about: detect
> that we need more threads and spawn them.
> It's less about sharing the implementation with workqueues -
> would be very nice but not a must.

But how cwmq can consider cgroup accounting ?
In any case, IMHO, the kernel should first provide the "mechanism" so
later on a user-space management application (the "policy") can
orchestrate it.
>
>
>
>> 3. Adding polling mode to vhost
>>
>> It is a good idea making polling adaptive based on various factors such as
>> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
>> or the amount of wasted cycles (cycles we kept polling but no new work was
>> added).
>> However, as a beginning polling patch, we would prefer having a naive
>> polling approach, which could be tuned with later patches.
>>
>
> While any polling approach would still need a lot of testing to prove we
> don't for example steal CPU from guest which could be doing other useful
> work, given that an exit is at least 1.5K cycles at least in theory it
> seems like something that can improve performance.  I'm not sure how
> naive we can be without introducing regressions  for some workloads.
> For example, if we are on the same host CPU, there's no
> chance busy waiting will help us make progress.
> How about detecting that the VCPU thread that kicked us
> is currently running on another CPU, and only polling in
> this case?
>
>> 4. vhost statistics
>>
>> The issue that was raised for the vhost statistics was using ftrace
>> instead of the debugfs mechanism.
>> However, looking further into the kvm stat mechanism, we learned that
>> ftrace didn't replace the plain debugfs mechanism, but was used in
>> addition to it.
>>
>> We propose to continue using debugfs for statistics, in a manner similar
>> to kvm,
>> and at some point in the future ftrace can be added to vhost as well.
>
> IMHO which kvm stat is a useful script, the best tool
> for perf stats is still perf. So I would try to integrate with that.
> How it works internally is IMHO less important.
>
>> Does this plan look o.k.?
>> If there are no further comments, I'll start preparing the patches
>> according to what we've agreed on thus far.
>> Thank you,
>> Razya
>
> I think a good place to try to start merging upstream would be 3 and 4.
> So if you want to make it easier to merge things upstream, try to keep 3
> and 4 independent from 1 and 2.

Note -1- and -3- are strongly related. If you have a thread that
serves multiple queues (whenever they belong to the same device/vm or
not) then this thread will be polling multiple queues at the same
time. This increases the chances you will find pending work to do in
some queue. In other words, you reduce the cycles wasted for polling.
In the other hand,  if you run multiple threads and these threads do
polling simultaneously then the threads may starve each other and
reduce performance (if they are scheduled to run in the same core). In
addition, a shared thread can decide when it should stop processing a
given queue and switch to other queue because by polling the thread
knows when new requests were added to a queue (this is what we called
fine-grained I/O scheduled heuristics)

So, seems like polling makes more sense when you serve multiple queues
with the same thread.

Abel.

>
> Thanks again,
>
> --
> MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19  6:40   ` Abel Gordon
@ 2013-12-19 10:13     ` Michael S. Tsirkin
  2013-12-19 10:36       ` Abel Gordon
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2013-12-19 10:13 UTC (permalink / raw)
  To: Abel Gordon
  Cc: Razya Ladelsky, Anthony Liguori, asias, digitaleric,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> >> Hi,
> >>
> >> Thank you all for your comments.
> >> I'm sorry for taking this long to reply, I was away on vacation..
> >>
> >> It was a good, long discussion, many issues were raised, which we'd like
> >> to address with the following proposed roadmap for Elvis patches.
> >> In general, we believe it would be best to start with patches that are
> >> as simple as possible, providing the basic Elvis functionality,
> >> and attend to the more complicated issues in subsequent patches.
> >>
> >> Here's the road map for Elvis patches:
> >
> > Thanks for the follow up. Some suggestions below.
> > Please note they suggestions below merely represent
> > thoughts on merging upstream.
> > If as the first step you are content with keeping this
> > work as out of tree patches, in order to have
> > the freedom to experiment with interfaces and
> > performance, please feel free to ignore them.
> >
> >> 1. Shared vhost thread for multiple devices.
> >>
> >> The way to go here, we believe, is to start with a patch having a shared
> >> vhost thread for multiple devices of the SAME vm.
> >> The next step/patch may be handling vms belonging to the same cgroup.
> >>
> >> Finally, we need to extend the functionality so that the shared vhost
> >> thread
> >> serves multiple vms (not necessarily belonging to the same cgroup).
> >>
> >> There was a lot of discussion about the way to address the enforcement
> >> of cgroup policies, and we will consider the various solutions with a
> >> future
> >> patch.
> >
> > With respect to the upstream kernel,
> > I'm not sure a bunch of changes just for the sake of guests with
> > multiple virtual NIC cards makes sense.
> > And I wonder how this step, in isolation, will affect e.g.
> > multiqueue workloads.
> > But I guess if the numbers are convincing, this can be mergeable.
> 
> Even if you have a single multiqueue device this change allows
> to create one vhost thread for all the queues, one vhost thread per
> queue or any other combination. I guess that depending on the workload
> and depending on the system utilization (free cycles/cores, density)
> you would prefer
> to use one or more vhost threads.

That is already controllable from the guest though, which likely has a better
idea about the workload.

> >
> >>
> >> 2. Creation of vhost threads
> >>
> >> We suggested two ways of controlling the creation and removal of vhost
> >> threads:
> >> - statically determining the maximum number of virtio devices per worker
> >> via a kernel module parameter
> >> - dynamically: Sysfs mechanism to add and remove vhost threads
> >>
> >> It seems that it would be simplest to take the static approach as
> >> a first stage. At a second stage (next patch), we'll advance to
> >> dynamically
> >> changing the number of vhost threads, using the static module parameter
> >> only as a default value.
> >
> > I'm not sure how independent this is from 1.
> > With respect to the upstream kernel,
> > Introducing interfaces (which we'll have to maintain
> > forever) just for the sake of guests with
> > multiple virtual NIC cards does not look like a good tradeoff.
> >
> > So I'm unlikely to merge this upstream without making it useful cross-VM,
> > and yes this means isolation and accounting with cgroups need to
> > work properly.
> 
> Agree, but even if you use a single multiqueue device having the
> ability to use 1 thread to serve all the queues or multiple threads to
> serve all the queues looks like a useful feature.

Could be.  At the moment, multiqueue is off by default because it causes
regressions for some workloads as compared to a single queue.
If we have heuristics in vhost that fix this by auto-tuning threading, that
would be nice.  But if you need to tune it manually anyway,
then from upstream perspective it does not seem to be worth it - you can just
turn multiqueue on/off in the guest.


> >
> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
> >> further.
> >> At the moment we prefer not to change the vhost model to use cwmq, as some
> >> of the issues that were discussed, such as cgroups, are not supported by
> >> cwmq, and this is adding more complexity.
> >> However, we'll look further into it, and consider it at a later stage.
> >
> > Hmm that's still assuming some smart management tool configuring
> > this correctly.  Can't this be determined automatically depending
> > on the workload?
> > This is what the cwmq suggestion was really about: detect
> > that we need more threads and spawn them.
> > It's less about sharing the implementation with workqueues -
> > would be very nice but not a must.
> 
> But how cwmq can consider cgroup accounting ?

I think cwmq is just a replacement for our own thread pool.
It doesn't make cgroup accounting easier or harder.

> In any case, IMHO, the kernel should first provide the "mechanism" so
> later on a user-space management application (the "policy") can
> orchestrate it.

I think policy would be something coarse-grained, like setting priority.
Making detailed scheduling decisions in userspace seems wrong somehow:
what does management application know that kernel doesn't?


> >
> >
> >
> >> 3. Adding polling mode to vhost
> >>
> >> It is a good idea making polling adaptive based on various factors such as
> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
> >> or the amount of wasted cycles (cycles we kept polling but no new work was
> >> added).
> >> However, as a beginning polling patch, we would prefer having a naive
> >> polling approach, which could be tuned with later patches.
> >>
> >
> > While any polling approach would still need a lot of testing to prove we
> > don't for example steal CPU from guest which could be doing other useful
> > work, given that an exit is at least 1.5K cycles at least in theory it
> > seems like something that can improve performance.  I'm not sure how
> > naive we can be without introducing regressions  for some workloads.
> > For example, if we are on the same host CPU, there's no
> > chance busy waiting will help us make progress.
> > How about detecting that the VCPU thread that kicked us
> > is currently running on another CPU, and only polling in
> > this case?
> >
> >> 4. vhost statistics
> >>
> >> The issue that was raised for the vhost statistics was using ftrace
> >> instead of the debugfs mechanism.
> >> However, looking further into the kvm stat mechanism, we learned that
> >> ftrace didn't replace the plain debugfs mechanism, but was used in
> >> addition to it.
> >>
> >> We propose to continue using debugfs for statistics, in a manner similar
> >> to kvm,
> >> and at some point in the future ftrace can be added to vhost as well.
> >
> > IMHO which kvm stat is a useful script, the best tool
> > for perf stats is still perf. So I would try to integrate with that.
> > How it works internally is IMHO less important.
> >
> >> Does this plan look o.k.?
> >> If there are no further comments, I'll start preparing the patches
> >> according to what we've agreed on thus far.
> >> Thank you,
> >> Razya
> >
> > I think a good place to try to start merging upstream would be 3 and 4.
> > So if you want to make it easier to merge things upstream, try to keep 3
> > and 4 independent from 1 and 2.
> 
> Note -1- and -3- are strongly related. If you have a thread that
> serves multiple queues (whenever they belong to the same device/vm or
> not) then this thread will be polling multiple queues at the same
> time. This increases the chances you will find pending work to do in
> some queue. In other words, you reduce the cycles wasted for polling.
> In the other hand,  if you run multiple threads and these threads do
> polling simultaneously then the threads may starve each other and
> reduce performance (if they are scheduled to run in the same core). In
> addition, a shared thread can decide when it should stop processing a
> given queue and switch to other queue because by polling the thread
> knows when new requests were added to a queue (this is what we called
> fine-grained I/O scheduled heuristics)
> 
> So, seems like polling makes more sense when you serve multiple queues
> with the same thread.
> 
> Abel.

A combination might bring gains in more workloads, but it should work on its
own too. It's quite possible that only a single VM is active, others are
idle. So either polling should handle that well or be smart enough to turn
itself off in this case.

> >
> > Thanks again,
> >
> > --
> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 10:13     ` Michael S. Tsirkin
@ 2013-12-19 10:36       ` Abel Gordon
  2013-12-19 11:37         ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Abel Gordon @ 2013-12-19 10:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Razya Ladelsky, Anthony Liguori, asias, digitaleric,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
>> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
>> >> Hi,
>> >>
>> >> Thank you all for your comments.
>> >> I'm sorry for taking this long to reply, I was away on vacation..
>> >>
>> >> It was a good, long discussion, many issues were raised, which we'd like
>> >> to address with the following proposed roadmap for Elvis patches.
>> >> In general, we believe it would be best to start with patches that are
>> >> as simple as possible, providing the basic Elvis functionality,
>> >> and attend to the more complicated issues in subsequent patches.
>> >>
>> >> Here's the road map for Elvis patches:
>> >
>> > Thanks for the follow up. Some suggestions below.
>> > Please note they suggestions below merely represent
>> > thoughts on merging upstream.
>> > If as the first step you are content with keeping this
>> > work as out of tree patches, in order to have
>> > the freedom to experiment with interfaces and
>> > performance, please feel free to ignore them.
>> >
>> >> 1. Shared vhost thread for multiple devices.
>> >>
>> >> The way to go here, we believe, is to start with a patch having a shared
>> >> vhost thread for multiple devices of the SAME vm.
>> >> The next step/patch may be handling vms belonging to the same cgroup.
>> >>
>> >> Finally, we need to extend the functionality so that the shared vhost
>> >> thread
>> >> serves multiple vms (not necessarily belonging to the same cgroup).
>> >>
>> >> There was a lot of discussion about the way to address the enforcement
>> >> of cgroup policies, and we will consider the various solutions with a
>> >> future
>> >> patch.
>> >
>> > With respect to the upstream kernel,
>> > I'm not sure a bunch of changes just for the sake of guests with
>> > multiple virtual NIC cards makes sense.
>> > And I wonder how this step, in isolation, will affect e.g.
>> > multiqueue workloads.
>> > But I guess if the numbers are convincing, this can be mergeable.
>>
>> Even if you have a single multiqueue device this change allows
>> to create one vhost thread for all the queues, one vhost thread per
>> queue or any other combination. I guess that depending on the workload
>> and depending on the system utilization (free cycles/cores, density)
>> you would prefer
>> to use one or more vhost threads.
>
> That is already controllable from the guest though, which likely has a better
> idea about the workload.

but the guest has no idea about what's going on in the host system
(e.g. other VMs I/O, cpu utilization of the host cores...)

>
>> >
>> >>
>> >> 2. Creation of vhost threads
>> >>
>> >> We suggested two ways of controlling the creation and removal of vhost
>> >> threads:
>> >> - statically determining the maximum number of virtio devices per worker
>> >> via a kernel module parameter
>> >> - dynamically: Sysfs mechanism to add and remove vhost threads
>> >>
>> >> It seems that it would be simplest to take the static approach as
>> >> a first stage. At a second stage (next patch), we'll advance to
>> >> dynamically
>> >> changing the number of vhost threads, using the static module parameter
>> >> only as a default value.
>> >
>> > I'm not sure how independent this is from 1.
>> > With respect to the upstream kernel,
>> > Introducing interfaces (which we'll have to maintain
>> > forever) just for the sake of guests with
>> > multiple virtual NIC cards does not look like a good tradeoff.
>> >
>> > So I'm unlikely to merge this upstream without making it useful cross-VM,
>> > and yes this means isolation and accounting with cgroups need to
>> > work properly.
>>
>> Agree, but even if you use a single multiqueue device having the
>> ability to use 1 thread to serve all the queues or multiple threads to
>> serve all the queues looks like a useful feature.
>
> Could be.  At the moment, multiqueue is off by default because it causes
> regressions for some workloads as compared to a single queue.
> If we have heuristics in vhost that fix this by auto-tuning threading, that
> would be nice.  But if you need to tune it manually anyway,
> then from upstream perspective it does not seem to be worth it - you can just
> turn multiqueue on/off in the guest.

I see. But we are mixing again between the policy and the mechanism.
We first need a mechanism to control the system and then we need to
implement the policy to orchestrate it (whenever it will be
implemented in the kernel as part of vhost or outside in user-space).
I don't see why to wait to have a policy to upstream the mechanism. If
we upstream the mechanism in a manner that the defaults do not affect
today's vhost behavior, then it will be possible to play with the
policies and upstream them later.

>
>
>> >
>> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
>> >> further.
>> >> At the moment we prefer not to change the vhost model to use cwmq, as some
>> >> of the issues that were discussed, such as cgroups, are not supported by
>> >> cwmq, and this is adding more complexity.
>> >> However, we'll look further into it, and consider it at a later stage.
>> >
>> > Hmm that's still assuming some smart management tool configuring
>> > this correctly.  Can't this be determined automatically depending
>> > on the workload?
>> > This is what the cwmq suggestion was really about: detect
>> > that we need more threads and spawn them.
>> > It's less about sharing the implementation with workqueues -
>> > would be very nice but not a must.
>>
>> But how cwmq can consider cgroup accounting ?
>
> I think cwmq is just a replacement for our own thread pool.
> It doesn't make cgroup accounting easier or harder.

cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
be much simpler consider cgrouups with our own threading model.

>
>> In any case, IMHO, the kernel should first provide the "mechanism" so
>> later on a user-space management application (the "policy") can
>> orchestrate it.
>
> I think policy would be something coarse-grained, like setting priority.
> Making detailed scheduling decisions in userspace seems wrong somehow:
> what does management application know that kernel doesn't?

The code of the management application can be
changed/modified/customized... also the management application can
consider additional inputs (like per VM SLAs) that are not available
to the kernel.
Think about MOM (memory over-commit manager) but for vhost (vhost I/O
manager ?VHIOM ?)

>
>
>> >
>> >
>> >
>> >> 3. Adding polling mode to vhost
>> >>
>> >> It is a good idea making polling adaptive based on various factors such as
>> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
>> >> or the amount of wasted cycles (cycles we kept polling but no new work was
>> >> added).
>> >> However, as a beginning polling patch, we would prefer having a naive
>> >> polling approach, which could be tuned with later patches.
>> >>
>> >
>> > While any polling approach would still need a lot of testing to prove we
>> > don't for example steal CPU from guest which could be doing other useful
>> > work, given that an exit is at least 1.5K cycles at least in theory it
>> > seems like something that can improve performance.  I'm not sure how
>> > naive we can be without introducing regressions  for some workloads.
>> > For example, if we are on the same host CPU, there's no
>> > chance busy waiting will help us make progress.
>> > How about detecting that the VCPU thread that kicked us
>> > is currently running on another CPU, and only polling in
>> > this case?
>> >
>> >> 4. vhost statistics
>> >>
>> >> The issue that was raised for the vhost statistics was using ftrace
>> >> instead of the debugfs mechanism.
>> >> However, looking further into the kvm stat mechanism, we learned that
>> >> ftrace didn't replace the plain debugfs mechanism, but was used in
>> >> addition to it.
>> >>
>> >> We propose to continue using debugfs for statistics, in a manner similar
>> >> to kvm,
>> >> and at some point in the future ftrace can be added to vhost as well.
>> >
>> > IMHO which kvm stat is a useful script, the best tool
>> > for perf stats is still perf. So I would try to integrate with that.
>> > How it works internally is IMHO less important.
>> >
>> >> Does this plan look o.k.?
>> >> If there are no further comments, I'll start preparing the patches
>> >> according to what we've agreed on thus far.
>> >> Thank you,
>> >> Razya
>> >
>> > I think a good place to try to start merging upstream would be 3 and 4.
>> > So if you want to make it easier to merge things upstream, try to keep 3
>> > and 4 independent from 1 and 2.
>>
>> Note -1- and -3- are strongly related. If you have a thread that
>> serves multiple queues (whenever they belong to the same device/vm or
>> not) then this thread will be polling multiple queues at the same
>> time. This increases the chances you will find pending work to do in
>> some queue. In other words, you reduce the cycles wasted for polling.
>> In the other hand,  if you run multiple threads and these threads do
>> polling simultaneously then the threads may starve each other and
>> reduce performance (if they are scheduled to run in the same core). In
>> addition, a shared thread can decide when it should stop processing a
>> given queue and switch to other queue because by polling the thread
>> knows when new requests were added to a queue (this is what we called
>> fine-grained I/O scheduled heuristics)
>>
>> So, seems like polling makes more sense when you serve multiple queues
>> with the same thread.
>>
>> Abel.
>
> A combination might bring gains in more workloads, but it should work on its
> own too. It's quite possible that only a single VM is active, others are
> idle. So either polling should handle that well or be smart enough to turn
> itself off in this case.

Agree, but we are  discussing again about the policy. We first need
the mechanism :)

Abel.

>
>> >
>> > Thanks again,
>> >
>> > --
>> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 10:36       ` Abel Gordon
@ 2013-12-19 11:37         ` Michael S. Tsirkin
  2013-12-19 12:56           ` Abel Gordon
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2013-12-19 11:37 UTC (permalink / raw)
  To: Abel Gordon
  Cc: Razya Ladelsky, Anthony Liguori, asias, digitaleric,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> >> >> Hi,
> >> >>
> >> >> Thank you all for your comments.
> >> >> I'm sorry for taking this long to reply, I was away on vacation..
> >> >>
> >> >> It was a good, long discussion, many issues were raised, which we'd like
> >> >> to address with the following proposed roadmap for Elvis patches.
> >> >> In general, we believe it would be best to start with patches that are
> >> >> as simple as possible, providing the basic Elvis functionality,
> >> >> and attend to the more complicated issues in subsequent patches.
> >> >>
> >> >> Here's the road map for Elvis patches:
> >> >
> >> > Thanks for the follow up. Some suggestions below.
> >> > Please note they suggestions below merely represent
> >> > thoughts on merging upstream.
> >> > If as the first step you are content with keeping this
> >> > work as out of tree patches, in order to have
> >> > the freedom to experiment with interfaces and
> >> > performance, please feel free to ignore them.
> >> >
> >> >> 1. Shared vhost thread for multiple devices.
> >> >>
> >> >> The way to go here, we believe, is to start with a patch having a shared
> >> >> vhost thread for multiple devices of the SAME vm.
> >> >> The next step/patch may be handling vms belonging to the same cgroup.
> >> >>
> >> >> Finally, we need to extend the functionality so that the shared vhost
> >> >> thread
> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
> >> >>
> >> >> There was a lot of discussion about the way to address the enforcement
> >> >> of cgroup policies, and we will consider the various solutions with a
> >> >> future
> >> >> patch.
> >> >
> >> > With respect to the upstream kernel,
> >> > I'm not sure a bunch of changes just for the sake of guests with
> >> > multiple virtual NIC cards makes sense.
> >> > And I wonder how this step, in isolation, will affect e.g.
> >> > multiqueue workloads.
> >> > But I guess if the numbers are convincing, this can be mergeable.
> >>
> >> Even if you have a single multiqueue device this change allows
> >> to create one vhost thread for all the queues, one vhost thread per
> >> queue or any other combination. I guess that depending on the workload
> >> and depending on the system utilization (free cycles/cores, density)
> >> you would prefer
> >> to use one or more vhost threads.
> >
> > That is already controllable from the guest though, which likely has a better
> > idea about the workload.
> 
> but the guest has no idea about what's going on in the host system
> (e.g. other VMs I/O, cpu utilization of the host cores...)

But again, you want to do things per VM now so you will have no idea
about other VMs, right? Host cpu utilization could be a useful input
for some heuristics, I agree, but nothing prevents us from sending
this info to guest agent and controlling multiqueue based on that
(kind of like balloon).

> >
> >> >
> >> >>
> >> >> 2. Creation of vhost threads
> >> >>
> >> >> We suggested two ways of controlling the creation and removal of vhost
> >> >> threads:
> >> >> - statically determining the maximum number of virtio devices per worker
> >> >> via a kernel module parameter
> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
> >> >>
> >> >> It seems that it would be simplest to take the static approach as
> >> >> a first stage. At a second stage (next patch), we'll advance to
> >> >> dynamically
> >> >> changing the number of vhost threads, using the static module parameter
> >> >> only as a default value.
> >> >
> >> > I'm not sure how independent this is from 1.
> >> > With respect to the upstream kernel,
> >> > Introducing interfaces (which we'll have to maintain
> >> > forever) just for the sake of guests with
> >> > multiple virtual NIC cards does not look like a good tradeoff.
> >> >
> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
> >> > and yes this means isolation and accounting with cgroups need to
> >> > work properly.
> >>
> >> Agree, but even if you use a single multiqueue device having the
> >> ability to use 1 thread to serve all the queues or multiple threads to
> >> serve all the queues looks like a useful feature.
> >
> > Could be.  At the moment, multiqueue is off by default because it causes
> > regressions for some workloads as compared to a single queue.
> > If we have heuristics in vhost that fix this by auto-tuning threading, that
> > would be nice.  But if you need to tune it manually anyway,
> > then from upstream perspective it does not seem to be worth it - you can just
> > turn multiqueue on/off in the guest.
> 
> I see. But we are mixing again between the policy and the mechanism.
> We first need a mechanism to control the system and then we need to
> implement the policy to orchestrate it (whenever it will be
> implemented in the kernel as part of vhost or outside in user-space).
> I don't see why to wait to have a policy to upstream the mechanism. If
> we upstream the mechanism in a manner that the defaults do not affect
> today's vhost behavior, then it will be possible to play with the
> policies and upstream them later.

Well it all hinges on whether it's in userspace actually.
Interfaces to userspace must be maintained forever, there's no way to know
they are right and make sense before they are used,  so we must not merge
the interface before there is userspace using it correctly.

If it's in-kernel, we can merge a bit of unused code into kernel
just to make it easier for you to make progress, provided it's
well isolated and doesn't make life much harder for others
working on same code-base.

> >
> >
> >> >
> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
> >> >> further.
> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
> >> >> of the issues that were discussed, such as cgroups, are not supported by
> >> >> cwmq, and this is adding more complexity.
> >> >> However, we'll look further into it, and consider it at a later stage.
> >> >
> >> > Hmm that's still assuming some smart management tool configuring
> >> > this correctly.  Can't this be determined automatically depending
> >> > on the workload?
> >> > This is what the cwmq suggestion was really about: detect
> >> > that we need more threads and spawn them.
> >> > It's less about sharing the implementation with workqueues -
> >> > would be very nice but not a must.
> >>
> >> But how cwmq can consider cgroup accounting ?
> >
> > I think cwmq is just a replacement for our own thread pool.
> > It doesn't make cgroup accounting easier or harder.
> 
> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
> be much simpler consider cgrouups with our own threading model.

Bandan here arrived at the same conclusion.

> >
> >> In any case, IMHO, the kernel should first provide the "mechanism" so
> >> later on a user-space management application (the "policy") can
> >> orchestrate it.
> >
> > I think policy would be something coarse-grained, like setting priority.
> > Making detailed scheduling decisions in userspace seems wrong somehow:
> > what does management application know that kernel doesn't?
> 
> The code of the management application can be
> changed/modified/customized... also the management application can
> consider additional inputs (like per VM SLAs) that are not available
> to the kernel.

An alternative is to express these in terms of cgroups.

> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
> manager ?VHIOM ?)

I worry that CPU and network load is much more dynamic than
memory load, and that in real life userspace simply won't be able to
react to changes fast enough.
Also that this will interact with scheduler in strange ways, e.g.
manager wants to free up the CPU and moves some vhost threads
off it, scheduler sees there's free CPU and moves some more jobs there.

After all it might be the best we can do for over-committed systems,
but it seems that it will have to be inter-VM to be really useful.


Overall one of the main points of vhost is a single interface
that let us access all kind of kernel APIs in a uniform way:
we tell kernel what we want it to do, and let it get about
executing that in the most efficient manner possible.
If you insist on managing everything from userspace you lose
some of these advantages.
That's why I keep suggesting we try to give scheduler more hints
about our expected behaviour.

> >
> >
> >> >
> >> >
> >> >
> >> >> 3. Adding polling mode to vhost
> >> >>
> >> >> It is a good idea making polling adaptive based on various factors such as
> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
> >> >> added).
> >> >> However, as a beginning polling patch, we would prefer having a naive
> >> >> polling approach, which could be tuned with later patches.
> >> >>
> >> >
> >> > While any polling approach would still need a lot of testing to prove we
> >> > don't for example steal CPU from guest which could be doing other useful
> >> > work, given that an exit is at least 1.5K cycles at least in theory it
> >> > seems like something that can improve performance.  I'm not sure how
> >> > naive we can be without introducing regressions  for some workloads.
> >> > For example, if we are on the same host CPU, there's no
> >> > chance busy waiting will help us make progress.
> >> > How about detecting that the VCPU thread that kicked us
> >> > is currently running on another CPU, and only polling in
> >> > this case?
> >> >
> >> >> 4. vhost statistics
> >> >>
> >> >> The issue that was raised for the vhost statistics was using ftrace
> >> >> instead of the debugfs mechanism.
> >> >> However, looking further into the kvm stat mechanism, we learned that
> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
> >> >> addition to it.
> >> >>
> >> >> We propose to continue using debugfs for statistics, in a manner similar
> >> >> to kvm,
> >> >> and at some point in the future ftrace can be added to vhost as well.
> >> >
> >> > IMHO which kvm stat is a useful script, the best tool
> >> > for perf stats is still perf. So I would try to integrate with that.
> >> > How it works internally is IMHO less important.
> >> >
> >> >> Does this plan look o.k.?
> >> >> If there are no further comments, I'll start preparing the patches
> >> >> according to what we've agreed on thus far.
> >> >> Thank you,
> >> >> Razya
> >> >
> >> > I think a good place to try to start merging upstream would be 3 and 4.
> >> > So if you want to make it easier to merge things upstream, try to keep 3
> >> > and 4 independent from 1 and 2.
> >>
> >> Note -1- and -3- are strongly related. If you have a thread that
> >> serves multiple queues (whenever they belong to the same device/vm or
> >> not) then this thread will be polling multiple queues at the same
> >> time. This increases the chances you will find pending work to do in
> >> some queue. In other words, you reduce the cycles wasted for polling.
> >> In the other hand,  if you run multiple threads and these threads do
> >> polling simultaneously then the threads may starve each other and
> >> reduce performance (if they are scheduled to run in the same core). In
> >> addition, a shared thread can decide when it should stop processing a
> >> given queue and switch to other queue because by polling the thread
> >> knows when new requests were added to a queue (this is what we called
> >> fine-grained I/O scheduled heuristics)
> >>
> >> So, seems like polling makes more sense when you serve multiple queues
> >> with the same thread.
> >>
> >> Abel.
> >
> > A combination might bring gains in more workloads, but it should work on its
> > own too. It's quite possible that only a single VM is active, others are
> > idle. So either polling should handle that well or be smart enough to turn
> > itself off in this case.
> 
> Agree, but we are  discussing again about the policy. We first need
> the mechanism :)
> 
> Abel.

So you want to control this from a management application as well?
That's a very low level detail to expose to management.

Why do we ever want to poll? It burns CPU uselessly.
There are several reasons for polling to be good:

	1. scheduler overhead is too large, we are thrashing
          between multiple threads running on the same CPU.
	  This is the one you are thinking of I guess: it can
          be controlled from the management application.
	  So here some management interface might make some sense.
	  But to know how long to poll you would need to know
          what the scheduler overhead is, then poll for
          a fraction of that.

	2. KVM exit overhead is too large, we are wasting cycles
	  on exit/reentry.
          Userspace doesn't easily know about exits though.
          Again, we'd need to measure how long does an exit take,
          then poll for a fraction of that.
          This one is easier to measure from guest, so maybe
          we should let guest control this.

	3. If we halt the CPU it might take too long to wake up.
          If we expect to be woken up soon we'd be better off
          busy waiting.
	  This looks like it would be better addressed by an in-kernel API
	  that says "I expect to be woken up in X microseconds".
	  This way if some other event occurs instead, it will be
          handled instead of waiting for us to finish polling.

As you see there are many factors in play here.
That's why we can't just add userspace interface "poll for X cycles":
to support item 1.  maintaining this interface would block
support for items 2 and 3.


> >
> >> >
> >> > Thanks again,
> >> >
> >> > --
> >> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 11:37         ` Michael S. Tsirkin
@ 2013-12-19 12:56           ` Abel Gordon
  2013-12-19 13:48             ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Abel Gordon @ 2013-12-19 12:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Razya Ladelsky, Anthony Liguori, asias, Eric Northup,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
>> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
>> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
>> >> >> Hi,
>> >> >>
>> >> >> Thank you all for your comments.
>> >> >> I'm sorry for taking this long to reply, I was away on vacation..
>> >> >>
>> >> >> It was a good, long discussion, many issues were raised, which we'd like
>> >> >> to address with the following proposed roadmap for Elvis patches.
>> >> >> In general, we believe it would be best to start with patches that are
>> >> >> as simple as possible, providing the basic Elvis functionality,
>> >> >> and attend to the more complicated issues in subsequent patches.
>> >> >>
>> >> >> Here's the road map for Elvis patches:
>> >> >
>> >> > Thanks for the follow up. Some suggestions below.
>> >> > Please note they suggestions below merely represent
>> >> > thoughts on merging upstream.
>> >> > If as the first step you are content with keeping this
>> >> > work as out of tree patches, in order to have
>> >> > the freedom to experiment with interfaces and
>> >> > performance, please feel free to ignore them.
>> >> >
>> >> >> 1. Shared vhost thread for multiple devices.
>> >> >>
>> >> >> The way to go here, we believe, is to start with a patch having a shared
>> >> >> vhost thread for multiple devices of the SAME vm.
>> >> >> The next step/patch may be handling vms belonging to the same cgroup.
>> >> >>
>> >> >> Finally, we need to extend the functionality so that the shared vhost
>> >> >> thread
>> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
>> >> >>
>> >> >> There was a lot of discussion about the way to address the enforcement
>> >> >> of cgroup policies, and we will consider the various solutions with a
>> >> >> future
>> >> >> patch.
>> >> >
>> >> > With respect to the upstream kernel,
>> >> > I'm not sure a bunch of changes just for the sake of guests with
>> >> > multiple virtual NIC cards makes sense.
>> >> > And I wonder how this step, in isolation, will affect e.g.
>> >> > multiqueue workloads.
>> >> > But I guess if the numbers are convincing, this can be mergeable.
>> >>
>> >> Even if you have a single multiqueue device this change allows
>> >> to create one vhost thread for all the queues, one vhost thread per
>> >> queue or any other combination. I guess that depending on the workload
>> >> and depending on the system utilization (free cycles/cores, density)
>> >> you would prefer
>> >> to use one or more vhost threads.
>> >
>> > That is already controllable from the guest though, which likely has a better
>> > idea about the workload.
>>
>> but the guest has no idea about what's going on in the host system
>> (e.g. other VMs I/O, cpu utilization of the host cores...)
>
> But again, you want to do things per VM now so you will have no idea
> about other VMs, right? Host cpu utilization could be a useful input

Razya shared a roadmap. The first step was to support sharing a thread
for a single VM but the goal is to later on extend the mechanism to
support multiple VMs and cgroups

> for some heuristics, I agree, but nothing prevents us from sending
> this info to guest agent and controlling multiqueue based on that
> (kind of like balloon).

IMHO, we should never share host internal information (e.g. resource
utilization) wit the guest. That's supposed to be confidential
information :)
The balloon is a bit different... kvm asks the guest OS to give  (if
possible) some pages but  kvm never sends to the balloon information
about the memory utilization of the host. If the guest wishes to send
information about it's own memory consumption (like it does for MOM),
that's OK.

So, the guest can share information with the host but the host should
be the one to make the decisions. KVM should never share host
information with the guest.

>
>> >
>> >> >
>> >> >>
>> >> >> 2. Creation of vhost threads
>> >> >>
>> >> >> We suggested two ways of controlling the creation and removal of vhost
>> >> >> threads:
>> >> >> - statically determining the maximum number of virtio devices per worker
>> >> >> via a kernel module parameter
>> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
>> >> >>
>> >> >> It seems that it would be simplest to take the static approach as
>> >> >> a first stage. At a second stage (next patch), we'll advance to
>> >> >> dynamically
>> >> >> changing the number of vhost threads, using the static module parameter
>> >> >> only as a default value.
>> >> >
>> >> > I'm not sure how independent this is from 1.
>> >> > With respect to the upstream kernel,
>> >> > Introducing interfaces (which we'll have to maintain
>> >> > forever) just for the sake of guests with
>> >> > multiple virtual NIC cards does not look like a good tradeoff.
>> >> >
>> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
>> >> > and yes this means isolation and accounting with cgroups need to
>> >> > work properly.
>> >>
>> >> Agree, but even if you use a single multiqueue device having the
>> >> ability to use 1 thread to serve all the queues or multiple threads to
>> >> serve all the queues looks like a useful feature.
>> >
>> > Could be.  At the moment, multiqueue is off by default because it causes
>> > regressions for some workloads as compared to a single queue.
>> > If we have heuristics in vhost that fix this by auto-tuning threading, that
>> > would be nice.  But if you need to tune it manually anyway,
>> > then from upstream perspective it does not seem to be worth it - you can just
>> > turn multiqueue on/off in the guest.
>>
>> I see. But we are mixing again between the policy and the mechanism.
>> We first need a mechanism to control the system and then we need to
>> implement the policy to orchestrate it (whenever it will be
>> implemented in the kernel as part of vhost or outside in user-space).
>> I don't see why to wait to have a policy to upstream the mechanism. If
>> we upstream the mechanism in a manner that the defaults do not affect
>> today's vhost behavior, then it will be possible to play with the
>> policies and upstream them later.
>
> Well it all hinges on whether it's in userspace actually.
> Interfaces to userspace must be maintained forever, there's no way to know
> they are right and make sense before they are used,  so we must not merge
> the interface before there is userspace using it correctly.
>
> If it's in-kernel, we can merge a bit of unused code into kernel
> just to make it easier for you to make progress, provided it's
> well isolated and doesn't make life much harder for others
> working on same code-base.

I see your concerns. We can also expose the mechanism as exported
symbols and implement the policy in a different kernel module. Is that
OK ? or you are also concerned about  exporting symbols to other
kernel modules because they may change as well ?

>
>> >
>> >
>> >> >
>> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
>> >> >> further.
>> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
>> >> >> of the issues that were discussed, such as cgroups, are not supported by
>> >> >> cwmq, and this is adding more complexity.
>> >> >> However, we'll look further into it, and consider it at a later stage.
>> >> >
>> >> > Hmm that's still assuming some smart management tool configuring
>> >> > this correctly.  Can't this be determined automatically depending
>> >> > on the workload?
>> >> > This is what the cwmq suggestion was really about: detect
>> >> > that we need more threads and spawn them.
>> >> > It's less about sharing the implementation with workqueues -
>> >> > would be very nice but not a must.
>> >>
>> >> But how cwmq can consider cgroup accounting ?
>> >
>> > I think cwmq is just a replacement for our own thread pool.
>> > It doesn't make cgroup accounting easier or harder.
>>
>> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
>> be much simpler consider cgrouups with our own threading model.
>
> Bandan here arrived at the same conclusion.

Same conclusion means no cwmq  ?

>
>> >
>> >> In any case, IMHO, the kernel should first provide the "mechanism" so
>> >> later on a user-space management application (the "policy") can
>> >> orchestrate it.
>> >
>> > I think policy would be something coarse-grained, like setting priority.
>> > Making detailed scheduling decisions in userspace seems wrong somehow:
>> > what does management application know that kernel doesn't?
>>
>> The code of the management application can be
>> changed/modified/customized... also the management application can
>> consider additional inputs (like per VM SLAs) that are not available
>> to the kernel.
>
> An alternative is to express these in terms of cgroups.

True but there is a semantic gap between cgroups (low level system
metrics) and SLAs (high level application performance metrics). I
don't believe everything can be expressed as cgroups...

>
>> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
>> manager ?VHIOM ?)
>
> I worry that CPU and network load is much more dynamic than
> memory load, and that in real life userspace simply won't be able to
> react to changes fast enough.
> Also that this will interact with scheduler in strange ways, e.g.
> manager wants to free up the CPU and moves some vhost threads
> off it, scheduler sees there's free CPU and moves some more jobs there.

I believe we can make very good decisions in user-space but obviously
this require some research and experimentation. I think Eyal Moscovici
is working on this.

> After all it might be the best we can do for over-committed systems,
> but it seems that it will have to be inter-VM to be really useful.

Agree 100% .The goal is to have inter-VM support. Single VM is the
first step to implement inter-VM.

>
>
> Overall one of the main points of vhost is a single interface
> that let us access all kind of kernel APIs in a uniform way:
> we tell kernel what we want it to do, and let it get about
> executing that in the most efficient manner possible.
> If you insist on managing everything from userspace you lose
> some of these advantages.
> That's why I keep suggesting we try to give scheduler more hints
> about our expected behaviour.
>
>> >
>> >
>> >> >
>> >> >
>> >> >
>> >> >> 3. Adding polling mode to vhost
>> >> >>
>> >> >> It is a good idea making polling adaptive based on various factors such as
>> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
>> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
>> >> >> added).
>> >> >> However, as a beginning polling patch, we would prefer having a naive
>> >> >> polling approach, which could be tuned with later patches.
>> >> >>
>> >> >
>> >> > While any polling approach would still need a lot of testing to prove we
>> >> > don't for example steal CPU from guest which could be doing other useful
>> >> > work, given that an exit is at least 1.5K cycles at least in theory it
>> >> > seems like something that can improve performance.  I'm not sure how
>> >> > naive we can be without introducing regressions  for some workloads.
>> >> > For example, if we are on the same host CPU, there's no
>> >> > chance busy waiting will help us make progress.
>> >> > How about detecting that the VCPU thread that kicked us
>> >> > is currently running on another CPU, and only polling in
>> >> > this case?
>> >> >
>> >> >> 4. vhost statistics
>> >> >>
>> >> >> The issue that was raised for the vhost statistics was using ftrace
>> >> >> instead of the debugfs mechanism.
>> >> >> However, looking further into the kvm stat mechanism, we learned that
>> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
>> >> >> addition to it.
>> >> >>
>> >> >> We propose to continue using debugfs for statistics, in a manner similar
>> >> >> to kvm,
>> >> >> and at some point in the future ftrace can be added to vhost as well.
>> >> >
>> >> > IMHO which kvm stat is a useful script, the best tool
>> >> > for perf stats is still perf. So I would try to integrate with that.
>> >> > How it works internally is IMHO less important.
>> >> >
>> >> >> Does this plan look o.k.?
>> >> >> If there are no further comments, I'll start preparing the patches
>> >> >> according to what we've agreed on thus far.
>> >> >> Thank you,
>> >> >> Razya
>> >> >
>> >> > I think a good place to try to start merging upstream would be 3 and 4.
>> >> > So if you want to make it easier to merge things upstream, try to keep 3
>> >> > and 4 independent from 1 and 2.
>> >>
>> >> Note -1- and -3- are strongly related. If you have a thread that
>> >> serves multiple queues (whenever they belong to the same device/vm or
>> >> not) then this thread will be polling multiple queues at the same
>> >> time. This increases the chances you will find pending work to do in
>> >> some queue. In other words, you reduce the cycles wasted for polling.
>> >> In the other hand,  if you run multiple threads and these threads do
>> >> polling simultaneously then the threads may starve each other and
>> >> reduce performance (if they are scheduled to run in the same core). In
>> >> addition, a shared thread can decide when it should stop processing a
>> >> given queue and switch to other queue because by polling the thread
>> >> knows when new requests were added to a queue (this is what we called
>> >> fine-grained I/O scheduled heuristics)
>> >>
>> >> So, seems like polling makes more sense when you serve multiple queues
>> >> with the same thread.
>> >>
>> >> Abel.
>> >
>> > A combination might bring gains in more workloads, but it should work on its
>> > own too. It's quite possible that only a single VM is active, others are
>> > idle. So either polling should handle that well or be smart enough to turn
>> > itself off in this case.
>>
>> Agree, but we are  discussing again about the policy. We first need
>> the mechanism :)
>>
>> Abel.
>
> So you want to control this from a management application as well?
> That's a very low level detail to expose to management.
>
> Why do we ever want to poll? It burns CPU uselessly.
> There are several reasons for polling to be good:
>
>         1. scheduler overhead is too large, we are thrashing
>           between multiple threads running on the same CPU.
>           This is the one you are thinking of I guess: it can
>           be controlled from the management application.
>           So here some management interface might make some sense.
>           But to know how long to poll you would need to know
>           what the scheduler overhead is, then poll for
>           a fraction of that.
>
>         2. KVM exit overhead is too large, we are wasting cycles
>           on exit/reentry.
>           Userspace doesn't easily know about exits though.
>           Again, we'd need to measure how long does an exit take,
>           then poll for a fraction of that.
>           This one is easier to measure from guest, so maybe
>           we should let guest control this.
>
>         3. If we halt the CPU it might take too long to wake up.
>           If we expect to be woken up soon we'd be better off
>           busy waiting.
>           This looks like it would be better addressed by an in-kernel API
>           that says "I expect to be woken up in X microseconds".
>           This way if some other event occurs instead, it will be
>           handled instead of waiting for us to finish polling.
>
> As you see there are many factors in play here.
> That's why we can't just add userspace interface "poll for X cycles":
> to support item 1.  maintaining this interface would block
> support for items 2 and 3.

It doesn't need to be pure user-space. Some parts (like "when" to do
polling) can be controlled in kernel but other parts (like number of
threads) can be controlled by user-space.  Also some of the polling
settings  (completely disable polling, min/max polling limitations...)
can be controlled  by user-space

>
>
>> >
>> >> >
>> >> > Thanks again,
>> >> >
>> >> > --
>> >> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 12:56           ` Abel Gordon
@ 2013-12-19 13:48             ` Michael S. Tsirkin
  2013-12-19 14:19               ` Abel Gordon
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2013-12-19 13:48 UTC (permalink / raw)
  To: Abel Gordon
  Cc: Razya Ladelsky, Anthony Liguori, asias, Eric Northup,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 02:56:10PM +0200, Abel Gordon wrote:
> On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
> >> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
> >> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> >> >> >> Hi,
> >> >> >>
> >> >> >> Thank you all for your comments.
> >> >> >> I'm sorry for taking this long to reply, I was away on vacation..
> >> >> >>
> >> >> >> It was a good, long discussion, many issues were raised, which we'd like
> >> >> >> to address with the following proposed roadmap for Elvis patches.
> >> >> >> In general, we believe it would be best to start with patches that are
> >> >> >> as simple as possible, providing the basic Elvis functionality,
> >> >> >> and attend to the more complicated issues in subsequent patches.
> >> >> >>
> >> >> >> Here's the road map for Elvis patches:
> >> >> >
> >> >> > Thanks for the follow up. Some suggestions below.
> >> >> > Please note they suggestions below merely represent
> >> >> > thoughts on merging upstream.
> >> >> > If as the first step you are content with keeping this
> >> >> > work as out of tree patches, in order to have
> >> >> > the freedom to experiment with interfaces and
> >> >> > performance, please feel free to ignore them.
> >> >> >
> >> >> >> 1. Shared vhost thread for multiple devices.
> >> >> >>
> >> >> >> The way to go here, we believe, is to start with a patch having a shared
> >> >> >> vhost thread for multiple devices of the SAME vm.
> >> >> >> The next step/patch may be handling vms belonging to the same cgroup.
> >> >> >>
> >> >> >> Finally, we need to extend the functionality so that the shared vhost
> >> >> >> thread
> >> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
> >> >> >>
> >> >> >> There was a lot of discussion about the way to address the enforcement
> >> >> >> of cgroup policies, and we will consider the various solutions with a
> >> >> >> future
> >> >> >> patch.
> >> >> >
> >> >> > With respect to the upstream kernel,
> >> >> > I'm not sure a bunch of changes just for the sake of guests with
> >> >> > multiple virtual NIC cards makes sense.
> >> >> > And I wonder how this step, in isolation, will affect e.g.
> >> >> > multiqueue workloads.
> >> >> > But I guess if the numbers are convincing, this can be mergeable.
> >> >>
> >> >> Even if you have a single multiqueue device this change allows
> >> >> to create one vhost thread for all the queues, one vhost thread per
> >> >> queue or any other combination. I guess that depending on the workload
> >> >> and depending on the system utilization (free cycles/cores, density)
> >> >> you would prefer
> >> >> to use one or more vhost threads.
> >> >
> >> > That is already controllable from the guest though, which likely has a better
> >> > idea about the workload.
> >>
> >> but the guest has no idea about what's going on in the host system
> >> (e.g. other VMs I/O, cpu utilization of the host cores...)
> >
> > But again, you want to do things per VM now so you will have no idea
> > about other VMs, right? Host cpu utilization could be a useful input
> 
> Razya shared a roadmap. The first step was to support sharing a thread
> for a single VM but the goal is to later on extend the mechanism to
> support multiple VMs and cgroups

Yes, I got that. What I'm not sure of is whether this is just a
development roadmap, or do you expect to be able to merge things
upstream in this order as well.
If the later, all I'm saying is that I think you are doing this in the wrong
order: we'll likely have to merge first 4 then 3 then possibly 1+2 together -
but maybe 1+2 will have to wait until cgroups are sorted out.
That's just a hunch of course until you actually try to do it.
If the former, most of my comments don't really apply.

> > for some heuristics, I agree, but nothing prevents us from sending
> > this info to guest agent and controlling multiqueue based on that
> > (kind of like balloon).
> 
> IMHO, we should never share host internal information (e.g. resource
> utilization) wit the guest. That's supposed to be confidential
> information :)
> The balloon is a bit different... kvm asks the guest OS to give  (if
> possible) some pages but  kvm never sends to the balloon information
> about the memory utilization of the host. If the guest wishes to send
> information about it's own memory consumption (like it does for MOM),
> that's OK.
> 
> So, the guest can share information with the host but the host should
> be the one to make the decisions. KVM should never share host
> information with the guest.

It's also easy to just tell guest agent to turn multiqueue on/off if you have
a mind to.


> >
> >> >
> >> >> >
> >> >> >>
> >> >> >> 2. Creation of vhost threads
> >> >> >>
> >> >> >> We suggested two ways of controlling the creation and removal of vhost
> >> >> >> threads:
> >> >> >> - statically determining the maximum number of virtio devices per worker
> >> >> >> via a kernel module parameter
> >> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
> >> >> >>
> >> >> >> It seems that it would be simplest to take the static approach as
> >> >> >> a first stage. At a second stage (next patch), we'll advance to
> >> >> >> dynamically
> >> >> >> changing the number of vhost threads, using the static module parameter
> >> >> >> only as a default value.
> >> >> >
> >> >> > I'm not sure how independent this is from 1.
> >> >> > With respect to the upstream kernel,
> >> >> > Introducing interfaces (which we'll have to maintain
> >> >> > forever) just for the sake of guests with
> >> >> > multiple virtual NIC cards does not look like a good tradeoff.
> >> >> >
> >> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
> >> >> > and yes this means isolation and accounting with cgroups need to
> >> >> > work properly.
> >> >>
> >> >> Agree, but even if you use a single multiqueue device having the
> >> >> ability to use 1 thread to serve all the queues or multiple threads to
> >> >> serve all the queues looks like a useful feature.
> >> >
> >> > Could be.  At the moment, multiqueue is off by default because it causes
> >> > regressions for some workloads as compared to a single queue.
> >> > If we have heuristics in vhost that fix this by auto-tuning threading, that
> >> > would be nice.  But if you need to tune it manually anyway,
> >> > then from upstream perspective it does not seem to be worth it - you can just
> >> > turn multiqueue on/off in the guest.
> >>
> >> I see. But we are mixing again between the policy and the mechanism.
> >> We first need a mechanism to control the system and then we need to
> >> implement the policy to orchestrate it (whenever it will be
> >> implemented in the kernel as part of vhost or outside in user-space).
> >> I don't see why to wait to have a policy to upstream the mechanism. If
> >> we upstream the mechanism in a manner that the defaults do not affect
> >> today's vhost behavior, then it will be possible to play with the
> >> policies and upstream them later.
> >
> > Well it all hinges on whether it's in userspace actually.
> > Interfaces to userspace must be maintained forever, there's no way to know
> > they are right and make sense before they are used,  so we must not merge
> > the interface before there is userspace using it correctly.
> >
> > If it's in-kernel, we can merge a bit of unused code into kernel
> > just to make it easier for you to make progress, provided it's
> > well isolated and doesn't make life much harder for others
> > working on same code-base.
> 
> I see your concerns. We can also expose the mechanism as exported
> symbols and implement the policy in a different kernel module.

Just in case I didn't make this clear previously - my gut feeling is
that the split between mechanism and policy is in the wrong place
here.  Policy is what to do, mechanism is how to do it.
What is the policy in aid of? What is the "what to do" here? "process packets
more efficiently" right?  Whether to poll and which threads process which
packets sounds more like how to do it - a mechanism.

> Is that
> OK ? or you are also concerned about  exporting symbols to other
> kernel modules because they may change as well ?

Sorry, I do not understand what is suggested here.

> >
> >> >
> >> >
> >> >> >
> >> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
> >> >> >> further.
> >> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
> >> >> >> of the issues that were discussed, such as cgroups, are not supported by
> >> >> >> cwmq, and this is adding more complexity.
> >> >> >> However, we'll look further into it, and consider it at a later stage.
> >> >> >
> >> >> > Hmm that's still assuming some smart management tool configuring
> >> >> > this correctly.  Can't this be determined automatically depending
> >> >> > on the workload?
> >> >> > This is what the cwmq suggestion was really about: detect
> >> >> > that we need more threads and spawn them.
> >> >> > It's less about sharing the implementation with workqueues -
> >> >> > would be very nice but not a must.
> >> >>
> >> >> But how cwmq can consider cgroup accounting ?
> >> >
> >> > I think cwmq is just a replacement for our own thread pool.
> >> > It doesn't make cgroup accounting easier or harder.
> >>
> >> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
> >> be much simpler consider cgrouups with our own threading model.
> >
> > Bandan here arrived at the same conclusion.
> 
> Same conclusion means no cwmq  ?

Yes.

> >
> >> >
> >> >> In any case, IMHO, the kernel should first provide the "mechanism" so
> >> >> later on a user-space management application (the "policy") can
> >> >> orchestrate it.
> >> >
> >> > I think policy would be something coarse-grained, like setting priority.
> >> > Making detailed scheduling decisions in userspace seems wrong somehow:
> >> > what does management application know that kernel doesn't?
> >>
> >> The code of the management application can be
> >> changed/modified/customized... also the management application can
> >> consider additional inputs (like per VM SLAs) that are not available
> >> to the kernel.
> >
> > An alternative is to express these in terms of cgroups.
> 
> True but there is a semantic gap between cgroups (low level system
> metrics) and SLAs (high level application performance metrics). I
> don't believe everything can be expressed as cgroups...

Not existing cgroups but if there's another resource
we need to partition it can be expressed as another kind of a cgroup
I think.

> >
> >> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
> >> manager ?VHIOM ?)
> >
> > I worry that CPU and network load is much more dynamic than
> > memory load, and that in real life userspace simply won't be able to
> > react to changes fast enough.
> > Also that this will interact with scheduler in strange ways, e.g.
> > manager wants to free up the CPU and moves some vhost threads
> > off it, scheduler sees there's free CPU and moves some more jobs there.
> 
> I believe we can make very good decisions in user-space but obviously
> this require some research and experimentation. I think Eyal Moscovici
> is working on this.
> 
> > After all it might be the best we can do for over-committed systems,
> > but it seems that it will have to be inter-VM to be really useful.
> 
> Agree 100% .The goal is to have inter-VM support. Single VM is the
> first step to implement inter-VM.

So I have my doubts about this being mergeable at the single VM stage.
But we'll see.


> >
> >
> > Overall one of the main points of vhost is a single interface
> > that let us access all kind of kernel APIs in a uniform way:
> > we tell kernel what we want it to do, and let it get about
> > executing that in the most efficient manner possible.
> > If you insist on managing everything from userspace you lose
> > some of these advantages.
> > That's why I keep suggesting we try to give scheduler more hints
> > about our expected behaviour.
> >
> >> >
> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >> 3. Adding polling mode to vhost
> >> >> >>
> >> >> >> It is a good idea making polling adaptive based on various factors such as
> >> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
> >> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
> >> >> >> added).
> >> >> >> However, as a beginning polling patch, we would prefer having a naive
> >> >> >> polling approach, which could be tuned with later patches.
> >> >> >>
> >> >> >
> >> >> > While any polling approach would still need a lot of testing to prove we
> >> >> > don't for example steal CPU from guest which could be doing other useful
> >> >> > work, given that an exit is at least 1.5K cycles at least in theory it
> >> >> > seems like something that can improve performance.  I'm not sure how
> >> >> > naive we can be without introducing regressions  for some workloads.
> >> >> > For example, if we are on the same host CPU, there's no
> >> >> > chance busy waiting will help us make progress.
> >> >> > How about detecting that the VCPU thread that kicked us
> >> >> > is currently running on another CPU, and only polling in
> >> >> > this case?
> >> >> >
> >> >> >> 4. vhost statistics
> >> >> >>
> >> >> >> The issue that was raised for the vhost statistics was using ftrace
> >> >> >> instead of the debugfs mechanism.
> >> >> >> However, looking further into the kvm stat mechanism, we learned that
> >> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
> >> >> >> addition to it.
> >> >> >>
> >> >> >> We propose to continue using debugfs for statistics, in a manner similar
> >> >> >> to kvm,
> >> >> >> and at some point in the future ftrace can be added to vhost as well.
> >> >> >
> >> >> > IMHO which kvm stat is a useful script, the best tool
> >> >> > for perf stats is still perf. So I would try to integrate with that.
> >> >> > How it works internally is IMHO less important.
> >> >> >
> >> >> >> Does this plan look o.k.?
> >> >> >> If there are no further comments, I'll start preparing the patches
> >> >> >> according to what we've agreed on thus far.
> >> >> >> Thank you,
> >> >> >> Razya
> >> >> >
> >> >> > I think a good place to try to start merging upstream would be 3 and 4.
> >> >> > So if you want to make it easier to merge things upstream, try to keep 3
> >> >> > and 4 independent from 1 and 2.
> >> >>
> >> >> Note -1- and -3- are strongly related. If you have a thread that
> >> >> serves multiple queues (whenever they belong to the same device/vm or
> >> >> not) then this thread will be polling multiple queues at the same
> >> >> time. This increases the chances you will find pending work to do in
> >> >> some queue. In other words, you reduce the cycles wasted for polling.
> >> >> In the other hand,  if you run multiple threads and these threads do
> >> >> polling simultaneously then the threads may starve each other and
> >> >> reduce performance (if they are scheduled to run in the same core). In
> >> >> addition, a shared thread can decide when it should stop processing a
> >> >> given queue and switch to other queue because by polling the thread
> >> >> knows when new requests were added to a queue (this is what we called
> >> >> fine-grained I/O scheduled heuristics)
> >> >>
> >> >> So, seems like polling makes more sense when you serve multiple queues
> >> >> with the same thread.
> >> >>
> >> >> Abel.
> >> >
> >> > A combination might bring gains in more workloads, but it should work on its
> >> > own too. It's quite possible that only a single VM is active, others are
> >> > idle. So either polling should handle that well or be smart enough to turn
> >> > itself off in this case.
> >>
> >> Agree, but we are  discussing again about the policy. We first need
> >> the mechanism :)
> >>
> >> Abel.
> >
> > So you want to control this from a management application as well?
> > That's a very low level detail to expose to management.
> >
> > Why do we ever want to poll? It burns CPU uselessly.
> > There are several reasons for polling to be good:
> >
> >         1. scheduler overhead is too large, we are thrashing
> >           between multiple threads running on the same CPU.
> >           This is the one you are thinking of I guess: it can
> >           be controlled from the management application.
> >           So here some management interface might make some sense.
> >           But to know how long to poll you would need to know
> >           what the scheduler overhead is, then poll for
> >           a fraction of that.
> >
> >         2. KVM exit overhead is too large, we are wasting cycles
> >           on exit/reentry.
> >           Userspace doesn't easily know about exits though.
> >           Again, we'd need to measure how long does an exit take,
> >           then poll for a fraction of that.
> >           This one is easier to measure from guest, so maybe
> >           we should let guest control this.
> >
> >         3. If we halt the CPU it might take too long to wake up.
> >           If we expect to be woken up soon we'd be better off
> >           busy waiting.
> >           This looks like it would be better addressed by an in-kernel API
> >           that says "I expect to be woken up in X microseconds".
> >           This way if some other event occurs instead, it will be
> >           handled instead of waiting for us to finish polling.
> >
> > As you see there are many factors in play here.
> > That's why we can't just add userspace interface "poll for X cycles":
> > to support item 1.  maintaining this interface would block
> > support for items 2 and 3.
> 
> It doesn't need to be pure user-space. Some parts (like "when" to do
> polling) can be controlled in kernel but other parts (like number of
> threads) can be controlled by user-space.  Also some of the polling
> settings  (completely disable polling, min/max polling limitations...)
> can be controlled  by user-space

So above concern would apply to min polling as well.
max polling could be expressed as some kind of latency/responsiveness
requirement and in that case it might start to make some sense.

> >
> >
> >> >
> >> >> >
> >> >> > Thanks again,
> >> >> >
> >> >> > --
> >> >> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 13:48             ` Michael S. Tsirkin
@ 2013-12-19 14:19               ` Abel Gordon
  2013-12-19 14:48                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Abel Gordon @ 2013-12-19 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Razya Ladelsky, Anthony Liguori, asias, Eric Northup,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 3:48 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Dec 19, 2013 at 02:56:10PM +0200, Abel Gordon wrote:
>> On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
>> >> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
>> >> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> Thank you all for your comments.
>> >> >> >> I'm sorry for taking this long to reply, I was away on vacation..
>> >> >> >>
>> >> >> >> It was a good, long discussion, many issues were raised, which we'd like
>> >> >> >> to address with the following proposed roadmap for Elvis patches.
>> >> >> >> In general, we believe it would be best to start with patches that are
>> >> >> >> as simple as possible, providing the basic Elvis functionality,
>> >> >> >> and attend to the more complicated issues in subsequent patches.
>> >> >> >>
>> >> >> >> Here's the road map for Elvis patches:
>> >> >> >
>> >> >> > Thanks for the follow up. Some suggestions below.
>> >> >> > Please note they suggestions below merely represent
>> >> >> > thoughts on merging upstream.
>> >> >> > If as the first step you are content with keeping this
>> >> >> > work as out of tree patches, in order to have
>> >> >> > the freedom to experiment with interfaces and
>> >> >> > performance, please feel free to ignore them.
>> >> >> >
>> >> >> >> 1. Shared vhost thread for multiple devices.
>> >> >> >>
>> >> >> >> The way to go here, we believe, is to start with a patch having a shared
>> >> >> >> vhost thread for multiple devices of the SAME vm.
>> >> >> >> The next step/patch may be handling vms belonging to the same cgroup.
>> >> >> >>
>> >> >> >> Finally, we need to extend the functionality so that the shared vhost
>> >> >> >> thread
>> >> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
>> >> >> >>
>> >> >> >> There was a lot of discussion about the way to address the enforcement
>> >> >> >> of cgroup policies, and we will consider the various solutions with a
>> >> >> >> future
>> >> >> >> patch.
>> >> >> >
>> >> >> > With respect to the upstream kernel,
>> >> >> > I'm not sure a bunch of changes just for the sake of guests with
>> >> >> > multiple virtual NIC cards makes sense.
>> >> >> > And I wonder how this step, in isolation, will affect e.g.
>> >> >> > multiqueue workloads.
>> >> >> > But I guess if the numbers are convincing, this can be mergeable.
>> >> >>
>> >> >> Even if you have a single multiqueue device this change allows
>> >> >> to create one vhost thread for all the queues, one vhost thread per
>> >> >> queue or any other combination. I guess that depending on the workload
>> >> >> and depending on the system utilization (free cycles/cores, density)
>> >> >> you would prefer
>> >> >> to use one or more vhost threads.
>> >> >
>> >> > That is already controllable from the guest though, which likely has a better
>> >> > idea about the workload.
>> >>
>> >> but the guest has no idea about what's going on in the host system
>> >> (e.g. other VMs I/O, cpu utilization of the host cores...)
>> >
>> > But again, you want to do things per VM now so you will have no idea
>> > about other VMs, right? Host cpu utilization could be a useful input
>>
>> Razya shared a roadmap. The first step was to support sharing a thread
>> for a single VM but the goal is to later on extend the mechanism to
>> support multiple VMs and cgroups
>
> Yes, I got that. What I'm not sure of is whether this is just a
> development roadmap, or do you expect to be able to merge things
> upstream in this order as well.
> If the later, all I'm saying is that I think you are doing this in the wrong
> order: we'll likely have to merge first 4 then 3 then possibly 1+2 together -
> but maybe 1+2 will have to wait until cgroups are sorted out.
> That's just a hunch of course until you actually try to do it.
> If the former, most of my comments don't really apply.
>
>> > for some heuristics, I agree, but nothing prevents us from sending
>> > this info to guest agent and controlling multiqueue based on that
>> > (kind of like balloon).
>>
>> IMHO, we should never share host internal information (e.g. resource
>> utilization) wit the guest. That's supposed to be confidential
>> information :)
>> The balloon is a bit different... kvm asks the guest OS to give  (if
>> possible) some pages but  kvm never sends to the balloon information
>> about the memory utilization of the host. If the guest wishes to send
>> information about it's own memory consumption (like it does for MOM),
>> that's OK.
>>
>> So, the guest can share information with the host but the host should
>> be the one to make the decisions. KVM should never share host
>> information with the guest.
>
> It's also easy to just tell guest agent to turn multiqueue on/off if you have
> a mind to.
>
>
>> >
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> 2. Creation of vhost threads
>> >> >> >>
>> >> >> >> We suggested two ways of controlling the creation and removal of vhost
>> >> >> >> threads:
>> >> >> >> - statically determining the maximum number of virtio devices per worker
>> >> >> >> via a kernel module parameter
>> >> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
>> >> >> >>
>> >> >> >> It seems that it would be simplest to take the static approach as
>> >> >> >> a first stage. At a second stage (next patch), we'll advance to
>> >> >> >> dynamically
>> >> >> >> changing the number of vhost threads, using the static module parameter
>> >> >> >> only as a default value.
>> >> >> >
>> >> >> > I'm not sure how independent this is from 1.
>> >> >> > With respect to the upstream kernel,
>> >> >> > Introducing interfaces (which we'll have to maintain
>> >> >> > forever) just for the sake of guests with
>> >> >> > multiple virtual NIC cards does not look like a good tradeoff.
>> >> >> >
>> >> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
>> >> >> > and yes this means isolation and accounting with cgroups need to
>> >> >> > work properly.
>> >> >>
>> >> >> Agree, but even if you use a single multiqueue device having the
>> >> >> ability to use 1 thread to serve all the queues or multiple threads to
>> >> >> serve all the queues looks like a useful feature.
>> >> >
>> >> > Could be.  At the moment, multiqueue is off by default because it causes
>> >> > regressions for some workloads as compared to a single queue.
>> >> > If we have heuristics in vhost that fix this by auto-tuning threading, that
>> >> > would be nice.  But if you need to tune it manually anyway,
>> >> > then from upstream perspective it does not seem to be worth it - you can just
>> >> > turn multiqueue on/off in the guest.
>> >>
>> >> I see. But we are mixing again between the policy and the mechanism.
>> >> We first need a mechanism to control the system and then we need to
>> >> implement the policy to orchestrate it (whenever it will be
>> >> implemented in the kernel as part of vhost or outside in user-space).
>> >> I don't see why to wait to have a policy to upstream the mechanism. If
>> >> we upstream the mechanism in a manner that the defaults do not affect
>> >> today's vhost behavior, then it will be possible to play with the
>> >> policies and upstream them later.
>> >
>> > Well it all hinges on whether it's in userspace actually.
>> > Interfaces to userspace must be maintained forever, there's no way to know
>> > they are right and make sense before they are used,  so we must not merge
>> > the interface before there is userspace using it correctly.
>> >
>> > If it's in-kernel, we can merge a bit of unused code into kernel
>> > just to make it easier for you to make progress, provided it's
>> > well isolated and doesn't make life much harder for others
>> > working on same code-base.
>>
>> I see your concerns. We can also expose the mechanism as exported
>> symbols and implement the policy in a different kernel module.
>
> Just in case I didn't make this clear previously - my gut feeling is
> that the split between mechanism and policy is in the wrong place
> here.  Policy is what to do, mechanism is how to do it.
> What is the policy in aid of? What is the "what to do" here? "process packets
> more efficiently" right?  Whether to poll and which threads process which
> packets sounds more like how to do it - a mechanism.
>
>> Is that
>> OK ? or you are also concerned about  exporting symbols to other
>> kernel modules because they may change as well ?
>
> Sorry, I do not understand what is suggested here.

Export a set of control functions from the vhost kernel module so a
new  kernel module can call this functions to orchestrate/manage
vhost.
In other words, keep the control interface internally in the kernel so
other kernel module can use them to implement the policy

>
>> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
>> >> >> >> further.
>> >> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
>> >> >> >> of the issues that were discussed, such as cgroups, are not supported by
>> >> >> >> cwmq, and this is adding more complexity.
>> >> >> >> However, we'll look further into it, and consider it at a later stage.
>> >> >> >
>> >> >> > Hmm that's still assuming some smart management tool configuring
>> >> >> > this correctly.  Can't this be determined automatically depending
>> >> >> > on the workload?
>> >> >> > This is what the cwmq suggestion was really about: detect
>> >> >> > that we need more threads and spawn them.
>> >> >> > It's less about sharing the implementation with workqueues -
>> >> >> > would be very nice but not a must.
>> >> >>
>> >> >> But how cwmq can consider cgroup accounting ?
>> >> >
>> >> > I think cwmq is just a replacement for our own thread pool.
>> >> > It doesn't make cgroup accounting easier or harder.
>> >>
>> >> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
>> >> be much simpler consider cgrouups with our own threading model.
>> >
>> > Bandan here arrived at the same conclusion.
>>
>> Same conclusion means no cwmq  ?
>
> Yes.
>
>> >
>> >> >
>> >> >> In any case, IMHO, the kernel should first provide the "mechanism" so
>> >> >> later on a user-space management application (the "policy") can
>> >> >> orchestrate it.
>> >> >
>> >> > I think policy would be something coarse-grained, like setting priority.
>> >> > Making detailed scheduling decisions in userspace seems wrong somehow:
>> >> > what does management application know that kernel doesn't?
>> >>
>> >> The code of the management application can be
>> >> changed/modified/customized... also the management application can
>> >> consider additional inputs (like per VM SLAs) that are not available
>> >> to the kernel.
>> >
>> > An alternative is to express these in terms of cgroups.
>>
>> True but there is a semantic gap between cgroups (low level system
>> metrics) and SLAs (high level application performance metrics). I
>> don't believe everything can be expressed as cgroups...
>
> Not existing cgroups but if there's another resource
> we need to partition it can be expressed as another kind of a cgroup
> I think.
>
>> >
>> >> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
>> >> manager ?VHIOM ?)
>> >
>> > I worry that CPU and network load is much more dynamic than
>> > memory load, and that in real life userspace simply won't be able to
>> > react to changes fast enough.
>> > Also that this will interact with scheduler in strange ways, e.g.
>> > manager wants to free up the CPU and moves some vhost threads
>> > off it, scheduler sees there's free CPU and moves some more jobs there.
>>
>> I believe we can make very good decisions in user-space but obviously
>> this require some research and experimentation. I think Eyal Moscovici
>> is working on this.
>>
>> > After all it might be the best we can do for over-committed systems,
>> > but it seems that it will have to be inter-VM to be really useful.
>>
>> Agree 100% .The goal is to have inter-VM support. Single VM is the
>> first step to implement inter-VM.
>
> So I have my doubts about this being mergeable at the single VM stage.
> But we'll see.
>
>
>> >
>> >
>> > Overall one of the main points of vhost is a single interface
>> > that let us access all kind of kernel APIs in a uniform way:
>> > we tell kernel what we want it to do, and let it get about
>> > executing that in the most efficient manner possible.
>> > If you insist on managing everything from userspace you lose
>> > some of these advantages.
>> > That's why I keep suggesting we try to give scheduler more hints
>> > about our expected behaviour.
>> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >> 3. Adding polling mode to vhost
>> >> >> >>
>> >> >> >> It is a good idea making polling adaptive based on various factors such as
>> >> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
>> >> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
>> >> >> >> added).
>> >> >> >> However, as a beginning polling patch, we would prefer having a naive
>> >> >> >> polling approach, which could be tuned with later patches.
>> >> >> >>
>> >> >> >
>> >> >> > While any polling approach would still need a lot of testing to prove we
>> >> >> > don't for example steal CPU from guest which could be doing other useful
>> >> >> > work, given that an exit is at least 1.5K cycles at least in theory it
>> >> >> > seems like something that can improve performance.  I'm not sure how
>> >> >> > naive we can be without introducing regressions  for some workloads.
>> >> >> > For example, if we are on the same host CPU, there's no
>> >> >> > chance busy waiting will help us make progress.
>> >> >> > How about detecting that the VCPU thread that kicked us
>> >> >> > is currently running on another CPU, and only polling in
>> >> >> > this case?
>> >> >> >
>> >> >> >> 4. vhost statistics
>> >> >> >>
>> >> >> >> The issue that was raised for the vhost statistics was using ftrace
>> >> >> >> instead of the debugfs mechanism.
>> >> >> >> However, looking further into the kvm stat mechanism, we learned that
>> >> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
>> >> >> >> addition to it.
>> >> >> >>
>> >> >> >> We propose to continue using debugfs for statistics, in a manner similar
>> >> >> >> to kvm,
>> >> >> >> and at some point in the future ftrace can be added to vhost as well.
>> >> >> >
>> >> >> > IMHO which kvm stat is a useful script, the best tool
>> >> >> > for perf stats is still perf. So I would try to integrate with that.
>> >> >> > How it works internally is IMHO less important.
>> >> >> >
>> >> >> >> Does this plan look o.k.?
>> >> >> >> If there are no further comments, I'll start preparing the patches
>> >> >> >> according to what we've agreed on thus far.
>> >> >> >> Thank you,
>> >> >> >> Razya
>> >> >> >
>> >> >> > I think a good place to try to start merging upstream would be 3 and 4.
>> >> >> > So if you want to make it easier to merge things upstream, try to keep 3
>> >> >> > and 4 independent from 1 and 2.
>> >> >>
>> >> >> Note -1- and -3- are strongly related. If you have a thread that
>> >> >> serves multiple queues (whenever they belong to the same device/vm or
>> >> >> not) then this thread will be polling multiple queues at the same
>> >> >> time. This increases the chances you will find pending work to do in
>> >> >> some queue. In other words, you reduce the cycles wasted for polling.
>> >> >> In the other hand,  if you run multiple threads and these threads do
>> >> >> polling simultaneously then the threads may starve each other and
>> >> >> reduce performance (if they are scheduled to run in the same core). In
>> >> >> addition, a shared thread can decide when it should stop processing a
>> >> >> given queue and switch to other queue because by polling the thread
>> >> >> knows when new requests were added to a queue (this is what we called
>> >> >> fine-grained I/O scheduled heuristics)
>> >> >>
>> >> >> So, seems like polling makes more sense when you serve multiple queues
>> >> >> with the same thread.
>> >> >>
>> >> >> Abel.
>> >> >
>> >> > A combination might bring gains in more workloads, but it should work on its
>> >> > own too. It's quite possible that only a single VM is active, others are
>> >> > idle. So either polling should handle that well or be smart enough to turn
>> >> > itself off in this case.
>> >>
>> >> Agree, but we are  discussing again about the policy. We first need
>> >> the mechanism :)
>> >>
>> >> Abel.
>> >
>> > So you want to control this from a management application as well?
>> > That's a very low level detail to expose to management.
>> >
>> > Why do we ever want to poll? It burns CPU uselessly.
>> > There are several reasons for polling to be good:
>> >
>> >         1. scheduler overhead is too large, we are thrashing
>> >           between multiple threads running on the same CPU.
>> >           This is the one you are thinking of I guess: it can
>> >           be controlled from the management application.
>> >           So here some management interface might make some sense.
>> >           But to know how long to poll you would need to know
>> >           what the scheduler overhead is, then poll for
>> >           a fraction of that.
>> >
>> >         2. KVM exit overhead is too large, we are wasting cycles
>> >           on exit/reentry.
>> >           Userspace doesn't easily know about exits though.
>> >           Again, we'd need to measure how long does an exit take,
>> >           then poll for a fraction of that.
>> >           This one is easier to measure from guest, so maybe
>> >           we should let guest control this.
>> >
>> >         3. If we halt the CPU it might take too long to wake up.
>> >           If we expect to be woken up soon we'd be better off
>> >           busy waiting.
>> >           This looks like it would be better addressed by an in-kernel API
>> >           that says "I expect to be woken up in X microseconds".
>> >           This way if some other event occurs instead, it will be
>> >           handled instead of waiting for us to finish polling.
>> >
>> > As you see there are many factors in play here.
>> > That's why we can't just add userspace interface "poll for X cycles":
>> > to support item 1.  maintaining this interface would block
>> > support for items 2 and 3.
>>
>> It doesn't need to be pure user-space. Some parts (like "when" to do
>> polling) can be controlled in kernel but other parts (like number of
>> threads) can be controlled by user-space.  Also some of the polling
>> settings  (completely disable polling, min/max polling limitations...)
>> can be controlled  by user-space
>
> So above concern would apply to min polling as well.
> max polling could be expressed as some kind of latency/responsiveness
> requirement and in that case it might start to make some sense.
>
>> >
>> >
>> >> >
>> >> >> >
>> >> >> > Thanks again,
>> >> >> >
>> >> >> > --
>> >> >> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 14:19               ` Abel Gordon
@ 2013-12-19 14:48                 ` Michael S. Tsirkin
  2013-12-24 12:50                   ` Razya Ladelsky
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2013-12-19 14:48 UTC (permalink / raw)
  To: Abel Gordon
  Cc: Razya Ladelsky, Anthony Liguori, asias, Eric Northup,
	Eran Raichstein, gleb, jasowang, Joel Nider, kvm, kvm-owner,
	pbonzini, Stefan Hajnoczi, Yossi Kuperman1, Eyal Moscovici, bsd

On Thu, Dec 19, 2013 at 04:19:47PM +0200, Abel Gordon wrote:
> On Thu, Dec 19, 2013 at 3:48 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Dec 19, 2013 at 02:56:10PM +0200, Abel Gordon wrote:
> >> On Thu, Dec 19, 2013 at 1:37 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Thu, Dec 19, 2013 at 12:36:30PM +0200, Abel Gordon wrote:
> >> >> On Thu, Dec 19, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Thu, Dec 19, 2013 at 08:40:44AM +0200, Abel Gordon wrote:
> >> >> >> On Wed, Dec 18, 2013 at 12:43 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> >> > On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> Thank you all for your comments.
> >> >> >> >> I'm sorry for taking this long to reply, I was away on vacation..
> >> >> >> >>
> >> >> >> >> It was a good, long discussion, many issues were raised, which we'd like
> >> >> >> >> to address with the following proposed roadmap for Elvis patches.
> >> >> >> >> In general, we believe it would be best to start with patches that are
> >> >> >> >> as simple as possible, providing the basic Elvis functionality,
> >> >> >> >> and attend to the more complicated issues in subsequent patches.
> >> >> >> >>
> >> >> >> >> Here's the road map for Elvis patches:
> >> >> >> >
> >> >> >> > Thanks for the follow up. Some suggestions below.
> >> >> >> > Please note they suggestions below merely represent
> >> >> >> > thoughts on merging upstream.
> >> >> >> > If as the first step you are content with keeping this
> >> >> >> > work as out of tree patches, in order to have
> >> >> >> > the freedom to experiment with interfaces and
> >> >> >> > performance, please feel free to ignore them.
> >> >> >> >
> >> >> >> >> 1. Shared vhost thread for multiple devices.
> >> >> >> >>
> >> >> >> >> The way to go here, we believe, is to start with a patch having a shared
> >> >> >> >> vhost thread for multiple devices of the SAME vm.
> >> >> >> >> The next step/patch may be handling vms belonging to the same cgroup.
> >> >> >> >>
> >> >> >> >> Finally, we need to extend the functionality so that the shared vhost
> >> >> >> >> thread
> >> >> >> >> serves multiple vms (not necessarily belonging to the same cgroup).
> >> >> >> >>
> >> >> >> >> There was a lot of discussion about the way to address the enforcement
> >> >> >> >> of cgroup policies, and we will consider the various solutions with a
> >> >> >> >> future
> >> >> >> >> patch.
> >> >> >> >
> >> >> >> > With respect to the upstream kernel,
> >> >> >> > I'm not sure a bunch of changes just for the sake of guests with
> >> >> >> > multiple virtual NIC cards makes sense.
> >> >> >> > And I wonder how this step, in isolation, will affect e.g.
> >> >> >> > multiqueue workloads.
> >> >> >> > But I guess if the numbers are convincing, this can be mergeable.
> >> >> >>
> >> >> >> Even if you have a single multiqueue device this change allows
> >> >> >> to create one vhost thread for all the queues, one vhost thread per
> >> >> >> queue or any other combination. I guess that depending on the workload
> >> >> >> and depending on the system utilization (free cycles/cores, density)
> >> >> >> you would prefer
> >> >> >> to use one or more vhost threads.
> >> >> >
> >> >> > That is already controllable from the guest though, which likely has a better
> >> >> > idea about the workload.
> >> >>
> >> >> but the guest has no idea about what's going on in the host system
> >> >> (e.g. other VMs I/O, cpu utilization of the host cores...)
> >> >
> >> > But again, you want to do things per VM now so you will have no idea
> >> > about other VMs, right? Host cpu utilization could be a useful input
> >>
> >> Razya shared a roadmap. The first step was to support sharing a thread
> >> for a single VM but the goal is to later on extend the mechanism to
> >> support multiple VMs and cgroups
> >
> > Yes, I got that. What I'm not sure of is whether this is just a
> > development roadmap, or do you expect to be able to merge things
> > upstream in this order as well.
> > If the later, all I'm saying is that I think you are doing this in the wrong
> > order: we'll likely have to merge first 4 then 3 then possibly 1+2 together -
> > but maybe 1+2 will have to wait until cgroups are sorted out.
> > That's just a hunch of course until you actually try to do it.
> > If the former, most of my comments don't really apply.
> >
> >> > for some heuristics, I agree, but nothing prevents us from sending
> >> > this info to guest agent and controlling multiqueue based on that
> >> > (kind of like balloon).
> >>
> >> IMHO, we should never share host internal information (e.g. resource
> >> utilization) wit the guest. That's supposed to be confidential
> >> information :)
> >> The balloon is a bit different... kvm asks the guest OS to give  (if
> >> possible) some pages but  kvm never sends to the balloon information
> >> about the memory utilization of the host. If the guest wishes to send
> >> information about it's own memory consumption (like it does for MOM),
> >> that's OK.
> >>
> >> So, the guest can share information with the host but the host should
> >> be the one to make the decisions. KVM should never share host
> >> information with the guest.
> >
> > It's also easy to just tell guest agent to turn multiqueue on/off if you have
> > a mind to.
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >> 2. Creation of vhost threads
> >> >> >> >>
> >> >> >> >> We suggested two ways of controlling the creation and removal of vhost
> >> >> >> >> threads:
> >> >> >> >> - statically determining the maximum number of virtio devices per worker
> >> >> >> >> via a kernel module parameter
> >> >> >> >> - dynamically: Sysfs mechanism to add and remove vhost threads
> >> >> >> >>
> >> >> >> >> It seems that it would be simplest to take the static approach as
> >> >> >> >> a first stage. At a second stage (next patch), we'll advance to
> >> >> >> >> dynamically
> >> >> >> >> changing the number of vhost threads, using the static module parameter
> >> >> >> >> only as a default value.
> >> >> >> >
> >> >> >> > I'm not sure how independent this is from 1.
> >> >> >> > With respect to the upstream kernel,
> >> >> >> > Introducing interfaces (which we'll have to maintain
> >> >> >> > forever) just for the sake of guests with
> >> >> >> > multiple virtual NIC cards does not look like a good tradeoff.
> >> >> >> >
> >> >> >> > So I'm unlikely to merge this upstream without making it useful cross-VM,
> >> >> >> > and yes this means isolation and accounting with cgroups need to
> >> >> >> > work properly.
> >> >> >>
> >> >> >> Agree, but even if you use a single multiqueue device having the
> >> >> >> ability to use 1 thread to serve all the queues or multiple threads to
> >> >> >> serve all the queues looks like a useful feature.
> >> >> >
> >> >> > Could be.  At the moment, multiqueue is off by default because it causes
> >> >> > regressions for some workloads as compared to a single queue.
> >> >> > If we have heuristics in vhost that fix this by auto-tuning threading, that
> >> >> > would be nice.  But if you need to tune it manually anyway,
> >> >> > then from upstream perspective it does not seem to be worth it - you can just
> >> >> > turn multiqueue on/off in the guest.
> >> >>
> >> >> I see. But we are mixing again between the policy and the mechanism.
> >> >> We first need a mechanism to control the system and then we need to
> >> >> implement the policy to orchestrate it (whenever it will be
> >> >> implemented in the kernel as part of vhost or outside in user-space).
> >> >> I don't see why to wait to have a policy to upstream the mechanism. If
> >> >> we upstream the mechanism in a manner that the defaults do not affect
> >> >> today's vhost behavior, then it will be possible to play with the
> >> >> policies and upstream them later.
> >> >
> >> > Well it all hinges on whether it's in userspace actually.
> >> > Interfaces to userspace must be maintained forever, there's no way to know
> >> > they are right and make sense before they are used,  so we must not merge
> >> > the interface before there is userspace using it correctly.
> >> >
> >> > If it's in-kernel, we can merge a bit of unused code into kernel
> >> > just to make it easier for you to make progress, provided it's
> >> > well isolated and doesn't make life much harder for others
> >> > working on same code-base.
> >>
> >> I see your concerns. We can also expose the mechanism as exported
> >> symbols and implement the policy in a different kernel module.
> >
> > Just in case I didn't make this clear previously - my gut feeling is
> > that the split between mechanism and policy is in the wrong place
> > here.  Policy is what to do, mechanism is how to do it.
> > What is the policy in aid of? What is the "what to do" here? "process packets
> > more efficiently" right?  Whether to poll and which threads process which
> > packets sounds more like how to do it - a mechanism.
> >
> >> Is that
> >> OK ? or you are also concerned about  exporting symbols to other
> >> kernel modules because they may change as well ?
> >
> > Sorry, I do not understand what is suggested here.
> 
> Export a set of control functions from the vhost kernel module so a
> new  kernel module can call this functions to orchestrate/manage
> vhost.
> In other words, keep the control interface internally in the kernel so
> other kernel module can use them to implement the policy

It seems too early for this.  It would be useful when we see many
different heuristics to the point where we want to be able to unload
unused ones to save some kernel memory.

This does not get around the fact that kernel/userspace interfaces
need to be designed carefully as they need to be maintained forever.

> >
> >> >
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> >> Regarding cwmq, it is an interesting mechanism, which we need to explore
> >> >> >> >> further.
> >> >> >> >> At the moment we prefer not to change the vhost model to use cwmq, as some
> >> >> >> >> of the issues that were discussed, such as cgroups, are not supported by
> >> >> >> >> cwmq, and this is adding more complexity.
> >> >> >> >> However, we'll look further into it, and consider it at a later stage.
> >> >> >> >
> >> >> >> > Hmm that's still assuming some smart management tool configuring
> >> >> >> > this correctly.  Can't this be determined automatically depending
> >> >> >> > on the workload?
> >> >> >> > This is what the cwmq suggestion was really about: detect
> >> >> >> > that we need more threads and spawn them.
> >> >> >> > It's less about sharing the implementation with workqueues -
> >> >> >> > would be very nice but not a must.
> >> >> >>
> >> >> >> But how cwmq can consider cgroup accounting ?
> >> >> >
> >> >> > I think cwmq is just a replacement for our own thread pool.
> >> >> > It doesn't make cgroup accounting easier or harder.
> >> >>
> >> >> cwmq doesn't solve the cgroup issue, that's the problem. IMHO, it will
> >> >> be much simpler consider cgrouups with our own threading model.
> >> >
> >> > Bandan here arrived at the same conclusion.
> >>
> >> Same conclusion means no cwmq  ?
> >
> > Yes.
> >
> >> >
> >> >> >
> >> >> >> In any case, IMHO, the kernel should first provide the "mechanism" so
> >> >> >> later on a user-space management application (the "policy") can
> >> >> >> orchestrate it.
> >> >> >
> >> >> > I think policy would be something coarse-grained, like setting priority.
> >> >> > Making detailed scheduling decisions in userspace seems wrong somehow:
> >> >> > what does management application know that kernel doesn't?
> >> >>
> >> >> The code of the management application can be
> >> >> changed/modified/customized... also the management application can
> >> >> consider additional inputs (like per VM SLAs) that are not available
> >> >> to the kernel.
> >> >
> >> > An alternative is to express these in terms of cgroups.
> >>
> >> True but there is a semantic gap between cgroups (low level system
> >> metrics) and SLAs (high level application performance metrics). I
> >> don't believe everything can be expressed as cgroups...
> >
> > Not existing cgroups but if there's another resource
> > we need to partition it can be expressed as another kind of a cgroup
> > I think.
> >
> >> >
> >> >> Think about MOM (memory over-commit manager) but for vhost (vhost I/O
> >> >> manager ?VHIOM ?)
> >> >
> >> > I worry that CPU and network load is much more dynamic than
> >> > memory load, and that in real life userspace simply won't be able to
> >> > react to changes fast enough.
> >> > Also that this will interact with scheduler in strange ways, e.g.
> >> > manager wants to free up the CPU and moves some vhost threads
> >> > off it, scheduler sees there's free CPU and moves some more jobs there.
> >>
> >> I believe we can make very good decisions in user-space but obviously
> >> this require some research and experimentation. I think Eyal Moscovici
> >> is working on this.
> >>
> >> > After all it might be the best we can do for over-committed systems,
> >> > but it seems that it will have to be inter-VM to be really useful.
> >>
> >> Agree 100% .The goal is to have inter-VM support. Single VM is the
> >> first step to implement inter-VM.
> >
> > So I have my doubts about this being mergeable at the single VM stage.
> > But we'll see.
> >
> >
> >> >
> >> >
> >> > Overall one of the main points of vhost is a single interface
> >> > that let us access all kind of kernel APIs in a uniform way:
> >> > we tell kernel what we want it to do, and let it get about
> >> > executing that in the most efficient manner possible.
> >> > If you insist on managing everything from userspace you lose
> >> > some of these advantages.
> >> > That's why I keep suggesting we try to give scheduler more hints
> >> > about our expected behaviour.
> >> >
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >> 3. Adding polling mode to vhost
> >> >> >> >>
> >> >> >> >> It is a good idea making polling adaptive based on various factors such as
> >> >> >> >> the I/O rate, the guest kick overhead(which is the tradeoff of polling),
> >> >> >> >> or the amount of wasted cycles (cycles we kept polling but no new work was
> >> >> >> >> added).
> >> >> >> >> However, as a beginning polling patch, we would prefer having a naive
> >> >> >> >> polling approach, which could be tuned with later patches.
> >> >> >> >>
> >> >> >> >
> >> >> >> > While any polling approach would still need a lot of testing to prove we
> >> >> >> > don't for example steal CPU from guest which could be doing other useful
> >> >> >> > work, given that an exit is at least 1.5K cycles at least in theory it
> >> >> >> > seems like something that can improve performance.  I'm not sure how
> >> >> >> > naive we can be without introducing regressions  for some workloads.
> >> >> >> > For example, if we are on the same host CPU, there's no
> >> >> >> > chance busy waiting will help us make progress.
> >> >> >> > How about detecting that the VCPU thread that kicked us
> >> >> >> > is currently running on another CPU, and only polling in
> >> >> >> > this case?
> >> >> >> >
> >> >> >> >> 4. vhost statistics
> >> >> >> >>
> >> >> >> >> The issue that was raised for the vhost statistics was using ftrace
> >> >> >> >> instead of the debugfs mechanism.
> >> >> >> >> However, looking further into the kvm stat mechanism, we learned that
> >> >> >> >> ftrace didn't replace the plain debugfs mechanism, but was used in
> >> >> >> >> addition to it.
> >> >> >> >>
> >> >> >> >> We propose to continue using debugfs for statistics, in a manner similar
> >> >> >> >> to kvm,
> >> >> >> >> and at some point in the future ftrace can be added to vhost as well.
> >> >> >> >
> >> >> >> > IMHO which kvm stat is a useful script, the best tool
> >> >> >> > for perf stats is still perf. So I would try to integrate with that.
> >> >> >> > How it works internally is IMHO less important.
> >> >> >> >
> >> >> >> >> Does this plan look o.k.?
> >> >> >> >> If there are no further comments, I'll start preparing the patches
> >> >> >> >> according to what we've agreed on thus far.
> >> >> >> >> Thank you,
> >> >> >> >> Razya
> >> >> >> >
> >> >> >> > I think a good place to try to start merging upstream would be 3 and 4.
> >> >> >> > So if you want to make it easier to merge things upstream, try to keep 3
> >> >> >> > and 4 independent from 1 and 2.
> >> >> >>
> >> >> >> Note -1- and -3- are strongly related. If you have a thread that
> >> >> >> serves multiple queues (whenever they belong to the same device/vm or
> >> >> >> not) then this thread will be polling multiple queues at the same
> >> >> >> time. This increases the chances you will find pending work to do in
> >> >> >> some queue. In other words, you reduce the cycles wasted for polling.
> >> >> >> In the other hand,  if you run multiple threads and these threads do
> >> >> >> polling simultaneously then the threads may starve each other and
> >> >> >> reduce performance (if they are scheduled to run in the same core). In
> >> >> >> addition, a shared thread can decide when it should stop processing a
> >> >> >> given queue and switch to other queue because by polling the thread
> >> >> >> knows when new requests were added to a queue (this is what we called
> >> >> >> fine-grained I/O scheduled heuristics)
> >> >> >>
> >> >> >> So, seems like polling makes more sense when you serve multiple queues
> >> >> >> with the same thread.
> >> >> >>
> >> >> >> Abel.
> >> >> >
> >> >> > A combination might bring gains in more workloads, but it should work on its
> >> >> > own too. It's quite possible that only a single VM is active, others are
> >> >> > idle. So either polling should handle that well or be smart enough to turn
> >> >> > itself off in this case.
> >> >>
> >> >> Agree, but we are  discussing again about the policy. We first need
> >> >> the mechanism :)
> >> >>
> >> >> Abel.
> >> >
> >> > So you want to control this from a management application as well?
> >> > That's a very low level detail to expose to management.
> >> >
> >> > Why do we ever want to poll? It burns CPU uselessly.
> >> > There are several reasons for polling to be good:
> >> >
> >> >         1. scheduler overhead is too large, we are thrashing
> >> >           between multiple threads running on the same CPU.
> >> >           This is the one you are thinking of I guess: it can
> >> >           be controlled from the management application.
> >> >           So here some management interface might make some sense.
> >> >           But to know how long to poll you would need to know
> >> >           what the scheduler overhead is, then poll for
> >> >           a fraction of that.
> >> >
> >> >         2. KVM exit overhead is too large, we are wasting cycles
> >> >           on exit/reentry.
> >> >           Userspace doesn't easily know about exits though.
> >> >           Again, we'd need to measure how long does an exit take,
> >> >           then poll for a fraction of that.
> >> >           This one is easier to measure from guest, so maybe
> >> >           we should let guest control this.
> >> >
> >> >         3. If we halt the CPU it might take too long to wake up.
> >> >           If we expect to be woken up soon we'd be better off
> >> >           busy waiting.
> >> >           This looks like it would be better addressed by an in-kernel API
> >> >           that says "I expect to be woken up in X microseconds".
> >> >           This way if some other event occurs instead, it will be
> >> >           handled instead of waiting for us to finish polling.
> >> >
> >> > As you see there are many factors in play here.
> >> > That's why we can't just add userspace interface "poll for X cycles":
> >> > to support item 1.  maintaining this interface would block
> >> > support for items 2 and 3.
> >>
> >> It doesn't need to be pure user-space. Some parts (like "when" to do
> >> polling) can be controlled in kernel but other parts (like number of
> >> threads) can be controlled by user-space.  Also some of the polling
> >> settings  (completely disable polling, min/max polling limitations...)
> >> can be controlled  by user-space
> >
> > So above concern would apply to min polling as well.
> > max polling could be expressed as some kind of latency/responsiveness
> > requirement and in that case it might start to make some sense.
> >
> >> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> > Thanks again,
> >> >> >> >
> >> >> >> > --
> >> >> >> > MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-19 14:48                 ` Michael S. Tsirkin
@ 2013-12-24 12:50                   ` Razya Ladelsky
  0 siblings, 0 replies; 13+ messages in thread
From: Razya Ladelsky @ 2013-12-24 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Abel Gordon, Anthony Liguori, asias, bsd, Eric Northup,
	Eran Raichstein, Eyal Moscovici, gleb, jasowang, Joel Nider, kvm,
	kvm-owner, pbonzini, Stefan Hajnoczi, Yossi Kuperman1

Hi,

To summarize the issues raised and following steps:

1. Shared vhost thread will support multiple vms, while supporting 
cgroups. 
As soon as we have a design to support cgroups with multiple vms, we'll 
share it.

2. Adding vhost polling mode: this patch can be submitted independently 
from (1).
We'll add a condition that will be checked periodically, in order to stop 
polling 
if the guest is not running (scheduled out) at that time. 

3. Implement good heuristics (policies) in the vhost module for 
adding/removing vhost
threads. We will not expose an interface to user-space at this time.

Thank you,
Razya

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-17 10:04 Updated Elvis Upstreaming Roadmap Razya Ladelsky
  2013-12-18 10:43 ` Michael S. Tsirkin
@ 2013-12-24 16:21 ` Gleb Natapov
  2013-12-25  7:38   ` Razya Ladelsky
  1 sibling, 1 reply; 13+ messages in thread
From: Gleb Natapov @ 2013-12-24 16:21 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Michael S. Tsirkin, abel.gordon, Anthony Liguori, asias,
	digitaleric, Eran Raichstein, gleb, jasowang, Joel Nider, kvm,
	kvm-owner, pbonzini, Stefan Hajnoczi, Yossi Kuperman1,
	Eyal Moscovici, bsd

On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> 4. vhost statistics 
> 
> The issue that was raised for the vhost statistics was using ftrace 
> instead of the debugfs mechanism.
> However, looking further into the kvm stat mechanism, we learned that 
> ftrace didn't replace the plain debugfs mechanism, but was used in 
> addition to it.
>  
It did. Statistics in debugfs is deprecated. No new statistics are
added there.  kvm_stat is using ftrace now (if available) and of course
ftrace gives seamless integration with perf.

--
			Gleb.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Updated Elvis Upstreaming Roadmap
  2013-12-24 16:21 ` Gleb Natapov
@ 2013-12-25  7:38   ` Razya Ladelsky
  0 siblings, 0 replies; 13+ messages in thread
From: Razya Ladelsky @ 2013-12-25  7:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: abel.gordon, Anthony Liguori, asias, bsd, digitaleric,
	Eran Raichstein, Eyal Moscovici, Gleb Natapov, gleb, jasowang,
	Joel Nider, kvm, kvm-owner, Michael S. Tsirkin, pbonzini,
	Stefan Hajnoczi, Yossi Kuperman1

Gleb Natapov <gleb@minantech.com> wrote on 24/12/2013 06:21:03 PM:

> From: Gleb Natapov <gleb@kernel.org>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>, abel.gordon@gmail.com, 
> Anthony Liguori <anthony@codemonkey.ws>, asias@redhat.com, 
> digitaleric@google.com, Eran Raichstein/Haifa/IBM@IBMIL, 
> gleb@redhat.com, jasowang@redhat.com, Joel Nider/Haifa/IBM@IBMIL, 
> kvm@vger.kernel.org, kvm-owner@vger.kernel.org, pbonzini@redhat.com,
> Stefan Hajnoczi <stefanha@gmail.com>, Yossi Kuperman1/Haifa/
> IBM@IBMIL, Eyal Moscovici/Haifa/IBM@IBMIL, bsd@redhat.com
> Date: 24/12/2013 06:21 PM
> Subject: Re: Updated Elvis Upstreaming Roadmap
> Sent by: Gleb Natapov <gleb@minantech.com>
> 
> On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote:
> > 4. vhost statistics 
> > 
> > The issue that was raised for the vhost statistics was using ftrace 
> > instead of the debugfs mechanism.
> > However, looking further into the kvm stat mechanism, we learned that 
> > ftrace didn't replace the plain debugfs mechanism, but was used in 
> > addition to it.
> > 
> It did. Statistics in debugfs is deprecated. No new statistics are
> added there.  kvm_stat is using ftrace now (if available) and of course
> ftrace gives seamless integration with perf.
>

O.k, I understand.
We'll look more into ftrace to see that it fully supports our vhost 
statistics
requirements.
Thank you,
Razya
 
> --
>          Gleb.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-12-25  7:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17 10:04 Updated Elvis Upstreaming Roadmap Razya Ladelsky
2013-12-18 10:43 ` Michael S. Tsirkin
2013-12-19  6:40   ` Abel Gordon
2013-12-19 10:13     ` Michael S. Tsirkin
2013-12-19 10:36       ` Abel Gordon
2013-12-19 11:37         ` Michael S. Tsirkin
2013-12-19 12:56           ` Abel Gordon
2013-12-19 13:48             ` Michael S. Tsirkin
2013-12-19 14:19               ` Abel Gordon
2013-12-19 14:48                 ` Michael S. Tsirkin
2013-12-24 12:50                   ` Razya Ladelsky
2013-12-24 16:21 ` Gleb Natapov
2013-12-25  7:38   ` Razya Ladelsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox