Fw: Benchmarking for vhost polling patch

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fw: Benchmarking for vhost polling patch
@ 2015-01-01 12:59 Razya Ladelsky
  2015-01-05 12:35 ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Razya Ladelsky @ 2015-01-01 12:59 UTC (permalink / raw)
  To: mst
  Cc: Alex Glikson, Eran Raichstein, Yossi Kuperman1, Joel Nider,
	abel.gordon, kvm, Eyal Moscovici, Razya Ladelsky

Hi Michael,
Just a follow up on the polling patch numbers,..
Please let me know if you find these numbers satisfying enough to continue 
with submitting this patch.
Otherwise - we'll have this patch submitted as part of the larger Elvis 
patch set rather than independently.
Thank you,
Razya 

----- Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -----

From:   Razya Ladelsky/Haifa/IBM@IBMIL
To:     mst@redhat.com
Cc: 
Date:   25/11/2014 02:43 PM
Subject:        Re: Benchmarking for vhost polling patch
Sent by:        kvm-owner@vger.kernel.org



Hi Michael,

> Hi Razya,
> On the netperf benchmark, it looks like polling=10 gives a modest but
> measureable gain.  So from that perspective it might be worth it if it's
> not too much code, though we'll need to spend more time checking the
> macro effect - we barely moved the needle on the macro benchmark and
> that is suspicious.

I ran memcached with various values for the key & value arguments, and 
managed to see a bigger impact of polling than when I used the default 
values,
Here are the numbers:

key=250     TPS      net    vhost vm   TPS/cpu  TPS/CPU
value=2048           rate   util  util          change

polling=0   101540   103.0  46   100   695.47
polling=5   136747   123.0  83   100   747.25   0.074440609
polling=7   140722   125.7  84   100   764.79   0.099663658
polling=10  141719   126.3  87   100   757.85   0.089688003
polling=15  142430   127.1  90   100   749.63   0.077863015
polling=25  146347   128.7  95   100   750.49   0.079107993
polling=50  150882   131.1  100  100   754.41   0.084733701

Macro benchmarks are less I/O intensive than the micro benchmark, which is 
why 
we can expect less impact for polling as compared to netperf. 
However, as shown above, we managed to get 10% TPS/CPU improvement with 
the 
polling patch.

> Is there a chance you are actually trading latency for throughput?
> do you observe any effect on latency?

No.

> How about trying some other benchmark, e.g. NFS?
> 

Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

> 
> Also, I am wondering:
> 
> since vhost thread is polling in kernel anyway, shouldn't
> we try and poll the host NIC?
> that would likely reduce at least the latency significantly,
> won't it?
> 

Yes, it could be a great addition at some point, but needs a thorough 
investigation. In any case, not a part of this patch...

Thanks,
Razya

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-01 12:59 Fw: Benchmarking for vhost polling patch Razya Ladelsky
@ 2015-01-05 12:35 ` Michael S. Tsirkin
  2015-01-11 12:44   ` Razya Ladelsky
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2015-01-05 12:35 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Alex Glikson, Eran Raichstein, Yossi Kuperman1, Joel Nider,
	abel.gordon, kvm, Eyal Moscovici

Hi Razya,
Thanks for the update.
So that's reasonable I think, and I think it makes sense
to keep working on this in isolation - it's more
manageable at this size.

The big questions in my mind:
- What happens if system is lightly loaded?
  E.g. a ping/pong benchmark. How much extra CPU are
  we wasting?
- We see the best performance on your system is with 10usec worth of polling.
  It's OK to be able to tune it for best performance, but
  most people don't have the time or the inclination.
  So what would be the best value for other CPUs?
- Should this be tunable from usespace per vhost instance?
  Why is it only tunable globally?
- How bad is it if you don't pin vhost and vcpu threads?
  Is the scheduler smart enough to pull them apart?
- What happens in overcommit scenarios? Does polling make things
  much worse?
  Clearly polling will work worse if e.g. vhost and vcpu
  share the host cpu. How can we avoid conflicts?

  For two last questions, better cooperation with host scheduler will
  likely help here.
  See e.g.  http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
  I'm currently looking at pushing something similar upstream,
  if it goes in vhost polling can do something similar.

Any data points to shed light on these questions?

On Thu, Jan 01, 2015 at 02:59:21PM +0200, Razya Ladelsky wrote:
> Hi Michael,
> Just a follow up on the polling patch numbers,..
> Please let me know if you find these numbers satisfying enough to continue 
> with submitting this patch.
> Otherwise - we'll have this patch submitted as part of the larger Elvis 
> patch set rather than independently.
> Thank you,
> Razya 
> 
> ----- Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -----
> 
> From:   Razya Ladelsky/Haifa/IBM@IBMIL
> To:     mst@redhat.com
> Cc: 
> Date:   25/11/2014 02:43 PM
> Subject:        Re: Benchmarking for vhost polling patch
> Sent by:        kvm-owner@vger.kernel.org
> 
> 
> 
> Hi Michael,
> 
> > Hi Razya,
> > On the netperf benchmark, it looks like polling=10 gives a modest but
> > measureable gain.  So from that perspective it might be worth it if it's
> > not too much code, though we'll need to spend more time checking the
> > macro effect - we barely moved the needle on the macro benchmark and
> > that is suspicious.
> 
> I ran memcached with various values for the key & value arguments, and 
> managed to see a bigger impact of polling than when I used the default 
> values,
> Here are the numbers:
> 
> key=250     TPS      net    vhost vm   TPS/cpu  TPS/CPU
> value=2048           rate   util  util          change
> 
> polling=0   101540   103.0  46   100   695.47
> polling=5   136747   123.0  83   100   747.25   0.074440609
> polling=7   140722   125.7  84   100   764.79   0.099663658
> polling=10  141719   126.3  87   100   757.85   0.089688003
> polling=15  142430   127.1  90   100   749.63   0.077863015
> polling=25  146347   128.7  95   100   750.49   0.079107993
> polling=50  150882   131.1  100  100   754.41   0.084733701
> 
> Macro benchmarks are less I/O intensive than the micro benchmark, which is 
> why 
> we can expect less impact for polling as compared to netperf. 
> However, as shown above, we managed to get 10% TPS/CPU improvement with 
> the 
> polling patch.
> 
> > Is there a chance you are actually trading latency for throughput?
> > do you observe any effect on latency?
> 
> No.
> 
> > How about trying some other benchmark, e.g. NFS?
> > 
> 
> Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

OK but was there a regression in this case?


> > 
> > Also, I am wondering:
> > 
> > since vhost thread is polling in kernel anyway, shouldn't
> > we try and poll the host NIC?
> > that would likely reduce at least the latency significantly,
> > won't it?
> > 
> 
> Yes, it could be a great addition at some point, but needs a thorough 
> investigation. In any case, not a part of this patch...
> 
> Thanks,
> Razya
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-05 12:35 ` Michael S. Tsirkin
@ 2015-01-11 12:44   ` Razya Ladelsky
  2015-01-12 10:36     ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Razya Ladelsky @ 2015-01-11 12:44 UTC (permalink / raw)
  To: Michael S.Tsirkin
  Cc: Alex Glikson, Eran Raichstein, Yossi Kuperman1, Joel Nider,
	abel.gordon, kvm, Eyal Moscovici, Razya Ladelsky

> Hi Razya,
> Thanks for the update.
> So that's reasonable I think, and I think it makes sense
> to keep working on this in isolation - it's more
> manageable at this size.
> 
> The big questions in my mind:
> - What happens if system is lightly loaded?
>   E.g. a ping/pong benchmark. How much extra CPU are
>   we wasting?
> - We see the best performance on your system is with 10usec worth of 
polling.
>   It's OK to be able to tune it for best performance, but
>   most people don't have the time or the inclination.
>   So what would be the best value for other CPUs?

The extra cpu waste vs throughput gains depends on the polling timeout 
value(poll_stop_idle).
The best value to chose is dependant on the workload and the system 
hardware and configuration.
There is nothing that we can say about this value in advance. The system's 
manager/administrator should use this optimization with the awareness that 
polling
consumes extra cpu cycles, as documented. 

> - Should this be tunable from usespace per vhost instance?
>   Why is it only tunable globally?

It should be tunable per vhost thread.
We can do it in a subsequent patch.

> - How bad is it if you don't pin vhost and vcpu threads?
>   Is the scheduler smart enough to pull them apart?
> - What happens in overcommit scenarios? Does polling make things
>   much worse?
>   Clearly polling will work worse if e.g. vhost and vcpu
>   share the host cpu. How can we avoid conflicts?
> 
>   For two last questions, better cooperation with host scheduler will
>   likely help here.
>   See e.g.  
http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
>   I'm currently looking at pushing something similar upstream,
>   if it goes in vhost polling can do something similar.
> 
> Any data points to shed light on these questions?

I ran a simple apache benchmark, with an over commit scenario, where both 
the vcpu and vhost share the same core.
In some cases (c>4 in my testcases) polling surprisingly produced a better 
throughput.
Therefore, it is hard to predict how the polling will impact performance 
in advance. 
It is up to whoever is using this optimization to use it wisely.
Thanks,
Razya 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-11 12:44   ` Razya Ladelsky
@ 2015-01-12 10:36     ` Michael S. Tsirkin
  2015-01-14 15:01       ` Razya Ladelsky
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2015-01-12 10:36 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Alex Glikson, Eran Raichstein, Yossi Kuperman1, Joel Nider,
	abel.gordon, kvm, Eyal Moscovici

On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
> > Hi Razya,
> > Thanks for the update.
> > So that's reasonable I think, and I think it makes sense
> > to keep working on this in isolation - it's more
> > manageable at this size.
> > 
> > The big questions in my mind:
> > - What happens if system is lightly loaded?
> >   E.g. a ping/pong benchmark. How much extra CPU are
> >   we wasting?
> > - We see the best performance on your system is with 10usec worth of 
> polling.
> >   It's OK to be able to tune it for best performance, but
> >   most people don't have the time or the inclination.
> >   So what would be the best value for other CPUs?
> 
> The extra cpu waste vs throughput gains depends on the polling timeout 
> value(poll_stop_idle).
> The best value to chose is dependant on the workload and the system 
> hardware and configuration.
> There is nothing that we can say about this value in advance. The system's 
> manager/administrator should use this optimization with the awareness that 
> polling
> consumes extra cpu cycles, as documented. 
> 
> > - Should this be tunable from usespace per vhost instance?
> >   Why is it only tunable globally?
> 
> It should be tunable per vhost thread.
> We can do it in a subsequent patch.

So I think whether the patchset is appropriate upstream
will depend exactly on coming up with a reasonable
interface for enabling and tuning the functionality.

I was hopeful some reasonable default value can be
derived from e.g. cost of the exit.
If that is not the case, it becomes that much harder
for users to select good default values.

There are some cases where networking stack already
exposes low-level hardware detail to userspace, e.g.
tcp polling configuration. If we can't come up with
a way to abstract hardware, maybe we can at least tie
it to these existing controls rather than introducing
new ones?


> > - How bad is it if you don't pin vhost and vcpu threads?
> >   Is the scheduler smart enough to pull them apart?
> > - What happens in overcommit scenarios? Does polling make things
> >   much worse?
> >   Clearly polling will work worse if e.g. vhost and vcpu
> >   share the host cpu. How can we avoid conflicts?
> > 
> >   For two last questions, better cooperation with host scheduler will
> >   likely help here.
> >   See e.g.  
> http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
> >   I'm currently looking at pushing something similar upstream,
> >   if it goes in vhost polling can do something similar.
> > 
> > Any data points to shed light on these questions?
> 
> I ran a simple apache benchmark, with an over commit scenario, where both 
> the vcpu and vhost share the same core.
> In some cases (c>4 in my testcases) polling surprisingly produced a better 
> throughput.

Likely because latency is hurt, so you get better batching?

> Therefore, it is hard to predict how the polling will impact performance 
> in advance. 

If it's so hard, users will struggle to configure this properly.
Looks like an argument for us developers to do the hard work,
and expose simpler controls to users?

> It is up to whoever is using this optimization to use it wisely.
> Thanks,
> Razya 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-12 10:36     ` Michael S. Tsirkin
@ 2015-01-14 15:01       ` Razya Ladelsky
  2015-01-14 15:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Razya Ladelsky @ 2015-01-14 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Eyal Moscovici,
	Joel Nider, kvm, Yossi Kuperman1, Razya Ladelsky

"Michael S. Tsirkin" <mst@redhat.com> wrote on 12/01/2015 12:36:13 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL
> Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, 
> Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> abel.gordon@gmail.com, kvm@vger.kernel.org, Eyal 
Moscovici/Haifa/IBM@IBMIL
> Date: 12/01/2015 12:36 PM
> Subject: Re: Fw: Benchmarking for vhost polling patch
> 
> On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
> > > Hi Razya,
> > > Thanks for the update.
> > > So that's reasonable I think, and I think it makes sense
> > > to keep working on this in isolation - it's more
> > > manageable at this size.
> > > 
> > > The big questions in my mind:
> > > - What happens if system is lightly loaded?
> > >   E.g. a ping/pong benchmark. How much extra CPU are
> > >   we wasting?
> > > - We see the best performance on your system is with 10usec worth of 

> > polling.
> > >   It's OK to be able to tune it for best performance, but
> > >   most people don't have the time or the inclination.
> > >   So what would be the best value for other CPUs?
> > 
> > The extra cpu waste vs throughput gains depends on the polling timeout 

> > value(poll_stop_idle).
> > The best value to chose is dependant on the workload and the system 
> > hardware and configuration.
> > There is nothing that we can say about this value in advance. The 
system's 
> > manager/administrator should use this optimization with the awareness 
that 
> > polling
> > consumes extra cpu cycles, as documented. 
> > 
> > > - Should this be tunable from usespace per vhost instance?
> > >   Why is it only tunable globally?
> > 
> > It should be tunable per vhost thread.
> > We can do it in a subsequent patch.
> 
> So I think whether the patchset is appropriate upstream
> will depend exactly on coming up with a reasonable
> interface for enabling and tuning the functionality.
> 

How about adding a new ioctl for each vhost device that 
sets the poll_stop_idle (the timeout)? 
This should be aligned with the QEMU "way" of doing things.

> I was hopeful some reasonable default value can be
> derived from e.g. cost of the exit.
> If that is not the case, it becomes that much harder
> for users to select good default values.
> 

Our suggestion would be to use the maximum (a large enough) value,
so that vhost is polling 100% of the time.
The polling optimization mainly addresses users who want to maximize their 
performance, even on the expense of wasting cpu cycles. The maximum value 
will produce the biggest impact on performance.
However, using the maximum default value will be valuable even for users 
who care more about the normalized throughput/cpu criteria. Such users, 
interested in a finer tuning of the polling timeout need to look for an 
optimal timeout value for their system. The maximum value serves as the 
upper limit of the range that needs to be searched for such optimal 
timeout value.


> There are some cases where networking stack already
> exposes low-level hardware detail to userspace, e.g.
> tcp polling configuration. If we can't come up with
> a way to abstract hardware, maybe we can at least tie
> it to these existing controls rather than introducing
> new ones?
> 

We've spent time thinking about the possible interfaces that 
could be appropriate for such an optimization(including tcp polling).
We think that using the ioctl as interface to "configure" the virtual 
device/vhost, 
in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of 
sense, and
is consistent with the existing mechanism. 

Thanks,
Razya



> 
> > > - How bad is it if you don't pin vhost and vcpu threads?
> > >   Is the scheduler smart enough to pull them apart?
> > > - What happens in overcommit scenarios? Does polling make things
> > >   much worse?
> > >   Clearly polling will work worse if e.g. vhost and vcpu
> > >   share the host cpu. How can we avoid conflicts?
> > > 
> > >   For two last questions, better cooperation with host scheduler 
will
> > >   likely help here.
> > >   See e.g. 
> > http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
> > >   I'm currently looking at pushing something similar upstream,
> > >   if it goes in vhost polling can do something similar.
> > > 
> > > Any data points to shed light on these questions?
> > 
> > I ran a simple apache benchmark, with an over commit scenario, where 
both 
> > the vcpu and vhost share the same core.
> > In some cases (c>4 in my testcases) polling surprisingly produced a 
better 
> > throughput.
> 
> Likely because latency is hurt, so you get better batching?
> 
> > Therefore, it is hard to predict how the polling will impact 
performance 
> > in advance. 
> 
> If it's so hard, users will struggle to configure this properly.
> Looks like an argument for us developers to do the hard work,
> and expose simpler controls to users?
> 
> > It is up to whoever is using this optimization to use it wisely.
> > Thanks,
> > Razya 
> > 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-14 15:01       ` Razya Ladelsky
@ 2015-01-14 15:23         ` Michael S. Tsirkin
  2015-01-18  7:40           ` Razya Ladelsky
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2015-01-14 15:23 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Eyal Moscovici,
	Joel Nider, kvm, Yossi Kuperman1

On Wed, Jan 14, 2015 at 05:01:05PM +0200, Razya Ladelsky wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 12/01/2015 12:36:13 PM:
> 
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > To: Razya Ladelsky/Haifa/IBM@IBMIL
> > Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, 
> > Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> > abel.gordon@gmail.com, kvm@vger.kernel.org, Eyal 
> Moscovici/Haifa/IBM@IBMIL
> > Date: 12/01/2015 12:36 PM
> > Subject: Re: Fw: Benchmarking for vhost polling patch
> > 
> > On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
> > > > Hi Razya,
> > > > Thanks for the update.
> > > > So that's reasonable I think, and I think it makes sense
> > > > to keep working on this in isolation - it's more
> > > > manageable at this size.
> > > > 
> > > > The big questions in my mind:
> > > > - What happens if system is lightly loaded?
> > > >   E.g. a ping/pong benchmark. How much extra CPU are
> > > >   we wasting?
> > > > - We see the best performance on your system is with 10usec worth of 
> 
> > > polling.
> > > >   It's OK to be able to tune it for best performance, but
> > > >   most people don't have the time or the inclination.
> > > >   So what would be the best value for other CPUs?
> > > 
> > > The extra cpu waste vs throughput gains depends on the polling timeout 
> 
> > > value(poll_stop_idle).
> > > The best value to chose is dependant on the workload and the system 
> > > hardware and configuration.
> > > There is nothing that we can say about this value in advance. The 
> system's 
> > > manager/administrator should use this optimization with the awareness 
> that 
> > > polling
> > > consumes extra cpu cycles, as documented. 
> > > 
> > > > - Should this be tunable from usespace per vhost instance?
> > > >   Why is it only tunable globally?
> > > 
> > > It should be tunable per vhost thread.
> > > We can do it in a subsequent patch.
> > 
> > So I think whether the patchset is appropriate upstream
> > will depend exactly on coming up with a reasonable
> > interface for enabling and tuning the functionality.
> > 
> 
> How about adding a new ioctl for each vhost device that 
> sets the poll_stop_idle (the timeout)? 
> This should be aligned with the QEMU "way" of doing things.
>
> > I was hopeful some reasonable default value can be
> > derived from e.g. cost of the exit.
> > If that is not the case, it becomes that much harder
> > for users to select good default values.
> > 
> 
> Our suggestion would be to use the maximum (a large enough) value,
> so that vhost is polling 100% of the time.
>
> The polling optimization mainly addresses users who want to maximize their 
> performance, even on the expense of wasting cpu cycles. The maximum value 
> will produce the biggest impact on performance.

*Everyone* is interested in getting maximum performance from
their systems.

> However, using the maximum default value will be valuable even for users 
> who care more about the normalized throughput/cpu criteria. Such users, 
> interested in a finer tuning of the polling timeout need to look for an 
> optimal timeout value for their system. The maximum value serves as the 
> upper limit of the range that needs to be searched for such optimal 
> timeout value.

Number of users who are going to do this kind of tuning
can be counted on one hand.

The "poll all the time" also only works well
only if you have dedicated CPUs for VMs, and no HT.

I'm concerned you didn't really try to do something more widely useful,
and easier to use, being too focused on getting your high netperf
number.


> 
> > There are some cases where networking stack already
> > exposes low-level hardware detail to userspace, e.g.
> > tcp polling configuration. If we can't come up with
> > a way to abstract hardware, maybe we can at least tie
> > it to these existing controls rather than introducing
> > new ones?
> > 
> 
> We've spent time thinking about the possible interfaces that 
> could be appropriate for such an optimization(including tcp polling).
> We think that using the ioctl as interface to "configure" the virtual 
> device/vhost, 
> in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of 
> sense, and
> is consistent with the existing mechanism. 
> 
> Thanks,
> Razya

guest is giving up it's share of CPU for benefit of vhost, right?
So maybe exposing this to guest is appropriate, and then
add e.g. an ethtool interface for guest admin to set this.

This means we'll want virtio and qemu patches for this.

But really, you want to find a way to enable it by default.


> > > > - How bad is it if you don't pin vhost and vcpu threads?
> > > >   Is the scheduler smart enough to pull them apart?
> > > > - What happens in overcommit scenarios? Does polling make things
> > > >   much worse?
> > > >   Clearly polling will work worse if e.g. vhost and vcpu
> > > >   share the host cpu. How can we avoid conflicts?
> > > > 
> > > >   For two last questions, better cooperation with host scheduler 
> will
> > > >   likely help here.
> > > >   See e.g. 
> > > http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
> > > >   I'm currently looking at pushing something similar upstream,
> > > >   if it goes in vhost polling can do something similar.
> > > > 
> > > > Any data points to shed light on these questions?
> > > 
> > > I ran a simple apache benchmark, with an over commit scenario, where 
> both 
> > > the vcpu and vhost share the same core.
> > > In some cases (c>4 in my testcases) polling surprisingly produced a 
> better 
> > > throughput.
> > 
> > Likely because latency is hurt, so you get better batching?
> > 
> > > Therefore, it is hard to predict how the polling will impact 
> performance 
> > > in advance. 
> > 
> > If it's so hard, users will struggle to configure this properly.
> > Looks like an argument for us developers to do the hard work,
> > and expose simpler controls to users?
> > 
> > > It is up to whoever is using this optimization to use it wisely.
> > > Thanks,
> > > Razya 
> > > 
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: Benchmarking for vhost polling patch
  2015-01-14 15:23         ` Michael S. Tsirkin
@ 2015-01-18  7:40           ` Razya Ladelsky
  0 siblings, 0 replies; 7+ messages in thread
From: Razya Ladelsky @ 2015-01-18  7:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Eyal Moscovici,
	Joel Nider, kvm, Yossi Kuperman1

> > 
> > Our suggestion would be to use the maximum (a large enough) value,
> > so that vhost is polling 100% of the time.
> >
> > The polling optimization mainly addresses users who want to maximize 
their 
> > performance, even on the expense of wasting cpu cycles. The maximum 
value 
> > will produce the biggest impact on performance.
> 
> *Everyone* is interested in getting maximum performance from
> their systems.
> 

Maybe so, but not everyone is willing to pay the price.
That is also the reason why this optimization should not be enabled by 
default. 

> > However, using the maximum default value will be valuable even for 
users 
> > who care more about the normalized throughput/cpu criteria. Such 
users, 
> > interested in a finer tuning of the polling timeout need to look for 
an 
> > optimal timeout value for their system. The maximum value serves as 
the 
> > upper limit of the range that needs to be searched for such optimal 
> > timeout value.
> 
> Number of users who are going to do this kind of tuning
> can be counted on one hand.
> 

If the optimization is not enabled by default, the default value is almost 
irrelevant, because when users turn on the feature they should understand 
that there's an associated cost and they have to tune their system if they 
want to get the maximum benefit (depending how they define their maximum 
benefit).
The maximum value is a good starting point that will work in most cases 
and can be used to start the tuning. 

> > 
> > > There are some cases where networking stack already
> > > exposes low-level hardware detail to userspace, e.g.
> > > tcp polling configuration. If we can't come up with
> > > a way to abstract hardware, maybe we can at least tie
> > > it to these existing controls rather than introducing
> > > new ones?
> > > 
> > 
> > We've spent time thinking about the possible interfaces that 
> > could be appropriate for such an optimization(including tcp polling).
> > We think that using the ioctl as interface to "configure" the virtual 
> > device/vhost, 
> > in the same manner that e.g. SET_NET_BACKEND is configured, makes a 
lot of 
> > sense, and
> > is consistent with the existing mechanism. 
> > 
> > Thanks,
> > Razya
> 
> guest is giving up it's share of CPU for benefit of vhost, right?
> So maybe exposing this to guest is appropriate, and then
> add e.g. an ethtool interface for guest admin to set this.
> 

The decision making of whether to turn polling on (and with what rate)
should be made by the system administrator, who has a broad view of the 
system and workload, and not by the guest administrator.
Polling should be a tunable parameter from the host side, the guest should 
not be aware of it.
The guest is not necessarily giving up its time. It may be that there's 
just an extra dedicated core or free cpu cycles on a different cpu.
We provide a mechanism and an interface that can be tuned by some other 
program to implement its policy.
This patch is all about the mechanism and not the policy of how to use it.

Thank you,
Razya 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-01-18  7:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-01 12:59 Fw: Benchmarking for vhost polling patch Razya Ladelsky
2015-01-05 12:35 ` Michael S. Tsirkin
2015-01-11 12:44   ` Razya Ladelsky
2015-01-12 10:36     ` Michael S. Tsirkin
2015-01-14 15:01       ` Razya Ladelsky
2015-01-14 15:23         ` Michael S. Tsirkin
2015-01-18  7:40           ` Razya Ladelsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).