[Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
@ 2013-11-26 12:40 Zhanghaoyu (A)
  2013-11-26 12:47 ` Paolo Bonzini
  2013-11-26 12:48 ` Gleb Natapov
  0 siblings, 2 replies; 60+ messages in thread
From: Zhanghaoyu (A) @ 2013-11-26 12:40 UTC (permalink / raw)
  To: KVM, qemu-devel@nongnu.org, Gleb Natapov, Michael S. Tsirkin,
	Eric Blake, Paolo Bonzini
  Cc: Huangweidong (C), Zanghongyong, Luonengjun, Jinxin (F)

Hi all,

When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
It's unacceptable in some real-time scenario, e.g. telecom. 

So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.

Any better ideas?

Thanks,
Zhang Haoyu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:40 [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table Zhanghaoyu (A)
@ 2013-11-26 12:47 ` Paolo Bonzini
  2013-11-26 12:56   ` Gleb Natapov
  2013-11-26 13:18   ` Avi Kivity
  2013-11-26 12:48 ` Gleb Natapov
  1 sibling, 2 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 12:47 UTC (permalink / raw)
  To: Zhanghaoyu (A)
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Jinxin (F), Luonengjun, qemu-devel@nongnu.org, Zanghongyong

Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> It's unacceptable in some real-time scenario, e.g. telecom. 
> 
> So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.

I don't think a workqueue is even needed.  You just need to use call_rcu
to free "old" after releasing kvm->irq_lock.

What do you think?

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:40 [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table Zhanghaoyu (A)
  2013-11-26 12:47 ` Paolo Bonzini
@ 2013-11-26 12:48 ` Gleb Natapov
  2013-11-26 12:50   ` Gleb Natapov
  1 sibling, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 12:48 UTC (permalink / raw)
  To: Zhanghaoyu (A)
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Jinxin (F), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

On Tue, Nov 26, 2013 at 12:40:36PM +0000, Zhanghaoyu (A) wrote:
> Hi all,
> 
> When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
Why vcpu thread ask the hypervisor to update the irq routing table on
pcpu migration?

> in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> It's unacceptable in some real-time scenario, e.g. telecom. 
> 
> So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
> 
> Any better ideas?
> 
> Thanks,
> Zhang Haoyu

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:48 ` Gleb Natapov
@ 2013-11-26 12:50   ` Gleb Natapov
  0 siblings, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 12:50 UTC (permalink / raw)
  To: Zhanghaoyu (A)
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Jinxin (F), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

On Tue, Nov 26, 2013 at 02:48:10PM +0200, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 12:40:36PM +0000, Zhanghaoyu (A) wrote:
> > Hi all,
> > 
> > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> Why vcpu thread ask the hypervisor to update the irq routing table on
> pcpu migration?
> 
Ah, I misread. Guest sets irq smp_affinity not host.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:47 ` Paolo Bonzini
@ 2013-11-26 12:56   ` Gleb Natapov
  2013-11-26 13:13     ` Paolo Bonzini
  2013-11-26 16:05     ` Michael S. Tsirkin
  2013-11-26 13:18   ` Avi Kivity
  1 sibling, 2 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 12:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Jinxin (F),
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong

On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> > so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> > It's unacceptable in some real-time scenario, e.g. telecom. 
> > 
> > So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> > and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> > And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
> 
> I don't think a workqueue is even needed.  You just need to use call_rcu
> to free "old" after releasing kvm->irq_lock.
> 
> What do you think?
> 
It should be rate limited somehow. Since it guest triggarable guest may cause
host to allocate a lot of memory this way.

Is this about MSI interrupt affinity? IIRC changing INT interrupt
affinity should not trigger kvm_set_irq_routing update. If this is about
MSI only then what about changing userspace to use KVM_SIGNAL_MSI for
MSI injection?

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:56   ` Gleb Natapov
@ 2013-11-26 13:13     ` Paolo Bonzini
  2013-11-28  3:46       ` Zhanghaoyu (A)
  2013-11-26 16:05     ` Michael S. Tsirkin
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 13:13 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Jinxin (F),
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong

Il 26/11/2013 13:56, Gleb Natapov ha scritto:
> > I don't think a workqueue is even needed.  You just need to use call_rcu
> > to free "old" after releasing kvm->irq_lock.
> > 
> > What do you think?
> 
> It should be rate limited somehow. Since it guest triggarable guest may cause
> host to allocate a lot of memory this way.

True, though if I understand Zhanghaoyu's proposal a workqueue would be
even worse.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:47 ` Paolo Bonzini
  2013-11-26 12:56   ` Gleb Natapov
@ 2013-11-26 13:18   ` Avi Kivity
  2013-11-26 13:47     ` Paolo Bonzini
  1 sibling, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 13:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Jinxin (F), Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org,
	Zanghongyong

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

On Tue, Nov 26, 2013 at 2:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
>> When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread
will IOCTL return to QEMU from hypervisor, then vcpu thread ask the
hypervisor to update the irq routing table,
>> in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread
is blocked for so much time to wait RCU grace period, and during this
period, this vcpu cannot provide service to VM,
>> so those interrupts delivered to this vcpu cannot be handled in time,
and the apps running on this vcpu cannot be serviced too.
>> It's unacceptable in some real-time scenario, e.g. telecom.
>>
>> So, I want to create a single workqueue for each VM, to asynchronously
performing the RCU synchronization for irq routing table,
>> and let the vcpu thread return and VMENTRY to service VM immediately, no
more need to blocked to wait RCU grace period.
>> And, I have implemented a raw patch, took a test in our telecom
environment, above problem disappeared.
>
> I don't think a workqueue is even needed.  You just need to use call_rcu
> to free "old" after releasing kvm->irq_lock.
>
> What do you think?

Can this cause an interrupt to be delivered to the wrong (old) vcpu?

The way Linux sets interrupt affinity, it cannot, since changing the
affinity is (IIRC) done in the interrupt handler, so the next interrupt
cannot be in flight and thus pick up the old interrupt routing table.

However it may be vulnerable in other ways.

[-- Attachment #2: Type: text/html, Size: 2207 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 13:18   ` Avi Kivity
@ 2013-11-26 13:47     ` Paolo Bonzini
  2013-11-26 14:36       ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 13:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Jinxin (F), Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org,
	Zanghongyong

Il 26/11/2013 14:18, Avi Kivity ha scritto:
> 
>> I don't think a workqueue is even needed.  You just need to use call_rcu
>> to free "old" after releasing kvm->irq_lock.
>>
>> What do you think?
> 
> Can this cause an interrupt to be delivered to the wrong (old) vcpu?

No, this would be exactly the same code that is running now:

        mutex_lock(&kvm->irq_lock);
        old = kvm->irq_routing;
        kvm_irq_routing_update(kvm, new);
        mutex_unlock(&kvm->irq_lock);

        synchronize_rcu();
        kfree(old);
        return 0;

Except that the kfree would run in the call_rcu kernel thread instead of
the vcpu thread.  But the vcpus already see the new routing table after
the rcu_assign_pointer that is in kvm_irq_routing_update.

There is still the problem that Gleb pointed out, though.

Paolo

> The way Linux sets interrupt affinity, it cannot, since changing the
> affinity is (IIRC) done in the interrupt handler, so the next interrupt
> cannot be in flight and thus pick up the old interrupt routing table.
> 
> However it may be vulnerable in other ways.
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 13:47     ` Paolo Bonzini
@ 2013-11-26 14:36       ` Avi Kivity
  2013-11-26 14:46         ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 14:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Tue, Nov 26, 2013 at 3:47 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 26/11/2013 14:18, Avi Kivity ha scritto:
> >
> >> I don't think a workqueue is even needed.  You just need to use call_rcu
> >> to free "old" after releasing kvm->irq_lock.
> >>
> >> What do you think?
> >
> > Can this cause an interrupt to be delivered to the wrong (old) vcpu?
>
> No, this would be exactly the same code that is running now:
>
>         mutex_lock(&kvm->irq_lock);
>         old = kvm->irq_routing;
>         kvm_irq_routing_update(kvm, new);
>         mutex_unlock(&kvm->irq_lock);
>
>         synchronize_rcu();
>         kfree(old);
>         return 0;
>
> Except that the kfree would run in the call_rcu kernel thread instead of
> the vcpu thread.  But the vcpus already see the new routing table after
> the rcu_assign_pointer that is in kvm_irq_routing_update.
>
>

I understood the proposal was also to eliminate the synchronize_rcu(), so
while new interrupts would see the new routing table, interrupts already in
flight could pick up the old one.

[-- Attachment #2: Type: text/html, Size: 1639 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 14:36       ` Avi Kivity
@ 2013-11-26 14:46         ` Paolo Bonzini
  2013-11-26 14:54           ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 14:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 15:36, Avi Kivity ha scritto:
> 
>     No, this would be exactly the same code that is running now:
> 
>             mutex_lock(&kvm->irq_lock);
>             old = kvm->irq_routing;
>             kvm_irq_routing_update(kvm, new);
>             mutex_unlock(&kvm->irq_lock);
> 
>             synchronize_rcu();
>             kfree(old);
>             return 0;
> 
>     Except that the kfree would run in the call_rcu kernel thread instead of
>     the vcpu thread.  But the vcpus already see the new routing table after
>     the rcu_assign_pointer that is in kvm_irq_routing_update.
> 
> I understood the proposal was also to eliminate the synchronize_rcu(),
> so while new interrupts would see the new routing table, interrupts
> already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").

If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 14:46         ` Paolo Bonzini
@ 2013-11-26 14:54           ` Avi Kivity
  2013-11-26 15:03             ` Gleb Natapov
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 14:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
> Il 26/11/2013 15:36, Avi Kivity ha scritto:
>>      No, this would be exactly the same code that is running now:
>>
>>              mutex_lock(&kvm->irq_lock);
>>              old = kvm->irq_routing;
>>              kvm_irq_routing_update(kvm, new);
>>              mutex_unlock(&kvm->irq_lock);
>>
>>              synchronize_rcu();
>>              kfree(old);
>>              return 0;
>>
>>      Except that the kfree would run in the call_rcu kernel thread instead of
>>      the vcpu thread.  But the vcpus already see the new routing table after
>>      the rcu_assign_pointer that is in kvm_irq_routing_update.
>>
>> I understood the proposal was also to eliminate the synchronize_rcu(),
>> so while new interrupts would see the new routing table, interrupts
>> already in flight could pick up the old one.
> Isn't that always the case with RCU?  (See my answer above: "the vcpus
> already see the new routing table after the rcu_assign_pointer that is
> in kvm_irq_routing_update").

With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).

It's another question whether the hardware provides the same guarantee.

> If you eliminate the synchronize_rcu, new interrupts would see the new
> routing table, while interrupts already in flight will get a dangling
> pointer.

Sure, if you drop the synchronize_rcu(), you have to add call_rcu().

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 14:54           ` Avi Kivity
@ 2013-11-26 15:03             ` Gleb Natapov
  2013-11-26 15:20               ` Paolo Bonzini
  2013-11-26 15:24               ` Avi Kivity
  0 siblings, 2 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 15:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Paolo Bonzini, Jinxin (F)

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:
> On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
> >Il 26/11/2013 15:36, Avi Kivity ha scritto:
> >>     No, this would be exactly the same code that is running now:
> >>
> >>             mutex_lock(&kvm->irq_lock);
> >>             old = kvm->irq_routing;
> >>             kvm_irq_routing_update(kvm, new);
> >>             mutex_unlock(&kvm->irq_lock);
> >>
> >>             synchronize_rcu();
> >>             kfree(old);
> >>             return 0;
> >>
> >>     Except that the kfree would run in the call_rcu kernel thread instead of
> >>     the vcpu thread.  But the vcpus already see the new routing table after
> >>     the rcu_assign_pointer that is in kvm_irq_routing_update.
> >>
> >>I understood the proposal was also to eliminate the synchronize_rcu(),
> >>so while new interrupts would see the new routing table, interrupts
> >>already in flight could pick up the old one.
> >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >already see the new routing table after the rcu_assign_pointer that is
> >in kvm_irq_routing_update").
> 
> With synchronize_rcu(), you have the additional guarantee that any
> parallel accesses to the old routing table have completed.  Since we
> also trigger the irq from rcu context, you know that after
> synchronize_rcu() you won't get any interrupts to the old
> destination (see kvm_set_irq_inatomic()).
We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:03             ` Gleb Natapov
@ 2013-11-26 15:20               ` Paolo Bonzini
  2013-11-26 15:25                 ` Avi Kivity
                                   ` (2 more replies)
  2013-11-26 15:24               ` Avi Kivity
  1 sibling, 3 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 15:20 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 16:03, Gleb Natapov ha scritto:
>>>> > >>I understood the proposal was also to eliminate the synchronize_rcu(),
>>>> > >>so while new interrupts would see the new routing table, interrupts
>>>> > >>already in flight could pick up the old one.
>>> > >Isn't that always the case with RCU?  (See my answer above: "the vcpus
>>> > >already see the new routing table after the rcu_assign_pointer that is
>>> > >in kvm_irq_routing_update").
>> > 
>> > With synchronize_rcu(), you have the additional guarantee that any
>> > parallel accesses to the old routing table have completed.  Since we
>> > also trigger the irq from rcu context, you know that after
>> > synchronize_rcu() you won't get any interrupts to the old
>> > destination (see kvm_set_irq_inatomic()).
> We do not have this guaranty for other vcpus that do not call
> synchronize_rcu(). They may still use outdated routing table while a vcpu
> or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

    VCPU writes to routing table
                                       e = entry from IRQ routing table
    kvm_irq_routing_update(kvm, new);
    VCPU resumes execution
                                       kvm_set_msi_irq(e, &irq);
                                       kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:03             ` Gleb Natapov
  2013-11-26 15:20               ` Paolo Bonzini
@ 2013-11-26 15:24               ` Avi Kivity
  2013-11-28  9:14                 ` Zhanghaoyu (A)
  1 sibling, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 15:24 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Paolo Bonzini, Jinxin (F)

On 11/26/2013 05:03 PM, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:
>> On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
>>> Il 26/11/2013 15:36, Avi Kivity ha scritto:
>>>>      No, this would be exactly the same code that is running now:
>>>>
>>>>              mutex_lock(&kvm->irq_lock);
>>>>              old = kvm->irq_routing;
>>>>              kvm_irq_routing_update(kvm, new);
>>>>              mutex_unlock(&kvm->irq_lock);
>>>>
>>>>              synchronize_rcu();
>>>>              kfree(old);
>>>>              return 0;
>>>>
>>>>      Except that the kfree would run in the call_rcu kernel thread instead of
>>>>      the vcpu thread.  But the vcpus already see the new routing table after
>>>>      the rcu_assign_pointer that is in kvm_irq_routing_update.
>>>>
>>>> I understood the proposal was also to eliminate the synchronize_rcu(),
>>>> so while new interrupts would see the new routing table, interrupts
>>>> already in flight could pick up the old one.
>>> Isn't that always the case with RCU?  (See my answer above: "the vcpus
>>> already see the new routing table after the rcu_assign_pointer that is
>>> in kvm_irq_routing_update").
>> With synchronize_rcu(), you have the additional guarantee that any
>> parallel accesses to the old routing table have completed.  Since we
>> also trigger the irq from rcu context, you know that after
>> synchronize_rcu() you won't get any interrupts to the old
>> destination (see kvm_set_irq_inatomic()).
> We do not have this guaranty for other vcpus that do not call
> synchronize_rcu(). They may still use outdated routing table while a vcpu
> or iothread that performed table update sits in synchronize_rcu().
>

Consider this guest code:

   write msi entry, directing the interrupt away from this vcpu
   nop
   memset(&idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.

Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:20               ` Paolo Bonzini
@ 2013-11-26 15:25                 ` Avi Kivity
  2013-11-26 15:28                   ` Paolo Bonzini
  2013-11-26 16:24                 ` Gleb Natapov
  2013-11-28  6:27                 ` Zhanghaoyu (A)
  2 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 15:25 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Jinxin (F)

On 11/26/2013 05:20 PM, Paolo Bonzini wrote:
> Il 26/11/2013 16:03, Gleb Natapov ha scritto:
>>>>>>>> I understood the proposal was also to eliminate the synchronize_rcu(),
>>>>>>>> so while new interrupts would see the new routing table, interrupts
>>>>>>>> already in flight could pick up the old one.
>>>>>> Isn't that always the case with RCU?  (See my answer above: "the vcpus
>>>>>> already see the new routing table after the rcu_assign_pointer that is
>>>>>> in kvm_irq_routing_update").
>>>> With synchronize_rcu(), you have the additional guarantee that any
>>>> parallel accesses to the old routing table have completed.  Since we
>>>> also trigger the irq from rcu context, you know that after
>>>> synchronize_rcu() you won't get any interrupts to the old
>>>> destination (see kvm_set_irq_inatomic()).
>> We do not have this guaranty for other vcpus that do not call
>> synchronize_rcu(). They may still use outdated routing table while a vcpu
>> or iothread that performed table update sits in synchronize_rcu().
> Avi's point is that, after the VCPU resumes execution, you know that no
> interrupt will be sent to the old destination because
> kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> also called within the RCU read-side critical section.
>
> Without synchronize_rcu you could have
>
>      VCPU writes to routing table
>                                         e = entry from IRQ routing table
>      kvm_irq_routing_update(kvm, new);
>      VCPU resumes execution
>                                         kvm_set_msi_irq(e, &irq);
>                                         kvm_irq_delivery_to_apic_fast();
>
> where the entry is stale but the VCPU has already resumed execution.
>
> If we want to ensure, we need to use a different mechanism for
> synchronization than the global RCU.  QRCU would work; readers are not
> wait-free but only if there is a concurrent synchronize_qrcu, which
> should be rare.

An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:25                 ` Avi Kivity
@ 2013-11-26 15:28                   ` Paolo Bonzini
  2013-11-26 15:35                     ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 15:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 16:25, Avi Kivity ha scritto:
>> If we want to ensure, we need to use a different mechanism for
>> synchronization than the global RCU.  QRCU would work; readers are not
>> wait-free but only if there is a concurrent synchronize_qrcu, which
>> should be rare.
> 
> An alternative path is to convince ourselves that the hardware does not
> provide the guarantees that the current code provides, and so we can
> relax them.

No, I think it's a reasonable guarantee to provide.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:28                   ` Paolo Bonzini
@ 2013-11-26 15:35                     ` Avi Kivity
  2013-11-26 15:58                       ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 15:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/26/2013 05:28 PM, Paolo Bonzini wrote:
> Il 26/11/2013 16:25, Avi Kivity ha scritto:
>>> If we want to ensure, we need to use a different mechanism for
>>> synchronization than the global RCU.  QRCU would work; readers are not
>>> wait-free but only if there is a concurrent synchronize_qrcu, which
>>> should be rare.
>> An alternative path is to convince ourselves that the hardware does not
>> provide the guarantees that the current code provides, and so we can
>> relax them.
> No, I think it's a reasonable guarantee to provide.
>

Why?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:35                     ` Avi Kivity
@ 2013-11-26 15:58                       ` Paolo Bonzini
  2013-11-26 16:06                         ` Avi Kivity
  2013-11-26 16:08                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 15:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 16:35, Avi Kivity ha scritto:
>>>> If we want to ensure, we need to use a different mechanism for
>>>> synchronization than the global RCU.  QRCU would work; readers are not
>>>> wait-free but only if there is a concurrent synchronize_qrcu, which
>>>> should be rare.
>>> An alternative path is to convince ourselves that the hardware does not
>>> provide the guarantees that the current code provides, and so we can
>>> relax them.
>> No, I think it's a reasonable guarantee to provide.
> 
> Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

(BTW, PCI memory writes are posted, but configuration writes are not).

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 12:56   ` Gleb Natapov
  2013-11-26 13:13     ` Paolo Bonzini
@ 2013-11-26 16:05     ` Michael S. Tsirkin
  2013-11-26 16:14       ` Gleb Natapov
  1 sibling, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2013-11-26 16:05 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Jinxin (F), Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> > Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> > > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> > > so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> > > It's unacceptable in some real-time scenario, e.g. telecom. 
> > > 
> > > So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> > > and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> > > And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
> > 
> > I don't think a workqueue is even needed.  You just need to use call_rcu
> > to free "old" after releasing kvm->irq_lock.
> > 
> > What do you think?
> > 
> It should be rate limited somehow. Since it guest triggarable guest may cause
> host to allocate a lot of memory this way.

The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
counter, which can be adjusted via rcutree.blimit, which defaults
to taking evasive action if more than 10K callbacks are waiting on a
given CPU.



> Is this about MSI interrupt affinity? IIRC changing INT interrupt
> affinity should not trigger kvm_set_irq_routing update. If this is about
> MSI only then what about changing userspace to use KVM_SIGNAL_MSI for
> MSI injection?
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:58                       ` Paolo Bonzini
@ 2013-11-26 16:06                         ` Avi Kivity
  2013-11-26 16:11                           ` Michael S. Tsirkin
  2013-11-26 16:08                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 16:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:
> Il 26/11/2013 16:35, Avi Kivity ha scritto:
>>>>> If we want to ensure, we need to use a different mechanism for
>>>>> synchronization than the global RCU.  QRCU would work; readers are not
>>>>> wait-free but only if there is a concurrent synchronize_qrcu, which
>>>>> should be rare.
>>>> An alternative path is to convince ourselves that the hardware does not
>>>> provide the guarantees that the current code provides, and so we can
>>>> relax them.
>>> No, I think it's a reasonable guarantee to provide.
>> Why?
> Because IIUC the semantics may depend not just on the interrupt
> controller, but also on the specific PCI device.  It seems safer to
> assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.

> (BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:58                       ` Paolo Bonzini
  2013-11-26 16:06                         ` Avi Kivity
@ 2013-11-26 16:08                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2013-11-26 16:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), KVM, Gleb Natapov, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Jinxin (F)

On Tue, Nov 26, 2013 at 04:58:53PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 16:35, Avi Kivity ha scritto:
> >>>> If we want to ensure, we need to use a different mechanism for
> >>>> synchronization than the global RCU.  QRCU would work; readers are not
> >>>> wait-free but only if there is a concurrent synchronize_qrcu, which
> >>>> should be rare.
> >>> An alternative path is to convince ourselves that the hardware does not
> >>> provide the guarantees that the current code provides, and so we can
> >>> relax them.
> >> No, I think it's a reasonable guarantee to provide.
> > 
> > Why?
> 
> Because IIUC the semantics may depend not just on the interrupt
> controller, but also on the specific PCI device.  It seems safer to
> assume that at least one device/driver pair wants this to work.
> 
> (BTW, PCI memory writes are posted, but configuration writes are not).
> 
> Paolo

You can also do a PCI read and flush out the writes.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:06                         ` Avi Kivity
@ 2013-11-26 16:11                           ` Michael S. Tsirkin
  2013-11-26 16:21                             ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2013-11-26 16:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Paolo Bonzini,
	Jinxin (F)

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:
> On 11/26/2013 05:58 PM, Paolo Bonzini wrote:
> >Il 26/11/2013 16:35, Avi Kivity ha scritto:
> >>>>>If we want to ensure, we need to use a different mechanism for
> >>>>>synchronization than the global RCU.  QRCU would work; readers are not
> >>>>>wait-free but only if there is a concurrent synchronize_qrcu, which
> >>>>>should be rare.
> >>>>An alternative path is to convince ourselves that the hardware does not
> >>>>provide the guarantees that the current code provides, and so we can
> >>>>relax them.
> >>>No, I think it's a reasonable guarantee to provide.
> >>Why?
> >Because IIUC the semantics may depend not just on the interrupt
> >controller, but also on the specific PCI device.  It seems safer to
> >assume that at least one device/driver pair wants this to work.
> 
> It's indeed safe, but I think there's a nice win to be had if we
> drop the assumption.

I'm not arguing with that, but a minor commoent below:

> >(BTW, PCI memory writes are posted, but configuration writes are not).
> 
> MSIs are configured via PCI memory writes.
> 
> By itself, that doesn't buy us anything, since the guest could flush
> the write via a read.  But I think the fact that the interrupt
> messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.

> The fact
> that Linux does interrupt migration from within the interrupt
> handler also shows that someone else believes that it is the only
> safe place to do it.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:05     ` Michael S. Tsirkin
@ 2013-11-26 16:14       ` Gleb Natapov
  2013-11-26 16:24         ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 16:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Huangweidong (C), KVM, Jinxin (F), Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
> > On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> > > Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > > > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> > > > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> > > > so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> > > > It's unacceptable in some real-time scenario, e.g. telecom. 
> > > > 
> > > > So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> > > > and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> > > > And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
> > > 
> > > I don't think a workqueue is even needed.  You just need to use call_rcu
> > > to free "old" after releasing kvm->irq_lock.
> > > 
> > > What do you think?
> > > 
> > It should be rate limited somehow. Since it guest triggarable guest may cause
> > host to allocate a lot of memory this way.
> 
> The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
> counter, which can be adjusted via rcutree.blimit, which defaults
> to taking evasive action if more than 10K callbacks are waiting on a
> given CPU.
> 
> 
Documentation/RCU/checklist.txt has:

        An especially important property of the synchronize_rcu()
        primitive is that it automatically self-limits: if grace periods
        are delayed for whatever reason, then the synchronize_rcu()
        primitive will correspondingly delay updates.  In contrast,
        code using call_rcu() should explicitly limit update rate in
        cases where grace periods are delayed, as failing to do so can
        result in excessive realtime latencies or even OOM conditions.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:11                           ` Michael S. Tsirkin
@ 2013-11-26 16:21                             ` Avi Kivity
  2013-11-26 16:42                               ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 16:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Huangweidong (C), KVM, Gleb Natapov, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Paolo Bonzini,
	Jinxin (F)

On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:
>> On 11/26/2013 05:58 PM, Paolo Bonzini wrote:
>>> Il 26/11/2013 16:35, Avi Kivity ha scritto:
>>>>>>> If we want to ensure, we need to use a different mechanism for
>>>>>>> synchronization than the global RCU.  QRCU would work; readers are not
>>>>>>> wait-free but only if there is a concurrent synchronize_qrcu, which
>>>>>>> should be rare.
>>>>>> An alternative path is to convince ourselves that the hardware does not
>>>>>> provide the guarantees that the current code provides, and so we can
>>>>>> relax them.
>>>>> No, I think it's a reasonable guarantee to provide.
>>>> Why?
>>> Because IIUC the semantics may depend not just on the interrupt
>>> controller, but also on the specific PCI device.  It seems safer to
>>> assume that at least one device/driver pair wants this to work.
>> It's indeed safe, but I think there's a nice win to be had if we
>> drop the assumption.
> I'm not arguing with that, but a minor commoent below:
>
>>> (BTW, PCI memory writes are posted, but configuration writes are not).
>> MSIs are configured via PCI memory writes.
>>
>> By itself, that doesn't buy us anything, since the guest could flush
>> the write via a read.  But I think the fact that the interrupt
>> messages themselves are posted proves that it is safe.
> FYI, PCI read flushes the interrupt itself in, too.
>

I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:20               ` Paolo Bonzini
  2013-11-26 15:25                 ` Avi Kivity
@ 2013-11-26 16:24                 ` Gleb Natapov
  2013-11-26 16:27                   ` Avi Kivity
  2013-11-26 16:28                   ` Paolo Bonzini
  2013-11-28  6:27                 ` Zhanghaoyu (A)
  2 siblings, 2 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 16:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 16:03, Gleb Natapov ha scritto:
> >>>> > >>I understood the proposal was also to eliminate the synchronize_rcu(),
> >>>> > >>so while new interrupts would see the new routing table, interrupts
> >>>> > >>already in flight could pick up the old one.
> >>> > >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>> > >already see the new routing table after the rcu_assign_pointer that is
> >>> > >in kvm_irq_routing_update").
> >> > 
> >> > With synchronize_rcu(), you have the additional guarantee that any
> >> > parallel accesses to the old routing table have completed.  Since we
> >> > also trigger the irq from rcu context, you know that after
> >> > synchronize_rcu() you won't get any interrupts to the old
> >> > destination (see kvm_set_irq_inatomic()).
> > We do not have this guaranty for other vcpus that do not call
> > synchronize_rcu(). They may still use outdated routing table while a vcpu
> > or iothread that performed table update sits in synchronize_rcu().
> 
> Avi's point is that, after the VCPU resumes execution, you know that no
> interrupt will be sent to the old destination because
> kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> also called within the RCU read-side critical section.
> 
> Without synchronize_rcu you could have
> 
>     VCPU writes to routing table
>                                        e = entry from IRQ routing table
>     kvm_irq_routing_update(kvm, new);
>     VCPU resumes execution
>                                        kvm_set_msi_irq(e, &irq);
>                                        kvm_irq_delivery_to_apic_fast();
> 
> where the entry is stale but the VCPU has already resumed execution.
> 
So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
                                 e = entry from IRQ routing table
                                 kvm_set_msi_irq(e, &irq);
                                 kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

      
--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:14       ` Gleb Natapov
@ 2013-11-26 16:24         ` Michael S. Tsirkin
  2013-11-30  2:46           ` Zhanghaoyu (A)
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2013-11-26 16:24 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Jinxin (F), Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

On Tue, Nov 26, 2013 at 06:14:27PM +0200, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
> > On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
> > > On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> > > > Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > > > > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table,
> > > > > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM,
> > > > > so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
> > > > > It's unacceptable in some real-time scenario, e.g. telecom. 
> > > > > 
> > > > > So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, 
> > > > > and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
> > > > > And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
> > > > 
> > > > I don't think a workqueue is even needed.  You just need to use call_rcu
> > > > to free "old" after releasing kvm->irq_lock.
> > > > 
> > > > What do you think?
> > > > 
> > > It should be rate limited somehow. Since it guest triggarable guest may cause
> > > host to allocate a lot of memory this way.
> > 
> > The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
> > counter, which can be adjusted via rcutree.blimit, which defaults
> > to taking evasive action if more than 10K callbacks are waiting on a
> > given CPU.
> > 
> > 
> Documentation/RCU/checklist.txt has:
> 
>         An especially important property of the synchronize_rcu()
>         primitive is that it automatically self-limits: if grace periods
>         are delayed for whatever reason, then the synchronize_rcu()
>         primitive will correspondingly delay updates.  In contrast,
>         code using call_rcu() should explicitly limit update rate in
>         cases where grace periods are delayed, as failing to do so can
>         result in excessive realtime latencies or even OOM conditions.

I just asked Paul what this means.

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:24                 ` Gleb Natapov
@ 2013-11-26 16:27                   ` Avi Kivity
  2013-11-26 16:38                     ` Gleb Natapov
  2013-11-26 16:28                   ` Paolo Bonzini
  1 sibling, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 16:27 UTC (permalink / raw)
  To: Gleb Natapov, Paolo Bonzini
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Jinxin (F)

On 11/26/2013 06:24 PM, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
>> Il 26/11/2013 16:03, Gleb Natapov ha scritto:
>>>>>>>>> I understood the proposal was also to eliminate the synchronize_rcu(),
>>>>>>>>> so while new interrupts would see the new routing table, interrupts
>>>>>>>>> already in flight could pick up the old one.
>>>>>>> Isn't that always the case with RCU?  (See my answer above: "the vcpus
>>>>>>> already see the new routing table after the rcu_assign_pointer that is
>>>>>>> in kvm_irq_routing_update").
>>>>> With synchronize_rcu(), you have the additional guarantee that any
>>>>> parallel accesses to the old routing table have completed.  Since we
>>>>> also trigger the irq from rcu context, you know that after
>>>>> synchronize_rcu() you won't get any interrupts to the old
>>>>> destination (see kvm_set_irq_inatomic()).
>>> We do not have this guaranty for other vcpus that do not call
>>> synchronize_rcu(). They may still use outdated routing table while a vcpu
>>> or iothread that performed table update sits in synchronize_rcu().
>> Avi's point is that, after the VCPU resumes execution, you know that no
>> interrupt will be sent to the old destination because
>> kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
>> also called within the RCU read-side critical section.
>>
>> Without synchronize_rcu you could have
>>
>>      VCPU writes to routing table
>>                                         e = entry from IRQ routing table
>>      kvm_irq_routing_update(kvm, new);
>>      VCPU resumes execution
>>                                         kvm_set_msi_irq(e, &irq);
>>                                         kvm_irq_delivery_to_apic_fast();
>>
>> where the entry is stale but the VCPU has already resumed execution.
>>
> So how is it different from what we have now:
>
> disable_irq()
> VCPU writes to routing table
>                                   e = entry from IRQ routing table
>                                   kvm_set_msi_irq(e, &irq);
>                                   kvm_irq_delivery_to_apic_fast();
> kvm_irq_routing_update(kvm, new);
> synchronize_rcu()
> VCPU resumes execution
> enable_irq()
> receive stale irq
>
>        
>

Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:24                 ` Gleb Natapov
  2013-11-26 16:27                   ` Avi Kivity
@ 2013-11-26 16:28                   ` Paolo Bonzini
  2013-11-26 16:29                     ` Avi Kivity
  2013-11-26 16:37                     ` Gleb Natapov
  1 sibling, 2 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 16:28 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 17:24, Gleb Natapov ha scritto:
>>     VCPU writes to routing table
>>                                        e = entry from IRQ routing table
>>     kvm_irq_routing_update(kvm, new);
>>     VCPU resumes execution
>>                                        kvm_set_msi_irq(e, &irq);
>>                                        kvm_irq_delivery_to_apic_fast();
>> 
>> where the entry is stale but the VCPU has already resumed execution.
> 
> So how is it different from what we have now:
> 
> disable_irq()
> VCPU writes to routing table
>                                  e = entry from IRQ routing table
>                                  kvm_set_msi_irq(e, &irq);
>                                  kvm_irq_delivery_to_apic_fast();
> kvm_irq_routing_update(kvm, new);
> synchronize_rcu()
> VCPU resumes execution
> enable_irq()
> receive stale irq

Adding a "disable/enable IRQs" looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:28                   ` Paolo Bonzini
@ 2013-11-26 16:29                     ` Avi Kivity
  2013-11-26 16:37                     ` Gleb Natapov
  1 sibling, 0 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-26 16:29 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Jinxin (F)

On 11/26/2013 06:28 PM, Paolo Bonzini wrote:
> Il 26/11/2013 17:24, Gleb Natapov ha scritto:
>>>      VCPU writes to routing table
>>>                                         e = entry from IRQ routing table
>>>      kvm_irq_routing_update(kvm, new);
>>>      VCPU resumes execution
>>>                                         kvm_set_msi_irq(e, &irq);
>>>                                         kvm_irq_delivery_to_apic_fast();
>>>
>>> where the entry is stale but the VCPU has already resumed execution.
>> So how is it different from what we have now:
>>
>> disable_irq()
>> VCPU writes to routing table
>>                                   e = entry from IRQ routing table
>>                                   kvm_set_msi_irq(e, &irq);
>>                                   kvm_irq_delivery_to_apic_fast();
>> kvm_irq_routing_update(kvm, new);
>> synchronize_rcu()
>> VCPU resumes execution
>> enable_irq()
>> receive stale irq
> Adding a "disable/enable IRQs" looks like a relatively big change.  But
> perhaps it's not for some reason I'm missing.
>

Those are guest operations, which may not be there at all.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:28                   ` Paolo Bonzini
  2013-11-26 16:29                     ` Avi Kivity
@ 2013-11-26 16:37                     ` Gleb Natapov
  1 sibling, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 16:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On Tue, Nov 26, 2013 at 05:28:23PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 17:24, Gleb Natapov ha scritto:
> >>     VCPU writes to routing table
> >>                                        e = entry from IRQ routing table
> >>     kvm_irq_routing_update(kvm, new);
> >>     VCPU resumes execution
> >>                                        kvm_set_msi_irq(e, &irq);
> >>                                        kvm_irq_delivery_to_apic_fast();
> >> 
> >> where the entry is stale but the VCPU has already resumed execution.
> > 
> > So how is it different from what we have now:
> > 
> > disable_irq()
> > VCPU writes to routing table
> >                                  e = entry from IRQ routing table
> >                                  kvm_set_msi_irq(e, &irq);
> >                                  kvm_irq_delivery_to_apic_fast();
> > kvm_irq_routing_update(kvm, new);
> > synchronize_rcu()
> > VCPU resumes execution
> > enable_irq()
> > receive stale irq
> 
> Adding a "disable/enable IRQs" looks like a relatively big change.  But
> perhaps it's not for some reason I'm missing.
> 
You will receive stale irq even without disable/enable IRQs of course. I
put it there so that guest would have a chance to do stupid things like
zeroing idt before receiving interrupt, but on real HW timing is
different from what we emulate, so the same race may happen even without
disable/enable IRQs.
--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:27                   ` Avi Kivity
@ 2013-11-26 16:38                     ` Gleb Natapov
  0 siblings, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-26 16:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Paolo Bonzini, Jinxin (F)

On Tue, Nov 26, 2013 at 06:27:47PM +0200, Avi Kivity wrote:
> On 11/26/2013 06:24 PM, Gleb Natapov wrote:
> >On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> >>Il 26/11/2013 16:03, Gleb Natapov ha scritto:
> >>>>>>>>>I understood the proposal was also to eliminate the synchronize_rcu(),
> >>>>>>>>>so while new interrupts would see the new routing table, interrupts
> >>>>>>>>>already in flight could pick up the old one.
> >>>>>>>Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>>>>>>already see the new routing table after the rcu_assign_pointer that is
> >>>>>>>in kvm_irq_routing_update").
> >>>>>With synchronize_rcu(), you have the additional guarantee that any
> >>>>>parallel accesses to the old routing table have completed.  Since we
> >>>>>also trigger the irq from rcu context, you know that after
> >>>>>synchronize_rcu() you won't get any interrupts to the old
> >>>>>destination (see kvm_set_irq_inatomic()).
> >>>We do not have this guaranty for other vcpus that do not call
> >>>synchronize_rcu(). They may still use outdated routing table while a vcpu
> >>>or iothread that performed table update sits in synchronize_rcu().
> >>Avi's point is that, after the VCPU resumes execution, you know that no
> >>interrupt will be sent to the old destination because
> >>kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> >>also called within the RCU read-side critical section.
> >>
> >>Without synchronize_rcu you could have
> >>
> >>     VCPU writes to routing table
> >>                                        e = entry from IRQ routing table
> >>     kvm_irq_routing_update(kvm, new);
> >>     VCPU resumes execution
> >>                                        kvm_set_msi_irq(e, &irq);
> >>                                        kvm_irq_delivery_to_apic_fast();
> >>
> >>where the entry is stale but the VCPU has already resumed execution.
> >>
> >So how is it different from what we have now:
> >
> >disable_irq()
> >VCPU writes to routing table
> >                                  e = entry from IRQ routing table
> >                                  kvm_set_msi_irq(e, &irq);
> >                                  kvm_irq_delivery_to_apic_fast();
> >kvm_irq_routing_update(kvm, new);
> >synchronize_rcu()
> >VCPU resumes execution
> >enable_irq()
> >receive stale irq
> >
> >
> 
> Suppose the guest did not disable_irq() and enable_irq(), but
> instead had a pci read where you have the enable_irq().  After the
> read you cannot have a stale irq (assuming the read flushes the irq
> all the way to the APIC).
There still may be race between pci read and MSI registered in IRR. I do
not believe such read can undo IRR changes.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:21                             ` Avi Kivity
@ 2013-11-26 16:42                               ` Paolo Bonzini
  0 siblings, 0 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-26 16:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

Il 26/11/2013 17:21, Avi Kivity ha scritto:
>>> It's indeed safe, but I think there's a nice win to be had if we
>>> drop the assumption.
>> I'm not arguing with that, but a minor commoent below:
>>
>>>> (BTW, PCI memory writes are posted, but configuration writes are not).
>>> MSIs are configured via PCI memory writes.
>>>
>>> By itself, that doesn't buy us anything, since the guest could flush
>>> the write via a read.  But I think the fact that the interrupt
>>> messages themselves are posted proves that it is safe.
>> FYI, PCI read flushes the interrupt itself in, too.
> 
> I guess that kills the optimization then.  Maybe you can do qrcu,
> whatever that is.

It's "srcu" (a separate SRCU instance specific to the irq routing
table), which I managed to misspell twice.

Actually, it turns out that qrcu actually exists
(http://lwn.net/Articles/223752/) and has extremely fast grace periods,
but read_lock/read_unlock are also more expensive.  So it was probably
some kind of Freudian slip.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 13:13     ` Paolo Bonzini
@ 2013-11-28  3:46       ` Zhanghaoyu (A)
  0 siblings, 0 replies; 60+ messages in thread
From: Zhanghaoyu (A) @ 2013-11-28  3:46 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Jinxin (F), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong

>> > I don't think a workqueue is even needed.  You just need to use 
>> > call_rcu to free "old" after releasing kvm->irq_lock.
>> > 
>> > What do you think?
>> 
>> It should be rate limited somehow. Since it guest triggarable guest 
>> may cause host to allocate a lot of memory this way.
>
Why does "use call_rcu to free "old" after releasing kvm->irq_lock" may cause host to allocate a lot of memory?
Do you mean that malicious guest's frequent irq-routing-table updating operations will result in too many delayed mem-free of old irq-routing-tables?

Thanks,
Zhang Haoyu

>True, though if I understand Zhanghaoyu's proposal a workqueue would be even worse.
>

>Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:20               ` Paolo Bonzini
  2013-11-26 15:25                 ` Avi Kivity
  2013-11-26 16:24                 ` Gleb Natapov
@ 2013-11-28  6:27                 ` Zhanghaoyu (A)
  2013-11-28  8:55                   ` Paolo Bonzini
  2 siblings, 1 reply; 60+ messages in thread
From: Zhanghaoyu (A) @ 2013-11-28  6:27 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin, Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

>>>>> > >>I understood the proposal was also to eliminate the 
>>>>> > >>synchronize_rcu(), so while new interrupts would see the new 
>>>>> > >>routing table, interrupts already in flight could pick up the old one.
>>>> > >Isn't that always the case with RCU?  (See my answer above: "the 
>>>> > >vcpus already see the new routing table after the 
>>>> > >rcu_assign_pointer that is in kvm_irq_routing_update").
>>> > 
>>> > With synchronize_rcu(), you have the additional guarantee that any 
>>> > parallel accesses to the old routing table have completed.  Since 
>>> > we also trigger the irq from rcu context, you know that after
>>> > synchronize_rcu() you won't get any interrupts to the old 
>>> > destination (see kvm_set_irq_inatomic()).
>> We do not have this guaranty for other vcpus that do not call 
>> synchronize_rcu(). They may still use outdated routing table while a 
>> vcpu or iothread that performed table update sits in synchronize_rcu().
>
>Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section.
>
>Without synchronize_rcu you could have
>
>    VCPU writes to routing table
>                                       e = entry from IRQ routing table
>    kvm_irq_routing_update(kvm, new);
>    VCPU resumes execution
>                                       kvm_set_msi_irq(e, &irq);
>                                       kvm_irq_delivery_to_apic_fast();
>
>where the entry is stale but the VCPU has already resumed execution.
>
If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?

Thanks,
Zhang Haoyu

>If we want to ensure, we need to use a different mechanism for synchronization than the global RCU.  QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare.
>
>Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  6:27                 ` Zhanghaoyu (A)
@ 2013-11-28  8:55                   ` Paolo Bonzini
  2013-11-28  9:19                     ` Gleb Natapov
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28  8:55 UTC (permalink / raw)
  To: Zhanghaoyu (A)
  Cc: Avi Kivity, Huangweidong (C), KVM, Gleb Natapov,
	Michael S. Tsirkin, Luonengjun, qemu-devel@nongnu.org,
	Zanghongyong, Avi Kivity, Jinxin (F)

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>> >Without synchronize_rcu you could have
>> >
>> >    VCPU writes to routing table
>> >                                       e = entry from IRQ routing table
>> >    kvm_irq_routing_update(kvm, new);
>> >    VCPU resumes execution
>> >                                       kvm_set_msi_irq(e, &irq);
>> >                                       kvm_irq_delivery_to_apic_fast();
>> >
>> >where the entry is stale but the VCPU has already resumed execution.
>> >
> If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

Can you try using SRCU and synchronize_srcu?

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 15:24               ` Avi Kivity
@ 2013-11-28  9:14                 ` Zhanghaoyu (A)
  2013-11-28  9:20                   ` Gleb Natapov
  0 siblings, 1 reply; 60+ messages in thread
From: Zhanghaoyu (A) @ 2013-11-28  9:14 UTC (permalink / raw)
  To: Avi Kivity, Gleb Natapov
  Cc: Huangweidong (C), KVM, Michael S. Tsirkin, Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Paolo Bonzini,
	Jinxin (F)

>>>>>      No, this would be exactly the same code that is running now:
>>>>>
>>>>>              mutex_lock(&kvm->irq_lock);
>>>>>              old = kvm->irq_routing;
>>>>>              kvm_irq_routing_update(kvm, new);
>>>>>              mutex_unlock(&kvm->irq_lock);
>>>>>
>>>>>              synchronize_rcu();
>>>>>              kfree(old);
>>>>>              return 0;
>>>>>
>>>>>      Except that the kfree would run in the call_rcu kernel thread instead of
>>>>>      the vcpu thread.  But the vcpus already see the new routing table after
>>>>>      the rcu_assign_pointer that is in kvm_irq_routing_update.
>>>>>
>>>>> I understood the proposal was also to eliminate the 
>>>>> synchronize_rcu(), so while new interrupts would see the new 
>>>>> routing table, interrupts already in flight could pick up the old one.
>>>> Isn't that always the case with RCU?  (See my answer above: "the 
>>>> vcpus already see the new routing table after the rcu_assign_pointer 
>>>> that is in kvm_irq_routing_update").
>>> With synchronize_rcu(), you have the additional guarantee that any 
>>> parallel accesses to the old routing table have completed.  Since we 
>>> also trigger the irq from rcu context, you know that after
>>> synchronize_rcu() you won't get any interrupts to the old destination 
>>> (see kvm_set_irq_inatomic()).
>> We do not have this guaranty for other vcpus that do not call 
>> synchronize_rcu(). They may still use outdated routing table while a 
>> vcpu or iothread that performed table update sits in synchronize_rcu().
>>
>
>Consider this guest code:
>
>   write msi entry, directing the interrupt away from this vcpu
>   nop
>   memset(&idt, 0, sizeof(idt));
>
>Currently, this code will never trigger a triple fault.  With the change to call_rcu(), it may.
>
>Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), 
>but we don't know if there's a path that guarantees the guest something that we're taking away with this change.

In native environment, if a CPU's LAPIC's IRR and ISR have been pending many interrupts, then OS perform zeroing this CPU's IDT before receiving interrupts,
will the same problem happen?

Zhang Haoyu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  8:55                   ` Paolo Bonzini
@ 2013-11-28  9:19                     ` Gleb Natapov
  2013-11-28  9:29                       ` Paolo Bonzini
  2013-11-28  9:49                       ` Avi Kivity
  0 siblings, 2 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28  9:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >> >Without synchronize_rcu you could have
> >> >
> >> >    VCPU writes to routing table
> >> >                                       e = entry from IRQ routing table
> >> >    kvm_irq_routing_update(kvm, new);
> >> >    VCPU resumes execution
> >> >                                       kvm_set_msi_irq(e, &irq);
> >> >                                       kvm_irq_delivery_to_apic_fast();
> >> >
> >> >where the entry is stale but the VCPU has already resumed execution.
> >> >
> > If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> 
> The problem is that we should ensure this, so using call_rcu is not
> possible (even not considering the memory allocation problem).
>
 
Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:14                 ` Zhanghaoyu (A)
@ 2013-11-28  9:20                   ` Gleb Natapov
  0 siblings, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28  9:20 UTC (permalink / raw)
  To: Zhanghaoyu (A)
  Cc: Avi Kivity, Huangweidong (C), KVM, Gleb Natapov,
	Michael S. Tsirkin, Luonengjun, qemu-devel@nongnu.org,
	Zanghongyong, Avi Kivity, Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 09:14:22AM +0000, Zhanghaoyu (A) wrote:
> >>>>>      No, this would be exactly the same code that is running now:
> >>>>>
> >>>>>              mutex_lock(&kvm->irq_lock);
> >>>>>              old = kvm->irq_routing;
> >>>>>              kvm_irq_routing_update(kvm, new);
> >>>>>              mutex_unlock(&kvm->irq_lock);
> >>>>>
> >>>>>              synchronize_rcu();
> >>>>>              kfree(old);
> >>>>>              return 0;
> >>>>>
> >>>>>      Except that the kfree would run in the call_rcu kernel thread instead of
> >>>>>      the vcpu thread.  But the vcpus already see the new routing table after
> >>>>>      the rcu_assign_pointer that is in kvm_irq_routing_update.
> >>>>>
> >>>>> I understood the proposal was also to eliminate the 
> >>>>> synchronize_rcu(), so while new interrupts would see the new 
> >>>>> routing table, interrupts already in flight could pick up the old one.
> >>>> Isn't that always the case with RCU?  (See my answer above: "the 
> >>>> vcpus already see the new routing table after the rcu_assign_pointer 
> >>>> that is in kvm_irq_routing_update").
> >>> With synchronize_rcu(), you have the additional guarantee that any 
> >>> parallel accesses to the old routing table have completed.  Since we 
> >>> also trigger the irq from rcu context, you know that after
> >>> synchronize_rcu() you won't get any interrupts to the old destination 
> >>> (see kvm_set_irq_inatomic()).
> >> We do not have this guaranty for other vcpus that do not call 
> >> synchronize_rcu(). They may still use outdated routing table while a 
> >> vcpu or iothread that performed table update sits in synchronize_rcu().
> >>
> >
> >Consider this guest code:
> >
> >   write msi entry, directing the interrupt away from this vcpu
> >   nop
> >   memset(&idt, 0, sizeof(idt));
> >
> >Currently, this code will never trigger a triple fault.  With the change to call_rcu(), it may.
> >
> >Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), 
> >but we don't know if there's a path that guarantees the guest something that we're taking away with this change.
> 
> In native environment, if a CPU's LAPIC's IRR and ISR have been pending many interrupts, then OS perform zeroing this CPU's IDT before receiving interrupts,
> will the same problem happen?
> 
This is just an example. OS can ensure that there is no other pending interrupt by some other means.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:19                     ` Gleb Natapov
@ 2013-11-28  9:29                       ` Paolo Bonzini
  2013-11-28  9:43                         ` Gleb Natapov
  2013-11-28  9:49                       ` Avi Kivity
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28  9:29 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

Il 28/11/2013 10:19, Gleb Natapov ha scritto:
> Not changing current behaviour is certainly safer, but I am still not 100%
> convinced we have to ensure this.
> 
> Suppose guest does:
> 
> 1: change msi interrupt by writing to pci register
> 2: read the pci register to flush the write
> 3: zero idt
> 
> I am pretty certain that this code can get interrupt after step 2 on real HW,
> but I cannot tell if guest can rely on it to be delivered exactly after
> read instruction or it can be delayed by couple of instructions.

I agree it's fragile, but if a dedicated SRCU can meet the requirements
(possibly with synchronize_srcu_expedited), I prefer not to break it.

Paolo

 Seems to me
> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:29                       ` Paolo Bonzini
@ 2013-11-28  9:43                         ` Gleb Natapov
  0 siblings, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28  9:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On Thu, Nov 28, 2013 at 10:29:36AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 10:19, Gleb Natapov ha scritto:
> > Not changing current behaviour is certainly safer, but I am still not 100%
> > convinced we have to ensure this.
> > 
> > Suppose guest does:
> > 
> > 1: change msi interrupt by writing to pci register
> > 2: read the pci register to flush the write
> > 3: zero idt
> > 
> > I am pretty certain that this code can get interrupt after step 2 on real HW,
> > but I cannot tell if guest can rely on it to be delivered exactly after
> > read instruction or it can be delayed by couple of instructions.
> 
> I agree it's fragile, but if a dedicated SRCU can meet the requirements
> (possibly with synchronize_srcu_expedited), I prefer not to break it.
> 
Definitely. If we can find reasonable solution that preserves current semantics
it's preferable path to take.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:19                     ` Gleb Natapov
  2013-11-28  9:29                       ` Paolo Bonzini
@ 2013-11-28  9:49                       ` Avi Kivity
  2013-11-28  9:53                         ` Paolo Bonzini
  2013-11-28 10:11                         ` Gleb Natapov
  1 sibling, 2 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-28  9:49 UTC (permalink / raw)
  To: Gleb Natapov, Paolo Bonzini
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
>> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>>>>> Without synchronize_rcu you could have
>>>>>
>>>>>     VCPU writes to routing table
>>>>>                                        e = entry from IRQ routing table
>>>>>     kvm_irq_routing_update(kvm, new);
>>>>>     VCPU resumes execution
>>>>>                                        kvm_set_msi_irq(e, &irq);
>>>>>                                        kvm_irq_delivery_to_apic_fast();
>>>>>
>>>>> where the entry is stale but the VCPU has already resumed execution.
>>>>>
>>> If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
>> The problem is that we should ensure this, so using call_rcu is not
>> possible (even not considering the memory allocation problem).
>>
>   
> Not changing current behaviour is certainly safer, but I am still not 100%
> convinced we have to ensure this.
>
> Suppose guest does:
>
> 1: change msi interrupt by writing to pci register
> 2: read the pci register to flush the write
> 3: zero idt
>
> I am pretty certain that this code can get interrupt after step 2 on real HW,
> but I cannot tell if guest can rely on it to be delivered exactly after
> read instruction or it can be delayed by couple of instructions. Seems to me
> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
>

Linux is safe, it does interrupt migration from within the interrupt 
handler.  If you do that before the device-specific EOI, you won't get 
another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes, 
consistently faster.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:49                       ` Avi Kivity
@ 2013-11-28  9:53                         ` Paolo Bonzini
  2013-11-28 10:16                           ` Avi Kivity
  2013-11-28 10:11                         ` Gleb Natapov
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28  9:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Gleb Natapov, Jinxin (F)

Il 28/11/2013 10:49, Avi Kivity ha scritto:
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't get
> another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code changes,
> consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:49                       ` Avi Kivity
  2013-11-28  9:53                         ` Paolo Bonzini
@ 2013-11-28 10:11                         ` Gleb Natapov
  2013-11-28 10:12                           ` Avi Kivity
  1 sibling, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 10:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>Without synchronize_rcu you could have
> >>>>>
> >>>>>    VCPU writes to routing table
> >>>>>                                       e = entry from IRQ routing table
> >>>>>    kvm_irq_routing_update(kvm, new);
> >>>>>    VCPU resumes execution
> >>>>>                                       kvm_set_msi_irq(e, &irq);
> >>>>>                                       kvm_irq_delivery_to_apic_fast();
> >>>>>
> >>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>
> >>>If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>The problem is that we should ensure this, so using call_rcu is not
> >>possible (even not considering the memory allocation problem).
> >>
> >Not changing current behaviour is certainly safer, but I am still not 100%
> >convinced we have to ensure this.
> >
> >Suppose guest does:
> >
> >1: change msi interrupt by writing to pci register
> >2: read the pci register to flush the write
> >3: zero idt
> >
> >I am pretty certain that this code can get interrupt after step 2 on real HW,
> >but I cannot tell if guest can rely on it to be delivered exactly after
> >read instruction or it can be delayed by couple of instructions. Seems to me
> >it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
> >
> 
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't
> get another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code
> changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 10:11                         ` Gleb Natapov
@ 2013-11-28 10:12                           ` Avi Kivity
  2013-11-28 11:02                             ` Gleb Natapov
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 10:12 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
>> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
>>> On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
>>>> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>>>>>>> Without synchronize_rcu you could have
>>>>>>>
>>>>>>>     VCPU writes to routing table
>>>>>>>                                        e = entry from IRQ routing table
>>>>>>>     kvm_irq_routing_update(kvm, new);
>>>>>>>     VCPU resumes execution
>>>>>>>                                        kvm_set_msi_irq(e, &irq);
>>>>>>>                                        kvm_irq_delivery_to_apic_fast();
>>>>>>>
>>>>>>> where the entry is stale but the VCPU has already resumed execution.
>>>>>>>
>>>>> If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
>>>> The problem is that we should ensure this, so using call_rcu is not
>>>> possible (even not considering the memory allocation problem).
>>>>
>>> Not changing current behaviour is certainly safer, but I am still not 100%
>>> convinced we have to ensure this.
>>>
>>> Suppose guest does:
>>>
>>> 1: change msi interrupt by writing to pci register
>>> 2: read the pci register to flush the write
>>> 3: zero idt
>>>
>>> I am pretty certain that this code can get interrupt after step 2 on real HW,
>>> but I cannot tell if guest can rely on it to be delivered exactly after
>>> read instruction or it can be delayed by couple of instructions. Seems to me
>>> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
>>>
>> Linux is safe, it does interrupt migration from within the interrupt
>> handler.  If you do that before the device-specific EOI, you won't
>> get another interrupt until programming the MSI is complete.
>>
>> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
>>
>> Using call_rcu() is a better solution than srcu IMO.  Less code
>> changes, consistently faster.
> Why not fix userspace to use KVM_SIGNAL_MSI instead?
>
>

Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28  9:53                         ` Paolo Bonzini
@ 2013-11-28 10:16                           ` Avi Kivity
  2013-11-28 10:40                             ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 10:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Gleb Natapov, Jinxin (F)

On 11/28/2013 11:53 AM, Paolo Bonzini wrote:
> Il 28/11/2013 10:49, Avi Kivity ha scritto:
>> Linux is safe, it does interrupt migration from within the interrupt
>> handler.  If you do that before the device-specific EOI, you won't get
>> another interrupt until programming the MSI is complete.
>>
>> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
>>
>> Using call_rcu() is a better solution than srcu IMO.  Less code changes,
>> consistently faster.
> call_rcu() has the problem of rate limiting, too.  It wasn't such a
> great idea, I think.
>
> The QRCU I linked would work great latency-wise (it has roughly the same
> latency of an rwsem but readers are lock-free).  However, the locked
> operations in the read path would hurt because of cache misses, so it's
> not good either.
>

I guess srcu would work.  Do you know what's the typical write-side 
latency there? (in terms of what it waits for, not nanoseconds).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 10:16                           ` Avi Kivity
@ 2013-11-28 10:40                             ` Paolo Bonzini
  2013-11-28 10:47                               ` Avi Kivity
  2013-11-28 11:09                               ` Gleb Natapov
  0 siblings, 2 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28 10:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Gleb Natapov, Jinxin (F)

Il 28/11/2013 11:16, Avi Kivity ha scritto:
>> The QRCU I linked would work great latency-wise (it has roughly the same
>> latency of an rwsem but readers are lock-free).  However, the locked
>> operations in the read path would hurt because of cache misses, so it's
>> not good either.
> 
> I guess srcu would work.  Do you know what's the typical write-side
> latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 10:40                             ` Paolo Bonzini
@ 2013-11-28 10:47                               ` Avi Kivity
  2013-11-28 11:09                               ` Gleb Natapov
  1 sibling, 0 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 10:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Huangweidong (C), Gleb Natapov, KVM, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Gleb Natapov, Jinxin (F)

On 11/28/2013 12:40 PM, Paolo Bonzini wrote:
> Il 28/11/2013 11:16, Avi Kivity ha scritto:
>>> The QRCU I linked would work great latency-wise (it has roughly the same
>>> latency of an rwsem but readers are lock-free).  However, the locked
>>> operations in the read path would hurt because of cache misses, so it's
>>> not good either.
>> I guess srcu would work.  Do you know what's the typical write-side
>> latency there? (in terms of what it waits for, not nanoseconds).
> If there's no concurrent reader, it's zero if I read the code right.
> Otherwise it depends:
>
> - if there are many callbacks, only 10 of them are processed per
> millisecond.  But unless there are concurrent synchronize_srcu calls
> there should not be any callback at all.  If all VCPUs were to furiously
> change the MSIs, the latency could go up to #vcpu/10 milliseconds.
>
> - if there are no callbacks, but there are readers, synchronize_srcu
> busy-loops for some time checking if the readers complete.  After a
> while (20 us for synchronize_srcu, 120 us for
> synchronize_srcu_expedited) it gives up and starts using a workqueue to
> poll every millisecond.  This should never happen unless
>
> So, given the very restricted usage this SRCU would have, we probably
> can expect synchronize_srcu_expedited to finish its job in the
> busy-looping phase, and 120 us should be the expected maximum
> latency---more likely to be an order of magnitude smaller, and in very
> rare cases higher.
>

Looks like it's a good fit then.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 10:12                           ` Avi Kivity
@ 2013-11-28 11:02                             ` Gleb Natapov
  2013-11-28 11:18                               ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 11:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> >>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>>>Without synchronize_rcu you could have
> >>>>>>>
> >>>>>>>    VCPU writes to routing table
> >>>>>>>                                       e = entry from IRQ routing table
> >>>>>>>    kvm_irq_routing_update(kvm, new);
> >>>>>>>    VCPU resumes execution
> >>>>>>>                                       kvm_set_msi_irq(e, &irq);
> >>>>>>>                                       kvm_irq_delivery_to_apic_fast();
> >>>>>>>
> >>>>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>>>
> >>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>>>The problem is that we should ensure this, so using call_rcu is not
> >>>>possible (even not considering the memory allocation problem).
> >>>>
> >>>Not changing current behaviour is certainly safer, but I am still not 100%
> >>>convinced we have to ensure this.
> >>>
> >>>Suppose guest does:
> >>>
> >>>1: change msi interrupt by writing to pci register
> >>>2: read the pci register to flush the write
> >>>3: zero idt
> >>>
> >>>I am pretty certain that this code can get interrupt after step 2 on real HW,
> >>>but I cannot tell if guest can rely on it to be delivered exactly after
> >>>read instruction or it can be delayed by couple of instructions. Seems to me
> >>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
> >>>
> >>Linux is safe, it does interrupt migration from within the interrupt
> >>handler.  If you do that before the device-specific EOI, you won't
> >>get another interrupt until programming the MSI is complete.
> >>
> >>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> >>
> >>Using call_rcu() is a better solution than srcu IMO.  Less code
> >>changes, consistently faster.
> >Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >
> >
> 
> Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 10:40                             ` Paolo Bonzini
  2013-11-28 10:47                               ` Avi Kivity
@ 2013-11-28 11:09                               ` Gleb Natapov
  2013-11-28 11:10                                 ` Paolo Bonzini
  1 sibling, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 11:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

On Thu, Nov 28, 2013 at 11:40:06AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 11:16, Avi Kivity ha scritto:
> >> The QRCU I linked would work great latency-wise (it has roughly the same
> >> latency of an rwsem but readers are lock-free).  However, the locked
> >> operations in the read path would hurt because of cache misses, so it's
> >> not good either.
> > 
> > I guess srcu would work.  Do you know what's the typical write-side
> > latency there? (in terms of what it waits for, not nanoseconds).
> 
> If there's no concurrent reader, it's zero if I read the code right.
> Otherwise it depends:
> 
> - if there are many callbacks, only 10 of them are processed per
> millisecond.  But unless there are concurrent synchronize_srcu calls
> there should not be any callback at all.  If all VCPUs were to furiously
> change the MSIs, the latency could go up to #vcpu/10 milliseconds.
> 
> - if there are no callbacks, but there are readers, synchronize_srcu
> busy-loops for some time checking if the readers complete.  After a
> while (20 us for synchronize_srcu, 120 us for
> synchronize_srcu_expedited) it gives up and starts using a workqueue to
> poll every millisecond.  This should never happen unless
> 
Unless what ? :) Unless reader is scheduled out?

> So, given the very restricted usage this SRCU would have, we probably
> can expect synchronize_srcu_expedited to finish its job in the
> busy-looping phase, and 120 us should be the expected maximum
> latency---more likely to be an order of magnitude smaller, and in very
> rare cases higher.
> 
> Paolo

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:09                               ` Gleb Natapov
@ 2013-11-28 11:10                                 ` Paolo Bonzini
  2013-11-28 11:16                                   ` Avi Kivity
  2013-11-28 11:23                                   ` Gleb Natapov
  0 siblings, 2 replies; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28 11:10 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

Il 28/11/2013 12:09, Gleb Natapov ha scritto:
> > - if there are no callbacks, but there are readers, synchronize_srcu
> > busy-loops for some time checking if the readers complete.  After a
> > while (20 us for synchronize_srcu, 120 us for
> > synchronize_srcu_expedited) it gives up and starts using a workqueue to
> > poll every millisecond.  This should never happen unless
> 
> Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:10                                 ` Paolo Bonzini
@ 2013-11-28 11:16                                   ` Avi Kivity
  2013-11-28 11:23                                   ` Gleb Natapov
  1 sibling, 0 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 11:16 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/28/2013 01:10 PM, Paolo Bonzini wrote:
> Il 28/11/2013 12:09, Gleb Natapov ha scritto:
>>> - if there are no callbacks, but there are readers, synchronize_srcu
>>> busy-loops for some time checking if the readers complete.  After a
>>> while (20 us for synchronize_srcu, 120 us for
>>> synchronize_srcu_expedited) it gives up and starts using a workqueue to
>>> poll every millisecond.  This should never happen unless
>> Unless what ? :) Unless reader is scheduled out?
> Yes.  Or unless my brain is scheduled out in the middle of a sentence.
>

You need a grace period.  Or just sleep().

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:02                             ` Gleb Natapov
@ 2013-11-28 11:18                               ` Avi Kivity
  2013-11-28 11:22                                 ` Gleb Natapov
  0 siblings, 1 reply; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 11:18 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
>> On 11/28/2013 12:11 PM, Gleb Natapov wrote:
>>> On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
>>>> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
>>>>> On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
>>>>>> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>>>>>>>>> Without synchronize_rcu you could have
>>>>>>>>>
>>>>>>>>>     VCPU writes to routing table
>>>>>>>>>                                        e = entry from IRQ routing table
>>>>>>>>>     kvm_irq_routing_update(kvm, new);
>>>>>>>>>     VCPU resumes execution
>>>>>>>>>                                        kvm_set_msi_irq(e, &irq);
>>>>>>>>>                                        kvm_irq_delivery_to_apic_fast();
>>>>>>>>>
>>>>>>>>> where the entry is stale but the VCPU has already resumed execution.
>>>>>>>>>
>>>>>>> If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
>>>>>> The problem is that we should ensure this, so using call_rcu is not
>>>>>> possible (even not considering the memory allocation problem).
>>>>>>
>>>>> Not changing current behaviour is certainly safer, but I am still not 100%
>>>>> convinced we have to ensure this.
>>>>>
>>>>> Suppose guest does:
>>>>>
>>>>> 1: change msi interrupt by writing to pci register
>>>>> 2: read the pci register to flush the write
>>>>> 3: zero idt
>>>>>
>>>>> I am pretty certain that this code can get interrupt after step 2 on real HW,
>>>>> but I cannot tell if guest can rely on it to be delivered exactly after
>>>>> read instruction or it can be delayed by couple of instructions. Seems to me
>>>>> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
>>>>>
>>>> Linux is safe, it does interrupt migration from within the interrupt
>>>> handler.  If you do that before the device-specific EOI, you won't
>>>> get another interrupt until programming the MSI is complete.
>>>>
>>>> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
>>>>
>>>> Using call_rcu() is a better solution than srcu IMO.  Less code
>>>> changes, consistently faster.
>>> Why not fix userspace to use KVM_SIGNAL_MSI instead?
>>>
>>>
>> Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> Zhanghaoyu said that the problem mostly hurts in real-time telecom
> environment, so I propose how he can fix the problem in his specific
> environment.  It will not fix older userspace obviously, but kernel
> fix will also require kernel update and usually updating userspace
> is easier.
>
>

Isn't the latency due to interrupt migration causing long 
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?

The problem occurs with assigned devices too AFAICT.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:18                               ` Avi Kivity
@ 2013-11-28 11:22                                 ` Gleb Natapov
  2013-11-28 11:30                                   ` Avi Kivity
  2013-11-28 11:33                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 11:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> >>>>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >>>>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>>>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>>>>>Without synchronize_rcu you could have
> >>>>>>>>>
> >>>>>>>>>    VCPU writes to routing table
> >>>>>>>>>                                       e = entry from IRQ routing table
> >>>>>>>>>    kvm_irq_routing_update(kvm, new);
> >>>>>>>>>    VCPU resumes execution
> >>>>>>>>>                                       kvm_set_msi_irq(e, &irq);
> >>>>>>>>>                                       kvm_irq_delivery_to_apic_fast();
> >>>>>>>>>
> >>>>>>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>>>>>
> >>>>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>>>>>The problem is that we should ensure this, so using call_rcu is not
> >>>>>>possible (even not considering the memory allocation problem).
> >>>>>>
> >>>>>Not changing current behaviour is certainly safer, but I am still not 100%
> >>>>>convinced we have to ensure this.
> >>>>>
> >>>>>Suppose guest does:
> >>>>>
> >>>>>1: change msi interrupt by writing to pci register
> >>>>>2: read the pci register to flush the write
> >>>>>3: zero idt
> >>>>>
> >>>>>I am pretty certain that this code can get interrupt after step 2 on real HW,
> >>>>>but I cannot tell if guest can rely on it to be delivered exactly after
> >>>>>read instruction or it can be delayed by couple of instructions. Seems to me
> >>>>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
> >>>>>
> >>>>Linux is safe, it does interrupt migration from within the interrupt
> >>>>handler.  If you do that before the device-specific EOI, you won't
> >>>>get another interrupt until programming the MSI is complete.
> >>>>
> >>>>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> >>>>
> >>>>Using call_rcu() is a better solution than srcu IMO.  Less code
> >>>>changes, consistently faster.
> >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >>>
> >>>
> >>Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> >environment, so I propose how he can fix the problem in his specific
> >environment.  It will not fix older userspace obviously, but kernel
> >fix will also require kernel update and usually updating userspace
> >is easier.
> >
> >
> 
> Isn't the latency due to interrupt migration causing long
> synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> 
If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:10                                 ` Paolo Bonzini
  2013-11-28 11:16                                   ` Avi Kivity
@ 2013-11-28 11:23                                   ` Gleb Natapov
  2013-11-28 11:31                                     ` Paolo Bonzini
  1 sibling, 1 reply; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 11:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

On Thu, Nov 28, 2013 at 12:10:40PM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 12:09, Gleb Natapov ha scritto:
> > > - if there are no callbacks, but there are readers, synchronize_srcu
> > > busy-loops for some time checking if the readers complete.  After a
> > > while (20 us for synchronize_srcu, 120 us for
> > > synchronize_srcu_expedited) it gives up and starts using a workqueue to
> > > poll every millisecond.  This should never happen unless
> > 
> > Unless what ? :) Unless reader is scheduled out?
> 
> Yes.  Or unless my brain is scheduled out in the middle of a sentence.
> 
So we will have to disable preemption in a reader to prevent big latencies for
a writer, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:22                                 ` Gleb Natapov
@ 2013-11-28 11:30                                   ` Avi Kivity
  2013-11-28 11:33                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 11:30 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Paolo Bonzini, Jinxin (F)

On 11/28/2013 01:22 PM, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
>> On 11/28/2013 01:02 PM, Gleb Natapov wrote:
>>> On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
>>>> On 11/28/2013 12:11 PM, Gleb Natapov wrote:
>>>>> On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
>>>>>> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
>>>>>>> On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
>>>>>>>> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>>>>>>>>>>> Without synchronize_rcu you could have
>>>>>>>>>>>
>>>>>>>>>>>     VCPU writes to routing table
>>>>>>>>>>>                                        e = entry from IRQ routing table
>>>>>>>>>>>     kvm_irq_routing_update(kvm, new);
>>>>>>>>>>>     VCPU resumes execution
>>>>>>>>>>>                                        kvm_set_msi_irq(e, &irq);
>>>>>>>>>>>                                        kvm_irq_delivery_to_apic_fast();
>>>>>>>>>>>
>>>>>>>>>>> where the entry is stale but the VCPU has already resumed execution.
>>>>>>>>>>>
>>>>>>>>> If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
>>>>>>>> The problem is that we should ensure this, so using call_rcu is not
>>>>>>>> possible (even not considering the memory allocation problem).
>>>>>>>>
>>>>>>> Not changing current behaviour is certainly safer, but I am still not 100%
>>>>>>> convinced we have to ensure this.
>>>>>>>
>>>>>>> Suppose guest does:
>>>>>>>
>>>>>>> 1: change msi interrupt by writing to pci register
>>>>>>> 2: read the pci register to flush the write
>>>>>>> 3: zero idt
>>>>>>>
>>>>>>> I am pretty certain that this code can get interrupt after step 2 on real HW,
>>>>>>> but I cannot tell if guest can rely on it to be delivered exactly after
>>>>>>> read instruction or it can be delayed by couple of instructions. Seems to me
>>>>>>> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
>>>>>>>
>>>>>> Linux is safe, it does interrupt migration from within the interrupt
>>>>>> handler.  If you do that before the device-specific EOI, you won't
>>>>>> get another interrupt until programming the MSI is complete.
>>>>>>
>>>>>> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
>>>>>>
>>>>>> Using call_rcu() is a better solution than srcu IMO.  Less code
>>>>>> changes, consistently faster.
>>>>> Why not fix userspace to use KVM_SIGNAL_MSI instead?
>>>>>
>>>>>
>>>> Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
>>> Zhanghaoyu said that the problem mostly hurts in real-time telecom
>>> environment, so I propose how he can fix the problem in his specific
>>> environment.  It will not fix older userspace obviously, but kernel
>>> fix will also require kernel update and usually updating userspace
>>> is easier.
>>>
>>>
>> Isn't the latency due to interrupt migration causing long
>> synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
>>
> If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> irq routing table changing MSI configuration should not cause update to
> irq routing table (not saying this is what happens with current QEMU, but
> theoretically there is not reason to update routing table in this case).

I see.  That pushes the problem to userspace, which uses traditional 
locking, so the problem disappears until qemu starts using rcu too to 
manage this.

There is also irqfd, however.  We could also do a KVM_UPDATE_IRQFD to 
change the payload it delivers, but that has exactly the same problems.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:23                                   ` Gleb Natapov
@ 2013-11-28 11:31                                     ` Paolo Bonzini
  2013-11-28 11:33                                       ` Avi Kivity
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2013-11-28 11:31 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), Gleb Natapov, KVM,
	Michael S. Tsirkin, Zhanghaoyu (A), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Avi Kivity, Jinxin (F)

Il 28/11/2013 12:23, Gleb Natapov ha scritto:
>>> > > Unless what ? :) Unless reader is scheduled out?
>> > 
>> > Yes.  Or unless my brain is scheduled out in the middle of a sentence.
>> > 
> So we will have to disable preemption in a reader to prevent big latencies for
> a writer, no?

I don't think that's necessary.  The writer itself could also be
scheduled out, and the reader critical sections are really small.  Let's
wait for Zhang to try SRCU and report back.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:31                                     ` Paolo Bonzini
@ 2013-11-28 11:33                                       ` Avi Kivity
  0 siblings, 0 replies; 60+ messages in thread
From: Avi Kivity @ 2013-11-28 11:33 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov
  Cc: Huangweidong (C), KVM, Gleb Natapov, Michael S. Tsirkin,
	Zhanghaoyu (A), Luonengjun, qemu-devel@nongnu.org, Zanghongyong,
	Avi Kivity, Jinxin (F)

On 11/28/2013 01:31 PM, Paolo Bonzini wrote:
> Il 28/11/2013 12:23, Gleb Natapov ha scritto:
>>>>>> Unless what ? :) Unless reader is scheduled out?
>>>> Yes.  Or unless my brain is scheduled out in the middle of a sentence.
>>>>
>> So we will have to disable preemption in a reader to prevent big latencies for
>> a writer, no?
> I don't think that's necessary.  The writer itself could also be
> scheduled out, and the reader critical sections are really small.  Let's
> wait for Zhang to try SRCU and report back.
>

I think readers have preemption disabled already in that context (irqfd 
writes).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:33                                   ` Michael S. Tsirkin
@ 2013-11-28 11:33                                     ` Gleb Natapov
  0 siblings, 0 replies; 60+ messages in thread
From: Gleb Natapov @ 2013-11-28 11:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Huangweidong (C), KVM, Gleb Natapov, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 01:33:48PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 28, 2013 at 01:22:45PM +0200, Gleb Natapov wrote:
> > On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> > > On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> > > >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> > > >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> > > >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> > > >>>>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> > > >>>>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> > > >>>>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> > > >>>>>>>>>Without synchronize_rcu you could have
> > > >>>>>>>>>
> > > >>>>>>>>>    VCPU writes to routing table
> > > >>>>>>>>>                                       e = entry from IRQ routing table
> > > >>>>>>>>>    kvm_irq_routing_update(kvm, new);
> > > >>>>>>>>>    VCPU resumes execution
> > > >>>>>>>>>                                       kvm_set_msi_irq(e, &irq);
> > > >>>>>>>>>                                       kvm_irq_delivery_to_apic_fast();
> > > >>>>>>>>>
> > > >>>>>>>>>where the entry is stale but the VCPU has already resumed execution.
> > > >>>>>>>>>
> > > >>>>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> > > >>>>>>The problem is that we should ensure this, so using call_rcu is not
> > > >>>>>>possible (even not considering the memory allocation problem).
> > > >>>>>>
> > > >>>>>Not changing current behaviour is certainly safer, but I am still not 100%
> > > >>>>>convinced we have to ensure this.
> > > >>>>>
> > > >>>>>Suppose guest does:
> > > >>>>>
> > > >>>>>1: change msi interrupt by writing to pci register
> > > >>>>>2: read the pci register to flush the write
> > > >>>>>3: zero idt
> > > >>>>>
> > > >>>>>I am pretty certain that this code can get interrupt after step 2 on real HW,
> > > >>>>>but I cannot tell if guest can rely on it to be delivered exactly after
> > > >>>>>read instruction or it can be delayed by couple of instructions. Seems to me
> > > >>>>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
> > > >>>>>
> > > >>>>Linux is safe, it does interrupt migration from within the interrupt
> > > >>>>handler.  If you do that before the device-specific EOI, you won't
> > > >>>>get another interrupt until programming the MSI is complete.
> > > >>>>
> > > >>>>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> > > >>>>
> > > >>>>Using call_rcu() is a better solution than srcu IMO.  Less code
> > > >>>>changes, consistently faster.
> > > >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> > > >>>
> > > >>>
> > > >>Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> > > >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> > > >environment, so I propose how he can fix the problem in his specific
> > > >environment.  It will not fix older userspace obviously, but kernel
> > > >fix will also require kernel update and usually updating userspace
> > > >is easier.
> > > >
> > > >
> > > 
> > > Isn't the latency due to interrupt migration causing long
> > > synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> > > 
> > If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> > irq routing table changing MSI configuration should not cause update to
> > irq routing table (not saying this is what happens with current QEMU, but
> > theoretically there is not reason to update routing table in this case).
> > 
> > --
> > 			Gleb.
> 
> Unfortunately all high performance users (vhost net,
> vhost scsi, virtio-blk data plane, vfio) switched to using
> eventfd.
> 
Right :(

> KVM_SIGNAL_MSI is used as a simple mechanism to avoid routing
> table hassles e.g. for hotplug MSIs.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-28 11:22                                 ` Gleb Natapov
  2013-11-28 11:30                                   ` Avi Kivity
@ 2013-11-28 11:33                                   ` Michael S. Tsirkin
  2013-11-28 11:33                                     ` Gleb Natapov
  1 sibling, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2013-11-28 11:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Huangweidong (C), KVM, Gleb Natapov, Zhanghaoyu (A),
	Luonengjun, qemu-devel@nongnu.org, Zanghongyong, Avi Kivity,
	Paolo Bonzini, Jinxin (F)

On Thu, Nov 28, 2013 at 01:22:45PM +0200, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> > On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> > >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> > >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> > >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> > >>>>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> > >>>>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> > >>>>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> > >>>>>>>>>Without synchronize_rcu you could have
> > >>>>>>>>>
> > >>>>>>>>>    VCPU writes to routing table
> > >>>>>>>>>                                       e = entry from IRQ routing table
> > >>>>>>>>>    kvm_irq_routing_update(kvm, new);
> > >>>>>>>>>    VCPU resumes execution
> > >>>>>>>>>                                       kvm_set_msi_irq(e, &irq);
> > >>>>>>>>>                                       kvm_irq_delivery_to_apic_fast();
> > >>>>>>>>>
> > >>>>>>>>>where the entry is stale but the VCPU has already resumed execution.
> > >>>>>>>>>
> > >>>>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this?
> > >>>>>>The problem is that we should ensure this, so using call_rcu is not
> > >>>>>>possible (even not considering the memory allocation problem).
> > >>>>>>
> > >>>>>Not changing current behaviour is certainly safer, but I am still not 100%
> > >>>>>convinced we have to ensure this.
> > >>>>>
> > >>>>>Suppose guest does:
> > >>>>>
> > >>>>>1: change msi interrupt by writing to pci register
> > >>>>>2: read the pci register to flush the write
> > >>>>>3: zero idt
> > >>>>>
> > >>>>>I am pretty certain that this code can get interrupt after step 2 on real HW,
> > >>>>>but I cannot tell if guest can rely on it to be delivered exactly after
> > >>>>>read instruction or it can be delayed by couple of instructions. Seems to me
> > >>>>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.
> > >>>>>
> > >>>>Linux is safe, it does interrupt migration from within the interrupt
> > >>>>handler.  If you do that before the device-specific EOI, you won't
> > >>>>get another interrupt until programming the MSI is complete.
> > >>>>
> > >>>>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> > >>>>
> > >>>>Using call_rcu() is a better solution than srcu IMO.  Less code
> > >>>>changes, consistently faster.
> > >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> > >>>
> > >>>
> > >>Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> > >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> > >environment, so I propose how he can fix the problem in his specific
> > >environment.  It will not fix older userspace obviously, but kernel
> > >fix will also require kernel update and usually updating userspace
> > >is easier.
> > >
> > >
> > 
> > Isn't the latency due to interrupt migration causing long
> > synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> > 
> If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> irq routing table changing MSI configuration should not cause update to
> irq routing table (not saying this is what happens with current QEMU, but
> theoretically there is not reason to update routing table in this case).
> 
> --
> 			Gleb.

Unfortunately all high performance users (vhost net,
vhost scsi, virtio-blk data plane, vfio) switched to using
eventfd.

KVM_SIGNAL_MSI is used as a simple mechanism to avoid routing
table hassles e.g. for hotplug MSIs.

-- 
MST

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
  2013-11-26 16:24         ` Michael S. Tsirkin
@ 2013-11-30  2:46           ` Zhanghaoyu (A)
  0 siblings, 0 replies; 60+ messages in thread
From: Zhanghaoyu (A) @ 2013-11-30  2:46 UTC (permalink / raw)
  To: Michael S. Tsirkin, Gleb Natapov
  Cc: Huangweidong (C), KVM, Jinxin (F), Luonengjun,
	qemu-devel@nongnu.org, Zanghongyong, Paolo Bonzini

>On Tue, Nov 26, 2013 at 06:14:27PM +0200, Gleb Natapov wrote:
>> On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
>> > On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
>> > > On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
>> > > > Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
>> > > > > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu 
>> > > > > thread will IOCTL return to QEMU from hypervisor, then vcpu 
>> > > > > thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too.
>> > > > > It's unacceptable in some real-time scenario, e.g. telecom. 
>> > > > > 
>> > > > > So, I want to create a single workqueue for each VM, to 
>> > > > > asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period.
>> > > > > And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared.
>> > > > 
>> > > > I don't think a workqueue is even needed.  You just need to use 
>> > > > call_rcu to free "old" after releasing kvm->irq_lock.
>> > > > 
>> > > > What do you think?
>> > > > 
>> > > It should be rate limited somehow. Since it guest triggarable 
>> > > guest may cause host to allocate a lot of memory this way.
>> > 
>> > The checks in __call_rcu(), should handle this I think.  These keep 
>> > a per-CPU counter, which can be adjusted via rcutree.blimit, which 
>> > defaults to taking evasive action if more than 10K callbacks are 
>> > waiting on a given CPU.
>> > 
>> > 
>> Documentation/RCU/checklist.txt has:
>> 
>>         An especially important property of the synchronize_rcu()
>>         primitive is that it automatically self-limits: if grace periods
>>         are delayed for whatever reason, then the synchronize_rcu()
>>         primitive will correspondingly delay updates.  In contrast,
>>         code using call_rcu() should explicitly limit update rate in
>>         cases where grace periods are delayed, as failing to do so can
>>         result in excessive realtime latencies or even OOM conditions.
>
>I just asked Paul what this means.

My understanding shown as blow,
The synchronous grace period API synchronize_rcu() can prevent current thread from generating a large number of rcu-update subsequently, just as the "self-limits" described above in Documentation/RCU/checklist.txt, can avoid memory exhaustion, but the asynchronous API call_rcu() cannot limit the update rate, need explicitly rate limit.

Thanks,
Zhang Haoyu
>
>> --
>> 			Gleb.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-11-30  2:47 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-26 12:40 [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table Zhanghaoyu (A)
2013-11-26 12:47 ` Paolo Bonzini
2013-11-26 12:56   ` Gleb Natapov
2013-11-26 13:13     ` Paolo Bonzini
2013-11-28  3:46       ` Zhanghaoyu (A)
2013-11-26 16:05     ` Michael S. Tsirkin
2013-11-26 16:14       ` Gleb Natapov
2013-11-26 16:24         ` Michael S. Tsirkin
2013-11-30  2:46           ` Zhanghaoyu (A)
2013-11-26 13:18   ` Avi Kivity
2013-11-26 13:47     ` Paolo Bonzini
2013-11-26 14:36       ` Avi Kivity
2013-11-26 14:46         ` Paolo Bonzini
2013-11-26 14:54           ` Avi Kivity
2013-11-26 15:03             ` Gleb Natapov
2013-11-26 15:20               ` Paolo Bonzini
2013-11-26 15:25                 ` Avi Kivity
2013-11-26 15:28                   ` Paolo Bonzini
2013-11-26 15:35                     ` Avi Kivity
2013-11-26 15:58                       ` Paolo Bonzini
2013-11-26 16:06                         ` Avi Kivity
2013-11-26 16:11                           ` Michael S. Tsirkin
2013-11-26 16:21                             ` Avi Kivity
2013-11-26 16:42                               ` Paolo Bonzini
2013-11-26 16:08                         ` Michael S. Tsirkin
2013-11-26 16:24                 ` Gleb Natapov
2013-11-26 16:27                   ` Avi Kivity
2013-11-26 16:38                     ` Gleb Natapov
2013-11-26 16:28                   ` Paolo Bonzini
2013-11-26 16:29                     ` Avi Kivity
2013-11-26 16:37                     ` Gleb Natapov
2013-11-28  6:27                 ` Zhanghaoyu (A)
2013-11-28  8:55                   ` Paolo Bonzini
2013-11-28  9:19                     ` Gleb Natapov
2013-11-28  9:29                       ` Paolo Bonzini
2013-11-28  9:43                         ` Gleb Natapov
2013-11-28  9:49                       ` Avi Kivity
2013-11-28  9:53                         ` Paolo Bonzini
2013-11-28 10:16                           ` Avi Kivity
2013-11-28 10:40                             ` Paolo Bonzini
2013-11-28 10:47                               ` Avi Kivity
2013-11-28 11:09                               ` Gleb Natapov
2013-11-28 11:10                                 ` Paolo Bonzini
2013-11-28 11:16                                   ` Avi Kivity
2013-11-28 11:23                                   ` Gleb Natapov
2013-11-28 11:31                                     ` Paolo Bonzini
2013-11-28 11:33                                       ` Avi Kivity
2013-11-28 10:11                         ` Gleb Natapov
2013-11-28 10:12                           ` Avi Kivity
2013-11-28 11:02                             ` Gleb Natapov
2013-11-28 11:18                               ` Avi Kivity
2013-11-28 11:22                                 ` Gleb Natapov
2013-11-28 11:30                                   ` Avi Kivity
2013-11-28 11:33                                   ` Michael S. Tsirkin
2013-11-28 11:33                                     ` Gleb Natapov
2013-11-26 15:24               ` Avi Kivity
2013-11-28  9:14                 ` Zhanghaoyu (A)
2013-11-28  9:20                   ` Gleb Natapov
2013-11-26 12:48 ` Gleb Natapov
2013-11-26 12:50   ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).