From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48775) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wjpki-0000e4-5E for qemu-devel@nongnu.org; Mon, 12 May 2014 08:54:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Wjpkb-0007VO-Iu for qemu-devel@nongnu.org; Mon, 12 May 2014 08:54:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:29966) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wjpkb-0007VK-AR for qemu-devel@nongnu.org; Mon, 12 May 2014 08:54:37 -0400 Date: Mon, 12 May 2014 15:53:23 +0300 From: "Michael S. Tsirkin" Message-ID: <20140512125323.GA16846@redhat.com> References: <33183CC9F5247A488A2544077AF19020815E7B70@SZXEMA503-MBS.china.huawei.com> <53709B0C.4030808@redhat.com> <20140512100814.GA15514@redhat.com> <53709F01.8090204@redhat.com> <20140512101844.GC15514@redhat.com> <5370A19F.5000900@redhat.com> <20140512110723.GB15684@redhat.com> <5370B48B.6080001@redhat.com> <20140512121252.GA16576@redhat.com> <5370C2B6.6080605@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5370C2B6.6080605@redhat.com> Subject: Re: [Qemu-devel] [RFC] vhost: Can we change synchronize_rcu to call_rcu in vhost_set_memory() in vhost kernel module? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: "Huangweidong (C)" , "gleb@redhat.com" , Radim Krcmar , "qemu-devel@nongnu.org" , "Gonglei (Arei)" , "avi.kivity@gmail.com" , "Herongguang (Stephen)" On Mon, May 12, 2014 at 02:46:46PM +0200, Paolo Bonzini wrote: > Il 12/05/2014 14:12, Michael S. Tsirkin ha scritto: > >On Mon, May 12, 2014 at 01:46:19PM +0200, Paolo Bonzini wrote: > >>Il 12/05/2014 13:07, Michael S. Tsirkin ha scritto: > >>>On Mon, May 12, 2014 at 12:25:35PM +0200, Paolo Bonzini wrote: > >>>>Il 12/05/2014 12:18, Michael S. Tsirkin ha scritto: > >>>>>On Mon, May 12, 2014 at 12:14:25PM +0200, Paolo Bonzini wrote: > >>>>>>Il 12/05/2014 12:08, Michael S. Tsirkin ha scritto: > >>>>>>>On Mon, May 12, 2014 at 11:57:32AM +0200, Paolo Bonzini wrote: > >>>>>>>>Perhaps we can check for cases where only the address is changing, > >>>>>>>>and poke at an existing struct kvm_kernel_irq_routing_entry without > >>>>>>>>doing any RCU synchronization? > >>>>>>> > >>>>>>>I suspect interrupts can get lost then: e.g. if address didn't match any > >>>>>>>cpus, now it matches some. No? > >>>>>> > >>>>>>Can you explain the problem more verbosely? :) > >>>>>> > >>>>>>Multiple writers would still be protected by the mutex, so you > >>>>>>cannot have an "in-place update" writer racing with a "copy the > >>>>>>array" writer. > >>>>> > >>>>>I am not sure really. > >>>>>I'm worried about reader vs writer. > >>>>>If reader sees a stale msi value msi will be sent to a wrong > >>>>>address. > >>>> > >>>>That shouldn't happen on any cache-coherent system, no? > >>>> > >>>>Or at least, it shouldn't become any worse than what can already > >>>>happen with RCU. > >>> > >>>Meaning guest must do some synchronization anyway? > >> > >>Yes, I think so. The simplest would be to mask the interrupt around > >>MSI configuration changes. Radim was looking at a similar bug. > > > >Was this with classic assignment or with vfio? > > No, it was with QEMU emulated devices (AHCI). Not this path, but it > did involve MSI configuration changes while the interrupt was > unmasked in the device. > > >>He couldn't replicate on bare metal, > >>but I don't see why it shouldn't > >>be possible there too. > > > >I doubt linux does something tricky here, so > >I'm not talking about something linux guest does, > >I'm talking about an abstract guarantee of the > >APIC. > > APIC or PCI? The chipset is involved here, not the APICs (the APICs > just see interrupt transactions). > > >If baremetal causes a synchronisation point > >and we don't, at some point linux will use it and > >this will bite us. > > Yes, and I agree. > > It's complicated; on one hand I'm not sure we can entirely fix this, > on the other hand it may be that the guest cannot have too high > expectations. > > The invalid scenario is an interrupt happening before the MSI > reconfiguration and using the new value. It's invalid because MSI > reconfiguration is a non-posted write and must push previously > posted writes. But we cannot do that at all! QEMU has the big lock > that implicitly orders transactions, but this does not apply to > high-performance users of irqfd (vhost, VFIO, even QEMU's own > virtio-blk dataplane). Even if we removed all RCU uses for the > routing table, we don't have a "bottleneck" that orders transactions > and removes this invalid scenario. > > At the same time, it's obviously okay if an interrupt happens after > the MSI reconfiguration and uses the old value. That's a posted > write passing an earlier non-posted write, which is allowed. This > is the main reason why I believe that masking/unmasking is required > on bare metal. In addition, if the masking/unmasking is done with a > posted write (MMIO BAR), the guest needs to do a read for > synchronization before it unmasks the interrupt. > > In any case, whether writes synchronize with RCU or bypass it > doesn't change the picture. In either case, writes are ordered > against each other but not against reads. RCU does nothing except > preventing dangling pointer accesses. > > Paolo This is the only part I don't get. RCU will make sure no VCPUs are running, won't it? So it's a kind of full barrier. -- MST