From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server Date: Thu, 01 Oct 2009 10:34:17 +0200 Message-ID: <4AC46989.7030502@redhat.com> References: <4AAFACB5.9050808@redhat.com> <4AAFF437.7060100@gmail.com> <4AB0A070.1050400@redhat.com> <4AB0CFA5.6040104@gmail.com> <4AB0E2A2.3080409@redhat.com> <4AB0F1EF.5050102@gmail.com> <4AB10B67.2050108@redhat.com> <4AB13B09.5040308@gmail.com> <4AB151D7.10402@redhat.com> <4AB1A8FD.2010805@gmail.com> <20090921214312.GJ7182@ovro.caltech.edu> <4AB89C48.4020903@redhat.com> <4ABA3005.60905@gmail.com> <4ABA32AF.50602@redhat.com> <4ABA3A73.5090508@gmail.com> <4ABA61D1.80703@gmail.com> <4ABA78DC.7070604@redhat.com> <4ABA8FDC.5010008@gmail.com> <4ABB1D44.5000007@redhat.com> <4ABBB46D.2000102@gmail.com> <4ABC7DCE.2000404@redhat.com> <4ABD36E3.9070503@gmail.com> <4ABF33B2.4000805@redhat.com> <4AC3B9C6.5090408@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Cc: "Ira W. Snyder" , "Michael S. Tsirkin" , netdev@vger.kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org, akpm@linux-foundation.org, hpa@zytor.com, Rusty Russell , s.hetze@linux-ag.com, alacrityvm-devel@lists.sourceforge.net To: Gregory Haskins Return-path: In-Reply-To: <4AC3B9C6.5090408@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On 09/30/2009 10:04 PM, Gregory Haskins wrote: >> A 2.6.27 guest, or Windows guest with the existing virtio drivers, won= 't work >> over vbus. >> =20 > Binary compatibility with existing virtio drivers, while nice to have, > is not a specific requirement nor goal. We will simply load an updated > KMP/MSI into those guests and they will work again. As previously > discussed, this is how more or less any system works today. It's like > we are removing an old adapter card and adding a new one to "uprev the > silicon". > =20 Virtualization is about not doing that. Sometimes it's necessary (when=20 you have made unfixable design mistakes), but just to replace a bus,=20 with no advantages to the guest that has to be changed (other=20 hypervisors or hypervisorless deployment scenarios aren't). >> Further, non-shmem virtio can't work over vbus. >> =20 > Actually I misspoke earlier when I said virtio works over non-shmem. > Thinking about it some more, both virtio and vbus fundamentally require > shared-memory, since sharing their metadata concurrently on both sides > is their raison d'=C3=AAtre. > > The difference is that virtio utilizes a pre-translation/mapping (via > ->add_buf) from the guest side. OTOH, vbus uses a post translation > scheme (via memctx) from the host-side. If anything, vbus is actually > more flexible because it doesn't assume the entire guest address space > is directly mappable. > > In summary, your statement is incorrect (though it is my fault for > putting that idea in your head). > =20 Well, Xen requires pre-translation (since the guest has to give the host=20 (which is just another guest) permissions to access the data). So=20 neither is a superset of the other, they're just different. It doesn't really matter since Xen is unlikely to adopt virtio. > An interesting thing here is that you don't even need a fancy > multi-homed setup to see the effects of my exit-ratio reduction work: > even single port configurations suffer from the phenomenon since many > devices have multiple signal-flows (e.g. network adapters tend to have > at least 3 flows: rx-ready, tx-complete, and control-events (link-state= , > etc). Whats worse, is that the flows often are indirectly related (for > instance, many host adapters will free tx skbs during rx operations, so > you tend to get bursts of tx-completes at the same time as rx-ready. I= f > the flows map 1:1 with IDT, they will suffer the same problem. > =20 You can simply use the same vector for both rx and tx and poll both at=20 every interrupt. > In any case, here is an example run of a simple single-homed guest over > standard GigE. Whats interesting here is that .qnotify to .notify > ratio, as this is the interrupt-to-signal ratio. In this case, its > 170047/151918, which comes out to about 11% savings in interrupt inject= ions: > > vbus-guest:/home/ghaskins # netperf -H dev > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > dev.laurelwood.net (192.168.1.10) port 0 AF_INET > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 1048576 16384 16384 10.01 940.77 > vbus-guest:/home/ghaskins # cat /sys/kernel/debug/pci-to-vbus-bridge > .events : 170048 > .qnotify : 151918 > .qinject : 0 > .notify : 170047 > .inject : 18238 > .bridgecalls : 18 > .buscalls : 12 > vbus-guest:/home/ghaskins # cat /proc/interrupts > CPU0 > 0: 87 IO-APIC-edge timer > 1: 6 IO-APIC-edge i8042 > 4: 733 IO-APIC-edge serial > 6: 2 IO-APIC-edge floppy > 7: 0 IO-APIC-edge parport0 > 8: 0 IO-APIC-edge rtc0 > 9: 0 IO-APIC-fasteoi acpi > 10: 0 IO-APIC-fasteoi virtio1 > 12: 90 IO-APIC-edge i8042 > 14: 3041 IO-APIC-edge ata_piix > 15: 1008 IO-APIC-edge ata_piix > 24: 151933 PCI-MSI-edge vbus > 25: 0 PCI-MSI-edge virtio0-config > 26: 190 PCI-MSI-edge virtio0-input > 27: 28 PCI-MSI-edge virtio0-output > NMI: 0 Non-maskable interrupts > LOC: 9854 Local timer interrupts > SPU: 0 Spurious interrupts > CNT: 0 Performance counter interrupts > PND: 0 Performance pending work > RES: 0 Rescheduling interrupts > CAL: 0 Function call interrupts > TLB: 0 TLB shootdowns > TRM: 0 Thermal event interrupts > THR: 0 Threshold APIC interrupts > MCE: 0 Machine check exceptions > MCP: 1 Machine check polls > ERR: 0 > MIS: 0 > > Its important to note here that we are actually looking at the interrup= t > rate, not the exit rate (which is usually a multiple of the interrupt > rate, since you have to factor in as many as three exits per interrupt > (IPI, window, EOI). Therefore we saved about 18k interrupts in this 10 > second burst, but we may have actually saved up to 54k exits in the > process. This is only over a 10 second window at GigE rates, so YMMV. > These numbers get even more dramatic on higher end hardware, but I > haven't had a chance to generate new numbers yet. > =20 (irq window exits should only be required on a small percentage of=20 interrupt injections, since the guest will try to disable interrupts for=20 short periods only) > Looking at some external stats paints an even bleaker picture: "exits" > as reported by kvm_stat for virtio-pci based virtio-net tip the scales > at 65k/s vs 36k/s for vbus based venet. And virtio is consuming ~30% o= f > my quad-core's cpu, vs 19% for venet during the test. Its hard to know > which innovation or innovations may be responsible for the entire > reduction, but certainly the interrupt-to-signal ratio mentioned above > is probably helping. > =20 Can you please stop comparing userspace-based virtio hosts to=20 kernel-based venet hosts? We know the userspace implementation sucks. > The even worse news for 1:1 models is that the ratio of > exits-per-interrupt climbs with load (exactly when it hurts the most) > since that is when the probability that the vcpu will need all three > exits is the highest. > =20 Requiring all three exits means the guest is spending most of its time=20 with interrupts disabled; that's unlikely. Thanks for the numbers. Are those 11% attributable to rx/tx=20 piggybacking from the same interface? Also, 170K interupts -> 17K interrupts/sec -> 55kbit/interrupt ->=20 6.8kB/interrupt. Ignoring interrupt merging and assuming equal rx/tx=20 distribution, that's about 13kB/interrupt. Seems rather low for a=20 saturated link. >> =20 >>> and priortizable/nestable signals. >>> >>> =20 >> That doesn't belong in a bus. >> =20 > Everyone is of course entitled to an opinion, but the industry as a > whole would disagree with you. Signal path routing (1:1, aggregated, > etc) is at the discretion of the bus designer. Most buses actually do > _not_ support 1:1 with IDT (think USB, SCSI, IDE, etc). > =20 With standard PCI, they do not. But all modern host adapters support=20 MSI and they will happily give you one interrupt per queue. > PCI is somewhat of an outlier in that regard afaict. Its actually a > nice feature of PCI when its used within its design spec (HW). For > SW/PV, 1:1 suffers from, among other issues, that "triple-exit scaling" > issue in the signal path I mentioned above. This is one of the many > reasons I think PCI is not the best choice for PV. > =20 Look at the vmxnet3 submission (recently posted on virtualization@). =20 It's a perfectly ordinary PCI NIC driver, apart from having so many 'V's=20 in the code. 16 rx queues, 8 tx queues, 25 MSIs, BARs for the=20 registers. So while the industry as a whole might disagree with me, it=20 seems VMware does not. >>> http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png >>> >>> =20 >> That's a red herring. The problem is not with virtio as an ABI, but >> with its implementation in userspace. vhost-net should offer equivale= nt >> performance to vbus. >> =20 > That's pure speculation. I would advise you to reserve such statements > until after a proper bakeoff can be completed. Let's do that then. Please reserve the corresponding comparisons from=20 your side as well. > This is not to mention > that vhost-net does nothing to address our other goals, like scheduler > coordination and non-802.x fabrics. > =20 What are scheduler coordination and non-802.x fabrics? >> Right, when you ignore the points where they don't fit, it's a perfect >> mesh. >> =20 > Where doesn't it fit? > =20 (avoiding infinite loop) >>>> But that's not a strong argument for vbus; instead of adding vbus yo= u >>>> could make virtio more friendly to non-virt >>>> >>>> =20 >>> Actually, it _is_ a strong argument then because adding vbus is what >>> helps makes virtio friendly to non-virt, at least for when performanc= e >>> matters. >>> >>> =20 >> As vhost-net shows, you can do that without vbus >> =20 > Citation please. Afaict, the one use case that we looked at for vhost > outside of KVM failed to adapt properly, so I do not see how this is tr= ue. > =20 I think Ira said he can make vhost work? >> and without breaking compatibility. >> =20 > Compatibility with what? vhost hasn't even been officially deployed in > KVM environments afaict, nevermind non-virt. Therefore, how could it > possibly have compatibility constraints with something non-virt already= ? > Citation please. > =20 virtio-net over pci is deployed. Replacing the backend with vhost-net=20 will require no guest modifications. Replacing the frontend with venet=20 or virt-net/vbus-pci will require guest modifications. Obviously virtio-net isn't deployed in non-virt. But if we adopt vbus,=20 we have to migrate guests. >> Of course there is such a thing as native, a pci-ready guest has tons = of >> support built into it >> =20 > I specifically mentioned that already ([1]). > > You are also overstating its role, since the basic OS is what implement= s > the native support for bus-objects, hotswap, etc, _not_ PCI. PCI just > rides underneath and feeds trivial events up, as do other bus-types > (usb, scsi, vbus, etc). But we have to implement vbus for each guest we want to support. That=20 includes Windows and older Linux which has a different internal API, so=20 we have to port the code multiple times, to get existing functionality. > And once those events are fed, you still need a > PV layer to actually handle the bus interface in a high-performance > manner so its not like you really have a "native" stack in either case. > =20 virtio-net doesn't use any pv layer. >> that doesn't need to be retrofitted. >> =20 > No, that is incorrect. You have to heavily modify the pci model with > layers on top to get any kind of performance out of it. Otherwise, we > would just use realtek emulation, which is technically the native PCI > you are apparently so enamored with. > =20 virtio-net doesn't modify the PCI model. And if you look at vmxnet3,=20 they mention that it conforms to somthing called UPT, which allows=20 hardware vendors to implement parts of their NIC model. So vmxnet3 is=20 apparently suitable to both hardware and software implementations. > Not to mention there are things you just plain can't do in PCI today, > like dynamically assign signal-paths, You can have dynamic MSI/queue routing with virtio, and each MSI can be=20 routed to a vcpu at will. > priority, and coalescing, etc. > =20 Do you mean interrupt priority? Well, apic allows interrupt priorities=20 and Windows uses them; Linux doesn't. I don't see a reason to provide=20 more than native hardware. >> Since >> practically everyone (including Xen) does their paravirt drivers atop >> pci, the claim that pci isn't suitable for high performance is incorre= ct. >> =20 > Actually IIUC, I think Xen bridges to their own bus as well (and only > where they have to), just like vbus. They don't use PCI natively. PCI > is perfectly suited as a bridge transport for PV, as I think the Xen an= d > vbus examples have demonstrated. Its the 1:1 device-model where PCI ha= s > the most problems. > =20 N:1 breaks down on large guests since one vcpu will have to process all=20 events. You could do N:M, with commands to change routings, but where's=20 your userspace interface? you can't tell from /proc/interrupts which=20 vbus interupts are active, and irqbalance can't steer them towards less=20 busy cpus since they're invisible to the interrupt controller. >>> And lastly, why would you _need_ to use the so called "native" >>> mechanism? The short answer is, "you don't". Any given system (gues= t >>> or bare-metal) already have a wide-range of buses (try running "tree >>> /sys/bus" in Linux). More importantly, the concept of adding new bus= es >>> is widely supported in both the Windows and Linux driver model (and >>> probably any other guest-type that matters). Therefore, despite clai= ms >>> to the contrary, its not hard or even unusual to add a new bus to the >>> mix. >>> >>> =20 >> The short answer is "compatibility". >> =20 > There was a point in time where the same could be said for virtio-pci > based drivers vs realtek and e1000, so that argument is demonstrably > silly. No one tried to make virtio work in a binary compatible way wit= h > realtek emulation, yet we all survived the requirement for loading a > virtio driver to my knowledge. > =20 The larger your installed base, the more difficult it is. Of course=20 it's doable, but I prefer not doing it and instead improving things in a=20 binary backwards compatible manner. If there is no choice we will bow=20 to the inevitable and make our users upgrade. But at this point there=20 is a choice, and I prefer to stick with vhost-net until it is proven=20 that it won't work. > The bottom line is: Binary device compatibility is not required in any > other system (as long as you follow sensible versioning/id rules), so > why is KVM considered special? > =20 One of the benefits of virtualization is that the guest model is=20 stable. You can live-migrate guests and upgrade the hardware=20 underneath. You can have a single guest image that you clone to=20 provision new guests. If you switch to a new model, you give up those=20 benefits, or you support both models indefinitely. Note even hardware nowadays is binary compatible. One e1000 driver=20 supports a ton of different cards, and I think (not sure) newer cards=20 will work with older drivers, just without all their features. > The fact is, it isn't special (at least not in this regard). What _is_ > required is "support" and we fully intend to support these proposed > components. I assure you that at least the users that care about > maximum performance will not generally mind loading a driver. Most of > them would have to anyway if they want to get beyond realtek emulation. > =20 For a new install, sure. I'm talking about existing deployments (and=20 those that will exist by the time vbus is ready for roll out). > I am certainly in no position to tell you how to feel, but this > declaration would seem from my perspective to be more of a means to an > end than a legitimate concern. Otherwise we would never have had virti= o > support in the first place, since it was not "compatible" with previous > releases. > =20 virtio was certainly not pain free, needing Windows drivers, updates to=20 management tools (you can't enable it by default, so you have to offer=20 it as a choice), mkinitrd, etc. I'd rather not have to go through that=20 again. >> Especially if the device changed is your boot disk. >> =20 > If and when that becomes a priority concern, that would be a function > transparently supported in the BIOS shipped with the hypervisor, and > would thus be invisible to the user. > =20 No, you have to update the driver in your initrd (for Linux) or properly=20 install the new driver (for Windows). It's especially difficult for=20 Windows. >> You may not care about the pain caused to users, but I do, so I will >> continue to insist on compatibility. >> =20 > For the users that don't care about maximum performance, there is no > change (and thus zero pain) required. They can use realtek or virtio i= f > they really want to. Neither is going away to my knowledge, and lets > face it: 2.6Gb/s out of virtio to userspace isn't *that* bad. But "goo= d > enough" isn't good enough, and I won't rest till we get to native > performance. I don't want to support both virtio and vbus in parallel. There's=20 enough work already. If we adopt vbus, we'll have to deprecate and=20 eventually kill off virtio. > 2) True pain to users is not caused by lack of binary compatibility. > Its caused by lack of support. And its a good thing or we would all be > emulating 8086 architecture forever... > > ..oh wait, I guess we kind of do that already ;). But at least we can > slip in something more advanced once in a while (APIC vs PIC, USB vs > uart, iso9660 vs floppy, for instance) and update the guest stack > instead of insisting it must look like ISA forever for compatibility's = sake. > =20 PCI is continuously updated, with MSI, MSI-X, and IOMMU support being=20 some recent updates. I'd like to ride on top of that instead of having=20 to clone it for every guest I support. >> So we have: vbus needs a connector, vhost needs a connector. vbus >> doesn't need userspace to program the addresses (but does need userspa= ce >> to instantiate the devices and to program the bus address decode) >> =20 > First of all, bus-decode is substantially easier than per-device decode > (you have to track all those per-device/per-signal fds somewhere, > integrate with hotswap, etc), and its only done once per guest at > startup and left alone. So its already not apples to apples. > =20 Right, it means you can hand off those eventfds to other qemus or other=20 pure userspace servers. It's more flexible. > Second, while its true that the general kvm-connector bus-decode needs > to be programmed, that is a function of adapting to the environment > that _you_ created for me. The original kvm-connector was discovered > via cpuid and hypercalls, and didn't need userspace at all to set it up= . > Therefore it would be entirely unfair of you to turn around and someh= ow > try to use that trait of the design against me since you yourself > imposed it. > =20 No kvm feature will ever be exposed to a guest without userspace=20 intervention. It's a basic requirement. If it causes complexity (and=20 it does) we have to live with it. >> Does it work on Windows? >> =20 > This question doesn't make sense. Hotswap control occurs on the host, > which is always Linux. > > If you were asking about whether a windows guest will support hotswap: > the answer is "yes". Our windows driver presents a unique PDO/FDO pair > for each logical device instance that is pushed out (just like the buil= t > in usb, pci, scsi bus drivers that windows supports natively). > =20 Ah, you have a Windows venet driver? >>> As an added bonus, its device-model is modular. A developer can writ= e a >>> new device model, compile it, insmod it to the host kernel, hotplug i= t >>> to the running guest with mkdir/ln, and the come back out again >>> (hotunplug with rmdir, rmmod, etc). They may do this all without tak= ing >>> the guest down, and while eating QEMU based IO solutions for breakfas= t >>> performance wise. >>> >>> Afaict, qemu can't do either of those things. >>> >>> =20 >> We've seen that herring before, >> =20 > Citation? > =20 It's the compare venet-in-kernel to virtio-in-userspace thing again. =20 Let's defer that until mst complete vhost-net mergable buffers, it which=20 time we can compare vhost-net to venet and see how much vbus contributes=20 to performance and how much of it comes from being in-kernel. >>>> Refactor instead of duplicating. >>>> >>>> =20 >>> There is no duplicating. vbus has no equivalent today as virtio does= n't >>> define these layers. >>> >>> =20 >> So define them if they're missing. >> =20 > I just did. > =20 Since this is getting confusing to me, I'll start from scratch looking=20 at the vbus layers, top to bottom: Guest side: 1. venet guest kernel driver - AFAICT, duplicates the virtio-net guest=20 driver functionality 2. vbus guest driver (config and hotplug) - duplicates pci, or if you=20 need non-pci support, virtio config and its pci bindings; needs=20 reimplementation for all supported guests 3. vbus guest driver (interrupt coalescing, priority) - if needed,=20 should be implemented as an irqchip (and be totally orthogonal to the=20 driver); needs reimplementation for all supported guests 4. vbus guest driver (shm/ioq) - finder grained layering than virtio=20 (which only supports the combination, due to the need for Xen support);=20 can be retrofitted to virtio at some cost Host side: 1. venet host kernel driver - is duplicated by vhost-net; doesn't=20 support live migration, unprivileged users, or slirp 2. vbus host driver (config and hotplug) - duplicates pci support in=20 userspace (which will need to be kept in any case); already has two=20 userspace interfaces 3. vbus host driver (interrupt coalescing, priority) - if we think we=20 need it (and I don't), should be part of kvm core, not a bus 4. vbus host driver (shm) - partially duplicated by vhost memory slots 5. vbus host driver (ioq) - duplicates userspace virtio, duplicated by vh= ost >>> There is no rewriting. vbus has no equivalent today as virtio doesn'= t >>> define these layers. >>> >>> By your own admission, you said if you wanted that capability, use a >>> library. What I think you are not understanding is vbus _is_ that >>> library. So what is the problem, exactly? >>> >>> =20 >> It's not compatible. >> =20 > No, that is incorrect. What you are apparently not understanding is > that not only is vbus that library, but its extensible. So even if > compatibility is your goal (it doesn't need to be IMO) it can be > accommodated by how you interface to the library. > =20 To me, compatible means I can live migrate an image to a new system=20 without the user knowing about the change. You'll be able to do that=20 with vhost-net. >>>> >>>> =20 >>> No, it does not. vbus just needs a relatively simple single message >>> pipe between the guest and host (think "hypercall tunnel", if you wil= l). >>> >>> =20 >> That's ioeventfd. So far so similar. >> =20 > No, that is incorrect. For one, vhost uses them on a per-signal path > basis, whereas vbus only has one channel for the entire guest->host. > =20 You'll probably need to change that as you start running smp guests. > Second, I do not use ioeventfd anymore because it has too many problems > with the surrounding technology. However, that is a topic for a > different thread. > =20 Please post your issues. I see ioeventfd/irqfd as critical kvm interface= s. >> vbus devices aren't magically instantiated. Userspace needs to >> instantiate them too. Sure, there's less work on the host side since >> you're using vbus instead of the native interface, but more work on th= e >> guest side since you're using vbus instead of the native interface. >> =20 > > No, that is incorrect. The amount of "work" that a guest does is > actually the same in both cases, since the guest OS peforms the hotswap > handling natively for all bus types (at least for Linux and Windows). > You still need to have a PV layer to interface with those objects in > both cases, as well, so there is no such thing as "native interface" fo= r > PV. Its only a matter of where it occurs in the stack. > =20 I'm missing something. Where's the pv layer for virtio-net? Linux drivers have an abstraction layer to deal with non-pci. But the=20 Windows drivers are ordinary pci drivers with nothing that looks=20 pv-ish. You could implement virtio-net hardware if you wanted to. >> non-privileged-user capable? >> =20 > The short answer is "not yet (I think)". I need to write a patch to > properly set the mode attribute in sysfs, but I think this will be triv= ial. > > =20 (and selinux label) >> Ah, so you have two control planes. >> =20 > So what? If anything, it goes to show how extensible the framework is > that a new plane could be added in 119 lines of code: > > ~/git/linux-2.6> stg show vbus-add-admin-ioctls.patch | diffstat > Makefile | 3 - > config-ioctl.c | 117 > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 119 insertions(+), 1 deletion(-) > > if and when having two control planes exceeds its utility, I will submi= t > a simple patch that removes the useless one. > =20 It always begins with a 119-line patch and then grows, that's life. >> kvm didn't have an existing counterpart in Linux when it was >> proposed/merged. >> =20 > And likewise, neither does vbus. > > =20 For virt uses, I don't see the need. For non-virt, I have no opinion. --=20 Do not meddle in the internals of kernels, for they are subtle and quick = to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org