* [PATCH v3 REPOST] xen-netfront: delay gARP until backend switches to Connected
From: Laszlo Ersek @ 2011-12-09 11:38 UTC (permalink / raw)
To: ian.campbell, konrad.wilk, jeremy, xen-devel, virtualization,
netdev, linux-kernel
After a guest is live migrated, the xen-netfront driver emits a gratuitous
ARP message, so that networking hardware on the target host's subnet can
take notice, and public routing to the guest is re-established. However,
if the packet appears on the backend interface before the backend is added
to the target host's bridge, the packet is lost, and the migrated guest's
peers become unable to talk to the guest.
A sufficient two-parts condition to prevent the above is:
(1) ensure that the backend only moves to Connected xenbus state after its
hotplug scripts completed, ie. the netback interface got added to the
bridge; and
(2) ensure the frontend only queues the gARP when it sees the backend move
to Connected.
These two together provide complete ordering. Sub-condition (1) is
satisfied by pvops commit 43223efd9bfd.
In general, the full condition is sufficient, not necessary, because,
according to [1], live migration has been working for a long time without
satisfying sub-condition (2). However, after 43223efd9bfd was backported
to the RHEL-5 host to ensure (1), (2) still proved necessary in the RHEL-6
guest. This patch intends to provide (2) for upstream.
The Reviewed-by line comes from [2].
[1] http://old-list-archives.xen.org/xen-devel/2011-06/msg01969.html
[2] http://old-list-archives.xen.org/xen-devel/2011-07/msg00484.html
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
---
drivers/net/xen-netfront.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index d29365a..f033656 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1646,7 +1646,6 @@ static void netback_changed(struct xenbus_device *dev,
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
- case XenbusStateConnected:
case XenbusStateUnknown:
case XenbusStateClosed:
break;
@@ -1657,6 +1656,9 @@ static void netback_changed(struct xenbus_device *dev,
if (xennet_connect(netdev) != 0)
break;
xenbus_switch_state(dev, XenbusStateConnected);
+ break;
+
+ case XenbusStateConnected:
netif_notify_peers(netdev);
break;
--
1.7.4.4
^ permalink raw reply related
* Re: [PATCH v3 REPOST] xen-netfront: delay gARP until backend switches to Connected
From: David Miller @ 2011-12-09 18:45 UTC (permalink / raw)
To: lersek
Cc: jeremy, xen-devel, ian.campbell, konrad.wilk, netdev,
linux-kernel, virtualization
In-Reply-To: <1323430738-3798-1-git-send-email-lersek@redhat.com>
From: Laszlo Ersek <lersek@redhat.com>
Date: Fri, 9 Dec 2011 12:38:58 +0100
> These two together provide complete ordering. Sub-condition (1) is
> satisfied by pvops commit 43223efd9bfd.
I don't see this commit in Linus's tree, so I doubt it's valid for
me to apply this as a bug fix to my 'net' tree since the precondition
pvops commit isn't upstream as far as I can tell.
Where did you intend me to apply this patch, and how did you expect
the dependent commit to make it's way into the tree so that this
fix is complete?
BTW, you should always explicitly specificy the answers to all the
questions in the previous paragraph, otherwise (like right now) we
go back and forth wasting time establishing these facts.
The way to say which tree the patch is intended for is to specify
it in the Subject like, f.e. "[PATCH net-next v3 REPOST] ..."
^ permalink raw reply
* Re: [PATCH v3 REPOST] xen-netfront: delay gARP until backend switches to Connected
From: Ian Campbell @ 2011-12-09 21:23 UTC (permalink / raw)
To: David Miller
Cc: jeremy@goop.org, xen-devel@lists.xensource.com,
konrad.wilk@oracle.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, lersek@redhat.com
In-Reply-To: <20111209.134514.2071890208094978847.davem@davemloft.net>
On Fri, 2011-12-09 at 18:45 +0000, David Miller wrote:
> From: Laszlo Ersek <lersek@redhat.com>
> Date: Fri, 9 Dec 2011 12:38:58 +0100
>
> > These two together provide complete ordering. Sub-condition (1) is
> > satisfied by pvops commit 43223efd9bfd.
>
> I don't see this commit in Linus's tree,
The referenced commit is in
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git#xen/next-2.6.32 some people call the "pvops tree" but there's no reason to expect someone outside the Xen world to know that...
A better reference would have been 6b0b80ca7165 in
git://xenbits.xen.org/people/ianc/linux-2.6.git#upstream/dom0/backend/netback-history which is the precise branch that was flattened to make f942dc2552b8, which is the upstream commit that added netback, so this change is already in upstream.
Ian.
^ permalink raw reply
* Re: [PATCH v3 REPOST] xen-netfront: delay gARP until backend switches to Connected
From: David Miller @ 2011-12-10 0:24 UTC (permalink / raw)
To: Ian.Campbell
Cc: jeremy, xen-devel, konrad.wilk, netdev, linux-kernel,
virtualization, lersek
In-Reply-To: <1323465781.20936.49.camel@dagon.hellion.org.uk>
From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Fri, 9 Dec 2011 21:23:00 +0000
> On Fri, 2011-12-09 at 18:45 +0000, David Miller wrote:
>> From: Laszlo Ersek <lersek@redhat.com>
>> Date: Fri, 9 Dec 2011 12:38:58 +0100
>>
>> > These two together provide complete ordering. Sub-condition (1) is
>> > satisfied by pvops commit 43223efd9bfd.
>>
>> I don't see this commit in Linus's tree,
>
> The referenced commit is in
> git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git#xen/next-2.6.32 some people call the "pvops tree" but there's no reason to expect someone outside the Xen world to know that...
>
> A better reference would have been 6b0b80ca7165 in
> git://xenbits.xen.org/people/ianc/linux-2.6.git#upstream/dom0/backend/netback-history which is the precise branch that was flattened to make f942dc2552b8, which is the upstream commit that added netback, so this change is already in upstream.
I want the commit message fixed so someone seeing the commit ID can
figure out what it actually refers to.
^ permalink raw reply
* Re: [RFC 4/11] virtio-pci: define layout for virtio vendor-specific capabilities.
From: Sasha Levin @ 2011-12-10 21:14 UTC (permalink / raw)
To: Rusty Russell; +Cc: Michael S. Tsirkin, virtualization
In-Reply-To: <874nxbgxmr.fsf@rustcorp.com.au>
On Thu, 2011-12-08 at 21:04 +1030, Rusty Russell wrote:
> Based on patch by Michael S. Tsirkin <mst@redhat.com>, but I found it
> hard to follow so changed to use structures which are more
> self-documenting.
>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
> include/linux/virtio_pci.h | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
> --- a/include/linux/virtio_pci.h
> +++ b/include/linux/virtio_pci.h
> @@ -92,4 +92,45 @@
> /* The alignment to use between consumer and producer parts of vring.
> * x86 pagesize again. */
> #define VIRTIO_PCI_VRING_ALIGN 4096
> +
> +/* IDs for different capabilities. Must all exist. */
> +/* FIXME: Do we win from separating ISR, NOTIFY and COMMON? */
> +/* Common configuration */
> +#define VIRTIO_PCI_CAP_COMMON_CFG 1
> +/* Notifications */
> +#define VIRTIO_PCI_CAP_NOTIFY_CFG 2
> +/* ISR access */
> +#define VIRTIO_PCI_CAP_ISR_CFG 3
> +/* Device specific confiuration */
> +#define VIRTIO_PCI_CAP_DEVICE_CFG 4
> +
> +/* This is the PCI capability header: */
> +struct virtio_pci_cap {
> + u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */
> + u8 cap_next; /* Generic PCI field: next ptr. */
There should be a cap_len field here, which is mandatory for
PCI_CAP_ID_VNDR capabilities.
> + u8 cfg_type; /* One of the VIRTIO_PCI_CAP_*_CFG. */
> +/* FIXME: Should we use a bir, instead of raw bar number? */
> + u8 bar; /* Where to find it. */
> + __le32 offset; /* Offset within bar. */
> + __le32 length; /* Length. */
> +};
> +
> +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> +struct virtio_pci_common_cfg {
> + /* About the whole device. */
> + __le32 device_feature_select; /* read-write */
> + __le32 device_feature; /* read-only */
> + __le32 guest_feature_select; /* read-write */
> + __le32 guest_feature; /* read-only */
> + __le16 msix_config; /* read-write */
> + __u8 device_status; /* read-write */
> + __u8 unused;
> +
> + /* About a specific virtqueue. */
> + __le16 queue_select; /* read-write */
> + __le16 queue_align; /* read-write, power of 2. */
> + __le16 queue_size; /* read-write, power of 2. */
> + __le16 queue_msix_vector;/* read-write */
> + __le64 queue_address; /* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
This is now a 64bit address, but we can't do an atomic iowrite64(). We
should make it clear to the device when it should initialize the vq.
--
Sasha.
^ permalink raw reply
* Re: [RFC 6/11] virtio_pci: don't use the legacy driver if we find the new PCI capabilities.
From: Sasha Levin @ 2011-12-10 21:18 UTC (permalink / raw)
To: Rusty Russell; +Cc: Michael S. Tsirkin, virtualization
In-Reply-To: <871usfgxg9.fsf@rustcorp.com.au>
On Thu, 2011-12-08 at 21:08 +1030, Rusty Russell wrote:
> With module option to override. I assume I can call
> pci_find_capability() before pci_request_regions?
I don't think thats safe.
--
Sasha.
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Sasha Levin @ 2011-12-10 21:32 UTC (permalink / raw)
To: Rusty Russell; +Cc: Michael S. Tsirkin, Avi Kivity, kvm, virtualization
In-Reply-To: <87d3byfev0.fsf@rustcorp.com.au>
On Fri, 2011-12-09 at 16:47 +1030, Rusty Russell wrote:
> On Thu, 08 Dec 2011 17:37:37 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> > Which leads me to the question: Are MMIO vs MMIO reads/writes not
> > ordered?
>
> That seems really odd, especially being repeatable.
Happens every single time. Can't be a coincidence.
I even went into paranoia mode and made sure that both IO requests come
from the same vcpu.
Another weird thing I've noticed is that mb() doesn't fix it, while if I
replace the mb() with a printk() it works well.
> BTW, that's an address, not a pfn now.
Fixed :)
--
Sasha.
^ permalink raw reply
* Re: [RFC 6/11] virtio_pci: don't use the legacy driver if we find the new PCI capabilities.
From: Rusty Russell @ 2011-12-11 5:15 UTC (permalink / raw)
To: Sasha Levin; +Cc: Michael S. Tsirkin, virtualization
In-Reply-To: <1323551918.32487.34.camel@lappy>
On Sat, 10 Dec 2011 23:18:38 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> On Thu, 2011-12-08 at 21:08 +1030, Rusty Russell wrote:
> > With module option to override. I assume I can call
> > pci_find_capability() before pci_request_regions?
>
> I don't think thats safe.
Actually, a number of drivers do it first thing in probe()
(eg. drivers/char/agp/intel-agp.c).
So I think it's easiest right at the start. I've changed that.
Thanks,
Rusty.
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Avi Kivity @ 2011-12-11 9:05 UTC (permalink / raw)
To: Sasha Levin; +Cc: Michael S. Tsirkin, kvm, virtualization
In-Reply-To: <1323358657.32487.9.camel@lappy>
On 12/08/2011 05:37 PM, Sasha Levin wrote:
> On Thu, 2011-12-08 at 20:52 +1030, Rusty Russell wrote:
> > Here's the patch series I ended up with. I haven't coded up the QEMU
> > side yet, so no idea if the new driver works.
> >
> > Questions:
> > (1) Do we win from separating ISR, NOTIFY and COMMON?
> > (2) I used a "u8 bar"; should I use a bir and pack it instead? BIR
> > seems a little obscure (noone else in the kernel source seems to
> > refer to it).
>
> I started implementing it for KVM tools, when I noticed a strange thing:
> my vq creating was failing because the driver was reading a value other
> than 0 from the address field of a new vq, and failing.
>
> I've added simple prints in the usermode code, and saw the following
> ordering:
>
> 1. queue select vq 0
> 2. queue read address (returns 0 - new vq)
> 3. queue write address (good address of vq)
> 4. queue read address (returns !=0, fails)
> 4. queue select vq 1
>
> From that I understood that the ordering is wrong, the driver was trying
> to read address before selecting the correct vq.
>
> At that point, I've added simple prints to the driver. Initially it
> looked as follows:
>
> iowrite16(index, &vp_dev->common->queue_select);
>
> switch (ioread64(&vp_dev->common->queue_address)) {
> [...]
> };
>
> So I added prints before the iowrite16() and after the ioread64(), and
> saw that while the driver prints were ordered, the device ones weren't:
>
> [ 1.264052] before iowrite index=1
> kvmtool: net returning pfn (vq=0): 310706176
> kvmtool: queue selected: 1
> [ 1.264890] after ioread index=1
>
> Suspecting that something was wrong with ordering, I've added a print
> between the iowrite and the ioread, and it finally started working well.
>
> Which leads me to the question: Are MMIO vs MMIO reads/writes not
> ordered?
mmios are strictly ordered.
Perhaps your printfs are reordered by buffering? Are they from
different threads? Are you using coalesced mmio (which is still
strictly ordered, if used correctly)?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [RFC 6/11] virtio_pci: don't use the legacy driver if we find the new PCI capabilities.
From: Michael S. Tsirkin @ 2011-12-11 9:37 UTC (permalink / raw)
To: Sasha Levin; +Cc: virtualization
In-Reply-To: <1323551918.32487.34.camel@lappy>
On Sat, Dec 10, 2011 at 11:18:38PM +0200, Sasha Levin wrote:
> On Thu, 2011-12-08 at 21:08 +1030, Rusty Russell wrote:
> > With module option to override. I assume I can call
> > pci_find_capability() before pci_request_regions?
>
> I don't think thats safe.
Why wouldn't it be? pci_find_capability is only
using config cycles, it never needs any regions.
> --
>
> Sasha.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2011-12-11 9:42 UTC (permalink / raw)
To: Rusty Russell; +Cc: Sasha Levin, virtualization
In-Reply-To: <87zkf3fiu2.fsf@rustcorp.com.au>
On Thu, Dec 08, 2011 at 09:09:33PM +1030, Rusty Russell wrote:
> +/* There is no iowrite64. We use two 32-bit ops. */
> +static void iowrite64(u64 val, const __le64 *addr)
> +{
> + iowrite32((u32)val, (__le32 *)addr);
> + iowrite32(val >> 32, (__le32 *)addr + 1);
> +}
> +
Let's put addr_lo/addr_hi in the structure then,
to make the fact this field is not atomic explicit?
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Sasha Levin @ 2011-12-11 10:03 UTC (permalink / raw)
To: Avi Kivity; +Cc: Michael S. Tsirkin, kvm, virtualization
In-Reply-To: <4EE4726F.3010503@redhat.com>
On Sun, 2011-12-11 at 11:05 +0200, Avi Kivity wrote:
> mmios are strictly ordered.
>
> Perhaps your printfs are reordered by buffering? Are they from
> different threads? Are you using coalesced mmio (which is still
> strictly ordered, if used correctly)?
I print the queue_selector and queue_address in the printfs, even if
printfs were reordered they would be printing the data right, unlike
they do now. It's the data in the printfs that matters, not their order.
Same vcpu thread with both accesses.
Not using coalesced mmio.
--
Sasha.
^ permalink raw reply
* [PATCH net-next v4] xen-netfront: delay gARP until backend switches to Connected
From: Laszlo Ersek @ 2011-12-11 11:48 UTC (permalink / raw)
To: konrad.wilk, jeremy, xen-devel, virtualization, netdev,
linux-kernel
In-Reply-To: <20111209.192439.275865894437319638.davem@davemloft.net>
After a guest is live migrated, the xen-netfront driver emits a gratuitous
ARP message, so that networking hardware on the target host's subnet can
take notice, and public routing to the guest is re-established. However,
if the packet appears on the backend interface before the backend is added
to the target host's bridge, the packet is lost, and the migrated guest's
peers become unable to talk to the guest.
A sufficient two-parts condition to prevent the above is:
(1) ensure that the backend only moves to Connected xenbus state after its
hotplug scripts completed, ie. the netback interface got added to the
bridge; and
(2) ensure the frontend only queues the gARP when it sees the backend move
to Connected.
These two together provide complete ordering. Sub-condition (1) is already
satisfied by commit f942dc2552b8 in Linus' tree, based on commit
6b0b80ca7165 from [1].
In general, the full condition is sufficient, not necessary, because,
according to [2], live migration has been working for a long time without
satisfying sub-condition (2). However, after 6b0b80ca7165 was backported
to the RHEL-5 host to ensure (1), (2) still proved necessary in the RHEL-6
guest. This patch intends to provide (2) for upstream.
The Reviewed-by line comes from [3].
[1] git://xenbits.xen.org/people/ianc/linux-2.6.git#upstream/dom0/backend/netback-history
[2] http://old-list-archives.xen.org/xen-devel/2011-06/msg01969.html
[3] http://old-list-archives.xen.org/xen-devel/2011-07/msg00484.html
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
---
drivers/net/xen-netfront.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index d29365a..f033656 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1646,7 +1646,6 @@ static void netback_changed(struct xenbus_device *dev,
case XenbusStateInitialised:
case XenbusStateReconfiguring:
case XenbusStateReconfigured:
- case XenbusStateConnected:
case XenbusStateUnknown:
case XenbusStateClosed:
break;
@@ -1657,6 +1656,9 @@ static void netback_changed(struct xenbus_device *dev,
if (xennet_connect(netdev) != 0)
break;
xenbus_switch_state(dev, XenbusStateConnected);
+ break;
+
+ case XenbusStateConnected:
netif_notify_peers(netdev);
break;
--
1.7.4.4
^ permalink raw reply related
* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Michael S. Tsirkin @ 2011-12-11 12:25 UTC (permalink / raw)
To: Rusty Russell, akong
Cc: kvm, Benjamin Herrenschmidt, linux-kernel, virtualization,
linux-arm-kernel
In-Reply-To: <87hb1iqls3.fsf@rustcorp.com.au>
[-- Attachment #1: Type: text/plain, Size: 2147 bytes --]
On Sat, Dec 03, 2011 at 03:44:36PM +1030, Rusty Russell wrote:
> On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote:
> > > A trivial, albeit sub-optimal, solution would be to simply revert
> > > commit d57ed95 "virtio: use smp_XX barriers on SMP". Obviously, though,
> > > that's going to have a negative impact on performance of SMP-based
> > > virtualization use cases.
> >
> > Have you measured the impact of using normal barriers (non-SMP ones)
> > like we use on normal HW drivers unconditionally ?
> >
> > IE. If the difference is small enough I'd say just go for it and avoid
> > the bloat.
>
> Yep. Plan is:
> 1) Measure the difference.
> 2) Difference unmeassurable? Use normal barriers (ie. revert d57ed95).
> 3) Difference small? Revert d57ed95 for 3.2, revisit for 3.3.
> 4) Difference large? Runtime switch based on "if you're PCI" for 3.2,
> revisit for 3.3.
>
> Cheers,
> Rusty.
Forwarding some results by Amos, who run multiple netperf streams in
parallel, from an external box to the guest. TCP_STREAM results were
noisy. This could be due to buffering done by TCP, where packet size
varies even as message size is constant.
TCP_RR results were consistent. In this benchmark, after switching
to mandatory barriers, CPU utilization increased by up to 35% while
throughput went down by up to 14%. the normalized throughput/cpu
regressed consistently, between 7 and 35%
The "fix" applied was simply this:
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3198f2e..fdccb77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -23,7 +23,7 @@
/* virtio guest is communicating with a virtual "device" that actually runs on
* a host processor. Memory barriers are used to control SMP effects. */
-#ifdef CONFIG_SMP
+#if 0
/* Where possible, use SMP barriers which are more lightweight than mandatory
* barriers, because mandatory barriers control MMIO effects on accesses
* through relaxed memory I/O windows (which virtio does not use). */
--
MST
[-- Attachment #2: compare-external_host-guest.txt --]
[-- Type: text/plain, Size: 18832 bytes --]
Fri Dec 9 23:57:33 2011
1 - old-exhost_guest.txt
2 - fixed-exhost_guest.txt
======================================================================================================================================================
TCP_STREAM
sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq
1 1| 64| 949.64| 10.64| 89| 1170134| 1368739| 0| 17| 487392| 488820| 504716| 2.39| 2.71
2 1| 64| 946.03| 10.87| 87| 1119582| 1325851| 0| 17| 493763| 485865| 516161| 2.30| 2.57
% | 0.0| -0.4| +2.2| -2.2| -4.3| -3.1| 0| 0.0| +1.3| -0.6| +2.3| -3.8| -5.2
1 2| 64| 1877.15| 15.45| 121| 2151267| 2561929| 0| 33| 923916| 971093| 969360| 2.22| 2.64
2 2| 64| 1867.63| 15.06| 124| 2212457| 2607606| 0| 33| 836160| 927721| 883964| 2.38| 2.95
% | 0.0| -0.5| -2.5| +2.5| +2.8| +1.8| 0| 0.0| -9.5| -4.5| -8.8| +7.2| +11.7
1 4| 64| 3577.38| 19.62| 182| 4176151| 5036661| 0| 64| 1677417| 1412979| 1859101| 2.96| 2.71
2 4| 64| 3583.17| 20.05| 178| 4215327| 5063534| 0| 65| 1682582| 1549394| 1759033| 2.72| 2.88
% | 0.0| +0.2| +2.2| -2.2| +0.9| +0.5| 0| +1.6| +0.3| +9.7| -5.4| -8.1| +6.3
1 1| 256| 2654.52| 11.41| 232| 925787| 1029214| 0| 14| 597763| 670927| 619414| 1.38| 1.66
2 1| 256| 2632.22| 20.32| 129| 977446| 1036094| 0| 15| 742699| 715460| 764512| 1.37| 1.36
% | 0.0| -0.8| +78.1| -44.4| +5.6| +0.7| 0| +7.1| +24.2| +6.6| +23.4| -0.7| -18.1
1 2| 256| 5228.76| 16.94| 308| 1949442| 2082492| 0| 30| 1230329| 1323945| 1274262| 1.47| 1.63
2 2| 256| 5140.98| 19.58| 262| 1991090| 2093206| 0| 30| 1400232| 1271363| 1441564| 1.57| 1.45
% | 0.0| -1.7| +15.6| -14.9| +2.1| +0.5| 0| 0.0| +13.8| -4.0| +13.1| +6.8| -11.0
1 4| 256| 9412.61| 24.04| 391| 2292404| 2351356| 0| 35| 1669864| 555786| 1741742| 4.12| 1.35
2 4| 256| 9408.92| 22.80| 412| 2303267| 2350153| 0| 35| 1665622| 889883| 1760349| 2.59| 1.34
% | 0.0| -0.0| -5.2| +5.4| +0.5| -0.1| 0| 0.0| -0.3| +60.1| +1.1| -37.1| -0.7
1 1| 512| 3683.30| 13.18| 279| 914301| 1024803| 0| 14| 633219| 676673| 661047| 1.35| 1.55
2 1| 512| 3639.29| 14.48| 251| 997516| 1039428| 0| 15| 682321| 754864| 708906| 1.32| 1.47
% | 0.0| -1.2| +9.9| -10.0| +9.1| +1.4| 0| +7.1| +7.8| +11.6| +7.2| -2.2| -5.2
1 2| 512| 7476.00| 19.81| 377| 1938184| 2018640| 0| 29| 1473562| 864841| 1513069| 2.24| 1.33
2 2| 512| 7220.90| 19.18| 376| 1785571| 2017992| 0| 28| 1177162| 1113322| 1232894| 1.60| 1.64
% | 0.0| -3.4| -3.2| -0.3| -7.9| -0.0| 0| -3.4| -20.1| +28.7| -18.5| -28.6| +23.3
1 4| 512| 8858.46| 23.02| 384| 2312952| 2376210| 0| 38| 1757540| 92819| 1803434| 24.92| 1.32
2 4| 512| 9412.01| 24.78| 379| 1984425| 2308926| 0| 30| 1597209| 682183| 1808295| 2.91| 1.28
% | 0.0| +6.2| +7.6| -1.3| -14.2| -2.8| 0| -21.1| -9.1| +635.0| +0.3| -88.3| -3.0
1 1| 1024| 4208.94| 14.76| 285| 849941| 921136| 0| 13| 674738| 613922| 700105| 1.38| 1.32
2 1| 1024| 4192.50| 14.03| 298| 817761| 922081| 0| 12| 583808| 566704| 609916| 1.44| 1.51
% | 0.0| -0.4| -4.9| +4.6| -3.8| +0.1| 0| -7.7| -13.5| -7.7| -12.9| +4.3| +14.4
1 2| 1024| 9408.36| 21.04| 447| 1767352| 1918373| 0| 27| 1124248| 860371| 1179068| 2.05| 1.63
2 2| 1024| 8694.83| 18.20| 477| 1784228| 1872733| 0| 28| 1116901| 1220318| 1161491| 1.46| 1.61
% | 0.0| -7.6| -13.5| +6.7| +1.0| -2.4| 0| +3.7| -0.7| +41.8| -1.5| -28.8| -1.2
1 4| 1024| 9411.98| 24.04| 391| 1961925| 2292825| 0| 30| 1398947| 965506| 1835604| 2.03| 1.25
2 4| 1024| 9409.97| 26.15| 359| 2072379| 2317492| 0| 34| 1494789| 855245| 1874374| 2.42| 1.24
% | 0.0| -0.0| +8.8| -8.2| +5.6| +1.1| 0| +13.3| +6.9| -11.4| +2.1| +19.2| -0.8
TCP_MAERTS
sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq
1 1| 64| 1092.21| 17.68| 61| 5855912| 1155517| 4| 89| 476428| 4912055| 538322| 1.19| 2.15
2 1| 64| 1031.43| 17.42| 59| 5533459| 1114148| 1| 85| 476884| 4701241| 538789| 1.18| 2.07
% | 0.0| -5.6| -1.5| -3.3| -5.5| -3.6| -75.0| -4.5| +0.1| -4.3| +0.1| -0.8| -3.7
1 2| 64| 2540.09| 32.76| 77| 13466718| 2509747| 1| 206| 473959| 2026564| 594762| 6.65| 4.22
2 2| 64| 2523.91| 32.72| 77| 13371187| 2486851| 1| 204| 477741| 1909387| 598663| 7.00| 4.15
% | 0.0| -0.6| -0.1| 0.0| -0.7| -0.9| 0.0| -1.0| +0.8| -5.8| +0.7| +5.3| -1.7
1 4| 64| 2532.23| 32.96| 76| 13444352| 2555416| 1| 205| 477843| 2048074| 598584| 6.56| 4.27
2 4| 64| 2519.36| 33.10| 76| 13371867| 2530066| 3| 207| 476255| 1917578| 596967| 6.97| 4.24
% | 0.0| -0.5| +0.4| 0.0| -0.5| -1.0| +200.0| +1.0| -0.3| -6.4| -0.3| +6.3| -0.7
1 1| 256| 3481.21| 24.08| 144| 18046440| 1051754| 0| 277| 95147| 115092| 157259| 156.80| 6.69
2 1| 256| 3582.56| 25.00| 143| 18555128| 1079983| 2| 285| 82839| 69704| 144987| 266.20| 7.45
% | 0.0| +2.9| +3.8| -0.7| +2.8| +2.7| 0| +2.9| -12.9| -39.4| -7.8| +69.8| +11.4
1 2| 256| 7835.95| 29.20| 268| 9215513| 988831| 2| 13609| 242276| 235627| 359178| 39.11| 2.75
2 2| 256| 7227.12| 26.14| 276| 8224494| 974662| 0| 3787| 267604| 257681| 362507| 31.92| 2.69
% | 0.0| -7.8| -10.5| +3.0| -10.8| -1.4| -100.0| -72.2| +10.5| +9.4| +0.9| -18.4| -2.2
1 4| 256| 9286.71| 29.59| 313| 3119380| 1125149| 1585| 2794| 365162| 334103| 483517| 9.34| 2.33
2 4| 256| 9365.28| 28.90| 324| 2609986| 1116090| 971| 2064| 373494| 342389| 488519| 7.62| 2.28
% | 0.0| +0.8| -2.3| +3.5| -16.3| -0.8| -38.7| -26.1| +2.3| +2.5| +1.0| -18.4| -2.1
1 1| 512| 9067.78| 17.76| 510| 2489492| 926754| 0| 3072| 350716| 318289| 415029| 7.82| 2.23
2 1| 512| 9363.32| 17.00| 550| 1239455| 934632| 0| 360| 444912| 419859| 506000| 2.95| 1.85
% | 0.0| +3.3| -4.3| +7.8| -50.2| +0.9| 0| -88.3| +26.9| +31.9| +21.9| -62.3| -17.0
1 2| 512| 8897.24| 20.18| 440| 4677263| 765728| 0| 8964| 312268| 266093| 384872| 17.58| 1.99
2 2| 512| 8994.28| 19.60| 458| 4068494| 777077| 0| 8040| 324208| 279281| 394930| 14.57| 1.97
% | 0.0| +1.1| -2.9| +4.1| -13.0| +1.5| 0| -10.3| +3.8| +5.0| +2.6| -17.1| -1.0
1 4| 512| 8660.59| 22.89| 378| 6274162| 1004294| 0| 9581| 300087| 248234| 381499| 25.28| 2.63
2 4| 512| 8720.95| 22.42| 388| 6321071| 996665| 2| 9307| 293339| 239522| 372874| 26.39| 2.67
% | 0.0| +0.7| -2.1| +2.6| +0.7| -0.8| 0| -2.9| -2.2| -3.5| -2.3| +4.4| +1.5
1 1| 1024| 9382.57| 16.55| 566| 1098014| 943968| 0| 63| 411612| 385473| 463009| 2.85| 2.04
2 1| 1024| 9375.36| 16.46| 569| 1113185| 962141| 0| 109| 434974| 419286| 486685| 2.65| 1.98
% | 0.0| -0.1| -0.5| +0.5| +1.4| +1.9| 0| +73.0| +5.7| +8.8| +5.1| -7.0| -2.9
1 2| 1024| 9341.87| 17.32| 539| 1449472| 807439| 0| 897| 380053| 354028| 433351| 4.09| 1.86
2 2| 1024| 9364.12| 17.13| 546| 1344190| 794110| 0| 581| 379684| 352227| 431694| 3.82| 1.84
% | 0.0| +0.2| -1.1| +1.3| -7.3| -1.7| 0| -35.2| -0.1| -0.5| -0.4| -6.6| -1.1
1 4| 1024| 9391.41| 18.70| 502| 1509582| 1104681| 2166| 658| 399083| 369721| 453543| 4.08| 2.44
2 4| 1024| 9404.67| 18.45| 509| 1361411| 1083647| 386| 398| 400368| 374415| 453718| 3.64| 2.39
% | 0.0| +0.1| -1.3| +1.4| -9.8| -1.9| -82.2| -39.5| +0.3| +1.3| +0.0| -10.8| -2.0
TCP_RR
sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq
1 50| 64| 10874.30| 6.75| 1611| 652525| 652529| 0| 10| 652518| 652582| 663468| 1.00| 0.98
2 50| 64| 9568.49| 9.13| 1048| 574176| 574180| 0| 8| 574166| 574255| 593988| 1.00| 0.97
% | 0.0| -12.0| +35.3| -34.9| -12.0| -12.0| 0| -20.0| -12.0| -12.0| -10.5| 0.0| -1.0
1 100| 64| 10880.83| 6.77| 1607| 652919| 652926| 0| 10| 652912| 652981| 664406| 1.00| 0.98
2 100| 64| 9586.65| 7.67| 1249| 575267| 575270| 0| 9| 575257| 575326| 593343| 1.00| 0.97
% | 0.0| -11.9| +13.3| -22.3| -11.9| -11.9| 0| -10.0| -11.9| -11.9| -10.7| 0.0| -1.0
1 250| 64| 10930.11| 6.81| 1605| 655873| 655884| 0| 10| 655869| 655951| 664979| 1.00| 0.99
2 250| 64| 9556.15| 7.65| 1249| 573436| 573440| 0| 9| 573427| 573488| 591151| 1.00| 0.97
% | 0.0| -12.6| +12.3| -22.2| -12.6| -12.6| 0| -10.0| -12.6| -12.6| -11.1| 0.0| -2.0
1 500| 64| 10930.33| 6.77| 1614| 655886| 655890| 0| 10| 655877| 655949| 664983| 1.00| 0.99
2 500| 64| 9598.61| 8.46| 1134| 575983| 575987| 0| 9| 575975| 576057| 595875| 1.00| 0.97
% | 0.0| -12.2| +25.0| -29.7| -12.2| -12.2| 0| -10.0| -12.2| -12.2| -10.4| 0.0| -2.0
1 50| 256| 10435.24| 8.87| 1176| 626181| 626184| 0| 9| 626173| 626240| 641870| 1.00| 0.98
2 50| 256| 8948.58| 10.06| 889| 536981| 536985| 0| 8| 536971| 537025| 552906| 1.00| 0.97
% | 0.0| -14.2| +13.4| -24.4| -14.2| -14.2| 0| -11.1| -14.2| -14.2| -13.9| 0.0| -1.0
1 100| 256| 10401.87| 8.73| 1191| 624179| 624184| 0| 10| 624168| 624248| 640003| 1.00| 0.98
2 100| 256| 9060.39| 10.46| 866| 543692| 543696| 0| 8| 543683| 543752| 559009| 1.00| 0.97
% | 0.0| -12.9| +19.8| -27.3| -12.9| -12.9| 0| -20.0| -12.9| -12.9| -12.7| 0.0| -1.0
1 250| 256| 10369.91| 8.79| 1179| 622261| 622267| 0| 9| 622252| 622325| 638523| 1.00| 0.97
2 250| 256| 9011.48| 10.88| 828| 540755| 540758| 0| 8| 540743| 540830| 555336| 1.00| 0.97
% | 0.0| -13.1| +23.8| -29.8| -13.1| -13.1| 0| -11.1| -13.1| -13.1| -13.0| 0.0| 0.0
1 500| 256| 10400.42| 8.70| 1195| 624094| 624099| 0| 10| 624085| 624155| 639254| 1.00| 0.98
2 500| 256| 9178.08| 10.83| 847| 550753| 550757| 0| 9| 550744| 550811| 566165| 1.00| 0.97
% | 0.0| -11.8| +24.5| -29.1| -11.8| -11.8| 0| -10.0| -11.8| -11.8| -11.4| 0.0| -1.0
1 50| 512| 8231.77| 6.22| 1323| 493974| 493978| 0| 7| 493966| 494049| 507228| 1.00| 0.97
2 50| 512| 7718.30| 5.81| 1328| 463164| 463168| 0| 7| 463155| 463218| 478833| 1.00| 0.97
% | 0.0| -6.2| -6.6| +0.4| -6.2| -6.2| 0| 0.0| -6.2| -6.2| -5.6| 0.0| 0.0
1 100| 512| 8267.63| 6.16| 1342| 496125| 496130| 0| 8| 496115| 496186| 509422| 1.00| 0.97
2 100| 512| 7713.78| 8.29| 930| 462894| 462900| 0| 7| 462885| 462977| 488732| 1.00| 0.95
% | 0.0| -6.7| +34.6| -30.7| -6.7| -6.7| 0| -12.5| -6.7| -6.7| -4.1| 0.0| -2.1
1 250| 512| 8227.23| 6.30| 1305| 493701| 493704| 0| 7| 493691| 493751| 507689| 1.00| 0.97
2 250| 512| 7737.14| 8.35| 926| 464296| 464303| 0| 7| 464291| 464358| 486086| 1.00| 0.96
% | 0.0| -6.0| +32.5| -29.0| -6.0| -6.0| 0| 0.0| -6.0| -6.0| -4.3| 0.0| -1.0
1 500| 512| 8259.80| 6.20| 1332| 495656| 495661| 0| 8| 495649| 495711| 509070| 1.00| 0.97
2 500| 512| 7614.72| 7.47| 1019| 456950| 456959| 0| 7| 456941| 457005| 476914| 1.00| 0.96
% | 0.0| -7.8| +20.5| -23.5| -7.8| -7.8| 0| -12.5| -7.8| -7.8| -6.3| 0.0| -1.0
1 50| 1024| 7879.43| 6.80| 1158| 472832| 472835| 0| 7| 472822| 472906| 488163| 1.00| 0.97
2 50| 1024| 7302.28| 7.35| 993| 438203| 438207| 0| 7| 438193| 438261| 455627| 1.00| 0.96
% | 0.0| -7.3| +8.1| -14.2| -7.3| -7.3| 0| 0.0| -7.3| -7.3| -6.7| 0.0| -1.0
1 100| 1024| 7916.07| 6.84| 1157| 475031| 475038| 0| 8| 475024| 475097| 490797| 1.00| 0.97
2 100| 1024| 7491.90| 6.96| 1076| 449582| 449587| 0| 7| 449573| 449627| 462560| 1.00| 0.97
% | 0.0| -5.4| +1.8| -7.0| -5.4| -5.4| 0| -12.5| -5.4| -5.4| -5.8| 0.0| 0.0
1 250| 1024| 7865.20| 6.82| 1153| 471978| 471984| 0| 7| 471969| 472035| 486978| 1.00| 0.97
2 250| 1024| 7472.83| 7.07| 1056| 448436| 448442| 0| 6| 448431| 448496| 463680| 1.00| 0.97
% | 0.0| -5.0| +3.7| -8.4| -5.0| -5.0| 0| -14.3| -5.0| -5.0| -4.8| 0.0| 0.0
1 500| 1024| 7931.30| 6.99| 1134| 475944| 475954| 0| 7| 475939| 476016| 491021| 1.00| 0.97
2 500| 1024| 7407.54| 7.57| 978| 444519| 444524| 0| 7| 444509| 444567| 462360| 1.00| 0.96
% | 0.0| -6.6| +8.3| -13.8| -6.6| -6.6| 0| 0.0| -6.6| -6.6| -5.8| 0.0| -1.0
[-- Attachment #3: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Michael S. Tsirkin @ 2011-12-11 12:30 UTC (permalink / raw)
To: Sasha Levin; +Cc: Avi Kivity, kvm, virtualization
In-Reply-To: <1323597832.4063.4.camel@lappy>
On Sun, Dec 11, 2011 at 12:03:52PM +0200, Sasha Levin wrote:
> On Sun, 2011-12-11 at 11:05 +0200, Avi Kivity wrote:
> > mmios are strictly ordered.
> >
> > Perhaps your printfs are reordered by buffering? Are they from
> > different threads? Are you using coalesced mmio (which is still
> > strictly ordered, if used correctly)?
>
> I print the queue_selector and queue_address in the printfs, even if
> printfs were reordered they would be printing the data right, unlike
> they do now. It's the data in the printfs that matters, not their order.
>
> Same vcpu thread with both accesses.
>
> Not using coalesced mmio.
Not sure why this would matter, but is the BAR a prefetcheable one?
Rusty's patch uses pci_iomap which maps a prefetcheable BAR
as cacheable.
> --
>
> Sasha.
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Michael S. Tsirkin @ 2011-12-11 12:47 UTC (permalink / raw)
To: Sasha Levin; +Cc: Avi Kivity, kvm, virtualization
In-Reply-To: <1323358657.32487.9.camel@lappy>
On Thu, Dec 08, 2011 at 05:37:37PM +0200, Sasha Levin wrote:
> On Thu, 2011-12-08 at 20:52 +1030, Rusty Russell wrote:
> > Here's the patch series I ended up with. I haven't coded up the QEMU
> > side yet, so no idea if the new driver works.
> >
> > Questions:
> > (1) Do we win from separating ISR, NOTIFY and COMMON?
> > (2) I used a "u8 bar"; should I use a bir and pack it instead? BIR
> > seems a little obscure (noone else in the kernel source seems to
> > refer to it).
>
> I started implementing it for KVM tools, when I noticed a strange thing:
> my vq creating was failing because the driver was reading a value other
> than 0 from the address field of a new vq, and failing.
>
> I've added simple prints in the usermode code, and saw the following
> ordering:
>
> 1. queue select vq 0
> 2. queue read address (returns 0 - new vq)
> 3. queue write address (good address of vq)
> 4. queue read address (returns !=0, fails)
> 4. queue select vq 1
>
> >From that I understood that the ordering is wrong, the driver was trying
> to read address before selecting the correct vq.
>
> At that point, I've added simple prints to the driver. Initially it
> looked as follows:
>
> iowrite16(index, &vp_dev->common->queue_select);
>
> switch (ioread64(&vp_dev->common->queue_address)) {
> [...]
> };
>
> So I added prints before the iowrite16() and after the ioread64(), and
> saw that while the driver prints were ordered, the device ones weren't:
>
> [ 1.264052] before iowrite index=1
> kvmtool: net returning pfn (vq=0): 310706176
> kvmtool: queue selected: 1
> [ 1.264890] after ioread index=1
>
> Suspecting that something was wrong with ordering, I've added a print
> between the iowrite and the ioread, and it finally started working well.
>
> Which leads me to the question: Are MMIO vs MMIO reads/writes not
> ordered?
First, I'd like to answer your questions from the PCI side.
Look for PCI rules in the PCI spec.
You will notices that a write is required to be able to
pass a read request. It might also pass read completion.
A read request will not pass a write request.
There's more or less no ordering between different types of transactions
(memory versus io/configuration).
That's wrt to the question you asked.
But this is not your setup: you have a single vcpu so
you will not initiate a write (select vq) until you get
a read completion.
So what you are really describing is this setup: guest reads a value,
gets the response, then writes out another one, and kvm tool reports the
write before the read.
> --
>
> Sasha.
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Sasha Levin @ 2011-12-11 12:48 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm, virtualization
In-Reply-To: <20111211123057.GD11504@redhat.com>
On Sun, 2011-12-11 at 14:30 +0200, Michael S. Tsirkin wrote:
> On Sun, Dec 11, 2011 at 12:03:52PM +0200, Sasha Levin wrote:
> > On Sun, 2011-12-11 at 11:05 +0200, Avi Kivity wrote:
> > > mmios are strictly ordered.
> > >
> > > Perhaps your printfs are reordered by buffering? Are they from
> > > different threads? Are you using coalesced mmio (which is still
> > > strictly ordered, if used correctly)?
> >
> > I print the queue_selector and queue_address in the printfs, even if
> > printfs were reordered they would be printing the data right, unlike
> > they do now. It's the data in the printfs that matters, not their order.
> >
> > Same vcpu thread with both accesses.
> >
> > Not using coalesced mmio.
>
> Not sure why this would matter, but is the BAR a prefetcheable one?
> Rusty's patch uses pci_iomap which maps a prefetcheable BAR
> as cacheable.
Wasn't defined as prefetchable, but I'm seeing same thing with or
without it.
--
Sasha.
^ permalink raw reply
* Re: [PATCH 0/11] RFC: PCI using capabilitities
From: Sasha Levin @ 2011-12-11 12:53 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm, virtualization
In-Reply-To: <20111211124703.GE11504@redhat.com>
On Sun, 2011-12-11 at 14:47 +0200, Michael S. Tsirkin wrote:
> First, I'd like to answer your questions from the PCI side.
> Look for PCI rules in the PCI spec.
> You will notices that a write is required to be able to
> pass a read request. It might also pass read completion.
> A read request will not pass a write request.
> There's more or less no ordering between different types of transactions
> (memory versus io/configuration).
>
> That's wrt to the question you asked.
>
> But this is not your setup: you have a single vcpu so
> you will not initiate a write (select vq) until you get
> a read completion.
>
> So what you are really describing is this setup: guest reads a value,
> gets the response, then writes out another one, and kvm tool reports the
> write before the read.
No, it's exactly the opposite. Guest writes a value first and then reads
one (writes queue_select and reads queue_address) and kvm tool reporting
the read before the write.
I must add here that the kvm tool doesn't do anything fancy with simple
IO/MMIO. Theres no thread games or anything similar there. The vcpu
thread is doing all the IO/MMIO work.
--
Sasha.
^ permalink raw reply
* Re: [PATCH RFC] virtio_net: fix refill related races
From: Michael S. Tsirkin @ 2011-12-11 14:44 UTC (permalink / raw)
To: Rusty Russell; +Cc: Amit Shah, netdev, linux-kernel, virtualization
In-Reply-To: <8739cvisqe.fsf@rustcorp.com.au>
On Thu, Dec 08, 2011 at 03:07:29PM +1030, Rusty Russell wrote:
> On Wed, 7 Dec 2011 17:21:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Fix theoretical races related to refill work:
> > 1. After napi is disabled by ndo_stop, refill work
> > can run and re-enable it.
> > 2. Refill can reschedule itself, if this happens
> > it can run after cancel_delayed_work_sync,
> > and will access device after it is destroyed.
> >
> > As a solution, add flags to track napi state and
> > to disable refill, and toggle them on start, stop
> > and remove; check these flags on refill.
>
> Why isn't a "dont-readd" flag sufficient?
>
> Cheers,
> Rusty.
I started with that, but here's the problem I wanted to
address:
- we run out of descriptors and schedule refill work
- ndo_close runs
- refill work runs
- ndo_open runs
Now if we just disable refill, refill work will not add buffers and will
not reschedule. Now we'll never get more buffers.
We can try starting refill work from ndo_open but overall
this seems to me more risky than just splitting flags.
--
MST
^ permalink raw reply
* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Benjamin Herrenschmidt @ 2011-12-11 22:27 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, virtualization, linux-arm-kernel
In-Reply-To: <20111211122544.GC11504@redhat.com>
On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:
> Forwarding some results by Amos, who run multiple netperf streams in
> parallel, from an external box to the guest. TCP_STREAM results were
> noisy. This could be due to buffering done by TCP, where packet size
> varies even as message size is constant.
>
> TCP_RR results were consistent. In this benchmark, after switching
> to mandatory barriers, CPU utilization increased by up to 35% while
> throughput went down by up to 14%. the normalized throughput/cpu
> regressed consistently, between 7 and 35%
>
> The "fix" applied was simply this:
What machine & processor was this ?
Cheers,
Ben.
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 3198f2e..fdccb77 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -23,7 +23,7 @@
>
> /* virtio guest is communicating with a virtual "device" that actually runs on
> * a host processor. Memory barriers are used to control SMP effects. */
> -#ifdef CONFIG_SMP
> +#if 0
> /* Where possible, use SMP barriers which are more lightweight than mandatory
> * barriers, because mandatory barriers control MMIO effects on accesses
> * through relaxed memory I/O windows (which virtio does not use). */
>
>
>
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Rusty Russell @ 2011-12-11 22:45 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Sasha Levin, virtualization
In-Reply-To: <20111211094256.GB11504@redhat.com>
On Sun, 11 Dec 2011 11:42:56 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Dec 08, 2011 at 09:09:33PM +1030, Rusty Russell wrote:
> > +/* There is no iowrite64. We use two 32-bit ops. */
> > +static void iowrite64(u64 val, const __le64 *addr)
> > +{
> > + iowrite32((u32)val, (__le32 *)addr);
> > + iowrite32(val >> 32, (__le32 *)addr + 1);
> > +}
> > +
>
> Let's put addr_lo/addr_hi in the structure then,
> to make the fact this field is not atomic explicit?
Good point, assuming I haven't missed something.
Are 64-bit accesses actually unknown in PCI-land? Or is this a limited
availability thing?
Thanks,
Rusty.
^ permalink raw reply
* Re: [PATCH RFC] virtio_net: fix refill related races
From: Rusty Russell @ 2011-12-11 22:55 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Amit Shah, netdev, linux-kernel, virtualization
In-Reply-To: <20111211144428.GB14381@redhat.com>
On Sun, 11 Dec 2011 16:44:29 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Dec 08, 2011 at 03:07:29PM +1030, Rusty Russell wrote:
> > On Wed, 7 Dec 2011 17:21:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Fix theoretical races related to refill work:
> > > 1. After napi is disabled by ndo_stop, refill work
> > > can run and re-enable it.
> > > 2. Refill can reschedule itself, if this happens
> > > it can run after cancel_delayed_work_sync,
> > > and will access device after it is destroyed.
> > >
> > > As a solution, add flags to track napi state and
> > > to disable refill, and toggle them on start, stop
> > > and remove; check these flags on refill.
> >
> > Why isn't a "dont-readd" flag sufficient?
> >
> > Cheers,
> > Rusty.
>
> I started with that, but here's the problem I wanted to
> address:
>
> - we run out of descriptors and schedule refill work
> - ndo_close runs
> - refill work runs
> - ndo_open runs
(s/ndo_close/ndo_stop/)
You don't think we should do any refills on a closed device? If so, we
simply move the refill-stop code into ndo_stop (aka virtnet_close), and
the refill-start code into ndo_open (aka. virtnet_open). Right?
Orthogonally, the refill-stop code is still buggy, as you noted. And
for self-rearming timers the pattern I've always used is a flag.
Or am I being obtuse again? :)
Rusty.
^ permalink raw reply
* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Amos Kong @ 2011-12-12 3:06 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: kvm, Michael S. Tsirkin, linux-kernel, virtualization,
linux-arm-kernel
In-Reply-To: <1323642447.19891.8.camel@pasglop>
On 12/12/11 06:27, Benjamin Herrenschmidt wrote:
> On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:
>
>> Forwarding some results by Amos, who run multiple netperf streams in
>> parallel, from an external box to the guest. TCP_STREAM results were
>> noisy. This could be due to buffering done by TCP, where packet size
>> varies even as message size is constant.
>>
>> TCP_RR results were consistent. In this benchmark, after switching
>> to mandatory barriers, CPU utilization increased by up to 35% while
>> throughput went down by up to 14%. the normalized throughput/cpu
>> regressed consistently, between 7 and 35%
>>
>> The "fix" applied was simply this:
>
> What machine& processor was this ?
pined guest memory to numa node 1
# numactl -m 1 qemu-kvm ...
pined guest vcpu threads and vhost thread to single cpu of numa node 1
# taskset -p 0x10 8348 (vhost_net_thread)
# taskset -p 0x20 8353 (vcpu 1 thread)
# taskset -p 0x40 8357 (vcpu 2 thread)
pined cpu/memory of netperf client process to node 1
# numactl --cpunodebind=1 --membind=1 netperf ...
8 cores
-------
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
microcode : 0xc
cpu MHz : 1596.000
cache size : 12288 KB
physical id : 1
siblings : 4
core id : 10
cpu cores : 4
apicid : 52
initial apicid : 52
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 4787.76
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
# cat /proc/meminfo
MemTotal: 16446616 kB
MemFree: 15874092 kB
Buffers: 30404 kB
Cached: 238640 kB
SwapCached: 0 kB
Active: 100204 kB
Inactive: 184312 kB
Active(anon): 15724 kB
Inactive(anon): 4 kB
Active(file): 84480 kB
Inactive(file): 184308 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 8388604 kB
SwapFree: 8388604 kB
Dirty: 56 kB
Writeback: 0 kB
AnonPages: 15548 kB
Mapped: 11540 kB
Shmem: 256 kB
Slab: 82444 kB
SReclaimable: 19220 kB
SUnreclaim: 63224 kB
KernelStack: 1224 kB
PageTables: 2256 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16611912 kB
Committed_AS: 209068 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 224244 kB
VmallocChunk: 34351073668 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 9876 kB
DirectMap2M: 2070528 kB
DirectMap1G: 14680064 kB
# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 8175 MB
node 0 free: 7706 MB
node 1 cpus: 4 5 6 7
node 1 size: 8192 MB
node 1 free: 7796 MB
node distances:
node 0 1
0: 10 20
1: 20 10
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1
> Cheers,
> Ben.
>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 3198f2e..fdccb77 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -23,7 +23,7 @@
>>
>> /* virtio guest is communicating with a virtual "device" that actually runs on
>> * a host processor. Memory barriers are used to control SMP effects. */
>> -#ifdef CONFIG_SMP
>> +#if 0
>> /* Where possible, use SMP barriers which are more lightweight than mandatory
>> * barriers, because mandatory barriers control MMIO effects on accesses
>> * through relaxed memory I/O windows (which virtio does not use). */
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Rusty Russell @ 2011-12-12 5:12 UTC (permalink / raw)
To: Amos Kong, Benjamin Herrenschmidt
Cc: kvm, Michael S. Tsirkin, linux-kernel, virtualization,
linux-arm-kernel
In-Reply-To: <4EE56FCD.9030609@redhat.com>
On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong <akong@redhat.com> wrote:
> On 12/12/11 06:27, Benjamin Herrenschmidt wrote:
> > On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:
> >
> >> Forwarding some results by Amos, who run multiple netperf streams in
> >> parallel, from an external box to the guest. TCP_STREAM results were
> >> noisy. This could be due to buffering done by TCP, where packet size
> >> varies even as message size is constant.
> >>
> >> TCP_RR results were consistent. In this benchmark, after switching
> >> to mandatory barriers, CPU utilization increased by up to 35% while
> >> throughput went down by up to 14%. the normalized throughput/cpu
> >> regressed consistently, between 7 and 35%
> >>
> >> The "fix" applied was simply this:
> >
> > What machine& processor was this ?
>
> pined guest memory to numa node 1
Please try this patch. How much does the branch cost us?
(Compiles, untested).
Thanks,
Rusty.
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: virtio: harsher barriers for virtio-mmio.
We were cheating with our barriers; using the smp ones rather than the
real device ones. That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).
Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci. In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.
By comparison, this branch costs us???
Reference: https://lkml.org/lkml/2011/12/11/22
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/lguest/lguest_device.c | 10 ++++++----
drivers/s390/kvm/kvm_virtio.c | 2 +-
drivers/virtio/virtio_mmio.c | 7 ++++---
drivers/virtio/virtio_pci.c | 4 ++--
drivers/virtio/virtio_ring.c | 34 +++++++++++++++++++++-------------
include/linux/virtio_ring.h | 1 +
tools/virtio/linux/virtio.h | 1 +
tools/virtio/virtio_test.c | 3 ++-
8 files changed, 38 insertions(+), 24 deletions(-)
diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
}
/*
- * OK, tell virtio_ring.c to set up a virtqueue now we know its size
- * and we've got a pointer to its pages.
+ * OK, tell virtio_ring.c to set up a virtqueue now we know its size
+ * and we've got a pointer to its pages. Note that we set weak_barriers
+ * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+ * barriers.
*/
- vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN,
- vdev, lvq->pages, lg_notify, callback, name);
+ vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN, vdev,
+ true, lvq->pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
- vdev, (void *) config->address,
+ vdev, true, (void *) config->address,
kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
writel(virt_to_phys(info->queue) >> PAGE_SHIFT,
vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
- /* Create the vring */
- vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN,
- vdev, info->queue, vm_notify, callback, name);
+ /* Create the vring: no weak barriers, the other side is could
+ * be an independent "device". */
+ vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+ false, info->queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
/* create the vring */
- vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN,
- vdev, info->queue, vp_notify, callback, name);
+ vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
+ true, info->queue, vp_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -28,17 +28,20 @@
#ifdef CONFIG_SMP
/* Where possible, use SMP barriers which are more lightweight than mandatory
* barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define virtio_rmb() smp_rmb()
-#define virtio_wmb() smp_wmb()
+ * through relaxed memory I/O windows (which virtio-pci does not use). */
+#define virtio_mb(vq) \
+ do { if ((vq)->weak_barriers) smp_mb(); else mb(); } while(0)
+#define virtio_rmb(vq) \
+ do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } while(0)
+#define virtio_wmb(vq) \
+ do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } while(0)
#else
/* We must force memory ordering even if guest is UP since host could be
* running on another CPU, but SMP barriers are defined to barrier() in that
* configuration. So fall back to mandatory barriers instead. */
-#define virtio_mb() mb()
-#define virtio_rmb() rmb()
-#define virtio_wmb() wmb()
+#define virtio_mb(vq) mb()
+#define virtio_rmb(vq) rmb()
+#define virtio_wmb(vq) wmb()
#endif
#ifdef DEBUG
@@ -77,6 +80,9 @@ struct vring_virtqueue
/* Actual memory layout for this queue */
struct vring vring;
+ /* Can we use weak barriers? */
+ bool weak_barriers;
+
/* Other side has made a mess, don't try any more. */
bool broken;
@@ -245,14 +251,14 @@ void virtqueue_kick(struct virtqueue *_v
START_USE(vq);
/* Descriptors and available array need to be set before we expose the
* new available array entries. */
- virtio_wmb();
+ virtio_wmb(vq);
old = vq->vring.avail->idx;
new = vq->vring.avail->idx = old + vq->num_added;
vq->num_added = 0;
/* Need to update avail index before checking if we should notify */
- virtio_mb();
+ virtio_mb(vq);
if (vq->event ?
vring_need_event(vring_avail_event(&vq->vring), new, old) :
@@ -314,7 +320,7 @@ void *virtqueue_get_buf(struct virtqueue
}
/* Only get used array entries after they have been exposed by host. */
- virtio_rmb();
+ virtio_rmb(vq);
i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
@@ -337,7 +343,7 @@ void *virtqueue_get_buf(struct virtqueue
* the read in the next get_buf call. */
if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
vring_used_event(&vq->vring) = vq->last_used_idx;
- virtio_mb();
+ virtio_mb(vq);
}
END_USE(vq);
@@ -366,7 +372,7 @@ bool virtqueue_enable_cb(struct virtqueu
* entry. Always do both to keep code simple. */
vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
vring_used_event(&vq->vring) = vq->last_used_idx;
- virtio_mb();
+ virtio_mb(vq);
if (unlikely(more_used(vq))) {
END_USE(vq);
return false;
@@ -393,7 +399,7 @@ bool virtqueue_enable_cb_delayed(struct
/* TODO: tune this threshold */
bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
- virtio_mb();
+ virtio_mb(vq);
if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
END_USE(vq);
return false;
@@ -453,6 +459,7 @@ EXPORT_SYMBOL_GPL(vring_interrupt);
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *),
void (*callback)(struct virtqueue *),
@@ -476,6 +483,7 @@ struct virtqueue *vring_new_virtqueue(un
vq->vq.vdev = vdev;
vq->vq.name = name;
vq->notify = notify;
+ vq->weak_barriers = weak_barriers;
vq->broken = false;
vq->last_used_idx = 0;
vq->num_added = 0;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -168,6 +168,7 @@ struct virtqueue;
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *vq),
void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -214,6 +214,7 @@ void *virtqueue_detach_unused_buf(struct
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *vq),
void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -92,7 +92,8 @@ static void vq_info_add(struct vdev_info
assert(r >= 0);
memset(info->ring, 0, vring_size(num, 4096));
vring_init(&info->vring, num, info->ring, 4096);
- info->vq = vring_new_virtqueue(info->vring.num, 4096, &dev->vdev, info->ring,
+ info->vq = vring_new_virtqueue(info->vring.num, 4096, &dev->vdev,
+ true, info->ring,
vq_notify, vq_callback, "test");
assert(info->vq);
info->vq->priv = info;
^ permalink raw reply
* Re: [Qemu-devel] [RFC 0/4] virtio-mmio transport
From: Pawel Moll @ 2011-12-12 11:16 UTC (permalink / raw)
To: Paul Brook
Cc: Peter Maydell, qemu-devel@nongnu.org, virtualization,
Anthony Liguori, Paolo Bonzini
In-Reply-To: <201112120252.35295.paul@codesourcery.com>
On Mon, 2011-12-12 at 02:52 +0000, Paul Brook wrote:
> I've taken a look at the virtion-mmio spec, and it looks fairly
> reasonable.
>
> The only thing I'd change is the GuestPageSize/QueuePFN mess. Seems like just
> using straight 64-bit addresses would be a better solution. Maybe split into
> a high/low pair to keep all registers as 32-bit registers.
This can be done fairly simple by:
1. Bumping Version register content.
2. Adding two registers: QueueAddrLow (0x44) and QueueAddrHigh (0x48).
3. Describing the QueuePFN as obsolete.
Than the driver would just use Addr or PFN depending on the device's
version.
I can do that, but not this year (on holiday from Friday 16th, without
any access to Internet whatsoever :-) One think to be decided is in what
order the halfs should be filled? Low first, then high? High then low?
Does it matter at all? :-)
Cheers!
Paweł
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox