Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH 4/4] xen kconfig: describe xen tmem in the config menu
From: Andrew Jones @ 2012-01-12 10:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: konrad, jeremy, xen-devel, virtualization, stefano stabellini
In-Reply-To: <20120111173536.GD4449@phenom.dumpdata.com>



----- Original Message -----
> On Wed, Jan 11, 2012 at 05:36:41PM +0100, Andrew Jones wrote:
> > Add a description to the config menu for xen tmem.
> 
> We will have another config option asked during 'make oldconfig'
> right?
> 
> I thought part of this cleanup patch is to remove some of this
> complexity - not make it _more_ complex.
> 

Another candidate for 'if EXPERT' then, IMHO.

> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > ---
> >  drivers/xen/Kconfig |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 1d24061..7e8d728 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -143,7 +143,7 @@ config SWIOTLB_XEN
> >  	select SWIOTLB
> >  
> >  config XEN_TMEM
> > -	bool
> > +	bool "Xen Transcendent Memory (tmem)"
> >  	default y if (CLEANCACHE || FRONTSWAP)
> >  	help
> >  	  Shim to interface in-kernel Transcendent Memory hooks
> > --
> > 1.7.7.5
> 

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen: remove CONFIG_XEN_DOM0 compile option
From: Andrew Jones @ 2012-01-12 10:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, xen-devel,
	virtualization, Stefano Stabellini
In-Reply-To: <20120111172700.GA4449@phenom.dumpdata.com>



----- Original Message -----
> On Wed, Jan 11, 2012 at 12:19:11PM -0400, Konrad Rzeszutek Wilk
> wrote:
> > > > If the root complaint is that "customers think that anything
> > > > set in
> > > > .config is a supported feature", then the solutions are to
> > > > support
> > > > all
> > > > the features in .config, re-educate the customers that they're
> > > > wrong,
> > > > or
> > > > maintain a local patch to do this stuff.
> > > 
> > > If only re-educating people was free, like preempting questions
> > > is.
> > > Local patches are of course always an option, and perhaps in this
> > > case it's the best one. However, I think we already made a case
> > > for
> > > better xen configurability for the driver domains, so I'm not
> > > 100%
> > 
> > Could you repost those backend patches please? At this point I am
> > not
> > sure which one we have discarded?
> 
> hm, I was thinking in terms of the XenBus ones. We had somewhere in
> this
> converstion something about seperating the backend's from depending
> on
> CONFIG_XEN_DOM0 as they can be run in any domain nowadays.
> 

Ah, I still haven't posted anything like that. So nothing got lost :-)
It's a good idea, but it's probably best done and tested by someone
working with driver domains, or even with the backends at all, which
I'm not.

Drew

> > 
> > > convinced my initial patch (making dom0 configurable) isn't
> > > worthy
> > > of upstream. Also, I didn't see any comments on my v2[*] of that
> > > patch, which I believe satisfies the menu complexity issue and
> > > brings in more configurability. That said, I'm about to reply to
> > > that patch myself, since there's an issue with it.
> > > 
> > > Drew
> > > 
> > > [*]
> > > http://article.gmane.org/gmane.linux.kernel.virtualization/14635
> > > 
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xensource.com
> > > http://lists.xensource.com/xen-devel
> 

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/4] xen kconfig: keep XEN_XENBUS_FRONTEND builtin
From: Andrew Jones @ 2012-01-12 10:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: konrad, jeremy, xen-devel, stefano stabellini, virtualization
In-Reply-To: <20120111172832.GB4449@phenom.dumpdata.com>



----- Original Message -----
> On Wed, Jan 11, 2012 at 05:36:38PM +0100, Andrew Jones wrote:
> > When XEN_XENBUS_FRONTEND gets selected as a module it can lead to
> > unbootable configs. If we need it, then we should just build it in.
> 
> Hm, don't the frontends by themsevles load this module? So if you
> do 'modprobe xen-pcifront' it would load this automatically?
> 

The problem is on boot, getting it there before we need xen-blkfront.
I think it might solvable if you make sure your tooling pushes the
xenbus module into your initrd. However we've had problems with
platform_pci being a module in the past, and if I'm not mistaken
that was one of the motivators for building it in (5fbdc10395cd). I
see this as a similar story.

Drew

> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > ---
> >  drivers/xen/Kconfig |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 8795480..1d24061 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -118,7 +118,7 @@ config XEN_SYS_HYPERVISOR
> >  	 but will have no xen contents.
> >  
> >  config XEN_XENBUS_FRONTEND
> > -	tristate
> > +	bool
> >  
> >  config XEN_GNTDEV
> >  	tristate "userspace grant access device driver"
> > --
> > 1.7.7.5
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Stefan Hajnoczi @ 2012-01-12 10:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Pawel Moll, Michael S. Tsirkin, virtualization,
	Christian Borntraeger, Sasha Levin
In-Reply-To: <1326314764.23910.128.camel@pasglop>

On Wed, Jan 11, 2012 at 8:46 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2012-01-11 at 14:28 +0000, Stefan Hajnoczi wrote:
>> On Wed, Jan 11, 2012 at 9:10 AM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Wed, 2012-01-11 at 08:47 +0000, Stefan Hajnoczi wrote:
>> >>
>> >> This is also an opportunity to stop using CPU physical addresses in
>> >> the ring and instead perform DMA like a normal PCI device (use bus
>> >> addresses).
>> >
>> > Euh why ?
>>
>> Because it's a paravirt hack that ends up hitting corner cases.  It's
>> not possible to do virtio-pci passthrough under nested virtualization
>> unless we use an IOMMU.  Imagine passing virtio-net from L0 into the
>> L2 guest (i.e. PCI-passthrough).  If virtio-pci is really "PCI" this
>> should be possible but it's not when we use physical addresses instead
>> of bus addresses.
>
> Is this just an academic exercise or is there any actual value in doing
> this ?

It's a corner case, the value is small.  I also hit this with virtio
on SPARC which is made difficult by the fact that the Solaris kernel
assumes there is an IOMMU for scatter-gather and doesn't provide
functions for allocating physically contiguous memory in drivers.
It's another instance where this shortcut comes up against problems
and behaving like a real PCI device would work fine.

> Using an iommu is going to slaugher your performance, so at the very
> least it should be kept an option.

That's a good idea.  By default it can continue to use physical addresses.

I guess there's no point in worrying about it until we have a real user.

Stefan

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-12  7:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120112065132.GF10319@redhat.com>

On Thu, 2012-01-12 at 08:51 +0200, Michael S. Tsirkin wrote:

> One othe rinteresting thing is that
> in practice host often runs on the same CPU as guest.

That's surprising. On Power with our huge exit cost, I would think that
would hurt more than anything else.

> This is the fastest way to run the host for when it
> does not do a lot of work as most data is cached. And there amount of sharing
> is less important than reducing cache consumption.
> 
> Again this is different from real hardware.

Yes indeed.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-12  6:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <1326349714.23910.207.camel@pasglop>

On Thu, Jan 12, 2012 at 05:28:34PM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2012-01-12 at 08:09 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 12, 2012 at 03:31:59PM +1100, Benjamin Herrenschmidt wrote:
> > > However I can see at least one advantage of what you've done :-) You
> > > never have to deal with holes in the ring.
> > 
> > Another advantage is the design goal for that ring:
> > host never needs to copy even if it completes
> > descriptors out of order. And out of order is something that does not
> > happen at all with hardware drivers. This is where paravirt is
> > different.
> 
> Actually out of order can happen with tagged command queue for SCSI or
> ATA, tho I'm not 100% familiar with how things like AHCI handle this.
> 
> > > > > Two rings do have the advantage of not requiring host side copy, which
> > > > > copy would surely add to cache pressure.
> > > > 
> > > > Well, a simple host could process in-order and leave stuff in the ring I
> > > > guess.  A smarter host would copy and queue, maybe leave one queue entry
> > > > in so it doesn't get flooded?
> > > 
> > > What's wrong with a ring of descriptors + a ring of completion, with a
> > > single toggle valid bit to indicate whether a given descriptor is valid
> > > or not (to avoid the nasty ping pong on the ring head/tails).
> > 
> > First, I don't understand how a valid bit avoids ping poing on the last
> > descriptor. Second, how do you handle out of order completions?
> 
> A toggle means the valid bit is never cleared, it just changes polarity
> every time you go around the ring. So there's never a write back to 0.

Ah, OK. Still we might have a cache line bouncing back
and forth as host rechecks this valid bit while guest
toggles it.

This also gets nasty with e.g. inline data.



> Out of order is something I hadn't thought about (I was most probably
> too focused on virtio-net) and is indeed a PITA. It's doable with rings
> but can get nasty.
> 
> I'll give that more thought in the next week, and Rusty and I shall play
> with userspace models based on your tool.

You might want to add more workloads - I have modeled stream transmit,
but e.g. ping pong is also interesting.

One othe rinteresting thing is that
in practice host often runs on the same CPU as guest.
This is the fastest way to run the host for when it
does not do a lot of work as most data is cached. And there amount of sharing
is less important than reducing cache consumption.

Again this is different from real hardware.

> > > > > About inline - it can only help very small buffers.
> > > > > Which workloads do you have in mind exactly?
> > > > 
> > > > It was suggested by others, but I think TCP Acks are the classic one.
> > > 
> > > Split headers + data too, tho that means supporting immediate +
> > > indirect. 
> > > 
> > > It makes a lot of sense for command rings as well if we're going to go
> > > down that route.
> > 
> > I don't see why it makes sense for commands. It's a performance
> > optimization and commands are off the data path.
> 
> Oh just code simplification not having to dequeue a descriptor, allocate
> a buffer, etc... but a lot of that can be buried in helpers indeed.
> 
> > We can't, legal PCI ProgIf values are defined in PCI spec.
> 
> Hrm, more or less yes, I suppose we should stay away from that then,
> do we use revision ID for anything in virtio-land ?
> 
> Cheers,
> Ben.
> 

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-12  6:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <1326348822.23910.202.camel@pasglop>

On Thu, Jan 12, 2012 at 05:13:42PM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2012-01-12 at 07:29 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 12, 2012 at 09:56:39AM +1100, Benjamin Herrenschmidt wrote:
> > > I'd suggest doing a simple user space app that creates such a ring,
> > > forks, and produce / consume. We can then run that through perf and
> > > analyze the cache behaviour, maximum throughput, etc....
> > > 
> > > Cheers,
> > > Ben.
> > 
> > Sure. You can also use my tools/virtio hack: this rebuilds virtio
> > ring code in userspace, has the advantage of reusing
> > actual kernel code, so it's up to date, has same barriers, etc.
> 
> Yes, Rusty mentioned it today :-)
> 
> We'll play around in the next couple of weeks. With LCA, I'm not sure
> how much I'll get done next week but heh...
> 
> I'm curious to see if we can still have a one-size-fit-all between
> unordered blk requests with large data/descriptor access ratio and
> ordered network buffers with a much smaller one.
> 
> Cheers,
> Ben.
> 

network buffers aren't ordered when the backend is zero copy.
Another idea is to add different sized bufs in the RX ring,
and have the device use the size that fits best.
This can improve small packet memory utilization without
need for data copies, but it also means that
descriptors are completed out of order in -net.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-12  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120112060954.GD10319@redhat.com>

On Thu, 2012-01-12 at 08:09 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 12, 2012 at 03:31:59PM +1100, Benjamin Herrenschmidt wrote:
> > However I can see at least one advantage of what you've done :-) You
> > never have to deal with holes in the ring.
> 
> Another advantage is the design goal for that ring:
> host never needs to copy even if it completes
> descriptors out of order. And out of order is something that does not
> happen at all with hardware drivers. This is where paravirt is
> different.

Actually out of order can happen with tagged command queue for SCSI or
ATA, tho I'm not 100% familiar with how things like AHCI handle this.

> > > > Two rings do have the advantage of not requiring host side copy, which
> > > > copy would surely add to cache pressure.
> > > 
> > > Well, a simple host could process in-order and leave stuff in the ring I
> > > guess.  A smarter host would copy and queue, maybe leave one queue entry
> > > in so it doesn't get flooded?
> > 
> > What's wrong with a ring of descriptors + a ring of completion, with a
> > single toggle valid bit to indicate whether a given descriptor is valid
> > or not (to avoid the nasty ping pong on the ring head/tails).
> 
> First, I don't understand how a valid bit avoids ping poing on the last
> descriptor. Second, how do you handle out of order completions?

A toggle means the valid bit is never cleared, it just changes polarity
every time you go around the ring. So there's never a write back to 0.

Out of order is something I hadn't thought about (I was most probably
too focused on virtio-net) and is indeed a PITA. It's doable with rings
but can get nasty.

I'll give that more thought in the next week, and Rusty and I shall play
with userspace models based on your tool.

> > > > About inline - it can only help very small buffers.
> > > > Which workloads do you have in mind exactly?
> > > 
> > > It was suggested by others, but I think TCP Acks are the classic one.
> > 
> > Split headers + data too, tho that means supporting immediate +
> > indirect. 
> > 
> > It makes a lot of sense for command rings as well if we're going to go
> > down that route.
> 
> I don't see why it makes sense for commands. It's a performance
> optimization and commands are off the data path.

Oh just code simplification not having to dequeue a descriptor, allocate
a buffer, etc... but a lot of that can be buried in helpers indeed.

> We can't, legal PCI ProgIf values are defined in PCI spec.

Hrm, more or less yes, I suppose we should stay away from that then,
do we use revision ID for anything in virtio-land ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-12  6:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120112052953.GB10319@redhat.com>

On Thu, 2012-01-12 at 07:29 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 12, 2012 at 09:56:39AM +1100, Benjamin Herrenschmidt wrote:
> > I'd suggest doing a simple user space app that creates such a ring,
> > forks, and produce / consume. We can then run that through perf and
> > analyze the cache behaviour, maximum throughput, etc....
> > 
> > Cheers,
> > Ben.
> 
> Sure. You can also use my tools/virtio hack: this rebuilds virtio
> ring code in userspace, has the advantage of reusing
> actual kernel code, so it's up to date, has same barriers, etc.

Yes, Rusty mentioned it today :-)

We'll play around in the next couple of weeks. With LCA, I'm not sure
how much I'll get done next week but heh...

I'm curious to see if we can still have a one-size-fit-all between
unordered blk requests with large data/descriptor access ratio and
ordered network buffers with a much smaller one.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-12  6:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <1326342719.23910.199.camel@pasglop>

On Thu, Jan 12, 2012 at 03:31:59PM +1100, Benjamin Herrenschmidt wrote:
> However I can see at least one advantage of what you've done :-) You
> never have to deal with holes in the ring.

Another advantage is the design goal for that ring:
host never needs to copy even if it completes
descriptors out of order. And out of order is something that does not
happen at all with hardware drivers. This is where paravirt is
different.

> > > Two rings do have the advantage of not requiring host side copy, which
> > > copy would surely add to cache pressure.
> > 
> > Well, a simple host could process in-order and leave stuff in the ring I
> > guess.  A smarter host would copy and queue, maybe leave one queue entry
> > in so it doesn't get flooded?
> 
> What's wrong with a ring of descriptors + a ring of completion, with a
> single toggle valid bit to indicate whether a given descriptor is valid
> or not (to avoid the nasty ping pong on the ring head/tails).

First, I don't understand how a valid bit avoids ping poing on the last
descriptor. Second, how do you handle out of order completions?


> > > About inline - it can only help very small buffers.
> > > Which workloads do you have in mind exactly?
> > 
> > It was suggested by others, but I think TCP Acks are the classic one.
> 
> Split headers + data too, tho that means supporting immediate +
> indirect. 
> 
> It makes a lot of sense for command rings as well if we're going to go
> down that route.

I don't see why it makes sense for commands. It's a performance
optimization and commands are off the data path.

> > 12 + 14 + 20 + 40 = 86 bytes with virtio_net_hdr_mrg_rxbuf at the front.
> > 
> > > BTW this seems to be the reverse from what you have in Mar 2001,
> > > see 87mxkjls61.fsf@rustcorp.com.au :)
> > 
> > (s/2001/2011).  Indeed.  Noone shared my optimism that having an open
> > process for a virtio2 would bring more players on board (my original
> > motivation).  But technical requirements are mounting up, which means
> > we're going to get there anyway.
> > 
> > > I am much less concerned with what we do for configuration,
> > > but I do not believe we have learned all performance lessons
> > > from virtio ring1. Is there any reason why we shouldn't be
> > > able to experiment with inline within virtio1 and see
> > > whether that gets us anything?
> > 
> > Inline in the used ring is possible, but those descriptors 8 bytes, vs
> > 24/32.
> > 
> > > If we do a bunch of changes to the ring at once, we can't
> > > figure out what's right, what's wrong, or back out of
> > > mistakes later.
> > > 
> > > Since there are non PCI transports that use the ring,
> > > we really shouldn't make both the configuration and
> > > the ring changes depend on the same feature bit.
> > 
> > Yes, I'm thinking #define VIRTIO_F_VIRTIO2 (-1).  For PCI, this gets
> > mapped into a "are we using the new config layout?".  For others, it
> > gets mapped into a transport-specific feature.
> 
> Or we can use the PCI ProgIf to indicate a different programming
> interface, that way we also use that as an excuse to say that the first
> BAR can either be PIO or MMIO :-)

We can't, legal PCI ProgIf values are defined in PCI spec.

> > (I'm sure you get it, but for the others) This is because I want to be
> > draw a clear line between all the legacy stuff at the same time, not
> > have to support part of it later because someone might not flip the
> > feature bit.
> 
> Cheers,
> Ben.
> 

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-12  6:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christian Borntraeger, Benjamin Herrenschmidt, Sasha Levin,
	Pawel Moll, virtualization
In-Reply-To: <87vcoh4r2y.fsf@rustcorp.com.au>

On Thu, Jan 12, 2012 at 12:31:09PM +1030, Rusty Russell wrote:
> On Wed, 11 Jan 2012 12:21:30 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Jan 11, 2012 at 10:55:52AM +1030, Rusty Russell wrote:
> > > On Tue, 10 Jan 2012 19:03:36 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > On Wed, Dec 21, 2011 at 11:03:25AM +1030, Rusty Russell wrote:
> > > > > Yes.  The idea that we can alter fields in the device-specific config
> > > > > area is flawed.  There may be cases where it doesn't matter, but as an
> > > > > idea it was holed to begin with.
> > > > > 
> > > > > We can reduce probability by doing a double read to check, but there are
> > > > > still cases where it will fail.
> > > > 
> > > > Okay - want me to propose an interface for that?
> > > 
> > > Had a brief chat with BenH (CC'd).
> > > 
> > > I think we should deprecate writing to the config space.  Only balloon
> > > does it AFAICT, and I can't quite figure out *why* it has an 'active'
> > > field.
> > 
> > Are you sure? I think net writes a mac address.
> 
> True.  We'll need to disable that, and come up with another mechanism if
> we want it back (a new feature and a VIRTIO_NET_HDR_F_SET_MAC flag in
> the virtio_net header?  Or would that mess up vhost_net?).

vhost net is only datapath, so no problem. copying in host is tricky for
vhost_net.

> > > This solves half the problem, of sync guest writes.  For the
> > > other half, I suggest a generation counter; odd means inconsistent.  The
> > > guest can poll.
> > 
> > So we get the counter until it's even, get the config, if it's changed
> > repeat? Yes it works. However, I would like to have a way to detect
> > config change just by looking at memory. ATM we need to read ISR to
> > know.  If we used a VQ, the advantage would be that the device can work
> > with a single MSIX vector shared by all VQs.
> 
> If we use a 32-bit counter, we also get this though, right?
> 
> If counter has changed, it's a config interrupt...

But we need an exit to read the counter. We can put the counter
in memory but this looks suspiciously like a simplified VQ -
so why not use a VQ then?

> > If we do require config VQ anyway, why not use it to notify
> > guest of config changes? Guest could pre-post an in buffer
> > and host uses that.
> 
> We could, but it's an additional burden on each device.  vqs are cheap,
> but not free.  And the config area is so damn convenient...

Not if you start playing with counters, checking it twice,
reinvent all kind of barriers ...

> > > BenH also convinced me we should finally make the config space LE if
> > > we're going to change things.  Since PCI is the most common transport,
> > > guest-endian confuses people.  And it sucks for really weird machines.
> > 
> > Are we going to keep guest endian for e.g. virtio net header?
> > If yes the benefit of switching config space is not that big.
> > And changes in devices would affect non-PCI transports.
> 
> Yep.  It would only make sense if we do it for everything.  And yes,
> it'll mess up everyone who is BE, so it needs to be a feature bit for
> them.
> 
> > > We should also change the ring (to a single ring, I think).  Descriptors
> > > to 24 bytes long (8 byte cookie, 8 byte addr, 4 byte len, 4 byte flags).
> > > We might be able to squeeze it into 20 bytes but that means packing.  We
> > > should support inline, chained or indirect.  Let the other side ack by
> > > setting flag, cookie and len (if written).
> > 
> > Quite possibly all or some of these things help performance
> > but do we have to change the spec before we have experimental
> > proof?
> 
> We change the spec last, once we know what we're doing, ideally.
> 
> > I did experiment with a single ring using tools/virtio and
> > I didn't see a measureable performance gain.
> 
> Interesting.  It is simpler and more standard than our current design,
> but that's not sufficient unless there are other reasons.  Needs further
> discussion and testing.
> 
> > Two rings do have the advantage of not requiring host side copy, which
> > copy would surely add to cache pressure.
> 
> Well, a simple host could process in-order and leave stuff in the ring I
> guess.  A smarter host would copy and queue, maybe leave one queue entry
> in so it doesn't get flooded?
> 
> >  Since
> > host doesn't change desriptors, we could also
> > preformat some descriptors in the current design.
> >
> > There is a fragmentation problem in theory but it can be alleviated with
> > a smart allocator.
> 
> Yeah, the complexity scares me...
> 
> > About inline - it can only help very small buffers.
> > Which workloads do you have in mind exactly?
> 
> It was suggested by others, but I think TCP Acks are the classic one.
> 12 + 14 + 20 + 40 = 86 bytes with virtio_net_hdr_mrg_rxbuf at the front.

That's only the simplest IPv4, right?
Anyway, this spans multiple descriptors so this complicates allocation
significantly.

> > BTW this seems to be the reverse from what you have in Mar 2001,
> > see 87mxkjls61.fsf@rustcorp.com.au :)
> 
> (s/2001/2011).  Indeed.  Noone shared my optimism that having an open
> process for a virtio2 would bring more players on board (my original
> motivation).  But technical requirements are mounting up, which means
> we're going to get there anyway.
> 
> > I am much less concerned with what we do for configuration,
> > but I do not believe we have learned all performance lessons
> > from virtio ring1. Is there any reason why we shouldn't be
> > able to experiment with inline within virtio1 and see
> > whether that gets us anything?
> 
> Inline in the used ring is possible, but those descriptors 8 bytes, vs
> 24/32.

Hmm, 86 > 32 anyway, right?

> > If we do a bunch of changes to the ring at once, we can't
> > figure out what's right, what's wrong, or back out of
> > mistakes later.
> > 
> > Since there are non PCI transports that use the ring,
> > we really shouldn't make both the configuration and
> > the ring changes depend on the same feature bit.
> 
> Yes, I'm thinking #define VIRTIO_F_VIRTIO2 (-1).  For PCI, this gets
> mapped into a "are we using the new config layout?".  For others, it
> gets mapped into a transport-specific feature.
> 
> (I'm sure you get it, but for the others) This is because I want to be
> draw a clear line between all the legacy stuff at the same time, not
> have to support part of it later because someone might not flip the
> feature bit.
> 
> Cheers,
> Rusty.

So my point is, config stuff and ring are completely separate,
they are different layers.

-- 
MST

^ permalink raw reply

* [PATCH] vhost-net: add module alias (v2.1)
From: Stephen Hemminger @ 2012-01-12  5:30 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kvm, Michael S. Tsirkin, netdev, Kay Sievers, virtualization,
	Alan Cox
In-Reply-To: <CAEH94LiF_F51_avA9UWMneo1N1wY0NzJmfohpxdrkAv2WkWT5Q@mail.gmail.com>

Subject: vhost-net: add module alias (v2.1)

By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.

Also:
  - use C99 style initialization.
  - add missing entry in documentation for loop-control

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
2.1 - add missing documentation for loop control as well

 Documentation/devices.txt  |    3 +++
 drivers/vhost/net.c        |    8 +++++---
 include/linux/miscdevice.h |    1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c	2012-01-10 10:56:58.883179194 -0800
+++ b/drivers/vhost/net.c	2012-01-10 19:48:23.650225892 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-	MISC_DYNAMIC_MINOR,
-	"vhost-net",
-	&vhost_net_fops,
+	.minor = VHOST_NET_MINOR,
+	.name = "vhost-net",
+	.fops = &vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION("0.0.1");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Michael S. Tsirkin");
 MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS("devname:vhost-net");
--- a/include/linux/miscdevice.h	2012-01-10 10:56:59.779189436 -0800
+++ b/include/linux/miscdevice.h	2012-01-11 09:13:20.803694316 -0800
@@ -42,6 +42,7 @@
 #define AUTOFS_MINOR		235
 #define MAPPER_CTRL_MINOR	236
 #define LOOP_CTRL_MINOR		237
+#define VHOST_NET_MINOR		238
 #define MISC_DYNAMIC_MINOR	255
 
 struct device;
--- a/Documentation/devices.txt	2012-01-10 10:56:53.399116518 -0800
+++ b/Documentation/devices.txt	2012-01-11 13:17:07.882113340 -0800
@@ -447,6 +447,9 @@ Your cooperation is appreciated.
 		234 = /dev/btrfs-control	Btrfs control device
 		235 = /dev/autofs	Autofs control device
 		236 = /dev/mapper/control	Device-Mapper control device
+		237 = /dev/loop-control Loopback control device
+		238 = /dev/vhost-net	Host kernel accelerator for virtio net
+
 		240-254			Reserved for local use
 		255			Reserved for MISC_DYNAMIC_MINOR

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-12  5:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <1326322599.23910.167.camel@pasglop>

On Thu, Jan 12, 2012 at 09:56:39AM +1100, Benjamin Herrenschmidt wrote:
> I'd suggest doing a simple user space app that creates such a ring,
> forks, and produce / consume. We can then run that through perf and
> analyze the cache behaviour, maximum throughput, etc....
> 
> Cheers,
> Ben.

Sure. You can also use my tools/virtio hack: this rebuilds virtio
ring code in userspace, has the advantage of reusing
actual kernel code, so it's up to date, has same barriers, etc.

-- 
MST

^ permalink raw reply

* [PULL] virtio and lguest
From: Rusty Russell @ 2012-01-12  5:22 UTC (permalink / raw)
  To: torvalds
  Cc: Stratos Psomadakis, Michael S. Tsirkin,
	lkml - Kernel Mailing List, virtualization, Sasha Levin,
	Amit Shah, Jacek Galowicz, Christoph Hellwig, Davidlohr Bueso

(I called the tag to-linus.  I think it worked).

To git@github.com:rustyrussell/linux.git
 + 3ed0016...b6c96c0 master -> master (forced update)
 + 815645d...f8e8df5 to-linus -> to-linus (forced update)
+ git request-pull remotes/origin/master git://github.com/rustyrussell/linux.git
The following changes since commit e343a895a9f342f239c5e3c5ffc6c0b1707e6244:

  Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost (2012-01-10 18:04:27 -0800)

are available in the git repository at:

  git://github.com/rustyrussell/linux.git master

Amit Shah (12):
      virtio: pci: switch to new PM API
      virtio: pci: add PM notification handlers for restore, freeze, thaw, poweroff
      virtio: console: Move vq and vq buf removal into separate functions
      virtio: console: Add freeze and restore handlers to support S4
      virtio: console: Disable callbacks for virtqueues at start of S4 freeze
      virtio: blk: Move vq initialization to separate function
      virtio: blk: Add freeze, restore handlers to support S4
      virtio: net: Move vq initialization into separate function
      virtio: net: Move vq and vq buf removal into separate function
      virtio: net: Add freeze, restore handlers to support S4
      virtio: balloon: Move vq initialization into separate function
      virtio: balloon: Add freeze, restore handlers to support S4

Davidlohr Bueso (1):
      lguest: move the lguest tool to the tools directory

Jacek Galowicz (1):
      lguest: switch segment-voodoo-numbers to readable symbols

Michael S. Tsirkin (1):
      virtio_blk: fix config handler race

Rusty Russell (7):
      virtio: harsher barriers for rpmsg.
      virtio: document functions better.
      virtio: rename virtqueue_add_buf_gfp to virtqueue_add_buf
      virtio: support unlocked queue kick
      virtio: avoid modulus operation.
      virtio: expose added descriptors immediately.
      virtio: add debugging if driver doesn't kick.

Sasha Levin (1):
      virtio-balloon: Trivial cleanups

Stratos Psomadakis (1):
      lguest: Make sure interrupt is allocated ok by lguest_setup_irq

 arch/x86/lguest/boot.c                             |   21 +-
 drivers/block/virtio_blk.c                         |   87 +++++++-
 drivers/char/hw_random/virtio-rng.c                |    2 +-
 drivers/char/virtio_console.c                      |  140 +++++++++---
 drivers/lguest/Makefile                            |    2 +-
 drivers/lguest/lguest_device.c                     |   18 +-
 drivers/lguest/segments.c                          |   28 ++-
 drivers/net/virtio_net.c                           |  125 ++++++++---
 drivers/s390/kvm/kvm_virtio.c                      |    2 +-
 drivers/virtio/virtio_balloon.c                    |  108 +++++++--
 drivers/virtio/virtio_mmio.c                       |    4 +-
 drivers/virtio/virtio_pci.c                        |  110 +++++++++-
 drivers/virtio/virtio_ring.c                       |  245 +++++++++++++++++---
 include/linux/virtio.h                             |   75 ++-----
 include/linux/virtio_ring.h                        |    1 +
 net/9p/trans_virtio.c                              |    6 +-
 {Documentation/virtual => tools}/lguest/.gitignore |    0
 {Documentation/virtual => tools}/lguest/Makefile   |    0
 {Documentation/virtual => tools}/lguest/extract    |    0
 {Documentation/virtual => tools}/lguest/lguest.c   |    2 +-
 {Documentation/virtual => tools}/lguest/lguest.txt |    0
 tools/virtio/linux/virtio.h                        |   22 +--
 tools/virtio/virtio_test.c                         |    6 +-
 23 files changed, 757 insertions(+), 247 deletions(-)
 rename {Documentation/virtual => tools}/lguest/.gitignore (100%)
 rename {Documentation/virtual => tools}/lguest/Makefile (100%)
 rename {Documentation/virtual => tools}/lguest/extract (100%)
 rename {Documentation/virtual => tools}/lguest/lguest.c (99%)
 rename {Documentation/virtual => tools}/lguest/lguest.txt (100%)

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-12  4:31 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christian Borntraeger, virtualization, Sasha Levin, Pawel Moll,
	Michael S. Tsirkin
In-Reply-To: <87vcoh4r2y.fsf@rustcorp.com.au>

On Thu, 2012-01-12 at 12:31 +1030, Rusty Russell wrote:

> > Are we going to keep guest endian for e.g. virtio net header?
> > If yes the benefit of switching config space is not that big.
> > And changes in devices would affect non-PCI transports.
> 
> Yep.  It would only make sense if we do it for everything.  And yes,
> it'll mess up everyone who is BE, so it needs to be a feature bit for
> them.

One thing we should do (I might give it a go after LCA) is start
providing sized accessors for it and start converting the guest drivers
in Linux at least to use those.

That way, the accessors could then do the byteswap in the future
transparently if they see the feature bit for the "fixed endian".

We'd have accessors for 8,16,32 and 64-bit quantities, and accessors for
"raw" blobs (already endian neutral such as MAC addresses).

> Interesting.  It is simpler and more standard than our current design,
> but that's not sufficient unless there are other reasons.  Needs further
> discussion and testing.

I think completions shall remain separate. As for having a separate
"available" vs. "descriptors", well, I can find pro and cons.

As it is today, it's more complex than it should be and it would make
things simpler to just have an available ring that contains descriptors
like mostly everything else does.

It would also be slightly more cache friendly (cache misses are a
significant part of the performance issues for things like high speed
networking).

However I can see at least one advantage of what you've done :-) You
never have to deal with holes in the ring.

For example, a typical network driver should always try to allocate a
new skb before it "consumes" one, because otherwise, there's a chance
that it fails to allocate it, leaving a hole in the ring. Many drivers
do it wrong with consequences going all the way to leaving stale DMA
pointers in there....

With your scheme, that problem doesn't exist, and you can batch the
refill which might be more efficient under some circumstances.

But is that worth the gain and the cost in cache line accesses ?
Probably not.

> > Two rings do have the advantage of not requiring host side copy, which
> > copy would surely add to cache pressure.
> 
> Well, a simple host could process in-order and leave stuff in the ring I
> guess.  A smarter host would copy and queue, maybe leave one queue entry
> in so it doesn't get flooded?

What's wrong with a ring of descriptors + a ring of completion, with a
single toggle valid bit to indicate whether a given descriptor is valid
or not (to avoid the nasty ping pong on the ring head/tails).

> > About inline - it can only help very small buffers.
> > Which workloads do you have in mind exactly?
> 
> It was suggested by others, but I think TCP Acks are the classic one.

Split headers + data too, tho that means supporting immediate +
indirect. 

It makes a lot of sense for command rings as well if we're going to go
down that route.

> 12 + 14 + 20 + 40 = 86 bytes with virtio_net_hdr_mrg_rxbuf at the front.
> 
> > BTW this seems to be the reverse from what you have in Mar 2001,
> > see 87mxkjls61.fsf@rustcorp.com.au :)
> 
> (s/2001/2011).  Indeed.  Noone shared my optimism that having an open
> process for a virtio2 would bring more players on board (my original
> motivation).  But technical requirements are mounting up, which means
> we're going to get there anyway.
> 
> > I am much less concerned with what we do for configuration,
> > but I do not believe we have learned all performance lessons
> > from virtio ring1. Is there any reason why we shouldn't be
> > able to experiment with inline within virtio1 and see
> > whether that gets us anything?
> 
> Inline in the used ring is possible, but those descriptors 8 bytes, vs
> 24/32.
> 
> > If we do a bunch of changes to the ring at once, we can't
> > figure out what's right, what's wrong, or back out of
> > mistakes later.
> > 
> > Since there are non PCI transports that use the ring,
> > we really shouldn't make both the configuration and
> > the ring changes depend on the same feature bit.
> 
> Yes, I'm thinking #define VIRTIO_F_VIRTIO2 (-1).  For PCI, this gets
> mapped into a "are we using the new config layout?".  For others, it
> gets mapped into a transport-specific feature.

Or we can use the PCI ProgIf to indicate a different programming
interface, that way we also use that as an excuse to say that the first
BAR can either be PIO or MMIO :-)
 
> (I'm sure you get it, but for the others) This is because I want to be
> draw a clear line between all the legacy stuff at the same time, not
> have to support part of it later because someone might not flip the
> feature bit.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] vhost-net: add module alias (v2)
From: Zhi Yong Wu @ 2012-01-12  2:15 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: kvm, Michael S. Tsirkin, netdev, Kay Sievers, virtualization,
	Alan Cox
In-Reply-To: <20120111091653.188b24ab@nehalam.linuxnetplumber.net>

On Thu, Jan 12, 2012 at 1:16 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> By adding the correct module alias, programs won't have to explicitly
> call modprobe. Vhost-net will always be available if built into the kernel.
> It does require assigning a permanent minor number for depmod to work.
> Choose one next to TUN since this driver is related to it.
>
> Also, use C99 style initialization.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ---
> v2 - document minor number and make sure to not overlap
>
>  Documentation/devices.txt  |    2 ++
>  drivers/vhost/net.c        |    8 +++++---
>  include/linux/miscdevice.h |    1 +
>  3 files changed, 8 insertions(+), 3 deletions(-)
>
> --- a/drivers/vhost/net.c       2012-01-10 10:56:58.883179194 -0800
> +++ b/drivers/vhost/net.c       2012-01-10 19:48:23.650225892 -0800
> @@ -856,9 +856,9 @@ static const struct file_operations vhos
>  };
>
>  static struct miscdevice vhost_net_misc = {
> -       MISC_DYNAMIC_MINOR,
> -       "vhost-net",
> -       &vhost_net_fops,
> +       .minor = VHOST_NET_MINOR,
> +       .name = "vhost-net",
> +       .fops = &vhost_net_fops,
>  };
>
>  static int vhost_net_init(void)
> @@ -879,3 +879,5 @@ MODULE_VERSION("0.0.1");
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR("Michael S. Tsirkin");
>  MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
> +MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
> +MODULE_ALIAS("devname:vhost-net");
> --- a/include/linux/miscdevice.h        2012-01-10 10:56:59.779189436 -0800
> +++ b/include/linux/miscdevice.h        2012-01-11 09:13:20.803694316 -0800
> @@ -42,6 +42,7 @@
>  #define AUTOFS_MINOR           235
>  #define MAPPER_CTRL_MINOR      236
>  #define LOOP_CTRL_MINOR                237
> +#define VHOST_NET_MINOR                238
>  #define MISC_DYNAMIC_MINOR     255
>
>  struct device;
> --- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800
> +++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800
> @@ -447,6 +447,8 @@ Your cooperation is appreciated.
>                234 = /dev/btrfs-control        Btrfs control device
>                235 = /dev/autofs       Autofs control device
>                236 = /dev/mapper/control       Device-Mapper control device
> +               237 = /dev/vhost-net    Host kernel accelerator for virtio net
> +
238? The stuff for LOOP_CTRL seems to be missing?

>                240-254                 Reserved for local use
>                255                     Reserved for MISC_DYNAMIC_MINOR
>
>
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Rusty Russell @ 2012-01-12  2:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Benjamin Herrenschmidt, Sasha Levin,
	Pawel Moll, virtualization
In-Reply-To: <20120111102129.GC20988@redhat.com>

On Wed, 11 Jan 2012 12:21:30 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Jan 11, 2012 at 10:55:52AM +1030, Rusty Russell wrote:
> > On Tue, 10 Jan 2012 19:03:36 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Wed, Dec 21, 2011 at 11:03:25AM +1030, Rusty Russell wrote:
> > > > Yes.  The idea that we can alter fields in the device-specific config
> > > > area is flawed.  There may be cases where it doesn't matter, but as an
> > > > idea it was holed to begin with.
> > > > 
> > > > We can reduce probability by doing a double read to check, but there are
> > > > still cases where it will fail.
> > > 
> > > Okay - want me to propose an interface for that?
> > 
> > Had a brief chat with BenH (CC'd).
> > 
> > I think we should deprecate writing to the config space.  Only balloon
> > does it AFAICT, and I can't quite figure out *why* it has an 'active'
> > field.
> 
> Are you sure? I think net writes a mac address.

True.  We'll need to disable that, and come up with another mechanism if
we want it back (a new feature and a VIRTIO_NET_HDR_F_SET_MAC flag in
the virtio_net header?  Or would that mess up vhost_net?).

> > This solves half the problem, of sync guest writes.  For the
> > other half, I suggest a generation counter; odd means inconsistent.  The
> > guest can poll.
> 
> So we get the counter until it's even, get the config, if it's changed
> repeat? Yes it works. However, I would like to have a way to detect
> config change just by looking at memory. ATM we need to read ISR to
> know.  If we used a VQ, the advantage would be that the device can work
> with a single MSIX vector shared by all VQs.

If we use a 32-bit counter, we also get this though, right?

If counter has changed, it's a config interrupt...

> If we do require config VQ anyway, why not use it to notify
> guest of config changes? Guest could pre-post an in buffer
> and host uses that.

We could, but it's an additional burden on each device.  vqs are cheap,
but not free.  And the config area is so damn convenient...

> > BenH also convinced me we should finally make the config space LE if
> > we're going to change things.  Since PCI is the most common transport,
> > guest-endian confuses people.  And it sucks for really weird machines.
> 
> Are we going to keep guest endian for e.g. virtio net header?
> If yes the benefit of switching config space is not that big.
> And changes in devices would affect non-PCI transports.

Yep.  It would only make sense if we do it for everything.  And yes,
it'll mess up everyone who is BE, so it needs to be a feature bit for
them.

> > We should also change the ring (to a single ring, I think).  Descriptors
> > to 24 bytes long (8 byte cookie, 8 byte addr, 4 byte len, 4 byte flags).
> > We might be able to squeeze it into 20 bytes but that means packing.  We
> > should support inline, chained or indirect.  Let the other side ack by
> > setting flag, cookie and len (if written).
> 
> Quite possibly all or some of these things help performance
> but do we have to change the spec before we have experimental
> proof?

We change the spec last, once we know what we're doing, ideally.

> I did experiment with a single ring using tools/virtio and
> I didn't see a measureable performance gain.

Interesting.  It is simpler and more standard than our current design,
but that's not sufficient unless there are other reasons.  Needs further
discussion and testing.

> Two rings do have the advantage of not requiring host side copy, which
> copy would surely add to cache pressure.

Well, a simple host could process in-order and leave stuff in the ring I
guess.  A smarter host would copy and queue, maybe leave one queue entry
in so it doesn't get flooded?

>  Since
> host doesn't change desriptors, we could also
> preformat some descriptors in the current design.
>
> There is a fragmentation problem in theory but it can be alleviated with
> a smart allocator.

Yeah, the complexity scares me...

> About inline - it can only help very small buffers.
> Which workloads do you have in mind exactly?

It was suggested by others, but I think TCP Acks are the classic one.
12 + 14 + 20 + 40 = 86 bytes with virtio_net_hdr_mrg_rxbuf at the front.

> BTW this seems to be the reverse from what you have in Mar 2001,
> see 87mxkjls61.fsf@rustcorp.com.au :)

(s/2001/2011).  Indeed.  Noone shared my optimism that having an open
process for a virtio2 would bring more players on board (my original
motivation).  But technical requirements are mounting up, which means
we're going to get there anyway.

> I am much less concerned with what we do for configuration,
> but I do not believe we have learned all performance lessons
> from virtio ring1. Is there any reason why we shouldn't be
> able to experiment with inline within virtio1 and see
> whether that gets us anything?

Inline in the used ring is possible, but those descriptors 8 bytes, vs
24/32.

> If we do a bunch of changes to the ring at once, we can't
> figure out what's right, what's wrong, or back out of
> mistakes later.
> 
> Since there are non PCI transports that use the ring,
> we really shouldn't make both the configuration and
> the ring changes depend on the same feature bit.

Yes, I'm thinking #define VIRTIO_F_VIRTIO2 (-1).  For PCI, this gets
mapped into a "are we using the new config layout?".  For others, it
gets mapped into a transport-specific feature.

(I'm sure you get it, but for the others) This is because I want to be
draw a clear line between all the legacy stuff at the same time, not
have to support part of it later because someone might not flip the
feature bit.

Cheers,
Rusty.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Rusty Russell @ 2012-01-12  1:42 UTC (permalink / raw)
  To: Michael S. Tsirkin, Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Pawel Moll, Sasha Levin, Anthony Liguori,
	virtualization
In-Reply-To: <20120111220232.GC27292@redhat.com>

On Thu, 12 Jan 2012 00:02:33 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Look, we have a race currently. Let us not tie a bug fix to a huge
> rewrite with unclear performance benefits, please.

In theory, yes.  In practice, we bandaid it.

I think in the short term we change ->get to get the entire sequence
twice, and check it's the same.  Theoretically, still racy, but it does
cut the window.  And we haven't seen the bug yet, either.

In the longer term, we fix it properly:

1) Make it readonly, prevents one class of problems.
2) Treat it as constant if drv->config_changed is NULL (we can do this
   now, in fact) and ignore the config interrupt.
3) Use a generation counter on the config, odd means wait.

Cheers,
Rusty.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Rusty Russell @ 2012-01-12  1:35 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Pawel Moll, Michael S. Tsirkin, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0D8EFA.3010503@codemonkey.ws>

On Wed, 11 Jan 2012 07:30:34 -0600, Anthony Liguori <anthony@codemonkey.ws> wrote:
> I think the more important thing to do is require accesses to integers in the 
> config space to always be aligned and to use the appropriate accessor. 
> Non-integer fields should be restricted to byte access.
> 
> That limits config space entries to 32-bit but also means that there is no need 
> for a generation counter.

Unfortunately not, see virtio_blk.  capacity is 64 bits, and geometry is
multiple fields.

> If we're already making the change, the endianness ought to be a feature bit.

I'd rather tie it to the new PCI config layout (and ring format).  Then
we can simply have two backends, legacy and modern.

For non-PCI, we would need a feature bit.

Cheers,
Rusty.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 22:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120111221349.GD27292@redhat.com>

On Thu, 2012-01-12 at 00:13 +0200, Michael S. Tsirkin wrote:

> > Well, I would argue that the network driver world has proven countless
> > times that those are good ideas :-)
> 
> Below you seem to suggest that separate rings like
> virtio has now is better than a single ring like Rusty
> suggested.

I was merely pointing at examples. My understanding (but I may have
misparsed him) is that Rusty wants to collate the descriptors and the
"available" ring. I personally don't think the completions (the used
ring) should be merged, it should remain separate. If Rusty was hinting
at doing that then I disagree with him :-)

> Are you familiar with current virtio ring structure?  How is this
> different?

Vaguely yes :-) It does have this uni-directional model, I didn't
express myself very well or I got confused by the statements about a
single ring. Yes, I think we should keep the ring at least split into
two directions as it is today.

But we can certainly merge available ring and descriptors ring.

Another option is having the ring entry size be a configurable power of
two. This is trivial and comes at pretty much no cost. That would allow
virtio-net for example to do efficient things like putting the headers
as immediate data in the ring and the data as indirect.

This means a large part of the packet processing can happen without
touching additional cache lines.

Anyway, at this stage, I plan to sit down with Rusty next week and hash
out some kind of proposal, which we can experiment with & benchmark.

I'd suggest doing a simple user space app that creates such a ring,
forks, and produce / consume. We can then run that through perf and
analyze the cache behaviour, maximum throughput, etc....

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 22:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120111221349.GD27292@redhat.com>

On Thu, 2012-01-12 at 00:13 +0200, Michael S. Tsirkin wrote:

> > We typically pre-populate the data rings with skb's for 1500 and 9000
> > bytes packets. Small packets come in immediately in the completion ring,
> > and large packets via the data ring. 
> 
> Won't real workloads suffer from packet reordering?

No, they aren't re-ordered. The completion ring has an entry for each
packet, in order. Those entries eventually reference the entry index in
the data rings if the data was put there instead of being immediate.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 22:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pawel Moll, virtualization, Christian Borntraeger, Sasha Levin,
	Anthony Liguori
In-Reply-To: <20120111220232.GC27292@redhat.com>

On Thu, 2012-01-12 at 00:02 +0200, Michael S. Tsirkin wrote:
> > We could probably have a helper library for sending control messages
> > which could handle waiting for a ring slot to be free (practically
> > always the case on control queues), writing the message, sending it
> and
> > waiting for a status queue confirmation message.
> > 
> > Cheers,
> > Ben.
> > 
> 
> Look, we have a race currently. Let us not tie a bug fix to a huge
> rewrite with unclear performance benefits, please.

Well, if we change endian, change the way config works, I think we are
doing enough dramatic changes to go all the way and change the ring
format too.

I don't think there's anything "unclear" about improving the rings in
the direction that all network cards have been chosing so far :-) But
for the control ring, performance is clearly not an issue.

I also like the simplicity of immediate data, it limits the failure
path, no allocation -does- make it easier, not having to wait for the
previous buffer to be complete if you pre-allocate etc...

In any case, if we do a lot of changes at the same time, we are probably
better off cutting the cord and making an incompatible vfio v2 with a
new set of drivers.

Having to deal with all the variants in a single driver will result in
unmaintainable drivers imho.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 22:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <1326316422.23910.154.camel@pasglop>

On Thu, Jan 12, 2012 at 08:13:42AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2012-01-11 at 12:21 +0200, Michael S. Tsirkin wrote:
> > 
> > > BenH also convinced me we should finally make the config space LE if
> > > we're going to change things.  Since PCI is the most common transport,
> > > guest-endian confuses people.  And it sucks for really weird machines.
> > 
> > Are we going to keep guest endian for e.g. virtio net header?
> > If yes the benefit of switching config space is not that big.
> > And changes in devices would affect non-PCI transports.
> 
> I think the concept of "guest endian" is broken by design. What does
> that mean when running for example an ARM or a ppc 440 "guest" which
> could be either endian ? Since you can't hard code your guest endian how
> do you obtain/negociate it ? Also you now have to deal with dual endian
> in the host, makes everything trickier.
> 
> Just make everything LE.

Yea. But it's not a pure transport issue, just fixing configuration
won't be enough.  E.g. we have structures like virtio net header.

> > Quite possibly all or some of these things help performance
> > but do we have to change the spec before we have experimental
> > proof?
> 
> Well, I would argue that the network driver world has proven countless
> times that those are good ideas :-)

Below you seem to suggest that separate rings like
virtio has now is better than a single ring like Rusty
suggested.

> But by all mean, let's do a
> prototype implementation with virtio-net for example and bench it.
> 
> I don't think you need a single ring. For multiqueue net, you definitely
> want multiple rings and you do want rings to remain uni-directional.
> 
> One other thing that can be useful is to separate the completion ring
> from the actual ring of DMA descriptors, making the former completely
> read-only by the guest and the later completely read only by the host.

Are you familiar with current virtio ring structure?  How is this
different?

> For example take the ehea ethernet rx model. It has 3 rx "rings" per
> queue. One contains the completions, it's a toggle-valid model so we
> never write back to clear valid, it contains infos from the parser, the
> tokenID of the packet and the index as to where in which ring the data
> is, wich is either inline in the completion ring (small packet), header
> inline & data in a data ring or completely in a data ring. Then you have
> two data rings which are simply rings of SG list entries (more or less).
> 
> We typically pre-populate the data rings with skb's for 1500 and 9000
> bytes packets. Small packets come in immediately in the completion ring,
> and large packets via the data ring. 

Won't real workloads suffer from packet reordering?

> That's just -an- example. There's many other to take inspiration from.
> Network folks have beaten to death the problem of ring efficiency vs.
> CPU caches.
> 
> > > Moreover, I think we should make all these changes at once (at least, in
> > > the spec).  That makes it a big change, and it'll take longer to
> > > develop, but makes it easy in the long run to differentiate legacy and
> > > modern virtio.
> > > 
> > > Thoughts?
> > > Rusty.
> > 
> > BTW this seems to be the reverse from what you have in Mar 2001,
> > see 87mxkjls61.fsf@rustcorp.com.au :)
> 
> That was 10 years ago... 

Sorry, typo. It was Mar 2010 :)

> > I am much less concerned with what we do for configuration,
> > but I do not believe we have learned all performance lessons
> > from virtio ring1. 
> 
> Maybe we have learned some more since then ? :-)

There was 1 change in ring layout.

> > Is there any reason why we shouldn't be
> > able to experiment with inline within virtio1 and see
> > whether that gets us anything?
> > If we do a bunch of changes to the ring at once, we can't
> > figure out what's right, what's wrong, or back out of
> > mistakes later.
> > 
> > Since there are non PCI transports that use the ring,
> > we really shouldn't make both the configuration and
> > the ring changes depend on the same feature bit.
> 
> Another advantage of inline data is that it makes things a lot easier
> for cases where only small amount of data need to be exchanged, such as
> control/status rings, maybe virtio-tty (which I'm working on), etc... 
> 
> Cheers,
> Ben.

Is that getting you a lot of speedup? Note you want to add more code on
data path for everyone.  Why can't you have a fixed buffer in memory and
just point to that?

-- 
MST

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 22:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Pawel Moll, virtualization, Christian Borntraeger, Sasha Levin,
	Anthony Liguori
In-Reply-To: <1326315726.23910.143.camel@pasglop>

On Thu, Jan 12, 2012 at 08:02:06AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2012-01-11 at 14:26 -0600, Anthony Liguori wrote:
> > 
> > I'd say that's a special case but I see what you're getting at here.
> > 
> > So what about keeping the config space read-only and using control
> > queues for 
> > everything else?
> 
> Which is exactly what Rusty and I are proposing :-)
> I would go further
> and eliminate the idea of a seqlock and instead of a status queue with
> precise messages indicating what changed.
>
> I would couple that with the new queue format allowing immediate data in
> the descriptor to avoid having to use indirect buffers for these, which
> means no allocation, no buffer pool etc... which makes everything a lot
> easier to deal with as well.

We just need a couple of buffers outstanding. It can't be easier,
and a single buf descriptors already do not use indirection.

> We could probably have a helper library for sending control messages
> which could handle waiting for a ring slot to be free (practically
> always the case on control queues), writing the message, sending it and
> waiting for a status queue confirmation message.
> 
> Cheers,
> Ben.
> 

Look, we have a race currently. Let us not tie a bug fix to a huge
rewrite with unclear performance benefits, please.

-- 
MST

^ permalink raw reply

* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 21:58 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
	Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DF08F.7010706@codemonkey.ws>

On Wed, Jan 11, 2012 at 02:26:55PM -0600, Anthony Liguori wrote:
> On 01/11/2012 02:14 PM, Michael S. Tsirkin wrote:
> >On Wed, Jan 11, 2012 at 01:42:39PM -0600, Anthony Liguori wrote:
> >>On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
> >>>
> >>>Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
> >>
> >>Do you know of a network device that obtains it's mac address via a DMA transaction?
> >
> >Sure.
> >See mlx4_replace_mac in drivers/net/ethernet/mellanox/mlx4/port.c
> 
> I'd say that's a special case but I see what you're getting at here.
> 
> So what about keeping the config space read-only and using control
> queues for everything else?

Not just read-only - constant.

> Regards,
> 
> Anthony Liguori
> 
> >
> >>Regards,
> >>
> >>Anthony Liguori
> >>
> >>>
> >>>>>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox