[Qemu-devel] Proposal for virtio standardization.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Proposal for virtio standardization.
@ 2012-09-27  0:29 Rusty Russell
  2012-10-04 18:49 ` Anthony Liguori
  2012-10-09 14:02 ` [Qemu-devel] Proposal for virtio standardization Cornelia Huck
  0 siblings, 2 replies; 39+ messages in thread
From: Rusty Russell @ 2012-09-27  0:29 UTC (permalink / raw)
  To: Anthony Liguori, Adam Litke, Amit Shah, Avi Kivity,
	Avishay Traeger, Jason Wang, Michael S. Tsirkin, Ohad Ben-Cohen,
	Paolo Bonzini, Pawel Moll, Sasha Levin, Cornelia Huck
  Cc: qemu-devel, LKML, kvm, virtualization

Hi all,

	I've had several requests for a more formal approach to the
virtio draft spec, and (after some soul-searching) I'd like to try that.

	The proposal is to use OASIS as the standards body, as it's
fairly light-weight as these things go.  For me this means paperwork and
setting up a Working Group and getting the right people involved as
Voting members starting with the current contributors; for most of you
it just means a new mailing list, though I'll be cross-posting any
drafts and major changes here anyway.

	I believe that a documented standard (aka virtio 1.0) will
increase visibility and adoption in areas outside our normal linux/kvm
universe.  There's been some of that already, but this is the clearest
path to accelerate it.  Not the easiest path, but I believe that a solid
I/O standard is a Good Thing for everyone.

	Yet I also want to decouple new and experimental development
from the standards effort; running code comes first.  New feature bits
and new device numbers should be reservable without requiring a full
spec change.

So the essence of my proposal is:
1) I start a Working Group within OASIS where we can aim for virtio spec
   1.0.

2) The current spec is textually reordered so the core is clearly
   bus-independent, with PCI, mmio, etc appendices.

3) Various clarifications, formalizations and cleanups to the spec text,
   and possibly elimination of old deprecated features.

4) The only significant change to the spec is that we use PCI
   capabilities, so we can have infinite feature bits.
   (see
http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)

5) Changes to the ring layout and other such things are deferred to a
   future virtio version; whether this is done within OASIS or
   externally depends on how well this works for the 1.0 release.

Thoughts?
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Proposal for virtio standardization.
  2012-09-27  0:29 [Qemu-devel] Proposal for virtio standardization Rusty Russell
@ 2012-10-04 18:49 ` Anthony Liguori
  2012-10-08  2:21   ` [Qemu-devel] Using PCI config space to indicate config location Rusty Russell
  2012-10-09 14:02 ` [Qemu-devel] Proposal for virtio standardization Cornelia Huck
  1 sibling, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-04 18:49 UTC (permalink / raw)
  To: Rusty Russell, Adam Litke, Amit Shah, Avi Kivity, Avishay Traeger,
	Jason Wang, Michael S. Tsirkin, Ohad Ben-Cohen, Paolo Bonzini,
	Pawel Moll, Sasha Levin, Cornelia Huck
  Cc: virtualization, qemu-devel, kvm, LKML

Rusty Russell <rusty@rustcorp.com.au> writes:

> Hi all,
>
> 	I've had several requests for a more formal approach to the
> virtio draft spec, and (after some soul-searching) I'd like to try that.
>
> 	The proposal is to use OASIS as the standards body, as it's
> fairly light-weight as these things go.  For me this means paperwork and
> setting up a Working Group and getting the right people involved as
> Voting members starting with the current contributors; for most of you
> it just means a new mailing list, though I'll be cross-posting any
> drafts and major changes here anyway.
>
> 	I believe that a documented standard (aka virtio 1.0) will
> increase visibility and adoption in areas outside our normal linux/kvm
> universe.  There's been some of that already, but this is the clearest
> path to accelerate it.  Not the easiest path, but I believe that a solid
> I/O standard is a Good Thing for everyone.
>
> 	Yet I also want to decouple new and experimental development
> from the standards effort; running code comes first.  New feature bits
> and new device numbers should be reservable without requiring a full
> spec change.
>
> So the essence of my proposal is:
> 1) I start a Working Group within OASIS where we can aim for virtio spec
>    1.0.
>
> 2) The current spec is textually reordered so the core is clearly
>    bus-independent, with PCI, mmio, etc appendices.
>
> 3) Various clarifications, formalizations and cleanups to the spec text,
>    and possibly elimination of old deprecated features.
>
> 4) The only significant change to the spec is that we use PCI
>    capabilities, so we can have infinite feature bits.
>    (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)

We discussed this on IRC last night.  I don't think PCI capabilites are
a good mechanism to use...

PCI capabilities are there to organize how the PCI config space is
allocated to allow vendor extensions to co-exist with future PCI
extensions.

But we've never used the PCI config space within virtio-pci.  We do
everything in BAR0.  I don't think there's any real advantage of using
the config space vs. a BAR for virtio-pci.

The nice thing about using a BAR is that the region is MMIO or PIO.
This maps really nicely to non-PCI transports too.  But extending the
PCI config space (especially dealing with capability allocation) is
pretty gnarly and there isn't an obvious equivalent outside of PCI.

There are very devices that we emulate today that make use of extended
PCI device registers outside the platform devices (that have no BARs).

Regards,

Anthony Liguori

>
> 5) Changes to the ring layout and other such things are deferred to a
>    future virtio version; whether this is done within OASIS or
>    externally depends on how well this works for the 1.0 release.
>
> Thoughts?
> Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [Qemu-devel] Using PCI config space to indicate config location
  2012-10-04 18:49 ` Anthony Liguori
@ 2012-10-08  2:21   ` Rusty Russell
  2012-10-08 13:58     ` Anthony Liguori
  2012-10-10  8:30     ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-08  2:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, kvm, virtualization

(Topic updated, cc's trimmed).

Anthony Liguori <aliguori@us.ibm.com> writes:
> Rusty Russell <rusty@rustcorp.com.au> writes:
>> 4) The only significant change to the spec is that we use PCI
>>    capabilities, so we can have infinite feature bits.
>>    (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)
>
> We discussed this on IRC last night.  I don't think PCI capabilites are
> a good mechanism to use...
>
> PCI capabilities are there to organize how the PCI config space is
> allocated to allow vendor extensions to co-exist with future PCI
> extensions.
>
> But we've never used the PCI config space within virtio-pci.  We do
> everything in BAR0.  I don't think there's any real advantage of using
> the config space vs. a BAR for virtio-pci.

Note before anyone gets confused; we were talking about using the PCI
config space to indicate what BAR(s) the virtio stuff is in.  An
alternative would be to simply specify a new layout format in BAR1.

The arguments for a more flexible format that I know of:

1) virtio-pci has already extended the pci-specific part of the
   configuration once (for MSI-X), so I don't want to assume it won't
   happen again.

2) ISTR an argument about mapping the ISR register separately, for
   performance, but I can't find a reference to it.

> This maps really nicely to non-PCI transports too.

This isn't right.  Noone else can use the PCI layout.  While parts are
common, other parts are pci-specific (MSI-X and ISR for example), and
yet other parts are specified by PCI elsewhere (eg interrupt numbers).

> But extending the
> PCI config space (especially dealing with capability allocation) is
> pretty gnarly and there isn't an obvious equivalent outside of PCI.

That's OK, because general changes should be done with feature bits, and
the others all have an infinite number.  Being the first, virtio-pci has
some unique limitations we'd like to fix.

> There are very devices that we emulate today that make use of extended
> PCI device registers outside the platform devices (that have no BARs).

This sentence confused me?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08  2:21   ` [Qemu-devel] Using PCI config space to indicate config location Rusty Russell
@ 2012-10-08 13:58     ` Anthony Liguori
  2012-10-08 14:58       ` Gerd Hoffmann
  2012-10-10  8:30     ` Michael S. Tsirkin
  1 sibling, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-08 13:58 UTC (permalink / raw)
  To: Rusty Russell; +Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Rusty Russell <rusty@rustcorp.com.au> writes:

> (Topic updated, cc's trimmed).
>
> Anthony Liguori <aliguori@us.ibm.com> writes:
>> Rusty Russell <rusty@rustcorp.com.au> writes:
>>> 4) The only significant change to the spec is that we use PCI
>>>    capabilities, so we can have infinite feature bits.
>>>    (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)
>>
>> We discussed this on IRC last night.  I don't think PCI capabilites are
>> a good mechanism to use...
>>
>> PCI capabilities are there to organize how the PCI config space is
>> allocated to allow vendor extensions to co-exist with future PCI
>> extensions.
>>
>> But we've never used the PCI config space within virtio-pci.  We do
>> everything in BAR0.  I don't think there's any real advantage of using
>> the config space vs. a BAR for virtio-pci.
>
> Note before anyone gets confused; we were talking about using the PCI
> config space to indicate what BAR(s) the virtio stuff is in.  An
> alternative would be to simply specify a new layout format in BAR1.
>
> The arguments for a more flexible format that I know of:
>
> 1) virtio-pci has already extended the pci-specific part of the
>    configuration once (for MSI-X), so I don't want to assume it won't
>    happen again.

"configuration" is the wrong word here.

The virtio-pci BAR0 layout is:

   0..19   virtio-pci registers
   20+     virtio configuration space

MSI-X needed to add additional virtio-pci registers, so now we have:

   0..19   virtio-pci registers

if MSI-X:
   20..23  virtio-pci MSI-X registers
   24+     virtio configuration space
else:
   20+     virtio configuration space

I agree, this stinks.

But I think we could solve this in a different way.  I think we could
just move the virtio configuration space to BAR1 by using a transport
feature bit.

That then frees up the entire BAR0 for use as virtio-pci registers.  We
can then always include the virtio-pci MSI-X register space and
introduce all new virtio-pci registers as simply being appended.

This new feature bit then becomes essentially a virtio configuration
latch.  When unacked, virtio configuration hides new registers, when
acked, those new registers are exposed.

Another option is to simply put new registers after the virtio
configuration blob.

> 2) ISTR an argument about mapping the ISR register separately, for
>    performance, but I can't find a reference to it.

I think the rationale is that ISR really needs to be PIO but everything
else doesn't.  PIO is much faster on x86 because it doesn't require
walking page tables or instruction emulation to handle the exit.

The argument to move the remaining registers to MMIO is to allow 64-bit
accesses to registers which isn't possible with PIO.

>> This maps really nicely to non-PCI transports too.
>
> This isn't right.  Noone else can use the PCI layout.  While parts are
> common, other parts are pci-specific (MSI-X and ISR for example), and
> yet other parts are specified by PCI elsewhere (eg interrupt numbers).
>
>> But extending the
>> PCI config space (especially dealing with capability allocation) is
>> pretty gnarly and there isn't an obvious equivalent outside of PCI.
>
> That's OK, because general changes should be done with feature bits, and
> the others all have an infinite number.  Being the first, virtio-pci has
> some unique limitations we'd like to fix.
>
>> There are very devices that we emulate today that make use of extended
>> PCI device registers outside the platform devices (that have no BARs).
>
> This sentence confused me?

There is a missing "few".  "There are very few devices..."

Extending the PCI configuration space is unusual for PCI devices.  That
was the point.

Regards,

Anthony Liguori

>
> Thanks,
> Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 13:58     ` Anthony Liguori
@ 2012-10-08 14:58       ` Gerd Hoffmann
  2012-10-08 15:09         ` Anthony Liguori
  0 siblings, 1 reply; 39+ messages in thread
From: Gerd Hoffmann @ 2012-10-08 14:58 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

  Hi,

> But I think we could solve this in a different way.  I think we could
> just move the virtio configuration space to BAR1 by using a transport
> feature bit.

Why hard-code stuff?

I think it makes alot of sense to have a capability simliar to msi-x
which simply specifies bar and offset of the register sets:

[root@fedora ~]# lspci -vvs4
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
        [ ... ]
	Region 0: I/O ports at c000 [size=64]
	Region 1: Memory at fc029000 (32-bit) [size=4K]
	Capabilities: [40] MSI-X: Enable+ Count=2 Masked-
		Vector table: BAR=1 offset=00000000
		PBA: BAR=1 offset=00000800

So we could have for virtio something like this:

        Capabilities: [??] virtio-regs:
                legacy: BAR=0 offset=0
                virtio-pci: BAR=1 offset=1000
                virtio-cfg: BAR=1 offset=1800

> That then frees up the entire BAR0 for use as virtio-pci registers.  We
> can then always include the virtio-pci MSI-X register space and
> introduce all new virtio-pci registers as simply being appended.

BAR0 needs to stay as-is for compatibility reasons.  New devices which
don't have to care about old guests don't need to provide a 'legacy'
register region.

Most devices have mmio at BAR1 for msi-x support anyway, we can place
the virtio-pci and virtio configuration registers there too by default.
 I wouldn't hardcode that though.

> This new feature bit then becomes essentially a virtio configuration
> latch.  When unacked, virtio configuration hides new registers, when
> acked, those new registers are exposed.

I'd just expose them all all the time.

>> 2) ISTR an argument about mapping the ISR register separately, for
>>    performance, but I can't find a reference to it.
> 
> I think the rationale is that ISR really needs to be PIO but everything
> else doesn't.  PIO is much faster on x86 because it doesn't require
> walking page tables or instruction emulation to handle the exit.

Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
correct?  Which would imply that pretty much only old guests without
MSI-X support need this, and we don't need to worry that much when
designing something new ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 14:58       ` Gerd Hoffmann
@ 2012-10-08 15:09         ` Anthony Liguori
  2012-10-08 20:13           ` Gerd Hoffmann
  0 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-08 15:09 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

Gerd Hoffmann <kraxel@redhat.com> writes:

>   Hi,
>
>> But I think we could solve this in a different way.  I think we could
>> just move the virtio configuration space to BAR1 by using a transport
>> feature bit.
>
> Why hard-code stuff?
>
> I think it makes alot of sense to have a capability simliar to msi-x
> which simply specifies bar and offset of the register sets:
>
> [root@fedora ~]# lspci -vvs4
> 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
>         [ ... ]
> 	Region 0: I/O ports at c000 [size=64]
> 	Region 1: Memory at fc029000 (32-bit) [size=4K]
> 	Capabilities: [40] MSI-X: Enable+ Count=2 Masked-
> 		Vector table: BAR=1 offset=00000000
> 		PBA: BAR=1 offset=00000800

MSI-X capability is a standard PCI capability which is why lspci can
parse it.

>
> So we could have for virtio something like this:
>
>         Capabilities: [??] virtio-regs:
>                 legacy: BAR=0 offset=0
>                 virtio-pci: BAR=1 offset=1000
>                 virtio-cfg: BAR=1 offset=1800

This would be a vendor specific PCI capability so lspci wouldn't
automatically know how to parse it.

You could just as well teach lspci to parse BAR0 to figure out what
features are supported.

>> That then frees up the entire BAR0 for use as virtio-pci registers.  We
>> can then always include the virtio-pci MSI-X register space and
>> introduce all new virtio-pci registers as simply being appended.
>
> BAR0 needs to stay as-is for compatibility reasons.  New devices which
> don't have to care about old guests don't need to provide a 'legacy'
> register region.

A latch feature bit would allow the format to change without impacting
compatibility at all.

>>> 2) ISTR an argument about mapping the ISR register separately, for
>>>    performance, but I can't find a reference to it.
>> 
>> I think the rationale is that ISR really needs to be PIO but everything
>> else doesn't.  PIO is much faster on x86 because it doesn't require
>> walking page tables or instruction emulation to handle the exit.
>
> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
> correct?  Which would imply that pretty much only old guests without
> MSI-X support need this, and we don't need to worry that much when
> designing something new ...

It wasn't that long ago that MSI-X wasn't supported..  I think we should
continue to keep ISR as PIO as it is a fast path.

Regards,

Anthony Liguori

>
> cheers,
>   Gerd
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 15:09         ` Anthony Liguori
@ 2012-10-08 20:13           ` Gerd Hoffmann
  2012-10-08 20:55             ` Anthony Liguori
  0 siblings, 1 reply; 39+ messages in thread
From: Gerd Hoffmann @ 2012-10-08 20:13 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

  Hi,

>> So we could have for virtio something like this:
>>
>>         Capabilities: [??] virtio-regs:
>>                 legacy: BAR=0 offset=0
>>                 virtio-pci: BAR=1 offset=1000
>>                 virtio-cfg: BAR=1 offset=1800
> 
> This would be a vendor specific PCI capability so lspci wouldn't
> automatically know how to parse it.

Sure, would need a patch to actually parse+print the cap,
/me was just trying to make my point clear in a simple way.

>>>> 2) ISTR an argument about mapping the ISR register separately, for
>>>>    performance, but I can't find a reference to it.
>>>
>>> I think the rationale is that ISR really needs to be PIO but everything
>>> else doesn't.  PIO is much faster on x86 because it doesn't require
>>> walking page tables or instruction emulation to handle the exit.
>>
>> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
>> correct?  Which would imply that pretty much only old guests without
>> MSI-X support need this, and we don't need to worry that much when
>> designing something new ...
> 
> It wasn't that long ago that MSI-X wasn't supported..  I think we should
> continue to keep ISR as PIO as it is a fast path.

No problem if we allow to have both legacy layout and new layout at the
same time.  Guests can continue to use ISR @ BAR0 in PIO space for
existing virtio devices, even in case they want use mmio for other
registers -> all fine.

New virtio devices can support MSI-X from day one and decide to not
expose a legacy layout PIO bar.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 20:13           ` Gerd Hoffmann
@ 2012-10-08 20:55             ` Anthony Liguori
  2012-10-08 23:56               ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-08 20:55 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

Gerd Hoffmann <kraxel@redhat.com> writes:

>   Hi,
>
>>> So we could have for virtio something like this:
>>>
>>>         Capabilities: [??] virtio-regs:
>>>                 legacy: BAR=0 offset=0
>>>                 virtio-pci: BAR=1 offset=1000
>>>                 virtio-cfg: BAR=1 offset=1800
>> 
>> This would be a vendor specific PCI capability so lspci wouldn't
>> automatically know how to parse it.
>
> Sure, would need a patch to actually parse+print the cap,
> /me was just trying to make my point clear in a simple way.
>
>>>>> 2) ISTR an argument about mapping the ISR register separately, for
>>>>>    performance, but I can't find a reference to it.
>>>>
>>>> I think the rationale is that ISR really needs to be PIO but everything
>>>> else doesn't.  PIO is much faster on x86 because it doesn't require
>>>> walking page tables or instruction emulation to handle the exit.
>>>
>>> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
>>> correct?  Which would imply that pretty much only old guests without
>>> MSI-X support need this, and we don't need to worry that much when
>>> designing something new ...
>> 
>> It wasn't that long ago that MSI-X wasn't supported..  I think we should
>> continue to keep ISR as PIO as it is a fast path.
>
> No problem if we allow to have both legacy layout and new layout at the
> same time.  Guests can continue to use ISR @ BAR0 in PIO space for
> existing virtio devices, even in case they want use mmio for other
> registers -> all fine.
>
> New virtio devices can support MSI-X from day one and decide to not
> expose a legacy layout PIO bar.

I think having BAR1 be an MMIO mirror of the registers + a BAR2 for
virtio configuration space is probably not that bad of a solution.

Regards,

Anthony Liguori

>
> cheers,
>   Gerd
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 20:55             ` Anthony Liguori
@ 2012-10-08 23:56               ` Rusty Russell
  2012-10-09  1:51                 ` Anthony Liguori
  2012-10-10  8:34                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-08 23:56 UTC (permalink / raw)
  To: Anthony Liguori, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Anthony Liguori <aliguori@us.ibm.com> writes:
> Gerd Hoffmann <kraxel@redhat.com> writes:
>
>>   Hi,
>>
>>>> So we could have for virtio something like this:
>>>>
>>>>         Capabilities: [??] virtio-regs:
>>>>                 legacy: BAR=0 offset=0
>>>>                 virtio-pci: BAR=1 offset=1000
>>>>                 virtio-cfg: BAR=1 offset=1800
>>> 
>>> This would be a vendor specific PCI capability so lspci wouldn't
>>> automatically know how to parse it.
>>
>> Sure, would need a patch to actually parse+print the cap,
>> /me was just trying to make my point clear in a simple way.
>>
>>>>>> 2) ISTR an argument about mapping the ISR register separately, for
>>>>>>    performance, but I can't find a reference to it.
>>>>>
>>>>> I think the rationale is that ISR really needs to be PIO but everything
>>>>> else doesn't.  PIO is much faster on x86 because it doesn't require
>>>>> walking page tables or instruction emulation to handle the exit.
>>>>
>>>> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
>>>> correct?  Which would imply that pretty much only old guests without
>>>> MSI-X support need this, and we don't need to worry that much when
>>>> designing something new ...
>>> 
>>> It wasn't that long ago that MSI-X wasn't supported..  I think we should
>>> continue to keep ISR as PIO as it is a fast path.
>>
>> No problem if we allow to have both legacy layout and new layout at the
>> same time.  Guests can continue to use ISR @ BAR0 in PIO space for
>> existing virtio devices, even in case they want use mmio for other
>> registers -> all fine.
>>
>> New virtio devices can support MSI-X from day one and decide to not
>> expose a legacy layout PIO bar.
>
> I think having BAR1 be an MMIO mirror of the registers + a BAR2 for
> virtio configuration space is probably not that bad of a solution.

Well, we also want to clean up the registers, so how about:

BAR0: legacy, as is.  If you access this, don't use the others.
BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
BAR2: virtio-cfg.  If you use this, don't use BAR0.
BAR3: ISR. If you use this, don't use BAR0.

I prefer the cases exclusive (ie. use one or the other) as a clear path
to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
an ugly corner case in future (ISR is BAR0 + 19?  WTF?).

As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
endorse that and leave it to the devices.

The detection is simple: if BAR1 has non-zero length, it's new-style,
otherwise legacy.

Thoughts?
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 23:56               ` Rusty Russell
@ 2012-10-09  1:51                 ` Anthony Liguori
  2012-10-09  3:16                   ` Rusty Russell
  2012-10-09  6:33                   ` Gerd Hoffmann
  2012-10-10  8:34                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 39+ messages in thread
From: Anthony Liguori @ 2012-10-09  1:51 UTC (permalink / raw)
  To: Rusty Russell, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Rusty Russell <rusty@rustcorp.com.au> writes:

> Anthony Liguori <aliguori@us.ibm.com> writes:
>> Gerd Hoffmann <kraxel@redhat.com> writes:
>>
>>>   Hi,
>>>
>>>>> So we could have for virtio something like this:
>>>>>
>>>>>         Capabilities: [??] virtio-regs:
>>>>>                 legacy: BAR=0 offset=0
>>>>>                 virtio-pci: BAR=1 offset=1000
>>>>>                 virtio-cfg: BAR=1 offset=1800
>>>> 
>>>> This would be a vendor specific PCI capability so lspci wouldn't
>>>> automatically know how to parse it.
>>>
>>> Sure, would need a patch to actually parse+print the cap,
>>> /me was just trying to make my point clear in a simple way.
>>>
>>>>>>> 2) ISTR an argument about mapping the ISR register separately, for
>>>>>>>    performance, but I can't find a reference to it.
>>>>>>
>>>>>> I think the rationale is that ISR really needs to be PIO but everything
>>>>>> else doesn't.  PIO is much faster on x86 because it doesn't require
>>>>>> walking page tables or instruction emulation to handle the exit.
>>>>>
>>>>> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
>>>>> correct?  Which would imply that pretty much only old guests without
>>>>> MSI-X support need this, and we don't need to worry that much when
>>>>> designing something new ...
>>>> 
>>>> It wasn't that long ago that MSI-X wasn't supported..  I think we should
>>>> continue to keep ISR as PIO as it is a fast path.
>>>
>>> No problem if we allow to have both legacy layout and new layout at the
>>> same time.  Guests can continue to use ISR @ BAR0 in PIO space for
>>> existing virtio devices, even in case they want use mmio for other
>>> registers -> all fine.
>>>
>>> New virtio devices can support MSI-X from day one and decide to not
>>> expose a legacy layout PIO bar.
>>
>> I think having BAR1 be an MMIO mirror of the registers + a BAR2 for
>> virtio configuration space is probably not that bad of a solution.
>
> Well, we also want to clean up the registers, so how about:
>
> BAR0: legacy, as is.  If you access this, don't use the others.
> BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
> BAR2: virtio-cfg.  If you use this, don't use BAR0.
> BAR3: ISR. If you use this, don't use BAR0.
>
> I prefer the cases exclusive (ie. use one or the other) as a clear path
> to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
> an ugly corner case in future (ISR is BAR0 + 19?  WTF?).

We'll never remove legacy so we shouldn't plan on it.  There are
literally hundreds of thousands of VMs out there with the current virtio
drivers installed in them.  We'll be supporting them for a very, very
long time :-)

I don't think we gain a lot by moving the ISR into a separate BAR.
Splitting up registers like that seems weird to me too.

It's very normal to have a mirrored set of registers that are PIO in one
bar and MMIO in a different BAR.

If we added an additional constraints that BAR1 was mirrored except for
the config space and the MSI section was always there, I think the end
result would be nice.  IOW:

BAR0[pio]: virtio-pci registers + optional MSI section + virtio-config
BAR1[mmio]: virtio-pci registers + MSI section + future extensions
BAR2[mmio]: virtio-config

We can continue to do ISR access via BAR0 for performance reasons.

> As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
> endorse that and leave it to the devices.
>
> The detection is simple: if BAR1 has non-zero length, it's new-style,
> otherwise legacy.

I agree that this is the best way to extend, but I think we should still
use a transport feature bit.  We want to be able to detect within QEMU
whether a guest is using these new features because we need to adjust
migration state accordingly.

Otherwise we would have to detect reads/writes to the new BARs to
maintain whether the extended register state needs to be saved.  This
gets nasty dealing with things like reset.

A feature bit simplifies this all pretty well.

Regards,

Anthony Liguori

>
> Thoughts?
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  1:51                 ` Anthony Liguori
@ 2012-10-09  3:16                   ` Rusty Russell
  2012-10-09 10:17                     ` Avi Kivity
                                       ` (2 more replies)
  2012-10-09  6:33                   ` Gerd Hoffmann
  1 sibling, 3 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-09  3:16 UTC (permalink / raw)
  To: Anthony Liguori, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Anthony Liguori <aliguori@us.ibm.com> writes:
> We'll never remove legacy so we shouldn't plan on it.  There are
> literally hundreds of thousands of VMs out there with the current virtio
> drivers installed in them.  We'll be supporting them for a very, very
> long time :-)

You will be supporting this for qemu on x86, sure.  As I think we're
still in the growth phase for virtio, I prioritize future spec
cleanliness pretty high.

But I think you'll be surprised how fast this is deprecated:
1) Bigger queues for block devices (guest-specified ringsize)
2) Smaller rings for openbios (guest-specified alignment)
3) All-mmio mode (powerpc)
4) Whatever network features get numbers > 31.

> I don't think we gain a lot by moving the ISR into a separate BAR.
> Splitting up registers like that seems weird to me too.

Confused.  I proposed the same split as you have, just ISR by itself.

> It's very normal to have a mirrored set of registers that are PIO in one
> bar and MMIO in a different BAR.
>
> If we added an additional constraints that BAR1 was mirrored except for
> the config space and the MSI section was always there, I think the end
> result would be nice.  IOW:

But it won't be the same, because we want all that extra stuff, like
more feature bits and queue size alignment.  (Admittedly queues past
16TB aren't a killer feature).

To make it concrete:

Current:
struct {
        __le32 host_features;   /* read-only */
        __le32 guest_features;  /* read/write */
        __le32 queue_pfn;       /* read/write */
        __le16 queue_size;      /* read-only */
        __le16 queue_sel;       /* read/write */
        __le16 queue_notify;    /* read/write */
        u8 status;              /* read/write */
        u8 isr;                 /* read-only, clear on read */
        /* Optional */
        __le16 msi_config_vector;       /* read/write */
        __le16 msi_queue_vector;        /* read/write */
        /* ... device features */
};

Proposed:
struct virtio_pci_cfg {
	/* About the whole device. */
	__le32 device_feature_select;	/* read-write */
	__le32 device_feature;		/* read-only */
	__le32 guest_feature_select;	/* read-write */
	__le32 guest_feature;		/* read-only */
	__le16 msix_config;		/* read-write */
	__u8 device_status;		/* read-write */
	__u8 unused;

	/* About a specific virtqueue. */
	__le16 queue_select;	/* read-write */
	__le16 queue_align;	/* read-write, power of 2. */
	__le16 queue_size;	/* read-write, power of 2. */
	__le16 queue_msix_vector;/* read-write */
	__le64 queue_address;	/* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
};

struct virtio_pci_isr {
        __u8 isr;                 /* read-only, clear on read */
};

We could also enforce LE in the per-device config space in this case,
another nice cleanup for PCI people.

> BAR0[pio]: virtio-pci registers + optional MSI section + virtio-config
> BAR1[mmio]: virtio-pci registers + MSI section + future extensions
> BAR2[mmio]: virtio-config
>
> We can continue to do ISR access via BAR0 for performance reasons.

But powerpc explicitly *doesnt* want a pio bar.  So let it be its own
bar, which can be either.

>> As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
>> endorse that and leave it to the devices.
>>
>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>> otherwise legacy.
>
> I agree that this is the best way to extend, but I think we should still
> use a transport feature bit.  We want to be able to detect within QEMU
> whether a guest is using these new features because we need to adjust
> migration state accordingly.
>
> Otherwise we would have to detect reads/writes to the new BARs to
> maintain whether the extended register state needs to be saved.  This
> gets nasty dealing with things like reset.

I don't think it'll be that bad; reset clears the device to unknown,
bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
unknown->modern mode, and anything else is bad (I prefer being strict so
we catch bad implementations from the beginning).

But I'm happy to implement it and see what it's like.

> A feature bit simplifies this all pretty well.

I suspect it will be quite ugly, actually.  The guest has to use BAR0 to
get the features to see if they can use BAR1.  Do they ack the feature
(via BAR0) before accessing BAR1?  If not, qemu can't rely on the
feature bit.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  3:16                   ` Rusty Russell
@ 2012-10-09 10:17                     ` Avi Kivity
  2012-10-09 14:03                       ` Anthony Liguori
  2012-10-09 13:56                     ` Anthony Liguori
  2012-10-09 21:09                     ` Jamie Lokier
  2 siblings, 1 reply; 39+ messages in thread
From: Avi Kivity @ 2012-10-09 10:17 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, kvm, Michael S. Tsirkin, qemu-devel,
	virtualization, Gerd Hoffmann

On 10/09/2012 05:16 AM, Rusty Russell wrote:
> Anthony Liguori <aliguori@us.ibm.com> writes:
>> We'll never remove legacy so we shouldn't plan on it.  There are
>> literally hundreds of thousands of VMs out there with the current virtio
>> drivers installed in them.  We'll be supporting them for a very, very
>> long time :-)
> 
> You will be supporting this for qemu on x86, sure.  As I think we're
> still in the growth phase for virtio, I prioritize future spec
> cleanliness pretty high.

If a pure ppc hypervisor was on the table, this might have been
worthwhile.  As it is the codebase is shared, and the Linux drivers are
shared, so cleaning up the spec doesn't help the code.

> 
> But I think you'll be surprised how fast this is deprecated:
> 1) Bigger queues for block devices (guest-specified ringsize)
> 2) Smaller rings for openbios (guest-specified alignment)
> 3) All-mmio mode (powerpc)
> 4) Whatever network features get numbers > 31.
> 
>> I don't think we gain a lot by moving the ISR into a separate BAR.
>> Splitting up registers like that seems weird to me too.
> 
> Confused.  I proposed the same split as you have, just ISR by itself.

I believe Anthony objects to having the ISR by itself.  What is the
motivation for that?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09 10:17                     ` Avi Kivity
@ 2012-10-09 14:03                       ` Anthony Liguori
  0 siblings, 0 replies; 39+ messages in thread
From: Anthony Liguori @ 2012-10-09 14:03 UTC (permalink / raw)
  To: Avi Kivity, Rusty Russell
  Cc: qemu-devel, virtualization, Gerd Hoffmann, kvm,
	Michael S. Tsirkin

Avi Kivity <avi@redhat.com> writes:

> On 10/09/2012 05:16 AM, Rusty Russell wrote:
>> Anthony Liguori <aliguori@us.ibm.com> writes:
>>> We'll never remove legacy so we shouldn't plan on it.  There are
>>> literally hundreds of thousands of VMs out there with the current virtio
>>> drivers installed in them.  We'll be supporting them for a very, very
>>> long time :-)
>> 
>> You will be supporting this for qemu on x86, sure.  As I think we're
>> still in the growth phase for virtio, I prioritize future spec
>> cleanliness pretty high.
>
> If a pure ppc hypervisor was on the table, this might have been
> worthwhile.  As it is the codebase is shared, and the Linux drivers are
> shared, so cleaning up the spec doesn't help the code.

Note that distros have been (perhaps unknowingly) shipping virtio-pci
for PPC for some time now.

So even though there wasn't a hypervisor that supported virtio-pci, the
guests already support it and are out there in the wild.

There's a lot of value in maintaining "legacy" support even for PPC.
 
>> But I think you'll be surprised how fast this is deprecated:
>> 1) Bigger queues for block devices (guest-specified ringsize)
>> 2) Smaller rings for openbios (guest-specified alignment)
>> 3) All-mmio mode (powerpc)
>> 4) Whatever network features get numbers > 31.
>> 
>>> I don't think we gain a lot by moving the ISR into a separate BAR.
>>> Splitting up registers like that seems weird to me too.
>> 
>> Confused.  I proposed the same split as you have, just ISR by itself.
>
> I believe Anthony objects to having the ISR by itself.  What is the
> motivation for that?

Right, BARs are a precious resource not to be spent lightly.  Having an
entire BAR dedicated to a 1-byte register seems like a waste to me.

Regards,

Anthony Liguori

>
>
> -- 
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  3:16                   ` Rusty Russell
  2012-10-09 10:17                     ` Avi Kivity
@ 2012-10-09 13:56                     ` Anthony Liguori
  2012-10-10  3:44                       ` Rusty Russell
  2012-10-09 21:09                     ` Jamie Lokier
  2 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-09 13:56 UTC (permalink / raw)
  To: Rusty Russell, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Rusty Russell <rusty@rustcorp.com.au> writes:

> Anthony Liguori <aliguori@us.ibm.com> writes:
>> We'll never remove legacy so we shouldn't plan on it.  There are
>> literally hundreds of thousands of VMs out there with the current virtio
>> drivers installed in them.  We'll be supporting them for a very, very
>> long time :-)
>
> You will be supporting this for qemu on x86, sure.

And PPC.

> As I think we're
> still in the growth phase for virtio, I prioritize future spec
> cleanliness pretty high.
>
> But I think you'll be surprised how fast this is deprecated:
> 1) Bigger queues for block devices (guest-specified ringsize)
> 2) Smaller rings for openbios (guest-specified alignment)
> 3) All-mmio mode (powerpc)
> 4) Whatever network features get numbers > 31.

We can do all of these things with incremental change to the existing
layout.  That's the only way what I'm suggesting is different.

You want to reorder all of the fields and have a driver flag day.  But I
strongly suspect we'll decide we need to do the same exercise again in 4
years when we now need to figure out how to take advantage of
transactional memory or some other whiz-bang hardware feature.

There are a finite number of BARs but each BAR has an almost infinite
size.  So extending BARs instead of introducing new one seems like the
conservative approach moving forward.

>> I don't think we gain a lot by moving the ISR into a separate BAR.
>> Splitting up registers like that seems weird to me too.
>
> Confused.  I proposed the same split as you have, just ISR by itself.

I disagree with moving the ISR into a separate BAR.  That's what seems
weird to me.

>> It's very normal to have a mirrored set of registers that are PIO in one
>> bar and MMIO in a different BAR.
>>
>> If we added an additional constraints that BAR1 was mirrored except for
>> the config space and the MSI section was always there, I think the end
>> result would be nice.  IOW:
>
> But it won't be the same, because we want all that extra stuff, like
> more feature bits and queue size alignment.  (Admittedly queues past
> 16TB aren't a killer feature).
>
> To make it concrete:
>
> Current:
> struct {
>         __le32 host_features;   /* read-only */
>         __le32 guest_features;  /* read/write */
>         __le32 queue_pfn;       /* read/write */
>         __le16 queue_size;      /* read-only */
>         __le16 queue_sel;       /* read/write */
>         __le16 queue_notify;    /* read/write */
>         u8 status;              /* read/write */
>         u8 isr;                 /* read-only, clear on read */
>         /* Optional */
>         __le16 msi_config_vector;       /* read/write */
>         __le16 msi_queue_vector;        /* read/write */
>         /* ... device features */
> };
>
> Proposed:
> struct virtio_pci_cfg {
> 	/* About the whole device. */
> 	__le32 device_feature_select;	/* read-write */
> 	__le32 device_feature;		/* read-only */
> 	__le32 guest_feature_select;	/* read-write */
> 	__le32 guest_feature;		/* read-only */
> 	__le16 msix_config;		/* read-write */
> 	__u8 device_status;		/* read-write */
> 	__u8 unused;
>
> 	/* About a specific virtqueue. */
> 	__le16 queue_select;	/* read-write */
> 	__le16 queue_align;	/* read-write, power of 2. */
> 	__le16 queue_size;	/* read-write, power of 2. */
> 	__le16 queue_msix_vector;/* read-write */
> 	__le64 queue_address;	/* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
> };
>
> struct virtio_pci_isr {
>         __u8 isr;                 /* read-only, clear on read */
> };

What I'm suggesting is:

> struct {
>         __le32 host_features;   /* read-only */
>         __le32 guest_features;  /* read/write */
>         __le32 queue_pfn;       /* read/write */
>         __le16 queue_size;      /* read-only */
>         __le16 queue_sel;       /* read/write */
>         __le16 queue_notify;    /* read/write */
>         u8 status;              /* read/write */
>         u8 isr;                 /* read-only, clear on read */
>         __le16 msi_config_vector;       /* read/write */
>         __le16 msi_queue_vector;        /* read/write */
>         __le32 host_feature_select;     /* read/write */
>         __le32 guest_feature_select;    /* read/write */
>         __le32 queue_pfn_hi;            /* read/write */
> };
>

With the additional semantic that the virtio-config space is overlayed
on top of the register set in BAR0 unless the
VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged.  This feature
acts as a latch and when set, removes the config space overlay.

If the config space overlays the registers, the offset in BAR0 of the
overlay depends on whether MSI is enabled or not in the PCI device.

BAR1 is an MMIO mirror of BAR0 except that the config space is never
overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG.

BAR2 contains the config space.

A guest can look at BAR1 and BAR2 to determine whether they exist.

> We could also enforce LE in the per-device config space in this case,
> another nice cleanup for PCI people.
>
>> BAR0[pio]: virtio-pci registers + optional MSI section + virtio-config
>> BAR1[mmio]: virtio-pci registers + MSI section + future extensions
>> BAR2[mmio]: virtio-config
>>
>> We can continue to do ISR access via BAR0 for performance reasons.
>
> But powerpc explicitly *doesnt* want a pio bar.  So let it be its own
> bar, which can be either.

Does it really care?  It all ends up being MMIO for PPC anyway so I
don't think there's any real pressing need for it not to be PIO.

>>> As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
>>> endorse that and leave it to the devices.
>>>
>>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>>> otherwise legacy.
>>
>> I agree that this is the best way to extend, but I think we should still
>> use a transport feature bit.  We want to be able to detect within QEMU
>> whether a guest is using these new features because we need to adjust
>> migration state accordingly.
>>
>> Otherwise we would have to detect reads/writes to the new BARs to
>> maintain whether the extended register state needs to be saved.  This
>> gets nasty dealing with things like reset.
>
> I don't think it'll be that bad; reset clears the device to unknown,
> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
> unknown->modern mode, and anything else is bad (I prefer being strict so
> we catch bad implementations from the beginning).

You mean, "accesses to bar1/2/3" change it to modern mode.  That's a
pretty big side effect of a read.

> But I'm happy to implement it and see what it's like.
>
>> A feature bit simplifies this all pretty well.
>
> I suspect it will be quite ugly, actually.  The guest has to use BAR0 to
> get the features to see if they can use BAR1.  Do they ack the feature
> (via BAR0) before accessing BAR1?  If not, qemu can't rely on the
> feature bit.

See above.  A guest could happily just use BAR1/BAR2 and completely
ignore BAR0 provided that BAR1/BAR2 are present.

It also means the guest drivers don't need separate code paths for
"legacy" vs. "modern".  They can switch between the two by just changing
a single pointer.

The implementation ends up being pretty trivial too.  We can use the
same MemoryRegion in QEMU for both PIO and MMIO.  The kernel code stays
the same.  It just takes a couple if()s to handle whether there's a
config overlay or not.

Regards,

Anthony Liguori

>
> Cheers,
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09 13:56                     ` Anthony Liguori
@ 2012-10-10  3:44                       ` Rusty Russell
  2012-10-10 11:37                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-10  3:44 UTC (permalink / raw)
  To: Anthony Liguori, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Anthony Liguori <aliguori@us.ibm.com> writes:
> Rusty Russell <rusty@rustcorp.com.au> writes:
>
>> Anthony Liguori <aliguori@us.ibm.com> writes:
>>> We'll never remove legacy so we shouldn't plan on it.  There are
>>> literally hundreds of thousands of VMs out there with the current virtio
>>> drivers installed in them.  We'll be supporting them for a very, very
>>> long time :-)
>>
>> You will be supporting this for qemu on x86, sure.
>
> And PPC.

Yeah, Ben was a bit early fixing the virtio bugs I guess.  Oh well.

>> As I think we're
>> still in the growth phase for virtio, I prioritize future spec
>> cleanliness pretty high.
>>
>> But I think you'll be surprised how fast this is deprecated:
>> 1) Bigger queues for block devices (guest-specified ringsize)
>> 2) Smaller rings for openbios (guest-specified alignment)
>> 3) All-mmio mode (powerpc)
>> 4) Whatever network features get numbers > 31.
>
> We can do all of these things with incremental change to the existing
> layout.  That's the only way what I'm suggesting is different.

Now extend your proposal to add all those features we want.  We add
if()s to the driver that we need to keep forever.  Let's just look at
what your proposal get us:

bool has_feature(struct virtio_pci_dev *dev, u32 num)
{
        /* Do we need to set selector? */
        if (dev->extended_features) {
                unsigned long selector_off = VIRTIO_PCI_HOST_FEATURES_SELECT;
                if (dev->using_bar0 && !dev->msi_active)
                        selector_off -= 32;
                writel(dev->config + selector_off, num / 32);
                num &= 31;
        } else if (num > 31)
                return false;
        return readl(dev->config + VIRTIO_PCI_HOST_FEATURES) & (1 << num);
}

vs two cases which independent with methods set at detection time:

virtio_pci_legacy.c:
bool has_feature(struct virtio_pci_dev *dev, u32 num)
{
        if (num > 31)
                return false;
        return readl(dev->config + VIRTIO_PCI_LEGACY_HOST_FEATURES) & (1 << num);
}

virtio_pci_new.c:
bool has_feature(struct virtio_pci_dev *dev, u32 num)
{
        writel(dev->config + VIRTIO_PCI_HOST_FEATURES_SELECT, num / 32);
        num &= 31;
        return readl(dev->config + VIRTIO_PCI_HOST_FEATURES) & (1 << num);
}

> You want to reorder all of the fields and have a driver flag day.  But I
> strongly suspect we'll decide we need to do the same exercise again in 4
> years when we now need to figure out how to take advantage of
> transactional memory or some other whiz-bang hardware feature.

It's not a flag day, as it's backwards compatible.

Sure, we might have a clean cut again.  I'm not afraid to rewrite this
code to make it cleaner; perhaps this is somewhere qemu could benefit
from a similar approach.

> What I'm suggesting is:
>
>> struct {
>>         __le32 host_features;   /* read-only */
>>         __le32 guest_features;  /* read/write */
>>         __le32 queue_pfn;       /* read/write */
>>         __le16 queue_size;      /* read-only */
>>         __le16 queue_sel;       /* read/write */
>>         __le16 queue_notify;    /* read/write */
>>         u8 status;              /* read/write */
>>         u8 isr;                 /* read-only, clear on read */
>>         __le16 msi_config_vector;       /* read/write */
>>         __le16 msi_queue_vector;        /* read/write */
>>         __le32 host_feature_select;     /* read/write */
>>         __le32 guest_feature_select;    /* read/write */
>>         __le32 queue_pfn_hi;            /* read/write */
>> };
>
> With the additional semantic that the virtio-config space is overlayed
> on top of the register set in BAR0 unless the
> VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged.  This feature
> acts as a latch and when set, removes the config space overlay.
>
> If the config space overlays the registers, the offset in BAR0 of the
> overlay depends on whether MSI is enabled or not in the PCI device.
>
> BAR1 is an MMIO mirror of BAR0 except that the config space is never
> overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG.

Creating a horrible mess for new code in order to support old code.
But I have to maintain the new code, and this is horrible.

> You mean, "accesses to bar1/2/3" change it to modern mode.  That's a
> pretty big side effect of a read.

They don't need to, but I'd prefer they did to keep implementations
from mixing old and new.

> See above.  A guest could happily just use BAR1/BAR2 and completely
> ignore BAR0 provided that BAR1/BAR2 are present.

But x86 guests want BAR0 because IO space is faster than MMIO.  Right?
So, not "happily".

> It also means the guest drivers don't need separate code paths for
> "legacy" vs. "modern".  They can switch between the two by just changing
> a single pointer.

This is wrong.  Of course they have separate code paths, but now you've
got lots of crossover, rather than two distinct ones.

> The implementation ends up being pretty trivial too.  We can use the
> same MemoryRegion in QEMU for both PIO and MMIO.  The kernel code stays
> the same.  It just takes a couple if()s to handle whether there's a
> config overlay or not.

"A couple of ifs" for every feature, and they nest.  You end up with
exponential complexity and an untestable mess.  That's why this change
is an opportunity.

Not a feature bit for 64 bit device offsets, and another for extended
features and another for size negotiation and another for alignment
setting.  That would be a maintenance nightmare for no good reason.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10  3:44                       ` Rusty Russell
@ 2012-10-10 11:37                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-10 11:37 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, kvm, qemu-devel, virtualization, Gerd Hoffmann

On Wed, Oct 10, 2012 at 02:14:11PM +1030, Rusty Russell wrote:
> > See above.  A guest could happily just use BAR1/BAR2 and completely
> > ignore BAR0 provided that BAR1/BAR2 are present.
> 
> But x86 guests want BAR0 because IO space is faster than MMIO.  Right?

Or to be a bit more precise, ATM on x86 emulating IO instructions is
faster than MMIO. IIUC this is since you get the address/data in
registers and don't need to decode them.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  3:16                   ` Rusty Russell
  2012-10-09 10:17                     ` Avi Kivity
  2012-10-09 13:56                     ` Anthony Liguori
@ 2012-10-09 21:09                     ` Jamie Lokier
  2012-10-10  3:44                       ` Rusty Russell
  2 siblings, 1 reply; 39+ messages in thread
From: Jamie Lokier @ 2012-10-09 21:09 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, kvm, Michael S. Tsirkin, qemu-devel,
	virtualization, Gerd Hoffmann

Rusty Russell wrote:
> I don't think it'll be that bad; reset clears the device to unknown,
> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
> unknown->modern mode, and anything else is bad (I prefer being strict so
> we catch bad implementations from the beginning).

Will that work, if the guest with kernel that uses modern mode, kexecs
to an older (but presumed reliable) kernel that only knows about legacy mode?

I.e. will the replacement kernel, or (ideally) replacement driver on
the rare occasion that is needed on a running kernel, be able to reset
the device hard enough?

-- Jamie

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09 21:09                     ` Jamie Lokier
@ 2012-10-10  3:44                       ` Rusty Russell
  2012-10-11  0:08                         ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-10  3:44 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Anthony Liguori, kvm, Michael S. Tsirkin, qemu-devel,
	virtualization, Gerd Hoffmann

Jamie Lokier <jamie@shareable.org> writes:

> Rusty Russell wrote:
>> I don't think it'll be that bad; reset clears the device to unknown,
>> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
>> unknown->modern mode, and anything else is bad (I prefer being strict so
>> we catch bad implementations from the beginning).
>
> Will that work, if the guest with kernel that uses modern mode, kexecs
> to an older (but presumed reliable) kernel that only knows about legacy mode?
>
> I.e. will the replacement kernel, or (ideally) replacement driver on
> the rare occasion that is needed on a running kernel, be able to reset
> the device hard enough?

Well, you need to reset the device, so yes.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10  3:44                       ` Rusty Russell
@ 2012-10-11  0:08                         ` Rusty Russell
  0 siblings, 0 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-11  0:08 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Anthony Liguori, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

Rusty Russell <rusty@rustcorp.com.au> writes:
> Jamie Lokier <jamie@shareable.org> writes:
>
>> Rusty Russell wrote:
>>> I don't think it'll be that bad; reset clears the device to unknown,
>>> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
>>> unknown->modern mode, and anything else is bad (I prefer being strict so
>>> we catch bad implementations from the beginning).
>>
>> Will that work, if the guest with kernel that uses modern mode, kexecs
>> to an older (but presumed reliable) kernel that only knows about legacy mode?
>>
>> I.e. will the replacement kernel, or (ideally) replacement driver on
>> the rare occasion that is needed on a running kernel, be able to reset
>> the device hard enough?
>
> Well, you need to reset the device, so yes.

MST said something which made me think harder about this case.

Either there needs to be a way to tell what mode the device is in, or
legacy reset has to work, even in modern mode.  The latter is
preferable, since it allows an older kernel to do the reset.

Now, since qemu would almost certainly have to add code to stop that
working, it'll probably be fine.  But I'd really like to add a "strict
mode" to qemu virtio which does extra sanity checking for driver
authors, and that might break this.  That's OK.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  1:51                 ` Anthony Liguori
  2012-10-09  3:16                   ` Rusty Russell
@ 2012-10-09  6:33                   ` Gerd Hoffmann
  2012-10-09 15:26                     ` Anthony Liguori
  1 sibling, 1 reply; 39+ messages in thread
From: Gerd Hoffmann @ 2012-10-09  6:33 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

  Hi,

>> Well, we also want to clean up the registers, so how about:
>>
>> BAR0: legacy, as is.  If you access this, don't use the others.

Ok.

>> BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
>> BAR2: virtio-cfg.  If you use this, don't use BAR0.

Why use two bars for this?  You can put them into one mmio bar, together
with the msi-x vector table and PBA.  Of course a pci capability
describing the location is helpful for that ;)

>> BAR3: ISR. If you use this, don't use BAR0.

Again, I wouldn't hardcode that but use a capability.

>> I prefer the cases exclusive (ie. use one or the other) as a clear path
>> to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
>> an ugly corner case in future (ISR is BAR0 + 19?  WTF?).

Ok, so we have four register sets:

  (1) legacy layout
  (2) new virtio-pci
  (3) new virtio-config
  (4) new virtio-isr

We can have a vendor pci capability, with a dword for each register set:

  bit  31    -- present bit
  bits 26-24 -- bar
  bits 23-0  -- offset

So current drivers which must support legacy can use this:

  legacy layout     -- present, bar 0, offset 0
  new virtio-pci    -- present, bar 1, offset 0
  new virtio-config -- present, bar 1, offset 256
  new virtio-isr    -- present, bar 0, offset 19

[ For completeness: msi-x capability could add this: ]

  msi-x vector table            bar 1, offset 512
  msi-x pba                     bar 1, offset 768

> We'll never remove legacy so we shouldn't plan on it.  There are
> literally hundreds of thousands of VMs out there with the current virtio
> drivers installed in them.  We'll be supporting them for a very, very
> long time :-)

But new devices (virtio-qxl being a candidate) don't have old guests and
don't need to worry.

They could use this if they care about fast isr:

  legacy layout     -- not present
  new virtio-pci    -- present, bar 1, offset 0
  new virtio-config -- present, bar 1, offset 256
  new virtio-isr    -- present, bar 0, offset 0

Or this if they don't worry about isr performance:

  legacy layout     -- not present
  new virtio-pci    -- present, bar 0, offset 0
  new virtio-config -- present, bar 0, offset 256
  new virtio-isr    -- not present

> I don't think we gain a lot by moving the ISR into a separate BAR.
> Splitting up registers like that seems weird to me too.

Main advantage of defining a register set with just isr is that it
reduces pio address space consumtion for new virtio devices which don't
have to worry about the legacy layout (8 bytes which is minimum size for
io bars instead of 64 bytes).

> If we added an additional constraints that BAR1 was mirrored except for

Why add constraints?  We want something future-proof, don't we?

>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>> otherwise legacy.

Doesn't fly.  BAR1 is in use today for MSI-X support.

> I agree that this is the best way to extend, but I think we should still
> use a transport feature bit.  We want to be able to detect within QEMU
> whether a guest is using these new features because we need to adjust
> migration state accordingly.

Why does migration need adjustments?

[ Not that I want veto a feature bit, but I don't see the need yet ]

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09  6:33                   ` Gerd Hoffmann
@ 2012-10-09 15:26                     ` Anthony Liguori
  2012-10-09 20:24                       ` Gerd Hoffmann
  0 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-09 15:26 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

Gerd Hoffmann <kraxel@redhat.com> writes:

>   Hi,
>
>>> Well, we also want to clean up the registers, so how about:
>>>
>>> BAR0: legacy, as is.  If you access this, don't use the others.
>
> Ok.
>
>>> BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
>>> BAR2: virtio-cfg.  If you use this, don't use BAR0.
>
> Why use two bars for this?  You can put them into one mmio bar, together
> with the msi-x vector table and PBA.  Of course a pci capability
> describing the location is helpful for that ;)

You don't need a capability.  You can also just add a "config offset"
field to the register set and then make the semantics that it occurs in
the same region.

>
>>> BAR3: ISR. If you use this, don't use BAR0.
>
> Again, I wouldn't hardcode that but use a capability.
>
>>> I prefer the cases exclusive (ie. use one or the other) as a clear path
>>> to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
>>> an ugly corner case in future (ISR is BAR0 + 19?  WTF?).
>
> Ok, so we have four register sets:
>
>   (1) legacy layout
>   (2) new virtio-pci
>   (3) new virtio-config
>   (4) new virtio-isr
>
> We can have a vendor pci capability, with a dword for each register set:
>
>   bit  31    -- present bit
>   bits 26-24 -- bar
>   bits 23-0  -- offset
>
> So current drivers which must support legacy can use this:
>
>   legacy layout     -- present, bar 0, offset 0
>   new virtio-pci    -- present, bar 1, offset 0
>   new virtio-config -- present, bar 1, offset 256
>   new virtio-isr    -- present, bar 0, offset 19
>
> [ For completeness: msi-x capability could add this: ]
>
>   msi-x vector table            bar 1, offset 512
>   msi-x pba                     bar 1, offset 768
>
>> We'll never remove legacy so we shouldn't plan on it.  There are
>> literally hundreds of thousands of VMs out there with the current virtio
>> drivers installed in them.  We'll be supporting them for a very, very
>> long time :-)
>
> But new devices (virtio-qxl being a candidate) don't have old guests and
> don't need to worry.
>
> They could use this if they care about fast isr:
>
>   legacy layout     -- not present
>   new virtio-pci    -- present, bar 1, offset 0
>   new virtio-config -- present, bar 1, offset 256
>   new virtio-isr    -- present, bar 0, offset 0
>
> Or this if they don't worry about isr performance:
>
>   legacy layout     -- not present
>   new virtio-pci    -- present, bar 0, offset 0
>   new virtio-config -- present, bar 0, offset 256
>   new virtio-isr    -- not present
>
>> I don't think we gain a lot by moving the ISR into a separate BAR.
>> Splitting up registers like that seems weird to me too.
>
> Main advantage of defining a register set with just isr is that it
> reduces pio address space consumtion for new virtio devices which don't
> have to worry about the legacy layout (8 bytes which is minimum size for
> io bars instead of 64 bytes).

Doing some rough math, we should have at least 16k of PIO space.  That
let's us have well over 500 virtio-pci devices with the current register
layout.

I don't think we're at risk of running out of space...

>> If we added an additional constraints that BAR1 was mirrored except for
>
> Why add constraints?  We want something future-proof, don't we?
>
>>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>>> otherwise legacy.
>
> Doesn't fly.  BAR1 is in use today for MSI-X support.

But the location is specified via capabilities so we can change the
location to be within BAR1 at a non-conflicting offset.

>> I agree that this is the best way to extend, but I think we should still
>> use a transport feature bit.  We want to be able to detect within QEMU
>> whether a guest is using these new features because we need to adjust
>> migration state accordingly.
>
> Why does migration need adjustments?

Because there is additional state in the "new" layout.  We need to
understand whether a guest is relying on that state or not.

For instance, extended virtio features.  If a guest is in the process
of reading extended virtio features, it may not have changed any state
but we must ensure that we don't migrate to an older verison of QEMU w/o
the extended virtio features.

This cannot be handled by subsections today because there is no guest
written state that's affected.

Regards,

Anthony Liguori

>
> [ Not that I want veto a feature bit, but I don't see the need yet ]
>
> cheers,
>   Gerd
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09 15:26                     ` Anthony Liguori
@ 2012-10-09 20:24                       ` Gerd Hoffmann
  2012-10-10  2:54                         ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Gerd Hoffmann @ 2012-10-09 20:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, virtualization, qemu-devel, kvm,
	Michael S. Tsirkin

  Hi,

>> Why use two bars for this?  You can put them into one mmio bar, together
>> with the msi-x vector table and PBA.  Of course a pci capability
>> describing the location is helpful for that ;)
> 
> You don't need a capability.  You can also just add a "config offset"
> field to the register set and then make the semantics that it occurs in
> the same region.

Yes, that will work too.  Removes some freedom to place the register
ranges, but given that we don't want burn bars and thus prefer to place
everything into a single mmio bar that shouldn't be an issue.  Real
hardware does this too btw (xhci for example).

>> Main advantage of defining a register set with just isr is that it
>> reduces pio address space consumtion for new virtio devices which don't
>> have to worry about the legacy layout (8 bytes which is minimum size for
>> io bars instead of 64 bytes).
> 
> Doing some rough math, we should have at least 16k of PIO space.  That
> let's us have well over 500 virtio-pci devices with the current register
> layout.

I've seen worries nevertheless, but given we have virtio-scsi now which
can handle lots of disks without needing lots of virtio-pci devices it
is probably not that a big issue any more.

>>>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>>>> otherwise legacy.
>>
>> Doesn't fly.  BAR1 is in use today for MSI-X support.
> 
> But the location is specified via capabilities so we can change the
> location to be within BAR1 at a non-conflicting offset.

Sure.  Nevertheless "BAR1 has non-zero length" can't be used to detect
new-style virtio as old-style devices already have BAR1 with a non-zero
length.

So how about this:

(1) Add a vendor specific pci capability for new-style virtio.
    Specifies the pci bar used for new-style virtio registers.
    Guests can use it to figure whenever new-style virtio is
    supported and to map the correct bar (which will probably
    be bar 1 in most cases).

(2) Have virtio-offsets register set, located at the new-style bar,
    offset 0:

    struct virtio_offsets {
       __le32 num_offsets;        // so we can extend this later
       __le32 virtio_pci_offset;  // location of virtio-pci registers
       __le32 virtio_cfg_offset;  // location of virtio config space
    };

(3) place virtio-pci registers (same as 0..23 in bar 0) in the new-style
    bar, offset virtio_pci_offset

(4) place virtio config space (same as 20+/24+ in bar 0) in the
    new-style bar, offset virtio_cfg_offset

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-09 20:24                       ` Gerd Hoffmann
@ 2012-10-10  2:54                         ` Rusty Russell
  2012-10-10 13:36                           ` Anthony Liguori
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-10  2:54 UTC (permalink / raw)
  To: Gerd Hoffmann, Anthony Liguori
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Gerd Hoffmann <kraxel@redhat.com> writes:
> So how about this:
>
> (1) Add a vendor specific pci capability for new-style virtio.
>     Specifies the pci bar used for new-style virtio registers.
>     Guests can use it to figure whenever new-style virtio is
>     supported and to map the correct bar (which will probably
>     be bar 1 in most cases).

This was closer to the original proposal[1], which I really liked (you
can layout bars however you want).  Anthony thought that vendor
capabilities were a PCI-e feature, but it seems they're blessed in PCI
2.3.

So let's return to that proposal, giving something like this:

/* IDs for different capabilities.  Must all exist. */
/* FIXME: Do we win from separating ISR, NOTIFY and COMMON? */
/* Common configuration */
#define VIRTIO_PCI_CAP_COMMON_CFG	1
/* Notifications */
#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
/* ISR access */
#define VIRTIO_PCI_CAP_ISR_CFG		3
/* Device specific confiuration */
#define VIRTIO_PCI_CAP_DEVICE_CFG	4

/* This is the PCI capability header: */
struct virtio_pci_cap {
	u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
	u8 cap_next;	/* Generic PCI field: next ptr. */
	u8 cap_len;	/* Generic PCI field: sizeof(struct virtio_pci_cap). */
	u8 cfg_type;	/* One of the VIRTIO_PCI_CAP_*_CFG. */
	u8 bar;		/* Where to find it. */
	u8 unused;
	__le16 offset;	/* Offset within bar. */
	__le32 length;	/* Length. */
};

This means qemu can point the isr_cfg into the legacy area if it wants.
In fact, it can put everything in BAR0 if it wants.

Thoughts?
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10  2:54                         ` Rusty Russell
@ 2012-10-10 13:36                           ` Anthony Liguori
  2012-10-10 13:41                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2012-10-10 13:36 UTC (permalink / raw)
  To: Rusty Russell, Gerd Hoffmann
  Cc: virtualization, qemu-devel, kvm, Michael S. Tsirkin

Rusty Russell <rusty@rustcorp.com.au> writes:

> Gerd Hoffmann <kraxel@redhat.com> writes:
>> So how about this:
>>
>> (1) Add a vendor specific pci capability for new-style virtio.
>>     Specifies the pci bar used for new-style virtio registers.
>>     Guests can use it to figure whenever new-style virtio is
>>     supported and to map the correct bar (which will probably
>>     be bar 1 in most cases).
>
> This was closer to the original proposal[1], which I really liked (you
> can layout bars however you want).  Anthony thought that vendor
> capabilities were a PCI-e feature, but it seems they're blessed in PCI
> 2.3.

2.3 was standardized in 2002.  Are we confident that vendor extensions
play nice with pre-2.3 OSes like Win2k, WinXP, etc?

I still think it's a bad idea to rely on something so "new" in something
as fundamental as virtio-pci unless we have to.

Regards,

Anthony Liguori

>
> So let's return to that proposal, giving something like this:
>
> /* IDs for different capabilities.  Must all exist. */
> /* FIXME: Do we win from separating ISR, NOTIFY and COMMON? */
> /* Common configuration */
> #define VIRTIO_PCI_CAP_COMMON_CFG	1
> /* Notifications */
> #define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> /* ISR access */
> #define VIRTIO_PCI_CAP_ISR_CFG		3
> /* Device specific confiuration */
> #define VIRTIO_PCI_CAP_DEVICE_CFG	4
>
> /* This is the PCI capability header: */
> struct virtio_pci_cap {
> 	u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
> 	u8 cap_next;	/* Generic PCI field: next ptr. */
> 	u8 cap_len;	/* Generic PCI field: sizeof(struct virtio_pci_cap). */
> 	u8 cfg_type;	/* One of the VIRTIO_PCI_CAP_*_CFG. */
> 	u8 bar;		/* Where to find it. */
> 	u8 unused;
> 	__le16 offset;	/* Offset within bar. */
> 	__le32 length;	/* Length. */
> };
>
> This means qemu can point the isr_cfg into the legacy area if it wants.
> In fact, it can put everything in BAR0 if it wants.
>
> Thoughts?
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10 13:36                           ` Anthony Liguori
@ 2012-10-10 13:41                             ` Michael S. Tsirkin
  2012-10-11  0:43                               ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-10 13:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, virtualization, Gerd Hoffmann, kvm, qemu-devel

On Wed, Oct 10, 2012 at 08:36:12AM -0500, Anthony Liguori wrote:
> Rusty Russell <rusty@rustcorp.com.au> writes:
> 
> > Gerd Hoffmann <kraxel@redhat.com> writes:
> >> So how about this:
> >>
> >> (1) Add a vendor specific pci capability for new-style virtio.
> >>     Specifies the pci bar used for new-style virtio registers.
> >>     Guests can use it to figure whenever new-style virtio is
> >>     supported and to map the correct bar (which will probably
> >>     be bar 1 in most cases).
> >
> > This was closer to the original proposal[1], which I really liked (you
> > can layout bars however you want).  Anthony thought that vendor
> > capabilities were a PCI-e feature, but it seems they're blessed in PCI
> > 2.3.
> 
> 2.3 was standardized in 2002.  Are we confident that vendor extensions
> play nice with pre-2.3 OSes like Win2k, WinXP, etc?
> 
> I still think it's a bad idea to rely on something so "new" in something
> as fundamental as virtio-pci unless we have to.
> 
> Regards,
> 
> Anthony Liguori

Pre 2.3 it was the case that *all* space outside
the capability linked list, and any capability not
recognized by space, was considered vendor specific.
So there's no problem really.

> >
> > So let's return to that proposal, giving something like this:
> >
> > /* IDs for different capabilities.  Must all exist. */
> > /* FIXME: Do we win from separating ISR, NOTIFY and COMMON? */
> > /* Common configuration */
> > #define VIRTIO_PCI_CAP_COMMON_CFG	1
> > /* Notifications */
> > #define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> > /* ISR access */
> > #define VIRTIO_PCI_CAP_ISR_CFG		3
> > /* Device specific confiuration */
> > #define VIRTIO_PCI_CAP_DEVICE_CFG	4
> >
> > /* This is the PCI capability header: */
> > struct virtio_pci_cap {
> > 	u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
> > 	u8 cap_next;	/* Generic PCI field: next ptr. */
> > 	u8 cap_len;	/* Generic PCI field: sizeof(struct virtio_pci_cap). */
> > 	u8 cfg_type;	/* One of the VIRTIO_PCI_CAP_*_CFG. */
> > 	u8 bar;		/* Where to find it. */
> > 	u8 unused;
> > 	__le16 offset;	/* Offset within bar. */
> > 	__le32 length;	/* Length. */
> > };
> >
> > This means qemu can point the isr_cfg into the legacy area if it wants.
> > In fact, it can put everything in BAR0 if it wants.
> >
> > Thoughts?
> > Rusty.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10 13:41                             ` Michael S. Tsirkin
@ 2012-10-11  0:43                               ` Rusty Russell
  0 siblings, 0 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-11  0:43 UTC (permalink / raw)
  To: Michael S. Tsirkin, Anthony Liguori
  Cc: virtualization, Gerd Hoffmann, kvm, qemu-devel

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Wed, Oct 10, 2012 at 08:36:12AM -0500, Anthony Liguori wrote:
>> Rusty Russell <rusty@rustcorp.com.au> writes:
>> 
>> > Gerd Hoffmann <kraxel@redhat.com> writes:
>> >> So how about this:
>> >>
>> >> (1) Add a vendor specific pci capability for new-style virtio.
>> >>     Specifies the pci bar used for new-style virtio registers.
>> >>     Guests can use it to figure whenever new-style virtio is
>> >>     supported and to map the correct bar (which will probably
>> >>     be bar 1 in most cases).
>> >
>> > This was closer to the original proposal[1], which I really liked (you
>> > can layout bars however you want).  Anthony thought that vendor
>> > capabilities were a PCI-e feature, but it seems they're blessed in PCI
>> > 2.3.
>> 
>> 2.3 was standardized in 2002.  Are we confident that vendor extensions
>> play nice with pre-2.3 OSes like Win2k, WinXP, etc?

2.2 (1998) had the capability IDs linked list, indicated by bit 4 in the
status register, but reserved ids 7 and above.  ID 9 (vendor specific)
was added in 2.3; it should be ignored, but will require testing of
course, like any change.

2.1 didn't have the capability ID linked list at all; bit 4 in the
status register was reserved.

QEMU's pci.c has capability support, and sets the capabiliy status bit
and list pointer when the driver requests (eg. the eepro100).

> Pre 2.3 it was the case that *all* space outside
> the capability linked list, and any capability not
> recognized by space, was considered vendor specific.
> So there's no problem really.

Oh in theory, sure.  In practice, almost certainly.  But this needs to
be proved with actual testing.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08 23:56               ` Rusty Russell
  2012-10-09  1:51                 ` Anthony Liguori
@ 2012-10-10  8:34                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-10  8:34 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, virtualization, Gerd Hoffmann, kvm, qemu-devel

On Tue, Oct 09, 2012 at 10:26:12AM +1030, Rusty Russell wrote:
> Anthony Liguori <aliguori@us.ibm.com> writes:
> > Gerd Hoffmann <kraxel@redhat.com> writes:
> >
> >>   Hi,
> >>
> >>>> So we could have for virtio something like this:
> >>>>
> >>>>         Capabilities: [??] virtio-regs:
> >>>>                 legacy: BAR=0 offset=0
> >>>>                 virtio-pci: BAR=1 offset=1000
> >>>>                 virtio-cfg: BAR=1 offset=1800
> >>> 
> >>> This would be a vendor specific PCI capability so lspci wouldn't
> >>> automatically know how to parse it.
> >>
> >> Sure, would need a patch to actually parse+print the cap,
> >> /me was just trying to make my point clear in a simple way.
> >>
> >>>>>> 2) ISTR an argument about mapping the ISR register separately, for
> >>>>>>    performance, but I can't find a reference to it.
> >>>>>
> >>>>> I think the rationale is that ISR really needs to be PIO but everything
> >>>>> else doesn't.  PIO is much faster on x86 because it doesn't require
> >>>>> walking page tables or instruction emulation to handle the exit.
> >>>>
> >>>> Is this still a pressing issue?  With MSI-X enabled ISR isn't needed,
> >>>> correct?  Which would imply that pretty much only old guests without
> >>>> MSI-X support need this, and we don't need to worry that much when
> >>>> designing something new ...
> >>> 
> >>> It wasn't that long ago that MSI-X wasn't supported..  I think we should
> >>> continue to keep ISR as PIO as it is a fast path.
> >>
> >> No problem if we allow to have both legacy layout and new layout at the
> >> same time.  Guests can continue to use ISR @ BAR0 in PIO space for
> >> existing virtio devices, even in case they want use mmio for other
> >> registers -> all fine.
> >>
> >> New virtio devices can support MSI-X from day one and decide to not
> >> expose a legacy layout PIO bar.
> >
> > I think having BAR1 be an MMIO mirror of the registers + a BAR2 for
> > virtio configuration space is probably not that bad of a solution.
> 
> Well, we also want to clean up the registers, so how about:
> 
> BAR0: legacy, as is.  If you access this, don't use the others.
> BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
> BAR2: virtio-cfg.  If you use this, don't use BAR0.
> BAR3: ISR. If you use this, don't use BAR0.

One problem here is there are only 3 64-bit BARs under PCI.
There's no point to make IO BAR 64-bit on x86, but there might
be for memory.

> I prefer the cases exclusive (ie. use one or the other) as a clear path
> to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
> an ugly corner case in future (ISR is BAR0 + 19?  WTF?).
> 
> As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
> endorse that and leave it to the devices.
> 
> The detection is simple: if BAR1 has non-zero length, it's new-style,
> otherwise legacy.
> 
> Thoughts?
> Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-08  2:21   ` [Qemu-devel] Using PCI config space to indicate config location Rusty Russell
  2012-10-08 13:58     ` Anthony Liguori
@ 2012-10-10  8:30     ` Michael S. Tsirkin
  2012-10-11  1:18       ` Rusty Russell
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-10  8:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

On Mon, Oct 08, 2012 at 12:51:25PM +1030, Rusty Russell wrote:
> (Topic updated, cc's trimmed).
> 
> Anthony Liguori <aliguori@us.ibm.com> writes:
> > Rusty Russell <rusty@rustcorp.com.au> writes:
> >> 4) The only significant change to the spec is that we use PCI
> >>    capabilities, so we can have infinite feature bits.
> >>    (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)
> >
> > We discussed this on IRC last night.  I don't think PCI capabilites are
> > a good mechanism to use...
> >
> > PCI capabilities are there to organize how the PCI config space is
> > allocated to allow vendor extensions to co-exist with future PCI
> > extensions.
> >
> > But we've never used the PCI config space within virtio-pci.  We do
> > everything in BAR0.  I don't think there's any real advantage of using
> > the config space vs. a BAR for virtio-pci.
> 
> Note before anyone gets confused; we were talking about using the PCI
> config space to indicate what BAR(s) the virtio stuff is in.  An
> alternative would be to simply specify a new layout format in BAR1.

One problem we are still left with is this: device specific
config accesses are still non atomic.
This is a problem for multibyte fields such as MAC address
where MAC could change while we are accessing it.

I was thinking about some backwards compatible way to solve this, but if
we are willing to break compatiblity or use some mode switch, how about
we give up on virtio config space completely, and do everything besides
IO and ISR through guest memory?

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-10  8:30     ` Michael S. Tsirkin
@ 2012-10-11  1:18       ` Rusty Russell
  2012-10-11 10:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-11  1:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Mon, Oct 08, 2012 at 12:51:25PM +1030, Rusty Russell wrote:
>> Note before anyone gets confused; we were talking about using the PCI
>> config space to indicate what BAR(s) the virtio stuff is in.  An
>> alternative would be to simply specify a new layout format in BAR1.
>
> One problem we are still left with is this: device specific
> config accesses are still non atomic.
> This is a problem for multibyte fields such as MAC address
> where MAC could change while we are accessing it.

It's also a problem for related fields, eg. console width and height, or
disk geometry.

> I was thinking about some backwards compatible way to solve this, but if
> we are willing to break compatiblity or use some mode switch, how about
> we give up on virtio config space completely, and do everything besides
> IO and ISR through guest memory?

I think there's still a benefit in the simple publishing of information:
I don't really want to add a control queue for the console.  But
inevitably, once-static information can change in later versions, and
it's horrible to have config information plus a bit that says "don't use
this, use the control queue".

Here's a table from a quick audit:

Driver          Config       Device changes    Driver writes... after init?
net             Y            Y                 N                N
block           Y            Y                 Y                Y
console         Y            Y                 N                N
rng             N            N                 N                N
balloon         Y            Y                 Y                Y
scsi            Y            N                 Y                N
9p              Y            N                 N                N

For config space reads, I suggest the driver publish a generation count.
For writes, the standard seems to be a commit latch.  We could abuse the
generation count for this: the driver writes to it to commit config
changes.

ie:
/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
struct virtio_pci_common_cfg {
	/* About the whole device. */
	__le32 device_feature_select;	/* read-write */
	__le32 device_feature;		/* read-only */
	__le32 guest_feature_select;	/* read-write */
	__le32 guest_feature;		/* read-only */
        __le32 config_gen_and_latch;    /* read-write */
	__le16 msix_config;		/* read-write */
	__u8 device_status;		/* read-write */
	__u8 unused;

	/* About a specific virtqueue. */
	__le16 queue_select;	/* read-write */
	__le16 queue_align;	/* read-write, power of 2. */
	__le16 queue_size;	/* read-write, power of 2. */
	__le16 queue_msix_vector;/* read-write */
	__le64 queue_address;	/* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
};

Thoughts?
Rusty.
PS.  Let's make all the virtio-device config LE, too...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-11  1:18       ` Rusty Russell
@ 2012-10-11 10:23         ` Michael S. Tsirkin
  2012-10-11 22:29           ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-11 10:23 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

On Thu, Oct 11, 2012 at 11:48:22AM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Mon, Oct 08, 2012 at 12:51:25PM +1030, Rusty Russell wrote:
> >> Note before anyone gets confused; we were talking about using the PCI
> >> config space to indicate what BAR(s) the virtio stuff is in.  An
> >> alternative would be to simply specify a new layout format in BAR1.
> >
> > One problem we are still left with is this: device specific
> > config accesses are still non atomic.
> > This is a problem for multibyte fields such as MAC address
> > where MAC could change while we are accessing it.
> 
> It's also a problem for related fields, eg. console width and height, or
> disk geometry.
> 
> > I was thinking about some backwards compatible way to solve this, but if
> > we are willing to break compatiblity or use some mode switch, how about
> > we give up on virtio config space completely, and do everything besides
> > IO and ISR through guest memory?
> 
> I think there's still a benefit in the simple publishing of information:
> I don't really want to add a control queue for the console.

One reason I thought using a vq is handy is because this would
let us get by with a single MSI vector. Currently we need at least 2
for config changes + a shared one for vqs.
But I don't insist.

>  But
> inevitably, once-static information can change in later versions, and
> it's horrible to have config information plus a bit that says "don't use
> this, use the control queue".
> 
> Here's a table from a quick audit:
> 
> Driver          Config       Device changes    Driver writes... after init?
> net             Y            Y                 N                N
> block           Y            Y                 Y                Y
> console         Y            Y                 N                N
> rng             N            N                 N                N
> balloon         Y            Y                 Y                Y
> scsi            Y            N                 Y                N
> 9p              Y            N                 N                N
> 
> For config space reads, I suggest the driver publish a generation count.

You mean device?

> For writes, the standard seems to be a commit latch.  We could abuse the
> generation count for this: the driver writes to it to commit config
> changes.

I think this will work. There are a couple of things that bother me:

This assumes read accesses have no side effects, and these are sometimes handy.
Also the semantics for write aren't very clear to me.
I guess device must buffer data until generation count write?
This assumes the device has a buffer to store writes,
and it must track each byte written. I kind of dislike this
tracking of accessed bytes. Also, device would need to resolve conflicts
if any in some device specific way.

Maybe it's a good idea to make the buffer accesses explicit instead?
IOW require driver to declare intent to read/request write of a specific
config range.  We could for example do it like this:
	__le32 config_offset;
	__le32 config_len;
	__u8 config_cmd; /* write-only: 0 - read, 1 - write
			    config_len bytes at config_offset
			    from/to config space to/from device memory */
But maybe this is over-engineering?



> ie:
> /* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> struct virtio_pci_common_cfg {
> 	/* About the whole device. */
> 	__le32 device_feature_select;	/* read-write */
> 	__le32 device_feature;		/* read-only */
> 	__le32 guest_feature_select;	/* read-write */
> 	__le32 guest_feature;		/* read-only */
>         __le32 config_gen_and_latch;    /* read-write */
> 	__le16 msix_config;		/* read-write */
> 	__u8 device_status;		/* read-write */
> 	__u8 unused;
> 
> 	/* About a specific virtqueue. */
> 	__le16 queue_select;	/* read-write */
> 	__le16 queue_align;	/* read-write, power of 2. */
> 	__le16 queue_size;	/* read-write, power of 2. */
> 	__le16 queue_msix_vector;/* read-write */
> 	__le64 queue_address;	/* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
> };
> 
> Thoughts?
> Rusty.
> PS.  Let's make all the virtio-device config LE, too...

We'll need some new API for devices then: currently we pass bytes.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-11 10:23         ` Michael S. Tsirkin
@ 2012-10-11 22:29           ` Rusty Russell
  2012-10-12  9:33             ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-11 22:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Thu, Oct 11, 2012 at 11:48:22AM +1030, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> > On Mon, Oct 08, 2012 at 12:51:25PM +1030, Rusty Russell wrote:
>> >> Note before anyone gets confused; we were talking about using the PCI
>> >> config space to indicate what BAR(s) the virtio stuff is in.  An
>> >> alternative would be to simply specify a new layout format in BAR1.
>> >
>> > One problem we are still left with is this: device specific
>> > config accesses are still non atomic.
>> > This is a problem for multibyte fields such as MAC address
>> > where MAC could change while we are accessing it.
>> 
>> It's also a problem for related fields, eg. console width and height, or
>> disk geometry.
>> 
>> > I was thinking about some backwards compatible way to solve this, but if
>> > we are willing to break compatiblity or use some mode switch, how about
>> > we give up on virtio config space completely, and do everything besides
>> > IO and ISR through guest memory?
>> 
>> I think there's still a benefit in the simple publishing of information:
>> I don't really want to add a control queue for the console.
>
> One reason I thought using a vq is handy is because this would
> let us get by with a single MSI vector. Currently we need at least 2
> for config changes + a shared one for vqs.
> But I don't insist.

Hmmm, that is true.

>> Here's a table from a quick audit:
>> 
>> Driver          Config       Device changes    Driver writes... after init?
>> net             Y            Y                 N                N
>> block           Y            Y                 Y                Y
>> console         Y            Y                 N                N
>> rng             N            N                 N                N
>> balloon         Y            Y                 Y                Y
>> scsi            Y            N                 Y                N
>> 9p              Y            N                 N                N
>> 
>> For config space reads, I suggest the driver publish a generation count.
>
> You mean device?

Yes, sorry.

>> For writes, the standard seems to be a commit latch.  We could abuse the
>> generation count for this: the driver writes to it to commit config
>> changes.
>
> I think this will work. There are a couple of things that bother me:
>
> This assumes read accesses have no side effects, and these are sometimes handy.
> Also the semantics for write aren't very clear to me.
> I guess device must buffer data until generation count write?
> This assumes the device has a buffer to store writes,
> and it must track each byte written. I kind of dislike this
> tracking of accessed bytes. Also, device would need to resolve conflicts
> if any in some device specific way.

It should be trivial to implement: you keep a scratch copy of the config
space, and copy it to the master copy when they hit the latch.

Implementation of this will show whether I've missed anything here, I
think.

> Maybe it's a good idea to make the buffer accesses explicit instead?
> IOW require driver to declare intent to read/request write of a specific
> config range.  We could for example do it like this:
> 	__le32 config_offset;
> 	__le32 config_len;
> 	__u8 config_cmd; /* write-only: 0 - read, 1 - write
> 			    config_len bytes at config_offset
> 			    from/to config space to/from device memory */
> But maybe this is over-engineering?

Seems overengineering since the config space is quite small in practice.

>> PS.  Let's make all the virtio-device config LE, too...
>
> We'll need some new API for devices then: currently we pass bytes.

Yes, but driver authors expect that anyway.  We can retro-define
virito-mmio to be LE (since all current users are), so this is
v. tempting.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-11 22:29           ` Rusty Russell
@ 2012-10-12  9:33             ` Michael S. Tsirkin
  2012-10-12  9:51               ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-12  9:33 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
> >> For writes, the standard seems to be a commit latch.  We could abuse the
> >> generation count for this: the driver writes to it to commit config
> >> changes.
> >
> > I think this will work. There are a couple of things that bother me:
> >
> > This assumes read accesses have no side effects, and these are sometimes handy.
> > Also the semantics for write aren't very clear to me.
> > I guess device must buffer data until generation count write?
> > This assumes the device has a buffer to store writes,
> > and it must track each byte written. I kind of dislike this
> > tracking of accessed bytes. Also, device would need to resolve conflicts
> > if any in some device specific way.
> 
> It should be trivial to implement: you keep a scratch copy of the config
> space, and copy it to the master copy when they hit the latch.
> 
> Implementation of this will show whether I've missed anything here, I
> think.

What I refer to: what happens if driver does:
- write offset 1
- write offset 3
- hit commit latch

?

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-12  9:33             ` Michael S. Tsirkin
@ 2012-10-12  9:51               ` Rusty Russell
  2012-10-12 10:02                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-12  9:51 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
>> >> For writes, the standard seems to be a commit latch.  We could abuse the
>> >> generation count for this: the driver writes to it to commit config
>> >> changes.
>> >
>> > I think this will work. There are a couple of things that bother me:
>> >
>> > This assumes read accesses have no side effects, and these are sometimes handy.
>> > Also the semantics for write aren't very clear to me.
>> > I guess device must buffer data until generation count write?
>> > This assumes the device has a buffer to store writes,
>> > and it must track each byte written. I kind of dislike this
>> > tracking of accessed bytes. Also, device would need to resolve conflicts
>> > if any in some device specific way.
>> 
>> It should be trivial to implement: you keep a scratch copy of the config
>> space, and copy it to the master copy when they hit the latch.
>> 
>> Implementation of this will show whether I've missed anything here, I
>> think.
>
> What I refer to: what happens if driver does:
> - write offset 1
> - write offset 3
> - hit commit latch

- nothing
- nothing
- effect of offset 1 and offset 3 writes

Now, since there's nothing published by the *driver* at the moment
which can't be trivially atomically written, this scheme is overkill
(sure, it means you could do a byte-at-a-time write to some 4-byte
field, but why?).

But perhaps it's overkill: no other bus has this feature, so we'd need a
feature bit for them anyway in future if we create a device which needs
such atomicity.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-12  9:51               ` Rusty Russell
@ 2012-10-12 10:02                 ` Michael S. Tsirkin
  2012-10-16 13:15                   ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-12 10:02 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
> >> >> For writes, the standard seems to be a commit latch.  We could abuse the
> >> >> generation count for this: the driver writes to it to commit config
> >> >> changes.
> >> >
> >> > I think this will work. There are a couple of things that bother me:
> >> >
> >> > This assumes read accesses have no side effects, and these are sometimes handy.
> >> > Also the semantics for write aren't very clear to me.
> >> > I guess device must buffer data until generation count write?
> >> > This assumes the device has a buffer to store writes,
> >> > and it must track each byte written. I kind of dislike this
> >> > tracking of accessed bytes. Also, device would need to resolve conflicts
> >> > if any in some device specific way.
> >> 
> >> It should be trivial to implement: you keep a scratch copy of the config
> >> space, and copy it to the master copy when they hit the latch.
> >> 
> >> Implementation of this will show whether I've missed anything here, I
> >> think.
> >
> > What I refer to: what happens if driver does:
> > - write offset 1
> > - write offset 3
> > - hit commit latch
> 
> - nothing
> - nothing
> - effect of offset 1 and offset 3 writes

OK so this means that you also need to track which bytes where written
in order to know to skip byte 2.
This is what I referred to. If instead we ask driver to specify
offset/length explicitly device only needs to remember that.

Not a big deal anyway, just pointing this out.

> Now, since there's nothing published by the *driver* at the moment
> which can't be trivially atomically written, this scheme is overkill
> (sure, it means you could do a byte-at-a-time write to some 4-byte
> field, but why?).
> 
> But perhaps it's overkill: no other bus has this feature, so we'd need a
> feature bit for them anyway in future if we create a device which needs
> such atomicity.
> 
> Cheers,
> Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-12 10:02                 ` Michael S. Tsirkin
@ 2012-10-16 13:15                   ` Rusty Russell
  2012-10-16 13:30                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2012-10-16 13:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> > On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
>> >> >> For writes, the standard seems to be a commit latch.  We could abuse the
>> >> >> generation count for this: the driver writes to it to commit config
>> >> >> changes.
>> >> >
>> >> > I think this will work. There are a couple of things that bother me:
>> >> >
>> >> > This assumes read accesses have no side effects, and these are sometimes handy.
>> >> > Also the semantics for write aren't very clear to me.
>> >> > I guess device must buffer data until generation count write?
>> >> > This assumes the device has a buffer to store writes,
>> >> > and it must track each byte written. I kind of dislike this
>> >> > tracking of accessed bytes. Also, device would need to resolve conflicts
>> >> > if any in some device specific way.
>> >> 
>> >> It should be trivial to implement: you keep a scratch copy of the config
>> >> space, and copy it to the master copy when they hit the latch.
>> >> 
>> >> Implementation of this will show whether I've missed anything here, I
>> >> think.
>> >
>> > What I refer to: what happens if driver does:
>> > - write offset 1
>> > - write offset 3
>> > - hit commit latch
>> 
>> - nothing
>> - nothing
>> - effect of offset 1 and offset 3 writes
>
> OK so this means that you also need to track which bytes where written
> in order to know to skip byte 2.
> This is what I referred to. If instead we ask driver to specify
> offset/length explicitly device only needs to remember that.

I was assuming the implementation would keep two complete copies of the
config space: writes go to the scratch version, which gets copied to the
master version upon latch write.

But I do wonder if we should just skip this for now, since we don't
have any immediate need.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-16 13:15                   ` Rusty Russell
@ 2012-10-16 13:30                     ` Michael S. Tsirkin
  2012-10-16 13:52                       ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2012-10-16 13:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> writes:
> >> > On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
> >> >> >> For writes, the standard seems to be a commit latch.  We could abuse the
> >> >> >> generation count for this: the driver writes to it to commit config
> >> >> >> changes.
> >> >> >
> >> >> > I think this will work. There are a couple of things that bother me:
> >> >> >
> >> >> > This assumes read accesses have no side effects, and these are sometimes handy.
> >> >> > Also the semantics for write aren't very clear to me.
> >> >> > I guess device must buffer data until generation count write?
> >> >> > This assumes the device has a buffer to store writes,
> >> >> > and it must track each byte written. I kind of dislike this
> >> >> > tracking of accessed bytes. Also, device would need to resolve conflicts
> >> >> > if any in some device specific way.
> >> >> 
> >> >> It should be trivial to implement: you keep a scratch copy of the config
> >> >> space, and copy it to the master copy when they hit the latch.
> >> >> 
> >> >> Implementation of this will show whether I've missed anything here, I
> >> >> think.
> >> >
> >> > What I refer to: what happens if driver does:
> >> > - write offset 1
> >> > - write offset 3
> >> > - hit commit latch
> >> 
> >> - nothing
> >> - nothing
> >> - effect of offset 1 and offset 3 writes
> >
> > OK so this means that you also need to track which bytes where written
> > in order to know to skip byte 2.
> > This is what I referred to. If instead we ask driver to specify
> > offset/length explicitly device only needs to remember that.
> 
> I was assuming the implementation would keep two complete copies of the
> config space: writes go to the scratch version, which gets copied to the
> master version upon latch write.

Yes but config space has some host modifiable registers too.
So host needs to be careful to avoid overwriting these.

If accesses have side effects that of course breaks too ...

> But I do wonder if we should just skip this for now, since we don't
> have any immediate need.
> 
> Cheers,
> Rusty.

MAC setting from guest needs this right now, no?

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Using PCI config space to indicate config location
  2012-10-16 13:30                     ` Michael S. Tsirkin
@ 2012-10-16 13:52                       ` Rusty Russell
  0 siblings, 0 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-16 13:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, qemu-devel, kvm, virtualization

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> I was assuming the implementation would keep two complete copies of the
>> config space: writes go to the scratch version, which gets copied to the
>> master version upon latch write.
>
> Yes but config space has some host modifiable registers too.
> So host needs to be careful to avoid overwriting these.
>
> If accesses have side effects that of course breaks too ...

Yes.

>
>> But I do wonder if we should just skip this for now, since we don't
>> have any immediate need.
>> 
>> Cheers,
>> Rusty.
>
> MAC setting from guest needs this right now, no?

Ah, I missed that in my table:

    Driver          Config       Device changes    Driver writes... after init?
    net             Y            Y                 N                N
    block           Y            Y                 Y                Y
    console         Y            Y                 N                N
    rng             N            N                 N                N
    balloon         Y            Y                 Y                Y
    scsi            Y            N                 Y                N
    9p              Y            N                 N                N

First line should be:

    net             Y            Y                 Y                N

So we could add a new cvq command (eg. #define VIRTIO_NET_CTRL_MAC_SET
1) and a VIRTIO_NET_F_CVQ_MAC_SET feature.  Or go ahead with the
latching scheme (which doesn't really help other busses).

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Proposal for virtio standardization.
  2012-09-27  0:29 [Qemu-devel] Proposal for virtio standardization Rusty Russell
  2012-10-04 18:49 ` Anthony Liguori
@ 2012-10-09 14:02 ` Cornelia Huck
  2012-10-10  3:46   ` Rusty Russell
  1 sibling, 1 reply; 39+ messages in thread
From: Cornelia Huck @ 2012-10-09 14:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ohad Ben-Cohen, Anthony Liguori, Avishay Traeger, kvm, Pawel Moll,
	Michael S. Tsirkin, qemu-devel, Jason Wang, LKML, virtualization,
	Avi Kivity, Adam Litke, Amit Shah, Paolo Bonzini, Sasha Levin

On Thu, 27 Sep 2012 09:59:33 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:

> Hi all,
> 
> 	I've had several requests for a more formal approach to the
> virtio draft spec, and (after some soul-searching) I'd like to try that.
> 
> 	The proposal is to use OASIS as the standards body, as it's
> fairly light-weight as these things go.  For me this means paperwork and
> setting up a Working Group and getting the right people involved as
> Voting members starting with the current contributors; for most of you
> it just means a new mailing list, though I'll be cross-posting any
> drafts and major changes here anyway.
> 
> 	I believe that a documented standard (aka virtio 1.0) will
> increase visibility and adoption in areas outside our normal linux/kvm
> universe.  There's been some of that already, but this is the clearest
> path to accelerate it.  Not the easiest path, but I believe that a solid
> I/O standard is a Good Thing for everyone.
> 
> 	Yet I also want to decouple new and experimental development
> from the standards effort; running code comes first.  New feature bits
> and new device numbers should be reservable without requiring a full
> spec change.
> 
> So the essence of my proposal is:
> 1) I start a Working Group within OASIS where we can aim for virtio spec
>    1.0.
> 
> 2) The current spec is textually reordered so the core is clearly
>    bus-independent, with PCI, mmio, etc appendices.
> 
> 3) Various clarifications, formalizations and cleanups to the spec text,
>    and possibly elimination of old deprecated features.
> 
> 4) The only significant change to the spec is that we use PCI
>    capabilities, so we can have infinite feature bits.
>    (see
> http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)

"Infinite" only applies to virtio-pci, no?

> 
> 5) Changes to the ring layout and other such things are deferred to a
>    future virtio version; whether this is done within OASIS or
>    externally depends on how well this works for the 1.0 release.
> 
> Thoughts?
> Rusty.
> 

Sounds like a good idea. I'll be happy to review the spec with an eye
to virtio-ccw.

Cornelia

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Proposal for virtio standardization.
  2012-10-09 14:02 ` [Qemu-devel] Proposal for virtio standardization Cornelia Huck
@ 2012-10-10  3:46   ` Rusty Russell
  0 siblings, 0 replies; 39+ messages in thread
From: Rusty Russell @ 2012-10-10  3:46 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Ohad Ben-Cohen, Anthony Liguori, Avishay Traeger, kvm, Pawel Moll,
	Michael S. Tsirkin, qemu-devel, Jason Wang, LKML, virtualization,
	Avi Kivity, Adam Litke, Amit Shah, Paolo Bonzini, Sasha Levin

Cornelia Huck <cornelia.huck@de.ibm.com> writes:
> On Thu, 27 Sep 2012 09:59:33 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>> 3) Various clarifications, formalizations and cleanups to the spec text,
>>    and possibly elimination of old deprecated features.
>> 
>> 4) The only significant change to the spec is that we use PCI
>>    capabilities, so we can have infinite feature bits.
>>    (see
>> http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)
>
> "Infinite" only applies to virtio-pci, no?

Yes, you already have "infinite" feature bits for ccw, as does every bus
we did since PCI.

> Sounds like a good idea. I'll be happy to review the spec with an eye
> to virtio-ccw.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2012-10-16 13:53 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-27  0:29 [Qemu-devel] Proposal for virtio standardization Rusty Russell
2012-10-04 18:49 ` Anthony Liguori
2012-10-08  2:21   ` [Qemu-devel] Using PCI config space to indicate config location Rusty Russell
2012-10-08 13:58     ` Anthony Liguori
2012-10-08 14:58       ` Gerd Hoffmann
2012-10-08 15:09         ` Anthony Liguori
2012-10-08 20:13           ` Gerd Hoffmann
2012-10-08 20:55             ` Anthony Liguori
2012-10-08 23:56               ` Rusty Russell
2012-10-09  1:51                 ` Anthony Liguori
2012-10-09  3:16                   ` Rusty Russell
2012-10-09 10:17                     ` Avi Kivity
2012-10-09 14:03                       ` Anthony Liguori
2012-10-09 13:56                     ` Anthony Liguori
2012-10-10  3:44                       ` Rusty Russell
2012-10-10 11:37                         ` Michael S. Tsirkin
2012-10-09 21:09                     ` Jamie Lokier
2012-10-10  3:44                       ` Rusty Russell
2012-10-11  0:08                         ` Rusty Russell
2012-10-09  6:33                   ` Gerd Hoffmann
2012-10-09 15:26                     ` Anthony Liguori
2012-10-09 20:24                       ` Gerd Hoffmann
2012-10-10  2:54                         ` Rusty Russell
2012-10-10 13:36                           ` Anthony Liguori
2012-10-10 13:41                             ` Michael S. Tsirkin
2012-10-11  0:43                               ` Rusty Russell
2012-10-10  8:34                 ` Michael S. Tsirkin
2012-10-10  8:30     ` Michael S. Tsirkin
2012-10-11  1:18       ` Rusty Russell
2012-10-11 10:23         ` Michael S. Tsirkin
2012-10-11 22:29           ` Rusty Russell
2012-10-12  9:33             ` Michael S. Tsirkin
2012-10-12  9:51               ` Rusty Russell
2012-10-12 10:02                 ` Michael S. Tsirkin
2012-10-16 13:15                   ` Rusty Russell
2012-10-16 13:30                     ` Michael S. Tsirkin
2012-10-16 13:52                       ` Rusty Russell
2012-10-09 14:02 ` [Qemu-devel] Proposal for virtio standardization Cornelia Huck
2012-10-10  3:46   ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).