From mboxrd@z Thu Jan  1 00:00:00 1970
From: Scott Wood <scottwood@freescale.com>
Subject: Re: [RFC PATCH 1/6] kvm: add device control API
Date: Thu, 21 Feb 2013 20:17:54 -0600
Message-ID: <1361499474.1574.9@snotra>
References: <20130221082209.GX3600@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1;
	delsp=Yes	format=Flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Alexander Graf <agraf@suse.de>, <kvm-ppc@vger.kernel.org>,
	<kvm@vger.kernel.org>, Christoffer Dall <cdall@cs.columbia.edu>,
	Paul Mackerras <paulus@samba.org>
To: Gleb Natapov <gleb@redhat.com>
Return-path: <kvm-ppc-owner@vger.kernel.org>
In-Reply-To: <20130221082209.GX3600@redhat.com> (from gleb@redhat.com on Thu
	Feb 21 02:22:09 2013)
Content-Disposition: inline
Sender: kvm-ppc-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> > >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> >> The ability to set/get attributes is needed.  Sorry, but "get
> > >or set
> > >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> > >just
> > >> >> not good enough -- assuming you don't want us to start =20
> sticking
> > >> >> pointers and commands in *that* data. :-)
> > >> >>
> > >> >Proposed interface sticks pointers into ioctl data, so why doin=
g
> > >> >the same
> > >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > >>
> > >> There's a difference between putting a pointer in an ioctl =20
> control
> > >> structure that is specifically documented as being that way (as =
=20
> in
> > >> ONE_REG), versus taking an ioctl that claims to be =20
> setting/getting a
> > >> blob of state and embedding pointers in it.  It would be like
> > >> sticking a pointer in the attribute payload of this API, which I
> > >> think is something to be discouraged.
> > >If documentation is what differentiate for you between silly and =20
> smart
> > >then write documentation instead of new interfaces.
> >
> > You mean like what we did with SREGS, that got deprecated and
> > replaced with ONE_REG?
> >
> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> interface. You can access all register at once or you can access them
> one by one. If there is a need we can add MULTIPLE_REGS that will get
> list of requested REGS.

http://www.spinics.net/lists/kvm-ppc/msg04876.html
http://www.spinics.net/lists/kvm-ppc/msg05842.html

> The interface is not over generic i.e it does
> not try to replace KVM_RUN by writing special register.

Sigh.

> >                                                    And we
> > don't necessarily want to set *everything*.
> What are those cases? You do need to on reset/migration.

Why do we want to set all the registers on reset, rather than tell the =
=20
in-kernel device to reset?  The default state came from the kernel in =20
the first place on irqchip creation...

> > >>                                        It'd also be using
> > >> KVM_SET_IRQCHIP to read data, which is the sort of thing you =20
> object
> > >> to later on regarding KVM_IRQ_LINE_STATUS.
> > >>
> > >Do not see why.
> >
> > It's either that, or have the data direction of the "chip" field in
> > KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >
> Still do not follow. Example?

struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is =20
insufficient to communicate attribute type plus a pointer.  So if we =20
want to provide a pointer for the kernel to write the attribute into, =20
it has to read from memory that the ioctl definition suggests should =20
only be written to.

> > >Setting the address is setting an attribute. Sending MSI is a =20
> command.
> > >Things you set/get during init/migration are attributes. Things =20
> you do
> > >to cause side-effects are commands.
> >
> > What if I set the address at a time that isn't init/migration (the
> > hardware allows moving it, like a PCI BAR)?  Suddenly it becomes no=
t
> > an attribute due to how the caller uses it?
> >
> What's the interface for guest to move it?

Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.

> Why it goes via userspace?

Because the mechanism in question doesn't just move MPIC.  It moves a =20
big block of a bunch of different devices all at once.

> You can move APIC base too, but this does not involve userspace. But
> even if you do go via userspace, it is just a guest asking to change =
=20
> device
> configuration, so using SET_ATTR to set new configuration is fine.
>=20
> > >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What =20
> real
> > >> (non-glue/wrapper) code can become common?
> > >>
> > >No new ioctl with exactly same result (well actually even faster =20
> since
> > >less copying is done).
> >
> > Which ioctl would go away?
> >
> Those that you propose in your new interface.

No, they wouldn't.  At most one MPIC attribute group would go away =20
(though as I've noted it would still be useful to be able to "get" =20
those attributes for debugging).

> > >> And I really hope you don't want us to do MSIs the x86 way.
> > >>
> > >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> > >glue it
> > >to mpic.
> >
> > We'll have to write extra code for it compared to the current way
> > where it works with *zero* code beyond what is wanted for other
> > purposes such as debug and (eventually) migration.  At least it's
> > more direct than having to establish a GSI route...
> If just writing a register cause MSI to be send how do you distinguis=
h
> between write that should send MSI and write that is done on migratio=
n
> to transfer current value?

It is a write-only command register.  The registers that contain the =20
state are elsewhere.

Again, we do not currently support migration on MPIC.  It is a very low=
 =20
priority for embedded.  We do not wish to rule it out entirely, but it =
=20
most likely would require adding more state accesors.

> We had that problem with MSRs on x86. We had
> to, eventually, add a flag that tells us the reason of MSR access.

The equivalent to that flag would be using the right kind of accessor =20
for what you want to do (simulated guest access versus backdoor state =20
access).

> > >> In the XICS thread, Paul brought up the possibliity of cascaded
> > >> MPICs.  It's not relevant to the systems we're trying to model, =
=20
> but
> > >> if one did want to use the in-kernel irqchip interface for that,=
 =20
> it
> > >> would be really nice to be able to operate on a specific MPIC fo=
r
> > >> injection rather than have to come up with some sort of global
> > >> identifier (above and beyond the minor flattening we'd need to =20
> do to
> > >> represent a single MPIC's interrupts in a flat numberspace).
> > >>
> > >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> > > =A0bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> > >  field: | irq_type  | vcpu_index |   irq_number    |
> > >Why will similar approach not work?
> >
> > Well, it conflicts with the GSI routing stuff, and I don't see an
> > irq chip ID field...
> It does :( Can't say I am happy about it, but I skipped the discussio=
n
> about the interface back then and it is too late to complain now. =20
> Since,
> as you notices, irqfd interfaces with irq routing I wonder what's ARM
> plan about it. But if you choose to go ARM way the format is ARM =20
> specific,
> so you can use your own encoding and put irq chip information there.

Well, we do want to support irqfd, so I don't think we'll be following =
=20
ARM here.

> > But otherwise (or assuming you mean to use such an encoding when
> > setting up a GSI route), I didn't say this part couldn't be made to
> > work.  It will require new kernel code for managing a GSI table in =
a
> > non-APIC way, and a new callback into the device code, but as I've
> > said elsewhere I think we need it for irqfd anyway.  If I use
> > KVM_IRQ_LINE for injecting interrupts, do you still object to the
> > rest of it?
> The rest of what, proposed interface? There are two separate =20
> discussions
> happening here interleaved. First is "do we need to introduce new =20
> generic
> interface for device creation when existing one, albeit not ideal, =20
> can be
> used" and I am OK with that as long as ARM moves to it for 3.10, =20
> although
> I would prefer to have some example of what this interface will be =20
> used
> for besides irq chips otherwise it will be just another way to create
> irqchip.

We need a new way to create irqchips anyway, even if it's just what the=
 =20
XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to =20
KVM_CREATE_DEVICE except it doesn't return an identifier for operating =
=20
on a specific device).  And of course we want to sort this out before =20
either patchset gets merged, so we don't end up adding both methods.  I=
 =20
suspect the XICS patchset flew under your radar because it has "PPC:" =20
in front of it and the subject line doesn't mention the ioctl, but I'm =
=20
not the only one that felt the need for a few new ioctls.

As for other types of devices, x86 has i8254.c emulated in-kernel -- I =
=20
know that's not going to switch to the new interface, but it could have=
 =20
if it existed back then.  I can also see creating an in-kernel =20
emulation device for doing MMIO filtering on some piece of embedded =20
hardware that guests need to access with reasonable performance, but =20
the hardware desginers screwed up the protection slightly (e.g. put =20
other things in the same 4K page).  We've done such filtering before in=
 =20
our standalone hypervisor; the question is whether it happens to =20
anything with enough performance requirements to be done in the kernel.

> Second one is "how the interface should look like". And here I
> think that strong distinction is needed between setting the attribute=
s
> and sending commands with side effects for reasons explained all over
> this ml thread.

OK, so let's just call them "commands".  I like the split into "read" =20
and "write" commands, especially when most of the commands naturally =20
come in such pairs, but if you don't like that part it can be reduced =20
to a single read/write command (and then we'd define separate set/get =20
commands where appropriate).

Note that the XICS patchset also involves device commands.  It does it =
=20
by passing any unknown vm ioctl to the irqchip (XICS implements =20
KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to =20
KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons =20
elsewhere in the thread for preferring something that doesn't require a=
 =20
new ioctl for every device command.

> > It's not about "silliness" as that this new thing I added for other
> > reasons did the job just as well (again, except when it comes to
> > irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> > was not the main point of the new interface.
> Having generic interface for device creation and then make some =20
> devices
> special by allowing them to be used with KVM_IRQ_LINE makes little
> sense,

Well, we'd want to document which devices tie into which generic =20
interfaces, and which device is selected if multiple such devices are =20
created (if that is allowed for a particular device class).

=46or KVM_IRQ_LINE, we could perhaps use the device id as the irqchip i=
d =20
in kvm_irq_routing_irqchip (or, have an attribute/command that =20
retrieves the id to be used there).  Unfortunately there is no irqchip =
=20
field in kvm_irq_routing_msi, though since it's basically a command to =
=20
write to an arbitrary MMIO address, maybe it could just be implemented =
=20
that way?

> > >> I see no need for a separate ioctl in terms of the underlying
> > >> infrastructure for distinguishing "attribute" from "write-only
> > >> command".  I'm open to improvements on what the ioctl is called.
> > >> It's basically like setting a register on a device, except I was
> > >> concerned that if we actually called it a "register" that people
> > >> would take it too literally and think it's only for the =20
> architected
> > >> register state of the emulated device.
> > >I agree "attribute" is better name than "register", but injecting
> > >interrupt is not setting an attribute.
> >
> > It's a dynamic attribute -- the state of the input line.  Better
> > names are welcome.  I don't see this difference as enough to warran=
t
> > separate ioctls.
> As long as you use the same attribute for migration and interrupt =20
> injection
> purpose I do. If you use separate attributes for migration and =20
> interrupt
> injection then not having separate ioctl is just a hack.

Why is it a hack?  Is it also a hack to not use a separate ioctl to =20
reset the device, to move its address map, etc?

-Scott