qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
@ 2013-10-16 12:59 Christian Borntraeger
  2013-10-16 13:06 ` Paolo Bonzini
  2013-10-16 15:44 ` Gleb Natapov
  0 siblings, 2 replies; 6+ messages in thread
From: Christian Borntraeger @ 2013-10-16 12:59 UTC (permalink / raw)
  To: Alexander Graf, Paolo Bonzini, Gleb Natapov
  Cc: Valgrind Developers, Jens Freimann, qemu-devel@nongnu.org,
	KVM list

Folks,

from time to time I update valgrind or qemu to work reasonably well
with KVM.

Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
kvm interrupt controller) with the following ioctl:

#define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)

qemu can then work on these devices with the ioctls 

/* ioctls for fds returned by KVM_CREATE_DEVICE */
#define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
#define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)

struct kvm_device_attr {
        __u32   flags;          /* no flags currently defined */
        __u32   group;          /* device-defined */
        __u64   attr;           /* group-defined */
        __u64   addr;           /* userspace address of attr data */
};

to communicate with the in-kvm devices to exchange data. There is an interesting
problem here: Properly defined ioctls allow to deduct the written or read area
of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
decode the ioctl number according to the _IOxx prefixes and deducts the memory
area that the kernels reads/writes. This is very helpful for definedness checkins.

For specific or more complex ioctls valgrind has special handling. The problem is
now, that looking at the current implementations of kvm device attr, all use 0,1,2,
etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
it is talking to without tracking the original create ioctl.

So it seems that by using a multiplexing ioctls we actually make the interface
less well defined and more error prone than individual ioctls.

Another hint to look at are system calls. Older archs (like i386) have an IPC system
call that multiplexes several things. But time has shown that having a system for each
and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
have an ipc system call they have semop, semget etc. This also makes tools like strace
easier to implement.

Whats even more puzzling to me: The main complaint about ioctls is the possibility to
create arbitrary interfaces into the kernel. Now with the IORW family, we actually
had some limited form of control. With an additional uncoordinated group/attr 
parameter we dropped all of that.


So how to process from here? 
a) leave it as is and accept false positives from valgrind and undecoded straces
b) We could enhance the device_addr group/attr values with size macros, e.g
the same as _IOWR (do we need direction?) for future users
c) Avoid the device api for new code and go back to individual ioctls in KVM

The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
should really have a 2nd look at the Linux system call ABI. Does it have to follow the
device model? especially if the interface has obvious draw backs.

Christian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
  2013-10-16 12:59 [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful? Christian Borntraeger
@ 2013-10-16 13:06 ` Paolo Bonzini
  2013-10-16 14:23   ` Christian Borntraeger
  2013-10-16 15:44 ` Gleb Natapov
  1 sibling, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2013-10-16 13:06 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: KVM list, Gleb Natapov, Valgrind Developers,
	qemu-devel@nongnu.org, Alexander Graf, Jens Freimann

Il 16/10/2013 14:59, Christian Borntraeger ha scritto:
> 
> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
> kvm interrupt controller) with the following ioctl:
> 
> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> 
> qemu can then work on these devices with the ioctls 
> 
> /* ioctls for fds returned by KVM_CREATE_DEVICE */
> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> 
> struct kvm_device_attr {
>         __u32   flags;          /* no flags currently defined */
>         __u32   group;          /* device-defined */
>         __u64   attr;           /* group-defined */
>         __u64   addr;           /* userspace address of attr data */
> };

Would it work to simply add an "__u64 size;" field to kvm_device_attr,
that is filled on exit by KVM_GET/HAS_DEVICE_ADDR, and filled on entry
to KVM_SET_DEVICE_ADDR?

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
  2013-10-16 13:06 ` Paolo Bonzini
@ 2013-10-16 14:23   ` Christian Borntraeger
  2013-10-16 14:59     ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Borntraeger @ 2013-10-16 14:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel@nongnu.org, Jens Freimann, Alexander Graf,
	Gleb Natapov, KVM list

dropping valgrind devel since its subscribers only...

On 16/10/13 15:06, Paolo Bonzini wrote:
> Il 16/10/2013 14:59, Christian Borntraeger ha scritto:
>>
>> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
>> kvm interrupt controller) with the following ioctl:
>>
>> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>>
>> qemu can then work on these devices with the ioctls 
>>
>> /* ioctls for fds returned by KVM_CREATE_DEVICE */
>> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
>> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>>
>> struct kvm_device_attr {
>>         __u32   flags;          /* no flags currently defined */
>>         __u32   group;          /* device-defined */
>>         __u64   attr;           /* group-defined */
>>         __u64   addr;           /* userspace address of attr data */
>> };
> 
> Would it work to simply add an "__u64 size;" field to kvm_device_attr,
> that is filled on exit by KVM_GET/HAS_DEVICE_ADDR, and filled on entry
> to KVM_SET_DEVICE_ADDR?

That would work, but it would change the ioctl number of KVM_*_DEVICE_ADDR,
due to the changed size of struct kvm_device_attr. We would then need compat
 handlers in the kernel.

If we could encode it in the existing interface the impact would be smaller.
e.g. 

#define ATTR_ATTR_MASK 0xffffffffULL
#define ATTR_LEN_MASK 0xffffffff00000000ULL


    switch (attr->attr) {
---> 
    switch (attr->attr & ATTR_ATTR_MASK) {

Then we could keep the device model abstraction.

Just thinking here..better proposals are welcome

Christian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
  2013-10-16 14:23   ` Christian Borntraeger
@ 2013-10-16 14:59     ` Paolo Bonzini
  0 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2013-10-16 14:59 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: qemu-devel@nongnu.org, Jens Freimann, Alexander Graf,
	Gleb Natapov, KVM list

Il 16/10/2013 16:23, Christian Borntraeger ha scritto:
> That would work, but it would change the ioctl number of KVM_*_DEVICE_ADDR,
> due to the changed size of struct kvm_device_attr. We would then need compat
>  handlers in the kernel.

Actually I did that on purpose :) but perhaps you're right that the
complication would be too high.

> If we could encode it in the existing interface the impact would be smaller.
> e.g. 
> 
> #define ATTR_ATTR_MASK 0xffffffffULL
> #define ATTR_LEN_MASK 0xffffffff00000000ULL
> 
> 
>     switch (attr->attr) {
> ---> 
>     switch (attr->attr & ATTR_ATTR_MASK) {
> 
> Then we could keep the device model abstraction.

That makes sense too.

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
  2013-10-16 12:59 [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful? Christian Borntraeger
  2013-10-16 13:06 ` Paolo Bonzini
@ 2013-10-16 15:44 ` Gleb Natapov
  2013-10-16 19:47   ` Christian Borntraeger
  1 sibling, 1 reply; 6+ messages in thread
From: Gleb Natapov @ 2013-10-16 15:44 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Jens Freimann, Alexander Graf, KVM list,
	qemu-devel@nongnu.org

On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote:
> Folks,
> 
> from time to time I update valgrind or qemu to work reasonably well
> with KVM.
> 
> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
> kvm interrupt controller) with the following ioctl:
> 
> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> 
> qemu can then work on these devices with the ioctls 
> 
> /* ioctls for fds returned by KVM_CREATE_DEVICE */
> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> 
> struct kvm_device_attr {
>         __u32   flags;          /* no flags currently defined */
>         __u32   group;          /* device-defined */
>         __u64   attr;           /* group-defined */
>         __u64   addr;           /* userspace address of attr data */
> };
> 
> to communicate with the in-kvm devices to exchange data. There is an interesting
> problem here: Properly defined ioctls allow to deduct the written or read area
> of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
> decode the ioctl number according to the _IOxx prefixes and deducts the memory
> area that the kernels reads/writes. This is very helpful for definedness checkins.
> 
And is not very helpful if you need to change structure size or it
changes by itself because of 32bit->64bit move.

> For specific or more complex ioctls valgrind has special handling. The problem is
> now, that looking at the current implementations of kvm device attr, all use 0,1,2,
> etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
> it is talking to without tracking the original create ioctl.
> 
> So it seems that by using a multiplexing ioctls we actually make the interface
> less well defined and more error prone than individual ioctls.
I do not see why it is more error prone. Yes, it is harder for valgrind
to track it, but this by itself does not make it error prone.

> 
> Another hint to look at are system calls. Older archs (like i386) have an IPC system
> call that multiplexes several things. But time has shown that having a system for each
> and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
> have an ipc system call they have semop, semget etc. This also makes tools like strace
> easier to implement.
> 
The difference is in amount of those subfunctions. We can define
separate ioctl for each cpu/device register of each architecture and it
will be nicely straceable, but how much new ioctls this will create? It
will easily reach 1000s quickly. So at some point you start multiplex.
ONE_REG interface is multiplexer, DEVICE_ATTR is same.

> Whats even more puzzling to me: The main complaint about ioctls is the possibility to
> create arbitrary interfaces into the kernel. Now with the IORW family, we actually
> had some limited form of control. With an additional uncoordinated group/attr 
> parameter we dropped all of that.
How this control is actually used in the kernel?

> 
> 
> So how to process from here? 
> a) leave it as is and accept false positives from valgrind and undecoded straces
strace still cannot decode kvm ioctls after all this years. I am not
concerned about straces at all, it is less then useless to debug KVM.

> b) We could enhance the device_addr group/attr values with size macros, e.g
> the same as _IOWR (do we need direction?) for future users
I am fine with that. But even in KVM this is not the first interface
that has this drawback. See KVM_GET_DIRTY_LOG for instance.

> c) Avoid the device api for new code and go back to individual ioctls in KVM
Individual ioctls were a mess. People are adding devices without even
knowing beforehand how final interface will look like. To accommodate
such "design" strategy devices need to have as fine grained interfaces as
possible otherwise ioctls will be deprecated faster than added (that was
justification for ONE_REG too BTW). If you want fine grained interface
you need to multiplex at some point or add hundreds of ioctls, if you
need to multiplex anyway multiplexing at device level is logical thing
to do. I understand valgrind sentiment, but it should not be a central
point of interface design.

> 
> The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
> as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
> should really have a 2nd look at the Linux system call ABI. Does it have to follow the
> device model? especially if the interface has obvious draw backs.
> 
Try to add new Linux system call. It's very hard and for a good reason,
you do not want to have 1000s of them where most of then is one-trick
pony and mostly deprecated (but you cannot remove them or reuse them
anyway). If we apply the same constrains to kvm ioctl additions as Linux
has for system calls, KVM development will stall and people will be less
then happy.

--
			Gleb.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
  2013-10-16 15:44 ` Gleb Natapov
@ 2013-10-16 19:47   ` Christian Borntraeger
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Borntraeger @ 2013-10-16 19:47 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Paolo Bonzini, Jens Freimann, Alexander Graf, KVM list,
	qemu-devel@nongnu.org

On 16/10/13 17:44, Gleb Natapov wrote:
> On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote:
>> Folks,
>>
>> from time to time I update valgrind or qemu to work reasonably well
>> with KVM.
>>
>> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
>> kvm interrupt controller) with the following ioctl:
>>
>> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>>
>> qemu can then work on these devices with the ioctls 
>>
>> /* ioctls for fds returned by KVM_CREATE_DEVICE */
>> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
>> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>>
>> struct kvm_device_attr {
>>         __u32   flags;          /* no flags currently defined */
>>         __u32   group;          /* device-defined */
>>         __u64   attr;           /* group-defined */
>>         __u64   addr;           /* userspace address of attr data */
>> };
>>
>> to communicate with the in-kvm devices to exchange data. There is an interesting
>> problem here: Properly defined ioctls allow to deduct the written or read area
>> of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
>> decode the ioctl number according to the _IOxx prefixes and deducts the memory
>> area that the kernels reads/writes. This is very helpful for definedness checkins.
>>
> And is not very helpful if you need to change structure size or it
> changes by itself because of 32bit->64bit move.

Well, it has pros and cons. The con is, that you need all kind of compat handling
when changing structures, the pro is, that unintentional changes are detected if
the size changes. 

> 
>> For specific or more complex ioctls valgrind has special handling. The problem is
>> now, that looking at the current implementations of kvm device attr, all use 0,1,2,
>> etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
>> it is talking to without tracking the original create ioctl.
>>
>> So it seems that by using a multiplexing ioctls we actually make the interface
>> less well defined and more error prone than individual ioctls.
> I do not see why it is more error prone. Yes, it is harder for valgrind
> to track it, but this by itself does not make it error prone.

By itself it looks like not more error prone. But it is easier to to mess up - e.g. 
making a structure bigger would cause an immediate error with old user space in case of
ioctls, but changing the target structure of get/set_attr would be unnoticed. On 
the other hand, changing the structure without a size change is not detected - so yeah
the extra checking is not fully sufficient.

> 
>>
>> Another hint to look at are system calls. Older archs (like i386) have an IPC system
>> call that multiplexes several things. But time has shown that having a system for each
>> and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
>> have an ipc system call they have semop, semget etc. This also makes tools like strace
>> easier to implement.
>>
> The difference is in amount of those subfunctions. We can define
> separate ioctl for each cpu/device register of each architecture and it
> will be nicely straceable, but how much new ioctls this will create? It
> will easily reach 1000s quickly. So at some point you start multiplex.
> ONE_REG interface is multiplexer, DEVICE_ATTR is same.

Agreed sometimes you need multiplexing. ONE_REG seems like a good example.
> 
>> Whats even more puzzling to me: The main complaint about ioctls is the possibility to
>> create arbitrary interfaces into the kernel. Now with the IORW family, we actually
>> had some limited form of control. With an additional uncoordinated group/attr 
>> parameter we dropped all of that.
> How this control is actually used in the kernel?

Only implicit, a change in the argument would cause a different ioctl number. 
> 
>>
>>
>> So how to process from here? 
>> a) leave it as is and accept false positives from valgrind and undecoded straces
> strace still cannot decode kvm ioctls after all this years. I am not
> concerned about straces at all, it is less then useless to debug KVM.

Yes, the code of strace of totally messed up. :-( 
But on my fedora 18 it actually decodes some of the KVM ioctls (e.g. KVM_RUN).


> 
>> b) We could enhance the device_addr group/attr values with size macros, e.g
>> the same as _IOWR (do we need direction?) for future users
> I am fine with that. But even in KVM this is not the first interface
> that has this drawback. See KVM_GET_DIRTY_LOG for instance.

There is additional difference. _IOW(KVMIO,  0x42, struct kvm_dirty_log) is 
unique, so valgrind can have special code to handle that. 1,2,3 is not unique.



> 
>> c) Avoid the device api for new code and go back to individual ioctls in KVM
> Individual ioctls were a mess. People are adding devices without even
> knowing beforehand how final interface will look like. To accommodate
> such "design" strategy devices need to have as fine grained interfaces as
> possible otherwise ioctls will be deprecated faster than added (that was
> justification for ONE_REG too BTW). If you want fine grained interface
> you need to multiplex at some point or add hundreds of ioctls, if you
> need to multiplex anyway multiplexing at device level is logical thing
> to do. I understand valgrind sentiment, but it should not be a central
> point of interface design.

I agree, that valgrind should not be a primary target for an interface.
I also agree that multiplexing at the right place is necessary and good
(e.g. ONE_REG). Multiplexing at a device level also looked right when Alex
showed me the interface - maybe its just the problem with valgrind that
makes me feel different now.

I more or less wanted to bring up this point to start the discussion, if we can
make this interface better (thats why the subject has RFC and "?")
Valgrind was the trigger that actually made me look closer. 
In the end we might just leave it as is, if we cant make the forward/backward
compatibility without lots of compat code.
We can always provide a suppression file for qemu under valgrind.

> 
>>
>> The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
>> as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
>> should really have a 2nd look at the Linux system call ABI. Does it have to follow the
>> device model? especially if the interface has obvious draw backs.
>>
> Try to add new Linux system call. It's very hard and for a good reason,
> you do not want to have 1000s of them where most of then is one-trick
> pony and mostly deprecated (but you cannot remove them or reuse them
> anyway). If we apply the same constrains to kvm ioctl additions as Linux
> has for system calls, KVM development will stall and people will be less
> then happy.

Agree and disagree :-) ioctls have a bad reputation because they were used to create pony
interfaces in the past but ioctls are ABI as well.


> 
> --
> 			Gleb.

Thanks for your quick response.

Christian 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-10-16 19:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-16 12:59 [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful? Christian Borntraeger
2013-10-16 13:06 ` Paolo Bonzini
2013-10-16 14:23   ` Christian Borntraeger
2013-10-16 14:59     ` Paolo Bonzini
2013-10-16 15:44 ` Gleb Natapov
2013-10-16 19:47   ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).