qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
@ 2013-10-16 12:59 Christian Borntraeger
  2013-10-16 13:06 ` Paolo Bonzini
  2013-10-16 15:44 ` Gleb Natapov
  0 siblings, 2 replies; 6+ messages in thread
From: Christian Borntraeger @ 2013-10-16 12:59 UTC (permalink / raw)
  To: Alexander Graf, Paolo Bonzini, Gleb Natapov
  Cc: Valgrind Developers, Jens Freimann, qemu-devel@nongnu.org,
	KVM list

Folks,

from time to time I update valgrind or qemu to work reasonably well
with KVM.

Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
kvm interrupt controller) with the following ioctl:

#define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)

qemu can then work on these devices with the ioctls 

/* ioctls for fds returned by KVM_CREATE_DEVICE */
#define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
#define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)

struct kvm_device_attr {
        __u32   flags;          /* no flags currently defined */
        __u32   group;          /* device-defined */
        __u64   attr;           /* group-defined */
        __u64   addr;           /* userspace address of attr data */
};

to communicate with the in-kvm devices to exchange data. There is an interesting
problem here: Properly defined ioctls allow to deduct the written or read area
of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
decode the ioctl number according to the _IOxx prefixes and deducts the memory
area that the kernels reads/writes. This is very helpful for definedness checkins.

For specific or more complex ioctls valgrind has special handling. The problem is
now, that looking at the current implementations of kvm device attr, all use 0,1,2,
etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
it is talking to without tracking the original create ioctl.

So it seems that by using a multiplexing ioctls we actually make the interface
less well defined and more error prone than individual ioctls.

Another hint to look at are system calls. Older archs (like i386) have an IPC system
call that multiplexes several things. But time has shown that having a system for each
and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
have an ipc system call they have semop, semget etc. This also makes tools like strace
easier to implement.

Whats even more puzzling to me: The main complaint about ioctls is the possibility to
create arbitrary interfaces into the kernel. Now with the IORW family, we actually
had some limited form of control. With an additional uncoordinated group/attr 
parameter we dropped all of that.


So how to process from here? 
a) leave it as is and accept false positives from valgrind and undecoded straces
b) We could enhance the device_addr group/attr values with size macros, e.g
the same as _IOWR (do we need direction?) for future users
c) Avoid the device api for new code and go back to individual ioctls in KVM

The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
should really have a 2nd look at the Linux system call ABI. Does it have to follow the
device model? especially if the interface has obvious draw backs.

Christian

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-10-16 19:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-16 12:59 [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful? Christian Borntraeger
2013-10-16 13:06 ` Paolo Bonzini
2013-10-16 14:23   ` Christian Borntraeger
2013-10-16 14:59     ` Paolo Bonzini
2013-10-16 15:44 ` Gleb Natapov
2013-10-16 19:47   ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).