public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>, Avi Kivity <avi@redhat.com>,
	linux-kernel@vger.kernel.org, aarcange@redhat.com,
	mtosatti@redhat.com, kvm@vger.kernel.org, joro@8bytes.org,
	penberg@cs.helsinki.fi, asias.hejun@gmail.com,
	gorcunov@gmail.com
Subject: Re: [ANNOUNCE] Native Linux KVM tool
Date: Fri, 08 Apr 2011 09:00:43 -0500	[thread overview]
Message-ID: <4D9F150B.1030809@codemonkey.ws> (raw)
In-Reply-To: <BANLkTimEwAuL4PM9vBfogkrua-Z44AZQvg@mail.gmail.com>

On 04/08/2011 12:14 AM, Pekka Enberg wrote:
> Hey, feel free to help out! ;-)
>
> I don't agree that a working 2500 LOC program is 'repeating the same
> architectural mistakes' as QEMU. I hope you realize that we've gotten
> here with just three part-time hackers working from their proverbial
> basements. So what you call mistakes, we call features for the sake of
> simplicity.

And by all means, it's a good accomplishment.

But the mistakes I'm referring to aren't missing bits of code.  It's 
that the current code makes really bad assumptions.

An example is ioport_ops.  This maps directly to 
ioport_{read,write}_table in QEMU.  Then you use ioport__register() to 
register entries in this table similar register_ioport_{read,write}() in 
QEMU.

The use of a struct is a small improvement but the fundamental design is 
flawed because it models a view of hardware where all devices are 
directly connected to the CPU.  This is not how hardware works at all.

On the PC QEMU tries to emulate, a PIO operation flows from the CPU to 
the i440fx.  The i440fx will do the first level of decoding treating the 
PCI host controller ports specially and then posting any I/Os in the PCI 
port range to the PCI bus.  If no device selects these ports, or the 
ports fall into the non-PCI range, the I/O request is then posted to the 
PIIX3.

The PIIX3 will handle a good chunk of the I/O requests (via it's Super 
I/O chipset) and the remainder will be posted to the ISA bus.  One or 
more ISA devices may then react to these posted I/O operation.

Really, having a flat table doesn't make sense.  You should just send 
everything to an i440fx directly.  Then the i440fx should decode what it 
can, and send it to the next level, and so forth.

You can get 90% of the way to working device model without modelling 
this type of flow, but you hit a wall pretty quickly as it's not unusual 
for PCI controllers to manipulate I/O requests in some fashion 
(particularly on non-x86 platforms).  If you treat everything as 
directly attached to the CPU, it's impossible to model this.

Likewise, the same flow is true in the opposite direction.  You use 
guest_flat_to_host() which assumes a linear mapping of guest memory to 
host memory.  We used to do that too in QEMU (phys_ram_base + X).  It 
took a long time to get rid of that assumption in QEMU.

There are multiple problems with this sort of assumption.  The first is 
that you treat all devices as being directly attached to the memory 
controller.  As with I/O instruction dispatch, this is not the case, and 
there are many PCI controllers that will munge these accesses (think 
IOMMU, for instance).  The second is you assume that you're not doing 
I/O to device memory, but this does happen in practice.  The 
cpu_physical_memory_rw() API is careful to support cases where you're 
writing data to I/O memory.

The other big problem here is that if you have open access to guest 
memory like this, you cannot easily track dirty information.  Userspace 
accesses to guest memory will not result in KVM updating the guest dirty 
bitmap.  You can add another API to explicitly set dirty bits (and 
that's exactly what we did a few years ago) but then you'll get 
extremely subtle bugs in migration if you're missing a dirty update 
somewhere.  This is exactly how our API evolved in QEMU.

As I said earlier, there are very good reasons we do the things we do in 
QEMU.  We're a large code base and there's far too much of the code base 
that noone cares about enough but that users are happy with.  It's far 
too hard to make broad sweeping changes right now (although that's 
something we're trying to improve).

But I'd strongly suggest taking some of the advise being offered here.  
Don't ignore the hard problems to start out with because as the code 
base grows, it'll become more difficult to fix those.  That's not to say 
that you need to implement migration tomorrow, but at least keep the 
constraints in mind and make sure that you're designing interfaces that 
let you do things like keep an updated dirty bitmap when you do memory 
accesses in userspace.

> I also don't agree with this sentiment that unless we have SMP,
> migration, yadda yadda yadda, now, it's impossible to change that in
> the future. It ignores the fact that this is exactly how the Linux
> kernel evolved

Over the course of 20 years.  By my count, we still have another decade 
of refactoring before I can get on top of my ivory tower and call every 
other project terrible.

>   and the fact that we're aggressively trying to keep the
> code size as small and tidy as possible so that changing things is as
> easy as possible.
>
> I've looked at QEMU sources over the years and especially over the
> past year and I think you might be way too familiar with its inner
> workings to see how complex (even the core code) has become for
> someone who isn't familiar with it.

I have no doubts about the complexity of QEMU.  But the 'goo' factor is 
not due to complexity, it's due to the fact that there's a lot of code 
that basically needs to be removed.  But removing features from an 
existing project is never a popular thing to do particularly when the 
work well enough for a lot of people.

Regards,

Anthony Liguori

  parent reply	other threads:[~2011-04-08 14:00 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-31 17:30 [ANNOUNCE] Native Linux KVM tool Pekka Enberg
     [not found] ` <1B1AE097-4524-4026-85EC-F9A0E274FFF2@suse.de>
2011-04-01  7:07   ` Carsten Otte
2011-04-01  7:37     ` Cyrill Gorcunov
2011-04-01 14:26 ` Steven Rostedt
2011-04-02 20:38 ` Anthony Liguori
2011-04-03  6:21   ` Ingo Molnar
2011-04-03  8:24     ` Avi Kivity
2011-04-03  8:53       ` Pekka Enberg
2011-04-03  9:06         ` Cyrill Gorcunov
2011-04-03  9:37         ` CaT
2011-04-04 10:31         ` Ingo Molnar
2011-04-03  8:51   ` Pekka Enberg
2011-04-03  9:17     ` Avi Kivity
2011-04-03  8:23 ` Avi Kivity
2011-04-03  9:59   ` Pekka Enberg
2011-04-03 10:11     ` Avi Kivity
2011-04-03 10:17       ` Pekka Enberg
2011-04-03 10:22         ` Avi Kivity
2011-04-03 10:32           ` Pekka Enberg
2011-04-03 13:09       ` Anthony Liguori
2011-04-03 13:19         ` Avi Kivity
2011-04-06  9:33           ` Ingo Molnar
2011-04-06  9:36             ` Gleb Natapov
2011-04-06  9:46               ` Ingo Molnar
2011-04-06  9:49                 ` Avi Kivity
2011-04-06  9:51                 ` Gleb Natapov
2011-04-06 10:14             ` Olivier Galibert
2011-04-06 10:55               ` Ingo Molnar
2011-04-08  2:04                 ` Anthony Liguori
2011-04-08  2:14             ` Anthony Liguori
2011-04-08  5:14               ` Pekka Enberg
2011-04-08  6:19                 ` Cyrill Gorcunov
2011-04-08  6:47                 ` Takuya Yoshikawa
2011-04-08  6:51                   ` Pekka Enberg
2011-04-08  7:10                     ` Takuya Yoshikawa
2011-04-08  7:39                 ` Jan Kiszka
2011-04-08  8:27                   ` Pekka Enberg
2011-04-08  9:11                     ` Jan Kiszka
2011-04-08  9:32                       ` Cyrill Gorcunov
2011-04-08 10:42                         ` Jan Kiszka
2011-04-08 12:27                           ` Alexander Graf
2011-04-08 12:33                             ` Cyrill Gorcunov
2011-04-08 14:39                         ` Ted Ts'o
2011-04-08 14:00                 ` Anthony Liguori [this message]
2011-04-08 19:20                   ` Andrea Arcangeli
2011-04-08 22:59                     ` Anthony Liguori
2011-04-10  8:05                       ` Avi Kivity
2011-04-09  7:40                     ` Ingo Molnar
2011-04-12  0:58                       ` Andrea Arcangeli
2011-04-09 18:23                   ` Olivier Galibert
2011-04-10  2:54                     ` Anthony Liguori
2011-04-08 15:59               ` Scott Wood
2011-04-08 22:58                 ` Anthony Liguori
2011-04-06  8:59         ` Markus Armbruster
2011-04-06  9:29           ` Gleb Natapov
2011-04-03  9:01 ` Alon Levy
2011-04-03 10:01   ` Pekka Enberg
2011-04-03 10:15     ` Alon Levy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D9F150B.1030809@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=aarcange@redhat.com \
    --cc=asias.hejun@gmail.com \
    --cc=avi@redhat.com \
    --cc=gorcunov@gmail.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mtosatti@redhat.com \
    --cc=penberg@cs.helsinki.fi \
    --cc=penberg@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox