All of lore.kernel.org
 help / color / mirror / Atom feed
* Questioning the Xen Design of the VMM
@ 2006-08-07 15:01 Al Boldi
  2006-08-08  9:10 ` Keir Fraser
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Al Boldi @ 2006-08-07 15:01 UTC (permalink / raw)
  To: xen-devel

Greetings!

The Xen project caught my attention on LKML discussing hypervisors, so I took 
a look at Xen and read the README, where it says:

	This install tree contains source for a Linux 2.6 guest

This immediately turned me off, as I hoped Xen would be a bit more 
transparent, by simply exposing native hw tunneled thru some multiplexed Xen 
patched host-kernel driver.

I maybe missing something, but why should the Xen-design require the guest to 
be patched?


Thanks!

--
Al

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-07 15:01 Questioning the Xen Design of the VMM Al Boldi
@ 2006-08-08  9:10 ` Keir Fraser
  2006-08-08  9:17 ` Harry Butterworth
  2006-08-08  9:20 ` Petersson, Mats
  2 siblings, 0 replies; 20+ messages in thread
From: Keir Fraser @ 2006-08-08  9:10 UTC (permalink / raw)
  To: Al Boldi, xen-devel




On 7/8/06 4:01 pm, "Al Boldi" <a1426z@gawab.com> wrote:

> The Xen project caught my attention on LKML discussing hypervisors, so I took
> a look at Xen and read the README, where it says:
> 
> This install tree contains source for a Linux 2.6 guest
> 
> This immediately turned me off, as I hoped Xen would be a bit more
> transparent, by simply exposing native hw tunneled thru some multiplexed Xen
> patched host-kernel driver.
> 
> I maybe missing something, but why should the Xen-design require the guest to
> be patched?

You can run fully-virtualised guests on VT-x and AMDV hardware these days.

 -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-07 15:01 Questioning the Xen Design of the VMM Al Boldi
  2006-08-08  9:10 ` Keir Fraser
@ 2006-08-08  9:17 ` Harry Butterworth
  2006-08-08  9:20 ` Petersson, Mats
  2 siblings, 0 replies; 20+ messages in thread
From: Harry Butterworth @ 2006-08-08  9:17 UTC (permalink / raw)
  To: Al Boldi; +Cc: xen-devel

On Mon, 2006-08-07 at 18:01 +0300, Al Boldi wrote:
> Greetings!
> 
> The Xen project caught my attention on LKML discussing hypervisors, so I took 
> a look at Xen and read the README, where it says:
> 
> 	This install tree contains source for a Linux 2.6 guest
> 
> This immediately turned me off, as I hoped Xen would be a bit more 
> transparent, by simply exposing native hw tunneled thru some multiplexed Xen 
> patched host-kernel driver.
> 
> I maybe missing something, but why should the Xen-design require the guest to 
> be patched?

Xen runs with high performance without binary translation on hardware
without virtualization support.  This requires patching the guest.

With hardware virtualization support Xen can run the guest unmodified.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-07 15:01 Questioning the Xen Design of the VMM Al Boldi
  2006-08-08  9:10 ` Keir Fraser
  2006-08-08  9:17 ` Harry Butterworth
@ 2006-08-08  9:20 ` Petersson, Mats
  2006-08-08 14:10   ` Al Boldi
  2 siblings, 1 reply; 20+ messages in thread
From: Petersson, Mats @ 2006-08-08  9:20 UTC (permalink / raw)
  To: Al Boldi, xen-devel

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Al Boldi
> Sent: 07 August 2006 16:01
> To: xen-devel@lists.xensource.com
> Subject: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Greetings!
> 
> The Xen project caught my attention on LKML discussing 
> hypervisors, so I took 
> a look at Xen and read the README, where it says:
> 
> 	This install tree contains source for a Linux 2.6 guest
> 
> This immediately turned me off, as I hoped Xen would be a bit more 
> transparent, by simply exposing native hw tunneled thru some 
> multiplexed Xen 
> patched host-kernel driver.

The actual hardware isn't exposed to the guest at all [unless you
explicitly ask for it in the configuration]. There are drivers that are
virtual versions of the real hardware, but there is no way that the
guest OS is ever touching any network or hard-disk, unless you've
explicitly configured it so - and then it uses a driver that is the
native driver [with some minor modifications to deal with the
virtualization - those modifications are generally in header files (at
least for well-behaved drivers)]. 

On the other hand, to reduce the size of the actual hypervisor (VMM),
the approach of Xen is to use Linux as a driver-domain (commonly
combined as the management "domain" of Dom0). This means that Xen
hypervisor itself can be driver-less, but of course also relies on
having another OS on top of itself to make up for this. Currently Linux
is the only available option for a driver-domain, but there's nothing in
the interface between Xen and the driver domain that says it HAS to be
so - it's just much easier to do with a well-known, open-source,
driver-rich kernel, than with a closed-source or driver-poor kernel...

> 
> I maybe missing something, but why should the Xen-design 
> require the guest to 
> be patched?

There are two flavours of Xen guests:
Para-virtual guests. Those are patched kernels, and have (in past
versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows,
<some version of>BSD and perhaps other versions that I don't know of.
Current Xen is "Linux only" supplied with the Xen kernel. Other kernels
are being worked on. 

HVM guests. These are fully virtualized guests, where the guest contains
the same binary as you would use on a non-virtual system. You can run
Windows or Linux, or most other OS's on this. It does require "new"
hardware that has virtualization support in hardware (AMD's AMDV (SVM)
or Intel VT) to use this flavour of guest though, so the older model is
still maintained. 

I hope this is of use to you. 

Please feel free to ask any further questions... 

--
Mats

> 
> 
> Thanks!
> 
> --
> Al
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-08  9:20 ` Petersson, Mats
@ 2006-08-08 14:10   ` Al Boldi
  2006-08-08 15:07     ` Petersson, Mats
  2006-08-09 12:49     ` Daniel Stodden
  0 siblings, 2 replies; 20+ messages in thread
From: Al Boldi @ 2006-08-08 14:10 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: xen-devel

Petersson, Mats wrote:
> Al Boldi wrote:
> > I hoped Xen would be a bit more
> > transparent, by simply exposing native hw tunneled thru some
> > multiplexed Xen patched host-kernel driver.
>
> On the other hand, to reduce the size of the actual hypervisor (VMM),
> the approach of Xen is to use Linux as a driver-domain (commonly
> combined as the management "domain" of Dom0). This means that Xen
> hypervisor itself can be driver-less, but of course also relies on
> having another OS on top of itself to make up for this. Currently Linux
> is the only available option for a driver-domain, but there's nothing in
> the interface between Xen and the driver domain that says it HAS to be
> so - it's just much easier to do with a well-known, open-source,
> driver-rich kernel, than with a closed-source or driver-poor kernel...

Ok, you are probably describing the state of the host-kernel, which I agree 
needs to be patched for performance reasons.

> > I maybe missing something, but why should the Xen-design
> > require the guest to be patched?
>
> There are two flavours of Xen guests:
> Para-virtual guests. Those are patched kernels, and have (in past
> versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows,
> <some version of>BSD and perhaps other versions that I don't know of.
> Current Xen is "Linux only" supplied with the Xen kernel. Other kernels
> are being worked on.

This is the part I am questioning.

> HVM guests. These are fully virtualized guests, where the guest contains
> the same binary as you would use on a non-virtual system. You can run
> Windows or Linux, or most other OS's on this. It does require "new"
> hardware that has virtualization support in hardware (AMD's AMDV (SVM)
> or Intel VT) to use this flavour of guest though, so the older model is
> still maintained.

So HVM solves the problem, but why can't this layer be implemented in 
software?

I'm sure there can't be a performance issue, as this virtualization doesn't 
occur on the physical resource level, but is (should be) rather implemented 
as some sort of a multiplexed routing algorithm, I think :)

> I hope this is of use to you.
> 
> Please feel free to ask any further questions...

Thanks a lot for your detailed response!

--
Al

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-08 14:10   ` Al Boldi
@ 2006-08-08 15:07     ` Petersson, Mats
  2006-08-08 16:39       ` Steven Rostedt
  2006-08-09 12:53       ` Al Boldi
  2006-08-09 12:49     ` Daniel Stodden
  1 sibling, 2 replies; 20+ messages in thread
From: Petersson, Mats @ 2006-08-08 15:07 UTC (permalink / raw)
  To: Al Boldi; +Cc: xen-devel

 

> -----Original Message-----
> From: Al Boldi [mailto:a1426z@gawab.com] 
> Sent: 08 August 2006 15:10
> To: Petersson, Mats
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Petersson, Mats wrote:
> > Al Boldi wrote:
> > > I hoped Xen would be a bit more
> > > transparent, by simply exposing native hw tunneled thru some
> > > multiplexed Xen patched host-kernel driver.
> >
> > On the other hand, to reduce the size of the actual 
> hypervisor (VMM),
> > the approach of Xen is to use Linux as a driver-domain (commonly
> > combined as the management "domain" of Dom0). This means that Xen
> > hypervisor itself can be driver-less, but of course also relies on
> > having another OS on top of itself to make up for this. 
> Currently Linux
> > is the only available option for a driver-domain, but 
> there's nothing in
> > the interface between Xen and the driver domain that says 
> it HAS to be
> > so - it's just much easier to do with a well-known, open-source,
> > driver-rich kernel, than with a closed-source or 
> driver-poor kernel...
> 
> Ok, you are probably describing the state of the host-kernel, 
> which I agree 
> needs to be patched for performance reasons.

Yes, but you could have more than one driver domain, that is isolated in
all aspects from other driver domains (host-kernel implies, to me, that
it's also the management of the other domains).

Why would you want to have more than one driver domain? For separation
of course... 
1. Competing Company A and Company B are sharing the same hardware - you
don't want Company A to have even the remotest chance of seeing any data
that belongs to B or the other way around, so you definitely want them
to be separated in as many ways as possible. 
2. Let's assume that someone finds a way to "hack" into a system by
sending some particular pattern on the network (TCP/IP to a particular
port, causing buffer overflow, seems to have been popular on Widnows at
least). If you have multiple driver domains, you would only get ONE
domain broken (into) by this approach - of course, if it's widespread it
would still break all ports, but if it's targetted towards one
particular domain, the others will survive [let's say one of your client
companies are attacked with a targetted attack - other companies will
then be unaffected]. 
> 
> > > I maybe missing something, but why should the Xen-design
> > > require the guest to be patched?
> >
> > There are two flavours of Xen guests:
> > Para-virtual guests. Those are patched kernels, and have (in past
> > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows,
> > <some version of>BSD and perhaps other versions that I 
> don't know of.
> > Current Xen is "Linux only" supplied with the Xen kernel. 
> Other kernels
> > are being worked on.
> 
> This is the part I am questioning.

The main reason to use a para-virtual kernel that it performs better
than the fully virtualized version.
> 
> > HVM guests. These are fully virtualized guests, where the 
> guest contains
> > the same binary as you would use on a non-virtual system. 
> You can run
> > Windows or Linux, or most other OS's on this. It does require "new"
> > hardware that has virtualization support in hardware (AMD's 
> AMDV (SVM)
> > or Intel VT) to use this flavour of guest though, so the 
> older model is
> > still maintained.
> 
> So HVM solves the problem, but why can't this layer be implemented in 
> software?

It CAN, and has been done. It is however, a little bit difficult to
cover some of the "strange" corner cases, as the x86 processor wasn't
really designed to handle virtualization natively [until these
extensions where added]. This is why you end up with binary translation
in VMWare for example. For example, let's say that we use the method of
"ring compression" (which is when the guest-OS is moved from Ring 0
[full privileges] to Ring 1 [less than full privileges]), and the
hypervisor wants to have full control of interrupt flags:

some_function:
	...
	pushf			// Save interrupt flag.
	cli			// Disable interrupts
	... 
	...
	...
	popf			// Restore interrupt flag. 
	...

In Ring 0, all this works just fine - but of course, we don't know that
the guest-OS tried to disable interrupts, so we have to change
something. In Ring 1, the guest can't disable interrupts, so the CLI
instruction can be intercepted. Great. But pushf/popf is a valid
instruction in all four rings - it just doesn't change the interrupt
enable flag in the flags register if you're not allowed to use the
CLI/STI instructions! So, that means that interrupts are disabled
forever after [until an STI instruction gets found by chance, at least].


And if the next bit of code is:

	mov	someaddress, eax		// someaddress is
updated by an interrupt!
$1:
	cmp	someaddress, eax		// Check it... 
	jz	$1

Then we'd very likely never get out of there, since the actual interrupt
causing someaddress to change is believed by the VMM to be disabled. 

There is no real way to make popf trap [other than supplying it with
invalid arguments in virtual 8086 mode, which isn't really a practical
thing to do here!]

Another problem is "hidden bits" in registers. 

Let's say this:

	mov	cr0, eax
	mov	eax, ecx
	or	$1, eax
	mov	eax, cr0
	mov	$0x10, eax
	mov	eax, fs
	mov	ecx, cr0
	
	mov	$0xF000000, eax
	mov	$10000, ecx
$1:
	mov	$0, fs:eax
	add	$4, eax
	dec	ecx
	jnz	$1

Let's now say that we have an interrupt that the hypervisor would handle
in the loop in the above code. The hypervisor itself uses FS for some
special purpose, and thus needs to save/restore the FS register. When it
returns, the system will crash (GP fault) because the FS register limit
is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS
was set to 0xFFFFFFFF before we took the interrupt... Incorrect
behaviour like this is terribly difficult to deal with, and there really
isn't any good way to solve these issues [other than not allowing the
code to run when it does "funny" things like this - or to perform the
necessary code in "translation mode" - i.e. emulate each instruction ->
slow(ish)]. 

> 
> I'm sure there can't be a performance issue, as this 
> virtualization doesn't 
> occur on the physical resource level, but is (should be) 
> rather implemented 
> as some sort of a multiplexed routing algorithm, I think :)

I'm not entirely sure what this statement is trying to say, but as I
understand the situation, performance is entirely the reason why the Xen
paravirtual model was implemented - all other VMM's are slower [although
it's often hard to prove that, since for example Vmware have the rule
that they have to give permission before publishing benchmarks of their
product, and of course that permission would only be given in cases
where there is some benefit to them]. 

One of the obvious reasons for para-virtual being better than full
virtualization is that it can be used in a "batched" mode. Let's say we
have some code that does this:

...
	p = malloc(2000 * 4096);
... 

Let's then say that the guts of malloc ends up in something like this:

map_pages_to_user(...)
{
	for(v = random_virtual_address, p = start_page; p < end_page;
p++, v+=4096)
		map_one_page_to_user(p, v);
}

In full virtualization, we have no way to understand that someone is
mapping 2000 pages to the same user-process in one guest, we'd just see
writes to the page-table one page at a time. 

In the para-virtual case, we could do something like:
map_pages_to_user(...)
{
	hypervisor_map_pages_to_user(current_process, start_page,
end_page,
random_virtual_address);
}

Now, the hypervisor knows "the full story" and can map all those pages
in one go - much quicker, I would say. There's still more work than in
the native case, but it's much closer to the native case. 


> 
> > I hope this is of use to you.
> > 
> > Please feel free to ask any further questions...
> 
> Thanks a lot for your detailed response!
> 
> --
> Al
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-08 15:07     ` Petersson, Mats
@ 2006-08-08 16:39       ` Steven Rostedt
  2006-08-08 17:14         ` Petersson, Mats
  2006-08-09 12:53       ` Al Boldi
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2006-08-08 16:39 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: Al Boldi, xen-devel

Mats, thanks for the examples of where the hypervisor needs to know 
otherwise x86 guest doesn't do what it expects to be done.

I've just recently started working with Xen, but my background has been 
more with other architectures than x86.  I understand all that you 
explained, but one: see below. (I'm posting to the list so that others 
can learn too ;)

Petersson, Mats wrote:
>  

[ snipped a lot of good info ]

> 
> Another problem is "hidden bits" in registers. 
> 
> Let's say this:
> 
> 	mov	cr0, eax
> 	mov	eax, ecx
> 	or	$1, eax
> 	mov	eax, cr0
> 	mov	$0x10, eax
> 	mov	eax, fs
> 	mov	ecx, cr0
> 	
> 	mov	$0xF000000, eax
> 	mov	$10000, ecx
> $1:
> 	mov	$0, fs:eax
> 	add	$4, eax
> 	dec	ecx
> 	jnz	$1
> 
> Let's now say that we have an interrupt that the hypervisor would handle
> in the loop in the above code. The hypervisor itself uses FS for some
> special purpose, and thus needs to save/restore the FS register. When it
> returns, the system will crash (GP fault) because the FS register limit
> is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS
> was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> behaviour like this is terribly difficult to deal with, and there really
> isn't any good way to solve these issues [other than not allowing the
> code to run when it does "funny" things like this - or to perform the
> necessary code in "translation mode" - i.e. emulate each instruction ->
> slow(ish)]. 
> 

The above I'm confused on.  In x86, the hypervisor can't store the fs 
register fully before returning from the interrupt??  You stated that 
the fs register limit was 0xffffffff before the interrupt, but ends up 
being 0xffff afterwards.  As I mentioned, I'm just learning the 
internals of x86, so my full comprehension on segment registers of x86 
is still a little fuzzy.

Could you explain further here?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-08 16:39       ` Steven Rostedt
@ 2006-08-08 17:14         ` Petersson, Mats
  2006-08-08 18:22           ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Petersson, Mats @ 2006-08-08 17:14 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Al Boldi, xen-devel

 

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Steven Rostedt
> Sent: 08 August 2006 17:39
> To: Petersson, Mats
> Cc: Al Boldi; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Mats, thanks for the examples of where the hypervisor needs to know 
> otherwise x86 guest doesn't do what it expects to be done.
> 
> I've just recently started working with Xen, but my 
> background has been 
> more with other architectures than x86.  I understand all that you 
> explained, but one: see below. (I'm posting to the list so 
> that others 
> can learn too ;)
> 
> Petersson, Mats wrote:
> >  
> 
> [ snipped a lot of good info ]
> 
> > 
> > Another problem is "hidden bits" in registers. 
> > 
> > Let's say this:
> > 
> > 	mov	cr0, eax
> > 	mov	eax, ecx
> > 	or	$1, eax
> > 	mov	eax, cr0
> > 	mov	$0x10, eax
> > 	mov	eax, fs
> > 	mov	ecx, cr0
> > 	
> > 	mov	$0xF000000, eax
> > 	mov	$10000, ecx
> > $1:
> > 	mov	$0, fs:eax
> > 	add	$4, eax
> > 	dec	ecx
> > 	jnz	$1
> > 
> > Let's now say that we have an interrupt that the hypervisor 
> would handle
> > in the loop in the above code. The hypervisor itself uses 
> FS for some
> > special purpose, and thus needs to save/restore the FS 
> register. When it
> > returns, the system will crash (GP fault) because the FS 
> register limit
> > is 0xFFFF (64KB) and eax is greater than the limit - but 
> the limit of FS
> > was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> > behaviour like this is terribly difficult to deal with, and 
> there really
> > isn't any good way to solve these issues [other than not 
> allowing the
> > code to run when it does "funny" things like this - or to 
> perform the
> > necessary code in "translation mode" - i.e. emulate each 
> instruction ->
> > slow(ish)]. 
> > 
> 
> The above I'm confused on.  In x86, the hypervisor can't store the fs 
> register fully before returning from the interrupt??  You stated that 
> the fs register limit was 0xffffffff before the interrupt, 
> but ends up 
> being 0xffff afterwards.  As I mentioned, I'm just learning the 
> internals of x86, so my full comprehension on segment 
> registers of x86 
> is still a little fuzzy.
> 
> Could you explain further here?

Sure, this code-snippet enters protected mode (bit 0 of CR0) and sets up
FS from the Global Descriptor Table. FS visible part (16 bits) gets set
to the value 0x10, and the limit is set to whatever happens to be in the
descriptor table, and I didn't actually specify what that value is, but
rather implied that the value for the limit is (0xfffff << 12 | 0xFFF)
(i.e. the limit is 2^20 - 1 and the granularity bit is set to 1 ->
multiply by 4096 and set lower bits to one). 

As we leave protected mode, the contents of FS is still maintained,
including the 80 bits of hidden information (limit, base and
attributes). 

However, if we then take an interrupt (or otherwise need to save/restore
FS), we'd loose all the hidden bits, and restoring it later would need
to figure out "how it got loaded" to make sure it's hidden parts are
re-loaded. 

It's unlikely that you'd see this scenario in Xen, since Xen works on
para-virtual kernels [unless we've got virtualization hardware, in which
case the hypervisor CAN SEE the internal parts of FS (or any other
segment register)]. 

Another tricky situation is:

	GDT[5] = {base = 0x1000, limit=0x1000, attr=<something> }
	FS = GDT[5];
	CLI();
	GDT [5] = [base = 0x2000, limit = 0x1000, attr=<something> }
	... 
	...
	...
	FS = GDT[5];
	STI();

Now, whilst this tricky code is unreliable on real hardware too (if
interrupts were enabled), if you have a situation where the guest can
not accept interrupts, but the hypervisor can, it would break if the
code with ... in it were to have an interrupt, because we'd have lost
the value of FS (we'd reload the NEW value of GDT[5] at the end of
interrupt, assuming it saves FS). 


Hidden parts of segment registers is one of the "security features" of
the 286 architecture, but it also creates some pretty interesting
scenarios for us programmers... 

--
Mats

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-08 17:14         ` Petersson, Mats
@ 2006-08-08 18:22           ` Steven Rostedt
  0 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2006-08-08 18:22 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: Al Boldi, xen-devel

Mat, thanks for the reply

Petersson, Mats wrote:
> 
> 
> Hidden parts of segment registers is one of the "security features" of
> the 286 architecture, but it also creates some pretty interesting
> scenarios for us programmers... 

The missing part in my mind was that I didn't know that the segment 
registers can't be completely read.  A colleague of mine told me a 
little more about them.  Yuck!

Thanks for the nice write ups though.

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-08 14:10   ` Al Boldi
  2006-08-08 15:07     ` Petersson, Mats
@ 2006-08-09 12:49     ` Daniel Stodden
  2006-08-10 14:57       ` Al Boldi
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Stodden @ 2006-08-09 12:49 UTC (permalink / raw)
  To: Al Boldi; +Cc: Petersson, Mats, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3757 bytes --]

On Tue, 2006-08-08 at 17:10 +0300, Al Boldi wrote:

> > There are two flavours of Xen guests:
> > Para-virtual guests. Those are patched kernels, and have (in past
> > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows,
> > <some version of>BSD and perhaps other versions that I don't know of.
> > Current Xen is "Linux only" supplied with the Xen kernel. Other kernels
> > are being worked on.
> 
> This is the part I am questioning.
> 
> > HVM guests. These are fully virtualized guests, where the guest contains
> > the same binary as you would use on a non-virtual system. You can run
> > Windows or Linux, or most other OS's on this. It does require "new"
> > hardware that has virtualization support in hardware (AMD's AMDV (SVM)
> > or Intel VT) to use this flavour of guest though, so the older model is
> > still maintained.
> 
> So HVM solves the problem, but why can't this layer be implemented in 
> software?

the short answer at the cpu level is "because of the arcane nature of
the x86 architecture" :/

it can be done, but it requires mechanisms xen developers currently do
not and wouldn't be willing to apply. non-paravirtualized guests may
perform operations which on bare x86 hardware are hard/impossible to
track. one way to work around this would be patching guest code segments
before executing them. that's where systems like e.g. vmware come into
play. xen-style paravirtualization at the cpu level basically resolves
that efficiently by teaching the guest system not to use the critical
stuff, but be aware of the vmm to do it instead.

once the cpu problem has been solved, you'd need to emulate hardware
resources an unmodified guest system attempts to drive. that again takes
additional cycles. elimination of the peripheral hardware interfaces by
putting the I/O layers on top of an abstract low-level path into the VMM
is one of the reasons why xen is faster than others. many systems do
this quite successfully, even for 'non-modified' guests like e.g.
windows, by installing dedicated, virtualization aware drivers once the
base installation went ok.

> I'm sure there can't be a performance issue, as this virtualization doesn't 
> occur on the physical resource level, but is (should be) rather implemented 
> as some sort of a multiplexed routing algorithm, I think :)

few device classes support resource sharing in that manner efficiently.
peripheral devices in commodity platforms are inherently single-hosted
and won't support unfiltered access by multiple driver instances in
several guests.

from the vmm perspective, it always boils down to emulating the device.
howerver, with varying degrees of complexity regarding the translation
of guest requests to physical access. it depends. ide, afaik is known to
work comparatively well. an example of an area where it's getting more
sportive would be network adapters.

this is basically the whole problem when building virtualization layers
for cots platforms: the device/driver landscape spreads to infinity :)
since you'll have a hard time driving any possible combination by
yourself, you need something else to do it. one solution are hosted
vmms, running on top of an existing operating system. a second solution
is what xen does: offload drivers to a modified guest system which can
then carry the I/O load from the additional, nonprivileged guests as
well.

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-08 15:07     ` Petersson, Mats
  2006-08-08 16:39       ` Steven Rostedt
@ 2006-08-09 12:53       ` Al Boldi
  2006-08-09 13:28         ` Petersson, Mats
  2006-08-10 11:20         ` Daniel Stodden
  1 sibling, 2 replies; 20+ messages in thread
From: Al Boldi @ 2006-08-09 12:53 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: xen-devel

Petersson, Mats wrote:
> > > Al Boldi wrote:
> > > > I maybe missing something, but why should the Xen-design
> > > > require the guest to be patched?
>
> The main reason to use a para-virtual kernel that it performs better
> than the fully virtualized version.
>
> > So HVM solves the problem, but why can't this layer be implemented in
> > software?
>
> It CAN, and has been done.

You mean full virtualization using binary translation in software?

My understanding was, that HVM implies full virtualization without the need 
for binary translation in software.

> It is however, a little bit difficult to
> cover some of the "strange" corner cases, as the x86 processor wasn't
> really designed to handle virtualization natively [until these
> extensions where added].

You mean AMDV/IntelVT extensions?

If so, then these extensions don't actively participate in the act of 
virtualization, but rather fix some x86-arch shortcomings, that make it 
easier for software (i.e. Xen) to virtualize, thus circumventing the need to 
do binary translation.  Is this a correct reading?

> This is why you end up with binary translation
> in VMWare for example. For example, let's say that we use the method of
> "ring compression" (which is when the guest-OS is moved from Ring 0
> [full privileges] to Ring 1 [less than full privileges]), and the
> hypervisor wants to have full control of interrupt flags:
>
> some_function:
> 	...
> 	pushf			// Save interrupt flag.
> 	cli			// Disable interrupts
> 	...
> 	...
> 	...
> 	popf			// Restore interrupt flag.
> 	...
>
> In Ring 0, all this works just fine - but of course, we don't know that
> the guest-OS tried to disable interrupts, so we have to change
> something. In Ring 1, the guest can't disable interrupts, so the CLI
> instruction can be intercepted. Great. But pushf/popf is a valid
> instruction in all four rings - it just doesn't change the interrupt
> enable flag in the flags register if you're not allowed to use the
> CLI/STI instructions! So, that means that interrupts are disabled
> forever after [until an STI instruction gets found by chance, at least].
>
>
> And if the next bit of code is:
>
> 	mov	someaddress, eax		// someaddress is
> updated by an interrupt!
> $1:
> 	cmp	someaddress, eax		// Check it...
> 	jz	$1
>
> Then we'd very likely never get out of there, since the actual interrupt
> causing someaddress to change is believed by the VMM to be disabled.
>
> There is no real way to make popf trap [other than supplying it with
> invalid arguments in virtual 8086 mode, which isn't really a practical
> thing to do here!]
>
> Another problem is "hidden bits" in registers.
>
> Let's say this:
>
> 	mov	cr0, eax
> 	mov	eax, ecx
> 	or	$1, eax
> 	mov	eax, cr0
> 	mov	$0x10, eax
> 	mov	eax, fs
> 	mov	ecx, cr0
>
> 	mov	$0xF000000, eax
> 	mov	$10000, ecx
> $1:
> 	mov	$0, fs:eax
> 	add	$4, eax
> 	dec	ecx
> 	jnz	$1
>
> Let's now say that we have an interrupt that the hypervisor would handle
> in the loop in the above code. The hypervisor itself uses FS for some
> special purpose, and thus needs to save/restore the FS register. When it
> returns, the system will crash (GP fault) because the FS register limit
> is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS
> was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> behaviour like this is terribly difficult to deal with, and there really
> isn't any good way to solve these issues [other than not allowing the
> code to run when it does "funny" things like this - or to perform the
> necessary code in "translation mode" - i.e. emulate each instruction ->
> slow(ish)].

Or introduce AMDV/IntelVT extensions?

> > I'm sure there can't be a performance issue, as this
> > virtualization doesn't
> > occur on the physical resource level, but is (should be)
> > rather implemented
> > as some sort of a multiplexed routing algorithm, I think :)
>
> I'm not entirely sure what this statement is trying to say, but as I
> understand the situation, performance is entirely the reason why the Xen
> paravirtual model was implemented - all other VMM's are slower [although
> it's often hard to prove that, since for example Vmware have the rule
> that they have to give permission before publishing benchmarks of their
> product, and of course that permission would only be given in cases
> where there is some benefit to them].
>
> One of the obvious reasons for para-virtual being better than full
> virtualization is that it can be used in a "batched" mode. Let's say we
> have some code that does this:
>
> ...
> 	p = malloc(2000 * 4096);
> ...
>
> Let's then say that the guts of malloc ends up in something like this:
>
> map_pages_to_user(...)
> {
> 	for(v = random_virtual_address, p = start_page; p < end_page;
> p++, v+=4096)
> 		map_one_page_to_user(p, v);
> }
>
> In full virtualization, we have no way to understand that someone is
> mapping 2000 pages to the same user-process in one guest, we'd just see
> writes to the page-table one page at a time.
>
> In the para-virtual case, we could do something like:
> map_pages_to_user(...)
> {
> 	hypervisor_map_pages_to_user(current_process, start_page,
> end_page,
> random_virtual_address);
> }
>
> Now, the hypervisor knows "the full story" and can map all those pages
> in one go - much quicker, I would say. There's still more work than in
> the native case, but it's much closer to the native case.

Sure, but wouldn't this be for the price of losing guest-OS transparency?


Thanks!

--
Al

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-09 12:53       ` Al Boldi
@ 2006-08-09 13:28         ` Petersson, Mats
  2006-08-10 14:55           ` Al Boldi
  2006-08-10 11:20         ` Daniel Stodden
  1 sibling, 1 reply; 20+ messages in thread
From: Petersson, Mats @ 2006-08-09 13:28 UTC (permalink / raw)
  To: Al Boldi; +Cc: xen-devel

> -----Original Message-----
> From: Al Boldi [mailto:a1426z@gawab.com] 
> Sent: 09 August 2006 13:53
> To: Petersson, Mats
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Petersson, Mats wrote:
> > > > Al Boldi wrote:
> > > > > I maybe missing something, but why should the Xen-design
> > > > > require the guest to be patched?
> >
> > The main reason to use a para-virtual kernel that it performs better
> > than the fully virtualized version.
> >
> > > So HVM solves the problem, but why can't this layer be 
> implemented in
> > > software?
> >
> > It CAN, and has been done.
> 
> You mean full virtualization using binary translation in software?

Yes, exactly - or other types of "full virtualiziation" using software -
I haven't made a complete inventory of "different technologies used for
virtualization on x86", so I can't really say - my job with AMD and Xen
is to implement into Xen the parts that support the AMD virtualization,
not understand the entire VM architecture available in the world... 
> 
> My understanding was, that HVM implies full virtualization 
> without the need 
> for binary translation in software.

Yes, that's generally correct. In very detail, there are some variants
of execution where this is broken, but that's the obscure corner cases,
rather than the normal behaviour. In particular, Intel's VT doesn't
support running real-mode inside a virtual machine, so if the guest is
run in real-mode, it requires some forms of "emulation" (actually, the
current solution uses a VM86 mode of the processor, and it's then only
having to emulate the opcodes that fault when run in VM86 mode). There
are some things that we (AMD) didn't get perfectly right either, and as
such could be improved... 
> 
> > It is however, a little bit difficult to
> > cover some of the "strange" corner cases, as the x86 
> processor wasn't
> > really designed to handle virtualization natively [until these
> > extensions where added].
> 
> You mean AMDV/IntelVT extensions?

Yes. 
> 
> If so, then these extensions don't actively participate in the act of 
> virtualization, but rather fix some x86-arch shortcomings, 
> that make it 
> easier for software (i.e. Xen) to virtualize, thus 
> circumventing the need to 
> do binary translation.  Is this a correct reading?

Not sure what your exact meaning is here. 

What do you mean by "actively participate in the act of virtualization".
Please clarify, and examplify an architecture where the hardware is
ACTIVELY taking part in the virtualization - do you mean a hardware
implementation of a hypervisor. [as, again, I haven't spent an awful lot
of time trying to understand how/what can and can't be done in other
architectures - as far as I understand it, both AMD and Intel's
virtualization technologies are fairly close "copies" IBM's original
implementation on the 360 series machines, so I expect that what
can/can't be done in that, is what can/can't be done in the x86 world]. 

I do agree that it removes the need for binary translation and
emulation, and makes the writing of the software to manage the VM's
easier. It also helps in the sense that it allows more selective
intercepts than for example ring compression (where all protected
instructions are "faulting", whether it's actually necessary for the
hypervisor to intercept or not - for example, it's completely useless
for the hypervisor to know when the guest reads or writes to CR2 - but
CR2 is a protected register, so it's going to get intercepted by a
ring-compressed kernel.), so fewer intercepts. It's also more easy to
determine the actual intercept reason in a virtualization enhanced
processor, since it gives an "exitcode" to indicate the reason for the
"exit" back to the hypervisor. 

> 
> > This is why you end up with binary translation
> > in VMWare for example. For example, let's say that we use 
> the method of
> > "ring compression" (which is when the guest-OS is moved from Ring 0
> > [full privileges] to Ring 1 [less than full privileges]), and the
> > hypervisor wants to have full control of interrupt flags:
> >
> > some_function:
> > 	...
> > 	pushf			// Save interrupt flag.
> > 	cli			// Disable interrupts
> > 	...
> > 	...
> > 	...
> > 	popf			// Restore interrupt flag.
> > 	...
> >
> > In Ring 0, all this works just fine - but of course, we 
> don't know that
> > the guest-OS tried to disable interrupts, so we have to change
> > something. In Ring 1, the guest can't disable interrupts, so the CLI
> > instruction can be intercepted. Great. But pushf/popf is a valid
> > instruction in all four rings - it just doesn't change the interrupt
> > enable flag in the flags register if you're not allowed to use the
> > CLI/STI instructions! So, that means that interrupts are disabled
> > forever after [until an STI instruction gets found by 
> chance, at least].
> >
> >
> > And if the next bit of code is:
> >
> > 	mov	someaddress, eax		// someaddress is
> > updated by an interrupt!
> > $1:
> > 	cmp	someaddress, eax		// Check it...
> > 	jz	$1
> >
> > Then we'd very likely never get out of there, since the 
> actual interrupt
> > causing someaddress to change is believed by the VMM to be disabled.
> >
> > There is no real way to make popf trap [other than supplying it with
> > invalid arguments in virtual 8086 mode, which isn't really 
> a practical
> > thing to do here!]
> >
> > Another problem is "hidden bits" in registers.
> >
> > Let's say this:
> >
> > 	mov	cr0, eax
> > 	mov	eax, ecx
> > 	or	$1, eax
> > 	mov	eax, cr0
> > 	mov	$0x10, eax
> > 	mov	eax, fs
> > 	mov	ecx, cr0
> >
> > 	mov	$0xF000000, eax
> > 	mov	$10000, ecx
> > $1:
> > 	mov	$0, fs:eax
> > 	add	$4, eax
> > 	dec	ecx
> > 	jnz	$1
> >
> > Let's now say that we have an interrupt that the hypervisor 
> would handle
> > in the loop in the above code. The hypervisor itself uses 
> FS for some
> > special purpose, and thus needs to save/restore the FS 
> register. When it
> > returns, the system will crash (GP fault) because the FS 
> register limit
> > is 0xFFFF (64KB) and eax is greater than the limit - but 
> the limit of FS
> > was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> > behaviour like this is terribly difficult to deal with, and 
> there really
> > isn't any good way to solve these issues [other than not 
> allowing the
> > code to run when it does "funny" things like this - or to 
> perform the
> > necessary code in "translation mode" - i.e. emulate each 
> instruction ->
> > slow(ish)].
> 
> Or introduce AMDV/IntelVT extensions?
> 
> > > I'm sure there can't be a performance issue, as this
> > > virtualization doesn't
> > > occur on the physical resource level, but is (should be)
> > > rather implemented
> > > as some sort of a multiplexed routing algorithm, I think :)
> >
> > I'm not entirely sure what this statement is trying to say, but as I
> > understand the situation, performance is entirely the 
> reason why the Xen
> > paravirtual model was implemented - all other VMM's are 
> slower [although
> > it's often hard to prove that, since for example Vmware 
> have the rule
> > that they have to give permission before publishing 
> benchmarks of their
> > product, and of course that permission would only be given in cases
> > where there is some benefit to them].
> >
> > One of the obvious reasons for para-virtual being better than full
> > virtualization is that it can be used in a "batched" mode. 
> Let's say we
> > have some code that does this:
> >
> > ...
> > 	p = malloc(2000 * 4096);
> > ...
> >
> > Let's then say that the guts of malloc ends up in something 
> like this:
> >
> > map_pages_to_user(...)
> > {
> > 	for(v = random_virtual_address, p = start_page; p < end_page;
> > p++, v+=4096)
> > 		map_one_page_to_user(p, v);
> > }
> >
> > In full virtualization, we have no way to understand that someone is
> > mapping 2000 pages to the same user-process in one guest, 
> we'd just see
> > writes to the page-table one page at a time.
> >
> > In the para-virtual case, we could do something like:
> > map_pages_to_user(...)
> > {
> > 	hypervisor_map_pages_to_user(current_process, start_page,
> > end_page,
> > random_virtual_address);
> > }
> >
> > Now, the hypervisor knows "the full story" and can map all 
> those pages
> > in one go - much quicker, I would say. There's still more 
> work than in
> > the native case, but it's much closer to the native case.
> 
> Sure, but wouldn't this be for the price of losing guest-OS 
> transparency?

Life is full of compromizes between one ideal solution and another. In
an ideal world, virtualization wouldn't cost anything, but it does.

Loosing guest-OS transparency when the geust-OS is open-source isn't
really a big issue, in my opinion. However, if you haven't got
source-code readily available, it becomes a big issue - since without
source code, it gets much harder to make the necessary modifications
(probably to the extent that it's actually IMPOSSIBLE to make them in a
sane and reliable manner). 

There is no doubt that para-virtualization is one viable solution to the
virtualization problem, but it's not the ONLY solution. Each user has a
choice: Recompile and get performance, or run unmodified code at lower
performance. 

--
Mats
> 
> 
> Thanks!
> 
> --
> Al
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-09 12:53       ` Al Boldi
  2006-08-09 13:28         ` Petersson, Mats
@ 2006-08-10 11:20         ` Daniel Stodden
  1 sibling, 0 replies; 20+ messages in thread
From: Daniel Stodden @ 2006-08-10 11:20 UTC (permalink / raw)
  To: Al Boldi; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4216 bytes --]

On Wed, 2006-08-09 at 15:53 +0300, Al Boldi wrote:
> Petersson, Mats wrote:
> > > > Al Boldi wrote:
> > > > > I maybe missing something, but why should the Xen-design
> > > > > require the guest to be patched?
> >
> > The main reason to use a para-virtual kernel that it performs better
> > than the fully virtualized version.
> >
> > > So HVM solves the problem, but why can't this layer be implemented in
> > > software?
> >
> > It CAN, and has been done.
> 
> You mean full virtualization using binary translation in software?
> 
> My understanding was, that HVM implies full virtualization without the need 
> for binary translation in software.
> 
> > It is however, a little bit difficult to
> > cover some of the "strange" corner cases, as the x86 processor wasn't
> > really designed to handle virtualization natively [until these
> > extensions where added].
> 
> You mean AMDV/IntelVT extensions?
> 
> If so, then these extensions don't actively participate in the act of 
> virtualization, but rather fix some x86-arch shortcomings, that make it 
> easier for software (i.e. Xen) to virtualize, thus circumventing the need to 
> do binary translation.  Is this a correct reading?

they fix the issues, removing the general need for binary translation,
but go well beyond that as well.

a comparatively simple example of where it goes beyond are privilege
levels. basic system virtualization would just move the guest kernel to
a nonprivileged level to maintain control in the vmm. so you'd have the
hypervisor in supervisor mode (that's why it's called a hypervisor), and
both guest kernel and applications in user mode [1]. [should note that
xen makes a difference here, using x86 privilege levels which are more
complex].

what vtx does is keeping the privilege rings in protected mode untouched
by the virtualization features. instead, two whole new modes are added:
'vmx root' and 'vmx non-root'. the former applies to the vmm, the latter
to the guests. _both_ of these basically implement the protected mode as
it used to be. so hardware virtualization won't have to muck around with
the regular privilege system.

one example where this is particularly useful are hosted vmms, e.g.
vmware workstation. imagine a natively-running operating system and a
machine monitor running on top of (or integrated with) that. the system
would run in vmx-root mode. regular application processes there in ring3
as they used to. additionally, one may start guest systems on top of the
vmm, which again are implemented on top a regular x86 protected mode,
but in non-root mode.

all of the above 
 - can be functionally achieved _efficiently_ without hardware 
   extensions like vmx
 - but ONLY as long as the privilege architecture supports   
   virtualization
 - x86 does NOT [2]
   the pushf/popf outlined is an example of where the problems are
   - binary translation is a way to do it anyway, but does not count
     as 'efficient'.

with vmx
  - efficient virtualization is achieved.
  - some things just get additional flexibility. 

related reading:

[1] popek & goldberg: Formal Requirements for Virtualizable Third
Generation Architectures.pdf, 1974 (!)

[2] robin & irvine:  Analysis of the Intel Pentium's Ability to Support
a Secure Virtual Machine Monitor.pdf, 2000

both should be available from the web if you dig around long enough. :)

> > This is why you end up with binary translation
> > in VMWare for example. For example, let's say that we use the method of
> > "ring compression" (which is when the guest-OS is moved from Ring 0
> > [full privileges] to Ring 1 [less than full privileges]), and the
> > hypervisor wants to have full control of interrupt flags:
> >
> > some_function:
> > 	...
> > 	pushf			// Save interrupt flag.
> > 	cli			// Disable interrupts
> > 	...


regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-09 13:28         ` Petersson, Mats
@ 2006-08-10 14:55           ` Al Boldi
  2006-08-10 15:42             ` Petersson, Mats
  0 siblings, 1 reply; 20+ messages in thread
From: Al Boldi @ 2006-08-10 14:55 UTC (permalink / raw)
  To: Petersson, Mats, Daniel Stodden; +Cc: xen-devel

Petersson, Mats wrote:
> > Al Boldi wrote:
> > You mean AMDV/IntelVT extensions?
>
> Yes.
>
> > If so, then these extensions don't actively participate in the act of
> > virtualization, but rather fix some x86-arch shortcomings, that make it
> > easier for software (i.e. Xen) to virtualize, thus circumventing the need
> > to do binary translation.  Is this a correct reading?
>
> Not sure what your exact meaning is here.
>
> What do you mean by "actively participate in the act of virtualization".

Is there any logic involved, that does some kind of a translation/control?

It seems not.

Daniel Stodden wrote:
>
> they fix the issues, removing the general need for binary translation,
> but go well beyond that as well.
>
> a comparatively simple example of where it goes beyond are privilege
> levels. basic system virtualization would just move the guest kernel to
> a nonprivileged level to maintain control in the vmm. so you'd have the
> hypervisor in supervisor mode (that's why it's called a hypervisor), and
> both guest kernel and applications in user mode [1]. [should note that
> xen makes a difference here, using x86 privilege levels which are more
> complex].
>
> what vtx does is keeping the privilege rings in protected mode untouched
> by the virtualization features. instead, two whole new modes are added:
> 'vmx root' and 'vmx non-root'. the former applies to the vmm, the latter
> to the guests. _both_ of these basically implement the protected mode as
> it used to be. so hardware virtualization won't have to muck around with
> the regular privilege system.
>
> one example where this is particularly useful are hosted vmms, e.g.
> vmware workstation. imagine a natively-running operating system and a
> machine monitor running on top of (or integrated with) that. the system
> would run in vmx-root mode. regular application processes there in ring3
> as they used to. additionally, one may start guest systems on top of the
> vmm, which again are implemented on top a regular x86 protected mode,
> but in non-root mode.
>
> all of the above
>  - can be functionally achieved _efficiently_ without hardware
>    extensions like vmx
>  - but ONLY as long as the privilege architecture supports
>    virtualization
>  - x86 does NOT [2]
>    the pushf/popf outlined is an example of where the problems are
>    - binary translation is a way to do it anyway, but does not count
>      as 'efficient'.
>
> with vmx
>   - efficient virtualization is achieved.
>   - some things just get additional flexibility.

So VMX doesn't really virtualize anything, but rather enables software to 
perform virtualization more efficiently.

Petersson, Mats wrote:
> There is no doubt that para-virtualization is one viable solution to the
> virtualization problem, but it's not the ONLY solution. Each user has a
> choice: Recompile and get performance, or run unmodified code at lower
> performance.

Agreed, but how much lower performance are we talking about in an HVM vs 
para-virtualized scenario?


Thanks!

--
Al

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-09 12:49     ` Daniel Stodden
@ 2006-08-10 14:57       ` Al Boldi
  2006-08-10 15:53         ` Daniel Stodden
  0 siblings, 1 reply; 20+ messages in thread
From: Al Boldi @ 2006-08-10 14:57 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Petersson, Mats, xen-devel

Daniel Stodden wrote:
> On Tue, 2006-08-08 at 17:10 +0300, Al Boldi wrote:
> > So HVM solves the problem, but why can't this layer be implemented in
> > software?
>
> the short answer at the cpu level is "because of the arcane nature of
> the x86 architecture" :/

Which AMDV/IntelVT supposedly solves?

> once the cpu problem has been solved, you'd need to emulate hardware
> resources an unmodified guest system attempts to drive. that again takes
> additional cycles. elimination of the peripheral hardware interfaces by
> putting the I/O layers on top of an abstract low-level path into the VMM
> is one of the reasons why xen is faster than others. many systems do
> this quite successfully, even for 'non-modified' guests like e.g.
> windows, by installing dedicated, virtualization aware drivers once the
> base installation went ok.

You mean "virtualization aware" drivers in the guest-OS?  Wouldn't this 
amount to a form of patching?

> > I'm sure there can't be a performance issue, as this virtualization
> > doesn't occur on the physical resource level, but is (should be) rather
> > implemented as some sort of a multiplexed routing algorithm, I think :)
>
> few device classes support resource sharing in that manner efficiently.
> peripheral devices in commodity platforms are inherently single-hosted
> and won't support unfiltered access by multiple driver instances in
> several guests.

Would this be due to the inability of the peripheral to switch contexts fast 
enough?

If so, how about a "AMDV/IntelVT" for peripherals?

> from the vmm perspective, it always boils down to emulating the device.
> howerver, with varying degrees of complexity regarding the translation
> of guest requests to physical access. it depends. ide, afaik is known to
> work comparatively well.

Probably because IDE follows a well defined API?

> an example of an area where it's getting more
> sportive would be network adapters.
>
> this is basically the whole problem when building virtualization layers
> for cots platforms: the device/driver landscape spreads to infinity :)
> since you'll have a hard time driving any possible combination by
> yourself, you need something else to do it. one solution are hosted
> vmms, running on top of an existing operating system. a second solution
> is what xen does: offload drivers to a modified guest system which can
> then carry the I/O load from the additional, nonprivileged guests as
> well.

Agreed; so let me rephrase the dilemma like this:
The PC platform was never intended to be used in a virtualizing scenario, and 
therefore does not contain the infrastructure to support this kind of a 
scenario efficiently, but this could easily be rectified by introducing 
simple extensions, akin to AMDV/IntelVT, on all levels of the PC hardware.

Is this a correct reading?

If so, has this been considered in the Xen design, so as to accommodate any 
future hwV/VT/VMX extensions easily and quickly?


Thanks for your input!

--
Al

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-10 14:55           ` Al Boldi
@ 2006-08-10 15:42             ` Petersson, Mats
  0 siblings, 0 replies; 20+ messages in thread
From: Petersson, Mats @ 2006-08-10 15:42 UTC (permalink / raw)
  To: Al Boldi, Daniel Stodden; +Cc: xen-devel

 

> -----Original Message-----
> From: Al Boldi [mailto:a1426z@gawab.com] 
> Sent: 10 August 2006 15:55
> To: Petersson, Mats; Daniel Stodden
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Petersson, Mats wrote:
> > > Al Boldi wrote:
> > > You mean AMDV/IntelVT extensions?
> >
> > Yes.
> >
> > > If so, then these extensions don't actively participate 
> in the act of
> > > virtualization, but rather fix some x86-arch 
> shortcomings, that make it
> > > easier for software (i.e. Xen) to virtualize, thus 
> circumventing the need
> > > to do binary translation.  Is this a correct reading?
> >
> > Not sure what your exact meaning is here.
> >
> > What do you mean by "actively participate in the act of 
> virtualization".
> 
> Is there any logic involved, that does some kind of a 
> translation/control?
> 
> It seems not.


AMD has annouced feature called "Nested page tables", which will allow
the translation of page-table lookups, essentially adding another layer
of address translation, so we can give the guest "physical" a map of
[0..256MB], whilst we're actually giving it some (completely) random set
of physical pages that it actually gets to use. This is not available in
the current generation of chips, but it will be in the next... 

I believe that Intel has at least publicly stated that they have a
similar solution in the pipeline. 

We (AMD) have also publicly talked about IOMMU, which will help hardware
virtualiztion. I'll make more comments on that in reply to your other
posting. 

So, in the current generation, shadow-page tables are used, so the
actual page-table used by the guest is write-protected, and when
write-faults occur, we replace the data written by the guest with a
translated value in a second page-table, which the guest never sees, but
the processor uses to translate the memory accesses. It's a fair bit
more work, but the guest is entirely unaware of the REAL PHYSICAL
address it lives at. 

> 
> Daniel Stodden wrote:
> >
> > they fix the issues, removing the general need for binary 
> translation,
> > but go well beyond that as well.
> >
> > a comparatively simple example of where it goes beyond are privilege
> > levels. basic system virtualization would just move the 
> guest kernel to
> > a nonprivileged level to maintain control in the vmm. so 
> you'd have the
> > hypervisor in supervisor mode (that's why it's called a 
> hypervisor), and
> > both guest kernel and applications in user mode [1]. 
> [should note that
> > xen makes a difference here, using x86 privilege levels 
> which are more
> > complex].
> >
> > what vtx does is keeping the privilege rings in protected 
> mode untouched
> > by the virtualization features. instead, two whole new 
> modes are added:
> > 'vmx root' and 'vmx non-root'. the former applies to the 
> vmm, the latter
> > to the guests. _both_ of these basically implement the 
> protected mode as
> > it used to be. so hardware virtualization won't have to 
> muck around with
> > the regular privilege system.
> >
> > one example where this is particularly useful are hosted vmms, e.g.
> > vmware workstation. imagine a natively-running operating 
> system and a
> > machine monitor running on top of (or integrated with) 
> that. the system
> > would run in vmx-root mode. regular application processes 
> there in ring3
> > as they used to. additionally, one may start guest systems 
> on top of the
> > vmm, which again are implemented on top a regular x86 
> protected mode,
> > but in non-root mode.
> >
> > all of the above
> >  - can be functionally achieved _efficiently_ without hardware
> >    extensions like vmx
> >  - but ONLY as long as the privilege architecture supports
> >    virtualization
> >  - x86 does NOT [2]
> >    the pushf/popf outlined is an example of where the problems are
> >    - binary translation is a way to do it anyway, but does not count
> >      as 'efficient'.
> >
> > with vmx
> >   - efficient virtualization is achieved.
> >   - some things just get additional flexibility.
> 
> So VMX doesn't really virtualize anything, but rather enables 
> software to 
> perform virtualization more efficiently.

Yes. 

> 
> Petersson, Mats wrote:
> > There is no doubt that para-virtualization is one viable 
> solution to the
> > virtualization problem, but it's not the ONLY solution. 
> Each user has a
> > choice: Recompile and get performance, or run unmodified 
> code at lower
> > performance.
> 
> Agreed, but how much lower performance are we talking about 
> in an HVM vs 
> para-virtualized scenario?

Unfortunately, this is not a trivial question to answer, since it
depends very much on what amoutn of hardware accesses are involved in
the system. I'm sure it can be concieved both cases that are 10x slower
and other cases where you get 98-99.9% of the original performance in
the virtual machine. 

Para-virtual is suppsed to be around 95-98% of native solution - but
again it depends on the workload what the exact figures are -
pathological cases can probably be found. 

A large percentage of any slowdown from HVM is caused by the way
hardware is emulated - using qemu-dm to model the virtual hardware. If
you have a disk benchmark, it's quite feasible that the native machine
has 10x or so the throughput of the HVM system. 

--
Mats
> 
> 
> Thanks!
> 
> --
> Al
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Questioning the Xen Design of the VMM
  2006-08-10 14:57       ` Al Boldi
@ 2006-08-10 15:53         ` Daniel Stodden
  2006-08-10 16:34           ` Petersson, Mats
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Stodden @ 2006-08-10 15:53 UTC (permalink / raw)
  To: Al Boldi; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5773 bytes --]

On Thu, 2006-08-10 at 17:57 +0300, Al Boldi wrote:

> > > So HVM solves the problem, but why can't this layer be implemented in
> > > software?
> >
> > the short answer at the cpu level is "because of the arcane nature of
> > the x86 architecture" :/
> 
> Which AMDV/IntelVT supposedly solves?

regarding the virtualization issue, yes.

> > once the cpu problem has been solved, you'd need to emulate hardware
> > resources an unmodified guest system attempts to drive. that again takes
> > additional cycles. elimination of the peripheral hardware interfaces by
> > putting the I/O layers on top of an abstract low-level path into the VMM
> > is one of the reasons why xen is faster than others. many systems do
> > this quite successfully, even for 'non-modified' guests like e.g.
> > windows, by installing dedicated, virtualization aware drivers once the
> > base installation went ok.
> 
> You mean "virtualization aware" drivers in the guest-OS?  Wouldn't this 
> amount to a form of patching?

yes, strictly speaking it is a modification. but one based upon usually
well-defined interfaces, and it does not require parsing opcodes and
patching code segments.

otoh, one which obviously needs to be reiterated for any additional
guest os family.

> > > I'm sure there can't be a performance issue, as this virtualization
> > > doesn't occur on the physical resource level, but is (should be) rather
> > > implemented as some sort of a multiplexed routing algorithm, I think :)
> >
> > few device classes support resource sharing in that manner efficiently.
> > peripheral devices in commodity platforms are inherently single-hosted
> > and won't support unfiltered access by multiple driver instances in
> > several guests.
> 
> Would this be due to the inability of the peripheral to switch contexts fast 
> enough?

maybe. more important: commodity peripherals typically wouldn't
sufficiently implement security and isolation. you're certainly won't
'route' arbitraty block I/O from a guest system to your disk controller
without further investigation and translation. it may gladly overwrite
your host partition or whatever resource you granted elsewhere.

> If so, how about a "AMDV/IntelVT" for peripherals?

good idea, and actually practical. unfortunately, this is where it's
getting expensive.

> > from the vmm perspective, it always boils down to emulating the device.
> > howerver, with varying degrees of complexity regarding the translation
> > of guest requests to physical access. it depends. ide, afaik is known to
> > work comparatively well.
> 
> Probably because IDE follows a well defined API?

yes. however, i'm not an ide guy. 
 
> > an example of an area where it's getting more
> > sportive would be network adapters.
> >
> > this is basically the whole problem when building virtualization layers
> > for cots platforms: the device/driver landscape spreads to infinity :)
> > since you'll have a hard time driving any possible combination by
> > yourself, you need something else to do it. one solution are hosted
> > vmms, running on top of an existing operating system. a second solution
> > is what xen does: offload drivers to a modified guest system which can
> > then carry the I/O load from the additional, nonprivileged guests as
> > well.
> 
> Agreed; so let me rephrase the dilemma like this:
> The PC platform was never intended to be used in a virtualizing scenario, and 
> therefore does not contain the infrastructure to support this kind of a 
> scenario efficiently, but this could easily be rectified by introducing 
> simple extensions, akin to AMDV/IntelVT, on all levels of the PC hardware.
> 
> Is this a correct reading?

yes, with restrictions. at this point in time, correct not from an
economical standpoint. the whole "virtualization renaissance", we've
been experiencing for the last 3 years or so builts upon the fact that
PC hardware has become

	1. terribly powerful, compared to the workloads most software 
	   systems then run actually require.

	2. remained comparatively cheap, as it always used to.

if you start to redesign the I/O system, you're likely to raise the cost
for the overall system.

I/O virtualization down to the device level may come, but like with
processor prices, it's all a "economy of scale".

hardware-assisted virtualization at various places in the architecture,
however, including I/O, is a topic as well understood.

may i again point you to some reading matter in that area:

nair/smith: virtual machines.
http://www.amazon.de/gp/product/1558609105/028-2651277-1478934?v=glance&n=52044011

excellent textbook on many aspects of system virtualization, including
those covered by this conversation so far.

> If so, has this been considered in the Xen design, so as to accommodate any 
> future hwV/VT/VMX extensions easily and quickly?

vmx is all about processor virtualization. addtional topics would
include memory virtualization (required, and available in the form of
regular virtual memory; but might see additional improvements.) and I/O
virtualization. i see no reasons why those could not be supported by
xen. as they are subsystems which have been backed in a portable and
scalable fashion in the operating system landscape for many year now. so
the topic of how to accomodate changes in that area is not particularly
new.

regards,
daniel

 
-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-10 15:53         ` Daniel Stodden
@ 2006-08-10 16:34           ` Petersson, Mats
  2006-08-10 18:07             ` Daniel Stodden
  0 siblings, 1 reply; 20+ messages in thread
From: Petersson, Mats @ 2006-08-10 16:34 UTC (permalink / raw)
  To: Daniel Stodden, Al Boldi; +Cc: xen-devel

 

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Daniel Stodden
> Sent: 10 August 2006 16:54
> To: Al Boldi
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> On Thu, 2006-08-10 at 17:57 +0300, Al Boldi wrote:
> 
> > > > So HVM solves the problem, but why can't this layer be 
> implemented in
> > > > software?
> > >
> > > the short answer at the cpu level is "because of the 
> arcane nature of
> > > the x86 architecture" :/
> > 
> > Which AMDV/IntelVT supposedly solves?
> 
> regarding the virtualization issue, yes.
> 
> > > once the cpu problem has been solved, you'd need to 
> emulate hardware
> > > resources an unmodified guest system attempts to drive. 
> that again takes
> > > additional cycles. elimination of the peripheral hardware 
> interfaces by
> > > putting the I/O layers on top of an abstract low-level 
> path into the VMM
> > > is one of the reasons why xen is faster than others. many 
> systems do
> > > this quite successfully, even for 'non-modified' guests like e.g.
> > > windows, by installing dedicated, virtualization aware 
> drivers once the
> > > base installation went ok.
> > 
> > You mean "virtualization aware" drivers in the guest-OS?  
> Wouldn't this 
> > amount to a form of patching?
> 
> yes, strictly speaking it is a modification. but one based 
> upon usually
> well-defined interfaces, and it does not require parsing opcodes and
> patching code segments.

Exactly. There's a big difference between applying patches to the existing binary, and adding new code by instaling a driver once the system is running. 

Compare for example that when you install Windows, it may not know how to drive nVidia's or ATI's latest graphics card, but you can install a new driver for it. You could, alternatively, perhaps patch the existing nVidia driver to make it work for the latest card, but most people prefer to grab the latest driver from www.nvidia.com or so...  

In this case, we're installing a driver that can talk via a defined interface to the hypervisor, and by doing so, allow us to get "fast" disk access, network access or even graphics.
> 
> otoh, one which obviously needs to be reiterated for any additional
> guest os family.
> 
> > > > I'm sure there can't be a performance issue, as this 
> virtualization
> > > > doesn't occur on the physical resource level, but is 
> (should be) rather
> > > > implemented as some sort of a multiplexed routing 
> algorithm, I think :)
> > >
> > > few device classes support resource sharing in that 
> manner efficiently.
> > > peripheral devices in commodity platforms are inherently 
> single-hosted
> > > and won't support unfiltered access by multiple driver 
> instances in
> > > several guests.
> > 
> > Would this be due to the inability of the peripheral to 
> switch contexts fast 
> > enough?
> 
> maybe. more important: commodity peripherals typically wouldn't
> sufficiently implement security and isolation. you're certainly won't
> 'route' arbitraty block I/O from a guest system to your disk 
> controller
> without further investigation and translation. it may gladly overwrite
> your host partition or whatever resource you granted elsewhere.

Context-switching is only part of the problem, as Daniel says. IOMMU is a technology that is coming in future products from AMD (and I'm sure Intel are working on such products as well. IBM already have a chipset in production for some of the PowerPC and x86-based servers). This will solve address translation, but it won't solve problems with sharing devices - that will require some form of either context-switching (which may be acceptable for some devices) or hardware changes to allow multi-porting within the device with multiple ports to allow a separated interface. Or, if applicable to the device, a context-switch of the device. 

However, context switching of external devices is DIFFICULT for several reasons, one being: it's not always possible to read the "context" of a device... Many devices have write-only fields, and other types of "can't read it back" type of behaviour. 

For example, in an IDE controller, if the system just issued a non-DMA transfer of a sector, waited for the READY to come back from the IDE controller, and started writing bytes to the IDE interface, it can't stop writing bytes until you've reached the correct number as per what the interface expects (usually 512 bytes). There is also, AFAIK, no way to tell how many bytes are left to write (or read in case of opposite direction transfers). This is obviusly "braindead" hardware, but it just so happens that much of the PC hardware, even in modern varieties, is pretty much "braindead" - i.e. it has no more intelligence in the device than absolutely necessary. I'm not sure how easy it is to interrogate the status of a DMA transfer, as I've never really dealt much with those. 

Another complexity in context-switching devices is that it really can't be done on-the-fly, but must be implemented on a "on-demand" basis [anything else would be FAR to slow - we don't want to do that many operations over the PCI bus that often].  

> 
> > If so, how about a "AMDV/IntelVT" for peripherals?
> 
> good idea, and actually practical. unfortunately, this is where it's
> getting expensive.

IOMMU isn't particularly expensive, but multiple ports within a device can get pretty complicated - and only really suitable for higher end devices in the first place.

> 
> > > from the vmm perspective, it always boils down to 
> emulating the device.
> > > howerver, with varying degrees of complexity regarding 
> the translation
> > > of guest requests to physical access. it depends. ide, 
> afaik is known to
> > > work comparatively well.
> > 
> > Probably because IDE follows a well defined API?
> 
> yes. however, i'm not an ide guy. 

I'm strictly not an IDE guy either, but I do know a fair bit about it, as I've written a bunch of test-code that uses the IDE interface to exercise the Xen-HVM/SVM code-paths involved with IO operations.

The IDE interface is pretty straightforward and simple, so it makes it easy to emulate for that reason. 

Other devices may have more complex interfaces, that are harder to write emulation code for. 
>  
> > > an example of an area where it's getting more
> > > sportive would be network adapters.
> > >
> > > this is basically the whole problem when building 
> virtualization layers
> > > for cots platforms: the device/driver landscape spreads 
> to infinity :)
> > > since you'll have a hard time driving any possible combination by
> > > yourself, you need something else to do it. one solution 
> are hosted
> > > vmms, running on top of an existing operating system. a 
> second solution
> > > is what xen does: offload drivers to a modified guest 
> system which can
> > > then carry the I/O load from the additional, 
> nonprivileged guests as
> > > well.
> > 
> > Agreed; so let me rephrase the dilemma like this:
> > The PC platform was never intended to be used in a 
> virtualizing scenario, and 
> > therefore does not contain the infrastructure to support 
> this kind of a 
> > scenario efficiently, but this could easily be rectified by 
> introducing 
> > simple extensions, akin to AMDV/IntelVT, on all levels of 
> the PC hardware.
> > 
> > Is this a correct reading?
> 
> yes, with restrictions. at this point in time, correct not from an
> economical standpoint. the whole "virtualization renaissance", we've
> been experiencing for the last 3 years or so builts upon the fact that
> PC hardware has become
> 
> 	1. terribly powerful, compared to the workloads most software 
> 	   systems then run actually require.
> 
> 	2. remained comparatively cheap, as it always used to.
> 
> if you start to redesign the I/O system, you're likely to 
> raise the cost
> for the overall system.
> 
> I/O virtualization down to the device level may come, but like with
> processor prices, it's all a "economy of scale".
> 
> hardware-assisted virtualization at various places in the 
> architecture,
> however, including I/O, is a topic as well understood.
> 
> may i again point you to some reading matter in that area:
> 
> nair/smith: virtual machines.
> http://www.amazon.de/gp/product/1558609105/028-2651277-1478934
> ?v=glance&n=52044011
> 
> excellent textbook on many aspects of system virtualization, including
> those covered by this conversation so far.
> 
> > If so, has this been considered in the Xen design, so as to 
> accommodate any 
> > future hwV/VT/VMX extensions easily and quickly?
> 
> vmx is all about processor virtualization. addtional topics would
> include memory virtualization (required, and available in the form of
> regular virtual memory; but might see additional 
> improvements.) and I/O
> virtualization. i see no reasons why those could not be supported by
> xen. as they are subsystems which have been backed in a portable and
> scalable fashion in the operating system landscape for many 
> year now. so
> the topic of how to accomodate changes in that area is not 
> particularly
> new.
> 
> regards,
> daniel
> 
>  
> -- 
> Daniel Stodden
> LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
> Institut für Informatik der TU München             D-85748 Garching
> http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
> PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-10 16:34           ` Petersson, Mats
@ 2006-08-10 18:07             ` Daniel Stodden
  2006-08-11  8:41               ` Petersson, Mats
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Stodden @ 2006-08-10 18:07 UTC (permalink / raw)
  To: Al Boldi, Petersson, Mats; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1579 bytes --]

On Thu, 2006-08-10 at 18:34 +0200, Petersson, Mats wrote:

> Context-switching is only part of the problem, as Daniel says. 

> IOMMU is a technology that is coming in future products from AMD

>  (and I'm sure Intel are working on such products as well. 

> IBM already have a chipset in production for some of the PowerPC 

> and x86-based servers).

i didn't have a look yet at the papers from amd, but
it may be of interest that the PCI interfaces (1998, maybe even earlier)
built by sun for their ultrasparc processors already implemented such a
beast. al, docs on the bridge should be available from sun online, if
you're interested in such things.

the basic idea being virtualization of the I/O address space, this
feature is quite cool even if you don't give a single thought about
system virtualization (sun probably didn't at that point). getting your
hands on contiguous, dma-able memory areas can be a permanent headache
in os and device driver design if you peripheral bus seeks physical
memory untranslated. put a translation table in between and upstream
transactions become a non-issue, without offloading any additional logic
into the peripheral bus interface.

mats, i suppose amd's iommu solves this as well?

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Questioning the Xen Design of the VMM
  2006-08-10 18:07             ` Daniel Stodden
@ 2006-08-11  8:41               ` Petersson, Mats
  0 siblings, 0 replies; 20+ messages in thread
From: Petersson, Mats @ 2006-08-11  8:41 UTC (permalink / raw)
  To: Daniel Stodden, Al Boldi, Petersson; +Cc: xen-devel

 

> -----Original Message-----
> From: Daniel Stodden [mailto:stodden@cs.tum.edu] 
> Sent: 10 August 2006 19:08
> To: Al Boldi; Petersson@nmail.informatik.tu-muenchen.de; 
> Petersson, Mats
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] Questioning the Xen Design of the VMM
> 
> On Thu, 2006-08-10 at 18:34 +0200, Petersson, Mats wrote:
> 
> > Context-switching is only part of the problem, as Daniel says. 
> 
> > IOMMU is a technology that is coming in future products from AMD
> 
> >  (and I'm sure Intel are working on such products as well. 
> 
> > IBM already have a chipset in production for some of the PowerPC 
> 
> > and x86-based servers).
> 
> i didn't have a look yet at the papers from amd, but
> it may be of interest that the PCI interfaces (1998, maybe 
> even earlier)
> built by sun for their ultrasparc processors already 
> implemented such a
> beast. al, docs on the bridge should be available from sun online, if
> you're interested in such things.
> 
> the basic idea being virtualization of the I/O address space, this
> feature is quite cool even if you don't give a single thought about
> system virtualization (sun probably didn't at that point). 
> getting your
> hands on contiguous, dma-able memory areas can be a permanent headache
> in os and device driver design if you peripheral bus seeks physical
> memory untranslated. put a translation table in between and upstream
> transactions become a non-issue, without offloading any 
> additional logic
> into the peripheral bus interface.
> 
> mats, i suppose amd's iommu solves this as well?

Yes, of course [see note below]. The only thing it doesn't solve is if the OS decides to swap the pages out - so there still needs to be a call to say "lock this area into memory, don't allow it to move or be swapped out" - but that's trivial compared to "make sure this [large] block of memory is contiguous so that it can be transferred to the hard-disk as one transfer". 

Of course, modern devices cope with this by using scatter/gather technology... 

Note: It does somewhat depend on how you implement the software to control the IOMMU and how you deal with memory allocation above and below this layer. Since the idea of the IOMMU is to translate guest physical addresses to machine physical addresses, when used in conjunction with a VMM, it doesn't necessarily help driver-writers as such, because all it does is present the guest OS and physical device with "the same view" of physical memory, so let's say that we give a guest-OS a mapping of 0..256MB, that on the Machine physical level isn't contiguous, the guest's physical view would still be contiguous [aside from the regular PC hardware holes, of course] - but the OS would still have to use contiguous regions to give to the hardware [assuming HW hasn't got scatter/gather], since the guest doesn't have control over the IOMMU itself - just like nested paging gives the guest it's own level of paging on top of an already virtual address, the IOMMU gives the guest a "virtual" PCI-space that matches it's guest-physical view. 

So, let's make a trivial example [using contiguous machine physical range - which may not be the case in real life]: 
Guest 	Machine
0..256MB	256..512MB

IOMMU would then map the 256MB of guest to the relevant machine physical address. 

In a driver, we are given the address 0x12345000, and 12K (three pages long) as a buffer for a pci device. The driver will do a virt_to_phys() call to the OS, which gives it an address in the 0..256MB range, say 0x1005000 - this address can then be given to the pci device, to translate it. But if the page 0x12346000 isn't mapped to the next guest-physical address (0x1006000), then you'd still have to deal with that in some way [presumably by allocating a new buffer with a "please make this contiguous" flag and copying the data or by sending the data in 4KB chunks]. 

I hope that's clear - it's rather confusing to think about all these things, because there are several levels of translation, which makes life pretty complicated. At least the IOMMU mapping should be pretty static. 

--
Mats
> 
> regards,
> daniel
> 
> -- 
> Daniel Stodden
> LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
> Institut für Informatik der TU München             D-85748 Garching
> http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
> PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-08-11  8:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-07 15:01 Questioning the Xen Design of the VMM Al Boldi
2006-08-08  9:10 ` Keir Fraser
2006-08-08  9:17 ` Harry Butterworth
2006-08-08  9:20 ` Petersson, Mats
2006-08-08 14:10   ` Al Boldi
2006-08-08 15:07     ` Petersson, Mats
2006-08-08 16:39       ` Steven Rostedt
2006-08-08 17:14         ` Petersson, Mats
2006-08-08 18:22           ` Steven Rostedt
2006-08-09 12:53       ` Al Boldi
2006-08-09 13:28         ` Petersson, Mats
2006-08-10 14:55           ` Al Boldi
2006-08-10 15:42             ` Petersson, Mats
2006-08-10 11:20         ` Daniel Stodden
2006-08-09 12:49     ` Daniel Stodden
2006-08-10 14:57       ` Al Boldi
2006-08-10 15:53         ` Daniel Stodden
2006-08-10 16:34           ` Petersson, Mats
2006-08-10 18:07             ` Daniel Stodden
2006-08-11  8:41               ` Petersson, Mats

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.