[PATCH 0/3] v2: KVM-userspace: add NUMA support for guests

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
@ 2008-12-05 13:29 Andre Przywara
  2008-12-05 14:28 ` Anthony Liguori
  0 siblings, 1 reply; 10+ messages in thread
From: Andre Przywara @ 2008-12-05 13:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Daniel P. Berrange

Hi,

this patch series introduces multiple NUMA nodes support within KVM guests.
This is the second try incorporating several requests from the list:
- use the QEMU firmware configuration interface instead of CMOS-RAM
- detect presence of libnuma automatically, can be disabled with
   ./configure --disable-numa
This only applies to the host side, the command line and guest (BIOS)
side are always built and functional, although this configuration
is only useful for research and debugging
- use a more flexible command line interface allowing:
   - specifying the distribution of memory across the guest nodes:
     mem:1536M;512M
   - specifying the distribution of the CPUs:
     cpu:0-2;3
   - specifying the host nodes the guest nodes should be pinned to:
     pin:3;2
All of these options are optional, in case of mem and cpu the resources 
are split equally across all guest nodes if omitted. Please note that at 
least in Linux SRAT takes precedence over E820, so the total usable 
memory will be the sum specified at the mem: option (although QEMU will 
still allocate the amount at -m).
If pin: is omitted, the guest nodes will be pinned to those host nodes 
where the threads are happen to be scheduled at on start-up time. This 
requires the (v)getcpu (v)syscall to be usable, this is true for kernels 
up from 2.6.19 and glibc >= 2.6 (sched_getcpu()). I have a hack if glibc 
doesn't support this, tell me if you are interested.
The only non-optional argument is the number of guest nodes, a possible 
command line looks like:
-numa 3,mem:1024M;512M;512M,cpu:0-1;2;3
Please note that you have to quote the semicolons on the shell.

The monitor command is left out for now and will be send later.

Please apply.

Regards,
Andre.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 13:29 [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests Andre Przywara
@ 2008-12-05 14:28 ` Anthony Liguori
  2008-12-05 15:22   ` Andre Przywara
  2008-12-05 15:27   ` Avi Kivity
  0 siblings, 2 replies; 10+ messages in thread
From: Anthony Liguori @ 2008-12-05 14:28 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Avi Kivity, kvm, Daniel P. Berrange

Hi Andre,

This patch series needs to be posted to qemu-devel.  I know qemu doesn't 
do true SMP yet, but it will in the relatively near future.  Either way, 
some of the design points needs review from a larger audience than 
present on kvm-devel.

I'm not a big fan of the libnuma dependency.  I'll willing to concede 
this if there's a wide agreement that we should support this directly in 
QEMU.

I don't think there's such a thing as a casual NUMA user.  The default 
NUMA policy in Linux is node-local memory.  As long as a VM is smaller 
than a single node, everything will work out fine.

In the event that the VM is larger than a single node, if a user is 
creating it via qemu-system-x86_64, they're going to either not care at 
all about NUMA, or be familiar enough with the numactl tools that 
they'll probably just want to use that.  Once you've got your head 
around the fact that VCPUs are just threads and the memory is just a 
shared memory segment, any knowledgable sysadmin will have no problem 
doing whatever sort of NUMA layout they want.

The other case is where management tools are creating VMs.  In this 
case, it's probably better to use numactl as an external tool because 
then it keeps things consistent wrt CPU pinning.

There's also a good argument for not introducing CPU pinning directly to 
QEMU.  There are multiple ways to effectively do CPU pinning.  You can 
use taskset, you can use cpusets or even something like libcgroup.

If you refactor the series so that the libnuma patch is the very last 
one and submit to qemu-devel, I'll review and apply all of the first 
patches.  We can continue to discuss the last patch independently of the 
first three if needed.

Regards,

Anthony Liguori

Andre Przywara wrote:
> Hi,
>
> this patch series introduces multiple NUMA nodes support within KVM 
> guests.
> This is the second try incorporating several requests from the list:
> - use the QEMU firmware configuration interface instead of CMOS-RAM
> - detect presence of libnuma automatically, can be disabled with
>   ./configure --disable-numa
> This only applies to the host side, the command line and guest (BIOS)
> side are always built and functional, although this configuration
> is only useful for research and debugging
> - use a more flexible command line interface allowing:
>   - specifying the distribution of memory across the guest nodes:
>     mem:1536M;512M
>   - specifying the distribution of the CPUs:
>     cpu:0-2;3
>   - specifying the host nodes the guest nodes should be pinned to:
>     pin:3;2
> All of these options are optional, in case of mem and cpu the 
> resources are split equally across all guest nodes if omitted. Please 
> note that at least in Linux SRAT takes precedence over E820, so the 
> total usable memory will be the sum specified at the mem: option 
> (although QEMU will still allocate the amount at -m).
> If pin: is omitted, the guest nodes will be pinned to those host nodes 
> where the threads are happen to be scheduled at on start-up time. This 
> requires the (v)getcpu (v)syscall to be usable, this is true for 
> kernels up from 2.6.19 and glibc >= 2.6 (sched_getcpu()). I have a 
> hack if glibc doesn't support this, tell me if you are interested.
> The only non-optional argument is the number of guest nodes, a 
> possible command line looks like:
> -numa 3,mem:1024M;512M;512M,cpu:0-1;2;3
> Please note that you have to quote the semicolons on the shell.
>
> The monitor command is left out for now and will be send later.
>
> Please apply.
>
> Regards,
> Andre.
>
> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 14:28 ` Anthony Liguori
@ 2008-12-05 15:22   ` Andre Przywara
  2008-12-05 15:41     ` Anthony Liguori
  2008-12-05 15:27   ` Avi Kivity
  1 sibling, 1 reply; 10+ messages in thread
From: Andre Przywara @ 2008-12-05 15:22 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, Daniel P. Berrange

Anthony,

> This patch series needs to be posted to qemu-devel.  I know qemu doesn't 
> do true SMP yet, but it will in the relatively near future.  Either way, 
> some of the design points needs review from a larger audience than 
> present on kvm-devel.
OK, I already started looking at that. The first patch applies with only 
some fuzz, so no problems here. The second patch could be changed to 
promote the values via the firmware configuration interface only, 
leaving the host side pinning alone (which wouldn't make much sense 
without true SMP anyway).
The third patch is actually BOCHS BIOS, and I am confused here:
I see the host side of the firmware config interface in QEMU SVN, but 
neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is any 
sign of usage from the BIOS side. Is the kvm-patched qemu the only user 
of the interface? If so I would have to introduce the interface to 
QEMU's bios.diff (or better send to bochs-developers?)
Do you know what BOCHS version the bios.diff applies against? Is that 
the 2.3.7 release?

> I'm not a big fan of the libnuma dependency.  I'll willing to concede 
> this if there's a wide agreement that we should support this directly in 
> QEMU.
As long as QEMU is not true SMP, libnuma is rather useless. One could 
pin the memory to the appropriate host nodes, but without the proper 
scheduling this doesn't make much sense. And rescheduling the qemu 
process each time a new VCPU is scheduled doesn't seem so smart, either.
So for qemu we could just drop libnuma (at least for now) (which is also 
the case for the current KVM patches, if there is no libnuma, the whole 
host side pinning is skipped).

> I don't think there's such a thing as a casual NUMA user.  The default 
> NUMA policy in Linux is node-local memory.  As long as a VM is smaller 
> than a single node, everything will work out fine.
Almost right, but simply calling qemu-system-x86_64 can lead to bad 
situations. I lately saw that VCPU #0 was scheduled on one node and VCPU 
#1 on another. This leads to random (probably excessive) remote accesses 
from the VCPUs, since the guest assumes uniform memory. Of course one 
could cure this small guest case with numactl, but in my experience the 
existence of this tool isn't as well-known as one would expect.
> 
> In the event that the VM is larger than a single node, if a user is 
> creating it via qemu-system-x86_64, they're going to either not care at 
> all about NUMA, or be familiar enough with the numactl tools that 
> they'll probably just want to use that.  Once you've got your head 
> around the fact that VCPUs are just threads and the memory is just a 
> shared memory segment, any knowledgable sysadmin will have no problem 
> doing whatever sort of NUMA layout they want.
Really? How do you want to assign certain _parts_ of guest memory with 
numactl? (Let alone the rather weird way of using -mempath, which is 
much easier done within QEMU). The same applies to the threads. You can 
assign _all_ the threads to certain nodes, but pinning single threads 
only requires some tedious work (QEMU monitor or top, then taskset -p). 
Isn't that OK if qemu would do this automatically (or at least give some 
support here)?

> The other case is where management tools are creating VMs.  In this 
> case, it's probably better to use numactl as an external tool because 
> then it keeps things consistent wrt CPU pinning.
> 
> There's also a good argument for not introducing CPU pinning directly to 
> QEMU.  There are multiple ways to effectively do CPU pinning.  You can 
> use taskset, you can use cpusets or even something like libcgroup.
I agree that pinning isn't the last word on the subject, but it works 
pretty well. But I wouldn't load the admin with the burden of pinning, 
but let this be done by QEMU/KVM. Maybe one could introduce a way to 
tell QEMU/KVM to not pin the threads.
I also had the idea to start with some sort of pinning (either 
automatically or user-chosen) and lift the affinity later (after the 
thread has done something and touched some memory). In this case Linux 
could (but probably will not easily) move the thread to another node. 
One could think about triggering this from a management app: If the app 
detects a congestion on one node, it could first lift the affinity 
restriction of some VCPU threads to achieve a better load balancing. If 
the situation persists (and doesn't turn out to be a short time peak), 
the manager could migrate the memory too and pin the VCPUs to the new 
node. I thought the migration and temporary un-pinning could be 
implemented in the monitor.

> If you refactor the series so that the libnuma patch is the very last 
> one and submit to qemu-devel, I'll review and apply all of the first 
> patches.  We can continue to discuss the last patch independently of the 
> first three if needed.
Sounds like a plan. I will start with this and hope for some advice on 
the BOCHS BIOS issue.

Thanks for your ideas!

Regards,
Andre.

> 
> Andre Przywara wrote:
>> Hi,
>>
>> this patch series introduces multiple NUMA nodes support within KVM 
>> guests.
>> This is the second try incorporating several requests from the list:
>> - use the QEMU firmware configuration interface instead of CMOS-RAM
>> - detect presence of libnuma automatically, can be disabled with
>>   ./configure --disable-numa
>> This only applies to the host side, the command line and guest (BIOS)
>> side are always built and functional, although this configuration
>> is only useful for research and debugging
>> - use a more flexible command line interface allowing:
>>   - specifying the distribution of memory across the guest nodes:
>>     mem:1536M;512M
>>   - specifying the distribution of the CPUs:
>>     cpu:0-2;3
>>   - specifying the host nodes the guest nodes should be pinned to:
>>     pin:3;2
>> All of these options are optional, in case of mem and cpu the 
>> resources are split equally across all guest nodes if omitted. Please 
>> note that at least in Linux SRAT takes precedence over E820, so the 
>> total usable memory will be the sum specified at the mem: option 
>> (although QEMU will still allocate the amount at -m).
>> If pin: is omitted, the guest nodes will be pinned to those host nodes 
>> where the threads are happen to be scheduled at on start-up time. This 
>> requires the (v)getcpu (v)syscall to be usable, this is true for 
>> kernels up from 2.6.19 and glibc >= 2.6 (sched_getcpu()). I have a 
>> hack if glibc doesn't support this, tell me if you are interested.
>> The only non-optional argument is the number of guest nodes, a 
>> possible command line looks like:
>> -numa 3,mem:1024M;512M;512M,cpu:0-1;2;3
>> Please note that you have to quote the semicolons on the shell.
>>
>> The monitor command is left out for now and will be send later.
>>
>> Please apply.
>>
>> Regards,
>> Andre.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
>>
> 
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x84917


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 14:28 ` Anthony Liguori
  2008-12-05 15:22   ` Andre Przywara
@ 2008-12-05 15:27   ` Avi Kivity
  2008-12-05 15:34     ` Anthony Liguori
  1 sibling, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2008-12-05 15:27 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Andre Przywara, kvm, Daniel P. Berrange

Anthony Liguori wrote:
>
> In the event that the VM is larger than a single node, if a user is 
> creating it via qemu-system-x86_64, they're going to either not care 
> at all about NUMA, or be familiar enough with the numactl tools that 
> they'll probably just want to use that.  Once you've got your head 
> around the fact that VCPUs are just threads and the memory is just a 
> shared memory segment, any knowledgable sysadmin will have no problem 
> doing whatever sort of NUMA layout they want.
>

The vast majority of production VMs will be created by management tools.

> The other case is where management tools are creating VMs.  In this 
> case, it's probably better to use numactl as an external tool because 
> then it keeps things consistent wrt CPU pinning.
>
> There's also a good argument for not introducing CPU pinning directly 
> to QEMU.  There are multiple ways to effectively do CPU pinning.  You 
> can use taskset, you can use cpusets or even something like libcgroup.
>
> If you refactor the series so that the libnuma patch is the very last 
> one and submit to qemu-devel, I'll review and apply all of the first 
> patches.  We can continue to discuss the last patch independently of 
> the first three if needed.

We need libnuma integrated in qemu.  Using numactl outside of qemu means 
we need to start exposing more and more qemu internals (vcpu->thread 
mapping, memory in /dev/shm, phys_addr->ram_addr mapping) and lose out 
on optimization opportunities (having multiple numa-aware iothreads, 
numa-aware kvm mmu).  It also means we cause duplication of the numa 
logic in management tools instead of consolidation in qemu.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 15:27   ` Avi Kivity
@ 2008-12-05 15:34     ` Anthony Liguori
  0 siblings, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2008-12-05 15:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andre Przywara, kvm, Daniel P. Berrange

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> In the event that the VM is larger than a single node, if a user is 
>> creating it via qemu-system-x86_64, they're going to either not care 
>> at all about NUMA, or be familiar enough with the numactl tools that 
>> they'll probably just want to use that.  Once you've got your head 
>> around the fact that VCPUs are just threads and the memory is just a 
>> shared memory segment, any knowledgable sysadmin will have no problem 
>> doing whatever sort of NUMA layout they want.
>>
>
> The vast majority of production VMs will be created by management tools.

I agree.

> We need libnuma integrated in qemu.  Using numactl outside of qemu 
> means we need to start exposing more and more qemu internals 
> (vcpu->thread mapping, memory in /dev/shm, phys_addr->ram_addr 
> mapping) and lose out on optimization opportunities (having multiple 
> numa-aware iothreads, numa-aware kvm mmu).  It also means we cause 
> duplication of the numa logic in management tools instead of 
> consolidation in qemu.

I think it's the opposite.  Integrating libnuma in QEMU means 
duplication of numactl functionality in QEMU.  What you'd really want, I 
think, is to be able to use numactl but say -qemu-guest-memory-offset 1G 
-qemu-guest-memory-size 1G.

The /dev/shm approximates that pretty well.  Also, the current patches 
don't do the most useful thing, they don't use provide an interface for 
dynamically changing numa attributes.

But, as I said, if there's agreement that we should bake this into QEMU, 
then so be it.  But let's make this a separate conversation than the 
rest of the patches.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 15:22   ` Andre Przywara
@ 2008-12-05 15:41     ` Anthony Liguori
  2008-12-08 21:46       ` André Przywara
  0 siblings, 1 reply; 10+ messages in thread
From: Anthony Liguori @ 2008-12-05 15:41 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Avi Kivity, kvm, Daniel P. Berrange

Andre Przywara wrote:
> Anthony,
>
>> This patch series needs to be posted to qemu-devel.  I know qemu 
>> doesn't do true SMP yet, but it will in the relatively near future.  
>> Either way, some of the design points needs review from a larger 
>> audience than present on kvm-devel.
> OK, I already started looking at that. The first patch applies with 
> only some fuzz, so no problems here. The second patch could be changed 
> to promote the values via the firmware configuration interface only, 
> leaving the host side pinning alone (which wouldn't make much sense 
> without true SMP anyway).
> The third patch is actually BOCHS BIOS, and I am confused here:
> I see the host side of the firmware config interface in QEMU SVN, but 
> neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is 
> any sign of usage from the BIOS side.

Really?  I assumed it was there.  I'll look this afternoon and if it 
isn't, I'll apply those patches to bios.diff and update the bios.

> Is the kvm-patched qemu the only user of the interface? If so I would 
> have to introduce the interface to QEMU's bios.diff (or better send to 
> bochs-developers?)
> Do you know what BOCHS version the bios.diff applies against? Is that 
> the 2.3.7 release?

Unfortunately, we don't track what version of the BOCHS BIOS is in the 
tree.  Usually, it's a SVN snapshot.  I'm going to change this the next 
time I update the BIOS though.

>> I'm not a big fan of the libnuma dependency.  I'll willing to concede 
>> this if there's a wide agreement that we should support this directly 
>> in QEMU.
> As long as QEMU is not true SMP, libnuma is rather useless. One could 
> pin the memory to the appropriate host nodes, but without the proper 
> scheduling this doesn't make much sense. And rescheduling the qemu 
> process each time a new VCPU is scheduled doesn't seem so smart, either.

Even if it's not useful, I'd still like to add it to QEMU.  That's one 
less thing that has to be merged from KVM into QEMU.

>> I don't think there's such a thing as a casual NUMA user.  The 
>> default NUMA policy in Linux is node-local memory.  As long as a VM 
>> is smaller than a single node, everything will work out fine.
> Almost right, but simply calling qemu-system-x86_64 can lead to bad 
> situations. I lately saw that VCPU #0 was scheduled on one node and 
> VCPU #1 on another. This leads to random (probably excessive) remote 
> accesses from the VCPUs, since the guest assumes uniform memory

That seems like Linux is behaving badly, no?  Can you describe the 
situation more?

> Of course one could cure this small guest case with numactl, but in my 
> experience the existence of this tool isn't as well-known as one would 
> expect.

NUMA systems are expensive.  If a customer cares about performance (as 
opposed to just getting more memory), then I think tools like numactl 
are pretty well known.

>>
>> In the event that the VM is larger than a single node, if a user is 
>> creating it via qemu-system-x86_64, they're going to either not care 
>> at all about NUMA, or be familiar enough with the numactl tools that 
>> they'll probably just want to use that.  Once you've got your head 
>> around the fact that VCPUs are just threads and the memory is just a 
>> shared memory segment, any knowledgable sysadmin will have no problem 
>> doing whatever sort of NUMA layout they want.
> Really? How do you want to assign certain _parts_ of guest memory with 
> numactl? (Let alone the rather weird way of using -mempath, which is 
> much easier done within QEMU).

I don't think -mem-path is weird at all.  In fact, I'd be inclined to 
use shared memory by default and create a temporary file name.  Then 
provide a monitor interface to lookup that file name so that an explicit 
-mem-path isn't required anymore.

> The same applies to the threads. You can assign _all_ the threads to 
> certain nodes, but pinning single threads only requires some tedious 
> work (QEMU monitor or top, then taskset -p). Isn't that OK if qemu 
> would do this automatically (or at least give some support here)?

Most VMs are going to be created through management tools so I don't 
think it's an issue.  I'd rather provide the best mechanisms for 
management tools to have the most flexibility.

>> The other case is where management tools are creating VMs.  In this 
>> case, it's probably better to use numactl as an external tool because 
>> then it keeps things consistent wrt CPU pinning.
>>
>> There's also a good argument for not introducing CPU pinning directly 
>> to QEMU.  There are multiple ways to effectively do CPU pinning.  You 
>> can use taskset, you can use cpusets or even something like libcgroup.
> I agree that pinning isn't the last word on the subject, but it works 
> pretty well. But I wouldn't load the admin with the burden of pinning, 
> but let this be done by QEMU/KVM. Maybe one could introduce a way to 
> tell QEMU/KVM to not pin the threads.

This is where things start to get ugly...

> I also had the idea to start with some sort of pinning (either 
> automatically or user-chosen) and lift the affinity later (after the 
> thread has done something and touched some memory). In this case Linux 
> could (but probably will not easily) move the thread to another node. 
> One could think about triggering this from a management app: If the 
> app detects a congestion on one node, it could first lift the affinity 
> restriction of some VCPU threads to achieve a better load balancing. 
> If the situation persists (and doesn't turn out to be a short time 
> peak), the manager could migrate the memory too and pin the VCPUs to 
> the new node. I thought the migration and temporary un-pinning could 
> be implemented in the monitor.

The other issue with pinning is what happens after live migration?  What 
about single-machine load balancing?  Regardless of whether we bake in 
libnuma control or not, I think an interface on the command line is not 
terribly interesting because it's too static.  I think a monitor 
interface is what we'd really want if we integrated with libnuma.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-05 15:41     ` Anthony Liguori
@ 2008-12-08 21:46       ` André Przywara
  2008-12-08 22:01         ` Anthony Liguori
  2008-12-09 14:24         ` Avi Kivity
  0 siblings, 2 replies; 10+ messages in thread
From: André Przywara @ 2008-12-08 21:46 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm

Hi Anthony,
>>...
>> The third patch is actually BOCHS BIOS, and I am confused here:
>> I see the host side of the firmware config interface in QEMU SVN, but 
>> neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is 
>> any sign of usage from the BIOS side.
> Really?  I assumed it was there.  I'll look this afternoon and if it 
> isn't, I'll apply those patches to bios.diff and update the bios.
I was partly wrong, the code is in BOCHS CVS, but not in qemu. It wasn't
in BOCHS 2.3.7 release, which qemu is currently based on. Could you pull
the latest BIOS code from BOCHS CVS to qemu? This would give us the
firmware interface for free and I could more easily port my patches.

>>> I'm not a big fan of the libnuma dependency.  I'll willing to concede 
>>> this if there's a wide agreement that we should support this directly 
>>> in QEMU.
What's actually bothering you with libnuma dependency? I
could directly use the Linux mbind syscall, but I think using a library
is more sane (and probably more portable).

>> As long as QEMU is not true SMP, libnuma is rather useless....
> Even if it's not useful, I'd still like to add it to QEMU.  That's one 
> less thing that has to be merged from KVM into QEMU.
OK, but since QEMU is portable, I have to use libnuma (and a configure
check). If there is no libnuma (and no Linux), host pinning is disabled
and you actually don't loose anything. Other QEMU host OSes could be
added later (Solaris comes to mind). I would try to assign guest memory
to different nodes. Although this doesn't help QEMU, this should be the 
same code as in KVM, so more easily mergeable. Since there are no 
threads (and no CPUState->threadid) I cannot add VCPU pinning here, but 
that is not a great loss.

>> Almost right, but simply calling qemu-system-x86_64 can lead to bad 
>> situations. I lately saw that VCPU #0 was scheduled on one node and 
>> VCPU #1 on another. This leads to random (probably excessive) remote 
>> accesses from the VCPUs, since the guest assumes uniform memory
> That seems like Linux is behaving badly, no?  Can you describe the 
> situation more?
That is just my observation. I have to do more research to get a decent
explanation, but I think the problem is that in this early state the
threads barely touch any memory, so Linux tries to distribute them as
best as possible. Just a quick run on a quad node machine with 16 cores
in total:
qemu-system-x86_64 -smp 4 -S: VCPUs running on pCPUs: 5,9,13,5
after continue in the monitor: 5,9,10,6: VCPU 2 changed the node
after booting has finished: 5,1,11,6: VCPU 1 changed the node
copying a file over the network: 7,5,11,6: VCPU 1 changed the node again
some load on the host, guest idle: 5,4,1,7: VCPU 2 changed the node
starting bunzipping: 1,4,2,7: VCPU 0 changed the node
bunzipping ended: 7,1,2,4: VCPU 0 and 1 changed the node
make -j8 on /dev/shm: 1,2,3,4: VCPU 0 changed the node

You see that Linux happily changes the assignments and even nodes, let
alone the rather arbitrary assignment at the beginning. After some load 
(at the end) the scheduling comes closer to a single node, but the 
memory was actually splitted between node0 and node1 (plus a few 
thousand pages on node 2 & 3).

> NUMA systems are expensive.  If a customer cares about performance (as 
> opposed to just getting more memory), then I think tools like numactl 
> are pretty well known.
Well, expensive depends, especially if I think of your employer ;-) In
fact every AMD dual socket server is NUMA, and Intel will join the game 
next year.

 >> ...
>> Really? How do you want to assign certain _parts_ of guest memory with 
>> numactl? (Let alone the rather weird way of using -mempath, which is 
>> much easier done within QEMU).
> I don't think -mem-path is weird at all.  In fact, I'd be inclined to 
> use shared memory by default and create a temporary file name.  Then 
> provide a monitor interface to lookup that file name so that an explicit 
> -mem-path isn't required anymore.
I didn't wanted to decry -mempath, what I meant was that the way of
accomplishing a NUMA aware setup with -mempath seems to me quite
complicated. Why not use a rather fool-proof way within QEMU?
 >> ...
>> But I wouldn't load the admin with the burden of pinning, 
>> but let this be done by QEMU/KVM. Maybe one could introduce a way to 
>> tell QEMU/KVM to not pin the threads.
> This is where things start to get ugly...
Why? qemu-system-x86_64 -numa 2,pin:none and then use whatever method 
you prefer (taskset, monitor) to pin the VCPUs (or left them unpinned).

> The other issue with pinning is what happens after live migration?  What 
> about single-machine load balancing?  Regardless of whether we bake in 
> libnuma control or not, I think an interface on the command line is not 
> terribly interesting because it's too static.
I agree, but only with regards to the pinning mechanism. AFAIK the NUMA 
topology itself (CPU->nodes, mem->nodes) is quite static (due to it's 
ACPI based nature).
 > I think a monitor
> interface is what we'd really want if we integrated with libnuma.
OK, I will implement a monitor interface with emphasis on pinning to 
host nodes. What about this:
  > info numa
2 nodes
node 0 cpus: 0 1 2
node 0 size: 1536 MB
node 0 host: 2
node 1 cpus: 3
node 1 size: 512 MB
node 1 host: *
// similar to numactl --hardware, * means all nodes (no pinning)
  > numa pin:0;3
// static pinning: guest 0 -> host 0, guest 1 -> host 3
  > numa pin:*;
// guest node 0 -> all nodes, guest node 1: keep as it is
// or maybe: numa pin:0-3;
  > numa migrate:1;2
// like pin, but moving all the memory, too
Additionally one could use some kind of home node, so one temporarily 
could change the VCPUs affinity and later return to the optimal affinity 
(where the memory is located) without specifying it again.
Comments are welcome.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-08 21:46       ` André Przywara
@ 2008-12-08 22:01         ` Anthony Liguori
  2008-12-09 14:24         ` Avi Kivity
  1 sibling, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2008-12-08 22:01 UTC (permalink / raw)
  To: André Przywara; +Cc: Avi Kivity, kvm

André Przywara wrote:
> I was partly wrong, the code is in BOCHS CVS, but not in qemu. It wasn't
> in BOCHS 2.3.7 release, which qemu is currently based on. Could you pull
> the latest BIOS code from BOCHS CVS to qemu? This would give us the
> firmware interface for free and I could more easily port my patches.

Working on that right now.  BOCHS CVS has diverged a fair bit from what 
we have so I'm adjusting our current patches and doing regression testing.

> What's actually bothering you with libnuma dependency? I
> could directly use the Linux mbind syscall, but I think using a library
> is more sane (and probably more portable).

You're making a default policy decision (pin nodes and pin cpus).  Your 
assuming that Linux will do the wrong thing by default and that the 
decision we'll be making is better.

That policy decision requires more validation.  We need benchmarks 
showing what the perf is like when not pinning vs pinning and we need to 
understand whether the bad performance is a Linux bug that can be fixed 
or whether it's something fundamental.

What I'm concerned about, is that it'll make the default situation 
worse.  I advocated punting to management tools because that at least 
gives the user the ability to make their own decisions which means you 
don't have to prove that this is the correct default decision.

I don't care about a libnuma dependency.  Library dependencies are fine 
as long as they're optional.

>>> Almost right, but simply calling qemu-system-x86_64 can lead to bad 
>>> situations. I lately saw that VCPU #0 was scheduled on one node and 
>>> VCPU #1 on another. This leads to random (probably excessive) remote 
>>> accesses from the VCPUs, since the guest assumes uniform memory
>> That seems like Linux is behaving badly, no?  Can you describe the 
>> situation more?
> That is just my observation. I have to do more research to get a decent
> explanation, but I think the problem is that in this early state the
> threads barely touch any memory, so Linux tries to distribute them as
> best as possible. Just a quick run on a quad node machine with 16 cores
> in total:

How does memory migration fit into all of this though?  Statistically 
speaking, if your NUMA guest is behaving well, it should be easy to 
recognize the groupings and perform the appropriate page migration.  I 
would think even the most naive page migration tool would be able to do 
the right thing.

>> NUMA systems are expensive.  If a customer cares about performance 
>> (as opposed to just getting more memory), then I think tools like 
>> numactl are pretty well known.
> Well, expensive depends, especially if I think of your employer ;-) In
> fact every AMD dual socket server is NUMA, and Intel will join the 
> game next year.

But the NUMA characteristics on an AMD system are relatively minor.  I 
doubt that doing static pinning would be what most users wanted since it 
could reduce overall system performance noticably.

Even with more traditional NUMA systems, the cost of remote memory 
access is often lost by the opportunity cost of leaving a CPU idle.  
That's what pinning does, it leaves CPUs potentially idle.

> Additionally one could use some kind of home node, so one temporarily 
> could change the VCPUs affinity and later return to the optimal 
> affinity (where the memory is located) without specifying it again.

Please resubmit with the first three patches in the front.  I don't 
think exposing NUMA attributes to a guest is at all controversial so 
that's relatively easy to apply.

I'm not saying that the last patch can't be applied, but I don't think 
it's as obvious that it's going to be a win when you start doing 
performance tests.

Regards,

Anthony Liguori

> Comments are welcome.
>
> Regards,
> Andre.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-08 21:46       ` André Przywara
  2008-12-08 22:01         ` Anthony Liguori
@ 2008-12-09 14:24         ` Avi Kivity
  2008-12-09 14:55           ` Anthony Liguori
  1 sibling, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2008-12-09 14:24 UTC (permalink / raw)
  To: André Przywara; +Cc: Anthony Liguori, kvm

André Przywara wrote:
>>> But I wouldn't load the admin with the burden of pinning, but let 
>>> this be done by QEMU/KVM. Maybe one could introduce a way to tell 
>>> QEMU/KVM to not pin the threads.
>> This is where things start to get ugly...
> Why? qemu-system-x86_64 -numa 2,pin:none and then use whatever method 
> you prefer (taskset, monitor) to pin the VCPUs (or left them unpinned).

I agree that for e.g. -numa 2, no host binding should occur.  Pinning 
memory or cpus to nodes should only occur if the user explicitly 
requested it.  Otherwise we run the risk of breaking load balancing.

If the user chooses to pin, the responsibility is on them.  If not, we 
should allow the host to do its thing.

> // similar to numactl --hardware, * means all nodes (no pinning)
>  > numa pin:0;3
> // static pinning: guest 0 -> host 0, guest 1 -> host 3
>  > numa pin:*;
> // guest node 0 -> all nodes, guest node 1: keep as it is
> // or maybe: numa pin:0-3;
>  > numa migrate:1;2

I suggest using exactly the same syntax as the command line option.  
Qemu would compute the difference between the current configuration and 
the desired configuration and migrate vcpus and memory as needed.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
  2008-12-09 14:24         ` Avi Kivity
@ 2008-12-09 14:55           ` Anthony Liguori
  0 siblings, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2008-12-09 14:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: André Przywara, kvm

Avi Kivity wrote:
> André Przywara wrote:
>>>> But I wouldn't load the admin with the burden of pinning, but let 
>>>> this be done by QEMU/KVM. Maybe one could introduce a way to tell 
>>>> QEMU/KVM to not pin the threads.
>>> This is where things start to get ugly...
>> Why? qemu-system-x86_64 -numa 2,pin:none and then use whatever method 
>> you prefer (taskset, monitor) to pin the VCPUs (or left them unpinned).
>
> I agree that for e.g. -numa 2, no host binding should occur.  Pinning 
> memory or cpus to nodes should only occur if the user explicitly 
> requested it.  Otherwise we run the risk of breaking load balancing.
>
> If the user chooses to pin, the responsibility is on them.  If not, we 
> should allow the host to do its thing.

Agreed.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-12-09 14:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 13:29 [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests Andre Przywara
2008-12-05 14:28 ` Anthony Liguori
2008-12-05 15:22   ` Andre Przywara
2008-12-05 15:41     ` Anthony Liguori
2008-12-08 21:46       ` André Przywara
2008-12-08 22:01         ` Anthony Liguori
2008-12-09 14:24         ` Avi Kivity
2008-12-09 14:55           ` Anthony Liguori
2008-12-05 15:27   ` Avi Kivity
2008-12-05 15:34     ` Anthony Liguori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).