Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Andre Przywara <andre.przywara@amd.com>
Cc: Avi Kivity <avi@redhat.com>,
	kvm@vger.kernel.org, "Daniel P. Berrange" <berrange@redhat.com>
Subject: Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
Date: Fri, 05 Dec 2008 09:41:39 -0600	[thread overview]
Message-ID: <49394BB3.9080509@codemonkey.ws> (raw)
In-Reply-To: <4939473D.6080606@amd.com>

Andre Przywara wrote:
> Anthony,
>
>> This patch series needs to be posted to qemu-devel.  I know qemu 
>> doesn't do true SMP yet, but it will in the relatively near future.  
>> Either way, some of the design points needs review from a larger 
>> audience than present on kvm-devel.
> OK, I already started looking at that. The first patch applies with 
> only some fuzz, so no problems here. The second patch could be changed 
> to promote the values via the firmware configuration interface only, 
> leaving the host side pinning alone (which wouldn't make much sense 
> without true SMP anyway).
> The third patch is actually BOCHS BIOS, and I am confused here:
> I see the host side of the firmware config interface in QEMU SVN, but 
> neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is 
> any sign of usage from the BIOS side.

Really?  I assumed it was there.  I'll look this afternoon and if it 
isn't, I'll apply those patches to bios.diff and update the bios.

> Is the kvm-patched qemu the only user of the interface? If so I would 
> have to introduce the interface to QEMU's bios.diff (or better send to 
> bochs-developers?)
> Do you know what BOCHS version the bios.diff applies against? Is that 
> the 2.3.7 release?

Unfortunately, we don't track what version of the BOCHS BIOS is in the 
tree.  Usually, it's a SVN snapshot.  I'm going to change this the next 
time I update the BIOS though.

>> I'm not a big fan of the libnuma dependency.  I'll willing to concede 
>> this if there's a wide agreement that we should support this directly 
>> in QEMU.
> As long as QEMU is not true SMP, libnuma is rather useless. One could 
> pin the memory to the appropriate host nodes, but without the proper 
> scheduling this doesn't make much sense. And rescheduling the qemu 
> process each time a new VCPU is scheduled doesn't seem so smart, either.

Even if it's not useful, I'd still like to add it to QEMU.  That's one 
less thing that has to be merged from KVM into QEMU.

>> I don't think there's such a thing as a casual NUMA user.  The 
>> default NUMA policy in Linux is node-local memory.  As long as a VM 
>> is smaller than a single node, everything will work out fine.
> Almost right, but simply calling qemu-system-x86_64 can lead to bad 
> situations. I lately saw that VCPU #0 was scheduled on one node and 
> VCPU #1 on another. This leads to random (probably excessive) remote 
> accesses from the VCPUs, since the guest assumes uniform memory

That seems like Linux is behaving badly, no?  Can you describe the 
situation more?

> Of course one could cure this small guest case with numactl, but in my 
> experience the existence of this tool isn't as well-known as one would 
> expect.

NUMA systems are expensive.  If a customer cares about performance (as 
opposed to just getting more memory), then I think tools like numactl 
are pretty well known.

>>
>> In the event that the VM is larger than a single node, if a user is 
>> creating it via qemu-system-x86_64, they're going to either not care 
>> at all about NUMA, or be familiar enough with the numactl tools that 
>> they'll probably just want to use that.  Once you've got your head 
>> around the fact that VCPUs are just threads and the memory is just a 
>> shared memory segment, any knowledgable sysadmin will have no problem 
>> doing whatever sort of NUMA layout they want.
> Really? How do you want to assign certain _parts_ of guest memory with 
> numactl? (Let alone the rather weird way of using -mempath, which is 
> much easier done within QEMU).

I don't think -mem-path is weird at all.  In fact, I'd be inclined to 
use shared memory by default and create a temporary file name.  Then 
provide a monitor interface to lookup that file name so that an explicit 
-mem-path isn't required anymore.

> The same applies to the threads. You can assign _all_ the threads to 
> certain nodes, but pinning single threads only requires some tedious 
> work (QEMU monitor or top, then taskset -p). Isn't that OK if qemu 
> would do this automatically (or at least give some support here)?

Most VMs are going to be created through management tools so I don't 
think it's an issue.  I'd rather provide the best mechanisms for 
management tools to have the most flexibility.

>> The other case is where management tools are creating VMs.  In this 
>> case, it's probably better to use numactl as an external tool because 
>> then it keeps things consistent wrt CPU pinning.
>>
>> There's also a good argument for not introducing CPU pinning directly 
>> to QEMU.  There are multiple ways to effectively do CPU pinning.  You 
>> can use taskset, you can use cpusets or even something like libcgroup.
> I agree that pinning isn't the last word on the subject, but it works 
> pretty well. But I wouldn't load the admin with the burden of pinning, 
> but let this be done by QEMU/KVM. Maybe one could introduce a way to 
> tell QEMU/KVM to not pin the threads.

This is where things start to get ugly...

> I also had the idea to start with some sort of pinning (either 
> automatically or user-chosen) and lift the affinity later (after the 
> thread has done something and touched some memory). In this case Linux 
> could (but probably will not easily) move the thread to another node. 
> One could think about triggering this from a management app: If the 
> app detects a congestion on one node, it could first lift the affinity 
> restriction of some VCPU threads to achieve a better load balancing. 
> If the situation persists (and doesn't turn out to be a short time 
> peak), the manager could migrate the memory too and pin the VCPUs to 
> the new node. I thought the migration and temporary un-pinning could 
> be implemented in the monitor.

The other issue with pinning is what happens after live migration?  What 
about single-machine load balancing?  Regardless of whether we bake in 
libnuma control or not, I think an interface on the command line is not 
terribly interesting because it's too static.  I think a monitor 
interface is what we'd really want if we integrated with libnuma.

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2008-12-05 15:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-05 13:29 [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests Andre Przywara
2008-12-05 14:28 ` Anthony Liguori
2008-12-05 15:22   ` Andre Przywara
2008-12-05 15:41     ` Anthony Liguori [this message]
2008-12-08 21:46       ` André Przywara
2008-12-08 22:01         ` Anthony Liguori
2008-12-09 14:24         ` Avi Kivity
2008-12-09 14:55           ` Anthony Liguori
2008-12-05 15:27   ` Avi Kivity
2008-12-05 15:34     ` Anthony Liguori

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49394BB3.9080509@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=andre.przywara@amd.com \
    --cc=avi@redhat.com \
    --cc=berrange@redhat.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).