From: "André Przywara" <osp@andrep.de>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
Date: Mon, 08 Dec 2008 22:46:17 +0100 [thread overview]
Message-ID: <493D95A9.1090307@andrep.de> (raw)
In-Reply-To: <49394BB3.9080509@codemonkey.ws>
Hi Anthony,
>>...
>> The third patch is actually BOCHS BIOS, and I am confused here:
>> I see the host side of the firmware config interface in QEMU SVN, but
>> neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is
>> any sign of usage from the BIOS side.
> Really? I assumed it was there. I'll look this afternoon and if it
> isn't, I'll apply those patches to bios.diff and update the bios.
I was partly wrong, the code is in BOCHS CVS, but not in qemu. It wasn't
in BOCHS 2.3.7 release, which qemu is currently based on. Could you pull
the latest BIOS code from BOCHS CVS to qemu? This would give us the
firmware interface for free and I could more easily port my patches.
>>> I'm not a big fan of the libnuma dependency. I'll willing to concede
>>> this if there's a wide agreement that we should support this directly
>>> in QEMU.
What's actually bothering you with libnuma dependency? I
could directly use the Linux mbind syscall, but I think using a library
is more sane (and probably more portable).
>> As long as QEMU is not true SMP, libnuma is rather useless....
> Even if it's not useful, I'd still like to add it to QEMU. That's one
> less thing that has to be merged from KVM into QEMU.
OK, but since QEMU is portable, I have to use libnuma (and a configure
check). If there is no libnuma (and no Linux), host pinning is disabled
and you actually don't loose anything. Other QEMU host OSes could be
added later (Solaris comes to mind). I would try to assign guest memory
to different nodes. Although this doesn't help QEMU, this should be the
same code as in KVM, so more easily mergeable. Since there are no
threads (and no CPUState->threadid) I cannot add VCPU pinning here, but
that is not a great loss.
>> Almost right, but simply calling qemu-system-x86_64 can lead to bad
>> situations. I lately saw that VCPU #0 was scheduled on one node and
>> VCPU #1 on another. This leads to random (probably excessive) remote
>> accesses from the VCPUs, since the guest assumes uniform memory
> That seems like Linux is behaving badly, no? Can you describe the
> situation more?
That is just my observation. I have to do more research to get a decent
explanation, but I think the problem is that in this early state the
threads barely touch any memory, so Linux tries to distribute them as
best as possible. Just a quick run on a quad node machine with 16 cores
in total:
qemu-system-x86_64 -smp 4 -S: VCPUs running on pCPUs: 5,9,13,5
after continue in the monitor: 5,9,10,6: VCPU 2 changed the node
after booting has finished: 5,1,11,6: VCPU 1 changed the node
copying a file over the network: 7,5,11,6: VCPU 1 changed the node again
some load on the host, guest idle: 5,4,1,7: VCPU 2 changed the node
starting bunzipping: 1,4,2,7: VCPU 0 changed the node
bunzipping ended: 7,1,2,4: VCPU 0 and 1 changed the node
make -j8 on /dev/shm: 1,2,3,4: VCPU 0 changed the node
You see that Linux happily changes the assignments and even nodes, let
alone the rather arbitrary assignment at the beginning. After some load
(at the end) the scheduling comes closer to a single node, but the
memory was actually splitted between node0 and node1 (plus a few
thousand pages on node 2 & 3).
> NUMA systems are expensive. If a customer cares about performance (as
> opposed to just getting more memory), then I think tools like numactl
> are pretty well known.
Well, expensive depends, especially if I think of your employer ;-) In
fact every AMD dual socket server is NUMA, and Intel will join the game
next year.
>> ...
>> Really? How do you want to assign certain _parts_ of guest memory with
>> numactl? (Let alone the rather weird way of using -mempath, which is
>> much easier done within QEMU).
> I don't think -mem-path is weird at all. In fact, I'd be inclined to
> use shared memory by default and create a temporary file name. Then
> provide a monitor interface to lookup that file name so that an explicit
> -mem-path isn't required anymore.
I didn't wanted to decry -mempath, what I meant was that the way of
accomplishing a NUMA aware setup with -mempath seems to me quite
complicated. Why not use a rather fool-proof way within QEMU?
>> ...
>> But I wouldn't load the admin with the burden of pinning,
>> but let this be done by QEMU/KVM. Maybe one could introduce a way to
>> tell QEMU/KVM to not pin the threads.
> This is where things start to get ugly...
Why? qemu-system-x86_64 -numa 2,pin:none and then use whatever method
you prefer (taskset, monitor) to pin the VCPUs (or left them unpinned).
> The other issue with pinning is what happens after live migration? What
> about single-machine load balancing? Regardless of whether we bake in
> libnuma control or not, I think an interface on the command line is not
> terribly interesting because it's too static.
I agree, but only with regards to the pinning mechanism. AFAIK the NUMA
topology itself (CPU->nodes, mem->nodes) is quite static (due to it's
ACPI based nature).
> I think a monitor
> interface is what we'd really want if we integrated with libnuma.
OK, I will implement a monitor interface with emphasis on pinning to
host nodes. What about this:
> info numa
2 nodes
node 0 cpus: 0 1 2
node 0 size: 1536 MB
node 0 host: 2
node 1 cpus: 3
node 1 size: 512 MB
node 1 host: *
// similar to numactl --hardware, * means all nodes (no pinning)
> numa pin:0;3
// static pinning: guest 0 -> host 0, guest 1 -> host 3
> numa pin:*;
// guest node 0 -> all nodes, guest node 1: keep as it is
// or maybe: numa pin:0-3;
> numa migrate:1;2
// like pin, but moving all the memory, too
Additionally one could use some kind of home node, so one temporarily
could change the VCPUs affinity and later return to the optimal affinity
(where the memory is located) without specifying it again.
Comments are welcome.
Regards,
Andre.
--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
next prev parent reply other threads:[~2008-12-08 21:46 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-05 13:29 [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests Andre Przywara
2008-12-05 14:28 ` Anthony Liguori
2008-12-05 15:22 ` Andre Przywara
2008-12-05 15:41 ` Anthony Liguori
2008-12-08 21:46 ` André Przywara [this message]
2008-12-08 22:01 ` Anthony Liguori
2008-12-09 14:24 ` Avi Kivity
2008-12-09 14:55 ` Anthony Liguori
2008-12-05 15:27 ` Avi Kivity
2008-12-05 15:34 ` Anthony Liguori
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=493D95A9.1090307@andrep.de \
--to=osp@andrep.de \
--cc=anthony@codemonkey.ws \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).