From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Liguori Subject: Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests Date: Mon, 08 Dec 2008 16:01:02 -0600 Message-ID: <493D991E.1000008@codemonkey.ws> References: <49392CB6.9000000@amd.com> <49393A78.5030601@codemonkey.ws> <4939473D.6080606@amd.com> <49394BB3.9080509@codemonkey.ws> <493D95A9.1090307@andrep.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Avi Kivity , kvm@vger.kernel.org To: =?ISO-8859-1?Q?Andr=E9_Przywara?= Return-path: Received: from an-out-0708.google.com ([209.85.132.240]:6238 "EHLO an-out-0708.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752297AbYLHWBH (ORCPT ); Mon, 8 Dec 2008 17:01:07 -0500 Received: by an-out-0708.google.com with SMTP id d40so572901and.1 for ; Mon, 08 Dec 2008 14:01:06 -0800 (PST) In-Reply-To: <493D95A9.1090307@andrep.de> Sender: kvm-owner@vger.kernel.org List-ID: Andr=E9 Przywara wrote: > I was partly wrong, the code is in BOCHS CVS, but not in qemu. It was= n't > in BOCHS 2.3.7 release, which qemu is currently based on. Could you p= ull > the latest BIOS code from BOCHS CVS to qemu? This would give us the > firmware interface for free and I could more easily port my patches. Working on that right now. BOCHS CVS has diverged a fair bit from what= =20 we have so I'm adjusting our current patches and doing regression testi= ng. > What's actually bothering you with libnuma dependency? I > could directly use the Linux mbind syscall, but I think using a libra= ry > is more sane (and probably more portable). You're making a default policy decision (pin nodes and pin cpus). Your= =20 assuming that Linux will do the wrong thing by default and that the=20 decision we'll be making is better. That policy decision requires more validation. We need benchmarks=20 showing what the perf is like when not pinning vs pinning and we need t= o=20 understand whether the bad performance is a Linux bug that can be fixed= =20 or whether it's something fundamental. What I'm concerned about, is that it'll make the default situation=20 worse. I advocated punting to management tools because that at least=20 gives the user the ability to make their own decisions which means you=20 don't have to prove that this is the correct default decision. I don't care about a libnuma dependency. Library dependencies are fine= =20 as long as they're optional. >>> Almost right, but simply calling qemu-system-x86_64 can lead to bad= =20 >>> situations. I lately saw that VCPU #0 was scheduled on one node and= =20 >>> VCPU #1 on another. This leads to random (probably excessive) remot= e=20 >>> accesses from the VCPUs, since the guest assumes uniform memory >> That seems like Linux is behaving badly, no? Can you describe the=20 >> situation more? > That is just my observation. I have to do more research to get a dece= nt > explanation, but I think the problem is that in this early state the > threads barely touch any memory, so Linux tries to distribute them as > best as possible. Just a quick run on a quad node machine with 16 cor= es > in total: How does memory migration fit into all of this though? Statistically=20 speaking, if your NUMA guest is behaving well, it should be easy to=20 recognize the groupings and perform the appropriate page migration. I=20 would think even the most naive page migration tool would be able to do= =20 the right thing. >> NUMA systems are expensive. If a customer cares about performance=20 >> (as opposed to just getting more memory), then I think tools like=20 >> numactl are pretty well known. > Well, expensive depends, especially if I think of your employer ;-) I= n > fact every AMD dual socket server is NUMA, and Intel will join the=20 > game next year. But the NUMA characteristics on an AMD system are relatively minor. I=20 doubt that doing static pinning would be what most users wanted since i= t=20 could reduce overall system performance noticably. Even with more traditional NUMA systems, the cost of remote memory=20 access is often lost by the opportunity cost of leaving a CPU idle. =20 That's what pinning does, it leaves CPUs potentially idle. > Additionally one could use some kind of home node, so one temporarily= =20 > could change the VCPUs affinity and later return to the optimal=20 > affinity (where the memory is located) without specifying it again. Please resubmit with the first three patches in the front. I don't=20 think exposing NUMA attributes to a guest is at all controversial so=20 that's relatively easy to apply. I'm not saying that the last patch can't be applied, but I don't think=20 it's as obvious that it's going to be a win when you start doing=20 performance tests. Regards, Anthony Liguori > Comments are welcome. > > Regards, > Andre. >