From: Chegu Vinod <chegu_vinod@hp.com>
To: kvm@vger.kernel.org
Subject: Re: How to determine the backing host physical memory for a given guest ?
Date: Fri, 11 May 2012 01:22:26 +0000 (UTC) [thread overview]
Message-ID: <loom.20120511T013646-106@post.gmane.org> (raw)
In-Reply-To: 4FABE004.9010808@linux.vnet.ibm.com
Andrew Theurer <habanero <at> linux.vnet.ibm.com> writes:
>
> On 05/09/2012 08:46 AM, Avi Kivity wrote:
> > On 05/09/2012 04:05 PM, Chegu Vinod wrote:
> >> Hello,
> >>
> >> On an 8 socket Westmere host I am attempting to run a single guest and
> >> characterize the virtualization overhead for a system intensive
> >> workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
> >> 20way/128G, ... 80way/512G).
> >>
> >> To do some comparisons between the native vs. guest runs. I have
> >> been using "numactl" to control the cpu node& memory node bindings for
> >> the qemu instance. For larger guest sizes I end up binding across multiple
> >> localities. for e.g. a 40 way guest :
> >>
> >> numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \
> >> qemu-system-x86_64 -smp 40 -m 262144 \
> >> <....>
> >>
> >> I understand that actual mappings from a guest virtual address to host
physical
> >> address could change.
> >>
> >> Is there a way to determine [at a given instant] which host's NUMA node is
> >> providing the backing physical memory for the active guest's kernel and
> >> also for the the apps actively running in the guest ?
> >>
> >> Guessing that there is a better way (some tool available?) than just
> >> diff'ng the per node memory usage...from the before and after output of
> >> "numactl --hardware" on the host.
> >>
> >
> > Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.
> >
>
> You can look at /proc/≤pid>/numa_maps and see all the mappings for the
> qemu process. There should be one really large mapping for the guest
> memory, and in that line a number of dirty pages list potentially for
> each NUMA node. This will tell you how much from each node, but not
> specifically "which page is mapped where".
Thanks . I will look at this in more detail.
>
> Keep in mind with the current numactl you are using, you will likely not
> get the benefits of NUMA enhancements found in the linux kernel from
> your guest (or host). There are a couple reasons: (1) your guest does
> not have a NUMA topology defined (based on what I see from the qemu
> command above), so it will not do anything special based on the host
> topology. Also, things that are broken down per-NUMA-node like some
> spin-locks and sched-domains are now system-wide/flat. This is a big
> deal for scheduler and other things like kmem allocation. With a single
> 80way VM with no NUMA, you will likely have massive spin-lock contention
> on some workloads.
We had seen evidence of increased lock contentions (via lockstat etc.)as the
guest size increased.
[On a related note : Given the nature of the system intensive workload... the
combination of the ticket based locks in the guest OS + PLE handling code in
the host kernel was not helping. So temporarily worked around this. Hope to try
out the PV locks changes soon....].
Regarding the -numa option :
I had earlier (about a ~month ago) tried the -numa option. The layout I
specified didn't match the layout the guest saw. Haven't yet looked into the
exact reason...but came to know that there was already an open issue:
https://bugzilla.redhat.com/show_bug.cgi?id=816804
Also remember noticing a warning message when the -numa option was used for a
guest with more than 64VCPUs. (in my case with 80 VCPUs). Will be looking at
the code soon to see if there is any limitation...
I have been using [more or less the upstream version of] qemu directly to start
the guest. For guest sizes : 10way, 20way, 40way 60way I had been using numactl
just to control the numa nodes where the guest ends up running. After the guest
booted up I used to set the affinity of the the VCPUs (to the specific cores on
the host) by via "taskset" (this was a touch painful... when compared to virsh
vcpupin). For 80way guest (on an 80way host) I don't use numactl.
Noticed that doing a "taskset" to pin the VCPUs didn't always give a better
performance...perhaps this is due to the absence of a NUMA layout in the guest.
(2) Once the VM does have NUMA toplogy (via qemu
> -numa), one still cannot manually set mempolicy for a portion of the VM
> memory that represents each NUMA node in the VM (or have this done
> automatically with something like autoNUMA). Therefore, it's difficult
> to forcefully map each of the VM's node's memory to the corresponding
> host node.
>
> There are a some things you can do to mitigate some of this. Definitely
> define the VM to match the NUMA topology found on the host.
The native/host platform has multiple levels of NUMA...
node distances:
node 0 1 2 3 4 5 6 7
0: 10 14 23 23 27 27 27 27
1: 14 10 23 23 27 27 27 27
2: 23 23 10 14 27 27 27 27
3: 23 23 14 10 27 27 27 27
4: 27 27 27 27 10 14 23 23
5: 27 27 27 27 14 10 23 23
6: 27 27 27 27 23 23 10 14
7: 27 27 27 27 23 23 14 10
The qemu's -numa option seems to only allow for one level (i.e.
specify multiple sockets etc). Am I missing something ?
> at least allow good scaling wrt locks and scheduler in the guest. As
> for getting memory placement close (a page in VM node x actually resides
> in host node x), you have to rely on vcpu pinning + guest NUMA topology,
> combined with default mempolicy in the guest and host.
I did recompile both the kernels with the SLUB allocator enabled...
> As pages are
> faulted in the guest, the hope is that the vcpu which did the faulting
> is running in the right node (guest and host), its guest OS mempolicy
> ensures this page is to be allocated in the guest local node, and that
> allocation cause a fault in qemu, which is -also- running on the -host-
> node X. The vcpu pinning is critical to get qemu to fault that memory
> to the correct node.
In the absence of a NUMA layout in the guest it doesn't look like pinning
helped... but I think I understand what you are saying. Thanks!
> Make sure you do not use numactl for any of this.
> I would suggest using libvirt and define the vcpu-pinning and the numa
> topology in the XML.
I will try this in the coming days (waiting to get back on the system :)).
Thanks for the detailed response !
Vinod
>
> -Andrew Theurer
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
next prev parent reply other threads:[~2012-05-11 1:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-09 13:05 How to determine the backing host physical memory for a given guest ? Chegu Vinod
2012-05-09 13:46 ` Avi Kivity
2012-05-10 1:23 ` Chegu Vinod
2012-05-10 15:34 ` Andrew Theurer
2012-05-11 1:22 ` Chegu Vinod [this message]
2012-05-12 2:50 ` Chegu Vinod
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=loom.20120511T013646-106@post.gmane.org \
--to=chegu_vinod@hp.com \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox