Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alexander Graf <agraf@suse.de>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm list <kvm@vger.kernel.org>,
	dipankar@in.ibm.com,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Chris Wright <chrisw@sous-sol.org>,
	bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 23 Nov 2011 19:34:37 +0100	[thread overview]
Message-ID: <4ECD3CBD.7010902@suse.de> (raw)
In-Reply-To: <20111123150300.GH8397@redhat.com>

On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> Hi!
>
> On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
>> Fundamentally, the entity that should be deciding what memory should be present
>> and where it should located is the kernel.  I'm fundamentally opposed to trying
>> to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
>>
>>   From what I can tell about ms_mbind(), it just uses process knowledge to bind
>> specific areas of memory to a memsched group and let's the kernel decide what to
>> do with that knowledge.  This is exactly the type of interface that QEMU should
>> be using.
>>
>> QEMU should tell the kernel enough information such that the kernel can make
>> good decisions.  QEMU should not be the one making the decisions.
> True, QEMU won't have to decide where the memory and vcpus should be
> located (but hey it wouldn't need to decide that even if you use
> cpusets, you can use relative mbind with cpusets, the admin or a
> cpuset job scheduler could decide) but it's still QEMU making the
> decision of what memory and which vcpus threads to
> ms_mbind/ms_tbind. Think how you're going to create the input of those
> syscalls...
>
> If it wasn't qemu to decide that, qemu wouldn't be required to scan
> the whole host physical numa (cpu/memory) topology in order to create
> the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the
> VM to another host, the whole vtopology may be counter-productive
> because the kernel isn't automatically detecting the numa affinity
> between threads and the guest vtopology will stick to whatever numa
> _physical_ topology that was seen on the first node where the VM was
> created.
>
> I doubt that the assumption that all cloud nodes will have the same
> physical numa topology is reasonable.
>
> Furthermore to get the same benefits that qemu gets on host by using
> ms_mbind/ms_tbind, every single guest application should be modified
> to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> hard bindings which is what we try to avoid).
>
> I think it's unreasonable to expect all applications to use
> ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
> wrappers, few apps will be modified for sys_ms_tbind/mbind.
>
> You can always have the supercomputer case with just one app that is
> optimized and a single VM spanning over the whole host, but in that
> scenarios hard bindings would work perfectly too.
>
> In my view the trouble of the numa hard bindings is not the fact
> they're hard and qemu has to also decide the location (in fact it
> doesn't need to decide the location if you use cpusets and relative
> mbinds). The bigger problem is the fact either the admin or the app
> developer has to explicitly scan the numa physical topology (both cpus
> and memory) and tell the kernel how much memory to bind to each
> thread. ms_mbind/ms_tbind only partially solve that problem. They're
> similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> don't need an admin or a cpuset-job-scheduler (or a perl script) to
> redistribute the hardware resources.

Well yeah, of course the guest needs to see some topology. I don't see 
why we'd have to actually scan the host for this though. All we need to 
tell the kernel is "this memory region is close to that thread".

So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to 
tell the kernel that this GB of RAM actually is close to that vCPU thread.

Of course the admin still needs to decide how to split up memory. That's 
the deal with emulating real hardware. You get the interfaces hardware 
gets :). However, if you follow a reasonable default strategy such as 
numa splitting your RAM into equal chunks between guest vCPUs you're 
probably close enough to optimal usage models. Or at least you could 
have a close enough approximation of how this mapping could work for the 
_guest_ regardless of the host and when you migrate it somewhere else it 
should also work reasonably well.


> Now dealing with bindings isn't big deal for qemu, in fact this API is
> pretty much ideal for qemu, but it won't make life substantially
> easier than if compared to hard bindings. Simply the management code
> that is now done with a perl script will have to be moved in the
> kernel. It looks an incremental improvement compared to the relative
> mbind+cpuset, but I'm unsure if it's the best we could aim for and
> what we really need in virt considering we deal with VM migration too.
>
> The real long term design to me is not to add more syscalls, and
> initially handling the case of a process/VM spanning not more than one
> node in thread number and amount of memory. That's not too hard an in
> fact I've benchmarks for the scheduler already showing it to work
> pretty well (it's creating a too strict affinity but it can be relaxed
> to be more useful). Then later add some mechanism (simplest is the
> page fault at low frequency) to create a
> guest_vcpu_thread<->host_memory affinity and have a parvirtualized
> interface that tells the guest scheduler to group CPUs.
>
> If the guest scheduler runs free and is allowed to move threads
> randomly without any paravirtualized interface that controls the CPU
> thread migration in the guest scheduler, the thread<->memory affinity
> on host will be hopeless. But with a parvirtualized interface to make
> a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
> will allow to create a more meaningful guest_thread<->physical_ram
> affinity on host through KVM page faults. And then this will work also
> with VM migration and without having to create a vtopology in guest.

So you want to basically dynamically create NUMA topologies from the 
runtime behavior of the guest? What if it changes over time?

> And for apps running in guest no paravirt will be needed of course.
>
> The reason paravirt would be needed for qemu-kvm with a full automatic
> thread<->memory affinity is that the vcpu threads are magic. What runs
> in the vcpu thread are guest threads. And those can move through the
> guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
> physical cpu for each physical node, any affinity we measure in the
> host will be meaningless. Normal threads using NPTL won't behave like
> that. Maybe some other thread library could have a "scheduler" inside
> that would make it behave like a vcpu thread (it's one thread really
> with several threads inside) but those existed mostly to simulate
> multiple threads in a single thread so they don't matter. And in this
> respect sys_tbind also requires the tid to have meaningful memory
> affinity. sys_tbind/mbind gets away with it by creating a vtopology in
> the guest, so the guest scheduler would then follow the vtopology (but
> vtopology breaks across VM migration and to really be followed well
> with sys_mbind/tbind it'd require all apps to be modified).
>
> grouping guest threads to stick into some vcpu sounds immensely
> simpler than changing the whole guest vtopology at runtime that would
> involve changing memory layout too.
>
> NOTE: the paravirt cpu grouping interface would also handle the case
> of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
> guests will have memory spanning over two nodes, and the guest
> vtopology created by sys_mbind/tbind can't handle it. While paravirt
> cpu grouping and automatic thread<->memory affinity on host will
> handle it, like it will handle VM migration across nodes with
> different physical topology. The problem is to create a
> thread<->memory affinity we'll have to issue some page fault in KVM in
> the background. How harmful that is I don't know at this point. So the
> full automatic thread<->memory affinity is a bit of a vapourware
> concept at this point (process<->memory affinity seems to work already
> though).
>
> But Peter's migration code was driven by page faults already (not
> included in the patch he posted) and the other patch that exists
> called migrate-on-fault also depended on page faults. So I am
> optimistic we could have a thread<->memory affinity working too in the
> longer term. The plan would be to run them at low frequency and only
> if we can't fit a process into one node (in terms of both number of
> threads and memory). If the process fits in one node, we wouldn't even
> need any page fault and the information in the pagetables will be
> enough to do a best decision. The downside is it significantly more
> difficult to implement the thread<->memory affinity. And that's why
> I'm focusing initially on the simpler case of considering only the
> process<->memory affinity. That's fairly easy.
>
> So for the time being this incremental improvement may be justified,
> it moves the logic from a perl script to the kernel but I'm just
> skeptical it provides a big advantage compared to the numa bindings we
> already have in the kernel, especially if in the long term we can get
> rid of a vtopology completely.

I actually like the idea of just telling the kernel how close memory 
will be to a thread. Sure, you can handle this basically by shoving your 
scheduler into user space, but isn't managing processes what a kernel is 
supposed to do in the first place?

You can always argue for a microkernel, but having a scheduler in user 
space (perl script) and another one in the kernel doesn't sound very 
appealing to me. If you want to go full-on user space, sure, I can see 
why :).

Either way, your approach sounds to be very much in the concept phase, 
while this is more something that can actually be tested and benchmarked 
against today. So yes, I want the interim solution - just in case your 
plan doesn't work out :). Oh, and then there's the non-PV guests too...


Alex

next prev parent reply	other threads:[~2011-11-23 18:33 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-29 18:45 [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding Bharata B Rao
2011-10-29 19:57 ` Alexander Graf
2011-10-30  9:32   ` Vaidyanathan Srinivasan
2011-11-08 17:33   ` Chris Wright
2011-11-21 15:18     ` Bharata B Rao
2011-11-21 15:25       ` Peter Zijlstra
2011-11-21 16:00         ` Bharata B Rao
2011-11-21 17:03           ` Peter Zijlstra
2011-11-21 22:50             ` Chris Wright
2011-11-22  1:57               ` Anthony Liguori
2011-11-22  1:51             ` Anthony Liguori
2011-11-23 15:03               ` Andrea Arcangeli
2011-11-23 18:34                 ` Alexander Graf [this message]
2011-11-23 20:19                   ` Andrea Arcangeli
2011-11-30 16:22                   ` Dipankar Sarma
2011-11-30 16:25                     ` Peter Zijlstra
2011-11-30 16:33                       ` Chris Wright
2011-11-30 17:41                     ` Andrea Arcangeli
2011-12-01 17:25                       ` Dipankar Sarma
2011-12-01 17:36                         ` Andrea Arcangeli
2011-12-01 17:49                           ` Dipankar Sarma
2011-12-01 17:40                 ` Peter Zijlstra
2011-12-22 11:01                   ` Marcelo Tosatti
2011-12-22 17:13                     ` Anthony Liguori
2011-12-22 17:55                       ` Marcelo Tosatti
2011-12-22 19:04                     ` Peter Zijlstra
2011-12-22 11:24                   ` Marcelo Tosatti
2011-11-21 18:03         ` Avi Kivity
2011-11-21 19:31           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ECD3CBD.7010902@suse.de \
    --to=agraf@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=chrisw@sous-sol.org \
    --cc=dipankar@in.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=svaidy@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).