From: Andrew Theurer <habanero@linux.vnet.ibm.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Andre Przywara <andre.przywara@amd.com>,
Marcelo Tosatti <mtosatti@redhat.com>,
"avi@redhat.com" <avi@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
Date: Tue, 31 Aug 2010 22:38:45 -0500 [thread overview]
Message-ID: <1283312325.3936.20.camel@localhost.localdomain> (raw)
In-Reply-To: <4C7D7C2A.7000205@codemonkey.ws>
On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>
> >>> Anthony Liguori wrote:
> >>>
> >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>>>
> >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>>>
> >>>>>> According to the user-provided assignment bind the respective part
> >>>>>> of the guest's memory to the given host node. This uses Linux'
> >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>>>> pinning right after the allocation.
> >>>>>> Failures are not fatal, but produce a warning.
> >>>>>>
> >>>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> >>>>>> ...
> >>>>>>
> >>>>> Why is it not possible (or perhaps not desired) to change the binding
> >>>>> after the guest is started?
> >>>>>
> >>>>> Sounds unflexible.
> >>>>>
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>
> >> This is just duplicating numactl.
> >>
> >>
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>
> >>>> We really need a solution that lets a user use a tool like numactl
> >>>> outside of the QEMU instance.
> >>>>
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs. numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >
>
> If you care about NUMA pinning, I can't believe you don't want
> guaranteed large page allocation which THP does not provide.
I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node. For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.
> The general point though is that we should find a way to partition
> memory in qemu such that an external process can control the actual NUMA
> placement. This gives us maximum flexibility.
>
> Otherwise, what do we implement in QEMU? Direct pinning of memory to
> nodes? Can we migrate memory between nodes? Should we support
> interleaving memory between two virtual nodes? Why pick and choose when
> we can have it all.
If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this. Is there another way to change NUMA
policies on mappings from a user tool? We can already inspect
with /proc/<pid>/numamaps. Is this something that could be added to
numactl?
>
> > FWIW, large apps like databases have set a precedent for managing their
> > own NUMA policies.
>
> Of course because they know what their NUMA policy should be. They live
> in a simple world where they assume they're the only application in the
> system, they read the distance tables, figure they'll use XX% of all
> physical memory, and then pin how they see fit.
>
> But an individual QEMU process lives in a complex world. It's almost
> never the only thing on the system and it's only allowed to use a subset
> of resources. It's not sure what set of resources it can and can't use
> and that's often times changing. The topology chosen for a guest is
> static but it's host topology may be dynamic due to thinks like live
> migration.
True, that's why this would require support to change in the monitor.
> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.
> Instead, it needs to let something with a larger view of the system
> determine a NUMA policy that makes sense overall.
I agree.
> There are two ways we can do this. We can implement monitor commands
> that attempt to expose every single NUMA tunable possible. Or, we can
> tie into the existing commands which guarantee that we support every
> possible tunable and that as NUMA support in Linux evolves, we get all
> the new features for free.
Assuming there's no new thing one needs to expose in qemu to work with
whatever new feature numactl/libnuma gets. But perhaps that's a lot
less likely.
> And, since numactl already supports setting policies on files in
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to
> work per-node instead of globally. And it's useful to implement other
> types of things like having one node be guaranteed large pages and
> another node THP or some other fanciness.
If it were not dependent on hugetlbfs, then I don't think I would have
an issue.
> Sounds awfully appealing to me.
>
> > I don't see why qemu should be any different.
> > Numactl is great for small apps that need to be pinned in one node, or
> > spread evenly on all nodes. Having to get hugetlbfs involved just to
> > workaround a shortcoming of numactl just seems like a bad idea.
> >
>
> You seem to be asserting that we should implement a full NUMA policy in
> QEMU. What should it be when we don't (in QEMU) know what else is
> running on the system?
I don't think the qemu itself should decide where to "be" on the system.
I would like to have -something- else make those decisions, either a
user or some mgmt daemon that looks at the whole picture. Or <gulp> get
the scheduler involved (with new algorithms).
I am still quite curious of numactl/libnuma could be extended to set
some policies on individual mappings. Then we will not even need to
have multiple -mem-path's.
-Andrew
>
> Regards,
>
> Anthony Liguori
>
next prev parent reply other threads:[~2010-09-01 3:38 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
2010-08-23 18:59 ` Marcelo Tosatti
2010-08-23 19:27 ` Anthony Liguori
2010-08-23 21:16 ` Andre Przywara
2010-08-23 21:27 ` Anthony Liguori
2010-08-31 20:54 ` Andrew Theurer
2010-08-31 22:03 ` Anthony Liguori
2010-09-01 3:38 ` Andrew Theurer [this message]
2010-09-09 20:00 ` Andre Przywara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1283312325.3936.20.camel@localhost.localdomain \
--to=habanero@linux.vnet.ibm.com \
--cc=andre.przywara@amd.com \
--cc=anthony@codemonkey.ws \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox