qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Wanlong Gao <gaowanlong@cn.fujitsu.com>
To: Eduardo Habkost <ehabkost@redhat.com>,
	Anthony Liguori <aliguori@us.ibm.com>
Cc: aarcange@redhat.com, andre.przywara@amd.com,
	qemu-devel@nongnu.org, mgorman@suse.de, pbonzini@redhat.com,
	Wanlong Gao <gaowanlong@cn.fujitsu.com>
Subject: Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes
Date: Thu, 06 Jun 2013 17:30:16 +0800	[thread overview]
Message-ID: <51B056A8.6060908@cn.fujitsu.com> (raw)
In-Reply-To: <20130605155409.GU2580@otherpad.lan.raisama.net>

On 06/05/2013 11:54 PM, Eduardo Habkost wrote:
> On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote:
>> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
>>
>>> Add monitor command mem-nodes to show the huge mapped
>>> memory nodes locations.
>>>
>>> (qemu) info mem-nodes
>>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0
>>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1
>>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0
>>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1
>>
>> This creates an ABI that we don't currently support.  Memory hotplug or
>> a variety of things can break this mapping and then we'd have to provide
>> an interface to describe that the mapping was broken.
> 
> What do you mean by "breaking this mapping", exactly? Would the backing
> file of existing guest RAM ever change? (It would require a memory copy
> from one file to another, why would QEMU ever do that?)
> 
>>
>> Also, it only works with hugetlbfs which is probbably not widely used
>> given the existance of THP.
> 
> Quoting yourself at
> http://article.gmane.org/gmane.comp.emulators.kvm.devel/58227:
> 
>>> It's extremely likely that if you're doing NUMA pinning, you're also 
>>> doing large pages via hugetlbfs.  numactl can already set policies for 
>>> files in hugetlbfs so all you need to do is have a separate hugetlbfs 
>>> file for each numa node.
>>>
>>> Then you have all the flexibility of numactl and you can implement node 
>>> migration external to QEMU if you so desire.
> 
> And if we simply report where are the backing files and offsets being
> used for guest RAM, one could simply use
> 'numactl --file --offset --length', so we don't even need separate
> files/mem-paths for each node.

Does "numactl" work after QEMU process mapped the hugetlbfs file?
I'm afraid the mempolicy set after it will take no effect.

And, if PCI-passthrough is used, direct-attached-device uses DMA transfer
between device and qemu process. All pages of the guest will be pinned by get_user_pages().

KVM_ASSIGN_PCI_DEVICE ioctl
  kvm_vm_ioctl_assign_device()
    =>kvm_assign_device()
      => kvm_iommu_map_memslots()
        => kvm_iommu_map_pages()
           => kvm_pin_pages()

So, with direct-attached-device, all guest page's page count will be +1 and
any page migration will not work. AutoNUMA won't too. Then any numa directions
are also ignored.


Mel? Andrea?

> 
> Does THP work with tmpfs, already? If it does, people who doesn't want
> hugetlbfs and want numa tuning to work with THP could just use tmpfs for
> -mem-path.
> 
>>
>> I had hoped that we would get proper userspace interfaces for describing
>> memory groups but that appears to have stalled out.
> 
> I would love to have it. But while we don't have it, sharing the
> tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to
> let other tools manipulate guest memory policy. We just need to let
> external tools know where the backing files are.
> 
>>
>> Does anyone know if this is still on the table?
>>
>> If we can't get a proper kernel interface, then perhaps we need to add
>> full libnuma support but that would really be unfortunate...
> 
> Why isn't the "info mem-nodes" solution (I mean: not this version, but a
> proper QMP version that exposes all the information we need) an option?
> 

And the shortage of hugetlbfs is that we can't know how many virtual
machines, so that we can't determine how many huge pages to be reserved
for virtual machines.

In order to set numa mempolices on THP or normal memories, as Anthony said,
I think we should add full libnuma support into QEMU, and allow manually
set mempolices in the QEMU command line and through QEMU monitor after
QEMU started. The external tools can't do exactly the right thing.

So IMO the right direction is the objected proposal
 - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@amd.com>
   http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684
 - Message-ID: <4C7D7C2A.7000205@codemonkey.ws>
   http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835

If you agree, I'll refact this patch set for review.

Thanks,
Wanlong Gao

> 
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>> Refer to the proposal of Eduardo and Daniel.
>>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476
>>>
>>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
>>> ---
>>>  monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 45 insertions(+)
>>>
>>> diff --git a/monitor.c b/monitor.c
>>> index eefc7f0..85c865f 100644
>>> --- a/monitor.c
>>> +++ b/monitor.c
>>> @@ -74,6 +74,10 @@
>>>  #endif
>>>  #include "hw/lm32/lm32_pic.h"
>>>  
>>> +#if defined(CONFIG_NUMA)
>>> +#include <numaif.h>
>>> +#endif
>>> +
>>>  //#define DEBUG
>>>  //#define DEBUG_COMPLETION
>>>  
>>> @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict)
>>>  }
>>>  #endif
>>>  
>>> +#if defined(CONFIG_NUMA)
>>> +static void mem_nodes(Monitor *mon, const QDict *qdict)
>>> +{
>>> +    RAMBlock *block;
>>> +    int prevnode, node;
>>> +    unsigned long long c, start, area;
>>> +    int fd;
>>> +    int pid = getpid();
>>> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>>> +        if (!(fd = block->fd))
>>> +            continue;
>>> +        prevnode = -1;
>>> +        start = 0;
>>> +        area = (unsigned long long)block->host;
>>> +        for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) {
>>> +            if (get_mempolicy(&node, NULL, 0, c + block->host,
>>> +                              MPOL_F_ADDR | MPOL_F_NODE) < 0)
>>> +                continue;
>>> +            if (node == prevnode)
>>> +                continue;
>>> +            if (prevnode != -1)
>>> +                monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n",
>>> +                               pid, fd, start + area, c + area, prevnode);
>>> +            prevnode = node;
>>> +            start = c;
>>> +         }
>>> +         monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n",
>>> +                        pid, fd, start + area, c + area, prevnode);
>>> +    }
>>> +}
>>> +#endif
>>> +
>>>  #if defined(TARGET_SH4)
>>>  
>>>  static void print_tlb(Monitor *mon, int idx, tlb_t *tlb)
>>> @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = {
>>>          .mhandler.cmd = mem_info,
>>>      },
>>>  #endif
>>> +#if defined(CONFIG_NUMA)
>>> +    {
>>> +        .name       = "mem-nodes",
>>> +        .args_type  = "",
>>> +        .params     = "",
>>> +        .help       = "show the huge mapped memory nodes location",
>>> +        .mhandler.cmd = mem_nodes,
>>> +    },
>>> +#endif
>>>      {
>>>          .name       = "mtree",
>>>          .args_type  = "",
>>> -- 
>>> 1.8.3.rc2.10.g0c2b1cf
>>
> 

  reply	other threads:[~2013-06-06  9:34 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-05  3:58 [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Wanlong Gao
2013-06-05  3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao
2013-06-05 12:39   ` Eric Blake
2013-06-05 12:57   ` Anthony Liguori
2013-06-05 15:54     ` Eduardo Habkost
2013-06-06  9:30       ` Wanlong Gao [this message]
2013-06-06 16:15         ` Eduardo Habkost
2013-06-14  1:04       ` Anthony Liguori
2013-06-14 13:56         ` Eduardo Habkost
2013-06-05 13:46   ` Eduardo Habkost
2013-06-11  7:22     ` Wanlong Gao
2013-06-11 13:40       ` Eduardo Habkost
2013-06-13  1:40         ` Wanlong Gao
2013-06-13 12:50           ` Eduardo Habkost
2013-06-13 22:32             ` Paolo Bonzini
2013-06-14  1:05               ` Anthony Liguori
2013-06-14  1:16                 ` Wanlong Gao
2013-06-15 17:23                   ` Paolo Bonzini
2013-06-05 10:02 ` [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Andreas Färber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51B056A8.6060908@cn.fujitsu.com \
    --to=gaowanlong@cn.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=aliguori@us.ibm.com \
    --cc=andre.przywara@amd.com \
    --cc=ehabkost@redhat.com \
    --cc=mgorman@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).