From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35593) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VExeH-0002c7-OU for qemu-devel@nongnu.org; Thu, 29 Aug 2013 04:32:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VExeC-0007le-EN for qemu-devel@nongnu.org; Thu, 29 Aug 2013 04:32:13 -0400 Received: from mx4-phx2.redhat.com ([209.132.183.25]:52806) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VExeC-0007lE-6k for qemu-devel@nongnu.org; Thu, 29 Aug 2013 04:32:08 -0400 Date: Thu, 29 Aug 2013 04:31:53 -0400 (EDT) From: Andrew Jones Message-ID: <221143207.3452239.1377765113416.JavaMail.root@redhat.com> In-Reply-To: <994970720.3441149.1377764149516.JavaMail.root@redhat.com> References: <1377231003-2816-1-git-send-email-gaowanlong@cn.fujitsu.com> <521AB2C1.40306@cn.fujitsu.com> <588686645.448144.1377503181634.JavaMail.root@redhat.com> <521B0EF7.2050203@cn.fujitsu.com> <298159858.481798.1377506599931.JavaMail.root@redhat.com> <521DFEBC.30502@redhat.com> <521EB076.2080604@cn.fujitsu.com> <994970720.3441149.1377764149516.JavaMail.root@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: gaowanlong@cn.fujitsu.com Cc: aliguori@us.ibm.com, ehabkost@redhat.com, hutao@cn.fujitsu.com, peter huangpeng , qemu-devel@nongnu.org, bsd@redhat.com, Paolo Bonzini , y-goto@jp.fujitsu.com, lcapitulino@redhat.com, lersek@redhat.com, afaerber@suse.de ----- Original Message ----- > > > ----- Original Message ----- > > On 08/28/2013 09:44 PM, Paolo Bonzini wrote: > > > Il 26/08/2013 10:43, Andrew Jones ha scritto: > > >> > > >> ----- Original Message ----- > > >>>> On 08/26/2013 03:46 PM, Andrew Jones wrote: > > >>>>>>>>>> Is this patch still necessary? I thought that dropping the > > >>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid > > >>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses? > > >>>>>>>>>> > > >>>>>>>>>> Yes, in 08/12 we also use mbind(), > > >>>>>> You don't need a whole library for mbind(), it's a syscall. See > > >>>>>> syscall(2). > > >>>>>> > > >>>>>>>>>> and in 09/12 we use max_numa_node(). > > >>>>>> Really? I didn't see it there. And anyway, that goes back to our > > >>>>>> discussion > > >>>>>> about setting qemu's MAX_NODES to whatever we think qemu should > > >>>>>> support, > > >>>>>> and then just checking that we don't blow that limit whenever > > >>>>>> reading > > >>>>>> host node info, i.e. > > >>>>>> > > >>>>>> maxnode = 0; > > >>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES) > > >>>>>> node_read(&info[maxnode++]); > > >>>>>> > > >>>>>> type of a thing. > > >>>>>> > > >>>>>> And, if there's a place you really need to know the current online > > >>>>>> number > > >>>>>> of host nodes, then, like I said earlier, you should just go to > > >>>>>> sysfs > > >>>>>> yourself. libnuma:numa_max_node() returns an int that it only > > >>>>>> initializes > > >>>>>> at library load time, so it's not going to adapt to > > >>>>>> onlining/offlining. > > >>>> > > >>>> OK, thank you. > > >>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall > > >>>> directly, > > >>>> right? > > >> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a > > >> more > > >> general lib. Whether or not we want to redefine those symbols within > > >> qemu, in order to avoid the dependency on installing numactl-devel, > > >> isn't > > >> something I can answer. That's a better question for Anthony. Anthony? > > >> Paolo, > > >> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the > > >> linux-header synch script? > > >> > > > > > > I think using libnuma is fine. In principle this could be used on other > > > OSes than Linux, I think? > > > > But seems that mbind(2) is Linux-specific syscall, right? > > > > You would need to avoid directly calling mbind, i.e. use libnuma for all > numa related calls. Then, if libnuma were to support more OSes, qemu would > automatically (wrt to numa) as well. Your mbind() with libnuma would look > like this > > numa_set_bind_policy(strict) > numa_tonodemask_memory(addr, size, nodemask) > > The problem is that set_bind_policy only takes a bool, and thus only > allows two of the four possibly policies > > MPOL_BIND strict == 1 > MPOL_PREFERRED strict == 0 > Ah, there is a way to get interleave policy if (policy == interleave) { numa_interleave_memory(addr, size, nodemask) } else { numa_set_bind_policy(strict) numa_tonodemask_memory(addr, size, nodemask) } a bit clunky. And I still don't see a way to select MPOL_DEFAULT, nor a way to use any additional flags, such as MPOL_F_RELATIVE_NODES. > So, due to libnuma's policy setting limitations, and the fact it doesn't > currently support more OSes than Linux, then I prefer your current > series version that drops libnuma. If qemu will need to support NUMA on > another OS, then we can cross this bridge when we get there.