From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vc0-x236.google.com (mail-vc0-x236.google.com [IPv6:2607:f8b0:400c:c03::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 649261A07E5 for ; Sat, 19 Jul 2014 04:47:14 +1000 (EST) Received: by mail-vc0-f182.google.com with SMTP id hy4so8260929vcb.13 for ; Fri, 18 Jul 2014 11:47:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20140718181947.GE13012@htj.dyndns.org> References: <20140717230923.GA32660@linux.vnet.ibm.com> <20140718112039.GA8383@htj.dyndns.org> <20140718180008.GC13012@htj.dyndns.org> <20140718181947.GE13012@htj.dyndns.org> Date: Fri, 18 Jul 2014 11:47:08 -0700 Message-ID: Subject: Re: [RFC 0/2] Memoryless nodes and kworker From: Nish Aravamudan To: Tejun Heo Content-Type: multipart/alternative; boundary=001a11c22fb441517f04fe7c2f4f Cc: Fenghua Yu , Tony Luck , linux-ia64@vger.kernel.org, Nishanth Aravamudan , "linux-kernel@vger.kernel.org" , Linux Memory Management List , David Rientjes , Joonsoo Kim , linuxppc-dev@lists.ozlabs.org, Jiang Liu , Wanpeng Li List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , --001a11c22fb441517f04fe7c2f4f Content-Type: text/plain; charset=UTF-8 On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo wrote: > > Hello, > > On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote: > > why aren't these callers using kthread_create_on_cpu()? That API was > > It is using that. There just are other data structures too. Sorry, I might not have been clear. Why are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)? In total in Linus' tree, there are only two APIs that use kthread_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percpu_thread(). Neither of those seem to be used by the workqueue code that I can see as of yet. > > already change to use cpu_to_mem() [so one change, rather than of all over > > the kernel source]. We could change it back to cpu_to_node and push down > > the knowledge about the fallback. > > And once it's properly solved, please convert back kthread to use > cpu_to_node() too. We really shouldn't be sprinkling the new subtly > different variant across the kernel. It's wrong and confusing. I understand what you mean, but it's equally wrong for the kernel to be wasting GBs of slab. Different kinds of wrongness :) > > Yes, this is a good point. But honestly, we're not really even to the point > > of talking about fallback here, at least in my testing, going off-node at > > all causes SLUB-configured slabs to deactivate, which then leads to an > > explosion in the unreclaimable slab. > > I don't think moving the logic inside allocator proper is a huge > amount of work and this isn't the first spillage of this subtlety out > of allocator proper. Fortunately, it hasn't spread too much yet. > Let's please stop it here. I'm not saying you shouldn't or can't fix > the off-node allocation. It seems like an additional reasonable approach would be to provide a suitable _cpu() API for the allocators. I'm not sure why saying that callers should know about NUMA (in order to call cpu_to_node() in every caller) is any better than saying that callers should know about memoryless nodes (in order to call cpu_to_mem() in every caller instead) -- when at least in several cases that I've seen the relevant data is what CPU we're expecting to run or are running on. Seems like the _cpu API would specify -- please allocate memory local to this CPU, wherever it is? Thanks, Nish --001a11c22fb441517f04fe7c2f4f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <= tj@kernel.org> wrote:
>
&g= t; Hello,
>
> On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Ar= avamudan wrote:
> > why aren't these callers using kthread_create_on_cpu()? That = API was
>
> It is using that. =C2=A0There just are other data s= tructures too.

Sorry, I might not have been clear.

Why = are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu),= ...) not using kthread_create_on_cpu(..., cpu, ...)?

In total in Linus' tree, there are only two APIs that use kth= read_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percp= u_thread(). Neither of those seem to be used by the workqueue code that I c= an see as of yet.

> > already change to use cpu_to_mem() [so one cha= nge, rather than of all over
> > the kernel source]. We could chan= ge it back to cpu_to_node and push down
> > the knowledge about th= e fallback.
>
> And once it's properly solved, please convert back kthread= to use
> cpu_to_node() too. =C2=A0We really shouldn't be sprinkl= ing the new subtly
> different variant across the kernel. =C2=A0It= 9;s wrong and confusing.

I understand what you mean, but it's equally wrong for t= he kernel to be wasting GBs of slab. Different kinds of wrongness :)

> > Yes, this is a good point. But honestly, we're no= t really even to the point
> > of talking about fallback here, at least in my testing, going off= -node at
> > all causes SLUB-configured slabs to deactivate, which= then leads to an
> > explosion in the unreclaimable slab.
>
> I don't think moving the logic inside allocator proper is = a huge
> amount of work and this isn't the first spillage of this= subtlety out
> of allocator proper. =C2=A0Fortunately, it hasn't= spread too much yet.
> Let's please stop it here. =C2=A0I'm not saying you shouldn= 9;t or can't fix
> the off-node allocation.

It = seems like an additional reasonable approach would be to provide a suitable= _cpu() API for the allocators. I'm not sure why saying that callers sh= ould know about NUMA (in order to call cpu_to_node() in every caller) is an= y better than saying that callers should know about memoryless nodes (in or= der to call cpu_to_mem() in every caller instead) -- when at least in sever= al cases that I've seen the relevant data is what CPU we're expecti= ng to run or are running on. Seems like the _cpu API would specify -- pleas= e allocate memory local to this CPU, wherever it is?

Thanks,
Nish
--001a11c22fb441517f04fe7c2f4f--