From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754898AbcDDJT6 (ORCPT ); Mon, 4 Apr 2016 05:19:58 -0400 Received: from mail-lb0-f195.google.com ([209.85.217.195]:35833 "EHLO mail-lb0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750810AbcDDJT4 (ORCPT ); Mon, 4 Apr 2016 05:19:56 -0400 Date: Mon, 4 Apr 2016 11:19:51 +0200 From: Ingo Molnar To: Jiri Olsa Cc: Peter Zijlstra , James Hartsock , Rik van Riel , Srivatsa Vaddagiri , Kirill Tkhai , linux-kernel@vger.kernel.org Subject: Re: [RFC] sched: unused cpu in affine workload Message-ID: <20160404091951.GA10360@gmail.com> References: <20160404082302.GB2137@krava.local> <20160404085944.GA3030@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160404085944.GA3030@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > - if you want to come up with a 'complete' solution then please don't put it into > hot paths such as wakeup or context switching, or any of the hardirq methods, > but try to integrate it with the NUMA scheduling slow path. > > The NUMA balancing slow path: that is softirq driven and reasonably low freq to > not cause many performance problems. > > The two problems (NUMA affinity and user affinity) are also losely related on a > conceptual level: the NUMA affinity optimization problem can be considered as a > workload determined, arbitrary 'NUMA mask' being optimized from first > principles. > > There's one ABI detail: this is true only as long as SMP affinity masks follow > node boundaries - the current NUMA balancing code is very much node granular, so > the two can only be merged if the ->cpus_allowed mask follows node boundaries as > well. > > A third approach would be to extend the NUMA balancing code to be CPU granular > (without changing anytask placement behavior of the current NUMA balancing code > of course), with node granular being a special case. This would fit the cgroups > (and virtualization) usecases, but that would be a major change. So my thinking here is: if the NUMA balancing code (which is node granular at the moment and uses node masks, etc.) is extended to be CPU granular (which is a big task in itself), then the two problems can be 'unified': - the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the MM code into the scheduler. - the scheduler syscall ABI (and other configuration sources) inputs arbitrary CPU affinity masks into the scheduler. it's a similar problem, with two (minor looking) complication: - the NUMA code right now is 'statistical', while ->cpus_allowed are hard constraints that must never be violated. So there always has to be a final layer to implement the hard constraint - which does not exist in the NUMA balancing case. This should be relatively easy I think as we already do it with the regular balancer. - the balancing slowpath would have to be activated on non-NUMA systems as well, so that it can handle ->cpus_allowed balancing. ... once all that is solved, I can see several advantages from unifying the NUMA balancing and SMP affinity balancing code: - the NUMA balancer would improve: cpus_allowed isolation is used more frequently, so fixes from those workloads would benefit the NUMA balancing case as well. - testing the NUMA balancer would become easier: we'd simply set cpus_allowed and would watch how it balances. No need to coax workloads into actual MM NUMA usage patters to set up interesting scenarios. - our existing half-hearted ways to deal with cpus_allowed balancing could be outsourced to the NUMA slow path, which would simplify the SMP balancing fast path. But it's a major piece of work, and I might be missing implementational details. It would be the biggest new scheduler feature since NUMA balancing for sure. Thanks, Ingo