From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9B1E7C43334
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jun 2022 23:41:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A767F6B0073; Wed,  8 Jun 2022 19:41:00 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A24FD6B0074; Wed,  8 Jun 2022 19:41:00 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8ED3A6B0075; Wed,  8 Jun 2022 19:41:00 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 7D2866B0073
	for <linux-mm@kvack.org>; Wed,  8 Jun 2022 19:41:00 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 503DD33947
	for <linux-mm@kvack.org>; Wed,  8 Jun 2022 23:41:00 +0000 (UTC)
X-FDA: 79556691480.20.ADC23D8
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
	by imf13.hostedemail.com (Postfix) with ESMTP id 5596320066
	for <linux-mm@kvack.org>; Wed,  8 Jun 2022 23:40:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1654731657; x=1686267657;
  h=message-id:subject:from:to:cc:date:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NST3C1XCoMC8qEgikymvqUPMZUKij4aUwSL5iFhJRgo=;
  b=B/rQ4ulxUJGPRmU3cpiXPNOdi3a6UUOxDe39D4lI2q4+GqlzWlVpcjer
   2YeCktzFrE3rln4QislO40WKoUSrKUMLtRPcCWoa25VPEBCYIaKGs0P64
   wgFf41v5edEcJavdZPb2qnrDEIk6smb8HpPM0N7HmwNWw1N/s4uNLwUvM
   zuMI+jUpcrLAEQRfgUzifhZEppPdcUIgnlEszNac4KXATtLJ2GUqPpPa4
   b2aOZ+uPPnxBJyg/AwQhROzxrSqYeu0BUeqo2qMh4MvkdXDtZjzW1OXHJ
   jp2wl6wTT7eRQ1kJ5C25fzSrOrYv8uebUP9VMi1FQIS4Gy2vl+jALFe++
   w==;
X-IronPort-AV: E=McAfee;i="6400,9594,10372"; a="302448835"
X-IronPort-AV: E=Sophos;i="5.91,287,1647327600"; 
   d="scan'208";a="302448835"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
  by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 16:40:54 -0700
X-IronPort-AV: E=Sophos;i="5.91,287,1647327600"; 
   d="scan'208";a="566134480"
Received: from schen9-mobl.amr.corp.intel.com ([10.209.124.119])
  by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 16:40:53 -0700
Message-ID: <aabc9a7645ce50f706ac117e6e8fc0f15a967c6c.camel@linux.intel.com>
Subject: Re: [PATCH] mm: mempolicy: N:M interleave policy for tiered memory
 nodes
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, Hao Wang <haowang3@fb.com>, Abhishek Dhanotia
 <abhishekd@fb.com>, "Huang, Ying" <ying.huang@intel.com>, Dave Hansen
 <dave.hansen@linux.intel.com>, Yang Shi <yang.shi@linux.alibaba.com>, 
 Davidlohr Bueso <dave@stgolabs.net>, Adam Manzanares
 <a.manzanares@samsung.com>, linux-kernel@vger.kernel.org, 
 kernel-team@fb.com, Hasan Al Maruf <hasanalmaruf@fb.com>
Date: Wed, 08 Jun 2022 16:40:53 -0700
In-Reply-To: <YqD0/tzFwXvJ1gK6@cmpxchg.org>
References: <20220607171949.85796-1-hannes@cmpxchg.org>
	 <6096c96086187e51706898e58610fc0148b4ca23.camel@linux.intel.com>
	 <YqD0/tzFwXvJ1gK6@cmpxchg.org>
Content-Type: text/plain; charset="UTF-8"
User-Agent: Evolution 3.34.4 (3.34.4-1.fc31) 
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654731657; a=rsa-sha256;
	cv=none;
	b=UnYfXLyVR2nYWieH0R8GZMa+wER62FiV5rCTeS7QXuooRkALQ8zaRCpzM51LgI7E2RxSEw
	5c8SR0phxPw1ZH2XoJj+avsHGSiraYrp59+K4JM23xAd0RiWboTNOQIw9QTN6jTRtlCp2A
	pBS804OZ4gMoo+S66OIekO96jx9Sss0=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="B/rQ4ulx";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=none (imf13.hostedemail.com: domain of tim.c.chen@linux.intel.com has no SPF policy when checking 192.55.52.88) smtp.mailfrom=tim.c.chen@linux.intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1654731657;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=3Cf0guQjvXg4rjeHq2b+AZAdq/gfEH/Mz02KGNvAUlE=;
	b=L9AeCrG37VP/rsMTawJq6+FZu6MqwlzuuyZm2s7heILhwBg30RwwJuQQ+vOnevCB1L0FGS
	BqzYQeJHkqUslST61VqssTpS9mHM0IrwTdeLRC8vZxlEBsFQU+rHeI95BwqgVAPb/Nb5kC
	vyd/qkLGtdEhzhieQyO3lSREJjGBrRA=
X-Stat-Signature: hmquex1bsmfr14pfc9e4aby6ybkhe8ra
X-Rspam-User: 
X-Rspamd-Queue-Id: 5596320066
X-Rspamd-Server: rspam07
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="B/rQ4ulx";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=none (imf13.hostedemail.com: domain of tim.c.chen@linux.intel.com has no SPF policy when checking 192.55.52.88) smtp.mailfrom=tim.c.chen@linux.intel.com
X-HE-Tag: 1654731657-824718
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, 2022-06-08 at 15:14 -0400, Johannes Weiner wrote:
> Hi Tim,
> 
> On Wed, Jun 08, 2022 at 11:15:27AM -0700, Tim Chen wrote:
> > On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> > >  /* Do dynamic interleaving for a process */
> > >  static unsigned interleave_nodes(struct mempolicy *policy)
> > >  {
> > >  	unsigned next;
> > >  	struct task_struct *me = current;
> > >  
> > > -	next = next_node_in(me->il_prev, policy->nodes);
> > > +	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
> > 
> > When we have three memory tiers, do we expect an N:M:K policy?
> > Like interleaving between DDR5, DDR4 and PMEM memory.
> > Or we expect an N:M policy still by interleaving between two specific tiers?
> 
> In the context of the proposed 'explicit tiers' interface, I think it
> would make sense to have a per-tier 'interleave_ratio knob. Because
> the ratio is configured based on hardware properties, it can be
> configured meaningfully for the entire tier hierarchy, even if
> individual tasks or vmas interleave over only a subset of nodes.

I think that makes sense.  So if have 3 tiers of memory whose bandwidth ratio are
4:2:1, then it makes sense to interleave according to this ratio, even if we choose
to interleave for a subset of nodes.  Say between tier 1 and tier 3, the
interleave ratio will be 4:1 as I can read 4 lines of data from tier 3 while
I got 1 line of data from tier 3.

> 
> > The other question is whether we will need multiple interleave policies depending
> > on cgroup?
> > One policy could be interleave between tier1, tier2, tier3.
> > Another could be interleave between tier1 and tier2.
> 
> This is a good question.
> 
> One thing that has defined cgroup development in recent years is the
> concept of "work conservation". Moving away from fixed limits and hard
> partitioning, cgroups are increasingly configured with weights,
> priorities, and guarantees (cpu.weight, io.latency/io.cost.qos,
> memory.low). These weights and priorities are enforced when cgroups
> are directly competing over a resource; but if there is no contention,
> any active cgroup, regardless of priority, has full access to the
> surplus (which could be the entire host if the main load is idle).
> 
> With that background, yes, we likely want some way of prioritizing
> tier access when multiple cgroups are competing. But we ALSO want the
> ability to say that if resources are NOT contended, a cgroup should
> interleave memory over all tiers according to optimal bandwidth.
> 
> That means that regardless of how the competitive cgroup rules for
> tier access end up looking like, it makes sense to have global
> interleaving weights based on hardware properties as proposed here.
> 
> The effective cgroup IL ratio for each tier could then be something
> like cgroup.tier_weight[tier] * tier/interleave_weight.

Thanks. I agree that a interleave ratio that's proportional to hardware
properties of each tier will suffice.

Tim