From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5F8F8F54 for ; Thu, 19 Oct 2023 06:30:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="jBTFL6pp" Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 62094FE; Wed, 18 Oct 2023 23:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697697049; x=1729233049; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=CXr5jF2igpwFL4qpy1yXG/Sjr63AfMYi+PszPqadSzU=; b=jBTFL6ppvsjw+LNEIWHmouMxA9dbeKhfJTmLsnQ4LgUFxwjBZxib3mqy j1pl6oNy/CzZsbjcyy8jA83YqYmyKXM3WkZM0QH4EszoGAEcppjKTYdQs iazGOcXiRCG+M2LtXsn9Hf/utDPkE7i7OXd2Uy0shb9EEOqQAn6H/oLNx qnBrAZ1p6rKpObqBrMmr7odNhu18rU0+m/sePQZmbmIBsxRH55B53nIBG cTfTRjvGOMcZfOfy/flAy3U6X/HGiomM+0MwZ40kc+dQJURGPHGMvs2rx xH2w6+Kx+XbqyFXrR3m9xTOF/Pppc1cxgpuY0Wu+71/Iew1u32UpSTtNk g==; X-IronPort-AV: E=McAfee;i="6600,9927,10867"; a="450404711" X-IronPort-AV: E=Sophos;i="6.03,236,1694761200"; d="scan'208";a="450404711" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Oct 2023 23:30:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10867"; a="930489532" X-IronPort-AV: E=Sophos;i="6.03,236,1694761200"; d="scan'208";a="930489532" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Oct 2023 23:30:43 -0700 From: "Huang, Ying" To: Gregory Price Cc: Gregory Price , , , , , , Aneesh Kumar K.V , Wei Xu , Alistair Popple , Dan Williams , Dave Hansen , Johannes Weiner , Jonathan Cameron , Michal Hocko , Tim Chen , Yang Shi Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving In-Reply-To: (Gregory Price's message of "Mon, 16 Oct 2023 22:52:58 -0400") References: <20231009204259.875232-1-gregory.price@memverge.com> <87o7gzm22n.fsf@yhuang6-desk2.ccr.corp.intel.com> <87pm1cwcz5.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 19 Oct 2023 14:28:42 +0800 Message-ID: <87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Gregory Price writes: > On Wed, Oct 18, 2023 at 04:29:02PM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > There are at least 5 proposals that i know of at the moment >> > >> > 1) mempolicy >> > 2) memory-tiers >> > 3) memory-block interleaving? (weighting among blocks inside a node) >> > Maybe relevant if Dynamic Capacity devices arrive, but it seems >> > like the wrong place to do this. >> > 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...) >> > 5) "just do it in hardware" >> >> It may be easier to start with the use case. What is the practical use >> cases in your mind that can not be satisfied with simple per-memory-tier >> weight? Can you compare the memory layout with different proposals? >> > > Before I delve in, one clarifying question: When you asked whether > weights should be part of node or memory-tiers, i took that to mean > whether it should be part of mempolicy or memory-tiers. > > Were you suggesting that weights should actually be part of > drivers/base/node.c? Yes. drivers/base/node.c vs. memory tiers. > Because I had not considered that, and this seems reasonable, easy to > implement, and would not require tying mempolicy.c to memory-tiers.c > > > > Beyond this, i think there's been 3 imagined use cases (now, including > this). > > a) > numactl --weighted-interleave=Node:weight,0:16,1:4,... > > b) > echo weight > /sys/.../memory-tiers/memtier/access0/interleave_weight > numactl --interleave=0,1 > > c) > echo weight > /sys/bus/node/node0/access0/interleave_weight > numactl --interleave=0,1 > > d) > options b or c, but with --weighted-interleave=0,1 instead > this requires libnuma changes to pick up, but it retains --interleave > as-is to avoid user confusion. > > The downside of an approach like A (which was my original approach), was > that the weights cannot really change should a node be hotplugged. Tasks > would need to detect this and change the policy themselves. That's not > a good solution. > > However in both B and C's design, weights can be rebalanced in response > to any number of events. Ultimately B and C are equivalent, but > the placement in nodes is cleaner and more intuitive. If memory-tiers > wants to use/change this information, there's nothing that prevents it. > > Assuming this is your meaning, I agree and I will pivot to this. Can you give a not-so-abstract example? For example, on a system with node 0, 1, 2, 3, memory tiers 4 (0, 1), 22 (2, 3), .... A workload runs on CPU of node 0, ...., interleaves memory on node 0, 1, ... Then compare the different behavior (including memory bandwidth) with node and memory-tier based solution. -- Best Regards, Huang, Ying