From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D785EC43334 for ; Wed, 8 Jun 2022 01:24:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 523CA6B0071; Tue, 7 Jun 2022 21:24:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D39B6B0072; Tue, 7 Jun 2022 21:24:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C28E6B0073; Tue, 7 Jun 2022 21:24:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2DE866B0071 for ; Tue, 7 Jun 2022 21:24:35 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 093416112F for ; Wed, 8 Jun 2022 01:24:35 +0000 (UTC) X-FDA: 79553323710.23.5615494 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf11.hostedemail.com (Postfix) with ESMTP id 7118E40059 for ; Wed, 8 Jun 2022 01:24:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654651473; x=1686187473; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=vtC+x3VoF0hLY5wXut7y+fi1xuSuwdoRGPO6T7AZRfU=; b=eJUEkT4UbnvVYljKHTMP0jXrZ942T2FowB27rdF2EZFG6q+ZZo0U10hW pkMMJSRQEa6gcNcWwMxowKV6ZNgzveZDUZ7f4FKRoawtPrDGKHbrIQnk/ s0UgtyPzECh1a47k3+HjYvtl93GlDAAqicjYMe1c3bV0n3SfvSrrSggq5 zI4K5y53RtMoOuefGfubUgw//7plhwGZDk8O+M2KmQnsKb7Rghw4dUsfU dcy7+5uLK5ugh0ytjI/t/jeguDQgaiSbbuiZfZ39kP2EYKp9VdgiD7Fnb 6nvDJtODTPgLteHgQNdD+W/IYXP++hONrtFT7jAn+caDawwm/eyAwniOZ w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="259839343" X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="259839343" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 18:24:32 -0700 X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="584522217" Received: from wantingz-mobl.ccr.corp.intel.com ([10.254.214.193]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 18:24:27 -0700 Message-ID: <8e826a0ae730f6f6e43e82a26a9e22059a5a1682.camel@intel.com> Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers From: Ying Huang To: Aneesh Kumar K V Cc: Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , linux-mm@kvack.org, akpm@linux-foundation.org Date: Wed, 08 Jun 2022 09:24:25 +0800 In-Reply-To: <1301311f-12f0-0fda-1245-82bb4c3f5e93@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> <20220527122528.129445-2-aneesh.kumar@linux.ibm.com> <352ae5f408b6d7d4d3d820d68e2f2c6b494e95e1.camel@intel.com> <143e40bcf46097d14514504518fdc1870fd8d4a1.camel@intel.com> <87ilpe8fxh.fsf@linux.ibm.com> <1301311f-12f0-0fda-1245-82bb4c3f5e93@linux.ibm.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 7118E40059 X-Stat-Signature: oukxq3r7b3zej9sqtbefougrhicbeedw X-Rspam-User: Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=eJUEkT4U; spf=none (imf11.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.126) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam08 X-HE-Tag: 1654651473-823670 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote: > On 6/6/22 2:22 PM, Ying Huang wrote: > .... > > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to > > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems > > > > > with two memory tiers (DRAM and pmem) the demotion continues to work > > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on > > > > > explicit memory tiers"). With that, there will not be any regression in > > > > > between the patch series. > > > > > > > > > > > > > Thanks! Please do that. And I think you can add sysfs interface after > > > > that patch too. That is, in [1/7] > > > > > > > > > > I am not sure why you insist on moving sysfs interfaces later. They are > > > introduced based on the helper added. It make patch review easier to > > > look at both the helpers and the user of the helper together in a patch. > > > > Yes. We should introduce a function and its user in one patch for > > review. But this doesn't mean that we should introduce the user space > > interface as the first step. I think the user space interface should > > output correct information when we expose it. > > > > If you look at this patchset we are not exposing any wrong information. > > patch 1 -> adds ability to register the memory tiers and expose details > of registered memory tier. At this point the patchset only support DRAM > tier and hence only one tier is shown But inside kernel, we actually work with 2 tiers and demote/prmote pages between them. With the information from your interface, users would think that there is no any demotion/promotion in kernel because there's only 1 tier. > patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the > details, because the patchset yet has not introduced a slower memory > tier like PMEM. > > patch 4 -> introducing demotion. Will make that patch 5 > > patch 5 -> add dax kmem numa nodes as slower memory tier. Now this > becomes patch 4 at which point we will correctly show two memory tiers > in the system. > > > > > > +struct memory_tier { > > > > + nodemask_t nodelist; > > > > +}; > > > > > > > > And struct device can be added after the kernel has switched the > > > > implementation based on explicit memory tiers. > > > > > > > > +struct memory_tier { > > > > + struct device dev; > > > > + nodemask_t nodelist; > > > > +}; > > > > > > > > > > > > > Can you elaborate on this? or possibly review the v5 series indicating > > > what change you are suggesting here? > > > > > > > > > > But I don't think it's a good idea to have "struct device" embedded in > > > > "struct memory_tier". We don't have "struct device" embedded in "struct > > > > pgdata_list"... > > > > > > > > > > I avoided creating an array for memory_tier (memory_tier[]) so that we > > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify > > > the life cycle management of that dynamic list. We free the struct > > > memory_tier allocation via device release function (memtier->dev.release > > > = memory_tier_device_release ) > > > > > > Why do you think it is not a good idea? > > > > I think that we shouldn't bind our kernel internal implementation with > > user space interface too much. Yes. We can expose kernel internal > > implementation to user space in a direct way. I suggest you to follow > > the style of "struct pglist_data" and "struct node". If we decouple > > "struct memory_tier" and "struct memory_tier_dev" (or some other name), > > we can refer to "struct memory_tier" without depending on all device > > core. Memory tier should be accessible inside the kernel even without a > > user interface. And memory tier isn't a device in concept. > > > > memory_tiers are different from pglist_data and struct node in that we > also allow the creation of them from userspace. I don't think that there's much difference. struct pglist_data and struct node can be created/destroyed dynamically too. Please take a look at __try_online_node() register_one_node() try_offline_node() unregister_one_node() > That is the life time of > a memory tier is driven from userspace and it is much easier to manage > them via sysfs file lifetime mechanism rather than inventing an > independent and more complex way of doing the same. You needs to manage the lifetime of struct memory_tier in kernel too. Because there are kernel users. And even if you use device core lifetime mechanism, you don't need to embed struct device in struct memory_tier too, you can free "separate" struct memory_tier in "release" callback of struct device. > > For life cycle management, I think that we can do that without sysfs > > too. > > > > unless there are specific details that you think will be broken by > embedding struct device inside struct memory_tier, IMHO I still consider > the embedded implementation much simpler and in accordance with other > kernel design patterns. In concept, struct memory_tier isn't a device. Although we expose it as a device in sysfs. That's just an implementation detail. So I think it's better to make struct memory_tier independent of struct device if possible. Via not embeding struct device in struct memory_tier, it's much easier to dereference struct memory_tier directly in inline function in ".h". We don't need to introduce one accessor function for each field of struct memory_tier for that. Best Regards, Huang, Ying