From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A610C43334 for ; Tue, 19 Jul 2022 14:00:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 834606B0072; Tue, 19 Jul 2022 10:00:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E3776B0073; Tue, 19 Jul 2022 10:00:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6AD186B0074; Tue, 19 Jul 2022 10:00:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 5365A6B0072 for ; Tue, 19 Jul 2022 10:00:41 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CB8D71C5AF1 for ; Tue, 19 Jul 2022 14:00:39 +0000 (UTC) X-FDA: 79704009798.17.C3FA3E4 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf17.hostedemail.com (Postfix) with ESMTP id 9F2FF400A1 for ; Tue, 19 Jul 2022 14:00:38 +0000 (UTC) Received: from fraeml706-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4LnL2r1TZcz67y8J; Tue, 19 Jul 2022 21:56:04 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml706-chm.china.huawei.com (10.206.15.55) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.24; Tue, 19 Jul 2022 16:00:35 +0200 Received: from localhost (10.81.209.49) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Tue, 19 Jul 2022 15:00:33 +0100 Date: Tue, 19 Jul 2022 15:00:32 +0100 From: Jonathan Cameron To: "Huang, Ying" CC: Wei Xu , Aneesh Kumar K V , Johannes Weiner , "Linux MM" , Andrew Morton , "Yang Shi" , Davidlohr Bueso , Tim C Chen , Michal Hocko , "Linux Kernel Mailing List" , Hesham Almatary , Dave Hansen , "Alistair Popple" , Dan Williams , Subject: Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Message-ID: <20220719150032.000066a6@Huawei.com> In-Reply-To: <875yk15svi.fsf@yhuang6-desk2.ccr.corp.intel.com> References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> <87r130b2rh.fsf@yhuang6-desk2.ccr.corp.intel.com> <60e97fa2-0b89-cf42-5307-5a57c956f741@linux.ibm.com> <87r12r5dwu.fsf@yhuang6-desk2.ccr.corp.intel.com> <0a55e48a-b4b7-4477-a72f-73644b5fc4cb@linux.ibm.com> <87mtde6cla.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ilo267jl.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edyp67m1.fsf@yhuang6-desk2.ccr.corp.intel.com> <875yk15svi.fsf@yhuang6-desk2.ccr.corp.intel.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.81.209.49] X-ClientProxiedBy: lhreml749-chm.china.huawei.com (10.201.108.199) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf17.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658239239; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nL1wbYuCTfzjSs8L8J5mNbeL82EplOLlw2iCBPYCLYg=; b=0j6vhPXDCjBIrq3tix0E5Az6kY3yqiQnQuBpcfSxFqCFEeEMLcSoSCH1Hu72jeVPSPVdyI VTcmzwUooEMKBrw+SxxveXkih/cHgJUcrNeKNqq9TOrjKF6FPEiVqbVb0oh/ISKFbFO09p hm5vJL7Kwu8E825ZmyyKC+8/P7f46rU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658239239; a=rsa-sha256; cv=none; b=D266PYFuXfw2nHTkNsABs0LS20rlAIzOnH9Fkxhsw4cfFHDBQX9hetQOKFyElqi+UOZatJ jMQc+cFFhYGMlSGaGNzb3g8nEsyEZHRd5IqRA25h13+awoceGdx9K20ucaaIMJuaJQ0DqX cEcM3FG4hkv+HNGsi5YXSIICngprpwM= X-Rspamd-Queue-Id: 9F2FF400A1 Authentication-Results: imf17.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf17.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: ju5mc9p4t9rowc5t44ojm9ztzxwzrdb8 X-HE-Tag: 1658239238-769348 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 13 Jul 2022 16:17:21 +0800 "Huang, Ying" wrote: > Wei Xu writes: > > > On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying wrote: > >> > >> Aneesh Kumar K V writes: > >> > >> > On 7/12/22 2:18 PM, Huang, Ying wrote: > >> >> Aneesh Kumar K V writes: > >> >> > >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote: > >> >>>> Aneesh Kumar K V writes: > >> >>>> > >> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: > >> >>>>>> Aneesh Kumar K V writes: > >> >>>>>> > >> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >> >>>>>>>> Hi, Aneesh, > >> >>>>>>>> > >> >>>>>>>> "Aneesh Kumar K.V" writes: > >> >>>>>>>> > >> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive > >> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier > >> >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the > >> >>>>>>>>> performance. > >> >>>>>>>>> > >> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a > >> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during > >> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or > >> >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into > >> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >> >>>>>>>>> the per-node demotion targets based on the distances between nodes. > >> >>>>>>>>> > >> >>>>>>>>> This current memory tier kernel interface needs to be improved for > >> >>>>>>>>> several important use cases: > >> >>>>>>>>> > >> >>>>>>>>> * The current tier initialization code always initializes > >> >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only > >> >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >> >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >> >>>>>>>>> a virtual machine) and should be put into a higher tier. > >> >>>>>>>>> > >> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top > >> >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >> >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >> >>>>>>>>> with CPUs are better to be placed into the next lower tier. > >> >>>>>>>>> > >> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes > >> >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >> >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice > >> >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no > >> >>>>>>>>> memory node is added or removed. This can make the tier > >> >>>>>>>>> hierarchy unstable and make it difficult to support tier-based > >> >>>>>>>>> memory accounting. > >> >>>>>>>>> > >> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the > >> >>>>>>>>> next lower tier as defined by the demotion path, not any other > >> >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order > >> >>>>>>>>> does not work in all use cases (e.g. some use cases may want to > >> >>>>>>>>> allow cross-socket demotion to another node in the same demotion > >> >>>>>>>>> tier as a fallback when the preferred demotion node is out of > >> >>>>>>>>> space), and has resulted in the feature request for an interface to > >> >>>>>>>>> override the system-wide, per-node demotion order from the > >> >>>>>>>>> userspace. This demotion order is also inconsistent with the page > >> >>>>>>>>> allocation fallback order when all the nodes in a higher tier are > >> >>>>>>>>> out of space: The page allocation can fall back to any node from > >> >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. > >> >>>>>>>>> > >> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory > >> >>>>>>>>> tier hierarchy in order to optimize its memory allocations. > >> >>>>>>>>> > >> >>>>>>>>> This patch series make the creation of memory tiers explicit under > >> >>>>>>>>> the control of userspace or device driver. > >> >>>>>>>>> > >> >>>>>>>>> Memory Tier Initialization > >> >>>>>>>>> ========================== > >> >>>>>>>>> > >> >>>>>>>>> By default, all memory nodes are assigned to the default tier with > >> >>>>>>>>> tier ID value 200. > >> >>>>>>>>> > >> >>>>>>>>> A device driver can move up or down its memory nodes from the default > >> >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the > >> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the > >> >>>>>>>>> default tier. > >> >>>>>>>>> > >> >>>>>>>>> The kernel initialization code makes the decision on which exact tier > >> >>>>>>>>> a memory node should be assigned to based on the requests from the > >> >>>>>>>>> device drivers as well as the memory device hardware information > >> >>>>>>>>> provided by the firmware. > >> >>>>>>>>> > >> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >> >>>>>>>>> > >> >>>>>>>>> Memory Allocation for Demotion > >> >>>>>>>>> ============================== > >> >>>>>>>>> This patch series keep the demotion target page allocation logic same. > >> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the > >> >>>>>>>>> next lower tier to the current NUMA node allocating pages from. > >> >>>>>>>>> > >> >>>>>>>>> This will be later improved to use the same page allocation strategy > >> >>>>>>>>> using fallback list. > >> >>>>>>>>> > >> >>>>>>>>> Sysfs Interface: > >> >>>>>>>>> ------------- > >> >>>>>>>>> Listing current list of memory tiers details: > >> >>>>>>>>> > >> >>>>>>>>> :/sys/devices/system/memtier$ ls > >> >>>>>>>>> default_tier max_tier memtier1 power uevent > >> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier > >> >>>>>>>>> memtier200 > >> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier > >> >>>>>>>>> 400 > >> >>>>>>>>> :/sys/devices/system/memtier$ > >> >>>>>>>>> > >> >>>>>>>>> Per node memory tier details: > >> >>>>>>>>> > >> >>>>>>>>> For a cpu only NUMA node: > >> >>>>>>>>> > >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> > >> >>>>>>>>> For a NUMA node with memory: > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 1 > >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>>>>>> default_tier max_tier memtier1 power uevent > >> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >> >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 2 > >> >>>>>>>>> :/sys/devices/system/node# > >> >>>>>>>>> > >> >>>>>>>>> Removing a memory tier > >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >> >>>>>>>>> 2 > >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >> >>>>>>>> > >> >>>>>>>> Thanks a lot for your patchset. > >> >>>>>>>> > >> >>>>>>>> Per my understanding, we haven't reach consensus on > >> >>>>>>>> > >> >>>>>>>> - how to create the default memory tiers in kernel (via abstract > >> >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) > >> >>>>>>>> > >> >>>>>>>> - how to override the default memory tiers from user space > >> >>>>>>>> > >> >>>>>>>> As in the following thread and email, > >> >>>>>>>> > >> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>>>>>> > >> >>>>>>>> I think that we need to finalized on that firstly? > >> >>>>>>> > >> >>>>>>> I did list the proposal here > >> >>>>>>> > >> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >> >>>>>>> > >> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >> >>>>>>> if the user wants a different tier topology. > >> >>>>>>> > >> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >> >>>>>>> > >> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >> >>>>>>> to control the tier assignment this can be a range of memory tiers. > >> >>>>>>> > >> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >> >>>>>>> the memory tier assignment based on device attributes. > >> >>>>>> > >> >>>>>> Sorry for late reply. > >> >>>>>> > >> >>>>>> As the first step, it may be better to skip the parts that we haven't > >> >>>>>> reached consensus yet, for example, the user space interface to override > >> >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory > >> >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we > >> >>>>>> cannot change the user space ABI. > >> >>>>>> > >> >>>>> > >> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? > >> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a > >> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point > >> >>>>> I am not sure which area we are still debating w.r.t the userspace interface. > >> >>>> > >> >>>> In > >> >>>> > >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >> >>>> > >> >>>> per my understanding, Johannes suggested to override the kernel default > >> >>>> memory tiers with "abstract distance" via drivers implementing memory > >> >>>> devices. As you said in another email, that is related to [7/12] of the > >> >>>> series. And we can table it for future. > >> >>>> > >> >>>> And per my understanding, he also suggested to make memory tier IDs > >> >>>> dynamic. For example, after the "abstract distance" of a driver is > >> >>>> overridden by users, the total number of memory tiers may be changed, > >> >>>> and the memory tier ID of some nodes may be changed too. This will make > >> >>>> memory tier ID easier to be understood, but more unstable. For example, > >> >>>> this will make it harder to specify the per-memory-tier memory partition > >> >>>> for a cgroup. > >> >>>> > >> >>> > >> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. > >> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches > >> >>> posted here > >> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ > >> >>> doesn't consider the node movement from one memory tier to another. If we need > >> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment > >> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not > >> >>> prevent such a restriction. > >> >> > >> >> Absolute stableness doesn't exist even in "rank" based solution. But > >> >> "rank" can improve the stableness at some degree. For example, if we > >> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM > >> >> nodes can keep its memory tier ID stable. This may be not a real issue > >> >> finally. But we need to discuss that. > >> >> > >> > > >> > I agree that using ranks gives us the flexibility to change demotion order > >> > without being blocked by cgroup usage. But how frequently do we expect the > >> > tier assignment to change? My expectation was these reassignments are going > >> > to be rare and won't happen frequently after a system is up and running? > >> > Hence using tierID for demotion order won't prevent a node reassignment > >> > much because we don't expect to change the node tierID during runtime. In > >> > the rare case we do, we will have to make sure there is no cgroup usage from > >> > the specific memory tier. > >> > > >> > Even if we use ranks, we will have to avoid a rank update, if such > >> > an update can change the meaning of top tier? ie, if a rank update > >> > can result in a node being moved from top tier to non top tier. > >> > > >> >> Tim has suggested to use top-tier(s) memory partition among cgroups. > >> >> But I don't think that has been finalized. We may use per-memory-tier > >> >> memory partition among cgroups. I don't know whether Wei will use that > >> >> (may be implemented in the user space). > >> >> > >> >> And, if we thought stableness between nodes and memory tier ID isn't > >> >> important. Why should we use sparse memory device IDs (that is, 100, > >> >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > >> >> > >> > > >> > > >> > The range allows us to use memtier ID for demotion order. ie, as we start initializing > >> > devices with different attributes via dax kmem, there will be a desire to > >> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables > >> > us to put these devices in the range [0 - 200) without updating the node to memtier > >> > mapping of existing NUMA nodes (ie, without updating default memtier). > >> > >> I believe that sparse memory tier IDs can make memory tier more stable > >> in some cases. But this is different from the system suggested by > >> Johannes. Per my understanding, with Johannes' system, we will > >> > >> - one driver may online different memory types (such as kmem_dax may > >> online HBM, PMEM, etc.) > >> > >> - one memory type manages several memory nodes (NUMA nodes) > >> > >> - one "abstract distance" for each memory type > >> > >> - the "abstract distance" can be offset by user space override knob > >> > >> - memory tiers generated dynamic from different memory types according > >> "abstract distance" and overridden "offset" > >> > >> - the granularity to group several memory types into one memory tier can > >> be overridden via user space knob > >> > >> In this way, the memory tiers may be changed totally after user space > >> overridden. It may be hard to link memory tiers before/after the > >> overridden. So we may need to reset all per-memory-tier configuration, > >> such as cgroup paritation limit or interleave weight, etc. > >> > >> Personally, I think the system above makes sense. But I think we need > >> to make sure whether it satisfies the requirements. > >> > >> Best Regards, > >> Huang, Ying > >> > > > > Th "memory type" and "abstract distance" concepts sound to me similar > > to the memory tier "rank" idea. > > Yes. "abstract distance" is similar as "rank". > > > We can have some well-defined type/distance/rank values, e.g. HBM, > > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The > > memory tiers will build from these values. It can be configurable to > > whether/how to collapse several values into a single tier. > > The memory types are registered by drivers (such as kmem_dax). And the > distances can come from SLIT, HMAT, and other firmware or driver > specific information sources. > > Per my understanding, this solution may make memory tier IDs more > unstable. For example, the memory ID of a node may be changed after the > user override the distance of a memory type. Although I think the > overriding should be a rare operations, will it be a real issue for your > use cases? Not sure how common it is, but I'm aware of systems that have dynamic access characteristics. i.e. the bandwidth and latency of a access to a given memory device will change dynamically at runtime (typically due to something like hardware degradation / power saving etc). Potentially leading to memory in use needing to move in 'demotion order'. We could handle that with a per device tier and rank that changes... Just thought I'd throw that out there to add to the complexity ;) I don't consider it important to support initially but just wanted to point out this will only get more complex over time. Jonathan > > Best Regards, > Huang, Ying