From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E53D6C433EF for ; Mon, 6 Jun 2022 16:16:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 223B88D0002; Mon, 6 Jun 2022 12:16:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1F6CF6B0074; Mon, 6 Jun 2022 12:16:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0BC578D0002; Mon, 6 Jun 2022 12:16:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id F0E506B0071 for ; Mon, 6 Jun 2022 12:16:28 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id BB06D12086C for ; Mon, 6 Jun 2022 16:16:28 +0000 (UTC) X-FDA: 79548313656.04.7B5FA66 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf19.hostedemail.com (Postfix) with ESMTP id ED29C1A001C for ; Mon, 6 Jun 2022 16:16:10 +0000 (UTC) Received: from fraeml715-chm.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4LGz9J19Pyz6896B; Tue, 7 Jun 2022 00:15:16 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml715-chm.china.huawei.com (10.206.15.34) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 6 Jun 2022 18:16:24 +0200 Received: from localhost (10.202.226.42) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 6 Jun 2022 17:16:24 +0100 Date: Mon, 6 Jun 2022 17:16:22 +0100 From: Jonathan Cameron To: Aneesh Kumar K V CC: , , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Message-ID: <20220606171622.000036ed@Huawei.com> In-Reply-To: <3a557f74-cc3a-c0ee-78e8-2cf50bee5f2d@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> <20220527122528.129445-3-aneesh.kumar@linux.ibm.com> <20220527151531.00002a0c@Huawei.com> <20220606155920.00004ce9@Huawei.com> <3a557f74-cc3a-c0ee-78e8-2cf50bee5f2d@linux.ibm.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.42] X-ClientProxiedBy: lhreml728-chm.china.huawei.com (10.201.108.79) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected X-Stat-Signature: 4qo3p7rfzg43a1kc4y9fshyrxc1u3mgm X-Rspam-User: Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: ED29C1A001C X-HE-Tag: 1654532170-571819 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 6 Jun 2022 21:31:16 +0530 Aneesh Kumar K V wrote: > On 6/6/22 8:29 PM, Jonathan Cameron wrote: > > On Fri, 3 Jun 2022 14:10:47 +0530 > > Aneesh Kumar K V wrote: > > > >> On 5/27/22 7:45 PM, Jonathan Cameron wrote: > >>> On Fri, 27 May 2022 17:55:23 +0530 > >>> "Aneesh Kumar K.V" wrote: > >>> > >>>> From: Jagdish Gediya > >>>> > >>>> Add support to read/write the memory tierindex for a NUMA node. > >>>> > >>>> /sys/devices/system/node/nodeN/memtier > >>>> > >>>> where N = node id > >>>> > >>>> When read, It list the memory tier that the node belongs to. > >>>> > >>>> When written, the kernel moves the node into the specified > >>>> memory tier, the tier assignment of all other nodes are not > >>>> affected. > >>>> > >>>> If the memory tier does not exist, writing to the above file > >>>> create the tier and assign the NUMA node to that tier. > >>> creates > >>> > >>> There was some discussion in v2 of Wei Xu's RFC that what matter > >>> for creation is the rank, not the tier number. > >>> > >>> My suggestion is move to an explicit creation file such as > >>> memtier/create_tier_from_rank > >>> to which writing the rank gives results in a new tier > >>> with the next device ID and requested rank. > >> > >> I think the below workflow is much simpler. > >> > >> :/sys/devices/system# cat memtier/memtier1/nodelist > >> 1-3 > >> :/sys/devices/system# cat node/node1/memtier > >> 1 > >> :/sys/devices/system# ls memtier/memtier* > >> nodelist power rank subsystem uevent > >> /sys/devices/system# ls memtier/ > >> default_rank max_tier memtier1 power uevent > >> :/sys/devices/system# echo 2 > node/node1/memtier > >> :/sys/devices/system# > >> > >> :/sys/devices/system# ls memtier/ > >> default_rank max_tier memtier1 memtier2 power uevent > >> :/sys/devices/system# cat memtier/memtier1/nodelist > >> 2-3 > >> :/sys/devices/system# cat memtier/memtier2/nodelist > >> 1 > >> :/sys/devices/system# > >> > >> ie, to create a tier we just write the tier id/tier index to > >> node/nodeN/memtier file. That will create a new memory tier if needed > >> and add the node to that specific memory tier. Since for now we are > >> having 1:1 mapping between tier index to rank value, we can derive the > >> rank value from the memory tier index. > >> > >> For dynamic memory tier support, we can assign a rank value such that > >> new memory tiers are always created such that it comes last in the > >> demotion order. > > > > I'm not keen on having to pass through an intermediate state where > > the rank may well be wrong, but I guess it's not that harmful even > > if it feels wrong ;) > > > > Any new memory tier added can be of lowest rank (rank - 0) and hence > will appear as the highest memory tier in demotion order. Depends on driver interaction - if new memory is CXL attached or GPU attached, chances are the driver has an input on which tier it is put in by default. > User can then > assign the right rank value to the memory tier? Also the actual demotion > target paths are built during memory block online which in most case > would happen after we properly verify that the device got assigned to > the right memory tier with correct rank value? Agreed, though that may change the model of how memory is brought online somewhat. > > > Races are potentially a bit of a pain though depending on what we > > expect the usage model to be. > > > > There are patterns (CXL regions for example) of guaranteeing the > > 'right' device is created by doing something like > > > > cat create_tier > temp.txt > > #(temp gets 2 for example on first call then > > # next read of this file gets 3 etc) > > > > cat temp.txt > create_tier > > # will fail if there hasn't been a read of the same value > > > > Assuming all software keeps to the model, then there are no > > race conditions over creation. Otherwise we have two new > > devices turn up very close to each other and userspace scripting > > tries to create two new tiers - if it races they may end up in > > the same tier when that wasn't the intent. Then code to set > > the rank also races and we get two potentially very different > > memories in a tier with a randomly selected rank. > > > > Fun and games... And a fine illustration why sysfs based 'device' > > creation is tricky to get right (and lots of cases in the kernel > > don't). > > > > I would expect userspace to be careful and verify the memory tier and > rank value before we online the memory blocks backed by the device. Even > if we race, the result would be two device not intended to be part of > the same memory tier appearing at the same tier. But then we won't be > building demotion targets yet. So userspace could verify this, move the > nodes out of the memory tier. Once it is verified, memory blocks can be > onlined. The race is there and not avoidable as far as I can see. Two processes A and B. A checks for a spare tier number B checks for a spare tier number A tries to assign node 3 to new tier 2 (new tier created) B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this is the same method we'd use to put it in the existing tier we can't tell this write was meant to create a new tier). A writes rank 100 to tier 2 A checks rank for tier 2 and finds it is 100 as expected. B write rank 200 to tier 2 (it could check if still default but even that is racy) B checks rank for tier 2 rank and finds it is 200 as expected. A onlines memory. B onlines memory. Both think they got what they wanted, but A definitely didn't. One work around is the read / write approach and create_tier. A reads create_tier - gets 2. B reads create_tier - gets 3. A writes 2 to create_tier as that's what it read. B writes 3 to create_tier as that's what it read. continue with created tiers. Obviously can exhaust tiers, but if this is root only, could just create lots anyway so no worse off. > > Having said that can you outline the usage of > memtier/create_tier_from_rank ? There are corner cases to deal with... A writes 100 to create_tier_from_rank. A goes looking for matching tier - finds it: tier2 B writes 200 to create_tier_from_rank B goes looking for matching tier - finds it: tier3 rest is fine as operating on different tiers. Trickier is A writes 100 to create_tier_from_rank - succeed. B writes 100 to create_tier_from_rank - Could fail, or could just eat it? Logically this is same as separate create_tier and then a write of rank, but in one operation, but then you need to search for the right one. As such, perhaps a create_tier that does the read/write pair as above is the best solution. Jonathan > > -aneesh