From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6345FC4332F for ; Fri, 9 Dec 2022 00:59:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CED828E0003; Thu, 8 Dec 2022 19:59:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C9DCA8E0001; Thu, 8 Dec 2022 19:59:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B653F8E0003; Thu, 8 Dec 2022 19:59:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A69588E0001 for ; Thu, 8 Dec 2022 19:59:51 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8283D1C68F3 for ; Fri, 9 Dec 2022 00:59:51 +0000 (UTC) X-FDA: 80220960582.24.BBB0377 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf28.hostedemail.com (Postfix) with ESMTP id DE795C0005 for ; Fri, 9 Dec 2022 00:59:49 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=s73H+aXN; spf=pass (imf28.hostedemail.com: domain of weixugc@google.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670547589; a=rsa-sha256; cv=none; b=w4l/p44peKwCygH24kRKNYOPIraWwPqvGQLrEo/TU/AI1jcPSHzXji9t5YSz9KCB2g4cxG 3hYKJ65IbpTs0QPjjTIrjv94IB1Qs740KFwZFS9tn2xMK75HACeNz+Qif2cVH4dt4B86wJ qxg1772FKJDLQp6+qvLUSFWZ+K3lWeU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=s73H+aXN; spf=pass (imf28.hostedemail.com: domain of weixugc@google.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670547589; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2U9eMXhbenP56WJasdNaIIj3GD4/sTO3KV8+xVkcYe8=; b=0WHBw82+Afc7ME0QkFenSPZV0WJIprpM7/KmF+cBkuJJcd7BZXs5UTeeeACwlY3lV+L610 P+YIU/G1RqxgMY1Xp2n4QaO7v/jbZl7UqEG0h64Qi4Tf+sllcFziMv8qRPUF+xcROzgFLJ HhAKcBRBAlAkTNNfXHdPZ3Xdrv8S9w4= Received: by mail-pl1-f172.google.com with SMTP id w23so3234481ply.12 for ; Thu, 08 Dec 2022 16:59:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=2U9eMXhbenP56WJasdNaIIj3GD4/sTO3KV8+xVkcYe8=; b=s73H+aXNz8dOAPCoTCrwWNBKOg1X8ONzNYgVkuVHCHXYPwr1vTFLnG5c4VKnUaBCkb gWBbcrrjXsxZfRSHGSCey3c5Tq91ta2QnaX3sugxG7yfyFHxC1gysv5PbfXtFUfljLj+ FbjLm9yr7q4H5cqQtr3xI28emZ0hnWyl+F1gtPdwP54ZAv0Sj0Gkejcy3Y2OgPiXVqo5 MPJpyCHex5uCZksVhUviq8hF8uLGK0PCgZf5utopHm5hyYgtuNNNfBlmPcOwqNh5DFMc auLC/t/VZqkTjTTNmAQK+QrdvbCAYbRksroieoO7KfJrSRP0SspXt+lRiF9S+ZBMJ0yX YhQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2U9eMXhbenP56WJasdNaIIj3GD4/sTO3KV8+xVkcYe8=; b=PLymvoDoiH/JKGl1qkeT7H9ivBLinyt1+22RNaVSaZ0Xs5dYWWMowbWsUtpeiYfwCc hLvoK5T8F3EPUErsTcVIZYHVk6ieG6XIh4cFcgeOfQwKPYldSanL5XgrzDS2yGXcygtD qs+/X7qA6BTYOSP1BJVyRGsEzA4/oEx49pGL1OncKfmyZm16s4dJgHRyeD90kFRLGwyd xc6XSH1FkoJ7ejtMLUwK8MOAucY064AIaSk39mjB1ssEwGiUyywNLPcopqIwJYPWHwRq 3nWvwEjR75Rx99G5yJaGOGoamu63KQPXtvs4Nh4exe+QXlAlnNaFP9hxnQ/u7i7A5Q9w 0sig== X-Gm-Message-State: ANoB5plEbhscO/US67w5OlNiXKT/+b639J8fMcA5MRCyKSDRog8BBVpS IEQrbMEd4TikF8UstVFv/3628TE0IERZzYH5yASjbg== X-Google-Smtp-Source: AA0mqf4+L6Y8rNezsq0n5sCmTN4mV923L0eBFglJ5E/dGjr/OpWA5PXz898gQTrVdiwye7p2NKWbUqOZVRWJOwmKY1I= X-Received: by 2002:a17:903:240a:b0:188:5d24:87e with SMTP id e10-20020a170903240a00b001885d24087emr79630543plo.87.1670547588394; Thu, 08 Dec 2022 16:59:48 -0800 (PST) MIME-Version: 1.0 References: <20221206023406.3182800-1-almasrymina@google.com> In-Reply-To: From: Wei Xu Date: Thu, 8 Dec 2022 16:59:36 -0800 Message-ID: Subject: Re: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems To: Michal Hocko Cc: Mina Almasry , Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , fvdl@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DE795C0005 X-Stat-Signature: tmqx3wza16d8cfym3kqyaf3hn7gsjpg4 X-HE-Tag: 1670547589-348925 X-HE-Meta: U2FsdGVkX18zb2UcCeqHG9nKDDNgOn+xSkuUbL/6VMFKSq2Qc37vY/jScufgHX9lEMYraImZ3D/8anqHmnbLXCZpsQ0nFRdfumODvNHM9vj1T/FaGmnf0rMa0fHmFuTbEdsf0VWhg7lCgp+2d3JnRDFMJ7aMEgesZmtEKNOGTq6CR5YBQxeTC04x44PJqiRKsfV1B5fYnFq1utPotEXI+v9vVq+E3D0SmtNW9CX/FLoJ9Sytr7qKZvHAkhugThtG/xz63vmCpptUWfSM+5qgnf45LHTzYEANYcv/Tk1FWEdTLlkpbH/XlGaBqy5IdxPbIYNYznM1XxEvjQpBP80ySowpL9vS2zLz85vM4qosUOeh05aMKqKiJgkHTCNj7htHsaobi7BVSlusF/4/eNdpjp9jhL83M5vlTpN/e5q03CRGY80vhGJxOojVu4aOAfjnPX1su/s9wnMkSm4LyE9BVOBRycPw1yhLe0Wr0oA90aW3TiWJFexgUOjdKVsdCtdXX0m2ELmUVm/zH2yEcCEUrg0cqjjYOCLsU2Qd9Cf5pwrX4RyQq6Xi/I/3QR55Fm0SXE1F2DBi0wq0/KpAcsv2yVKPjloc6udKsFN1jMzaUVm2YQuzj33G/ufX5M+pYPETZY5yKFyHiIDz9sCOuOnC9hU8K7AIx+2ImXTfcTVckk0ngBr5LDVvlAoncMvvEibUp4fygbaSY0rUWGcL+i0MB/LzbOUQJ5s/gt8bXbnoEYjxhZdP49w2hTlF+ZCWoWZlHYAWcjgwWni8AQOE5NzPg00fvFk/v7iN8hf0xr1GpFi7ZG9NEhriNFVku7VJffzv4BUP13pa+dONTyng4uIjGSMKtGL72nP7m9cSZVkJXKx2IHEGyCrlrwbGLCnxqEpJz3t36C9LgYfqGIXma8i4IYow7W6sieycO/FepVe0/u6RM2QpYDM5UWC/8yy6njs9i+2bqkncwlzfXqvyZ0+ /rg52gBZ 3nq/NiUgNpuUL4jM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 8, 2022 at 3:54 AM Michal Hocko wrote: > > On Thu 08-12-22 01:00:40, Mina Almasry wrote: > > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko wrote: > > > > > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko wrote: > > > [...] > > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > > > but I do not expect to make a big difference in practice > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 026199c047e0..1b7f2d8cb128 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > LIST_HEAD(ret_folios); > > > > > LIST_HEAD(free_folios); > > > > > LIST_HEAD(demote_folios); > > > > > - unsigned int nr_reclaimed = 0; > > > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > > > unsigned int pgactivate = 0; > > > > > bool do_demote_pass; > > > > > struct swap_iocb *plug = NULL; > > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > } > > > > > /* 'folio_list' is always empty here */ > > > > > > > > > > - /* Migrate folios selected for demotion */ > > > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > > > + /* > > > > > + * Migrate folios selected for demotion. > > > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > > > + * because no charges are really freed during the migration. Global > > > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > > > + * demotion to reclaim memory. > > > > > + */ > > > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > > > + if (!cgroup_reclaim(sc)) > > > > > + nr_reclaimed += nr_demoted; > > > > > + > > > > > /* Folios that could not be demoted are still in @demote_folios */ > > > > > if (!list_empty(&demote_folios)) { > > > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > > > > > [...] > > > > > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > > > for me. This is my test case. I run it on a machine with 2 memory > > > > tiers. > > > > > > > > Memory tier 1= nodes 0-2 > > > > Memory tier 2= node 3 > > > > > > > > mkdir -p /sys/fs/cgroup/unified/test > > > > cd /sys/fs/cgroup/unified/test > > > > echo $$ > cgroup.procs > > > > head -c 500m /dev/random > /tmp/testfile > > > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > > > echo "1m nodes=0-2" > memory.reclaim > > > > > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > > > memory from nodes 0-2 to node 3. > > > > > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > > > returns -EAGAIN to the user. This may be the correct behavior you're > > > > intending, but it completely breaks the use case I implemented the > > > > nodes= arg for and listed on the commit message of that change. > > > > > > Yes, strictly speaking the behavior is correct albeit unexpected. You > > > have told the kernel to _reclaim_ that much memory but demotion are > > > simply aging handling rather than a reclaim if the demotion target has a > > > lot of memory free. > > > > Yes, by the strict definition of reclaim, you're completely correct. > > But in reality earlier I proposed a patch to the kernel that disables > > demotion in proactive reclaim. That would have been a correct change > > by the strict definition of reclaim, but Johannes informed me that > > meta already has a dependency on proactive reclaim triggering demotion > > and directed me to add a nodes= arg to memory.reclaim to trigger > > demotion instead, to satisfy both use cases. Seems both us and meta > > are using this interface to trigger both reclaim and demotion, despite > > the strict definition of the word? > > Well, demotion is a part of aging and that is a part of the reclaim so I > believe we want both and demotion mostly an implementation detail. If > you want to have a very precise control then the nodemask should drive > you there. > > [...] > > > I am worried this will popping up again and again. I thought your nodes > > > subset approach could deal with this but I have overlooked one important > > > thing in your patch. The user provided nodemask controls where to > > > reclaim from but it doesn't constrain demotion targets. Is this > > > intentional? Would it actually make more sense to control demotion by > > > addint demotion nodes into the nodemask? > > > > > > > IMO, yes it is intentional, and no I would not recommend adding > > demotion nodes (I assume you mean adding both demote_from_nodes and > > demote_to_nodes as arg). > > What I really mean is to add demotion nodes to the nodemask along with > the set of nodes you want to reclaim from. To me that sounds like a > more natural interface allowing for all sorts of usecases: > - free up demotion targets (only specify demotion nodes in the mask) > - control where to demote (e.g. select specific demotion target(s)) > - do not demote at all (skip demotion nodes from the node mask) For clarification, do you mean to add another argument (e.g. demotion_nodes) in addition to the "nodes" argument? Also, which meaning do these arguments have? - Option 1: The "nodes" argument specifies the source nodes where pages should be reclaimed (but not demoted) from and the "demotion_nodes" argument specifies the source nodes where pages should be demoted (but not reclaimed) from. - Option 2: The "nodes" argument specifies the source nodes where pages should be reclaimed or demoted from and the "demotion_nodes" argument specifies the allowed demotion target nodes. Option 1 is more flexible in allowing various use cases. Option 2 can also support the use cases to reclaim only without demotion and demotion with fallback to reclaim, but cannot support demotion only without reclaim. > > My opinion is based on 2 reasons: > > > > 1. We control proactive reclaim by looking for nodes/tiers approaching > > pressure and triggering reclaim/demotion from said nodes/tiers. So we > > know the node/tier we would like to reclaim from, but not necessarily > > have a requirement on where the memory should go. I think it should be > > up to the kernel. > > 2. Currently I think most tiered machines will have 2 memory tiers, > > but I think the code is designed to support N memory tiers. What > > happens if N=3 and the user asks you to demote from the top tier nodes > > to the bottom tier nodes (skipping the middle tier)? The user in this > > case is explicitly asking to break the aging pipeline. From my short > > while on the mailing list I see great interest in respecting the aging > > pipeline, so it seems to me the demotion target should be decided by > > the kernel based on the aging pipeline, and the user should not be > > able to override it? I don't know. Maybe there is a valid use case for > > that somewhere. > > I agree that the agining should be preserved as much as possible unless > there is an explicit requirement to do otherwise which might be > something application specific. > > It is really hard to assume all the usecases at this stage but we should > keep in mind that the semantic of the interface will get cast into stone > once it is released. As of now I do see a great confusion point in the > nodemask semantic which pretends to allow some fine control while it is > largerly hard to predict because it makes some assumptions about the > reclaim while it has a very limited control of the demotion. > > -- > Michal Hocko > SUSE Labs