From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 267CDC433F5 for ; Mon, 14 Feb 2022 11:49:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=5ocvNq2xfLSZBI6yUu2Vlr2MeIwo8Rh0KA1GFeWgo7U=; b=Syed52eN2cIwMc 8Ay98M+cBAgxLcUeEfaGhL24ovkkxetM7R4n21x7UyfLZqwUqtyvKbb05W04OuCCg8lOvncsMlMzE RmLTMoDKaF7wBKzHaz2MbX1MHypeUxTeC3PJco+x9roYGX3TTFi7vjP4bP2QpAf8mlMbutJVtC37p weKMkvyEn+DpArDXA5byfUO/iAOEAeclnT4Yuj1OXV8As124suOfRVSVbIRwgpr9lFOusXWXa+Bp3 dB/42gkOKl07qRkbfuBPxwsJMIbtvdb8weRs7fYF4Mc6ryhUuBsQWalCzhRuspIwByvk439tFwTxK ZUHqUW2oUdtwL6PMyNFQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1nJZp5-00EvPH-VE; Mon, 14 Feb 2022 11:47:17 +0000 Received: from ams.source.kernel.org ([145.40.68.75]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1nJYbf-00EVzL-3G for linux-arm-kernel@lists.infradead.org; Mon, 14 Feb 2022 10:29:26 +0000 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id D10EFB80E20; Mon, 14 Feb 2022 10:29:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4899C340F0; Mon, 14 Feb 2022 10:29:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1644834554; bh=Yjpd9or3InM7O72Ha8qqhYu1AMkoWAkyJcyqk3l1uP4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=SxSbArKoO6QRWCpB0gREzXBLyJR4iT+F7EzfyQBAGNysz5fYvadx8h2FFUXiHcCJ8 OB0kC6nMrABepvcSdNnTyM862zaSF/W1p4BR3wT4S62Ff4EjYcND77yB11FmK7xMki TIuljUCm4+kzy6Z2y1zWg5hiVEVWZhYmSzcyG62WDnMuuwXA9RuNa5t3vJOxIugXA1 90m0ly8lpfLhOWOSg8CZiWk1qgR3H6Fw5x3ZFEO2KVCxUb4/Tu8EdzDul1LAhSYfFb f1rt5ZXAiIKJsVxilDUMkv24PF4XPKq5S3IZcIJao/EJcKZqw+AKndJGOk9tXpEWWb 1No+JyQqXdijQ== Date: Mon, 14 Feb 2022 12:28:56 +0200 From: Mike Rapoport To: Yu Zhao Cc: Andrew Morton , Johannes Weiner , Mel Gorman , Michal Hocko , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Subject: Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation Message-ID: References: <20220208081902.3550911-1-yuzhao@google.com> <20220208081902.3550911-13-yuzhao@google.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20220208081902.3550911-13-yuzhao@google.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220214_022919_877871_0C4C7D3B X-CRM114-Status: GOOD ( 42.21 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi, On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote: > Add a design doc and an admin guide. > = > Signed-off-by: Yu Zhao > Acked-by: Brian Geffon > Acked-by: Jan Alexander Steffens (heftig) > Acked-by: Oleksandr Natalenko > Acked-by: Steven Barrett > Acked-by: Suleiman Souhlal > Tested-by: Daniel Byrne > Tested-by: Donald Carr > Tested-by: Holger Hoffst=E4tte > Tested-by: Konstantin Kharlamov > Tested-by: Shuang Zhai > Tested-by: Sofia Trinh > --- > Documentation/admin-guide/mm/index.rst | 1 + > Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++ > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 152 ++++++++++++++++++ Please consider splitting this patch into Documentation/admin-guide and Documentation/vm parts. For now I only had time to review the admin-guide part. > 4 files changed, 275 insertions(+) > create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst > create mode 100644 Documentation/vm/multigen_lru.rst > = > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin= -guide/mm/index.rst > index c21b5823f126..2cf5bae62036 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -32,6 +32,7 @@ the Linux memory management. > idle_page_tracking > ksm > memory-hotplug > + multigen_lru > nommu-mmap > numa_memory_policy > numaperf > diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentatio= n/admin-guide/mm/multigen_lru.rst > new file mode 100644 > index 000000000000..16a543c8b886 > --- /dev/null > +++ b/Documentation/admin-guide/mm/multigen_lru.rst > @@ -0,0 +1,121 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > +Multigenerational LRU > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + > +Quick start > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D There is no explanation why one would want to use multigenerational LRU until the next section. I think there should be an overview that explains why users would want to enable multigenerational LRU. = > +Build configurations > +-------------------- > +:Required: Set ``CONFIG_LRU_GEN=3Dy``. Maybe = Set ``CONFIG_LRU_GEN=3Dy`` to build kernel with multigenerational LRU > + > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=3Dy`` to enable the > + multigenerational LRU by default. > + > +Runtime configurations > +---------------------- > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if > + ``CONFIG_LRU_GEN_ENABLED=3Dn``. > + > +This file accepts different values to enabled or disabled the > +following features: Maybe After multigenerational LRU is enabled, this file accepts different values to enable or disable the following feaures: > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > +Values Features > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > +0x0001 the multigenerational LRU The multigenerational LRU what? What will happen if I write 0x2 to this file? Please consider splitting "enable" and "features" attributes. > +0x0002 clear the accessed bit in leaf page table entries **in large > + batches**, when MMU sets it (e.g., on x86) Is extra markup really needed here... > +0x0004 clear the accessed bit in non-leaf page table entries **as > + well**, when MMU sets it (e.g., on x86) ... and here? As for the descriptions, what is the user-visible effect of these features? How different modes of clearing the access bit are reflected in, say, GUI responsiveness, database TPS, or probability of OOM? > +[yYnN] apply to all the features above > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > + > +E.g., > +:: > + > + echo y >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0007 > + echo 5 >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0005 > + > +Most users should enable or disable all the features unless some of > +them have unforeseen side effects. > + > +Recipes > +=3D=3D=3D=3D=3D=3D=3D > +Personal computers > +------------------ > +Personal computers are more sensitive to thrashing because it can > +cause janks (lags when rendering UI) and negatively impact user > +experience. The multigenerational LRU offers thrashing prevention to > +the majority of laptop and desktop users who don't have oomd. I'd expect something like this paragraph in overview. > + > +:Thrashing prevention: Write ``N`` to > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > + ``N`` milliseconds from getting evicted. The OOM killer is triggered > + if this working set can't be kept in memory. Based on the average > + human detectable lag (~100ms), ``N=3D1000`` usually eliminates > + intolerable janks due to thrashing. Larger values like ``N=3D3000`` > + make janks less noticeable at the risk of premature OOM kills. > + > +Data centers > +------------ > +Data centers want to optimize job scheduling (bin packing) to improve > +memory utilizations. Job schedulers need to estimate whether a server > +can allocate a certain amount of memory for a new job, and this step > +is known as working set estimation, which doesn't impact the existing > +jobs running on this server. They also want to attempt freeing some > +cold memory from the existing jobs, and this step is known as proactive > +reclaim, which improves the chance of landing a new job successfully. This paragraph also fits overview. > + > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations > + for working set estimation and proactive reclaim. Please add a note that this is build time option. > + > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following Is debugfs interface relevant only for datacenters? = > + format: > + :: > + > + memcg memcg_id memcg_path > + node node_id > + min_gen birth_time anon_size file_size > + ... > + max_gen birth_time anon_size file_size > + > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > + youngest generation number. ``birth_time`` is in milliseconds. It's unclear what is birth_time reference point. Is it milliseconds from the system start or it is measured some other way? > + ``anon_size`` and ``file_size`` are in pages. The youngest generation > + represents the group of the MRU pages and the oldest generation > + represents the group of the LRU pages. For working set estimation, a Please spell out MRU and LRU fully. > + job scheduler writes to this file at a certain time interval to > + create new generations, and it ranks available servers based on the > + sizes of their cold memory defined by this time interval. For > + proactive reclaim, a job scheduler writes to this file before it > + tries to land a new job, and if it fails to materialize the cold > + memory without impacting the existing jobs, it retries on the next > + server according to the ranking result. Is this knob only relevant for a job scheduler? Or it can be used in other use-cases as well? > + > + This file accepts commands in the following subsections. Multiple ^ described > + command lines are supported, so does concatenation with delimiters > + ``,`` and ``;``. > + > + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for > + debugging. > + > +:Working set estimation: Write ``+ memcg_id node_id max_gen > + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke > + the aging. It scans PTEs for hot pages and promotes them to the > + youngest generation ``max_gen``. Then it creates a new generation > + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages > + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead > + as well as the coverage when scanning PTEs. > + > +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness > + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the > + eviction. It evicts generations less than or equal to ``min_gen``. > + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and > + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use > + ``nr_to_reclaim`` to limit the number of pages to evict. I feel that /sys/kernel/debug/lru_gen is too overloaded. > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > index 44365c4574a3..b48434300226 100644 > --- a/Documentation/vm/index.rst > +++ b/Documentation/vm/index.rst > @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply al= locating memory, see the > ksm > memory-model > mmu_notifier > + multigen_lru > numa > overcommit-accounting > page_migration -- = Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel