From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CF41C433F5 for ; Wed, 16 Feb 2022 03:22:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237597AbiBPDWa (ORCPT ); Tue, 15 Feb 2022 22:22:30 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:51296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344264AbiBPDWa (ORCPT ); Tue, 15 Feb 2022 22:22:30 -0500 Received: from mail-io1-xd33.google.com (mail-io1-xd33.google.com [IPv6:2607:f8b0:4864:20::d33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1293165825 for ; Tue, 15 Feb 2022 19:22:17 -0800 (PST) Received: by mail-io1-xd33.google.com with SMTP id h5so829438ioj.3 for ; Tue, 15 Feb 2022 19:22:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=pC+CQP5X8/WBY/xYLNskVxmp5h3d1UBhUppWThc6zcU=; b=JjHYxwuhg8eU+0FsJ4cbBFmZCmqA53fU2f9F6kotYxiXToxiJUNSyjfwP83rdcpdSm 47XXk41juhHZsyak1Pw53y932BX59s9FRfGUrsDZQwRVB3yxsxbDKyNjewbCuyNlFC3u czn1vbpNWkprv7EoRa7tckSkUURyKh01lzbIKhhS834iGgl1H/GGbRBjy9mjbQES+WoL Pd+PqDlOVy26h96FYWL83RXtPX9h8pSJp8twZ97itYQY/w+SWQqSDB4ebt3MZg7FfQbZ faEXPMmdiaVL9LJc4+CQ7p0DEnu4+DZ1VYR/vctO4rXPhcjLVNso8NhzFblg7BasRd1Z COOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=pC+CQP5X8/WBY/xYLNskVxmp5h3d1UBhUppWThc6zcU=; b=PCHuokzWN40wklx9w0fZ0eIGIrqdYjRLS8VW8P0Za1NsJNLWtTjDECcm/PlBmlthHf o/VbFuY/urVdCcg6dk5YqFIZ95c2HxDIfPsiMdwJuX7ivqY9wemkOrSBkWMR+GsQz6jW 5Yl9h5kK6Tjtyif9VAVuJCK9Sy9pKBBzxYhCbGUGQItz4GJ0hK9qT6skwRdAwulXOms2 RvNi3arr+/5+xSS/rYnlkpre1/oF7VdEA2qyuGJK+dd7hdAI5GQ18/0gabpNS7lACaoo duDSAnA45qPj3CpYekVYgSvcUACmS6+duLNKXvk2MH6iT0wGX5A0WMe+dE7oYj5UKb0G 3Uww== X-Gm-Message-State: AOAM530UO81S7AK/xLacpkf1IcFJuUosMX6CcboiWiWkRWTLWcGFjvPZ u0mi/GoAmx7UREpkSv0Z2SAEOQ== X-Google-Smtp-Source: ABdhPJy5rdhqytpZ3v2zTdrmze3vfJ5giIn3PmnJT8l9HA15NYywXtFHUxaBc+uTT+/oN6duu62B6g== X-Received: by 2002:a05:6638:379b:b0:310:bb27:6c28 with SMTP id w27-20020a056638379b00b00310bb276c28mr470358jal.71.1644981736174; Tue, 15 Feb 2022 19:22:16 -0800 (PST) Received: from google.com ([2620:15c:183:200:5929:5114:bf56:ccb6]) by smtp.gmail.com with ESMTPSA id u15sm7547237ill.75.2022.02.15.19.22.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Feb 2022 19:22:15 -0800 (PST) Date: Tue, 15 Feb 2022 20:22:10 -0700 From: Yu Zhao To: Mike Rapoport Cc: Andrew Morton , Johannes Weiner , Mel Gorman , Michal Hocko , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Subject: Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation Message-ID: References: <20220208081902.3550911-1-yuzhao@google.com> <20220208081902.3550911-13-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-doc@vger.kernel.org On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote: Thanks for reviewing. > > Documentation/admin-guide/mm/index.rst | 1 + > > Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++ > > Documentation/vm/index.rst | 1 + > > Documentation/vm/multigen_lru.rst | 152 ++++++++++++++++++ > > Please consider splitting this patch into Documentation/admin-guide and > Documentation/vm parts. Will do. > > +===================== > > +Multigenerational LRU > > +===================== > + > > +Quick start > > +=========== > > There is no explanation why one would want to use multigenerational LRU > until the next section. > > I think there should be an overview that explains why users would want to > enable multigenerational LRU. Will do. > > +Build configurations > > +-------------------- > > +:Required: Set ``CONFIG_LRU_GEN=y``. > > Maybe > > Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU Will do. > > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the > > + multigenerational LRU by default. > > + > > +Runtime configurations > > +---------------------- > > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if > > + ``CONFIG_LRU_GEN_ENABLED=n``. > > + > > +This file accepts different values to enabled or disabled the > > +following features: > > Maybe > > After multigenerational LRU is enabled, this file accepts different > values to enable or disable the following feaures: Will do. > > +====== ======== > > +Values Features > > +====== ======== > > +0x0001 the multigenerational LRU > > The multigenerational LRU what? Itself? This depends on the POV, and I'm trying to determine what would be the natural way to present it. MGLRU itself could be seen as an add-on atop the existing page reclaim or an alternative in parallel. The latter would be similar to sl[aou]b, and that's how I personally see it. But here I presented it more like the former because I feel this way is more natural to users because they are like switches on a single panel. > What will happen if I write 0x2 to this file? Just like turning on a branch breaker while leaving the main breaker off in a circuit breaker box. This is how I see it, and I'm totally fine with changing it to whatever you'd recommend. > Please consider splitting "enable" and "features" attributes. How about s/Features/Components/? > > +0x0002 clear the accessed bit in leaf page table entries **in large > > + batches**, when MMU sets it (e.g., on x86) > > Is extra markup really needed here... > > > +0x0004 clear the accessed bit in non-leaf page table entries **as > > + well**, when MMU sets it (e.g., on x86) > > ... and here? Will do. > As for the descriptions, what is the user-visible effect of these features? > How different modes of clearing the access bit are reflected in, say, GUI > responsiveness, database TPS, or probability of OOM? These remain to be seen :) I just added these switches in v7, per Mel's request from the meeting we had. These were never tested in the field. > > +[yYnN] apply to all the features above > > +====== ======== > > + > > +E.g., > > +:: > > + > > + echo y >/sys/kernel/mm/lru_gen/enabled > > + cat /sys/kernel/mm/lru_gen/enabled > > + 0x0007 > > + echo 5 >/sys/kernel/mm/lru_gen/enabled > > + cat /sys/kernel/mm/lru_gen/enabled > > + 0x0005 > > + > > +Most users should enable or disable all the features unless some of > > +them have unforeseen side effects. > > + > > +Recipes > > +======= > > +Personal computers > > +------------------ > > +Personal computers are more sensitive to thrashing because it can > > +cause janks (lags when rendering UI) and negatively impact user > > +experience. The multigenerational LRU offers thrashing prevention to > > +the majority of laptop and desktop users who don't have oomd. > > I'd expect something like this paragraph in overview. > > > + > > +:Thrashing prevention: Write ``N`` to > > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > > + ``N`` milliseconds from getting evicted. The OOM killer is triggered > > + if this working set can't be kept in memory. Based on the average > > + human detectable lag (~100ms), ``N=1000`` usually eliminates > > + intolerable janks due to thrashing. Larger values like ``N=3000`` > > + make janks less noticeable at the risk of premature OOM kills. > > > + > > +Data centers > > +------------ > > +Data centers want to optimize job scheduling (bin packing) to improve > > +memory utilizations. Job schedulers need to estimate whether a server > > +can allocate a certain amount of memory for a new job, and this step > > +is known as working set estimation, which doesn't impact the existing > > +jobs running on this server. They also want to attempt freeing some > > +cold memory from the existing jobs, and this step is known as proactive > > +reclaim, which improves the chance of landing a new job successfully. > > This paragraph also fits overview. Will do. > > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations > > + for working set estimation and proactive reclaim. > > Please add a note that this is build time option. Will do. > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following > > Is debugfs interface relevant only for datacenters? For the moment, yes. > > + format: > > + :: > > + > > + memcg memcg_id memcg_path > > + node node_id > > + min_gen birth_time anon_size file_size > > + ... > > + max_gen birth_time anon_size file_size > > + > > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > > + youngest generation number. ``birth_time`` is in milliseconds. > > It's unclear what is birth_time reference point. Is it milliseconds from > the system start or it is measured some other way? Good point. Will clarify. > > + ``anon_size`` and ``file_size`` are in pages. The youngest generation > > + represents the group of the MRU pages and the oldest generation > > + represents the group of the LRU pages. For working set estimation, a > > Please spell out MRU and LRU fully. Will do. > > + job scheduler writes to this file at a certain time interval to > > + create new generations, and it ranks available servers based on the > > + sizes of their cold memory defined by this time interval. For > > + proactive reclaim, a job scheduler writes to this file before it > > + tries to land a new job, and if it fails to materialize the cold > > + memory without impacting the existing jobs, it retries on the next > > + server according to the ranking result. > > Is this knob only relevant for a job scheduler? Or it can be used in other > use-cases as well? There are other concrete use cases but I'm not ready to discuss them yet. > > + This file accepts commands in the following subsections. Multiple > > ^ described Will do.