Re: [PATCH v3 0/7] mm: workingset reporting

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: Yuanchu Xie <yuanchu@google.com>
Cc: David Hildenbrand <david@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Khalid Aziz <khalid.aziz@oracle.com>,
	Henry Huang <henry.hj@antgroup.com>, Yu Zhao <yuzhao@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Gregory Price <gregory.price@memverge.com>,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lance Yang <ioworker0@gmail.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Muhammad Usama Anjum <usama.anjum@collabora.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Shuah Khan <shuah@kernel.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>,
	Kairui Song <kasong@tencent.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Vasily Averin <vasily.averin@linux.dev>,
	Nhat Pham <nphamcs@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	"Vishal Moola (Oracle)" <vishal.moola@gmail.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 0/7] mm: workingset reporting
Date: Tue, 20 Aug 2024 09:00:37 -0400	[thread overview]
Message-ID: <ZsSTdY5hsv05jcj-@PC2K9PVX.TheFacebook.com> (raw)
In-Reply-To: <20240813165619.748102-1-yuanchu@google.com>

On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references. However, the concept of workingset applies generically to
> all types of memory, which could be kernel slab caches, discardable
> userspace caches (databases), or CXL.mem. Therefore, data sources might
> come from slab shrinkers, device drivers, or the userspace. IMO, the
> kernel should provide a set of workingset interfaces that should be
> generic enough to accommodate the various use cases, and be extensible
> to potential future use cases. The current proposed interfaces are not
> sufficient in that regard, but I would like to start somewhere, solicit
> feedback, and iterate.
>
... snip ... 
> Use cases
> ==========
> Promotion/Demotion
> If different mechanisms are used for promition and demotion, workingset
> information can help connect the two and avoid pages being migrated back
> and forth.
> For example, given a promotion hot page threshold defined in reaccess
> distance of N seconds (promote pages accessed more often than every N
> seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> the fast memory node passes the threshold. This calculation can be done
> with workingset reports.
> To be directly useful for promotion policies, the workingset report
> interfaces need to be extended to report hotness and gather hotness
> information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> 
> Sysfs and Cgroup Interfaces
> ==========
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 20000 anon=34342 file=0
> 30000 anon=353232 file=333608
> 40000 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
> 
> I realize this does not generalize well to hotness information, but I
> lack the intuition for an abstraction that presents hotness in a useful
> way. Based on a recent proposal for move_phys_pages[2], it seems like
> userspace tiering software would like to move specific physical pages,
> instead of informing the kernel "move x number of hot pages to y
> device". Please advise.
> 
> [2]
> https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> 

Just as a note on this work, this is really a testing interface.  The
end-goal is not to merge such an interface that is user-facing like
move_phys_pages, but instead to have something like a triggered kernel
task that has a directive of "Promote X pages from Device A".

This work is more of an open collaboration for prototyping such that we
don't have to plumb it through the kernel from the start and assess the
usefulness of the hardware hotness collection mechanism.

---

More generally on promotion, I have been considering recently a problem
with promoting unmapped pagecache pages - since they are not subject to
NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
a potential mechanism to trigger promotion - but i'm starting to see a
pattern of competing priorities between reclaim (LRU/MGLRU) logic and
promotion logic.

Reclaim is triggered largely under memory pressure - which means co-opting
reclaim logic for promotion is at best logically confusing, and at worst
likely to introduce regressions.  The LRU/MGLRU logic is written largely
for reclaim, not promotion.  This makes hacking promotion in after the
fact rather dubious - the design choices don't match.

One example: if a page moves from inactive->active (or old->young), we
could treat this as a page "becoming hot" and mark it for promotion, but
this potentially punishes pages on the "active/younger" lists which are
themselves hotter.

I'm starting to think separate demotion/reclaim and promotion components
are warranted. This could take the form of a separate kernel worker that
occasionally gets scheduled to manage a promotion list, or even the
addition of a PG_promote flag to decouple reclaim and promotion logic
completely.  Separating the structures entirely would be good to allow
both demotion/reclaim and promotion to occur concurrently (although this
seems problematic under memory pressure).

Would like to know your thoughts here.  If we can decide to segregate
promotion and demotion logic, it might go a long way to simplify the
existing interfaces and formalize transactions between the two.

(also if you're going to LPC, might be worth a chat in person)

~Gregory

next prev parent reply	other threads:[~2024-08-20 13:01 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 16:56 [PATCH v3 0/7] mm: workingset reporting Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 1/7] mm: aggregate working set information into histograms Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 2/7] mm: use refresh interval to rate-limit workingset report aggregation Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 3/7] mm: report workingset during memory pressure driven scanning Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 4/7] mm: extend working set reporting to memcgs Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 5/7] mm: add kernel aging thread for workingset reporting Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 6/7] selftest: test system-wide " Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 7/7] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Yuanchu Xie
2024-08-13 18:23   ` Waiman Long
2024-08-13 23:45   ` Randy Dunlap
2024-08-13 18:33 ` [PATCH v3 0/7] mm: workingset reporting Andrew Morton
2024-08-16  3:14   ` David Rientjes
2024-08-20 13:00 ` Gregory Price [this message]
2024-08-26 23:43   ` Yuanchu Xie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZsSTdY5hsv05jcj-@PC2K9PVX.TheFacebook.com \
    --to=gourry@gourry.net \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=cgroups@vger.kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=henry.hj@antgroup.com \
    --cc=ioworker0@gmail.com \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mst@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=quic_sudaraja@quicinc.com \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shuah@kernel.org \
    --cc=usama.anjum@collabora.com \
    --cc=vasily.averin@linux.dev \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox