From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
To: "Huang, Ying" <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Mina Almasry
<almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>,
Roman Gushchin
<roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>,
Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
Yang Shi
<yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
fvdl-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
bagasdotme-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
Date: Thu, 15 Dec 2022 10:21:25 +0100 [thread overview]
Message-ID: <Y5rnFbOqHQUT5da7@dhcp22.suse.cz> (raw)
In-Reply-To: <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
On Thu 15-12-22 13:50:14, Huang, Ying wrote:
> Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> writes:
>
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >> >
> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> >> > [...]
> >> > > > After these discussion, I think the solution maybe use different
> >> > > > interfaces for "proactive demote" and "proactive reclaim". That is,
> >> > > > reconsider "memory.demote". In this way, we will always uncharge the
> >> > > > cgroup for "memory.reclaim". This avoid the possible confusion there.
> >> > > > And, because demotion is considered aging, we don't need to disable
> >> > > > demotion for "memory.reclaim", just don't count it.
> >> > >
> >> > > Hm, so in summary:
> >> > >
> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
> >> > > change to only count reclaimed pages against the goal.
> >> > >
> >> > > 2) memory.demote would only demote.
> >> > >
> >>
> >> If the above 2 points are agreeable then yes, this sounds good to me
> >> and does address our use case.
> >>
> >> > > a) What if the demotion targets are full? Would it reclaim or fail?
> >> > >
> >>
> >> Wei will chime in if he disagrees, but I think we _require_ that it
> >> fails, not falls back to reclaim. The interface is asking for
> >> demotion, and is called memory.demote. For such an interface to fall
> >> back to reclaim would be very confusing to userspace and may trigger
> >> reclaim on a high priority job that we want to shield from proactive
> >> reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> >
> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >>
> >> memory.demote will need a nodemask, for sure. Today the nodemask would
> >> be useful if there is a specific node in the top tier that is
> >> overloaded and we want to reduce the pressure by demoting. In the
> >> future there will be N tiers and the nodemask says which tier to
> >> demote from.
> >
> > OK, so what is the exact semantic of the node mask. Does it control
> > where to demote from or to or both?
> >
> >> I don't think memory.reclaim would need a nodemask anymore? At least I
> >> no longer see the use for it for us.
> >>
> >> > > Would
> >> > > they return -EINVAL if a) memory.reclaim gets passed only toptier
> >> > > nodes or b) memory.demote gets passed any lasttier nodes?
> >> >
> >>
> >> Honestly it would be great if memory.reclaim can force reclaim from a
> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> >> specifically asking for that because they decided in their usecase
> >> it's a good idea then the kernel should comply IMO. Not a strict
> >> requirement for us. Wei will chime in if he disagrees.
> >
> > That would require a nodemask to say which nodes to reclaim, no? The
> > default behavior should be in line with what standard memory reclaim
> > does. If the demotion is a part of that process so should be
> > memory.reclaim part of it. If we want to have a finer control then a
> > nodemask is really a must and then the nodemaks should constrain both
> > agining and reclaim.
> >
> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >>
> >> > I would also add
> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >> > demote from and to) and how to achieve that?
> >>
> >> We care deeply about specifying which node to demote _from_. That
> >> would be some node that is approaching pressure and we're looking for
> >> proactive saving from. So far I haven't seen any reason to control
> >> which nodes to demote _to_. The kernel deciding that based on the
> >> aging pipeline and the node distances sounds good to me. Obviously
> >> someone else may find that useful.
> >
> > Please keep in mind that the interface should be really prepared for
> > future extensions so try to abstract from your immediate usecases.
>
> I see two requirements here, one is to control the demotion source, that
> is, which nodes to free memory. The other is to control the demotion
> path. I think that we can use two different parameters for them, for
> example, "from=<demotion source nodes>" and "to=<demotion target
> nodes>". In most cases we don't need to control the demotion path.
> Because in current implementation, the nodes in the lower tiers in the
> same socket (local nodes) will be preferred. I think that this is
> the desired behavior in most cases.
Even if the demotion path is not really required at the moment we should
keep in mind future potential extensions. E.g. when a userspace based
balancing is to be implemented because the default behavior cannot
capture userspace policies (one example would be enforcing a
prioritization of containers when some container's demoted pages would
need to be demoted further to free up a space for a different
workload).
> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >> > configuration allowed as well?
> >> >
> >>
> >> demotion will of course not work on single tiered systems. The
> >> interface may return some failure on such systems or not be available
> >> at all.
> >
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose?
>
> Do you mean to demote the cold pages from the specified source nodes to
> the specified target nodes in different sockets? We don't do that to
> avoid loop in the demotion path. If we prevent the target nodes from
> demoting cold pages to the source nodes at the same time, it seems
> doable.
Loops could be avoid by properly specifying from and to nodes if this is
going to be a fine grained interface to control demotion.
--
Michal Hocko
SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Mina Almasry <almasrymina@google.com>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Zefan Li <lizefan.x@bytedance.com>,
Jonathan Corbet <corbet@lwn.net>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeelb@google.com>,
Muchun Song <songmuchun@bytedance.com>,
Andrew Morton <akpm@linux-foundation.org>,
Yang Shi <yang.shi@linux.alibaba.com>,
Yosry Ahmed <yosryahmed@google.com>,
weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
Date: Thu, 15 Dec 2022 10:21:25 +0100 [thread overview]
Message-ID: <Y5rnFbOqHQUT5da7@dhcp22.suse.cz> (raw)
In-Reply-To: <87mt7pdxm1.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Thu 15-12-22 13:50:14, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >
> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> >> > [...]
> >> > > > After these discussion, I think the solution maybe use different
> >> > > > interfaces for "proactive demote" and "proactive reclaim". That is,
> >> > > > reconsider "memory.demote". In this way, we will always uncharge the
> >> > > > cgroup for "memory.reclaim". This avoid the possible confusion there.
> >> > > > And, because demotion is considered aging, we don't need to disable
> >> > > > demotion for "memory.reclaim", just don't count it.
> >> > >
> >> > > Hm, so in summary:
> >> > >
> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
> >> > > change to only count reclaimed pages against the goal.
> >> > >
> >> > > 2) memory.demote would only demote.
> >> > >
> >>
> >> If the above 2 points are agreeable then yes, this sounds good to me
> >> and does address our use case.
> >>
> >> > > a) What if the demotion targets are full? Would it reclaim or fail?
> >> > >
> >>
> >> Wei will chime in if he disagrees, but I think we _require_ that it
> >> fails, not falls back to reclaim. The interface is asking for
> >> demotion, and is called memory.demote. For such an interface to fall
> >> back to reclaim would be very confusing to userspace and may trigger
> >> reclaim on a high priority job that we want to shield from proactive
> >> reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> >
> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >>
> >> memory.demote will need a nodemask, for sure. Today the nodemask would
> >> be useful if there is a specific node in the top tier that is
> >> overloaded and we want to reduce the pressure by demoting. In the
> >> future there will be N tiers and the nodemask says which tier to
> >> demote from.
> >
> > OK, so what is the exact semantic of the node mask. Does it control
> > where to demote from or to or both?
> >
> >> I don't think memory.reclaim would need a nodemask anymore? At least I
> >> no longer see the use for it for us.
> >>
> >> > > Would
> >> > > they return -EINVAL if a) memory.reclaim gets passed only toptier
> >> > > nodes or b) memory.demote gets passed any lasttier nodes?
> >> >
> >>
> >> Honestly it would be great if memory.reclaim can force reclaim from a
> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> >> specifically asking for that because they decided in their usecase
> >> it's a good idea then the kernel should comply IMO. Not a strict
> >> requirement for us. Wei will chime in if he disagrees.
> >
> > That would require a nodemask to say which nodes to reclaim, no? The
> > default behavior should be in line with what standard memory reclaim
> > does. If the demotion is a part of that process so should be
> > memory.reclaim part of it. If we want to have a finer control then a
> > nodemask is really a must and then the nodemaks should constrain both
> > agining and reclaim.
> >
> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >>
> >> > I would also add
> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >> > demote from and to) and how to achieve that?
> >>
> >> We care deeply about specifying which node to demote _from_. That
> >> would be some node that is approaching pressure and we're looking for
> >> proactive saving from. So far I haven't seen any reason to control
> >> which nodes to demote _to_. The kernel deciding that based on the
> >> aging pipeline and the node distances sounds good to me. Obviously
> >> someone else may find that useful.
> >
> > Please keep in mind that the interface should be really prepared for
> > future extensions so try to abstract from your immediate usecases.
>
> I see two requirements here, one is to control the demotion source, that
> is, which nodes to free memory. The other is to control the demotion
> path. I think that we can use two different parameters for them, for
> example, "from=<demotion source nodes>" and "to=<demotion target
> nodes>". In most cases we don't need to control the demotion path.
> Because in current implementation, the nodes in the lower tiers in the
> same socket (local nodes) will be preferred. I think that this is
> the desired behavior in most cases.
Even if the demotion path is not really required at the moment we should
keep in mind future potential extensions. E.g. when a userspace based
balancing is to be implemented because the default behavior cannot
capture userspace policies (one example would be enforcing a
prioritization of containers when some container's demoted pages would
need to be demoted further to free up a space for a different
workload).
> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >> > configuration allowed as well?
> >> >
> >>
> >> demotion will of course not work on single tiered systems. The
> >> interface may return some failure on such systems or not be available
> >> at all.
> >
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose?
>
> Do you mean to demote the cold pages from the specified source nodes to
> the specified target nodes in different sockets? We don't do that to
> avoid loop in the demotion path. If we prevent the target nodes from
> demoting cold pages to the source nodes at the same time, it seems
> doable.
Loops could be avoid by properly specifying from and to nodes if this is
going to be a fine grained interface to control demotion.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2022-12-15 9:21 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
2022-12-03 3:17 ` Muchun Song
[not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-12-12 8:55 ` Michal Hocko
2022-12-12 8:55 ` Michal Hocko
[not found] ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13 0:54 ` Mina Almasry
2022-12-13 0:54 ` Mina Almasry
2022-12-13 6:30 ` Huang, Ying
2022-12-13 7:48 ` Wei Xu
2022-12-13 8:51 ` Michal Hocko
2022-12-13 13:42 ` Huang, Ying
[not found] ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-13 13:30 ` Johannes Weiner
2022-12-13 13:30 ` Johannes Weiner
2022-12-13 14:03 ` Michal Hocko
[not found] ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13 19:29 ` Mina Almasry
2022-12-13 19:29 ` Mina Almasry
2022-12-14 10:23 ` Michal Hocko
2022-12-15 5:50 ` Huang, Ying
[not found] ` <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-15 9:21 ` Michal Hocko [this message]
2022-12-15 9:21 ` Michal Hocko
2022-12-16 3:02 ` Huang, Ying
[not found] ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-15 17:58 ` Wei Xu
2022-12-15 17:58 ` Wei Xu
2022-12-16 8:40 ` Michal Hocko
2022-12-13 8:33 ` Michal Hocko
2022-12-13 15:58 ` Johannes Weiner
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-12-13 19:53 ` Mina Almasry
2022-12-13 19:53 ` Mina Almasry
2022-12-14 7:20 ` Huang, Ying
2022-12-14 7:15 ` Huang, Ying
2022-12-14 10:43 ` Michal Hocko
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
2022-12-16 12:02 ` Mina Almasry
2022-12-16 12:22 ` Michal Hocko
2022-12-16 12:28 ` Bagas Sanjaya
[not found] ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-16 18:18 ` Andrew Morton
2022-12-16 18:18 ` Andrew Morton
[not found] ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-12-17 9:57 ` Michal Hocko
2022-12-17 9:57 ` Michal Hocko
[not found] ` <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-19 22:42 ` Andrew Morton
2022-12-19 22:42 ` Andrew Morton
2023-01-03 8:37 ` Michal Hocko
2023-01-04 8:41 ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
2023-01-18 17:21 ` Michal Hocko
[not found] ` <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-01-19 8:29 ` Huang, Ying
2023-01-19 8:29 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y5rnFbOqHQUT5da7@dhcp22.suse.cz \
--to=mhocko-ibi9rg/b67k@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=bagasdotme-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=corbet-T1hC0tSOHrs@public.gmane.org \
--cc=fvdl-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
--cc=linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org \
--cc=roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org \
--cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org \
--cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org \
--cc=ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.