All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: David Rientjes <rientjes@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<kbusch@kernel.org>, <ying.huang@intel.com>,
	<dan.j.williams@intel.com>
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Thu, 2 Jul 2020 11:02:35 +0100	[thread overview]
Message-ID: <20200702110235.00005f2f@Huawei.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007011226240.1908531@chino.kir.corp.google.com>

On Wed, 1 Jul 2020 12:45:17 -0700
David Rientjes <rientjes@google.com> wrote:

> On Wed, 1 Jul 2020, Yang Shi wrote:
> 
> > > We can do this if we consider pmem not to be a separate memory tier from
> > > the system perspective, however, but rather the socket perspective.  In
> > > other words, a node can only demote to a series of exclusive pmem ranges
> > > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > > one DRAM node.
> > > 
> > > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > > just to be slower volatile memory and we don't need to deal with the
> > > latency concerns of cross socket migration.  A user page will never be
> > > demoted to a pmem range across the socket and will never be promoted to a
> > > different DRAM node that it doesn't have access to.  
> > 
> > But I don't see too much benefit to limit the migration target to the
> > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> > different socket) pmem node since even the cross socket access should be much
> > faster then refault or swap from disk.
> >   
> 
> Hi Yang,
> 
> Right, but any eventual promotion path would allow this to subvert the 
> user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
> to a DRAM node on its socket.  We've discussed not having the ability to 
> map from the demoted page to either of these contexts and it becomes more 
> difficult for shared memory.  We have page_to_nid() and page_zone() so we 
> can always find the appropriate demotion or promotion node for a given 
> page if there is a 1:1 relationship.
> 
> Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
> nodes?  It seems much simpler in terms of implementation and is more 
> intuitive.
Hi David, Yang,

The 1:1 mapping implies a particular system topology.  In the medium
term we are likely to see systems with a central pool of persistent memory
with equal access characteristics from multiple CPU containing nodes, each
with local DRAM. 

Clearly we could fake a split of such a pmem pool to keep the 1:1 mapping
but it's certainly not elegant and may be very wasteful for resources.

Can a zone based approach work well without such a hard wall?

Jonathan

> 
> > I think using pmem as a node is more natural than zone and less intrusive
> > since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> > think the implementation may be more intrusive and complicated (i.e. need a
> > new gfp flag) and user can't control the memory placement.
> >   
> 
> This is an important decision to make, I'm not sure that we actually 
> *want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
> migrate_pages() back to DRAM and cause other memory to be demoted in its 
> place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
> Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
> then puts the kswapd for the PMEM node under pressure to reclaim it serves 
> *only* to spend unnecessary cpu cycles.
> 
> Users could control the memory placement through a new mempolicy flag, 
> which I think are needed anyway for explicit allocation policies for PMEM 
> nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
> relationship with DRAM, now your system only has nodes {0,1} as today, no 
> new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
> specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
> can then mlock() if I want to disable demotion on memory pressure).
> 




  reply	other threads:[~2020-07-02 10:03 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
2020-06-29 23:45 ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron [this message]
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen
2020-07-01 19:25       ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200702110235.00005f2f@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.