public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org, akpm@linux-foundation.org,
	sthanneeru@micron.com,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Wei Xu <weixugc@google.com>, Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Michal Hocko <mhocko@kernel.org>, Tim Chen <tim.c.chen@intel.com>,
	Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
Date: Mon, 30 Oct 2023 00:19:39 -0400	[thread overview]
Message-ID: <ZT8u2246+vkA/4F+@memverge.com> (raw)
In-Reply-To: <87a5s0df6p.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Mon, Oct 30, 2023 at 10:20:14AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> The extending adds complexity to the kernel code and changes the kernel
> ABI.  So, IMHO, we need some real life use case to prove the added
> complexity is necessary.
> 
> For example, in [1], Johannes showed the use case to support to add
> per-memory-tier interleave weight.
> 
> [1] https://lore.kernel.org/all/20220607171949.85796-1-hannes@cmpxchg.org/
> 
> --
> Best Regards,
> Huang, Ying

Sorry, I misunderstood your question.

The use case is the same as the N:M interleave strategy between tiers,
and in fact the proposal for weights was directly inspired by the patch
you posted. We're searching for the best way to implement weights.

We've discussed placing these weights in:

1) mempolicy :
   https://lore.kernel.org/linux-cxl/20230914235457.482710-1-gregory.price@memverge.com/

2) tiers
   https://lore.kernel.org/linux-cxl/20231009204259.875232-1-gregory.price@memverge.com/

and now
3) the nodes themselves
   RFC not posted yet

The use case is the exact same as the patch you posted, which is to enable
optimal distribution of memory to maximize memory bandwidth usage.

The use case is straight forward - Consider a machine with the following
numa nodes:

1) Socket 0 - DRAM - ~400GB/s bandwidth local, less cross-socket
2) Socket 1 - DRAM - ~400GB/s bandwidth local, less cross socket
3) CXL Memory Attached to Socket 0 with ~64GB/s per link.
4) CXL Memory Attached to Socket 1 with ~64GB/s per link.

The goal is to enable mempolicy to implement weighted interleave such
that a thread running on socket 0 can effectively spread its memory
across each numa node (or some subset there-of) such that it maximizes
its bandwidth usage across the various devices.

For example, lets consider a system with only 1 & 2 (2 sockets w/ DRAM).

On an Intel System with UPI, the "effective" bandwidth available for a
task on Socket 0 is not 800GB/s, it's about 450-500GB/s split about
300/200 between the sockets (you never get the full amount, and UPI limits
cross-socket bandwidth).

Today `numactl --interleave` will split your memory 50:50 between
sockets, which is just blatantly suboptimal.  In this case you would
prefer a 3:2 distribution (literally weights of 3 and 2 respectively).

The extension to CXL becomes obvious then, as each individual node,
respective to its CPU placement, has a different optimal weight.


Of course the question becomes "what if a task uses more threads than a
single socket has to offer", and the answer there is essentially the
same as the answer today:  Then that process must become "numa-aware" to
make the best use of the available resources.

However, for software capable of exhausting bandwidth with from a single
socket (which on intel takes about 16-20 threads with certain access
patterns), then a weighted-interleave system provided via some interface
like `numactl --weighted-interleave` with weights either set in numa
nodes or mempolicy is sufficient.


~Gregory


  reply	other threads:[~2023-10-30  4:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights Gregory Price
2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
2023-10-10  1:07   ` Gregory Price
2023-10-16  7:57 ` Huang, Ying
2023-10-17  1:28   ` Gregory Price
2023-10-18  8:29     ` Huang, Ying
2023-10-17  2:52       ` Gregory Price
2023-10-19  6:28         ` Huang, Ying
2023-10-18  2:47           ` Gregory Price
2023-10-20  6:11             ` Huang, Ying
2023-10-19 13:26               ` Gregory Price
2023-10-23  2:09                 ` Huang, Ying
2023-10-24 15:32                   ` Gregory Price
2023-10-25  1:13                     ` Huang, Ying
2023-10-25 19:51                       ` Gregory Price
2023-10-30  2:20                         ` Huang, Ying
2023-10-30  4:19                           ` Gregory Price [this message]
2023-10-30  5:23                             ` Huang, Ying
2023-10-18  8:31       ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZT8u2246+vkA/4F+@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sthanneeru@micron.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox