From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Gregory Price <gourry@gourry.net>
Cc: <lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>,
<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
Date: Thu, 13 Mar 2025 17:20:04 +0000 [thread overview]
Message-ID: <20250313172004.00002236@huawei.com> (raw)
In-Reply-To: <Z8u4GTrr-UytqXCB@gourry-fedora-PF4VCD3F>
On Fri, 7 Mar 2025 22:23:05 -0500
Gregory Price <gourry@gourry.net> wrote:
> In the last section we discussed how the CEDT CFMWS and SRAT Memory
> Affinity structures are used by linux to "create" NUMA nodes (or at
> least mark them as possible). However, the examples I used suggested
> that there was a 1-to-1 relationship between CFMWS and devices or
> host bridges.
>
> This is not true - in fact, CFMWS are a simply a carve out of System
> Physical Address space which may be used to map any number of endpoint
> devices behind the associated Host Bridge(s).
>
> The limiting factor is what your platform vendor BIOS supports.
>
> This section describes a handful of *possible* configurations, what NUMA
> structure they will create, and what flexibility this provides.
>
> All of these CFMWS configurations are made up, and may or may not exist
> in real machines. They are a conceptual teching tool, not a roadmap.
>
> (When discussing interleave in this section, please note that I am
> intentionally omitting details about decoder programming, as this
> will be covered later.)
>
>
> -------------------------------
> One 2GB Device, Multiple CFMWS.
> -------------------------------
> Lets imagine we have one 2GB device attached to a host bridge.
>
> In this example, the device hosts 2GB of persistent memory - but we
> might want the flexibility to map capacity as volatile or persistent.
Fairly sure we block persistent in a volatile CFMWS in the kernel.
Any bios actually does this?
You might have a variable partition device but I thought in kernel at
least we decided that no one was building that crazy?
Maybe a QoS split is a better example to motivate one range, two places?
>
> The platform vendor may decide that they want to reserve two entirely
> separate system physical address ranges to represent the capacity.
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 000A <- Bit(3) - Persistant
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> You might have a CEDT with two CFMWS as above, where the base addresses
> are `0x100000000` and `0x200000000` respectively, but whose window sizes
> cover the entire 2GB capacity of the device. This affords the user
> flexibility in where the memory is mapped depending on if it is mapped
> as volatile or persistent while keeping the two SPA ranges separate.
>
> This is allowed because the endpoint decoders commit device physical
> address space *in order*, meaning no two regions of device physical
> address space can be mapped to more than one system physical address.
>
> i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
>
> (See Section 2a - decoder programming).
>
> -------------------------------------------------------------
> Two Devices On One Host Bridge - With and Without Interleave.
> -------------------------------------------------------------
> What if we wanted some capacity on each endpoint hosted on its own NUMA
> node, and wanted to interleave a portion of each device capacity?
If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
checks on HPA kick in here and restrict flexibility a lot
(assuming I understand them correctly that is)
This is a good illustration of why we should at some point revisit
multiple NUMA nodes per CFMWS. We have to burn SPA space just
to get nodes. From a spec point of view all that is needed here
is a single CFMWS.
>
> We could produce the following CFMWS configuration.
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region 2
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region 3
> Window size : 0000000100000000 <- 4GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> In this configuration, we could still do what we did with the prior
> configuration (2 CFMWS), but we could also use the third root decoder
> to simplify decoder programming of interleave.
>
> Since the third region has sufficient capacity (4GB) to cover both
> devices (2GB/each), we can actually associate the entire capacity of
> both devices in that region.
>
> We'll discuss this decoder structure in-depth in Section 4.
>
next prev parent reply other threads:[~2025-03-13 17:20 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
2025-02-18 16:11 ` Gregory Price
2025-02-20 16:30 ` Jonathan Cameron
2025-02-20 16:52 ` Gregory Price
2025-03-04 0:32 ` Gregory Price
2025-03-13 16:12 ` Jonathan Cameron
2025-03-13 17:20 ` Gregory Price
2025-03-10 10:45 ` Yuquan Wang
2025-03-10 14:19 ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
2025-02-06 15:59 ` Gregory Price
2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
2025-03-07 15:07 ` Gregory Price
2025-03-11 2:48 ` Zhijian Li (Fujitsu)
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2025-04-02 14:18 ` Gregory Price
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
2025-04-08 4:14 ` Gregory Price
2025-04-08 5:37 ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:03 ` Gregory Price
2025-02-18 17:49 ` Yang Shi
2025-02-18 18:04 ` Gregory Price
2025-02-18 19:25 ` David Hildenbrand
2025-02-18 20:25 ` Gregory Price
2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
2025-02-19 8:53 ` David Hildenbrand
2025-02-19 16:14 ` Gregory Price
2025-02-20 17:50 ` Yang Shi
2025-02-20 18:43 ` Gregory Price
2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
2025-02-20 19:44 ` David Hildenbrand
2025-02-20 20:06 ` Gregory Price
2025-03-11 14:53 ` Zi Yan
2025-03-11 15:58 ` Gregory Price
2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
2025-03-05 23:34 ` Gregory Price
2025-03-05 23:41 ` Dave Jiang
2025-03-06 0:09 ` Gregory Price
2025-03-06 1:37 ` Yuquan Wang
2025-03-06 17:08 ` Gregory Price
2025-03-07 2:20 ` Yuquan Wang
2025-03-07 15:12 ` Gregory Price
2025-03-13 17:00 ` Jonathan Cameron
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20 ` Jonathan Cameron [this message]
2025-03-13 18:17 ` Gregory Price
2025-03-14 11:09 ` Jonathan Cameron
2025-03-14 13:46 ` Gregory Price
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
2025-03-14 11:14 ` Jonathan Cameron
2025-03-27 9:34 ` Yuquan Wang
2025-03-27 12:36 ` Gregory Price
2025-03-27 13:21 ` Dan Williams
2025-03-27 16:36 ` Gregory Price
2025-03-31 23:49 ` [Lsf-pc] " Dan Williams
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13 8:31 ` Yuquan Wang
2025-03-13 16:48 ` Gregory Price
2025-03-26 9:28 ` Yuquan Wang
2025-03-26 12:53 ` Gregory Price
2025-03-27 2:20 ` Yuquan Wang
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
2025-03-27 6:29 ` Yuquan Wang
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02 4:49 ` Gregory Price
2025-04-07 16:14 ` Adam Manzanares
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250313172004.00002236@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=gourry@gourry.net \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.