From: Reinette Chatre <reinette.chatre@intel.com>
To: Tony Luck <tony.luck@intel.com>,
Fenghua Yu <fenghua.yu@intel.com>,
"Maciej Wieczor-Retman" <maciej.wieczor-retman@intel.com>,
Peter Newman <peternewman@google.com>,
James Morse <james.morse@arm.com>,
Babu Moger <babu.moger@amd.com>,
Drew Fustini <dfustini@baylibre.com>,
Dave Martin <Dave.Martin@arm.com>
Cc: <x86@kernel.org>, <linux-kernel@vger.kernel.org>,
<patches@lists.linux.dev>
Subject: Re: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache
Date: Wed, 22 May 2024 14:08:01 -0700 [thread overview]
Message-ID: <beee3369-0075-462e-8449-88fee807463e@intel.com> (raw)
In-Reply-To: <20240515222326.74166-7-tony.luck@intel.com>
Hi Tony,
On 5/15/2024 3:23 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
Should it then perhaps be "number of logical RMIDs per SNC node"?
The way this feature is introduced makes it hard to understand how
RMIDs are used when SNC is enabled since the implementation appears
to distinguish between (a) RMIDs assigned to monitor group and written
to PQR register and (b) RMIDs for which the event counters are read.
The former ((a)) is used directly after the adjustment described in (1) but
the latter needs an adjustment that I did not notice being mentioned.
I think it will help to move the get_node_rmid() helper that is
introduced later here and explain why it is needed.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) Disable the "-o mba_MBps" mount option in SNC mode
> because the monitoring is being done per SNC node, while the
> bandwidth allocation is still done at the L3 cache scope.
> Trying to use this feedback loop might result in contradictory
> changes to the throttling level coming from each of the SNC
> node bandwidth measurements.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 ++-
> 4 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 135190e0711c..49440f194253 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -484,6 +484,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
> extern struct rdtgroup rdtgroup_default;
> extern struct dentry *debugfs_resctrl;
>
> +extern unsigned int snc_nodes_per_l3_cache;
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 395bac851f6e..bfa9d3a429fd 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -331,6 +331,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
> return r->default_ctrl;
> }
>
> +/*
> + * Number of SNC nodes that share each L3 cache. Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
> static void mba_wrmsr_intel(struct msr_param *m)
> {
> struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..0f66825a1ac9 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1022,8 +1022,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index cc31ede1a1e7..0923492a8bd0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2346,7 +2346,8 @@ static bool supports_mba_mbps(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> return (is_mbm_local_enabled() &&
> - r->alloc_capable && is_mba_linear());
> + r->alloc_capable && is_mba_linear() &&
> + snc_nodes_per_l3_cache == 1);
> }
>
> /*
Since the software controller is a filesystem feature the above
now requires that snc_nodes_per_l3_cache becomes part of the resctrl
filesystem code and every architecture will need to set snc_nodes_per_l3_cache.
Every architecture will thus need to interpret what "SNC" means for them
using the term introduced here. That may be ok ... but the term "SNC"
will then surely not identify an Intel feature and Intel needs to be ok
that any architecture calls their "similar to SNC but not quite identical"
"SNC".
I assume now that as part of the fs/arch split there needs to be
a new helper that allows different architectures to set this
filesystem variable?
Reinette
next prev parent reply other threads:[~2024-05-22 21:08 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-15 22:23 [PATCH v18 00/17] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2024-05-15 22:23 ` [PATCH v18 01/17] x86/resctrl: Prepare for new domain scope Tony Luck
2024-05-15 22:23 ` [PATCH v18 02/17] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2024-05-15 22:23 ` [PATCH v18 03/17] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2024-05-15 22:23 ` [PATCH v18 04/17] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2024-05-15 22:23 ` [PATCH v18 05/17] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2024-05-15 22:23 ` [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2024-05-22 21:08 ` Reinette Chatre [this message]
2024-05-23 19:04 ` Luck, Tony
2024-05-23 20:56 ` Reinette Chatre
2024-05-23 21:25 ` Luck, Tony
2024-05-23 22:31 ` Reinette Chatre
2024-05-23 23:18 ` Luck, Tony
2024-05-23 23:48 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 07/17] x86/resctrl: Prepare for new Sub-NUMA (SNC) cluster monitor files Tony Luck
2024-05-22 21:12 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 08/17] x86/resctrl: Add and initialize display_id field to struct rdt_mon_domain Tony Luck
2024-05-22 21:13 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 09/17] x86/resctrl: Add new fields to struct rmid_read for summation of domains Tony Luck
2024-05-22 21:14 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 10/17] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function Tony Luck
2024-05-22 21:15 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 11/17] x86/resctrl: Allocate a new bit in union mon_data_bits Tony Luck
2024-05-22 21:16 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 12/17] x86/resctrl: Create Sub-NUMA (SNC) monitor files Tony Luck
2024-05-22 21:19 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 13/17] x86/resctrl: Handle removing directories in Sub-NUMA (SNC) mode Tony Luck
2024-05-22 21:20 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 14/17] x86/resctrl: Sum monitor data acrss Sub-NUMA (SNC) nodes when needed Tony Luck
2024-05-22 21:24 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode Tony Luck
2024-05-16 8:59 ` Maciej Wieczor-Retman
2024-05-22 21:25 ` Reinette Chatre
2024-05-22 23:47 ` Tony Luck
2024-05-23 17:03 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 16/17] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2024-05-22 21:26 ` Reinette Chatre
2024-05-15 22:23 ` [PATCH v18 17/17] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2024-05-16 8:57 ` Maciej Wieczor-Retman
2024-05-22 21:27 ` Reinette Chatre
2024-05-16 9:28 ` [PATCH v18 00/17] Add support for Sub-NUMA cluster (SNC) systems Maciej Wieczor-Retman
2024-05-16 15:52 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=beee3369-0075-462e-8449-88fee807463e@intel.com \
--to=reinette.chatre@intel.com \
--cc=Dave.Martin@arm.com \
--cc=babu.moger@amd.com \
--cc=dfustini@baylibre.com \
--cc=fenghua.yu@intel.com \
--cc=james.morse@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.wieczor-retman@intel.com \
--cc=patches@lists.linux.dev \
--cc=peternewman@google.com \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox