From: "Luck, Tony" <tony.luck@intel.com>
To: Fenghua Yu <fenghuay@nvidia.com>,
Reinette Chatre <reinette.chatre@intel.com>,
Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
Peter Newman <peternewman@google.com>,
James Morse <james.morse@arm.com>,
Babu Moger <babu.moger@amd.com>,
Drew Fustini <dfustini@baylibre.com>,
Dave Martin <Dave.Martin@arm.com>,
Anil Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Region aware RDT options for resctrl
Date: Fri, 11 Apr 2025 13:56:26 -0700 [thread overview]
Message-ID: <Z_mB-gmQe_LR4FWP@agluck-desk3> (raw)
In-Reply-To: <Z_mBcnAcGzMMvfxV@agluck-desk3>
On Fri, Apr 11, 2025 at 01:54:12PM -0700, Luck, Tony wrote:
Add Cc: lkml
> A future CPU from Intel will implement "region aware" memory bandwidth
> monitoring and bandwidth allocation. This will provide for more granular
> monitoring and control for heterogeneous memory configurations. BIOS
> will populate an ACPI table that describes which system physical address
> ranges belong to each region. E.g. for a two socket system with both
> DDR and CXL memory regions could be assigned like this:
>
> Region 0: Local DDR
> Region 1: Remote DDR
> Region 2: Local CXL
> Region 3: Remote CXL
>
> Details of the ACPI tables and MMIO registers in the "Intel(R)
> Resource Director Technology Architecture Specification" here:
> https://cdrdv2.intel.com/v1/dl/getContent/789566
>
> The existing Linux resctrl user interface will need some extensions
> to handle these new hardware monitors and controls. Here are some
> options for discussion with the goal of aligning on some user interface
> that meets now and near future needs of all architectures.
>
> Memory bandwidth monitoring
> ---------------------------
>
> The existing interface provides two files in each of the per-domain
> directories under "mon_data":
>
> mbm_local_bytes: Count of bytes transferred to/from "local" memory
> mbm_total_bytes: Count of bytes transferred to/from all memory
>
> Proposal is to provide a new file to report traffic for each region
> for however many regions are implemented on a system:
>
> mbm_region_0_bytes
> ...
> mbm_region_N_bytes
>
> Potentially a compatability file:
>
> mbm_total_bytes
>
> could be included which provides data for the sum across all regions.
>
> Providing a similar mbm_local_bytes file would be challenging as the
> BIOS controls the region numbering and it may be difficult/impossible
> for Linux to determine which regions report "local" memory traffic.
> A future implementation may allow the OS to define the region mapping
> which makes things even more complex as the mappings could be changed
> at run time.
>
> Memory bandwidth allocation
> ---------------------------
>
> This is more complex as there are some additional capability improvements
> in addition to providing separate controls for each region. Resctrl
> already has support to control bandwidth to "slow" memory on AMD systems
> providing separate controls for "regular" and "slow" memory in the schemata file:
>
> $ cat schemata
> MB: 0=100;1=100
> SMBA:0=100;1=100
>
> It would be tricky for resctrl to build on this for regions for the same
> reason the mbm_local_bytes would be difficult. No way for Linux to determine
> which regions are CXL vs. DDR. This approach would also lose ability to
> control local vs. remote bandwidth. Also not extensible for future memory
> configuration options.
>
> Option 1: Per-memory regions might be described individually like this:
>
> $ cat schemata
> RMB0:0=100;1=100
> RMB1:0=75;1=75
> RMB2:0=25;1=25
>
> Option 2: Add to schemata per-line syntax to keep one line, but specify each region
> in some comma separated list:
>
> $ cat schemata
> RMB:0=100,75,50,25;1=100,50,25
>
> But there are additional capabilities that would be useful to expose that
> may influence decisions.
>
> 1) Better than 1% throttle granularity
>
> Existing Intel implementations provide throttle controls in 10% steps. The
> architectural enumeration allows for at best 1% steps. But this may still be
> inadequate to provide distinct controls when very high levels of throttling
> are needed for low priority workloads. The RDT architecture specification
> allows for bandwidth limits to be specified from 1 (maximum throttle) to 511
> (no throttle) though implementations may provide other ranges, e.g. 1..255.
>
> Option 1: Specify bandwidth in schemata with floating point values
>
> $ cat info/MB/min_bandwidth
> 0.1957
> $ info/MB/bandwidth_gran
> 0.1957
> $ cat schemata
> RMB0:0=100;1=100
> RMB1:0=0.75;1=1.25
>
> Option 2: Change from "percentage" to some enumerated range
>
> $ cat schemata
> RMB0:0=511;1=511
>
> 2) Min/max ranges for bandwidth
>
> When a single fixed value for bandwidth limits is provided, users are
> forced to be overly conservative when assigning limits in the schemata
> file in order to keep memory controllers within capacity limits. This
> can result in jobs being throttled unnecessarily at times when there is
> plenty of bandwidth capacity available.
>
> The latest RDT architecture specification allows for setting a minimum
> and maximum bandwidth in addition to the normal limit. Example usage
> would be to set a higher maximum value for low priority jobs to allow
> them to run faster when the system has available memory bandwidth capacity.
> High priority jobs can have a minimum bandwidth setting so that when
> the system is running close to capacity limits, those jobs are not
> throttled as much (or at all) while lower priority jobs are throttled.
>
> Syntax option:
>
> $ cat schemata
> RMB0:0=25<50<100;1=25<50<100
>
> Combining some of these options for new capabilities we could have:
>
> $ cat schemata
> RMB0:0=25<50<100;1=25<50<100
> RMB1:0=2.5<30<40;1=2.5<30<40
> RMB2:0=80<90<100;1=80<90<100
>
> -Tony
next parent reply other threads:[~2025-04-11 20:56 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Z_mBcnAcGzMMvfxV@agluck-desk3>
2025-04-11 20:56 ` Luck, Tony [this message]
2025-04-14 17:30 ` Region aware RDT options for resctrl Reinette Chatre
2025-04-14 17:56 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z_mB-gmQe_LR4FWP@agluck-desk3 \
--to=tony.luck@intel.com \
--cc=Dave.Martin@arm.com \
--cc=anil.s.keshavamurthy@intel.com \
--cc=babu.moger@amd.com \
--cc=dfustini@baylibre.com \
--cc=fenghuay@nvidia.com \
--cc=james.morse@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.wieczor-retman@intel.com \
--cc=peternewman@google.com \
--cc=reinette.chatre@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.