[PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
@ 2025-01-22 20:20 Babu Moger
  2025-01-22 20:20 ` [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init() Babu Moger
                   ` (25 more replies)
  0 siblings, 26 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

Series is written such that it is easier to support other assignable
features supported from different vendors.

The feature details are documented in the  APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'

# Introduction

Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware.
The counters of any other RMIDs which are no longer being tracked will be
reset to zero. The MBM event counters return "Unavailable" for the RMIDs
that are not tracked by hardware. So, there can be only limited number of
groups that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups
are being tracked for certain point of time. Users do not have the option
to monitor a group or set of groups for certain period of time without
worrying about counter being reset in between.

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned.  The assigned RMID will be tracked by the hardware until the user
unassigns it manually. There is no need to worry about counters being reset
during this period. Additionally, the user can specify a bitmask identifying
the specific bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current 'default' mode without
assignment option.

# Linux Implementation

Create a generic interface aimed to support user space assignment
of scarce counters used for monitoring. First usage of interface
is by ABMC with option to expand usage to "soft-ABMC" and MPAM
counters in future.

Feature adds following interface files:

/sys/fs/resctrl/info/L3_MON/mbm_assign_mode: Reports the list of assignable
monitoring features supported. The enclosed brackets indicate which
feature is enabled.

/sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring
counters available for assignment.

/sys/fs/resctrl/info/L3_MON/available_mbm_cntrs: Reports the number of monitoring
counters free in each domain.

/sys/fs/resctrl/info/L3_MON/mbm_assign_control: Reports the resctrl group and monitor
status of each group. Assignment state can be updated by writing to the
interface.

# Examples

a. Check if ABMC support is available
	#mount -t resctrl resctrl /sys/fs/resctrl/

	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	[mbm_cntr_assign]
	default

	ABMC feature is detected and it is enabled.

b. Check how many ABMC counters are available. 

	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
	32

c. Check how many ABMC counters are available in each domain.

	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
	0=30;1=30

d. Create few resctrl groups.

	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp

e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   to list and modify any group's monitoring states. File provides single place
   to list monitoring states of all the resctrl groups. It makes it easier for
   user space to learn about the used counters without needing to traverse all
   the groups thus reducing the number of file system calls.

	The list follows the following format:

	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

	Format for specific type of groups:

	* Default CTRL_MON group:
	 "//<domain_id>=<flags>"

       * Non-default CTRL_MON group:
               "<CTRL_MON group>//<domain_id>=<flags>"

       * Child MON group of default CTRL_MON group:
               "/<MON group>/<domain_id>=<flags>"

       * Child MON group of non-default CTRL_MON group:
               "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

       Flags can be one of the following:

        t  MBM total event is enabled.
        l  MBM local event is enabled.
        tl Both total and local MBM events are enabled.
        _  None of the MBM events are enabled

	Examples:

	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control 
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
	//0=tl;1=tl
	/child_default_mon_grp/0=tl;1=tl

	There are four groups and all the groups have local and total
	event enabled on domain 0 and 1.

f. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.

 	The write format is similar to the above list format with addition
	of opcode for the assignment operation.
    	“<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”

	* Default CTRL_MON group:
	        "//<domain_id><opcode><flags>"

	* Non-default CTRL_MON group:
	        "<CTRL_MON group>//<domain_id><opcode><flags>"

	* Child MON group of default CTRL_MON group:
	        "/<MON group>/<domain_id><opcode><flags>"

	* Child MON group of non-default CTRL_MON group:
	        "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

	Opcode can be one of the following:

	= Update the assignment to match the flags.
	+ Assign a new MBM event without impacting existing assignments.
	- Unassign a MBM event from currently assigned events.

	Flags can be one of the following:

        t  MBM total event.
        l  MBM local event.
        tl Both total and local MBM events.
        _  None of the MBM events. Only works with '=' opcode. This flag cannot be combined with other flags.

	Initial group status:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
	//0=tl;1=tl
	/child_default_mon_grp/0=tl;1=tl

	To update the default group to enable only total event on domain 0:
	# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
	//0=t;1=tl
	/child_default_mon_grp/0=tl;1=tl

	To update the MON group child_default_mon_grp to remove total event on domain 1:
	# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
	//0=t;1=tl
	/child_default_mon_grp/0=tl;1=l

	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
	remove both local and total events on domain 1:
	# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
	       /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
	//0=t;1=tl
	/child_default_mon_grp/0=tl;1=l

	To update the default group to add a local event domain 0.
	# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
	//0=tl;1=tl
	/child_default_mon_grp/0=tl;1=l

	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all
	the MBM events on all the domains.
	# echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=_;1=_
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
	//0=tl;1=tl
	/child_default_mon_grp/0=tl;1=l

g. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
   There is no change in reading the events with ABMC. If the event is unassigned
   when reading, then the read will come back as "Unassigned".

	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	779247936
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
	765207488

h. Check the bandwidth configuration for the group. Note that bandwidth
   configuration has a domain scope. Total event defaults to 0x7F (to
   count all the events) and local event defaults to 0x15 (to count all
   the local numa events). The event bitmap decoding is available at
   https://www.kernel.org/doc/Documentation/x86/resctrl.rst
   in section "mbm_total_bytes_config", "mbm_local_bytes_config":

	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x7f;1=0x7f

	#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 
	0=0x15;1=0x15

i. Change the bandwidth source for domain 0 for the total event to count only reads.
   Note that this change effects total events on the domain 0.

	#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x33;1=0x7F

j. Now read the total event again. The first read may come back with "Unavailable"
   status. The subsequent read of mbm_total_bytes will display only the read events.

	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	Unavailable
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	314101

k. Users will have the option to go back to 'default' mbm_assign_mode if required.
   This can be done using the following command. Note that switching the
   mbm_assign_mode will reset all the MBM counters of all resctrl groups.

	# echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	mbm_cntr_assign
	[default]

l. Unmount the resctrl

	#umount /sys/fs/resctrl/
---
v11:
   The commit 2937f9c361f7a ("x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags")
   is already merged. Removed from the series.

   Resolved minor conflicts due to code displacement in latest code.

   Moved the monitoring related calls to monitor.c file when possible.
   Moved some of the changes from include/linux/resctrl.h to arch/x86/kernel/cpu/resctrl/internal.h
   as requested by Reinette. This changes will be moved back when arch and non code is separated.

   Renamed rdtgroup_mbm_assign_mode_show() to resctrl_mbm_assign_mode_show().
   Renamed rdtgroup_num_mbm_cntrs_show() to resctrl_num_mbm_cntrs_show().

   Moved the mon_config_info structure definition to internal.h.
   Moved resctrl_arch_mon_event_config_get() and resctrl_arch_mon_event_config_set()
   to monitor.c file.

   Moved resctrl_arch_assign_cntr() and resctrl_abmc_config_one_amd() to monitor.c.
   Added the code to reset the arch state in resctrl_arch_assign_cntr().
   Also removed resctrl_arch_reset_rmid() inside IPI as the counters are reset from the callers.

   Renamed rdtgroup_assign_cntr_event() to resctrl_assign_cntr_event().
   Refactored the resctrl_assign_cntr_event().
   Added functionality to exit on the first error during assignment.
   Simplified mbm_cntr_free().
   Removed the function mbm_cntr_assigned(). Will be using mbm_cntr_get() to
   figure out if the counter is assigned or not.

   Renamed rdtgroup_unassign_cntr_event() to resctrl_unassign_cntr_event().
   Refactored the resctrl_unassign_cntr_event().

   Moved mbm_cntr_reset() to monitor.c.
   Added code reset non-architectural state in mbm_cntr_reset().
   Added missing rdtgroup_unassign_cntrs() calls on failure path.

   Domain can be NULL with SNC support so moved the unassign check in rdtgroup_mondata_show().

   Renamed rdtgroup_mbm_assign_mode_write() to resctrl_mbm_assign_mode_write().
   Added more details in resctrl.rst about mbm_cntr_assign mode.
   Re-arranged the text in resctrl.rst file in section mbm_cntr_assign.

   Moved resctrl_arch_mbm_cntr_assign_set_one() to monitor.c

   Added non-arch RMID reset in mbm_config_write_domain().
   Removed resctrl_arch_reset_rmid() call in resctrl_abmc_config_one_amd(). Not required
   as reset of arch and non-arch rmid counters done from the callers. It simplies the IPI code.

   Fixed printing the separator after each domain while listing the group assignments.
   Renamed rdtgroup_mbm_assign_control_show to resctrl_mbm_assign_control_show().

   Fixed the static check warning with initializing dom_id in resctrl_process_flags()

   Added change log in each patch for specific changes.

v10:
   Major change is related to domain specific assignment.
   Added struct mbm_cntr_cfg inside mon domains. This will handle
   the domain specific assignments as discussed in below.
   https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
   I did not see the need to add cntr_id in mbm_state structure. Not used in the code.
   Following patches take care of these changes.
   Patch 12, 13, 15, 16, 17, 18.

   Added __init attribute to cache_alloc_hsw_probe(). Followed function
   prototype rules (preferred order is storage class before return type).

   Moved the mon_config_info structure definition to resctrl.h

   Added call resctrl_arch_reset_rmid() to reset the RMID in the domain inside IPI call
   resctrl_abmc_config_one_amd.

   SMP and non-SMP call support is not required in resctrl_arch_config_cntr with new
   domain specific assign approach/data structure.

   Assigned the counter before exposing the event files.
   Moved the call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().
   This is called both CNTR_MON and MON group creation.

   Call mbm_cntr_reset() when unmounted to clear all the assignments.

   Fixed the issue with finding the domain in multiple iterations in rdtgroup_process_flags().

   Printed full error message with domain information when assign fails.

   Taken care of other text comments in all the patches. Patch specific changes are in each patch.

   If I missed something please point me and it is not intentional.

v9:
   Patch 14 is a new addition. 
   Major change in patch 24.
   Moved the fix patch to address __init attribute to begining of the series.
   Fixed all the call sequences. Added additional Fixed tags.

   Added Reviewed-by where applicable.

   Took care of couple of minor merge conflicts with latest code.
   Re-ordered the MSR in couple of instances.
   Added available_mbm_cntrs (patch 14) to print the number of counter in a domain.

   Used MBM_EVENT_ARRAY_INDEX macro to get the event index.
   Introduced rdtgroup_cntr_id_init() to initialize the cntr_id

   Introduced new function resctrl_config_cntr to assign the counter, update
   the bitmap and reset the architectural state.
   Taken care of error handling(freeing the counter) when assignment fails.

   Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
   Updated couple of rdtgroup_unassign_cntrs() calls properly.

   Fixed problem changing the mode to mbm_cntr_assign mode when it is
   not supported. Added extra checks to detect if systems supports it.

   https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
   As discussed in the above comment, introduced resctrl_mon_event_config_set to
   handle IPI. But sending another IPI inside IPI causes problem. Kernel
   reports SMP warning. So, introduced resctrl_arch_update_cntr() to send the
   command directly.

   Fixed handling special case '//0=' and '//".
   Removed extra strstr() call in rdtgroup_mbm_assign_control_write().
   Added generic failure text when assignment operation fails.
   Corrected user documentation format texts.

v8:
  Patches are getting into final stages. 
  Couple of changes Patch 8, Patch 19 and Patch 23.
  Most of the other changes are related to rename and text message updates.

  Details are in each patch. Here is the summary.

  Added __init attribute to dom_data_init() in patch 8/25.
  Moved the mbm_cntrs_init() and mbm_cntrs_exit() functionality inside
  dom_data_init() and dom_data_exit() respectively.

  Renamed resctrl_mbm_evt_config_init() to arch_mbm_evt_config_init()
  Renamed resctrl_arch_event_config_get() to resctrl_arch_mon_event_config_get().
          resctrl_arch_event_config_set() to resctrl_arch_mon_event_config_set().

  Rename resctrl_arch_assign_cntr to resctrl_arch_config_cntr.
  Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
  Added the code to return the error if rdtgroup_assign_cntr_event fails.
  Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
  Renamed rdtgroup_mbm_cntr_is_assigned to mbm_cntr_assigned_to_domain
  Added return error handling in resctrl_arch_config_cntr().
  Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
  Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
  Fixed the problem with unassigning the child MON groups of CTRL_MON group.
  Reset the internal counters after mbm_cntr_assign mode is changed.
  Renamed rdtgroup_mbm_cntr_reset() to mbm_cntr_reset()
  Renamed resctrl_arch_mbm_cntr_assign_configure to
            resctrl_arch_mbm_cntr_assign_set_one.

  Used the same IPI as event update to modify the assignment.
  Could not do the way we discussed in the thread.
  https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
  Needed to figure out event type to update the configuration.

  Moved unassign first and assign during the assign modification.
  Assign none "_" takes priority. Cannot be mixed with other flags.
  Updated the documentation and .rst file format. htmldoc looks ok.

v7:
   Major changes are related to FS and arch codes separation.
   Changed few interface names based on feedback.
   Here are the summary and each patch contains changes specific the patch.

   Removed WARN_ON for num_mbm_cntrs. Decided to dynamically allocate the bitmap.
   WARN_ON is not required anymore.

   Renamed the function resctrl_arch_get_abmc_enabled() to resctrl_arch_mbm_cntr_assign_enabled().

   Merged resctrl_arch_mbm_cntr_assign_disable, resctrl_arch_mbm_cntr_assign_disable
   and renamed to resctrl_arch_mbm_cntr_assign_set(). Passed the struct rdt_resource
   to these functions.

   Removed resctrl_arch_reset_rmid_all() from arch code. This will be done from FS the caller.

   Updated the descriptions/commit log in resctrl.rst to generic text. Removed ABMC references.
   Renamed mbm_mode to mbm_assign_mode.
   Renamed mbm_control to  mbm_assign_control.
   Introduced mutex lock in rdtgroup_mbm_mode_show().

   The 'legacy' mode is called 'default' mode. 

   Removed the static allocation and now allocating bitmap mbm_cntr_free_map dynamically.

   Merged rdtgroup_assign_cntr(), rdtgroup_alloc_cntr() into one.
   Merged rdtgroup_unassign_cntr(), rdtgroup_free_cntr() into one.

  Added struct rdt_resource to the interface functions resctrl_arch_assign_cntr ()
  and resctrl_arch_unassign_cntr().
  Rename rdtgroup_abmc_cfg() to resctrl_abmc_config_one_amd().

  Added a new patch to fix counter assignment on event config changes.

  Removed the references of ABMC from user interfaces.

  Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
  Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.

  Thomas Gleixner asked us to update  https://gitlab.com/x86-cpuid.org/x86-cpuid-db. 
  It needs internal approval. We are working on it.

v6:
  We still need to finalize few interface details on mbm_assign_mode and mbm_assign_control
  in case of ABMC and Soft-ABMC. We can continue the discussion with this series.

  Added support for domain-id '*' to update all the domains at once.
  Fixed assign interface to allocate the counter if counter is
  not assigned.   
  Fixed unassign interface to free the counter if the counter is not
  assigned in any of the domains.

  Renamed abmc_capable to mbm_cntr_assignable.

  Renamed abmc_enabled to mbm_cntr_assign_enabled.
  Used msr_set_bit and msr_clear_bit for msr updates.
  Renamed resctrl_arch_abmc_enable() to resctrl_arch_mbm_cntr_assign_enable().
  Renamed resctrl_arch_abmc_disable() to resctrl_arch_mbm_cntr_assign_disable().

  Changed the display name from num_cntrs to num_mbm_cntrs.

  Removed the variable mbm_cntrs_free_map_len. This is not required.
  Removed the call mbm_cntrs_init() in arch code. This needs to be done at higher level.
  Used DECLARE_BITMAP to initialize mbm_cntrs_free_map.
  Removed unused config value definitions.

  Introduced mbm_cntr_map to track counters at domain level. With this
  we dont need to send MSR read to read the counter configuration.

  Separated all the counter id management to upper level in FS code.

  Added checks to detect "Unassigned" before reading the RMID.

  More details in each patch.

v5:
  Rebase changes (because of SNC support)

  Interface changes.
   /sys/fs/resctrl/mbm_assign to /sys/fs/resctrl/mbm_assign_mode.
   /sys/fs/resctrl/mbm_assign_control to /sys/fs/resctrl/mbm_assign_control.

  Added few arch specific routines.
  resctrl_arch_get_abmc_enabled.
  resctrl_arch_abmc_enable.
  resctrl_arch_abmc_disable.

  Few renames
   num_cntrs_free_map -> mbm_cntrs_free_map
   num_cntrs_init -> mbm_cntrs_init
   arch_domain_mbm_evt_config -> resctrl_arch_mbm_evt_config

  Introduced resctrl_arch_event_config_get and
    resctrl_arch_event_config_set() to update event configuration.

  Removed mon_state field mongroup. Added MON_CNTR_UNSET to initialize counters.

  Renamed ctr_id to cntr_id for the hardware counter.

  Report "Unassigned" in case the user attempts to read the events without assigning the counter.

  ABMC is enabled during the boot up. Can be enabled or disabled later.

  Fixed opcode and flags combination.
    '=_" is valid.
    "-_" amd "+_" is not valid.

 Added all the comments as far as I know. If I missed something, it is not intentional.

v4: 
  Main change is domain specific event assignment.
  Kept the ABMC feature as a default.
  Dynamcic switching between ABMC and mbm_legacy is still allowed.
  We are still not clear about mount option.
  Moved the monitoring related data in resctrl_mon structure from rdt_resource.
  Fixed the display of legacy and ABMC mode.
  Used bimap APIs when possible.
  Removed event configuration read from MSRs. We can use the
  internal saved data.(patch 12)
  Added more comments about L3_QOS_ABMC_CFG MSR.
  Added IPIs to read the assignment status for each domain (patch 18 and 19)
  More details in each patch.

v3:
   This series adds the support for global assignment mode discussed in
   the thread. https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
   Removed the individual assignment mode and included the global assignment interface.
   Added following interface files.
   a. /sys/fs/resctrl/info/L3_MON/mbm_assign
      Used for displaying the current assignment mode and switch between
      ABMC and legacy mode.
   b. /sys/fs/resctrl/info/L3_MON/mbm_assign_control
      Used for lising the groups assignment mode and modify the assignment states.
   c. Most of the changes are related to the new interface.
   d. Addressed the comments from Reinette, James and Peter.
   e. Hope I have addressed most of the major feedbacks discussed. If I missed
      something then it is not intentional. Please feel free to comment.
   f. Sending this as an RFC as per Reinette's comment. So, this is still open
      for discussion.

v2:
   a. Major change is the way ABMC is enabled. Earlier, user needed to remount
      with -o abmc to enable ABMC feature. Removed that option now.
      Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".

   b. Added new word 21 to x86/cpufeatures.h.

   c. Display unsupported if user attempts to read the events when ABMC is enabled
      and event is not assigned.

   d. Display monitor_state as "Unsupported" when ABMC is disabled.

   e. Text updates and rebase to latest tip tree (as of Jan 18).

   f. This series is still work in progress. I am yet to hear from ARM developers. 

--------------------------------------------------------------------------------------

Previous revisions:

v10: https://lore.kernel.org/lkml/cover.1734034524.git.babu.moger@amd.com/
v9: https://lore.kernel.org/lkml/cover.1730244116.git.babu.moger@amd.com/
v8: https://lore.kernel.org/lkml/cover.1728495588.git.babu.moger@amd.com/
v7: https://lore.kernel.org/lkml/cover.1725488488.git.babu.moger@amd.com/
v6: https://lore.kernel.org/lkml/cover.1722981659.git.babu.moger@amd.com/
v5: https://lore.kernel.org/lkml/cover.1720043311.git.babu.moger@amd.com/
v4: https://lore.kernel.org/lkml/cover.1716552602.git.babu.moger@amd.com/
v3: https://lore.kernel.org/lkml/cover.1711674410.git.babu.moger@amd.com/  
v2: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
v1: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/

Babu Moger (23):
  x86/resctrl: Add __init attribute to functions called from
    resctrl_late_init()
  x86/cpufeatures: Add support for Assignable Bandwidth Monitoring
    Counters (ABMC)
  x86/resctrl: Add ABMC feature in the command line options
  x86/resctrl: Consolidate monitoring related data from rdt_resource
  x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  x86/resctrl: Add support to enable/disable AMD ABMC feature
  x86/resctrl: Introduce the interface to display monitor mode
  x86/resctrl: Introduce interface to display number of monitoring
    counters
  x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct
    rdt_hw_mon_domain
  x86/resctrl: Remove MSR reading of event configuration value
  x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at
    domain
  x86/resctrl: Introduce interface to display number of free counters
  x86/resctrl: Add data structures and definitions for ABMC assignment
  x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter
    with ABMC
  x86/resctrl: Add the functionality to assigm MBM events
  x86/resctrl: Add the functionality to unassigm MBM events
  x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is
    enabled
  x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign
    mode
  x86/resctrl: Introduce the interface to switch between monitor modes
  x86/resctrl: Configure mbm_cntr_assign mode if supported
  x86/resctrl: Update assignments on event configuration changes
  x86/resctrl: Introduce interface to list assignment states of all the
    groups
  x86/resctrl: Introduce interface to modify assignment states of the
    groups

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/arch/x86/resctrl.rst            | 242 +++++++
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/cpu/resctrl/core.c            |  23 +-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c     |  13 +
 arch/x86/kernel/cpu/resctrl/internal.h        |  91 ++-
 arch/x86/kernel/cpu/resctrl/monitor.c         | 402 +++++++++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 620 ++++++++++++++++--
 arch/x86/kernel/cpu/scattered.c               |   1 +
 include/linux/resctrl.h                       |  34 +-
 12 files changed, 1350 insertions(+), 84 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init()
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-05 22:22   ` Reinette Chatre
  2025-02-19 13:28   ` Dave Martin
  2025-01-22 20:20 ` [PATCH v11 02/23] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (24 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

resctrl_late_init() has the __init attribute, but some of the functions
called from it do not have the __init attribute.

Add the __init attribute to all the functions in the call sequences to
maintain consistency throughout.

Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: No changes.

v10: Text changes.
     Added __init attribute to cache_alloc_hsw_probe()
     Followed function prototype rules (preferred order is storage
     class before return type).

v9: Moved the patch to the begining of the series.
    Fixed all the call sequences. Added additional Fixed tags.

v8: New patch.
---
 arch/x86/kernel/cpu/resctrl/core.c     | 10 +++++-----
 arch/x86/kernel/cpu/resctrl/internal.h |  2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c  |  4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3d1735ed8d1f..f0a331287979 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -145,7 +145,7 @@ u32 resctrl_arch_system_num_rmid_idx(void)
  * is always 20 on hsw server parts. The minimum cache bitmask length
  * allowed for HSW server is always 2 bits. Hardcode all of them.
  */
-static inline void cache_alloc_hsw_probe(void)
+static inline __init void cache_alloc_hsw_probe(void)
 {
 	struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
 	struct rdt_resource *r  = &hw_res->r_resctrl;
@@ -277,7 +277,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
 	return true;
 }
 
-static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
+static __init void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	union cpuid_0x10_1_eax eax;
@@ -296,7 +296,7 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
 	r->alloc_capable = true;
 }
 
-static void rdt_get_cdp_config(int level)
+static __init void rdt_get_cdp_config(int level)
 {
 	/*
 	 * By default, CDP is disabled. CDP can be enabled by mount parameter
@@ -306,12 +306,12 @@ static void rdt_get_cdp_config(int level)
 	rdt_resources_all[level].r_resctrl.cdp_capable = true;
 }
 
-static void rdt_get_cdp_l3_config(void)
+static __init void rdt_get_cdp_l3_config(void)
 {
 	rdt_get_cdp_config(RDT_RESOURCE_L3);
 }
 
-static void rdt_get_cdp_l2_config(void)
+static __init void rdt_get_cdp_l2_config(void)
 {
 	rdt_get_cdp_config(RDT_RESOURCE_L2);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 20c898f09b7e..05358e78147b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -634,7 +634,7 @@ int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
-int rdt_get_mon_l3_config(struct rdt_resource *r);
+int __init rdt_get_mon_l3_config(struct rdt_resource *r);
 void __exit rdt_put_mon_l3_config(void);
 bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 94a1d9780461..1c7b574bf0cd 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -979,7 +979,7 @@ void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_
 		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
 
-static int dom_data_init(struct rdt_resource *r)
+static __init int dom_data_init(struct rdt_resource *r)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	u32 num_closid = resctrl_arch_get_num_closid(r);
@@ -1077,7 +1077,7 @@ static struct mon_evt mbm_local_event = {
  * because as per the SDM the total and local memory bandwidth
  * are enumerated as part of L3 monitoring.
  */
-static void l3_mon_evt_init(struct rdt_resource *r)
+static __init void l3_mon_evt_init(struct rdt_resource *r)
 {
 	INIT_LIST_HEAD(&r->evt_list);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 02/23] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC)
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
  2025-01-22 20:20 ` [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init() Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 03/23] x86/resctrl: Add ABMC feature in the command line options Babu Moger
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware. The
counters of any other RMIDs which are no longer being tracked will be reset
to zero. The MBM event counters return "Unavailable" for the RMIDs that are
not tracked by hardware. So, there can be only limited number of groups
that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups are
being tracked for certain point of time. Users do not have the option to
monitor a group or set of groups for certain period of time without
worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The assigned RMID will be tracked by the hardware until the user
unassigns it manually. There is no need to worry about counters being reset
during this period. Additionally, the user can specify a bitmask
identifying the specific bandwidth types from the given source to track
with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can enable a maximum
of 2 ABMC counters per group. User will also have the option to enable only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to disable an already
enabled counter to make space for new assignments.

The feature can be detected via CPUID_Fn80000020_EBX_x00 bit 5.
Bits Description
5    ABMC (Assignable Bandwidth Monitoring Counters)

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---

Note: Checkpatch checks/warnings are ignored to maintain coding style.

v11: No changes.

v10: No changes.

v9: Took care of couple of minor merge conflicts. No other changes.

v8: No changes.

v7: Removed "" from feature flags. Not required anymore.
    https://lore.kernel.org/lkml/20240817145058.GCZsC40neU4wkPXeVR@fat_crate.local/

v6: Added Reinette's Reviewed-by. Moved the Checkpatch note below ---.

v5: Minor rebase change and subject line update.

v4: Changes because of rebase. Feature word 21 has few more additions now.
    Changed the text to "tracked by hardware" instead of active.

v3: Change because of rebase. Actual patch did not change.

v2: Added dependency on X86_FEATURE_BMEC.
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/cpuid-deps.c   | 3 +++
 arch/x86/kernel/cpu/scattered.c    | 1 +
 3 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 508c0dad116b..7950a420170f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -483,6 +483,7 @@
 #define X86_FEATURE_AMD_FAST_CPPC	(21*32 + 5) /* Fast CPPC */
 #define X86_FEATURE_AMD_HETEROGENEOUS_CORES (21*32 + 6) /* Heterogeneous Core Topology */
 #define X86_FEATURE_AMD_WORKLOAD_CLASS	(21*32 + 7) /* Workload Classification */
+#define X86_FEATURE_ABMC		(21*32 + 8) /* Assignable Bandwidth Monitoring Counters */

 /*
  * BUG word(s)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 8bd84114c2d9..7e4d63b381d6 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -70,6 +70,9 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_BMEC,			X86_FEATURE_CQM_MBM_TOTAL   },
 	{ X86_FEATURE_BMEC,			X86_FEATURE_CQM_MBM_LOCAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_CQM_MBM_TOTAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_CQM_MBM_LOCAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_BMEC      },
 	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
 	{ X86_FEATURE_AVX512_FP16,		X86_FEATURE_AVX512BW  },
 	{ X86_FEATURE_ENQCMD,			X86_FEATURE_XSAVES    },
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 16f3ca30626a..3b72b72270f1 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -49,6 +49,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_MBA,			CPUID_EBX,  6, 0x80000008, 0 },
 	{ X86_FEATURE_SMBA,			CPUID_EBX,  2, 0x80000020, 0 },
 	{ X86_FEATURE_BMEC,			CPUID_EBX,  3, 0x80000020, 0 },
+	{ X86_FEATURE_ABMC,			CPUID_EBX,  5, 0x80000020, 0 },
 	{ X86_FEATURE_AMD_WORKLOAD_CLASS,	CPUID_EAX, 22, 0x80000021, 0 },
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 03/23] x86/resctrl: Add ABMC feature in the command line options
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
  2025-01-22 20:20 ` [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init() Babu Moger
  2025-01-22 20:20 ` [PATCH v11 02/23] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 04/23] x86/resctrl: Consolidate monitoring related data from rdt_resource Babu Moger
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Add the command line option to enable or disable exposing the ABMC
(Assignable Bandwidth Monitoring Counters) hardware feature to resctrl.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v11: No changes.

v10: No changes.

v9: No code changes. Added Reviewed-by.

v8: Commit message update.

v7: No changes

v6: No changes

v5: No changes

v4: No changes

v3: No changes

v2: No changes
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 Documentation/arch/x86/resctrl.rst              | 1 +
 arch/x86/kernel/cpu/resctrl/core.c              | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05f5935eeac8..154a93c080f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5854,7 +5854,7 @@
 	rdt=		[HW,X86,RDT]
 			Turn on/off individual RDT features. List is:
 			cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
-			mba, smba, bmec.
+			mba, smba, bmec, abmc.
 			E.g. to turn on cmt and turn off mba use:
 				rdt=cmt,!mba
 
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 6768fc1fad16..fb90f08e564e 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
 MBA (Memory Bandwidth Allocation)		"mba"
 SMBA (Slow Memory Bandwidth Allocation)         ""
 BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
 ===============================================	================================
 
 Historically, new features were made visible by default in /proc/cpuinfo. This
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f0a331287979..97511cc132d6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -811,6 +811,7 @@ enum {
 	RDT_FLAG_MBA,
 	RDT_FLAG_SMBA,
 	RDT_FLAG_BMEC,
+	RDT_FLAG_ABMC,
 };
 
 #define RDT_OPT(idx, n, f)	\
@@ -836,6 +837,7 @@ static struct rdt_options rdt_options[]  __initdata = {
 	RDT_OPT(RDT_FLAG_MBA,	    "mba",	X86_FEATURE_MBA),
 	RDT_OPT(RDT_FLAG_SMBA,	    "smba",	X86_FEATURE_SMBA),
 	RDT_OPT(RDT_FLAG_BMEC,	    "bmec",	X86_FEATURE_BMEC),
+	RDT_OPT(RDT_FLAG_ABMC,	    "abmc",	X86_FEATURE_ABMC),
 };
 #define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 04/23] x86/resctrl: Consolidate monitoring related data from rdt_resource
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (2 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 03/23] x86/resctrl: Add ABMC feature in the command line options Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 05/23] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details Babu Moger
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The cache allocation and memory bandwidth allocation feature properties
are consolidated into struct resctrl_cache and struct resctrl_membw
respectively.

In preparation for more monitoring properties that will clobber the
existing resource struct more, re-organize the monitoring specific
properties to also be in a separate structure.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v11: No changes.

v10: No changes.

v9: No changes.

v8: Added Reviewed-by from Reinette. No other changes.

v7: Added kernel doc for data structure. Minor text update.

v6: Update commit message and update kernel doc for rdt_resource.

v5: Commit message update.
    Also changes related to data structure updates does to SNC support.

v4: New patch.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  4 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c  | 18 +++++++++---------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  8 ++++----
 include/linux/resctrl.h                | 16 ++++++++++++----
 4 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 97511cc132d6..cb7feb7c990f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -124,7 +124,7 @@ u32 resctrl_arch_system_num_rmid_idx(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 
 	/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
-	return r->num_rmid;
+	return r->mon.num_rmid;
 }
 
 /*
@@ -627,7 +627,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 
 	arch_mon_domain_online(r, d);
 
-	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
 		mon_domain_free(hw_dom);
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 1c7b574bf0cd..0c481501b8c5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -222,7 +222,7 @@ static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
 	if (snc_nodes_per_l3_cache == 1)
 		return lrmid;
 
-	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->mon.num_rmid;
 }
 
 static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
@@ -297,11 +297,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
-		       sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
+		       sizeof(*hw_dom->arch_mbm_total) * r->mon.num_rmid);
 
 	if (is_mbm_local_enabled())
 		memset(hw_dom->arch_mbm_local, 0,
-		       sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
+		       sizeof(*hw_dom->arch_mbm_local) * r->mon.num_rmid);
 }
 
 static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
@@ -1079,14 +1079,14 @@ static struct mon_evt mbm_local_event = {
  */
 static __init void l3_mon_evt_init(struct rdt_resource *r)
 {
-	INIT_LIST_HEAD(&r->evt_list);
+	INIT_LIST_HEAD(&r->mon.evt_list);
 
 	if (is_llc_occupancy_enabled())
-		list_add_tail(&llc_occupancy_event.list, &r->evt_list);
+		list_add_tail(&llc_occupancy_event.list, &r->mon.evt_list);
 	if (is_mbm_total_enabled())
-		list_add_tail(&mbm_total_event.list, &r->evt_list);
+		list_add_tail(&mbm_total_event.list, &r->mon.evt_list);
 	if (is_mbm_local_enabled())
-		list_add_tail(&mbm_local_event.list, &r->evt_list);
+		list_add_tail(&mbm_local_event.list, &r->mon.evt_list);
 }
 
 /*
@@ -1183,7 +1183,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
 	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
-	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
+	r->mon.num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
@@ -1198,7 +1198,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 *
 	 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
 	 */
-	threshold = resctrl_rmid_realloc_limit / r->num_rmid;
+	threshold = resctrl_rmid_realloc_limit / r->mon.num_rmid;
 
 	/*
 	 * Because num_rmid may not be a power of two, round the value
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6419e04d8a7b..f91fe605766f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1106,7 +1106,7 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
 {
 	struct rdt_resource *r = of->kn->parent->priv;
 
-	seq_printf(seq, "%d\n", r->num_rmid);
+	seq_printf(seq, "%d\n", r->mon.num_rmid);
 
 	return 0;
 }
@@ -1117,7 +1117,7 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
 	struct rdt_resource *r = of->kn->parent->priv;
 	struct mon_evt *mevt;
 
-	list_for_each_entry(mevt, &r->evt_list, list) {
+	list_for_each_entry(mevt, &r->mon.evt_list, list) {
 		seq_printf(seq, "%s\n", mevt->name);
 		if (mevt->configurable)
 			seq_printf(seq, "%s_config\n", mevt->name);
@@ -3068,13 +3068,13 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
 	struct mon_evt *mevt;
 	int ret;
 
-	if (WARN_ON(list_empty(&r->evt_list)))
+	if (WARN_ON(list_empty(&r->mon.evt_list)))
 		return -EPERM;
 
 	priv.u.rid = r->rid;
 	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
 	priv.u.sum = do_sum;
-	list_for_each_entry(mevt, &r->evt_list, list) {
+	list_for_each_entry(mevt, &r->mon.evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
 		if (ret)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..3c2307c7c106 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -182,16 +182,26 @@ enum resctrl_scope {
 	RESCTRL_L3_NODE,
 };
 
+/**
+ * struct resctrl_mon - Monitoring related data of a resctrl resource
+ * @num_rmid:		Number of RMIDs available
+ * @evt_list:		List of monitoring events
+ */
+struct resctrl_mon {
+	int			num_rmid;
+	struct list_head	evt_list;
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
- * @num_rmid:		Number of RMIDs available
  * @ctrl_scope:		Scope of this resource for control functions
  * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
+ * @mon:		Monitoring related data.
  * @ctrl_domains:	RCU list of all control domains for this resource
  * @mon_domains:	RCU list of all monitor domains for this resource
  * @name:		Name to use in "schemata" file.
@@ -199,7 +209,6 @@ enum resctrl_scope {
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
  * @format_str:		Per resource format string to show domain value
  * @parse_ctrlval:	Per resource function pointer to parse control values
- * @evt_list:		List of monitoring events
  * @fflags:		flags to choose base and info files
  * @cdp_capable:	Is the CDP feature available on this resource
  */
@@ -207,11 +216,11 @@ struct rdt_resource {
 	int			rid;
 	bool			alloc_capable;
 	bool			mon_capable;
-	int			num_rmid;
 	enum resctrl_scope	ctrl_scope;
 	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
+	struct resctrl_mon	mon;
 	struct list_head	ctrl_domains;
 	struct list_head	mon_domains;
 	char			*name;
@@ -221,7 +230,6 @@ struct rdt_resource {
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
 						 struct rdt_ctrl_domain *d);
-	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 05/23] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (3 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 04/23] x86/resctrl: Consolidate monitoring related data from rdt_resource Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
     Monitoring Counter ID + 1

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Detect the feature and number of assignable monitoring counters supported.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v11: No changes.

v10: No changes.

v9: Added Reviewed-by tag. No code changes

v8: Used GENMASK for the mask.

v7: Removed WARN_ON for num_mbm_cntrs. Decided to dynamically allocate the
    bitmap. WARN_ON is not required anymore.
    Removed redundant comments.

v6: Commit message update.
    Renamed abmc_capable to mbm_cntr_assignable.

v5: Name change num_cntrs to num_mbm_cntrs.
    Moved abmc_capable to resctrl_mon.

v4: Removed resctrl_arch_has_abmc(). Added all the code inline. We dont
    need to separate this as arch code.

v3: Removed changes related to mon_features.
    Moved rdt_cpu_has to core.c and added new function resctrl_arch_has_abmc.
    Also moved the fields mbm_assign_capable and mbm_assign_cntrs to
    rdt_resource. (James)

v2: Changed the field name to mbm_assign_capable from abmc_capable.
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
 include/linux/resctrl.h               | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 0c481501b8c5..c3d7d4c3009a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1228,6 +1228,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			resctrl_file_fflags_init("mbm_local_bytes_config",
 						 RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
 		}
+
+		if (rdt_cpu_has(X86_FEATURE_ABMC)) {
+			r->mon.mbm_cntr_assignable = true;
+			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
+			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+		}
 	}
 
 	l3_mon_evt_init(r);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 3c2307c7c106..511cfce8fc21 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -185,10 +185,14 @@ enum resctrl_scope {
 /**
  * struct resctrl_mon - Monitoring related data of a resctrl resource
  * @num_rmid:		Number of RMIDs available
+ * @num_mbm_cntrs:	Number of assignable monitoring counters
+ * @mbm_cntr_assignable:Is system capable of supporting monitor assignment?
  * @evt_list:		List of monitoring events
  */
 struct resctrl_mon {
 	int			num_rmid;
+	int			num_mbm_cntrs;
+	bool			mbm_cntr_assignable;
 	struct list_head	evt_list;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (4 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 05/23] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-05 22:49   ` Reinette Chatre
  2025-02-21 18:05   ` James Morse
  2025-01-22 20:20 ` [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
                   ` (19 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Add the functionality to enable/disable AMD ABMC feature.

AMD ABMC feature is enabled by setting enabled bit(0) in MSR
L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
to be updated on all the logical processors in the QOS Domain.

Hardware counters will reset when ABMC state is changed.

The ABMC feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Moved the monitoring related calls to monitor.c file.
     Moved the changes from include/linux/resctrl.h to
     arch/x86/kernel/cpu/resctrl/internal.h.
     Removed the Reviewed-by tag as patch changed.
     Actual code did not change.

v10: No changes.

v9: Re-ordered the MSR and added Reviewed-by tag.

v8: Commit message update and moved around the comments about L3_QOS_EXT_CFG
    to _resctrl_abmc_enable.

v7: Renamed the function
    resctrl_arch_get_abmc_enabled() to resctrl_arch_mbm_cntr_assign_enabled().

    Merged resctrl_arch_mbm_cntr_assign_disable, resctrl_arch_mbm_cntr_assign_disable
    and renamed to resctrl_arch_mbm_cntr_assign_set().

    Moved the function definition to linux/resctrl.h.

    Passed the struct rdt_resource to these functions.
    Removed resctrl_arch_reset_rmid_all() from arch code. This will be done
    from the caller.

v6: Renamed abmc_enabled to mbm_cntr_assign_enabled.
    Used msr_set_bit and msr_clear_bit for msr updates.
    Renamed resctrl_arch_abmc_enable() to resctrl_arch_mbm_cntr_assign_enable().
    Renamed resctrl_arch_abmc_disable() to resctrl_arch_mbm_cntr_assign_disable().
    Made _resctrl_abmc_enable to return void.

v5: Renamed resctrl_abmc_enable to resctrl_arch_abmc_enable.
    Renamed resctrl_abmc_disable to resctrl_arch_abmc_disable.
    Introduced resctrl_arch_get_abmc_enabled to get abmc state from
    non-arch code.
    Renamed resctrl_abmc_set_all to _resctrl_abmc_enable().
    Modified commit log to make it clear about AMD ABMC feature.

v3: No changes.

v2: Few text changes in commit message.
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/core.c     |  5 ++++
 arch/x86/kernel/cpu/resctrl/internal.h |  7 +++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 36 ++++++++++++++++++++++++++
 4 files changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9a71880eec07..fea1f3afe197 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1197,6 +1197,7 @@
 /* - AMD: */
 #define MSR_IA32_MBA_BW_BASE		0xc0000200
 #define MSR_IA32_SMBA_BW_BASE		0xc0000280
+#define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
 #define MSR_IA32_EVT_CFG_BASE		0xc0000400
 
 /* AMD-V MSRs */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cb7feb7c990f..3f847728aa7a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -405,6 +405,11 @@ void rdt_ctrl_update(void *arg)
 	hw_res->msr_update(m);
 }
 
+bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r)
+{
+	return resctrl_to_arch_res(r)->mbm_cntr_assign_enabled;
+}
+
 /*
  * rdt_find_domain - Search for a domain id in a resource domain list.
  *
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 05358e78147b..ca69f2e0909f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -56,6 +56,9 @@
 /* Max event bits supported */
 #define MAX_EVT_CONFIG_BITS		GENMASK(6, 0)
 
+/* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
+#define ABMC_ENABLE_BIT			0
+
 /**
  * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
  *			        aren't marked nohz_full
@@ -479,6 +482,7 @@ struct rdt_parse_data {
  * @mbm_cfg_mask:	Bandwidth sources that can be tracked when Bandwidth
  *			Monitoring Event Configuration (BMEC) is supported.
  * @cdp_enabled:	CDP state of this resource
+ * @mbm_cntr_assign_enabled:	ABMC feature is enabled
  *
  * Members of this structure are either private to the architecture
  * e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
@@ -493,6 +497,7 @@ struct rdt_hw_resource {
 	unsigned int		mbm_width;
 	unsigned int		mbm_cfg_mask;
 	bool			cdp_enabled;
+	bool			mbm_cntr_assign_enabled;
 };
 
 static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
@@ -658,4 +663,6 @@ void resctrl_file_fflags_init(const char *config, unsigned long fflags);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
+int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
+bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c3d7d4c3009a..a7526306f5e4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1261,3 +1261,39 @@ void __init intel_rdt_mbm_apply_quirk(void)
 	mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
 	mbm_cf = mbm_cf_table[cf_index].cf;
 }
+
+static void resctrl_abmc_set_one_amd(void *arg)
+{
+	bool *enable = arg;
+
+	if (*enable)
+		msr_set_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
+	else
+		msr_clear_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
+}
+
+/*
+ * Update L3_QOS_EXT_CFG MSR on all the CPUs associated with the monitor
+ * domain.
+ */
+static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
+{
+	struct rdt_mon_domain *d;
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list)
+		on_each_cpu_mask(&d->hdr.cpu_mask,
+				 resctrl_abmc_set_one_amd, &enable, 1);
+}
+
+int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
+{
+	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+	if (r->mon.mbm_cntr_assignable &&
+	    hw_res->mbm_cntr_assign_enabled != enable) {
+		_resctrl_abmc_enable(r, enable);
+		hw_res->mbm_cntr_assign_enabled = enable;
+	}
+
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (5 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06 18:01   ` Reinette Chatre
  2025-02-21 18:06   ` James Morse
  2025-01-22 20:20 ` [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
                   ` (18 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Introduce the interface file "mbm_assign_mode" to list monitor modes
supported.

The "mbm_cntr_assign" mode provides the option to assign a counter to
an RMID, event pair and monitor the bandwidth as long as it is assigned.

On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
Bandwidth Monitoring Counters) hardware feature and is enabled by default.

The "default" mode is the existing monitoring mode that works without the
explicit counter assignment, instead relying on dynamic counter assignment
by hardware that may result in hardware not dedicating a counter resulting
in monitoring data reads returning "Unavailable".

Provide an interface to display the monitor mode on the system.
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_cntr_assign]
default

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Renamed rdtgroup_mbm_assign_mode_show() to resctrl_mbm_assign_mode_show().
     Removed few texts in resctrl.rst about AMD specific informati0n.
     Updated few texts.

v10: Added few more text to user documentation clarify on the default mode.

v9: Updated user documentation based on comments.

v8: Commit message update.

v7: Updated the descriptions/commit log in resctrl.rst to generic text.
    Thanks to James and Reinette.
    Rename mbm_mode to mbm_assign_mode.
    Introduced mutex lock in rdtgroup_mbm_mode_show().

v6: Added documentation for mbm_cntr_assign and legacy mode.
    Moved mbm_mode fflags initialization to static initialization.

v5: Changed interface name to mbm_mode.
    It will be always available even if ABMC feature is not supported.
    Added description in resctrl.rst about ABMC mode.
    Fixed display abmc and legacy consistantly.

v4: Fixed the checks for legacy and abmc mode. Default it ABMC.

v3: New patch to display ABMC capability.
---
 Documentation/arch/x86/resctrl.rst     | 26 +++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index fb90f08e564e..b5defc5bce0e 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -257,6 +257,32 @@ with the following files:
 	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
 	    0=0x30;1=0x30;3=0x15;4=0x15
 
+"mbm_assign_mode":
+	Reports the list of monitoring modes supported. The enclosed brackets
+	indicate which mode is enabled.
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+	  [mbm_cntr_assign]
+	  default
+
+	"mbm_cntr_assign":
+
+	In mbm_cntr_assign, monitoring event can only accumulate data while
+	it is backed by a hardware counter. The user-space is able to specify
+	which of the events in CTRL_MON or MON groups should have a counter
+	assigned using the "mbm_assign_control" file. The number of counters
+	available is described in the "num_mbm_cntrs" file. Changing the mode
+	may cause all counters on a resource to reset.
+
+	"default":
+
+	In default mode, resctrl assumes there is a hardware counter for each
+	event within every CTRL_MON and MON group. On AMD platforms, it is
+	recommended to use mbm_cntr_assign mode if supported, because reading
+	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
+	there is no counter associated with that event.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f91fe605766f..3880480a41d2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -854,6 +854,30 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
+					struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	mutex_lock(&rdtgroup_mutex);
+
+	if (r->mon.mbm_cntr_assignable) {
+		if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
+			seq_puts(s, "[mbm_cntr_assign]\n");
+			seq_puts(s, "default\n");
+		} else {
+			seq_puts(s, "mbm_cntr_assign\n");
+			seq_puts(s, "[default]\n");
+		}
+	} else {
+		seq_puts(s, "[default]\n");
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1910,6 +1934,13 @@ static struct rftype res_common_files[] = {
 		.seq_show	= mbm_local_bytes_config_show,
 		.write		= mbm_local_bytes_config_write,
 	},
+	{
+		.name		= "mbm_assign_mode",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= resctrl_mbm_assign_mode_show,
+		.fflags		= RFTYPE_MON_INFO,
+	},
 	{
 		.name		= "cpus",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (6 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-05 23:17   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 09/23] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain Babu Moger
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The mbm_cntr_assign mode provides an option to the user to assign a
counter to an RMID, event pair and monitor the bandwidth as long as
the counter is assigned. Number of assignments depend on number of
monitoring counters available.

Provide the interface to display the number of monitoring counters
supported. The resctrl file 'num_mbm_cntrs' is visible to user space
when the system supports mbm_cntr_assign mode.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Renamed rdtgroup_num_mbm_cntrs_show() to resctrl_num_mbm_cntrs_show().
     Few monor text updates.

v10: No changes.

v9: Updated user document based on the comments.
    Will add a new file available_mbm_cntrs later in the series.

v8: Commit message update and documentation update.

v7: Minor commit log text changes.

v6: No changes.

v5: Changed the display name from num_cntrs to num_mbm_cntrs.
    Updated the commit message.
    Moved the patch after mbm_mode is introduced.

v4: Changed the counter name to num_cntrs. And few text changes.

v3: Changed the field name to mbm_assign_cntrs.

v2: Changed the field name to mbm_assignable_counters from abmc_counter.
---
 Documentation/arch/x86/resctrl.rst     | 16 ++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 ++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index b5defc5bce0e..31ff764deeeb 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -283,6 +283,22 @@ with the following files:
 	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
 	there is no counter associated with that event.
 
+"num_mbm_cntrs":
+	The number of monitoring counters available for assignment when the
+	system supports mbm_cntr_assign mode.
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+	  32
+
+	The resctrl file system supports tracking up to two memory bandwidth
+	events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
+	Up to two counters can be assigned per monitoring group, one for each
+	memory bandwidth event. More monitoring groups can be tracked by
+	assigning one counter per monitoring group. However, doing so limits
+	memory bandwidth tracking to a single memory bandwidth event per
+	monitoring group.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a7526306f5e4..5f87fc1650e5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			r->mon.mbm_cntr_assignable = true;
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3880480a41d2..9b09189ef2d1 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -878,6 +878,16 @@ static int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
+				      struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	seq_printf(s, "%d\n", r->mon.num_mbm_cntrs);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1941,6 +1951,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= resctrl_mbm_assign_mode_show,
 		.fflags		= RFTYPE_MON_INFO,
 	},
+	{
+		.name		= "num_mbm_cntrs",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= resctrl_num_mbm_cntrs_show,
+	},
 	{
 		.name		= "cpus",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 09/23] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (7 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured to track specific
events. The event configuration is domain specific. ABMC (Assignable
Bandwidth Monitoring Counters) feature needs event configuration
information to assign a hardware counter to an RMID. Event configurations
are not stored in resctrl but instead always read from or written to
hardware directly when prompted by user space.

Read the event configuration from the hardware during the domain
initialization. Save the configuration value in struct rdt_hw_mon_domain,
so it can be used for counter assignment.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v11: Resolved minor conflicts due to code displacement. Actual code didnt
     change.
v10: Conflicts due to code displacement. Actual code didnt change.

v9: Added Reviewed-by tag. No other changes.

v8: Renamed resctrl_mbm_evt_config_init() to arch_mbm_evt_config_init()
    Minor commit message update.

v7: Fixed initializing INVALID_CONFIG_VALUE to mbm_local_cfg in case of error.

v6: Renamed resctrl_arch_mbm_evt_config -> resctrl_mbm_evt_config_init
    Initialized value to INVALID_CONFIG_VALUE if it is not configurable.
    Minor commit message update.

v5: Exported mon_event_config_index_get.
    Renamed arch_domain_mbm_evt_config to resctrl_arch_mbm_evt_config.

v4: Read the configuration information from the hardware to initialize.
    Added few commit messages.
    Fixed the tab spaces.

v3: Minor changes related to rebase in mbm_config_write_domain.

v2: No changes.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  2 ++
 arch/x86/kernel/cpu/resctrl/internal.h |  9 +++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 26 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +---
 4 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3f847728aa7a..22399f19810f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -632,6 +632,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 
 	arch_mon_domain_online(r, d);
 
+	arch_mbm_evt_config_init(hw_dom);
+
 	if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
 		mon_domain_free(hw_dom);
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ca69f2e0909f..ab28b9340ee7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -56,6 +56,9 @@
 /* Max event bits supported */
 #define MAX_EVT_CONFIG_BITS		GENMASK(6, 0)
 
+#define INVALID_CONFIG_VALUE		U32_MAX
+#define INVALID_CONFIG_INDEX		UINT_MAX
+
 /* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
 #define ABMC_ENABLE_BIT			0
 
@@ -403,6 +406,8 @@ struct rdt_hw_ctrl_domain {
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
+ * @mbm_total_cfg:	MBM total bandwidth configuration
+ * @mbm_local_cfg:	MBM local bandwidth configuration
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
@@ -410,6 +415,8 @@ struct rdt_hw_mon_domain {
 	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
+	u32				mbm_total_cfg;
+	u32				mbm_local_cfg;
 };
 
 static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
@@ -665,4 +672,6 @@ bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
 int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
 bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
+void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
+unsigned int mon_event_config_index_get(u32 evtid);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 5f87fc1650e5..8917c7261680 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1244,6 +1244,32 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	return 0;
 }
 
+void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom)
+{
+	unsigned int index;
+	u64 msrval;
+
+	/*
+	 * Read the configuration registers QOS_EVT_CFG_n, where <n> is
+	 * the BMEC event number (EvtID).
+	 */
+	if (mbm_total_event.configurable) {
+		index = mon_event_config_index_get(QOS_L3_MBM_TOTAL_EVENT_ID);
+		rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
+		hw_dom->mbm_total_cfg = msrval & MAX_EVT_CONFIG_BITS;
+	} else {
+		hw_dom->mbm_total_cfg = INVALID_CONFIG_VALUE;
+	}
+
+	if (mbm_local_event.configurable) {
+		index = mon_event_config_index_get(QOS_L3_MBM_LOCAL_EVENT_ID);
+		rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
+		hw_dom->mbm_local_cfg = msrval & MAX_EVT_CONFIG_BITS;
+	} else {
+		hw_dom->mbm_local_cfg = INVALID_CONFIG_VALUE;
+	}
+}
+
 void __exit rdt_put_mon_l3_config(void)
 {
 	dom_data_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9b09189ef2d1..ddecaa51cec4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1591,8 +1591,6 @@ struct mon_config_info {
 	u32 mon_config;
 };
 
-#define INVALID_CONFIG_INDEX   UINT_MAX
-
 /**
  * mon_event_config_index_get - get the hardware index for the
  *                              configurable event
@@ -1602,7 +1600,7 @@ struct mon_config_info {
  *         1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID
  *         INVALID_CONFIG_INDEX for invalid evtid
  */
-static inline unsigned int mon_event_config_index_get(u32 evtid)
+unsigned int mon_event_config_index_get(u32 evtid)
 {
 	switch (evtid) {
 	case QOS_L3_MBM_TOTAL_EVENT_ID:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (8 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 09/23] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-05 23:58   ` Reinette Chatre
  2025-02-06  6:24   ` Xin Li
  2025-01-22 20:20 ` [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain Babu Moger
                   ` (15 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The event configuration is domain specific and initialized during domain
initialization. The values are stored in struct rdt_hw_mon_domain.

It is not required to read the configuration register every time user asks
for it. Use the value stored in struct rdt_hw_mon_domain instead.

Introduce resctrl_arch_mon_event_config_get() and
resctrl_arch_mon_event_config_set() to get/set architecture domain specific
mbm_total_cfg/mbm_local_cfg values.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Moved the mon_config_info structure definition to internal.h.
     Moved resctrl_arch_mon_event_config_get() and resctrl_arch_mon_event_config_set()
     to monitor.c file.
     Renamed local variable from val to config_val.

v10: Moved the mon_config_info structure definition to resctrl.h.

v9: Removed QOS_L3_OCCUP_EVENT_ID switch case in resctrl_arch_mon_event_config_set.
    Fixed a unnecessary space.

v8: Renamed
    resctrl_arch_event_config_get() to resctrl_arch_mon_event_config_get().
    resctrl_arch_event_config_set() to resctrl_arch_mon_event_config_set().

v7: Removed check if (val == INVALID_CONFIG_VALUE) as resctrl_arch_event_config_get
    already prints warning.
    Kept the Event config value definitions as is.

v6: Fixed inconstancy with types. Made all the types to u32 for config
    value.
    Removed few rdt_last_cmd_puts as it is not necessary.
    Removed unused config value definitions.
    Few more updates to commit message.

v5: Introduced resctrl_arch_event_config_get and
    resctrl_arch_event_config_get() based on our discussion.
    https://lore.kernel.org/lkml/68e861f9-245d-4496-a72e-46fc57d19c62@amd.com/

v4: New patch.
---
 arch/x86/kernel/cpu/resctrl/internal.h | 15 +++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 46 +++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++---------------------
 3 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ab28b9340ee7..cfaea20145d0 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -605,6 +605,18 @@ union cpuid_0x10_x_edx {
 	unsigned int full;
 };
 
+/**
+ * struct mon_config_info - Monitoring event configuratiin details
+ * @d:			Domain for the event
+ * @evtid:		Event type
+ * @mon_config:		Event configuration value
+ */
+struct mon_config_info {
+	struct rdt_mon_domain *d;
+	enum resctrl_event_id evtid;
+	u32 mon_config;
+};
+
 void rdt_last_cmd_clear(void);
 void rdt_last_cmd_puts(const char *s);
 __printf(1, 2)
@@ -674,4 +686,7 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
 bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
 void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
 unsigned int mon_event_config_index_get(u32 evtid);
+void resctrl_arch_mon_event_config_set(void *info);
+u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
+				      enum resctrl_event_id eventid);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 8917c7261680..6fe9e610e9a0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1324,3 +1324,49 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
 
 	return 0;
 }
+
+u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
+				      enum resctrl_event_id eventid)
+{
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+
+	switch (eventid) {
+	case QOS_L3_OCCUP_EVENT_ID:
+		break;
+	case QOS_L3_MBM_TOTAL_EVENT_ID:
+		return hw_dom->mbm_total_cfg;
+	case QOS_L3_MBM_LOCAL_EVENT_ID:
+		return hw_dom->mbm_local_cfg;
+	}
+
+	/* Never expect to get here */
+	WARN_ON_ONCE(1);
+
+	return INVALID_CONFIG_VALUE;
+}
+
+void resctrl_arch_mon_event_config_set(void *info)
+{
+	struct mon_config_info *mon_info = info;
+	struct rdt_hw_mon_domain *hw_dom;
+	unsigned int index;
+
+	index = mon_event_config_index_get(mon_info->evtid);
+	if (index == INVALID_CONFIG_INDEX)
+		return;
+
+	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
+
+	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
+
+	switch (mon_info->evtid) {
+	case QOS_L3_MBM_TOTAL_EVENT_ID:
+		hw_dom->mbm_total_cfg = mon_info->mon_config;
+		break;
+	case QOS_L3_MBM_LOCAL_EVENT_ID:
+		hw_dom->mbm_local_cfg = mon_info->mon_config;
+		break;
+	default:
+		break;
+	}
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ddecaa51cec4..18110a1afb6d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1586,11 +1586,6 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 	return ret;
 }
 
-struct mon_config_info {
-	u32 evtid;
-	u32 mon_config;
-};
-
 /**
  * mon_event_config_index_get - get the hardware index for the
  *                              configurable event
@@ -1613,33 +1608,11 @@ unsigned int mon_event_config_index_get(u32 evtid)
 	}
 }
 
-static void mon_event_config_read(void *info)
-{
-	struct mon_config_info *mon_info = info;
-	unsigned int index;
-	u64 msrval;
-
-	index = mon_event_config_index_get(mon_info->evtid);
-	if (index == INVALID_CONFIG_INDEX) {
-		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
-		return;
-	}
-	rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
-
-	/* Report only the valid event configuration bits */
-	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
-}
-
-static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
-{
-	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
-}
-
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
-	struct mon_config_info mon_info;
 	struct rdt_mon_domain *dom;
 	bool sep = false;
+	u32 config_val;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
@@ -1648,11 +1621,8 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		if (sep)
 			seq_puts(s, ";");
 
-		memset(&mon_info, 0, sizeof(struct mon_config_info));
-		mon_info.evtid = evtid;
-		mondata_config_read(dom, &mon_info);
-
-		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
+		config_val = resctrl_arch_mon_event_config_get(dom, evtid);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, config_val);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1683,33 +1653,23 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
 	return 0;
 }
 
-static void mon_event_config_write(void *info)
-{
-	struct mon_config_info *mon_info = info;
-	unsigned int index;
-
-	index = mon_event_config_index_get(mon_info->evtid);
-	if (index == INVALID_CONFIG_INDEX) {
-		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
-		return;
-	}
-	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
-}
 
 static void mbm_config_write_domain(struct rdt_resource *r,
 				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
+	u32 config_val;
 
 	/*
-	 * Read the current config value first. If both are the same then
+	 * Check the current config value first. If both are the same then
 	 * no need to write it again.
 	 */
-	mon_info.evtid = evtid;
-	mondata_config_read(d, &mon_info);
-	if (mon_info.mon_config == val)
+	config_val = resctrl_arch_mon_event_config_get(d, evtid);
+	if (config_val == INVALID_CONFIG_VALUE || config_val == val)
 		return;
 
+	mon_info.d = d;
+	mon_info.evtid = evtid;
 	mon_info.mon_config = val;
 
 	/*
@@ -1718,7 +1678,8 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask,
+			      resctrl_arch_mon_event_config_set,
 			      &mon_info, 1);
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (9 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-05 23:57   ` Reinette Chatre
  2025-02-21 18:07   ` James Morse
  2025-01-22 20:20 ` [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters Babu Moger
                   ` (14 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
MBM event of a monitor group. Hardware counters are assigned/unassigned
at monitoring domain level.

Manage a monitoring domain's hardware counters using a per monitoring
domain array of struct mbm_cntr_cfg that is indexed by the hardware
counter	ID. A hardware counter's configuration contains the MBM event
ID and points to the monitoring group that it is assigned to, with a
NULL pointer meaning that the hardware counter is available for assignment.

There is no direct way to determine which hardware counters are	assigned
to a particular monitoring group. Check every entry of every hardware
counter	configuration array in every monitoring domain to query which
MBM events of a monitoring group is tracked by hardware. Such queries
are acceptable because of a very small number of assignable counters.

Suggested-by: Peter Newman <peternewman@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Refined the change log based on Reinette's feedback.
     Fixed few style issues.

v10: Patch changed completely to handle the counters at domain level.
     https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
     Removed Reviewed-by tag.
     Did not see the need to add cntr_id in mbm_state structure. Not used in the code.

v9: Added Reviewed-by tag. No other changes.

v8: Minor commit message changes.

v7: Added check mbm_cntr_assignable for allocating bitmap mbm_cntr_map

v6: New patch to add domain level assignment.
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
 include/linux/resctrl.h                | 14 ++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 18110a1afb6d..75a3b56996ca 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -4009,6 +4009,7 @@ static void __init rdtgroup_setup_default(void)
 
 static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
+	kfree(d->cntr_cfg);
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
@@ -4082,6 +4083,16 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
 			return -ENOMEM;
 		}
 	}
+	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
+		tsize = sizeof(*d->cntr_cfg);
+		d->cntr_cfg = kcalloc(r->mon.num_mbm_cntrs, tsize, GFP_KERNEL);
+		if (!d->cntr_cfg) {
+			bitmap_free(d->rmid_busy_llc);
+			kfree(d->mbm_total);
+			kfree(d->mbm_local);
+			return -ENOMEM;
+		}
+	}
 
 	return 0;
 }
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 511cfce8fc21..9a54e307d340 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
 	u32				*mbps_val;
 };
 
+/**
+ * struct mbm_cntr_cfg - assignable counter configuration
+ * @evtid:		 MBM event to which the counter is assigned. Only valid
+ *			 if @rdtgroup is not NULL.
+ * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
+ *			 counter is free.
+ */
+struct mbm_cntr_cfg {
+	enum resctrl_event_id	evtid;
+	struct rdtgroup		*rdtgrp;
+};
+
 /**
  * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
@@ -105,6 +117,7 @@ struct rdt_ctrl_domain {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
+ * @cntr_cfg:		assignable counters configuration
  */
 struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
@@ -116,6 +129,7 @@ struct rdt_mon_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
+	struct mbm_cntr_cfg		*cntr_cfg;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (10 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06  0:19   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 13/23] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Provide the interface to display the number of monitoring counters
available for assignment in each domain when mbm_cntr_assign is enabled.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Rename rdtgroup_available_mbm_cntrs_show() to resctrl_available_mbm_cntrs_show().
     Few minor text changes.

v10: Patch changed to handle the counters at domain level.
     https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
     So, display logic also changed now.

v9: New patch
---
 Documentation/arch/x86/resctrl.rst     |  8 +++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 46 ++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 31ff764deeeb..99cae75559b0 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -299,6 +299,14 @@ with the following files:
 	memory bandwidth tracking to a single memory bandwidth event per
 	monitoring group.
 
+"available_mbm_cntrs":
+	The number of monitoring counters available for assignment in each
+	domain when mbm_cntr_assign mode is enabled on the system.
+	::
+
+	 # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+	 0=30;1=30
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6fe9e610e9a0..f2bf5b13465d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1234,6 +1234,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
+			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 75a3b56996ca..2b86124c336b 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -888,6 +888,46 @@ static int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
+					    struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdt_mon_domain *dom;
+	bool sep = false;
+	u32 cntrs, i;
+	int ret = 0;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		ret = -EINVAL;
+		goto unlock_cntrs_show;
+	}
+
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+		if (sep)
+			seq_puts(s, ";");
+
+		cntrs = 0;
+		for (i = 0; i < r->mon.num_mbm_cntrs; i++) {
+			if (!dom->cntr_cfg[i].rdtgrp)
+				cntrs++;
+		}
+
+		seq_printf(s, "%d=%d", dom->hdr.id, cntrs);
+		sep = true;
+	}
+	seq_puts(s, "\n");
+
+unlock_cntrs_show:
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1916,6 +1956,12 @@ static struct rftype res_common_files[] = {
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= resctrl_num_mbm_cntrs_show,
 	},
+	{
+		.name		= "available_mbm_cntrs",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= resctrl_available_mbm_cntrs_show,
+	},
 	{
 		.name		= "cpus",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 13/23] x86/resctrl: Add data structures and definitions for ABMC assignment
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (11 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as the
counter is assigned. The bandwidth events will be tracked by the hardware
until the user changes the configuration. Each resctrl group can configure
maximum two counters, one for total event and one for local event.

The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh).
Configuration is done by setting the counter id, bandwidth source (RMID)
and bandwidth configuration supported by BMEC (Bandwidth Monitoring Event
Configuration).

Attempts to read or write the MSR when ABMC is not enabled will result
in a #GP(0) exception.

Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG
(0xC000_03FDh):
=========================================================================
Bits 	Mnemonic	Description			Access Reset
							Type   Value
=========================================================================
63 	CfgEn 		Configuration Enable 		R/W 	0

62 	CtrEn 		Enable/disable counting		R/W 	0

61:53 	– 		Reserved 			MBZ 	0

52:48 	CtrID 		Counter Identifier		R/W	0

47 	IsCOS		BwSrc field is a CLOSID		R/W	0
			(not an RMID)

46:44 	–		Reserved			MBZ	0

43:32	BwSrc		Bandwidth Source		R/W	0
			(RMID or CLOSID)

31:0	BwType		Bandwidth configuration		R/W	0
			to track for this counter
==========================================================================

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v11: No changes.

v10: No changes.

v9: Removed the references of L3_QOS_ABMC_DSC.
    Text changes about configuration in kernel doc.

v8: Update the configuration notes in kernel_doc.
    Few commit message update.

v7: Removed the reference of L3_QOS_ABMC_DSC as it is not used anymore.
    Moved the configuration notes to kernel_doc.
    Adjusted the tabs for l3_qos_abmc_cfg and checkpatch seems happy.

v6: Removed all the fs related changes.
    Added note on CfgEn,CtrEn.
    Removed the definitions which are not used.
    Removed cntr_id initialization.

v5: Moved assignment flags here (path 10/19 of v4).
    Added MON_CNTR_UNSET definition to initialize cntr_id's.
    More details in commit log.
    Renamed few fields in l3_qos_abmc_cfg for readability.

v4: Added more descriptions.
    Changed the name abmc_ctr_id to ctr_id.
    Added L3_QOS_ABMC_DSC. Used for reading the configuration.

v3: No changes.

v2: No changes.
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h | 35 ++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index fea1f3afe197..e753a3332496 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1197,6 +1197,7 @@
 /* - AMD: */
 #define MSR_IA32_MBA_BW_BASE		0xc0000200
 #define MSR_IA32_SMBA_BW_BASE		0xc0000280
+#define MSR_IA32_L3_QOS_ABMC_CFG	0xc00003fd
 #define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
 #define MSR_IA32_EVT_CFG_BASE		0xc0000400
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index cfaea20145d0..acac7972cea4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -617,6 +617,41 @@ struct mon_config_info {
 	u32 mon_config;
 };
 
+/*
+ * ABMC counters are configured by writing to L3_QOS_ABMC_CFG.
+ * @bw_type		: Bandwidth configuration (supported by BMEC)
+ *			  tracked by the @cntr_id.
+ * @bw_src		: Bandwidth source (RMID or CLOSID).
+ * @reserved1		: Reserved.
+ * @is_clos		: @bw_src field is a CLOSID (not an RMID).
+ * @cntr_id		: Counter identifier.
+ * @reserved		: Reserved.
+ * @cntr_en		: Counting enable bit.
+ * @cfg_en		: Configuration enable bit.
+ *
+ * Configuration and counting:
+ * Counter can be configured across multiple writes to MSR. Configuration
+ * is applied only when @cfg_en = 1. Counter @cntr_id is reset when the
+ * configuration is applied.
+ * @cfg_en = 1, @cntr_en = 0 : Apply @cntr_id configuration but do not
+ *                             count events.
+ * @cfg_en = 1, @cntr_en = 1 : Apply @cntr_id configuration and start
+ *                             counting events.
+ */
+union l3_qos_abmc_cfg {
+	struct {
+		unsigned long bw_type  :32,
+			      bw_src   :12,
+			      reserved1: 3,
+			      is_clos  : 1,
+			      cntr_id  : 5,
+			      reserved : 9,
+			      cntr_en  : 1,
+			      cfg_en   : 1;
+	} split;
+	unsigned long full;
+};
+
 void rdt_last_cmd_clear(void);
 void rdt_last_cmd_puts(const char *s);
 __printf(1, 2)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (12 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 13/23] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-19 13:32   ` Dave Martin
  2025-02-21 18:06   ` James Morse
  2025-01-22 20:20 ` [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events Babu Moger
                   ` (11 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it
is assigned. The assigned RMID will be tracked by the hardware until the
user unassigns it manually.

Implement an architecture-specific handler to assign and unassign the
counter. Configure counters by writing to the L3_QOS_ABMC_CFG MSR,
specifying the counter ID, bandwidth source (RMID), and event
configuration.

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
    Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Moved resctrl_arch_assign_cntr() and resctrl_abmc_config_one_amd() to
     monitor.c.
     Added the code to reset the arch state in resctrl_arch_assign_cntr().
     Also removed resctrl_arch_reset_rmid() inside IPI as the counters are
     reset from the callers.
     Re-wrote commit message.

v10: Added call resctrl_arch_reset_rmid() to reset the RMID in the domain
     inside IPI call.
     SMP and non-SMP call support is not required in resctrl_arch_config_cntr
     with new domain specific assign approach/data structure.
     Commit message update.

v9: Removed the code to reset the architectural state. It will done
    in another patch.

v8: Rename resctrl_arch_assign_cntr to resctrl_arch_config_cntr.

v7: Separated arch and fs functions. This patch only has arch implementation.
    Added struct rdt_resource to the interface resctrl_arch_assign_cntr.
    Rename rdtgroup_abmc_cfg() to resctrl_abmc_config_one_amd().

v6: Removed mbm_cntr_alloc() from this patch to keep fs and arch code
    separate.
    Added code to update the counter assignment at domain level.

v5: Few name changes to match cntr_id.
    Changed the function names to
      rdtgroup_assign_cntr
      resctr_arch_assign_cntr
      More comments on commit log.
      Added function summary.

v4: Commit message update.
      User bitmap APIs where applicable.
      Changed the interfaces considering MPAM(arm).
      Added domain specific assignment.

v3: Removed the static from the prototype of rdtgroup_assign_abmc.
      The function is not called directly from user anymore. These
      changes are related to global assignment interface.

v2: Minor text changes in commit message.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  3 ++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 42 ++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index acac7972cea4..161d3feb567c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -724,4 +724,7 @@ unsigned int mon_event_config_index_get(u32 evtid);
 void resctrl_arch_mon_event_config_set(void *info);
 u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 				      enum resctrl_event_id eventid);
+int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			     u32 cntr_id, bool assign);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f2bf5b13465d..ef836bb69b9b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1371,3 +1371,45 @@ void resctrl_arch_mon_event_config_set(void *info)
 		break;
 	}
 }
+
+static void resctrl_abmc_config_one_amd(void *info)
+{
+	union l3_qos_abmc_cfg *abmc_cfg = info;
+
+	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
+}
+
+/*
+ * Send an IPI to the domain to assign the counter to RMID, event pair.
+ */
+int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			     u32 cntr_id, bool assign)
+{
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+	union l3_qos_abmc_cfg abmc_cfg = { 0 };
+	struct arch_mbm_state *am;
+
+	abmc_cfg.split.cfg_en = 1;
+	abmc_cfg.split.cntr_en = assign ? 1 : 0;
+	abmc_cfg.split.cntr_id = cntr_id;
+	abmc_cfg.split.bw_src = rmid;
+
+	/* Update the event configuration from the domain */
+	if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+		abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
+	else
+		abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
+
+	smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd, &abmc_cfg, 1);
+
+	/*
+	 * Reset the architectural state so that reading of hardware
+	 * counter is not considered as an overflow in next update.
+	 */
+	am = get_arch_mbm_state(hw_dom, rmid, evtid);
+	if (am)
+		memset(am, 0, sizeof(*am));
+
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (13 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06  1:05   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm " Babu Moger
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The mbm_cntr_assign mode offers several counters that can be assigned
to an RMID, event pair and monitor the bandwidth as long as it is
assigned.

Add the functionality to allocate and assign the counters to RMID, event
pair in the domain.

If all counters are in use, the kernel will show an error message: "Out
of MBM assignable counters" when a new assignment is requested. Exit on
the first failure when assigning counters across all the domains.
Report the error in /sys/fs/resctrl/info/last_cmd_status.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Patch changed again quite a bit.
     Moved the functions to monitor.c.
     Renamed rdtgroup_assign_cntr_event() to resctrl_assign_cntr_event().
     Refactored the resctrl_assign_cntr_event().
     Added functionality to exit on the first error during assignment.
     Simplified mbm_cntr_free().
     Removed the function mbm_cntr_assigned(). Will be using mbm_cntr_get() to
     figure out if the counter is assigned or not.
     Updated commit message and code comments.

v10: Patch changed completely.
     Counters are managed at the domain based on the discussion.
     https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
     Reset non-architectural MBM state.
     Commit message update.

v9: Introduced new function resctrl_config_cntr to assign the counter, update
    the bitmap and reset the architectural state.
    Taken care of error handling(freeing the counter) when assignment fails.
    Moved mbm_cntr_assigned_to_domain here as it used in this patch.
    Minor text changes.

v8: Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
    Added the code to return the error if rdtgroup_assign_cntr_event fails.
    Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
    Updated typo in the comments.

v7: New patch. Moved all the FS code here.
    Merged rdtgroup_assign_cntr and rdtgroup_alloc_cntr.
    Adde new #define MBM_EVENT_ARRAY_INDEX.
---
 arch/x86/kernel/cpu/resctrl/internal.h |   2 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 105 +++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 161d3feb567c..547d8a4c8aba 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -727,4 +727,6 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
 			     u32 cntr_id, bool assign);
+int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ef836bb69b9b..127c4000a81a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1413,3 +1413,108 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	return 0;
 }
+
+/*
+ * Configure the counter for the event, RMID pair for the domain. Reset the
+ * non-architectural state to clear all the event counters.
+ */
+static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			       u32 cntr_id, bool assign)
+{
+	struct mbm_state *m;
+	int ret;
+
+	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
+	if (ret)
+		return ret;
+
+	m = get_mbm_state(d, closid, rmid, evtid);
+	if (m)
+		memset(m, 0, sizeof(struct mbm_state));
+
+	return ret;
+}
+
+static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
+			struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int cntr_id;
+
+	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
+		if (d->cntr_cfg[cntr_id].rdtgrp == rdtgrp &&
+		    d->cntr_cfg[cntr_id].evtid == evtid)
+			return cntr_id;
+	}
+
+	return -ENOENT;
+}
+
+static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
+			  struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int cntr_id;
+
+	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
+		if (!d->cntr_cfg[cntr_id].rdtgrp) {
+			d->cntr_cfg[cntr_id].rdtgrp = rdtgrp;
+			d->cntr_cfg[cntr_id].evtid = evtid;
+			return cntr_id;
+		}
+	}
+
+	return -ENOSPC;
+}
+
+static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
+{
+	memset(&d->cntr_cfg[cntr_id], 0, sizeof(struct mbm_cntr_cfg));
+}
+
+/*
+ * Allocate a fresh counter and configure the event if not assigned already
+ * else return success.
+ */
+static int resctrl_alloc_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+				     struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int cntr_id, ret = 0;
+
+	if (mbm_cntr_get(r, d, rdtgrp, evtid) == -ENOENT) {
+		cntr_id = mbm_cntr_alloc(r, d, rdtgrp, evtid);
+		if (cntr_id <  0) {
+			rdt_last_cmd_printf("Domain %d is Out of MBM assignable counter\n",
+					    d->hdr.id);
+			return -ENOSPC;
+		}
+
+		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid, rdtgrp->closid,
+					  cntr_id, true);
+		if (ret) {
+			rdt_last_cmd_printf("Assignment failed on domain %d\n", d->hdr.id);
+			mbm_cntr_free(d, cntr_id);
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Assign a hardware counter to event @evtid of group @rdtgrp.
+ * Counter will be assigned to all the domains if @d is NULL else
+ * the counter will be assigned to @d.
+ */
+int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int ret = 0;
+
+	if (!d) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list)
+			ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
+	} else {
+		ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
+	}
+
+	return ret;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm MBM events
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (14 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06  3:54   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

The mbm_cntr_assign mode provides a limited number of hardware counters
that can be assigned to an RMID, event pair to monitor bandwidth while
assigned. If all counters are in use, the kernel will show an error
message: "Out of MBM assignable counters" when a new assignment is
requested. To make space for a new assignment, users must unassign an
already assigned counter and retry the assignment again..

Add the functionality to unassign and free the counters in the domain.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Moved the functions to monitor.c.
     Renamed rdtgroup_unassign_cntr_event() to resctrl_unassign_cntr_event().
     Refactored the resctrl_unassign_cntr_event().
     Updated commit message and code comments.

v10: Patch changed again.
     Counters are managed at the domain based on the discussion.
     https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
     commit message update.

v9: Changes related to addition of new function resctrl_config_cntr().
    The removed rdtgroup_mbm_cntr_is_assigned() as it was introduced
    already.
    Text changes to take care comments.

v8: Renamed rdtgroup_mbm_cntr_is_assigned to mbm_cntr_assigned_to_domain
    Added return error handling in resctrl_arch_config_cntr().

v7: Merged rdtgroup_unassign_cntr and rdtgroup_free_cntr functions.
    Renamed rdtgroup_mbm_cntr_test() to rdtgroup_mbm_cntr_is_assigned().
    Reworded the commit log little bit.

v6: Removed mbm_cntr_free from this patch.
    Added counter test in all the domains and free if it is not assigned to
    any domains.

v5: Few name changes to match cntr_id.
    Changed the function names to rdtgroup_unassign_cntr
    More comments on commit log.

v4: Added domain specific unassign feature.
    Few name changes.

v3: Removed the static from the prototype of rdtgroup_unassign_abmc.
    The function is not called directly from user anymore. These
    changes are related to global assignment interface.

v2: No changes.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 39 ++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 547d8a4c8aba..a5b8eadc7f5c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -729,4 +729,6 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 cntr_id, bool assign);
 int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
+int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 127c4000a81a..b6d188d0f9b7 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1518,3 +1518,42 @@ int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	return ret;
 }
+
+/*
+ * Unassign and free the counter if assigned else return success.
+ */
+static int resctrl_free_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+				    struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int cntr_id, ret = 0;
+
+	cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
+	if (cntr_id != -ENOENT) {
+		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
+					  rdtgrp->closid, cntr_id, false);
+		if (!ret)
+			mbm_cntr_free(d, cntr_id);
+	}
+
+	return ret;
+}
+
+/*
+ * Unassign a hardware counter associated with @evtid from the domain and
+ * the group. Unassign the counters from all the domains if @d is NULL else
+ * unassign from @d.
+ */
+int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+{
+	int ret = 0;
+
+	if (!d) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list)
+			ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
+	} else {
+		ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
+	}
+
+	return ret;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (15 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm " Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06 18:03   ` Reinette Chatre
  2025-02-19 13:41   ` Dave Martin
  2025-01-22 20:20 ` [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
                   ` (8 subsequent siblings)
  25 siblings, 2 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Assign/unassign counters on resctrl group creation/deletion. Two counters
are required per group, one for MBM total event and one for MBM local
event.

There are a limited number of counters available for assignment. If these
counters are exhausted, the kernel will display the error message: "Out of
MBM assignable counters". However, it is not necessary to fail the
creation of a group due to assignment failures. Users have the flexibility
to modify the assignments at a later time.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Moved mbm_cntr_reset() to monitor.c.
     Added code reset non-architectural state in mbm_cntr_reset().
     Added missing rdtgroup_unassign_cntrs() calls on failure path.

v10: Assigned the counter before exposing the event files.
    Moved the call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().
    This is called both CNTR_MON and MON group creation.
    Call mbm_cntr_reset() when unmounted to clear all the assignments.
    Taken care of few other feedback comments.

v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
    Updated couple of rdtgroup_unassign_cntrs() calls properly.
    Updated function comments.

v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
    Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
    Fixed the problem with unassigning the child MON groups of CTRL_MON group.

v7: Reworded the commit message.
    Removed the reference of ABMC with mbm_cntr_assign.
    Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.

v6: Removed the redundant comments on all the calls of
    rdtgroup_assign_cntrs. Updated the commit message.
    Dropped printing error message on every call of rdtgroup_assign_cntrs.

v5: Removed the code to enable/disable ABMC during the mount.
    That will be another patch.
    Added arch callers to get the arch specific data.
    Renamed fuctions to match the other abmc function.
    Added code comments for assignment failures.

v4: Few name changes based on the upstream discussion.
    Commit message update.

v3: This is a new patch. Patch addresses the upstream comment to enable
    ABMC feature by default if the feature is available.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 27 +++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 63 +++++++++++++++++++++++++-
 3 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a5b8eadc7f5c..c979abb3d3b0 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -731,4 +731,5 @@ int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
+void mbm_cntr_reset(struct rdt_resource *r);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index b6d188d0f9b7..118b39fbb01e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1557,3 +1557,30 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
 
 	return ret;
 }
+
+void mbm_cntr_reset(struct rdt_resource *r)
+{
+	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+	struct rdt_mon_domain *dom;
+
+	/*
+	 * Reset the domain counter configuration. Hardware counters
+	 * will reset after switching the monitor mode. So, reset the
+	 * architectural amd non-architectural state so that reading
+	 * of hardware counter is not considered as an overflow in the
+	 * next update.
+	 */
+	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
+		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+			memset(dom->cntr_cfg, 0,
+			       sizeof(*dom->cntr_cfg) * r->mon.num_mbm_cntrs);
+			if (is_mbm_total_enabled())
+				memset(dom->mbm_total, 0,
+				       sizeof(struct mbm_state) * idx_limit);
+			if (is_mbm_local_enabled())
+				memset(dom->mbm_local, 0,
+				       sizeof(struct mbm_state) * idx_limit);
+			resctrl_arch_reset_rmid_all(r, dom);
+		}
+	}
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2b86124c336b..f61f0cd032ef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2687,6 +2687,46 @@ static void schemata_list_destroy(void)
 	}
 }
 
+/*
+ * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
+ * counters are automatically assigned. Each group can accommodate two counters:
+ * one for the total event and one for the local event. Assignments may fail
+ * due to the limited number of counters. However, it is not necessary to fail
+ * the group creation and thus no failure is returned. Users have the option
+ * to modify the counter assignments after the group has been created.
+ */
+static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
+		return;
+
+	if (is_mbm_total_enabled())
+		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+
+	if (is_mbm_local_enabled())
+		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
+/*
+ * Called when a group is deleted. Counters are unassigned if it was in
+ * assigned state.
+ */
+static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
+		return;
+
+	if (is_mbm_total_enabled())
+		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+
+	if (is_mbm_local_enabled())
+		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2741,6 +2781,8 @@ static int rdt_get_tree(struct fs_context *fc)
 		if (ret < 0)
 			goto out_info;
 
+		rdtgroup_assign_cntrs(&rdtgroup_default);
+
 		ret = mkdir_mondata_all(rdtgroup_default.kn,
 					&rdtgroup_default, &kn_mondata);
 		if (ret < 0)
@@ -2779,8 +2821,10 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (resctrl_arch_mon_capable())
 		kernfs_remove(kn_mondata);
 out_mongrp:
-	if (resctrl_arch_mon_capable())
+	if (resctrl_arch_mon_capable()) {
+		rdtgroup_unassign_cntrs(&rdtgroup_default);
 		kernfs_remove(kn_mongrp);
+	}
 out_info:
 	kernfs_remove(kn_info);
 out_schemata_free:
@@ -2956,6 +3000,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
 
 	head = &rdtgrp->mon.crdtgrp_list;
 	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
+		rdtgroup_unassign_cntrs(sentry);
 		free_rmid(sentry->closid, sentry->mon.rmid);
 		list_del(&sentry->mon.crdtgrp_list);
 
@@ -2996,6 +3041,8 @@ static void rmdir_all_sub(void)
 		cpumask_or(&rdtgroup_default.cpu_mask,
 			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
 
+		rdtgroup_unassign_cntrs(rdtgrp);
+
 		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 		kernfs_remove(rdtgrp->kn);
@@ -3027,6 +3074,8 @@ static void rdt_kill_sb(struct super_block *sb)
 	for_each_alloc_capable_rdt_resource(r)
 		reset_all_ctrls(r);
 	rmdir_all_sub();
+	rdtgroup_unassign_cntrs(&rdtgroup_default);
+	mbm_cntr_reset(&rdt_resources_all[RDT_RESOURCE_L3].r_resctrl);
 	rdt_pseudo_lock_release();
 	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
 	schemata_list_destroy();
@@ -3490,9 +3539,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 	}
 	rdtgrp->mon.rmid = ret;
 
+	rdtgroup_assign_cntrs(rdtgrp);
+
 	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
 	if (ret) {
 		rdt_last_cmd_puts("kernfs subdir error\n");
+		rdtgroup_unassign_cntrs(rdtgrp);
 		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 		return ret;
 	}
@@ -3502,8 +3554,10 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 
 static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
 {
-	if (resctrl_arch_mon_capable())
+	if (resctrl_arch_mon_capable()) {
+		rdtgroup_unassign_cntrs(rgrp);
 		free_rmid(rgrp->closid, rgrp->mon.rmid);
+	}
 }
 
 static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
@@ -3764,6 +3818,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	update_closid_rmid(tmpmask, NULL);
 
 	rdtgrp->flags = RDT_DELETED;
+
+	rdtgroup_unassign_cntrs(rdtgrp);
+
 	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 	/*
@@ -3810,6 +3867,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
 	update_closid_rmid(tmpmask, NULL);
 
+	rdtgroup_unassign_cntrs(rdtgrp);
+
 	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 	closid_free(rdtgrp->closid);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (16 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06 18:04   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

In mbm_cntr_assign mode, the hardware counter should be assigned to read
the MBM events.

Report 'Unassigned' in case the user attempts to read the events without
assigning the counter.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Domain can be NULL with SNC support so moved the unassign check in
     rdtgroup_mondata_show().

v10: Moved the code to check the assign state inside mon_event_read().
     Fixed few text comments.

v9: Used is_mbm_event() to check the event type.
    Minor user documentation update.

v8: Used MBM_EVENT_ARRAY_INDEX to get the index for the MBM event.
    Documentation update to make the text generic.

v7: Moved the documentation under "mon_data".
    Updated the text little bit.

v6: Added more explaination in the resctrl.rst
    Added checks to detect "Unassigned" before reading RMID.

v5: New patch.
---
 Documentation/arch/x86/resctrl.rst        | 10 ++++++++++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 13 +++++++++++++
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 ++
 arch/x86/kernel/cpu/resctrl/monitor.c     |  4 ++--
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 99cae75559b0..072b15550ff7 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -431,6 +431,16 @@ When monitoring is enabled all MON groups will also contain:
 	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
 	where "YY" is the node number.
 
+	When supported the mbm_cntr_assign mode allows users to assign a
+	counter to mon_hw_id, event pair enabling bandwidth monitoring for
+	as long as the counter remains assigned. The hardware will continue
+	tracking the assigned mon_hw_id until the user manually unassigns
+	it, ensuring that counters are not reset during this period. With
+	a limited number of counters, the system may run out of assignable
+	counters. In that case, MBM event counters will return 'Unassigned'
+	when the event is read. Users must manually assign a counter to read
+	the events.
+
 "mon_hw_id":
 	Available only with debug option. The identifier used by hardware
 	for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 536351159cc2..d00e77d08ef4 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -679,6 +679,17 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 			goto out;
 		}
 		d = container_of(hdr, struct rdt_mon_domain, hdr);
+
+		/*
+		 * Report 'Unassigned' if the mbm_cntr_assign mode is enabled and
+		 * counter is unassigned
+		 */
+		if (resctrl_arch_mbm_cntr_assign_enabled(r) && is_mbm_event(evtid) &&
+		    (mbm_cntr_get(r, d, rdtgrp, evtid) == -ENOENT)) {
+			rr.err = -ENOENT;
+			goto checkresult;
+		}
+
 		mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
 	}
 
@@ -688,6 +699,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		seq_puts(m, "Error\n");
 	else if (rr.err == -EINVAL)
 		seq_puts(m, "Unavailable\n");
+	else if (rr.err == -ENOENT)
+		seq_puts(m, "Unassigned\n");
 	else
 		seq_printf(m, "%llu\n", rr.val);
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c979abb3d3b0..c006c4d8d6ff 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -732,4 +732,6 @@ int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 void mbm_cntr_reset(struct rdt_resource *r);
+int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
+		 struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 118b39fbb01e..3d748fdbcb5f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1436,8 +1436,8 @@ static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 	return ret;
 }
 
-static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
-			struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
+int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
+		 struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
 {
 	int cntr_id;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (17 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06 18:05   ` Reinette Chatre
  2025-01-22 20:20 ` [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Resctrl subsystem can support two monitoring modes, 'mbm_cntr_assign' or
'default'. In mbm_cntr_assign, monitoring event can only accumulate data
while it is backed by a hardware counter. In 'default' mode, resctrl
assumes there is a hardware counter for each event within every CTRL_MON
and MON group.

Introduce interface to switch between mbm_cntr_assign and default modes.

$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_cntr_assign]
default

To enable the "mbm_cntr_assign" mode:
$ echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

To enable the default monitoring mode:
$ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

MBM event counters are automatically reset as part of changing the mode.
Clear both architectural and non-architectural event states to prevent
overflow conditions during the next event read.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Changed the name of the function rdtgroup_mbm_assign_mode_write() to
     resctrl_mbm_assign_mode_write().
     Rewrote the commit message with context.
     Added few more details in resctrl.rst about mbm_cntr_assign mode.
     Re-arranged the text in resctrl.rst file.

v10: The call mbm_cntr_reset() has been moved to earlier patch.
     Minor documentation update.

v9: Fixed extra spaces in user documentation.
    Fixed problem changing the mode to mbm_cntr_assign mode when it is
    not supported. Added extra checks to detect if systems supports it.
    Used the rdtgroup_cntr_id_init to initialize cntr_id.

v8: Reset the internal counters after mbm_cntr_assign mode is changed.
    Renamed rdtgroup_mbm_cntr_reset() to mbm_cntr_reset()
    Updated the documentation to make text generic.

v7: Changed the interface name to mbm_assign_mode.
    Removed the references of ABMC.
    Added the changes to reset global and domain bitmaps.
    Added the changes to reset rmid.

v6: Changed the mode name to mbm_cntr_assign.
    Moved all the FS related code here.
    Added changes to reset mbm_cntr_map and resctrl group counters.

v5: Change log and mode description text correction.

v4: Minor commit text changes. Keep the default to ABMC when supported.
    Fixed comments to reflect changed interface "mbm_mode".

v3: New patch to address the review comments from upstream.
---
 Documentation/arch/x86/resctrl.rst     | 25 ++++++++++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 072b15550ff7..5d18c4c8bc48 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -259,7 +259,10 @@ with the following files:
 
 "mbm_assign_mode":
 	Reports the list of monitoring modes supported. The enclosed brackets
-	indicate which mode is enabled.
+	indicate which mode is enabled. The MBM events (mbm_total_bytes and/or
+	mbm_local_bytes) associated with counters may reset when "mbm_assign_mode"
+	is changed.
+
 	::
 
 	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
@@ -275,6 +278,16 @@ with the following files:
 	available is described in the "num_mbm_cntrs" file. Changing the mode
 	may cause all counters on a resource to reset.
 
+	Moving to mbm_cntr_assign mode require users to assign the counters to
+	the events. Otherwise, the MBM event counters will return "Unassigned"
+	when read.
+
+	The mode is beneficial for AMD platforms that support more CTRL_MON
+	and MON groups than available hardware counters. By default, this
+	feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
+	Monitoring Counters) capability, ensuring counters remain assigned even
+	when the corresponding RMID is not actively used by any processor.
+
 	"default":
 
 	In default mode, resctrl assumes there is a hardware counter for each
@@ -283,6 +296,16 @@ with the following files:
 	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
 	there is no counter associated with that event.
 
+	* To enable "mbm_cntr_assign" mode:
+	  ::
+
+	    # echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+	* To enable default monitoring mode:
+	  ::
+
+	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
 "num_mbm_cntrs":
 	The number of monitoring counters available for assignment when the
 	system supports mbm_cntr_assign mode.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f61f0cd032ef..6922173c4f8f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -928,6 +928,53 @@ static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
+					     char *buf, size_t nbytes, loff_t off)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	int ret = 0;
+	bool enable;
+
+	/* Valid input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+
+	buf[nbytes - 1] = '\0';
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_last_cmd_clear();
+
+	if (!strcmp(buf, "default")) {
+		enable = 0;
+	} else if (!strcmp(buf, "mbm_cntr_assign")) {
+		if (r->mon.mbm_cntr_assignable) {
+			enable = 1;
+		} else {
+			ret = -EINVAL;
+			rdt_last_cmd_puts("mbm_cntr_assign mode is not supported\n");
+			goto write_exit;
+		}
+	} else {
+		ret = -EINVAL;
+		rdt_last_cmd_puts("Unsupported assign mode\n");
+		goto write_exit;
+	}
+
+	if (enable != resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		ret = resctrl_arch_mbm_cntr_assign_set(r, enable);
+		if (!ret)
+			mbm_cntr_reset(r);
+	}
+
+write_exit:
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return ret ?: nbytes;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1945,9 +1992,10 @@ static struct rftype res_common_files[] = {
 	},
 	{
 		.name		= "mbm_assign_mode",
-		.mode		= 0444,
+		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= resctrl_mbm_assign_mode_show,
+		.write		= resctrl_mbm_assign_mode_write,
 		.fflags		= RFTYPE_MON_INFO,
 	},
 	{
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (18 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-21 18:06   ` James Morse
  2025-01-22 20:20 ` [PATCH v11 21/23] x86/resctrl: Update assignments on event configuration changes Babu Moger
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Configure mbm_cntr_assign mode on AMD platforms. On AMD platforms, it
is recommended to use mbm_cntr_assign mode if supported, because
reading "mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable'
if there is no counter associated with that event.

The mbm_cntr_assign mode, referred to as ABMC (Assignable Bandwidth
Monitoring Counters) on AMD, is enabled by default when supported by the
system.

Update ABMC across all logical processors within the resctrl domain to
ensure proper functionality.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Commit text in imperative tone. Added few more details.
     Moved resctrl_arch_mbm_cntr_assign_set_one() to monitor.c.

v10: Commit text in imperative tone.

v9: Minor code change due to merge. Actual code did not change.

v8: Renamed resctrl_arch_mbm_cntr_assign_configure to
	resctrl_arch_mbm_cntr_assign_set_one.
    Adde r->mon_capable check.
    Commit message update.

v7: Introduced resctrl_arch_mbm_cntr_assign_configure() to configure.
    Moved the default settings to rdt_get_mon_l3_config(). It should be
    done before the hotplug handler is called. It cannot be done at
    rdtgroup_init().

v6: Keeping the default enablement in arch init code for now.
     This may need some discussion.
     Renamed resctrl_arch_configure_abmc to resctrl_arch_mbm_cntr_assign_configure.

v5: New patch to enable ABMC by default.
---
 arch/x86/kernel/cpu/resctrl/internal.h | 1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 8 ++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c006c4d8d6ff..2480698b643d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -734,4 +734,5 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
 void mbm_cntr_reset(struct rdt_resource *r);
 int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
 		 struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
+void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
 #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3d748fdbcb5f..a9a5dc626a1e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			r->mon.mbm_cntr_assignable = true;
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+			hw_res->mbm_cntr_assign_enabled = true;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
 		}
@@ -1313,6 +1314,13 @@ static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
 				 resctrl_abmc_set_one_amd, &enable, 1);
 }
 
+void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r)
+{
+	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+	resctrl_abmc_set_one_amd(&hw_res->mbm_cntr_assign_enabled);
+}
+
 int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6922173c4f8f..515969c5f64f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 
 void resctrl_online_cpu(unsigned int cpu)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
 	mutex_lock(&rdtgroup_mutex);
 	/* The CPU is set in default rdtgroup after online. */
 	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+	if (r->mon_capable && r->mon.mbm_cntr_assignable)
+		resctrl_arch_mbm_cntr_assign_set_one(r);
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 21/23] x86/resctrl: Update assignments on event configuration changes
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (19 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-01-22 20:20 ` [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

When BMEC (Bandwidth Monitoring Event Configuration) is supported,
resctrl provides option to configure events by writing to the interfaces
/sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config or
/sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config.

Update MBM event assignments for all monitor groups in the affected domains
whenever the event configuration is changed.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11:
   Added non-arch RMID reset code in mbm_config_write_domain() which was missing.
   Removed resctrl_arch_reset_rmid() call in resctrl_abmc_config_one_amd().
   Not required as reset of arch and non-arch rmid counters done from the callers.
   It simplies the IPI code.
   Updated the code comments with Reinette's feedback.
   Updated the commit message in imperative mode.

v10: Code changed completely with domain specific counter assignment.
     Rewrite the commit message.
     Added few more code comments.

v9: Again patch changed completely based on the comment.
    https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
    Introduced resctrl_mon_event_config_set to handle IPI.
    But sending another IPI inside IPI causes problem. Kernel reports SMP
    warning. So, introduced resctrl_arch_update_cntr() to send the command directly.

v8: Patch changed completely.
    Updated the assignment on same IPI as the event is updated.
    Could not do the way we discussed in the thread.
    https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
    Needed to figure out event type to update the configuration.

v7: New patch to update the assignments. Missed it earlier.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  4 +-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 58 ++++++++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++-
 3 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 2480698b643d..aec564fa2833 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -607,11 +607,13 @@ union cpuid_0x10_x_edx {
 
 /**
  * struct mon_config_info - Monitoring event configuratiin details
+ * @r:			Resource for monitoring
  * @d:			Domain for the event
  * @evtid:		Event type
  * @mon_config:		Event configuration value
  */
 struct mon_config_info {
+	struct rdt_resource *r;
 	struct rdt_mon_domain *d;
 	enum resctrl_event_id evtid;
 	u32 mon_config;
@@ -721,12 +723,12 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
 bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
 void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
 unsigned int mon_event_config_index_get(u32 evtid);
-void resctrl_arch_mon_event_config_set(void *info);
 u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 				      enum resctrl_event_id eventid);
 int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
 			     u32 cntr_id, bool assign);
+void resctrl_mon_event_config_set(void *info);
 int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
 			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
 int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a9a5dc626a1e..024aabbecbb5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1354,26 +1354,26 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 	return INVALID_CONFIG_VALUE;
 }
 
-void resctrl_arch_mon_event_config_set(void *info)
+static void resctrl_arch_mon_event_config_set(struct rdt_mon_domain *d,
+					      enum resctrl_event_id eventid, u32 val)
 {
-	struct mon_config_info *mon_info = info;
 	struct rdt_hw_mon_domain *hw_dom;
 	unsigned int index;
 
-	index = mon_event_config_index_get(mon_info->evtid);
+	index = mon_event_config_index_get(eventid);
 	if (index == INVALID_CONFIG_INDEX)
 		return;
 
-	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
+	wrmsr(MSR_IA32_EVT_CFG_BASE + index, val, 0);
 
-	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
-	switch (mon_info->evtid) {
+	switch (eventid) {
 	case QOS_L3_MBM_TOTAL_EVENT_ID:
-		hw_dom->mbm_total_cfg = mon_info->mon_config;
+		hw_dom->mbm_total_cfg = val;
 		break;
 	case QOS_L3_MBM_LOCAL_EVENT_ID:
-		hw_dom->mbm_local_cfg = mon_info->mon_config;
+		hw_dom->mbm_local_cfg = val;
 		break;
 	default:
 		break;
@@ -1592,3 +1592,45 @@ void mbm_cntr_reset(struct rdt_resource *r)
 		}
 	}
 }
+
+/*
+ * Update hardware counter configuration after event configuration change.
+ * Walk the hardware counters of domain @d to reconfigure all assigned
+ * counters that are monitoring @evtid with the event's new configuration
+ * value.
+ * This is run on a CPU belonging to domain @d so call
+ * resctrl_abmc_config_one_amd() directly.
+ */
+static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+				     enum resctrl_event_id evtid, u32 val)
+{
+	union l3_qos_abmc_cfg abmc_cfg = { 0 };
+	struct rdtgroup *rdtgrp;
+	u32 cntr_id;
+
+	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
+		rdtgrp = d->cntr_cfg[cntr_id].rdtgrp;
+		if (rdtgrp && d->cntr_cfg[cntr_id].evtid == evtid) {
+			abmc_cfg.split.cfg_en = 1;
+			abmc_cfg.split.cntr_en = 1;
+			abmc_cfg.split.cntr_id = cntr_id;
+			abmc_cfg.split.bw_src = rdtgrp->mon.rmid;
+			abmc_cfg.split.bw_type = val;
+			resctrl_abmc_config_one_amd(&abmc_cfg);
+		}
+	}
+}
+
+void resctrl_mon_event_config_set(void *info)
+{
+	struct mon_config_info *mon_info = info;
+	struct rdt_mon_domain *d = mon_info->d;
+	struct rdt_resource *r = mon_info->r;
+
+	resctrl_arch_mon_event_config_set(d, mon_info->evtid, mon_info->mon_config);
+
+	/* Check if assignments needs to be updated */
+	if (resctrl_arch_mbm_cntr_assign_enabled(r))
+		resctrl_arch_update_cntr(r, d, mon_info->evtid,
+					 mon_info->mon_config);
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 515969c5f64f..5d305d0ac053 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1740,10 +1740,10 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
 	return 0;
 }
 
-
 static void mbm_config_write_domain(struct rdt_resource *r,
 				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
+	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	struct mon_config_info mon_info = {0};
 	u32 config_val;
 
@@ -1755,6 +1755,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	if (config_val == INVALID_CONFIG_VALUE || config_val == val)
 		return;
 
+	mon_info.r = r;
 	mon_info.d = d;
 	mon_info.evtid = evtid;
 	mon_info.mon_config = val;
@@ -1766,7 +1767,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
 	smp_call_function_any(&d->hdr.cpu_mask,
-			      resctrl_arch_mon_event_config_set,
+			      resctrl_mon_event_config_set,
 			      &mon_info, 1);
 
 	/*
@@ -1779,6 +1780,11 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * mbm_local and mbm_total counts for all the RMIDs.
 	 */
 	resctrl_arch_reset_rmid_all(r, d);
+
+	if (is_mbm_total_enabled())
+		memset(d->mbm_total, 0, sizeof(struct mbm_state) * idx_limit);
+	if (is_mbm_local_enabled())
+		memset(d->mbm_local, 0, sizeof(struct mbm_state) * idx_limit);
 }
 
 static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (20 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 21/23] x86/resctrl: Update assignments on event configuration changes Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-19 13:53   ` Dave Martin
  2025-01-22 20:20 ` [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

Provide the interface to list the assignment states of all the resctrl
groups in mbm_cntr_assign mode.

Example:
$ mount -t resctrl resctrl /sys/fs/resctrl/
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=tl;1=tl

List follows the following format:

"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Format for specific type of groups:

- Default CTRL_MON group:
  "//<domain_id>=<flags>"

- Non-default CTRL_MON group:
  "<CTRL_MON group>//<domain_id>=<flags>"

- Child MON group of default CTRL_MON group:
  "/<MON group>/<domain_id>=<flags>"

- Child MON group of non-default CTRL_MON group:
  "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Flags can be one of the following:
t  MBM total event is assigned
l  MBM local event is assigned
tl Both total and local MBM events are assigned
_  None of the MBM events are assigned

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Fixed printing the separator after each domain while listing the group assignments.
     Renamed rdtgroup_mbm_assign_control_show to resctrl_mbm_assign_control_show().

v10: Changes mostly due to domain specific counter assignment.

v9: Minor parameter update in resctrl_mbm_event_assigned().

v8: Moved resctrl_mbm_event_assigned() in here as it is first used here.
    Moved rdt_last_cmd_clear() before making any call.
    Updated the commit log.
    Corrected the doc format.

v7: Renamed the interface name from 'mbm_control' to 'mbm_assign_control'
    to match 'mbm_assign_mode'.
    Removed Arch references from FS code.
    Added rdt_last_cmd_clear() before the command processing.
    Added rdtgroup_mutex before all the calls.
    Removed references of ABMC from FS code.

v6: The domain specific assignment can be determined looking at mbm_cntr_map.
    Removed rdtgroup_abmc_dom_cfg() and rdtgroup_abmc_dom_state().
    Removed the switch statement for the domain_state detection.
    Determined the flags incremently.
    Removed special handling of default group while printing..

v5: Replaced "assignment flags" with "flags".
    Changes related to mon structure.
    Changes related renaming the interface from mbm_assign_control to
    mbm_control.

v4: Added functionality to query domain specific assigment in.
    rdtgroup_abmc_dom_state().

v3: New patch.
    Addresses the feedback to provide the global assignment interface.
    https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
---
 Documentation/arch/x86/resctrl.rst     | 44 ++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 81 ++++++++++++++++++++++++++
 3 files changed, 126 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 5d18c4c8bc48..3040e5c4cd76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -330,6 +330,50 @@ with the following files:
 	 # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
 	 0=30;1=30
 
+"mbm_assign_control":
+	Reports the resctrl group and monitor status of each group.
+
+	List follows the following format:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Format for specific type of groups:
+
+	* Default CTRL_MON group:
+		"//<domain_id>=<flags>"
+
+	* Non-default CTRL_MON group:
+		"<CTRL_MON group>//<domain_id>=<flags>"
+
+	* Child MON group of default CTRL_MON group:
+		"/<MON group>/<domain_id>=<flags>"
+
+	* Child MON group of non-default CTRL_MON group:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Flags can be one of the following:
+	::
+
+	 t  MBM total event is assigned.
+	 l  MBM local event is assigned.
+	 tl Both MBM total and local events are assigned.
+	 _  None of the MBM events are assigned.
+
+	Examples:
+	::
+
+	 # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
+
+	 # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	 non_default_ctrl_mon_grp//0=tl;1=tl
+	 non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
+	 //0=tl;1=tl
+	 /child_default_mon_grp/0=tl;1=tl
+
+	There are four resctrl groups. All the groups have total and local MBM events
+	assigned on domain 0 and 1.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 024aabbecbb5..2dd6c47c9276 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1236,6 +1236,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			hw_res->mbm_cntr_assign_enabled = true;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
+			resctrl_file_fflags_init("mbm_assign_control", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5d305d0ac053..6e29827239e0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static char *rdtgroup_mon_state_to_str(struct rdt_resource *r,
+				       struct rdt_mon_domain *d,
+				       struct rdtgroup *rdtgrp, char *str)
+{
+	char *tmp = str;
+
+	/* Query the total and local event flags for the domain */
+	if (mbm_cntr_get(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID) != -ENOENT)
+		*tmp++ = 't';
+
+	if (mbm_cntr_get(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID) != -ENOENT)
+		*tmp++ = 'l';
+
+	if (tmp == str)
+		*tmp++ = '_';
+
+	*tmp = '\0';
+	return str;
+}
+
+static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
+					   struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdtgroup *rdtg, *crg;
+	struct rdt_mon_domain *dom;
+	char str[10];
+	bool sep;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+	rdt_last_cmd_clear();
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		mutex_unlock(&rdtgroup_mutex);
+		cpus_read_unlock();
+		return -EINVAL;
+	}
+
+	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+		seq_printf(s, "%s//", rdtg->kn->name);
+
+		sep = false;
+		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+			if (sep)
+				seq_puts(s, ";");
+
+			seq_printf(s, "%d=%s", dom->hdr.id,
+				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
+
+			sep = true;
+		}
+		seq_putc(s, '\n');
+
+		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
+			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
+
+			sep = false;
+			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+				if (sep)
+					seq_puts(s, ";");
+				seq_printf(s, "%d=%s", dom->hdr.id,
+					   rdtgroup_mon_state_to_str(r, dom, crg, str));
+				sep = true;
+			}
+			seq_putc(s, '\n');
+		}
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1996,6 +2071,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= mbm_local_bytes_config_show,
 		.write		= mbm_local_bytes_config_write,
 	},
+	{
+		.name		= "mbm_assign_control",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= resctrl_mbm_assign_control_show,
+	},
 	{
 		.name		= "mbm_assign_mode",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (21 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
@ 2025-01-22 20:20 ` Babu Moger
  2025-02-06 18:48   ` Reinette Chatre
                     ` (2 more replies)
  2025-02-03 14:54 ` [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Peter Newman
                   ` (2 subsequent siblings)
  25 siblings, 3 replies; 209+ messages in thread
From: Babu Moger @ 2025-01-22 20:20 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	babu.moger, xin3.li, andrew.cooper3, ebiggers, mario.limonciello,
	james.morse, tan.shaopeng, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian

When mbm_cntr_assign mode is enabled, users can designate which of the MBM
events in the CTRL_MON or MON groups should have counters assigned.

Provide an interface for assigning MBM events by writing to the file:
/sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
events can be assigned or unassigned as needed.

Format is similar to the list format with addition of opcode for the
assignment operation.
 "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

Format for specific type of groups:

 * Default CTRL_MON group:
         "//<domain_id><opcode><flags>"

 * Non-default CTRL_MON group:
         "<CTRL_MON group>//<domain_id><opcode><flags>"

 * Child MON group of default CTRL_MON group:
         "/<MON group>/<domain_id><opcode><flags>"

 * Child MON group of non-default CTRL_MON group:
         "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

Domain_id '*' will apply the flags on all the domains.

Opcode can be one of the following:

 = Update the assignment to match the flags
 + Assign a new MBM event without impacting existing assignments.
 - Unassign a MBM event from currently assigned events.

Assignment flags can be one of the following:
 t  MBM total event
 l  MBM local event
 tl Both total and local MBM events
 _  None of the MBM events. Valid only with '=' opcode. This flag cannot
    be combined with other flags.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Fixed the static check warning with initializing dom_id in resctrl_process_flags().

v10: Fixed the issue with finding the domain in multiple iterations.
     Printed error message with domain information when assign fails.
     Changed the variables to unsigned for processing assign state.
     Taken care of few format corrections.

v9: Fixed handling special case '//0=' and '//".
    Removed extra strstr() call.
    Added generic failure text when assignment operation fails.
    Corrected user documentation format texts.

v8: Moved unassign as the first action during the assign modification.
    Assign none "_" takes priority. Cannot be mixed with other flags.
    Updated the documentation and .rst file format. htmldoc looks ok.

v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
    Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
    Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
    Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
    Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
    Removed ABMC reference in FS code.
    Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
    Not sure if we need to change the behaviour here. Processed them sequencially right now.
    Users have the liberty to pass the flags. Restricting it might be a problem later.

v6: Added support assign all if domain id is '*'
    Fixed the allocation of counter id if it not assigned already.

v5: Interface name changed from mbm_assign_control to mbm_control.
    Fixed opcode and flags combination.
    '=_" is valid.
    "-_" amd "+_" is not valid.
    Minor message update.
    Renamed the function with prefix - rdtgroup_.
    Corrected few documentation mistakes.
    Rebase related changes after SNC support.

v4: Added domain specific assignments. Fixed the opcode parsing.

v3: New patch.
    Addresses the feedback to provide the global assignment interface.
---
 Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
 arch/x86/kernel/cpu/resctrl/internal.h |  10 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 241 ++++++++++++++++++++++++-
 3 files changed, 365 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 3040e5c4cd76..47e15b48d951 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -356,7 +356,8 @@ with the following files:
 	 t  MBM total event is assigned.
 	 l  MBM local event is assigned.
 	 tl Both MBM total and local events are assigned.
-	 _  None of the MBM events are assigned.
+	 _  None of the MBM events are assigned. Only works with opcode '=' for write
+	    and cannot be combined with other flags.
 
 	Examples:
 	::
@@ -374,6 +375,119 @@ with the following files:
 	There are four resctrl groups. All the groups have total and local MBM events
 	assigned on domain 0 and 1.
 
+	Assignment state can be updated by writing to "mbm_assign_control".
+
+	Format is similar to the list format with addition of opcode for the
+	assignment operation.
+
+		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
+
+	Format for each type of group:
+
+        * Default CTRL_MON group:
+                "//<domain_id><opcode><flags>"
+
+        * Non-default CTRL_MON group:
+                "<CTRL_MON group>//<domain_id><opcode><flags>"
+
+        * Child MON group of default CTRL_MON group:
+                "/<MON group>/<domain_id><opcode><flags>"
+
+        * Child MON group of non-default CTRL_MON group:
+                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
+
+	Domain_id '*' will apply the flags to all the domains.
+
+	Opcode can be one of the following:
+	::
+
+	 = Update the assignment to match the MBM event.
+	 + Assign a new MBM event without impacting existing assignments.
+	 - Unassign a MBM event from currently assigned events.
+
+	Examples:
+	Initial group status:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
+	  //0=tl;1=tl
+	  /child_default_mon_grp/0=tl;1=tl
+
+	To update the default group to assign only total MBM event on domain 0:
+	::
+
+	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
+	  //0=t;1=tl
+	  /child_default_mon_grp/0=tl;1=tl
+
+	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
+	::
+
+	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
+	  //0=t;1=tl
+	  /child_default_mon_grp/0=tl;1=l
+
+	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
+	both local and total MBM events on domain 1:
+	::
+
+	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
+			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
+	  //0=t;1=tl
+	  /child_default_mon_grp/0=tl;1=l
+
+	To update the default group to add a local MBM event domain 0:
+	::
+
+	  # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
+	  //0=tl;1=tl
+	  /child_default_mon_grp/0=tl;1=l
+
+	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all the
+	MBM events on all the domains:
+	::
+
+	  # echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=_;1=_
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
+	  //0=tl;1=tl
+	  /child_default_mon_grp/0=tl;1=l
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index aec564fa2833..377b5db66793 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -62,6 +62,16 @@
 /* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
 #define ABMC_ENABLE_BIT			0
 
+/*
+ * Assignment flags for mbm_cntr_assign mode
+ */
+enum {
+	ASSIGN_NONE	= 0,
+	ASSIGN_TOTAL	= BIT(QOS_L3_MBM_TOTAL_EVENT_ID),
+	ASSIGN_LOCAL	= BIT(QOS_L3_MBM_LOCAL_EVENT_ID),
+	ASSIGN_INVALID,
+};
+
 /**
  * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
  *			        aren't marked nohz_full
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6e29827239e0..299839bcf23f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static unsigned int resctrl_str_to_mon_state(char *flag)
+{
+	unsigned int i, mon_state = ASSIGN_NONE;
+
+	if (!strlen(flag))
+		return ASSIGN_INVALID;
+
+	for (i = 0; i < strlen(flag); i++) {
+		switch (*(flag + i)) {
+		case 't':
+			mon_state |= ASSIGN_TOTAL;
+			break;
+		case 'l':
+			mon_state |= ASSIGN_LOCAL;
+			break;
+		case '_':
+			return ASSIGN_NONE;
+		default:
+			return ASSIGN_INVALID;
+		}
+	}
+
+	return mon_state;
+}
+
+static struct rdtgroup *rdtgroup_find_grp_by_name(enum rdt_group_type rtype,
+						  char *p_grp, char *c_grp)
+{
+	struct rdtgroup *rdtg, *crg;
+
+	if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
+		return &rdtgroup_default;
+	} else if (rtype == RDTCTRL_GROUP) {
+		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
+			if (!strcmp(p_grp, rdtg->kn->name))
+				return rdtg;
+	} else if (rtype == RDTMON_GROUP) {
+		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+			if (!strcmp(p_grp, rdtg->kn->name)) {
+				list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
+						    mon.crdtgrp_list) {
+					if (!strcmp(c_grp, crg->kn->name))
+						return crg;
+				}
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static int resctrl_process_flags(struct rdt_resource *r,
+				 enum rdt_group_type rtype,
+				 char *p_grp, char *c_grp, char *tok)
+{
+	unsigned int op, mon_state, assign_state, unassign_state;
+	char *dom_str, *id_str, *op_str;
+	struct rdt_mon_domain *d;
+	unsigned long dom_id = 0;
+	struct rdtgroup *rdtgrp;
+	char domain[10];
+	bool found;
+	int ret;
+
+	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
+
+	if (!rdtgrp) {
+		rdt_last_cmd_puts("Not a valid resctrl group\n");
+		return -EINVAL;
+	}
+
+next:
+	if (!tok || tok[0] == '\0')
+		return 0;
+
+	/* Start processing the strings for each domain */
+	dom_str = strim(strsep(&tok, ";"));
+
+	op_str = strpbrk(dom_str, "=+-");
+
+	if (op_str) {
+		op = *op_str;
+	} else {
+		rdt_last_cmd_puts("Missing operation =, +, - character\n");
+		return -EINVAL;
+	}
+
+	id_str = strsep(&dom_str, "=+-");
+
+	/* Check for domain id '*' which means all domains */
+	if (id_str && *id_str == '*') {
+		d = NULL;
+		goto check_state;
+	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
+		rdt_last_cmd_puts("Missing domain id\n");
+		return -EINVAL;
+	}
+
+	/* Verify if the dom_id is valid */
+	found = false;
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
+		return -EINVAL;
+	}
+
+check_state:
+	mon_state = resctrl_str_to_mon_state(dom_str);
+
+	if (mon_state == ASSIGN_INVALID) {
+		rdt_last_cmd_puts("Invalid assign flag\n");
+		goto out_fail;
+	}
+
+	assign_state = 0;
+	unassign_state = 0;
+
+	switch (op) {
+	case '+':
+		if (mon_state == ASSIGN_NONE) {
+			rdt_last_cmd_puts("Invalid assign opcode\n");
+			goto out_fail;
+		}
+		assign_state = mon_state;
+		break;
+	case '-':
+		if (mon_state == ASSIGN_NONE) {
+			rdt_last_cmd_puts("Invalid assign opcode\n");
+			goto out_fail;
+		}
+		unassign_state = mon_state;
+		break;
+	case '=':
+		assign_state = mon_state;
+		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
+		break;
+	default:
+		break;
+	}
+
+	if (unassign_state & ASSIGN_TOTAL) {
+		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (unassign_state & ASSIGN_LOCAL) {
+		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (assign_state & ASSIGN_TOTAL) {
+		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (assign_state & ASSIGN_LOCAL) {
+		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	goto next;
+
+out_fail:
+	sprintf(domain, d ? "%ld" : "*", dom_id);
+
+	rdt_last_cmd_printf("Assign operation '%s%c%s' failed on the group %s/%s/\n",
+			    domain, op, dom_str, p_grp, c_grp);
+
+	return -EINVAL;
+}
+
+static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
+						char *buf, size_t nbytes, loff_t off)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	char *token, *cmon_grp, *mon_grp;
+	enum rdt_group_type rtype;
+	int ret;
+
+	/* Valid input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+
+	buf[nbytes - 1] = '\0';
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_last_cmd_clear();
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		mutex_unlock(&rdtgroup_mutex);
+		cpus_read_unlock();
+		return -EINVAL;
+	}
+
+	while ((token = strsep(&buf, "\n")) != NULL) {
+		/*
+		 * The write command follows the following format:
+		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
+		 * Extract the CTRL_MON group.
+		 */
+		cmon_grp = strsep(&token, "/");
+
+		/*
+		 * Extract the MON_GROUP.
+		 * strsep returns empty string for contiguous delimiters.
+		 * Empty mon_grp here means it is a RDTCTRL_GROUP.
+		 */
+		mon_grp = strsep(&token, "/");
+
+		if (*mon_grp == '\0')
+			rtype = RDTCTRL_GROUP;
+		else
+			rtype = RDTMON_GROUP;
+
+		ret = resctrl_process_flags(r, rtype, cmon_grp, mon_grp, token);
+		if (ret)
+			break;
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return ret ?: nbytes;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -2073,9 +2311,10 @@ static struct rftype res_common_files[] = {
 	},
 	{
 		.name		= "mbm_assign_control",
-		.mode		= 0444,
+		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= resctrl_mbm_assign_control_show,
+		.write		= resctrl_mbm_assign_control_write,
 	},
 	{
 		.name		= "mbm_assign_mode",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (22 preceding siblings ...)
  2025-01-22 20:20 ` [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
@ 2025-02-03 14:54 ` Peter Newman
  2025-02-03 20:49   ` Moger, Babu
  2025-02-12 17:46 ` Dave Martin
  2025-02-21 18:07 ` James Morse
  25 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-02-03 14:54 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On Wed, Jan 22, 2025 at 9:20 PM Babu Moger <babu.moger@amd.com> wrote:
>
>
> This series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
>
> Series is written such that it is easier to support other assignable
> features supported from different vendors.
>
> The feature details are documented in the  APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>
> # Introduction
>
> Users can create as many monitor groups as RMIDs supported by the hardware.
> However, bandwidth monitoring feature on AMD system only guarantees that
> RMIDs currently assigned to a processor will be tracked by hardware.
> The counters of any other RMIDs which are no longer being tracked will be
> reset to zero. The MBM event counters return "Unavailable" for the RMIDs
> that are not tracked by hardware. So, there can be only limited number of
> groups that can give guaranteed monitoring numbers. With ever changing
> configurations there is no way to definitely know which of these groups
> are being tracked for certain point of time. Users do not have the option
> to monitor a group or set of groups for certain period of time without
> worrying about counter being reset in between.
>
> The ABMC feature provides an option to the user to assign a hardware
> counter to an RMID, event pair and monitor the bandwidth as long as it is
> assigned.  The assigned RMID will be tracked by the hardware until the user
> unassigns it manually. There is no need to worry about counters being reset
> during this period. Additionally, the user can specify a bitmask identifying
> the specific bandwidth types from the given source to track with the counter.
>
> Without ABMC enabled, monitoring will work in current 'default' mode without
> assignment option.
>
> # Linux Implementation
>
> Create a generic interface aimed to support user space assignment
> of scarce counters used for monitoring. First usage of interface
> is by ABMC with option to expand usage to "soft-ABMC" and MPAM
> counters in future.

As a reminder of the work related to this, please take a look at the
thread where Reinette proposed a "shared counters" mode in
mbm_assign_control[1]. I am currently working to demonstrate that this
combined with the mbm_*_bytes_per_second events discussed earlier in
the same thread will address my users' concerns about the overhead of
reading a large number of MBM counters, resulting from a maximal
number of monitoring groups whose jobs are not isolated to any L3
monitoring domain.

ABMC will add to the number of registers which need to be programmed
in each domain, so I will need to demonstrate that ABMC combined with
these additional features addresses their performance concerns and
that the resulting interface is user-friendly enough that they will
not need a detailed understanding of the implementation to avoid an
unacceptable performance degradation (i.e., needing to understand what
conditions will increase the number of IPIs required).

If all goes well, soft-ABMC will try to extend this usage model to the
existing, pre-ABMC, AMD platforms I support.

Thanks,
-Peter

[1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-03 14:54 ` [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Peter Newman
@ 2025-02-03 20:49   ` Moger, Babu
  2025-02-13 17:51     ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-03 20:49 UTC (permalink / raw)
  To: Peter Newman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/3/25 08:54, Peter Newman wrote:
> Hi Babu,
> 
> On Wed, Jan 22, 2025 at 9:20 PM Babu Moger <babu.moger@amd.com> wrote:
>>
>>
>> This series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> Series is written such that it is easier to support other assignable
>> features supported from different vendors.
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> # Introduction
>>
>> Users can create as many monitor groups as RMIDs supported by the hardware.
>> However, bandwidth monitoring feature on AMD system only guarantees that
>> RMIDs currently assigned to a processor will be tracked by hardware.
>> The counters of any other RMIDs which are no longer being tracked will be
>> reset to zero. The MBM event counters return "Unavailable" for the RMIDs
>> that are not tracked by hardware. So, there can be only limited number of
>> groups that can give guaranteed monitoring numbers. With ever changing
>> configurations there is no way to definitely know which of these groups
>> are being tracked for certain point of time. Users do not have the option
>> to monitor a group or set of groups for certain period of time without
>> worrying about counter being reset in between.
>>
>> The ABMC feature provides an option to the user to assign a hardware
>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>> assigned.  The assigned RMID will be tracked by the hardware until the user
>> unassigns it manually. There is no need to worry about counters being reset
>> during this period. Additionally, the user can specify a bitmask identifying
>> the specific bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current 'default' mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Create a generic interface aimed to support user space assignment
>> of scarce counters used for monitoring. First usage of interface
>> is by ABMC with option to expand usage to "soft-ABMC" and MPAM
>> counters in future.
> 
> As a reminder of the work related to this, please take a look at the
> thread where Reinette proposed a "shared counters" mode in
> mbm_assign_control[1]. I am currently working to demonstrate that this
> combined with the mbm_*_bytes_per_second events discussed earlier in
> the same thread will address my users' concerns about the overhead of
> reading a large number of MBM counters, resulting from a maximal
> number of monitoring groups whose jobs are not isolated to any L3
> monitoring domain.
> 
> ABMC will add to the number of registers which need to be programmed
> in each domain, so I will need to demonstrate that ABMC combined with
> these additional features addresses their performance concerns and
> that the resulting interface is user-friendly enough that they will
> not need a detailed understanding of the implementation to avoid an
> unacceptable performance degradation (i.e., needing to understand what
> conditions will increase the number of IPIs required).
> 
> If all goes well, soft-ABMC will try to extend this usage model to the
> existing, pre-ABMC, AMD platforms I support.
> 
> Thanks,
> -Peter
> 
> [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/
> 

Thanks for the heads-up. I understand what's going on and have an idea of
the plan. Please keep us updated on the progress. Also, if any changes are
needed in this series to meet your requirements, feel free to share your
feedback.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init()
  2025-01-22 20:20 ` [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init() Babu Moger
@ 2025-02-05 22:22   ` Reinette Chatre
  2025-02-19 13:28   ` Dave Martin
  1 sibling, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-05 22:22 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> resctrl_late_init() has the __init attribute, but some of the functions
> called from it do not have the __init attribute.
> 
> Add the __init attribute to all the functions in the call sequences to
> maintain consistency throughout.
> 
> Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
> Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
> Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

Thank you.

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-01-22 20:20 ` [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
@ 2025-02-05 22:49   ` Reinette Chatre
  2025-02-06 16:15     ` Moger, Babu
  2025-02-21 18:05   ` James Morse
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-05 22:49 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> Add the functionality to enable/disable AMD ABMC feature.
> 
> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
> to be updated on all the logical processors in the QOS Domain.
> 
> Hardware counters will reset when ABMC state is changed.

I find that the state management in this series is organized better
and easier to understand. I do think that it can be simplified more
and a hint to this is that it is mentioned here but not done in the
code introduced here but instead required from the caller. It seems
simpler to me that the architectural state can just be reset at the
same time as enable/disable of ABMC? 

> 
> The ABMC feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c3d7d4c3009a..a7526306f5e4 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1261,3 +1261,39 @@ void __init intel_rdt_mbm_apply_quirk(void)
>  	mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
>  	mbm_cf = mbm_cf_table[cf_index].cf;
>  }
> +
> +static void resctrl_abmc_set_one_amd(void *arg)
> +{
> +	bool *enable = arg;
> +
> +	if (*enable)
> +		msr_set_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
> +	else
> +		msr_clear_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
> +}
> +
> +/*
> + * Update L3_QOS_EXT_CFG MSR on all the CPUs associated with the monitor
> + * domain.

All monitor domains are impacted and above does not clearly state "why".
How about
 * ABMC enable/disable requires update of L3_QOS_EXT_CFG MSR on all the CPUs
 * associated with all monitor domains.


> + */
> +static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
> +{
> +	struct rdt_mon_domain *d;
> +
> +	list_for_each_entry(d, &r->mon_domains, hdr.list)
> +		on_each_cpu_mask(&d->hdr.cpu_mask,
> +				 resctrl_abmc_set_one_amd, &enable, 1);
> +}
> +
> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
> +{
> +	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +
> +	if (r->mon.mbm_cntr_assignable &&
> +	    hw_res->mbm_cntr_assign_enabled != enable) {
> +		_resctrl_abmc_enable(r, enable);
> +		hw_res->mbm_cntr_assign_enabled = enable;

Added benefit of resetting architectural state within this if statement
(perhaps simpler to be done within _resctrl_abmc_enable()) is that it will
not be done unnecessarily if ABMC is already in requested state.

> +	}
> +
> +	return 0;
> +}

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-01-22 20:20 ` [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
@ 2025-02-05 23:17   ` Reinette Chatre
  2025-02-07 17:18     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-05 23:17 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> The mbm_cntr_assign mode provides an option to the user to assign a
> counter to an RMID, event pair and monitor the bandwidth as long as
> the counter is assigned. Number of assignments depend on number of
> monitoring counters available.
> 
> Provide the interface to display the number of monitoring counters
> supported. The resctrl file 'num_mbm_cntrs' is visible to user space
> when the system supports mbm_cntr_assign mode.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

...

> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index b5defc5bce0e..31ff764deeeb 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -283,6 +283,22 @@ with the following files:
>  	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
>  	there is no counter associated with that event.
>  
> +"num_mbm_cntrs":
> +	The number of monitoring counters available for assignment when the
> +	system supports mbm_cntr_assign mode.
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> +	  32
> +
> +	The resctrl file system supports tracking up to two memory bandwidth
> +	events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
> +	Up to two counters can be assigned per monitoring group, one for each
> +	memory bandwidth event. More monitoring groups can be tracked by
> +	assigning one counter per monitoring group. However, doing so limits
> +	memory bandwidth tracking to a single memory bandwidth event per
> +	monitoring group.
> +

This text needs an update to reflect the switch to per-domain counter assignment.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-01-22 20:20 ` [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain Babu Moger
@ 2025-02-05 23:57   ` Reinette Chatre
  2025-02-07 18:23     ` Moger, Babu
  2025-02-21 18:07   ` James Morse
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-05 23:57 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
> MBM event of a monitor group. Hardware counters are assigned/unassigned
> at monitoring domain level.
> 
> Manage a monitoring domain's hardware counters using a per monitoring
> domain array of struct mbm_cntr_cfg that is indexed by the hardware
> counter	ID. A hardware counter's configuration contains the MBM event

Something strange in this changelog with a few random \t in the text.

> ID and points to the monitoring group that it is assigned to, with a
> NULL pointer meaning that the hardware counter is available for assignment.
> 
> There is no direct way to determine which hardware counters are	assigned

... another \t above 

> to a particular monitoring group. Check every entry of every hardware
> counter	configuration array in every monitoring domain to query which

... one more \t above

> MBM events of a monitoring group is tracked by hardware. Such queries
> are acceptable because of a very small number of assignable counters.

It is not obvious what "very small number" means. Is it possible to give
a range to help reader understand the motivation?

> 
> Suggested-by: Peter Newman <peternewman@google.com>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
>  include/linux/resctrl.h                | 14 ++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 18110a1afb6d..75a3b56996ca 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -4009,6 +4009,7 @@ static void __init rdtgroup_setup_default(void)
>  
>  static void domain_destroy_mon_state(struct rdt_mon_domain *d)
>  {
> +	kfree(d->cntr_cfg);
>  	bitmap_free(d->rmid_busy_llc);
>  	kfree(d->mbm_total);
>  	kfree(d->mbm_local);
> @@ -4082,6 +4083,16 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
>  			return -ENOMEM;
>  		}
>  	}
> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
> +		tsize = sizeof(*d->cntr_cfg);
> +		d->cntr_cfg = kcalloc(r->mon.num_mbm_cntrs, tsize, GFP_KERNEL);
> +		if (!d->cntr_cfg) {
> +			bitmap_free(d->rmid_busy_llc);
> +			kfree(d->mbm_total);
> +			kfree(d->mbm_local);
> +			return -ENOMEM;
> +		}
> +	}
>  
>  	return 0;
>  }
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 511cfce8fc21..9a54e307d340 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
>  	u32				*mbps_val;
>  };
>  
> +/**
> + * struct mbm_cntr_cfg - assignable counter configuration
> + * @evtid:		 MBM event to which the counter is assigned. Only valid
> + *			 if @rdtgroup is not NULL.
> + * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
> + *			 counter is free.
> + */
> +struct mbm_cntr_cfg {
> +	enum resctrl_event_id	evtid;
> +	struct rdtgroup		*rdtgrp;
> +};
> +

$ scripts/kernel-doc -v -none include/linux/resctrl.h                           
...                                                                             
include/linux/resctrl.h:107: warning: Function parameter or struct member 'rdtgrp' not described in 'mbm_cntr_cfg'
include/linux/resctrl.h:107: warning: Excess struct member 'rdtgroup' description in 'mbm_cntr_cfg'
...                                            

>  /**
>   * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>   * @hdr:		common header for different domain types
> @@ -105,6 +117,7 @@ struct rdt_ctrl_domain {
>   * @cqm_limbo:		worker to periodically read CQM h/w counters
>   * @mbm_work_cpu:	worker CPU for MBM h/w counters
>   * @cqm_work_cpu:	worker CPU for CQM h/w counters
> + * @cntr_cfg:		assignable counters configuration
>   */
>  struct rdt_mon_domain {
>  	struct rdt_domain_hdr		hdr;
> @@ -116,6 +129,7 @@ struct rdt_mon_domain {
>  	struct delayed_work		cqm_limbo;
>  	int				mbm_work_cpu;
>  	int				cqm_work_cpu;
> +	struct mbm_cntr_cfg		*cntr_cfg;
>  };
>  
>  /**

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-01-22 20:20 ` [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
@ 2025-02-05 23:58   ` Reinette Chatre
  2025-02-06  0:51     ` Luck, Tony
  2025-02-07 17:30     ` Moger, Babu
  2025-02-06  6:24   ` Xin Li
  1 sibling, 2 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-05 23:58 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> The event configuration is domain specific and initialized during domain
> initialization. The values are stored in struct rdt_hw_mon_domain.
> 
> It is not required to read the configuration register every time user asks
> for it. Use the value stored in struct rdt_hw_mon_domain instead.
> 
> Introduce resctrl_arch_mon_event_config_get() and
> resctrl_arch_mon_event_config_set() to get/set architecture domain specific
> mbm_total_cfg/mbm_local_cfg values.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> ---
>  arch/x86/kernel/cpu/resctrl/internal.h | 15 +++++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 46 +++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++---------------------
>  3 files changed, 72 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index ab28b9340ee7..cfaea20145d0 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -605,6 +605,18 @@ union cpuid_0x10_x_edx {
>  	unsigned int full;
>  };
>  
> +/**
> + * struct mon_config_info - Monitoring event configuratiin details

Same typo as previous version. 

> + * @d:			Domain for the event
> + * @evtid:		Event type
> + * @mon_config:		Event configuration value
> + */
> +struct mon_config_info {
> +	struct rdt_mon_domain *d;
> +	enum resctrl_event_id evtid;
> +	u32 mon_config;
> +};
> +
>  void rdt_last_cmd_clear(void);
>  void rdt_last_cmd_puts(const char *s);
>  __printf(1, 2)
> @@ -674,4 +686,7 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>  bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>  void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
>  unsigned int mon_event_config_index_get(u32 evtid);
> +void resctrl_arch_mon_event_config_set(void *info);
> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
> +				      enum resctrl_event_id eventid);
>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 8917c7261680..6fe9e610e9a0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1324,3 +1324,49 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>  
>  	return 0;
>  }
> +
> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
> +				      enum resctrl_event_id eventid)
> +{
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +
> +	switch (eventid) {
> +	case QOS_L3_OCCUP_EVENT_ID:
> +		break;
> +	case QOS_L3_MBM_TOTAL_EVENT_ID:
> +		return hw_dom->mbm_total_cfg;
> +	case QOS_L3_MBM_LOCAL_EVENT_ID:
> +		return hw_dom->mbm_local_cfg;
> +	}
> +
> +	/* Never expect to get here */
> +	WARN_ON_ONCE(1);
> +
> +	return INVALID_CONFIG_VALUE;
> +}
> +
> +void resctrl_arch_mon_event_config_set(void *info)
> +{
> +	struct mon_config_info *mon_info = info;
> +	struct rdt_hw_mon_domain *hw_dom;
> +	unsigned int index;
> +
> +	index = mon_event_config_index_get(mon_info->evtid);
> +	if (index == INVALID_CONFIG_INDEX)
> +		return;
> +
> +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> +
> +	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
> +
> +	switch (mon_info->evtid) {
> +	case QOS_L3_MBM_TOTAL_EVENT_ID:
> +		hw_dom->mbm_total_cfg = mon_info->mon_config;
> +		break;
> +	case QOS_L3_MBM_LOCAL_EVENT_ID:
> +		hw_dom->mbm_local_cfg = mon_info->mon_config;
> +		break;
> +	default:
> +		break;
> +	}
> +}

This new arch API has sharp corners because of asymmetry of where resctrl
runs the arch function. I do not think it is required to change this since we
can only speculate about how this may be used in the future but I do think
it will be helpful to add comments that highlight:

resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.

... 

> @@ -1683,33 +1653,23 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> -static void mon_event_config_write(void *info)
> -{
> -	struct mon_config_info *mon_info = info;
> -	unsigned int index;
> -
> -	index = mon_event_config_index_get(mon_info->evtid);
> -	if (index == INVALID_CONFIG_INDEX) {
> -		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
> -		return;
> -	}
> -	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> -}
>  
>  static void mbm_config_write_domain(struct rdt_resource *r,
>  				    struct rdt_mon_domain *d, u32 evtid, u32 val)
>  {
>  	struct mon_config_info mon_info = {0};

As discussed in previous version it is unnecessary to explicitly initialize
the structure if it is fully initialized in the code. This avoids need for
future cleanups like commit 29eaa7958367 ("x86/resctrl: Slightly clean-up mbm_config_show()")

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters
  2025-01-22 20:20 ` [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters Babu Moger
@ 2025-02-06  0:19   ` Reinette Chatre
  2025-02-07 18:59     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06  0:19 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> Provide the interface to display the number of monitoring counters
> available for assignment in each domain when mbm_cntr_assign is enabled.

"when mbm_cntr_assign is enabled" -> "when mbm_cntr_assign mode is enabled"?

> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---


> ---
>  Documentation/arch/x86/resctrl.rst     |  8 +++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 46 ++++++++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 31ff764deeeb..99cae75559b0 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -299,6 +299,14 @@ with the following files:
>  	memory bandwidth tracking to a single memory bandwidth event per
>  	monitoring group.
>  
> +"available_mbm_cntrs":
> +	The number of monitoring counters available for assignment in each
> +	domain when mbm_cntr_assign mode is enabled on the system.
> +	::
> +

Documentation jumps in with some hardcoded values that may cause confusion.
It looks to be missing something like (and looking back this also applies
to "num_mbm_cntrs"):
"For example, on a system with 30 available monitoring/(hardware?) counters in
each of its L3 domains:"


> +	 # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
> +	 0=30;1=30
> +


>  "max_threshold_occupancy":
>  		Read/write file provides the largest value (in
>  		bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 6fe9e610e9a0..f2bf5b13465d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1234,6 +1234,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>  			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
> +			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
>  		}
>  	}
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 75a3b56996ca..2b86124c336b 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -888,6 +888,46 @@ static int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> +static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
> +					    struct seq_file *s, void *v)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +	struct rdt_mon_domain *dom;
> +	bool sep = false;
> +	u32 cntrs, i;
> +	int ret = 0;
> +
> +	cpus_read_lock();
> +	mutex_lock(&rdtgroup_mutex);
> +

Missing rdt_last_cmd_clear()?

> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
> +		ret = -EINVAL;
> +		goto unlock_cntrs_show;
> +	}
> +
> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +		if (sep)
> +			seq_puts(s, ";");

The one character prints can be simplified with a seq_putc().

> +
> +		cntrs = 0;
> +		for (i = 0; i < r->mon.num_mbm_cntrs; i++) {
> +			if (!dom->cntr_cfg[i].rdtgrp)
> +				cntrs++;
> +		}
> +
> +		seq_printf(s, "%d=%d", dom->hdr.id, cntrs);

I expect cntrs to need %u?

> +		sep = true;
> +	}
> +	seq_puts(s, "\n");
> +
> +unlock_cntrs_show:
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +
> +	return ret;
> +}
> +
>  #ifdef CONFIG_PROC_CPU_RESCTRL
>  
>  /*
> @@ -1916,6 +1956,12 @@ static struct rftype res_common_files[] = {
>  		.kf_ops		= &rdtgroup_kf_single_ops,
>  		.seq_show	= resctrl_num_mbm_cntrs_show,
>  	},
> +	{
> +		.name		= "available_mbm_cntrs",
> +		.mode		= 0444,
> +		.kf_ops		= &rdtgroup_kf_single_ops,
> +		.seq_show	= resctrl_available_mbm_cntrs_show,
> +	},
>  	{
>  		.name		= "cpus",
>  		.mode		= 0644,

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-05 23:58   ` Reinette Chatre
@ 2025-02-06  0:51     ` Luck, Tony
  2025-02-06  1:41       ` Reinette Chatre
  2025-02-07 17:30     ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-06  0:51 UTC (permalink / raw)
  To: Chatre, Reinette, Babu Moger, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com
  Cc: x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> This new arch API has sharp corners because of asymmetry of where resctrl
> runs the arch function. I do not think it is required to change this since we
> can only speculate about how this may be used in the future but I do think
> it will be helpful to add comments that highlight:
>
> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.

Here's a vague data point about the future to help with speculation.

I have something coming along the pipeline that also can run on any CPU.

I am contemplating a flag in the rdt_resource structure (in appropriate substructure
resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.

Would something like that be useful here?

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events
  2025-01-22 20:20 ` [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events Babu Moger
@ 2025-02-06  1:05   ` Reinette Chatre
  2025-02-07 21:10     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06  1:05 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

subject: "assigm" -> "assign"

On 1/22/25 12:20 PM, Babu Moger wrote:
> The mbm_cntr_assign mode offers several counters that can be assigned

This "several counters" contradicts the "very small number of assignable
counters" used in earlier patch to justify how counters are managed.

> to an RMID, event pair and monitor the bandwidth as long as it is
> assigned.
> 
> Add the functionality to allocate and assign the counters to RMID, event
> pair in the domain.
> 
> If all counters are in use, the kernel will show an error message: "Out
> of MBM assignable counters" when a new assignment is requested. Exit on
> the first failure when assigning counters across all the domains.
> Report the error in /sys/fs/resctrl/info/last_cmd_status.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

..

> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |   2 +
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 105 +++++++++++++++++++++++++
>  2 files changed, 107 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 161d3feb567c..547d8a4c8aba 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -727,4 +727,6 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>  int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>  			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>  			     u32 cntr_id, bool assign);
> +int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ef836bb69b9b..127c4000a81a 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1413,3 +1413,108 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>  
>  	return 0;
>  }
> +
> +/*
> + * Configure the counter for the event, RMID pair for the domain. Reset the
> + * non-architectural state to clear all the event counters.
> + */
> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +			       u32 cntr_id, bool assign)
> +{
> +	struct mbm_state *m;
> +	int ret;
> +
> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
> +	if (ret)
> +		return ret;
> +
> +	m = get_mbm_state(d, closid, rmid, evtid);
> +	if (m)
> +		memset(m, 0, sizeof(struct mbm_state));
> +
> +	return ret;
> +}
> +
> +static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int cntr_id;
> +
> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
> +		if (d->cntr_cfg[cntr_id].rdtgrp == rdtgrp &&
> +		    d->cntr_cfg[cntr_id].evtid == evtid)
> +			return cntr_id;
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			  struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int cntr_id;
> +
> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
> +		if (!d->cntr_cfg[cntr_id].rdtgrp) {
> +			d->cntr_cfg[cntr_id].rdtgrp = rdtgrp;
> +			d->cntr_cfg[cntr_id].evtid = evtid;
> +			return cntr_id;
> +		}
> +	}
> +
> +	return -ENOSPC;
> +}
> +
> +static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
> +{
> +	memset(&d->cntr_cfg[cntr_id], 0, sizeof(struct mbm_cntr_cfg));
> +}
> +
> +/*
> + * Allocate a fresh counter and configure the event if not assigned already
> + * else return success.

I find this confusing. I think the "else return success" can just be dropped.

> + */
> +static int resctrl_alloc_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				     struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int cntr_id, ret = 0;
> +
> +	if (mbm_cntr_get(r, d, rdtgrp, evtid) == -ENOENT) {

This can be simplified while reducing a level of indent with:

	/* No need to allocate and configure if counter already assigned to this event. */
	if (mbm_cntr_get(r, d, rdtgrp, evtid) >= 0)
		return 0;

> +		cntr_id = mbm_cntr_alloc(r, d, rdtgrp, evtid);
> +		if (cntr_id <  0) {
> +			rdt_last_cmd_printf("Domain %d is Out of MBM assignable counter\n",

"Domain %d is Out of MBM assignable counter" -> "Domain %d is out of MBM assignable counters"
or, the message can be something like "Unable to allocate counter in domain %d" to not
assume the error and just return the error directly. resctrl_process_flags() can in turn
not override the error resulting in -ENOSPC returned to userspace that can be interpreted
appropriately instead of always returning -EINVAL and requiring user space to check
last_cmd_status? 

> +					    d->hdr.id);
> +			return -ENOSPC;

Please do not override error of a function.

> +		}
> +
> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid, rdtgrp->closid,
> +					  cntr_id, true);
> +		if (ret) {
> +			rdt_last_cmd_printf("Assignment failed on domain %d\n", d->hdr.id);

I assume this targets the scenario when user space requests "all" domains to be changed
and the error message in resctrl_process_flags() will then print "*" instead of the
actual domain ID. If this is the goal to give more detail to error then the event
can be displayed also?

> +			mbm_cntr_free(d, cntr_id);
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Assign a hardware counter to event @evtid of group @rdtgrp.
> + * Counter will be assigned to all the domains if @d is NULL else
> + * the counter will be assigned to @d.

Please use available 80 chars.

> + */
> +int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int ret = 0;
> +
> +	if (!d) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list)
> +			ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);

This does not "exit on first failure" as the changelog claims. It actually looks like
as long as the last domain succeeds, while all other domains fail, this request is
considered successful.

> +	} else {
> +		ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
> +	}
> +
> +	return ret;
> +}

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06  0:51     ` Luck, Tony
@ 2025-02-06  1:41       ` Reinette Chatre
  2025-02-06 15:56         ` Luck, Tony
  2025-02-19 13:28         ` Dave Martin
  0 siblings, 2 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06  1:41 UTC (permalink / raw)
  To: Luck, Tony, Babu Moger, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com
  Cc: x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Tony,

On 2/5/25 4:51 PM, Luck, Tony wrote:
>> This new arch API has sharp corners because of asymmetry of where resctrl
>> runs the arch function. I do not think it is required to change this since we
>> can only speculate about how this may be used in the future but I do think
>> it will be helpful to add comments that highlight:
>>
>> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
>> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.
> 
> Here's a vague data point about the future to help with speculation.
> 
> I have something coming along the pipeline that also can run on any CPU.
> 
> I am contemplating a flag in the rdt_resource structure (in appropriate substructure
> resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.
> 
> Would something like that be useful here?

hmm ... I cannot envision how this may look. Could you please elaborate?

You mention "a" (singular) flag in rdt_resource while this scenario involves
different ops having different scope. This makes me think that this flag may
have to be per operation that in turn would need additional infrastructure to
manage and track operations.

These "arch" functions are evolving as the work to support MPAM is done and
so far I think it has been quite ad-hoc to just refactor arch specific code
into "arch" helpers instead of keeping track of which scope they are running in.
This currently requires any arch implementing an "arch" helper to be well aware 
of how resctrl will call it.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm MBM events
  2025-01-22 20:20 ` [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm " Babu Moger
@ 2025-02-06  3:54   ` Reinette Chatre
  2025-02-10 16:23     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06  3:54 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

subject: unassigm -> unassign

On 1/22/25 12:20 PM, Babu Moger wrote:
> The mbm_cntr_assign mode provides a limited number of hardware counters

(now back to "limited number of hardware counters")

> that can be assigned to an RMID, event pair to monitor bandwidth while
> assigned. If all counters are in use, the kernel will show an error
> message: "Out of MBM assignable counters" when a new assignment is
> requested. To make space for a new assignment, users must unassign an

To me "kernel will show an error" implies the kernel ring buffer. Please make
the message accurate and mention that it will be in 
last_cmd_status while also considering to use -ENOSPC to help user space.

> already assigned counter and retry the assignment again..

".." -> "."

> 
> Add the functionality to unassign and free the counters in the domain.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>

...

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 127c4000a81a..b6d188d0f9b7 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1518,3 +1518,42 @@ int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>  
>  	return ret;
>  }
> +
> +/*
> + * Unassign and free the counter if assigned else return success.
> + */
> +static int resctrl_free_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				    struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int cntr_id, ret = 0;
> +
> +	cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
> +	if (cntr_id != -ENOENT) {

This can be simplified and indent level reduced with:

	cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
	if (cntr_id < 0)
		return ret;

> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
> +					  rdtgrp->closid, cntr_id, false);
> +		if (!ret)
> +			mbm_cntr_free(d, cntr_id);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Unassign a hardware counter associated with @evtid from the domain and
> + * the group. Unassign the counters from all the domains if @d is NULL else
> + * unassign from @d.
> + */
> +int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
> +{
> +	int ret = 0;
> +
> +	if (!d) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list)
> +			ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);

Same issue as previous patch wrt error handling.

> +	} else {
> +		ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
> +	}
> +
> +	return ret;
> +}

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-01-22 20:20 ` [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
  2025-02-05 23:58   ` Reinette Chatre
@ 2025-02-06  6:24   ` Xin Li
  2025-02-06 16:17     ` Reinette Chatre
  1 sibling, 1 reply; 209+ messages in thread
From: Xin Li @ 2025-02-06  6:24 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On 1/22/2025 12:20 PM, Babu Moger wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 8917c7261680..6fe9e610e9a0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1324,3 +1324,49 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>   
>   	return 0;
>   }
> +
> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
> +				      enum resctrl_event_id eventid)
> +{
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +
> +	switch (eventid) {
> +	case QOS_L3_OCCUP_EVENT_ID:
> +		break;
> +	case QOS_L3_MBM_TOTAL_EVENT_ID:
> +		return hw_dom->mbm_total_cfg;
> +	case QOS_L3_MBM_LOCAL_EVENT_ID:
> +		return hw_dom->mbm_local_cfg;
> +	}
> +
> +	/* Never expect to get here */
> +	WARN_ON_ONCE(1);
> +
> +	return INVALID_CONFIG_VALUE;
> +}
> +
> +void resctrl_arch_mon_event_config_set(void *info)
> +{
> +	struct mon_config_info *mon_info = info;
> +	struct rdt_hw_mon_domain *hw_dom;
> +	unsigned int index;
> +
> +	index = mon_event_config_index_get(mon_info->evtid);
> +	if (index == INVALID_CONFIG_INDEX)
> +		return;
> +
> +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);

This is the existing code, however it would be better to use wrmsrl()
when the higher 32-bit are all 0s:

	wrmsrl(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config);

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06  1:41       ` Reinette Chatre
@ 2025-02-06 15:56         ` Luck, Tony
  2025-02-21 18:08           ` James Morse
  2025-02-19 13:28         ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-06 15:56 UTC (permalink / raw)
  To: Chatre, Reinette, Babu Moger, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com
  Cc: x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> >> This new arch API has sharp corners because of asymmetry of where resctrl
> >> runs the arch function. I do not think it is required to change this since we
> >> can only speculate about how this may be used in the future but I do think
> >> it will be helpful to add comments that highlight:
> >>
> >> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
> >> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.
> >
> > Here's a vague data point about the future to help with speculation.
> >
> > I have something coming along the pipeline that also can run on any CPU.
> >
> > I am contemplating a flag in the rdt_resource structure (in appropriate substructure
> > resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.
> >
> > Would something like that be useful here?
>
> hmm ... I cannot envision how this may look. Could you please elaborate?
>
> You mention "a" (singular) flag in rdt_resource while this scenario involves
> different ops having different scope. This makes me think that this flag may
> have to be per operation that in turn would need additional infrastructure to
> manage and track operations.
>
> These "arch" functions are evolving as the work to support MPAM is done and
> so far I think it has been quite ad-hoc to just refactor arch specific code
> into "arch" helpers instead of keeping track of which scope they are running in.
> This currently requires any arch implementing an "arch" helper to be well aware
> of how resctrl will call it.

Reinette,

I haven't fleshed it out yet. One option would be to have a "choose_cpu_mask()"
function that takes resource and domain parameters (and given your comment
about this case an operation code). Then use that as the mask in an smp_call*().

Operations that can run anywhere would return a mask with just bit for the
current CPU. Those tied to a domain, a copy of the domain mask.

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-02-05 22:49   ` Reinette Chatre
@ 2025-02-06 16:15     ` Moger, Babu
  2025-02-06 18:42       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-06 16:15 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 4:49 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Add the functionality to enable/disable AMD ABMC feature.
>>
>> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
>> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
>> to be updated on all the logical processors in the QOS Domain.
>>
>> Hardware counters will reset when ABMC state is changed.
> 
> I find that the state management in this series is organized better
> and easier to understand. I do think that it can be simplified more
> and a hint to this is that it is mentioned here but not done in the
> code introduced here but instead required from the caller. It seems
> simpler to me that the architectural state can just be reset at the
> same time as enable/disable of ABMC?

Right now, it is done from mbm_cntr_reset(). It does both arch and 
non-arch state reset for all the RMIDs in all the domains. It is called 
in two places.

1 rdtgroup.c resctrl_mbm_assign_mode_write -> mbm_cntr_reset();
2 rdtgroup.c rdt_kill_sb()-> mbm_cntr_reset();

I will have to introduce another function to reset RMIDs in all the 
domains. Also, make sure it is called from both these places.

list_for_each_entry(dom, &r->mon_domains, hdr.list)
             resctrl_arch_reset_rmid_all(r, dom);


I feel current code is much more cleaner.  What do you think?

> 
>>
>> The ABMC feature details are documented in APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC).
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> ...
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c3d7d4c3009a..a7526306f5e4 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1261,3 +1261,39 @@ void __init intel_rdt_mbm_apply_quirk(void)
>>   	mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
>>   	mbm_cf = mbm_cf_table[cf_index].cf;
>>   }
>> +
>> +static void resctrl_abmc_set_one_amd(void *arg)
>> +{
>> +	bool *enable = arg;
>> +
>> +	if (*enable)
>> +		msr_set_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
>> +	else
>> +		msr_clear_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
>> +}
>> +
>> +/*
>> + * Update L3_QOS_EXT_CFG MSR on all the CPUs associated with the monitor
>> + * domain.
> 
> All monitor domains are impacted and above does not clearly state "why".
> How about
>   * ABMC enable/disable requires update of L3_QOS_EXT_CFG MSR on all the CPUs
>   * associated with all monitor domains.

Sure.

> 
> 
>> + */
>> +static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
>> +{
>> +	struct rdt_mon_domain *d;
>> +
>> +	list_for_each_entry(d, &r->mon_domains, hdr.list)
>> +		on_each_cpu_mask(&d->hdr.cpu_mask,
>> +				 resctrl_abmc_set_one_amd, &enable, 1);
>> +}
>> +
>> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>> +{
>> +	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> +
>> +	if (r->mon.mbm_cntr_assignable &&
>> +	    hw_res->mbm_cntr_assign_enabled != enable) {
>> +		_resctrl_abmc_enable(r, enable);
>> +		hw_res->mbm_cntr_assign_enabled = enable;
> 
> Added benefit of resetting architectural state within this if statement
> (perhaps simpler to be done within _resctrl_abmc_enable()) is that it will
> not be done unnecessarily if ABMC is already in requested state.

It will be
       list_for_each_entry(dom, &r->mon_domains, hdr.list)
             resctrl_arch_reset_rmid_all(r, dom);
> 
>> +	}
>> +
>> +	return 0;
>> +}
> 
> Reinette
> 

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06  6:24   ` Xin Li
@ 2025-02-06 16:17     ` Reinette Chatre
  2025-02-07 10:07       ` Xin Li
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 16:17 UTC (permalink / raw)
  To: Xin Li, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Xin,

On 2/5/25 10:24 PM, Xin Li wrote:
> On 1/22/2025 12:20 PM, Babu Moger wrote:
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 8917c7261680..6fe9e610e9a0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1324,3 +1324,49 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>>         return 0;
>>   }
>> +
>> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>> +                      enum resctrl_event_id eventid)
>> +{
>> +    struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> +
>> +    switch (eventid) {
>> +    case QOS_L3_OCCUP_EVENT_ID:
>> +        break;
>> +    case QOS_L3_MBM_TOTAL_EVENT_ID:
>> +        return hw_dom->mbm_total_cfg;
>> +    case QOS_L3_MBM_LOCAL_EVENT_ID:
>> +        return hw_dom->mbm_local_cfg;
>> +    }
>> +
>> +    /* Never expect to get here */
>> +    WARN_ON_ONCE(1);
>> +
>> +    return INVALID_CONFIG_VALUE;
>> +}
>> +
>> +void resctrl_arch_mon_event_config_set(void *info)
>> +{
>> +    struct mon_config_info *mon_info = info;
>> +    struct rdt_hw_mon_domain *hw_dom;
>> +    unsigned int index;
>> +
>> +    index = mon_event_config_index_get(mon_info->evtid);
>> +    if (index == INVALID_CONFIG_INDEX)
>> +        return;
>> +
>> +    wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> 
> This is the existing code, however it would be better to use wrmsrl()
> when the higher 32-bit are all 0s:
> 
>     wrmsrl(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config);
> 

Could you please elaborate what makes this change better?

Thank you!

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode
  2025-01-22 20:20 ` [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
@ 2025-02-06 18:01   ` Reinette Chatre
  2025-02-06 23:41     ` Moger, Babu
  2025-02-21 18:06   ` James Morse
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:01 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> Introduce the interface file "mbm_assign_mode" to list monitor modes
> supported.
> 
> The "mbm_cntr_assign" mode provides the option to assign a counter to
> an RMID, event pair and monitor the bandwidth as long as it is assigned.
> 
> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable

""mbm_cntr_assign" is backed" -> ""mbm_cntr_assign" mode is backed"?

> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
> 
> The "default" mode is the existing monitoring mode that works without the
> explicit counter assignment, instead relying on dynamic counter assignment
> by hardware that may result in hardware not dedicating a counter resulting
> in monitoring data reads returning "Unavailable".
> 
> Provide an interface to display the monitor mode on the system.
> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> [mbm_cntr_assign]
> default
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---


> ---
>  Documentation/arch/x86/resctrl.rst     | 26 +++++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index fb90f08e564e..b5defc5bce0e 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -257,6 +257,32 @@ with the following files:
>  	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>  	    0=0x30;1=0x30;3=0x15;4=0x15
>  
> +"mbm_assign_mode":
> +	Reports the list of monitoring modes supported. The enclosed brackets
> +	indicate which mode is enabled.
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> +	  [mbm_cntr_assign]
> +	  default
> +
> +	"mbm_cntr_assign":
> +
> +	In mbm_cntr_assign, monitoring event can only accumulate data while

"In mbm_cntr_assign, monitoring event" -> "In mbm_cntr_assign mode, a monitoring event"?

> +	it is backed by a hardware counter. The user-space is able to specify
> +	which of the events in CTRL_MON or MON groups should have a counter
> +	assigned using the "mbm_assign_control" file. The number of counters
> +	available is described in the "num_mbm_cntrs" file. Changing the mode
> +	may cause all counters on a resource to reset.
> +
> +	"default":
> +
> +	In default mode, resctrl assumes there is a hardware counter for each
> +	event within every CTRL_MON and MON group. On AMD platforms, it is
> +	recommended to use mbm_cntr_assign mode if supported, because reading
> +	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-01-22 20:20 ` [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
@ 2025-02-06 18:03   ` Reinette Chatre
  2025-02-10 17:27     ` Moger, Babu
  2025-02-19 13:41   ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:03 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> Assign/unassign counters on resctrl group creation/deletion. Two counters
> are required per group, one for MBM total event and one for MBM local
> event.
> 
> There are a limited number of counters available for assignment. If these
> counters are exhausted, the kernel will display the error message: "Out of
> MBM assignable counters". However, it is not necessary to fail the
> creation of a group due to assignment failures. Users have the flexibility
> to modify the assignments at a later time.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index b6d188d0f9b7..118b39fbb01e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1557,3 +1557,30 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
>  
>  	return ret;
>  }
> +
> +void mbm_cntr_reset(struct rdt_resource *r)
> +{
> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> +	struct rdt_mon_domain *dom;
> +
> +	/*
> +	 * Reset the domain counter configuration. Hardware counters
> +	 * will reset after switching the monitor mode. So, reset the
> +	 * architectural amd non-architectural state so that reading

"amd" -> "and"

> +	 * of hardware counter is not considered as an overflow in the
> +	 * next update.
> +	 */
> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +			memset(dom->cntr_cfg, 0,
> +			       sizeof(*dom->cntr_cfg) * r->mon.num_mbm_cntrs);
> +			if (is_mbm_total_enabled())
> +				memset(dom->mbm_total, 0,
> +				       sizeof(struct mbm_state) * idx_limit);
> +			if (is_mbm_local_enabled())
> +				memset(dom->mbm_local, 0,
> +				       sizeof(struct mbm_state) * idx_limit);
> +			resctrl_arch_reset_rmid_all(r, dom);
> +		}
> +	}
> +}

I looked back at the previous versions to better understand how this function
came about and I do not think it actually solves the problem it aims to solve.

rdtgroup_unassign_cntrs() can fail and when it does the counter is not free'd. That
leaves a monitoring domain's array with an entry that points to a resource group
that no longer exists (unless it is the default resource group) since
rdtgroup_unassign_cntrs() does not check the return and proceeds to remove the
resource group. mbm_cntr_reset() is called on umount of resctrl but
rdtgroup_unassign_cntrs() is called on every  group remove and those scenarios
are not handled.

To address this I believe that I need to go back on a previous request to have
resctrl_arch_config_cntr() return an error code. AMD does not need this and
it is difficult to predict what will work for MPAM. I originally wanted to be
flexible here but this appears to be impractical. With a new requirement that 
resctrl_arch_config_cntr() always succeeds the counter will in turn always
be free'd and not leave dangling pointers. I believe doing so eliminates
the need for mbm_cntr_reset() as used in this patch. My apologies for the
misdirection. We can re-evaluate these flows if MPAM needs anything different.

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 2b86124c336b..f61f0cd032ef 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2687,6 +2687,46 @@ static void schemata_list_destroy(void)
>  	}
>  }
>  
> +/*
> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
> + * counters are automatically assigned. Each group can accommodate two counters:
> + * one for the total event and one for the local event. Assignments may fail
> + * due to the limited number of counters. However, it is not necessary to fail
> + * the group creation and thus no failure is returned. Users have the option
> + * to modify the counter assignments after the group has been created.
> + */
> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	if (is_mbm_total_enabled())
> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
> +
> +	if (is_mbm_local_enabled())
> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +/*
> + * Called when a group is deleted. Counters are unassigned if it was in
> + * assigned state.
> + */
> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +

It looks to me as though there are a couple of places (rmdir_all_sub(), rdt_kill_sb(),
and rdtgroup_rmdir_ctrl()) where rdtgroup_unassign_cntrs() could be called on a system that
does not support monitoring and/or only supports cache allocation monitoring.

In these paths it is only the architecture's resctrl_arch_mbm_cntr_assign_enabled(r) that
gates the resctrl flow. I think rdtgroup_unassign_cntrs() and to match rdtgroup_assign_cntrs()
can do with at least a r->mon_capable check.

> +	if (is_mbm_total_enabled())
> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
> +
> +	if (is_mbm_local_enabled())
> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
>  static int rdt_get_tree(struct fs_context *fc)
>  {
>  	struct rdt_fs_context *ctx = rdt_fc2context(fc);
> @@ -2741,6 +2781,8 @@ static int rdt_get_tree(struct fs_context *fc)
>  		if (ret < 0)
>  			goto out_info;
>  
> +		rdtgroup_assign_cntrs(&rdtgroup_default);
> +
>  		ret = mkdir_mondata_all(rdtgroup_default.kn,
>  					&rdtgroup_default, &kn_mondata);
>  		if (ret < 0)
> @@ -2779,8 +2821,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  	if (resctrl_arch_mon_capable())
>  		kernfs_remove(kn_mondata);
>  out_mongrp:
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>  		kernfs_remove(kn_mongrp);
> +	}
>  out_info:
>  	kernfs_remove(kn_info);
>  out_schemata_free:
> @@ -2956,6 +3000,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>  
>  	head = &rdtgrp->mon.crdtgrp_list;
>  	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
> +		rdtgroup_unassign_cntrs(sentry);
>  		free_rmid(sentry->closid, sentry->mon.rmid);
>  		list_del(&sentry->mon.crdtgrp_list);
>  
> @@ -2996,6 +3041,8 @@ static void rmdir_all_sub(void)
>  		cpumask_or(&rdtgroup_default.cpu_mask,
>  			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>  
> +		rdtgroup_unassign_cntrs(rdtgrp);
> +
>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  
>  		kernfs_remove(rdtgrp->kn);
> @@ -3027,6 +3074,8 @@ static void rdt_kill_sb(struct super_block *sb)
>  	for_each_alloc_capable_rdt_resource(r)
>  		reset_all_ctrls(r);
>  	rmdir_all_sub();
> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
> +	mbm_cntr_reset(&rdt_resources_all[RDT_RESOURCE_L3].r_resctrl);
>  	rdt_pseudo_lock_release();
>  	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>  	schemata_list_destroy();
> @@ -3490,9 +3539,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>  	}
>  	rdtgrp->mon.rmid = ret;
>  
> +	rdtgroup_assign_cntrs(rdtgrp);
> +
>  	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
>  	if (ret) {
>  		rdt_last_cmd_puts("kernfs subdir error\n");
> +		rdtgroup_unassign_cntrs(rdtgrp);
>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  		return ret;
>  	}
> @@ -3502,8 +3554,10 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>  
>  static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
>  {
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		rdtgroup_unassign_cntrs(rgrp);
>  		free_rmid(rgrp->closid, rgrp->mon.rmid);
> +	}
>  }
>  
>  static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
> @@ -3764,6 +3818,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  	update_closid_rmid(tmpmask, NULL);
>  
>  	rdtgrp->flags = RDT_DELETED;
> +
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  
>  	/*
> @@ -3810,6 +3867,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>  	update_closid_rmid(tmpmask, NULL);
>  
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  	closid_free(rdtgrp->closid);
>  

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2025-01-22 20:20 ` [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
@ 2025-02-06 18:04   ` Reinette Chatre
  2025-02-10 17:39     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:04 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> In mbm_cntr_assign mode, the hardware counter should be assigned to read
> the MBM events.
> 
> Report 'Unassigned' in case the user attempts to read the events without
> assigning the counter.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 99cae75559b0..072b15550ff7 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -431,6 +431,16 @@ When monitoring is enabled all MON groups will also contain:
>  	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
>  	where "YY" is the node number.
>  
> +	When supported the mbm_cntr_assign mode allows users to assign a

Same comment as previous version.

> +	counter to mon_hw_id, event pair enabling bandwidth monitoring for
> +	as long as the counter remains assigned. The hardware will continue
> +	tracking the assigned mon_hw_id until the user manually unassigns
> +	it, ensuring that counters are not reset during this period. With
> +	a limited number of counters, the system may run out of assignable
> +	counters. In that case, MBM event counters will return 'Unassigned'
> +	when the event is read. Users must manually assign a counter to read
> +	the events.
> +
>  "mon_hw_id":
>  	Available only with debug option. The identifier used by hardware
>  	for the monitor group. On x86 this is the RMID.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes
  2025-01-22 20:20 ` [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
@ 2025-02-06 18:05   ` Reinette Chatre
  2025-02-10 18:54     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:05 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> Resctrl subsystem can support two monitoring modes, 'mbm_cntr_assign' or
> 'default'. In mbm_cntr_assign, monitoring event can only accumulate data
> while it is backed by a hardware counter. In 'default' mode, resctrl
> assumes there is a hardware counter for each event within every CTRL_MON
> and MON group.
> 
> Introduce interface to switch between mbm_cntr_assign and default modes.
> 
> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> [mbm_cntr_assign]
> default
> 
> To enable the "mbm_cntr_assign" mode:
> $ echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> 
> To enable the default monitoring mode:
> $ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> 
> MBM event counters are automatically reset as part of changing the mode.
> Clear both architectural and non-architectural event states to prevent
> overflow conditions during the next event read.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---


> ---
>  Documentation/arch/x86/resctrl.rst     | 25 ++++++++++++-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++-
>  2 files changed, 73 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 072b15550ff7..5d18c4c8bc48 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -259,7 +259,10 @@ with the following files:
>  
>  "mbm_assign_mode":
>  	Reports the list of monitoring modes supported. The enclosed brackets
> -	indicate which mode is enabled.
> +	indicate which mode is enabled. The MBM events (mbm_total_bytes and/or
> +	mbm_local_bytes) associated with counters may reset when "mbm_assign_mode"
> +	is changed.
> +
>  	::
>  
>  	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> @@ -275,6 +278,16 @@ with the following files:
>  	available is described in the "num_mbm_cntrs" file. Changing the mode
>  	may cause all counters on a resource to reset.
>  
> +	Moving to mbm_cntr_assign mode require users to assign the counters to
> +	the events. Otherwise, the MBM event counters will return "Unassigned"
> +	when read.

Again ... please be consistent in using single or double quotes for information
returned from file.

> +
> +	The mode is beneficial for AMD platforms that support more CTRL_MON
> +	and MON groups than available hardware counters. By default, this
> +	feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
> +	Monitoring Counters) capability, ensuring counters remain assigned even
> +	when the corresponding RMID is not actively used by any processor.
> +
>  	"default":
>  
>  	In default mode, resctrl assumes there is a hardware counter for each
> @@ -283,6 +296,16 @@ with the following files:
>  	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
>  	there is no counter associated with that event.
>  
> +	* To enable "mbm_cntr_assign" mode:
> +	  ::
> +
> +	    # echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> +
> +	* To enable default monitoring mode:
> +	  ::
> +
> +	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> +

Please be consistent in the documentation.

To enable "mbm_cntr_assign" mode:
To enable "default" mode:
or
To enable "mbm_cntr_assign" monitoring mode:
To enable "default" monitoring mode:
or 
...?



>  "num_mbm_cntrs":
>  	The number of monitoring counters available for assignment when the
>  	system supports mbm_cntr_assign mode.
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f61f0cd032ef..6922173c4f8f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -928,6 +928,53 @@ static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
>  	return ret;
>  }
>  
> +static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
> +					     char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +	int ret = 0;
> +	bool enable;
> +
> +	/* Valid input requires a trailing newline */
> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +
> +	buf[nbytes - 1] = '\0';
> +
> +	cpus_read_lock();
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	rdt_last_cmd_clear();
> +
> +	if (!strcmp(buf, "default")) {
> +		enable = 0;
> +	} else if (!strcmp(buf, "mbm_cntr_assign")) {
> +		if (r->mon.mbm_cntr_assignable) {
> +			enable = 1;
> +		} else {
> +			ret = -EINVAL;
> +			rdt_last_cmd_puts("mbm_cntr_assign mode is not supported\n");
> +			goto write_exit;
> +		}
> +	} else {
> +		ret = -EINVAL;
> +		rdt_last_cmd_puts("Unsupported assign mode\n");
> +		goto write_exit;
> +	}
> +
> +	if (enable != resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +		ret = resctrl_arch_mbm_cntr_assign_set(r, enable);
> +		if (!ret)
> +			mbm_cntr_reset(r);

The following APIs interact with the MBM assignable counters:

mbm_cntr_alloc()
mbm_cntr_get()
mbm_cntr_free()

mbm_cntr_reset() appears to be related but does significantly more
than interact with the MBM assignable counters and that creates a
confusing API.

How about introducing mbm_cntr_free_all() that _only_ releases all
MBM assignable counters and match with mbm_cntr_free() that releases
a single MBM assignable counter? mbm_cntr_free_all() lives with the
other functions operating on MBM assignable counters, thus not
hiding its functionality in other parts of resctrl.

This series open codes reset of non-architectural state in two places,
within mbm_cntr_reset() and within mbm_config_write_domain(). That
can be turned into a new helper that only resets architectural state,
for example resctrl_reset_rmid_all() to match existing
resctrl_arch_reset_rmid_all().

resctrl_arch_mbm_cntr_assign_set() can also reset any architectural
state leaving mbm_cntr_free_all() and resctrl_reset_rmid_all() to be called
here and from within mbm_config_write_domain().

What do you think?

> +	}
> +
> +write_exit:
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +
> +	return ret ?: nbytes;
> +}
> +
>  #ifdef CONFIG_PROC_CPU_RESCTRL
>  
>  /*
> @@ -1945,9 +1992,10 @@ static struct rftype res_common_files[] = {
>  	},
>  	{
>  		.name		= "mbm_assign_mode",
> -		.mode		= 0444,
> +		.mode		= 0644,
>  		.kf_ops		= &rdtgroup_kf_single_ops,
>  		.seq_show	= resctrl_mbm_assign_mode_show,
> +		.write		= resctrl_mbm_assign_mode_write,
>  		.fflags		= RFTYPE_MON_INFO,
>  	},
>  	{

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-02-06 16:15     ` Moger, Babu
@ 2025-02-06 18:42       ` Reinette Chatre
  2025-02-06 22:57         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:42 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/6/25 8:15 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/5/2025 4:49 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>> Add the functionality to enable/disable AMD ABMC feature.
>>>
>>> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
>>> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
>>> to be updated on all the logical processors in the QOS Domain.
>>>
>>> Hardware counters will reset when ABMC state is changed.
>>
>> I find that the state management in this series is organized better
>> and easier to understand. I do think that it can be simplified more
>> and a hint to this is that it is mentioned here but not done in the
>> code introduced here but instead required from the caller. It seems
>> simpler to me that the architectural state can just be reset at the
>> same time as enable/disable of ABMC?
> 
> Right now, it is done from mbm_cntr_reset(). It does both arch and non-arch state reset for all the RMIDs in all the domains. It is called in two places.
> 
> 1 rdtgroup.c resctrl_mbm_assign_mode_write -> mbm_cntr_reset();
Please see my response to this usage in the related patch:
https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
In summary, I find mbm_cntr_reset() ended up being a catch-all for random
cleanup and creates confusion with the other mbm_cntr_*() calls.

> 2 rdtgroup.c rdt_kill_sb()-> mbm_cntr_reset();
Please see my response to this usage in the related patch:
https://lore.kernel.org/lkml/8d04f824-d1cc-461c-9c57-0f26c6aa96e0@intel.com/
In summary, it does not solve the problem it originally set out to solve
and it can be eliminated.

> 
> I will have to introduce another function to reset RMIDs in all the domains. Also, make sure it is called from both these places.
> 
> list_for_each_entry(dom, &r->mon_domains, hdr.list)
>             resctrl_arch_reset_rmid_all(r, dom);

I do not see need for new functions, except the one I mention in 
https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
that suggests a new helper for reset of architectural state that does not
exist and ends up being open coded in two places in this series.

With only one place (resctrl_mbm_assign_mode_write()) remaining that needs
all state reset I think it will be easier to understand if the state reset
is open coded within it, replacing mbm_cntr_reset() with:

	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
		mbm_cntr_free_all()
		resctrl_reset_rmid_all() // Just for architectural state
	}

I would not insist on reset of architectural state within the
architectural helper. I find that it is best for architecture to
maintain its state but I also see there are many precedent for
resctrl explicitly managing the state.

> I feel current code is much more cleaner.  What do you think?

It is better that previous versions, yes.

> 
>>
>>>
>>> The ABMC feature details are documented in APM listed below [1].
>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>> Monitoring (ABMC).
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>
>> ...
>>

...

>>> + */
>>> +static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
>>> +{
>>> +    struct rdt_mon_domain *d;
>>> +
>>> +    list_for_each_entry(d, &r->mon_domains, hdr.list)
>>> +        on_each_cpu_mask(&d->hdr.cpu_mask,
>>> +                 resctrl_abmc_set_one_amd, &enable, 1);
>>> +}
>>> +
>>> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>>> +{
>>> +    struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>> +
>>> +    if (r->mon.mbm_cntr_assignable &&
>>> +        hw_res->mbm_cntr_assign_enabled != enable) {
>>> +        _resctrl_abmc_enable(r, enable);
>>> +        hw_res->mbm_cntr_assign_enabled = enable;
>>
>> Added benefit of resetting architectural state within this if statement
>> (perhaps simpler to be done within _resctrl_abmc_enable()) is that it will
>> not be done unnecessarily if ABMC is already in requested state.
> 
> It will be
>       list_for_each_entry(dom, &r->mon_domains, hdr.list)
>             resctrl_arch_reset_rmid_all(r, dom);

I am not sure if you are actually planning a new loop here ... as
I suggested above this can be added to _resctrl_abmc_enable() where
there is already a loop over all monitor domains and all that is
needed is to add a call to resctrl_arch_reset_rmid_all(r, dom). 
Even so, as I mentioned above, if after fixing automatic counter
unassignment you still find that resetting architectural and
non-architectural state together then we can go with that to match
the other flows (eg. mbm_config_write_domain()).

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-01-22 20:20 ` [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
@ 2025-02-06 18:48   ` Reinette Chatre
  2025-02-10 19:46     ` Moger, Babu
  2025-02-19 16:07   ` Dave Martin
  2025-02-21 18:07   ` James Morse
  2 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 18:48 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 1/22/25 12:20 PM, Babu Moger wrote:
> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
> events in the CTRL_MON or MON groups should have counters assigned.
> 
> Provide an interface for assigning MBM events by writing to the file:
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
> events can be assigned or unassigned as needed.
> 
> Format is similar to the list format with addition of opcode for the
> assignment operation.
>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Format for specific type of groups:
> 
>  * Default CTRL_MON group:
>          "//<domain_id><opcode><flags>"
> 
>  * Non-default CTRL_MON group:
>          "<CTRL_MON group>//<domain_id><opcode><flags>"
> 
>  * Child MON group of default CTRL_MON group:
>          "/<MON group>/<domain_id><opcode><flags>"
> 
>  * Child MON group of non-default CTRL_MON group:
>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Domain_id '*' will apply the flags on all the domains.
> 
> Opcode can be one of the following:
> 
>  = Update the assignment to match the flags
>  + Assign a new MBM event without impacting existing assignments.
>  - Unassign a MBM event from currently assigned events.
> 
> Assignment flags can be one of the following:
>  t  MBM total event
>  l  MBM local event
>  tl Both total and local MBM events
>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>     be combined with other flags.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v11: Fixed the static check warning with initializing dom_id in resctrl_process_flags().
> 
> v10: Fixed the issue with finding the domain in multiple iterations.
>      Printed error message with domain information when assign fails.
>      Changed the variables to unsigned for processing assign state.
>      Taken care of few format corrections.
> 
> v9: Fixed handling special case '//0=' and '//".
>     Removed extra strstr() call.
>     Added generic failure text when assignment operation fails.
>     Corrected user documentation format texts.
> 
> v8: Moved unassign as the first action during the assign modification.
>     Assign none "_" takes priority. Cannot be mixed with other flags.
>     Updated the documentation and .rst file format. htmldoc looks ok.
> 
> v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
>     Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
>     Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
>     Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
>     Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
>     Removed ABMC reference in FS code.
>     Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
>     Not sure if we need to change the behaviour here. Processed them sequencially right now.
>     Users have the liberty to pass the flags. Restricting it might be a problem later.
> 
> v6: Added support assign all if domain id is '*'
>     Fixed the allocation of counter id if it not assigned already.
> 
> v5: Interface name changed from mbm_assign_control to mbm_control.
>     Fixed opcode and flags combination.
>     '=_" is valid.
>     "-_" amd "+_" is not valid.
>     Minor message update.
>     Renamed the function with prefix - rdtgroup_.
>     Corrected few documentation mistakes.
>     Rebase related changes after SNC support.
> 
> v4: Added domain specific assignments. Fixed the opcode parsing.
> 
> v3: New patch.
>     Addresses the feedback to provide the global assignment interface.
> ---
>  Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
>  arch/x86/kernel/cpu/resctrl/internal.h |  10 +
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 241 ++++++++++++++++++++++++-
>  3 files changed, 365 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 3040e5c4cd76..47e15b48d951 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -356,7 +356,8 @@ with the following files:
>  	 t  MBM total event is assigned.
>  	 l  MBM local event is assigned.
>  	 tl Both MBM total and local events are assigned.
> -	 _  None of the MBM events are assigned.
> +	 _  None of the MBM events are assigned. Only works with opcode '=' for write
> +	    and cannot be combined with other flags.
>  
>  	Examples:
>  	::
> @@ -374,6 +375,119 @@ with the following files:
>  	There are four resctrl groups. All the groups have total and local MBM events
>  	assigned on domain 0 and 1.
>  
> +	Assignment state can be updated by writing to "mbm_assign_control".
> +
> +	Format is similar to the list format with addition of opcode for the
> +	assignment operation.
> +
> +		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> +	Format for each type of group:
> +
> +        * Default CTRL_MON group:
> +                "//<domain_id><opcode><flags>"
> +
> +        * Non-default CTRL_MON group:
> +                "<CTRL_MON group>//<domain_id><opcode><flags>"
> +
> +        * Child MON group of default CTRL_MON group:
> +                "/<MON group>/<domain_id><opcode><flags>"
> +
> +        * Child MON group of non-default CTRL_MON group:
> +                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> +	Domain_id '*' will apply the flags to all the domains.
> +
> +	Opcode can be one of the following:
> +	::
> +
> +	 = Update the assignment to match the MBM event.
> +	 + Assign a new MBM event without impacting existing assignments.
> +	 - Unassign a MBM event from currently assigned events.
> +
> +	Examples:
> +	Initial group status:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
> +	  //0=tl;1=tl
> +	  /child_default_mon_grp/0=tl;1=tl
> +
> +	To update the default group to assign only total MBM event on domain 0:
> +	::
> +
> +	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
> +	  //0=t;1=tl
> +	  /child_default_mon_grp/0=tl;1=tl
> +
> +	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
> +	::
> +
> +	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
> +	  //0=t;1=tl
> +	  /child_default_mon_grp/0=tl;1=l
> +
> +	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
> +	both local and total MBM events on domain 1:
> +	::
> +
> +	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
> +			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
> +	  //0=t;1=tl
> +	  /child_default_mon_grp/0=tl;1=l
> +
> +	To update the default group to add a local MBM event domain 0:

"local MBM event domain 0" -> "local MBM event on domain 0"?

Taking a step back to look at the completed "mbm_assign_control" section
it is noteworthy that all this work is about assigning counters to events
but after this large section is complete the word "counter" does not appear
a single time.

The section starts with a brief:
"Reports the resctrl group and monitor status of each group." and then
moves to terms like "assigning events"/"assignment status" without defining
what that means.

Instead of rewriting this, what do you think of adding some definition
of what "assignment state" means to the start of the section. For example,
(I am sure it can be improved):

"Use "mbm_assign_control" to manage monitoring counter assignment to
monitoring events when mbm_cntr_assign_mode is enabled."

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-02-06 18:42       ` Reinette Chatre
@ 2025-02-06 22:57         ` Moger, Babu
  2025-02-06 23:28           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-06 22:57 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

Lots of things in here.

On 2/6/2025 12:42 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/6/25 8:15 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/5/2025 4:49 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>>> Add the functionality to enable/disable AMD ABMC feature.
>>>>
>>>> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
>>>> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
>>>> to be updated on all the logical processors in the QOS Domain.
>>>>
>>>> Hardware counters will reset when ABMC state is changed.
>>>
>>> I find that the state management in this series is organized better
>>> and easier to understand. I do think that it can be simplified more
>>> and a hint to this is that it is mentioned here but not done in the
>>> code introduced here but instead required from the caller. It seems
>>> simpler to me that the architectural state can just be reset at the
>>> same time as enable/disable of ABMC?
>>
>> Right now, it is done from mbm_cntr_reset(). It does both arch and non-arch state reset for all the RMIDs in all the domains. It is called in two places.
>>
>> 1 rdtgroup.c resctrl_mbm_assign_mode_write -> mbm_cntr_reset();
> Please see my response to this usage in the related patch:
> https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
> In summary, I find mbm_cntr_reset() ended up being a catch-all for random
> cleanup and creates confusion with the other mbm_cntr_*() calls.

Yes. It should work. Will respond that comment later.

>> 2 rdtgroup.c rdt_kill_sb()-> mbm_cntr_reset();
> Please see my response to this usage in the related patch:
> https://lore.kernel.org/lkml/8d04f824-d1cc-461c-9c57-0f26c6aa96e0@intel.com/
> In summary, it does not solve the problem it originally set out to solve
> and it can be eliminated.

Yes. Should be fine. mbm_cntr_reset() can be completely removed.
Will respond that comment later.

> 
>>
>> I will have to introduce another function to reset RMIDs in all the domains. Also, make sure it is called from both these places.
>>
>> list_for_each_entry(dom, &r->mon_domains, hdr.list)
>>              resctrl_arch_reset_rmid_all(r, dom);
> 
> I do not see need for new functions, except the one I mention in
> https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
> that suggests a new helper for reset of architectural state that does not
> exist and ends up being open coded in two places in this series.
> 
> With only one place (resctrl_mbm_assign_mode_write()) remaining that needs
> all state reset I think it will be easier to understand if the state reset
> is open coded within it, replacing mbm_cntr_reset() with:
> 
> 	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> 		mbm_cntr_free_all()
> 		resctrl_reset_rmid_all() // Just for architectural state
> 	}

You meant "Just for non-architectural state" ?


> I would not insist on reset of architectural state within the
> architectural helper. I find that it is best for architecture to
> maintain its state but I also see there are many precedent for
> resctrl explicitly managing the state.
> 
>> I feel current code is much more cleaner.  What do you think?
> 
> It is better that previous versions, yes.
> 
>>
>>>
>>>>
>>>> The ABMC feature details are documented in APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC).
>>>>
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>> ---
>>>
>>> ...
>>>
> 
> ...
> 
>>>> + */
>>>> +static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
>>>> +{
>>>> +    struct rdt_mon_domain *d;
>>>> +
>>>> +    list_for_each_entry(d, &r->mon_domains, hdr.list)
>>>> +        on_each_cpu_mask(&d->hdr.cpu_mask,
>>>> +                 resctrl_abmc_set_one_amd, &enable, 1);
>>>> +}
>>>> +
>>>> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>>>> +{
>>>> +    struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>>> +
>>>> +    if (r->mon.mbm_cntr_assignable &&
>>>> +        hw_res->mbm_cntr_assign_enabled != enable) {
>>>> +        _resctrl_abmc_enable(r, enable);
>>>> +        hw_res->mbm_cntr_assign_enabled = enable;
>>>
>>> Added benefit of resetting architectural state within this if statement
>>> (perhaps simpler to be done within _resctrl_abmc_enable()) is that it will
>>> not be done unnecessarily if ABMC is already in requested state.
>>
>> It will be
>>        list_for_each_entry(dom, &r->mon_domains, hdr.list)
>>              resctrl_arch_reset_rmid_all(r, dom);
> 
> I am not sure if you are actually planning a new loop here ... as
> I suggested above this can be added to _resctrl_abmc_enable() where
> there is already a loop over all monitor domains and all that is
> needed is to add a call to resctrl_arch_reset_rmid_all(r, dom).

Sure.

> Even so, as I mentioned above, if after fixing automatic counter
> unassignment you still find that resetting architectural and
> non-architectural state together then we can go with that to match
> the other flows (eg. mbm_config_write_domain()).
> 

Sure.
Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-02-06 22:57         ` Moger, Babu
@ 2025-02-06 23:28           ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-06 23:28 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/6/25 2:57 PM, Moger, Babu wrote:
> On 2/6/2025 12:42 PM, Reinette Chatre wrote:
>> On 2/6/25 8:15 AM, Moger, Babu wrote:
>>> On 2/5/2025 4:49 PM, Reinette Chatre wrote:
>>>> On 1/22/25 12:20 PM, Babu Moger wrote:


>>>
>>> I will have to introduce another function to reset RMIDs in all the domains. Also, make sure it is called from both these places.
>>>
>>> list_for_each_entry(dom, &r->mon_domains, hdr.list)
>>>              resctrl_arch_reset_rmid_all(r, dom);
>>
>> I do not see need for new functions, except the one I mention in
>> https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
>> that suggests a new helper for reset of architectural state that does not
>> exist and ends up being open coded in two places in this series.
>>
>> With only one place (resctrl_mbm_assign_mode_write()) remaining that needs
>> all state reset I think it will be easier to understand if the state reset
>> is open coded within it, replacing mbm_cntr_reset() with:
>>
>>     list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>>         mbm_cntr_free_all()
>>         resctrl_reset_rmid_all() // Just for architectural state
>>     }
> 
> You meant "Just for non-architectural state" ?

Yes, thank you, what a confusion causing typo.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode
  2025-02-06 18:01   ` Reinette Chatre
@ 2025-02-06 23:41     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-06 23:41 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/6/2025 12:01 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>> supported.
>>
>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>
>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
> 
> ""mbm_cntr_assign" is backed" -> ""mbm_cntr_assign" mode is backed"?

Sure.
> 
>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>
>> The "default" mode is the existing monitoring mode that works without the
>> explicit counter assignment, instead relying on dynamic counter assignment
>> by hardware that may result in hardware not dedicating a counter resulting
>> in monitoring data reads returning "Unavailable".
>>
>> Provide an interface to display the monitor mode on the system.
>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> [mbm_cntr_assign]
>> default
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> 
>> ---
>>   Documentation/arch/x86/resctrl.rst     | 26 +++++++++++++++++++++
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++++
>>   2 files changed, 57 insertions(+)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index fb90f08e564e..b5defc5bce0e 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -257,6 +257,32 @@ with the following files:
>>   	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>>   	    0=0x30;1=0x30;3=0x15;4=0x15
>>   
>> +"mbm_assign_mode":
>> +	Reports the list of monitoring modes supported. The enclosed brackets
>> +	indicate which mode is enabled.
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> +	  [mbm_cntr_assign]
>> +	  default
>> +
>> +	"mbm_cntr_assign":
>> +
>> +	In mbm_cntr_assign, monitoring event can only accumulate data while
> 
> "In mbm_cntr_assign, monitoring event" -> "In mbm_cntr_assign mode, a monitoring event"?

Sure.

> 
>> +	it is backed by a hardware counter. The user-space is able to specify
>> +	which of the events in CTRL_MON or MON groups should have a counter
>> +	assigned using the "mbm_assign_control" file. The number of counters
>> +	available is described in the "num_mbm_cntrs" file. Changing the mode
>> +	may cause all counters on a resource to reset.
>> +
>> +	"default":
>> +
>> +	In default mode, resctrl assumes there is a hardware counter for each
>> +	event within every CTRL_MON and MON group. On AMD platforms, it is
>> +	recommended to use mbm_cntr_assign mode if supported, because reading
>> +	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
> 
> Reinette
> 

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06 16:17     ` Reinette Chatre
@ 2025-02-07 10:07       ` Xin Li
  2025-02-11 19:44         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Xin Li @ 2025-02-07 10:07 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On 2/6/2025 8:17 AM, Reinette Chatre wrote:
>>> +    wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>> This is the existing code, however it would be better to use wrmsrl()
>> when the higher 32-bit are all 0s:
>>
>>      wrmsrl(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config);
>>
> Could you please elaborate what makes this change better?

In short, it takes one less argument, and doesn't pass an argument of 0.

The longer story is that hpa and I are refactoring the MSR access APIs
to accommodate the immediate form of MSR access instructions.  And we
are not happy about that there are too many MSR access APIs and their
uses are *random*.  The native wrmsr() and wrmsrl() are essentially the 
same and the only difference is that wrmsr() passes a 64-bit value to be
written into a MSR in *2* u32 arguments.  But we already have struct msr
defined in asm/shared/msr.h as:
	struct msr {
         	union {
                 	struct {
                         	u32 l;
	                        u32 h;
         	        };
                 	u64 q;
	        };
	};

it's more natural to do the same job with this data structure in most
cases.  And we want to remove wrmsr() and only keep wrmsrl(), thus a
developer won't have to figure out which one is better to use :-P.

For that to happen, one cleanup is to replace wrmsr(msr, low, 0) with
wrmsrl(msr, low) (low is automatically converted to u64 from u32).

However, I'm fine if Babu wants to keep it as-is.

Thanks!
     Xin









^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-02-05 23:17   ` Reinette Chatre
@ 2025-02-07 17:18     ` Moger, Babu
  2025-02-07 18:52       ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 17:18 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 5:17 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> The mbm_cntr_assign mode provides an option to the user to assign a
>> counter to an RMID, event pair and monitor the bandwidth as long as
>> the counter is assigned. Number of assignments depend on number of
>> monitoring counters available.
>>
>> Provide the interface to display the number of monitoring counters
>> supported. The resctrl file 'num_mbm_cntrs' is visible to user space
>> when the system supports mbm_cntr_assign mode.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> ...
> 
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index b5defc5bce0e..31ff764deeeb 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -283,6 +283,22 @@ with the following files:
>>   	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
>>   	there is no counter associated with that event.
>>   
>> +"num_mbm_cntrs":
>> +	The number of monitoring counters available for assignment when the
>> +	system supports mbm_cntr_assign mode.
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>> +	  32
>> +
>> +	The resctrl file system supports tracking up to two memory bandwidth
>> +	events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
>> +	Up to two counters can be assigned per monitoring group, one for each
>> +	memory bandwidth event. More monitoring groups can be tracked by
>> +	assigning one counter per monitoring group. However, doing so limits
>> +	memory bandwidth tracking to a single memory bandwidth event per
>> +	monitoring group.
>> +
> 
> This text needs an update to reflect the switch to per-domain counter assignment.

Does this look ok? Just added domain in the text.

"The number of monitoring counters available in each domain for 
assignment when the system supports mbm_cntr_assign mode.
::
   # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
   32

The resctrl file system supports tracking up to two memory bandwidth
events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
Up to two counters can be assigned per monitoring group, one for each
memory bandwidth event in each domain. More monitoring groups can be 
tracked by assigning one counter per monitoring group. However, doing so 
limits memory bandwidth tracking to a single memory bandwidth event per
monitoring group."


Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-05 23:58   ` Reinette Chatre
  2025-02-06  0:51     ` Luck, Tony
@ 2025-02-07 17:30     ` Moger, Babu
  1 sibling, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 17:30 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 5:58 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> The event configuration is domain specific and initialized during domain
>> initialization. The values are stored in struct rdt_hw_mon_domain.
>>
>> It is not required to read the configuration register every time user asks
>> for it. Use the value stored in struct rdt_hw_mon_domain instead.
>>
>> Introduce resctrl_arch_mon_event_config_get() and
>> resctrl_arch_mon_event_config_set() to get/set architecture domain specific
>> mbm_total_cfg/mbm_local_cfg values.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
>> ---
>>   arch/x86/kernel/cpu/resctrl/internal.h | 15 +++++++
>>   arch/x86/kernel/cpu/resctrl/monitor.c  | 46 +++++++++++++++++++
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++---------------------
>>   3 files changed, 72 insertions(+), 50 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index ab28b9340ee7..cfaea20145d0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -605,6 +605,18 @@ union cpuid_0x10_x_edx {
>>   	unsigned int full;
>>   };
>>   
>> +/**
>> + * struct mon_config_info - Monitoring event configuratiin details
> 
> Same typo as previous version.

I am really sorry about this.

> 
>> + * @d:			Domain for the event
>> + * @evtid:		Event type
>> + * @mon_config:		Event configuration value
>> + */
>> +struct mon_config_info {
>> +	struct rdt_mon_domain *d;
>> +	enum resctrl_event_id evtid;
>> +	u32 mon_config;
>> +};
>> +
>>   void rdt_last_cmd_clear(void);
>>   void rdt_last_cmd_puts(const char *s);
>>   __printf(1, 2)
>> @@ -674,4 +686,7 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>>   bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>>   void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
>>   unsigned int mon_event_config_index_get(u32 evtid);
>> +void resctrl_arch_mon_event_config_set(void *info);
>> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>> +				      enum resctrl_event_id eventid);
>>   #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 8917c7261680..6fe9e610e9a0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1324,3 +1324,49 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
>>   
>>   	return 0;
>>   }
>> +
>> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>> +				      enum resctrl_event_id eventid)
>> +{
>> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> +
>> +	switch (eventid) {
>> +	case QOS_L3_OCCUP_EVENT_ID:
>> +		break;
>> +	case QOS_L3_MBM_TOTAL_EVENT_ID:
>> +		return hw_dom->mbm_total_cfg;
>> +	case QOS_L3_MBM_LOCAL_EVENT_ID:
>> +		return hw_dom->mbm_local_cfg;
>> +	}
>> +
>> +	/* Never expect to get here */
>> +	WARN_ON_ONCE(1);
>> +
>> +	return INVALID_CONFIG_VALUE;
>> +}
>> +
>> +void resctrl_arch_mon_event_config_set(void *info)
>> +{
>> +	struct mon_config_info *mon_info = info;
>> +	struct rdt_hw_mon_domain *hw_dom;
>> +	unsigned int index;
>> +
>> +	index = mon_event_config_index_get(mon_info->evtid);
>> +	if (index == INVALID_CONFIG_INDEX)
>> +		return;
>> +
>> +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>> +
>> +	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
>> +
>> +	switch (mon_info->evtid) {
>> +	case QOS_L3_MBM_TOTAL_EVENT_ID:
>> +		hw_dom->mbm_total_cfg = mon_info->mon_config;
>> +		break;
>> +	case QOS_L3_MBM_LOCAL_EVENT_ID:
>> +		hw_dom->mbm_local_cfg = mon_info->mon_config;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +}
> 
> This new arch API has sharp corners because of asymmetry of where resctrl
> runs the arch function. I do not think it is required to change this since we
> can only speculate about how this may be used in the future but I do think
> it will be helpful to add comments that highlight:
> 
> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.

Sure. will do.

> 
> ...
> 
>> @@ -1683,33 +1653,23 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
>>   	return 0;
>>   }
>>   
>> -static void mon_event_config_write(void *info)
>> -{
>> -	struct mon_config_info *mon_info = info;
>> -	unsigned int index;
>> -
>> -	index = mon_event_config_index_get(mon_info->evtid);
>> -	if (index == INVALID_CONFIG_INDEX) {
>> -		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>> -		return;
>> -	}
>> -	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>> -}
>>   
>>   static void mbm_config_write_domain(struct rdt_resource *r,
>>   				    struct rdt_mon_domain *d, u32 evtid, u32 val)
>>   {
>>   	struct mon_config_info mon_info = {0};
> 
> As discussed in previous version it is unnecessary to explicitly initialize
> the structure if it is fully initialized in the code. This avoids need for
> future cleanups like commit 29eaa7958367 ("x86/resctrl: Slightly clean-up mbm_config_show()")

Yes. Need to change it.

thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-05 23:57   ` Reinette Chatre
@ 2025-02-07 18:23     ` Moger, Babu
  2025-02-10 18:10       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 18:23 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 5:57 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
>> MBM event of a monitor group. Hardware counters are assigned/unassigned
>> at monitoring domain level.
>>
>> Manage a monitoring domain's hardware counters using a per monitoring
>> domain array of struct mbm_cntr_cfg that is indexed by the hardware
>> counter	ID. A hardware counter's configuration contains the MBM event
> 
> Something strange in this changelog with a few random \t in the text.

Not sure how it got in there. I can only see with "set list" option.

I Will remove it.

> 
>> ID and points to the monitoring group that it is assigned to, with a
>> NULL pointer meaning that the hardware counter is available for assignment.
>>
>> There is no direct way to determine which hardware counters are	assigned
> 
> ... another \t above

Sure.
> 
>> to a particular monitoring group. Check every entry of every hardware
>> counter	configuration array in every monitoring domain to query which
> 
> ... one more \t above

Sure

> 
>> MBM events of a monitoring group is tracked by hardware. Such queries
>> are acceptable because of a very small number of assignable counters.
> 
> It is not obvious what "very small number" means. Is it possible to give
> a range to help reader understand the motivation?

How about?

MBM events of a monitoring group is tracked by hardware. Such queries
are acceptable because of a very small number of assignable counters(32 
to 64).

> 
>>
>> Suggested-by: Peter Newman <peternewman@google.com>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
>>   include/linux/resctrl.h                | 14 ++++++++++++++
>>   2 files changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 18110a1afb6d..75a3b56996ca 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -4009,6 +4009,7 @@ static void __init rdtgroup_setup_default(void)
>>   
>>   static void domain_destroy_mon_state(struct rdt_mon_domain *d)
>>   {
>> +	kfree(d->cntr_cfg);
>>   	bitmap_free(d->rmid_busy_llc);
>>   	kfree(d->mbm_total);
>>   	kfree(d->mbm_local);
>> @@ -4082,6 +4083,16 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
>>   			return -ENOMEM;
>>   		}
>>   	}
>> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
>> +		tsize = sizeof(*d->cntr_cfg);
>> +		d->cntr_cfg = kcalloc(r->mon.num_mbm_cntrs, tsize, GFP_KERNEL);
>> +		if (!d->cntr_cfg) {
>> +			bitmap_free(d->rmid_busy_llc);
>> +			kfree(d->mbm_total);
>> +			kfree(d->mbm_local);
>> +			return -ENOMEM;
>> +		}
>> +	}
>>   
>>   	return 0;
>>   }
>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index 511cfce8fc21..9a54e307d340 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
>>   	u32				*mbps_val;
>>   };
>>   
>> +/**
>> + * struct mbm_cntr_cfg - assignable counter configuration
>> + * @evtid:		 MBM event to which the counter is assigned. Only valid
>> + *			 if @rdtgroup is not NULL.
>> + * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
>> + *			 counter is free.
>> + */
>> +struct mbm_cntr_cfg {
>> +	enum resctrl_event_id	evtid;
>> +	struct rdtgroup		*rdtgrp;
>> +};
>> +
> 
> $ scripts/kernel-doc -v -none include/linux/resctrl.h
> ...
> include/linux/resctrl.h:107: warning: Function parameter or struct member 'rdtgrp' not described in 'mbm_cntr_cfg'
> include/linux/resctrl.h:107: warning: Excess struct member 'rdtgroup' description in 'mbm_cntr_cfg'


Yes. Will fix this.

> ...
> 
>>   /**
>>    * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>>    * @hdr:		common header for different domain types
>> @@ -105,6 +117,7 @@ struct rdt_ctrl_domain {
>>    * @cqm_limbo:		worker to periodically read CQM h/w counters
>>    * @mbm_work_cpu:	worker CPU for MBM h/w counters
>>    * @cqm_work_cpu:	worker CPU for CQM h/w counters
>> + * @cntr_cfg:		assignable counters configuration
>>    */
>>   struct rdt_mon_domain {
>>   	struct rdt_domain_hdr		hdr;
>> @@ -116,6 +129,7 @@ struct rdt_mon_domain {
>>   	struct delayed_work		cqm_limbo;
>>   	int				mbm_work_cpu;
>>   	int				cqm_work_cpu;
>> +	struct mbm_cntr_cfg		*cntr_cfg;
>>   };
>>   
>>   /**
> 
> Reinette
> 
> 

Thanks
Babu


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-02-07 17:18     ` Moger, Babu
@ 2025-02-07 18:52       ` Moger, Babu
  2025-02-10 18:08         ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 18:52 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/7/2025 11:18 AM, Moger, Babu wrote:
> Does this look ok? Just added domain in the text.
> 
> "The number of monitoring counters available in each domain for 
> assignment when the system supports mbm_cntr_assign mode.
> ::
>    # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>    32
> 
> The resctrl file system supports tracking up to two memory bandwidth
> events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
> Up to two counters can be assigned per monitoring group, one for each
> memory bandwidth event in each domain. More monitoring groups can be 
> tracked by assigning one counter per monitoring group. However, doing so 
> limits memory bandwidth tracking to a single memory bandwidth event per
> monitoring group."

Revised again:

"The number of monitoring counters available in each domain for 
assignment when the system supports mbm_cntr_assign mode. For example, 
on a system with 32 monitoring counters:
::
   # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
   32

The resctrl file system supports tracking up to two memory bandwidth
events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
Up to two counters can be assigned per monitoring group, one for each
memory bandwidth event in each domain. More monitoring groups can be 
tracked by assigning one counter per monitoring group. However, doing so 
limits memory bandwidth tracking to a single memory bandwidth event per
monitoring group."

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters
  2025-02-06  0:19   ` Reinette Chatre
@ 2025-02-07 18:59     ` Moger, Babu
  2025-02-19 13:31       ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 18:59 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 6:19 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Provide the interface to display the number of monitoring counters
>> available for assignment in each domain when mbm_cntr_assign is enabled.
> 
> "when mbm_cntr_assign is enabled" -> "when mbm_cntr_assign mode is enabled"?

Sure.

> 
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> 
>> ---
>>   Documentation/arch/x86/resctrl.rst     |  8 +++++
>>   arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 46 ++++++++++++++++++++++++++
>>   3 files changed, 55 insertions(+)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 31ff764deeeb..99cae75559b0 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -299,6 +299,14 @@ with the following files:
>>   	memory bandwidth tracking to a single memory bandwidth event per
>>   	monitoring group.
>>   
>> +"available_mbm_cntrs":
>> +	The number of monitoring counters available for assignment in each
>> +	domain when mbm_cntr_assign mode is enabled on the system.
>> +	::
>> +
> 
> Documentation jumps in with some hardcoded values that may cause confusion.
> It looks to be missing something like (and looking back this also applies
> to "num_mbm_cntrs"):
> "For example, on a system with 30 available monitoring/(hardware?) counters in
> each of its L3 domains:"

Sure.

> 
> 
>> +	 # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
>> +	 0=30;1=30
>> +
> 
> 
>>   "max_threshold_occupancy":
>>   		Read/write file provides the largest value (in
>>   		bytes) at which a previously used LLC_occupancy
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 6fe9e610e9a0..f2bf5b13465d 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1234,6 +1234,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>   			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>   			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>>   			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
>> +			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
>>   		}
>>   	}
>>   
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 75a3b56996ca..2b86124c336b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -888,6 +888,46 @@ static int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
>>   	return 0;
>>   }
>>   
>> +static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
>> +					    struct seq_file *s, void *v)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +	struct rdt_mon_domain *dom;
>> +	bool sep = false;
>> +	u32 cntrs, i;
>> +	int ret = 0;
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&rdtgroup_mutex);
>> +
> 
> Missing rdt_last_cmd_clear()?

Yes.

> 
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
>> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
>> +		ret = -EINVAL;
>> +		goto unlock_cntrs_show;
>> +	}
>> +
>> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +		if (sep)
>> +			seq_puts(s, ";");
> 
> The one character prints can be simplified with a seq_putc().

Sure.
     seq_putc(s, ';');

> 
>> +
>> +		cntrs = 0;
>> +		for (i = 0; i < r->mon.num_mbm_cntrs; i++) {
>> +			if (!dom->cntr_cfg[i].rdtgrp)
>> +				cntrs++;
>> +		}
>> +
>> +		seq_printf(s, "%d=%d", dom->hdr.id, cntrs);
> 
> I expect cntrs to need %u?

Sure.

> 
>> +		sep = true;
>> +	}
>> +	seq_puts(s, "\n");
>> +
>> +unlock_cntrs_show:
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	cpus_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>>   #ifdef CONFIG_PROC_CPU_RESCTRL
>>   
>>   /*
>> @@ -1916,6 +1956,12 @@ static struct rftype res_common_files[] = {
>>   		.kf_ops		= &rdtgroup_kf_single_ops,
>>   		.seq_show	= resctrl_num_mbm_cntrs_show,
>>   	},
>> +	{
>> +		.name		= "available_mbm_cntrs",
>> +		.mode		= 0444,
>> +		.kf_ops		= &rdtgroup_kf_single_ops,
>> +		.seq_show	= resctrl_available_mbm_cntrs_show,
>> +	},
>>   	{
>>   		.name		= "cpus",
>>   		.mode		= 0644,
> 
> Reinette
> 

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events
  2025-02-06  1:05   ` Reinette Chatre
@ 2025-02-07 21:10     ` Moger, Babu
  2025-02-10 18:25       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-07 21:10 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/2025 7:05 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> subject: "assigm" -> "assign"

Sure.

> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> The mbm_cntr_assign mode offers several counters that can be assigned
> 
> This "several counters" contradicts the "very small number of assignable
> counters" used in earlier patch to justify how counters are managed.

How about?

The mbm_cntr_assign mode offers "num_mbm_cntrs" number of counters that 
can be assigned to an RMID, event pair and monitor the bandwidth as long 
as it is assigned.

> 
>> to an RMID, event pair and monitor the bandwidth as long as it is
>> assigned.
>>
>> Add the functionality to allocate and assign the counters to RMID, event
>> pair in the domain.
>>
>> If all counters are in use, the kernel will show an error message: "Out
>> of MBM assignable counters" when a new assignment is requested. Exit on
>> the first failure when assigning counters across all the domains.
>> Report the error in /sys/fs/resctrl/info/last_cmd_status.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> ..
> 
>> ---
>>   arch/x86/kernel/cpu/resctrl/internal.h |   2 +
>>   arch/x86/kernel/cpu/resctrl/monitor.c  | 105 +++++++++++++++++++++++++
>>   2 files changed, 107 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 161d3feb567c..547d8a4c8aba 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -727,4 +727,6 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>>   int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>   			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>   			     u32 cntr_id, bool assign);
>> +int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
>>   #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index ef836bb69b9b..127c4000a81a 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1413,3 +1413,108 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>   
>>   	return 0;
>>   }
>> +
>> +/*
>> + * Configure the counter for the event, RMID pair for the domain. Reset the
>> + * non-architectural state to clear all the event counters.
>> + */
>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
>> +			       u32 cntr_id, bool assign)
>> +{
>> +	struct mbm_state *m;
>> +	int ret;
>> +
>> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>> +	if (ret)
>> +		return ret;
>> +
>> +	m = get_mbm_state(d, closid, rmid, evtid);
>> +	if (m)
>> +		memset(m, 0, sizeof(struct mbm_state));
>> +
>> +	return ret;
>> +}
>> +
>> +static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int cntr_id;
>> +
>> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
>> +		if (d->cntr_cfg[cntr_id].rdtgrp == rdtgrp &&
>> +		    d->cntr_cfg[cntr_id].evtid == evtid)
>> +			return cntr_id;
>> +	}
>> +
>> +	return -ENOENT;
>> +}
>> +
>> +static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			  struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int cntr_id;
>> +
>> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
>> +		if (!d->cntr_cfg[cntr_id].rdtgrp) {
>> +			d->cntr_cfg[cntr_id].rdtgrp = rdtgrp;
>> +			d->cntr_cfg[cntr_id].evtid = evtid;
>> +			return cntr_id;
>> +		}
>> +	}
>> +
>> +	return -ENOSPC;
>> +}
>> +
>> +static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
>> +{
>> +	memset(&d->cntr_cfg[cntr_id], 0, sizeof(struct mbm_cntr_cfg));
>> +}
>> +
>> +/*
>> + * Allocate a fresh counter and configure the event if not assigned already
>> + * else return success.
> 
> I find this confusing. I think the "else return success" can just be dropped.

Sure.

> 
>> + */
>> +static int resctrl_alloc_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +				     struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int cntr_id, ret = 0;
>> +
>> +	if (mbm_cntr_get(r, d, rdtgrp, evtid) == -ENOENT) {
> 
> This can be simplified while reducing a level of indent with:
> 
> 	/* No need to allocate and configure if counter already assigned to this event. */
> 	if (mbm_cntr_get(r, d, rdtgrp, evtid) >= 0)
> 		return 0;

Sure.

> 
>> +		cntr_id = mbm_cntr_alloc(r, d, rdtgrp, evtid);
>> +		if (cntr_id <  0) {
>> +			rdt_last_cmd_printf("Domain %d is Out of MBM assignable counter\n",
> 
> "Domain %d is Out of MBM assignable counter" -> "Domain %d is out of MBM assignable counters"
> or, the message can be something like "Unable to allocate counter in domain %d" to not

Yes. "Unable to allocate counter in domain %d" sounds better.


> assume the error and just return the error directly. resctrl_process_flags() can in turn
> not override the error resulting in -ENOSPC returned to userspace that can be interpreted
> appropriately instead of always returning -EINVAL and requiring user space to check
> last_cmd_status?

Sure.

> 
>> +					    d->hdr.id);
>> +			return -ENOSPC;
> 
> Please do not override error of a function.

Ok

> 
>> +		}
>> +
>> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid, rdtgrp->closid,
>> +					  cntr_id, true);
>> +		if (ret) {
>> +			rdt_last_cmd_printf("Assignment failed on domain %d\n", d->hdr.id);
> 
> I assume this targets the scenario when user space requests "all" domains to be changed
> and the error message in resctrl_process_flags() will then print "*" instead of the
> actual domain ID. If this is the goal to give more detail to error then the event
> can be displayed also?

Sure. Will change it to.

rdt_last_cmd_printf("Assignment of event %d failed on domain %d\n", 
d->hdr.id, evtid);

> 
>> +			mbm_cntr_free(d, cntr_id);
>> +		}
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Assign a hardware counter to event @evtid of group @rdtgrp.
>> + * Counter will be assigned to all the domains if @d is NULL else
>> + * the counter will be assigned to @d.
> 
> Please use available 80 chars.

Sure.

> 
>> + */
>> +int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			      struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!d) {
>> +		list_for_each_entry(d, &r->mon_domains, hdr.list)
>> +			ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
> 
> This does not "exit on first failure" as the changelog claims. It actually looks like
> as long as the last domain succeeds, while all other domains fail, this request is
> considered successful.


Yes. That is correct. I have to check return status in each loop. Will 
fix it.

list_for_each_entry(d, &r->mon_domains, hdr.list) {
       ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
       if (ret)
                  return ret;
}



> 
>> +	} else {
>> +		ret = resctrl_alloc_config_cntr(r, d, rdtgrp, evtid);
>> +	}
>> +
>> +	return ret;
>> +}
> 
> Reinette
> 

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm MBM events
  2025-02-06  3:54   ` Reinette Chatre
@ 2025-02-10 16:23     ` Moger, Babu
  2025-02-10 18:30       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 16:23 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/5/25 21:54, Reinette Chatre wrote:
> Hi Babu,
> 
> subject: unassigm -> unassign

Sure.

> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> The mbm_cntr_assign mode provides a limited number of hardware counters
> 
> (now back to "limited number of hardware counters")

How about?

The mbm_cntr_assign mode provides "num_mbm_cntrs" number of hardware counters

> 
>> that can be assigned to an RMID, event pair to monitor bandwidth while
>> assigned. If all counters are in use, the kernel will show an error
>> message: "Out of MBM assignable counters" when a new assignment is
>> requested. To make space for a new assignment, users must unassign an
> 
> To me "kernel will show an error" implies the kernel ring buffer. Please make
> the message accurate and mention that it will be in 
> last_cmd_status while also considering to use -ENOSPC to help user space.

If all the counters are in use, the kernel will log the error message
"Unable to allocate counter in domain" in /sys/fs/resctrl/info/
last_cmd_status when a new assignment is requested. To make space for a
new assignment, users must unassign an already assigned counter and retry
the assignment again.

> 
>> already assigned counter and retry the assignment again..
> 
> ".." -> "."
> 

Sure.

>>
>> Add the functionality to unassign and free the counters in the domain.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
> 
> ...
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 127c4000a81a..b6d188d0f9b7 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1518,3 +1518,42 @@ int resctrl_assign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>>  
>>  	return ret;
>>  }
>> +
>> +/*
>> + * Unassign and free the counter if assigned else return success.
>> + */
>> +static int resctrl_free_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +				    struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int cntr_id, ret = 0;
>> +
>> +	cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
>> +	if (cntr_id != -ENOENT) {
> 
> This can be simplified and indent level reduced with:
> 
> 	cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
> 	if (cntr_id < 0)
> 		return ret;
> 

Sure.

>> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
>> +					  rdtgrp->closid, cntr_id, false);
>> +		if (!ret)
>> +			mbm_cntr_free(d, cntr_id);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Unassign a hardware counter associated with @evtid from the domain and
>> + * the group. Unassign the counters from all the domains if @d is NULL else
>> + * unassign from @d.
>> + */
>> +int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +				struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!d) {
>> +		list_for_each_entry(d, &r->mon_domains, hdr.list)
>> +			ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
> 
> Same issue as previous patch wrt error handling.

Yes.

list_for_each_entry(d, &r->mon_domains, hdr.list) {
     ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
     if (ret)
           return ret;
}

> 
>> +	} else {
>> +		ret = resctrl_free_config_cntr(r, d, rdtgrp, evtid);
>> +	}
>> +
>> +	return ret;
>> +}
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-06 18:03   ` Reinette Chatre
@ 2025-02-10 17:27     ` Moger, Babu
  2025-02-10 18:34       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 17:27 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/6/25 12:03, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>> are required per group, one for MBM total event and one for MBM local
>> event.
>>
>> There are a limited number of counters available for assignment. If these
>> counters are exhausted, the kernel will display the error message: "Out of
>> MBM assignable counters". However, it is not necessary to fail the
>> creation of a group due to assignment failures. Users have the flexibility
>> to modify the assignments at a later time.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index b6d188d0f9b7..118b39fbb01e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1557,3 +1557,30 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
>>  
>>  	return ret;
>>  }
>> +
>> +void mbm_cntr_reset(struct rdt_resource *r)
>> +{
>> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>> +	struct rdt_mon_domain *dom;
>> +
>> +	/*
>> +	 * Reset the domain counter configuration. Hardware counters
>> +	 * will reset after switching the monitor mode. So, reset the
>> +	 * architectural amd non-architectural state so that reading
> 
> "amd" -> "and"

Sure.

> 
>> +	 * of hardware counter is not considered as an overflow in the
>> +	 * next update.
>> +	 */
>> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +			memset(dom->cntr_cfg, 0,
>> +			       sizeof(*dom->cntr_cfg) * r->mon.num_mbm_cntrs);
>> +			if (is_mbm_total_enabled())
>> +				memset(dom->mbm_total, 0,
>> +				       sizeof(struct mbm_state) * idx_limit);
>> +			if (is_mbm_local_enabled())
>> +				memset(dom->mbm_local, 0,
>> +				       sizeof(struct mbm_state) * idx_limit);
>> +			resctrl_arch_reset_rmid_all(r, dom);
>> +		}
>> +	}
>> +}
> 
> I looked back at the previous versions to better understand how this function
> came about and I do not think it actually solves the problem it aims to solve.
> 
> rdtgroup_unassign_cntrs() can fail and when it does the counter is not free'd. That
> leaves a monitoring domain's array with an entry that points to a resource group
> that no longer exists (unless it is the default resource group) since
> rdtgroup_unassign_cntrs() does not check the return and proceeds to remove the
> resource group. mbm_cntr_reset() is called on umount of resctrl but
> rdtgroup_unassign_cntrs() is called on every  group remove and those scenarios
> are not handled.
> 
> To address this I believe that I need to go back on a previous request to have
> resctrl_arch_config_cntr() return an error code. AMD does not need this and
> it is difficult to predict what will work for MPAM. I originally wanted to be
> flexible here but this appears to be impractical. With a new requirement that 
> resctrl_arch_config_cntr() always succeeds the counter will in turn always
> be free'd and not leave dangling pointers. I believe doing so eliminates
> the need for mbm_cntr_reset() as used in this patch. My apologies for the
> misdirection. We can re-evaluate these flows if MPAM needs anything different.

So, new requirement is to free the counter even if the
resctrl_arch_config_cntr() call fails. That way after calling
rdtgroup_unassign_cntrs() the counter is freed and it is in clean state.
So, we dont need to call mbm_cntr_reset() in the end to clear all the entries.

Here is the call sequence.

rdtgroup_unassign_cntrs() -> resctrl_unassign_cntr_event() ->
resctrl_free_config_cntr() -> resctrl_config_cntr() ->
resctrl_arch_config_cntr().

So, only change here is.

/*
 * Unassign and free the counter if assigned else return success.
 */
static int resctrl_free_config_cntr(struct rdt_resource *r,
           struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
           enum resctrl_event_id evtid)
{
        int cntr_id, ret = 0;

        cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
	if (cntr_id < 0)
 		return ret;

        /* Unassign and free the counter*/
        ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
                                          rdtgrp->closid, cntr_id, false);
        mbm_cntr_free(d, cntr_id);

        return ret;
}


> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 2b86124c336b..f61f0cd032ef 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2687,6 +2687,46 @@ static void schemata_list_destroy(void)
>>  	}
>>  }
>>  
>> +/*
>> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
>> + * counters are automatically assigned. Each group can accommodate two counters:
>> + * one for the total event and one for the local event. Assignments may fail
>> + * due to the limited number of counters. However, it is not necessary to fail
>> + * the group creation and thus no failure is returned. Users have the option
>> + * to modify the counter assignments after the group has been created.
>> + */
>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
>> +	if (is_mbm_total_enabled())
>> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>> +/*
>> + * Called when a group is deleted. Counters are unassigned if it was in
>> + * assigned state.
>> + */
>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
> 
> It looks to me as though there are a couple of places (rmdir_all_sub(), rdt_kill_sb(),
> and rdtgroup_rmdir_ctrl()) where rdtgroup_unassign_cntrs() could be called on a system that
> does not support monitoring and/or only supports cache allocation monitoring.
> 
> In these paths it is only the architecture's resctrl_arch_mbm_cntr_assign_enabled(r) that
> gates the resctrl flow. I think rdtgroup_unassign_cntrs() and to match rdtgroup_assign_cntrs()
> can do with at least a r->mon_capable check.

ok. Will add following check.

if (!r->mon_capable || !resctrl_arch_mbm_cntr_assign_enabled(r))
   return;

> 
>> +	if (is_mbm_total_enabled())
>> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>>  static int rdt_get_tree(struct fs_context *fc)
>>  {
>>  	struct rdt_fs_context *ctx = rdt_fc2context(fc);
>> @@ -2741,6 +2781,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>  		if (ret < 0)
>>  			goto out_info;
>>  
>> +		rdtgroup_assign_cntrs(&rdtgroup_default);
>> +
>>  		ret = mkdir_mondata_all(rdtgroup_default.kn,
>>  					&rdtgroup_default, &kn_mondata);
>>  		if (ret < 0)
>> @@ -2779,8 +2821,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>  	if (resctrl_arch_mon_capable())
>>  		kernfs_remove(kn_mondata);
>>  out_mongrp:
>> -	if (resctrl_arch_mon_capable())
>> +	if (resctrl_arch_mon_capable()) {
>> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>>  		kernfs_remove(kn_mongrp);
>> +	}
>>  out_info:
>>  	kernfs_remove(kn_info);
>>  out_schemata_free:
>> @@ -2956,6 +3000,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>>  
>>  	head = &rdtgrp->mon.crdtgrp_list;
>>  	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>> +		rdtgroup_unassign_cntrs(sentry);
>>  		free_rmid(sentry->closid, sentry->mon.rmid);
>>  		list_del(&sentry->mon.crdtgrp_list);
>>  
>> @@ -2996,6 +3041,8 @@ static void rmdir_all_sub(void)
>>  		cpumask_or(&rdtgroup_default.cpu_mask,
>>  			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>  
>> +		rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  
>>  		kernfs_remove(rdtgrp->kn);
>> @@ -3027,6 +3074,8 @@ static void rdt_kill_sb(struct super_block *sb)
>>  	for_each_alloc_capable_rdt_resource(r)
>>  		reset_all_ctrls(r);
>>  	rmdir_all_sub();
>> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
>> +	mbm_cntr_reset(&rdt_resources_all[RDT_RESOURCE_L3].r_resctrl);
>>  	rdt_pseudo_lock_release();
>>  	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>  	schemata_list_destroy();
>> @@ -3490,9 +3539,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>>  	}
>>  	rdtgrp->mon.rmid = ret;
>>  
>> +	rdtgroup_assign_cntrs(rdtgrp);
>> +
>>  	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
>>  	if (ret) {
>>  		rdt_last_cmd_puts("kernfs subdir error\n");
>> +		rdtgroup_unassign_cntrs(rdtgrp);
>>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  		return ret;
>>  	}
>> @@ -3502,8 +3554,10 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>>  
>>  static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
>>  {
>> -	if (resctrl_arch_mon_capable())
>> +	if (resctrl_arch_mon_capable()) {
>> +		rdtgroup_unassign_cntrs(rgrp);
>>  		free_rmid(rgrp->closid, rgrp->mon.rmid);
>> +	}
>>  }
>>  
>>  static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>> @@ -3764,6 +3818,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>  	update_closid_rmid(tmpmask, NULL);
>>  
>>  	rdtgrp->flags = RDT_DELETED;
>> +
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  
>>  	/*
>> @@ -3810,6 +3867,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>  	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>  	update_closid_rmid(tmpmask, NULL);
>>  
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  	closid_free(rdtgrp->closid);
>>  
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 18/23] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2025-02-06 18:04   ` Reinette Chatre
@ 2025-02-10 17:39     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 17:39 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/6/25 12:04, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> In mbm_cntr_assign mode, the hardware counter should be assigned to read
>> the MBM events.
>>
>> Report 'Unassigned' in case the user attempts to read the events without
>> assigning the counter.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 99cae75559b0..072b15550ff7 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -431,6 +431,16 @@ When monitoring is enabled all MON groups will also contain:
>>  	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
>>  	where "YY" is the node number.
>>  
>> +	When supported the mbm_cntr_assign mode allows users to assign a
> 
> Same comment as previous version.

Sorry about that.

"mbm_cntr_assign mode allows users to assign a"


> 
>> +	counter to mon_hw_id, event pair enabling bandwidth monitoring for
>> +	as long as the counter remains assigned. The hardware will continue
>> +	tracking the assigned mon_hw_id until the user manually unassigns
>> +	it, ensuring that counters are not reset during this period. With
>> +	a limited number of counters, the system may run out of assignable
>> +	counters. In that case, MBM event counters will return 'Unassigned'
>> +	when the event is read. Users must manually assign a counter to read
>> +	the events.
>> +
>>  "mon_hw_id":
>>  	Available only with debug option. The identifier used by hardware
>>  	for the monitor group. On x86 this is the RMID.
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-02-07 18:52       ` Moger, Babu
@ 2025-02-10 18:08         ` Reinette Chatre
  2025-02-10 20:26           ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-10 18:08 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/7/25 10:52 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/7/2025 11:18 AM, Moger, Babu wrote:
>> Does this look ok? Just added domain in the text.
>>
>> "The number of monitoring counters available in each domain for assignment when the system supports mbm_cntr_assign mode.
>> ::
>>    # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>    32
>>
>> The resctrl file system supports tracking up to two memory bandwidth
>> events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
>> Up to two counters can be assigned per monitoring group, one for each
>> memory bandwidth event in each domain. More monitoring groups can be tracked by assigning one counter per monitoring group. However, doing so limits memory bandwidth tracking to a single memory bandwidth event per
>> monitoring group."
> 
> Revised again:
> 
> "The number of monitoring counters available in each domain for assignment when the system supports mbm_cntr_assign mode. For example, on a system with 32 monitoring counters:

I think we need to be careful with "available" since all these counters
may not be available. That is why "available_mbm_cntrs" exist.

How about something like (please feel free to improve):
"The maximum number of monitoring counters (total of available and assigned counters)
 in each domain when the system supports mbm_cntr_assign mode." 

Could you please make the "For example" a new paragraph (this follows existing style in the
docs). It could also be made more specific, for example,

"For example, on a system with 32 monitoring counters in each domain:"

> ::
>   # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>   32
> 

The rest of the documentation seems like a repeat of what can be found in
the "mbm_assign_mode" section right above it. It does not look as though
any information will be lost by dropping the text below?

> The resctrl file system supports tracking up to two memory bandwidth
> events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
> Up to two counters can be assigned per monitoring group, one for each
> memory bandwidth event in each domain. More monitoring groups can be tracked by assigning one counter per monitoring group. However, doing so limits memory bandwidth tracking to a single memory bandwidth event per
> monitoring group."
> 
> Thanks
> Babu

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-07 18:23     ` Moger, Babu
@ 2025-02-10 18:10       ` Reinette Chatre
  2025-02-19 13:30         ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-10 18:10 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/7/25 10:23 AM, Moger, Babu wrote:
> On 2/5/2025 5:57 PM, Reinette Chatre wrote:
>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>
>>> to a particular monitoring group. Check every entry of every hardware
>>> counter    configuration array in every monitoring domain to query which
>>
>> ... one more \t above
> 
> Sure
> 
>>
>>> MBM events of a monitoring group is tracked by hardware. Such queries
>>> are acceptable because of a very small number of assignable counters.
>>
>> It is not obvious what "very small number" means. Is it possible to give
>> a range to help reader understand the motivation?
> 
> How about?
> 
> MBM events of a monitoring group is tracked by hardware. Such queries
> are acceptable because of a very small number of assignable counters(32 to 64).

Yes, thank you. This helps to understand the claim.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 15/23] x86/resctrl: Add the functionality to assigm MBM events
  2025-02-07 21:10     ` Moger, Babu
@ 2025-02-10 18:25       ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-10 18:25 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/7/25 1:10 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/5/2025 7:05 PM, Reinette Chatre wrote:
>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>> The mbm_cntr_assign mode offers several counters that can be assigned
>>
>> This "several counters" contradicts the "very small number of assignable
>> counters" used in earlier patch to justify how counters are managed.
> 
> How about?
> 
> The mbm_cntr_assign mode offers "num_mbm_cntrs" number of counters that can be assigned to an RMID, event pair and monitor the bandwidth as long as it is assigned.

Sure. The word "several" can just be dropped from original also. The concern is not
the language but instead that the description moves from "several" in one patch
and then "limited" in the next patch.


...

>>> +        }
>>> +
>>> +        ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid, rdtgrp->closid,
>>> +                      cntr_id, true);
>>> +        if (ret) {
>>> +            rdt_last_cmd_printf("Assignment failed on domain %d\n", d->hdr.id);
>>
>> I assume this targets the scenario when user space requests "all" domains to be changed
>> and the error message in resctrl_process_flags() will then print "*" instead of the
>> actual domain ID. If this is the goal to give more detail to error then the event
>> can be displayed also?
> 
> Sure. Will change it to.
> 
> rdt_last_cmd_printf("Assignment of event %d failed on domain %d\n", d->hdr.id, evtid);

ok, printing the event ID should be OK since the ID will be part of resctrl fs and
not architecture specific. Please just swap last two parameters.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm MBM events
  2025-02-10 16:23     ` Moger, Babu
@ 2025-02-10 18:30       ` Reinette Chatre
  2025-02-22  0:36         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-10 18:30 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/10/25 8:23 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/5/25 21:54, Reinette Chatre wrote:
>> Hi Babu,
>>
>> subject: unassigm -> unassign
> 
> Sure.
> 
>>
>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>> The mbm_cntr_assign mode provides a limited number of hardware counters
>>
>> (now back to "limited number of hardware counters")
> 
> How about?
> 
> The mbm_cntr_assign mode provides "num_mbm_cntrs" number of hardware counters

ok.

> 
>>
>>> that can be assigned to an RMID, event pair to monitor bandwidth while
>>> assigned. If all counters are in use, the kernel will show an error
>>> message: "Out of MBM assignable counters" when a new assignment is
>>> requested. To make space for a new assignment, users must unassign an
>>
>> To me "kernel will show an error" implies the kernel ring buffer. Please make
>> the message accurate and mention that it will be in 
>> last_cmd_status while also considering to use -ENOSPC to help user space.
> 
> If all the counters are in use, the kernel will log the error message
> "Unable to allocate counter in domain" in /sys/fs/resctrl/info/
> last_cmd_status when a new assignment is requested. To make space for a
> new assignment, users must unassign an already assigned counter and retry
> the assignment again.
> 

This is better, but can user space receive -ENOSPC to avoid needing to check
and parse last_cmd_status on every error?

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-10 17:27     ` Moger, Babu
@ 2025-02-10 18:34       ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-10 18:34 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/10/25 9:27 AM, Moger, Babu wrote:
> On 2/6/25 12:03, Reinette Chatre wrote:
>> On 1/22/25 12:20 PM, Babu Moger wrote:

>>
>>> +	 * of hardware counter is not considered as an overflow in the
>>> +	 * next update.
>>> +	 */
>>> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
>>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>>> +			memset(dom->cntr_cfg, 0,
>>> +			       sizeof(*dom->cntr_cfg) * r->mon.num_mbm_cntrs);
>>> +			if (is_mbm_total_enabled())
>>> +				memset(dom->mbm_total, 0,
>>> +				       sizeof(struct mbm_state) * idx_limit);
>>> +			if (is_mbm_local_enabled())
>>> +				memset(dom->mbm_local, 0,
>>> +				       sizeof(struct mbm_state) * idx_limit);
>>> +			resctrl_arch_reset_rmid_all(r, dom);
>>> +		}
>>> +	}
>>> +}
>>
>> I looked back at the previous versions to better understand how this function
>> came about and I do not think it actually solves the problem it aims to solve.
>>
>> rdtgroup_unassign_cntrs() can fail and when it does the counter is not free'd. That
>> leaves a monitoring domain's array with an entry that points to a resource group
>> that no longer exists (unless it is the default resource group) since
>> rdtgroup_unassign_cntrs() does not check the return and proceeds to remove the
>> resource group. mbm_cntr_reset() is called on umount of resctrl but
>> rdtgroup_unassign_cntrs() is called on every  group remove and those scenarios
>> are not handled.
>>
>> To address this I believe that I need to go back on a previous request to have
>> resctrl_arch_config_cntr() return an error code. AMD does not need this and
>> it is difficult to predict what will work for MPAM. I originally wanted to be
>> flexible here but this appears to be impractical. With a new requirement that 
>> resctrl_arch_config_cntr() always succeeds the counter will in turn always
>> be free'd and not leave dangling pointers. I believe doing so eliminates
>> the need for mbm_cntr_reset() as used in this patch. My apologies for the
>> misdirection. We can re-evaluate these flows if MPAM needs anything different.
> 
> So, new requirement is to free the counter even if the
> resctrl_arch_config_cntr() call fails. That way after calling

No. Quoting above: "new requirement that resctrl_arch_config_cntr() always succeeds".
As I see it this will eliminate a lot of error checking on the calling path,
not ignore errors.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 19/23] x86/resctrl: Introduce the interface to switch between monitor modes
  2025-02-06 18:05   ` Reinette Chatre
@ 2025-02-10 18:54     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 18:54 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/6/25 12:05, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Resctrl subsystem can support two monitoring modes, 'mbm_cntr_assign' or
>> 'default'. In mbm_cntr_assign, monitoring event can only accumulate data
>> while it is backed by a hardware counter. In 'default' mode, resctrl
>> assumes there is a hardware counter for each event within every CTRL_MON
>> and MON group.
>>
>> Introduce interface to switch between mbm_cntr_assign and default modes.
>>
>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> [mbm_cntr_assign]
>> default
>>
>> To enable the "mbm_cntr_assign" mode:
>> $ echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>
>> To enable the default monitoring mode:
>> $ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>
>> MBM event counters are automatically reset as part of changing the mode.
>> Clear both architectural and non-architectural event states to prevent
>> overflow conditions during the next event read.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> 
>> ---
>>  Documentation/arch/x86/resctrl.rst     | 25 ++++++++++++-
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++-
>>  2 files changed, 73 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 072b15550ff7..5d18c4c8bc48 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -259,7 +259,10 @@ with the following files:
>>  
>>  "mbm_assign_mode":
>>  	Reports the list of monitoring modes supported. The enclosed brackets
>> -	indicate which mode is enabled.
>> +	indicate which mode is enabled. The MBM events (mbm_total_bytes and/or
>> +	mbm_local_bytes) associated with counters may reset when "mbm_assign_mode"
>> +	is changed.
>> +
>>  	::
>>  
>>  	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> @@ -275,6 +278,16 @@ with the following files:
>>  	available is described in the "num_mbm_cntrs" file. Changing the mode
>>  	may cause all counters on a resource to reset.
>>  
>> +	Moving to mbm_cntr_assign mode require users to assign the counters to
>> +	the events. Otherwise, the MBM event counters will return "Unassigned"
>> +	when read.
> 
> Again ... please be consistent in using single or double quotes for information
> returned from file.

ok. Will change it to 'Unassigned'.

> 
>> +
>> +	The mode is beneficial for AMD platforms that support more CTRL_MON
>> +	and MON groups than available hardware counters. By default, this
>> +	feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
>> +	Monitoring Counters) capability, ensuring counters remain assigned even
>> +	when the corresponding RMID is not actively used by any processor.
>> +
>>  	"default":
>>  
>>  	In default mode, resctrl assumes there is a hardware counter for each
>> @@ -283,6 +296,16 @@ with the following files:
>>  	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
>>  	there is no counter associated with that event.
>>  
>> +	* To enable "mbm_cntr_assign" mode:
>> +	  ::
>> +
>> +	    # echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> +
>> +	* To enable default monitoring mode:
>> +	  ::
>> +
>> +	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> +
> 
> Please be consistent in the documentation.
> 
> To enable "mbm_cntr_assign" mode:
> To enable "default" mode:
> or
> To enable "mbm_cntr_assign" monitoring mode:
> To enable "default" monitoring mode:

This sounds (monitoring mode) better.

> or 
> ...?
> 
> 
> 
>>  "num_mbm_cntrs":
>>  	The number of monitoring counters available for assignment when the
>>  	system supports mbm_cntr_assign mode.
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index f61f0cd032ef..6922173c4f8f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -928,6 +928,53 @@ static int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
>>  	return ret;
>>  }
>>  
>> +static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
>> +					     char *buf, size_t nbytes, loff_t off)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +	int ret = 0;
>> +	bool enable;
>> +
>> +	/* Valid input requires a trailing newline */
>> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> +		return -EINVAL;
>> +
>> +	buf[nbytes - 1] = '\0';
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&rdtgroup_mutex);
>> +
>> +	rdt_last_cmd_clear();
>> +
>> +	if (!strcmp(buf, "default")) {
>> +		enable = 0;
>> +	} else if (!strcmp(buf, "mbm_cntr_assign")) {
>> +		if (r->mon.mbm_cntr_assignable) {
>> +			enable = 1;
>> +		} else {
>> +			ret = -EINVAL;
>> +			rdt_last_cmd_puts("mbm_cntr_assign mode is not supported\n");
>> +			goto write_exit;
>> +		}
>> +	} else {
>> +		ret = -EINVAL;
>> +		rdt_last_cmd_puts("Unsupported assign mode\n");
>> +		goto write_exit;
>> +	}
>> +
>> +	if (enable != resctrl_arch_mbm_cntr_assign_enabled(r)) {
>> +		ret = resctrl_arch_mbm_cntr_assign_set(r, enable);
>> +		if (!ret)
>> +			mbm_cntr_reset(r);
> 
> The following APIs interact with the MBM assignable counters:
> 
> mbm_cntr_alloc()
> mbm_cntr_get()
> mbm_cntr_free()
> 
> mbm_cntr_reset() appears to be related but does significantly more
> than interact with the MBM assignable counters and that creates a
> confusing API.
> 
> How about introducing mbm_cntr_free_all() that _only_ releases all
> MBM assignable counters and match with mbm_cntr_free() that releases
> a single MBM assignable counter? mbm_cntr_free_all() lives with the
> other functions operating on MBM assignable counters, thus not
> hiding its functionality in other parts of resctrl.
> 
> This series open codes reset of non-architectural state in two places,
> within mbm_cntr_reset() and within mbm_config_write_domain(). That
> can be turned into a new helper that only resets architectural state,
> for example resctrl_reset_rmid_all() to match existing
> resctrl_arch_reset_rmid_all().
> 
> resctrl_arch_mbm_cntr_assign_set() can also reset any architectural
> state leaving mbm_cntr_free_all() and resctrl_reset_rmid_all() to be called
> here and from within mbm_config_write_domain().
> 
> What do you think?

Sounds like a good code separation. It should be fine.
Will let you know if there are any issues.

> 
>> +	}
>> +
>> +write_exit:
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	cpus_read_unlock();
>> +
>> +	return ret ?: nbytes;
>> +}
>> +
>>  #ifdef CONFIG_PROC_CPU_RESCTRL
>>  
>>  /*
>> @@ -1945,9 +1992,10 @@ static struct rftype res_common_files[] = {
>>  	},
>>  	{
>>  		.name		= "mbm_assign_mode",
>> -		.mode		= 0444,
>> +		.mode		= 0644,
>>  		.kf_ops		= &rdtgroup_kf_single_ops,
>>  		.seq_show	= resctrl_mbm_assign_mode_show,
>> +		.write		= resctrl_mbm_assign_mode_write,
>>  		.fflags		= RFTYPE_MON_INFO,
>>  	},
>>  	{
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-06 18:48   ` Reinette Chatre
@ 2025-02-10 19:46     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 19:46 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/6/25 12:48, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
>> events in the CTRL_MON or MON groups should have counters assigned.
>>
>> Provide an interface for assigning MBM events by writing to the file:
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
>> events can be assigned or unassigned as needed.
>>
>> Format is similar to the list format with addition of opcode for the
>> assignment operation.
>>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Format for specific type of groups:
>>
>>  * Default CTRL_MON group:
>>          "//<domain_id><opcode><flags>"
>>
>>  * Non-default CTRL_MON group:
>>          "<CTRL_MON group>//<domain_id><opcode><flags>"
>>
>>  * Child MON group of default CTRL_MON group:
>>          "/<MON group>/<domain_id><opcode><flags>"
>>
>>  * Child MON group of non-default CTRL_MON group:
>>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Domain_id '*' will apply the flags on all the domains.
>>
>> Opcode can be one of the following:
>>
>>  = Update the assignment to match the flags
>>  + Assign a new MBM event without impacting existing assignments.
>>  - Unassign a MBM event from currently assigned events.
>>
>> Assignment flags can be one of the following:
>>  t  MBM total event
>>  l  MBM local event
>>  tl Both total and local MBM events
>>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>>     be combined with other flags.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v11: Fixed the static check warning with initializing dom_id in resctrl_process_flags().
>>
>> v10: Fixed the issue with finding the domain in multiple iterations.
>>      Printed error message with domain information when assign fails.
>>      Changed the variables to unsigned for processing assign state.
>>      Taken care of few format corrections.
>>
>> v9: Fixed handling special case '//0=' and '//".
>>     Removed extra strstr() call.
>>     Added generic failure text when assignment operation fails.
>>     Corrected user documentation format texts.
>>
>> v8: Moved unassign as the first action during the assign modification.
>>     Assign none "_" takes priority. Cannot be mixed with other flags.
>>     Updated the documentation and .rst file format. htmldoc looks ok.
>>
>> v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
>>     Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
>>     Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
>>     Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
>>     Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
>>     Removed ABMC reference in FS code.
>>     Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
>>     Not sure if we need to change the behaviour here. Processed them sequencially right now.
>>     Users have the liberty to pass the flags. Restricting it might be a problem later.
>>
>> v6: Added support assign all if domain id is '*'
>>     Fixed the allocation of counter id if it not assigned already.
>>
>> v5: Interface name changed from mbm_assign_control to mbm_control.
>>     Fixed opcode and flags combination.
>>     '=_" is valid.
>>     "-_" amd "+_" is not valid.
>>     Minor message update.
>>     Renamed the function with prefix - rdtgroup_.
>>     Corrected few documentation mistakes.
>>     Rebase related changes after SNC support.
>>
>> v4: Added domain specific assignments. Fixed the opcode parsing.
>>
>> v3: New patch.
>>     Addresses the feedback to provide the global assignment interface.
>> ---
>>  Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
>>  arch/x86/kernel/cpu/resctrl/internal.h |  10 +
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 241 ++++++++++++++++++++++++-
>>  3 files changed, 365 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 3040e5c4cd76..47e15b48d951 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -356,7 +356,8 @@ with the following files:
>>  	 t  MBM total event is assigned.
>>  	 l  MBM local event is assigned.
>>  	 tl Both MBM total and local events are assigned.
>> -	 _  None of the MBM events are assigned.
>> +	 _  None of the MBM events are assigned. Only works with opcode '=' for write
>> +	    and cannot be combined with other flags.
>>  
>>  	Examples:
>>  	::
>> @@ -374,6 +375,119 @@ with the following files:
>>  	There are four resctrl groups. All the groups have total and local MBM events
>>  	assigned on domain 0 and 1.
>>  
>> +	Assignment state can be updated by writing to "mbm_assign_control".
>> +
>> +	Format is similar to the list format with addition of opcode for the
>> +	assignment operation.
>> +
>> +		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>> +
>> +	Format for each type of group:
>> +
>> +        * Default CTRL_MON group:
>> +                "//<domain_id><opcode><flags>"
>> +
>> +        * Non-default CTRL_MON group:
>> +                "<CTRL_MON group>//<domain_id><opcode><flags>"
>> +
>> +        * Child MON group of default CTRL_MON group:
>> +                "/<MON group>/<domain_id><opcode><flags>"
>> +
>> +        * Child MON group of non-default CTRL_MON group:
>> +                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>> +
>> +	Domain_id '*' will apply the flags to all the domains.
>> +
>> +	Opcode can be one of the following:
>> +	::
>> +
>> +	 = Update the assignment to match the MBM event.
>> +	 + Assign a new MBM event without impacting existing assignments.
>> +	 - Unassign a MBM event from currently assigned events.
>> +
>> +	Examples:
>> +	Initial group status:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
>> +	  //0=tl;1=tl
>> +	  /child_default_mon_grp/0=tl;1=tl
>> +
>> +	To update the default group to assign only total MBM event on domain 0:
>> +	::
>> +
>> +	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
>> +	  //0=t;1=tl
>> +	  /child_default_mon_grp/0=tl;1=tl
>> +
>> +	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
>> +	::
>> +
>> +	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
>> +	  //0=t;1=tl
>> +	  /child_default_mon_grp/0=tl;1=l
>> +
>> +	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
>> +	both local and total MBM events on domain 1:
>> +	::
>> +
>> +	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
>> +			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
>> +	  //0=t;1=tl
>> +	  /child_default_mon_grp/0=tl;1=l
>> +
>> +	To update the default group to add a local MBM event domain 0:
> 
> "local MBM event domain 0" -> "local MBM event on domain 0"?

Sure.

> 
> Taking a step back to look at the completed "mbm_assign_control" section
> it is noteworthy that all this work is about assigning counters to events
> but after this large section is complete the word "counter" does not appear
> a single time.
> 
> The section starts with a brief:
> "Reports the resctrl group and monitor status of each group." and then
> moves to terms like "assigning events"/"assignment status" without defining
> what that means.
> 
> Instead of rewriting this, what do you think of adding some definition
> of what "assignment state" means to the start of the section. For example,
> (I am sure it can be improved):
> 
> "Use "mbm_assign_control" to manage monitoring counter assignment to
> monitoring events when mbm_cntr_assign_mode is enabled."


Sure.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 08/23] x86/resctrl: Introduce interface to display number of monitoring counters
  2025-02-10 18:08         ` Reinette Chatre
@ 2025-02-10 20:26           ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-10 20:26 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/10/25 12:08, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/7/25 10:52 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/7/2025 11:18 AM, Moger, Babu wrote:
>>> Does this look ok? Just added domain in the text.
>>>
>>> "The number of monitoring counters available in each domain for assignment when the system supports mbm_cntr_assign mode.
>>> ::
>>>    # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>>    32
>>>
>>> The resctrl file system supports tracking up to two memory bandwidth
>>> events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
>>> Up to two counters can be assigned per monitoring group, one for each
>>> memory bandwidth event in each domain. More monitoring groups can be tracked by assigning one counter per monitoring group. However, doing so limits memory bandwidth tracking to a single memory bandwidth event per
>>> monitoring group."
>>
>> Revised again:
>>
>> "The number of monitoring counters available in each domain for assignment when the system supports mbm_cntr_assign mode. For example, on a system with 32 monitoring counters:
> 
> I think we need to be careful with "available" since all these counters
> may not be available. That is why "available_mbm_cntrs" exist.
> 
> How about something like (please feel free to improve):
> "The maximum number of monitoring counters (total of available and assigned counters)
>  in each domain when the system supports mbm_cntr_assign mode." 

Sure.

> Could you please make the "For example" a new paragraph (this follows existing style in the
> docs). It could also be made more specific, for example,
> 
> "For example, on a system with 32 monitoring counters in each domain:"

Yes.

> 
>> ::
>>   # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>   32
>>
> 
> The rest of the documentation seems like a repeat of what can be found in
> the "mbm_assign_mode" section right above it. It does not look as though
> any information will be lost by dropping the text below?

Sure.

> 
>> The resctrl file system supports tracking up to two memory bandwidth
>> events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
>> Up to two counters can be assigned per monitoring group, one for each
>> memory bandwidth event in each domain. More monitoring groups can be tracked by assigning one counter per monitoring group. However, doing so limits memory bandwidth tracking to a single memory bandwidth event per
>> monitoring group."
>>
>> Thanks
>> Babu
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-07 10:07       ` Xin Li
@ 2025-02-11 19:44         ` Moger, Babu
  2025-02-12  8:33           ` Xin Li
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-11 19:44 UTC (permalink / raw)
  To: Xin Li, Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Xin,

On 2/7/25 04:07, Xin Li wrote:
> On 2/6/2025 8:17 AM, Reinette Chatre wrote:
>>>> +    wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>>> This is the existing code, however it would be better to use wrmsrl()
>>> when the higher 32-bit are all 0s:
>>>
>>>      wrmsrl(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config);
>>>
>> Could you please elaborate what makes this change better?
> 
> In short, it takes one less argument, and doesn't pass an argument of 0.
> 
> The longer story is that hpa and I are refactoring the MSR access APIs
> to accommodate the immediate form of MSR access instructions.  And we
> are not happy about that there are too many MSR access APIs and their
> uses are *random*.  The native wrmsr() and wrmsrl() are essentially the
> same and the only difference is that wrmsr() passes a 64-bit value to be
> written into a MSR in *2* u32 arguments.  But we already have struct msr
> defined in asm/shared/msr.h as:
>     struct msr {
>             union {
>                     struct {
>                             u32 l;
>                             u32 h;
>                     };
>                     u64 q;
>             };
>     };
> 
> it's more natural to do the same job with this data structure in most
> cases.  And we want to remove wrmsr() and only keep wrmsrl(), thus a
> developer won't have to figure out which one is better to use :-P.
> 
> For that to happen, one cleanup is to replace wrmsr(msr, low, 0) with
> wrmsrl(msr, low) (low is automatically converted to u64 from u32).
> 
> However, I'm fine if Babu wants to keep it as-is.

Thanks for the explanation.  Changed it to use wrmsrl().

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-11 19:44         ` Moger, Babu
@ 2025-02-12  8:33           ` Xin Li
  0 siblings, 0 replies; 209+ messages in thread
From: Xin Li @ 2025-02-12  8:33 UTC (permalink / raw)
  To: babu.moger, Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On 2/11/2025 11:44 AM, Moger, Babu wrote:
> Hi Xin,
> 
> On 2/7/25 04:07, Xin Li wrote:
>> On 2/6/2025 8:17 AM, Reinette Chatre wrote:
>>>>> +    wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>>>> This is the existing code, however it would be better to use wrmsrl()
>>>> when the higher 32-bit are all 0s:
>>>>
>>>>       wrmsrl(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config);
>>>>
>>> Could you please elaborate what makes this change better?
>>
>> In short, it takes one less argument, and doesn't pass an argument of 0.
>>
>> The longer story is that hpa and I are refactoring the MSR access APIs
>> to accommodate the immediate form of MSR access instructions.  And we
>> are not happy about that there are too many MSR access APIs and their
>> uses are *random*.  The native wrmsr() and wrmsrl() are essentially the
>> same and the only difference is that wrmsr() passes a 64-bit value to be
>> written into a MSR in *2* u32 arguments.  But we already have struct msr
>> defined in asm/shared/msr.h as:
>>      struct msr {
>>              union {
>>                      struct {
>>                              u32 l;
>>                              u32 h;
>>                      };
>>                      u64 q;
>>              };
>>      };
>>
>> it's more natural to do the same job with this data structure in most
>> cases.  And we want to remove wrmsr() and only keep wrmsrl(), thus a
>> developer won't have to figure out which one is better to use :-P.
>>
>> For that to happen, one cleanup is to replace wrmsr(msr, low, 0) with
>> wrmsrl(msr, low) (low is automatically converted to u64 from u32).
>>
>> However, I'm fine if Babu wants to keep it as-is.
> 
> Thanks for the explanation.  Changed it to use wrmsrl().
> 

You're welcome.  And thanks for making the change.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (23 preceding siblings ...)
  2025-02-03 14:54 ` [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Peter Newman
@ 2025-02-12 17:46 ` Dave Martin
  2025-02-12 23:33   ` Reinette Chatre
  2025-02-13 16:19   ` Moger, Babu
  2025-02-21 18:07 ` James Morse
  25 siblings, 2 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-12 17:46 UTC (permalink / raw)
  To: Babu Moger, peternewman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi there,

On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> 
> This series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
> 
> Series is written such that it is easier to support other assignable
> features supported from different vendors.
> 
> The feature details are documented in the  APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> 
> The patches are based on top of commit
> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
> 
> # Introduction

[...]

> # Examples
> 
> a. Check if ABMC support is available
> 	#mount -t resctrl resctrl /sys/fs/resctrl/
> 
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> 	[mbm_cntr_assign]
> 	default

(Nit: can this be called "mbm_counter_assign"?  The name is already
long, so I wonder whether anything is gained by using a cryptic
abbreviation for "counter".  Same with all the "cntrs" elsewhere.
This is purely cosmetic, though -- the interface works either way.)

> 	ABMC feature is detected and it is enabled.
> 
> b. Check how many ABMC counters are available. 
> 
> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
> 	32

Is this file needed?

With MPAM, it is more difficult to promise that the same number of
counters will be available everywhere.

Rather than lie, or report a "safe" value here that may waste some
counters, can we just allow the number of counters to be be discovered
per domain via available_mbm_cntrs?

num_closids and num_rmids are already problematic for MPAM, so it would
be good to avoid any more parameters of this sort from being reported
to userspace unless there is a clear understanding of why they are
needed.

Reporting number of counters per monitoring domain is a more natural
fit for MPAM, as below:

> c. Check how many ABMC counters are available in each domain.
> 
> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
> 	0=30;1=30

For MPAM, this seems supportable.  Each monitoring domain will have
some counters, and a well-defined number of them will be available for
allocation at any one time.

> d. Create few resctrl groups.
> 
> 	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
> 
> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    to list and modify any group's monitoring states. File provides single place
>    to list monitoring states of all the resctrl groups. It makes it easier for
>    user space to learn about the used counters without needing to traverse all
>    the groups thus reducing the number of file system calls.
> 
> 	The list follows the following format:
> 
> 	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
> 
> 	Format for specific type of groups:
> 
> 	* Default CTRL_MON group:
> 	 "//<domain_id>=<flags>"

[...]

>        Flags can be one of the following:
> 
>         t  MBM total event is enabled.
>         l  MBM local event is enabled.
>         tl Both total and local MBM events are enabled.
>         _  None of the MBM events are enabled
> 
> 	Examples:

[...]

I think that this basically works for MPAM.

The local/total distinction doesn't map in a consistent way onto MPAM,
but this problem is not specific to ABMC.  It feels sensible for ABMC
to be built around the same concepts that resctrl already has elsewhere
in the interface.  MPAM will do its best to fit (as already).

Regarding Peter's use case of assiging multiple counters to a
monitoring group [1], I feel that it's probably good enough to make
sure that the ABMC interface can be extended in future in a backwards
compatible way so as to support this, without trying to support it
immediately.

[1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/

For example, if we added new generic "letters" -- say, "0" to "9",
combined with new counter files in resctrlfs, that feels like a
possible approach.  ABMC (as in this series) should just reject such
such assignments, and the new counter files wouldn't exist.

Availability of this feature could also be reported as a distinct mode
in mbm_assign_mode, say "mbm_cntr_generic", or whatever.

A _sketch_ of this follows.  This is NOT a proposal -- the key
question is whether we are confident that we can extend the interface
in this way in the future without breaking anything.

If "yes", then the ABMC interface (as proposed by this series) works as
a foundation to build on.

--8<--

[artists's impression]

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
 	mbm_cntr_generic
 	[mbm_cntr_assign]
 	default

# echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
# echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
# echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
# echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
# echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
# echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 

...

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes

etc.

-->8--

Any thoughts on this, Peter?

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-12 17:46 ` Dave Martin
@ 2025-02-12 23:33   ` Reinette Chatre
  2025-02-12 23:40     ` Reinette Chatre
                       ` (2 more replies)
  2025-02-13 16:19   ` Moger, Babu
  1 sibling, 3 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-12 23:33 UTC (permalink / raw)
  To: Dave Martin, Babu Moger, peternewman
  Cc: corbet, tglx, mingo, bp, dave.hansen, tony.luck, fenghua.yu, x86,
	hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/12/25 9:46 AM, Dave Martin wrote:
> Hi there,
> 
> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>
>> This series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> Series is written such that it is easier to support other assignable
>> features supported from different vendors.
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> # Introduction
> 
> [...]
> 
>> # Examples
>>
>> a. Check if ABMC support is available
>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> 	[mbm_cntr_assign]
>> 	default
> 
> (Nit: can this be called "mbm_counter_assign"?  The name is already
> long, so I wonder whether anything is gained by using a cryptic
> abbreviation for "counter".  Same with all the "cntrs" elsewhere.
> This is purely cosmetic, though -- the interface works either way.)
> 
>> 	ABMC feature is detected and it is enabled.
>>
>> b. Check how many ABMC counters are available. 
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
>> 	32
> 
> Is this file needed?
> 
> With MPAM, it is more difficult to promise that the same number of
> counters will be available everywhere.
> 
> Rather than lie, or report a "safe" value here that may waste some
> counters, can we just allow the number of counters to be be discovered
> per domain via available_mbm_cntrs?

This sounds reasonable to me. I think us having trouble with the
user documentation of this file so late in development should also have been
a sign to rethink its value.

For a user to discover the number of counters supported via available_mbm_cntrs
would require the file's contents to be captured right after mount. Since we've
had scenarios where new userspace needs to discover an up-and-running system's
configuration this may not be possible. I thus wonder instead of removing
num_mbm_cntrs, it could be modified to return the per-domain supported counters
instead of a single value? 

> num_closids and num_rmids are already problematic for MPAM, so it would
> be good to avoid any more parameters of this sort from being reported
> to userspace unless there is a clear understanding of why they are
> needed.

Yes. Appreciate your help in identifying what could be problematic for MPAM.

> 
> Reporting number of counters per monitoring domain is a more natural
> fit for MPAM, as below:
> 
>> c. Check how many ABMC counters are available in each domain.
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
>> 	0=30;1=30
> 
> For MPAM, this seems supportable.  Each monitoring domain will have
> some counters, and a well-defined number of them will be available for
> allocation at any one time.
> 
>> d. Create few resctrl groups.
>>
>> 	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
>>
>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    to list and modify any group's monitoring states. File provides single place
>>    to list monitoring states of all the resctrl groups. It makes it easier for
>>    user space to learn about the used counters without needing to traverse all
>>    the groups thus reducing the number of file system calls.
>>
>> 	The list follows the following format:
>>
>> 	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>
>> 	Format for specific type of groups:
>>
>> 	* Default CTRL_MON group:
>> 	 "//<domain_id>=<flags>"
> 
> [...]
> 
>>        Flags can be one of the following:
>>
>>         t  MBM total event is enabled.
>>         l  MBM local event is enabled.
>>         tl Both total and local MBM events are enabled.
>>         _  None of the MBM events are enabled
>>
>> 	Examples:
> 
> [...]
> 
> I think that this basically works for MPAM.
> 
> The local/total distinction doesn't map in a consistent way onto MPAM,
> but this problem is not specific to ABMC.  It feels sensible for ABMC
> to be built around the same concepts that resctrl already has elsewhere
> in the interface.  MPAM will do its best to fit (as already).
> 
> Regarding Peter's use case of assiging multiple counters to a
> monitoring group [1], I feel that it's probably good enough to make
> sure that the ABMC interface can be extended in future in a backwards
> compatible way so as to support this, without trying to support it
> immediately.
> 
> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> 

I do not think that resctrl's current support of the mbm_total_bytes and
mbm_local_bytes should be considered as the "only" two available "slots"
into which all possible events should be forced into. "mon_features" exists
to guide user space to which events are supported and as I see it new events
can be listed here to inform user space of their availability, with their
associated event files available in the resource groups.

> 
> For example, if we added new generic "letters" -- say, "0" to "9",
> combined with new counter files in resctrlfs, that feels like a
> possible approach.  ABMC (as in this series) should just reject such
> such assignments, and the new counter files wouldn't exist.
> 
> Availability of this feature could also be reported as a distinct mode
> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
> 
> 
> A _sketch_ of this follows.  This is NOT a proposal -- the key
> question is whether we are confident that we can extend the interface
> in this way in the future without breaking anything.
> 
> If "yes", then the ABMC interface (as proposed by this series) works as
> a foundation to build on.
> 
> --8<--
> 
> [artists's impression]
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>  	mbm_cntr_generic
>  	[mbm_cntr_assign]
>  	default
> 
> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 
> 
> ...
> 
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
> 
> etc.
> 

It is not clear to me what additional features such an interface enables. It
also looks like user space will need to track and manage counter IDs?

It sounds to me as though the issue starts with your statement
"The local/total distinction doesn't map in a consistent way onto MPAM". To
address this I expect that an MPAM system will not support nor list
mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
it would list the events that are appropriate to the system? Trying to match
with what Peter said [1] in the message you refer to, this may be possible:

# cat /sys/fs/resctrl/info/L3_MON/mon_features
mbm_local_read_bytes
mbm_local_write_bytes
mbm_local_bytes

(*) I am including mbm_local_bytes since it could be an event that can be software
defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
counted.

I see the support for MPAM events distinct from the support of assignable counters.
Once the MPAM events are sorted, I think that they can be assigned with existing interface.
Please help me understand if you see it differently.
	
Doing so would need to come up with alphabetical letters for these events,
which seems to be needed for your proposal also? If we use possible flags of:

mbm_local_read_bytes a
mbm_local_write_bytes b

Then mbm_assign_control can be used as:
# echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
<value>
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
<sum of mbm_local_read_bytes and mbm_local_write_bytes>

One issue would be when resctrl needs to support more than 26 events (no more flags available),
assuming that upper case would be used for "shared" counters (unless this interface is defined
differently and only few uppercase letters used for it). Would this be too low of a limit?

Reinette

[1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-12 23:33   ` Reinette Chatre
@ 2025-02-12 23:40     ` Reinette Chatre
  2025-02-13  0:11     ` Luck, Tony
  2025-02-13 17:37     ` Dave Martin
  2 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-12 23:40 UTC (permalink / raw)
  To: Dave Martin, Babu Moger, peternewman
  Cc: corbet, tglx, mingo, bp, dave.hansen, tony.luck, x86, hpa,
	paulmck, akpm, thuth, rostedt, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, jpoimboe, perry.yuan, sandipan.das, kai.huang,
	xiaoyao.li, seanjc, xin3.li, andrew.cooper3, ebiggers,
	mario.limonciello, james.morse, tan.shaopeng, linux-doc,
	linux-kernel, maciej.wieczor-retman, eranian

-Fenghua (his email address does not work anymore)

On 2/12/25 3:33 PM, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/12/25 9:46 AM, Dave Martin wrote:
>> Hi there,
>>
>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>
>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>> (ABMC). It is also called QoS RMID Pinning feature
>>>
>>> Series is written such that it is easier to support other assignable
>>> features supported from different vendors.
>>>
>>> The feature details are documented in the  APM listed below [1].
>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>> Monitoring (ABMC). The documentation is available at
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>
>>> The patches are based on top of commit
>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>>
>>> # Introduction
>>
>> [...]
>>
>>> # Examples
>>>
>>> a. Check if ABMC support is available
>>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>>
>>> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> 	[mbm_cntr_assign]
>>> 	default
>>
>> (Nit: can this be called "mbm_counter_assign"?  The name is already
>> long, so I wonder whether anything is gained by using a cryptic
>> abbreviation for "counter".  Same with all the "cntrs" elsewhere.
>> This is purely cosmetic, though -- the interface works either way.)
>>
>>> 	ABMC feature is detected and it is enabled.
>>>
>>> b. Check how many ABMC counters are available. 
>>>
>>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
>>> 	32
>>
>> Is this file needed?
>>
>> With MPAM, it is more difficult to promise that the same number of
>> counters will be available everywhere.
>>
>> Rather than lie, or report a "safe" value here that may waste some
>> counters, can we just allow the number of counters to be be discovered
>> per domain via available_mbm_cntrs?
> 
> This sounds reasonable to me. I think us having trouble with the
> user documentation of this file so late in development should also have been
> a sign to rethink its value.
> 
> For a user to discover the number of counters supported via available_mbm_cntrs
> would require the file's contents to be captured right after mount. Since we've
> had scenarios where new userspace needs to discover an up-and-running system's
> configuration this may not be possible. I thus wonder instead of removing
> num_mbm_cntrs, it could be modified to return the per-domain supported counters
> instead of a single value? 
> 
>> num_closids and num_rmids are already problematic for MPAM, so it would
>> be good to avoid any more parameters of this sort from being reported
>> to userspace unless there is a clear understanding of why they are
>> needed.
> 
> Yes. Appreciate your help in identifying what could be problematic for MPAM.
> 
>>
>> Reporting number of counters per monitoring domain is a more natural
>> fit for MPAM, as below:
>>
>>> c. Check how many ABMC counters are available in each domain.
>>>
>>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
>>> 	0=30;1=30
>>
>> For MPAM, this seems supportable.  Each monitoring domain will have
>> some counters, and a well-defined number of them will be available for
>> allocation at any one time.
>>
>>> d. Create few resctrl groups.
>>>
>>> 	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
>>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
>>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
>>>
>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>    to list and modify any group's monitoring states. File provides single place
>>>    to list monitoring states of all the resctrl groups. It makes it easier for
>>>    user space to learn about the used counters without needing to traverse all
>>>    the groups thus reducing the number of file system calls.
>>>
>>> 	The list follows the following format:
>>>
>>> 	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>>
>>> 	Format for specific type of groups:
>>>
>>> 	* Default CTRL_MON group:
>>> 	 "//<domain_id>=<flags>"
>>
>> [...]
>>
>>>        Flags can be one of the following:
>>>
>>>         t  MBM total event is enabled.
>>>         l  MBM local event is enabled.
>>>         tl Both total and local MBM events are enabled.
>>>         _  None of the MBM events are enabled
>>>
>>> 	Examples:
>>
>> [...]
>>
>> I think that this basically works for MPAM.
>>
>> The local/total distinction doesn't map in a consistent way onto MPAM,
>> but this problem is not specific to ABMC.  It feels sensible for ABMC
>> to be built around the same concepts that resctrl already has elsewhere
>> in the interface.  MPAM will do its best to fit (as already).
>>
>> Regarding Peter's use case of assiging multiple counters to a
>> monitoring group [1], I feel that it's probably good enough to make
>> sure that the ABMC interface can be extended in future in a backwards
>> compatible way so as to support this, without trying to support it
>> immediately.
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>
> 
> I do not think that resctrl's current support of the mbm_total_bytes and
> mbm_local_bytes should be considered as the "only" two available "slots"
> into which all possible events should be forced into. "mon_features" exists
> to guide user space to which events are supported and as I see it new events
> can be listed here to inform user space of their availability, with their
> associated event files available in the resource groups.
> 
>>
>> For example, if we added new generic "letters" -- say, "0" to "9",
>> combined with new counter files in resctrlfs, that feels like a
>> possible approach.  ABMC (as in this series) should just reject such
>> such assignments, and the new counter files wouldn't exist.
>>
>> Availability of this feature could also be reported as a distinct mode
>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>
>>
>> A _sketch_ of this follows.  This is NOT a proposal -- the key
>> question is whether we are confident that we can extend the interface
>> in this way in the future without breaking anything.
>>
>> If "yes", then the ABMC interface (as proposed by this series) works as
>> a foundation to build on.
>>
>> --8<--
>>
>> [artists's impression]
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>  	mbm_cntr_generic
>>  	[mbm_cntr_assign]
>>  	default
>>
>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 
>>
>> ...
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>
>> etc.
>>
> 
> It is not clear to me what additional features such an interface enables. It
> also looks like user space will need to track and manage counter IDs?
> 
> It sounds to me as though the issue starts with your statement
> "The local/total distinction doesn't map in a consistent way onto MPAM". To
> address this I expect that an MPAM system will not support nor list
> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
> it would list the events that are appropriate to the system? Trying to match
> with what Peter said [1] in the message you refer to, this may be possible:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mon_features
> mbm_local_read_bytes
> mbm_local_write_bytes
> mbm_local_bytes
> 
> (*) I am including mbm_local_bytes since it could be an event that can be software
> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
> counted.
> 
> I see the support for MPAM events distinct from the support of assignable counters.
> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> Please help me understand if you see it differently.
> 	
> Doing so would need to come up with alphabetical letters for these events,
> which seems to be needed for your proposal also? If we use possible flags of:
> 
> mbm_local_read_bytes a
> mbm_local_write_bytes b
> 
> Then mbm_assign_control can be used as:
> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> <value>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> 
> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> assuming that upper case would be used for "shared" counters (unless this interface is defined
> differently and only few uppercase letters used for it). Would this be too low of a limit?
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-12 23:33   ` Reinette Chatre
  2025-02-12 23:40     ` Reinette Chatre
@ 2025-02-13  0:11     ` Luck, Tony
  2025-02-13 17:56       ` Dave Martin
  2025-02-13 17:37     ` Dave Martin
  2 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-13  0:11 UTC (permalink / raw)
  To: Chatre, Reinette, Dave Martin, Babu Moger, peternewman@google.com
  Cc: corbet@lwn.net, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, fenghua.yu@intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> I do not think that resctrl's current support of the mbm_total_bytes and
> mbm_local_bytes should be considered as the "only" two available "slots"
> into which all possible events should be forced into. "mon_features" exists
> to guide user space to which events are supported and as I see it new events
> can be listed here to inform user space of their availability, with their
> associated event files available in the resource groups.

100%  I have a number of "events" in the pipeline that do not fit these
names. I'm planning on new files with descriptive[1] names for the events
they report.

-Tony

[1] When these are ready to post we can discuss the names I chose and
change them if there are better names that work across architectures.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-12 17:46 ` Dave Martin
  2025-02-12 23:33   ` Reinette Chatre
@ 2025-02-13 16:19   ` Moger, Babu
  2025-02-13 18:18     ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-13 16:19 UTC (permalink / raw)
  To: Dave Martin, peternewman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

Thanks for your help. Reinette has asked few questions already. I have few
more questions on top of that.

On 2/12/25 11:46, Dave Martin wrote:
> Hi there,
> 
> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>
>> This series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> Series is written such that it is easier to support other assignable
>> features supported from different vendors.
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> # Introduction
> 
> [...]
> 
>> # Examples
>>
>> a. Check if ABMC support is available
>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> 	[mbm_cntr_assign]
>> 	default
> 
> (Nit: can this be called "mbm_counter_assign"?  The name is already
> long, so I wonder whether anything is gained by using a cryptic
> abbreviation for "counter".  Same with all the "cntrs" elsewhere.
> This is purely cosmetic, though -- the interface works either way.)

Yes. We can do that.

> 
>> 	ABMC feature is detected and it is enabled.
>>
>> b. Check how many ABMC counters are available. 
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
>> 	32
> 
> Is this file needed?
> 
> With MPAM, it is more difficult to promise that the same number of
> counters will be available everywhere.
> 
> Rather than lie, or report a "safe" value here that may waste some
> counters, can we just allow the number of counters to be be discovered
> per domain via available_mbm_cntrs?

As  Reinette suggested below we can display per domain supported counters
here.
https://lore.kernel.org/lkml/9e849476-7c4b-478b-bd2a-185024def3a3@intel.com/

> 
> num_closids and num_rmids are already problematic for MPAM, so it would
> be good to avoid any more parameters of this sort from being reported
> to userspace unless there is a clear understanding of why they are
> needed.
> 
> Reporting number of counters per monitoring domain is a more natural
> fit for MPAM, as below:
> 
>> c. Check how many ABMC counters are available in each domain.
>>
>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
>> 	0=30;1=30
> 
> For MPAM, this seems supportable.  Each monitoring domain will have
> some counters, and a well-defined number of them will be available for
> allocation at any one time.
> 
>> d. Create few resctrl groups.
>>
>> 	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
>> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
>>
>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    to list and modify any group's monitoring states. File provides single place
>>    to list monitoring states of all the resctrl groups. It makes it easier for
>>    user space to learn about the used counters without needing to traverse all
>>    the groups thus reducing the number of file system calls.
>>
>> 	The list follows the following format:
>>
>> 	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>
>> 	Format for specific type of groups:
>>
>> 	* Default CTRL_MON group:
>> 	 "//<domain_id>=<flags>"
> 
> [...]
> 
>>        Flags can be one of the following:
>>
>>         t  MBM total event is enabled.
>>         l  MBM local event is enabled.
>>         tl Both total and local MBM events are enabled.
>>         _  None of the MBM events are enabled
>>
>> 	Examples:
> 
> [...]
> 
> I think that this basically works for MPAM.
> 
> The local/total distinction doesn't map in a consistent way onto MPAM,
> but this problem is not specific to ABMC.  It feels sensible for ABMC
> to be built around the same concepts that resctrl already has elsewhere
> in the interface.  MPAM will do its best to fit (as already).
> 
> Regarding Peter's use case of assiging multiple counters to a
> monitoring group [1], I feel that it's probably good enough to make
> sure that the ABMC interface can be extended in future in a backwards
> compatible way so as to support this, without trying to support it
> immediately.
> 
> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> 
> 
> For example, if we added new generic "letters" -- say, "0" to "9",
> combined with new counter files in resctrlfs, that feels like a
> possible approach.  ABMC (as in this series) should just reject such
> such assignments, and the new counter files wouldn't exist.

What is "combined with new counter files"? Does MPAM going to add new
files to support counter assignment in ARM?

Also what is  "0" to "9"? Is this counter ids?


> 
> Availability of this feature could also be reported as a distinct mode
> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.

Yes. That should be fine.

> 
> 
> A _sketch_ of this follows.  This is NOT a proposal -- the key
> question is whether we are confident that we can extend the interface
> in this way in the future without breaking anything.
> 
> If "yes", then the ABMC interface (as proposed by this series) works as
> a foundation to build on.
> 
> --8<--
> 
> [artists's impression]
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>  	mbm_cntr_generic
>  	[mbm_cntr_assign]
>  	default

Yes. This looks good.


> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control

Looks like you are assigning counter ids to domains here. That is
different than ABMC. In ABMC, we assign events (local or total) to the
domain. We internally handle the counter ids based on the availability.

Can MPAM follow the same concept?  It is possible?


> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 

This also looks different that we are have right now in resctrl fs.

Are you creating separate file for each counter id in
/sys/fs/resctrl/info/L3_MON/?


> 
> ...
> 
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
> 
> etc.
> 
> -->8--
> 
> Any thoughts on this, Peter?
> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-12 23:33   ` Reinette Chatre
  2025-02-12 23:40     ` Reinette Chatre
  2025-02-13  0:11     ` Luck, Tony
@ 2025-02-13 17:37     ` Dave Martin
  2025-02-14  6:26       ` Reinette Chatre
  2 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-13 17:37 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Babu Moger, peternewman, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/12/25 9:46 AM, Dave Martin wrote:
> > Hi there,
> > 
> > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>
> >> This series adds the support for Assignable Bandwidth Monitoring Counters
> >> (ABMC). It is also called QoS RMID Pinning feature
> >>
> >> Series is written such that it is easier to support other assignable
> >> features supported from different vendors.
> >>
> >> The feature details are documented in the  APM listed below [1].
> >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> >> Monitoring (ABMC). The documentation is available at
> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> >>
> >> The patches are based on top of commit
> >> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'

[...]

> >> b. Check how many ABMC counters are available. 
> >>
> >> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
> >> 	32
> > 
> > Is this file needed?
> > 
> > With MPAM, it is more difficult to promise that the same number of
> > counters will be available everywhere.
> > 
> > Rather than lie, or report a "safe" value here that may waste some
> > counters, can we just allow the number of counters to be be discovered
> > per domain via available_mbm_cntrs?
> 
> This sounds reasonable to me. I think us having trouble with the
> user documentation of this file so late in development should also have been
> a sign to rethink its value.
> 
> For a user to discover the number of counters supported via available_mbm_cntrs
> would require the file's contents to be captured right after mount. Since we've
> had scenarios where new userspace needs to discover an up-and-running system's
> configuration this may not be possible. I thus wonder instead of removing
> num_mbm_cntrs, it could be modified to return the per-domain supported counters
> instead of a single value? 

Is it actually useful to be able to discover the number of counters
that exist?  A counter that exists but is not available cannot be used,
so perhaps it is not useful to know about it in the first place.

But if we keep this file but make it report the number of counters for
each domain (similarly to mbm_available_cntrs), then I think the MPAM
driver should be able to work with that.

> > num_closids and num_rmids are already problematic for MPAM, so it would
> > be good to avoid any more parameters of this sort from being reported
> > to userspace unless there is a clear understanding of why they are
> > needed.
> 
> Yes. Appreciate your help in identifying what could be problematic for MPAM.

For clarity: this is a background issue, mostly orthogonal to this
series.

If this series is merged as-is, with a global per-resource
num_mbm_cntrs property, then this not really worse than the current
situation -- it's just a bit annoying from the MPAM perspective.

In a nutshell, the num_closids / num_rmids parameters seem to expose
RDT-specific hardware semantics to userspace, implying a specific
allocation model for control group and monitoring group identifiers.

The guarantees that userspace is entitled to asssume when resctrl
reports particular values do not seem to be well described and are hard
to map onto the nearest-equivalent MPAM implementation.  A combination
of control and monitoring groups that can be created on x86 may not be
creatable on MPAM, even when the number of supportable control and
monitoring partitions is the same.

Even with the ABMC series, we may still be constrained on what we can
report for num_rmids: we can't know in advance whether or not the user
is going to use mbm_cntr_assign mode -- if not, we can't promise to
create more monitoring groups than the number of counters in the
hardware.

It seems natural for the counts reported by "available_mbm_cntrs" to
change dynamically when the ABMC assignment mode is changed, but I
think userspace are likely to expect the global "num_rmids" parameters
to be fixed for the lifetime of the resctrl mount (and possibly fixed
for all time on a given hardware platform -- at least, modulo CDP).

I think it might be possible to tighten up the docmentation of
num_closids in particular in a way that doesn't conflict with x86 and
may make it easier for MPAM to fit in with, but that feels like a
separate conversation.

None of this should be considered a blocker for this series, either way.

> > 
> > Reporting number of counters per monitoring domain is a more natural
> > fit for MPAM, as below:
> > 
> >> c. Check how many ABMC counters are available in each domain.
> >>
> >> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
> >> 	0=30;1=30
> > 
> > For MPAM, this seems supportable.  Each monitoring domain will have
> > some counters, and a well-defined number of them will be available for
> > allocation at any one time.

[...]

> >> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control

[...]

> >>        Flags can be one of the following:
> >>
> >>         t  MBM total event is enabled.
> >>         l  MBM local event is enabled.
> >>         tl Both total and local MBM events are enabled.
> >>         _  None of the MBM events are enabled
> >>
> >> 	Examples:
> > 
> > [...]
> > 
> > I think that this basically works for MPAM.
> > 
> > The local/total distinction doesn't map in a consistent way onto MPAM,
> > but this problem is not specific to ABMC.  It feels sensible for ABMC
> > to be built around the same concepts that resctrl already has elsewhere
> > in the interface.  MPAM will do its best to fit (as already).
> > 
> > Regarding Peter's use case of assiging multiple counters to a
> > monitoring group [1], I feel that it's probably good enough to make
> > sure that the ABMC interface can be extended in future in a backwards
> > compatible way so as to support this, without trying to support it
> > immediately.
> > 
> > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> > 
> 
> I do not think that resctrl's current support of the mbm_total_bytes and
> mbm_local_bytes should be considered as the "only" two available "slots"
> into which all possible events should be forced into. "mon_features" exists
> to guide user space to which events are supported and as I see it new events
> can be listed here to inform user space of their availability, with their
> associated event files available in the resource groups.

That's fair.  I wasn't currently sure how (or if) the set of countable
events was expected to grow / evolve via this route.

Either way, I think this confirms that there is at least one viable way
to enable more counters for a single control group, on top of this
series.

(If there is more than one way, that seems fine?)

> > 
> > For example, if we added new generic "letters" -- say, "0" to "9",
> > combined with new counter files in resctrlfs, that feels like a
> > possible approach.  ABMC (as in this series) should just reject such
> > such assignments, and the new counter files wouldn't exist.
> > 
> > Availability of this feature could also be reported as a distinct mode
> > in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
> > 
> > 
> > A _sketch_ of this follows.  This is NOT a proposal -- the key
> > question is whether we are confident that we can extend the interface
> > in this way in the future without breaking anything.
> > 
> > If "yes", then the ABMC interface (as proposed by this series) works as
> > a foundation to build on.
> > 
> > --8<--
> > 
> > [artists's impression]
> > 
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> >  	mbm_cntr_generic
> >  	[mbm_cntr_assign]
> >  	default
> > 
> > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
> > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
> > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
> > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 
> > 
> > ...
> > 
> > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
> > 
> > etc.
> > 
> 
> It is not clear to me what additional features such an interface enables. It
> also looks like user space will need to track and manage counter IDs?

My idea was that for these generic counters, new files could be exposed
to configure what they actually count (the ..._type files shown above;
or possibly via the ..._config files that already exist).

The "IDs" were inteded as abstract; the number only relates the
assignments in mbm_assign_control to the files created elsewhere.  This
wouldn't be related to IDs assigned by the hardware.

If there are multiple resctrl users then using numeric IDs might be
problematic; though if we go eventually in the direction of making
resctrlfs multi-mountable then each mount could have its own namespace.

Allowing counters to be named and configured with a mkdir()-style
interface might be possible too; that might make it easier for users to
coexist within a single resctrl mount (if we think that's important
enough).

> It sounds to me as though the issue starts with your statement
> "The local/total distinction doesn't map in a consistent way onto MPAM". To
> address this I expect that an MPAM system will not support nor list
> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
> it would list the events that are appropriate to the system? Trying to match
> with what Peter said [1] in the message you refer to, this may be possible:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mon_features
> mbm_local_read_bytes
> mbm_local_write_bytes
> mbm_local_bytes
> 
> (*) I am including mbm_local_bytes since it could be an event that can be software
> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
> counted.
> 
> I see the support for MPAM events distinct from the support of assignable counters.
> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> Please help me understand if you see it differently.
> 	
> Doing so would need to come up with alphabetical letters for these events,
> which seems to be needed for your proposal also? If we use possible flags of:
> 
> mbm_local_read_bytes a
> mbm_local_write_bytes b
> 
> Then mbm_assign_control can be used as:
> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> <value>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> 
> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> assuming that upper case would be used for "shared" counters (unless this interface is defined
> differently and only few uppercase letters used for it). Would this be too low of a limit?
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/

That approach would also work, where an MPAM system has events are not
a reasonable approximation of the generic "total" or "local".

For now we would probably stick with "total" and "local" anyway though,
because the MPAM architecture doesn't natively allow the mapping onto
the memory system topology to be discovered, and the information in
ACPI / device tree is insufficient to tell us everything we'd need to
know.  But I guess what counts as "local" in particular will be quite
hardware and topology dependent even on x86, so perhaps we shouldn't
worry about having the behaviour match exactly (?)

Regarding the code letters, my idea was that the event type might be
configured by a separate file, instead of in mbm_assign_control
directly, in which case running out of letters wouldn't be a problem.

Alternatively, if we want to be able to expand beyond single letters,
could we reserve one or more characters for extension purposes?

If braces are forbidden by the syntax today, could we add support for
something like the following later on, without breaking anything?

# echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control

For now, my main concern would be whether this series prevents that
sort of thing being added in a backwards compatible way later.

I don't really see anything that is a blocker.

What do you think?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-03 20:49   ` Moger, Babu
@ 2025-02-13 17:51     ` Dave Martin
  2025-02-13 18:08       ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-13 17:51 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Peter Newman, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, fenghua.yu, x86, hpa, paulmck, akpm,
	thuth, rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Mon, Feb 03, 2025 at 02:49:27PM -0600, Moger, Babu wrote:
> Hi Peter,
> 
> On 2/3/25 08:54, Peter Newman wrote:

[...]

> >> # Linux Implementation
> >>
> >> Create a generic interface aimed to support user space assignment
> >> of scarce counters used for monitoring. First usage of interface
> >> is by ABMC with option to expand usage to "soft-ABMC" and MPAM
> >> counters in future.
> > 
> > As a reminder of the work related to this, please take a look at the
> > thread where Reinette proposed a "shared counters" mode in
> > mbm_assign_control[1]. I am currently working to demonstrate that this
> > combined with the mbm_*_bytes_per_second events discussed earlier in
> > the same thread will address my users' concerns about the overhead of
> > reading a large number of MBM counters, resulting from a maximal
> > number of monitoring groups whose jobs are not isolated to any L3
> > monitoring domain.
> > 
> > ABMC will add to the number of registers which need to be programmed
> > in each domain, so I will need to demonstrate that ABMC combined with
> > these additional features addresses their performance concerns and
> > that the resulting interface is user-friendly enough that they will
> > not need a detailed understanding of the implementation to avoid an
> > unacceptable performance degradation (i.e., needing to understand what
> > conditions will increase the number of IPIs required).
> > 
> > If all goes well, soft-ABMC will try to extend this usage model to the
> > existing, pre-ABMC, AMD platforms I support.
> > 
> > Thanks,
> > -Peter
> > 
> > [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/
> > 
> 
> Thanks for the heads-up. I understand what's going on and have an idea of
> the plan. Please keep us updated on the progress. Also, if any changes are
> needed in this series to meet your requirements, feel free to share your
> feedback.

Playing devil's advocate, I wonder whether there is a point beyond
which it would be better to have an interface to hand over some of the
counters to perf?

The logic for round-robin scheduling of events onto counters, dealing
with overflows etc. has already been invented over there, and it's
fiddly to get right.  Ideally resctrl wouldn't have its own special
implementation of that kind of stuff.

(Said my someone who has never tried to hack up an uncore event source
in perf.)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13  0:11     ` Luck, Tony
@ 2025-02-13 17:56       ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-13 17:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Chatre, Reinette, Babu Moger, peternewman@google.com,
	corbet@lwn.net, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, paulmck@kernel.org, akpm@linux-foundation.org,
	thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Tony,

On Thu, Feb 13, 2025 at 12:11:13AM +0000, Luck, Tony wrote:
> > I do not think that resctrl's current support of the mbm_total_bytes and
> > mbm_local_bytes should be considered as the "only" two available "slots"
> > into which all possible events should be forced into. "mon_features" exists
> > to guide user space to which events are supported and as I see it new events
> > can be listed here to inform user space of their availability, with their
> > associated event files available in the resource groups.
> 
> 100%  I have a number of "events" in the pipeline that do not fit these
> names. I'm planning on new files with descriptive[1] names for the events
> they report.
> 
> -Tony
> 
> [1] When these are ready to post we can discuss the names I chose and
> change them if there are better names that work across architectures.

Do any of the approaches discussed in [2] look viable for this?

(Ideally, reply over there.)

Cheers
---Dave

[2] https://lore.kernel.org/lkml/Z64tw2NbJXbKpLrH@e133380.arm.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13 17:51     ` Dave Martin
@ 2025-02-13 18:08       ` Luck, Tony
  0 siblings, 0 replies; 209+ messages in thread
From: Luck, Tony @ 2025-02-13 18:08 UTC (permalink / raw)
  To: Dave Martin, Moger, Babu
  Cc: Peter Newman, corbet@lwn.net, Chatre, Reinette,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, fenghua.yu@intel.com, x86@kernel.org,
	hpa@zytor.com, paulmck@kernel.org, akpm@linux-foundation.org,
	thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> Playing devil's advocate, I wonder whether there is a point beyond
> which it would be better to have an interface to hand over some of the
> counters to perf?
>
> The logic for round-robin scheduling of events onto counters, dealing
> with overflows etc. has already been invented over there, and it's
> fiddly to get right.  Ideally resctrl wouldn't have its own special
> implementation of that kind of stuff.
>
> (Said my someone who has never tried to hack up an uncore event source
> in perf.)

Initial implementation on Intel RDT tried to use perf ... it all went badly and
was reverted.

There are some very un-perf-like properties that we couldn't find a
workaround for at the time.

E.g.

1) Cache occupancy counters. These change even when your workload
isn't running (downward due to evictions).

2) Counters based on RMIDs show the aggregated values from multiple
CPUs as tasks are scheduled on cores.

But maybe you meant "don't let resctrl use all those counters" ... hand some
of them to perf to use in some other way?

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13 16:19   ` Moger, Babu
@ 2025-02-13 18:18     ` Dave Martin
  2025-02-13 18:39       ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-13 18:18 UTC (permalink / raw)
  To: Moger, Babu
  Cc: peternewman, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Thu, Feb 13, 2025 at 10:19:29AM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> Thanks for your help. Reinette has asked few questions already. I have few
> more questions on top of that.
> 
> On 2/12/25 11:46, Dave Martin wrote:
> > Hi there,
> > 
> > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>
> >> This series adds the support for Assignable Bandwidth Monitoring Counters
> >> (ABMC). It is also called QoS RMID Pinning feature

[...]

> >> a. Check if ABMC support is available
> >> 	#mount -t resctrl resctrl /sys/fs/resctrl/
> >>
> >> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> >> 	[mbm_cntr_assign]
> >> 	default
> > 
> > (Nit: can this be called "mbm_counter_assign"?  The name is already
> > long, so I wonder whether anything is gained by using a cryptic
> > abbreviation for "counter".  Same with all the "cntrs" elsewhere.
> > This is purely cosmetic, though -- the interface works either way.)
> 
> Yes. We can do that.

Thanks (note, I'm also happy without this change, if you aren't
planning do a substantial respin of the series.)

[...]

> >> b. Check how many ABMC counters are available. 
> >>
> >> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
> >> 	32
> > 
> > Is this file needed?
> > 
> > With MPAM, it is more difficult to promise that the same number of
> > counters will be available everywhere.
> > 
> > Rather than lie, or report a "safe" value here that may waste some
> > counters, can we just allow the number of counters to be be discovered
> > per domain via available_mbm_cntrs?
> 
> As  Reinette suggested below we can display per domain supported counters
> here.
> https://lore.kernel.org/lkml/9e849476-7c4b-478b-bd2a-185024def3a3@intel.com/

Although I'm still not convinced that this file is necessary, MPAM
should be able to work with this.

(I'm assuming that ABMC hardware has a set of counters for each
monitoring domain, of course -- otherwise this doesn't make sense.)

[...]

> >> c. Check how many ABMC counters are available in each domain.
> >>
> >> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
> >> 	0=30;1=30
> > 
> > For MPAM, this seems supportable.  Each monitoring domain will have
> > some counters, and a well-defined number of them will be available for
> > allocation at any one time.

[...]

> >>        Flags can be one of the following:
> >>
> >>         t  MBM total event is enabled.
> >>         l  MBM local event is enabled.
> >>         tl Both total and local MBM events are enabled.
> >>         _  None of the MBM events are enabled
> >>
> >> 	Examples:
> > 
> > [...]
> > 
> > I think that this basically works for MPAM.
> > 
> > The local/total distinction doesn't map in a consistent way onto MPAM,
> > but this problem is not specific to ABMC.  It feels sensible for ABMC
> > to be built around the same concepts that resctrl already has elsewhere
> > in the interface.  MPAM will do its best to fit (as already).
> > 
> > Regarding Peter's use case of assiging multiple counters to a
> > monitoring group [1], I feel that it's probably good enough to make
> > sure that the ABMC interface can be extended in future in a backwards
> > compatible way so as to support this, without trying to support it
> > immediately.
> > 
> > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> > 
> > 
> > For example, if we added new generic "letters" -- say, "0" to "9",
> > combined with new counter files in resctrlfs, that feels like a
> > possible approach.  ABMC (as in this series) should just reject such
> > such assignments, and the new counter files wouldn't exist.
> 
> What is "combined with new counter files"? Does MPAM going to add new
> files to support counter assignment in ARM?
> 
> Also what is  "0" to "9"? Is this counter ids?
> 
> 
> > 
> > Availability of this feature could also be reported as a distinct mode
> > in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
> 
> Yes. That should be fine.
> 
> > 
> > 
> > A _sketch_ of this follows.  This is NOT a proposal -- the key
> > question is whether we are confident that we can extend the interface
> > in this way in the future without breaking anything.
> > 
> > If "yes", then the ABMC interface (as proposed by this series) works as
> > a foundation to build on.
> > 
> > --8<--
> > 
> > [artists's impression]
> > 
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> >  	mbm_cntr_generic
> >  	[mbm_cntr_assign]
> >  	default
> 
> Yes. This looks good.

Good to know, thanks.  (Just to be clear, I am *not* suggesting adding
anything like this just now -- just checking whether the idea works
at all.)


> > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> Looks like you are assigning counter ids to domains here. That is
> different than ABMC. In ABMC, we assign events (local or total) to the
> domain. We internally handle the counter ids based on the availability.

The numbers are not supposed to have an hardware significance.

	'//0=6'

just "means assign some unused counter for domain 0, and create files
in resctrl so I can configure and read it".

The "6" is really just a tag for labelling the resulting resctrl
file names so that the user can tell them apart.  It's not supposed
to imply any specific hardware counter or event.

> Can MPAM follow the same concept?  It is possible?

[...]

> Thanks
> Babu Moger

Yes, although there is some hard-to-avoid fuzz about the precise
meaning of "local" and "total".

As Reinette pointed out, there is the also the possibility of adding
new named events other than "local" and "total" if we find that some
kinds of event don't fit these categories.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13 18:18     ` Dave Martin
@ 2025-02-13 18:39       ` Luck, Tony
  2025-02-14  6:34         ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-13 18:39 UTC (permalink / raw)
  To: Dave Martin, Moger, Babu
  Cc: peternewman@google.com, corbet@lwn.net, Chatre, Reinette,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

> Yes, although there is some hard-to-avoid fuzz about the precise
> meaning of "local" and "total".

Things are only getting fuzzier with mixed DDR and CXL memory.

> As Reinette pointed out, there is the also the possibility of adding
> new named events other than "local" and "total" if we find that some
> kinds of event don't fit these categories.

Not just new names, new scopes too. Patches coming later this year
that would present:

$ cd sys/fs/resctrl
$ cat mon_data/mon_PKG_00/llc_stalls
779762866739

I.e. a way to cheaply collect some "perf" like events across
all CPUs on a package that executed jobs with a specific RMID.

Of course this can be done with perf today, but the cost to collect
this data from heavily multi-threaded workloads that context switch
rapidly is very high.

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13 17:37     ` Dave Martin
@ 2025-02-14  6:26       ` Reinette Chatre
  2025-02-14 18:31         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-14  6:26 UTC (permalink / raw)
  To: Dave Martin
  Cc: Babu Moger, peternewman, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/13/25 9:37 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>> Hi there,
>>>
>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>
>>>> Series is written such that it is easier to support other assignable
>>>> features supported from different vendors.
>>>>
>>>> The feature details are documented in the  APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC). The documentation is available at
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>
>>>> The patches are based on top of commit
>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
> 
> [...]
> 
>>>> b. Check how many ABMC counters are available. 
>>>>
>>>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
>>>> 	32
>>>
>>> Is this file needed?
>>>
>>> With MPAM, it is more difficult to promise that the same number of
>>> counters will be available everywhere.
>>>
>>> Rather than lie, or report a "safe" value here that may waste some
>>> counters, can we just allow the number of counters to be be discovered
>>> per domain via available_mbm_cntrs?
>>
>> This sounds reasonable to me. I think us having trouble with the
>> user documentation of this file so late in development should also have been
>> a sign to rethink its value.
>>
>> For a user to discover the number of counters supported via available_mbm_cntrs
>> would require the file's contents to be captured right after mount. Since we've
>> had scenarios where new userspace needs to discover an up-and-running system's
>> configuration this may not be possible. I thus wonder instead of removing
>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>> instead of a single value? 
> 
> Is it actually useful to be able to discover the number of counters
> that exist?  A counter that exists but is not available cannot be used,
> so perhaps it is not useful to know about it in the first place.

An alternative perspective of what "available" means is "how many counters
could I possibly get to do this new monitoring task". A user may be willing
to re-assign counters if the new monitoring task is important. Knowing
how many counters are already free and available for assignment would be
easy from available_mbm_cntrs but to get an idea of how many counters
could be re-assigned to help out with the new task would require
some intricate parsing of mbm_assign_control.


> But if we keep this file but make it report the number of counters for
> each domain (similarly to mbm_available_cntrs), then I think the MPAM
> driver should be able to work with that.
> 
>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>> be good to avoid any more parameters of this sort from being reported
>>> to userspace unless there is a clear understanding of why they are
>>> needed.
>>
>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
> 
> For clarity: this is a background issue, mostly orthogonal to this
> series.
> 
> If this series is merged as-is, with a global per-resource
> num_mbm_cntrs property, then this not really worse than the current
> situation -- it's just a bit annoying from the MPAM perspective.
> 
> 
> In a nutshell, the num_closids / num_rmids parameters seem to expose
> RDT-specific hardware semantics to userspace, implying a specific
> allocation model for control group and monitoring group identifiers.
> 
> The guarantees that userspace is entitled to asssume when resctrl
> reports particular values do not seem to be well described and are hard
> to map onto the nearest-equivalent MPAM implementation.  A combination
> of control and monitoring groups that can be created on x86 may not be
> creatable on MPAM, even when the number of supportable control and
> monitoring partitions is the same.

I understand. This interface was created almost a decade ago. It would have been
wonderful if the user interface could have been created with a clear vision
of all the use cases it would end up needing to support. I am trying to be
very careful with this new user interface as I try to consider all the things I
learned while working on resctrl. All help get this new interface right is
greatly appreciated.

Since your specifically mention issues that MPAM has with num_rmids, please
note that we have been trying (see [1], but maybe start reading thread at [2])
to find ways to make this work with MPAM but no word from MPAM side. 
I see that you were not cc'd on the discussion so this is not a criticism of
you personally but I would like to highlight that we do try to make things
work well for MPAM but so far this work seems ignored, yet critisized
for not being done. I expect the more use cases are thrown at an interface
as it is developed the better it would get and I would gladly work with MPAM
folks to improve things.

> Even with the ABMC series, we may still be constrained on what we can
> report for num_rmids: we can't know in advance whether or not the user
> is going to use mbm_cntr_assign mode -- if not, we can't promise to
> create more monitoring groups than the number of counters in the
> hardware.

It is the architecture that decides which modes are supported and
which is default.

> It seems natural for the counts reported by "available_mbm_cntrs" to
> change dynamically when the ABMC assignment mode is changed, but I
> think userspace are likely to expect the global "num_rmids" parameters
> to be fixed for the lifetime of the resctrl mount (and possibly fixed
> for all time on a given hardware platform -- at least, modulo CDP).
> 
> 
> I think it might be possible to tighten up the docmentation of
> num_closids in particular in a way that doesn't conflict with x86 and
> may make it easier for MPAM to fit in with, but that feels like a
> separate conversation.
> 
> None of this should be considered a blocker for this series, either way.
> 
>>>
>>> Reporting number of counters per monitoring domain is a more natural
>>> fit for MPAM, as below:
>>>
>>>> c. Check how many ABMC counters are available in each domain.
>>>>
>>>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
>>>> 	0=30;1=30
>>>
>>> For MPAM, this seems supportable.  Each monitoring domain will have
>>> some counters, and a well-defined number of them will be available for
>>> allocation at any one time.
> 
> [...]
> 
>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> [...]
> 
>>>>        Flags can be one of the following:
>>>>
>>>>         t  MBM total event is enabled.
>>>>         l  MBM local event is enabled.
>>>>         tl Both total and local MBM events are enabled.
>>>>         _  None of the MBM events are enabled
>>>>
>>>> 	Examples:
>>>
>>> [...]
>>>
>>> I think that this basically works for MPAM.
>>>
>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>> but this problem is not specific to ABMC.  It feels sensible for ABMC
>>> to be built around the same concepts that resctrl already has elsewhere
>>> in the interface.  MPAM will do its best to fit (as already).
>>>
>>> Regarding Peter's use case of assiging multiple counters to a
>>> monitoring group [1], I feel that it's probably good enough to make
>>> sure that the ABMC interface can be extended in future in a backwards
>>> compatible way so as to support this, without trying to support it
>>> immediately.
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>>
>>
>> I do not think that resctrl's current support of the mbm_total_bytes and
>> mbm_local_bytes should be considered as the "only" two available "slots"
>> into which all possible events should be forced into. "mon_features" exists
>> to guide user space to which events are supported and as I see it new events
>> can be listed here to inform user space of their availability, with their
>> associated event files available in the resource groups.
> 
> That's fair.  I wasn't currently sure how (or if) the set of countable
> events was expected to grow / evolve via this route.
> 
> Either way, I think this confirms that there is at least one viable way
> to enable more counters for a single control group, on top of this
> series.
> 
> (If there is more than one way, that seems fine?)
> 
>>>
>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>> combined with new counter files in resctrlfs, that feels like a
>>> possible approach.  ABMC (as in this series) should just reject such
>>> such assignments, and the new counter files wouldn't exist.
>>>
>>> Availability of this feature could also be reported as a distinct mode
>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>
>>>
>>> A _sketch_ of this follows.  This is NOT a proposal -- the key
>>> question is whether we are confident that we can extend the interface
>>> in this way in the future without breaking anything.
>>>
>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>> a foundation to build on.
>>>
>>> --8<--
>>>
>>> [artists's impression]
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>  	mbm_cntr_generic
>>>  	[mbm_cntr_assign]
>>>  	default
>>>
>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 
>>>
>>> ...
>>>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>
>>> etc.
>>>
>>
>> It is not clear to me what additional features such an interface enables. It
>> also looks like user space will need to track and manage counter IDs?
> 
> My idea was that for these generic counters, new files could be exposed
> to configure what they actually count (the ..._type files shown above;
> or possibly via the ..._config files that already exist).
> 
> The "IDs" were inteded as abstract; the number only relates the
> assignments in mbm_assign_control to the files created elsewhere.  This
> wouldn't be related to IDs assigned by the hardware.

I see. Yes, this sounds related to and a generalization of the AMD
configurable event feature.

> 
> If there are multiple resctrl users then using numeric IDs might be
> problematic; though if we go eventually in the direction of making
> resctrlfs multi-mountable then each mount could have its own namespace.

I am not aware of "multi-mountable" direction.

> 
> Allowing counters to be named and configured with a mkdir()-style
> interface might be possible too; that might make it easier for users to
> coexist within a single resctrl mount (if we think that's important
> enough).
> 
>> It sounds to me as though the issue starts with your statement
>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>> address this I expect that an MPAM system will not support nor list
>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>> it would list the events that are appropriate to the system? Trying to match
>> with what Peter said [1] in the message you refer to, this may be possible:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> mbm_local_read_bytes
>> mbm_local_write_bytes
>> mbm_local_bytes
>>
>> (*) I am including mbm_local_bytes since it could be an event that can be software
>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>> counted.
>>
>> I see the support for MPAM events distinct from the support of assignable counters.
>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>> Please help me understand if you see it differently.
>> 	
>> Doing so would need to come up with alphabetical letters for these events,
>> which seems to be needed for your proposal also? If we use possible flags of:
>>
>> mbm_local_read_bytes a
>> mbm_local_write_bytes b
>>
>> Then mbm_assign_control can be used as:
>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>> <value>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>
>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> 
> That approach would also work, where an MPAM system has events are not
> a reasonable approximation of the generic "total" or "local".
> 
> For now we would probably stick with "total" and "local" anyway though,
> because the MPAM architecture doesn't natively allow the mapping onto
> the memory system topology to be discovered, and the information in
> ACPI / device tree is insufficient to tell us everything we'd need to
> know.  But I guess what counts as "local" in particular will be quite
> hardware and topology dependent even on x86, so perhaps we shouldn't
> worry about having the behaviour match exactly (?)
> 
> Regarding the code letters, my idea was that the event type might be
> configured by a separate file, instead of in mbm_assign_control
> directly, in which case running out of letters wouldn't be a problem.

This work started with individual files for counters but the issue was
raised that this will require a large number of filesystem calls when, for
example, a user wants to move a group of counters associated with the events
of one set of monitoring groups to another set of monitoring groups. This
is for the use case where there are a significant number of monitor groups
for which there are not sufficient counters. With mbm_assign_control this
can be done in a single write and such a monitoring transition can thus
be accomplished more efficiently.

> 
> Alternatively, if we want to be able to expand beyond single letters,
> could we reserve one or more characters for extension purposes?
> 
> If braces are forbidden by the syntax today, could we add support for
> something like the following later on, without breaking anything?
> 
> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 

Thank you for the suggestion. I think we may need something like this.
Babu, what do you think?

> 
> For now, my main concern would be whether this series prevents that
> sort of thing being added in a backwards compatible way later.
> 
> I don't really see anything that is a blocker.
> 
> What do you think?

I do not fully understand the MPAM counter feature. It almost sounds like
every counter could be configured independently with the expectation to
configure and assign each counter independently to a domain. As I understand
these capabilities match AMD's ABMC feature, but the planned implementation
to support ABMC first configures events per-domain and then assign these
events to counters. hmmm ... but in your example a file like
"mbm_counter0_bytes_type" is global. Could you please elaborate how in
your example writing a single letter to that file will be interpreted?


Reinette

[1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
[2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@mail.gmail.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-13 18:39       ` Luck, Tony
@ 2025-02-14  6:34         ` Reinette Chatre
  2025-02-14  7:23           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-14  6:34 UTC (permalink / raw)
  To: Luck, Tony, Dave Martin, Moger, Babu
  Cc: peternewman@google.com, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Tony,

On 2/13/25 10:39 AM, Luck, Tony wrote:
>> Yes, although there is some hard-to-avoid fuzz about the precise
>> meaning of "local" and "total".
> 
> Things are only getting fuzzier with mixed DDR and CXL memory.
> 
>> As Reinette pointed out, there is the also the possibility of adding
>> new named events other than "local" and "total" if we find that some
>> kinds of event don't fit these categories.
> 
> Not just new names, new scopes too. Patches coming later this year
> that would present:
> 
> $ cd sys/fs/resctrl
> $ cat mon_data/mon_PKG_00/llc_stalls
> 779762866739

Thank you for catching this. To support this would not be possible for
the current plan for mbm_assign_control since it does not have a way
to distinguish domain X of the PKG resource from domain X of the L3 resource.
Sounds like we need to include the resource name in the mbm_assign_control
syntax?

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-14  6:34         ` Reinette Chatre
@ 2025-02-14  7:23           ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-14  7:23 UTC (permalink / raw)
  To: Luck, Tony, Dave Martin, Moger, Babu
  Cc: peternewman@google.com, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane



On 2/13/25 10:34 PM, Reinette Chatre wrote:
> Hi Tony,
> 
> On 2/13/25 10:39 AM, Luck, Tony wrote:
>>> Yes, although there is some hard-to-avoid fuzz about the precise
>>> meaning of "local" and "total".
>>
>> Things are only getting fuzzier with mixed DDR and CXL memory.
>>
>>> As Reinette pointed out, there is the also the possibility of adding
>>> new named events other than "local" and "total" if we find that some
>>> kinds of event don't fit these categories.
>>
>> Not just new names, new scopes too. Patches coming later this year
>> that would present:
>>
>> $ cd sys/fs/resctrl
>> $ cat mon_data/mon_PKG_00/llc_stalls
>> 779762866739
> 
> Thank you for catching this. To support this would not be possible for
> the current plan for mbm_assign_control since it does not have a way
> to distinguish domain X of the PKG resource from domain X of the L3 resource.
> Sounds like we need to include the resource name in the mbm_assign_control
> syntax?

ugh ... please ignore this message. This is not needed since mbm_assign_control
is already associated with the resource.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-14  6:26       ` Reinette Chatre
@ 2025-02-14 18:31         ` Moger, Babu
  2025-02-14 19:18           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-14 18:31 UTC (permalink / raw)
  To: Reinette Chatre, Dave Martin
  Cc: Babu Moger, peternewman, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave/Reinette,

On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/13/25 9:37 AM, Dave Martin wrote:
>> Hi Reinette,
>>
>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>> Hi there,
>>>>
>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>
>>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>>
>>>>> Series is written such that it is easier to support other assignable
>>>>> features supported from different vendors.
>>>>>
>>>>> The feature details are documented in the  APM listed below [1].
>>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>>> Monitoring (ABMC). The documentation is available at
>>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>>
>>>>> The patches are based on top of commit
>>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> [...]
>>
>>>>> b. Check how many ABMC counters are available.
>>>>>
>>>>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>>>> 	32
>>>>
>>>> Is this file needed?
>>>>
>>>> With MPAM, it is more difficult to promise that the same number of
>>>> counters will be available everywhere.
>>>>
>>>> Rather than lie, or report a "safe" value here that may waste some
>>>> counters, can we just allow the number of counters to be be discovered
>>>> per domain via available_mbm_cntrs?
>>>
>>> This sounds reasonable to me. I think us having trouble with the
>>> user documentation of this file so late in development should also have been
>>> a sign to rethink its value.
>>>
>>> For a user to discover the number of counters supported via available_mbm_cntrs
>>> would require the file's contents to be captured right after mount. Since we've
>>> had scenarios where new userspace needs to discover an up-and-running system's
>>> configuration this may not be possible. I thus wonder instead of removing
>>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>>> instead of a single value?
>>
>> Is it actually useful to be able to discover the number of counters
>> that exist?  A counter that exists but is not available cannot be used,
>> so perhaps it is not useful to know about it in the first place.
> 
> An alternative perspective of what "available" means is "how many counters
> could I possibly get to do this new monitoring task". A user may be willing
> to re-assign counters if the new monitoring task is important. Knowing
> how many counters are already free and available for assignment would be
> easy from available_mbm_cntrs but to get an idea of how many counters
> could be re-assigned to help out with the new task would require
> some intricate parsing of mbm_assign_control.
> 
> 
>> But if we keep this file but make it report the number of counters for
>> each domain (similarly to mbm_available_cntrs), then I think the MPAM
>> driver should be able to work with that.
>>
>>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>>> be good to avoid any more parameters of this sort from being reported
>>>> to userspace unless there is a clear understanding of why they are
>>>> needed.
>>>
>>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
>>
>> For clarity: this is a background issue, mostly orthogonal to this
>> series.
>>
>> If this series is merged as-is, with a global per-resource
>> num_mbm_cntrs property, then this not really worse than the current
>> situation -- it's just a bit annoying from the MPAM perspective.
>>
>>
>> In a nutshell, the num_closids / num_rmids parameters seem to expose
>> RDT-specific hardware semantics to userspace, implying a specific
>> allocation model for control group and monitoring group identifiers.
>>
>> The guarantees that userspace is entitled to asssume when resctrl
>> reports particular values do not seem to be well described and are hard
>> to map onto the nearest-equivalent MPAM implementation.  A combination
>> of control and monitoring groups that can be created on x86 may not be
>> creatable on MPAM, even when the number of supportable control and
>> monitoring partitions is the same.
> 
> I understand. This interface was created almost a decade ago. It would have been
> wonderful if the user interface could have been created with a clear vision
> of all the use cases it would end up needing to support. I am trying to be
> very careful with this new user interface as I try to consider all the things I
> learned while working on resctrl. All help get this new interface right is
> greatly appreciated.
> 
> Since your specifically mention issues that MPAM has with num_rmids, please
> note that we have been trying (see [1], but maybe start reading thread at [2])
> to find ways to make this work with MPAM but no word from MPAM side.
> I see that you were not cc'd on the discussion so this is not a criticism of
> you personally but I would like to highlight that we do try to make things
> work well for MPAM but so far this work seems ignored, yet critisized
> for not being done. I expect the more use cases are thrown at an interface
> as it is developed the better it would get and I would gladly work with MPAM
> folks to improve things.
> 
>> Even with the ABMC series, we may still be constrained on what we can
>> report for num_rmids: we can't know in advance whether or not the user
>> is going to use mbm_cntr_assign mode -- if not, we can't promise to
>> create more monitoring groups than the number of counters in the
>> hardware.
> 
> It is the architecture that decides which modes are supported and
> which is default.
> 
>> It seems natural for the counts reported by "available_mbm_cntrs" to
>> change dynamically when the ABMC assignment mode is changed, but I
>> think userspace are likely to expect the global "num_rmids" parameters
>> to be fixed for the lifetime of the resctrl mount (and possibly fixed
>> for all time on a given hardware platform -- at least, modulo CDP).
>>
>>
>> I think it might be possible to tighten up the docmentation of
>> num_closids in particular in a way that doesn't conflict with x86 and
>> may make it easier for MPAM to fit in with, but that feels like a
>> separate conversation.
>>
>> None of this should be considered a blocker for this series, either way.
>>
>>>>
>>>> Reporting number of counters per monitoring domain is a more natural
>>>> fit for MPAM, as below:
>>>>
>>>>> c. Check how many ABMC counters are available in each domain.
>>>>>
>>>>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
>>>>> 	0=30;1=30
>>>>
>>>> For MPAM, this seems supportable.  Each monitoring domain will have
>>>> some counters, and a well-defined number of them will be available for
>>>> allocation at any one time.
>>
>> [...]
>>
>>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> [...]
>>
>>>>>         Flags can be one of the following:
>>>>>
>>>>>          t  MBM total event is enabled.
>>>>>          l  MBM local event is enabled.
>>>>>          tl Both total and local MBM events are enabled.
>>>>>          _  None of the MBM events are enabled
>>>>>
>>>>> 	Examples:
>>>>
>>>> [...]
>>>>
>>>> I think that this basically works for MPAM.
>>>>
>>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>>> but this problem is not specific to ABMC.  It feels sensible for ABMC
>>>> to be built around the same concepts that resctrl already has elsewhere
>>>> in the interface.  MPAM will do its best to fit (as already).
>>>>
>>>> Regarding Peter's use case of assiging multiple counters to a
>>>> monitoring group [1], I feel that it's probably good enough to make
>>>> sure that the ABMC interface can be extended in future in a backwards
>>>> compatible way so as to support this, without trying to support it
>>>> immediately.
>>>>
>>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>>>
>>>
>>> I do not think that resctrl's current support of the mbm_total_bytes and
>>> mbm_local_bytes should be considered as the "only" two available "slots"
>>> into which all possible events should be forced into. "mon_features" exists
>>> to guide user space to which events are supported and as I see it new events
>>> can be listed here to inform user space of their availability, with their
>>> associated event files available in the resource groups.
>>
>> That's fair.  I wasn't currently sure how (or if) the set of countable
>> events was expected to grow / evolve via this route.
>>
>> Either way, I think this confirms that there is at least one viable way
>> to enable more counters for a single control group, on top of this
>> series.
>>
>> (If there is more than one way, that seems fine?)
>>
>>>>
>>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>>> combined with new counter files in resctrlfs, that feels like a
>>>> possible approach.  ABMC (as in this series) should just reject such
>>>> such assignments, and the new counter files wouldn't exist.
>>>>
>>>> Availability of this feature could also be reported as a distinct mode
>>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>>
>>>>
>>>> A _sketch_ of this follows.  This is NOT a proposal -- the key
>>>> question is whether we are confident that we can extend the interface
>>>> in this way in the future without breaking anything.
>>>>
>>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>>> a foundation to build on.
>>>>
>>>> --8<--
>>>>
>>>> [artists's impression]
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>   	mbm_cntr_generic
>>>>   	[mbm_cntr_assign]
>>>>   	default
>>>>
>>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type
>>>>
>>>> ...
>>>>
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>>
>>>> etc.
>>>>
>>>
>>> It is not clear to me what additional features such an interface enables. It
>>> also looks like user space will need to track and manage counter IDs?
>>
>> My idea was that for these generic counters, new files could be exposed
>> to configure what they actually count (the ..._type files shown above;
>> or possibly via the ..._config files that already exist).
>>
>> The "IDs" were inteded as abstract; the number only relates the
>> assignments in mbm_assign_control to the files created elsewhere.  This
>> wouldn't be related to IDs assigned by the hardware.
> 
> I see. Yes, this sounds related to and a generalization of the AMD
> configurable event feature.
> 
>>
>> If there are multiple resctrl users then using numeric IDs might be
>> problematic; though if we go eventually in the direction of making
>> resctrlfs multi-mountable then each mount could have its own namespace.
> 
> I am not aware of "multi-mountable" direction.
> 
>>
>> Allowing counters to be named and configured with a mkdir()-style
>> interface might be possible too; that might make it easier for users to
>> coexist within a single resctrl mount (if we think that's important
>> enough).
>>
>>> It sounds to me as though the issue starts with your statement
>>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>>> address this I expect that an MPAM system will not support nor list
>>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>>> it would list the events that are appropriate to the system? Trying to match
>>> with what Peter said [1] in the message you refer to, this may be possible:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>> mbm_local_read_bytes
>>> mbm_local_write_bytes
>>> mbm_local_bytes
>>>
>>> (*) I am including mbm_local_bytes since it could be an event that can be software
>>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>>> counted.
>>>
>>> I see the support for MPAM events distinct from the support of assignable counters.
>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>> Please help me understand if you see it differently.
>>> 	
>>> Doing so would need to come up with alphabetical letters for these events,
>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>
>>> mbm_local_read_bytes a
>>> mbm_local_write_bytes b
>>>
>>> Then mbm_assign_control can be used as:
>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>> <value>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>
>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>
>> That approach would also work, where an MPAM system has events are not
>> a reasonable approximation of the generic "total" or "local".
>>
>> For now we would probably stick with "total" and "local" anyway though,
>> because the MPAM architecture doesn't natively allow the mapping onto
>> the memory system topology to be discovered, and the information in
>> ACPI / device tree is insufficient to tell us everything we'd need to
>> know.  But I guess what counts as "local" in particular will be quite
>> hardware and topology dependent even on x86, so perhaps we shouldn't
>> worry about having the behaviour match exactly (?)
>>
>> Regarding the code letters, my idea was that the event type might be
>> configured by a separate file, instead of in mbm_assign_control
>> directly, in which case running out of letters wouldn't be a problem.
> 
> This work started with individual files for counters but the issue was
> raised that this will require a large number of filesystem calls when, for
> example, a user wants to move a group of counters associated with the events
> of one set of monitoring groups to another set of monitoring groups. This
> is for the use case where there are a significant number of monitor groups
> for which there are not sufficient counters. With mbm_assign_control this
> can be done in a single write and such a monitoring transition can thus
> be accomplished more efficiently.
> 
>>
>> Alternatively, if we want to be able to expand beyond single letters,
>> could we reserve one or more characters for extension purposes?
>>
>> If braces are forbidden by the syntax today, could we add support for
>> something like the following later on, without breaking anything?
>>
>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
> 
> Thank you for the suggestion. I think we may need something like this.
> Babu, what do you think?

I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
It is a random text?

In his example from
https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
--------------------------------------------------------------
The numbers are not supposed to have an hardware significance.

	'//0=6'

just "means assign some unused counter for domain 0, and create files
in resctrl so I can configure and read it".

The "6" is really just a tag for labelling the resulting resctrl
file names so that the user can tell them apart.  It's not supposed
to imply any specific hardware counter or event.
------------------------------------------------------------------

It seems that 'foo' and 'bar' are tags used to create files in 
/sys/fs/resctrl/info/L3_MON/.

Given that, it looks like we're discussing entirely different things.

> 
>>
>> For now, my main concern would be whether this series prevents that
>> sort of thing being added in a backwards compatible way later.
>>
>> I don't really see anything that is a blocker.
>>
>> What do you think?
> 
> I do not fully understand the MPAM counter feature. It almost sounds like
> every counter could be configured independently with the expectation to
> configure and assign each counter independently to a domain. As I understand
> these capabilities match AMD's ABMC feature, but the planned implementation
> to support ABMC first configures events per-domain and then assign these
> events to counters. hmmm ... but in your example a file like
> "mbm_counter0_bytes_type" is global. Could you please elaborate how in
> your example writing a single letter to that file will be interpreted?
> 
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
> [2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@mail.gmail.com/
> 


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-14 18:31         ` Moger, Babu
@ 2025-02-14 19:18           ` Reinette Chatre
  2025-02-14 19:51             ` Moger, Babu
  2025-02-17 10:26             ` Peter Newman
  0 siblings, 2 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-14 19:18 UTC (permalink / raw)
  To: Moger, Babu, Dave Martin
  Cc: Babu Moger, peternewman, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/14/25 10:31 AM, Moger, Babu wrote:
> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:

(quoting relevant parts with goal to focus discussion on new possible syntax)

>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>> Please help me understand if you see it differently.
>>>>     
>>>> Doing so would need to come up with alphabetical letters for these events,
>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>
>>>> mbm_local_read_bytes a
>>>> mbm_local_write_bytes b
>>>>
>>>> Then mbm_assign_control can be used as:
>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>> <value>
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>
>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?

As mentioned above, one possible issue with existing interface is that
it is limited to 26 events (assuming only lower case letters are used). The limit
is low enough to be of concern.

....

>>>
>>> Alternatively, if we want to be able to expand beyond single letters,
>>> could we reserve one or more characters for extension purposes?
>>>
>>> If braces are forbidden by the syntax today, could we add support for
>>> something like the following later on, without breaking anything?
>>>
>>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>

Dave proposed a change in syntax that can (a) support unlimited events,
(b) be more intuitive than the one letter flags that may be hard to match
to the events they correspond to.

>> Thank you for the suggestion. I think we may need something like this.
>> Babu, what do you think?
> 
> I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
> It is a random text?

Not random text. It refers to the events.

I do not know if braces is what will be settled on but a slight change in
example to make it match your series can be:

# echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control

With syntax like above there is no concern that we will run out of
flags and the events assigned are clear without needing to parse separate flags.
For a system with a lot of events and domains this will become quite a lot
to parse though.

> 
> In his example from
> https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
> --------------------------------------------------------------
> The numbers are not supposed to have an hardware significance.
> 
>     '//0=6'
> 
> just "means assign some unused counter for domain 0, and create files
> in resctrl so I can configure and read it".

Thanks for pointing this out. I missed that the idea was that the
configuration files are dynamically created.

> 
> The "6" is really just a tag for labelling the resulting resctrl
> file names so that the user can tell them apart.  It's not supposed
> to imply any specific hardware counter or event.

Right.

> ------------------------------------------------------------------
> 
> It seems that 'foo' and 'bar' are tags used to create files in /sys/fs/resctrl/info/L3_MON/.
> 
> Given that, it looks like we're discussing entirely different things.

I am still trying to understand how MPAM counters can be supported.

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-14 19:18           ` Reinette Chatre
@ 2025-02-14 19:51             ` Moger, Babu
  2025-02-17 10:26             ` Peter Newman
  1 sibling, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-14 19:51 UTC (permalink / raw)
  To: Reinette Chatre, Dave Martin
  Cc: Babu Moger, peternewman, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/14/2025 1:18 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/14/25 10:31 AM, Moger, Babu wrote:
>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> 
> (quoting relevant parts with goal to focus discussion on new possible syntax)
> 
>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>> Please help me understand if you see it differently.
>>>>>      
>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>
>>>>> mbm_local_read_bytes a
>>>>> mbm_local_write_bytes b
>>>>>
>>>>> Then mbm_assign_control can be used as:
>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>> <value>
>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>
>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> 
> As mentioned above, one possible issue with existing interface is that
> it is limited to 26 events (assuming only lower case letters are used). The limit
> is low enough to be of concern.

Yes. Agree.

> 
> ....
> 
>>>>
>>>> Alternatively, if we want to be able to expand beyond single letters,
>>>> could we reserve one or more characters for extension purposes?
>>>>
>>>> If braces are forbidden by the syntax today, could we add support for
>>>> something like the following later on, without breaking anything?
>>>>
>>>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>
> 
> Dave proposed a change in syntax that can (a) support unlimited events,
> (b) be more intuitive than the one letter flags that may be hard to match
> to the events they correspond to.

Yea. Sounds good.

> 
>>> Thank you for the suggestion. I think we may need something like this.
>>> Babu, what do you think?
>>
>> I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
>> It is a random text?
> 
> Not random text. It refers to the events.
> 
> I do not know if braces is what will be settled on but a slight change in
> example to make it match your series can be:
> 
> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> With syntax like above there is no concern that we will run out of
> flags and the events assigned are clear without needing to parse separate flags.

Yes. We need to change our current "flag parsing". It should not be a 
problem.

> For a system with a lot of events and domains this will become quite a lot
> to parse though.
> 
>>
>> In his example from
>> https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
>> --------------------------------------------------------------
>> The numbers are not supposed to have an hardware significance.
>>
>>      '//0=6'
>>
>> just "means assign some unused counter for domain 0, and create files
>> in resctrl so I can configure and read it".
> 
> Thanks for pointing this out. I missed that the idea was that the
> configuration files are dynamically created.
> 
>>
>> The "6" is really just a tag for labelling the resulting resctrl
>> file names so that the user can tell them apart.  It's not supposed
>> to imply any specific hardware counter or event.
> 
> Right.
> 
>> ------------------------------------------------------------------
>>
>> It seems that 'foo' and 'bar' are tags used to create files in /sys/fs/resctrl/info/L3_MON/.
>>
>> Given that, it looks like we're discussing entirely different things.
> 
> I am still trying to understand how MPAM counters can be supported.
> 
> Reinette


Thanks
Babu



^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-14 19:18           ` Reinette Chatre
  2025-02-14 19:51             ` Moger, Babu
@ 2025-02-17 10:26             ` Peter Newman
  2025-02-17 16:45               ` Moger, Babu
  2025-02-18 17:49               ` Reinette Chatre
  1 sibling, 2 replies; 209+ messages in thread
From: Peter Newman @ 2025-02-17 10:26 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Babu,
>
> On 2/14/25 10:31 AM, Moger, Babu wrote:
> > On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>
> (quoting relevant parts with goal to focus discussion on new possible syntax)
>
> >>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>> Please help me understand if you see it differently.
> >>>>
> >>>> Doing so would need to come up with alphabetical letters for these events,
> >>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>
> >>>> mbm_local_read_bytes a
> >>>> mbm_local_write_bytes b
> >>>>
> >>>> Then mbm_assign_control can be used as:
> >>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>> <value>
> >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>
> >>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>
> As mentioned above, one possible issue with existing interface is that
> it is limited to 26 events (assuming only lower case letters are used). The limit
> is low enough to be of concern.

The events which can be monitored by a single counter on ABMC and MPAM
so far are combinable, so 26 counters per group today means it limits
breaking down MBM traffic for each group 26 ways. If a user complained
that a 26-way breakdown of a group's MBM traffic was limiting their
investigation, I would question whether they know what they're looking
for.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-17 10:26             ` Peter Newman
@ 2025-02-17 16:45               ` Moger, Babu
  2025-02-18 12:30                 ` Dave Martin
  2025-02-18 16:51                 ` Luck, Tony
  2025-02-18 17:49               ` Reinette Chatre
  1 sibling, 2 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-17 16:45 UTC (permalink / raw)
  To: Peter Newman, Reinette Chatre
  Cc: Moger, Babu, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi All,

On 2/17/25 04:26, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Babu,
>>
>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>
>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>
>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>> Please help me understand if you see it differently.
>>>>>>
>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>
>>>>>> mbm_local_read_bytes a
>>>>>> mbm_local_write_bytes b
>>>>>>
>>>>>> Then mbm_assign_control can be used as:
>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>> <value>
>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>
>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>
>> As mentioned above, one possible issue with existing interface is that
>> it is limited to 26 events (assuming only lower case letters are used). The limit
>> is low enough to be of concern.
> 
> The events which can be monitored by a single counter on ABMC and MPAM
> so far are combinable, so 26 counters per group today means it limits
> breaking down MBM traffic for each group 26 ways. If a user complained
> that a 26-way breakdown of a group's MBM traffic was limiting their
> investigation, I would question whether they know what they're looking
> for.

Based on the discussion so far, it felt like it is not a group level
breakdown. It is kind of global level breakdown. I could be wrong here.

My understanding so far, MPAM has a number of global counters. It can be
assigned to any domain in the system and monitor events.

They also have a way to configure the events (read, write or both).

Both these feature are inline with current resctrl implementation and can
be easily adapted.

One thing I am not clear why MPAM implementation plans to create separate
files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
events. We already have files in each group to read the events.

# ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
total 0
-r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
-r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
-r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-17 16:45               ` Moger, Babu
@ 2025-02-18 12:30                 ` Dave Martin
  2025-02-18 15:39                   ` Moger, Babu
  2025-02-18 16:51                 ` Luck, Tony
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-18 12:30 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Peter Newman, Reinette Chatre, Moger, Babu, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote:
> Hi All,
> 
> On 2/17/25 04:26, Peter Newman wrote:
> > Hi Reinette,
> > 
> > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Babu,
> >>
> >> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>
> >> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>
> >>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>> Please help me understand if you see it differently.
> >>>>>>
> >>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>
> >>>>>> mbm_local_read_bytes a
> >>>>>> mbm_local_write_bytes b
> >>>>>>
> >>>>>> Then mbm_assign_control can be used as:
> >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>> <value>
> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>
> >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>
> >> As mentioned above, one possible issue with existing interface is that
> >> it is limited to 26 events (assuming only lower case letters are used). The limit
> >> is low enough to be of concern.
> > 
> > The events which can be monitored by a single counter on ABMC and MPAM
> > so far are combinable, so 26 counters per group today means it limits
> > breaking down MBM traffic for each group 26 ways. If a user complained
> > that a 26-way breakdown of a group's MBM traffic was limiting their
> > investigation, I would question whether they know what they're looking
> > for.
> 
> Based on the discussion so far, it felt like it is not a group level
> breakdown. It is kind of global level breakdown. I could be wrong here.
> 
> My understanding so far, MPAM has a number of global counters. It can be
> assigned to any domain in the system and monitor events.
> 
> They also have a way to configure the events (read, write or both).
> 
> Both these feature are inline with current resctrl implementation and can
> be easily adapted.
> 
> One thing I am not clear why MPAM implementation plans to create separate
> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
> events. We already have files in each group to read the events.
> 
> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
> total 0
> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes


To be clear, we have no current plan to do this from the Arm side.

My sketch was just a thought experiment to test whether we would have
difficulties _if_ a decision were made to extend the interface in that
direction.

But it looks OK to me: the interface proposed in this series seems to
leave enough possibilities for extension open that we could do
something like what I described later in if we decide to.


Overall, the interface proposed in this series seems a reasonable way
to support ABMC systems while keeping the consumer-side interface
(i.e., reading the mbm_total_bytes files etc.) as similar to the
classic / Intel RDT situation as possible.

MPAM can fit in with this approach, as demonstrated by James' past
branches porting the MPAM driver on top of previous versions of the
ABMC series.

As I understand it, he's almost done with porting onto this v11,
with no significant issues.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 12:30                 ` Dave Martin
@ 2025-02-18 15:39                   ` Moger, Babu
  2025-02-18 18:14                     ` Reinette Chatre
  2025-02-19 12:24                     ` Dave Martin
  0 siblings, 2 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-18 15:39 UTC (permalink / raw)
  To: Dave Martin, reinette.chatre@intel.com
  Cc: Peter Newman, Reinette Chatre, Moger, Babu, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi All,

On 2/18/25 06:30, Dave Martin wrote:
> On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote:
>> Hi All,
>>
>> On 2/17/25 04:26, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Babu,
>>>>
>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>
>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>
>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>
>>>>>>>> mbm_local_read_bytes a
>>>>>>>> mbm_local_write_bytes b
>>>>>>>>
>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>> <value>
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>
>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>
>>>> As mentioned above, one possible issue with existing interface is that
>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>> is low enough to be of concern.
>>>
>>> The events which can be monitored by a single counter on ABMC and MPAM
>>> so far are combinable, so 26 counters per group today means it limits
>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>> investigation, I would question whether they know what they're looking
>>> for.
>>
>> Based on the discussion so far, it felt like it is not a group level
>> breakdown. It is kind of global level breakdown. I could be wrong here.
>>
>> My understanding so far, MPAM has a number of global counters. It can be
>> assigned to any domain in the system and monitor events.
>>
>> They also have a way to configure the events (read, write or both).
>>
>> Both these feature are inline with current resctrl implementation and can
>> be easily adapted.
>>
>> One thing I am not clear why MPAM implementation plans to create separate
>> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
>> events. We already have files in each group to read the events.
>>
>> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
>> total 0
>> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
> 
> 
> To be clear, we have no current plan to do this from the Arm side.
> 
> My sketch was just a thought experiment to test whether we would have
> difficulties _if_ a decision were made to extend the interface in that
> direction.
> 
> But it looks OK to me: the interface proposed in this series seems to
> leave enough possibilities for extension open that we could do
> something like what I described later in if we decide to.
> 
> 
> Overall, the interface proposed in this series seems a reasonable way
> to support ABMC systems while keeping the consumer-side interface
> (i.e., reading the mbm_total_bytes files etc.) as similar to the
> classic / Intel RDT situation as possible.
> 
> MPAM can fit in with this approach, as demonstrated by James' past
> branches porting the MPAM driver on top of previous versions of the
> ABMC series.

Thanks Dave.
> 
> As I understand it, he's almost done with porting onto this v11,
> with no significant issues.
> 
Good to know. Thanks

I am working on v12 of ABMC with few changes from Reinette's earlier
review comments.

Most of the changes are related to commit message update and user
documentation update.

Introduced couple of new functions resctrl_reset_rmid_all() and
mbm_cntr_free_all() to organize the code better based on the comment.
https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/


On top of that I have few comments from from Dave.

1.  Change "mbm_cntr_assign" to "mbm_counter_assign".

This will require me to search and replace lot of places. There are
variables, names like num_mbm_cntrs, mbm_cntr_assignable,
resctrl_arch_mbm_cntr_assign_enabled, resctrl_arch_mbm_cntr_assign_set,
mbm_cntr_assign_enabled, resctrl_num_mbm_cntrs_show, mbm_cntr_cfg and list
goes on.

 This is mostly cosmetic and not much value add. Will drop this change if
Dave has no objections.

2. Change /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs to display per-domain
supported counters instead of a single value.


3. Use the actual events instead of flags based on the below comment.

https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/

 Something like this.
 # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>/sys/fs/resctrl/info/L3_MON/mbm_assign_control

 Are we ready to go with this approach? I am still not clear on this.

 Reinette, What do you think?


-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-17 16:45               ` Moger, Babu
  2025-02-18 12:30                 ` Dave Martin
@ 2025-02-18 16:51                 ` Luck, Tony
  2025-02-18 18:27                   ` Reinette Chatre
  1 sibling, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-18 16:51 UTC (permalink / raw)
  To: babu.moger@amd.com, Peter Newman, Chatre, Reinette
  Cc: Moger, Babu, Dave Martin, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> Based on the discussion so far, it felt like it is not a group level
> breakdown. It is kind of global level breakdown. I could be wrong here.
>
> My understanding so far, MPAM has a number of global counters. It can be
> assigned to any domain in the system and monitor events.
>
> They also have a way to configure the events (read, write or both).
>
> Both these feature are inline with current resctrl implementation and can
> be easily adapted.
>
> One thing I am not clear why MPAM implementation plans to create separate
> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
> events. We already have files in each group to read the events.
>
> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
> total 0
> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes

It would be nice if the filenames here reflected the reconfigured
events. From what I can tell on AMD with BMEC it is possible to change the
underlying events so that local b/w is reported in the mbm_total_bytes
file, and vice versa. Or an event like:

   6       Dirty Victims from the QOS domain to all types of memory

is counted.

Though maybe we'd need to create a lot of filenames for the 2**6
combinations of bits.

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-17 10:26             ` Peter Newman
  2025-02-17 16:45               ` Moger, Babu
@ 2025-02-18 17:49               ` Reinette Chatre
  2025-02-19 11:28                 ` Peter Newman
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-18 17:49 UTC (permalink / raw)
  To: Peter Newman
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/17/25 2:26 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Babu,
>>
>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>
>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>
>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>> Please help me understand if you see it differently.
>>>>>>
>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>
>>>>>> mbm_local_read_bytes a
>>>>>> mbm_local_write_bytes b
>>>>>>
>>>>>> Then mbm_assign_control can be used as:
>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>> <value>
>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>
>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>
>> As mentioned above, one possible issue with existing interface is that
>> it is limited to 26 events (assuming only lower case letters are used). The limit
>> is low enough to be of concern.
> 
> The events which can be monitored by a single counter on ABMC and MPAM
> so far are combinable, so 26 counters per group today means it limits
> breaking down MBM traffic for each group 26 ways. If a user complained
> that a 26-way breakdown of a group's MBM traffic was limiting their
> investigation, I would question whether they know what they're looking
> for.

The key here is "so far" as well as the focus on MBM only. 

It is impossible for me to predict what we will see in a couple of years
from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
to support their users. Just looking at the Intel RDT spec the event register
has space for 32 events for each "CPU agent" resource. That does not take into
account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
that he is working on patches [1] that will add new events and shared the idea
that we may be trending to support "perf" like events associated with RMID. I
expect AMD PQoS and Arm MPAM to provide related enhancements to support their
customers.
This all makes me think that resctrl should be ready to support more events than 26.

My goal is for resctrl to have a user interface that can as much as possible
be ready for whatever may be required from it years down the line. Of course,
I may be wrong and resctrl would never need to support more than 26 events per
resource (*). The risk is that resctrl *may* need to support more than 26 events
and how could resctrl support that?

What is the risk of supporting more than 26 events? As I highlighted earlier
the interface I used as demonstration may become unwieldy to parse on a system
with many domains that supports many events. This is a concern for me. Any suggestions
will be appreciated, especially from you since I know that you are very familiar with
issues related to large scale use of resctrl interfaces.

Reinette

[1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/

(*) There is also the scenario where combined between resources there may be
more than 26 events supported that will require the same one letter flag to be
used for different events of different resources. This may potentially be
confusing.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 15:39                   ` Moger, Babu
@ 2025-02-18 18:14                     ` Reinette Chatre
  2025-02-18 19:32                       ` Moger, Babu
  2025-02-19 12:24                     ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-18 18:14 UTC (permalink / raw)
  To: babu.moger, Dave Martin
  Cc: Peter Newman, Moger, Babu, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/18/25 7:39 AM, Moger, Babu wrote:
 
> 3. Use the actual events instead of flags based on the below comment.
> 
> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
> 
>  Something like this.
>  # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
>  Are we ready to go with this approach? I am still not clear on this.
> 
>  Reinette, What do you think?

I was actually expecting some push back or at least discussion on this interface
because the braces seem difficult to parse when compared to, for example, using
commas to separate the events of a domain. Peter [1] has some reservations about
going this direction and since he would end up using this interface significantly
I would prefer to resolve that first.

Reinette


[1] https://lore.kernel.org/lkml/CALPaoCh7WpohzpXhSAbumjSZBv1_+1bXON7_V1pwG4bdEBr52Q@mail.gmail.com/



 
 
 


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 16:51                 ` Luck, Tony
@ 2025-02-18 18:27                   ` Reinette Chatre
  2025-02-18 19:08                     ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-18 18:27 UTC (permalink / raw)
  To: Luck, Tony, babu.moger@amd.com, Peter Newman
  Cc: Moger, Babu, Dave Martin, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Tony,

On 2/18/25 8:51 AM, Luck, Tony wrote:
>> Based on the discussion so far, it felt like it is not a group level
>> breakdown. It is kind of global level breakdown. I could be wrong here.
>>
>> My understanding so far, MPAM has a number of global counters. It can be
>> assigned to any domain in the system and monitor events.
>>
>> They also have a way to configure the events (read, write or both).
>>
>> Both these feature are inline with current resctrl implementation and can
>> be easily adapted.
>>
>> One thing I am not clear why MPAM implementation plans to create separate
>> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
>> events. We already have files in each group to read the events.
>>
>> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
>> total 0
>> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
> 
> It would be nice if the filenames here reflected the reconfigured
> events. From what I can tell on AMD with BMEC it is possible to change the
> underlying events so that local b/w is reported in the mbm_total_bytes
> file, and vice versa. Or an event like:
> 
>    6       Dirty Victims from the QOS domain to all types of memory
> 
> is counted.
> 
> Though maybe we'd need to create a lot of filenames for the 2**6
> combinations of bits.

Instead of accommodating all possible names resctrl could support
"generic" names as hinted in Dave Martin's proposal.

The complication with BMEC is that these are the underlying
mbm_local_bytes and mbm_total_bytes events on which configuration
was built. Specifically, by default and at hardware reset mbm_local_bytes
counts exactly that. The event is fixed if BMEC is not supported and
configurable if it is.

Reinette

[1] https://lore.kernel.org/lkml/Z6zeXby8ajh0ax6i@e133380.arm.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 18:27                   ` Reinette Chatre
@ 2025-02-18 19:08                     ` Luck, Tony
  2025-02-18 21:32                       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-18 19:08 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger@amd.com, Peter Newman
  Cc: Moger, Babu, Dave Martin, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> >> Based on the discussion so far, it felt like it is not a group level
> >> breakdown. It is kind of global level breakdown. I could be wrong here.
> >>
> >> My understanding so far, MPAM has a number of global counters. It can be
> >> assigned to any domain in the system and monitor events.
> >>
> >> They also have a way to configure the events (read, write or both).
> >>
> >> Both these feature are inline with current resctrl implementation and can
> >> be easily adapted.
> >>
> >> One thing I am not clear why MPAM implementation plans to create separate
> >> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
> >> events. We already have files in each group to read the events.
> >>
> >> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
> >> total 0
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
> >
> > It would be nice if the filenames here reflected the reconfigured
> > events. From what I can tell on AMD with BMEC it is possible to change the
> > underlying events so that local b/w is reported in the mbm_total_bytes
> > file, and vice versa. Or an event like:
> >
> >    6       Dirty Victims from the QOS domain to all types of memory
> >
> > is counted.
> >
> > Though maybe we'd need to create a lot of filenames for the 2**6
> > combinations of bits.
>
> Instead of accommodating all possible names resctrl could support
> "generic" names as hinted in Dave Martin's proposal.
>
> The complication with BMEC is that these are the underlying
> mbm_local_bytes and mbm_total_bytes events on which configuration
> was built. Specifically, by default and at hardware reset mbm_local_bytes
> counts exactly that. The event is fixed if BMEC is not supported and
> configurable if it is.

Would if be possible to rename the files if the config changed?

I.e. initially they are named mbm_local_bytes and mbm_total_bytes.

But when the user changes the config for mbm_total_bytes using the
BMEC config file, that file is renamed everywhere to "user_config1"

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 18:14                     ` Reinette Chatre
@ 2025-02-18 19:32                       ` Moger, Babu
  2025-02-18 21:29                         ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-18 19:32 UTC (permalink / raw)
  To: Reinette Chatre, Dave Martin
  Cc: Peter Newman, Moger, Babu, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/18/25 12:14, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/18/25 7:39 AM, Moger, Babu wrote:
>  
>> 3. Use the actual events instead of flags based on the below comment.
>>
>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>>
>>  Something like this.
>>  # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>>  Are we ready to go with this approach? I am still not clear on this.
>>
>>  Reinette, What do you think?
> 
> I was actually expecting some push back or at least discussion on this interface
> because the braces seem difficult to parse when compared to, for example, using

I am yet to work on it. Will work on it after confirmation.

Here is the output from a system with 12 domains. I created one "test" group.

Output is definitely harder to parse for human eyes.

#cat info/L3_MON/mbm_assign_control
test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}

It is harder to parse in code also. We should consider only if there is a
value-add with this format.

Otherwise I prefer our current flag format.

# cat info/L3_MON/mbm_assign_control
test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl


> commas to separate the events of a domain. Peter [1] has some reservations about

Yes. I would like to hear from Peter.

> going this direction and since he would end up using this interface significantly
> I would prefer to resolve that first.
> 
> Reinette
> 
> 
> [1] https://lore.kernel.org/lkml/CALPaoCh7WpohzpXhSAbumjSZBv1_+1bXON7_V1pwG4bdEBr52Q@mail.gmail.com/
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 19:32                       ` Moger, Babu
@ 2025-02-18 21:29                         ` Reinette Chatre
  2025-02-19 12:26                           ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-18 21:29 UTC (permalink / raw)
  To: babu.moger, Dave Martin
  Cc: Peter Newman, Moger, Babu, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/18/25 11:32 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/18/25 12:14, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/18/25 7:39 AM, Moger, Babu wrote:
>>  
>>> 3. Use the actual events instead of flags based on the below comment.
>>>
>>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>>>
>>>  Something like this.
>>>  # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>>  Are we ready to go with this approach? I am still not clear on this.
>>>
>>>  Reinette, What do you think?
>>
>> I was actually expecting some push back or at least discussion on this interface
>> because the braces seem difficult to parse when compared to, for example, using
> 
> I am yet to work on it. Will work on it after confirmation.
> 
> Here is the output from a system with 12 domains. I created one "test" group.
> 
> Output is definitely harder to parse for human eyes.
> 
> #cat info/L3_MON/mbm_assign_control
> test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> //0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> 
> It is harder to parse in code also. We should consider only if there is a
> value-add with this format.

Please see my comments in [2] for some motivations.

> 
> Otherwise I prefer our current flag format.
> 
> # cat info/L3_MON/mbm_assign_control
> test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl

We could possibly consider some middle ground where flags are separated by
commas and when the amount of used flags reach 26 the interface can use
"two letter flags" or "longer names" or "the actual event name" or ....

> 
> 
>> commas to separate the events of a domain. Peter [1] has some reservations about
> 
> Yes. I would like to hear from Peter.
> 

Reinette


[2] https://lore.kernel.org/lkml/ccd9c5d7-0266-4054-879e-e084b6972ad5@intel.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 19:08                     ` Luck, Tony
@ 2025-02-18 21:32                       ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-18 21:32 UTC (permalink / raw)
  To: Luck, Tony, babu.moger@amd.com, Peter Newman
  Cc: Moger, Babu, Dave Martin, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Tony,

On 2/18/25 11:08 AM, Luck, Tony wrote:
>>>> Based on the discussion so far, it felt like it is not a group level
>>>> breakdown. It is kind of global level breakdown. I could be wrong here.
>>>>
>>>> My understanding so far, MPAM has a number of global counters. It can be
>>>> assigned to any domain in the system and monitor events.
>>>>
>>>> They also have a way to configure the events (read, write or both).
>>>>
>>>> Both these feature are inline with current resctrl implementation and can
>>>> be easily adapted.
>>>>
>>>> One thing I am not clear why MPAM implementation plans to create separate
>>>> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
>>>> events. We already have files in each group to read the events.
>>>>
>>>> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
>>>> total 0
>>>> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
>>>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
>>>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
>>>
>>> It would be nice if the filenames here reflected the reconfigured
>>> events. From what I can tell on AMD with BMEC it is possible to change the
>>> underlying events so that local b/w is reported in the mbm_total_bytes
>>> file, and vice versa. Or an event like:
>>>
>>>    6       Dirty Victims from the QOS domain to all types of memory
>>>
>>> is counted.
>>>
>>> Though maybe we'd need to create a lot of filenames for the 2**6
>>> combinations of bits.
>>
>> Instead of accommodating all possible names resctrl could support
>> "generic" names as hinted in Dave Martin's proposal.
>>
>> The complication with BMEC is that these are the underlying
>> mbm_local_bytes and mbm_total_bytes events on which configuration
>> was built. Specifically, by default and at hardware reset mbm_local_bytes
>> counts exactly that. The event is fixed if BMEC is not supported and
>> configurable if it is.
> 
> Would if be possible to rename the files if the config changed?
> 
> I.e. initially they are named mbm_local_bytes and mbm_total_bytes.
> 
> But when the user changes the config for mbm_total_bytes using the
> BMEC config file, that file is renamed everywhere to "user_config1"
> 

The motivation for doing this to an existing interface is not clear. On
its own I think it will add confusion. It sounds to me as though there is
some future (similar to BMEC) feature that needs to be supported for which
such a change would make things compatible. For this I think it would be easier to
discuss that future feature and ensure everybody is clear on what interface
would work for that new feature before making changes to existing feature to
be compatible with it.

Reinette



^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 17:49               ` Reinette Chatre
@ 2025-02-19 11:28                 ` Peter Newman
  2025-02-19 12:26                   ` Dave Martin
  2025-02-19 17:56                   ` Reinette Chatre
  0 siblings, 2 replies; 209+ messages in thread
From: Peter Newman @ 2025-02-19 11:28 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 2/17/25 2:26 AM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Babu,
> >>
> >> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>
> >> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>
> >>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>> Please help me understand if you see it differently.
> >>>>>>
> >>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>
> >>>>>> mbm_local_read_bytes a
> >>>>>> mbm_local_write_bytes b
> >>>>>>
> >>>>>> Then mbm_assign_control can be used as:
> >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>> <value>
> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>
> >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>
> >> As mentioned above, one possible issue with existing interface is that
> >> it is limited to 26 events (assuming only lower case letters are used). The limit
> >> is low enough to be of concern.
> >
> > The events which can be monitored by a single counter on ABMC and MPAM
> > so far are combinable, so 26 counters per group today means it limits
> > breaking down MBM traffic for each group 26 ways. If a user complained
> > that a 26-way breakdown of a group's MBM traffic was limiting their
> > investigation, I would question whether they know what they're looking
> > for.
>
> The key here is "so far" as well as the focus on MBM only.
>
> It is impossible for me to predict what we will see in a couple of years
> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> to support their users. Just looking at the Intel RDT spec the event register
> has space for 32 events for each "CPU agent" resource. That does not take into
> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> that he is working on patches [1] that will add new events and shared the idea
> that we may be trending to support "perf" like events associated with RMID. I
> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> customers.
> This all makes me think that resctrl should be ready to support more events than 26.

I was thinking of the letters as representing a reusable, user-defined
event-set for applying to a single counter rather than as individual
events, since MPAM and ABMC allow us to choose the set of events each
one counts. Wherever we define the letters, we could use more symbolic
event names.

In the letters as events model, choosing the events assigned to a
group wouldn't be enough information, since we would want to control
which events should share a counter and which should be counted by
separate counters. I think the amount of information that would need
to be encoded into mbm_assign_control to represent the level of
configurability supported by hardware would quickly get out of hand.

Maybe as an example, one counter for all reads, one counter for all
writes in ABMC would look like...

(L3_QOS_ABMC_CFG.BwType field names below)

(per domain)
group 0:
 counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
 counter 1: VictimBW,LclNTWr,RmtNTWr
group 1:
 counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
 counter 3: VictimBW,LclNTWr,RmtNTWr
...

I assume packing all of this info for a group's desired counter
configuration into a single line (with 32 domains per line on many
dual-socket AMD configurations I see) would be difficult to look at,
even if we could settle on a single letter to represent each
universally.

>
> My goal is for resctrl to have a user interface that can as much as possible
> be ready for whatever may be required from it years down the line. Of course,
> I may be wrong and resctrl would never need to support more than 26 events per
> resource (*). The risk is that resctrl *may* need to support more than 26 events
> and how could resctrl support that?
>
> What is the risk of supporting more than 26 events? As I highlighted earlier
> the interface I used as demonstration may become unwieldy to parse on a system
> with many domains that supports many events. This is a concern for me. Any suggestions
> will be appreciated, especially from you since I know that you are very familiar with
> issues related to large scale use of resctrl interfaces.

It's mainly just the unwieldiness of all the information in one file.
It's already at the limit of what I can visually look through.

I believe that shared assignments will take care of all the
high-frequency and performance-intensive batch configuration updates I
was originally concerned about, so I no longer see much benefit in
finding ways to textually encode all this information in a single file
when it would be more manageable to distribute it around the
filesystem hierarchy.

-Peter


>
> Reinette
>
> [1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/
>
> (*) There is also the scenario where combined between resources there may be
> more than 26 events supported that will require the same one letter flag to be
> used for different events of different resources. This may potentially be
> confusing.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 15:39                   ` Moger, Babu
  2025-02-18 18:14                     ` Reinette Chatre
@ 2025-02-19 12:24                     ` Dave Martin
  1 sibling, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-19 12:24 UTC (permalink / raw)
  To: Moger, Babu
  Cc: reinette.chatre@intel.com, Peter Newman, Moger, Babu, corbet,
	tglx, mingo, bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm,
	thuth, rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi there,

On Tue, Feb 18, 2025 at 09:39:43AM -0600, Moger, Babu wrote:
> Hi All,
> 
> On 2/18/25 06:30, Dave Martin wrote:
> > On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote:
> >> Hi All,

[...]

> >> One thing I am not clear why MPAM implementation plans to create separate
> >> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
> >> events. We already have files in each group to read the events.
> >>
> >> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
> >> total 0
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
> > 
> > 
> > To be clear, we have no current plan to do this from the Arm side.
> > 
> > My sketch was just a thought experiment to test whether we would have
> > difficulties _if_ a decision were made to extend the interface in that
> > direction.
> > 
> > But it looks OK to me: the interface proposed in this series seems to
> > leave enough possibilities for extension open that we could do
> > something like what I described later in if we decide to.
> > 
> > 
> > Overall, the interface proposed in this series seems a reasonable way
> > to support ABMC systems while keeping the consumer-side interface
> > (i.e., reading the mbm_total_bytes files etc.) as similar to the
> > classic / Intel RDT situation as possible.
> > 
> > MPAM can fit in with this approach, as demonstrated by James' past
> > branches porting the MPAM driver on top of previous versions of the
> > ABMC series.
> 
> Thanks Dave.
> > 
> > As I understand it, he's almost done with porting onto this v11,
> > with no significant issues.
> > 
> Good to know. Thanks
> 
> I am working on v12 of ABMC with few changes from Reinette's earlier
> review comments.
> 
> Most of the changes are related to commit message update and user
> documentation update.
> 
> Introduced couple of new functions resctrl_reset_rmid_all() and
> mbm_cntr_free_all() to organize the code better based on the comment.
> https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
> 
> 
> On top of that I have few comments from from Dave.
> 
> 1.  Change "mbm_cntr_assign" to "mbm_counter_assign".
> 
> This will require me to search and replace lot of places. There are
> variables, names like num_mbm_cntrs, mbm_cntr_assignable,
> resctrl_arch_mbm_cntr_assign_enabled, resctrl_arch_mbm_cntr_assign_set,
> mbm_cntr_assign_enabled, resctrl_num_mbm_cntrs_show, mbm_cntr_cfg and list
> goes on.
> 
>  This is mostly cosmetic and not much value add. Will drop this change if
> Dave has no objections.

There is no need to change the names of kernel symbols -- this was just
about the interface presented to userspace.

So, if you rename only the affect file names in resctrlfs (I think
there weren't any others) then I'm happy with that.

But if you prefer to avoid this inconsistency, the file name can stay
as-is.  It's not a huge deal.


> 2. Change /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs to display per-domain
> supported counters instead of a single value.

Ack; thanks (we could always add it back in later without an ABI break,
if people feel strongly about it and it looks feasible).


> 3. Use the actual events instead of flags based on the below comment.
> 
> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
> 
>  Something like this.
>  # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
> >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
>  Are we ready to go with this approach? I am still not clear on this.

[...]

> -- 
> Thanks
> Babu Moger

On this point, I'll defer to discussions elsewhere on the thread.


I have a few other minor comments pending to post, but it looks like
there may be a more serious issue with how the mbm_assign_control file
is handled in the kernel -- I'll try to post comments on that today.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-18 21:29                         ` Reinette Chatre
@ 2025-02-19 12:26                           ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-19 12:26 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: babu.moger, Peter Newman, Moger, Babu, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Tue, Feb 18, 2025 at 01:29:09PM -0800, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/18/25 11:32 AM, Moger, Babu wrote:
> > Hi Reinette,
> > 
> > On 2/18/25 12:14, Reinette Chatre wrote:
> >> Hi Babu,
> >>
> >> On 2/18/25 7:39 AM, Moger, Babu wrote:
> >>  
> >>> 3. Use the actual events instead of flags based on the below comment.
> >>>
> >>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
> >>>
> >>>  Something like this.
> >>>  # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
> >>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>
> >>>  Are we ready to go with this approach? I am still not clear on this.
> >>>
> >>>  Reinette, What do you think?
> >>
> >> I was actually expecting some push back or at least discussion on this interface
> >> because the braces seem difficult to parse when compared to, for example, using
> > 
> > I am yet to work on it. Will work on it after confirmation.
> > 
> > Here is the output from a system with 12 domains. I created one "test" group.
> > 
> > Output is definitely harder to parse for human eyes.
> > 
> > #cat info/L3_MON/mbm_assign_control
> > test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> > //0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> > 
> > It is harder to parse in code also. We should consider only if there is a
> > value-add with this format.
> 
> Please see my comments in [2] for some motivations.
> 
> > 
> > Otherwise I prefer our current flag format.
> > 
> > # cat info/L3_MON/mbm_assign_control
> > test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> > //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> 
> We could possibly consider some middle ground where flags are separated by
> commas and when the amount of used flags reach 26 the interface can use
> "two letter flags" or "longer names" or "the actual event name" or ....
> 
> > 
> > 
> >> commas to separate the events of a domain. Peter [1] has some reservations about
> > 
> > Yes. I would like to hear from Peter.
> > 
> 
> Reinette

Ack; see also my reply to Peter on the other subthread.

I think the single-letter names provide a much less cumbersome
interface.

From the Arm side, I'd be happy to see just "t" and "l" for now, with
their current fixed mappings to event names, provided that we are
confident that we can add flexibility later without breaking the ABI.

In case this has got lost in the noise, I still think that the v11
proposal for the ABMC interface looks fine as a first step -- I just
wanted to kick the tires re extensibility.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-19 11:28                 ` Peter Newman
@ 2025-02-19 12:26                   ` Dave Martin
  2025-02-19 17:56                   ` Reinette Chatre
  1 sibling, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-19 12:26 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Feb 19, 2025 at 12:28:16PM +0100, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
> >
> > Hi Peter,
> >
> > On 2/17/25 2:26 AM, Peter Newman wrote:
> > > Hi Reinette,
> > >
> > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> > > <reinette.chatre@intel.com> wrote:

[...]

> > >> As mentioned above, one possible issue with existing interface is that
> > >> it is limited to 26 events (assuming only lower case letters are used). The limit
> > >> is low enough to be of concern.
> > >
> > > The events which can be monitored by a single counter on ABMC and MPAM
> > > so far are combinable, so 26 counters per group today means it limits
> > > breaking down MBM traffic for each group 26 ways. If a user complained
> > > that a 26-way breakdown of a group's MBM traffic was limiting their
> > > investigation, I would question whether they know what they're looking
> > > for.
> >
> > The key here is "so far" as well as the focus on MBM only.
> >
> > It is impossible for me to predict what we will see in a couple of years
> > from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> > to support their users. Just looking at the Intel RDT spec the event register
> > has space for 32 events for each "CPU agent" resource. That does not take into
> > account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> > that he is working on patches [1] that will add new events and shared the idea
> > that we may be trending to support "perf" like events associated with RMID. I
> > expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> > customers.
> > This all makes me think that resctrl should be ready to support more events than 26.
> 
> I was thinking of the letters as representing a reusable, user-defined
> event-set for applying to a single counter rather than as individual
> events, since MPAM and ABMC allow us to choose the set of events each
> one counts. Wherever we define the letters, we could use more symbolic
> event names.
> 
> In the letters as events model, choosing the events assigned to a
> group wouldn't be enough information, since we would want to control
> which events should share a counter and which should be counted by
> separate counters. I think the amount of information that would need
> to be encoded into mbm_assign_control to represent the level of
> configurability supported by hardware would quickly get out of hand.
> 
> Maybe as an example, one counter for all reads, one counter for all
> writes in ABMC would look like...
> 
> (L3_QOS_ABMC_CFG.BwType field names below)
> 
> (per domain)
> group 0:
>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>  counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>  counter 3: VictimBW,LclNTWr,RmtNTWr
> ...
> 
> I assume packing all of this info for a group's desired counter
> configuration into a single line (with 32 domains per line on many
> dual-socket AMD configurations I see) would be difficult to look at,
> even if we could settle on a single letter to represent each
> universally.
> 
> >
> > My goal is for resctrl to have a user interface that can as much as possible
> > be ready for whatever may be required from it years down the line. Of course,
> > I may be wrong and resctrl would never need to support more than 26 events per
> > resource (*). The risk is that resctrl *may* need to support more than 26 events
> > and how could resctrl support that?
> >
> > What is the risk of supporting more than 26 events? As I highlighted earlier
> > the interface I used as demonstration may become unwieldy to parse on a system
> > with many domains that supports many events. This is a concern for me. Any suggestions
> > will be appreciated, especially from you since I know that you are very familiar with
> > issues related to large scale use of resctrl interfaces.
> 
> It's mainly just the unwieldiness of all the information in one file.
> It's already at the limit of what I can visually look through.
> 
> I believe that shared assignments will take care of all the
> high-frequency and performance-intensive batch configuration updates I
> was originally concerned about, so I no longer see much benefit in
> finding ways to textually encode all this information in a single file
> when it would be more manageable to distribute it around the
> filesystem hierarchy.
> 
> -Peter

This was sort of what I had in my mind.

I think it may make some sense to support "t" and "l" out of the box,
as intuitively backwards-compatible event names, but provide a way to
create new "letters" as needed, with well-defined way (customisable or
not) of mapping these to event names visible in resctrlfs.  I just used
the digits for this purpose, but we could have an explicit interface
for it.

In order for this series to stabilise though, does it make sense to put
this out of scope just for now?

The current series provides a way to provide the mbm_total_bytes and
mbm_local_bytes counters on AMBC and MPAM systems, without having to
limit the total number of monitoring groups (MPAM's current approach)
or overcommit the counters so that they may not be continuously
reliable when there are too many groups (AMD?).

That seems immediately useful.

The ability to assign arbitrarily many counters to a group is a new
feature however.  Does it make sense to consider this on its own merits
when the baseline ABMC interface has been settled?

May main concern right now (from the Arm side) is to be confident that
the initial ABMC interface definition doesn't paint us into a corner.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init()
  2025-01-22 20:20 ` [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init() Babu Moger
  2025-02-05 22:22   ` Reinette Chatre
@ 2025-02-19 13:28   ` Dave Martin
  2025-02-19 16:53     ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:28 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Jan 22, 2025 at 02:20:09PM -0600, Babu Moger wrote:
> resctrl_late_init() has the __init attribute, but some of the functions
> called from it do not have the __init attribute.
> 
> Add the __init attribute to all the functions in the call sequences to
> maintain consistency throughout.

(BTW, did you just find these cases by inspection, or were you getting
build warnings?

Even with CONFIG_DEBUG_SECTION_MISMATCH=y, I struggle to get build
warnings about section mismatches on inlined functions.  Even building
with -fno-inline doesn't flag them all up (though I don't think this
suppresses all inlining).

If you have a way of tracking these cases down automatically, I'd be
interested to know so that I can apply it elsewhere.)

Cheers
---Dave


> 
> Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
> Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
> Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

[...]

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 3d1735ed8d1f..f0a331287979 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -145,7 +145,7 @@ u32 resctrl_arch_system_num_rmid_idx(void)
>   * is always 20 on hsw server parts. The minimum cache bitmask length
>   * allowed for HSW server is always 2 bits. Hardcode all of them.
>   */
> -static inline void cache_alloc_hsw_probe(void)
> +static inline __init void cache_alloc_hsw_probe(void)
>  {
>  	struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
>  	struct rdt_resource *r  = &hw_res->r_resctrl;
> @@ -277,7 +277,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)

[...]

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06  1:41       ` Reinette Chatre
  2025-02-06 15:56         ` Luck, Tony
@ 2025-02-19 13:28         ` Dave Martin
  2025-02-21 18:08           ` James Morse
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:28 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Luck, Tony, Babu Moger, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

On Wed, Feb 05, 2025 at 05:41:53PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 2/5/25 4:51 PM, Luck, Tony wrote:
> >> This new arch API has sharp corners because of asymmetry of where resctrl
> >> runs the arch function. I do not think it is required to change this since we
> >> can only speculate about how this may be used in the future but I do think
> >> it will be helpful to add comments that highlight:
> >>
> >> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
> >> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.
> > 
> > Here's a vague data point about the future to help with speculation.
> > 
> > I have something coming along the pipeline that also can run on any CPU.
> > 
> > I am contemplating a flag in the rdt_resource structure (in appropriate substructure
> > resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.
> > 
> > Would something like that be useful here?
> 
> hmm ... I cannot envision how this may look. Could you please elaborate?
> 
> You mention "a" (singular) flag in rdt_resource while this scenario involves
> different ops having different scope. This makes me think that this flag may
> have to be per operation that in turn would need additional infrastructure to
> manage and track operations.
> 
> These "arch" functions are evolving as the work to support MPAM is done and
> so far I think it has been quite ad-hoc to just refactor arch specific code
> into "arch" helpers instead of keeping track of which scope they are running in.
> This currently requires any arch implementing an "arch" helper to be well aware 
> of how resctrl will call it.
> 
> Reinette

For MPAM, we must typically do all configuration access from a CPU in a
power domain that depends on the power domain of the relevant MPAM MSC,
including reads of the configuration.

In the MPAM case, the required topology knowledge is not necessarily
identical to the resctrl domain topology, so it doesn't feel right to
have the resctrl core code making the decisions.

So, in the interest of keeping the arch interface simple, should cross-
calling be delegated to the arch code, at least for now?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-10 18:10       ` Reinette Chatre
@ 2025-02-19 13:30         ` Dave Martin
  2025-02-19 18:07           ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:30 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Mon, Feb 10, 2025 at 10:10:26AM -0800, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/7/25 10:23 AM, Moger, Babu wrote:
> > On 2/5/2025 5:57 PM, Reinette Chatre wrote:
> >> On 1/22/25 12:20 PM, Babu Moger wrote:

[...]

> >>> MBM events of a monitoring group is tracked by hardware. Such queries
> >>> are acceptable because of a very small number of assignable counters.
> >>
> >> It is not obvious what "very small number" means. Is it possible to give
> >> a range to help reader understand the motivation?
> > 
> > How about?
> > 
> > MBM events of a monitoring group is tracked by hardware. Such queries
> > are acceptable because of a very small number of assignable counters(32 to 64).
> 
> Yes, thank you. This helps to understand the claim.
> 
> Reinette

Do these queries only happen when userspace reads an mbm_assign_control
file?

It might be worth documenting somewhere that writing and (especially)
reading an mbm_assign_control file is not intended to be super-fast.

It feels like userspace should not generally rely on reading
mbm_assign_control files except for diagnostic purposes, or occasional
read-modify-write transformations.  Or do expect some other usage model
that makes this a hotter path?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 12/23] x86/resctrl: Introduce interface to display number of free counters
  2025-02-07 18:59     ` Moger, Babu
@ 2025-02-19 13:31       ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:31 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Fri, Feb 07, 2025 at 12:59:55PM -0600, Moger, Babu wrote:
> Hi Reinette,
> 
> On 2/5/2025 6:19 PM, Reinette Chatre wrote:
> > Hi Babu,
> > 
> > On 1/22/25 12:20 PM, Babu Moger wrote:

[...]

> > > diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> > > index 31ff764deeeb..99cae75559b0 100644
> > > --- a/Documentation/arch/x86/resctrl.rst
> > > +++ b/Documentation/arch/x86/resctrl.rst
> > > @@ -299,6 +299,14 @@ with the following files:
> > >   	memory bandwidth tracking to a single memory bandwidth event per
> > >   	monitoring group.
> > > +"available_mbm_cntrs":
> > > +	The number of monitoring counters available for assignment in each
> > > +	domain when mbm_cntr_assign mode is enabled on the system.
> > > +	::
> > > +
> > 
> > Documentation jumps in with some hardcoded values that may cause confusion.
> > It looks to be missing something like (and looking back this also applies
> > to "num_mbm_cntrs"):
> > "For example, on a system with 30 available monitoring/(hardware?) counters in
> > each of its L3 domains:"
> 
> Sure.

It could make sense to write something like
"... 30 available [hardware] memory bandwidth counters in each ..."

MPAM has different kinds of counters, at least in theory.

(No big deal, though.)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2025-01-22 20:20 ` [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
@ 2025-02-19 13:32   ` Dave Martin
  2025-02-19 21:00     ` Moger, Babu
  2025-02-21 18:06   ` James Morse
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:32 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Wed, Jan 22, 2025 at 02:20:22PM -0600, Babu Moger wrote:
> The ABMC feature provides an option to the user to assign a hardware
> counter to an RMID, event pair and monitor the bandwidth as long as it
> is assigned. The assigned RMID will be tracked by the hardware until the
> user unassigns it manually.
> 
> Implement an architecture-specific handler to assign and unassign the
> counter. Configure counters by writing to the L3_QOS_ABMC_CFG MSR,
> specifying the counter ID, bandwidth source (RMID), and event
> configuration.
> 
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>     Monitoring (ABMC).
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

[...]

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f2bf5b13465d..ef836bb69b9b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1371,3 +1371,45 @@ void resctrl_arch_mon_event_config_set(void *info)

[...]

> +/*
> + * Send an IPI to the domain to assign the counter to RMID, event pair.
> + */
> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +			     u32 cntr_id, bool assign)
> +{
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +	union l3_qos_abmc_cfg abmc_cfg = { 0 };
> +	struct arch_mbm_state *am;
> +
> +	abmc_cfg.split.cfg_en = 1;
> +	abmc_cfg.split.cntr_en = assign ? 1 : 0;
> +	abmc_cfg.split.cntr_id = cntr_id;
> +	abmc_cfg.split.bw_src = rmid;
> +
> +	/* Update the event configuration from the domain */
> +	if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
> +		abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
> +	else
> +		abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
> +
> +	smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd, &abmc_cfg, 1);
> +
> +	/*
> +	 * Reset the architectural state so that reading of hardware
> +	 * counter is not considered as an overflow in next update.
> +	 */
> +	am = get_arch_mbm_state(hw_dom, rmid, evtid);

Is this necessary when unassigning the counter, or only when assigning?

> +	if (am)
> +		memset(am, 0, sizeof(*am));
> +
> +	return 0;
> +}

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-01-22 20:20 ` [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
  2025-02-06 18:03   ` Reinette Chatre
@ 2025-02-19 13:41   ` Dave Martin
  2025-02-19 14:09     ` Peter Newman
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:41 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
> Assign/unassign counters on resctrl group creation/deletion. Two counters
> are required per group, one for MBM total event and one for MBM local
> event.
> 
> There are a limited number of counters available for assignment. If these
> counters are exhausted, the kernel will display the error message: "Out of
> MBM assignable counters". However, it is not necessary to fail the
> creation of a group due to assignment failures. Users have the flexibility
> to modify the assignments at a later time.

If we are doing this, should turning mbm_cntr_assign mode on also
trigger auto-assingment for all extant monitoring groups?

Either way though, this auto-assignment feels like a potential nuisance
for userspace.

If the userspace use-case requires too many monitoring groups for the
available counters, then the kernel will auto-assign counters to a
random subset of groups which may or may not be the ones that userspace
wanted to monitor; then userspace must manually look for the assigned
counters and unassign some of them before they can be assigned where
userspace actually wanted them.

This is not impossible for userspace to cope with, but it feels
awkward.

Is there a way to inhibit auto-assignment?

Or could automatic assignments be considered somehow "weak", so that
new explicit assignments can steal automatically assigned counters
without the need to unassign them explicitly?

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-01-22 20:20 ` [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
@ 2025-02-19 13:53   ` Dave Martin
  2025-02-19 21:09     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-19 13:53 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
> Provide the interface to list the assignment states of all the resctrl
> groups in mbm_cntr_assign mode.

[...]

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5d305d0ac053..6e29827239e0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,

[...]

> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> +					   struct seq_file *s, void *v)
> +{

[...]

> +	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> +		seq_printf(s, "%s//", rdtg->kn->name);
> +
> +		sep = false;
> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +			if (sep)
> +				seq_puts(s, ";");
> +
> +			seq_printf(s, "%d=%s", dom->hdr.id,
> +				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
> +
> +			sep = true;
> +		}
> +		seq_putc(s, '\n');
> +
> +		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
> +			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
> +
> +			sep = false;
> +			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +				if (sep)
> +					seq_puts(s, ";");
> +				seq_printf(s, "%d=%s", dom->hdr.id,
> +					   rdtgroup_mon_state_to_str(r, dom, crg, str));

Unlike the other resctrl files, it looks like the total size of this
data will scale up with the number of existing monitoring groups
and the lengths of the group names (in addition to the number of
monitoring domains).

So, this can easily be more than a page, overflowing internal limits
in the seq_file and kernfs code.

Do we need to track some state between read() calls?  This can be done
by overriding the kernfs .open() and .release() methods and hanging
some state data (or an rdtgroup_file pointer) on of->priv.

Also, if we allow the data to be read out in chunks, then we would
either have to snapshot all the data in one go and stash the unread
tail in the kernel, or we would need to move over to RCU-based
enumeration or similar -- otherwise releasing rdtgroup_mutex in the
middle of the enumeration in order to return data to userspace is going
to be a problem...

> +				sep = true;
> +			}
> +			seq_putc(s, '\n');
> +		}
> +	}
> +
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +	return 0;
> +}

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-19 13:41   ` Dave Martin
@ 2025-02-19 14:09     ` Peter Newman
  2025-02-19 17:55       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-02-19 14:09 UTC (permalink / raw)
  To: Dave Martin
  Cc: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On Wed, Feb 19, 2025 at 2:41 PM Dave Martin <Dave.Martin@arm.com> wrote:
>
> Hi,
>
> On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
> > Assign/unassign counters on resctrl group creation/deletion. Two counters
> > are required per group, one for MBM total event and one for MBM local
> > event.
> >
> > There are a limited number of counters available for assignment. If these
> > counters are exhausted, the kernel will display the error message: "Out of
> > MBM assignable counters". However, it is not necessary to fail the
> > creation of a group due to assignment failures. Users have the flexibility
> > to modify the assignments at a later time.
>
> If we are doing this, should turning mbm_cntr_assign mode on also
> trigger auto-assingment for all extant monitoring groups?
>
> Either way though, this auto-assignment feels like a potential nuisance
> for userspace.
>
> If the userspace use-case requires too many monitoring groups for the
> available counters, then the kernel will auto-assign counters to a
> random subset of groups which may or may not be the ones that userspace
> wanted to monitor; then userspace must manually look for the assigned
> counters and unassign some of them before they can be assigned where
> userspace actually wanted them.
>
> This is not impossible for userspace to cope with, but it feels
> awkward.
>
> Is there a way to inhibit auto-assignment?
>
> Or could automatic assignments be considered somehow "weak", so that
> new explicit assignments can steal automatically assigned counters
> without the need to unassign them explicitly?

We had an incomplete discussion about this early on[1]. I guess I
didn't revisit it because I found it was trivial to add a flag that
inhibits the assignment behavior during mkdir and had moved on to
bigger issues.

If an agent creating directories isn't coordinated with the agent
managing counters, a series of creating and destroying a group could
prevent a monitor assignment from ever succeeding because it's not
possible to atomically discover the name of the new directory that
stole the previously-available counter and reassign it.

However, if the counter-manager can get all the counters assigned once
and only move them with atomic reassignments, it will become
impossible to snatch them with a mkdir.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCihfQ9VtLYzyHB9-PsQzXLc06BW8bzhBXwj9-i+Q8RVFQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-01-22 20:20 ` [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
  2025-02-06 18:48   ` Reinette Chatre
@ 2025-02-19 16:07   ` Dave Martin
  2025-02-19 17:43     ` Luck, Tony
  2025-02-20  0:34     ` Moger, Babu
  2025-02-21 18:07   ` James Morse
  2 siblings, 2 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-19 16:07 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:
> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
> events in the CTRL_MON or MON groups should have counters assigned.
> 
> Provide an interface for assigning MBM events by writing to the file:
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
> events can be assigned or unassigned as needed.
> 
> Format is similar to the list format with addition of opcode for the
> assignment operation.
>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

[...]

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6e29827239e0..299839bcf23f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,

[...]

> +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
> +						char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +	char *token, *cmon_grp, *mon_grp;
> +	enum rdt_group_type rtype;
> +	int ret;
> +
> +	/* Valid input requires a trailing newline */
> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +
> +	buf[nbytes - 1] = '\0';
> +
> +	cpus_read_lock();
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	rdt_last_cmd_clear();
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
> +		mutex_unlock(&rdtgroup_mutex);
> +		cpus_read_unlock();
> +		return -EINVAL;
> +	}
> +
> +	while ((token = strsep(&buf, "\n")) != NULL) {
> +		/*
> +		 * The write command follows the following format:
> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
> +		 * Extract the CTRL_MON group.
> +		 */
> +		cmon_grp = strsep(&token, "/");
> +

As when reading this file, I think that the data can grow larger than a
page and get split into multiple write() calls.

I don't currently think the file needs to be redesigned, but there are
some concerns about how userspace will work with it that need to be
sorted out.

Every monitoring group can contribute a line to this file:

	CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF

so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1

NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
600 bytes per monitoring group in the worst case.

We don't need to have many control and monitoring groups for this to
grow potentially over 4K.

We could simply place a limit on how much userspace is allowed to write
to this file in one go, although this restriction feels difficult for
userspace to follow -- but maybe this is workable in the short term, on
current systems (?)

Otherwise, since we expect this interface to be written using scripting
languages, I think we need to be prepared to accept fully-buffered
I/O.  That means that the data may be cut at random places, not
necessarily at newlines.  (For smaller files such as schemata this is
not such an issue, since the whole file is likely to be small enough to
fit into the default stdio buffers -- this is how sysfs gets away with
it IIUC.)

For fully-buffered I/O, we may have to cache an incomplete line in
between write() calls.  If there is a dangling incomplete line when the
file is closed then it is hard to tell userspace, because people often
don't bother to check the return value of close(), fclose() etc.
However, since it's an ABI violation for userspace to end this file
with a partial line, I think it's sufficient to report that via
last_cmd_status.  (Making close() return -EIO still seems a good idea
though, just in case userspace is listening.)

I hacked up something a bit like this so that schemata could be written
interactively from the shell, so I can try to port that onto this series
as an illustration, if it helps.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init()
  2025-02-19 13:28   ` Dave Martin
@ 2025-02-19 16:53     ` Moger, Babu
  2025-02-20 13:29       ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-19 16:53 UTC (permalink / raw)
  To: Dave Martin
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/19/25 07:28, Dave Martin wrote:
> Hi,
> 
> On Wed, Jan 22, 2025 at 02:20:09PM -0600, Babu Moger wrote:
>> resctrl_late_init() has the __init attribute, but some of the functions
>> called from it do not have the __init attribute.
>>
>> Add the __init attribute to all the functions in the call sequences to
>> maintain consistency throughout.
> 
> (BTW, did you just find these cases by inspection, or were you getting
> build warnings?
> 
> Even with CONFIG_DEBUG_SECTION_MISMATCH=y, I struggle to get build
> warnings about section mismatches on inlined functions.  Even building
> with -fno-inline doesn't flag them all up (though I don't think this
> suppresses all inlining).
> 
> If you have a way of tracking these cases down automatically, I'd be
> interested to know so that I can apply it elsewhere.)

It is mostly by code inspection at this point.

You can refer to this commit [1].

We used to see section mismatch warnings when non-init functions call
__init functions.

MODPOST Module.symvers
WARNING: modpost: vmlinux: section mismatch in reference:
rdt_get_mon_l3_config+0x2b5 (section: .text) -> rdt_cpu_has (section:
.init.text)
WARNING: modpost: vmlinux: section mismatch in reference:
rdt_get_mon_l3_config+0x408 (section: .text) -> rdt_cpu_has (section:
.init.text)


1.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.14-rc3&id=bd334c86b5d70e5d1c6169991802e62c828d6f38

> 
> Cheers
> ---Dave
> 
> 
>>
>> Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
>> Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
>> Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
>> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> [...]
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 3d1735ed8d1f..f0a331287979 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -145,7 +145,7 @@ u32 resctrl_arch_system_num_rmid_idx(void)
>>   * is always 20 on hsw server parts. The minimum cache bitmask length
>>   * allowed for HSW server is always 2 bits. Hardcode all of them.
>>   */
>> -static inline void cache_alloc_hsw_probe(void)
>> +static inline __init void cache_alloc_hsw_probe(void)
>>  {
>>  	struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
>>  	struct rdt_resource *r  = &hw_res->r_resctrl;
>> @@ -277,7 +277,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
> 
> [...]
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-19 16:07   ` Dave Martin
@ 2025-02-19 17:43     ` Luck, Tony
  2025-02-20 14:57       ` Dave Martin
  2025-02-20  0:34     ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-19 17:43 UTC (permalink / raw)
  To: Dave Martin, Babu Moger
  Cc: corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, fenghua.yu@intel.com, x86@kernel.org,
	hpa@zytor.com, paulmck@kernel.org, akpm@linux-foundation.org,
	thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> I hacked up something a bit like this so that schemata could be written
> interactively from the shell, so I can try to port that onto this series
> as an illustration, if it helps.

Note that schemata will accept writes that just change the bits you want to change.

So from the shell:

# cat schemata
MB:0=100;1=100
L3:0=fff;1=fff

# echo "MB:1=90" > schemata

# cat schemata
MB:0=100;1= 90
L3:0=fff;1=fff

-Tony


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-19 14:09     ` Peter Newman
@ 2025-02-19 17:55       ` Reinette Chatre
  2025-02-20 10:35         ` Peter Newman
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-19 17:55 UTC (permalink / raw)
  To: Peter Newman, Dave Martin
  Cc: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave and Peter,

On 2/19/25 6:09 AM, Peter Newman wrote:
> Hi Dave,
> 
> On Wed, Feb 19, 2025 at 2:41 PM Dave Martin <Dave.Martin@arm.com> wrote:
>>
>> Hi,
>>
>> On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
>>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>>> are required per group, one for MBM total event and one for MBM local
>>> event.
>>>
>>> There are a limited number of counters available for assignment. If these
>>> counters are exhausted, the kernel will display the error message: "Out of
>>> MBM assignable counters". However, it is not necessary to fail the
>>> creation of a group due to assignment failures. Users have the flexibility
>>> to modify the assignments at a later time.
>>
>> If we are doing this, should turning mbm_cntr_assign mode on also
>> trigger auto-assingment for all extant monitoring groups?
>>
>> Either way though, this auto-assignment feels like a potential nuisance
>> for userspace.

hmmm ... this auto-assignment was created with the goal to help userspace.
In mbm_cntr_assign mode the user will only see data when a counter is assigned
to an event. mbm_cntr_assign mode is selected as default on a system that
supports ABMC. Without auto assignment a user will thus see different
behavior when reading the monitoring events when the user switches to a kernel with
assignable counter support: Before assignable counter support events will have
data, with assignable counter support the events will not have data. 

I understood that interfaces should not behave differently when user space
switches kernels and that is what the auto assignment aims to solve.

>>
>> If the userspace use-case requires too many monitoring groups for the
>> available counters, then the kernel will auto-assign counters to a
>> random subset of groups which may or may not be the ones that userspace
>> wanted to monitor; then userspace must manually look for the assigned
>> counters and unassign some of them before they can be assigned where
>> userspace actually wanted them.
>>
>> This is not impossible for userspace to cope with, but it feels
>> awkward.
>>
>> Is there a way to inhibit auto-assignment?
>>
>> Or could automatic assignments be considered somehow "weak", so that
>> new explicit assignments can steal automatically assigned counters
>> without the need to unassign them explicitly?
> 
> We had an incomplete discussion about this early on[1]. I guess I
> didn't revisit it because I found it was trivial to add a flag that
> inhibits the assignment behavior during mkdir and had moved on to
> bigger issues.

Could you please remind me how a user will set this flag?

> 
> If an agent creating directories isn't coordinated with the agent
> managing counters, a series of creating and destroying a group could
> prevent a monitor assignment from ever succeeding because it's not
> possible to atomically discover the name of the new directory that
> stole the previously-available counter and reassign it.
> 
> However, if the counter-manager can get all the counters assigned once
> and only move them with atomic reassignments, it will become
> impossible to snatch them with a mkdir.
> 

You have many points that makes auto-assignment not be ideal but I
remain concerned that not doing something like this will break
existing users who are not as familiar with resctrl internals.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-19 11:28                 ` Peter Newman
  2025-02-19 12:26                   ` Dave Martin
@ 2025-02-19 17:56                   ` Reinette Chatre
  2025-02-20 14:53                     ` Peter Newman
  2025-02-20 16:46                     ` Dave Martin
  1 sibling, 2 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-19 17:56 UTC (permalink / raw)
  To: Peter Newman
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/19/25 3:28 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Babu,
>>>>
>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>
>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>
>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>
>>>>>>>> mbm_local_read_bytes a
>>>>>>>> mbm_local_write_bytes b
>>>>>>>>
>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>> <value>
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>
>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>
>>>> As mentioned above, one possible issue with existing interface is that
>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>> is low enough to be of concern.
>>>
>>> The events which can be monitored by a single counter on ABMC and MPAM
>>> so far are combinable, so 26 counters per group today means it limits
>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>> investigation, I would question whether they know what they're looking
>>> for.
>>
>> The key here is "so far" as well as the focus on MBM only.
>>
>> It is impossible for me to predict what we will see in a couple of years
>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>> to support their users. Just looking at the Intel RDT spec the event register
>> has space for 32 events for each "CPU agent" resource. That does not take into
>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>> that he is working on patches [1] that will add new events and shared the idea
>> that we may be trending to support "perf" like events associated with RMID. I
>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>> customers.
>> This all makes me think that resctrl should be ready to support more events than 26.
> 
> I was thinking of the letters as representing a reusable, user-defined
> event-set for applying to a single counter rather than as individual
> events, since MPAM and ABMC allow us to choose the set of events each
> one counts. Wherever we define the letters, we could use more symbolic
> event names.

Thank you for clarifying.

> 
> In the letters as events model, choosing the events assigned to a
> group wouldn't be enough information, since we would want to control
> which events should share a counter and which should be counted by
> separate counters. I think the amount of information that would need
> to be encoded into mbm_assign_control to represent the level of
> configurability supported by hardware would quickly get out of hand.
> 
> Maybe as an example, one counter for all reads, one counter for all
> writes in ABMC would look like...
> 
> (L3_QOS_ABMC_CFG.BwType field names below)
> 
> (per domain)
> group 0:
>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>  counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>  counter 3: VictimBW,LclNTWr,RmtNTWr
> ...
> 

I think this may also be what Dave was heading towards in [2] but in that
example and above the counter configuration appears to be global. You do mention
"configurability supported by hardware" so I wonder if per-domain counter
configuration is a requirement?

Until now I viewed counter configuration separate from counter assignment,
similar to how AMD's counters can be configured via mbm_total_bytes_config and
mbm_local_bytes_config before they are assigned. That is still per-domain
counter configuration though, not per-counter.

> I assume packing all of this info for a group's desired counter
> configuration into a single line (with 32 domains per line on many
> dual-socket AMD configurations I see) would be difficult to look at,
> even if we could settle on a single letter to represent each
> universally.
> 
>>
>> My goal is for resctrl to have a user interface that can as much as possible
>> be ready for whatever may be required from it years down the line. Of course,
>> I may be wrong and resctrl would never need to support more than 26 events per
>> resource (*). The risk is that resctrl *may* need to support more than 26 events
>> and how could resctrl support that?
>>
>> What is the risk of supporting more than 26 events? As I highlighted earlier
>> the interface I used as demonstration may become unwieldy to parse on a system
>> with many domains that supports many events. This is a concern for me. Any suggestions
>> will be appreciated, especially from you since I know that you are very familiar with
>> issues related to large scale use of resctrl interfaces.
> 
> It's mainly just the unwieldiness of all the information in one file.
> It's already at the limit of what I can visually look through.

I agree.

> 
> I believe that shared assignments will take care of all the
> high-frequency and performance-intensive batch configuration updates I
> was originally concerned about, so I no longer see much benefit in
> finding ways to textually encode all this information in a single file
> when it would be more manageable to distribute it around the
> filesystem hierarchy.

This is significant. The motivation for the single file was to support
the "high-frequency and performance-intensive" usage. Would "shared assignments"
not also depend on the same files that, if distributed, will require many
filesystem operations? 
Having the files distributed will be significantly simpler while also
avoiding the file size issue that Dave Martin exposed. 

Reinette

>> [1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/
>>
>> (*) There is also the scenario where combined between resources there may be
>> more than 26 events supported that will require the same one letter flag to be
>> used for different events of different resources. This may potentially be
>> confusing.

[2] https://lore.kernel.org/lkml/Z6zeXby8ajh0ax6i@e133380.arm.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-19 13:30         ` Dave Martin
@ 2025-02-19 18:07           ` Moger, Babu
  2025-02-20 13:33             ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-19 18:07 UTC (permalink / raw)
  To: Dave Martin, Reinette Chatre
  Cc: Moger, Babu, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/19/25 07:30, Dave Martin wrote:
> Hi,
> 
> On Mon, Feb 10, 2025 at 10:10:26AM -0800, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/7/25 10:23 AM, Moger, Babu wrote:
>>> On 2/5/2025 5:57 PM, Reinette Chatre wrote:
>>>> On 1/22/25 12:20 PM, Babu Moger wrote:
> 
> [...]
> 
>>>>> MBM events of a monitoring group is tracked by hardware. Such queries
>>>>> are acceptable because of a very small number of assignable counters.
>>>>
>>>> It is not obvious what "very small number" means. Is it possible to give
>>>> a range to help reader understand the motivation?
>>>
>>> How about?
>>>
>>> MBM events of a monitoring group is tracked by hardware. Such queries
>>> are acceptable because of a very small number of assignable counters(32 to 64).
>>
>> Yes, thank you. This helps to understand the claim.
>>
>> Reinette
> 
> Do these queries only happen when userspace reads an mbm_assign_control
> file?

Yes. All these queries are initiated by userspace in the form of
individual assignments or creating a group(mkdir).

> 
> It might be worth documenting somewhere that writing and (especially)
> reading an mbm_assign_control file is not intended to be super-fast.


We can drop the last sentence if it is creating confusion.

> 
> It feels like userspace should not generally rely on reading
> mbm_assign_control files except for diagnostic purposes, or occasional
> read-modify-write transformations.  Or do expect some other usage model
> that makes this a hotter path?
> 
> Cheers
> ---Dave

Our earlier interface was intended to query each group separately. After
the input from Peter, we changed it to batched query. One query from
userspace can list all the assignments. I am not aware of any other usage
model.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2025-02-19 13:32   ` Dave Martin
@ 2025-02-19 21:00     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-19 21:00 UTC (permalink / raw)
  To: Dave Martin
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/19/25 07:32, Dave Martin wrote:
> On Wed, Jan 22, 2025 at 02:20:22PM -0600, Babu Moger wrote:
>> The ABMC feature provides an option to the user to assign a hardware
>> counter to an RMID, event pair and monitor the bandwidth as long as it
>> is assigned. The assigned RMID will be tracked by the hardware until the
>> user unassigns it manually.
>>
>> Implement an architecture-specific handler to assign and unassign the
>> counter. Configure counters by writing to the L3_QOS_ABMC_CFG MSR,
>> specifying the counter ID, bandwidth source (RMID), and event
>> configuration.
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>     Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>     Monitoring (ABMC).
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> [...]
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f2bf5b13465d..ef836bb69b9b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1371,3 +1371,45 @@ void resctrl_arch_mon_event_config_set(void *info)
> 
> [...]
> 
>> +/*
>> + * Send an IPI to the domain to assign the counter to RMID, event pair.
>> + */
>> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>> +			     u32 cntr_id, bool assign)
>> +{
>> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> +	union l3_qos_abmc_cfg abmc_cfg = { 0 };
>> +	struct arch_mbm_state *am;
>> +
>> +	abmc_cfg.split.cfg_en = 1;
>> +	abmc_cfg.split.cntr_en = assign ? 1 : 0;
>> +	abmc_cfg.split.cntr_id = cntr_id;
>> +	abmc_cfg.split.bw_src = rmid;
>> +
>> +	/* Update the event configuration from the domain */
>> +	if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
>> +		abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
>> +	else
>> +		abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
>> +
>> +	smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd, &abmc_cfg, 1);
>> +
>> +	/*
>> +	 * Reset the architectural state so that reading of hardware
>> +	 * counter is not considered as an overflow in next update.
>> +	 */
>> +	am = get_arch_mbm_state(hw_dom, rmid, evtid);
> 
> Is this necessary when unassigning the counter, or only when assigning?

Yes. It is only required when assigning. Will add a check. thanks


> 
>> +	if (am)
>> +		memset(am, 0, sizeof(*am));
>> +
>> +	return 0;
>> +}
> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-19 13:53   ` Dave Martin
@ 2025-02-19 21:09     ` Moger, Babu
  2025-02-20 15:44       ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-19 21:09 UTC (permalink / raw)
  To: Dave Martin
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/19/25 07:53, Dave Martin wrote:
> On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
>> Provide the interface to list the assignment states of all the resctrl
>> groups in mbm_cntr_assign mode.
> 
> [...]
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 5d305d0ac053..6e29827239e0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
> 
> [...]
> 
>> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>> +					   struct seq_file *s, void *v)
>> +{
> 
> [...]
> 
>> +	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +		seq_printf(s, "%s//", rdtg->kn->name);
>> +
>> +		sep = false;
>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +			if (sep)
>> +				seq_puts(s, ";");
>> +
>> +			seq_printf(s, "%d=%s", dom->hdr.id,
>> +				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
>> +
>> +			sep = true;
>> +		}
>> +		seq_putc(s, '\n');
>> +
>> +		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
>> +			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
>> +
>> +			sep = false;
>> +			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +				if (sep)
>> +					seq_puts(s, ";");
>> +				seq_printf(s, "%d=%s", dom->hdr.id,
>> +					   rdtgroup_mon_state_to_str(r, dom, crg, str));
> 
> Unlike the other resctrl files, it looks like the total size of this
> data will scale up with the number of existing monitoring groups
> and the lengths of the group names (in addition to the number of
> monitoring domains).
> 
> So, this can easily be more than a page, overflowing internal limits
> in the seq_file and kernfs code.
> 
> Do we need to track some state between read() calls?  This can be done
> by overriding the kernfs .open() and .release() methods and hanging
> some state data (or an rdtgroup_file pointer) on of->priv.
> 
> Also, if we allow the data to be read out in chunks, then we would
> either have to snapshot all the data in one go and stash the unread
> tail in the kernel, or we would need to move over to RCU-based
> enumeration or similar -- otherwise releasing rdtgroup_mutex in the
> middle of the enumeration in order to return data to userspace is going
> to be a problem...

Good catch.

I see similar buffer overflow is handled by calling seq_buf_clear()
(look at process_durations() or in show_user_instructions()).

How about handling this by calling rdt_last_cmd_clear() before printing
each group?

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 484d6009869f..1828f59eb723 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
kernfs_open_file *of,
        }

        list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+               rdt_last_cmd_clear();
                seq_printf(s, "%s//", rdtg->kn->name);

                sep = false;
@@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
kernfs_open_file *of,
                seq_putc(s, '\n');

                list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
mon.crdtgrp_list) {
+                       rdt_last_cmd_clear();
                        seq_printf(s, "%s/%s/", rdtg->kn->name,
crg->kn->name);

                        sep = false;


> 
>> +				sep = true;
>> +			}
>> +			seq_putc(s, '\n');
>> +		}
>> +	}
>> +
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	cpus_read_unlock();
>> +	return 0;
>> +}
> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-19 16:07   ` Dave Martin
  2025-02-19 17:43     ` Luck, Tony
@ 2025-02-20  0:34     ` Moger, Babu
  2025-02-20 15:21       ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-20  0:34 UTC (permalink / raw)
  To: Dave Martin, Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/19/2025 10:07 AM, Dave Martin wrote:
> Hi,
> 
> On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:
>> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
>> events in the CTRL_MON or MON groups should have counters assigned.
>>
>> Provide an interface for assigning MBM events by writing to the file:
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
>> events can be assigned or unassigned as needed.
>>
>> Format is similar to the list format with addition of opcode for the
>> assignment operation.
>>   "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> [...]
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6e29827239e0..299839bcf23f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> 
> [...]
> 
>> +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
>> +						char *buf, size_t nbytes, loff_t off)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +	char *token, *cmon_grp, *mon_grp;
>> +	enum rdt_group_type rtype;
>> +	int ret;
>> +
>> +	/* Valid input requires a trailing newline */
>> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> +		return -EINVAL;
>> +
>> +	buf[nbytes - 1] = '\0';
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&rdtgroup_mutex);
>> +
>> +	rdt_last_cmd_clear();
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
>> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
>> +		mutex_unlock(&rdtgroup_mutex);
>> +		cpus_read_unlock();
>> +		return -EINVAL;
>> +	}
>> +
>> +	while ((token = strsep(&buf, "\n")) != NULL) {
>> +		/*
>> +		 * The write command follows the following format:
>> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
>> +		 * Extract the CTRL_MON group.
>> +		 */
>> +		cmon_grp = strsep(&token, "/");
>> +
> 
> As when reading this file, I think that the data can grow larger than a
> page and get split into multiple write() calls.
> 
> I don't currently think the file needs to be redesigned, but there are
> some concerns about how userspace will work with it that need to be
> sorted out.
> 
> Every monitoring group can contribute a line to this file:
> 
> 	CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF
> 
> so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1
> 
> NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
> 600 bytes per monitoring group in the worst case.
> 
> We don't need to have many control and monitoring groups for this to
> grow potentially over 4K.
> 
> 
> We could simply place a limit on how much userspace is allowed to write
> to this file in one go, although this restriction feels difficult for
> userspace to follow -- but maybe this is workable in the short term, on
> current systems (?)
> 
> Otherwise, since we expect this interface to be written using scripting
> languages, I think we need to be prepared to accept fully-buffered
> I/O.  That means that the data may be cut at random places, not
> necessarily at newlines.  (For smaller files such as schemata this is
> not such an issue, since the whole file is likely to be small enough to
> fit into the default stdio buffers -- this is how sysfs gets away with
> it IIUC.)
> 
> For fully-buffered I/O, we may have to cache an incomplete line in
> between write() calls.  If there is a dangling incomplete line when the
> file is closed then it is hard to tell userspace, because people often
> don't bother to check the return value of close(), fclose() etc.
> However, since it's an ABI violation for userspace to end this file
> with a partial line, I think it's sufficient to report that via
> last_cmd_status.  (Making close() return -EIO still seems a good idea
> though, just in case userspace is listening.)

Seems like we can add a check in resctrl_mbm_assign_control_write() to 
compare nbytes > PAGE_SIZE.

But do we really need this? I have no way of testing this. Help me 
understand.

All these file operations go thru generic call kernfs_fop_write_iter(). 
Doesn't it take care of buffer check and overflow?


> 
> I hacked up something a bit like this so that schemata could be written
> interactively from the shell, so I can try to port that onto this series
> as an illustration, if it helps.
> 
> Cheers
> ---Dave
> 

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-19 17:55       ` Reinette Chatre
@ 2025-02-20 10:35         ` Peter Newman
  2025-02-20 13:40           ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-02-20 10:35 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Martin, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Dave and Peter,
>
> On 2/19/25 6:09 AM, Peter Newman wrote:
> > Hi Dave,
> >
> > On Wed, Feb 19, 2025 at 2:41 PM Dave Martin <Dave.Martin@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
> >>> Assign/unassign counters on resctrl group creation/deletion. Two counters
> >>> are required per group, one for MBM total event and one for MBM local
> >>> event.
> >>>
> >>> There are a limited number of counters available for assignment. If these
> >>> counters are exhausted, the kernel will display the error message: "Out of
> >>> MBM assignable counters". However, it is not necessary to fail the
> >>> creation of a group due to assignment failures. Users have the flexibility
> >>> to modify the assignments at a later time.
> >>
> >> If we are doing this, should turning mbm_cntr_assign mode on also
> >> trigger auto-assingment for all extant monitoring groups?
> >>
> >> Either way though, this auto-assignment feels like a potential nuisance
> >> for userspace.
>
> hmmm ... this auto-assignment was created with the goal to help userspace.
> In mbm_cntr_assign mode the user will only see data when a counter is assigned
> to an event. mbm_cntr_assign mode is selected as default on a system that
> supports ABMC. Without auto assignment a user will thus see different
> behavior when reading the monitoring events when the user switches to a kernel with
> assignable counter support: Before assignable counter support events will have
> data, with assignable counter support the events will not have data.
>
> I understood that interfaces should not behave differently when user space
> switches kernels and that is what the auto assignment aims to solve.
>
> >>
> >> If the userspace use-case requires too many monitoring groups for the
> >> available counters, then the kernel will auto-assign counters to a
> >> random subset of groups which may or may not be the ones that userspace
> >> wanted to monitor; then userspace must manually look for the assigned
> >> counters and unassign some of them before they can be assigned where
> >> userspace actually wanted them.
> >>
> >> This is not impossible for userspace to cope with, but it feels
> >> awkward.
> >>
> >> Is there a way to inhibit auto-assignment?
> >>
> >> Or could automatic assignments be considered somehow "weak", so that
> >> new explicit assignments can steal automatically assigned counters
> >> without the need to unassign them explicitly?
> >
> > We had an incomplete discussion about this early on[1]. I guess I
> > didn't revisit it because I found it was trivial to add a flag that
> > inhibits the assignment behavior during mkdir and had moved on to
> > bigger issues.
>
> Could you please remind me how a user will set this flag?

Quoting my original suggestion[1]:

 "info/L3_MON/mbm_assign_on_mkdir?

  boolean (parsed with kstrtobool()), defaulting to true?"

After mount, any groups that got counters on creation would have to be
cleaned up, but at least that can be done with forward progress once
the flag is cleared.

I was able to live with that as long as there aren't users polling for
resctrl to be mounted and immediately creating groups. For us, a
single container manager service manages resctrl.

>
> >
> > If an agent creating directories isn't coordinated with the agent
> > managing counters, a series of creating and destroying a group could
> > prevent a monitor assignment from ever succeeding because it's not
> > possible to atomically discover the name of the new directory that
> > stole the previously-available counter and reassign it.
> >
> > However, if the counter-manager can get all the counters assigned once
> > and only move them with atomic reassignments, it will become
> > impossible to snatch them with a mkdir.
> >
>
> You have many points that makes auto-assignment not be ideal but I
> remain concerned that not doing something like this will break
> existing users who are not as familiar with resctrl internals.

I agree auto-assignment should be the default. I just want an official
way to turn it off.

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/CALPaoCiJ9ELXkij-zsAhxC1hx8UUR+KMPJH6i8c8AT6_mtXs+Q@mail.gmail.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 01/23] x86/resctrl: Add __init attribute to functions called from resctrl_late_init()
  2025-02-19 16:53     ` Moger, Babu
@ 2025-02-20 13:29       ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-20 13:29 UTC (permalink / raw)
  To: Moger, Babu
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Feb 19, 2025 at 10:53:41AM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/19/25 07:28, Dave Martin wrote:
> > Hi,
> > 
> > On Wed, Jan 22, 2025 at 02:20:09PM -0600, Babu Moger wrote:
> >> resctrl_late_init() has the __init attribute, but some of the functions
> >> called from it do not have the __init attribute.
> >>
> >> Add the __init attribute to all the functions in the call sequences to
> >> maintain consistency throughout.
> > 
> > (BTW, did you just find these cases by inspection, or were you getting
> > build warnings?
> > 
> > Even with CONFIG_DEBUG_SECTION_MISMATCH=y, I struggle to get build
> > warnings about section mismatches on inlined functions.  Even building
> > with -fno-inline doesn't flag them all up (though I don't think this
> > suppresses all inlining).
> > 
> > If you have a way of tracking these cases down automatically, I'd be
> > interested to know so that I can apply it elsewhere.)
> 
> It is mostly by code inspection at this point.
> 
> You can refer to this commit [1].
> 
> We used to see section mismatch warnings when non-init functions call
> __init functions.
> 
> MODPOST Module.symvers
> WARNING: modpost: vmlinux: section mismatch in reference:
> rdt_get_mon_l3_config+0x2b5 (section: .text) -> rdt_cpu_has (section:
> .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference:
> rdt_get_mon_l3_config+0x408 (section: .text) -> rdt_cpu_has (section:
> .init.text)
> 
> 
> 1.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.14-rc3&id=bd334c86b5d70e5d1c6169991802e62c828d6f38

Right.

No problem with this patch, but I'll bear in mind for the future that
CONFIG_DEBUG_SECTION_MISMATCH has limitations...

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-19 18:07           ` Moger, Babu
@ 2025-02-20 13:33             ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-20 13:33 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Reinette Chatre, Moger, Babu, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman, x86, hpa, paulmck, akpm,
	thuth, rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Wed, Feb 19, 2025 at 12:07:30PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/19/25 07:30, Dave Martin wrote:
> > Hi,
> > 
> > On Mon, Feb 10, 2025 at 10:10:26AM -0800, Reinette Chatre wrote:
> >> Hi Babu,
> >>
> >> On 2/7/25 10:23 AM, Moger, Babu wrote:
> >>> On 2/5/2025 5:57 PM, Reinette Chatre wrote:
> >>>> On 1/22/25 12:20 PM, Babu Moger wrote:
> > 
> > [...]
> > 
> >>>>> MBM events of a monitoring group is tracked by hardware. Such queries
> >>>>> are acceptable because of a very small number of assignable counters.
> >>>>
> >>>> It is not obvious what "very small number" means. Is it possible to give
> >>>> a range to help reader understand the motivation?
> >>>
> >>> How about?
> >>>
> >>> MBM events of a monitoring group is tracked by hardware. Such queries
> >>> are acceptable because of a very small number of assignable counters(32 to 64).
> >>
> >> Yes, thank you. This helps to understand the claim.
> >>
> >> Reinette
> > 
> > Do these queries only happen when userspace reads an mbm_assign_control
> > file?
> 
> Yes. All these queries are initiated by userspace in the form of
> individual assignments or creating a group(mkdir).
> 
> > 
> > It might be worth documenting somewhere that writing and (especially)
> > reading an mbm_assign_control file is not intended to be super-fast.
> 
> 
> We can drop the last sentence if it is creating confusion.
> 
> > 
> > It feels like userspace should not generally rely on reading
> > mbm_assign_control files except for diagnostic purposes, or occasional
> > read-modify-write transformations.  Or do expect some other usage model
> > that makes this a hotter path?
> > 
> > Cheers
> > ---Dave
> 
> Our earlier interface was intended to query each group separately. After
> the input from Peter, we changed it to batched query. One query from
> userspace can list all the assignments. I am not aware of any other usage
> model.

Right, that's what I thought.

I'll defer to Reinette on whether it's important to keep the statement
about rationale -- it might indeed be easier to drop it if it just
raises more questions.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-20 10:35         ` Peter Newman
@ 2025-02-20 13:40           ` Dave Martin
  2025-02-20 17:08             ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-20 13:40 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
> Hi Reinette,
> 
> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
> >
> > Hi Dave and Peter,
> >
> > On 2/19/25 6:09 AM, Peter Newman wrote:
> > > Hi Dave,
> > >
> > > On Wed, Feb 19, 2025 at 2:41 PM Dave Martin <Dave.Martin@arm.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
> > >>> Assign/unassign counters on resctrl group creation/deletion. Two counters
> > >>> are required per group, one for MBM total event and one for MBM local
> > >>> event.
> > >>>
> > >>> There are a limited number of counters available for assignment. If these
> > >>> counters are exhausted, the kernel will display the error message: "Out of
> > >>> MBM assignable counters". However, it is not necessary to fail the
> > >>> creation of a group due to assignment failures. Users have the flexibility
> > >>> to modify the assignments at a later time.
> > >>
> > >> If we are doing this, should turning mbm_cntr_assign mode on also
> > >> trigger auto-assingment for all extant monitoring groups?
> > >>
> > >> Either way though, this auto-assignment feels like a potential nuisance
> > >> for userspace.
> >
> > hmmm ... this auto-assignment was created with the goal to help userspace.
> > In mbm_cntr_assign mode the user will only see data when a counter is assigned
> > to an event. mbm_cntr_assign mode is selected as default on a system that
> > supports ABMC. Without auto assignment a user will thus see different
> > behavior when reading the monitoring events when the user switches to a kernel with
> > assignable counter support: Before assignable counter support events will have
> > data, with assignable counter support the events will not have data.
> >
> > I understood that interfaces should not behave differently when user space
> > switches kernels and that is what the auto assignment aims to solve.
> >
> > >>
> > >> If the userspace use-case requires too many monitoring groups for the
> > >> available counters, then the kernel will auto-assign counters to a
> > >> random subset of groups which may or may not be the ones that userspace
> > >> wanted to monitor; then userspace must manually look for the assigned
> > >> counters and unassign some of them before they can be assigned where
> > >> userspace actually wanted them.
> > >>
> > >> This is not impossible for userspace to cope with, but it feels
> > >> awkward.
> > >>
> > >> Is there a way to inhibit auto-assignment?
> > >>
> > >> Or could automatic assignments be considered somehow "weak", so that
> > >> new explicit assignments can steal automatically assigned counters
> > >> without the need to unassign them explicitly?
> > >
> > > We had an incomplete discussion about this early on[1]. I guess I
> > > didn't revisit it because I found it was trivial to add a flag that
> > > inhibits the assignment behavior during mkdir and had moved on to
> > > bigger issues.
> >
> > Could you please remind me how a user will set this flag?
> 
> Quoting my original suggestion[1]:
> 
>  "info/L3_MON/mbm_assign_on_mkdir?
> 
>   boolean (parsed with kstrtobool()), defaulting to true?"
> 
> After mount, any groups that got counters on creation would have to be
> cleaned up, but at least that can be done with forward progress once
> the flag is cleared.
> 
> I was able to live with that as long as there aren't users polling for
> resctrl to be mounted and immediately creating groups. For us, a
> single container manager service manages resctrl.
> 
> >
> > >
> > > If an agent creating directories isn't coordinated with the agent
> > > managing counters, a series of creating and destroying a group could
> > > prevent a monitor assignment from ever succeeding because it's not
> > > possible to atomically discover the name of the new directory that
> > > stole the previously-available counter and reassign it.
> > >
> > > However, if the counter-manager can get all the counters assigned once
> > > and only move them with atomic reassignments, it will become
> > > impossible to snatch them with a mkdir.
> > >
> >
> > You have many points that makes auto-assignment not be ideal but I
> > remain concerned that not doing something like this will break
> > existing users who are not as familiar with resctrl internals.
> 
> I agree auto-assignment should be the default. I just want an official
> way to turn it off.
> 
> Thanks!
> -Peter
> 
> [1] https://lore.kernel.org/lkml/CALPaoCiJ9ELXkij-zsAhxC1hx8UUR+KMPJH6i8c8AT6_mtXs+Q@mail.gmail.com/
> 

+1

That's basically my position -- the auto-assignment feels like a
_potential_ nuisance for ABMC-aware users, but it depends on what they
are trying to do.  Migration of non-ABMC-aware users will be easier for
basic use cases if auto-assignment occurs by default (as in this
series).

Having an explicit way to turn this off seems perfectly reasonable
(and could be added later on, if not provided in this series).


What about the question re whether turning mbm_cntr_assign mode on
should trigger auto-assignment?

Currently turning this mode off and then on again has the effect of
removing all automatic assignments for extant groups.  This feels
surprising and/or unintentional (?)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-19 17:56                   ` Reinette Chatre
@ 2025-02-20 14:53                     ` Peter Newman
  2025-02-20 18:36                       ` Reinette Chatre
  2025-02-20 16:46                     ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-02-20 14:53 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 2/19/25 3:28 AM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>> Hi Reinette,
> >>>
> >>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>>
> >>>> Hi Babu,
> >>>>
> >>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>
> >>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>
> >>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>
> >>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>
> >>>>>>>> mbm_local_read_bytes a
> >>>>>>>> mbm_local_write_bytes b
> >>>>>>>>
> >>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>> <value>
> >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>
> >>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>
> >>>> As mentioned above, one possible issue with existing interface is that
> >>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>> is low enough to be of concern.
> >>>
> >>> The events which can be monitored by a single counter on ABMC and MPAM
> >>> so far are combinable, so 26 counters per group today means it limits
> >>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>> investigation, I would question whether they know what they're looking
> >>> for.
> >>
> >> The key here is "so far" as well as the focus on MBM only.
> >>
> >> It is impossible for me to predict what we will see in a couple of years
> >> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >> to support their users. Just looking at the Intel RDT spec the event register
> >> has space for 32 events for each "CPU agent" resource. That does not take into
> >> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >> that he is working on patches [1] that will add new events and shared the idea
> >> that we may be trending to support "perf" like events associated with RMID. I
> >> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >> customers.
> >> This all makes me think that resctrl should be ready to support more events than 26.
> >
> > I was thinking of the letters as representing a reusable, user-defined
> > event-set for applying to a single counter rather than as individual
> > events, since MPAM and ABMC allow us to choose the set of events each
> > one counts. Wherever we define the letters, we could use more symbolic
> > event names.
>
> Thank you for clarifying.
>
> >
> > In the letters as events model, choosing the events assigned to a
> > group wouldn't be enough information, since we would want to control
> > which events should share a counter and which should be counted by
> > separate counters. I think the amount of information that would need
> > to be encoded into mbm_assign_control to represent the level of
> > configurability supported by hardware would quickly get out of hand.
> >
> > Maybe as an example, one counter for all reads, one counter for all
> > writes in ABMC would look like...
> >
> > (L3_QOS_ABMC_CFG.BwType field names below)
> >
> > (per domain)
> > group 0:
> >  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >  counter 1: VictimBW,LclNTWr,RmtNTWr
> > group 1:
> >  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >  counter 3: VictimBW,LclNTWr,RmtNTWr
> > ...
> >
>
> I think this may also be what Dave was heading towards in [2] but in that
> example and above the counter configuration appears to be global. You do mention
> "configurability supported by hardware" so I wonder if per-domain counter
> configuration is a requirement?

If it's global and we want a particular group to be watched by more
counters, I wouldn't want this to result in allocating more counters
for that group in all domains, or allocating counters in domains where
they're not needed. I want to encourage my users to avoid allocating
monitoring resources in domains where a job is not allowed to run so
there's less pressure on the counters.

In Dave's proposal it looks like global configuration means
globally-defined "named counter configurations", which works because
it's really per-domain assignment of the configurations to however
many counters the group needs in each domain.

>
> Until now I viewed counter configuration separate from counter assignment,
> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> mbm_local_bytes_config before they are assigned. That is still per-domain
> counter configuration though, not per-counter.
>
> > I assume packing all of this info for a group's desired counter
> > configuration into a single line (with 32 domains per line on many
> > dual-socket AMD configurations I see) would be difficult to look at,
> > even if we could settle on a single letter to represent each
> > universally.
> >
> >>
> >> My goal is for resctrl to have a user interface that can as much as possible
> >> be ready for whatever may be required from it years down the line. Of course,
> >> I may be wrong and resctrl would never need to support more than 26 events per
> >> resource (*). The risk is that resctrl *may* need to support more than 26 events
> >> and how could resctrl support that?
> >>
> >> What is the risk of supporting more than 26 events? As I highlighted earlier
> >> the interface I used as demonstration may become unwieldy to parse on a system
> >> with many domains that supports many events. This is a concern for me. Any suggestions
> >> will be appreciated, especially from you since I know that you are very familiar with
> >> issues related to large scale use of resctrl interfaces.
> >
> > It's mainly just the unwieldiness of all the information in one file.
> > It's already at the limit of what I can visually look through.
>
> I agree.
>
> >
> > I believe that shared assignments will take care of all the
> > high-frequency and performance-intensive batch configuration updates I
> > was originally concerned about, so I no longer see much benefit in
> > finding ways to textually encode all this information in a single file
> > when it would be more manageable to distribute it around the
> > filesystem hierarchy.
>
> This is significant. The motivation for the single file was to support
> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> not also depend on the same files that, if distributed, will require many
> filesystem operations?
> Having the files distributed will be significantly simpler while also
> avoiding the file size issue that Dave Martin exposed.

The remaining filesystem operations will be assigning or removing
shared counter assignments in the applicable domains, which would
normally correspond to mkdir/rmdir of groups or changing their CPU
affinity. The shared assignments are more "program and forget", while
the exclusive assignment approach requires updates for every counter
(in every domain) every few seconds to cover a large number of groups.

When they want to pay extra attention to a particular group, I expect
they'll ask for exclusive counters and leave them assigned for a while
as they collect extra data.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-19 17:43     ` Luck, Tony
@ 2025-02-20 14:57       ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-02-20 14:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Babu Moger, corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

[Dropped Cc: fenghua.yu@intel.com <fenghua.yu@intel.com> (bounces)]

On Wed, Feb 19, 2025 at 05:43:43PM +0000, Luck, Tony wrote:
> > I hacked up something a bit like this so that schemata could be written
> > interactively from the shell, so I can try to port that onto this series
> > as an illustration, if it helps.
> 
> Note that schemata will accept writes that just change the bits you want to change.
> 
> So from the shell:
> 
> # cat schemata
> MB:0=100;1=100
> L3:0=fff;1=fff
> 
> # echo "MB:1=90" > schemata
> 
> # cat schemata
> MB:0=100;1= 90
> L3:0=fff;1=fff
> 
> -Tony
> 

Yes, but not:

# {
	p=:
	echo -n MB
	for ((d = 0; d < 2; d++)); do
		echo -n "$p$d=100"
		p=';'
	done
	echo
  } >schemata

(Or at least, it depends on the shell.  Each simple command that
generates output can result in a separate write() call -- certainly
there is no guarantee that it won't.)

Doing the same thing from C will "work", because by default I/O on the
schemata file will be fully buffered in userspace... unless the whole
output exceeds the default buffer size.

The difference from sysfs here is that it would be insane to write a
small, single formatted value in pieces when it is natural to generate
it from a single format specifier -- whereas the syntax of some of
resctrl's files has a multilevel internal structure that has to be
built up in a piecemeal fashion (whether or not it is written to the
file in one go).

I'm not saying that this is an issue for realistic uses though, and
anyway, the schemata file is nothing to do with this series.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-20  0:34     ` Moger, Babu
@ 2025-02-20 15:21       ` Dave Martin
  2025-02-20 20:57         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-20 15:21 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Feb 19, 2025 at 06:34:42PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/19/2025 10:07 AM, Dave Martin wrote:
> > Hi,
> > 
> > On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:

> > [...]
> > 
> > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > index 6e29827239e0..299839bcf23f 100644
> > > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> > 
> > [...]
> > 
> > > +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
> > > +						char *buf, size_t nbytes, loff_t off)
> > > +{

[...]

> > > +	while ((token = strsep(&buf, "\n")) != NULL) {
> > > +		/*
> > > +		 * The write command follows the following format:
> > > +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
> > > +		 * Extract the CTRL_MON group.
> > > +		 */
> > > +		cmon_grp = strsep(&token, "/");
> > > +
> > 
> > As when reading this file, I think that the data can grow larger than a
> > page and get split into multiple write() calls.
> > 
> > I don't currently think the file needs to be redesigned, but there are
> > some concerns about how userspace will work with it that need to be
> > sorted out.
> > 
> > Every monitoring group can contribute a line to this file:
> > 
> > 	CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF
> > 
> > so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1
> > 
> > NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
> > 600 bytes per monitoring group in the worst case.
> > 
> > We don't need to have many control and monitoring groups for this to
> > grow potentially over 4K.
> > 
> > 
> > We could simply place a limit on how much userspace is allowed to write
> > to this file in one go, although this restriction feels difficult for
> > userspace to follow -- but maybe this is workable in the short term, on
> > current systems (?)
> > 
> > Otherwise, since we expect this interface to be written using scripting
> > languages, I think we need to be prepared to accept fully-buffered
> > I/O.  That means that the data may be cut at random places, not
> > necessarily at newlines.  (For smaller files such as schemata this is
> > not such an issue, since the whole file is likely to be small enough to
> > fit into the default stdio buffers -- this is how sysfs gets away with
> > it IIUC.)
> > 
> > For fully-buffered I/O, we may have to cache an incomplete line in
> > between write() calls.  If there is a dangling incomplete line when the
> > file is closed then it is hard to tell userspace, because people often
> > don't bother to check the return value of close(), fclose() etc.
> > However, since it's an ABI violation for userspace to end this file
> > with a partial line, I think it's sufficient to report that via
> > last_cmd_status.  (Making close() return -EIO still seems a good idea
> > though, just in case userspace is listening.)
> 
> Seems like we can add a check in resctrl_mbm_assign_control_write() to
> compare nbytes > PAGE_SIZE.

This might be a reasonable stopgap approach, if we are confident that the
number of RMIDs and monitoring domains is small enough on known
platforms that the problem is unlikely to be hit.  I can't really judge
on this.

> But do we really need this? I have no way of testing this. Help me
> understand.

It's easy to demonatrate this using the schemata file (which works in a
similar way).  Open f in /sys/fs/resctrl/schemata, then:

	int n = 0;

	for (n = 0; n < 1000; n++)
		if (fputs("MB:0=100;1=100\n", f) == EOF)
			fprintf(stderr, "Failed on interation %d\n", n);

This will succeed a certain number of times (272, for me) and then fail
when the stdio buffer for f overflows, triggering a write().

Putting an explicit fflush() after every fputs() call (or doing a
setlinebuf(f) before the loop) makes it work.  But this is awkward and
unexpected for the user, and doing the right thing from a scripting
language may be tricky.

In this example I am doing something a bit artificial -- we don't
officially say what happens when a pre-opened schemata file handle is
reused in this way, AFAICT.  But for mbm_assign_control it is
legitimate to write many lines, and we can hit this kind of problem.


I'll leave it to others to judge whether we _need_ to fix this, but it
feels like a problem waiting to happen.


> All these file operations go thru generic call kernfs_fop_write_iter().
> Doesn't it take care of buffer check and overflow?

No, this is called for each iovec segment (where userspace used one of
the iovec based I/O syscalls).  But there is no buffering or
concatenation of the data read in: each segment gets passed down to the
individual kernfs_file_operations write method for the file:

	len = ops->write(of, buf, len, iocb->ki_pos)

calls down to

	resctrl_mbm_assign_control_write(of, buf, len, iocb->ki_pos).


I'll try to port my buffering hack on top of the series -- that should
help to illustrate what I mean.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-19 21:09     ` Moger, Babu
@ 2025-02-20 15:44       ` Dave Martin
  2025-02-20 21:29         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-20 15:44 UTC (permalink / raw)
  To: Moger, Babu
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/19/25 07:53, Dave Martin wrote:
> > On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
> >> Provide the interface to list the assignment states of all the resctrl
> >> groups in mbm_cntr_assign mode.

[...]

> >> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> >> +					   struct seq_file *s, void *v)
> >> +{

[...]

> > Unlike the other resctrl files, it looks like the total size of this
> > data will scale up with the number of existing monitoring groups
> > and the lengths of the group names (in addition to the number of
> > monitoring domains).
> > 
> > So, this can easily be more than a page, overflowing internal limits
> > in the seq_file and kernfs code.
> > 
> > Do we need to track some state between read() calls?  This can be done
> > by overriding the kernfs .open() and .release() methods and hanging
> > some state data (or an rdtgroup_file pointer) on of->priv.
> > 
> > Also, if we allow the data to be read out in chunks, then we would
> > either have to snapshot all the data in one go and stash the unread
> > tail in the kernel, or we would need to move over to RCU-based
> > enumeration or similar -- otherwise releasing rdtgroup_mutex in the
> > middle of the enumeration in order to return data to userspace is going
> > to be a problem...
> 
> Good catch.
> 
> I see similar buffer overflow is handled by calling seq_buf_clear()
> (look at process_durations() or in show_user_instructions()).
> 
> How about handling this by calling rdt_last_cmd_clear() before printing
> each group?

Does this work?

Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
So far as I can see, show_user_instructions() just gives up on printing
the affected line, while process_durations() tries to anticipate
overflow and prints out the accumulated text to dmesg before clearing
the buffer.

In our case, we cannot send more data to userspace than was requested
in the read() call, so we might have nowhere to drain the seq_buf
contents to in order to free up space.

sysfs "expects" userspace to do a big enough read() that this problem
doesn't happen.  In practice this is OK because people usually read
through a buffered I/O layer like stdio, and in realistic
implementations the user-side I/O buffer is large enough to hide this
issue.

But mbm_assign_control data is dynamically generated and potentially
much bigger than a typical sysfs file.

> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 484d6009869f..1828f59eb723 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
> kernfs_open_file *of,
>         }
> 
>         list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> +               rdt_last_cmd_clear();
>                 seq_printf(s, "%s//", rdtg->kn->name);
> 
>                 sep = false;
> @@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
> kernfs_open_file *of,
>                 seq_putc(s, '\n');
> 
>                 list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
> mon.crdtgrp_list) {
> +                       rdt_last_cmd_clear();

I don't see how this helps.

Surely last_cmd_status has nothing to do with s?

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-19 17:56                   ` Reinette Chatre
  2025-02-20 14:53                     ` Peter Newman
@ 2025-02-20 16:46                     ` Dave Martin
  2025-02-20 17:46                       ` Dave Martin
  1 sibling, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-20 16:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Newman, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote:
> Hi Peter,
> 
> On 2/19/25 3:28 AM, Peter Newman wrote:

[...]

> > In the letters as events model, choosing the events assigned to a
> > group wouldn't be enough information, since we would want to control
> > which events should share a counter and which should be counted by
> > separate counters. I think the amount of information that would need
> > to be encoded into mbm_assign_control to represent the level of
> > configurability supported by hardware would quickly get out of hand.
> > 
> > Maybe as an example, one counter for all reads, one counter for all
> > writes in ABMC would look like...
> > 
> > (L3_QOS_ABMC_CFG.BwType field names below)
> > 
> > (per domain)
> > group 0:
> >  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >  counter 1: VictimBW,LclNTWr,RmtNTWr
> > group 1:
> >  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >  counter 3: VictimBW,LclNTWr,RmtNTWr
> > ...
> > 
> 
> I think this may also be what Dave was heading towards in [2] but in that
> example and above the counter configuration appears to be global. You do mention
> "configurability supported by hardware" so I wonder if per-domain counter
> configuration is a requirement?
> 
> Until now I viewed counter configuration separate from counter assignment,
> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> mbm_local_bytes_config before they are assigned. That is still per-domain
> counter configuration though, not per-counter.

I hadn't tried to work the design through in any detail: it wasn't
intended as a suggestion for something we should definitely do right
now; rather, it was just an incomplete sketch of one possible future
evolution of the interface.

Either way these feel like future concerns, if the first iteration of
ABMC is just to provide the basics so that ABMC hardware can implement
resctrl without userspace seeing counters randomly stopping and
resetting...

Peter, can you give a view on whether the ABMC as proposed in this series
is a useful stepping-stone?  Or are there things that you need that you
feel could not be added as a later extension without ABI breakage?

[...]

> > I believe that shared assignments will take care of all the
> > high-frequency and performance-intensive batch configuration updates I
> > was originally concerned about, so I no longer see much benefit in
> > finding ways to textually encode all this information in a single file
> > when it would be more manageable to distribute it around the
> > filesystem hierarchy.
> 
> This is significant. The motivation for the single file was to support
> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> not also depend on the same files that, if distributed, will require many
> filesystem operations? 
> Having the files distributed will be significantly simpler while also
> avoiding the file size issue that Dave Martin exposed. 
> 
> Reinette

I still haven't fully understood the "shared assignments" proposal;
I need to go back and look at it.

If we split the file, it will be more closely aligned with the design
of the rest of the resctrlfs interface.

OTOH, the current interface seems workable and I think the file size
issue can be addressed without major re-engineering.

So, from my side, I would not consider the current interface design
a blocker.

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-20 13:40           ` Dave Martin
@ 2025-02-20 17:08             ` Reinette Chatre
  2025-02-21 17:14               ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-20 17:08 UTC (permalink / raw)
  To: Dave Martin, Peter Newman
  Cc: Babu Moger, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/20/25 5:40 AM, Dave Martin wrote:
> On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
>> Hi Reinette,
>>
>> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
>> <reinette.chatre@intel.com> wrote:
>>>
>>> Hi Dave and Peter,
>>>
>>> On 2/19/25 6:09 AM, Peter Newman wrote:
>>>> Hi Dave,
>>>>
>>>> On Wed, Feb 19, 2025 at 2:41 PM Dave Martin <Dave.Martin@arm.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Wed, Jan 22, 2025 at 02:20:25PM -0600, Babu Moger wrote:
>>>>>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>>>>>> are required per group, one for MBM total event and one for MBM local
>>>>>> event.
>>>>>>
>>>>>> There are a limited number of counters available for assignment. If these
>>>>>> counters are exhausted, the kernel will display the error message: "Out of
>>>>>> MBM assignable counters". However, it is not necessary to fail the
>>>>>> creation of a group due to assignment failures. Users have the flexibility
>>>>>> to modify the assignments at a later time.
>>>>>
>>>>> If we are doing this, should turning mbm_cntr_assign mode on also
>>>>> trigger auto-assingment for all extant monitoring groups?
>>>>>
>>>>> Either way though, this auto-assignment feels like a potential nuisance
>>>>> for userspace.
>>>
>>> hmmm ... this auto-assignment was created with the goal to help userspace.
>>> In mbm_cntr_assign mode the user will only see data when a counter is assigned
>>> to an event. mbm_cntr_assign mode is selected as default on a system that
>>> supports ABMC. Without auto assignment a user will thus see different
>>> behavior when reading the monitoring events when the user switches to a kernel with
>>> assignable counter support: Before assignable counter support events will have
>>> data, with assignable counter support the events will not have data.
>>>
>>> I understood that interfaces should not behave differently when user space
>>> switches kernels and that is what the auto assignment aims to solve.
>>>
>>>>>
>>>>> If the userspace use-case requires too many monitoring groups for the
>>>>> available counters, then the kernel will auto-assign counters to a
>>>>> random subset of groups which may or may not be the ones that userspace
>>>>> wanted to monitor; then userspace must manually look for the assigned
>>>>> counters and unassign some of them before they can be assigned where
>>>>> userspace actually wanted them.
>>>>>
>>>>> This is not impossible for userspace to cope with, but it feels
>>>>> awkward.
>>>>>
>>>>> Is there a way to inhibit auto-assignment?
>>>>>
>>>>> Or could automatic assignments be considered somehow "weak", so that
>>>>> new explicit assignments can steal automatically assigned counters
>>>>> without the need to unassign them explicitly?
>>>>
>>>> We had an incomplete discussion about this early on[1]. I guess I
>>>> didn't revisit it because I found it was trivial to add a flag that
>>>> inhibits the assignment behavior during mkdir and had moved on to
>>>> bigger issues.
>>>
>>> Could you please remind me how a user will set this flag?
>>
>> Quoting my original suggestion[1]:
>>
>>  "info/L3_MON/mbm_assign_on_mkdir?
>>
>>   boolean (parsed with kstrtobool()), defaulting to true?"
>>
>> After mount, any groups that got counters on creation would have to be
>> cleaned up, but at least that can be done with forward progress once
>> the flag is cleared.
>>
>> I was able to live with that as long as there aren't users polling for
>> resctrl to be mounted and immediately creating groups. For us, a
>> single container manager service manages resctrl.
>>
>>>
>>>>
>>>> If an agent creating directories isn't coordinated with the agent
>>>> managing counters, a series of creating and destroying a group could
>>>> prevent a monitor assignment from ever succeeding because it's not
>>>> possible to atomically discover the name of the new directory that
>>>> stole the previously-available counter and reassign it.
>>>>
>>>> However, if the counter-manager can get all the counters assigned once
>>>> and only move them with atomic reassignments, it will become
>>>> impossible to snatch them with a mkdir.
>>>>
>>>
>>> You have many points that makes auto-assignment not be ideal but I
>>> remain concerned that not doing something like this will break
>>> existing users who are not as familiar with resctrl internals.
>>
>> I agree auto-assignment should be the default. I just want an official
>> way to turn it off.
>>
>> Thanks!
>> -Peter
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCiJ9ELXkij-zsAhxC1hx8UUR+KMPJH6i8c8AT6_mtXs+Q@mail.gmail.com/
>>
> 
> +1
> 
> That's basically my position -- the auto-assignment feels like a
> _potential_ nuisance for ABMC-aware users, but it depends on what they
> are trying to do.  Migration of non-ABMC-aware users will be easier for
> basic use cases if auto-assignment occurs by default (as in this
> series).
> 
> Having an explicit way to turn this off seems perfectly reasonable
> (and could be added later on, if not provided in this series).
> 
> 
> What about the question re whether turning mbm_cntr_assign mode on
> should trigger auto-assignment?
> 
> Currently turning this mode off and then on again has the effect of
> removing all automatic assignments for extant groups.  This feels
> surprising and/or unintentional (?)

Connecting to what you start off by saying I also see auto-assignment
as the way to provide a smooth transition for "non-ABMC-aware" users.

To me a user that turns this mode off and then on again can be
considered as a user that is "ABMC-aware" and turning it "off and then
on again" seems like an intuitive way to get to a "clean slate"
wrt counter assignments. This may also be a convenient way for
an "ABMC-aware" user space to unassign all counters and thus also
helpful if resctrl supports the flag that Peter proposed. The flag
seems to already keep something like this in its context with
a name of "mbm_assign_on_mkdir" that could be interpreted as
"only auto assign on mkdir"?

I am not taking a stand for one or the other approach but instead
trying to be more specific about pros/cons. Could you please provide
more insight in the use case you have in mind so that we can see how
resctrl could behave with few surprises? 

Reinette



^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-20 16:46                     ` Dave Martin
@ 2025-02-20 17:46                       ` Dave Martin
  2025-02-20 18:36                         ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-20 17:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Newman, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi again,

On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote:
> Hi,
> 
> On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote:
> > Hi Peter,
> > 
> > On 2/19/25 3:28 AM, Peter Newman wrote:
> 
> [...]
> 
> > > In the letters as events model, choosing the events assigned to a
> > > group wouldn't be enough information, since we would want to control
> > > which events should share a counter and which should be counted by
> > > separate counters. I think the amount of information that would need
> > > to be encoded into mbm_assign_control to represent the level of
> > > configurability supported by hardware would quickly get out of hand.
> > > 
> > > Maybe as an example, one counter for all reads, one counter for all
> > > writes in ABMC would look like...
> > > 
> > > (L3_QOS_ABMC_CFG.BwType field names below)
> > > 
> > > (per domain)
> > > group 0:
> > >  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > >  counter 1: VictimBW,LclNTWr,RmtNTWr
> > > group 1:
> > >  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > >  counter 3: VictimBW,LclNTWr,RmtNTWr
> > > ...
> > > 
> > 
> > I think this may also be what Dave was heading towards in [2] but in that
> > example and above the counter configuration appears to be global. You do mention
> > "configurability supported by hardware" so I wonder if per-domain counter
> > configuration is a requirement?
> > 
> > Until now I viewed counter configuration separate from counter assignment,
> > similar to how AMD's counters can be configured via mbm_total_bytes_config and
> > mbm_local_bytes_config before they are assigned. That is still per-domain
> > counter configuration though, not per-counter.
> 
> I hadn't tried to work the design through in any detail: it wasn't
> intended as a suggestion for something we should definitely do right
> now; rather, it was just an incomplete sketch of one possible future
> evolution of the interface.
> 
> Either way these feel like future concerns, if the first iteration of
> ABMC is just to provide the basics so that ABMC hardware can implement
> resctrl without userspace seeing counters randomly stopping and
> resetting...
> 
> Peter, can you give a view on whether the ABMC as proposed in this series
> is a useful stepping-stone?  Or are there things that you need that you
> feel could not be added as a later extension without ABI breakage?
> 
> [...]
> 
> > > I believe that shared assignments will take care of all the
> > > high-frequency and performance-intensive batch configuration updates I
> > > was originally concerned about, so I no longer see much benefit in
> > > finding ways to textually encode all this information in a single file
jjjk> > > when it would be more manageable to distribute it around the
> > > filesystem hierarchy.
> > 
> > This is significant. The motivation for the single file was to support
> > the "high-frequency and performance-intensive" usage. Would "shared assignments"
> > not also depend on the same files that, if distributed, will require many
> > filesystem operations? 
> > Having the files distributed will be significantly simpler while also
> > avoiding the file size issue that Dave Martin exposed. 
> > 
> > Reinette
> 
> I still haven't fully understood the "shared assignments" proposal;
> I need to go back and look at it.

Having taken a quick look at that now, this all seems to duplicate
perf's design journey (again).

"rate" events make some sense.  The perf equivalent is to keep an
accumulated count of the amount of time a counter has been assigned to
an event, and another accumulated count of the events counted by the
counter during assignment.  Only userspace knows what it wants to do
with this information: perf exposes the raw accumulated counts.

Perf events can be also pinned so that they are prioritised for
assignment to counters; that sounds a lot like the regular, non-shared
resctrl counters.


Playing devil's advocate:

It does feel like we are doomed to reinvent perf if we go too far down
this road...

> If we split the file, it will be more closely aligned with the design
> of the rest of the resctrlfs interface.
> 
> OTOH, the current interface seems workable and I think the file size
> issue can be addressed without major re-engineering.
> 
> So, from my side, I would not consider the current interface design
> a blocker.

...so, drawing a hard line around the use cases that we intend to
address with this interface and avoiding feature creep seems desirable.

resctrlfs is already in the wild, so providing reasonable baseline
compatiblity with that interface for ABMC hardware is a sensible goal.
The current series does that.

But I wonder how much additional functionality we should really be
adding via the mbm_assign_control interface, once this series is
settled.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-20 14:53                     ` Peter Newman
@ 2025-02-20 18:36                       ` Reinette Chatre
  2025-02-21 13:12                         ` Peter Newman
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-20 18:36 UTC (permalink / raw)
  To: Peter Newman
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/20/25 6:53 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>
>>>>>> Hi Babu,
>>>>>>
>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>
>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>
>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>
>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>
>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>
>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>> <value>
>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>
>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>
>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>> is low enough to be of concern.
>>>>>
>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>> investigation, I would question whether they know what they're looking
>>>>> for.
>>>>
>>>> The key here is "so far" as well as the focus on MBM only.
>>>>
>>>> It is impossible for me to predict what we will see in a couple of years
>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>> that he is working on patches [1] that will add new events and shared the idea
>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>> customers.
>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>
>>> I was thinking of the letters as representing a reusable, user-defined
>>> event-set for applying to a single counter rather than as individual
>>> events, since MPAM and ABMC allow us to choose the set of events each
>>> one counts. Wherever we define the letters, we could use more symbolic
>>> event names.
>>
>> Thank you for clarifying.
>>
>>>
>>> In the letters as events model, choosing the events assigned to a
>>> group wouldn't be enough information, since we would want to control
>>> which events should share a counter and which should be counted by
>>> separate counters. I think the amount of information that would need
>>> to be encoded into mbm_assign_control to represent the level of
>>> configurability supported by hardware would quickly get out of hand.
>>>
>>> Maybe as an example, one counter for all reads, one counter for all
>>> writes in ABMC would look like...
>>>
>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>
>>> (per domain)
>>> group 0:
>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>> group 1:
>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>> ...
>>>
>>
>> I think this may also be what Dave was heading towards in [2] but in that
>> example and above the counter configuration appears to be global. You do mention
>> "configurability supported by hardware" so I wonder if per-domain counter
>> configuration is a requirement?
> 
> If it's global and we want a particular group to be watched by more
> counters, I wouldn't want this to result in allocating more counters
> for that group in all domains, or allocating counters in domains where
> they're not needed. I want to encourage my users to avoid allocating
> monitoring resources in domains where a job is not allowed to run so
> there's less pressure on the counters.
> 
> In Dave's proposal it looks like global configuration means
> globally-defined "named counter configurations", which works because
> it's really per-domain assignment of the configurations to however
> many counters the group needs in each domain.

I think I am becoming lost. Would a global configuration not break your
view of "event-set applied to a single counter"? If a counter is configured
globally then it would not make it possible to support the full configurability
of the hardware. 
Before I add more confusion, let me try with an example that builds on your
earlier example copied below:

>>> (per domain)
>>> group 0:
>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>> group 1:
>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>> ...

Since the above states "per domain" I rewrite the example to highlight that as
I understand it:

group 0:
 domain 0:
  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 1: VictimBW,LclNTWr,RmtNTWr
 domain 1:
  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 1: VictimBW,LclNTWr,RmtNTWr
group 1:
 domain 0:
  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 3: VictimBW,LclNTWr,RmtNTWr
 domain 1:
  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 3: VictimBW,LclNTWr,RmtNTWr

You mention that you do not want counters to be allocated in domains that they
are not needed in. So, let's say group 0 does not need counter 0 and counter 1
in domain 1, resulting in:

group 0:
 domain 0:
  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 1: VictimBW,LclNTWr,RmtNTWr
group 1:
 domain 0:
  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 3: VictimBW,LclNTWr,RmtNTWr
 domain 1:
  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 3: VictimBW,LclNTWr,RmtNTWr

With counter 0 and counter 1 available in domain 1, these counters could
theoretically be configured to give group 1 more data in domain 1:

group 0:
 domain 0:
  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 1: VictimBW,LclNTWr,RmtNTWr
group 1:
 domain 0:
  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
  counter 3: VictimBW,LclNTWr,RmtNTWr
 domain 1:
  counter 0: LclFill,RmtFill
  counter 1: LclNTWr,RmtNTWr
  counter 2: LclSlowFill,RmtSlowFill
  counter 3: VictimBW

The counters are shown with different per-domain configurations that seems to
match with earlier goals of (a) choose events counted by each counter and
(b) do not allocate counters in domains where they are not needed. As I
understand the above does contradict global counter configuration though.
Or do you mean that only the *name* of the counter is global and then
that it is reconfigured as part of every assignment?

>> Until now I viewed counter configuration separate from counter assignment,
>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
>> mbm_local_bytes_config before they are assigned. That is still per-domain
>> counter configuration though, not per-counter.
>>
>>> I assume packing all of this info for a group's desired counter
>>> configuration into a single line (with 32 domains per line on many
>>> dual-socket AMD configurations I see) would be difficult to look at,
>>> even if we could settle on a single letter to represent each
>>> universally.
>>>
>>>>
>>>> My goal is for resctrl to have a user interface that can as much as possible
>>>> be ready for whatever may be required from it years down the line. Of course,
>>>> I may be wrong and resctrl would never need to support more than 26 events per
>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
>>>> and how could resctrl support that?
>>>>
>>>> What is the risk of supporting more than 26 events? As I highlighted earlier
>>>> the interface I used as demonstration may become unwieldy to parse on a system
>>>> with many domains that supports many events. This is a concern for me. Any suggestions
>>>> will be appreciated, especially from you since I know that you are very familiar with
>>>> issues related to large scale use of resctrl interfaces.
>>>
>>> It's mainly just the unwieldiness of all the information in one file.
>>> It's already at the limit of what I can visually look through.
>>
>> I agree.
>>
>>>
>>> I believe that shared assignments will take care of all the
>>> high-frequency and performance-intensive batch configuration updates I
>>> was originally concerned about, so I no longer see much benefit in
>>> finding ways to textually encode all this information in a single file
>>> when it would be more manageable to distribute it around the
>>> filesystem hierarchy.
>>
>> This is significant. The motivation for the single file was to support
>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
>> not also depend on the same files that, if distributed, will require many
>> filesystem operations?
>> Having the files distributed will be significantly simpler while also
>> avoiding the file size issue that Dave Martin exposed.
> 
> The remaining filesystem operations will be assigning or removing
> shared counter assignments in the applicable domains, which would
> normally correspond to mkdir/rmdir of groups or changing their CPU
> affinity. The shared assignments are more "program and forget", while
> the exclusive assignment approach requires updates for every counter
> (in every domain) every few seconds to cover a large number of groups.
> 
> When they want to pay extra attention to a particular group, I expect
> they'll ask for exclusive counters and leave them assigned for a while
> as they collect extra data.

The single file approach is already unwieldy. The demands that will be
placed on it to support the usages currently being discussed would make this
interface even harder to use and manage. If the single file is not required 
then I think we should go back to smaller files distributed in resctrl.
This may not even be an either/or argument. One way to view mbm_assign_control
could be as a way for user to interact with the distributed counter
related files with a single file system operation. Although, without
knowing how counter configuration is expected to work this remains unclear.

Reinette



^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-20 17:46                       ` Dave Martin
@ 2025-02-20 18:36                         ` Reinette Chatre
  2025-02-21 16:47                           ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-20 18:36 UTC (permalink / raw)
  To: Dave Martin
  Cc: Peter Newman, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/20/25 9:46 AM, Dave Martin wrote:
> Hi again,
> 
> On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote:
>> Hi,
>>
>> On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote:
>>> Hi Peter,
>>>
>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>
>> [...]
>>
>>>> In the letters as events model, choosing the events assigned to a
>>>> group wouldn't be enough information, since we would want to control
>>>> which events should share a counter and which should be counted by
>>>> separate counters. I think the amount of information that would need
>>>> to be encoded into mbm_assign_control to represent the level of
>>>> configurability supported by hardware would quickly get out of hand.
>>>>
>>>> Maybe as an example, one counter for all reads, one counter for all
>>>> writes in ABMC would look like...
>>>>
>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>
>>>> (per domain)
>>>> group 0:
>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>> ...
>>>>
>>>
>>> I think this may also be what Dave was heading towards in [2] but in that
>>> example and above the counter configuration appears to be global. You do mention
>>> "configurability supported by hardware" so I wonder if per-domain counter
>>> configuration is a requirement?
>>>
>>> Until now I viewed counter configuration separate from counter assignment,
>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
>>> mbm_local_bytes_config before they are assigned. That is still per-domain
>>> counter configuration though, not per-counter.
>>
>> I hadn't tried to work the design through in any detail: it wasn't
>> intended as a suggestion for something we should definitely do right
>> now; rather, it was just an incomplete sketch of one possible future
>> evolution of the interface.
>>
>> Either way these feel like future concerns, if the first iteration of
>> ABMC is just to provide the basics so that ABMC hardware can implement
>> resctrl without userspace seeing counters randomly stopping and
>> resetting...
>>
>> Peter, can you give a view on whether the ABMC as proposed in this series
>> is a useful stepping-stone?  Or are there things that you need that you
>> feel could not be added as a later extension without ABI breakage?
>>
>> [...]
>>
>>>> I believe that shared assignments will take care of all the
>>>> high-frequency and performance-intensive batch configuration updates I
>>>> was originally concerned about, so I no longer see much benefit in
>>>> finding ways to textually encode all this information in a single file
> jjjk> > > when it would be more manageable to distribute it around the
>>>> filesystem hierarchy.
>>>
>>> This is significant. The motivation for the single file was to support
>>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
>>> not also depend on the same files that, if distributed, will require many
>>> filesystem operations? 
>>> Having the files distributed will be significantly simpler while also
>>> avoiding the file size issue that Dave Martin exposed. 
>>>
>>> Reinette
>>
>> I still haven't fully understood the "shared assignments" proposal;
>> I need to go back and look at it.
> 
> Having taken a quick look at that now, this all seems to duplicate
> perf's design journey (again).
> 
> "rate" events make some sense.  The perf equivalent is to keep an
> accumulated count of the amount of time a counter has been assigned to
> an event, and another accumulated count of the events counted by the
> counter during assignment.  Only userspace knows what it wants to do
> with this information: perf exposes the raw accumulated counts.
> 
> Perf events can be also pinned so that they are prioritised for
> assignment to counters; that sounds a lot like the regular, non-shared
> resctrl counters.
> 
> 
> Playing devil's advocate:
> 
> It does feel like we are doomed to reinvent perf if we go too far down
> this road...
> 
>> If we split the file, it will be more closely aligned with the design
>> of the rest of the resctrlfs interface.
>>
>> OTOH, the current interface seems workable and I think the file size
>> issue can be addressed without major re-engineering.
>>
>> So, from my side, I would not consider the current interface design
>> a blocker.
> 
> ...so, drawing a hard line around the use cases that we intend to
> address with this interface and avoiding feature creep seems desirable.

This is exactly what I am trying to do ... to understand what use cases
the interface is expected to support.

You have mentioned a couple of times now that this interface is sufficient but
at the same time you hinted at some features from MPAM that I do not see
possible to accommodate with this interface.
 
> resctrlfs is already in the wild, so providing reasonable baseline
> compatiblity with that interface for ABMC hardware is a sensible goal.
> The current series does that.
> 
> But I wonder how much additional functionality we should really be
> adding via the mbm_assign_control interface, once this series is
> settled.

Are you speculating that MPAM counters may not make use of this interface?

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-20 15:21       ` Dave Martin
@ 2025-02-20 20:57         ` Moger, Babu
  2025-02-21 15:53           ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-20 20:57 UTC (permalink / raw)
  To: Dave Martin, Moger, Babu
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/20/25 09:21, Dave Martin wrote:
> Hi,
> 
> On Wed, Feb 19, 2025 at 06:34:42PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/19/2025 10:07 AM, Dave Martin wrote:
>>> Hi,
>>>
>>> On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:
> 
>>> [...]
>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 6e29827239e0..299839bcf23f 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>>>
>>> [...]
>>>
>>>> +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
>>>> +						char *buf, size_t nbytes, loff_t off)
>>>> +{
> 
> [...]
> 
>>>> +	while ((token = strsep(&buf, "\n")) != NULL) {
>>>> +		/*
>>>> +		 * The write command follows the following format:
>>>> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
>>>> +		 * Extract the CTRL_MON group.
>>>> +		 */
>>>> +		cmon_grp = strsep(&token, "/");
>>>> +
>>>
>>> As when reading this file, I think that the data can grow larger than a
>>> page and get split into multiple write() calls.
>>>
>>> I don't currently think the file needs to be redesigned, but there are
>>> some concerns about how userspace will work with it that need to be
>>> sorted out.
>>>
>>> Every monitoring group can contribute a line to this file:
>>>
>>> 	CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF
>>>
>>> so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1
>>>
>>> NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
>>> 600 bytes per monitoring group in the worst case.
>>>
>>> We don't need to have many control and monitoring groups for this to
>>> grow potentially over 4K.
>>>
>>>
>>> We could simply place a limit on how much userspace is allowed to write
>>> to this file in one go, although this restriction feels difficult for
>>> userspace to follow -- but maybe this is workable in the short term, on
>>> current systems (?)
>>>
>>> Otherwise, since we expect this interface to be written using scripting
>>> languages, I think we need to be prepared to accept fully-buffered
>>> I/O.  That means that the data may be cut at random places, not
>>> necessarily at newlines.  (For smaller files such as schemata this is
>>> not such an issue, since the whole file is likely to be small enough to
>>> fit into the default stdio buffers -- this is how sysfs gets away with
>>> it IIUC.)
>>>
>>> For fully-buffered I/O, we may have to cache an incomplete line in
>>> between write() calls.  If there is a dangling incomplete line when the
>>> file is closed then it is hard to tell userspace, because people often
>>> don't bother to check the return value of close(), fclose() etc.
>>> However, since it's an ABI violation for userspace to end this file
>>> with a partial line, I think it's sufficient to report that via
>>> last_cmd_status.  (Making close() return -EIO still seems a good idea
>>> though, just in case userspace is listening.)
>>
>> Seems like we can add a check in resctrl_mbm_assign_control_write() to
>> compare nbytes > PAGE_SIZE.
> 
> This might be a reasonable stopgap approach, if we are confident that the
> number of RMIDs and monitoring domains is small enough on known
> platforms that the problem is unlikely to be hit.  I can't really judge
> on this.
> 
>> But do we really need this? I have no way of testing this. Help me
>> understand.
> 
> It's easy to demonatrate this using the schemata file (which works in a
> similar way).  Open f in /sys/fs/resctrl/schemata, then:
> 
> 	int n = 0;
> 
> 	for (n = 0; n < 1000; n++)
> 		if (fputs("MB:0=100;1=100\n", f) == EOF)
> 			fprintf(stderr, "Failed on interation %d\n", n);
> 
> This will succeed a certain number of times (272, for me) and then fail
> when the stdio buffer for f overflows, triggering a write().
> 
> Putting an explicit fflush() after every fputs() call (or doing a
> setlinebuf(f) before the loop) makes it work.  But this is awkward and
> unexpected for the user, and doing the right thing from a scripting
> language may be tricky.
> 
> In this example I am doing something a bit artificial -- we don't
> officially say what happens when a pre-opened schemata file handle is
> reused in this way, AFAICT.  But for mbm_assign_control it is
> legitimate to write many lines, and we can hit this kind of problem.
> 
> 
> I'll leave it to others to judge whether we _need_ to fix this, but it
> feels like a problem waiting to happen.

Created the problem using this code using a "test" group.

include <stdio.h>
#include <errno.h>
#include <string.h>

int main()
{
        FILE *file;
        int n;

        file = fopen("/sys/fs/resctrl/info/L3_MON/mbm_assign_control", "w");

        if (file == NULL) {
                printf("Error opening file!\n");
                return 1;
        }

        printf("File opened successfully.\n");

        for (n = 0; n < 100; n++)
                if
(fputs("test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;9=tl;10=tl;11=tl\n", file) == EOF)
                        fprintf(stderr, "Failed on interation %d error
%s\n ", n, strerror(errno));

        if (fclose(file) == 0) {
                printf("File closed successfully.\n");
        } else {
                printf("Error closing file!\n");
        }
}


When the buffer overflow happens the newline will not be there. I have
added this error via rdt_last_cmd_puts. At least user knows there is an error.

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 484d6009869f..70a96976e3ab 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1250,8 +1252,10 @@ static ssize_t
resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
        int ret;

        /* Valid input requires a trailing newline */
-       if (nbytes == 0 || buf[nbytes - 1] != '\n')
+       if (nbytes == 0 || buf[nbytes - 1] != '\n') {
+               rdt_last_cmd_puts("mbm_cntr_assign: buffer invalid\n");
                return -EINVAL;
+       }

        buf[nbytes - 1] = '\0';



I am open to other ideas to handle this case.


> 
> 
>> All these file operations go thru generic call kernfs_fop_write_iter().
>> Doesn't it take care of buffer check and overflow?
> 
> No, this is called for each iovec segment (where userspace used one of
> the iovec based I/O syscalls).  But there is no buffering or
> concatenation of the data read in: each segment gets passed down to the
> individual kernfs_file_operations write method for the file:
> 
> 	len = ops->write(of, buf, len, iocb->ki_pos)
> 
> calls down to
> 
> 	resctrl_mbm_assign_control_write(of, buf, len, iocb->ki_pos).
> 
> 
> I'll try to port my buffering hack on top of the series -- that should
> help to illustrate what I mean.
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-20 15:44       ` Dave Martin
@ 2025-02-20 21:29         ` Moger, Babu
  2025-02-21 16:00           ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-20 21:29 UTC (permalink / raw)
  To: Dave Martin
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/20/25 09:44, Dave Martin wrote:
> Hi,
> 
> On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/19/25 07:53, Dave Martin wrote:
>>> On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
>>>> Provide the interface to list the assignment states of all the resctrl
>>>> groups in mbm_cntr_assign mode.
> 
> [...]
> 
>>>> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>>>> +					   struct seq_file *s, void *v)
>>>> +{
> 
> [...]
> 
>>> Unlike the other resctrl files, it looks like the total size of this
>>> data will scale up with the number of existing monitoring groups
>>> and the lengths of the group names (in addition to the number of
>>> monitoring domains).
>>>
>>> So, this can easily be more than a page, overflowing internal limits
>>> in the seq_file and kernfs code.
>>>
>>> Do we need to track some state between read() calls?  This can be done
>>> by overriding the kernfs .open() and .release() methods and hanging
>>> some state data (or an rdtgroup_file pointer) on of->priv.
>>>
>>> Also, if we allow the data to be read out in chunks, then we would
>>> either have to snapshot all the data in one go and stash the unread
>>> tail in the kernel, or we would need to move over to RCU-based
>>> enumeration or similar -- otherwise releasing rdtgroup_mutex in the
>>> middle of the enumeration in order to return data to userspace is going
>>> to be a problem...
>>
>> Good catch.
>>
>> I see similar buffer overflow is handled by calling seq_buf_clear()
>> (look at process_durations() or in show_user_instructions()).
>>
>> How about handling this by calling rdt_last_cmd_clear() before printing
>> each group?
> 
> Does this work?
> 
> Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
> So far as I can see, show_user_instructions() just gives up on printing
> the affected line, while process_durations() tries to anticipate
> overflow and prints out the accumulated text to dmesg before clearing
> the buffer.

Yea. Agree,

> 
> In our case, we cannot send more data to userspace than was requested
> in the read() call, so we might have nowhere to drain the seq_buf
> contents to in order to free up space.
> 
> sysfs "expects" userspace to do a big enough read() that this problem
> doesn't happen.  In practice this is OK because people usually read
> through a buffered I/O layer like stdio, and in realistic
> implementations the user-side I/O buffer is large enough to hide this
> issue.
> 
> But mbm_assign_control data is dynamically generated and potentially
> much bigger than a typical sysfs file.

I have no idea how to handle this case. We may have to live with this
problem. Let us know if there are any ideas.

> 
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 484d6009869f..1828f59eb723 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
>> kernfs_open_file *of,
>>         }
>>
>>         list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +               rdt_last_cmd_clear();
>>                 seq_printf(s, "%s//", rdtg->kn->name);
>>
>>                 sep = false;
>> @@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
>> kernfs_open_file *of,
>>                 seq_putc(s, '\n');
>>
>>                 list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
>> mon.crdtgrp_list) {
>> +                       rdt_last_cmd_clear();
> 
> I don't see how this helps.
> 
> Surely last_cmd_status has nothing to do with s?

Correct. Clearly, I misunderstood the problem.

> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-20 18:36                       ` Reinette Chatre
@ 2025-02-21 13:12                         ` Peter Newman
  2025-02-21 22:43                           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-02-21 13:12 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 2/20/25 6:53 AM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>> Hi Reinette,
> >>>
> >>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>> Hi Reinette,
> >>>>>
> >>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>
> >>>>>> Hi Babu,
> >>>>>>
> >>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>
> >>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>
> >>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>
> >>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>
> >>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>
> >>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>> <value>
> >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>
> >>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>
> >>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>> is low enough to be of concern.
> >>>>>
> >>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>> investigation, I would question whether they know what they're looking
> >>>>> for.
> >>>>
> >>>> The key here is "so far" as well as the focus on MBM only.
> >>>>
> >>>> It is impossible for me to predict what we will see in a couple of years
> >>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>> that he is working on patches [1] that will add new events and shared the idea
> >>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>> customers.
> >>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>
> >>> I was thinking of the letters as representing a reusable, user-defined
> >>> event-set for applying to a single counter rather than as individual
> >>> events, since MPAM and ABMC allow us to choose the set of events each
> >>> one counts. Wherever we define the letters, we could use more symbolic
> >>> event names.
> >>
> >> Thank you for clarifying.
> >>
> >>>
> >>> In the letters as events model, choosing the events assigned to a
> >>> group wouldn't be enough information, since we would want to control
> >>> which events should share a counter and which should be counted by
> >>> separate counters. I think the amount of information that would need
> >>> to be encoded into mbm_assign_control to represent the level of
> >>> configurability supported by hardware would quickly get out of hand.
> >>>
> >>> Maybe as an example, one counter for all reads, one counter for all
> >>> writes in ABMC would look like...
> >>>
> >>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>
> >>> (per domain)
> >>> group 0:
> >>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>> group 1:
> >>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>> ...
> >>>
> >>
> >> I think this may also be what Dave was heading towards in [2] but in that
> >> example and above the counter configuration appears to be global. You do mention
> >> "configurability supported by hardware" so I wonder if per-domain counter
> >> configuration is a requirement?
> >
> > If it's global and we want a particular group to be watched by more
> > counters, I wouldn't want this to result in allocating more counters
> > for that group in all domains, or allocating counters in domains where
> > they're not needed. I want to encourage my users to avoid allocating
> > monitoring resources in domains where a job is not allowed to run so
> > there's less pressure on the counters.
> >
> > In Dave's proposal it looks like global configuration means
> > globally-defined "named counter configurations", which works because
> > it's really per-domain assignment of the configurations to however
> > many counters the group needs in each domain.
>
> I think I am becoming lost. Would a global configuration not break your
> view of "event-set applied to a single counter"? If a counter is configured
> globally then it would not make it possible to support the full configurability
> of the hardware.
> Before I add more confusion, let me try with an example that builds on your
> earlier example copied below:
>
> >>> (per domain)
> >>> group 0:
> >>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>> group 1:
> >>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>> ...
>
> Since the above states "per domain" I rewrite the example to highlight that as
> I understand it:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>
> You mention that you do not want counters to be allocated in domains that they
> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> in domain 1, resulting in:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>
> With counter 0 and counter 1 available in domain 1, these counters could
> theoretically be configured to give group 1 more data in domain 1:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 0: LclFill,RmtFill
>   counter 1: LclNTWr,RmtNTWr
>   counter 2: LclSlowFill,RmtSlowFill
>   counter 3: VictimBW
>
> The counters are shown with different per-domain configurations that seems to
> match with earlier goals of (a) choose events counted by each counter and
> (b) do not allocate counters in domains where they are not needed. As I
> understand the above does contradict global counter configuration though.
> Or do you mean that only the *name* of the counter is global and then
> that it is reconfigured as part of every assignment?

Yes, I meant only the *name* is global. I assume based on a particular
system configuration, the user will settle on a handful of useful
groupings to count.

Perhaps mbm_assign_control syntax is the clearest way to express an example...

 # define global configurations (in ABMC terms), not necessarily in this
 # syntax and probably not in the mbm_assign_control file.

 r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
 w=VictimBW,LclNTWr,RmtNTWr

 # legacy "total" configuration, effectively r+w
 t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr

 /group0/0=t;1=t
 /group1/0=t;1=t
 /group2/0=_;1=t
 /group3/0=rw;1=_

- group2 is restricted to domain 0
- group3 is restricted to domain 1
- the rest are unrestricted
- In group3, we decided we need to separate read and write traffic

This consumes 4 counters in domain 0 and 3 counters in domain 1.

>
> >> Until now I viewed counter configuration separate from counter assignment,
> >> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> >> mbm_local_bytes_config before they are assigned. That is still per-domain
> >> counter configuration though, not per-counter.
> >>
> >>> I assume packing all of this info for a group's desired counter
> >>> configuration into a single line (with 32 domains per line on many
> >>> dual-socket AMD configurations I see) would be difficult to look at,
> >>> even if we could settle on a single letter to represent each
> >>> universally.
> >>>
> >>>>
> >>>> My goal is for resctrl to have a user interface that can as much as possible
> >>>> be ready for whatever may be required from it years down the line. Of course,
> >>>> I may be wrong and resctrl would never need to support more than 26 events per
> >>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
> >>>> and how could resctrl support that?
> >>>>
> >>>> What is the risk of supporting more than 26 events? As I highlighted earlier
> >>>> the interface I used as demonstration may become unwieldy to parse on a system
> >>>> with many domains that supports many events. This is a concern for me. Any suggestions
> >>>> will be appreciated, especially from you since I know that you are very familiar with
> >>>> issues related to large scale use of resctrl interfaces.
> >>>
> >>> It's mainly just the unwieldiness of all the information in one file.
> >>> It's already at the limit of what I can visually look through.
> >>
> >> I agree.
> >>
> >>>
> >>> I believe that shared assignments will take care of all the
> >>> high-frequency and performance-intensive batch configuration updates I
> >>> was originally concerned about, so I no longer see much benefit in
> >>> finding ways to textually encode all this information in a single file
> >>> when it would be more manageable to distribute it around the
> >>> filesystem hierarchy.
> >>
> >> This is significant. The motivation for the single file was to support
> >> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> >> not also depend on the same files that, if distributed, will require many
> >> filesystem operations?
> >> Having the files distributed will be significantly simpler while also
> >> avoiding the file size issue that Dave Martin exposed.
> >
> > The remaining filesystem operations will be assigning or removing
> > shared counter assignments in the applicable domains, which would
> > normally correspond to mkdir/rmdir of groups or changing their CPU
> > affinity. The shared assignments are more "program and forget", while
> > the exclusive assignment approach requires updates for every counter
> > (in every domain) every few seconds to cover a large number of groups.
> >
> > When they want to pay extra attention to a particular group, I expect
> > they'll ask for exclusive counters and leave them assigned for a while
> > as they collect extra data.
>
> The single file approach is already unwieldy. The demands that will be
> placed on it to support the usages currently being discussed would make this
> interface even harder to use and manage. If the single file is not required
> then I think we should go back to smaller files distributed in resctrl.
> This may not even be an either/or argument. One way to view mbm_assign_control
> could be as a way for user to interact with the distributed counter
> related files with a single file system operation. Although, without
> knowing how counter configuration is expected to work this remains unclear.

If we do both interfaces and the multi-file model gives us more
capability to express configurations, we could find situations where
there are configurations we cannot represent when reading back from
mbm_assign_control, or updates through mbm_assign_control have
ambiguous effects on existing configurations which were created with
other files.

However, the example I gave above seems to be adequately represented
by a minor extension to mbm_assign_control and we all seem to
understand it now, so maybe it's not broken yet. It's unfortunate that
work went into a requirement that's no longer relevant, but I don't
think that on its own is a blocker.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-20 20:57         ` Moger, Babu
@ 2025-02-21 15:53           ` Dave Martin
  2025-02-21 20:16             ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-21 15:53 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Moger, Babu, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman, x86, hpa, paulmck, akpm,
	thuth, rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Thu, Feb 20, 2025 at 02:57:31PM -0600, Moger, Babu wrote:
> Hi Dave,

[...]

> Created the problem using this code using a "test" group.
> 
> include <stdio.h>
> #include <errno.h>
> #include <string.h>
> 
> int main()
> {
>         FILE *file;
>         int n;
> 
>         file = fopen("/sys/fs/resctrl/info/L3_MON/mbm_assign_control", "w");
> 
>         if (file == NULL) {
>                 printf("Error opening file!\n");
>                 return 1;
>         }
> 
>         printf("File opened successfully.\n");
> 
>         for (n = 0; n < 100; n++)
>                 if
> (fputs("test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;9=tl;10=tl;11=tl\n", file) == EOF)
>                         fprintf(stderr, "Failed on interation %d error
> %s\n ", n, strerror(errno));
> 
>         if (fclose(file) == 0) {
>                 printf("File closed successfully.\n");
>         } else {
>                 printf("Error closing file!\n");
>         }
> }

Right.

> When the buffer overflow happens the newline will not be there. I have
> added this error via rdt_last_cmd_puts. At least user knows there is an error.
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 484d6009869f..70a96976e3ab 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1250,8 +1252,10 @@ static ssize_t
> resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
>         int ret;
> 
>         /* Valid input requires a trailing newline */
> -       if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +       if (nbytes == 0 || buf[nbytes - 1] != '\n') {
> +               rdt_last_cmd_puts("mbm_cntr_assign: buffer invalid\n");
>                 return -EINVAL;
> +       }
> 
>         buf[nbytes - 1] = '\0';
> 
> 
> 
> I am open to other ideas to handle this case.

Reinette, what do you think about this as a stopgap approach?

The worst that happens is that userspace gets an unexpected failure in
scenarios that seem unlikely in the near future (i.e., where there are
a lot of RMIDs available, and at the same time groups have been given
stupidly long names).

Since this is an implementation issue rather than an interface issue,
we could fix it later on.


Longer term, we may want to define some stuff along the lines of

	struct rdtgroup_file {
		/* persistent data for an rdtgroup open file instance */
	};

	static int rdtgroup_file_open(struct kernfs_open_file *of)
	{
		struct rdtgroup_file *rf;

		rf = kzalloc(sizeof(*rf), GFP_KERNEL);
		if (!rf)
			return -ENOMEM;

		of->priv;
	}

	static void rdtgroup_file_release(struct kernfs_open_file *of)
	{
		/*
		 * Deal with dangling data and do cleanup appropriate
		 * for whatever kind of file this is, then:
		 */
		kfree(of->priv);
	}


Then we'd have somewhere to stash data that needs to be carried over
from one read/write call to the next.

I tried to port my schemata buffering hack over, but the requirements
are not exactly the same as for mbm_assign_control, so it wasn't
trivial.  It feels do-able, but it might be better to stabilise this
series before going down that road.

(I'm happy to spend some time trying to wire this up if it would be
useful, though.)

Cheers
---Dave 

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-20 21:29         ` Moger, Babu
@ 2025-02-21 16:00           ` Dave Martin
  2025-02-21 20:10             ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-21 16:00 UTC (permalink / raw)
  To: Moger, Babu
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/20/25 09:44, Dave Martin wrote:
> > Hi,
> > 
> > On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:

[...]

> >> Good catch.
> >>
> >> I see similar buffer overflow is handled by calling seq_buf_clear()
> >> (look at process_durations() or in show_user_instructions()).
> >>
> >> How about handling this by calling rdt_last_cmd_clear() before printing
> >> each group?
> > 
> > Does this work?
> > 
> > Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
> > So far as I can see, show_user_instructions() just gives up on printing
> > the affected line, while process_durations() tries to anticipate
> > overflow and prints out the accumulated text to dmesg before clearing
> > the buffer.
> 
> Yea. Agree,
> 
> > 
> > In our case, we cannot send more data to userspace than was requested
> > in the read() call, so we might have nowhere to drain the seq_buf
> > contents to in order to free up space.
> > 
> > sysfs "expects" userspace to do a big enough read() that this problem
> > doesn't happen.  In practice this is OK because people usually read
> > through a buffered I/O layer like stdio, and in realistic
> > implementations the user-side I/O buffer is large enough to hide this
> > issue.
> > 
> > But mbm_assign_control data is dynamically generated and potentially
> > much bigger than a typical sysfs file.
> 
> I have no idea how to handle this case. We may have to live with this
> problem. Let us know if there are any ideas.

I think the current implication is that this will work for now provided
that the generated text fits in a page.


Reinette, what's your view on accepting this limitation in the interest
of stabilising this series, and tidying up this corner case later?

As for writes to this file, we're unlikely to hit the limit unless
there are a lot of RMIDs available and many groups with excessively
long names.

This looks perfectly fixable, but it might be better to settle the
design of this series first before we worry too much about it.

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-20 18:36                         ` Reinette Chatre
@ 2025-02-21 16:47                           ` Dave Martin
  2025-02-21 22:43                             ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-21 16:47 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Newman, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Thu, Feb 20, 2025 at 10:36:18AM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/20/25 9:46 AM, Dave Martin wrote:
> > Hi again,
> > 
> > On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote:

[...]

> > Having taken a quick look at that now, this all seems to duplicate
> > perf's design journey (again).
> > 
> > "rate" events make some sense.  The perf equivalent is to keep an
> > accumulated count of the amount of time a counter has been assigned to
> > an event, and another accumulated count of the events counted by the
> > counter during assignment.  Only userspace knows what it wants to do
> > with this information: perf exposes the raw accumulated counts.
> > 
> > Perf events can be also pinned so that they are prioritised for
> > assignment to counters; that sounds a lot like the regular, non-shared
> > resctrl counters.
> > 
> > 
> > Playing devil's advocate:
> > 
> > It does feel like we are doomed to reinvent perf if we go too far down
> > this road...
> > 
> >> If we split the file, it will be more closely aligned with the design
> >> of the rest of the resctrlfs interface.
> >>
> >> OTOH, the current interface seems workable and I think the file size
> >> issue can be addressed without major re-engineering.
> >>
> >> So, from my side, I would not consider the current interface design
> >> a blocker.
> > 
> > ...so, drawing a hard line around the use cases that we intend to
> > address with this interface and avoiding feature creep seems desirable.
> 
> This is exactly what I am trying to do ... to understand what use cases
> the interface is expected to support.
> 
> You have mentioned a couple of times now that this interface is sufficient but
> at the same time you hinted at some features from MPAM that I do not see
> possible to accommodate with this interface.

It's kind of both.

I think the interface is sufficient to be useful, and therefore has
value.

The problem being addressed here (shortage of counters) is fully
relevant to MPAM (at last on some hardware).

Any architecture may define new metrics and types of event that can be
counted, and they're not going to match up exactly between arches -- so
I don't think we can expect everything to fit perfectly within a
generic interface.  But having a generic interface is still useful for
making common features convenient to use.

So the interface is useful but not universal, but that doesn't feel
like a bug.

Hopefully that makes my position a bit clearer.

> > resctrlfs is already in the wild, so providing reasonable baseline
> > compatiblity with that interface for ABMC hardware is a sensible goal.
> > The current series does that.
> > 
> > But I wonder how much additional functionality we should really be
> > adding via the mbm_assign_control interface, once this series is
> > settled.
> 
> Are you speculating that MPAM counters may not make use of this interface?
> 
> Reinette

No, I think it makes sense for MPAM to follow this interface, as least
as far as what has been proposed so far here.

I think James got his updated rebase working. [1]


perf support would be for the future if we do it, but the ABMC
interface may be a useful starting point anyway, because it allows
counters to be assigned explicitly -- that provides a natural way to
hand over some counters to perf, either because that interface may be a
more natural fit for what the user is trying to do, or perhaps to count
weird, platform-specific event types that do not merit the effort of
integration into resctrlfs proper.

Does that make sense?

Cheers
---Dave

[1] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/abmc/v11

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-20 17:08             ` Reinette Chatre
@ 2025-02-21 17:14               ` Dave Martin
  2025-02-21 18:23                 ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-21 17:14 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Newman, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi,

On Thu, Feb 20, 2025 at 09:08:17AM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/20/25 5:40 AM, Dave Martin wrote:
> > On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
> >> Hi Reinette,
> >>
> >> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
> >> <reinette.chatre@intel.com> wrote:

[...]

> >>> Could you please remind me how a user will set this flag?
> >>
> >> Quoting my original suggestion[1]:
> >>
> >>  "info/L3_MON/mbm_assign_on_mkdir?
> >>
> >>   boolean (parsed with kstrtobool()), defaulting to true?"
> >>
> >> After mount, any groups that got counters on creation would have to be
> >> cleaned up, but at least that can be done with forward progress once
> >> the flag is cleared.
> >>
> >> I was able to live with that as long as there aren't users polling for
> >> resctrl to be mounted and immediately creating groups. For us, a
> >> single container manager service manages resctrl.

[...]

> > +1
> > 
> > That's basically my position -- the auto-assignment feels like a
> > _potential_ nuisance for ABMC-aware users, but it depends on what they
> > are trying to do.  Migration of non-ABMC-aware users will be easier for
> > basic use cases if auto-assignment occurs by default (as in this
> > series).
> > 
> > Having an explicit way to turn this off seems perfectly reasonable
> > (and could be added later on, if not provided in this series).
> > 
> > 
> > What about the question re whether turning mbm_cntr_assign mode on
> > should trigger auto-assignment?
> > 
> > Currently turning this mode off and then on again has the effect of
> > removing all automatic assignments for extant groups.  This feels
> > surprising and/or unintentional (?)
> 
> Connecting to what you start off by saying I also see auto-assignment
> as the way to provide a smooth transition for "non-ABMC-aware" users.

I agree, and having this on by default also helps non-ABMC-aware users.

> To me a user that turns this mode off and then on again can be
> considered as a user that is "ABMC-aware" and turning it "off and then
> on again" seems like an intuitive way to get to a "clean slate"
> wrt counter assignments. This may also be a convenient way for
> an "ABMC-aware" user space to unassign all counters and thus also
> helpful if resctrl supports the flag that Peter proposed. The flag
> seems to already keep something like this in its context with
> a name of "mbm_assign_on_mkdir" that could be interpreted as
> "only auto assign on mkdir"?

Yes, that's reasonable.  It could be a good idea to document this
behaviour of switching the mbm_cntr_assign mode, if we think it is
useful and people are likely to rely on it.

Since mkdir is an implementation detail of the resctrl interface, I'd
be tempted to go for a more generic name, say,
"mbm_assign_new_mon_groups".  But that's just bikeshedding.
The proposed behaviour seems fine.

Either way, if this is not included in this series, it could be added
later without breaking anything.


> I am not taking a stand for one or the other approach but instead
> trying to be more specific about pros/cons. Could you please provide
> more insight in the use case you have in mind so that we can see how
> resctrl could behave with few surprises? 
> 
> Reinette

I don't have a strong view either.

I don't have a concrete use case here -- I was just trying to imagine
the experience of an ABMC-aware user who wants full control over what
counters get assigned.

I agree that the convenience of the non-ABMC-aware user should probably
take priority over that of the ABMC-aware user, at least in situations
where the expected behaviour is achievable (i.e., where we didn't run
out of counters to auto-assign.)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-01-22 20:20 ` [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
  2025-02-05 22:49   ` Reinette Chatre
@ 2025-02-21 18:05   ` James Morse
  2025-02-21 18:25     ` Reinette Chatre
  1 sibling, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:05 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> Add the functionality to enable/disable AMD ABMC feature.
> 
> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
> to be updated on all the logical processors in the QOS Domain.
> 
> Hardware counters will reset when ABMC state is changed.
> 
> The ABMC feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).


> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 05358e78147b..ca69f2e0909f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -658,4 +663,6 @@ void resctrl_file_fflags_init(const char *config, unsigned long fflags);
>  void rdt_staged_configs_clear(void);
>  bool closid_allocated(unsigned int closid);
>  int resctrl_find_cleanest_closid(void);
> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
> +bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */

A minor nit - but could these be added to include/linux/resctrl.h instead?
This is where they need to end up after the arch/fs split, and its harmless to do it from
the beginning.


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode
  2025-01-22 20:20 ` [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
  2025-02-06 18:01   ` Reinette Chatre
@ 2025-02-21 18:06   ` James Morse
  2025-02-21 19:44     ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:06 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> Introduce the interface file "mbm_assign_mode" to list monitor modes
> supported.
> 
> The "mbm_cntr_assign" mode provides the option to assign a counter to
> an RMID, event pair and monitor the bandwidth as long as it is assigned.
> 
> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
> 
> The "default" mode is the existing monitoring mode that works without the
> explicit counter assignment, instead relying on dynamic counter assignment
> by hardware that may result in hardware not dedicating a counter resulting
> in monitoring data reads returning "Unavailable".
> 
> Provide an interface to display the monitor mode on the system.
> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> [mbm_cntr_assign]
> default

> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index fb90f08e564e..b5defc5bce0e 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -257,6 +257,32 @@ with the following files:
>  	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>  	    0=0x30;1=0x30;3=0x15;4=0x15
>  
> +"mbm_assign_mode":
> +	Reports the list of monitoring modes supported. The enclosed brackets
> +	indicate which mode is enabled.
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> +	  [mbm_cntr_assign]
> +	  default
> +
> +	"mbm_cntr_assign":
> +
> +	In mbm_cntr_assign, monitoring event can only accumulate data while
> +	it is backed by a hardware counter. The user-space is able to specify
> +	which of the events in CTRL_MON or MON groups should have a counter
> +	assigned using the "mbm_assign_control" file. The number of counters
> +	available is described in the "num_mbm_cntrs" file. Changing the mode
> +	may cause all counters on a resource to reset.

> +	"default":
> +
> +	In default mode, resctrl assumes there is a hardware counter for each
> +	event within every CTRL_MON and MON group. On AMD platforms, it is
> +	recommended to use mbm_cntr_assign mode if supported, because reading
> +	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
> +	there is no counter associated with that event.

But if you read a value instead of "Unavailable", that doesn't mean the value is correct.
For two reads that succeed, the counter may have been reset in the middle.

I'm suggesting something like:
| it is recommended to use mbm_cntr_assign mode if supported, to avoid counters
| being re-allocated by hardware. This can cause a misleading value to be read,
| or if no counter is associated with that event "Unavailable".


>  "max_threshold_occupancy":
>  		Read/write file provides the largest value (in
>  		bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f91fe605766f..3880480a41d2 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -854,6 +854,30 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
>  	return ret;
>  }
>  
> +static int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
> +					struct seq_file *s, void *v)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	if (r->mon.mbm_cntr_assignable) {
> +		if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +			seq_puts(s, "[mbm_cntr_assign]\n");
> +			seq_puts(s, "default\n");
> +		} else {
> +			seq_puts(s, "mbm_cntr_assign\n");
> +			seq_puts(s, "[default]\n");
> +		}

What do you think to an architecture being able to opt-out of this flexibility?

If there aren't enough counters I can expose what the hardware has through this interface
- but if user-space turns it off ... then what?

For MPAM this would need to be some best-effort software allocation strategy that I'd
rather not write - its not a problem that can be solved, and any value that is reported is
likely to be wrong. For ABMC platforms, existing stable kernels expose a value, so being
able to preserve the existing behaviour makes sense. MPAM doesn't have this problem.

Something like this:
----------%<----------
@@ -861,16 +861,21 @@ static int resctrl_mbm_assign_mode_show(struct kernfs_open
_file *of,
                                        struct seq_file *s, void *v)
 {
        struct rdt_resource *r = of->kn->parent->priv;
+       bool enabled = resctrl_arch_mbm_cntr_assign_enabled(r);

        mutex_lock(&rdtgroup_mutex);

        if (r->mon.mbm_cntr_assignable) {
-               if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
+               if (enabled)
                        seq_puts(s, "[mbm_cntr_assign]\n");
-                       seq_puts(s, "default\n");
-               } else {
-                       seq_puts(s, "mbm_cntr_assign\n");
+               else
                        seq_puts(s, "[default]\n");
+
+               if (!IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) {
+                       if (enabled)
+                               seq_puts(s, "default\n");
+                       else
+                               seq_puts(s, "mbm_cntr_assign\n");
                }
        } else {
                seq_puts(s, "[default]\n");
----------%<----------

x86 wouldn't define CONFIG_RESCTRL_ASSIGN_FIXED, arm64 would, meaning for MPAM the file
would be either:
 | [default]
or
| [mbm_cntr_assign]


> +	} else {
> +		seq_puts(s, "[default]\n");
> +	}
> +
> +	mutex_unlock(&rdtgroup_mutex);
> +
> +	return 0;
> +}


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2025-01-22 20:20 ` [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
  2025-02-19 13:32   ` Dave Martin
@ 2025-02-21 18:06   ` James Morse
  2025-02-21 22:24     ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:06 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> The ABMC feature provides an option to the user to assign a hardware
> counter to an RMID, event pair and monitor the bandwidth as long as it
> is assigned. The assigned RMID will be tracked by the hardware until the
> user unassigns it manually.
> 
> Implement an architecture-specific handler to assign and unassign the
> counter. Configure counters by writing to the L3_QOS_ABMC_CFG MSR,
> specifying the counter ID, bandwidth source (RMID), and event
> configuration.
> 
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>     Monitoring (ABMC).

> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index acac7972cea4..161d3feb567c 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -724,4 +724,7 @@ unsigned int mon_event_config_index_get(u32 evtid);
>  void resctrl_arch_mon_event_config_set(void *info);
>  u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>  				      enum resctrl_event_id eventid);
> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +			     u32 cntr_id, bool assign);
>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */


Could this be added to include/linux/resctrl.h instead? Its where it needs to end up
eventually.


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-01-22 20:20 ` [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
@ 2025-02-21 18:06   ` James Morse
  2025-02-24 15:49     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:06 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> Configure mbm_cntr_assign mode on AMD platforms. On AMD platforms, it
> is recommended to use mbm_cntr_assign mode if supported, because
> reading "mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable'
> if there is no counter associated with that event.

(If you agree with my comment on patch 7, it would be good to update this
wording to match.)


> The mbm_cntr_assign mode, referred to as ABMC (Assignable Bandwidth
> Monitoring Counters) on AMD, is enabled by default when supported by the
> system.
> 
> Update ABMC across all logical processors within the resctrl domain to
> ensure proper functionality.
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c006c4d8d6ff..2480698b643d 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -734,4 +734,5 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
>  void mbm_cntr_reset(struct rdt_resource *r);
>  int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
>  		 struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
> +void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */

Could this be put in include/linux/resctrl.h, its where it needs to end up eventually.



This sequence has me confused:

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3d748fdbcb5f..a9a5dc626a1e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  			r->mon.mbm_cntr_assignable = true;
>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;

> +			hw_res->mbm_cntr_assign_enabled = true;

Here the arch code sets ABMC to be enabled by default at boot.


> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6922173c4f8f..515969c5f64f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>  
>  void resctrl_online_cpu(unsigned int cpu)
>  {
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
>  	mutex_lock(&rdtgroup_mutex);
>  	/* The CPU is set in default rdtgroup after online. */
>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
> +	if (r->mon_capable && r->mon.mbm_cntr_assignable)
> +		resctrl_arch_mbm_cntr_assign_set_one(r);
>  	mutex_unlock(&rdtgroup_mutex);
>  }

But here, resctrl has to call back to the arch code to make sure the hardware is in the
same state as hw_res->mbm_cntr_assign_enabled.

Could this be done in resctrl_arch_online_cpu() instead? That way resctrl doesn't get CPUs
in an inconsistent state that it has to fix up...


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-01-22 20:20 ` [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain Babu Moger
  2025-02-05 23:57   ` Reinette Chatre
@ 2025-02-21 18:07   ` James Morse
  2025-02-21 18:35     ` Reinette Chatre
  1 sibling, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:07 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
> MBM event of a monitor group. Hardware counters are assigned/unassigned
> at monitoring domain level.
> 
> Manage a monitoring domain's hardware counters using a per monitoring
> domain array of struct mbm_cntr_cfg that is indexed by the hardware
> counter	ID. A hardware counter's configuration contains the MBM event
> ID and points to the monitoring group that it is assigned to, with a
> NULL pointer meaning that the hardware counter is available for assignment.
> 
> There is no direct way to determine which hardware counters are	assigned
> to a particular monitoring group. Check every entry of every hardware
> counter	configuration array in every monitoring domain to query which
> MBM events of a monitoring group is tracked by hardware. Such queries
> are acceptable because of a very small number of assignable counters.

> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 511cfce8fc21..9a54e307d340 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
>  	u32				*mbps_val;
>  };
>  
> +/**
> + * struct mbm_cntr_cfg - assignable counter configuration
> + * @evtid:		 MBM event to which the counter is assigned. Only valid
> + *			 if @rdtgroup is not NULL.
> + * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
> + *			 counter is free.
> + */
> +struct mbm_cntr_cfg {
> +	enum resctrl_event_id	evtid;
> +	struct rdtgroup		*rdtgrp;
> +};

struct rdtgroup here suggests this shouldn't be something the arch code is touching.

If its not needed by any arch specific code, (I couldn't find a resctrl_arch helper that
takes this) - could it be moved to resctrl's internal.h.

(If this does need to be visible to the arch code, one option would be to replace rdtgroup
with the closid/rmid, and a valid flag so that memset() continues to reset these entries)


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-01-22 20:20 ` [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
  2025-02-06 18:48   ` Reinette Chatre
  2025-02-19 16:07   ` Dave Martin
@ 2025-02-21 18:07   ` James Morse
  2025-02-24 20:49     ` Moger, Babu
  2 siblings, 1 reply; 209+ messages in thread
From: James Morse @ 2025-02-21 18:07 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
> events in the CTRL_MON or MON groups should have counters assigned.
> 
> Provide an interface for assigning MBM events by writing to the file:
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
> events can be assigned or unassigned as needed.
> 
> Format is similar to the list format with addition of opcode for the
> assignment operation.
>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Format for specific type of groups:
> 
>  * Default CTRL_MON group:
>          "//<domain_id><opcode><flags>"
> 
>  * Non-default CTRL_MON group:
>          "<CTRL_MON group>//<domain_id><opcode><flags>"
> 
>  * Child MON group of default CTRL_MON group:
>          "/<MON group>/<domain_id><opcode><flags>"
> 
>  * Child MON group of non-default CTRL_MON group:
>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Domain_id '*' will apply the flags on all the domains.
> 
> Opcode can be one of the following:
> 
>  = Update the assignment to match the flags
>  + Assign a new MBM event without impacting existing assignments.
>  - Unassign a MBM event from currently assigned events.
> 
> Assignment flags can be one of the following:
>  t  MBM total event
>  l  MBM local event
>  tl Both total and local MBM events
>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>     be combined with other flags.

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6e29827239e0..299839bcf23f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,

> +static int resctrl_process_flags(struct rdt_resource *r,
> +				 enum rdt_group_type rtype,
> +				 char *p_grp, char *c_grp, char *tok)
> +{
> +	unsigned int op, mon_state, assign_state, unassign_state;
> +	char *dom_str, *id_str, *op_str;
> +	struct rdt_mon_domain *d;
> +	unsigned long dom_id = 0;
> +	struct rdtgroup *rdtgrp;
> +	char domain[10];
> +	bool found;
> +	int ret;
> +
> +	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
> +
> +	if (!rdtgrp) {
> +		rdt_last_cmd_puts("Not a valid resctrl group\n");
> +		return -EINVAL;
> +	}
> +
> +next:
> +	if (!tok || tok[0] == '\0')
> +		return 0;
> +
> +	/* Start processing the strings for each domain */
> +	dom_str = strim(strsep(&tok, ";"));
> +
> +	op_str = strpbrk(dom_str, "=+-");
> +
> +	if (op_str) {
> +		op = *op_str;
> +	} else {
> +		rdt_last_cmd_puts("Missing operation =, +, - character\n");
> +		return -EINVAL;
> +	}
> +
> +	id_str = strsep(&dom_str, "=+-");
> +
> +	/* Check for domain id '*' which means all domains */
> +	if (id_str && *id_str == '*') {
> +		d = NULL;
> +		goto check_state;
> +	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
> +		rdt_last_cmd_puts("Missing domain id\n");
> +		return -EINVAL;
> +	}
> +
> +	/* Verify if the dom_id is valid */
> +	found = false;
> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found) {
> +		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
> +		return -EINVAL;
> +	}
> +
> +check_state:
> +	mon_state = resctrl_str_to_mon_state(dom_str);
> +
> +	if (mon_state == ASSIGN_INVALID) {
> +		rdt_last_cmd_puts("Invalid assign flag\n");
> +		goto out_fail;
> +	}
> +
> +	assign_state = 0;
> +	unassign_state = 0;
> +
> +	switch (op) {
> +	case '+':
> +		if (mon_state == ASSIGN_NONE) {
> +			rdt_last_cmd_puts("Invalid assign opcode\n");
> +			goto out_fail;
> +		}
> +		assign_state = mon_state;
> +		break;
> +	case '-':
> +		if (mon_state == ASSIGN_NONE) {
> +			rdt_last_cmd_puts("Invalid assign opcode\n");
> +			goto out_fail;
> +		}
> +		unassign_state = mon_state;
> +		break;
> +	case '=':
> +		assign_state = mon_state;
> +		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
> +		break;
> +	default:
> +		break;
> +	}


> +	if (unassign_state & ASSIGN_TOTAL) {
> +		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (unassign_state & ASSIGN_LOCAL) {
> +		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (assign_state & ASSIGN_TOTAL) {
> +		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (assign_state & ASSIGN_LOCAL) {
> +		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}

This sequence of if's allows the helpers to be called on platforms that doesn't support
both local and total. Could we reject such misconfiguration here in the parsing code?
You have these checks in rdtgroup_assign_cntrs() added in patch 17.


What do you think to trying to group these four by event type, and passing the event type
in as an argument? ... it ends up with a helper that takes a large number of arguments,
(both assign_state and unassign_state), but there is less repetition...


Thanks,

James

> +	goto next;
> +
> +out_fail:
> +	sprintf(domain, d ? "%ld" : "*", dom_id);
> +
> +	rdt_last_cmd_printf("Assign operation '%s%c%s' failed on the group %s/%s/\n",
> +			    domain, op, dom_str, p_grp, c_grp);
> +
> +	return -EINVAL;
> +}
> +

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-01-22 20:20 [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (24 preceding siblings ...)
  2025-02-12 17:46 ` Dave Martin
@ 2025-02-21 18:07 ` James Morse
  25 siblings, 0 replies; 209+ messages in thread
From: James Morse @ 2025-02-21 18:07 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 22/01/2025 20:20, Babu Moger wrote:
> This series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
> 
> Series is written such that it is easier to support other assignable
> features supported from different vendors.
> 
> The feature details are documented in the  APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> 
> The patches are based on top of commit
> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'

I've rebased the MPAM tree on top of this v11, here:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/abmc/v11
Hopefully this is sufficient evidence that this interface works for MPAM.

It would be convenient for MPAM platforms to not have to support a 'default' mode if they
are emulating ABMC - this was something that was never supported, and its not a problem
that can be solved. (comments on the relevant patches).


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-06 15:56         ` Luck, Tony
@ 2025-02-21 18:08           ` James Morse
  0 siblings, 0 replies; 209+ messages in thread
From: James Morse @ 2025-02-21 18:08 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Babu Moger, corbet@lwn.net,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, peternewman@google.com
  Cc: x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

Hi Tony, Reinette,

On 2/6/25 15:56, Luck, Tony wrote:
>>>> This new arch API has sharp corners because of asymmetry of where resctrl
>>>> runs the arch function. I do not think it is required to change this since we
>>>> can only speculate about how this may be used in the future but I do think
>>>> it will be helpful to add comments that highlight:
>>>>
>>>> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
>>>> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.
>>>
>>> Here's a vague data point about the future to help with speculation.
>>>
>>> I have something coming along the pipeline that also can run on any CPU.

RISC-V has this - all their controls/monitors are accessible from any CPU.
Some MPAM platforms can do this too - but the code has to be structured for those
that need the IPI.

Having this be something resctrl can be told sounds like a great idea.

It sounds like all or nothing suits x86/riscv.
The MPAM driver has an accessibility cpumask for each thing it accesses that determines
if it needs to do an IPI.


>>> I am contemplating a flag in the rdt_resource structure (in appropriate substructure
>>> resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.
>>>
>>> Would something like that be useful here?
>>
>> hmm ... I cannot envision how this may look. Could you please elaborate?
>>
>> You mention "a" (singular) flag in rdt_resource while this scenario involves
>> different ops having different scope. This makes me think that this flag may
>> have to be per operation that in turn would need additional infrastructure to
>> manage and track operations.
>>
>> These "arch" functions are evolving as the work to support MPAM is done and
>> so far I think it has been quite ad-hoc to just refactor arch specific code
>> into "arch" helpers instead of keeping track of which scope they are running in.
>> This currently requires any arch implementing an "arch" helper to be well aware
>> of how resctrl will call it.

This is how APIs in linux evolve - only the immediate problem needs solving.
Arch code being aware how resctrl uses a function shouldn't be surprising - it is the only user.


> I haven't fleshed it out yet. One option would be to have a "choose_cpu_mask()"
> function that takes resource and domain parameters (and given your comment
> about this case an operation code). Then use that as the mask in an smp_call*().
> 
> Operations that can run anywhere would return a mask with just bit for the
> current CPU.

That sounds like extra work. We already have a cpumask, if you set it to the cpu_possible mask
at boot, then smp_call_function() and friends will always prefer the current CPU, and it all falls
out in the wash.


Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 10/23] x86/resctrl: Remove MSR reading of event configuration value
  2025-02-19 13:28         ` Dave Martin
@ 2025-02-21 18:08           ` James Morse
  0 siblings, 0 replies; 209+ messages in thread
From: James Morse @ 2025-02-21 18:08 UTC (permalink / raw)
  To: Dave Martin, Reinette Chatre
  Cc: Luck, Tony, Babu Moger, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

Hi Dave,

On 2/19/25 13:28, Dave Martin wrote:
> On Wed, Feb 05, 2025 at 05:41:53PM -0800, Reinette Chatre wrote:
>> On 2/5/25 4:51 PM, Luck, Tony wrote:
>>>> This new arch API has sharp corners because of asymmetry of where resctrl
>>>> runs the arch function. I do not think it is required to change this since we
>>>> can only speculate about how this may be used in the future but I do think
>>>> it will be helpful to add comments that highlight:
>>>>
>>>> resctrl_arch_mon_event_config_get() ->  May run on CPU that does not belong to domain.
>>>> resctrl_arch_mon_event_config_set() ->  Runs on CPU that belongs to domain.
>>>
>>> Here's a vague data point about the future to help with speculation.
>>>
>>> I have something coming along the pipeline that also can run on any CPU.
>>>
>>> I am contemplating a flag in the rdt_resource structure (in appropriate substructure
>>> resctrl_cache/resctrl_membw) to indicate "domain" vs. "any" for operations.
>>>
>>> Would something like that be useful here?
>>
>> hmm ... I cannot envision how this may look. Could you please elaborate?
>>
>> You mention "a" (singular) flag in rdt_resource while this scenario involves
>> different ops having different scope. This makes me think that this flag may
>> have to be per operation that in turn would need additional infrastructure to
>> manage and track operations.
>>
>> These "arch" functions are evolving as the work to support MPAM is done and
>> so far I think it has been quite ad-hoc to just refactor arch specific code
>> into "arch" helpers instead of keeping track of which scope they are running in.
>> This currently requires any arch implementing an "arch" helper to be well aware
>> of how resctrl will call it.

> For MPAM, we must typically do all configuration access from a CPU in a
> power domain that depends on the power domain of the relevant MPAM MSC,
> including reads of the configuration.
This is the worst case - but the firmware can describe an MSC as being globally accessible.


> In the MPAM case, the required topology knowledge is not necessarily
> identical to the resctrl domain topology, so it doesn't feel right to
> have the resctrl core code making the decisions.

This is up to the MPAM driver to hide. The upshot is that for some calls
resctrl needs to schedule work so that the MPAM driver can in turn IPI to get
the rest of the work done. The changes for this are already upstream.


> So, in the interest of keeping the arch interface simple, should cross-
> calling be delegated to the arch code, at least for now?


That's invasive on top of what we already have. Take smp_mon_event_count() as an
example, there is a bunch of work that resctrl can do on a CPU that can access
the hardware registers. If the cpumask was hidden, then we'd either need more IPI,
or to allocate a structure that can describe a long list of monitors to read.



Thanks,

James

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-21 17:14               ` Dave Martin
@ 2025-02-21 18:23                 ` Moger, Babu
  2025-02-21 22:48                   ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-21 18:23 UTC (permalink / raw)
  To: Dave Martin, Reinette Chatre
  Cc: Peter Newman, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi All,

On 2/21/2025 11:14 AM, Dave Martin wrote:
> Hi,
> 
> On Thu, Feb 20, 2025 at 09:08:17AM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 2/20/25 5:40 AM, Dave Martin wrote:
>>> On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
>>>> Hi Reinette,
>>>>
>>>> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
>>>> <reinette.chatre@intel.com> wrote:
> 
> [...]
> 
>>>>> Could you please remind me how a user will set this flag?
>>>>
>>>> Quoting my original suggestion[1]:
>>>>
>>>>   "info/L3_MON/mbm_assign_on_mkdir?
>>>>
>>>>    boolean (parsed with kstrtobool()), defaulting to true?"
>>>>
>>>> After mount, any groups that got counters on creation would have to be
>>>> cleaned up, but at least that can be done with forward progress once
>>>> the flag is cleared.
>>>>
>>>> I was able to live with that as long as there aren't users polling for
>>>> resctrl to be mounted and immediately creating groups. For us, a
>>>> single container manager service manages resctrl.
> 
> [...]
> 
>>> +1
>>>
>>> That's basically my position -- the auto-assignment feels like a
>>> _potential_ nuisance for ABMC-aware users, but it depends on what they
>>> are trying to do.  Migration of non-ABMC-aware users will be easier for
>>> basic use cases if auto-assignment occurs by default (as in this
>>> series).
>>>
>>> Having an explicit way to turn this off seems perfectly reasonable
>>> (and could be added later on, if not provided in this series).
>>>
>>>
>>> What about the question re whether turning mbm_cntr_assign mode on
>>> should trigger auto-assignment?
>>>
>>> Currently turning this mode off and then on again has the effect of
>>> removing all automatic assignments for extant groups.  This feels
>>> surprising and/or unintentional (?)
>>
>> Connecting to what you start off by saying I also see auto-assignment
>> as the way to provide a smooth transition for "non-ABMC-aware" users.
> 
> I agree, and having this on by default also helps non-ABMC-aware users.
> 
>> To me a user that turns this mode off and then on again can be
>> considered as a user that is "ABMC-aware" and turning it "off and then
>> on again" seems like an intuitive way to get to a "clean slate"
>> wrt counter assignments. This may also be a convenient way for
>> an "ABMC-aware" user space to unassign all counters and thus also
>> helpful if resctrl supports the flag that Peter proposed. The flag
>> seems to already keep something like this in its context with
>> a name of "mbm_assign_on_mkdir" that could be interpreted as
>> "only auto assign on mkdir"?
> 
> Yes, that's reasonable.  It could be a good idea to document this
> behaviour of switching the mbm_cntr_assign mode, if we think it is
> useful and people are likely to rely on it.
> 
> Since mkdir is an implementation detail of the resctrl interface, I'd
> be tempted to go for a more generic name, say,
> "mbm_assign_new_mon_groups".  But that's just bikeshedding.
> The proposed behaviour seems fine.
> 
> Either way, if this is not included in this series, it could be added
> later without breaking anything.

How about more generic "mbm_cntr_assign_auto" ?

We can add it as part of "struct resctrl_mon" and set it "on" when ABMC 
is detected. It will be part of check in rdtgroup_assign_cntrs() which 
is called when new groups are created. Also, provide  user interface to 
disable it. Seems simple to me.

Thanks
Babu


> 
> 
>> I am not taking a stand for one or the other approach but instead
>> trying to be more specific about pros/cons. Could you please provide
>> more insight in the use case you have in mind so that we can see how
>> resctrl could behave with few surprises?
>>
>> Reinette
> 
> I don't have a strong view either.
> 
> I don't have a concrete use case here -- I was just trying to imagine
> the experience of an ABMC-aware user who wants full control over what
> counters get assigned.
> 
> I agree that the convenience of the non-ABMC-aware user should probably
> take priority over that of the ABMC-aware user, at least in situations
> where the expected behaviour is achievable (i.e., where we didn't run
> out of counters to auto-assign.)
> 
> Cheers
> ---Dave
> 


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 06/23] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2025-02-21 18:05   ` James Morse
@ 2025-02-21 18:25     ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 18:25 UTC (permalink / raw)
  To: James Morse, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,

On 2/21/25 10:05 AM, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> Add the functionality to enable/disable AMD ABMC feature.
>>
>> AMD ABMC feature is enabled by setting enabled bit(0) in MSR
>> L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
>> to be updated on all the logical processors in the QOS Domain.
>>
>> Hardware counters will reset when ABMC state is changed.
>>
>> The ABMC feature details are documented in APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC).
> 
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 05358e78147b..ca69f2e0909f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -658,4 +663,6 @@ void resctrl_file_fflags_init(const char *config, unsigned long fflags);
>>  void rdt_staged_configs_clear(void);
>>  bool closid_allocated(unsigned int closid);
>>  int resctrl_find_cleanest_closid(void);
>> +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>> +bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> 
> A minor nit - but could these be added to include/linux/resctrl.h instead?
> This is where they need to end up after the arch/fs split, and its harmless to do it from
> the beginning.

These prototypes were moved back to the internal header to follow guidance
from Boris that was received during recent software controller enhancement.
Boris advised [1] that items needed by other architecture should only be
moved to include/linux/resctrl.h when that support is added.

Reinette

[1] https://lore.kernel.org/lkml/20241209222047.GKZ1dtPxIu5_Hxs1fp@fat_crate.local/


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-21 18:07   ` James Morse
@ 2025-02-21 18:35     ` Reinette Chatre
  2025-02-21 20:10       ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 18:35 UTC (permalink / raw)
  To: James Morse, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,

On 2/21/25 10:07 AM, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
>> MBM event of a monitor group. Hardware counters are assigned/unassigned
>> at monitoring domain level.
>>
>> Manage a monitoring domain's hardware counters using a per monitoring
>> domain array of struct mbm_cntr_cfg that is indexed by the hardware
>> counter	ID. A hardware counter's configuration contains the MBM event
>> ID and points to the monitoring group that it is assigned to, with a
>> NULL pointer meaning that the hardware counter is available for assignment.
>>
>> There is no direct way to determine which hardware counters are	assigned
>> to a particular monitoring group. Check every entry of every hardware
>> counter	configuration array in every monitoring domain to query which
>> MBM events of a monitoring group is tracked by hardware. Such queries
>> are acceptable because of a very small number of assignable counters.
> 
>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index 511cfce8fc21..9a54e307d340 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
>>  	u32				*mbps_val;
>>  };
>>  
>> +/**
>> + * struct mbm_cntr_cfg - assignable counter configuration
>> + * @evtid:		 MBM event to which the counter is assigned. Only valid
>> + *			 if @rdtgroup is not NULL.
>> + * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
>> + *			 counter is free.
>> + */
>> +struct mbm_cntr_cfg {
>> +	enum resctrl_event_id	evtid;
>> +	struct rdtgroup		*rdtgrp;
>> +};
> 
> struct rdtgroup here suggests this shouldn't be something the arch code is touching.
> 
> If its not needed by any arch specific code, (I couldn't find a resctrl_arch helper that
> takes this) - could it be moved to resctrl's internal.h.
> 
> (If this does need to be visible to the arch code, one option would be to replace rdtgroup
> with the closid/rmid, and a valid flag so that memset() continues to reset these entries)
> 

Thank you for catching this!

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 07/23] x86/resctrl: Introduce the interface to display monitor mode
  2025-02-21 18:06   ` James Morse
@ 2025-02-21 19:44     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-21 19:44 UTC (permalink / raw)
  To: James Morse, Babu Moger, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,

On 2/21/2025 12:06 PM, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>> supported.
>>
>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>
>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>
>> The "default" mode is the existing monitoring mode that works without the
>> explicit counter assignment, instead relying on dynamic counter assignment
>> by hardware that may result in hardware not dedicating a counter resulting
>> in monitoring data reads returning "Unavailable".
>>
>> Provide an interface to display the monitor mode on the system.
>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> [mbm_cntr_assign]
>> default
> 
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index fb90f08e564e..b5defc5bce0e 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -257,6 +257,32 @@ with the following files:
>>   	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>>   	    0=0x30;1=0x30;3=0x15;4=0x15
>>   
>> +"mbm_assign_mode":
>> +	Reports the list of monitoring modes supported. The enclosed brackets
>> +	indicate which mode is enabled.
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> +	  [mbm_cntr_assign]
>> +	  default
>> +
>> +	"mbm_cntr_assign":
>> +
>> +	In mbm_cntr_assign, monitoring event can only accumulate data while
>> +	it is backed by a hardware counter. The user-space is able to specify
>> +	which of the events in CTRL_MON or MON groups should have a counter
>> +	assigned using the "mbm_assign_control" file. The number of counters
>> +	available is described in the "num_mbm_cntrs" file. Changing the mode
>> +	may cause all counters on a resource to reset.
> 
>> +	"default":
>> +
>> +	In default mode, resctrl assumes there is a hardware counter for each
>> +	event within every CTRL_MON and MON group. On AMD platforms, it is
>> +	recommended to use mbm_cntr_assign mode if supported, because reading
>> +	"mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable' if
>> +	there is no counter associated with that event.
> 
> But if you read a value instead of "Unavailable", that doesn't mean the value is correct.
> For two reads that succeed, the counter may have been reset in the middle.
> 
> I'm suggesting something like:
> | it is recommended to use mbm_cntr_assign mode if supported, to avoid counters
> | being re-allocated by hardware. This can cause a misleading value to be read,
> | or if no counter is associated with that event "Unavailable".
> 
Looks good with slight modification.

On AMD platforms, it is recommended to use the mbm_cntr_assign mode, if 
supported, to prevent the hardware from resetting counters between 
reads. This can result in misleading values or display "Unavailable" if 
no counter is assigned to the event.


> 
>>   "max_threshold_occupancy":
>>   		Read/write file provides the largest value (in
>>   		bytes) at which a previously used LLC_occupancy
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index f91fe605766f..3880480a41d2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -854,6 +854,30 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
>>   	return ret;
>>   }
>>   
>> +static int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
>> +					struct seq_file *s, void *v)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +
>> +	mutex_lock(&rdtgroup_mutex);
>> +
>> +	if (r->mon.mbm_cntr_assignable) {
>> +		if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
>> +			seq_puts(s, "[mbm_cntr_assign]\n");
>> +			seq_puts(s, "default\n");
>> +		} else {
>> +			seq_puts(s, "mbm_cntr_assign\n");
>> +			seq_puts(s, "[default]\n");
>> +		}
> 
> What do you think to an architecture being able to opt-out of this flexibility?
> 
> If there aren't enough counters I can expose what the hardware has through this interface
> - but if user-space turns it off ... then what?
> 
> For MPAM this would need to be some best-effort software allocation strategy that I'd
> rather not write - its not a problem that can be solved, and any value that is reported is
> likely to be wrong. For ABMC platforms, existing stable kernels expose a value, so being
> able to preserve the existing behaviour makes sense. MPAM doesn't have this problem.
> 
> Something like this:
> ----------%<----------
> @@ -861,16 +861,21 @@ static int resctrl_mbm_assign_mode_show(struct kernfs_open
> _file *of,
>                                          struct seq_file *s, void *v)
>   {
>          struct rdt_resource *r = of->kn->parent->priv;
> +       bool enabled = resctrl_arch_mbm_cntr_assign_enabled(r);
> 
>          mutex_lock(&rdtgroup_mutex);
> 
>          if (r->mon.mbm_cntr_assignable) {
> -               if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +               if (enabled)
>                          seq_puts(s, "[mbm_cntr_assign]\n");
> -                       seq_puts(s, "default\n");
> -               } else {
> -                       seq_puts(s, "mbm_cntr_assign\n");
> +               else
>                          seq_puts(s, "[default]\n");
> +
> +               if (!IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) {
> +                       if (enabled)
> +                               seq_puts(s, "default\n");
> +                       else
> +                               seq_puts(s, "mbm_cntr_assign\n");
>                  }
>          } else {
>                  seq_puts(s, "[default]\n");
> ----------%<----------
> 
> x86 wouldn't define CONFIG_RESCTRL_ASSIGN_FIXED, arm64 would, meaning for MPAM the file
> would be either:
>   | [default]
> or
> | [mbm_cntr_assign]
> 

Looks good mostly.

resctrl_arch_mbm_cntr_assign_enabled(r) needs to be called with the lock.

Also, there is no reference of CONFIG_RESCTRL_ASSIGN_FIXED in this 
series. Hope that is not an issue.

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3880480a41d2..2907bb7bfa56 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -858,16 +858,22 @@ static int resctrl_mbm_assign_mode_show(struct 
kernfs_open_file *of,
                                         struct seq_file *s, void *v)
  {
         struct rdt_resource *r = of->kn->parent->priv;
+       bool enabled;

         mutex_lock(&rdtgroup_mutex);
+       enabled = resctrl_arch_mbm_cntr_assign_enabled(r);

         if (r->mon.mbm_cntr_assignable) {
-               if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
+               if (enabled)
                         seq_puts(s, "[mbm_cntr_assign]\n");
-                       seq_puts(s, "default\n");
-               } else {
-                       seq_puts(s, "mbm_cntr_assign\n");
+               else
                         seq_puts(s, "[default]\n");
+
+               if (!IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED)) {
+                       if (enabled)
+                               seq_puts(s, "default\n");
+                       else
+                               seq_puts(s, "mbm_cntr_assign\n");
                 }
         } else {
                 seq_puts(s, "[default]\n");




> 
>> +	} else {
>> +		seq_puts(s, "[default]\n");
>> +	}
>> +
>> +	mutex_unlock(&rdtgroup_mutex);
>> +
>> +	return 0;
>> +}
> 
> 
> Thanks,
> 
> James
> 

Thanks
Babu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 11/23] x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at domain
  2025-02-21 18:35     ` Reinette Chatre
@ 2025-02-21 20:10       ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-21 20:10 UTC (permalink / raw)
  To: Reinette Chatre, James Morse, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian



On 2/21/2025 12:35 PM, Reinette Chatre wrote:
> Hi James,
> 
> On 2/21/25 10:07 AM, James Morse wrote:
>> Hi Babu,
>>
>> On 22/01/2025 20:20, Babu Moger wrote:
>>> In mbm_cntr_assign mode hardware counters are assigned/unassigned to an
>>> MBM event of a monitor group. Hardware counters are assigned/unassigned
>>> at monitoring domain level.
>>>
>>> Manage a monitoring domain's hardware counters using a per monitoring
>>> domain array of struct mbm_cntr_cfg that is indexed by the hardware
>>> counter	ID. A hardware counter's configuration contains the MBM event
>>> ID and points to the monitoring group that it is assigned to, with a
>>> NULL pointer meaning that the hardware counter is available for assignment.
>>>
>>> There is no direct way to determine which hardware counters are	assigned
>>> to a particular monitoring group. Check every entry of every hardware
>>> counter	configuration array in every monitoring domain to query which
>>> MBM events of a monitoring group is tracked by hardware. Such queries
>>> are acceptable because of a very small number of assignable counters.
>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 511cfce8fc21..9a54e307d340 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -94,6 +94,18 @@ struct rdt_ctrl_domain {
>>>   	u32				*mbps_val;
>>>   };
>>>   
>>> +/**
>>> + * struct mbm_cntr_cfg - assignable counter configuration
>>> + * @evtid:		 MBM event to which the counter is assigned. Only valid
>>> + *			 if @rdtgroup is not NULL.
>>> + * @rdtgroup:		 resctrl group assigned to the counter. NULL if the
>>> + *			 counter is free.
>>> + */
>>> +struct mbm_cntr_cfg {
>>> +	enum resctrl_event_id	evtid;
>>> +	struct rdtgroup		*rdtgrp;
>>> +};
>>
>> struct rdtgroup here suggests this shouldn't be something the arch code is touching.
>>
>> If its not needed by any arch specific code, (I couldn't find a resctrl_arch helper that
>> takes this) - could it be moved to resctrl's internal.h.
>>
>> (If this does need to be visible to the arch code, one option would be to replace rdtgroup
>> with the closid/rmid, and a valid flag so that memset() continues to reset these entries)
>>
> 
> Thank you for catching this!
> 
> Reinette
> 
> 

Sure. Will move it to arch/x86/kernel/cpu/resctrl/internal.h.

thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-21 16:00           ` Dave Martin
@ 2025-02-21 20:10             ` Reinette Chatre
  2025-02-24 17:17               ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 20:10 UTC (permalink / raw)
  To: Dave Martin, Moger, Babu
  Cc: corbet, tglx, mingo, bp, dave.hansen, tony.luck, peternewman, x86,
	hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/21/25 8:00 AM, Dave Martin wrote:
> On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/20/25 09:44, Dave Martin wrote:
>>> Hi,
>>>
>>> On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
> 
> [...]
> 
>>>> Good catch.
>>>>
>>>> I see similar buffer overflow is handled by calling seq_buf_clear()
>>>> (look at process_durations() or in show_user_instructions()).
>>>>
>>>> How about handling this by calling rdt_last_cmd_clear() before printing
>>>> each group?
>>>
>>> Does this work?
>>>
>>> Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
>>> So far as I can see, show_user_instructions() just gives up on printing
>>> the affected line, while process_durations() tries to anticipate
>>> overflow and prints out the accumulated text to dmesg before clearing
>>> the buffer.
>>
>> Yea. Agree,
>>
>>>
>>> In our case, we cannot send more data to userspace than was requested
>>> in the read() call, so we might have nowhere to drain the seq_buf
>>> contents to in order to free up space.
>>>
>>> sysfs "expects" userspace to do a big enough read() that this problem
>>> doesn't happen.  In practice this is OK because people usually read
>>> through a buffered I/O layer like stdio, and in realistic
>>> implementations the user-side I/O buffer is large enough to hide this
>>> issue.
>>>
>>> But mbm_assign_control data is dynamically generated and potentially
>>> much bigger than a typical sysfs file.
>>
>> I have no idea how to handle this case. We may have to live with this
>> problem. Let us know if there are any ideas.
> 
> I think the current implication is that this will work for now provided
> that the generated text fits in a page.
> 
> 
> Reinette, what's your view on accepting this limitation in the interest
> of stabilising this series, and tidying up this corner case later?
> 
> As for writes to this file, we're unlikely to hit the limit unless
> there are a lot of RMIDs available and many groups with excessively
> long names.

I am more concerned about reads to this file. If only 4K writes are
supported then user space can reconfigure the system in page sized
portions. It may not be efficient if the user wants to reconfigure the
entire system but it will work. The problem with reads is that if
larger than 4K reads are required but not supported then it will be
impossible for user space to learn state of the system.

We may already be at that limit. Peter described [1] how AMD systems
already have 32 L3 monitoring domains and 256 RMIDs. With conservative
resource group names of 10 characters I see one line per monitoring group
that could look like below and thus easily be above 200 characters:

resgroupAA/mongroupAA/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl;12=tl;13=tl;14=tl;15=tl;16=tl;17=tl;18=tl;19=tl;20=tl;21=tl;22=tl;23=tl;24=tl;25=tl;26=tl;27=tl;28=tl;29=tl;30=tl;31=tl;32=tl

Multiplying that with the existing possible 256 monitor groups the limit
is exceeded today.

I understand that all domains having "tl" flags are not possible today, but
even if that is changed to "_" the resulting display still looks to
easily exceed a page when many RMIDs are in use.

> 
> This looks perfectly fixable, but it might be better to settle the
> design of this series first before we worry too much about it.

I think it fair to delay support of writing more than a page of
data but it looks to me like we need a solution to support displaying
more than a page of data to user space.

Reinette

[1] https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-21 15:53           ` Dave Martin
@ 2025-02-21 20:16             ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 20:16 UTC (permalink / raw)
  To: Dave Martin, Moger, Babu
  Cc: Moger, Babu, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/21/25 7:53 AM, Dave Martin wrote:
> Hi,
> 
> On Thu, Feb 20, 2025 at 02:57:31PM -0600, Moger, Babu wrote:
>> Hi Dave,
> 
> [...]
> 
>> Created the problem using this code using a "test" group.
>>
>> include <stdio.h>
>> #include <errno.h>
>> #include <string.h>
>>
>> int main()
>> {
>>         FILE *file;
>>         int n;
>>
>>         file = fopen("/sys/fs/resctrl/info/L3_MON/mbm_assign_control", "w");
>>
>>         if (file == NULL) {
>>                 printf("Error opening file!\n");
>>                 return 1;
>>         }
>>
>>         printf("File opened successfully.\n");
>>
>>         for (n = 0; n < 100; n++)
>>                 if
>> (fputs("test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;9=tl;10=tl;11=tl\n", file) == EOF)
>>                         fprintf(stderr, "Failed on interation %d error
>> %s\n ", n, strerror(errno));
>>
>>         if (fclose(file) == 0) {
>>                 printf("File closed successfully.\n");
>>         } else {
>>                 printf("Error closing file!\n");
>>         }
>> }
> 
> Right.
> 
>> When the buffer overflow happens the newline will not be there. I have
>> added this error via rdt_last_cmd_puts. At least user knows there is an error.
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 484d6009869f..70a96976e3ab 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1250,8 +1252,10 @@ static ssize_t
>> resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
>>         int ret;
>>
>>         /* Valid input requires a trailing newline */
>> -       if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> +       if (nbytes == 0 || buf[nbytes - 1] != '\n') {
>> +               rdt_last_cmd_puts("mbm_cntr_assign: buffer invalid\n");
>>                 return -EINVAL;
>> +       }
>>
>>         buf[nbytes - 1] = '\0';
>>
>>
>>
>> I am open to other ideas to handle this case.
> 
> Reinette, what do you think about this as a stopgap approach?

This seems fair. I expect that we need to document somewhere that writes
"may" (to leave wiggle room for improvements) require page sized writes. 

> 
> The worst that happens is that userspace gets an unexpected failure in
> scenarios that seem unlikely in the near future (i.e., where there are
> a lot of RMIDs available, and at the same time groups have been given
> stupidly long names).
> 
> Since this is an implementation issue rather than an interface issue,
> we could fix it later on.
> 
> 
> Longer term, we may want to define some stuff along the lines of
> 
> 	struct rdtgroup_file {
> 		/* persistent data for an rdtgroup open file instance */
> 	};
> 
> 	static int rdtgroup_file_open(struct kernfs_open_file *of)
> 	{
> 		struct rdtgroup_file *rf;
> 
> 		rf = kzalloc(sizeof(*rf), GFP_KERNEL);
> 		if (!rf)
> 			return -ENOMEM;
> 
> 		of->priv;
> 	}
> 
> 	static void rdtgroup_file_release(struct kernfs_open_file *of)
> 	{
> 		/*
> 		 * Deal with dangling data and do cleanup appropriate
> 		 * for whatever kind of file this is, then:
> 		 */
> 		kfree(of->priv);
> 	}
> 
> 
> Then we'd have somewhere to stash data that needs to be carried over
> from one read/write call to the next.

Something like this seems needed for reading from this file. 

> 
> I tried to port my schemata buffering hack over, but the requirements
> are not exactly the same as for mbm_assign_control, so it wasn't
> trivial.  It feels do-able, but it might be better to stabilise this
> series before going down that road.
> 
> (I'm happy to spend some time trying to wire this up if it would be
> useful, though.)

I was hoping that we would not need to re-invent something here. This does
not seem like a new problem.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 14/23] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2025-02-21 18:06   ` James Morse
@ 2025-02-21 22:24     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-21 22:24 UTC (permalink / raw)
  To: James Morse, Babu Moger, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,


On 2/21/2025 12:06 PM, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> The ABMC feature provides an option to the user to assign a hardware
>> counter to an RMID, event pair and monitor the bandwidth as long as it
>> is assigned. The assigned RMID will be tracked by the hardware until the
>> user unassigns it manually.
>>
>> Implement an architecture-specific handler to assign and unassign the
>> counter. Configure counters by writing to the L3_QOS_ABMC_CFG MSR,
>> specifying the counter ID, bandwidth source (RMID), and event
>> configuration.
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>      Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>      Monitoring (ABMC).
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index acac7972cea4..161d3feb567c 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -724,4 +724,7 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>   void resctrl_arch_mon_event_config_set(void *info);
>>   u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>>   				      enum resctrl_event_id eventid);
>> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>> +			     u32 cntr_id, bool assign);
>>   #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> 
> 
> Could this be added to include/linux/resctrl.h instead? Its where it needs to end up
> eventually.
> 

As Reinette already mentioned in  [1], Boris wanted this moved when 
arch/fs code separation integrated. Lets keep it in resctrl/internal.h 
for now.

[1] 
https://lore.kernel.org/lkml/e524c376-9ef8-488e-8053-b49ccafd306d@intel.com/ 


Thanks
Babu

> Thanks,
> 
> James
> 


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-21 13:12                         ` Peter Newman
@ 2025-02-21 22:43                           ` Reinette Chatre
  2025-02-25 17:11                             ` Peter Newman
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 22:43 UTC (permalink / raw)
  To: Peter Newman
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/21/25 5:12 AM, Peter Newman wrote:
> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>
>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>
>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>
>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>
>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>
>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>> <value>
>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>
>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>
>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>> is low enough to be of concern.
>>>>>>>
>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>> for.
>>>>>>
>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>
>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>> customers.
>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>
>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>> event-set for applying to a single counter rather than as individual
>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>> event names.
>>>>
>>>> Thank you for clarifying.
>>>>
>>>>>
>>>>> In the letters as events model, choosing the events assigned to a
>>>>> group wouldn't be enough information, since we would want to control
>>>>> which events should share a counter and which should be counted by
>>>>> separate counters. I think the amount of information that would need
>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>
>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>> writes in ABMC would look like...
>>>>>
>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>
>>>>> (per domain)
>>>>> group 0:
>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>> group 1:
>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>> ...
>>>>>
>>>>
>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>> example and above the counter configuration appears to be global. You do mention
>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>> configuration is a requirement?
>>>
>>> If it's global and we want a particular group to be watched by more
>>> counters, I wouldn't want this to result in allocating more counters
>>> for that group in all domains, or allocating counters in domains where
>>> they're not needed. I want to encourage my users to avoid allocating
>>> monitoring resources in domains where a job is not allowed to run so
>>> there's less pressure on the counters.
>>>
>>> In Dave's proposal it looks like global configuration means
>>> globally-defined "named counter configurations", which works because
>>> it's really per-domain assignment of the configurations to however
>>> many counters the group needs in each domain.
>>
>> I think I am becoming lost. Would a global configuration not break your
>> view of "event-set applied to a single counter"? If a counter is configured
>> globally then it would not make it possible to support the full configurability
>> of the hardware.
>> Before I add more confusion, let me try with an example that builds on your
>> earlier example copied below:
>>
>>>>> (per domain)
>>>>> group 0:
>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>> group 1:
>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>> ...
>>
>> Since the above states "per domain" I rewrite the example to highlight that as
>> I understand it:
>>
>> group 0:
>>  domain 0:
>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>  domain 1:
>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>> group 1:
>>  domain 0:
>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>  domain 1:
>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>
>> You mention that you do not want counters to be allocated in domains that they
>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>> in domain 1, resulting in:
>>
>> group 0:
>>  domain 0:
>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>> group 1:
>>  domain 0:
>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>  domain 1:
>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>
>> With counter 0 and counter 1 available in domain 1, these counters could
>> theoretically be configured to give group 1 more data in domain 1:
>>
>> group 0:
>>  domain 0:
>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>> group 1:
>>  domain 0:
>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>  domain 1:
>>   counter 0: LclFill,RmtFill
>>   counter 1: LclNTWr,RmtNTWr
>>   counter 2: LclSlowFill,RmtSlowFill
>>   counter 3: VictimBW
>>
>> The counters are shown with different per-domain configurations that seems to
>> match with earlier goals of (a) choose events counted by each counter and
>> (b) do not allocate counters in domains where they are not needed. As I
>> understand the above does contradict global counter configuration though.
>> Or do you mean that only the *name* of the counter is global and then
>> that it is reconfigured as part of every assignment?
> 
> Yes, I meant only the *name* is global. I assume based on a particular
> system configuration, the user will settle on a handful of useful
> groupings to count.
> 
> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> 
>  # define global configurations (in ABMC terms), not necessarily in this
>  # syntax and probably not in the mbm_assign_control file.
> 
>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>  w=VictimBW,LclNTWr,RmtNTWr
> 
>  # legacy "total" configuration, effectively r+w
>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> 
>  /group0/0=t;1=t
>  /group1/0=t;1=t
>  /group2/0=_;1=t
>  /group3/0=rw;1=_
> 
> - group2 is restricted to domain 0
> - group3 is restricted to domain 1
> - the rest are unrestricted
> - In group3, we decided we need to separate read and write traffic
> 
> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> 

I see. Thank you for the example.

resctrl supports per-domain configurations with the following possible when
using mbm_total_bytes_config and mbm_local_bytes_config:

t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr

   /group0/0=t;1=t
   /group1/0=t;1=t

Even though the flags are identical in all domains, the assigned counters will
be configured differently in each domain.

With this supported by hardware and currently also supported by resctrl it seems
reasonable to carry this forward to what will be supported next.

>>
>>>> Until now I viewed counter configuration separate from counter assignment,
>>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
>>>> mbm_local_bytes_config before they are assigned. That is still per-domain
>>>> counter configuration though, not per-counter.
>>>>
>>>>> I assume packing all of this info for a group's desired counter
>>>>> configuration into a single line (with 32 domains per line on many
>>>>> dual-socket AMD configurations I see) would be difficult to look at,
>>>>> even if we could settle on a single letter to represent each
>>>>> universally.
>>>>>
>>>>>>
>>>>>> My goal is for resctrl to have a user interface that can as much as possible
>>>>>> be ready for whatever may be required from it years down the line. Of course,
>>>>>> I may be wrong and resctrl would never need to support more than 26 events per
>>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
>>>>>> and how could resctrl support that?
>>>>>>
>>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier
>>>>>> the interface I used as demonstration may become unwieldy to parse on a system
>>>>>> with many domains that supports many events. This is a concern for me. Any suggestions
>>>>>> will be appreciated, especially from you since I know that you are very familiar with
>>>>>> issues related to large scale use of resctrl interfaces.
>>>>>
>>>>> It's mainly just the unwieldiness of all the information in one file.
>>>>> It's already at the limit of what I can visually look through.
>>>>
>>>> I agree.
>>>>
>>>>>
>>>>> I believe that shared assignments will take care of all the
>>>>> high-frequency and performance-intensive batch configuration updates I
>>>>> was originally concerned about, so I no longer see much benefit in
>>>>> finding ways to textually encode all this information in a single file
>>>>> when it would be more manageable to distribute it around the
>>>>> filesystem hierarchy.
>>>>
>>>> This is significant. The motivation for the single file was to support
>>>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
>>>> not also depend on the same files that, if distributed, will require many
>>>> filesystem operations?
>>>> Having the files distributed will be significantly simpler while also
>>>> avoiding the file size issue that Dave Martin exposed.
>>>
>>> The remaining filesystem operations will be assigning or removing
>>> shared counter assignments in the applicable domains, which would
>>> normally correspond to mkdir/rmdir of groups or changing their CPU
>>> affinity. The shared assignments are more "program and forget", while
>>> the exclusive assignment approach requires updates for every counter
>>> (in every domain) every few seconds to cover a large number of groups.
>>>
>>> When they want to pay extra attention to a particular group, I expect
>>> they'll ask for exclusive counters and leave them assigned for a while
>>> as they collect extra data.
>>
>> The single file approach is already unwieldy. The demands that will be
>> placed on it to support the usages currently being discussed would make this
>> interface even harder to use and manage. If the single file is not required
>> then I think we should go back to smaller files distributed in resctrl.
>> This may not even be an either/or argument. One way to view mbm_assign_control
>> could be as a way for user to interact with the distributed counter
>> related files with a single file system operation. Although, without
>> knowing how counter configuration is expected to work this remains unclear.
> 
> If we do both interfaces and the multi-file model gives us more
> capability to express configurations, we could find situations where
> there are configurations we cannot represent when reading back from
> mbm_assign_control, or updates through mbm_assign_control have
> ambiguous effects on existing configurations which were created with
> other files.

Right. My assumption was that the syntax would be identical.

> 
> However, the example I gave above seems to be adequately represented
> by a minor extension to mbm_assign_control and we all seem to

To confirm what you mean with "minor extension to mbm_assign_control",
is this where the flags are associated with counter configurations? At this
time this is done separately from mbm_assign_control with the hardcoded "t"
and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes
respectively. I think it would be simpler to keep these configurations
separate from mbm_assign_control. How it would look without better
understanding of MPAM is not clear to me at this time, unless if the
requirement is to enhance support for ABMC and BMEC. I do see that
this can be added later to build on what is supported by mbm_assign_control
with the syntax in this version.

> understand it now, so maybe it's not broken yet. It's unfortunate that
> work went into a requirement that's no longer relevant, but I don't
> think that on its own is a blocker.

I understand that requirements may change as we get new information.
Digesting it now is significantly easier than trying to adapt after
the user interface is merged and essentially set in stone.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-21 16:47                           ` Dave Martin
@ 2025-02-21 22:43                             ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 22:43 UTC (permalink / raw)
  To: Dave Martin
  Cc: Peter Newman, Moger, Babu, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Dave,

On 2/21/25 8:47 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Thu, Feb 20, 2025 at 10:36:18AM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 2/20/25 9:46 AM, Dave Martin wrote:
>>> Hi again,
>>>
>>> On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote:
> 
> [...]
> 
>>> Having taken a quick look at that now, this all seems to duplicate
>>> perf's design journey (again).
>>>
>>> "rate" events make some sense.  The perf equivalent is to keep an
>>> accumulated count of the amount of time a counter has been assigned to
>>> an event, and another accumulated count of the events counted by the
>>> counter during assignment.  Only userspace knows what it wants to do
>>> with this information: perf exposes the raw accumulated counts.
>>>
>>> Perf events can be also pinned so that they are prioritised for
>>> assignment to counters; that sounds a lot like the regular, non-shared
>>> resctrl counters.
>>>
>>>
>>> Playing devil's advocate:
>>>
>>> It does feel like we are doomed to reinvent perf if we go too far down
>>> this road...
>>>
>>>> If we split the file, it will be more closely aligned with the design
>>>> of the rest of the resctrlfs interface.
>>>>
>>>> OTOH, the current interface seems workable and I think the file size
>>>> issue can be addressed without major re-engineering.
>>>>
>>>> So, from my side, I would not consider the current interface design
>>>> a blocker.
>>>
>>> ...so, drawing a hard line around the use cases that we intend to
>>> address with this interface and avoiding feature creep seems desirable.
>>
>> This is exactly what I am trying to do ... to understand what use cases
>> the interface is expected to support.
>>
>> You have mentioned a couple of times now that this interface is sufficient but
>> at the same time you hinted at some features from MPAM that I do not see
>> possible to accommodate with this interface.
> 
> It's kind of both.
> 
> I think the interface is sufficient to be useful, and therefore has
> value.
> 
> The problem being addressed here (shortage of counters) is fully
> relevant to MPAM (at last on some hardware).
> 
> Any architecture may define new metrics and types of event that can be
> counted, and they're not going to match up exactly between arches -- so
> I don't think we can expect everything to fit perfectly within a
> generic interface.  But having a generic interface is still useful for
> making common features convenient to use.
> 
> So the interface is useful but not universal, but that doesn't feel
> like a bug.
> 
> Hopefully that makes my position a bit clearer.
> 
>>> resctrlfs is already in the wild, so providing reasonable baseline
>>> compatiblity with that interface for ABMC hardware is a sensible goal.
>>> The current series does that.
>>>
>>> But I wonder how much additional functionality we should really be
>>> adding via the mbm_assign_control interface, once this series is
>>> settled.
>>
>> Are you speculating that MPAM counters may not make use of this interface?
>>
>> Reinette
> 
> No, I think it makes sense for MPAM to follow this interface, as least
> as far as what has been proposed so far here.
> 
> I think James got his updated rebase working. [1]
> 
> 
> perf support would be for the future if we do it, but the ABMC
> interface may be a useful starting point anyway, because it allows
> counters to be assigned explicitly -- that provides a natural way to
> hand over some counters to perf, either because that interface may be a
> more natural fit for what the user is trying to do, or perhaps to count
> weird, platform-specific event types that do not merit the effort of
> integration into resctrlfs proper.
> 
> Does that make sense?
> 

This is reasonable. You did state earlier that we should aim to draw
hard lines around the use cases we aim to address and I think one
way this work is doing this is by being explicit in user interface that
this is all about "memory bandwidth monitoring". This is not intended to
be a fully generic interface for all possible counters for all possible
resources.

Apart from that time will tell how many blind spots there were while
creating this interface.

Thank you very much for all your very valuable insights.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-21 18:23                 ` Moger, Babu
@ 2025-02-21 22:48                   ` Reinette Chatre
  2025-02-21 23:42                     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-21 22:48 UTC (permalink / raw)
  To: Moger, Babu, Dave Martin
  Cc: Peter Newman, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 2/21/25 10:23 AM, Moger, Babu wrote:
> Hi All,
> 
> On 2/21/2025 11:14 AM, Dave Martin wrote:
>> Hi,
>>
>> On Thu, Feb 20, 2025 at 09:08:17AM -0800, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 2/20/25 5:40 AM, Dave Martin wrote:
>>>> On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>
>> [...]
>>
>>>>>> Could you please remind me how a user will set this flag?
>>>>>
>>>>> Quoting my original suggestion[1]:
>>>>>
>>>>>   "info/L3_MON/mbm_assign_on_mkdir?
>>>>>
>>>>>    boolean (parsed with kstrtobool()), defaulting to true?"
>>>>>
>>>>> After mount, any groups that got counters on creation would have to be
>>>>> cleaned up, but at least that can be done with forward progress once
>>>>> the flag is cleared.
>>>>>
>>>>> I was able to live with that as long as there aren't users polling for
>>>>> resctrl to be mounted and immediately creating groups. For us, a
>>>>> single container manager service manages resctrl.
>>
>> [...]
>>
>>>> +1
>>>>
>>>> That's basically my position -- the auto-assignment feels like a
>>>> _potential_ nuisance for ABMC-aware users, but it depends on what they
>>>> are trying to do.  Migration of non-ABMC-aware users will be easier for
>>>> basic use cases if auto-assignment occurs by default (as in this
>>>> series).
>>>>
>>>> Having an explicit way to turn this off seems perfectly reasonable
>>>> (and could be added later on, if not provided in this series).
>>>>
>>>>
>>>> What about the question re whether turning mbm_cntr_assign mode on
>>>> should trigger auto-assignment?
>>>>
>>>> Currently turning this mode off and then on again has the effect of
>>>> removing all automatic assignments for extant groups.  This feels
>>>> surprising and/or unintentional (?)
>>>
>>> Connecting to what you start off by saying I also see auto-assignment
>>> as the way to provide a smooth transition for "non-ABMC-aware" users.
>>
>> I agree, and having this on by default also helps non-ABMC-aware users.
>>
>>> To me a user that turns this mode off and then on again can be
>>> considered as a user that is "ABMC-aware" and turning it "off and then
>>> on again" seems like an intuitive way to get to a "clean slate"
>>> wrt counter assignments. This may also be a convenient way for
>>> an "ABMC-aware" user space to unassign all counters and thus also
>>> helpful if resctrl supports the flag that Peter proposed. The flag
>>> seems to already keep something like this in its context with
>>> a name of "mbm_assign_on_mkdir" that could be interpreted as
>>> "only auto assign on mkdir"?
>>
>> Yes, that's reasonable.  It could be a good idea to document this
>> behaviour of switching the mbm_cntr_assign mode, if we think it is
>> useful and people are likely to rely on it.
>>
>> Since mkdir is an implementation detail of the resctrl interface, I'd
>> be tempted to go for a more generic name, say,
>> "mbm_assign_new_mon_groups".  But that's just bikeshedding.
>> The proposed behaviour seems fine.
>>
>> Either way, if this is not included in this series, it could be added
>> later without breaking anything.
> 
> How about more generic "mbm_cntr_assign_auto" ?

I would like to be careful to not make it _too_ generic. Dave already pointed
out that users may be surprised that counters are not auto-assigned when switching
between the different modes so using the the name to help highlight when this
auto-assignment can be expected to happen seems very useful.

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-21 22:48                   ` Reinette Chatre
@ 2025-02-21 23:42                     ` Moger, Babu
  2025-02-27 11:07                       ` Peter Newman
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-21 23:42 UTC (permalink / raw)
  To: Reinette Chatre, Dave Martin
  Cc: Peter Newman, Babu Moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/21/2025 4:48 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/21/25 10:23 AM, Moger, Babu wrote:
>> Hi All,
>>
>> On 2/21/2025 11:14 AM, Dave Martin wrote:
>>> Hi,
>>>
>>> On Thu, Feb 20, 2025 at 09:08:17AM -0800, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 2/20/25 5:40 AM, Dave Martin wrote:
>>>>> On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
>>>>>> Hi Reinette,
>>>>>>
>>>>>> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
>>>>>> <reinette.chatre@intel.com> wrote:
>>>
>>> [...]
>>>
>>>>>>> Could you please remind me how a user will set this flag?
>>>>>>
>>>>>> Quoting my original suggestion[1]:
>>>>>>
>>>>>>    "info/L3_MON/mbm_assign_on_mkdir?
>>>>>>
>>>>>>     boolean (parsed with kstrtobool()), defaulting to true?"
>>>>>>
>>>>>> After mount, any groups that got counters on creation would have to be
>>>>>> cleaned up, but at least that can be done with forward progress once
>>>>>> the flag is cleared.
>>>>>>
>>>>>> I was able to live with that as long as there aren't users polling for
>>>>>> resctrl to be mounted and immediately creating groups. For us, a
>>>>>> single container manager service manages resctrl.
>>>
>>> [...]
>>>
>>>>> +1
>>>>>
>>>>> That's basically my position -- the auto-assignment feels like a
>>>>> _potential_ nuisance for ABMC-aware users, but it depends on what they
>>>>> are trying to do.  Migration of non-ABMC-aware users will be easier for
>>>>> basic use cases if auto-assignment occurs by default (as in this
>>>>> series).
>>>>>
>>>>> Having an explicit way to turn this off seems perfectly reasonable
>>>>> (and could be added later on, if not provided in this series).
>>>>>
>>>>>
>>>>> What about the question re whether turning mbm_cntr_assign mode on
>>>>> should trigger auto-assignment?
>>>>>
>>>>> Currently turning this mode off and then on again has the effect of
>>>>> removing all automatic assignments for extant groups.  This feels
>>>>> surprising and/or unintentional (?)
>>>>
>>>> Connecting to what you start off by saying I also see auto-assignment
>>>> as the way to provide a smooth transition for "non-ABMC-aware" users.
>>>
>>> I agree, and having this on by default also helps non-ABMC-aware users.
>>>
>>>> To me a user that turns this mode off and then on again can be
>>>> considered as a user that is "ABMC-aware" and turning it "off and then
>>>> on again" seems like an intuitive way to get to a "clean slate"
>>>> wrt counter assignments. This may also be a convenient way for
>>>> an "ABMC-aware" user space to unassign all counters and thus also
>>>> helpful if resctrl supports the flag that Peter proposed. The flag
>>>> seems to already keep something like this in its context with
>>>> a name of "mbm_assign_on_mkdir" that could be interpreted as
>>>> "only auto assign on mkdir"?
>>>
>>> Yes, that's reasonable.  It could be a good idea to document this
>>> behaviour of switching the mbm_cntr_assign mode, if we think it is
>>> useful and people are likely to rely on it.
>>>
>>> Since mkdir is an implementation detail of the resctrl interface, I'd
>>> be tempted to go for a more generic name, say,
>>> "mbm_assign_new_mon_groups".  But that's just bikeshedding.
>>> The proposed behaviour seems fine.
>>>
>>> Either way, if this is not included in this series, it could be added
>>> later without breaking anything.
>>
>> How about more generic "mbm_cntr_assign_auto" ?
> 
> I would like to be careful to not make it _too_ generic. Dave already pointed
> out that users may be surprised that counters are not auto-assigned when switching
> between the different modes so using the the name to help highlight when this
> auto-assignment can be expected to happen seems very useful.

In that case "mbm_assign_on_mkdir" seems on point and precise.
Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 16/23] x86/resctrl: Add the functionality to unassigm MBM events
  2025-02-10 18:30       ` Reinette Chatre
@ 2025-02-22  0:36         ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-22  0:36 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 2/10/2025 12:30 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/10/25 8:23 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/5/25 21:54, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> subject: unassigm -> unassign
>>
>> Sure.
>>
>>>
>>> On 1/22/25 12:20 PM, Babu Moger wrote:
>>>> The mbm_cntr_assign mode provides a limited number of hardware counters
>>>
>>> (now back to "limited number of hardware counters")
>>
>> How about?
>>
>> The mbm_cntr_assign mode provides "num_mbm_cntrs" number of hardware counters
> 
> ok.
> 
>>
>>>
>>>> that can be assigned to an RMID, event pair to monitor bandwidth while
>>>> assigned. If all counters are in use, the kernel will show an error
>>>> message: "Out of MBM assignable counters" when a new assignment is
>>>> requested. To make space for a new assignment, users must unassign an
>>>
>>> To me "kernel will show an error" implies the kernel ring buffer. Please make
>>> the message accurate and mention that it will be in
>>> last_cmd_status while also considering to use -ENOSPC to help user space.
>>
>> If all the counters are in use, the kernel will log the error message
>> "Unable to allocate counter in domain" in /sys/fs/resctrl/info/
>> last_cmd_status when a new assignment is requested. To make space for a
>> new assignment, users must unassign an already assigned counter and retry
>> the assignment again.
>>
> 
> This is better, but can user space receive -ENOSPC to avoid needing to check
> and parse last_cmd_status on every error?

Yes. There was a problem in passing the error in 
resctrl_process_flags(). Took care of it now.

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-02-21 18:06   ` James Morse
@ 2025-02-24 15:49     ` Moger, Babu
  2025-02-24 17:01       ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-24 15:49 UTC (permalink / raw)
  To: James Morse, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: fenghua.yu, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,

On 2/21/25 12:06, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> Configure mbm_cntr_assign mode on AMD platforms. On AMD platforms, it
>> is recommended to use mbm_cntr_assign mode if supported, because
>> reading "mbm_total_bytes" or "mbm_local_bytes" will report 'Unavailable'
>> if there is no counter associated with that event.
> 
> (If you agree with my comment on patch 7, it would be good to update this
> wording to match.)

Sure.

> 
> 
>> The mbm_cntr_assign mode, referred to as ABMC (Assignable Bandwidth
>> Monitoring Counters) on AMD, is enabled by default when supported by the
>> system.
>>
>> Update ABMC across all logical processors within the resctrl domain to
>> ensure proper functionality.
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index c006c4d8d6ff..2480698b643d 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -734,4 +734,5 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
>>  void mbm_cntr_reset(struct rdt_resource *r);
>>  int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
>>  		 struct rdtgroup *rdtgrp, enum resctrl_event_id evtid);
>> +void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
>>  #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> 
> Could this be put in include/linux/resctrl.h, its where it needs to end up eventually.
> 

As Reinette already mentioned in  [1], Boris wanted this moved when
arch/fs code separation integrated. Lets keep it in resctrl/internal.h
for now.

[1]
https://lore.kernel.org/lkml/e524c376-9ef8-488e-8053-b49ccafd306d@intel.com/

> 
> 
> This sequence has me confused:
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 3d748fdbcb5f..a9a5dc626a1e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>  			r->mon.mbm_cntr_assignable = true;
>>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
> 
>> +			hw_res->mbm_cntr_assign_enabled = true;
> 
> Here the arch code sets ABMC to be enabled by default at boot.
> 
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6922173c4f8f..515969c5f64f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>>  
>>  void resctrl_online_cpu(unsigned int cpu)
>>  {
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>>  	mutex_lock(&rdtgroup_mutex);
>>  	/* The CPU is set in default rdtgroup after online. */
>>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
>> +	if (r->mon_capable && r->mon.mbm_cntr_assignable)
>> +		resctrl_arch_mbm_cntr_assign_set_one(r);
>>  	mutex_unlock(&rdtgroup_mutex);
>>  }
> 
> But here, resctrl has to call back to the arch code to make sure the hardware is in the
> same state as hw_res->mbm_cntr_assign_enabled.
> 
> Could this be done in resctrl_arch_online_cpu() instead? That way resctrl doesn't get CPUs
> in an inconsistent state that it has to fix up...
> 

Sure. Here is the diff.

diff --git a/arch/x86/kernel/cpu/resctrl/core.c
b/arch/x86/kernel/cpu/resctrl/core.c
index 22399f19810f..f48b298413bc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -771,6 +771,12 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
                domain_add_cpu(cpu, r);
        mutex_unlock(&domain_list_lock);

+       r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+       mutex_lock(&rdtgroup_mutex);
+       if (r->mon_capable && r->mon.mbm_cntr_assignable)
+               resctrl_arch_mbm_cntr_assign_set_one(r);
+       mutex_unlock(&rdtgroup_mutex);
+
        clear_closid_rmid(cpu);
        resctrl_online_cpu(cpu);


-- 
Thanks
Babu Moger

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-02-24 15:49     ` Moger, Babu
@ 2025-02-24 17:01       ` Reinette Chatre
  2025-02-24 21:18         ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-24 17:01 UTC (permalink / raw)
  To: babu.moger, James Morse, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James and Babu,

On 2/24/25 7:49 AM, Moger, Babu wrote:
> Hi James,
> 
> On 2/21/25 12:06, James Morse wrote:
>> Hi Babu,
>>
>> On 22/01/2025 20:20, Babu Moger wrote:

>> This sequence has me confused:
>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 3d748fdbcb5f..a9a5dc626a1e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>  			r->mon.mbm_cntr_assignable = true;
>>>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>>
>>> +			hw_res->mbm_cntr_assign_enabled = true;
>>
>> Here the arch code sets ABMC to be enabled by default at boot.
>>
>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 6922173c4f8f..515969c5f64f 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>>>  
>>>  void resctrl_online_cpu(unsigned int cpu)
>>>  {
>>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>>  	mutex_lock(&rdtgroup_mutex);
>>>  	/* The CPU is set in default rdtgroup after online. */
>>>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
>>> +	if (r->mon_capable && r->mon.mbm_cntr_assignable)
>>> +		resctrl_arch_mbm_cntr_assign_set_one(r);
>>>  	mutex_unlock(&rdtgroup_mutex);
>>>  }
>>
>> But here, resctrl has to call back to the arch code to make sure the hardware is in the
>> same state as hw_res->mbm_cntr_assign_enabled.

Another scenario needing to be supported by this flow is when CPUs come online later ...
after resctrl is mounted and potentially after the user modified the assignable counter
mode.

>>
>> Could this be done in resctrl_arch_online_cpu() instead? That way resctrl doesn't get CPUs
>> in an inconsistent state that it has to fix up...

Could you please elaborate the inconsistent state that would need to be fixed up?

>>
> 
> Sure. Here is the diff.
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index 22399f19810f..f48b298413bc 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -771,6 +771,12 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
>                 domain_add_cpu(cpu, r);
>         mutex_unlock(&domain_list_lock);
> 
> +       r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +       mutex_lock(&rdtgroup_mutex);
> +       if (r->mon_capable && r->mon.mbm_cntr_assignable)
> +               resctrl_arch_mbm_cntr_assign_set_one(r);
> +       mutex_unlock(&rdtgroup_mutex);
> +
>         clear_closid_rmid(cpu);
>         resctrl_online_cpu(cpu);

This would require every architecture to duplicate the above, no?

Also, please note there is more appropriate domain_add_cpu_mon().

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-21 20:10             ` Reinette Chatre
@ 2025-02-24 17:17               ` Dave Martin
  2025-02-24 17:23                 ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-24 17:17 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, corbet, tglx, mingo, bp, dave.hansen, tony.luck,
	peternewman, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Fri, Feb 21, 2025 at 12:10:44PM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/21/25 8:00 AM, Dave Martin wrote:
> > On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
> >> Hi Dave,
> >>
> >> On 2/20/25 09:44, Dave Martin wrote:

[...]

> >>> But mbm_assign_control data is dynamically generated and potentially
> >>> much bigger than a typical sysfs file.
> >>
> >> I have no idea how to handle this case. We may have to live with this
> >> problem. Let us know if there are any ideas.
> > 
> > I think the current implication is that this will work for now provided
> > that the generated text fits in a page.
> > 
> > 
> > Reinette, what's your view on accepting this limitation in the interest
> > of stabilising this series, and tidying up this corner case later?
> > 
> > As for writes to this file, we're unlikely to hit the limit unless
> > there are a lot of RMIDs available and many groups with excessively
> > long names.
> 
> I am more concerned about reads to this file. If only 4K writes are
> supported then user space can reconfigure the system in page sized
> portions. It may not be efficient if the user wants to reconfigure the
> entire system but it will work. The problem with reads is that if
> larger than 4K reads are required but not supported then it will be
> impossible for user space to learn state of the system.
> 
> We may already be at that limit. Peter described [1] how AMD systems
> already have 32 L3 monitoring domains and 256 RMIDs. With conservative
> resource group names of 10 characters I see one line per monitoring group
> that could look like below and thus easily be above 200 characters:
> 
> resgroupAA/mongroupAA/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl;12=tl;13=tl;14=tl;15=tl;16=tl;17=tl;18=tl;19=tl;20=tl;21=tl;22=tl;23=tl;24=tl;25=tl;26=tl;27=tl;28=tl;29=tl;30=tl;31=tl;32=tl
> 
> Multiplying that with the existing possible 256 monitor groups the limit
> is exceeded today.

That's useful to know.  I guess we probably shouldn't just kick this
issue down the road, then -- at least on the read side (as you say).

> I understand that all domains having "tl" flags are not possible today, but
> even if that is changed to "_" the resulting display still looks to
> easily exceed a page when many RMIDs are in use.
> 
> > 
> > This looks perfectly fixable, but it might be better to settle the
> > design of this series first before we worry too much about it.
> 
> I think it fair to delay support of writing more than a page of
> data but it looks to me like we need a solution to support displaying
> more than a page of data to user space.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/

Ack; if I can't find an off-the-shelf solution for this, I'll try to
hack something as minimal as possible to provide the required
behaviour, but I won't try to make it optimal or pretty for now.

It has just occurred to be that ftrace has large, multi-line text files
in sysfs, so I'll try to find out how they handle that there.  Maybe
there is some infrastructure we can re-use.

Either way, hopefully that will move the discussion forward (unless
someone else comes up with a better idea first!)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-24 17:17               ` Dave Martin
@ 2025-02-24 17:23                 ` Luck, Tony
  2025-02-28 17:50                   ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-02-24 17:23 UTC (permalink / raw)
  To: Dave Martin, Chatre, Reinette
  Cc: Moger, Babu, corbet@lwn.net, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, peternewman@google.com,
	x86@kernel.org, hpa@zytor.com, paulmck@kernel.org,
	akpm@linux-foundation.org, thuth@redhat.com, rostedt@goodmis.org,
	xiongwei.song@windriver.com, pawan.kumar.gupta@linux.intel.com,
	daniel.sneddon@linux.intel.com, jpoimboe@kernel.org,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Li, Xin3, andrew.cooper3@citrix.com,
	ebiggers@google.com, mario.limonciello@amd.com,
	james.morse@arm.com, tan.shaopeng@fujitsu.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Wieczor-Retman, Maciej, Eranian, Stephane

> It has just occurred to be that ftrace has large, multi-line text files
> in sysfs, so I'll try to find out how they handle that there.  Maybe
> there is some infrastructure we can re-use.

Resctrl was built on top of "kernfs" because that was a simple base
that met needs at the time.

Do we need to look at either extending capabilities of kernfs? Or
move to sysfs?

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups
  2025-02-21 18:07   ` James Morse
@ 2025-02-24 20:49     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-24 20:49 UTC (permalink / raw)
  To: James Morse, corbet, reinette.chatre, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi James,

On 2/21/25 12:07, James Morse wrote:
> Hi Babu,
> 
> On 22/01/2025 20:20, Babu Moger wrote:
>> When mbm_cntr_assign mode is enabled, users can designate which of the MBM
>> events in the CTRL_MON or MON groups should have counters assigned.
>>
>> Provide an interface for assigning MBM events by writing to the file:
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
>> events can be assigned or unassigned as needed.
>>
>> Format is similar to the list format with addition of opcode for the
>> assignment operation.
>>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Format for specific type of groups:
>>
>>  * Default CTRL_MON group:
>>          "//<domain_id><opcode><flags>"
>>
>>  * Non-default CTRL_MON group:
>>          "<CTRL_MON group>//<domain_id><opcode><flags>"
>>
>>  * Child MON group of default CTRL_MON group:
>>          "/<MON group>/<domain_id><opcode><flags>"
>>
>>  * Child MON group of non-default CTRL_MON group:
>>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Domain_id '*' will apply the flags on all the domains.
>>
>> Opcode can be one of the following:
>>
>>  = Update the assignment to match the flags
>>  + Assign a new MBM event without impacting existing assignments.
>>  - Unassign a MBM event from currently assigned events.
>>
>> Assignment flags can be one of the following:
>>  t  MBM total event
>>  l  MBM local event
>>  tl Both total and local MBM events
>>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>>     be combined with other flags.
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6e29827239e0..299839bcf23f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> 
>> +static int resctrl_process_flags(struct rdt_resource *r,
>> +				 enum rdt_group_type rtype,
>> +				 char *p_grp, char *c_grp, char *tok)
>> +{
>> +	unsigned int op, mon_state, assign_state, unassign_state;
>> +	char *dom_str, *id_str, *op_str;
>> +	struct rdt_mon_domain *d;
>> +	unsigned long dom_id = 0;
>> +	struct rdtgroup *rdtgrp;
>> +	char domain[10];
>> +	bool found;
>> +	int ret;
>> +
>> +	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
>> +
>> +	if (!rdtgrp) {
>> +		rdt_last_cmd_puts("Not a valid resctrl group\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +next:
>> +	if (!tok || tok[0] == '\0')
>> +		return 0;
>> +
>> +	/* Start processing the strings for each domain */
>> +	dom_str = strim(strsep(&tok, ";"));
>> +
>> +	op_str = strpbrk(dom_str, "=+-");
>> +
>> +	if (op_str) {
>> +		op = *op_str;
>> +	} else {
>> +		rdt_last_cmd_puts("Missing operation =, +, - character\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	id_str = strsep(&dom_str, "=+-");
>> +
>> +	/* Check for domain id '*' which means all domains */
>> +	if (id_str && *id_str == '*') {
>> +		d = NULL;
>> +		goto check_state;
>> +	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
>> +		rdt_last_cmd_puts("Missing domain id\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* Verify if the dom_id is valid */
>> +	found = false;
>> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
>> +		if (d->hdr.id == dom_id) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found) {
>> +		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
>> +		return -EINVAL;
>> +	}
>> +
>> +check_state:
>> +	mon_state = resctrl_str_to_mon_state(dom_str);
>> +
>> +	if (mon_state == ASSIGN_INVALID) {
>> +		rdt_last_cmd_puts("Invalid assign flag\n");
>> +		goto out_fail;
>> +	}
>> +
>> +	assign_state = 0;
>> +	unassign_state = 0;
>> +
>> +	switch (op) {
>> +	case '+':
>> +		if (mon_state == ASSIGN_NONE) {
>> +			rdt_last_cmd_puts("Invalid assign opcode\n");
>> +			goto out_fail;
>> +		}
>> +		assign_state = mon_state;
>> +		break;
>> +	case '-':
>> +		if (mon_state == ASSIGN_NONE) {
>> +			rdt_last_cmd_puts("Invalid assign opcode\n");
>> +			goto out_fail;
>> +		}
>> +		unassign_state = mon_state;
>> +		break;
>> +	case '=':
>> +		assign_state = mon_state;
>> +		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> 
>> +	if (unassign_state & ASSIGN_TOTAL) {
>> +		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (unassign_state & ASSIGN_LOCAL) {
>> +		ret = resctrl_unassign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (assign_state & ASSIGN_TOTAL) {
>> +		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (assign_state & ASSIGN_LOCAL) {
>> +		ret = resctrl_assign_cntr_event(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
> 
> This sequence of if's allows the helpers to be called on platforms that doesn't support
> both local and total. Could we reject such misconfiguration here in the parsing code?
> You have these checks in rdtgroup_assign_cntrs() added in patch 17.
> 

Yes. Added the check is_mbm_total_enabled() and is_mbm_local_enabled().

> 
> What do you think to trying to group these four by event type, and passing the event type
> in as an argument? ... it ends up with a helper that takes a large number of arguments,
> (both assign_state and unassign_state), but there is less repetition...

Added a new helper function resctrl_process_cntr_event(). I could not
avoid the repetitions.



> 
> 
> Thanks,
> 
> James
> 
>> +	goto next;
>> +
>> +out_fail:
>> +	sprintf(domain, d ? "%ld" : "*", dom_id);
>> +
>> +	rdt_last_cmd_printf("Assign operation '%s%c%s' failed on the group %s/%s/\n",
>> +			    domain, op, dom_str, p_grp, c_grp);
>> +
>> +	return -EINVAL;
>> +}
>> +
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-02-24 17:01       ` Reinette Chatre
@ 2025-02-24 21:18         ` Moger, Babu
  2025-02-24 22:20           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-24 21:18 UTC (permalink / raw)
  To: Reinette Chatre, James Morse, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Reinette,


On 2/24/25 11:01, Reinette Chatre wrote:
> Hi James and Babu,
> 
> On 2/24/25 7:49 AM, Moger, Babu wrote:
>> Hi James,
>>
>> On 2/21/25 12:06, James Morse wrote:
>>> Hi Babu,
>>>
>>> On 22/01/2025 20:20, Babu Moger wrote:
> 
>>> This sequence has me confused:
>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> index 3d748fdbcb5f..a9a5dc626a1e 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> @@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>>  			r->mon.mbm_cntr_assignable = true;
>>>>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>>>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>>>
>>>> +			hw_res->mbm_cntr_assign_enabled = true;
>>>
>>> Here the arch code sets ABMC to be enabled by default at boot.
>>>
>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 6922173c4f8f..515969c5f64f 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>>>>  
>>>>  void resctrl_online_cpu(unsigned int cpu)
>>>>  {
>>>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>>> +
>>>>  	mutex_lock(&rdtgroup_mutex);
>>>>  	/* The CPU is set in default rdtgroup after online. */
>>>>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
>>>> +	if (r->mon_capable && r->mon.mbm_cntr_assignable)
>>>> +		resctrl_arch_mbm_cntr_assign_set_one(r);
>>>>  	mutex_unlock(&rdtgroup_mutex);
>>>>  }
>>>
>>> But here, resctrl has to call back to the arch code to make sure the hardware is in the
>>> same state as hw_res->mbm_cntr_assign_enabled.
> 
> Another scenario needing to be supported by this flow is when CPUs come online later ...
> after resctrl is mounted and potentially after the user modified the assignable counter
> mode.

If the user modifies the assignable counter mode. It is recorded in
mbm_cntr_assign_enabled already. When the new CPU comes online, the
hotplug handler(resctrl_arch_online_cpu) is will update the CPU to the new
mode after checking mbm_cntr_assign_enabled.

Are you talking about different case here? Please elaborate.

> 
>>>
>>> Could this be done in resctrl_arch_online_cpu() instead? That way resctrl doesn't get CPUs
>>> in an inconsistent state that it has to fix up...
> 
> Could you please elaborate the inconsistent state that would need to be fixed up?
> 
>>>
>>
>> Sure. Here is the diff.
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
>> b/arch/x86/kernel/cpu/resctrl/core.c
>> index 22399f19810f..f48b298413bc 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -771,6 +771,12 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
>>                 domain_add_cpu(cpu, r);
>>         mutex_unlock(&domain_list_lock);
>>
>> +       r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +       mutex_lock(&rdtgroup_mutex);
>> +       if (r->mon_capable && r->mon.mbm_cntr_assignable)
>> +               resctrl_arch_mbm_cntr_assign_set_one(r);
>> +       mutex_unlock(&rdtgroup_mutex);
>> +
>>         clear_closid_rmid(cpu);
>>         resctrl_online_cpu(cpu);
> 
> This would require every architecture to duplicate the above, no?
> 
> Also, please note there is more appropriate domain_add_cpu_mon().

Yes. This may be better place to add this code. Will wait once James
clarifies on "inconsistent state".
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 20/23] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2025-02-24 21:18         ` Moger, Babu
@ 2025-02-24 22:20           ` Reinette Chatre
  0 siblings, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-24 22:20 UTC (permalink / raw)
  To: babu.moger, James Morse, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, peternewman
  Cc: x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, tan.shaopeng,
	linux-doc, linux-kernel, maciej.wieczor-retman, eranian

Hi Babu,

On 2/24/25 1:18 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> 
> On 2/24/25 11:01, Reinette Chatre wrote:
>> Hi James and Babu,
>>
>> On 2/24/25 7:49 AM, Moger, Babu wrote:
>>> Hi James,
>>>
>>> On 2/21/25 12:06, James Morse wrote:
>>>> Hi Babu,
>>>>
>>>> On 22/01/2025 20:20, Babu Moger wrote:
>>
>>>> This sequence has me confused:
>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> index 3d748fdbcb5f..a9a5dc626a1e 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> @@ -1233,6 +1233,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>>>  			r->mon.mbm_cntr_assignable = true;
>>>>>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>>>>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>>>>
>>>>> +			hw_res->mbm_cntr_assign_enabled = true;
>>>>
>>>> Here the arch code sets ABMC to be enabled by default at boot.
>>>>
>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> index 6922173c4f8f..515969c5f64f 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> @@ -4302,9 +4302,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
>>>>>  
>>>>>  void resctrl_online_cpu(unsigned int cpu)
>>>>>  {
>>>>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>>>> +
>>>>>  	mutex_lock(&rdtgroup_mutex);
>>>>>  	/* The CPU is set in default rdtgroup after online. */
>>>>>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
>>>>> +	if (r->mon_capable && r->mon.mbm_cntr_assignable)
>>>>> +		resctrl_arch_mbm_cntr_assign_set_one(r);
>>>>>  	mutex_unlock(&rdtgroup_mutex);
>>>>>  }
>>>>
>>>> But here, resctrl has to call back to the arch code to make sure the hardware is in the
>>>> same state as hw_res->mbm_cntr_assign_enabled.
>>
>> Another scenario needing to be supported by this flow is when CPUs come online later ...
>> after resctrl is mounted and potentially after the user modified the assignable counter
>> mode.
> 
> If the user modifies the assignable counter mode. It is recorded in
> mbm_cntr_assign_enabled already. When the new CPU comes online, the
> hotplug handler(resctrl_arch_online_cpu) is will update the CPU to the new
> mode after checking mbm_cntr_assign_enabled.
> 
> Are you talking about different case here? Please elaborate.

I am talking about the same case. James started with "This sequence has me confused"
with "Here the arch code sets ABMC to be enabled by default at boot." snippet not
seemingly matching with later snippet where "resctrl has to call back to the arch code".
My goal was to highlight why resctrl would need to call back into arch code even though
arch code establishes the default. 

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-21 22:43                           ` Reinette Chatre
@ 2025-02-25 17:11                             ` Peter Newman
  2025-02-25 21:31                               ` Moger, Babu
  2025-02-25 21:41                               ` Reinette Chatre
  0 siblings, 2 replies; 209+ messages in thread
From: Peter Newman @ 2025-02-25 17:11 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 2/21/25 5:12 AM, Peter Newman wrote:
> > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>
> >>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>
> >>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>
> >>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>> <value>
> >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>
> >>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>
> >>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>> is low enough to be of concern.
> >>>>>>>
> >>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>> for.
> >>>>>>
> >>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>
> >>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>> customers.
> >>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>
> >>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>> event-set for applying to a single counter rather than as individual
> >>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>> event names.
> >>>>
> >>>> Thank you for clarifying.
> >>>>
> >>>>>
> >>>>> In the letters as events model, choosing the events assigned to a
> >>>>> group wouldn't be enough information, since we would want to control
> >>>>> which events should share a counter and which should be counted by
> >>>>> separate counters. I think the amount of information that would need
> >>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>
> >>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>> writes in ABMC would look like...
> >>>>>
> >>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>
> >>>>> (per domain)
> >>>>> group 0:
> >>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>> group 1:
> >>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>> ...
> >>>>>
> >>>>
> >>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>> example and above the counter configuration appears to be global. You do mention
> >>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>> configuration is a requirement?
> >>>
> >>> If it's global and we want a particular group to be watched by more
> >>> counters, I wouldn't want this to result in allocating more counters
> >>> for that group in all domains, or allocating counters in domains where
> >>> they're not needed. I want to encourage my users to avoid allocating
> >>> monitoring resources in domains where a job is not allowed to run so
> >>> there's less pressure on the counters.
> >>>
> >>> In Dave's proposal it looks like global configuration means
> >>> globally-defined "named counter configurations", which works because
> >>> it's really per-domain assignment of the configurations to however
> >>> many counters the group needs in each domain.
> >>
> >> I think I am becoming lost. Would a global configuration not break your
> >> view of "event-set applied to a single counter"? If a counter is configured
> >> globally then it would not make it possible to support the full configurability
> >> of the hardware.
> >> Before I add more confusion, let me try with an example that builds on your
> >> earlier example copied below:
> >>
> >>>>> (per domain)
> >>>>> group 0:
> >>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>> group 1:
> >>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>> ...
> >>
> >> Since the above states "per domain" I rewrite the example to highlight that as
> >> I understand it:
> >>
> >> group 0:
> >>  domain 0:
> >>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>  domain 1:
> >>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >>  domain 0:
> >>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>  domain 1:
> >>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>
> >> You mention that you do not want counters to be allocated in domains that they
> >> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >> in domain 1, resulting in:
> >>
> >> group 0:
> >>  domain 0:
> >>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >>  domain 0:
> >>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>  domain 1:
> >>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>
> >> With counter 0 and counter 1 available in domain 1, these counters could
> >> theoretically be configured to give group 1 more data in domain 1:
> >>
> >> group 0:
> >>  domain 0:
> >>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >>  domain 0:
> >>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>  domain 1:
> >>   counter 0: LclFill,RmtFill
> >>   counter 1: LclNTWr,RmtNTWr
> >>   counter 2: LclSlowFill,RmtSlowFill
> >>   counter 3: VictimBW
> >>
> >> The counters are shown with different per-domain configurations that seems to
> >> match with earlier goals of (a) choose events counted by each counter and
> >> (b) do not allocate counters in domains where they are not needed. As I
> >> understand the above does contradict global counter configuration though.
> >> Or do you mean that only the *name* of the counter is global and then
> >> that it is reconfigured as part of every assignment?
> >
> > Yes, I meant only the *name* is global. I assume based on a particular
> > system configuration, the user will settle on a handful of useful
> > groupings to count.
> >
> > Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >
> >  # define global configurations (in ABMC terms), not necessarily in this
> >  # syntax and probably not in the mbm_assign_control file.
> >
> >  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >  w=VictimBW,LclNTWr,RmtNTWr
> >
> >  # legacy "total" configuration, effectively r+w
> >  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >
> >  /group0/0=t;1=t
> >  /group1/0=t;1=t
> >  /group2/0=_;1=t
> >  /group3/0=rw;1=_
> >
> > - group2 is restricted to domain 0
> > - group3 is restricted to domain 1
> > - the rest are unrestricted
> > - In group3, we decided we need to separate read and write traffic
> >
> > This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >
>
> I see. Thank you for the example.
>
> resctrl supports per-domain configurations with the following possible when
> using mbm_total_bytes_config and mbm_local_bytes_config:
>
> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>
>    /group0/0=t;1=t
>    /group1/0=t;1=t
>
> Even though the flags are identical in all domains, the assigned counters will
> be configured differently in each domain.
>
> With this supported by hardware and currently also supported by resctrl it seems
> reasonable to carry this forward to what will be supported next.

The hardware supports both a per-domain mode, where all groups in a
domain use the same configurations and are limited to two events per
group and a per-group mode where every group can be configured and
assigned freely. This series is using the legacy counter access mode
where only counters whose BwType matches an instance of QOS_EVT_CFG_n
in the domain can be read. If we chose to read the assigned counter
directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
rather than asking the hardware to find the counter by RMID, we would
not be limited to 2 counters per group/domain and the hardware would
have the same flexibility as on MPAM.

(I might have said something confusing in my last messages because I
had forgotten that I switched to the extended assignment mode when
prototyping with soft-ABMC and MPAM.)

Forcing all groups on a domain to share the same 2 counter
configurations would not be acceptable for us, as the example I gave
earlier is one I've already been asked about.

I'm worried about requiring support for domain-level
mbm_total_bytes_config and mbm_local_bytes_config files to be carried
forward, because this conflicts with the configuration being per
group/domain. (i.e., what would be read back from the domain files if
per-group customizations have already been applied?)

>
> >>
> >>>> Until now I viewed counter configuration separate from counter assignment,
> >>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> >>>> mbm_local_bytes_config before they are assigned. That is still per-domain
> >>>> counter configuration though, not per-counter.
> >>>>
> >>>>> I assume packing all of this info for a group's desired counter
> >>>>> configuration into a single line (with 32 domains per line on many
> >>>>> dual-socket AMD configurations I see) would be difficult to look at,
> >>>>> even if we could settle on a single letter to represent each
> >>>>> universally.
> >>>>>
> >>>>>>
> >>>>>> My goal is for resctrl to have a user interface that can as much as possible
> >>>>>> be ready for whatever may be required from it years down the line. Of course,
> >>>>>> I may be wrong and resctrl would never need to support more than 26 events per
> >>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
> >>>>>> and how could resctrl support that?
> >>>>>>
> >>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier
> >>>>>> the interface I used as demonstration may become unwieldy to parse on a system
> >>>>>> with many domains that supports many events. This is a concern for me. Any suggestions
> >>>>>> will be appreciated, especially from you since I know that you are very familiar with
> >>>>>> issues related to large scale use of resctrl interfaces.
> >>>>>
> >>>>> It's mainly just the unwieldiness of all the information in one file.
> >>>>> It's already at the limit of what I can visually look through.
> >>>>
> >>>> I agree.
> >>>>
> >>>>>
> >>>>> I believe that shared assignments will take care of all the
> >>>>> high-frequency and performance-intensive batch configuration updates I
> >>>>> was originally concerned about, so I no longer see much benefit in
> >>>>> finding ways to textually encode all this information in a single file
> >>>>> when it would be more manageable to distribute it around the
> >>>>> filesystem hierarchy.
> >>>>
> >>>> This is significant. The motivation for the single file was to support
> >>>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> >>>> not also depend on the same files that, if distributed, will require many
> >>>> filesystem operations?
> >>>> Having the files distributed will be significantly simpler while also
> >>>> avoiding the file size issue that Dave Martin exposed.
> >>>
> >>> The remaining filesystem operations will be assigning or removing
> >>> shared counter assignments in the applicable domains, which would
> >>> normally correspond to mkdir/rmdir of groups or changing their CPU
> >>> affinity. The shared assignments are more "program and forget", while
> >>> the exclusive assignment approach requires updates for every counter
> >>> (in every domain) every few seconds to cover a large number of groups.
> >>>
> >>> When they want to pay extra attention to a particular group, I expect
> >>> they'll ask for exclusive counters and leave them assigned for a while
> >>> as they collect extra data.
> >>
> >> The single file approach is already unwieldy. The demands that will be
> >> placed on it to support the usages currently being discussed would make this
> >> interface even harder to use and manage. If the single file is not required
> >> then I think we should go back to smaller files distributed in resctrl.
> >> This may not even be an either/or argument. One way to view mbm_assign_control
> >> could be as a way for user to interact with the distributed counter
> >> related files with a single file system operation. Although, without
> >> knowing how counter configuration is expected to work this remains unclear.
> >
> > If we do both interfaces and the multi-file model gives us more
> > capability to express configurations, we could find situations where
> > there are configurations we cannot represent when reading back from
> > mbm_assign_control, or updates through mbm_assign_control have
> > ambiguous effects on existing configurations which were created with
> > other files.
>
> Right. My assumption was that the syntax would be identical.
>
> >
> > However, the example I gave above seems to be adequately represented
> > by a minor extension to mbm_assign_control and we all seem to
>
> To confirm what you mean with "minor extension to mbm_assign_control",
> is this where the flags are associated with counter configurations? At this
> time this is done separately from mbm_assign_control with the hardcoded "t"
> and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes
> respectively. I think it would be simpler to keep these configurations
> separate from mbm_assign_control. How it would look without better
> understanding of MPAM is not clear to me at this time, unless if the
> requirement is to enhance support for ABMC and BMEC. I do see that
> this can be added later to build on what is supported by mbm_assign_control
> with the syntax in this version.

As I explained above, I was looking at this from the perspective of
the extended event assignment mode.

Thanks,
-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-25 17:11                             ` Peter Newman
@ 2025-02-25 21:31                               ` Moger, Babu
  2025-02-26 13:27                                 ` Peter Newman
  2025-02-25 21:41                               ` Reinette Chatre
  1 sibling, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-02-25 21:31 UTC (permalink / raw)
  To: Peter Newman, Reinette Chatre
  Cc: Moger, Babu, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/25/25 11:11, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>
>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>
>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>
>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>
>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>> for.
>>>>>>>>
>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>
>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>> customers.
>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>
>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>> event names.
>>>>>>
>>>>>> Thank you for clarifying.
>>>>>>
>>>>>>>
>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>> which events should share a counter and which should be counted by
>>>>>>> separate counters. I think the amount of information that would need
>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>
>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>> writes in ABMC would look like...
>>>>>>>
>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>>>>
>>>>>>
>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>> configuration is a requirement?
>>>>>
>>>>> If it's global and we want a particular group to be watched by more
>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>> for that group in all domains, or allocating counters in domains where
>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>> there's less pressure on the counters.
>>>>>
>>>>> In Dave's proposal it looks like global configuration means
>>>>> globally-defined "named counter configurations", which works because
>>>>> it's really per-domain assignment of the configurations to however
>>>>> many counters the group needs in each domain.
>>>>
>>>> I think I am becoming lost. Would a global configuration not break your
>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>> globally then it would not make it possible to support the full configurability
>>>> of the hardware.
>>>> Before I add more confusion, let me try with an example that builds on your
>>>> earlier example copied below:
>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>
>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>> I understand it:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> You mention that you do not want counters to be allocated in domains that they
>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>> in domain 1, resulting in:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 0: LclFill,RmtFill
>>>>   counter 1: LclNTWr,RmtNTWr
>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW
>>>>
>>>> The counters are shown with different per-domain configurations that seems to
>>>> match with earlier goals of (a) choose events counted by each counter and
>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>> understand the above does contradict global counter configuration though.
>>>> Or do you mean that only the *name* of the counter is global and then
>>>> that it is reconfigured as part of every assignment?
>>>
>>> Yes, I meant only the *name* is global. I assume based on a particular
>>> system configuration, the user will settle on a handful of useful
>>> groupings to count.
>>>
>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>
>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>  # syntax and probably not in the mbm_assign_control file.
>>>
>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>
>>>  # legacy "total" configuration, effectively r+w
>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>
>>>  /group0/0=t;1=t
>>>  /group1/0=t;1=t
>>>  /group2/0=_;1=t
>>>  /group3/0=rw;1=_
>>>
>>> - group2 is restricted to domain 0
>>> - group3 is restricted to domain 1
>>> - the rest are unrestricted
>>> - In group3, we decided we need to separate read and write traffic
>>>
>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>
>>
>> I see. Thank you for the example.
>>
>> resctrl supports per-domain configurations with the following possible when
>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>
>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>
>>    /group0/0=t;1=t
>>    /group1/0=t;1=t
>>
>> Even though the flags are identical in all domains, the assigned counters will
>> be configured differently in each domain.
>>
>> With this supported by hardware and currently also supported by resctrl it seems
>> reasonable to carry this forward to what will be supported next.
> 
> The hardware supports both a per-domain mode, where all groups in a
> domain use the same configurations and are limited to two events per
> group and a per-group mode where every group can be configured and
> assigned freely. This series is using the legacy counter access mode
> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> in the domain can be read. If we chose to read the assigned counter
> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> rather than asking the hardware to find the counter by RMID, we would
> not be limited to 2 counters per group/domain and the hardware would
> have the same flexibility as on MPAM.

In extended mode, the contents of a specific counter can be read by
setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
[EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
QM_CTR will then return the contents of the specified counter.

It is documented below.
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
 Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)

We previously discussed this with you (off the public list) and I
initially proposed the extended assignment mode.

Yes, the extended mode allows greater flexibility by enabling multiple
counters to be assigned to the same group, rather than being limited to
just two.

However, the challenge is that we currently lack the necessary interfaces
to configure multiple events per group. Without these interfaces, the
extended mode is not practical at this time.

Therefore, we ultimately agreed to use the legacy mode, as it does not
require modifications to the existing interface, allowing us to continue
using it as is.

> 
> (I might have said something confusing in my last messages because I
> had forgotten that I switched to the extended assignment mode when
> prototyping with soft-ABMC and MPAM.)
> 
> Forcing all groups on a domain to share the same 2 counter
> configurations would not be acceptable for us, as the example I gave
> earlier is one I've already been asked about.

I don’t see this as a blocker. It should be considered an extension to the
current ABMC series. We can easily build on top of this series once we
finalize how to configure the multiple event interface for each group.

> 
> I'm worried about requiring support for domain-level
> mbm_total_bytes_config and mbm_local_bytes_config files to be carried
> forward, because this conflicts with the configuration being per
> group/domain. (i.e., what would be read back from the domain files if
> per-group customizations have already been applied?)
> 
>>
>>>>
>>>>>> Until now I viewed counter configuration separate from counter assignment,
>>>>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
>>>>>> mbm_local_bytes_config before they are assigned. That is still per-domain
>>>>>> counter configuration though, not per-counter.
>>>>>>
>>>>>>> I assume packing all of this info for a group's desired counter
>>>>>>> configuration into a single line (with 32 domains per line on many
>>>>>>> dual-socket AMD configurations I see) would be difficult to look at,
>>>>>>> even if we could settle on a single letter to represent each
>>>>>>> universally.
>>>>>>>
>>>>>>>>
>>>>>>>> My goal is for resctrl to have a user interface that can as much as possible
>>>>>>>> be ready for whatever may be required from it years down the line. Of course,
>>>>>>>> I may be wrong and resctrl would never need to support more than 26 events per
>>>>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
>>>>>>>> and how could resctrl support that?
>>>>>>>>
>>>>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier
>>>>>>>> the interface I used as demonstration may become unwieldy to parse on a system
>>>>>>>> with many domains that supports many events. This is a concern for me. Any suggestions
>>>>>>>> will be appreciated, especially from you since I know that you are very familiar with
>>>>>>>> issues related to large scale use of resctrl interfaces.
>>>>>>>
>>>>>>> It's mainly just the unwieldiness of all the information in one file.
>>>>>>> It's already at the limit of what I can visually look through.
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>>>
>>>>>>> I believe that shared assignments will take care of all the
>>>>>>> high-frequency and performance-intensive batch configuration updates I
>>>>>>> was originally concerned about, so I no longer see much benefit in
>>>>>>> finding ways to textually encode all this information in a single file
>>>>>>> when it would be more manageable to distribute it around the
>>>>>>> filesystem hierarchy.
>>>>>>
>>>>>> This is significant. The motivation for the single file was to support
>>>>>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
>>>>>> not also depend on the same files that, if distributed, will require many
>>>>>> filesystem operations?
>>>>>> Having the files distributed will be significantly simpler while also
>>>>>> avoiding the file size issue that Dave Martin exposed.
>>>>>
>>>>> The remaining filesystem operations will be assigning or removing
>>>>> shared counter assignments in the applicable domains, which would
>>>>> normally correspond to mkdir/rmdir of groups or changing their CPU
>>>>> affinity. The shared assignments are more "program and forget", while
>>>>> the exclusive assignment approach requires updates for every counter
>>>>> (in every domain) every few seconds to cover a large number of groups.
>>>>>
>>>>> When they want to pay extra attention to a particular group, I expect
>>>>> they'll ask for exclusive counters and leave them assigned for a while
>>>>> as they collect extra data.
>>>>
>>>> The single file approach is already unwieldy. The demands that will be
>>>> placed on it to support the usages currently being discussed would make this
>>>> interface even harder to use and manage. If the single file is not required
>>>> then I think we should go back to smaller files distributed in resctrl.
>>>> This may not even be an either/or argument. One way to view mbm_assign_control
>>>> could be as a way for user to interact with the distributed counter
>>>> related files with a single file system operation. Although, without
>>>> knowing how counter configuration is expected to work this remains unclear.
>>>
>>> If we do both interfaces and the multi-file model gives us more
>>> capability to express configurations, we could find situations where
>>> there are configurations we cannot represent when reading back from
>>> mbm_assign_control, or updates through mbm_assign_control have
>>> ambiguous effects on existing configurations which were created with
>>> other files.
>>
>> Right. My assumption was that the syntax would be identical.
>>
>>>
>>> However, the example I gave above seems to be adequately represented
>>> by a minor extension to mbm_assign_control and we all seem to
>>
>> To confirm what you mean with "minor extension to mbm_assign_control",
>> is this where the flags are associated with counter configurations? At this
>> time this is done separately from mbm_assign_control with the hardcoded "t"
>> and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes
>> respectively. I think it would be simpler to keep these configurations
>> separate from mbm_assign_control. How it would look without better
>> understanding of MPAM is not clear to me at this time, unless if the
>> requirement is to enhance support for ABMC and BMEC. I do see that
>> this can be added later to build on what is supported by mbm_assign_control
>> with the syntax in this version.
> 
> As I explained above, I was looking at this from the perspective of
> the extended event assignment mode.
> 
> Thanks,
> -Peter
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-25 17:11                             ` Peter Newman
  2025-02-25 21:31                               ` Moger, Babu
@ 2025-02-25 21:41                               ` Reinette Chatre
  1 sibling, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-02-25 21:41 UTC (permalink / raw)
  To: Peter Newman
  Cc: Moger, Babu, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/25/25 9:11 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>
>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>
>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>
>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>
>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>> for.
>>>>>>>>
>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>
>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>> customers.
>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>
>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>> event names.
>>>>>>
>>>>>> Thank you for clarifying.
>>>>>>
>>>>>>>
>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>> which events should share a counter and which should be counted by
>>>>>>> separate counters. I think the amount of information that would need
>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>
>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>> writes in ABMC would look like...
>>>>>>>
>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>>>>
>>>>>>
>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>> configuration is a requirement?
>>>>>
>>>>> If it's global and we want a particular group to be watched by more
>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>> for that group in all domains, or allocating counters in domains where
>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>> there's less pressure on the counters.
>>>>>
>>>>> In Dave's proposal it looks like global configuration means
>>>>> globally-defined "named counter configurations", which works because
>>>>> it's really per-domain assignment of the configurations to however
>>>>> many counters the group needs in each domain.
>>>>
>>>> I think I am becoming lost. Would a global configuration not break your
>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>> globally then it would not make it possible to support the full configurability
>>>> of the hardware.
>>>> Before I add more confusion, let me try with an example that builds on your
>>>> earlier example copied below:
>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>
>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>> I understand it:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> You mention that you do not want counters to be allocated in domains that they
>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>> in domain 1, resulting in:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>
>>>> group 0:
>>>>  domain 0:
>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>>  domain 0:
>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>  domain 1:
>>>>   counter 0: LclFill,RmtFill
>>>>   counter 1: LclNTWr,RmtNTWr
>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>   counter 3: VictimBW
>>>>
>>>> The counters are shown with different per-domain configurations that seems to
>>>> match with earlier goals of (a) choose events counted by each counter and
>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>> understand the above does contradict global counter configuration though.
>>>> Or do you mean that only the *name* of the counter is global and then
>>>> that it is reconfigured as part of every assignment?
>>>
>>> Yes, I meant only the *name* is global. I assume based on a particular
>>> system configuration, the user will settle on a handful of useful
>>> groupings to count.
>>>
>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>
>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>  # syntax and probably not in the mbm_assign_control file.
>>>
>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>
>>>  # legacy "total" configuration, effectively r+w
>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>
>>>  /group0/0=t;1=t
>>>  /group1/0=t;1=t
>>>  /group2/0=_;1=t
>>>  /group3/0=rw;1=_
>>>
>>> - group2 is restricted to domain 0
>>> - group3 is restricted to domain 1
>>> - the rest are unrestricted
>>> - In group3, we decided we need to separate read and write traffic
>>>
>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>
>>
>> I see. Thank you for the example.
>>
>> resctrl supports per-domain configurations with the following possible when
>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>
>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>
>>    /group0/0=t;1=t
>>    /group1/0=t;1=t
>>
>> Even though the flags are identical in all domains, the assigned counters will
>> be configured differently in each domain.
>>
>> With this supported by hardware and currently also supported by resctrl it seems
>> reasonable to carry this forward to what will be supported next.
> 
> The hardware supports both a per-domain mode, where all groups in a
> domain use the same configurations and are limited to two events per
> group and a per-group mode where every group can be configured and
> assigned freely. This series is using the legacy counter access mode
> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> in the domain can be read. If we chose to read the assigned counter
> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> rather than asking the hardware to find the counter by RMID, we would
> not be limited to 2 counters per group/domain and the hardware would
> have the same flexibility as on MPAM.
> 
> (I might have said something confusing in my last messages because I
> had forgotten that I switched to the extended assignment mode when
> prototyping with soft-ABMC and MPAM.)
> 
> Forcing all groups on a domain to share the same 2 counter
> configurations would not be acceptable for us, as the example I gave
> earlier is one I've already been asked about.

I am surprised to hear this at this point of this work. Sounds like
we need to go back a couple of steps to determine how to best support
user requirements that now includes per-group counter assignment.

Have you perhaps looked into how users access the counter data as
part of your prototyping?

Reinette

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-25 21:31                               ` Moger, Babu
@ 2025-02-26 13:27                                 ` Peter Newman
  2025-02-26 16:25                                   ` Reinette Chatre
  2025-03-03 19:16                                   ` Moger, Babu
  0 siblings, 2 replies; 209+ messages in thread
From: Peter Newman @ 2025-02-26 13:27 UTC (permalink / raw)
  To: babu.moger
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter,
>
> On 2/25/25 11:11, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>
> >>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>
> >>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>
> >>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>
> >>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>> for.
> >>>>>>>>
> >>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>
> >>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>> customers.
> >>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>
> >>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>> event names.
> >>>>>>
> >>>>>> Thank you for clarifying.
> >>>>>>
> >>>>>>>
> >>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>> which events should share a counter and which should be counted by
> >>>>>>> separate counters. I think the amount of information that would need
> >>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>
> >>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>> writes in ABMC would look like...
> >>>>>>>
> >>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>
> >>>>>>> (per domain)
> >>>>>>> group 0:
> >>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>> group 1:
> >>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>> ...
> >>>>>>>
> >>>>>>
> >>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>> configuration is a requirement?
> >>>>>
> >>>>> If it's global and we want a particular group to be watched by more
> >>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>> for that group in all domains, or allocating counters in domains where
> >>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>> there's less pressure on the counters.
> >>>>>
> >>>>> In Dave's proposal it looks like global configuration means
> >>>>> globally-defined "named counter configurations", which works because
> >>>>> it's really per-domain assignment of the configurations to however
> >>>>> many counters the group needs in each domain.
> >>>>
> >>>> I think I am becoming lost. Would a global configuration not break your
> >>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>> globally then it would not make it possible to support the full configurability
> >>>> of the hardware.
> >>>> Before I add more confusion, let me try with an example that builds on your
> >>>> earlier example copied below:
> >>>>
> >>>>>>> (per domain)
> >>>>>>> group 0:
> >>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>> group 1:
> >>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>> ...
> >>>>
> >>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>> I understand it:
> >>>>
> >>>> group 0:
> >>>>  domain 0:
> >>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>  domain 1:
> >>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>> group 1:
> >>>>  domain 0:
> >>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>  domain 1:
> >>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>
> >>>> You mention that you do not want counters to be allocated in domains that they
> >>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>> in domain 1, resulting in:
> >>>>
> >>>> group 0:
> >>>>  domain 0:
> >>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>> group 1:
> >>>>  domain 0:
> >>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>  domain 1:
> >>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>
> >>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>
> >>>> group 0:
> >>>>  domain 0:
> >>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>> group 1:
> >>>>  domain 0:
> >>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>  domain 1:
> >>>>   counter 0: LclFill,RmtFill
> >>>>   counter 1: LclNTWr,RmtNTWr
> >>>>   counter 2: LclSlowFill,RmtSlowFill
> >>>>   counter 3: VictimBW
> >>>>
> >>>> The counters are shown with different per-domain configurations that seems to
> >>>> match with earlier goals of (a) choose events counted by each counter and
> >>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>> understand the above does contradict global counter configuration though.
> >>>> Or do you mean that only the *name* of the counter is global and then
> >>>> that it is reconfigured as part of every assignment?
> >>>
> >>> Yes, I meant only the *name* is global. I assume based on a particular
> >>> system configuration, the user will settle on a handful of useful
> >>> groupings to count.
> >>>
> >>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>
> >>>  # define global configurations (in ABMC terms), not necessarily in this
> >>>  # syntax and probably not in the mbm_assign_control file.
> >>>
> >>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  w=VictimBW,LclNTWr,RmtNTWr
> >>>
> >>>  # legacy "total" configuration, effectively r+w
> >>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>
> >>>  /group0/0=t;1=t
> >>>  /group1/0=t;1=t
> >>>  /group2/0=_;1=t
> >>>  /group3/0=rw;1=_
> >>>
> >>> - group2 is restricted to domain 0
> >>> - group3 is restricted to domain 1
> >>> - the rest are unrestricted
> >>> - In group3, we decided we need to separate read and write traffic
> >>>
> >>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>
> >>
> >> I see. Thank you for the example.
> >>
> >> resctrl supports per-domain configurations with the following possible when
> >> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>
> >> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>
> >>    /group0/0=t;1=t
> >>    /group1/0=t;1=t
> >>
> >> Even though the flags are identical in all domains, the assigned counters will
> >> be configured differently in each domain.
> >>
> >> With this supported by hardware and currently also supported by resctrl it seems
> >> reasonable to carry this forward to what will be supported next.
> >
> > The hardware supports both a per-domain mode, where all groups in a
> > domain use the same configurations and are limited to two events per
> > group and a per-group mode where every group can be configured and
> > assigned freely. This series is using the legacy counter access mode
> > where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> > in the domain can be read. If we chose to read the assigned counter
> > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> > rather than asking the hardware to find the counter by RMID, we would
> > not be limited to 2 counters per group/domain and the hardware would
> > have the same flexibility as on MPAM.
>
> In extended mode, the contents of a specific counter can be read by
> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> QM_CTR will then return the contents of the specified counter.
>
> It is documented below.
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>
> We previously discussed this with you (off the public list) and I
> initially proposed the extended assignment mode.
>
> Yes, the extended mode allows greater flexibility by enabling multiple
> counters to be assigned to the same group, rather than being limited to
> just two.
>
> However, the challenge is that we currently lack the necessary interfaces
> to configure multiple events per group. Without these interfaces, the
> extended mode is not practical at this time.
>
> Therefore, we ultimately agreed to use the legacy mode, as it does not
> require modifications to the existing interface, allowing us to continue
> using it as is.
>
> >
> > (I might have said something confusing in my last messages because I
> > had forgotten that I switched to the extended assignment mode when
> > prototyping with soft-ABMC and MPAM.)
> >
> > Forcing all groups on a domain to share the same 2 counter
> > configurations would not be acceptable for us, as the example I gave
> > earlier is one I've already been asked about.
>
> I don’t see this as a blocker. It should be considered an extension to the
> current ABMC series. We can easily build on top of this series once we
> finalize how to configure the multiple event interface for each group.

I don't think it is, either. Only being able to use ABMC to assign
counters is fine for our use as an incremental step. My longer-term
concern is the domain-scoped mbm_total_bytes_config and
mbm_local_bytes_config files, but they were introduced with BMEC, so
there's already an expectation that the files are present when BMEC is
supported.

On ABMC hardware that also supports BMEC, I'm concerned about enabling
ABMC when only the BMEC-style event configuration interface exists.
The scope of my issue is just whether enabling "full" ABMC support
will require an additional opt-in, since that could remove the BMEC
interface. If it does, it's something we can live with.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-26 13:27                                 ` Peter Newman
@ 2025-02-26 16:25                                   ` Reinette Chatre
  2025-02-26 17:12                                     ` Moger, Babu
  2025-03-03 19:16                                   ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-02-26 16:25 UTC (permalink / raw)
  To: Peter Newman, babu.moger
  Cc: Moger, Babu, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 2/26/25 5:27 AM, Peter Newman wrote:
> Hi Babu,
> 
> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/25/25 11:11, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>
>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>
>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>> for.
>>>>>>>>>>
>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>
>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>> customers.
>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>
>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>> event names.
>>>>>>>>
>>>>>>>> Thank you for clarifying.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>
>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>
>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>
>>>>>>>>> (per domain)
>>>>>>>>> group 0:
>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> group 1:
>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>> configuration is a requirement?
>>>>>>>
>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>> there's less pressure on the counters.
>>>>>>>
>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>> many counters the group needs in each domain.
>>>>>>
>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>> globally then it would not make it possible to support the full configurability
>>>>>> of the hardware.
>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>> earlier example copied below:
>>>>>>
>>>>>>>>> (per domain)
>>>>>>>>> group 0:
>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> group 1:
>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> ...
>>>>>>
>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>> I understand it:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>> in domain 1, resulting in:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 0: LclFill,RmtFill
>>>>>>   counter 1: LclNTWr,RmtNTWr
>>>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW
>>>>>>
>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>> understand the above does contradict global counter configuration though.
>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>> that it is reconfigured as part of every assignment?
>>>>>
>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>> system configuration, the user will settle on a handful of useful
>>>>> groupings to count.
>>>>>
>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>
>>>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>>>  # syntax and probably not in the mbm_assign_control file.
>>>>>
>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>>  # legacy "total" configuration, effectively r+w
>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>>  /group0/0=t;1=t
>>>>>  /group1/0=t;1=t
>>>>>  /group2/0=_;1=t
>>>>>  /group3/0=rw;1=_
>>>>>
>>>>> - group2 is restricted to domain 0
>>>>> - group3 is restricted to domain 1
>>>>> - the rest are unrestricted
>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>
>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>
>>>>
>>>> I see. Thank you for the example.
>>>>
>>>> resctrl supports per-domain configurations with the following possible when
>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>
>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>
>>>>    /group0/0=t;1=t
>>>>    /group1/0=t;1=t
>>>>
>>>> Even though the flags are identical in all domains, the assigned counters will
>>>> be configured differently in each domain.
>>>>
>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>> reasonable to carry this forward to what will be supported next.
>>>
>>> The hardware supports both a per-domain mode, where all groups in a
>>> domain use the same configurations and are limited to two events per
>>> group and a per-group mode where every group can be configured and
>>> assigned freely. This series is using the legacy counter access mode
>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>> in the domain can be read. If we chose to read the assigned counter
>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>> rather than asking the hardware to find the counter by RMID, we would
>>> not be limited to 2 counters per group/domain and the hardware would
>>> have the same flexibility as on MPAM.
>>
>> In extended mode, the contents of a specific counter can be read by
>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>> QM_CTR will then return the contents of the specified counter.
>>
>> It is documented below.
>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>
>> We previously discussed this with you (off the public list) and I
>> initially proposed the extended assignment mode.
>>
>> Yes, the extended mode allows greater flexibility by enabling multiple
>> counters to be assigned to the same group, rather than being limited to
>> just two.
>>
>> However, the challenge is that we currently lack the necessary interfaces
>> to configure multiple events per group. Without these interfaces, the
>> extended mode is not practical at this time.
>>
>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>> require modifications to the existing interface, allowing us to continue
>> using it as is.
>>
>>>
>>> (I might have said something confusing in my last messages because I
>>> had forgotten that I switched to the extended assignment mode when
>>> prototyping with soft-ABMC and MPAM.)
>>>
>>> Forcing all groups on a domain to share the same 2 counter
>>> configurations would not be acceptable for us, as the example I gave
>>> earlier is one I've already been asked about.
>>
>> I don’t see this as a blocker. It should be considered an extension to the
>> current ABMC series. We can easily build on top of this series once we
>> finalize how to configure the multiple event interface for each group.
> 
> I don't think it is, either. Only being able to use ABMC to assign
> counters is fine for our use as an incremental step. My longer-term
> concern is the domain-scoped mbm_total_bytes_config and
> mbm_local_bytes_config files, but they were introduced with BMEC, so
> there's already an expectation that the files are present when BMEC is
> supported.
> 
> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> ABMC when only the BMEC-style event configuration interface exists.

ABMC currently depends on BMEC making the current implementation the
one you are concerned about?
https://lore.kernel.org/lkml/e4111779ebb0e7004dbedc258eeae2677f578ab1.1737577229.git.babu.moger@amd.com/

> The scope of my issue is just whether enabling "full" ABMC support
> will require an additional opt-in, since that could remove the BMEC
> interface. If it does, it's something we can live with.


Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-26 16:25                                   ` Reinette Chatre
@ 2025-02-26 17:12                                     ` Moger, Babu
  0 siblings, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-02-26 17:12 UTC (permalink / raw)
  To: Reinette Chatre, Peter Newman
  Cc: Moger, Babu, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter/Reinette,

On 2/26/25 10:25, Reinette Chatre wrote:
> Hi Peter,
> 
> On 2/26/25 5:27 AM, Peter Newman wrote:
>> Hi Babu,
>>
>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 2/25/25 11:11, Peter Newman wrote:
>>>> Hi Reinette,
>>>>
>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>> <reinette.chatre@intel.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>
>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>> for.
>>>>>>>>>>>
>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>
>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>> customers.
>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>
>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>> event names.
>>>>>>>>>
>>>>>>>>> Thank you for clarifying.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>
>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>
>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>
>>>>>>>>>> (per domain)
>>>>>>>>>> group 0:
>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>> configuration is a requirement?
>>>>>>>>
>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>> there's less pressure on the counters.
>>>>>>>>
>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>> many counters the group needs in each domain.
>>>>>>>
>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>> of the hardware.
>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>> earlier example copied below:
>>>>>>>
>>>>>>>>>> (per domain)
>>>>>>>>>> group 0:
>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> ...
>>>>>>>
>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>> I understand it:
>>>>>>>
>>>>>>> group 0:
>>>>>>>  domain 0:
>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>  domain 1:
>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  domain 0:
>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>  domain 1:
>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>> in domain 1, resulting in:
>>>>>>>
>>>>>>> group 0:
>>>>>>>  domain 0:
>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  domain 0:
>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>  domain 1:
>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>
>>>>>>> group 0:
>>>>>>>  domain 0:
>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>>  domain 0:
>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>  domain 1:
>>>>>>>   counter 0: LclFill,RmtFill
>>>>>>>   counter 1: LclNTWr,RmtNTWr
>>>>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>>>>   counter 3: VictimBW
>>>>>>>
>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>
>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>> system configuration, the user will settle on a handful of useful
>>>>>> groupings to count.
>>>>>>
>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>
>>>>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>>>>  # syntax and probably not in the mbm_assign_control file.
>>>>>>
>>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>>  # legacy "total" configuration, effectively r+w
>>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>>  /group0/0=t;1=t
>>>>>>  /group1/0=t;1=t
>>>>>>  /group2/0=_;1=t
>>>>>>  /group3/0=rw;1=_
>>>>>>
>>>>>> - group2 is restricted to domain 0
>>>>>> - group3 is restricted to domain 1
>>>>>> - the rest are unrestricted
>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>
>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>
>>>>>
>>>>> I see. Thank you for the example.
>>>>>
>>>>> resctrl supports per-domain configurations with the following possible when
>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>
>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>>    /group0/0=t;1=t
>>>>>    /group1/0=t;1=t
>>>>>
>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>> be configured differently in each domain.
>>>>>
>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>> reasonable to carry this forward to what will be supported next.
>>>>
>>>> The hardware supports both a per-domain mode, where all groups in a
>>>> domain use the same configurations and are limited to two events per
>>>> group and a per-group mode where every group can be configured and
>>>> assigned freely. This series is using the legacy counter access mode
>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>> in the domain can be read. If we chose to read the assigned counter
>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>> rather than asking the hardware to find the counter by RMID, we would
>>>> not be limited to 2 counters per group/domain and the hardware would
>>>> have the same flexibility as on MPAM.
>>>
>>> In extended mode, the contents of a specific counter can be read by
>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>> QM_CTR will then return the contents of the specified counter.
>>>
>>> It is documented below.
>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>
>>> We previously discussed this with you (off the public list) and I
>>> initially proposed the extended assignment mode.
>>>
>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>> counters to be assigned to the same group, rather than being limited to
>>> just two.
>>>
>>> However, the challenge is that we currently lack the necessary interfaces
>>> to configure multiple events per group. Without these interfaces, the
>>> extended mode is not practical at this time.
>>>
>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>> require modifications to the existing interface, allowing us to continue
>>> using it as is.
>>>
>>>>
>>>> (I might have said something confusing in my last messages because I
>>>> had forgotten that I switched to the extended assignment mode when
>>>> prototyping with soft-ABMC and MPAM.)
>>>>
>>>> Forcing all groups on a domain to share the same 2 counter
>>>> configurations would not be acceptable for us, as the example I gave
>>>> earlier is one I've already been asked about.
>>>
>>> I don’t see this as a blocker. It should be considered an extension to the
>>> current ABMC series. We can easily build on top of this series once we
>>> finalize how to configure the multiple event interface for each group.
>>
>> I don't think it is, either. Only being able to use ABMC to assign
>> counters is fine for our use as an incremental step. My longer-term
>> concern is the domain-scoped mbm_total_bytes_config and
>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>> there's already an expectation that the files are present when BMEC is
>> supported.

It's good that we at least know about this concern now. Let's take a step
back and figure out how we can address it.

>>
>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>> ABMC when only the BMEC-style event configuration interface exists.
> 
> ABMC currently depends on BMEC making the current implementation the
> one you are concerned about?
> https://lore.kernel.org/lkml/e4111779ebb0e7004dbedc258eeae2677f578ab1.1737577229.git.babu.moger@amd.com/

I think it is more than that.

The ABMC feature allows event configuration by writing to L3_QOS_ABMC_CFG,
where we can set cntr_id, RMID, and event configuration. Currently, we
derive event configuration from BMEC settings (either
mbm_total_bytes_config or mbm_local_bytes_config).

If we don’t use BMEC values, we would need to require users to manually
specify event configuration settings.

struct mbm_cntr_cfg {
        enum resctrl_event_id   evtid;
        struct rdtgroup         *rdtgrp;
};

Currently, we determine the RMID from the rdtgroup and the event type,
while event configuration relies on BMEC:


To make event configuration independent of BMEC, we can include an
explicit event configuration field:

struct mbm_cntr_cfg {
        enum resctrl_event_id   evtid;
        u32                     evt_cfg;  // User-provided config value
        struct rdtgroup         *rdtgrp;
};

Key Considerations

1.  Counter Management: Managing counters globally (like CLOSID
management) would be simpler than handling them at the domain level,
though domain-level management is feasible.

2. User Input: Users will need to specify event configuration when
assigning events.


Here is the quick example using our current interface:
a. List the group.

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t:0x1F,l:0x15;1=t:0x1F,l:0x15

b. Unassign an Event:

#echo "//0-l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t:0x1F;1=t:0x1F,l:0x15

c. Assign an Event:

#echo "//0+l:0x15" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Note that I dont want to rush here.

Peter, Can you please spend some time and propose the interface you are
thinking of based on both ABMC and MPAM.

> 
>> The scope of my issue is just whether enabling "full" ABMC support
>> will require an additional opt-in, since that could remove the BMEC
>> interface. If it does, it's something we can live with.
> 
> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2025-02-21 23:42                     ` Moger, Babu
@ 2025-02-27 11:07                       ` Peter Newman
  0 siblings, 0 replies; 209+ messages in thread
From: Peter Newman @ 2025-02-27 11:07 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Reinette Chatre, Dave Martin, Babu Moger, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On Sat, Feb 22, 2025 at 12:43 AM Moger, Babu <bmoger@amd.com> wrote:
>
> Hi Reinette,
>
> On 2/21/2025 4:48 PM, Reinette Chatre wrote:
> > Hi Babu,
> >
> > On 2/21/25 10:23 AM, Moger, Babu wrote:
> >> Hi All,
> >>
> >> On 2/21/2025 11:14 AM, Dave Martin wrote:
> >>> Hi,
> >>>
> >>> On Thu, Feb 20, 2025 at 09:08:17AM -0800, Reinette Chatre wrote:
> >>>> Hi Dave,
> >>>>
> >>>> On 2/20/25 5:40 AM, Dave Martin wrote:
> >>>>> On Thu, Feb 20, 2025 at 11:35:56AM +0100, Peter Newman wrote:
> >>>>>> Hi Reinette,
> >>>>>>
> >>>>>> On Wed, Feb 19, 2025 at 6:55 PM Reinette Chatre
> >>>>>> <reinette.chatre@intel.com> wrote:
> >>>
> >>> [...]
> >>>
> >>>>>>> Could you please remind me how a user will set this flag?
> >>>>>>
> >>>>>> Quoting my original suggestion[1]:
> >>>>>>
> >>>>>>    "info/L3_MON/mbm_assign_on_mkdir?
> >>>>>>
> >>>>>>     boolean (parsed with kstrtobool()), defaulting to true?"
> >>>>>>
> >>>>>> After mount, any groups that got counters on creation would have to be
> >>>>>> cleaned up, but at least that can be done with forward progress once
> >>>>>> the flag is cleared.
> >>>>>>
> >>>>>> I was able to live with that as long as there aren't users polling for
> >>>>>> resctrl to be mounted and immediately creating groups. For us, a
> >>>>>> single container manager service manages resctrl.
> >>>
> >>> [...]
> >>>
> >>>>> +1
> >>>>>
> >>>>> That's basically my position -- the auto-assignment feels like a
> >>>>> _potential_ nuisance for ABMC-aware users, but it depends on what they
> >>>>> are trying to do.  Migration of non-ABMC-aware users will be easier for
> >>>>> basic use cases if auto-assignment occurs by default (as in this
> >>>>> series).
> >>>>>
> >>>>> Having an explicit way to turn this off seems perfectly reasonable
> >>>>> (and could be added later on, if not provided in this series).
> >>>>>
> >>>>>
> >>>>> What about the question re whether turning mbm_cntr_assign mode on
> >>>>> should trigger auto-assignment?
> >>>>>
> >>>>> Currently turning this mode off and then on again has the effect of
> >>>>> removing all automatic assignments for extant groups.  This feels
> >>>>> surprising and/or unintentional (?)
> >>>>
> >>>> Connecting to what you start off by saying I also see auto-assignment
> >>>> as the way to provide a smooth transition for "non-ABMC-aware" users.
> >>>
> >>> I agree, and having this on by default also helps non-ABMC-aware users.
> >>>
> >>>> To me a user that turns this mode off and then on again can be
> >>>> considered as a user that is "ABMC-aware" and turning it "off and then
> >>>> on again" seems like an intuitive way to get to a "clean slate"
> >>>> wrt counter assignments. This may also be a convenient way for
> >>>> an "ABMC-aware" user space to unassign all counters and thus also
> >>>> helpful if resctrl supports the flag that Peter proposed. The flag
> >>>> seems to already keep something like this in its context with
> >>>> a name of "mbm_assign_on_mkdir" that could be interpreted as
> >>>> "only auto assign on mkdir"?
> >>>
> >>> Yes, that's reasonable.  It could be a good idea to document this
> >>> behaviour of switching the mbm_cntr_assign mode, if we think it is
> >>> useful and people are likely to rely on it.
> >>>
> >>> Since mkdir is an implementation detail of the resctrl interface, I'd
> >>> be tempted to go for a more generic name, say,
> >>> "mbm_assign_new_mon_groups".  But that's just bikeshedding.
> >>> The proposed behaviour seems fine.
> >>>
> >>> Either way, if this is not included in this series, it could be added
> >>> later without breaking anything.
> >>
> >> How about more generic "mbm_cntr_assign_auto" ?
> >
> > I would like to be careful to not make it _too_ generic. Dave already pointed
> > out that users may be surprised that counters are not auto-assigned when switching
> > between the different modes so using the the name to help highlight when this
> > auto-assignment can be expected to happen seems very useful.
>
> In that case "mbm_assign_on_mkdir" seems on point and precise.
> Thanks

It also looks like counters are not assigned when a domain is
hotplugged, so explicitly stating that it's on mkdir gets us off the
hook for that.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-24 17:23                 ` Luck, Tony
@ 2025-02-28 17:50                   ` Dave Martin
  2025-03-03 19:30                     ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Dave Martin @ 2025-02-28 17:50 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Chatre, Reinette, Moger, Babu, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

Hi,

On Tue, Feb 25, 2025 at 12:39:04PM +0000, Dave Martin wrote:
> Hi Tony,
> 
> On Mon, Feb 24, 2025 at 05:23:06PM +0000, Luck, Tony wrote:
> > > It has just occurred to be that ftrace has large, multi-line text files
> > > in sysfs, so I'll try to find out how they handle that there.  Maybe
> > > there is some infrastructure we can re-use.
> > 
> > Resctrl was built on top of "kernfs" because that was a simple base
> > that met needs at the time.
> > 
> > Do we need to look at either extending capabilities of kernfs? Or
> > move to sysfs?
> > 
> > -Tony
> 
> I took a look at what ftrace does: it basically rolls its own buffering
> implementation, sufficient for its needs.
> 
> The ftrace code is internal and not trivial to pick up and plonk into
> resctrl.  We also have another possible requirement that ftrace doesn't
> have (whole-file atomicity).  But ftrace's example at least confirms
> that there is probably no off-the-shelf implementation for this in the
> kernel.

[...]

After having spent a bit of time looking into this, I think we are probably
OK, at least for reading these files.

seq_file will loop over the file's show() callback, growing the seq_file
buffer until show() can run without overrunning the buffer.

This means that the show() callback receives a buffer that is magically big
enough, but there may be some "speculative" calls whose output never goes
to userspace.  Once seq_file has the data, it deals with the userspace-
facing I/O buffering internally, so we shouldn't have to worry about that.

I'll try to hack up a test next week to confirm that this works.

The seq_file approach appears sound, but may be inefficient if the initial
guess at the buffer size (= PAGE_SIZE) is frequently too small.  (There is
single_open_size() though, which allows the buffer to be preallocated with
a specified size and may be useful.)

seq_file doesn't help with the write side at all, but I think we agreed
that handling large file writes properly is less urgent.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-02-26 13:27                                 ` Peter Newman
  2025-02-26 16:25                                   ` Reinette Chatre
@ 2025-03-03 19:16                                   ` Moger, Babu
  2025-03-04 16:44                                     ` Peter Newman
  1 sibling, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-03 19:16 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter/Reinette,

On 2/26/25 07:27, Peter Newman wrote:
> Hi Babu,
> 
> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/25/25 11:11, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>
>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>
>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>> for.
>>>>>>>>>>
>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>
>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>> customers.
>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>
>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>> event names.
>>>>>>>>
>>>>>>>> Thank you for clarifying.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>
>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>
>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>
>>>>>>>>> (per domain)
>>>>>>>>> group 0:
>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> group 1:
>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>> configuration is a requirement?
>>>>>>>
>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>> there's less pressure on the counters.
>>>>>>>
>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>> many counters the group needs in each domain.
>>>>>>
>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>> globally then it would not make it possible to support the full configurability
>>>>>> of the hardware.
>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>> earlier example copied below:
>>>>>>
>>>>>>>>> (per domain)
>>>>>>>>> group 0:
>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> group 1:
>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> ...
>>>>>>
>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>> I understand it:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>> in domain 1, resulting in:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>
>>>>>> group 0:
>>>>>>  domain 0:
>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>> group 1:
>>>>>>  domain 0:
>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>  domain 1:
>>>>>>   counter 0: LclFill,RmtFill
>>>>>>   counter 1: LclNTWr,RmtNTWr
>>>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>>>   counter 3: VictimBW
>>>>>>
>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>> understand the above does contradict global counter configuration though.
>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>> that it is reconfigured as part of every assignment?
>>>>>
>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>> system configuration, the user will settle on a handful of useful
>>>>> groupings to count.
>>>>>
>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>
>>>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>>>  # syntax and probably not in the mbm_assign_control file.
>>>>>
>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>>  # legacy "total" configuration, effectively r+w
>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>>  /group0/0=t;1=t
>>>>>  /group1/0=t;1=t
>>>>>  /group2/0=_;1=t
>>>>>  /group3/0=rw;1=_
>>>>>
>>>>> - group2 is restricted to domain 0
>>>>> - group3 is restricted to domain 1
>>>>> - the rest are unrestricted
>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>
>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>
>>>>
>>>> I see. Thank you for the example.
>>>>
>>>> resctrl supports per-domain configurations with the following possible when
>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>
>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>
>>>>    /group0/0=t;1=t
>>>>    /group1/0=t;1=t
>>>>
>>>> Even though the flags are identical in all domains, the assigned counters will
>>>> be configured differently in each domain.
>>>>
>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>> reasonable to carry this forward to what will be supported next.
>>>
>>> The hardware supports both a per-domain mode, where all groups in a
>>> domain use the same configurations and are limited to two events per
>>> group and a per-group mode where every group can be configured and
>>> assigned freely. This series is using the legacy counter access mode
>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>> in the domain can be read. If we chose to read the assigned counter
>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>> rather than asking the hardware to find the counter by RMID, we would
>>> not be limited to 2 counters per group/domain and the hardware would
>>> have the same flexibility as on MPAM.
>>
>> In extended mode, the contents of a specific counter can be read by
>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>> QM_CTR will then return the contents of the specified counter.
>>
>> It is documented below.
>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>
>> We previously discussed this with you (off the public list) and I
>> initially proposed the extended assignment mode.
>>
>> Yes, the extended mode allows greater flexibility by enabling multiple
>> counters to be assigned to the same group, rather than being limited to
>> just two.
>>
>> However, the challenge is that we currently lack the necessary interfaces
>> to configure multiple events per group. Without these interfaces, the
>> extended mode is not practical at this time.
>>
>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>> require modifications to the existing interface, allowing us to continue
>> using it as is.
>>
>>>
>>> (I might have said something confusing in my last messages because I
>>> had forgotten that I switched to the extended assignment mode when
>>> prototyping with soft-ABMC and MPAM.)
>>>
>>> Forcing all groups on a domain to share the same 2 counter
>>> configurations would not be acceptable for us, as the example I gave
>>> earlier is one I've already been asked about.
>>
>> I don’t see this as a blocker. It should be considered an extension to the
>> current ABMC series. We can easily build on top of this series once we
>> finalize how to configure the multiple event interface for each group.
> 
> I don't think it is, either. Only being able to use ABMC to assign
> counters is fine for our use as an incremental step. My longer-term
> concern is the domain-scoped mbm_total_bytes_config and
> mbm_local_bytes_config files, but they were introduced with BMEC, so
> there's already an expectation that the files are present when BMEC is
> supported.
> 
> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> ABMC when only the BMEC-style event configuration interface exists.
> The scope of my issue is just whether enabling "full" ABMC support
> will require an additional opt-in, since that could remove the BMEC
> interface. If it does, it's something we can live with.

As you know, this series is currently blocked without further feedback.

I’d like to begin reworking these patches to incorporate Peter’s feedback.
Any input or suggestions would be appreciated.

Here’s what we’ve learned so far:

1. Assignments should be independent of BMEC.
2. We should be able to specify multiple event types to a counter (e.g.,
read, write, victimBM, etc.). This is also called shared counter
3. There should be an option to assign events per domain.
4. Currently, only two counters can be assigned per group, but the design
should allow flexibility to assign more in the future as the interface
evolves.
5. Utilize the extended RMID read mode.


Here is my proposal using Peter's earlier example:

# define event configurations

========================================================
Bits	Mnemonics    	Description
====   ========================================================
6       VictimBW	Dirty Victims from all types of memory
5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
4       LclSlowFill     Reads to slow memory in the local NUMA domain
3       RmtNTWr  	Non-temporal writes to non-local NUMA domain
2       LclNTWr 	Non-temporal writes to local NUMA domain
1       mtFill		Reads to memory in the non-local NUMA domain
0       LclFill		Reads to memory in the local NUMA domain
====    ========================================================

#Define flags based on combination of above event types.

t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
l = LclFill, LclNTWr, LclSlowFill
r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
w = VictimBW,LclNTWr,RmtNTWr
v = VictimBW	

Peter suggested the following format earlier :

/group0/0=t;1=t
/group1/0=t;1=t
/group2/0=_;1=t
/group3/0=rw;1=_

Interpretation:
/group0/0=t;1=t  : Assign a counter with event configuration 't' to domain
0 and 1 on the resctrl group0.

This format does not indicate which index should be used for assignment.
Based the index we can read the events from either mbm_total_bytes or
mbm_local_bytes.

Currently, we can assign two counters to a group and events can be read
from mon_data/mon_L3_00/mbm_total_bytes (index 0) and
mon_data/mon_L3_00/mbm_local_bytes (index 1).

To address this, we need to include the index in some form. One approach
is to incorporate this information into the group's name.

Like below:

/group0:0/0=t;1=t
/group0:1/0=l;1=l
/group1:0/0=t;1=t
/group2:1/0=_;1=t
/group3:0/0=rw;1=_


Interpretation:
/group0:0/0=t;1=t : Assign a counter with event configuration 't' to
domain 0 and 1 on the resctrl group0 and use the index 0. The events can
be read in group0/mon_data/mon_L3_00/mbm_total_bytes and
group0/mon_data/mon_L3_01/mbm_total_bytes


/group0:1/0=l;1=l  :  Assign a counter with event configuration 'l' to
domain 0 and 1 on the resctrl group0 and use the index 1. The events can
be read in group0/mon_data/mon_L3_00/mbm_local_bytes and
group0/mon_data/mon_L3_01/mbm_local_bytes


What are your thoughts?
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-02-28 17:50                   ` Dave Martin
@ 2025-03-03 19:30                     ` Luck, Tony
  2025-03-05 18:06                       ` Dave Martin
  0 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-03-03 19:30 UTC (permalink / raw)
  To: Dave Martin
  Cc: Chatre, Reinette, Moger, Babu, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

> After having spent a bit of time looking into this, I think we are probably
> OK, at least for reading these files.
>
> seq_file will loop over the file's show() callback, growing the seq_file
> buffer until show() can run without overrunning the buffer.
>
> This means that the show() callback receives a buffer that is magically big
> enough, but there may be some "speculative" calls whose output never goes
> to userspace.  Once seq_file has the data, it deals with the userspace-
> facing I/O buffering internally, so we shouldn't have to worry about that.

Doesn't this depend on the size of the user read(2) syscall request?

If the total size of the resctrl file is very large, we have a potential issue:

1) User asks for 4KB, owns the resctrl mutex.

2) resctrl uses seq_file and fills with more than 4KB

3) User gets the first 4KB, releases the resctrl mutex

4) Some other pending resctrl operation now gets the mutex and makes changes that affect the contents of this file

5) User asks for next 4K (when it reaquires resctrl mutex)

6) resctrl uses seq_file() to construct new image of file incorporating changes because of step 4

7) User gets the second 4KB from the seq_file buffer (which doesn't fit cleanly next to data it got in step 3).

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-03 19:16                                   ` Moger, Babu
@ 2025-03-04 16:44                                     ` Peter Newman
  2025-03-04 21:49                                       ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-03-04 16:44 UTC (permalink / raw)
  To: babu.moger
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter/Reinette,
>
> On 2/26/25 07:27, Peter Newman wrote:
> > Hi Babu,
> >
> > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/25/25 11:11, Peter Newman wrote:
> >>> Hi Reinette,
> >>>
> >>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>
> >>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>
> >>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>> for.
> >>>>>>>>>>
> >>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>
> >>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>> customers.
> >>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>
> >>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>> event names.
> >>>>>>>>
> >>>>>>>> Thank you for clarifying.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>
> >>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>
> >>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>
> >>>>>>>>> (per domain)
> >>>>>>>>> group 0:
> >>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> group 1:
> >>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>> configuration is a requirement?
> >>>>>>>
> >>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>> there's less pressure on the counters.
> >>>>>>>
> >>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>> many counters the group needs in each domain.
> >>>>>>
> >>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>> globally then it would not make it possible to support the full configurability
> >>>>>> of the hardware.
> >>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>> earlier example copied below:
> >>>>>>
> >>>>>>>>> (per domain)
> >>>>>>>>> group 0:
> >>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> group 1:
> >>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> ...
> >>>>>>
> >>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>> I understand it:
> >>>>>>
> >>>>>> group 0:
> >>>>>>  domain 0:
> >>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>  domain 1:
> >>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>>  domain 0:
> >>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>  domain 1:
> >>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>> in domain 1, resulting in:
> >>>>>>
> >>>>>> group 0:
> >>>>>>  domain 0:
> >>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>>  domain 0:
> >>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>  domain 1:
> >>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>
> >>>>>> group 0:
> >>>>>>  domain 0:
> >>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>>  domain 0:
> >>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>  domain 1:
> >>>>>>   counter 0: LclFill,RmtFill
> >>>>>>   counter 1: LclNTWr,RmtNTWr
> >>>>>>   counter 2: LclSlowFill,RmtSlowFill
> >>>>>>   counter 3: VictimBW
> >>>>>>
> >>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>> understand the above does contradict global counter configuration though.
> >>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>> that it is reconfigured as part of every assignment?
> >>>>>
> >>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>> system configuration, the user will settle on a handful of useful
> >>>>> groupings to count.
> >>>>>
> >>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>
> >>>>>  # define global configurations (in ABMC terms), not necessarily in this
> >>>>>  # syntax and probably not in the mbm_assign_control file.
> >>>>>
> >>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>  w=VictimBW,LclNTWr,RmtNTWr
> >>>>>
> >>>>>  # legacy "total" configuration, effectively r+w
> >>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>
> >>>>>  /group0/0=t;1=t
> >>>>>  /group1/0=t;1=t
> >>>>>  /group2/0=_;1=t
> >>>>>  /group3/0=rw;1=_
> >>>>>
> >>>>> - group2 is restricted to domain 0
> >>>>> - group3 is restricted to domain 1
> >>>>> - the rest are unrestricted
> >>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>
> >>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>
> >>>>
> >>>> I see. Thank you for the example.
> >>>>
> >>>> resctrl supports per-domain configurations with the following possible when
> >>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>
> >>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>
> >>>>    /group0/0=t;1=t
> >>>>    /group1/0=t;1=t
> >>>>
> >>>> Even though the flags are identical in all domains, the assigned counters will
> >>>> be configured differently in each domain.
> >>>>
> >>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>> reasonable to carry this forward to what will be supported next.
> >>>
> >>> The hardware supports both a per-domain mode, where all groups in a
> >>> domain use the same configurations and are limited to two events per
> >>> group and a per-group mode where every group can be configured and
> >>> assigned freely. This series is using the legacy counter access mode
> >>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>> in the domain can be read. If we chose to read the assigned counter
> >>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>> rather than asking the hardware to find the counter by RMID, we would
> >>> not be limited to 2 counters per group/domain and the hardware would
> >>> have the same flexibility as on MPAM.
> >>
> >> In extended mode, the contents of a specific counter can be read by
> >> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >> QM_CTR will then return the contents of the specified counter.
> >>
> >> It is documented below.
> >> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>
> >> We previously discussed this with you (off the public list) and I
> >> initially proposed the extended assignment mode.
> >>
> >> Yes, the extended mode allows greater flexibility by enabling multiple
> >> counters to be assigned to the same group, rather than being limited to
> >> just two.
> >>
> >> However, the challenge is that we currently lack the necessary interfaces
> >> to configure multiple events per group. Without these interfaces, the
> >> extended mode is not practical at this time.
> >>
> >> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >> require modifications to the existing interface, allowing us to continue
> >> using it as is.
> >>
> >>>
> >>> (I might have said something confusing in my last messages because I
> >>> had forgotten that I switched to the extended assignment mode when
> >>> prototyping with soft-ABMC and MPAM.)
> >>>
> >>> Forcing all groups on a domain to share the same 2 counter
> >>> configurations would not be acceptable for us, as the example I gave
> >>> earlier is one I've already been asked about.
> >>
> >> I don’t see this as a blocker. It should be considered an extension to the
> >> current ABMC series. We can easily build on top of this series once we
> >> finalize how to configure the multiple event interface for each group.
> >
> > I don't think it is, either. Only being able to use ABMC to assign
> > counters is fine for our use as an incremental step. My longer-term
> > concern is the domain-scoped mbm_total_bytes_config and
> > mbm_local_bytes_config files, but they were introduced with BMEC, so
> > there's already an expectation that the files are present when BMEC is
> > supported.
> >
> > On ABMC hardware that also supports BMEC, I'm concerned about enabling
> > ABMC when only the BMEC-style event configuration interface exists.
> > The scope of my issue is just whether enabling "full" ABMC support
> > will require an additional opt-in, since that could remove the BMEC
> > interface. If it does, it's something we can live with.
>
> As you know, this series is currently blocked without further feedback.
>
> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> Any input or suggestions would be appreciated.
>
> Here’s what we’ve learned so far:
>
> 1. Assignments should be independent of BMEC.
> 2. We should be able to specify multiple event types to a counter (e.g.,
> read, write, victimBM, etc.). This is also called shared counter
> 3. There should be an option to assign events per domain.
> 4. Currently, only two counters can be assigned per group, but the design
> should allow flexibility to assign more in the future as the interface
> evolves.
> 5. Utilize the extended RMID read mode.
>
>
> Here is my proposal using Peter's earlier example:
>
> # define event configurations
>
> ========================================================
> Bits    Mnemonics       Description
> ====   ========================================================
> 6       VictimBW        Dirty Victims from all types of memory
> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
> 2       LclNTWr         Non-temporal writes to local NUMA domain
> 1       mtFill          Reads to memory in the non-local NUMA domain
> 0       LclFill         Reads to memory in the local NUMA domain
> ====    ========================================================
>
> #Define flags based on combination of above event types.
>
> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> l = LclFill, LclNTWr, LclSlowFill
> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> w = VictimBW,LclNTWr,RmtNTWr
> v = VictimBW
>
> Peter suggested the following format earlier :
>
> /group0/0=t;1=t
> /group1/0=t;1=t
> /group2/0=_;1=t
> /group3/0=rw;1=_

After some inquiries within Google, it sounds like nobody has invested
much into the current mbm_assign_control format yet, so it would be
best to drop it and distribute the configuration around the filesystem
hierarchy[1], which should allow us to produce something more flexible
and cleaner to implement.

Roughly what I had in mind:

Use mkdir in a info/<resource>_MON subdirectory to create free-form
names for the assignable configurations rather than being restricted
to single letters.  In the resulting directory, populate a file where
we can specify the set of events the config should represent. I think
we should use symbolic names for the events rather than raw BMEC field
values. Moving forward we could come up with portable names for common
events and only support the BMEC names on AMD machines for users who
want specific events and don't care about portability.

Next, put assignment-control file nodes in per-domain directories
(i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
counter-configuration name into the file would then allocate a counter
in the domain, apply the named configuration, and monitor the parent
group-directory. We can also put a group/resource-scoped assign_* file
higher in the hierarchy to make it easier for users who want to
configure all domains the same for a group.

The configuration names listed in assign_* would result in files of
the same name in the appropriate mon_data domain directories from
which the count values can be read.

 # mkdir info/L3_MON/counter_configs/mbm_local_bytes
 # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
 # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
 # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
 # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
LclFill
LclNTWr
LclSlowFill

Note that we could also pre-populate info/L3_MON/counter_configs with
the expected configuration for mbm_local_bytes and mbm_total_bytes for
backwards compatibility.

To manually allocate counters for "mbm_local_bytes":

 # mkdir test
 # echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
 # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive
 # echo mbm_local_bytes > test/mon_data/mon_L3_02/assign_exclusive
[..]

Which would result in the creation of test/mon_data/mon_L3_*/mbm_local_bytes

For unassignment, we can just make an "unassign" node alongside
"assign_exclusive" and "assign_shared". These should provide enough
context to form resctrl_arch_config_cntr() calls.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCj1TH+GN6+dFnt5xuN406u=tB-8mj+UuMRSm5KWPJW2wg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-04 16:44                                     ` Peter Newman
@ 2025-03-04 21:49                                       ` Moger, Babu
  2025-03-05 10:40                                         ` Peter Newman
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-04 21:49 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 3/4/25 10:44, Peter Newman wrote:
> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter/Reinette,
>>
>> On 2/26/25 07:27, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>> for.
>>>>>>>>>>>>
>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>
>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>> customers.
>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>
>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>> event names.
>>>>>>>>>>
>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>
>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>
>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>
>>>>>>>>>>> (per domain)
>>>>>>>>>>> group 0:
>>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>> configuration is a requirement?
>>>>>>>>>
>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>
>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>
>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>> of the hardware.
>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>> earlier example copied below:
>>>>>>>>
>>>>>>>>>>> (per domain)
>>>>>>>>>>> group 0:
>>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> ...
>>>>>>>>
>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>> I understand it:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>>  domain 0:
>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>  domain 1:
>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>>  domain 0:
>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>  domain 1:
>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>> in domain 1, resulting in:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>>  domain 0:
>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>>  domain 0:
>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>  domain 1:
>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>>  domain 0:
>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>>  domain 0:
>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>  domain 1:
>>>>>>>>   counter 0: LclFill,RmtFill
>>>>>>>>   counter 1: LclNTWr,RmtNTWr
>>>>>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>   counter 3: VictimBW
>>>>>>>>
>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>
>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>> groupings to count.
>>>>>>>
>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>
>>>>>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>  # syntax and probably not in the mbm_assign_control file.
>>>>>>>
>>>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>>  # legacy "total" configuration, effectively r+w
>>>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>>  /group0/0=t;1=t
>>>>>>>  /group1/0=t;1=t
>>>>>>>  /group2/0=_;1=t
>>>>>>>  /group3/0=rw;1=_
>>>>>>>
>>>>>>> - group2 is restricted to domain 0
>>>>>>> - group3 is restricted to domain 1
>>>>>>> - the rest are unrestricted
>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>
>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>
>>>>>>
>>>>>> I see. Thank you for the example.
>>>>>>
>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>
>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>>    /group0/0=t;1=t
>>>>>>    /group1/0=t;1=t
>>>>>>
>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>> be configured differently in each domain.
>>>>>>
>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>
>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>> domain use the same configurations and are limited to two events per
>>>>> group and a per-group mode where every group can be configured and
>>>>> assigned freely. This series is using the legacy counter access mode
>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>> have the same flexibility as on MPAM.
>>>>
>>>> In extended mode, the contents of a specific counter can be read by
>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>> QM_CTR will then return the contents of the specified counter.
>>>>
>>>> It is documented below.
>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>
>>>> We previously discussed this with you (off the public list) and I
>>>> initially proposed the extended assignment mode.
>>>>
>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>> counters to be assigned to the same group, rather than being limited to
>>>> just two.
>>>>
>>>> However, the challenge is that we currently lack the necessary interfaces
>>>> to configure multiple events per group. Without these interfaces, the
>>>> extended mode is not practical at this time.
>>>>
>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>> require modifications to the existing interface, allowing us to continue
>>>> using it as is.
>>>>
>>>>>
>>>>> (I might have said something confusing in my last messages because I
>>>>> had forgotten that I switched to the extended assignment mode when
>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>
>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>> configurations would not be acceptable for us, as the example I gave
>>>>> earlier is one I've already been asked about.
>>>>
>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>> current ABMC series. We can easily build on top of this series once we
>>>> finalize how to configure the multiple event interface for each group.
>>>
>>> I don't think it is, either. Only being able to use ABMC to assign
>>> counters is fine for our use as an incremental step. My longer-term
>>> concern is the domain-scoped mbm_total_bytes_config and
>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>> there's already an expectation that the files are present when BMEC is
>>> supported.
>>>
>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>> ABMC when only the BMEC-style event configuration interface exists.
>>> The scope of my issue is just whether enabling "full" ABMC support
>>> will require an additional opt-in, since that could remove the BMEC
>>> interface. If it does, it's something we can live with.
>>
>> As you know, this series is currently blocked without further feedback.
>>
>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>> Any input or suggestions would be appreciated.
>>
>> Here’s what we’ve learned so far:
>>
>> 1. Assignments should be independent of BMEC.
>> 2. We should be able to specify multiple event types to a counter (e.g.,
>> read, write, victimBM, etc.). This is also called shared counter
>> 3. There should be an option to assign events per domain.
>> 4. Currently, only two counters can be assigned per group, but the design
>> should allow flexibility to assign more in the future as the interface
>> evolves.
>> 5. Utilize the extended RMID read mode.
>>
>>
>> Here is my proposal using Peter's earlier example:
>>
>> # define event configurations
>>
>> ========================================================
>> Bits    Mnemonics       Description
>> ====   ========================================================
>> 6       VictimBW        Dirty Victims from all types of memory
>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>> 1       mtFill          Reads to memory in the non-local NUMA domain
>> 0       LclFill         Reads to memory in the local NUMA domain
>> ====    ========================================================
>>
>> #Define flags based on combination of above event types.
>>
>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>> l = LclFill, LclNTWr, LclSlowFill
>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>> w = VictimBW,LclNTWr,RmtNTWr
>> v = VictimBW
>>
>> Peter suggested the following format earlier :
>>
>> /group0/0=t;1=t
>> /group1/0=t;1=t
>> /group2/0=_;1=t
>> /group3/0=rw;1=_
> 
> After some inquiries within Google, it sounds like nobody has invested
> much into the current mbm_assign_control format yet, so it would be
> best to drop it and distribute the configuration around the filesystem
> hierarchy[1], which should allow us to produce something more flexible
> and cleaner to implement.
> 
> Roughly what I had in mind:
> 
> Use mkdir in a info/<resource>_MON subdirectory to create free-form
> names for the assignable configurations rather than being restricted
> to single letters.  In the resulting directory, populate a file where
> we can specify the set of events the config should represent. I think
> we should use symbolic names for the events rather than raw BMEC field
> values. Moving forward we could come up with portable names for common
> events and only support the BMEC names on AMD machines for users who
> want specific events and don't care about portability.


I’m still processing this. Let me start with some initial questions.

So, we are creating event configurations here, which seems reasonable.

Yes, we should use portable names and are not limited to BMEC names.

How many configurations should we allow? Do we know?

> 
> Next, put assignment-control file nodes in per-domain directories
> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> counter-configuration name into the file would then allocate a counter
> in the domain, apply the named configuration, and monitor the parent
> group-directory. We can also put a group/resource-scoped assign_* file
> higher in the hierarchy to make it easier for users who want to
> configure all domains the same for a group.

What is the difference between shared and exclusive?

Having three files—assign_shared, assign_exclusive, and unassign—for each
domain seems excessive. In a system with 32 groups and 12 domains, this
results in 32 × 12 × 3 files, which is quite large.

There should be a more efficient way to handle this.

Initially, we started with a group-level file for this interface, but it
was rejected due to the high number of sysfs calls, making it inefficient.

Additionally, how can we list all assignments with a single sysfs call?

That was another problem we need to address.


> 
> The configuration names listed in assign_* would result in files of
> the same name in the appropriate mon_data domain directories from
> which the count values can be read.
> 
>  # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>  # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>  # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>  # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>  # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
> LclNTWr
> LclSlowFill

I feel we can just have the configs. event_filter file is not required.

#cat info/L3_MON/counter_configs/mbm_local_bytes
LclFill <-rename these to generic names.
LclNTWr
LclSlowFill


> 
> Note that we could also pre-populate info/L3_MON/counter_configs with
> the expected configuration for mbm_local_bytes and mbm_total_bytes for
> backwards compatibility.
> 
> To manually allocate counters for "mbm_local_bytes":
> 
>  # mkdir test
>  # echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>  # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive
>  # echo mbm_local_bytes > test/mon_data/mon_L3_02/assign_exclusive
> [..]
> 
> Which would result in the creation of test/mon_data/mon_L3_*/mbm_local_bytes
> 
> For unassignment, we can just make an "unassign" node alongside
> "assign_exclusive" and "assign_shared". These should provide enough
> context to form resctrl_arch_config_cntr() calls.
> 
> -Peter
> 
> [1] https://lore.kernel.org/lkml/CALPaoCj1TH+GN6+dFnt5xuN406u=tB-8mj+UuMRSm5KWPJW2wg@mail.gmail.com/
> 

Lets keep discussing.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-04 21:49                                       ` Moger, Babu
@ 2025-03-05 10:40                                         ` Peter Newman
  2025-03-05 19:34                                           ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Peter Newman @ 2025-03-05 10:40 UTC (permalink / raw)
  To: babu.moger
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter,
>
> On 3/4/25 10:44, Peter Newman wrote:
> > On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>
> >> Hi Peter/Reinette,
> >>
> >> On 2/26/25 07:27, Peter Newman wrote:
> >>> Hi Babu,
> >>>
> >>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/25/25 11:11, Peter Newman wrote:
> >>>>> Hi Reinette,
> >>>>>
> >>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>
> >>>>>> Hi Peter,
> >>>>>>
> >>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>>>> for.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>>>> customers.
> >>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>>>
> >>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>>>> event names.
> >>>>>>>>>>
> >>>>>>>>>> Thank you for clarifying.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>>>
> >>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>>>
> >>>>>>>>>>> (per domain)
> >>>>>>>>>>> group 0:
> >>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> group 1:
> >>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>>>> configuration is a requirement?
> >>>>>>>>>
> >>>>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>>>> there's less pressure on the counters.
> >>>>>>>>>
> >>>>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>>>> many counters the group needs in each domain.
> >>>>>>>>
> >>>>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>>>> globally then it would not make it possible to support the full configurability
> >>>>>>>> of the hardware.
> >>>>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>>>> earlier example copied below:
> >>>>>>>>
> >>>>>>>>>>> (per domain)
> >>>>>>>>>>> group 0:
> >>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> group 1:
> >>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>>>> I understand it:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>  domain 1:
> >>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>  domain 1:
> >>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>
> >>>>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>>>> in domain 1, resulting in:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>  domain 1:
> >>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>
> >>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>>  domain 0:
> >>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>  domain 1:
> >>>>>>>>   counter 0: LclFill,RmtFill
> >>>>>>>>   counter 1: LclNTWr,RmtNTWr
> >>>>>>>>   counter 2: LclSlowFill,RmtSlowFill
> >>>>>>>>   counter 3: VictimBW
> >>>>>>>>
> >>>>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>>>> understand the above does contradict global counter configuration though.
> >>>>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>>>> that it is reconfigured as part of every assignment?
> >>>>>>>
> >>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>>>> system configuration, the user will settle on a handful of useful
> >>>>>>> groupings to count.
> >>>>>>>
> >>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>>>
> >>>>>>>  # define global configurations (in ABMC terms), not necessarily in this
> >>>>>>>  # syntax and probably not in the mbm_assign_control file.
> >>>>>>>
> >>>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>  w=VictimBW,LclNTWr,RmtNTWr
> >>>>>>>
> >>>>>>>  # legacy "total" configuration, effectively r+w
> >>>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>
> >>>>>>>  /group0/0=t;1=t
> >>>>>>>  /group1/0=t;1=t
> >>>>>>>  /group2/0=_;1=t
> >>>>>>>  /group3/0=rw;1=_
> >>>>>>>
> >>>>>>> - group2 is restricted to domain 0
> >>>>>>> - group3 is restricted to domain 1
> >>>>>>> - the rest are unrestricted
> >>>>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>>>
> >>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>>>
> >>>>>>
> >>>>>> I see. Thank you for the example.
> >>>>>>
> >>>>>> resctrl supports per-domain configurations with the following possible when
> >>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>>>
> >>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>>    /group0/0=t;1=t
> >>>>>>    /group1/0=t;1=t
> >>>>>>
> >>>>>> Even though the flags are identical in all domains, the assigned counters will
> >>>>>> be configured differently in each domain.
> >>>>>>
> >>>>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>>>> reasonable to carry this forward to what will be supported next.
> >>>>>
> >>>>> The hardware supports both a per-domain mode, where all groups in a
> >>>>> domain use the same configurations and are limited to two events per
> >>>>> group and a per-group mode where every group can be configured and
> >>>>> assigned freely. This series is using the legacy counter access mode
> >>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>>>> in the domain can be read. If we chose to read the assigned counter
> >>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>>>> rather than asking the hardware to find the counter by RMID, we would
> >>>>> not be limited to 2 counters per group/domain and the hardware would
> >>>>> have the same flexibility as on MPAM.
> >>>>
> >>>> In extended mode, the contents of a specific counter can be read by
> >>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >>>> QM_CTR will then return the contents of the specified counter.
> >>>>
> >>>> It is documented below.
> >>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >>>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>>>
> >>>> We previously discussed this with you (off the public list) and I
> >>>> initially proposed the extended assignment mode.
> >>>>
> >>>> Yes, the extended mode allows greater flexibility by enabling multiple
> >>>> counters to be assigned to the same group, rather than being limited to
> >>>> just two.
> >>>>
> >>>> However, the challenge is that we currently lack the necessary interfaces
> >>>> to configure multiple events per group. Without these interfaces, the
> >>>> extended mode is not practical at this time.
> >>>>
> >>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >>>> require modifications to the existing interface, allowing us to continue
> >>>> using it as is.
> >>>>
> >>>>>
> >>>>> (I might have said something confusing in my last messages because I
> >>>>> had forgotten that I switched to the extended assignment mode when
> >>>>> prototyping with soft-ABMC and MPAM.)
> >>>>>
> >>>>> Forcing all groups on a domain to share the same 2 counter
> >>>>> configurations would not be acceptable for us, as the example I gave
> >>>>> earlier is one I've already been asked about.
> >>>>
> >>>> I don’t see this as a blocker. It should be considered an extension to the
> >>>> current ABMC series. We can easily build on top of this series once we
> >>>> finalize how to configure the multiple event interface for each group.
> >>>
> >>> I don't think it is, either. Only being able to use ABMC to assign
> >>> counters is fine for our use as an incremental step. My longer-term
> >>> concern is the domain-scoped mbm_total_bytes_config and
> >>> mbm_local_bytes_config files, but they were introduced with BMEC, so
> >>> there's already an expectation that the files are present when BMEC is
> >>> supported.
> >>>
> >>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> >>> ABMC when only the BMEC-style event configuration interface exists.
> >>> The scope of my issue is just whether enabling "full" ABMC support
> >>> will require an additional opt-in, since that could remove the BMEC
> >>> interface. If it does, it's something we can live with.
> >>
> >> As you know, this series is currently blocked without further feedback.
> >>
> >> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> >> Any input or suggestions would be appreciated.
> >>
> >> Here’s what we’ve learned so far:
> >>
> >> 1. Assignments should be independent of BMEC.
> >> 2. We should be able to specify multiple event types to a counter (e.g.,
> >> read, write, victimBM, etc.). This is also called shared counter
> >> 3. There should be an option to assign events per domain.
> >> 4. Currently, only two counters can be assigned per group, but the design
> >> should allow flexibility to assign more in the future as the interface
> >> evolves.
> >> 5. Utilize the extended RMID read mode.
> >>
> >>
> >> Here is my proposal using Peter's earlier example:
> >>
> >> # define event configurations
> >>
> >> ========================================================
> >> Bits    Mnemonics       Description
> >> ====   ========================================================
> >> 6       VictimBW        Dirty Victims from all types of memory
> >> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
> >> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
> >> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
> >> 2       LclNTWr         Non-temporal writes to local NUMA domain
> >> 1       mtFill          Reads to memory in the non-local NUMA domain
> >> 0       LclFill         Reads to memory in the local NUMA domain
> >> ====    ========================================================
> >>
> >> #Define flags based on combination of above event types.
> >>
> >> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >> l = LclFill, LclNTWr, LclSlowFill
> >> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> w = VictimBW,LclNTWr,RmtNTWr
> >> v = VictimBW
> >>
> >> Peter suggested the following format earlier :
> >>
> >> /group0/0=t;1=t
> >> /group1/0=t;1=t
> >> /group2/0=_;1=t
> >> /group3/0=rw;1=_
> >
> > After some inquiries within Google, it sounds like nobody has invested
> > much into the current mbm_assign_control format yet, so it would be
> > best to drop it and distribute the configuration around the filesystem
> > hierarchy[1], which should allow us to produce something more flexible
> > and cleaner to implement.
> >
> > Roughly what I had in mind:
> >
> > Use mkdir in a info/<resource>_MON subdirectory to create free-form
> > names for the assignable configurations rather than being restricted
> > to single letters.  In the resulting directory, populate a file where
> > we can specify the set of events the config should represent. I think
> > we should use symbolic names for the events rather than raw BMEC field
> > values. Moving forward we could come up with portable names for common
> > events and only support the BMEC names on AMD machines for users who
> > want specific events and don't care about portability.
>
>
> I’m still processing this. Let me start with some initial questions.
>
> So, we are creating event configurations here, which seems reasonable.
>
> Yes, we should use portable names and are not limited to BMEC names.
>
> How many configurations should we allow? Do we know?

Do we need an upper limit?

>
> >
> > Next, put assignment-control file nodes in per-domain directories
> > (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> > counter-configuration name into the file would then allocate a counter
> > in the domain, apply the named configuration, and monitor the parent
> > group-directory. We can also put a group/resource-scoped assign_* file
> > higher in the hierarchy to make it easier for users who want to
> > configure all domains the same for a group.
>
> What is the difference between shared and exclusive?

Shared assignment[1] means that non-exclusively-assigned counters in
each domain will be scheduled round-robin to the groups requesting
shared access to a counter. In my tests, I assigned the counters long
enough to produce a single 1-second MB/s sample for the per-domain
aggregation files[2].

These do not need to be implemented immediately, but knowing that they
work addresses the overhead and scalability concerns of reassigning
counters and reading their values.

>
> Having three files—assign_shared, assign_exclusive, and unassign—for each
> domain seems excessive. In a system with 32 groups and 12 domains, this
> results in 32 × 12 × 3 files, which is quite large.
>
> There should be a more efficient way to handle this.
>
> Initially, we started with a group-level file for this interface, but it
> was rejected due to the high number of sysfs calls, making it inefficient.

I had rejected it due to the high-frequency of access of a large
number of files, which has since been addressed by shared assignment
(or automatic reassignment) and aggregated mbps files.

>
> Additionally, how can we list all assignments with a single sysfs call?
>
> That was another problem we need to address.

This is not a requirement I was aware of. If the user forgot where
they assigned counters (or forgot to disable auto-assignment), they
can read multiple sysfs nodes to remind themselves.

>
>
> >
> > The configuration names listed in assign_* would result in files of
> > the same name in the appropriate mon_data domain directories from
> > which the count values can be read.
> >
> >  # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> >  # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >  # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >  # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >  # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > LclFill
> > LclNTWr
> > LclSlowFill
>
> I feel we can just have the configs. event_filter file is not required.

That's right, I forgot that we can implement kernfs_ops::open(). I was
only looking at struct kernfs_syscall_ops

>
> #cat info/L3_MON/counter_configs/mbm_local_bytes
> LclFill <-rename these to generic names.
> LclNTWr
> LclSlowFill
>

I think portable and non-portable event names should both be available
as options. There are simple bandwidth measurement mechanisms that
will be applied in general, but when they turn up an issue, it can
often lead to a more focused investigation, requiring more precise
events.

-Peter

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups
  2025-03-03 19:30                     ` Luck, Tony
@ 2025-03-05 18:06                       ` Dave Martin
  0 siblings, 0 replies; 209+ messages in thread
From: Dave Martin @ 2025-03-05 18:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Chatre, Reinette, Moger, Babu, corbet@lwn.net, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	peternewman@google.com, x86@kernel.org, hpa@zytor.com,
	paulmck@kernel.org, akpm@linux-foundation.org, thuth@redhat.com,
	rostedt@goodmis.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	jpoimboe@kernel.org, perry.yuan@amd.com, sandipan.das@amd.com,
	Huang, Kai, Li, Xiaoyao, seanjc@google.com, Li, Xin3,
	andrew.cooper3@citrix.com, ebiggers@google.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Wieczor-Retman, Maciej,
	Eranian, Stephane

On Mon, Mar 03, 2025 at 07:30:48PM +0000, Luck, Tony wrote:
> > After having spent a bit of time looking into this, I think we are probably
> > OK, at least for reading these files.
> >
> > seq_file will loop over the file's show() callback, growing the seq_file
> > buffer until show() can run without overrunning the buffer.
> >
> > This means that the show() callback receives a buffer that is magically big
> > enough, but there may be some "speculative" calls whose output never goes
> > to userspace.  Once seq_file has the data, it deals with the userspace-
> > facing I/O buffering internally, so we shouldn't have to worry about that.
> 
> Doesn't this depend on the size of the user read(2) syscall request?

Yes and no.

If I've understood correctly:

To service a given read() call, seq_file calls down into the backend to
generate some whole record, then copies it out to userspace, then
repeats this process so long as there is any space left in the user
buffer.

For resctrl files, we don't implement a seq_file iterator: there is no
.next(), no .llseek(), and we don't implement any notion of file
position.  So our _show() functions generate a single big record that
contains the whole dump -- frequently multiple lines of text.

(This might or might not be desirable, but it is at least simple.)

If a _show() function in resctrl holds rdtgroup_mutex throughout, then
whatever it dumps will be dumped atomically with respect to other
resctrl operations that take this mutex.

So, to flesh out your scenario:

> 
> If the total size of the resctrl file is very large, we have a potential issue:

Let's say it's 5KB.

> 1) User asks for 4KB, owns the resctrl mutex.

(Note, rdtgroup_mutex is only held temporarily inside the resctrlfs
backend to these operations; at the start of the process, it is not
held.)

> 2) resctrl uses seq_file and fills with more than 4KB

(It's actually seq_file that uses resctrl here via callbacks: seq_file
sits in between the VFS layer and resctrl.)

When a .show() callback is called, resctrl doesn't know how much data
to generate; it just writes stuff out with seq_printf() etc.

If there's too much to fit in the default seq_file buffer, the data
gets truncated and the seq_file will get internally marked as having
overflowed.  resctrl could check for this condition in order to avoid
formatting text that will get thrown away due to truncation, but this
is not required.  When the .show() callback returns, the seq_file
implementation will respond to the overflow by growing the buffer and
retrying the whole thing until this doesn't occur (see the loop
preceding the "Fill" label in seq_file.c:seq_read_iter().)

This terminates with a seq_file buffer that contains all the output
(untruncated), or with an -ENOMEM failure (which would be punted to
userspace).

So, assuming nothing went wrong, the seq_file buffer now has the 5KB of
data.  rdtgroup_mutex is not held (it was only held in the _show()
callback).

> 3) User gets the first 4KB, releases the resctrl mutex

Userspace gets the first 4KB, and seq_file's notion of the file
position is advanced by this amount, and the generated text is
kept in the seq_file's buffer.

> 4) Some other pending resctrl operation now gets the mutex and makes changes that affect the contents of this file

The un-read data remains buffered in seq_file.  Other resctrl
operations can happen, so the buffered data may become stale, but it is
still an atomic snapshot.

> 5) User asks for next 4K (when it reaquires resctrl mutex)

If an iterator is implemented, seq_file might try to generate another
record to fill the requested space.  But we don't have an iterator, so
the generated data remains as-is.

> 6) resctrl uses seq_file() to construct new image of file incorporating changes because of step 4

I think this happens only if the file is reopened or lseek()'d, and
only if .llseek() is wired up in struct file_operations.  Resctrl
doesn't seem to do this (whether by accident or by design).

So userspace just sees a non-seekable file.

> 7) User gets the second 4KB from the seq_file buffer (which doesn't fit cleanly next to data it got in step 3).

Userspace gets the final 1K of the data that was generated in response
to the original read() call.

If userspace tries to read again, it will get EOF (again, because we
don't have an iterator -- meaning that no additional records can be
generated).

I haven't traced in detail through the code, but that's my
understanding.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-05 10:40                                         ` Peter Newman
@ 2025-03-05 19:34                                           ` Moger, Babu
  2025-03-10 22:48                                             ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-05 19:34 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Moger, Babu, Dave Martin, corbet, tglx, mingo,
	bp, dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Peter,

On 3/5/25 04:40, Peter Newman wrote:
> Hi Babu,
> 
> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 3/4/25 10:44, Peter Newman wrote:
>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>
>>>> Hi Peter/Reinette,
>>>>
>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>> Hi Reinette,
>>>>>>>
>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>> event names.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>
>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>
>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>
>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>
>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>
>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>> of the hardware.
>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>> earlier example copied below:
>>>>>>>>>>
>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>  counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>  counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>> I understand it:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>  domain 1:
>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>  domain 1:
>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>  domain 1:
>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>>  domain 0:
>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>  domain 1:
>>>>>>>>>>   counter 0: LclFill,RmtFill
>>>>>>>>>>   counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>   counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>   counter 3: VictimBW
>>>>>>>>>>
>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>
>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>> groupings to count.
>>>>>>>>>
>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>
>>>>>>>>>  # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>  # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>
>>>>>>>>>  r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>  w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>>  # legacy "total" configuration, effectively r+w
>>>>>>>>>  t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>>  /group0/0=t;1=t
>>>>>>>>>  /group1/0=t;1=t
>>>>>>>>>  /group2/0=_;1=t
>>>>>>>>>  /group3/0=rw;1=_
>>>>>>>>>
>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>> - the rest are unrestricted
>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>
>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I see. Thank you for the example.
>>>>>>>>
>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>
>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>>    /group0/0=t;1=t
>>>>>>>>    /group1/0=t;1=t
>>>>>>>>
>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>> be configured differently in each domain.
>>>>>>>>
>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>
>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>> have the same flexibility as on MPAM.
>>>>>>
>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>
>>>>>> It is documented below.
>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>  Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>
>>>>>> We previously discussed this with you (off the public list) and I
>>>>>> initially proposed the extended assignment mode.
>>>>>>
>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>> just two.
>>>>>>
>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>> extended mode is not practical at this time.
>>>>>>
>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>> using it as is.
>>>>>>
>>>>>>>
>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>
>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>> earlier is one I've already been asked about.
>>>>>>
>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>
>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>> there's already an expectation that the files are present when BMEC is
>>>>> supported.
>>>>>
>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>> interface. If it does, it's something we can live with.
>>>>
>>>> As you know, this series is currently blocked without further feedback.
>>>>
>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>> Any input or suggestions would be appreciated.
>>>>
>>>> Here’s what we’ve learned so far:
>>>>
>>>> 1. Assignments should be independent of BMEC.
>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>> read, write, victimBM, etc.). This is also called shared counter
>>>> 3. There should be an option to assign events per domain.
>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>> should allow flexibility to assign more in the future as the interface
>>>> evolves.
>>>> 5. Utilize the extended RMID read mode.
>>>>
>>>>
>>>> Here is my proposal using Peter's earlier example:
>>>>
>>>> # define event configurations
>>>>
>>>> ========================================================
>>>> Bits    Mnemonics       Description
>>>> ====   ========================================================
>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>> ====    ========================================================
>>>>
>>>> #Define flags based on combination of above event types.
>>>>
>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>> l = LclFill, LclNTWr, LclSlowFill
>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>> v = VictimBW
>>>>
>>>> Peter suggested the following format earlier :
>>>>
>>>> /group0/0=t;1=t
>>>> /group1/0=t;1=t
>>>> /group2/0=_;1=t
>>>> /group3/0=rw;1=_
>>>
>>> After some inquiries within Google, it sounds like nobody has invested
>>> much into the current mbm_assign_control format yet, so it would be
>>> best to drop it and distribute the configuration around the filesystem
>>> hierarchy[1], which should allow us to produce something more flexible
>>> and cleaner to implement.
>>>
>>> Roughly what I had in mind:
>>>
>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>> names for the assignable configurations rather than being restricted
>>> to single letters.  In the resulting directory, populate a file where
>>> we can specify the set of events the config should represent. I think
>>> we should use symbolic names for the events rather than raw BMEC field
>>> values. Moving forward we could come up with portable names for common
>>> events and only support the BMEC names on AMD machines for users who
>>> want specific events and don't care about portability.
>>
>>
>> I’m still processing this. Let me start with some initial questions.
>>
>> So, we are creating event configurations here, which seems reasonable.
>>
>> Yes, we should use portable names and are not limited to BMEC names.
>>
>> How many configurations should we allow? Do we know?
> 
> Do we need an upper limit?

I think so. This needs to be maintained in some data structure. We can
start with 2 default configurations for now.

> 
>>
>>>
>>> Next, put assignment-control file nodes in per-domain directories
>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>> counter-configuration name into the file would then allocate a counter
>>> in the domain, apply the named configuration, and monitor the parent
>>> group-directory. We can also put a group/resource-scoped assign_* file
>>> higher in the hierarchy to make it easier for users who want to
>>> configure all domains the same for a group.
>>
>> What is the difference between shared and exclusive?
> 
> Shared assignment[1] means that non-exclusively-assigned counters in
> each domain will be scheduled round-robin to the groups requesting
> shared access to a counter. In my tests, I assigned the counters long
> enough to produce a single 1-second MB/s sample for the per-domain
> aggregation files[2].
> 
> These do not need to be implemented immediately, but knowing that they
> work addresses the overhead and scalability concerns of reassigning
> counters and reading their values.

Ok. Lets focus on exclusive assignments for now.

> 
>>
>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>> domain seems excessive. In a system with 32 groups and 12 domains, this
>> results in 32 × 12 × 3 files, which is quite large.
>>
>> There should be a more efficient way to handle this.
>>
>> Initially, we started with a group-level file for this interface, but it
>> was rejected due to the high number of sysfs calls, making it inefficient.
> 
> I had rejected it due to the high-frequency of access of a large
> number of files, which has since been addressed by shared assignment
> (or automatic reassignment) and aggregated mbps files.

I think we should address this as well. Creating three extra files for
each group isn’t ideal when there are more efficient alternatives.

> 
>>
>> Additionally, how can we list all assignments with a single sysfs call?
>>
>> That was another problem we need to address.
> 
> This is not a requirement I was aware of. If the user forgot where
> they assigned counters (or forgot to disable auto-assignment), they
> can read multiple sysfs nodes to remind themselves.

I suggest, we should provide users with an option to list the assignments
of all groups in a single command. As the number of groups increases, it
becomes cumbersome to query each group individually.

To achieve this, we can reuse our existing mbm_assign_control interface
for this purpose. More details on this below.

>>
>>
>>>
>>> The configuration names listed in assign_* would result in files of
>>> the same name in the appropriate mon_data domain directories from
>>> which the count values can be read.
>>>
>>>  # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>  # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>  # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>  # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>  # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>
>> I feel we can just have the configs. event_filter file is not required.
> 
> That's right, I forgot that we can implement kernfs_ops::open(). I was
> only looking at struct kernfs_syscall_ops
> 
>>
>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>> LclFill <-rename these to generic names.
>> LclNTWr
>> LclSlowFill
>>
> 
> I think portable and non-portable event names should both be available
> as options. There are simple bandwidth measurement mechanisms that
> will be applied in general, but when they turn up an issue, it can
> often lead to a more focused investigation, requiring more precise
> events.

I aggree. We should provide both portable and non-portable event names.

Here is my draft proposal based on the discussion so far and reusing some
of the current interface. Idea here is to start with basic assigment
feature with options to enhance it in the future. Feel free to
comment/suggest.

1. Event configurations will be in
   /sys/fs/resctrl/info/L3_MON/counter_configs/.

   There will be two pre-defined configurations by default.

   #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
   LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill

   #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
   LclFill, LclNTWr, LclSlowFill

2. Users will have options to update these configurations.

   #echo "LclFill, LclNTWr, RmtFill" >
      /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes

   # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
   LclFill, LclNTWr, RmtFill

3. The default configurations will be used when user mounts the resctrl.

   mount  -t resctrl resctrl /sys/fs/resctrl/
   mkdir /sys/fs/resctrl/test/

4. The resctrl group/domains can be in one of these assingnment states.
   e: Exclusive
   s: Shared
   u: Unassigned

   Exclusive mode is supported now. Shared mode will be supported in the
future.

5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
to list the assignment state of all the groups.

   Format:
   "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"

  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   test//mbm_total_bytes:0=e;1=e
   test//mbm_local_bytes:0=e;1=e
   //mbm_total_bytes:0=e;1=e
   //mbm_local_bytes:0=e;1=e

6. Users can modify the assignment state by writing to mbm_assign_control.

   Format:
   “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”

   #echo "test//mbm_local_bytes:0=e;1=e" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

   #echo "test//mbm_local_bytes:0=u;1=u" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

   # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   test//mbm_total_bytes:0=u;1=u
   test//mbm_local_bytes:0=u;1=u
   //mbm_total_bytes:0=e;1=e
   //mbm_local_bytes:0=e;1=e

   The corresponding events will be read in

   /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
   /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
   /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
   /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
   /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
   /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
   /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
   /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes

7. In the first stage, only two configurations(mbm_total_bytes and
mbm_local_bytes) will be supported.

8. In the future, there will be options to create multiple configurations
and corresponding directory will be created in
/sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-05 19:34                                           ` Moger, Babu
@ 2025-03-10 22:48                                             ` Moger, Babu
  2025-03-10 23:22                                               ` Luck, Tony
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-10 22:48 UTC (permalink / raw)
  To: babu.moger, Peter Newman, Chatre, Reinette
  Cc: Reinette Chatre, Dave Martin, corbet, tglx, mingo, bp,
	dave.hansen, tony.luck, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi All,

On 3/5/2025 1:34 PM, Moger, Babu wrote:
> Hi Peter,
> 
> On 3/5/25 04:40, Peter Newman wrote:
>> Hi Babu,
>>
>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 3/4/25 10:44, Peter Newman wrote:
>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>
>>>>> Hi Peter/Reinette,
>>>>>
>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>> Hi Reinette,
>>>>>>>>
>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>
>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>
>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>
>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>> of the hardware.
>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>
>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>   counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>   counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>> I understand it:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>   domain 1:
>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>   domain 1:
>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>   domain 1:
>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>>   domain 0:
>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>   domain 1:
>>>>>>>>>>>    counter 0: LclFill,RmtFill
>>>>>>>>>>>    counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>    counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>    counter 3: VictimBW
>>>>>>>>>>>
>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>
>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>> groupings to count.
>>>>>>>>>>
>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>
>>>>>>>>>>   # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>   # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>
>>>>>>>>>>   r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>   w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>>   # legacy "total" configuration, effectively r+w
>>>>>>>>>>   t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>>   /group0/0=t;1=t
>>>>>>>>>>   /group1/0=t;1=t
>>>>>>>>>>   /group2/0=_;1=t
>>>>>>>>>>   /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>
>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>
>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>
>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>>     /group0/0=t;1=t
>>>>>>>>>     /group1/0=t;1=t
>>>>>>>>>
>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>> be configured differently in each domain.
>>>>>>>>>
>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>
>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>
>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>
>>>>>>> It is documented below.
>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>   Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>
>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>> initially proposed the extended assignment mode.
>>>>>>>
>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>> just two.
>>>>>>>
>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>> extended mode is not practical at this time.
>>>>>>>
>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>> using it as is.
>>>>>>>
>>>>>>>>
>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>
>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>> earlier is one I've already been asked about.
>>>>>>>
>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>
>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>> supported.
>>>>>>
>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>> interface. If it does, it's something we can live with.
>>>>>
>>>>> As you know, this series is currently blocked without further feedback.
>>>>>
>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>> Any input or suggestions would be appreciated.
>>>>>
>>>>> Here’s what we’ve learned so far:
>>>>>
>>>>> 1. Assignments should be independent of BMEC.
>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>> 3. There should be an option to assign events per domain.
>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>> should allow flexibility to assign more in the future as the interface
>>>>> evolves.
>>>>> 5. Utilize the extended RMID read mode.
>>>>>
>>>>>
>>>>> Here is my proposal using Peter's earlier example:
>>>>>
>>>>> # define event configurations
>>>>>
>>>>> ========================================================
>>>>> Bits    Mnemonics       Description
>>>>> ====   ========================================================
>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>> ====    ========================================================
>>>>>
>>>>> #Define flags based on combination of above event types.
>>>>>
>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>> v = VictimBW
>>>>>
>>>>> Peter suggested the following format earlier :
>>>>>
>>>>> /group0/0=t;1=t
>>>>> /group1/0=t;1=t
>>>>> /group2/0=_;1=t
>>>>> /group3/0=rw;1=_
>>>>
>>>> After some inquiries within Google, it sounds like nobody has invested
>>>> much into the current mbm_assign_control format yet, so it would be
>>>> best to drop it and distribute the configuration around the filesystem
>>>> hierarchy[1], which should allow us to produce something more flexible
>>>> and cleaner to implement.
>>>>
>>>> Roughly what I had in mind:
>>>>
>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>> names for the assignable configurations rather than being restricted
>>>> to single letters.  In the resulting directory, populate a file where
>>>> we can specify the set of events the config should represent. I think
>>>> we should use symbolic names for the events rather than raw BMEC field
>>>> values. Moving forward we could come up with portable names for common
>>>> events and only support the BMEC names on AMD machines for users who
>>>> want specific events and don't care about portability.
>>>
>>>
>>> I’m still processing this. Let me start with some initial questions.
>>>
>>> So, we are creating event configurations here, which seems reasonable.
>>>
>>> Yes, we should use portable names and are not limited to BMEC names.
>>>
>>> How many configurations should we allow? Do we know?
>>
>> Do we need an upper limit?
> 
> I think so. This needs to be maintained in some data structure. We can
> start with 2 default configurations for now.
> 
>>
>>>
>>>>
>>>> Next, put assignment-control file nodes in per-domain directories
>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>> counter-configuration name into the file would then allocate a counter
>>>> in the domain, apply the named configuration, and monitor the parent
>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>> higher in the hierarchy to make it easier for users who want to
>>>> configure all domains the same for a group.
>>>
>>> What is the difference between shared and exclusive?
>>
>> Shared assignment[1] means that non-exclusively-assigned counters in
>> each domain will be scheduled round-robin to the groups requesting
>> shared access to a counter. In my tests, I assigned the counters long
>> enough to produce a single 1-second MB/s sample for the per-domain
>> aggregation files[2].
>>
>> These do not need to be implemented immediately, but knowing that they
>> work addresses the overhead and scalability concerns of reassigning
>> counters and reading their values.
> 
> Ok. Lets focus on exclusive assignments for now.
> 
>>
>>>
>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>> results in 32 × 12 × 3 files, which is quite large.
>>>
>>> There should be a more efficient way to handle this.
>>>
>>> Initially, we started with a group-level file for this interface, but it
>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>
>> I had rejected it due to the high-frequency of access of a large
>> number of files, which has since been addressed by shared assignment
>> (or automatic reassignment) and aggregated mbps files.
> 
> I think we should address this as well. Creating three extra files for
> each group isn’t ideal when there are more efficient alternatives.
> 
>>
>>>
>>> Additionally, how can we list all assignments with a single sysfs call?
>>>
>>> That was another problem we need to address.
>>
>> This is not a requirement I was aware of. If the user forgot where
>> they assigned counters (or forgot to disable auto-assignment), they
>> can read multiple sysfs nodes to remind themselves.
> 
> I suggest, we should provide users with an option to list the assignments
> of all groups in a single command. As the number of groups increases, it
> becomes cumbersome to query each group individually.
> 
> To achieve this, we can reuse our existing mbm_assign_control interface
> for this purpose. More details on this below.
> 
>>>
>>>
>>>>
>>>> The configuration names listed in assign_* would result in files of
>>>> the same name in the appropriate mon_data domain directories from
>>>> which the count values can be read.
>>>>
>>>>   # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>   # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>   # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>   # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>   # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> LclFill
>>>> LclNTWr
>>>> LclSlowFill
>>>
>>> I feel we can just have the configs. event_filter file is not required.
>>
>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>> only looking at struct kernfs_syscall_ops
>>
>>>
>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill <-rename these to generic names.
>>> LclNTWr
>>> LclSlowFill
>>>
>>
>> I think portable and non-portable event names should both be available
>> as options. There are simple bandwidth measurement mechanisms that
>> will be applied in general, but when they turn up an issue, it can
>> often lead to a more focused investigation, requiring more precise
>> events.
> 
> I aggree. We should provide both portable and non-portable event names.
> 
> Here is my draft proposal based on the discussion so far and reusing some
> of the current interface. Idea here is to start with basic assigment
> feature with options to enhance it in the future. Feel free to
> comment/suggest.
> 
> 1. Event configurations will be in
>     /sys/fs/resctrl/info/L3_MON/counter_configs/.
> 
>     There will be two pre-defined configurations by default.
> 
>     #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>     LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
> 
>     #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>     LclFill, LclNTWr, LclSlowFill
> 
> 2. Users will have options to update these configurations.
> 
>     #echo "LclFill, LclNTWr, RmtFill" >
>        /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> 
>     # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>     LclFill, LclNTWr, RmtFill
> 
> 3. The default configurations will be used when user mounts the resctrl.
> 
>     mount  -t resctrl resctrl /sys/fs/resctrl/
>     mkdir /sys/fs/resctrl/test/
> 
> 4. The resctrl group/domains can be in one of these assingnment states.
>     e: Exclusive
>     s: Shared
>     u: Unassigned
> 
>     Exclusive mode is supported now. Shared mode will be supported in the
> future.
> 
> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> to list the assignment state of all the groups.
> 
>     Format:
>     "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
> 
>    # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>     test//mbm_total_bytes:0=e;1=e
>     test//mbm_local_bytes:0=e;1=e
>     //mbm_total_bytes:0=e;1=e
>     //mbm_local_bytes:0=e;1=e
> 
> 6. Users can modify the assignment state by writing to mbm_assign_control.
> 
>     Format:
>     “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
> 
>     #echo "test//mbm_local_bytes:0=e;1=e" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
>     #echo "test//mbm_local_bytes:0=u;1=u" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>     test//mbm_total_bytes:0=u;1=u
>     test//mbm_local_bytes:0=u;1=u
>     //mbm_total_bytes:0=e;1=e
>     //mbm_local_bytes:0=e;1=e
> 
>     The corresponding events will be read in
> 
>     /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>     /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>     /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>     /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>     /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>     /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>     /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>     /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
> 
> 7. In the first stage, only two configurations(mbm_total_bytes and
> mbm_local_bytes) will be supported.
> 
> 8. In the future, there will be options to create multiple configurations
> and corresponding directory will be created in
> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
> 

I know you are all busy with multiple series going on parallel. I am 
still waiting for the inputs on this. It will be great if you can spend 
some time on this to see if we can find common ground on the interface.

Thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-10 22:48                                             ` Moger, Babu
@ 2025-03-10 23:22                                               ` Luck, Tony
  2025-03-11  1:44                                                 ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Luck, Tony @ 2025-03-10 23:22 UTC (permalink / raw)
  To: Moger, Babu
  Cc: babu.moger, Peter Newman, Chatre, Reinette, Dave Martin, corbet,
	tglx, mingo, bp, dave.hansen, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
> Hi All,
> 
> On 3/5/2025 1:34 PM, Moger, Babu wrote:
> > Hi Peter,
> > 
> > On 3/5/25 04:40, Peter Newman wrote:
> > > Hi Babu,
> > > 
> > > On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > > 
> > > > Hi Peter,
> > > > 
> > > > On 3/4/25 10:44, Peter Newman wrote:
> > > > > On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > > > > 
> > > > > > Hi Peter/Reinette,
> > > > > > 
> > > > > > On 2/26/25 07:27, Peter Newman wrote:
> > > > > > > Hi Babu,
> > > > > > > 
> > > > > > > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > > > > > > 
> > > > > > > > Hi Peter,
> > > > > > > > 
> > > > > > > > On 2/25/25 11:11, Peter Newman wrote:
> > > > > > > > > Hi Reinette,
> > > > > > > > > 
> > > > > > > > > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi Peter,
> > > > > > > > > > 
> > > > > > > > > > On 2/21/25 5:12 AM, Peter Newman wrote:
> > > > > > > > > > > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > On 2/20/25 6:53 AM, Peter Newman wrote:
> > > > > > > > > > > > > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > On 2/19/25 3:28 AM, Peter Newman wrote:
> > > > > > > > > > > > > > > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> > > > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > > > On 2/17/25 2:26 AM, Peter Newman wrote:
> > > > > > > > > > > > > > > > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> > > > > > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > > > > > On 2/14/25 10:31 AM, Moger, Babu wrote:
> > > > > > > > > > > > > > > > > > > On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> > > > > > > > > > > > > > > > > > > > On 2/13/25 9:37 AM, Dave Martin wrote:
> > > > > > > > > > > > > > > > > > > > > On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> > > > > > > > > > > > > > > > > > > > > > On 2/12/25 9:46 AM, Dave Martin wrote:
> > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > (quoting relevant parts with goal to focus discussion on new possible syntax)
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > I see the support for MPAM events distinct from the support of assignable counters.
> > > > > > > > > > > > > > > > > > > > > > Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> > > > > > > > > > > > > > > > > > > > > > Please help me understand if you see it differently.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Doing so would need to come up with alphabetical letters for these events,
> > > > > > > > > > > > > > > > > > > > > > which seems to be needed for your proposal also? If we use possible flags of:
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > mbm_local_read_bytes a
> > > > > > > > > > > > > > > > > > > > > > mbm_local_write_bytes b
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Then mbm_assign_control can be used as:
> > > > > > > > > > > > > > > > > > > > > > # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > > > > > > > > > > > > > > > > > > > > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> > > > > > > > > > > > > > > > > > > > > > <value>
> > > > > > > > > > > > > > > > > > > > > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> > > > > > > > > > > > > > > > > > > > > > <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > One issue would be when resctrl needs to support more than 26 events (no more flags available),
> > > > > > > > > > > > > > > > > > > > > > assuming that upper case would be used for "shared" counters (unless this interface is defined
> > > > > > > > > > > > > > > > > > > > > > differently and only few uppercase letters used for it). Would this be too low of a limit?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > As mentioned above, one possible issue with existing interface is that
> > > > > > > > > > > > > > > > > > it is limited to 26 events (assuming only lower case letters are used). The limit
> > > > > > > > > > > > > > > > > > is low enough to be of concern.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The events which can be monitored by a single counter on ABMC and MPAM
> > > > > > > > > > > > > > > > > so far are combinable, so 26 counters per group today means it limits
> > > > > > > > > > > > > > > > > breaking down MBM traffic for each group 26 ways. If a user complained
> > > > > > > > > > > > > > > > > that a 26-way breakdown of a group's MBM traffic was limiting their
> > > > > > > > > > > > > > > > > investigation, I would question whether they know what they're looking
> > > > > > > > > > > > > > > > > for.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The key here is "so far" as well as the focus on MBM only.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > It is impossible for me to predict what we will see in a couple of years
> > > > > > > > > > > > > > > > from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> > > > > > > > > > > > > > > > to support their users. Just looking at the Intel RDT spec the event register
> > > > > > > > > > > > > > > > has space for 32 events for each "CPU agent" resource. That does not take into
> > > > > > > > > > > > > > > > account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> > > > > > > > > > > > > > > > that he is working on patches [1] that will add new events and shared the idea
> > > > > > > > > > > > > > > > that we may be trending to support "perf" like events associated with RMID. I
> > > > > > > > > > > > > > > > expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> > > > > > > > > > > > > > > > customers.
> > > > > > > > > > > > > > > > This all makes me think that resctrl should be ready to support more events than 26.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I was thinking of the letters as representing a reusable, user-defined
> > > > > > > > > > > > > > > event-set for applying to a single counter rather than as individual
> > > > > > > > > > > > > > > events, since MPAM and ABMC allow us to choose the set of events each
> > > > > > > > > > > > > > > one counts. Wherever we define the letters, we could use more symbolic
> > > > > > > > > > > > > > > event names.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thank you for clarifying.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In the letters as events model, choosing the events assigned to a
> > > > > > > > > > > > > > > group wouldn't be enough information, since we would want to control
> > > > > > > > > > > > > > > which events should share a counter and which should be counted by
> > > > > > > > > > > > > > > separate counters. I think the amount of information that would need
> > > > > > > > > > > > > > > to be encoded into mbm_assign_control to represent the level of
> > > > > > > > > > > > > > > configurability supported by hardware would quickly get out of hand.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Maybe as an example, one counter for all reads, one counter for all
> > > > > > > > > > > > > > > writes in ABMC would look like...
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > (L3_QOS_ABMC_CFG.BwType field names below)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > (per domain)
> > > > > > > > > > > > > > > group 0:
> > > > > > > > > > > > > > >   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > >   counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > group 1:
> > > > > > > > > > > > > > >   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > >   counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I think this may also be what Dave was heading towards in [2] but in that
> > > > > > > > > > > > > > example and above the counter configuration appears to be global. You do mention
> > > > > > > > > > > > > > "configurability supported by hardware" so I wonder if per-domain counter
> > > > > > > > > > > > > > configuration is a requirement?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > If it's global and we want a particular group to be watched by more
> > > > > > > > > > > > > counters, I wouldn't want this to result in allocating more counters
> > > > > > > > > > > > > for that group in all domains, or allocating counters in domains where
> > > > > > > > > > > > > they're not needed. I want to encourage my users to avoid allocating
> > > > > > > > > > > > > monitoring resources in domains where a job is not allowed to run so
> > > > > > > > > > > > > there's less pressure on the counters.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In Dave's proposal it looks like global configuration means
> > > > > > > > > > > > > globally-defined "named counter configurations", which works because
> > > > > > > > > > > > > it's really per-domain assignment of the configurations to however
> > > > > > > > > > > > > many counters the group needs in each domain.
> > > > > > > > > > > > 
> > > > > > > > > > > > I think I am becoming lost. Would a global configuration not break your
> > > > > > > > > > > > view of "event-set applied to a single counter"? If a counter is configured
> > > > > > > > > > > > globally then it would not make it possible to support the full configurability
> > > > > > > > > > > > of the hardware.
> > > > > > > > > > > > Before I add more confusion, let me try with an example that builds on your
> > > > > > > > > > > > earlier example copied below:
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > (per domain)
> > > > > > > > > > > > > > > group 0:
> > > > > > > > > > > > > > >   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > >   counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > group 1:
> > > > > > > > > > > > > > >   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > >   counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > 
> > > > > > > > > > > > Since the above states "per domain" I rewrite the example to highlight that as
> > > > > > > > > > > > I understand it:
> > > > > > > > > > > > 
> > > > > > > > > > > > group 0:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >   domain 1:
> > > > > > > > > > > >    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >   domain 1:
> > > > > > > > > > > >    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > 
> > > > > > > > > > > > You mention that you do not want counters to be allocated in domains that they
> > > > > > > > > > > > are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> > > > > > > > > > > > in domain 1, resulting in:
> > > > > > > > > > > > 
> > > > > > > > > > > > group 0:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >   domain 1:
> > > > > > > > > > > >    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > 
> > > > > > > > > > > > With counter 0 and counter 1 available in domain 1, these counters could
> > > > > > > > > > > > theoretically be configured to give group 1 more data in domain 1:
> > > > > > > > > > > > 
> > > > > > > > > > > > group 0:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > >   domain 0:
> > > > > > > > > > > >    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >   domain 1:
> > > > > > > > > > > >    counter 0: LclFill,RmtFill
> > > > > > > > > > > >    counter 1: LclNTWr,RmtNTWr
> > > > > > > > > > > >    counter 2: LclSlowFill,RmtSlowFill
> > > > > > > > > > > >    counter 3: VictimBW
> > > > > > > > > > > > 
> > > > > > > > > > > > The counters are shown with different per-domain configurations that seems to
> > > > > > > > > > > > match with earlier goals of (a) choose events counted by each counter and
> > > > > > > > > > > > (b) do not allocate counters in domains where they are not needed. As I
> > > > > > > > > > > > understand the above does contradict global counter configuration though.
> > > > > > > > > > > > Or do you mean that only the *name* of the counter is global and then
> > > > > > > > > > > > that it is reconfigured as part of every assignment?
> > > > > > > > > > > 
> > > > > > > > > > > Yes, I meant only the *name* is global. I assume based on a particular
> > > > > > > > > > > system configuration, the user will settle on a handful of useful
> > > > > > > > > > > groupings to count.
> > > > > > > > > > > 
> > > > > > > > > > > Perhaps mbm_assign_control syntax is the clearest way to express an example...
> > > > > > > > > > > 
> > > > > > > > > > >   # define global configurations (in ABMC terms), not necessarily in this
> > > > > > > > > > >   # syntax and probably not in the mbm_assign_control file.
> > > > > > > > > > > 
> > > > > > > > > > >   r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > >   w=VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > 
> > > > > > > > > > >   # legacy "total" configuration, effectively r+w
> > > > > > > > > > >   t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > 
> > > > > > > > > > >   /group0/0=t;1=t
> > > > > > > > > > >   /group1/0=t;1=t
> > > > > > > > > > >   /group2/0=_;1=t
> > > > > > > > > > >   /group3/0=rw;1=_
> > > > > > > > > > > 
> > > > > > > > > > > - group2 is restricted to domain 0
> > > > > > > > > > > - group3 is restricted to domain 1
> > > > > > > > > > > - the rest are unrestricted
> > > > > > > > > > > - In group3, we decided we need to separate read and write traffic
> > > > > > > > > > > 
> > > > > > > > > > > This consumes 4 counters in domain 0 and 3 counters in domain 1.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I see. Thank you for the example.
> > > > > > > > > > 
> > > > > > > > > > resctrl supports per-domain configurations with the following possible when
> > > > > > > > > > using mbm_total_bytes_config and mbm_local_bytes_config:
> > > > > > > > > > 
> > > > > > > > > > t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > 
> > > > > > > > > >     /group0/0=t;1=t
> > > > > > > > > >     /group1/0=t;1=t
> > > > > > > > > > 
> > > > > > > > > > Even though the flags are identical in all domains, the assigned counters will
> > > > > > > > > > be configured differently in each domain.
> > > > > > > > > > 
> > > > > > > > > > With this supported by hardware and currently also supported by resctrl it seems
> > > > > > > > > > reasonable to carry this forward to what will be supported next.
> > > > > > > > > 
> > > > > > > > > The hardware supports both a per-domain mode, where all groups in a
> > > > > > > > > domain use the same configurations and are limited to two events per
> > > > > > > > > group and a per-group mode where every group can be configured and
> > > > > > > > > assigned freely. This series is using the legacy counter access mode
> > > > > > > > > where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> > > > > > > > > in the domain can be read. If we chose to read the assigned counter
> > > > > > > > > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> > > > > > > > > rather than asking the hardware to find the counter by RMID, we would
> > > > > > > > > not be limited to 2 counters per group/domain and the hardware would
> > > > > > > > > have the same flexibility as on MPAM.
> > > > > > > > 
> > > > > > > > In extended mode, the contents of a specific counter can be read by
> > > > > > > > setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> > > > > > > > [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> > > > > > > > QM_CTR will then return the contents of the specified counter.
> > > > > > > > 
> > > > > > > > It is documented below.
> > > > > > > > https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> > > > > > > >   Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> > > > > > > > 
> > > > > > > > We previously discussed this with you (off the public list) and I
> > > > > > > > initially proposed the extended assignment mode.
> > > > > > > > 
> > > > > > > > Yes, the extended mode allows greater flexibility by enabling multiple
> > > > > > > > counters to be assigned to the same group, rather than being limited to
> > > > > > > > just two.
> > > > > > > > 
> > > > > > > > However, the challenge is that we currently lack the necessary interfaces
> > > > > > > > to configure multiple events per group. Without these interfaces, the
> > > > > > > > extended mode is not practical at this time.
> > > > > > > > 
> > > > > > > > Therefore, we ultimately agreed to use the legacy mode, as it does not
> > > > > > > > require modifications to the existing interface, allowing us to continue
> > > > > > > > using it as is.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > (I might have said something confusing in my last messages because I
> > > > > > > > > had forgotten that I switched to the extended assignment mode when
> > > > > > > > > prototyping with soft-ABMC and MPAM.)
> > > > > > > > > 
> > > > > > > > > Forcing all groups on a domain to share the same 2 counter
> > > > > > > > > configurations would not be acceptable for us, as the example I gave
> > > > > > > > > earlier is one I've already been asked about.
> > > > > > > > 
> > > > > > > > I don’t see this as a blocker. It should be considered an extension to the
> > > > > > > > current ABMC series. We can easily build on top of this series once we
> > > > > > > > finalize how to configure the multiple event interface for each group.
> > > > > > > 
> > > > > > > I don't think it is, either. Only being able to use ABMC to assign
> > > > > > > counters is fine for our use as an incremental step. My longer-term
> > > > > > > concern is the domain-scoped mbm_total_bytes_config and
> > > > > > > mbm_local_bytes_config files, but they were introduced with BMEC, so
> > > > > > > there's already an expectation that the files are present when BMEC is
> > > > > > > supported.
> > > > > > > 
> > > > > > > On ABMC hardware that also supports BMEC, I'm concerned about enabling
> > > > > > > ABMC when only the BMEC-style event configuration interface exists.
> > > > > > > The scope of my issue is just whether enabling "full" ABMC support
> > > > > > > will require an additional opt-in, since that could remove the BMEC
> > > > > > > interface. If it does, it's something we can live with.
> > > > > > 
> > > > > > As you know, this series is currently blocked without further feedback.
> > > > > > 
> > > > > > I’d like to begin reworking these patches to incorporate Peter’s feedback.
> > > > > > Any input or suggestions would be appreciated.
> > > > > > 
> > > > > > Here’s what we’ve learned so far:
> > > > > > 
> > > > > > 1. Assignments should be independent of BMEC.
> > > > > > 2. We should be able to specify multiple event types to a counter (e.g.,
> > > > > > read, write, victimBM, etc.). This is also called shared counter
> > > > > > 3. There should be an option to assign events per domain.
> > > > > > 4. Currently, only two counters can be assigned per group, but the design
> > > > > > should allow flexibility to assign more in the future as the interface
> > > > > > evolves.
> > > > > > 5. Utilize the extended RMID read mode.
> > > > > > 
> > > > > > 
> > > > > > Here is my proposal using Peter's earlier example:
> > > > > > 
> > > > > > # define event configurations
> > > > > > 
> > > > > > ========================================================
> > > > > > Bits    Mnemonics       Description
> > > > > > ====   ========================================================
> > > > > > 6       VictimBW        Dirty Victims from all types of memory
> > > > > > 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
> > > > > > 4       LclSlowFill     Reads to slow memory in the local NUMA domain
> > > > > > 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
> > > > > > 2       LclNTWr         Non-temporal writes to local NUMA domain
> > > > > > 1       mtFill          Reads to memory in the non-local NUMA domain
> > > > > > 0       LclFill         Reads to memory in the local NUMA domain
> > > > > > ====    ========================================================
> > > > > > 
> > > > > > #Define flags based on combination of above event types.
> > > > > > 
> > > > > > t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > l = LclFill, LclNTWr, LclSlowFill
> > > > > > r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > w = VictimBW,LclNTWr,RmtNTWr
> > > > > > v = VictimBW
> > > > > > 
> > > > > > Peter suggested the following format earlier :
> > > > > > 
> > > > > > /group0/0=t;1=t
> > > > > > /group1/0=t;1=t
> > > > > > /group2/0=_;1=t
> > > > > > /group3/0=rw;1=_
> > > > > 
> > > > > After some inquiries within Google, it sounds like nobody has invested
> > > > > much into the current mbm_assign_control format yet, so it would be
> > > > > best to drop it and distribute the configuration around the filesystem
> > > > > hierarchy[1], which should allow us to produce something more flexible
> > > > > and cleaner to implement.
> > > > > 
> > > > > Roughly what I had in mind:
> > > > > 
> > > > > Use mkdir in a info/<resource>_MON subdirectory to create free-form
> > > > > names for the assignable configurations rather than being restricted
> > > > > to single letters.  In the resulting directory, populate a file where
> > > > > we can specify the set of events the config should represent. I think
> > > > > we should use symbolic names for the events rather than raw BMEC field
> > > > > values. Moving forward we could come up with portable names for common
> > > > > events and only support the BMEC names on AMD machines for users who
> > > > > want specific events and don't care about portability.
> > > > 
> > > > 
> > > > I’m still processing this. Let me start with some initial questions.
> > > > 
> > > > So, we are creating event configurations here, which seems reasonable.
> > > > 
> > > > Yes, we should use portable names and are not limited to BMEC names.
> > > > 
> > > > How many configurations should we allow? Do we know?
> > > 
> > > Do we need an upper limit?
> > 
> > I think so. This needs to be maintained in some data structure. We can
> > start with 2 default configurations for now.
> > 
> > > 
> > > > 
> > > > > 
> > > > > Next, put assignment-control file nodes in per-domain directories
> > > > > (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> > > > > counter-configuration name into the file would then allocate a counter
> > > > > in the domain, apply the named configuration, and monitor the parent
> > > > > group-directory. We can also put a group/resource-scoped assign_* file
> > > > > higher in the hierarchy to make it easier for users who want to
> > > > > configure all domains the same for a group.
> > > > 
> > > > What is the difference between shared and exclusive?
> > > 
> > > Shared assignment[1] means that non-exclusively-assigned counters in
> > > each domain will be scheduled round-robin to the groups requesting
> > > shared access to a counter. In my tests, I assigned the counters long
> > > enough to produce a single 1-second MB/s sample for the per-domain
> > > aggregation files[2].
> > > 
> > > These do not need to be implemented immediately, but knowing that they
> > > work addresses the overhead and scalability concerns of reassigning
> > > counters and reading their values.
> > 
> > Ok. Lets focus on exclusive assignments for now.
> > 
> > > 
> > > > 
> > > > Having three files—assign_shared, assign_exclusive, and unassign—for each
> > > > domain seems excessive. In a system with 32 groups and 12 domains, this
> > > > results in 32 × 12 × 3 files, which is quite large.
> > > > 
> > > > There should be a more efficient way to handle this.
> > > > 
> > > > Initially, we started with a group-level file for this interface, but it
> > > > was rejected due to the high number of sysfs calls, making it inefficient.
> > > 
> > > I had rejected it due to the high-frequency of access of a large
> > > number of files, which has since been addressed by shared assignment
> > > (or automatic reassignment) and aggregated mbps files.
> > 
> > I think we should address this as well. Creating three extra files for
> > each group isn’t ideal when there are more efficient alternatives.
> > 
> > > 
> > > > 
> > > > Additionally, how can we list all assignments with a single sysfs call?
> > > > 
> > > > That was another problem we need to address.
> > > 
> > > This is not a requirement I was aware of. If the user forgot where
> > > they assigned counters (or forgot to disable auto-assignment), they
> > > can read multiple sysfs nodes to remind themselves.
> > 
> > I suggest, we should provide users with an option to list the assignments
> > of all groups in a single command. As the number of groups increases, it
> > becomes cumbersome to query each group individually.
> > 
> > To achieve this, we can reuse our existing mbm_assign_control interface
> > for this purpose. More details on this below.
> > 
> > > > 
> > > > 
> > > > > 
> > > > > The configuration names listed in assign_* would result in files of
> > > > > the same name in the appropriate mon_data domain directories from
> > > > > which the count values can be read.
> > > > > 
> > > > >   # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> > > > >   # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > >   # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > >   # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > >   # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > > LclFill
> > > > > LclNTWr
> > > > > LclSlowFill
> > > > 
> > > > I feel we can just have the configs. event_filter file is not required.
> > > 
> > > That's right, I forgot that we can implement kernfs_ops::open(). I was
> > > only looking at struct kernfs_syscall_ops
> > > 
> > > > 
> > > > #cat info/L3_MON/counter_configs/mbm_local_bytes
> > > > LclFill <-rename these to generic names.
> > > > LclNTWr
> > > > LclSlowFill
> > > > 
> > > 
> > > I think portable and non-portable event names should both be available
> > > as options. There are simple bandwidth measurement mechanisms that
> > > will be applied in general, but when they turn up an issue, it can
> > > often lead to a more focused investigation, requiring more precise
> > > events.
> > 
> > I aggree. We should provide both portable and non-portable event names.
> > 
> > Here is my draft proposal based on the discussion so far and reusing some
> > of the current interface. Idea here is to start with basic assigment
> > feature with options to enhance it in the future. Feel free to
> > comment/suggest.
> > 
> > 1. Event configurations will be in
> >     /sys/fs/resctrl/info/L3_MON/counter_configs/.
> > 
> >     There will be two pre-defined configurations by default.
> > 
> >     #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> >     LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
> > 
> >     #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >     LclFill, LclNTWr, LclSlowFill
> > 
> > 2. Users will have options to update these configurations.
> > 
> >     #echo "LclFill, LclNTWr, RmtFill" >
> >        /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes

This part seems odd to me. Now the "mbm_local_bytes" files aren't
reporting "local_bytes" any more. They report something different,
and users only know if they come to check the options currently
configured in this file. Changing the contents without changing
the name seems confusing to me.

> > 
> >     # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >     LclFill, LclNTWr, RmtFill
> > 
> > 3. The default configurations will be used when user mounts the resctrl.
> > 
> >     mount  -t resctrl resctrl /sys/fs/resctrl/
> >     mkdir /sys/fs/resctrl/test/
> > 
> > 4. The resctrl group/domains can be in one of these assingnment states.
> >     e: Exclusive
> >     s: Shared
> >     u: Unassigned
> > 
> >     Exclusive mode is supported now. Shared mode will be supported in the
> > future.
> > 
> > 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > to list the assignment state of all the groups.
> > 
> >     Format:
> >     "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
> > 
> >    # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >     test//mbm_total_bytes:0=e;1=e
> >     test//mbm_local_bytes:0=e;1=e
> >     //mbm_total_bytes:0=e;1=e
> >     //mbm_local_bytes:0=e;1=e
> > 
> > 6. Users can modify the assignment state by writing to mbm_assign_control.
> > 
> >     Format:
> >     “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
> > 
> >     #echo "test//mbm_local_bytes:0=e;1=e" >
> > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > 
> >     #echo "test//mbm_local_bytes:0=u;1=u" >
> > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > 
> >     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >     test//mbm_total_bytes:0=u;1=u
> >     test//mbm_local_bytes:0=u;1=u
> >     //mbm_total_bytes:0=e;1=e
> >     //mbm_local_bytes:0=e;1=e
> > 
> >     The corresponding events will be read in
> > 
> >     /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> >     /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
> >     /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >     /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
> >     /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> >     /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> >     /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> >     /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
> > 
> > 7. In the first stage, only two configurations(mbm_total_bytes and
> > mbm_local_bytes) will be supported.
> > 
> > 8. In the future, there will be options to create multiple configurations
> > and corresponding directory will be created in
> > /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.

Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
directory? Like this:

# echo "LclFill, LclNTWr, RmtFill" >
        /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff

This seems OK (dependent on the user picking meaningful names for
the set of attributes picked ... but if they want to name this
monitor file "brian" then they have to live with any confusion
that they bring on themselves).

Would this involve an extension to kernfs? I don't see a function
pointer callback for file creation in kernfs_syscall_ops.

> > 
> 
> I know you are all busy with multiple series going on parallel. I am still
> waiting for the inputs on this. It will be great if you can spend some time
> on this to see if we can find common ground on the interface.
> 
> Thanks
> Babu

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-10 23:22                                               ` Luck, Tony
@ 2025-03-11  1:44                                                 ` Moger, Babu
  2025-03-11  3:51                                                   ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-11  1:44 UTC (permalink / raw)
  To: Luck, Tony
  Cc: babu.moger, Peter Newman, Chatre, Reinette, Dave Martin, corbet,
	tglx, mingo, bp, dave.hansen, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Tony,

On 3/10/2025 6:22 PM, Luck, Tony wrote:
> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>> Hi Peter,
>>>
>>> On 3/5/25 04:40, Peter Newman wrote:
>>>> Hi Babu,
>>>>
>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter/Reinette,
>>>>>>>
>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>> Hi Reinette,
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>
>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>
>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>
>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>
>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>
>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>
>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>
>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>
>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>
>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>
>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>
>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>
>>>>>>>>> It is documented below.
>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>
>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>
>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>> just two.
>>>>>>>>>
>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>
>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>> using it as is.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>
>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>
>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>
>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>> supported.
>>>>>>>>
>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>
>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>
>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>
>>>>>>> Here’s what we’ve learned so far:
>>>>>>>
>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>> evolves.
>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>
>>>>>>>
>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>
>>>>>>> # define event configurations
>>>>>>>
>>>>>>> ========================================================
>>>>>>> Bits    Mnemonics       Description
>>>>>>> ====   ========================================================
>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>> ====    ========================================================
>>>>>>>
>>>>>>> #Define flags based on combination of above event types.
>>>>>>>
>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>> v = VictimBW
>>>>>>>
>>>>>>> Peter suggested the following format earlier :
>>>>>>>
>>>>>>> /group0/0=t;1=t
>>>>>>> /group1/0=t;1=t
>>>>>>> /group2/0=_;1=t
>>>>>>> /group3/0=rw;1=_
>>>>>>
>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>> and cleaner to implement.
>>>>>>
>>>>>> Roughly what I had in mind:
>>>>>>
>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>> names for the assignable configurations rather than being restricted
>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>> we can specify the set of events the config should represent. I think
>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>> values. Moving forward we could come up with portable names for common
>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>> want specific events and don't care about portability.
>>>>>
>>>>>
>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>
>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>
>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>
>>>>> How many configurations should we allow? Do we know?
>>>>
>>>> Do we need an upper limit?
>>>
>>> I think so. This needs to be maintained in some data structure. We can
>>> start with 2 default configurations for now.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>> configure all domains the same for a group.
>>>>>
>>>>> What is the difference between shared and exclusive?
>>>>
>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>> each domain will be scheduled round-robin to the groups requesting
>>>> shared access to a counter. In my tests, I assigned the counters long
>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>> aggregation files[2].
>>>>
>>>> These do not need to be implemented immediately, but knowing that they
>>>> work addresses the overhead and scalability concerns of reassigning
>>>> counters and reading their values.
>>>
>>> Ok. Lets focus on exclusive assignments for now.
>>>
>>>>
>>>>>
>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>
>>>>> There should be a more efficient way to handle this.
>>>>>
>>>>> Initially, we started with a group-level file for this interface, but it
>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>
>>>> I had rejected it due to the high-frequency of access of a large
>>>> number of files, which has since been addressed by shared assignment
>>>> (or automatic reassignment) and aggregated mbps files.
>>>
>>> I think we should address this as well. Creating three extra files for
>>> each group isn’t ideal when there are more efficient alternatives.
>>>
>>>>
>>>>>
>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>
>>>>> That was another problem we need to address.
>>>>
>>>> This is not a requirement I was aware of. If the user forgot where
>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>> can read multiple sysfs nodes to remind themselves.
>>>
>>> I suggest, we should provide users with an option to list the assignments
>>> of all groups in a single command. As the number of groups increases, it
>>> becomes cumbersome to query each group individually.
>>>
>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>> for this purpose. More details on this below.
>>>
>>>>>
>>>>>
>>>>>>
>>>>>> The configuration names listed in assign_* would result in files of
>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>> which the count values can be read.
>>>>>>
>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> LclFill
>>>>>> LclNTWr
>>>>>> LclSlowFill
>>>>>
>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>
>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>> only looking at struct kernfs_syscall_ops
>>>>
>>>>>
>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>> LclFill <-rename these to generic names.
>>>>> LclNTWr
>>>>> LclSlowFill
>>>>>
>>>>
>>>> I think portable and non-portable event names should both be available
>>>> as options. There are simple bandwidth measurement mechanisms that
>>>> will be applied in general, but when they turn up an issue, it can
>>>> often lead to a more focused investigation, requiring more precise
>>>> events.
>>>
>>> I aggree. We should provide both portable and non-portable event names.
>>>
>>> Here is my draft proposal based on the discussion so far and reusing some
>>> of the current interface. Idea here is to start with basic assigment
>>> feature with options to enhance it in the future. Feel free to
>>> comment/suggest.
>>>
>>> 1. Event configurations will be in
>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>
>>>      There will be two pre-defined configurations by default.
>>>
>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>
>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>      LclFill, LclNTWr, LclSlowFill
>>>
>>> 2. Users will have options to update these configurations.
>>>
>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> 
> This part seems odd to me. Now the "mbm_local_bytes" files aren't
> reporting "local_bytes" any more. They report something different,
> and users only know if they come to check the options currently
> configured in this file. Changing the contents without changing
> the name seems confusing to me.

It is the same behaviour right now with BMEC. It is configurable.
By default it is mbm_local_bytes, but users can configure whatever they 
want to monitor using /info/L3_MON/mbm_local_bytes_config.

We can continue the same behaviour with ABMC, but the configuration will 
be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.

> 
>>>
>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>      LclFill, LclNTWr, RmtFill
>>>
>>> 3. The default configurations will be used when user mounts the resctrl.
>>>
>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>      mkdir /sys/fs/resctrl/test/
>>>
>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>      e: Exclusive
>>>      s: Shared
>>>      u: Unassigned
>>>
>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>> future.
>>>
>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> to list the assignment state of all the groups.
>>>
>>>      Format:
>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>
>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>      test//mbm_total_bytes:0=e;1=e
>>>      test//mbm_local_bytes:0=e;1=e
>>>      //mbm_total_bytes:0=e;1=e
>>>      //mbm_local_bytes:0=e;1=e
>>>
>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>
>>>      Format:
>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>
>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>      test//mbm_total_bytes:0=u;1=u
>>>      test//mbm_local_bytes:0=u;1=u
>>>      //mbm_total_bytes:0=e;1=e
>>>      //mbm_local_bytes:0=e;1=e
>>>
>>>      The corresponding events will be read in
>>>
>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>
>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>> mbm_local_bytes) will be supported.
>>>
>>> 8. In the future, there will be options to create multiple configurations
>>> and corresponding directory will be created in
>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
> 
> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
> directory? Like this:
> 
> # echo "LclFill, LclNTWr, RmtFill" >
>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
> 
> This seems OK (dependent on the user picking meaningful names for
> the set of attributes picked ... but if they want to name this
> monitor file "brian" then they have to live with any confusion
> that they bring on themselves).
> 
> Would this involve an extension to kernfs? I don't see a function
> pointer callback for file creation in kernfs_syscall_ops.
> 
>>>
>>
>> I know you are all busy with multiple series going on parallel. I am still
>> waiting for the inputs on this. It will be great if you can spend some time
>> on this to see if we can find common ground on the interface.
>>
>> Thanks
>> Babu
> 
> -Tony
> 


thanks
Babu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11  1:44                                                 ` Moger, Babu
@ 2025-03-11  3:51                                                   ` Reinette Chatre
  2025-03-11 20:35                                                     ` Moger, Babu
  0 siblings, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-03-11  3:51 UTC (permalink / raw)
  To: Moger, Babu, Luck, Tony
  Cc: babu.moger, Peter Newman, Dave Martin, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian



On 3/10/25 6:44 PM, Moger, Babu wrote:
> Hi Tony,
> 
> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>> Hi All,
>>>
>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>> Hi Peter,
>>>>
>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter/Reinette,
>>>>>>>>
>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>> Hi Babu,
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>
>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>>
>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>
>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>
>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>
>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>>
>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>
>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>
>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>
>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>
>>>>>>>>>> It is documented below.
>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>
>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>
>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>> just two.
>>>>>>>>>>
>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>
>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>> using it as is.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>
>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>
>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>
>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>> supported.
>>>>>>>>>
>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>
>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>
>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>
>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>
>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>> evolves.
>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>
>>>>>>>> # define event configurations
>>>>>>>>
>>>>>>>> ========================================================
>>>>>>>> Bits    Mnemonics       Description
>>>>>>>> ====   ========================================================
>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>>> ====    ========================================================
>>>>>>>>
>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>
>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>> v = VictimBW
>>>>>>>>
>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>
>>>>>>>> /group0/0=t;1=t
>>>>>>>> /group1/0=t;1=t
>>>>>>>> /group2/0=_;1=t
>>>>>>>> /group3/0=rw;1=_
>>>>>>>
>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>> and cleaner to implement.
>>>>>>>
>>>>>>> Roughly what I had in mind:
>>>>>>>
>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>> want specific events and don't care about portability.
>>>>>>
>>>>>>
>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>
>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>
>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>
>>>>>> How many configurations should we allow? Do we know?
>>>>>
>>>>> Do we need an upper limit?
>>>>
>>>> I think so. This needs to be maintained in some data structure. We can
>>>> start with 2 default configurations for now.

There is a big difference between no upper limit and 2. The hardware is
capable of supporting per-domain configurations so more flexibility is
certainly possible. Consider the example presented by Peter in:
https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/

>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>> configure all domains the same for a group.
>>>>>>
>>>>>> What is the difference between shared and exclusive?
>>>>>
>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>> aggregation files[2].
>>>>>
>>>>> These do not need to be implemented immediately, but knowing that they
>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>> counters and reading their values.
>>>>
>>>> Ok. Lets focus on exclusive assignments for now.
>>>>
>>>>>
>>>>>>
>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>
>>>>>> There should be a more efficient way to handle this.
>>>>>>
>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>
>>>>> I had rejected it due to the high-frequency of access of a large
>>>>> number of files, which has since been addressed by shared assignment
>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>
>>>> I think we should address this as well. Creating three extra files for
>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>
>>>>>
>>>>>>
>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>
>>>>>> That was another problem we need to address.
>>>>>
>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>> can read multiple sysfs nodes to remind themselves.
>>>>
>>>> I suggest, we should provide users with an option to list the assignments
>>>> of all groups in a single command. As the number of groups increases, it
>>>> becomes cumbersome to query each group individually.
>>>>
>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>> for this purpose. More details on this below.
>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>> which the count values can be read.
>>>>>>>
>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>> LclFill
>>>>>>> LclNTWr
>>>>>>> LclSlowFill
>>>>>>
>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>
>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>> only looking at struct kernfs_syscall_ops
>>>>>
>>>>>>
>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill <-rename these to generic names.
>>>>>> LclNTWr
>>>>>> LclSlowFill
>>>>>>
>>>>>
>>>>> I think portable and non-portable event names should both be available
>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>> will be applied in general, but when they turn up an issue, it can
>>>>> often lead to a more focused investigation, requiring more precise
>>>>> events.
>>>>
>>>> I aggree. We should provide both portable and non-portable event names.
>>>>
>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>> of the current interface. Idea here is to start with basic assigment
>>>> feature with options to enhance it in the future. Feel free to
>>>> comment/suggest.
>>>>
>>>> 1. Event configurations will be in
>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>
>>>>      There will be two pre-defined configurations by default.
>>>>
>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>
>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>      LclFill, LclNTWr, LclSlowFill
>>>>
>>>> 2. Users will have options to update these configurations.
>>>>
>>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>
>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>> reporting "local_bytes" any more. They report something different,
>> and users only know if they come to check the options currently
>> configured in this file. Changing the contents without changing
>> the name seems confusing to me.
> 
> It is the same behaviour right now with BMEC. It is configurable.
> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
> 
> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.

This could be supported by following Peter's original proposal where the name
of the counter configuration is provided by the user via a mkdir:
https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/

As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.

> 
>>
>>>>
>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>      LclFill, LclNTWr, RmtFill
>>>>
>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>
>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>>      mkdir /sys/fs/resctrl/test/
>>>>
>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>      e: Exclusive
>>>>      s: Shared
>>>>      u: Unassigned
>>>>
>>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>>> future.
>>>>
>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> to list the assignment state of all the groups.
>>>>
>>>>      Format:
>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>
>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>      test//mbm_total_bytes:0=e;1=e
>>>>      test//mbm_local_bytes:0=e;1=e
>>>>      //mbm_total_bytes:0=e;1=e
>>>>      //mbm_local_bytes:0=e;1=e

This would make mbm_assign_control even more unwieldy and quicker to exceed a
page of data (these examples never seem to reflect those AMD systems with the many
L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
and solved when/if going this route.

There seems to be two opinions about this file at moment. Would it be possible to
summarize the discussion with pros/cons raised to make an informed selection?
I understand that Google as represented by Peter no longer requires/requests this
file but the motivation for this change seems new and does not seem to reduce the
original motivation for this file. We may also want to separate requirements for reading
from and writing to this file.

>>>>
>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>
>>>>      Format:
>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>
>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>      test//mbm_total_bytes:0=u;1=u
>>>>      test//mbm_local_bytes:0=u;1=u
>>>>      //mbm_total_bytes:0=e;1=e
>>>>      //mbm_local_bytes:0=e;1=e
>>>>
>>>>      The corresponding events will be read in
>>>>
>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>
>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>> mbm_local_bytes) will be supported.
>>>>
>>>> 8. In the future, there will be options to create multiple configurations
>>>> and corresponding directory will be created in
>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>
>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>> directory? Like this:
>>
>> # echo "LclFill, LclNTWr, RmtFill" >
>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>
>> This seems OK (dependent on the user picking meaningful names for
>> the set of attributes picked ... but if they want to name this
>> monitor file "brian" then they have to live with any confusion
>> that they bring on themselves).
>>
>> Would this involve an extension to kernfs? I don't see a function
>> pointer callback for file creation in kernfs_syscall_ops.
>>
>>>>
>>>
>>> I know you are all busy with multiple series going on parallel. I am still
>>> waiting for the inputs on this. It will be great if you can spend some time
>>> on this to see if we can find common ground on the interface.
>>>
>>> Thanks
>>> Babu
>>
>> -Tony
>>
> 
> 
> thanks
> Babu

Reinette


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11  3:51                                                   ` Reinette Chatre
@ 2025-03-11 20:35                                                     ` Moger, Babu
  2025-03-11 20:53                                                       ` Luck, Tony
  2025-03-12 15:07                                                       ` Reinette Chatre
  0 siblings, 2 replies; 209+ messages in thread
From: Moger, Babu @ 2025-03-11 20:35 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, Luck, Tony
  Cc: Peter Newman, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi All,

On 3/10/25 22:51, Reinette Chatre wrote:
> 
> 
> On 3/10/25 6:44 PM, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>> Hi All,
>>>>
>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>> Hi Peter,
>>>>>
>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>
>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>> Hi Babu,
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>
>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>
>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>>>
>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>
>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>
>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>
>>>>>>>>>>> It is documented below.
>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>
>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>
>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>> just two.
>>>>>>>>>>>
>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>
>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>> using it as is.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>
>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>
>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>
>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>> supported.
>>>>>>>>>>
>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>
>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>
>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>
>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>
>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>> evolves.
>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>
>>>>>>>>> # define event configurations
>>>>>>>>>
>>>>>>>>> ========================================================
>>>>>>>>> Bits    Mnemonics       Description
>>>>>>>>> ====   ========================================================
>>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>>>> ====    ========================================================
>>>>>>>>>
>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>
>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> v = VictimBW
>>>>>>>>>
>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>
>>>>>>>>> /group0/0=t;1=t
>>>>>>>>> /group1/0=t;1=t
>>>>>>>>> /group2/0=_;1=t
>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>
>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>> and cleaner to implement.
>>>>>>>>
>>>>>>>> Roughly what I had in mind:
>>>>>>>>
>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>> want specific events and don't care about portability.
>>>>>>>
>>>>>>>
>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>
>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>
>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>
>>>>>>> How many configurations should we allow? Do we know?
>>>>>>
>>>>>> Do we need an upper limit?
>>>>>
>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>> start with 2 default configurations for now.
> 
> There is a big difference between no upper limit and 2. The hardware is
> capable of supporting per-domain configurations so more flexibility is
> certainly possible. Consider the example presented by Peter in:
> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
> 
>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>> configure all domains the same for a group.
>>>>>>>
>>>>>>> What is the difference between shared and exclusive?
>>>>>>
>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>> aggregation files[2].
>>>>>>
>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>> counters and reading their values.
>>>>>
>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>
>>>>>>> There should be a more efficient way to handle this.
>>>>>>>
>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>
>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>> number of files, which has since been addressed by shared assignment
>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>
>>>>> I think we should address this as well. Creating three extra files for
>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>
>>>>>>> That was another problem we need to address.
>>>>>>
>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>
>>>>> I suggest, we should provide users with an option to list the assignments
>>>>> of all groups in a single command. As the number of groups increases, it
>>>>> becomes cumbersome to query each group individually.
>>>>>
>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>> for this purpose. More details on this below.
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>> which the count values can be read.
>>>>>>>>
>>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>> LclFill
>>>>>>>> LclNTWr
>>>>>>>> LclSlowFill
>>>>>>>
>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>
>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>
>>>>>>>
>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill <-rename these to generic names.
>>>>>>> LclNTWr
>>>>>>> LclSlowFill
>>>>>>>
>>>>>>
>>>>>> I think portable and non-portable event names should both be available
>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>> events.
>>>>>
>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>
>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>> of the current interface. Idea here is to start with basic assigment
>>>>> feature with options to enhance it in the future. Feel free to
>>>>> comment/suggest.
>>>>>
>>>>> 1. Event configurations will be in
>>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>
>>>>>      There will be two pre-defined configurations by default.
>>>>>
>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>
>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>      LclFill, LclNTWr, LclSlowFill
>>>>>
>>>>> 2. Users will have options to update these configurations.
>>>>>
>>>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>
>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>> reporting "local_bytes" any more. They report something different,
>>> and users only know if they come to check the options currently
>>> configured in this file. Changing the contents without changing
>>> the name seems confusing to me.
>>
>> It is the same behaviour right now with BMEC. It is configurable.
>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>
>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
> 
> This could be supported by following Peter's original proposal where the name
> of the counter configuration is provided by the user via a mkdir:
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
> 
> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.

Sure. We can do that. I was thinking in the first phase, just provide the
default pre-defined configuration and option to update the configuration.

We can add the mkdir support later. That way we can provide basic ABMC
support without too much code complexity with mkdir support.

> 
>>
>>>
>>>>>
>>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>      LclFill, LclNTWr, RmtFill
>>>>>
>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>
>>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>>>      mkdir /sys/fs/resctrl/test/
>>>>>
>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>      e: Exclusive
>>>>>      s: Shared
>>>>>      u: Unassigned
>>>>>
>>>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>>>> future.
>>>>>
>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> to list the assignment state of all the groups.
>>>>>
>>>>>      Format:
>>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>
>>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>      test//mbm_total_bytes:0=e;1=e
>>>>>      test//mbm_local_bytes:0=e;1=e
>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>      //mbm_local_bytes:0=e;1=e
> 
> This would make mbm_assign_control even more unwieldy and quicker to exceed a
> page of data (these examples never seem to reflect those AMD systems with the many
> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
> and solved when/if going this route.

This problem is not specific this series. I feel it is a generic problem
to many of the semilar interfaces. I dont know how it is addressed. May
have to investigate on this. Any pointers would be helpful.


> 
> There seems to be two opinions about this file at moment. Would it be possible to
> summarize the discussion with pros/cons raised to make an informed selection?
> I understand that Google as represented by Peter no longer requires/requests this
> file but the motivation for this change seems new and does not seem to reduce the
> original motivation for this file. We may also want to separate requirements for reading
> from and writing to this file.

Yea. We can just use mbm_assign_control for reading the assignment states.

Summary: We have two proposals.

First one from Peter:

https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/


Pros
a.  Allows flexible creation of free-form names for assignable
configurations, stored in info/L3_MON/counter_configs/.

b.  Events can be accessed using corresponding free-form names in the
mon_data directory, making it clear to users what each event represents.


Cons:
a. Requires three separate files for assignment in each group
(assign_exclusive, assign_shared, unassign), which might be excessive.

b. No built-in listing support, meaning users must query each group
individually to check assignment states.


Second Proposal (Mine)

https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/

Pros:

a. Maintains the flexibility of free-form names for assignable
configurations (info/L3_MON/counter_configs/).

b. Events remain accessible via free-form names in mon_data, ensuring
clarity on their purpose.

c. Adds the ability to list assignment states for all groups in a single
command.

Cons:
a.  Potential buffer overflow issues when handling a large number of
groups and domains and code complexity to fix the issue.


Third Option: A Hybrid Approach

We could combine elements from both proposals:

a. Retain the free-form naming approach for assignable configurations in
info/L3_MON/counter_configs/.

b. Use the assignment method from the first proposal:
   $mkdir test
   $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive

c. Introduce listing support via the info/L3_MON/mbm_assign_control
interface, enabling users to read assignment states for all groups in one
place. Only reading support.


> 
>>>>>
>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>
>>>>>      Format:
>>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>
>>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>      test//mbm_total_bytes:0=u;1=u
>>>>>      test//mbm_local_bytes:0=u;1=u
>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>      //mbm_local_bytes:0=e;1=e
>>>>>
>>>>>      The corresponding events will be read in
>>>>>
>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>
>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>> mbm_local_bytes) will be supported.
>>>>>
>>>>> 8. In the future, there will be options to create multiple configurations
>>>>> and corresponding directory will be created in
>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>
>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>> directory? Like this:
>>>
>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>
>>> This seems OK (dependent on the user picking meaningful names for
>>> the set of attributes picked ... but if they want to name this
>>> monitor file "brian" then they have to live with any confusion
>>> that they bring on themselves).
>>>
>>> Would this involve an extension to kernfs? I don't see a function
>>> pointer callback for file creation in kernfs_syscall_ops.
>>>
>>>>>
>>>>
>>>> I know you are all busy with multiple series going on parallel. I am still
>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>> on this to see if we can find common ground on the interface.
>>>>
>>>> Thanks
>>>> Babu
>>>
>>> -Tony
>>>
>>
>>
>> thanks
>> Babu
> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11 20:35                                                     ` Moger, Babu
@ 2025-03-11 20:53                                                       ` Luck, Tony
  2025-03-12 15:14                                                         ` Moger, Babu
  2025-03-12 15:15                                                         ` Reinette Chatre
  2025-03-12 15:07                                                       ` Reinette Chatre
  1 sibling, 2 replies; 209+ messages in thread
From: Luck, Tony @ 2025-03-11 20:53 UTC (permalink / raw)
  To: Moger, Babu
  Cc: Reinette Chatre, Moger, Babu, Peter Newman, Dave Martin, corbet,
	tglx, mingo, bp, dave.hansen, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
> Hi All,
> 
> On 3/10/25 22:51, Reinette Chatre wrote:
> > 
> > 
> > On 3/10/25 6:44 PM, Moger, Babu wrote:
> >> Hi Tony,
> >>
> >> On 3/10/2025 6:22 PM, Luck, Tony wrote:
> >>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
> >>>> Hi All,
> >>>>
> >>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
> >>>>> Hi Peter,
> >>>>>
> >>>>> On 3/5/25 04:40, Peter Newman wrote:
> >>>>>> Hi Babu,
> >>>>>>
> >>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>
> >>>>>>> Hi Peter,
> >>>>>>>
> >>>>>>> On 3/4/25 10:44, Peter Newman wrote:
> >>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Peter/Reinette,
> >>>>>>>>>
> >>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
> >>>>>>>>>> Hi Babu,
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>
> >>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
> >>>>>>>>>>>> Hi Reinette,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>>>>>>>>>>> for.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>>>>>>>>>>> customers.
> >>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>>>>>>>>>>> event names.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thank you for clarifying.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>>>>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (per domain)
> >>>>>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>>>>>>>>>>> configuration is a requirement?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>>>>>>>>>>> there's less pressure on the counters.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>>>>>>>>>>> many counters the group needs in each domain.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
> >>>>>>>>>>>>>>> of the hardware.
> >>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>>>>>>>>>>> earlier example copied below:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (per domain)
> >>>>>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>>>>>>>>>>> I understand it:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>    domain 1:
> >>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>    domain 1:
> >>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>>>>>>>>>>> in domain 1, resulting in:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>    domain 1:
> >>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>    domain 0:
> >>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>    domain 1:
> >>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
> >>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>     counter 3: VictimBW
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
> >>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
> >>>>>>>>>>>>>> groupings to count.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
> >>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
> >>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    /group0/0=t;1=t
> >>>>>>>>>>>>>>    /group1/0=t;1=t
> >>>>>>>>>>>>>>    /group2/0=_;1=t
> >>>>>>>>>>>>>>    /group3/0=rw;1=_
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - group2 is restricted to domain 0
> >>>>>>>>>>>>>> - group3 is restricted to domain 1
> >>>>>>>>>>>>>> - the rest are unrestricted
> >>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see. Thank you for the example.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
> >>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>      /group0/0=t;1=t
> >>>>>>>>>>>>>      /group1/0=t;1=t
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
> >>>>>>>>>>>>> be configured differently in each domain.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
> >>>>>>>>>>>> domain use the same configurations and are limited to two events per
> >>>>>>>>>>>> group and a per-group mode where every group can be configured and
> >>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
> >>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
> >>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
> >>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
> >>>>>>>>>>>> have the same flexibility as on MPAM.
> >>>>>>>>>>>
> >>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
> >>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
> >>>>>>>>>>>
> >>>>>>>>>>> It is documented below.
> >>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>>>>>>>>>>
> >>>>>>>>>>> We previously discussed this with you (off the public list) and I
> >>>>>>>>>>> initially proposed the extended assignment mode.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
> >>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
> >>>>>>>>>>> just two.
> >>>>>>>>>>>
> >>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
> >>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
> >>>>>>>>>>> extended mode is not practical at this time.
> >>>>>>>>>>>
> >>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >>>>>>>>>>> require modifications to the existing interface, allowing us to continue
> >>>>>>>>>>> using it as is.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> (I might have said something confusing in my last messages because I
> >>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
> >>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
> >>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
> >>>>>>>>>>>> earlier is one I've already been asked about.
> >>>>>>>>>>>
> >>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
> >>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
> >>>>>>>>>>> finalize how to configure the multiple event interface for each group.
> >>>>>>>>>>
> >>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
> >>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
> >>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
> >>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
> >>>>>>>>>> there's already an expectation that the files are present when BMEC is
> >>>>>>>>>> supported.
> >>>>>>>>>>
> >>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> >>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
> >>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
> >>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
> >>>>>>>>>> interface. If it does, it's something we can live with.
> >>>>>>>>>
> >>>>>>>>> As you know, this series is currently blocked without further feedback.
> >>>>>>>>>
> >>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> >>>>>>>>> Any input or suggestions would be appreciated.
> >>>>>>>>>
> >>>>>>>>> Here’s what we’ve learned so far:
> >>>>>>>>>
> >>>>>>>>> 1. Assignments should be independent of BMEC.
> >>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
> >>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
> >>>>>>>>> 3. There should be an option to assign events per domain.
> >>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
> >>>>>>>>> should allow flexibility to assign more in the future as the interface
> >>>>>>>>> evolves.
> >>>>>>>>> 5. Utilize the extended RMID read mode.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Here is my proposal using Peter's earlier example:
> >>>>>>>>>
> >>>>>>>>> # define event configurations
> >>>>>>>>>
> >>>>>>>>> ========================================================
> >>>>>>>>> Bits    Mnemonics       Description
> >>>>>>>>> ====   ========================================================
> >>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
> >>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
> >>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
> >>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
> >>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
> >>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
> >>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
> >>>>>>>>> ====    ========================================================
> >>>>>>>>>
> >>>>>>>>> #Define flags based on combination of above event types.
> >>>>>>>>>
> >>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
> >>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> v = VictimBW
> >>>>>>>>>
> >>>>>>>>> Peter suggested the following format earlier :
> >>>>>>>>>
> >>>>>>>>> /group0/0=t;1=t
> >>>>>>>>> /group1/0=t;1=t
> >>>>>>>>> /group2/0=_;1=t
> >>>>>>>>> /group3/0=rw;1=_
> >>>>>>>>
> >>>>>>>> After some inquiries within Google, it sounds like nobody has invested
> >>>>>>>> much into the current mbm_assign_control format yet, so it would be
> >>>>>>>> best to drop it and distribute the configuration around the filesystem
> >>>>>>>> hierarchy[1], which should allow us to produce something more flexible
> >>>>>>>> and cleaner to implement.
> >>>>>>>>
> >>>>>>>> Roughly what I had in mind:
> >>>>>>>>
> >>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
> >>>>>>>> names for the assignable configurations rather than being restricted
> >>>>>>>> to single letters.  In the resulting directory, populate a file where
> >>>>>>>> we can specify the set of events the config should represent. I think
> >>>>>>>> we should use symbolic names for the events rather than raw BMEC field
> >>>>>>>> values. Moving forward we could come up with portable names for common
> >>>>>>>> events and only support the BMEC names on AMD machines for users who
> >>>>>>>> want specific events and don't care about portability.
> >>>>>>>
> >>>>>>>
> >>>>>>> I’m still processing this. Let me start with some initial questions.
> >>>>>>>
> >>>>>>> So, we are creating event configurations here, which seems reasonable.
> >>>>>>>
> >>>>>>> Yes, we should use portable names and are not limited to BMEC names.
> >>>>>>>
> >>>>>>> How many configurations should we allow? Do we know?
> >>>>>>
> >>>>>> Do we need an upper limit?
> >>>>>
> >>>>> I think so. This needs to be maintained in some data structure. We can
> >>>>> start with 2 default configurations for now.
> > 
> > There is a big difference between no upper limit and 2. The hardware is
> > capable of supporting per-domain configurations so more flexibility is
> > certainly possible. Consider the example presented by Peter in:
> > https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
> > 
> >>>>>>>> Next, put assignment-control file nodes in per-domain directories
> >>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> >>>>>>>> counter-configuration name into the file would then allocate a counter
> >>>>>>>> in the domain, apply the named configuration, and monitor the parent
> >>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
> >>>>>>>> higher in the hierarchy to make it easier for users who want to
> >>>>>>>> configure all domains the same for a group.
> >>>>>>>
> >>>>>>> What is the difference between shared and exclusive?
> >>>>>>
> >>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
> >>>>>> each domain will be scheduled round-robin to the groups requesting
> >>>>>> shared access to a counter. In my tests, I assigned the counters long
> >>>>>> enough to produce a single 1-second MB/s sample for the per-domain
> >>>>>> aggregation files[2].
> >>>>>>
> >>>>>> These do not need to be implemented immediately, but knowing that they
> >>>>>> work addresses the overhead and scalability concerns of reassigning
> >>>>>> counters and reading their values.
> >>>>>
> >>>>> Ok. Lets focus on exclusive assignments for now.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
> >>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
> >>>>>>> results in 32 × 12 × 3 files, which is quite large.
> >>>>>>>
> >>>>>>> There should be a more efficient way to handle this.
> >>>>>>>
> >>>>>>> Initially, we started with a group-level file for this interface, but it
> >>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
> >>>>>>
> >>>>>> I had rejected it due to the high-frequency of access of a large
> >>>>>> number of files, which has since been addressed by shared assignment
> >>>>>> (or automatic reassignment) and aggregated mbps files.
> >>>>>
> >>>>> I think we should address this as well. Creating three extra files for
> >>>>> each group isn’t ideal when there are more efficient alternatives.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Additionally, how can we list all assignments with a single sysfs call?
> >>>>>>>
> >>>>>>> That was another problem we need to address.
> >>>>>>
> >>>>>> This is not a requirement I was aware of. If the user forgot where
> >>>>>> they assigned counters (or forgot to disable auto-assignment), they
> >>>>>> can read multiple sysfs nodes to remind themselves.
> >>>>>
> >>>>> I suggest, we should provide users with an option to list the assignments
> >>>>> of all groups in a single command. As the number of groups increases, it
> >>>>> becomes cumbersome to query each group individually.
> >>>>>
> >>>>> To achieve this, we can reuse our existing mbm_assign_control interface
> >>>>> for this purpose. More details on this below.
> >>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The configuration names listed in assign_* would result in files of
> >>>>>>>> the same name in the appropriate mon_data domain directories from
> >>>>>>>> which the count values can be read.
> >>>>>>>>
> >>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>> LclFill
> >>>>>>>> LclNTWr
> >>>>>>>> LclSlowFill
> >>>>>>>
> >>>>>>> I feel we can just have the configs. event_filter file is not required.
> >>>>>>
> >>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
> >>>>>> only looking at struct kernfs_syscall_ops
> >>>>>>
> >>>>>>>
> >>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>>> LclFill <-rename these to generic names.
> >>>>>>> LclNTWr
> >>>>>>> LclSlowFill
> >>>>>>>
> >>>>>>
> >>>>>> I think portable and non-portable event names should both be available
> >>>>>> as options. There are simple bandwidth measurement mechanisms that
> >>>>>> will be applied in general, but when they turn up an issue, it can
> >>>>>> often lead to a more focused investigation, requiring more precise
> >>>>>> events.
> >>>>>
> >>>>> I aggree. We should provide both portable and non-portable event names.
> >>>>>
> >>>>> Here is my draft proposal based on the discussion so far and reusing some
> >>>>> of the current interface. Idea here is to start with basic assigment
> >>>>> feature with options to enhance it in the future. Feel free to
> >>>>> comment/suggest.
> >>>>>
> >>>>> 1. Event configurations will be in
> >>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
> >>>>>
> >>>>>      There will be two pre-defined configurations by default.
> >>>>>
> >>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> >>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
> >>>>>
> >>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>      LclFill, LclNTWr, LclSlowFill
> >>>>>
> >>>>> 2. Users will have options to update these configurations.
> >>>>>
> >>>>>      #echo "LclFill, LclNTWr, RmtFill" >
> >>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>
> >>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
> >>> reporting "local_bytes" any more. They report something different,
> >>> and users only know if they come to check the options currently
> >>> configured in this file. Changing the contents without changing
> >>> the name seems confusing to me.
> >>
> >> It is the same behaviour right now with BMEC. It is configurable.
> >> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
> >>
> >> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
> > 
> > This could be supported by following Peter's original proposal where the name
> > of the counter configuration is provided by the user via a mkdir:
> > https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
> > 
> > As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
> 
> Sure. We can do that. I was thinking in the first phase, just provide the
> default pre-defined configuration and option to update the configuration.
> 
> We can add the mkdir support later. That way we can provide basic ABMC
> support without too much code complexity with mkdir support.
> 
> > 
> >>
> >>>
> >>>>>
> >>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>      LclFill, LclNTWr, RmtFill
> >>>>>
> >>>>> 3. The default configurations will be used when user mounts the resctrl.
> >>>>>
> >>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
> >>>>>      mkdir /sys/fs/resctrl/test/
> >>>>>
> >>>>> 4. The resctrl group/domains can be in one of these assingnment states.
> >>>>>      e: Exclusive
> >>>>>      s: Shared
> >>>>>      u: Unassigned
> >>>>>
> >>>>>      Exclusive mode is supported now. Shared mode will be supported in the
> >>>>> future.
> >>>>>
> >>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>> to list the assignment state of all the groups.
> >>>>>
> >>>>>      Format:
> >>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
> >>>>>
> >>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>      test//mbm_total_bytes:0=e;1=e
> >>>>>      test//mbm_local_bytes:0=e;1=e
> >>>>>      //mbm_total_bytes:0=e;1=e
> >>>>>      //mbm_local_bytes:0=e;1=e
> > 
> > This would make mbm_assign_control even more unwieldy and quicker to exceed a
> > page of data (these examples never seem to reflect those AMD systems with the many
> > L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
> > and solved when/if going this route.
> 
> This problem is not specific this series. I feel it is a generic problem
> to many of the semilar interfaces. I dont know how it is addressed. May
> have to investigate on this. Any pointers would be helpful.
> 
> 
> > 
> > There seems to be two opinions about this file at moment. Would it be possible to
> > summarize the discussion with pros/cons raised to make an informed selection?
> > I understand that Google as represented by Peter no longer requires/requests this
> > file but the motivation for this change seems new and does not seem to reduce the
> > original motivation for this file. We may also want to separate requirements for reading
> > from and writing to this file.
> 
> Yea. We can just use mbm_assign_control for reading the assignment states.
> 
> Summary: We have two proposals.
> 
> First one from Peter:
> 
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
> 
> 
> Pros
> a.  Allows flexible creation of free-form names for assignable
> configurations, stored in info/L3_MON/counter_configs/.
> 
> b.  Events can be accessed using corresponding free-form names in the
> mon_data directory, making it clear to users what each event represents.
> 
> 
> Cons:
> a. Requires three separate files for assignment in each group
> (assign_exclusive, assign_shared, unassign), which might be excessive.
> 
> b. No built-in listing support, meaning users must query each group
> individually to check assignment states.

How big of a problem is this in reality? I'd assume that users of this
feature would only reassign counter attributes at some slow rate (set
up counters, measure for at least a few seconds, then set up for next
measurement). Cost to open/read/close a few hundred kernfs files isn't
very high. Biggest cost might be hogging the resctrl mutex which would
cause jitter in the tasks reading data from resctrl monitors.

Anyone doing this at scale should be able to keep track of what they set,
so wouldn't need to read at all. I'm not a big believer in "multiple
agents independently tweaking resctrl without knowledge of each other".

> 
> Second Proposal (Mine)
> 
> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
> 
> Pros:
> 
> a. Maintains the flexibility of free-form names for assignable
> configurations (info/L3_MON/counter_configs/).
> 
> b. Events remain accessible via free-form names in mon_data, ensuring
> clarity on their purpose.
> 
> c. Adds the ability to list assignment states for all groups in a single
> command.
> 
> Cons:
> a.  Potential buffer overflow issues when handling a large number of
> groups and domains and code complexity to fix the issue.
> 
> 
> Third Option: A Hybrid Approach
> 
> We could combine elements from both proposals:
> 
> a. Retain the free-form naming approach for assignable configurations in
> info/L3_MON/counter_configs/.
> 
> b. Use the assignment method from the first proposal:
>    $mkdir test
>    $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
> 
> c. Introduce listing support via the info/L3_MON/mbm_assign_control
> interface, enabling users to read assignment states for all groups in one
> place. Only reading support.
> 
> 
> > 
> >>>>>
> >>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
> >>>>>
> >>>>>      Format:
> >>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
> >>>>>
> >>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
> >>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>
> >>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
> >>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>
> >>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>      test//mbm_total_bytes:0=u;1=u
> >>>>>      test//mbm_local_bytes:0=u;1=u
> >>>>>      //mbm_total_bytes:0=e;1=e
> >>>>>      //mbm_local_bytes:0=e;1=e
> >>>>>
> >>>>>      The corresponding events will be read in
> >>>>>
> >>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> >>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
> >>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
> >>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> >>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> >>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
> >>>>>
> >>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
> >>>>> mbm_local_bytes) will be supported.
> >>>>>
> >>>>> 8. In the future, there will be options to create multiple configurations
> >>>>> and corresponding directory will be created in
> >>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
> >>>
> >>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
> >>> directory? Like this:
> >>>
> >>> # echo "LclFill, LclNTWr, RmtFill" >
> >>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
> >>>
> >>> This seems OK (dependent on the user picking meaningful names for
> >>> the set of attributes picked ... but if they want to name this
> >>> monitor file "brian" then they have to live with any confusion
> >>> that they bring on themselves).
> >>>
> >>> Would this involve an extension to kernfs? I don't see a function
> >>> pointer callback for file creation in kernfs_syscall_ops.
> >>>
> >>>>>
> >>>>
> >>>> I know you are all busy with multiple series going on parallel. I am still
> >>>> waiting for the inputs on this. It will be great if you can spend some time
> >>>> on this to see if we can find common ground on the interface.
> >>>>
> >>>> Thanks
> >>>> Babu
> >>>
> >>> -Tony
> >>>
> >>
> >>
> >> thanks
> >> Babu
> > 
> > Reinette
> > 
> > 
> 
> -- 
> Thanks
> Babu Moger

-Tony

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11 20:35                                                     ` Moger, Babu
  2025-03-11 20:53                                                       ` Luck, Tony
@ 2025-03-12 15:07                                                       ` Reinette Chatre
  2025-03-12 16:03                                                         ` Moger, Babu
  1 sibling, 1 reply; 209+ messages in thread
From: Reinette Chatre @ 2025-03-12 15:07 UTC (permalink / raw)
  To: babu.moger, Moger, Babu, Luck, Tony
  Cc: Peter Newman, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Babu,

On 3/11/25 1:35 PM, Moger, Babu wrote:
> Hi All,
> 
> On 3/10/25 22:51, Reinette Chatre wrote:
>>
>>
>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>> Hi Tony,
>>>
>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>> Hi All,
>>>>>
>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>
>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>
>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>
>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>
>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>> just two.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>
>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>> supported.
>>>>>>>>>>>
>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>
>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>
>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>
>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>
>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>> evolves.
>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>
>>>>>>>>>> # define event configurations
>>>>>>>>>>
>>>>>>>>>> ========================================================
>>>>>>>>>> Bits    Mnemonics       Description
>>>>>>>>>> ====   ========================================================
>>>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>>>>> ====    ========================================================
>>>>>>>>>>
>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>
>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> v = VictimBW
>>>>>>>>>>
>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>
>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>
>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>> and cleaner to implement.
>>>>>>>>>
>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>
>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>
>>>>>>>>
>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>
>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>
>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>
>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>
>>>>>>> Do we need an upper limit?
>>>>>>
>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>> start with 2 default configurations for now.
>>
>> There is a big difference between no upper limit and 2. The hardware is
>> capable of supporting per-domain configurations so more flexibility is
>> certainly possible. Consider the example presented by Peter in:
>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>
>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>> configure all domains the same for a group.
>>>>>>>>
>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>
>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>> aggregation files[2].
>>>>>>>
>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>> counters and reading their values.
>>>>>>
>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>
>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>
>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>
>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>
>>>>>> I think we should address this as well. Creating three extra files for
>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>
>>>>>>>> That was another problem we need to address.
>>>>>>>
>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>
>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>> becomes cumbersome to query each group individually.
>>>>>>
>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>> for this purpose. More details on this below.
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>> which the count values can be read.
>>>>>>>>>
>>>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> LclFill
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>
>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>
>>>>>>>>
>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>> LclNTWr
>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>
>>>>>>> I think portable and non-portable event names should both be available
>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>> events.
>>>>>>
>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>
>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>> comment/suggest.
>>>>>>
>>>>>> 1. Event configurations will be in
>>>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>
>>>>>>      There will be two pre-defined configurations by default.
>>>>>>
>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>
>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>      LclFill, LclNTWr, LclSlowFill
>>>>>>
>>>>>> 2. Users will have options to update these configurations.
>>>>>>
>>>>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>
>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>> reporting "local_bytes" any more. They report something different,
>>>> and users only know if they come to check the options currently
>>>> configured in this file. Changing the contents without changing
>>>> the name seems confusing to me.
>>>
>>> It is the same behaviour right now with BMEC. It is configurable.
>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>
>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>
>> This could be supported by following Peter's original proposal where the name
>> of the counter configuration is provided by the user via a mkdir:
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
> 
> Sure. We can do that. I was thinking in the first phase, just provide the
> default pre-defined configuration and option to update the configuration.
> 
> We can add the mkdir support later. That way we can provide basic ABMC
> support without too much code complexity with mkdir support.

This is not clear to me how you envision the "first phase". Is it what you
proposed above, for example:
      #echo "LclFill, LclNTWr, RmtFill" >
         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes

In above the counter configuration name is a file. 

How could mkdir support be added to this later if there are already files present?

> 
>>
>>>
>>>>
>>>>>>
>>>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>      LclFill, LclNTWr, RmtFill
>>>>>>
>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>
>>>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>>>>      mkdir /sys/fs/resctrl/test/
>>>>>>
>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>      e: Exclusive
>>>>>>      s: Shared
>>>>>>      u: Unassigned
>>>>>>
>>>>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>>>>> future.
>>>>>>
>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> to list the assignment state of all the groups.
>>>>>>
>>>>>>      Format:
>>>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>
>>>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>      test//mbm_total_bytes:0=e;1=e
>>>>>>      test//mbm_local_bytes:0=e;1=e
>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>      //mbm_local_bytes:0=e;1=e
>>
>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>> page of data (these examples never seem to reflect those AMD systems with the many
>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>> and solved when/if going this route.
> 
> This problem is not specific this series. I feel it is a generic problem
> to many of the semilar interfaces. I dont know how it is addressed. May
> have to investigate on this. Any pointers would be helpful.

Dave Martin already did a lot of analysis here. What other pointers do you need?

> 
> 
>>
>> There seems to be two opinions about this file at moment. Would it be possible to
>> summarize the discussion with pros/cons raised to make an informed selection?
>> I understand that Google as represented by Peter no longer requires/requests this
>> file but the motivation for this change seems new and does not seem to reduce the
>> original motivation for this file. We may also want to separate requirements for reading
>> from and writing to this file.
> 
> Yea. We can just use mbm_assign_control for reading the assignment states.
> 
> Summary: We have two proposals.
> 
> First one from Peter:
> 
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
> 
> 
> Pros
> a.  Allows flexible creation of free-form names for assignable
> configurations, stored in info/L3_MON/counter_configs/.
> 
> b.  Events can be accessed using corresponding free-form names in the
> mon_data directory, making it clear to users what each event represents.
> 
> 
> Cons:
> a. Requires three separate files for assignment in each group
> (assign_exclusive, assign_shared, unassign), which might be excessive.
> 
> b. No built-in listing support, meaning users must query each group
> individually to check assignment states.
> 
> 
> Second Proposal (Mine)
> 
> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
> 
> Pros:
> 
> a. Maintains the flexibility of free-form names for assignable
> configurations (info/L3_MON/counter_configs/).
> 
> b. Events remain accessible via free-form names in mon_data, ensuring
> clarity on their purpose.
> 
> c. Adds the ability to list assignment states for all groups in a single
> command.
> 
> Cons:
> a.  Potential buffer overflow issues when handling a large number of
> groups and domains and code complexity to fix the issue.
> 
> 
> Third Option: A Hybrid Approach
> 
> We could combine elements from both proposals:
> 
> a. Retain the free-form naming approach for assignable configurations in
> info/L3_MON/counter_configs/.
> 
> b. Use the assignment method from the first proposal:
>    $mkdir test
>    $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
> 
> c. Introduce listing support via the info/L3_MON/mbm_assign_control
> interface, enabling users to read assignment states for all groups in one
> place. Only reading support.
> 
> 
>>
>>>>>>
>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>
>>>>>>      Format:
>>>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>
>>>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>      test//mbm_total_bytes:0=u;1=u
>>>>>>      test//mbm_local_bytes:0=u;1=u
>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>      //mbm_local_bytes:0=e;1=e
>>>>>>
>>>>>>      The corresponding events will be read in
>>>>>>
>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>
>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>> mbm_local_bytes) will be supported.
>>>>>>
>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>> and corresponding directory will be created in
>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>
>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>> directory? Like this:
>>>>
>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>
>>>> This seems OK (dependent on the user picking meaningful names for
>>>> the set of attributes picked ... but if they want to name this
>>>> monitor file "brian" then they have to live with any confusion
>>>> that they bring on themselves).
>>>>
>>>> Would this involve an extension to kernfs? I don't see a function
>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>
>>>>>>
>>>>>
>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>> on this to see if we can find common ground on the interface.
>>>>>
>>>>> Thanks
>>>>> Babu
>>>>
>>>> -Tony
>>>>
>>>
>>>
>>> thanks
>>> Babu
>>
>> Reinette
>>
>>
> 


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11 20:53                                                       ` Luck, Tony
@ 2025-03-12 15:14                                                         ` Moger, Babu
  2025-03-12 15:15                                                         ` Reinette Chatre
  1 sibling, 0 replies; 209+ messages in thread
From: Moger, Babu @ 2025-03-12 15:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Reinette Chatre, Moger, Babu, Peter Newman, Dave Martin, corbet,
	tglx, mingo, bp, dave.hansen, x86, hpa, paulmck, akpm, thuth,
	rostedt, xiongwei.song, pawan.kumar.gupta, daniel.sneddon,
	jpoimboe, perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc,
	xin3.li, andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Tony,

On 3/11/25 15:53, Luck, Tony wrote:
> On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>
>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>> supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>
>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>
>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>
>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>> evolves.
>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>
>>>>>>>>>>> # define event configurations
>>>>>>>>>>>
>>>>>>>>>>> ========================================================
>>>>>>>>>>> Bits    Mnemonics       Description
>>>>>>>>>>> ====   ========================================================
>>>>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>>>>>> ====    ========================================================
>>>>>>>>>>>
>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>
>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>
>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>
>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>
>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>
>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>
>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>
>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>
>>>>>>>> Do we need an upper limit?
>>>>>>>
>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>> start with 2 default configurations for now.
>>>
>>> There is a big difference between no upper limit and 2. The hardware is
>>> capable of supporting per-domain configurations so more flexibility is
>>> certainly possible. Consider the example presented by Peter in:
>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>
>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>
>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>
>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>> aggregation files[2].
>>>>>>>>
>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>> counters and reading their values.
>>>>>>>
>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>
>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>
>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>
>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>
>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>
>>>>>>>>> That was another problem we need to address.
>>>>>>>>
>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>
>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>
>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>> for this purpose. More details on this below.
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>> which the count values can be read.
>>>>>>>>>>
>>>>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> LclFill
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>
>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>
>>>>>>>>>
>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>> events.
>>>>>>>
>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>
>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>> comment/suggest.
>>>>>>>
>>>>>>> 1. Event configurations will be in
>>>>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>
>>>>>>>      There will be two pre-defined configurations by default.
>>>>>>>
>>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>
>>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>      LclFill, LclNTWr, LclSlowFill
>>>>>>>
>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>
>>>>>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>> reporting "local_bytes" any more. They report something different,
>>>>> and users only know if they come to check the options currently
>>>>> configured in this file. Changing the contents without changing
>>>>> the name seems confusing to me.
>>>>
>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>
>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>
>>> This could be supported by following Peter's original proposal where the name
>>> of the counter configuration is provided by the user via a mkdir:
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>
>> Sure. We can do that. I was thinking in the first phase, just provide the
>> default pre-defined configuration and option to update the configuration.
>>
>> We can add the mkdir support later. That way we can provide basic ABMC
>> support without too much code complexity with mkdir support.
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>      LclFill, LclNTWr, RmtFill
>>>>>>>
>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>
>>>>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>      mkdir /sys/fs/resctrl/test/
>>>>>>>
>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>>      e: Exclusive
>>>>>>>      s: Shared
>>>>>>>      u: Unassigned
>>>>>>>
>>>>>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>> future.
>>>>>>>
>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> to list the assignment state of all the groups.
>>>>>>>
>>>>>>>      Format:
>>>>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>
>>>>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>      test//mbm_total_bytes:0=e;1=e
>>>>>>>      test//mbm_local_bytes:0=e;1=e
>>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>>      //mbm_local_bytes:0=e;1=e
>>>
>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>> page of data (these examples never seem to reflect those AMD systems with the many
>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>> and solved when/if going this route.
>>
>> This problem is not specific this series. I feel it is a generic problem
>> to many of the semilar interfaces. I dont know how it is addressed. May
>> have to investigate on this. Any pointers would be helpful.
>>
>>
>>>
>>> There seems to be two opinions about this file at moment. Would it be possible to
>>> summarize the discussion with pros/cons raised to make an informed selection?
>>> I understand that Google as represented by Peter no longer requires/requests this
>>> file but the motivation for this change seems new and does not seem to reduce the
>>> original motivation for this file. We may also want to separate requirements for reading
>>> from and writing to this file.
>>
>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>
>> Summary: We have two proposals.
>>
>> First one from Peter:
>>
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>>
>> Pros
>> a.  Allows flexible creation of free-form names for assignable
>> configurations, stored in info/L3_MON/counter_configs/.
>>
>> b.  Events can be accessed using corresponding free-form names in the
>> mon_data directory, making it clear to users what each event represents.
>>
>>
>> Cons:
>> a. Requires three separate files for assignment in each group
>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>
>> b. No built-in listing support, meaning users must query each group
>> individually to check assignment states.
> 
> How big of a problem is this in reality? I'd assume that users of this
> feature would only reassign counter attributes at some slow rate (set
> up counters, measure for at least a few seconds, then set up for next
> measurement). Cost to open/read/close a few hundred kernfs files isn't
> very high. Biggest cost might be hogging the resctrl mutex which would
> cause jitter in the tasks reading data from resctrl monitors.

Yes. That is a good point. Dont know how big the problem it is.

But we all need to aggre that group listing is not requirement. We can go
ahead that route.

Lets hear from all the parties.

> 
> Anyone doing this at scale should be able to keep track of what they set,
> so wouldn't need to read at all. I'm not a big believer in "multiple
> agents independently tweaking resctrl without knowledge of each other".
> 
>>
>> Second Proposal (Mine)
>>
>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>
>> Pros:
>>
>> a. Maintains the flexibility of free-form names for assignable
>> configurations (info/L3_MON/counter_configs/).
>>
>> b. Events remain accessible via free-form names in mon_data, ensuring
>> clarity on their purpose.
>>
>> c. Adds the ability to list assignment states for all groups in a single
>> command.
>>
>> Cons:
>> a.  Potential buffer overflow issues when handling a large number of
>> groups and domains and code complexity to fix the issue.
>>
>>
>> Third Option: A Hybrid Approach
>>
>> We could combine elements from both proposals:
>>
>> a. Retain the free-form naming approach for assignable configurations in
>> info/L3_MON/counter_configs/.
>>
>> b. Use the assignment method from the first proposal:
>>    $mkdir test
>>    $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>
>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>> interface, enabling users to read assignment states for all groups in one
>> place. Only reading support.
>>
>>
>>>
>>>>>>>
>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>
>>>>>>>      Format:
>>>>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>
>>>>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>      test//mbm_total_bytes:0=u;1=u
>>>>>>>      test//mbm_local_bytes:0=u;1=u
>>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>>      //mbm_local_bytes:0=e;1=e
>>>>>>>
>>>>>>>      The corresponding events will be read in
>>>>>>>
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>
>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>
>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>> and corresponding directory will be created in
>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>
>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>> directory? Like this:
>>>>>
>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>
>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>> the set of attributes picked ... but if they want to name this
>>>>> monitor file "brian" then they have to live with any confusion
>>>>> that they bring on themselves).
>>>>>
>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>
>>>>>>>
>>>>>>
>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>> on this to see if we can find common ground on the interface.
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>
>>>>> -Tony
>>>>>
>>>>
>>>>
>>>> thanks
>>>> Babu
>>>
>>> Reinette
>>>
>>>
>>
>> -- 
>> Thanks
>> Babu Moger
> 
> -Tony
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-11 20:53                                                       ` Luck, Tony
  2025-03-12 15:14                                                         ` Moger, Babu
@ 2025-03-12 15:15                                                         ` Reinette Chatre
  1 sibling, 0 replies; 209+ messages in thread
From: Reinette Chatre @ 2025-03-12 15:15 UTC (permalink / raw)
  To: Luck, Tony, Moger, Babu
  Cc: Moger, Babu, Peter Newman, Dave Martin, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, paulmck, akpm, thuth, rostedt,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, jpoimboe,
	perry.yuan, sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Tony,

On 3/11/25 1:53 PM, Luck, Tony wrote:
> On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>    counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>    counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>    domain 0:
>>>>>>>>>>>>>>>>>     counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>    domain 1:
>>>>>>>>>>>>>>>>>     counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>>     counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>     counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>     counter 3: VictimBW
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>>    # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>    w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>>    t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    /group0/0=t;1=t
>>>>>>>>>>>>>>>>    /group1/0=t;1=t
>>>>>>>>>>>>>>>>    /group2/0=_;1=t
>>>>>>>>>>>>>>>>    /group3/0=rw;1=_
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      /group0/0=t;1=t
>>>>>>>>>>>>>>>      /group1/0=t;1=t
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>>    Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>
>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>> supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>
>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>
>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>
>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>> evolves.
>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>
>>>>>>>>>>> # define event configurations
>>>>>>>>>>>
>>>>>>>>>>> ========================================================
>>>>>>>>>>> Bits    Mnemonics       Description
>>>>>>>>>>> ====   ========================================================
>>>>>>>>>>> 6       VictimBW        Dirty Victims from all types of memory
>>>>>>>>>>> 5       RmtSlowFill     Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>> 4       LclSlowFill     Reads to slow memory in the local NUMA domain
>>>>>>>>>>> 3       RmtNTWr         Non-temporal writes to non-local NUMA domain
>>>>>>>>>>> 2       LclNTWr         Non-temporal writes to local NUMA domain
>>>>>>>>>>> 1       mtFill          Reads to memory in the non-local NUMA domain
>>>>>>>>>>> 0       LclFill         Reads to memory in the local NUMA domain
>>>>>>>>>>> ====    ========================================================
>>>>>>>>>>>
>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>
>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>
>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>
>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>
>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>> to single letters.  In the resulting directory, populate a file where
>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>
>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>
>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>
>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>
>>>>>>>> Do we need an upper limit?
>>>>>>>
>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>> start with 2 default configurations for now.
>>>
>>> There is a big difference between no upper limit and 2. The hardware is
>>> capable of supporting per-domain configurations so more flexibility is
>>> certainly possible. Consider the example presented by Peter in:
>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>
>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>
>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>
>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>> aggregation files[2].
>>>>>>>>
>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>> counters and reading their values.
>>>>>>>
>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>
>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>
>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>
>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>
>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>
>>>>>>>>> That was another problem we need to address.
>>>>>>>>
>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>
>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>
>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>> for this purpose. More details on this below.
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>> which the count values can be read.
>>>>>>>>>>
>>>>>>>>>>    # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>    # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>    # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> LclFill
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>
>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>
>>>>>>>>>
>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>> events.
>>>>>>>
>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>
>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>> comment/suggest.
>>>>>>>
>>>>>>> 1. Event configurations will be in
>>>>>>>      /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>
>>>>>>>      There will be two pre-defined configurations by default.
>>>>>>>
>>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>>      LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>
>>>>>>>      #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>      LclFill, LclNTWr, LclSlowFill
>>>>>>>
>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>
>>>>>>>      #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>         /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>> reporting "local_bytes" any more. They report something different,
>>>>> and users only know if they come to check the options currently
>>>>> configured in this file. Changing the contents without changing
>>>>> the name seems confusing to me.
>>>>
>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>
>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>
>>> This could be supported by following Peter's original proposal where the name
>>> of the counter configuration is provided by the user via a mkdir:
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>
>> Sure. We can do that. I was thinking in the first phase, just provide the
>> default pre-defined configuration and option to update the configuration.
>>
>> We can add the mkdir support later. That way we can provide basic ABMC
>> support without too much code complexity with mkdir support.
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>      # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>      LclFill, LclNTWr, RmtFill
>>>>>>>
>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>
>>>>>>>      mount  -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>      mkdir /sys/fs/resctrl/test/
>>>>>>>
>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>>      e: Exclusive
>>>>>>>      s: Shared
>>>>>>>      u: Unassigned
>>>>>>>
>>>>>>>      Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>> future.
>>>>>>>
>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> to list the assignment state of all the groups.
>>>>>>>
>>>>>>>      Format:
>>>>>>>      "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>
>>>>>>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>      test//mbm_total_bytes:0=e;1=e
>>>>>>>      test//mbm_local_bytes:0=e;1=e
>>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>>      //mbm_local_bytes:0=e;1=e
>>>
>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>> page of data (these examples never seem to reflect those AMD systems with the many
>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>> and solved when/if going this route.
>>
>> This problem is not specific this series. I feel it is a generic problem
>> to many of the semilar interfaces. I dont know how it is addressed. May
>> have to investigate on this. Any pointers would be helpful.
>>
>>
>>>
>>> There seems to be two opinions about this file at moment. Would it be possible to
>>> summarize the discussion with pros/cons raised to make an informed selection?
>>> I understand that Google as represented by Peter no longer requires/requests this
>>> file but the motivation for this change seems new and does not seem to reduce the
>>> original motivation for this file. We may also want to separate requirements for reading
>>> from and writing to this file.
>>
>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>
>> Summary: We have two proposals.
>>
>> First one from Peter:
>>
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>>
>> Pros
>> a.  Allows flexible creation of free-form names for assignable
>> configurations, stored in info/L3_MON/counter_configs/.
>>
>> b.  Events can be accessed using corresponding free-form names in the
>> mon_data directory, making it clear to users what each event represents.
>>
>>
>> Cons:
>> a. Requires three separate files for assignment in each group
>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>
>> b. No built-in listing support, meaning users must query each group
>> individually to check assignment states.
> 
> How big of a problem is this in reality? I'd assume that users of this
> feature would only reassign counter attributes at some slow rate (set
> up counters, measure for at least a few seconds, then set up for next
> measurement). Cost to open/read/close a few hundred kernfs files isn't
> very high. Biggest cost might be hogging the resctrl mutex which would
> cause jitter in the tasks reading data from resctrl monitors.

Good point. The length of holding the resctrl mutex should also be
considered when exploring the mbm_assign_control file. If a user attempts
to make many changes using a single file like that then holding the resctrl
mutex during entire configuration may also have a big impact. This may be
of more concern with the additional automation being added to resctrl, for
example the upcoming "shared assignment" that does automatic assignment of
counters.

> 
> Anyone doing this at scale should be able to keep track of what they set,
> so wouldn't need to read at all. I'm not a big believer in "multiple
> agents independently tweaking resctrl without knowledge of each other".
> 
>>
>> Second Proposal (Mine)
>>
>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>
>> Pros:
>>
>> a. Maintains the flexibility of free-form names for assignable
>> configurations (info/L3_MON/counter_configs/).
>>
>> b. Events remain accessible via free-form names in mon_data, ensuring
>> clarity on their purpose.
>>
>> c. Adds the ability to list assignment states for all groups in a single
>> command.
>>
>> Cons:
>> a.  Potential buffer overflow issues when handling a large number of
>> groups and domains and code complexity to fix the issue.
>>
>>
>> Third Option: A Hybrid Approach
>>
>> We could combine elements from both proposals:
>>
>> a. Retain the free-form naming approach for assignable configurations in
>> info/L3_MON/counter_configs/.
>>
>> b. Use the assignment method from the first proposal:
>>    $mkdir test
>>    $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>
>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>> interface, enabling users to read assignment states for all groups in one
>> place. Only reading support.
>>
>>
>>>
>>>>>>>
>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>
>>>>>>>      Format:
>>>>>>>      “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>
>>>>>>>      #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>>      #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>>      # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>      test//mbm_total_bytes:0=u;1=u
>>>>>>>      test//mbm_local_bytes:0=u;1=u
>>>>>>>      //mbm_total_bytes:0=e;1=e
>>>>>>>      //mbm_local_bytes:0=e;1=e
>>>>>>>
>>>>>>>      The corresponding events will be read in
>>>>>>>
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>      /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>
>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>
>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>> and corresponding directory will be created in
>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>
>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>> directory? Like this:
>>>>>
>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>>          /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>
>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>> the set of attributes picked ... but if they want to name this
>>>>> monitor file "brian" then they have to live with any confusion
>>>>> that they bring on themselves).
>>>>>
>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>
>>>>>>>
>>>>>>
>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>> on this to see if we can find common ground on the interface.
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>
>>>>> -Tony
>>>>>
>>>>
>>>>
>>>> thanks
>>>> Babu
>>>
>>> Reinette
>>>
>>>
>>
>> -- 
>> Thanks
>> Babu Moger
> 
> -Tony


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
  2025-03-12 15:07                                                       ` Reinette Chatre
@ 2025-03-12 16:03                                                         ` Moger, Babu
  2025-03-12 17:14                                                           ` Reinette Chatre
  0 siblings, 1 reply; 209+ messages in thread
From: Moger, Babu @ 2025-03-12 16:03 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, Luck, Tony
  Cc: Peter Newman, Dave Martin, corbet, tglx, mingo, bp, dave.hansen,
	x86, hpa, paulmck, akpm, thuth, rostedt, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, jpoimboe, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, xin3.li,
	andrew.cooper3, ebiggers, mario.limonciello, james.morse,
	tan.shaopeng, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian

Hi Reinette,

On 3/12/25 10:07, Reinette Chatre wrote:
> Hi Babu,
> 
> On 3/11/25 1:35 PM, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>    counter 0: LclFill,RmtFill