linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
@ 2024-10-29 23:21 Babu Moger
  2024-10-29 23:21 ` [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init Babu Moger
                   ` (25 more replies)
  0 siblings, 26 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky


This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

Series is written such that it is easier to support other assignable
features supported from different vendors.

The feature details are documented in the  APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
da99b80a7f5f1 (tip/master) Merge branch into tip/master: 'x86/sev'

# Introduction

Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware.
The counters of any other RMIDs which are no longer being tracked will be
reset to zero. The MBM event counters return "Unavailable" for the RMIDs
that are not tracked by hardware. So, there can be only limited number of
groups that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups
are being tracked for certain point of time. Users do not have the option
to monitor a group or set of groups for certain period of time without
worrying about counter being reset in between.
    
The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned.  The assigned RMID will be tracked by the hardware until the user
unassigns it manually. There is no need to worry about counters being reset
during this period. Additionally, the user can specify a bitmask identifying
the specific bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current 'default' mode without
assignment option.

# Linux Implementation

Create a generic interface aimed to support user space assignment
of scarce counters used for monitoring. First usage of interface
is by ABMC with option to expand usage to "soft-ABMC" and MPAM
counters in future.

Feature adds following interface files:

/sys/fs/resctrl/info/L3_MON/mbm_assign_mode: Reports the list of assignable
monitoring features supported. The enclosed brackets indicate which
feature is enabled.

/sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring
counters available for assignment.

/sys/fs/resctrl/info/L3_MON/available_mbm_cntrs: Reports the number of monitoring
counters free in each domain.

/sys/fs/resctrl/info/L3_MON/mbm_assign_control: Reports the resctrl group and monitor
status of each group. Assignment state can be updated by writing to the
interface.

# Examples

a. Check if ABMC support is available
	#mount -t resctrl resctrl /sys/fs/resctrl/

	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	[mbm_cntr_assign]
	default

	ABMC feature is detected and it is enabled.

b. Check how many ABMC counters are available. 

	#cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
	32

c. Create few resctrl groups.

	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp


d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   to list and modify any group's monitoring states. File provides single place
   to list monitoring states of all the resctrl groups. It makes it easier for
   user space to learn about the used counters without needing to traverse all
   the groups thus reducing the number of file system calls.

	The list follows the following format:

	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

	Format for specific type of groups:

	* Default CTRL_MON group:
	 "//<domain_id>=<flags>"

       * Non-default CTRL_MON group:
               "<CTRL_MON group>//<domain_id>=<flags>"

       * Child MON group of default CTRL_MON group:
               "/<MON group>/<domain_id>=<flags>"

       * Child MON group of non-default CTRL_MON group:
               "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

       Flags can be one of the following:

        t  MBM total event is enabled.
        l  MBM local event is enabled.
        tl Both total and local MBM events are enabled.
        _  None of the MBM events are enabled

	Examples:

	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control 
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=tl;
	
	There are four groups and all the groups have local and total
	event enabled on domain 0 and 1.

e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.

 	The write format is similar to the above list format with addition
	of opcode for the assignment operation.
    	“<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”

	
	* Default CTRL_MON group:
	        "//<domain_id><opcode><flags>"
	
	* Non-default CTRL_MON group:
	        "<CTRL_MON group>//<domain_id><opcode><flags>"
	
	* Child MON group of default CTRL_MON group:
	        "/<MON group>/<domain_id><opcode><flags>"
	
	* Child MON group of non-default CTRL_MON group:
	        "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
	
	Opcode can be one of the following:
	
	= Update the assignment to match the flags.
	+ Assign a new MBM event without impacting existing assignments.
	- Unassign a MBM event from currently assigned events.

	Flags can be one of the following:

        t  MBM total event.
        l  MBM local event.
        tl Both total and local MBM events.
        _  None of the MBM events. Only works with '=' opcode. This flag cannot be combined with other flags.
	
	Initial group status:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=tl;

	To update the default group to enable only total event on domain 0:
	# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=tl;

	To update the MON group child_default_mon_grp to remove total event on domain 1:
	# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=l;

	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
	remove both local and total events on domain 1:
	# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
	       /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=l;

	To update the default group to add a local event domain 0.
	# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=l;

	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all
	the MBM events on all the domains.
	# echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
	non_default_ctrl_mon_grp//0=_;1=_;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=l;


f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
   There is no change in reading the events with ABMC. If the event is unassigned
   when reading, then the read will come back as "Unassigned".
	
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	779247936
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
	765207488
	
g. Check the bandwidth configuration for the group. Note that bandwidth
   configuration has a domain scope. Total event defaults to 0x7F (to
   count all the events) and local event defaults to 0x15 (to count all
   the local numa events). The event bitmap decoding is available at
   https://www.kernel.org/doc/Documentation/x86/resctrl.rst
   in section "mbm_total_bytes_config", "mbm_local_bytes_config":
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x7f;1=0x7f
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 
	0=0x15;1=0x15
	
h. Change the bandwidth source for domain 0 for the total event to count only reads.
   Note that this change effects total events on the domain 0.
	
	#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x33;1=0x7F
	
i. Now read the total event again. The first read will come back with "Unavailable"
   status. The subsequent read of mbm_total_bytes will display only the read events.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	Unavailable
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	314101

j. Users will have the option to go back to 'default' mbm_assign_mode if required.
   This can be done using the following command. Note that switching the
   mbm_assign_mode will reset all the MBM counters of all resctrl groups.

	# echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
	mbm_cntr_assign
	[default]

	
k. Unmount the resctrl
	 
	#umount /sys/fs/resctrl/
---
v9:
   Patch 14 is a new addition. 
   Major change in patch 24.
   Moved the fix patch to address __init attribute to begining of the series.
   Fixed all the call sequences. Added additional Fixed tags.

   Added Reviewed-by where applicable.

   Took care of couple of minor merge conflicts with latest code.
   Re-ordered the MSR in couple of instances.
   Added available_mbm_cntrs (patch 14) to print the number of counter in a domain.

   Used MBM_EVENT_ARRAY_INDEX macro to get the event index.
   Introduced rdtgroup_cntr_id_init() to initialize the cntr_id

   Introduced new function resctrl_config_cntr to assign the counter, update
   the bitmap and reset the architectural state.
   Taken care of error handling(freeing the counter) when assignment fails.
  
   Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
   Updated couple of rdtgroup_unassign_cntrs() calls properly.

   Fixed problem changing the mode to mbm_cntr_assign mode when it is
   not supported. Added extra checks to detect if systems supports it.
   
   https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
   As discussed in the above comment, introduced resctrl_mon_event_config_set to
   handle IPI. But sending another IPI inside IPI causes problem. Kernel
   reports SMP warning. So, introduced resctrl_arch_update_cntr() to send the
   command directly.

   Fixed handling special case '//0=' and '//".
   Removed extra strstr() call in rdtgroup_mbm_assign_control_write().
   Added generic failure text when assignment operation fails.
   Corrected user documentation format texts.

v8:
  Patches are getting into final stages. 
  Couple of changes Patch 8, Patch 19 and Patch 23.
  Most of the other changes are related to rename and text message updates.

  Details are in each patch. Here is the summary.

  Added __init attribute to dom_data_init() in patch 8/25.
  Moved the mbm_cntrs_init() and mbm_cntrs_exit() functionality inside
  dom_data_init() and dom_data_exit() respectively.

  Renamed resctrl_mbm_evt_config_init() to arch_mbm_evt_config_init()
  Renamed resctrl_arch_event_config_get() to resctrl_arch_mon_event_config_get().
          resctrl_arch_event_config_set() to resctrl_arch_mon_event_config_set().

  Rename resctrl_arch_assign_cntr to resctrl_arch_config_cntr.
  Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
  Added the code to return the error if rdtgroup_assign_cntr_event fails.
  Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
  Renamed rdtgroup_mbm_cntr_is_assigned to mbm_cntr_assigned_to_domain
  Added return error handling in resctrl_arch_config_cntr().
  Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
  Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
  Fixed the problem with unassigning the child MON groups of CTRL_MON group.
  Reset the internal counters after mbm_cntr_assign mode is changed.
  Renamed rdtgroup_mbm_cntr_reset() to mbm_cntr_reset()
  Renamed resctrl_arch_mbm_cntr_assign_configure to
            resctrl_arch_mbm_cntr_assign_set_one.

  Used the same IPI as event update to modify the assignment.
  Could not do the way we discussed in the thread.
  https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
  Needed to figure out event type to update the configuration.

  Moved unassign first and assign during the assign modification.
  Assign none "_" takes priority. Cannot be mixed with other flags.
  Updated the documentation and .rst file format. htmldoc looks ok.

v7:
   Major changes are related to FS and arch codes separation.
   Changed few interface names based on feedback.
   Here are the summary and each patch contains changes specific the patch.

   Removed WARN_ON for num_mbm_cntrs. Decided to dynamically allocate the bitmap.
   WARN_ON is not required anymore.
 
   Renamed the function resctrl_arch_get_abmc_enabled() to resctrl_arch_mbm_cntr_assign_enabled().

   Merged resctrl_arch_mbm_cntr_assign_disable, resctrl_arch_mbm_cntr_assign_disable
   and renamed to resctrl_arch_mbm_cntr_assign_set(). Passed the struct rdt_resource
   to these functions.

   Removed resctrl_arch_reset_rmid_all() from arch code. This will be done from FS the caller.

   Updated the descriptions/commit log in resctrl.rst to generic text. Removed ABMC references.
   Renamed mbm_mode to mbm_assign_mode.
   Renamed mbm_control to  mbm_assign_control.
   Introduced mutex lock in rdtgroup_mbm_mode_show().
 
   The 'legacy' mode is called 'default' mode. 

   Removed the static allocation and now allocating bitmap mbm_cntr_free_map dynamically.

   Merged rdtgroup_assign_cntr(), rdtgroup_alloc_cntr() into one.
   Merged rdtgroup_unassign_cntr(), rdtgroup_free_cntr() into one.
   
  Added struct rdt_resource to the interface functions resctrl_arch_assign_cntr ()
  and resctrl_arch_unassign_cntr().
  Rename rdtgroup_abmc_cfg() to resctrl_abmc_config_one_amd().
   
  Added a new patch to fix counter assignment on event config changes.

  Removed the references of ABMC from user interfaces.

  Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
  Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.

  Thomas Gleixner asked us to update  https://gitlab.com/x86-cpuid.org/x86-cpuid-db. 
  It needs internal approval. We are working on it.

v6:
  We still need to finalize few interface details on mbm_assign_mode and mbm_assign_control
  in case of ABMC and Soft-ABMC. We can continue the discussion with this series.

  Added support for domain-id '*' to update all the domains at once.
  Fixed assign interface to allocate the counter if counter is
  not assigned.   
  Fixed unassign interface to free the counter if the counter is not
  assigned in any of the domains.

  Renamed abmc_capable to mbm_cntr_assignable.

  Renamed abmc_enabled to mbm_cntr_assign_enabled.
  Used msr_set_bit and msr_clear_bit for msr updates.
  Renamed resctrl_arch_abmc_enable() to resctrl_arch_mbm_cntr_assign_enable().
  Renamed resctrl_arch_abmc_disable() to resctrl_arch_mbm_cntr_assign_disable().

  Changed the display name from num_cntrs to num_mbm_cntrs.

  Removed the variable mbm_cntrs_free_map_len. This is not required.
  Removed the call mbm_cntrs_init() in arch code. This needs to be done at higher level.
  Used DECLARE_BITMAP to initialize mbm_cntrs_free_map.
  Removed unused config value definitions.

  Introduced mbm_cntr_map to track counters at domain level. With this
  we dont need to send MSR read to read the counter configuration.

  Separated all the counter id management to upper level in FS code.

  Added checks to detect "Unassigned" before reading the RMID.

  More details in each patch.

v5:
  Rebase changes (because of SNC support)

  Interface changes.
   /sys/fs/resctrl/mbm_assign to /sys/fs/resctrl/mbm_assign_mode.
   /sys/fs/resctrl/mbm_assign_control to /sys/fs/resctrl/mbm_assign_control.

  Added few arch specific routines.
  resctrl_arch_get_abmc_enabled.
  resctrl_arch_abmc_enable.
  resctrl_arch_abmc_disable.

  Few renames
   num_cntrs_free_map -> mbm_cntrs_free_map
   num_cntrs_init -> mbm_cntrs_init
   arch_domain_mbm_evt_config -> resctrl_arch_mbm_evt_config

  Introduced resctrl_arch_event_config_get and
    resctrl_arch_event_config_set() to update event configuration.

  Removed mon_state field mongroup. Added MON_CNTR_UNSET to initialize counters.

  Renamed ctr_id to cntr_id for the hardware counter.
 
  Report "Unassigned" in case the user attempts to read the events without assigning the counter.
  
  ABMC is enabled during the boot up. Can be enabled or disabled later.

  Fixed opcode and flags combination.
    '=_" is valid.
    "-_" amd "+_" is not valid.

 Added all the comments as far as I know. If I missed something, it is not intentional.

v4: 
  Main change is domain specific event assignment.
  Kept the ABMC feature as a default.
  Dynamcic switching between ABMC and mbm_legacy is still allowed.
  We are still not clear about mount option.
  Moved the monitoring related data in resctrl_mon structure from rdt_resource.
  Fixed the display of legacy and ABMC mode.
  Used bimap APIs when possible.
  Removed event configuration read from MSRs. We can use the
  internal saved data.(patch 12)
  Added more comments about L3_QOS_ABMC_CFG MSR.
  Added IPIs to read the assignment status for each domain (patch 18 and 19)
  More details in each patch.

v3:
   This series adds the support for global assignment mode discussed in
   the thread. https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
   Removed the individual assignment mode and included the global assignment interface.
   Added following interface files.
   a. /sys/fs/resctrl/info/L3_MON/mbm_assign
      Used for displaying the current assignment mode and switch between
      ABMC and legacy mode.
   b. /sys/fs/resctrl/info/L3_MON/mbm_assign_control
      Used for lising the groups assignment mode and modify the assignment states.
   c. Most of the changes are related to the new interface.
   d. Addressed the comments from Reinette, James and Peter.
   e. Hope I have addressed most of the major feedbacks discussed. If I missed
      something then it is not intentional. Please feel free to comment.
   f. Sending this as an RFC as per Reinette's comment. So, this is still open
      for discussion.

v2:
   a. Major change is the way ABMC is enabled. Earlier, user needed to remount
      with -o abmc to enable ABMC feature. Removed that option now.
      Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".
     
   b. Added new word 21 to x86/cpufeatures.h.

   c. Display unsupported if user attempts to read the events when ABMC is enabled
      and event is not assigned.

   d. Display monitor_state as "Unsupported" when ABMC is disabled.
  
   e. Text updates and rebase to latest tip tree (as of Jan 18).
 
   f. This series is still work in progress. I am yet to hear from ARM developers. 

v8: https://lore.kernel.org/lkml/cover.1728495588.git.babu.moger@amd.com/
v7: https://lore.kernel.org/lkml/cover.1725488488.git.babu.moger@amd.com/
v6: https://lore.kernel.org/lkml/cover.1722981659.git.babu.moger@amd.com/
v5: https://lore.kernel.org/lkml/cover.1720043311.git.babu.moger@amd.com/
v4: https://lore.kernel.org/lkml/cover.1716552602.git.babu.moger@amd.com/
v3: https://lore.kernel.org/lkml/cover.1711674410.git.babu.moger@amd.com/  
v2: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
v1: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/


Babu Moger (26):
  x86/resctrl: Add __init attribute for the functions called in
    resctrl_late_init
  x86/cpufeatures: Add support for Assignable Bandwidth Monitoring
    Counters (ABMC)
  x86/resctrl: Add ABMC feature in the command line options
  x86/resctrl: Consolidate monitoring related data from rdt_resource
  x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
  x86/resctrl: Add support to enable/disable AMD ABMC feature
  x86/resctrl: Introduce the interface to display monitor mode
  x86/resctrl: Introduce interface to display number of monitoring
    counters
  x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable
    counters
  x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct
    rdt_hw_mon_domain
  x86/resctrl: Remove MSR reading of event configuration value
  x86/resctrl: Introduce mbm_cntr_map to track assignable counters at
    domain
  x86/resctrl: Introduce interface to display number of free counters
  x86/resctrl: Add data structures and definitions for ABMC assignment
  x86/resctrl: Introduce cntr_id in mongroup for assignments
  x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter
    with ABMC
  x86/resctrl: Add the interface to assign/update counter assignment
  x86/resctrl: Add the interface to unassign a MBM counter
  x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is
    enabled
  x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign
    mode
  x86/resctrl: Introduce the interface to switch between monitor modes
  x86/resctrl: Configure mbm_cntr_assign mode if supported
  x86/resctrl: Update assignments on event configuration changes
  x86/resctrl: Introduce interface to list assignment states of all the
    groups
  x86/resctrl: Introduce interface to modify assignment states of the
    groups

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/arch/x86/resctrl.rst            | 233 +++++
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/cpu/resctrl/core.c            |  27 +-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c     |  12 +-
 arch/x86/kernel/cpu/resctrl/internal.h        |  91 +-
 arch/x86/kernel/cpu/resctrl/monitor.c         | 113 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 967 ++++++++++++++++--
 arch/x86/kernel/cpu/scattered.c               |   1 +
 include/linux/resctrl.h                       |  32 +-
 12 files changed, 1389 insertions(+), 95 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-15 23:21   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 02/26] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The function resctrl_late_init() has the __init attribute, but some
functions it calls do not. Add the __init attribute to all the functions
to maintain consistency throughout the call sequence.

Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Moved the patch to the begining of the series.
    Fixed all the call sequences. Added additional Fixed tags.

v8: New patch.
---
 arch/x86/kernel/cpu/resctrl/core.c     | 8 ++++----
 arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 4 ++--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b681c2e07dbf..f845d0590429 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -275,7 +275,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
 	return true;
 }
 
-static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
+static __init void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	union cpuid_0x10_1_eax eax;
@@ -294,7 +294,7 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
 	r->alloc_capable = true;
 }
 
-static void rdt_get_cdp_config(int level)
+static __init void rdt_get_cdp_config(int level)
 {
 	/*
 	 * By default, CDP is disabled. CDP can be enabled by mount parameter
@@ -304,12 +304,12 @@ static void rdt_get_cdp_config(int level)
 	rdt_resources_all[level].r_resctrl.cdp_capable = true;
 }
 
-static void rdt_get_cdp_l3_config(void)
+static __init void rdt_get_cdp_l3_config(void)
 {
 	rdt_get_cdp_config(RDT_RESOURCE_L3);
 }
 
-static void rdt_get_cdp_l2_config(void)
+static __init void rdt_get_cdp_l2_config(void)
 {
 	rdt_get_cdp_config(RDT_RESOURCE_L2);
 }
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 955999aecfca..16181b90159a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -627,7 +627,7 @@ int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
-int rdt_get_mon_l3_config(struct rdt_resource *r);
+int __init rdt_get_mon_l3_config(struct rdt_resource *r);
 void __exit rdt_put_mon_l3_config(void);
 bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 851b561850e0..17790f92ef51 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -983,7 +983,7 @@ void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_
 		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
 
-static int dom_data_init(struct rdt_resource *r)
+static __init int dom_data_init(struct rdt_resource *r)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	u32 num_closid = resctrl_arch_get_num_closid(r);
@@ -1081,7 +1081,7 @@ static struct mon_evt mbm_local_event = {
  * because as per the SDM the total and local memory bandwidth
  * are enumerated as part of L3 monitoring.
  */
-static void l3_mon_evt_init(struct rdt_resource *r)
+static void __init l3_mon_evt_init(struct rdt_resource *r)
 {
 	INIT_LIST_HEAD(&r->evt_list);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 02/26] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC)
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
  2024-10-29 23:21 ` [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 03/26] x86/resctrl: Add ABMC feature in the command line options Babu Moger
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware. The
counters of any other RMIDs which are no longer being tracked will be reset
to zero. The MBM event counters return "Unavailable" for the RMIDs that are
not tracked by hardware. So, there can be only limited number of groups
that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups are
being tracked for certain point of time. Users do not have the option to
monitor a group or set of groups for certain period of time without
worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The assigned RMID will be tracked by the hardware until the user
unassigns it manually. There is no need to worry about counters being reset
during this period. Additionally, the user can specify a bitmask
identifying the specific bandwidth types from the given source to track
with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can enable a maximum
of 2 ABMC counters per group. User will also have the option to enable only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to disable an already
enabled counter to make space for new assignments.

The feature can be detected via CPUID_Fn80000020_EBX_x00 bit 5.
Bits Description
5    ABMC (Assignable Bandwidth Monitoring Counters)

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
Note: Checkpatch checks/warnings are ignored to maintain coding style.

v9: Took care of couple of minor merge conflicts. No other changes.

v8: No changes.

v7: Removed "" from feature flags. Not required anymore.
    https://lore.kernel.org/lkml/20240817145058.GCZsC40neU4wkPXeVR@fat_crate.local/

v6: Added Reinette's Reviewed-by. Moved the Checkpatch note below ---.

v5: Minor rebase change and subject line update.

v4: Changes because of rebase. Feature word 21 has few more additions now.
    Changed the text to "tracked by hardware" instead of active.

v3: Change because of rebase. Actual patch did not change.

v2: Added dependency on X86_FEATURE_BMEC.
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/cpuid-deps.c   | 3 +++
 arch/x86/kernel/cpu/scattered.c    | 1 +
 3 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ea33439a5d00..f52e5f8fe24f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -476,6 +476,7 @@
 #define X86_FEATURE_AMD_FAST_CPPC	(21*32 + 5) /* Fast CPPC */
 #define X86_FEATURE_AMD_HETEROGENEOUS_CORES (21*32 + 6) /* Heterogeneous Core Topology */
 #define X86_FEATURE_AMD_WORKLOAD_CLASS	(21*32 + 7) /* Workload Classification */
+#define X86_FEATURE_ABMC		(21*32 + 8) /* Assignable Bandwidth Monitoring Counters */
 
 /*
  * BUG word(s)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 8bd84114c2d9..7e4d63b381d6 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -70,6 +70,9 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_BMEC,			X86_FEATURE_CQM_MBM_TOTAL   },
 	{ X86_FEATURE_BMEC,			X86_FEATURE_CQM_MBM_LOCAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_CQM_MBM_TOTAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_CQM_MBM_LOCAL   },
+	{ X86_FEATURE_ABMC,			X86_FEATURE_BMEC      },
 	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
 	{ X86_FEATURE_AVX512_FP16,		X86_FEATURE_AVX512BW  },
 	{ X86_FEATURE_ENQCMD,			X86_FEATURE_XSAVES    },
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 16f3ca30626a..3b72b72270f1 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -49,6 +49,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_MBA,			CPUID_EBX,  6, 0x80000008, 0 },
 	{ X86_FEATURE_SMBA,			CPUID_EBX,  2, 0x80000020, 0 },
 	{ X86_FEATURE_BMEC,			CPUID_EBX,  3, 0x80000020, 0 },
+	{ X86_FEATURE_ABMC,			CPUID_EBX,  5, 0x80000020, 0 },
 	{ X86_FEATURE_AMD_WORKLOAD_CLASS,	CPUID_EAX, 22, 0x80000021, 0 },
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 03/26] x86/resctrl: Add ABMC feature in the command line options
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
  2024-10-29 23:21 ` [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init Babu Moger
  2024-10-29 23:21 ` [PATCH v9 02/26] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 04/26] x86/resctrl: Consolidate monitoring related data from rdt_resource Babu Moger
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Add the command line option to enable or disable exposing the ABMC
(Assignable Bandwidth Monitoring Counters) hardware feature to resctrl.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: No code changes. Added Reviewed-by.

v8: Commit message update.

v7: No changes

v6: No changes

v5: No changes

v4: No changes

v3: No changes

v2: No changes
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 Documentation/arch/x86/resctrl.rst              | 1 +
 arch/x86/kernel/cpu/resctrl/core.c              | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1518343bbe22..b3b3ca564220 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5677,7 +5677,7 @@
 	rdt=		[HW,X86,RDT]
 			Turn on/off individual RDT features. List is:
 			cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
-			mba, smba, bmec.
+			mba, smba, bmec, abmc.
 			E.g. to turn on cmt and turn off mba use:
 				rdt=cmt,!mba
 
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a824affd741d..30586728a4cd 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
 MBA (Memory Bandwidth Allocation)		"mba"
 SMBA (Slow Memory Bandwidth Allocation)         ""
 BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
 ===============================================	================================
 
 Historically, new features were made visible by default in /proc/cpuinfo. This
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f845d0590429..25616d82c0cc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -809,6 +809,7 @@ enum {
 	RDT_FLAG_MBA,
 	RDT_FLAG_SMBA,
 	RDT_FLAG_BMEC,
+	RDT_FLAG_ABMC,
 };
 
 #define RDT_OPT(idx, n, f)	\
@@ -834,6 +835,7 @@ static struct rdt_options rdt_options[]  __initdata = {
 	RDT_OPT(RDT_FLAG_MBA,	    "mba",	X86_FEATURE_MBA),
 	RDT_OPT(RDT_FLAG_SMBA,	    "smba",	X86_FEATURE_SMBA),
 	RDT_OPT(RDT_FLAG_BMEC,	    "bmec",	X86_FEATURE_BMEC),
+	RDT_OPT(RDT_FLAG_ABMC,	    "abmc",	X86_FEATURE_ABMC),
 };
 #define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 04/26] x86/resctrl: Consolidate monitoring related data from rdt_resource
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (2 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 03/26] x86/resctrl: Add ABMC feature in the command line options Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 05/26] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details Babu Moger
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The cache allocation and memory bandwidth allocation feature properties
are consolidated into struct resctrl_cache and struct resctrl_membw
respectively.

In preparation for more monitoring properties that will clobber the
existing resource struct more, re-organize the monitoring specific
properties to also be in a separate structure.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: No changes.

v8: Added Reviewed-by from Reinette. No other changes.

v7: Added kernel doc for data structure. Minor text update.

v6: Update commit message and update kernel doc for rdt_resource.

v5: Commit message update.
    Also changes related to data structure updates does to SNC support.

v4: New patch.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  4 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c  | 18 +++++++++---------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  8 ++++----
 include/linux/resctrl.h                | 16 ++++++++++++----
 4 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 25616d82c0cc..468af203ca69 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -124,7 +124,7 @@ u32 resctrl_arch_system_num_rmid_idx(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 
 	/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
-	return r->num_rmid;
+	return r->mon.num_rmid;
 }
 
 /*
@@ -625,7 +625,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 
 	arch_mon_domain_online(r, d);
 
-	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
 		mon_domain_free(hw_dom);
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 17790f92ef51..f7fdf2c767f0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -222,7 +222,7 @@ static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
 	if (snc_nodes_per_l3_cache == 1)
 		return lrmid;
 
-	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->mon.num_rmid;
 }
 
 static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
@@ -297,11 +297,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
-		       sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
+		       sizeof(*hw_dom->arch_mbm_total) * r->mon.num_rmid);
 
 	if (is_mbm_local_enabled())
 		memset(hw_dom->arch_mbm_local, 0,
-		       sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
+		       sizeof(*hw_dom->arch_mbm_local) * r->mon.num_rmid);
 }
 
 static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
@@ -1083,14 +1083,14 @@ static struct mon_evt mbm_local_event = {
  */
 static void __init l3_mon_evt_init(struct rdt_resource *r)
 {
-	INIT_LIST_HEAD(&r->evt_list);
+	INIT_LIST_HEAD(&r->mon.evt_list);
 
 	if (is_llc_occupancy_enabled())
-		list_add_tail(&llc_occupancy_event.list, &r->evt_list);
+		list_add_tail(&llc_occupancy_event.list, &r->mon.evt_list);
 	if (is_mbm_total_enabled())
-		list_add_tail(&mbm_total_event.list, &r->evt_list);
+		list_add_tail(&mbm_total_event.list, &r->mon.evt_list);
 	if (is_mbm_local_enabled())
-		list_add_tail(&mbm_local_event.list, &r->evt_list);
+		list_add_tail(&mbm_local_event.list, &r->mon.evt_list);
 }
 
 /*
@@ -1186,7 +1186,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
 	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
-	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
+	r->mon.num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
@@ -1201,7 +1201,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 *
 	 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
 	 */
-	threshold = resctrl_rmid_realloc_limit / r->num_rmid;
+	threshold = resctrl_rmid_realloc_limit / r->mon.num_rmid;
 
 	/*
 	 * Because num_rmid may not be a power of two, round the value
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index d906a1cd8491..1647ad9145ef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1097,7 +1097,7 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
 {
 	struct rdt_resource *r = of->kn->parent->priv;
 
-	seq_printf(seq, "%d\n", r->num_rmid);
+	seq_printf(seq, "%d\n", r->mon.num_rmid);
 
 	return 0;
 }
@@ -1108,7 +1108,7 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
 	struct rdt_resource *r = of->kn->parent->priv;
 	struct mon_evt *mevt;
 
-	list_for_each_entry(mevt, &r->evt_list, list) {
+	list_for_each_entry(mevt, &r->mon.evt_list, list) {
 		seq_printf(seq, "%s\n", mevt->name);
 		if (mevt->configurable)
 			seq_printf(seq, "%s_config\n", mevt->name);
@@ -3057,13 +3057,13 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
 	struct mon_evt *mevt;
 	int ret;
 
-	if (WARN_ON(list_empty(&r->evt_list)))
+	if (WARN_ON(list_empty(&r->mon.evt_list)))
 		return -EPERM;
 
 	priv.u.rid = r->rid;
 	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
 	priv.u.sum = do_sum;
-	list_for_each_entry(mevt, &r->evt_list, list) {
+	list_for_each_entry(mevt, &r->mon.evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
 		if (ret)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..3c2307c7c106 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -182,16 +182,26 @@ enum resctrl_scope {
 	RESCTRL_L3_NODE,
 };
 
+/**
+ * struct resctrl_mon - Monitoring related data of a resctrl resource
+ * @num_rmid:		Number of RMIDs available
+ * @evt_list:		List of monitoring events
+ */
+struct resctrl_mon {
+	int			num_rmid;
+	struct list_head	evt_list;
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
- * @num_rmid:		Number of RMIDs available
  * @ctrl_scope:		Scope of this resource for control functions
  * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
+ * @mon:		Monitoring related data.
  * @ctrl_domains:	RCU list of all control domains for this resource
  * @mon_domains:	RCU list of all monitor domains for this resource
  * @name:		Name to use in "schemata" file.
@@ -199,7 +209,6 @@ enum resctrl_scope {
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
  * @format_str:		Per resource format string to show domain value
  * @parse_ctrlval:	Per resource function pointer to parse control values
- * @evt_list:		List of monitoring events
  * @fflags:		flags to choose base and info files
  * @cdp_capable:	Is the CDP feature available on this resource
  */
@@ -207,11 +216,11 @@ struct rdt_resource {
 	int			rid;
 	bool			alloc_capable;
 	bool			mon_capable;
-	int			num_rmid;
 	enum resctrl_scope	ctrl_scope;
 	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
+	struct resctrl_mon	mon;
 	struct list_head	ctrl_domains;
 	struct list_head	mon_domains;
 	char			*name;
@@ -221,7 +230,6 @@ struct rdt_resource {
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
 						 struct rdt_ctrl_domain *d);
-	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 05/26] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (3 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 04/26] x86/resctrl: Consolidate monitoring related data from rdt_resource Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 06/26] x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags Babu Moger
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
     Monitoring Counter ID + 1

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Detect the feature and number of assignable monitoring counters supported.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: Added Reviewed-by tag. No code changes

v8: Used GENMASK for the mask.

v7: Removed WARN_ON for num_mbm_cntrs. Decided to dynamically allocate the
    bitmap. WARN_ON is not required anymore.
    Removed redundant comments.

v6: Commit message update.
    Renamed abmc_capable to mbm_cntr_assignable.

v5: Name change num_cntrs to num_mbm_cntrs.
    Moved abmc_capable to resctrl_mon.

v4: Removed resctrl_arch_has_abmc(). Added all the code inline. We dont
    need to separate this as arch code.

v3: Removed changes related to mon_features.
    Moved rdt_cpu_has to core.c and added new function resctrl_arch_has_abmc.
    Also moved the fields mbm_assign_capable and mbm_assign_cntrs to
    rdt_resource. (James)

v2: Changed the field name to mbm_assign_capable from abmc_capable.
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
 include/linux/resctrl.h               | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f7fdf2c767f0..383bf3ad2dcf 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1229,6 +1229,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			mbm_local_event.configurable = true;
 			mbm_config_rftype_init("mbm_local_bytes_config");
 		}
+
+		if (rdt_cpu_has(X86_FEATURE_ABMC)) {
+			r->mon.mbm_cntr_assignable = true;
+			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
+			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+		}
 	}
 
 	l3_mon_evt_init(r);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 3c2307c7c106..511cfce8fc21 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -185,10 +185,14 @@ enum resctrl_scope {
 /**
  * struct resctrl_mon - Monitoring related data of a resctrl resource
  * @num_rmid:		Number of RMIDs available
+ * @num_mbm_cntrs:	Number of assignable monitoring counters
+ * @mbm_cntr_assignable:Is system capable of supporting monitor assignment?
  * @evt_list:		List of monitoring events
  */
 struct resctrl_mon {
 	int			num_rmid;
+	int			num_mbm_cntrs;
+	bool			mbm_cntr_assignable;
 	struct list_head	evt_list;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 06/26] x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (4 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 05/26] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 07/26] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

thread_throttle_mode_init() and mbm_config_rftype_init() both initialize
fflags for resctrl files.

Adding new files will involve adding another function to initialize
the fflags. This can be simplified by adding a new function
resctrl_file_fflags_init() and passing the file name and flags
to be initialized.

Consolidate fflags initialization into resctrl_file_fflags_init() and
remove thread_throttle_mode_init() and mbm_config_rftype_init().

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: No changes.

v8: No changes.

v7: No changes.

v6: Added Reviewed-by from Reinette.

v5: Commit message update.

v4: Commit message update.

v3: New patch to display ABMC capability.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  4 +++-
 arch/x86/kernel/cpu/resctrl/internal.h |  4 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c  |  6 ++++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++-------------
 4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 468af203ca69..7beac735c8e5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -234,7 +234,9 @@ static __init bool __get_mem_config_intel(struct rdt_resource *r)
 		r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD;
 	else
 		r->membw.throttle_mode = THREAD_THROTTLE_MAX;
-	thread_throttle_mode_init();
+
+	resctrl_file_fflags_init("thread_throttle_mode",
+				 RFTYPE_CTRL_INFO | RFTYPE_RES_MB);
 
 	r->alloc_capable = true;
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 16181b90159a..9dd1799adba3 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -647,8 +647,8 @@ void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_mon_domain *d);
 void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
-void __init thread_throttle_mode_init(void);
-void __init mbm_config_rftype_init(const char *config);
+void __init resctrl_file_fflags_init(const char *config,
+				     unsigned long fflags);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 383bf3ad2dcf..9209fb3dc78f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1223,11 +1223,13 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 
 		if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
 			mbm_total_event.configurable = true;
-			mbm_config_rftype_init("mbm_total_bytes_config");
+			resctrl_file_fflags_init("mbm_total_bytes_config",
+						 RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
 		}
 		if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
 			mbm_local_event.configurable = true;
-			mbm_config_rftype_init("mbm_local_bytes_config");
+			resctrl_file_fflags_init("mbm_local_bytes_config",
+						 RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
 		}
 
 		if (rdt_cpu_has(X86_FEATURE_ABMC)) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1647ad9145ef..687d9d8d82a4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2020,24 +2020,14 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name)
 	return NULL;
 }
 
-void __init thread_throttle_mode_init(void)
-{
-	struct rftype *rft;
-
-	rft = rdtgroup_get_rftype_by_name("thread_throttle_mode");
-	if (!rft)
-		return;
-
-	rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB;
-}
-
-void __init mbm_config_rftype_init(const char *config)
+void __init resctrl_file_fflags_init(const char *config,
+				     unsigned long fflags)
 {
 	struct rftype *rft;
 
 	rft = rdtgroup_get_rftype_by_name(config);
 	if (rft)
-		rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE;
+		rft->fflags = fflags;
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 07/26] x86/resctrl: Add support to enable/disable AMD ABMC feature
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (5 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 06/26] x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Add the functionality to enable/disable AMD ABMC feature.

AMD ABMC feature is enabled by setting enabled bit(0) in MSR
L3_QOS_EXT_CFG. When the state of ABMC is changed, the MSR needs
to be updated on all the logical processors in the QOS Domain.

Hardware counters will reset when ABMC state is changed.

The ABMC feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: Re-ordered the MSR and added Reviewed-by tag.

v8: Commit message update and moved around the comments about L3_QOS_EXT_CFG
    to _resctrl_abmc_enable.

v7: Renamed the function
    resctrl_arch_get_abmc_enabled() to resctrl_arch_mbm_cntr_assign_enabled().

    Merged resctrl_arch_mbm_cntr_assign_disable, resctrl_arch_mbm_cntr_assign_disable
    and renamed to resctrl_arch_mbm_cntr_assign_set().

    Moved the function definition to linux/resctrl.h.

    Passed the struct rdt_resource to these functions.
    Removed resctrl_arch_reset_rmid_all() from arch code. This will be done
    from the caller.

v6: Renamed abmc_enabled to mbm_cntr_assign_enabled.
    Used msr_set_bit and msr_clear_bit for msr updates.
    Renamed resctrl_arch_abmc_enable() to resctrl_arch_mbm_cntr_assign_enable().
    Renamed resctrl_arch_abmc_disable() to resctrl_arch_mbm_cntr_assign_disable().
    Made _resctrl_abmc_enable to return void.

v5: Renamed resctrl_abmc_enable to resctrl_arch_abmc_enable.
    Renamed resctrl_abmc_disable to resctrl_arch_abmc_disable.
    Introduced resctrl_arch_get_abmc_enabled to get abmc state from
    non-arch code.
    Renamed resctrl_abmc_set_all to _resctrl_abmc_enable().
    Modified commit log to make it clear about AMD ABMC feature.

v3: No changes.

v2: Few text changes in commit message.
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/core.c     |  5 ++++
 arch/x86/kernel/cpu/resctrl/internal.h |  5 ++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 36 ++++++++++++++++++++++++++
 include/linux/resctrl.h                |  3 +++
 5 files changed, 50 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..bdc95b7cd1b0 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1194,6 +1194,7 @@
 /* - AMD: */
 #define MSR_IA32_MBA_BW_BASE		0xc0000200
 #define MSR_IA32_SMBA_BW_BASE		0xc0000280
+#define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
 #define MSR_IA32_EVT_CFG_BASE		0xc0000400
 
 /* AMD-V MSRs */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7beac735c8e5..9603f5cb483c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -405,6 +405,11 @@ void rdt_ctrl_update(void *arg)
 	hw_res->msr_update(m);
 }
 
+bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r)
+{
+	return resctrl_to_arch_res(r)->mbm_cntr_assign_enabled;
+}
+
 /*
  * rdt_find_domain - Search for a domain id in a resource domain list.
  *
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 9dd1799adba3..c07a93da31cc 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -56,6 +56,9 @@
 /* Max event bits supported */
 #define MAX_EVT_CONFIG_BITS		GENMASK(6, 0)
 
+/* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
+#define ABMC_ENABLE_BIT			0
+
 /**
  * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
  *			        aren't marked nohz_full
@@ -477,6 +480,7 @@ struct rdt_parse_data {
  * @mbm_cfg_mask:	Bandwidth sources that can be tracked when Bandwidth
  *			Monitoring Event Configuration (BMEC) is supported.
  * @cdp_enabled:	CDP state of this resource
+ * @mbm_cntr_assign_enabled:	ABMC feature is enabled
  *
  * Members of this structure are either private to the architecture
  * e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
@@ -491,6 +495,7 @@ struct rdt_hw_resource {
 	unsigned int		mbm_width;
 	unsigned int		mbm_cfg_mask;
 	bool			cdp_enabled;
+	bool			mbm_cntr_assign_enabled;
 };
 
 static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 687d9d8d82a4..d54c2701c09c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2402,6 +2402,42 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
 	return 0;
 }
 
+static void resctrl_abmc_set_one_amd(void *arg)
+{
+	bool *enable = arg;
+
+	if (*enable)
+		msr_set_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
+	else
+		msr_clear_bit(MSR_IA32_L3_QOS_EXT_CFG, ABMC_ENABLE_BIT);
+}
+
+/*
+ * Update L3_QOS_EXT_CFG MSR on all the CPUs associated with the monitor
+ * domain.
+ */
+static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
+{
+	struct rdt_mon_domain *d;
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list)
+		on_each_cpu_mask(&d->hdr.cpu_mask,
+				 resctrl_abmc_set_one_amd, &enable, 1);
+}
+
+int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
+{
+	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+	if (r->mon.mbm_cntr_assignable &&
+	    hw_res->mbm_cntr_assign_enabled != enable) {
+		_resctrl_abmc_enable(r, enable);
+		hw_res->mbm_cntr_assign_enabled = enable;
+	}
+
+	return 0;
+}
+
 /*
  * We don't allow rdtgroup directories to be created anywhere
  * except the root directory. Thus when looking for the rdtgroup
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 511cfce8fc21..f11d6fdfd977 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -355,4 +355,7 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
 
+int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
+bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
+
 #endif /* _RESCTRL_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (6 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 07/26] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:00   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Introduce the interface file "mbm_assign_mode" to list monitor modes
supported.

The "mbm_cntr_assign" mode provides the option to assign a counter to
an RMID, event pair and monitor the bandwidth as long as it is assigned.

On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
Bandwidth Monitoring Counters) hardware feature and is enabled by default.

The "default" mode is the existing monitoring mode that works without the
explicit counter assignment, instead relying on dynamic counter assignment
by hardware that may result in hardware not dedicating a counter resulting
in monitoring data reads returning "Unavailable".

Provide an interface to display the monitor mode on the system.
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_cntr_assign]
default

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Updated user documentation based on comments.

v8: Commit message update.

v7: Updated the descriptions/commit log in resctrl.rst to generic text.
    Thanks to James and Reinette.
    Rename mbm_mode to mbm_assign_mode.
    Introduced mutex lock in rdtgroup_mbm_mode_show().

v6: Added documentation for mbm_cntr_assign and legacy mode.
    Moved mbm_mode fflags initialization to static initialization.

v5: Changed interface name to mbm_mode.
    It will be always available even if ABMC feature is not supported.
    Added description in resctrl.rst about ABMC mode.
    Fixed display abmc and legacy consistantly.

v4: Fixed the checks for legacy and abmc mode. Default it ABMC.

v3: New patch to display ABMC capability.
---
 Documentation/arch/x86/resctrl.rst     | 33 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 30586728a4cd..a93d7980e25f 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -257,6 +257,39 @@ with the following files:
 	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
 	    0=0x30;1=0x30;3=0x15;4=0x15
 
+"mbm_assign_mode":
+	Reports the list of monitoring modes supported. The enclosed brackets
+	indicate which mode is enabled.
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+	  [mbm_cntr_assign]
+	  default
+
+	"mbm_cntr_assign":
+
+	In mbm_cntr_assign mode user-space is able to specify which of the
+	events in CTRL_MON or MON groups should have a counter assigned using the
+	"mbm_assign_control" file. The number of counters available is described
+	in the "num_mbm_cntrs" file. Changing the mode may cause all counters on
+	a resource to reset.
+
+	The mode is useful on platforms which support more CTRL_MON and MON
+	groups than the hardware counters, meaning 'unassigned' events on CTRL_MON or
+	MON groups will report 'Unavailable' or count the traffic in an unpredictable
+	way.
+
+	AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
+	enable this mode by default so that counters remain assigned even when the
+	corresponding RMID is not in use by any processor.
+
+	"default":
+
+	In default mode resctrl assumes there is a hardware counter for each
+	event within every CTRL_MON and MON group. Reading mbm_total_bytes or
+	mbm_local_bytes may report 'Unavailable' if there is no counter associated
+	with that event.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index d54c2701c09c..f25ff1430014 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -845,6 +845,30 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int rdtgroup_mbm_assign_mode_show(struct kernfs_open_file *of,
+					 struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	mutex_lock(&rdtgroup_mutex);
+
+	if (r->mon.mbm_cntr_assignable) {
+		if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
+			seq_puts(s, "[mbm_cntr_assign]\n");
+			seq_puts(s, "default\n");
+		} else {
+			seq_puts(s, "mbm_cntr_assign\n");
+			seq_puts(s, "[default]\n");
+		}
+	} else {
+		seq_puts(s, "[default]\n");
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1901,6 +1925,13 @@ static struct rftype res_common_files[] = {
 		.seq_show	= mbm_local_bytes_config_show,
 		.write		= mbm_local_bytes_config_write,
 	},
+	{
+		.name		= "mbm_assign_mode",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdtgroup_mbm_assign_mode_show,
+		.fflags		= RFTYPE_MON_INFO,
+	},
 	{
 		.name		= "cpus",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (7 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:06   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters Babu Moger
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The mbm_cntr_assign mode provides an option to the user to assign a
counter to an RMID, event pair and monitor the bandwidth as long as
the counter is assigned. Number of assignments depend on number of
monitoring counters available.

Provide the interface to display the number of monitoring counters
supported. The interface file 'num_mbm_cntrs' is available when an
architecture supports mbm_cntr_assign mode.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Updated user document based on the comments.
    Will add a new file available_mbm_cntrs later in the series.

v8: Commit message update and documentation update.

v7: Minor commit log text changes.

v6: No changes.

v5: Changed the display name from num_cntrs to num_mbm_cntrs.
    Updated the commit message.
    Moved the patch after mbm_mode is introduced.

v4: Changed the counter name to num_cntrs. And few text changes.

v3: Changed the field name to mbm_assign_cntrs.

v2: Changed the field name to mbm_assignable_counters from abmc_counter.
---
 Documentation/arch/x86/resctrl.rst     | 12 ++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 ++++++++++++++++
 3 files changed, 29 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a93d7980e25f..2f3a86278e84 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -290,6 +290,18 @@ with the following files:
 	mbm_local_bytes may report 'Unavailable' if there is no counter associated
 	with that event.
 
+"num_mbm_cntrs":
+	The number of monitoring counters available for assignment when the
+	architecture supports mbm_cntr_assign mode.
+
+	The resctrl file system supports tracking up to two memory bandwidth
+	events per monitoring group: mbm_total_bytes and/or mbm_local_bytes.
+	Up to two counters can be assigned per monitoring group, one for each
+	memory bandwidth event. More monitoring groups can be tracked by
+	assigning one counter per monitoring group. However, doing so limits
+	memory bandwidth tracking to a single memory bandwidth event per
+	monitoring group.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 9209fb3dc78f..e0d8080dcdcf 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1236,6 +1236,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			r->mon.mbm_cntr_assignable = true;
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f25ff1430014..339bb0b09a82 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -869,6 +869,16 @@ static int rdtgroup_mbm_assign_mode_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdtgroup_num_mbm_cntrs_show(struct kernfs_open_file *of,
+				       struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	seq_printf(s, "%d\n", r->mon.num_mbm_cntrs);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1940,6 +1950,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdtgroup_cpus_show,
 		.fflags		= RFTYPE_BASE,
 	},
+	{
+		.name		= "num_mbm_cntrs",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdtgroup_num_mbm_cntrs_show,
+	},
 	{
 		.name		= "cpus_list",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (8 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:11   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 11/26] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain Babu Moger
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hardware provides a set of counters when mbm_assign_mode is supported.
These counters are assigned to the MBM monitoring events of a CTRL_MON or
MON group that needs to be tracked. The kernel must manage and track the
available counters.

Introduce mbm_cntr_free_map bitmap to track available counters and set
of routines to allocate and free the counters.

dom_data_init() requires mbm_cntr_assign state to initialize
mbm_cntr_free_map bitmap. Move dom_data_init() after mbm_cntr_assign
detection.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Commit message update and kernel doc update.

v8: Moved the init and exit functionality inside dom_data_init()
    and dom_data_exit() respectively.

v7: Removed the static allocation and now allocating bitmap mbm_cntr_free_map
    dynamically.
    Passed the struct rdt_resource mbm_cntr_alloc and mbm_cntr_free.
    Removed the reference of ABMC and changed it mbm_cntr_assign.
    Few other text changes.

v6: Removed the variable mbm_cntrs_free_map_len. This is not required.
    Removed the call mbm_cntrs_init() in arch code. This needs to be
    done at higher level.
    Used DECLARE_BITMAP to initialize mbm_cntrs_free_map.
    Moved all the counter interfaces mbm_cntr_alloc() and mbm_cntr_free()
    in here as part of separating arch and fs bits.

v5:
   Updated the comments and commit log.
   Few renames
    num_cntrs_free_map -> mbm_cntrs_free_map
    num_cntrs_init -> mbm_cntrs_init
    Added initialization in rdt_get_tree because the default ABMC
    enablement happens during the init.

v4: Changed the name to num_cntrs where applicable.
     Used bitmap apis.
     Added more comments for the globals.

v3: Changed the bitmap name to assign_cntrs_free_map. Removed abmc
     from the name.

v2: Changed the bitmap name to assignable_counter_free_map from
     abmc_counter_free_map.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  2 +-
 arch/x86/kernel/cpu/resctrl/internal.h |  4 ++-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 49 ++++++++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 ++++++++++
 include/linux/resctrl.h                |  2 ++
 5 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9603f5cb483c..02ccb4d2955d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -1140,7 +1140,7 @@ static void __exit resctrl_exit(void)
 	rdtgroup_exit();
 
 	if (r->mon_capable)
-		rdt_put_mon_l3_config();
+		rdt_put_mon_l3_config(r);
 }
 
 __exitcall(resctrl_exit);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c07a93da31cc..8ab59d59c15a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -633,7 +633,7 @@ void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
 int __init rdt_get_mon_l3_config(struct rdt_resource *r);
-void __exit rdt_put_mon_l3_config(void);
+void __exit rdt_put_mon_l3_config(struct rdt_resource *r);
 bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
@@ -654,6 +654,8 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init resctrl_file_fflags_init(const char *config,
 				     unsigned long fflags);
+int mbm_cntr_alloc(struct rdt_resource *r);
+void mbm_cntr_free(struct rdt_resource *r, u32 cntr_id);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e0d8080dcdcf..185ac210d46e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -983,6 +983,27 @@ void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_
 		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
 }
 
+/*
+ * Bitmap to track the available hardware counters when operating in
+ * "mbm_cntr_assign" mode. A hardware counter can be assigned to a RMID,
+ * event pair.
+ */
+static __init unsigned long *mbm_cntrs_init(struct rdt_resource *r)
+{
+	r->mon.mbm_cntr_free_map = bitmap_zalloc(r->mon.num_mbm_cntrs,
+						 GFP_KERNEL);
+	if (r->mon.mbm_cntr_free_map)
+		bitmap_fill(r->mon.mbm_cntr_free_map, r->mon.num_mbm_cntrs);
+
+	return r->mon.mbm_cntr_free_map;
+}
+
+static  __exit void mbm_cntrs_exit(struct rdt_resource *r)
+{
+	bitmap_free(r->mon.mbm_cntr_free_map);
+	r->mon.mbm_cntr_free_map = NULL;
+}
+
 static __init int dom_data_init(struct rdt_resource *r)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -1020,6 +1041,17 @@ static __init int dom_data_init(struct rdt_resource *r)
 		goto out_unlock;
 	}
 
+	if (r->mon.mbm_cntr_assignable && !mbm_cntrs_init(r)) {
+		if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+			kfree(closid_num_dirty_rmid);
+			closid_num_dirty_rmid = NULL;
+		}
+		kfree(rmid_ptrs);
+		rmid_ptrs = NULL;
+		err = -ENOMEM;
+		goto out_unlock;
+	}
+
 	for (i = 0; i < idx_limit; i++) {
 		entry = &rmid_ptrs[i];
 		INIT_LIST_HEAD(&entry->list);
@@ -1044,7 +1076,7 @@ static __init int dom_data_init(struct rdt_resource *r)
 	return err;
 }
 
-static void __exit dom_data_exit(void)
+static void __exit dom_data_exit(struct rdt_resource *r)
 {
 	mutex_lock(&rdtgroup_mutex);
 
@@ -1056,6 +1088,9 @@ static void __exit dom_data_exit(void)
 	kfree(rmid_ptrs);
 	rmid_ptrs = NULL;
 
+	if (r->mon.mbm_cntr_assignable)
+		mbm_cntrs_exit(r);
+
 	mutex_unlock(&rdtgroup_mutex);
 }
 
@@ -1210,10 +1245,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 */
 	resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
 
-	ret = dom_data_init(r);
-	if (ret)
-		return ret;
-
 	if (rdt_cpu_has(X86_FEATURE_BMEC)) {
 		u32 eax, ebx, ecx, edx;
 
@@ -1240,6 +1271,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 		}
 	}
 
+	ret = dom_data_init(r);
+	if (ret)
+		return ret;
+
 	l3_mon_evt_init(r);
 
 	r->mon_capable = true;
@@ -1247,9 +1282,9 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	return 0;
 }
 
-void __exit rdt_put_mon_l3_config(void)
+void __exit rdt_put_mon_l3_config(struct rdt_resource *r)
 {
-	dom_data_exit();
+	dom_data_exit(r);
 }
 
 void __init intel_rdt_mbm_apply_quirk(void)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 339bb0b09a82..a4b92476f501 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -185,6 +185,25 @@ bool closid_allocated(unsigned int closid)
 	return !test_bit(closid, &closid_free_map);
 }
 
+int mbm_cntr_alloc(struct rdt_resource *r)
+{
+	int cntr_id;
+
+	cntr_id = find_first_bit(r->mon.mbm_cntr_free_map,
+				 r->mon.num_mbm_cntrs);
+	if (cntr_id >= r->mon.num_mbm_cntrs)
+		return -ENOSPC;
+
+	__clear_bit(cntr_id, r->mon.mbm_cntr_free_map);
+
+	return cntr_id;
+}
+
+void mbm_cntr_free(struct rdt_resource *r, u32 cntr_id)
+{
+	__set_bit(cntr_id, r->mon.mbm_cntr_free_map);
+}
+
 /**
  * rdtgroup_mode_by_closid - Return mode of resource group with closid
  * @closid: closid if the resource group
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f11d6fdfd977..afe3b22d3e60 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -187,12 +187,14 @@ enum resctrl_scope {
  * @num_rmid:		Number of RMIDs available
  * @num_mbm_cntrs:	Number of assignable monitoring counters
  * @mbm_cntr_assignable:Is system capable of supporting monitor assignment?
+ * @mbm_cntr_free_map:	Bitmap of free MBM counters
  * @evt_list:		List of monitoring events
  */
 struct resctrl_mon {
 	int			num_rmid;
 	int			num_mbm_cntrs;
 	bool			mbm_cntr_assignable;
+	unsigned long		*mbm_cntr_free_map;
 	struct list_head	evt_list;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 11/26] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (9 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured to track specific
events. The event configuration is domain specific. ABMC (Assignable
Bandwidth Monitoring Counters) feature needs event configuration
information to assign a hardware counter to an RMID. Event configurations
are not stored in resctrl but instead always read from or written to
hardware directly when prompted by user space.

Read the event configuration from the hardware during the domain
initialization. Save the configuration value in struct rdt_hw_mon_domain,
so it can be used for counter assignment.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: Added Reviewed-by tag. No other changes.

v8: Renamed resctrl_mbm_evt_config_init() to arch_mbm_evt_config_init()
    Minor commit message update.

v7: Fixed initializing INVALID_CONFIG_VALUE to mbm_local_cfg in case of error.

v6: Renamed resctrl_arch_mbm_evt_config -> resctrl_mbm_evt_config_init
    Initialized value to INVALID_CONFIG_VALUE if it is not configurable.
    Minor commit message update.

v5: Exported mon_event_config_index_get.
    Renamed arch_domain_mbm_evt_config to resctrl_arch_mbm_evt_config.

v4: Read the configuration information from the hardware to initialize.
    Added few commit messages.
    Fixed the tab spaces.

v3: Minor changes related to rebase in mbm_config_write_domain.

v2: No changes.
---
 arch/x86/kernel/cpu/resctrl/core.c     |  2 ++
 arch/x86/kernel/cpu/resctrl/internal.h |  9 +++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 26 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +---
 4 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 02ccb4d2955d..11cba9f35945 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -632,6 +632,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 
 	arch_mon_domain_online(r, d);
 
+	arch_mbm_evt_config_init(hw_dom);
+
 	if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
 		mon_domain_free(hw_dom);
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8ab59d59c15a..add8e84b483e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -56,6 +56,9 @@
 /* Max event bits supported */
 #define MAX_EVT_CONFIG_BITS		GENMASK(6, 0)
 
+#define INVALID_CONFIG_VALUE		U32_MAX
+#define INVALID_CONFIG_INDEX		UINT_MAX
+
 /* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
 #define ABMC_ENABLE_BIT			0
 
@@ -401,6 +404,8 @@ struct rdt_hw_ctrl_domain {
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
+ * @mbm_total_cfg:	MBM total bandwidth configuration
+ * @mbm_local_cfg:	MBM local bandwidth configuration
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
@@ -408,6 +413,8 @@ struct rdt_hw_mon_domain {
 	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
+	u32				mbm_total_cfg;
+	u32				mbm_local_cfg;
 };
 
 static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
@@ -656,6 +663,8 @@ void __init resctrl_file_fflags_init(const char *config,
 				     unsigned long fflags);
 int mbm_cntr_alloc(struct rdt_resource *r);
 void mbm_cntr_free(struct rdt_resource *r, u32 cntr_id);
+void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
+unsigned int mon_event_config_index_get(u32 evtid);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 185ac210d46e..3996f7528b66 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1282,6 +1282,32 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	return 0;
 }
 
+void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom)
+{
+	unsigned int index;
+	u64 msrval;
+
+	/*
+	 * Read the configuration registers QOS_EVT_CFG_n, where <n> is
+	 * the BMEC event number (EvtID).
+	 */
+	if (mbm_total_event.configurable) {
+		index = mon_event_config_index_get(QOS_L3_MBM_TOTAL_EVENT_ID);
+		rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
+		hw_dom->mbm_total_cfg = msrval & MAX_EVT_CONFIG_BITS;
+	} else {
+		hw_dom->mbm_total_cfg = INVALID_CONFIG_VALUE;
+	}
+
+	if (mbm_local_event.configurable) {
+		index = mon_event_config_index_get(QOS_L3_MBM_LOCAL_EVENT_ID);
+		rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
+		hw_dom->mbm_local_cfg = msrval & MAX_EVT_CONFIG_BITS;
+	} else {
+		hw_dom->mbm_local_cfg = INVALID_CONFIG_VALUE;
+	}
+}
+
 void __exit rdt_put_mon_l3_config(struct rdt_resource *r)
 {
 	dom_data_exit(r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a4b92476f501..811b477f7710 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1601,8 +1601,6 @@ struct mon_config_info {
 	u32 mon_config;
 };
 
-#define INVALID_CONFIG_INDEX   UINT_MAX
-
 /**
  * mon_event_config_index_get - get the hardware index for the
  *                              configurable event
@@ -1612,7 +1610,7 @@ struct mon_config_info {
  *         1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID
  *         INVALID_CONFIG_INDEX for invalid evtid
  */
-static inline unsigned int mon_event_config_index_get(u32 evtid)
+unsigned int mon_event_config_index_get(u32 evtid)
 {
 	switch (evtid) {
 	case QOS_L3_MBM_TOTAL_EVENT_ID:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (10 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 11/26] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:24   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 13/26] x86/resctrl: Introduce mbm_cntr_map to track assignable counters at domain Babu Moger
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The event configuration is domain specific and initialized during domain
initialization. The values are stored in struct rdt_hw_mon_domain.

It is not required to read the configuration register every time user asks
for it. Use the value stored in struct rdt_hw_mon_domain instead.

Introduce resctrl_arch_mon_event_config_get() and
resctrl_arch_mon_event_config_set() to get/set architecture domain specific
mbm_total_cfg/mbm_local_cfg values.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Removed QOS_L3_OCCUP_EVENT_ID switch case in resctrl_arch_mon_event_config_set.
    Fixed a unnecessary space.

v8: Renamed
    resctrl_arch_event_config_get() to resctrl_arch_mon_event_config_get().
    resctrl_arch_event_config_set() to resctrl_arch_mon_event_config_set().

v7: Removed check if (val == INVALID_CONFIG_VALUE) as resctrl_arch_event_config_get
    already prints warning.
    Kept the Event config value definitions as is.

v6: Fixed inconstancy with types. Made all the types to u32 for config
    value.
    Removed few rdt_last_cmd_puts as it is not necessary.
    Removed unused config value definitions.
    Few more updates to commit message.

v5: Introduced resctrl_arch_event_config_get and
    resctrl_arch_event_config_get() based on our discussion.
    https://lore.kernel.org/lkml/68e861f9-245d-4496-a72e-46fc57d19c62@amd.com/

v4: New patch.
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 101 ++++++++++++++-----------
 include/linux/resctrl.h                |   4 +
 2 files changed, 60 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 811b477f7710..3a3d98e8ca28 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1597,10 +1597,55 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 }
 
 struct mon_config_info {
+	struct rdt_mon_domain *d;
 	u32 evtid;
 	u32 mon_config;
 };
 
+u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
+				      enum resctrl_event_id eventid)
+{
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+
+	switch (eventid) {
+	case QOS_L3_OCCUP_EVENT_ID:
+		break;
+	case QOS_L3_MBM_TOTAL_EVENT_ID:
+		return hw_dom->mbm_total_cfg;
+	case QOS_L3_MBM_LOCAL_EVENT_ID:
+		return hw_dom->mbm_local_cfg;
+	}
+
+	/* Never expect to get here */
+	WARN_ON_ONCE(1);
+
+	return INVALID_CONFIG_VALUE;
+}
+
+void resctrl_arch_mon_event_config_set(void *info)
+{
+	struct mon_config_info *mon_info = info;
+	struct rdt_hw_mon_domain *hw_dom;
+	unsigned int index;
+
+	index = mon_event_config_index_get(mon_info->evtid);
+	if (index == INVALID_CONFIG_INDEX)
+		return;
+
+	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
+
+	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
+
+	switch (mon_info->evtid) {
+	case QOS_L3_MBM_TOTAL_EVENT_ID:
+		hw_dom->mbm_total_cfg = mon_info->mon_config;
+		break;
+	case QOS_L3_MBM_LOCAL_EVENT_ID:
+		hw_dom->mbm_local_cfg = mon_info->mon_config;
+		break;
+	}
+}
+
 /**
  * mon_event_config_index_get - get the hardware index for the
  *                              configurable event
@@ -1623,33 +1668,11 @@ unsigned int mon_event_config_index_get(u32 evtid)
 	}
 }
 
-static void mon_event_config_read(void *info)
-{
-	struct mon_config_info *mon_info = info;
-	unsigned int index;
-	u64 msrval;
-
-	index = mon_event_config_index_get(mon_info->evtid);
-	if (index == INVALID_CONFIG_INDEX) {
-		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
-		return;
-	}
-	rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
-
-	/* Report only the valid event configuration bits */
-	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
-}
-
-static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
-{
-	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
-}
-
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
-	struct mon_config_info mon_info;
 	struct rdt_mon_domain *dom;
 	bool sep = false;
+	u32 val;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
@@ -1658,11 +1681,8 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		if (sep)
 			seq_puts(s, ";");
 
-		memset(&mon_info, 0, sizeof(struct mon_config_info));
-		mon_info.evtid = evtid;
-		mondata_config_read(dom, &mon_info);
-
-		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
+		val = resctrl_arch_mon_event_config_get(dom, evtid);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, val);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1693,33 +1713,23 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
 	return 0;
 }
 
-static void mon_event_config_write(void *info)
-{
-	struct mon_config_info *mon_info = info;
-	unsigned int index;
-
-	index = mon_event_config_index_get(mon_info->evtid);
-	if (index == INVALID_CONFIG_INDEX) {
-		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
-		return;
-	}
-	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
-}
 
 static void mbm_config_write_domain(struct rdt_resource *r,
 				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
+	u32 config_val;
 
 	/*
-	 * Read the current config value first. If both are the same then
+	 * Check the current config value first. If both are the same then
 	 * no need to write it again.
 	 */
-	mon_info.evtid = evtid;
-	mondata_config_read(d, &mon_info);
-	if (mon_info.mon_config == val)
+	config_val = resctrl_arch_mon_event_config_get(d, evtid);
+	if (config_val == INVALID_CONFIG_VALUE || config_val == val)
 		return;
 
+	mon_info.d = d;
+	mon_info.evtid = evtid;
 	mon_info.mon_config = val;
 
 	/*
@@ -1728,7 +1738,8 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask,
+			      resctrl_arch_mon_event_config_set,
 			      &mon_info, 1);
 
 	/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index afe3b22d3e60..70885a835acb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -354,6 +354,10 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
  */
 void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
+void resctrl_arch_mon_event_config_set(void *info);
+u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
+				      enum resctrl_event_id eventid);
+
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 13/26] x86/resctrl: Introduce mbm_cntr_map to track assignable counters at domain
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (11 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The MBM counters are allocated globally and assigned to an RMID, event pair
in a resctrl group. It is tracked by mbm_cntr_free_map. Counters are
assigned to the domain based on the user input. It needs to be tracked
at domain level also.

Add the mbm_cntr_map bitmap in struct rdt_mon_domain to keep track of
assignment at domain level. The global counter at mbm_cntr_free_map can
be released when assignment at all the domains are cleared.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
v9: Added Reviewed-by tag. No other changes.

v8: Minor commit message changes.

v7: Added check mbm_cntr_assignable for allocating bitmap mbm_cntr_map

v6: New patch to add domain level assignment.
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
 include/linux/resctrl.h                |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3a3d98e8ca28..654cdfee1b00 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -4091,6 +4091,7 @@ static void __init rdtgroup_setup_default(void)
 
 static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
+	bitmap_free(d->mbm_cntr_map);
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
@@ -4164,6 +4165,15 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
 			return -ENOMEM;
 		}
 	}
+	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
+		d->mbm_cntr_map = bitmap_zalloc(r->mon.num_mbm_cntrs, GFP_KERNEL);
+		if (!d->mbm_cntr_map) {
+			bitmap_free(d->rmid_busy_llc);
+			kfree(d->mbm_total);
+			kfree(d->mbm_local);
+			return -ENOMEM;
+		}
+	}
 
 	return 0;
 }
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 70885a835acb..0b8eeb8afc68 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -105,6 +105,7 @@ struct rdt_ctrl_domain {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
+ * @mbm_cntr_map:	bitmap to track domain counter assignment
  */
 struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
@@ -116,6 +117,7 @@ struct rdt_mon_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
+	unsigned long			*mbm_cntr_map;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (12 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 13/26] x86/resctrl: Introduce mbm_cntr_map to track assignable counters at domain Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:57   ` Luck, Tony
                     ` (2 more replies)
  2024-10-29 23:21 ` [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
                   ` (11 subsequent siblings)
  25 siblings, 3 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Provide the interface to display the number of free monitoring counters
available for assignment in each doamin when mbm_cntr_assign is supported.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: New patch.
---
 Documentation/arch/x86/resctrl.rst     |  4 ++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 2f3a86278e84..2bc58d974934 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -302,6 +302,10 @@ with the following files:
 	memory bandwidth tracking to a single memory bandwidth event per
 	monitoring group.
 
+"available_mbm_cntrs":
+	The number of free monitoring counters available assignment in each domain
+	when the architecture supports mbm_cntr_assign mode.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3996f7528b66..e8d38a963f39 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1268,6 +1268,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
+			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 654cdfee1b00..ef0c1246fa2a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -898,6 +898,33 @@ static int rdtgroup_num_mbm_cntrs_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdtgroup_available_mbm_cntrs_show(struct kernfs_open_file *of,
+					     struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdt_mon_domain *dom;
+	bool sep = false;
+	u32 val;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+		if (sep)
+			seq_puts(s, ";");
+
+		val = r->mon.num_mbm_cntrs - hweight64(*dom->mbm_cntr_map);
+		seq_printf(s, "%d=%d", dom->hdr.id, val);
+		sep = true;
+	}
+	seq_puts(s, "\n");
+
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1984,6 +2011,12 @@ static struct rftype res_common_files[] = {
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= rdtgroup_num_mbm_cntrs_show,
 	},
+	{
+		.name		= "available_mbm_cntrs",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdtgroup_available_mbm_cntrs_show,
+	},
 	{
 		.name		= "cpus_list",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (13 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:35   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments Babu Moger
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as the
counter is assigned. The bandwidth events will be tracked by the hardware
until the user changes the configuration. Each resctrl group can configure
maximum two counters, one for total event and one for local event.

The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh).
Configuration is done by setting the counter id, bandwidth source (RMID)
and bandwidth configuration supported by BMEC (Bandwidth Monitoring Event
Configuration).

Attempts to read or write the MSR when ABMC is not enabled will result
in a #GP(0) exception.

Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG
(0xC000_03FDh):
=========================================================================
Bits 	Mnemonic	Description			Access Reset
							Type   Value
=========================================================================
63 	CfgEn 		Configuration Enable 		R/W 	0

62 	CtrEn 		Enable/disable counting		R/W 	0

61:53 	– 		Reserved 			MBZ 	0

52:48 	CtrID 		Counter Identifier		R/W	0

47 	IsCOS		BwSrc field is a CLOSID		R/W	0
			(not an RMID)

46:44 	–		Reserved			MBZ	0

43:32	BwSrc		Bandwidth Source		R/W	0
			(RMID or CLOSID)

31:0	BwType		Bandwidth configuration		R/W	0
			to track for this counter
==========================================================================

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Removed the references of L3_QOS_ABMC_DSC.
    Text changes about configuration in kernel doc.

v8: Update the configuration notes in kernel_doc.
    Few commit message update.

v7: Removed the reference of L3_QOS_ABMC_DSC as it is not used anymore.
    Moved the configuration notes to kernel_doc.
    Adjusted the tabs for l3_qos_abmc_cfg and checkpatch seems happy.

v6: Removed all the fs related changes.
    Added note on CfgEn,CtrEn.
    Removed the definitions which are not used.
    Removed cntr_id initialization.

v5: Moved assignment flags here (path 10/19 of v4).
    Added MON_CNTR_UNSET definition to initialize cntr_id's.
    More details in commit log.
    Renamed few fields in l3_qos_abmc_cfg for readability.

v4: Added more descriptions.
    Changed the name abmc_ctr_id to ctr_id.
    Added L3_QOS_ABMC_DSC. Used for reading the configuration.

v3: No changes.

v2: No changes.
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h | 35 ++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index bdc95b7cd1b0..d7dec2326cd8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1194,6 +1194,7 @@
 /* - AMD: */
 #define MSR_IA32_MBA_BW_BASE		0xc0000200
 #define MSR_IA32_SMBA_BW_BASE		0xc0000280
+#define MSR_IA32_L3_QOS_ABMC_CFG	0xc00003fd
 #define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
 #define MSR_IA32_EVT_CFG_BASE		0xc0000400
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index add8e84b483e..5895ea72fc26 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -602,6 +602,41 @@ union cpuid_0x10_x_edx {
 	unsigned int full;
 };
 
+/*
+ * ABMC counters are configured by writing to L3_QOS_ABMC_CFG.
+ * @bw_type		: Bandwidth configuration (supported by BMEC)
+ *			  tracked by the @cntr_id.
+ * @bw_src		: Bandwidth source (RMID or CLOSID).
+ * @reserved1		: Reserved.
+ * @is_clos		: @bw_src field is a CLOSID (not an RMID).
+ * @cntr_id		: Counter identifier.
+ * @reserved		: Reserved.
+ * @cntr_en		: Counting enable bit.
+ * @cfg_en		: Configuration enable bit.
+ *
+ * Configuration and counting:
+ * Counter can be configured across multiple writes to MSR. Configuration
+ * is applied only when @cfg_en = 1. Counter @cntr_id is reset when the
+ * configuration is applied.
+ * @cfg_en = 1, @cntr_en = 0 : Apply @cntr_id configuration but do not
+ *                             count events.
+ * @cfg_en = 1, @cntr_en = 1 : Apply @cntr_id configuration and start
+ *                             counting events.
+ */
+union l3_qos_abmc_cfg {
+	struct {
+		unsigned long bw_type  :32,
+			      bw_src   :12,
+			      reserved1: 3,
+			      is_clos  : 1,
+			      cntr_id  : 5,
+			      reserved : 9,
+			      cntr_en  : 1,
+			      cfg_en   : 1;
+	} split;
+	unsigned long full;
+};
+
 void rdt_last_cmd_clear(void);
 void rdt_last_cmd_puts(const char *s);
 __printf(1, 2)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (14 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:38   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

mbm_cntr_assign feature provides an option to the user to assign a counter
to an RMID, event pair and monitor the bandwidth as long as the counter is
assigned. There can be two counters per monitor group, one for MBM total
event and another for MBM local event.

Introduce cntr_id to manage the assignments.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Used MBM_EVENT_ARRAY_INDEX macro to get the event index.
    Introduced rdtgroup_cntr_id_init() to initialize the cntr_id.

v8: Minor commit message update.

v7: Minor comment update for cntr_id.

v6: New patch.
    Separated FS and arch bits.
---
 arch/x86/kernel/cpu/resctrl/internal.h | 14 ++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5895ea72fc26..d1f3f3ca4df9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -62,6 +62,18 @@
 /* Setting bit 0 in L3_QOS_EXT_CFG enables the ABMC feature. */
 #define ABMC_ENABLE_BIT			0
 
+/* Maximum assignable counters per resctrl group */
+#define MAX_CNTRS			2
+
+#define MON_CNTR_UNSET			U32_MAX
+
+/*
+ * Get the counter index for the assignable counter
+ * 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID
+ * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID
+ */
+#define MBM_EVENT_ARRAY_INDEX(_event) ((_event) - 2)
+
 /**
  * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
  *			        aren't marked nohz_full
@@ -231,12 +243,14 @@ enum rdtgrp_mode {
  * @parent:			parent rdtgrp
  * @crdtgrp_list:		child rdtgroup node list
  * @rmid:			rmid for this rdtgroup
+ * @cntr_id:			IDs of hardware counters assigned to monitor group
  */
 struct mongroup {
 	struct kernfs_node	*mon_data_kn;
 	struct rdtgroup		*parent;
 	struct list_head	crdtgrp_list;
 	u32			rmid;
+	u32			cntr_id[MAX_CNTRS];
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ef0c1246fa2a..36845e8e400d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -925,6 +925,14 @@ static int rdtgroup_available_mbm_cntrs_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static void rdtgroup_cntr_id_init(struct rdtgroup *rdtgrp,
+				  enum resctrl_event_id evtid)
+{
+	int index = MBM_EVENT_ARRAY_INDEX(evtid);
+
+	rdtgrp->mon.cntr_id[index] = MON_CNTR_UNSET;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -3561,6 +3569,9 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
 	}
 	rdtgrp->mon.rmid = ret;
 
+	rdtgroup_cntr_id_init(rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+	rdtgroup_cntr_id_init(rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+
 	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
 	if (ret) {
 		rdt_last_cmd_puts("kernfs subdir error\n");
@@ -4115,6 +4126,10 @@ static void __init rdtgroup_setup_default(void)
 	rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID;
 	rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID;
 	rdtgroup_default.type = RDTCTRL_GROUP;
+
+	rdtgroup_cntr_id_init(&rdtgroup_default, QOS_L3_MBM_TOTAL_EVENT_ID);
+	rdtgroup_cntr_id_init(&rdtgroup_default, QOS_L3_MBM_LOCAL_EVENT_ID);
+
 	INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list);
 
 	list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (15 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:54   ` Luck, Tony
  2024-11-16  0:44   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment Babu Moger
                   ` (8 subsequent siblings)
  25 siblings, 2 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The assigned RMID will be tracked by the hardware until the user
unassigns it manually.

Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
specifying the counter id, bandwidth source, and bandwidth types.

Provide the interface to assign the counter ids to RMID.

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
    Monitoring (ABMC).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Removed the code to reset the architectural state. It will done
    in another patch.

v8: Rename resctrl_arch_assign_cntr to resctrl_arch_config_cntr.

v7: Separated arch and fs functions. This patch only has arch implementation.
    Added struct rdt_resource to the interface resctrl_arch_assign_cntr.
    Rename rdtgroup_abmc_cfg() to resctrl_abmc_config_one_amd().

v6: Removed mbm_cntr_alloc() from this patch to keep fs and arch code
    separate.
    Added code to update the counter assignment at domain level.

v5: Few name changes to match cntr_id.
    Changed the function names to
      rdtgroup_assign_cntr
      resctr_arch_assign_cntr
      More comments on commit log.
      Added function summary.

v4: Commit message update.
      User bitmap APIs where applicable.
      Changed the interfaces considering MPAM(arm).
      Added domain specific assignment.

v3: Removed the static from the prototype of rdtgroup_assign_abmc.
      The function is not called directly from user anymore. These
      changes are related to global assignment interface.

v2: Minor text changes in commit message.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  3 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 38 ++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d1f3f3ca4df9..00f7bf60e16a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -714,6 +714,9 @@ int mbm_cntr_alloc(struct rdt_resource *r);
 void mbm_cntr_free(struct rdt_resource *r, u32 cntr_id);
 void arch_mbm_evt_config_init(struct rdt_hw_mon_domain *hw_dom);
 unsigned int mon_event_config_index_get(u32 evtid);
+int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			     u32 cntr_id, bool assign);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 36845e8e400d..1b5529c212f5 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1886,6 +1886,44 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static void resctrl_abmc_config_one_amd(void *info)
+{
+	u64 *msrval = info;
+
+	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, *msrval);
+}
+
+/*
+ * Send an IPI to the domain to assign the counter to RMID, event pair.
+ */
+int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			     u32 cntr_id, bool assign)
+{
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+	union l3_qos_abmc_cfg abmc_cfg = { 0 };
+	struct arch_mbm_state *arch_mbm;
+
+	abmc_cfg.split.cfg_en = 1;
+	abmc_cfg.split.cntr_en = assign ? 1 : 0;
+	abmc_cfg.split.cntr_id = cntr_id;
+	abmc_cfg.split.bw_src = rmid;
+
+	/* Update the event configuration from the domain */
+	if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
+		abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
+		arch_mbm = &hw_dom->arch_mbm_total[rmid];
+	} else {
+		abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
+		arch_mbm = &hw_dom->arch_mbm_local[rmid];
+	}
+
+	smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
+			      &abmc_cfg, 1);
+
+	return 0;
+}
+
 /* rdtgroup information files for one cache resource. */
 static struct rftype res_common_files[] = {
 	{
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (16 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-16  0:57   ` Reinette Chatre
  2024-12-04  4:16   ` Fenghua Yu
  2024-10-29 23:21 ` [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter Babu Moger
                   ` (7 subsequent siblings)
  25 siblings, 2 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The mbm_cntr_assign mode offers several hardware counters that can be
assigned to an RMID, event pair and monitor the bandwidth as long as it
is assigned.

Counters are managed at two levels. The global assignment is tracked
using the mbm_cntr_free_map field in the struct resctrl_mon, while
domain-specific assignments are tracked using the mbm_cntr_map field
in the struct rdt_mon_domain. Allocation begins at the global level
and is then applied individually to each domain.

Introduce an interface to allocate these counters and update the
corresponding domains accordingly.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Introduced new function resctrl_config_cntr to assign the counter, update
    the bitmap and reset the architectural state.
    Taken care of error handling(freeing the counter) when assignment fails.
    Moved mbm_cntr_assigned_to_domain here as it used in this patch.
    Minor text changes.

v8: Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
    Added the code to return the error if rdtgroup_assign_cntr_event fails.
    Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
    Updated typo in the comments.

v7: New patch. Moved all the FS code here.
    Merged rdtgroup_assign_cntr and rdtgroup_alloc_cntr.
    Adde new #define MBM_EVENT_ARRAY_INDEX.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 87 ++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 00f7bf60e16a..cb496bd97007 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
 int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
 			     u32 cntr_id, bool assign);
+int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
+			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1b5529c212f5..bc3752967c44 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 	return 0;
 }
 
+/*
+ * Configure the counter for the event, RMID pair for the domain.
+ * Update the bitmap and reset the architectural state.
+ */
+static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
+			       u32 cntr_id, bool assign)
+{
+	int ret;
+
+	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
+	if (ret)
+		return ret;
+
+	if (assign)
+		__set_bit(cntr_id, d->mbm_cntr_map);
+	else
+		__clear_bit(cntr_id, d->mbm_cntr_map);
+
+	/*
+	 * Reset the architectural state so that reading of hardware
+	 * counter is not considered as an overflow in next update.
+	 */
+	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
+
+	return ret;
+}
+
+static bool mbm_cntr_assigned_to_domain(struct rdt_resource *r, u32 cntr_id)
+{
+	struct rdt_mon_domain *d;
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list)
+		if (test_bit(cntr_id, d->mbm_cntr_map))
+			return 1;
+
+	return 0;
+}
+
+/*
+ * Assign a hardware counter to event @evtid of group @rdtgrp.
+ * Counter will be assigned to all the domains if rdt_mon_domain is NULL
+ * else the counter will be assigned to specific domain.
+ */
+int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
+			       struct rdt_mon_domain *d, enum resctrl_event_id evtid)
+{
+	int index = MBM_EVENT_ARRAY_INDEX(evtid);
+	int cntr_id = rdtgrp->mon.cntr_id[index];
+	int ret;
+
+	/*
+	 * Allocate a new counter id to the event if the counter is not
+	 * assigned already.
+	 */
+	if (cntr_id == MON_CNTR_UNSET) {
+		cntr_id = mbm_cntr_alloc(r);
+		if (cntr_id < 0) {
+			rdt_last_cmd_puts("Out of MBM assignable counters\n");
+			return -ENOSPC;
+		}
+		rdtgrp->mon.cntr_id[index] = cntr_id;
+	}
+
+	if (!d) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
+						  rdtgrp->closid, cntr_id, true);
+			if (ret)
+				goto out_done_assign;
+		}
+	} else {
+		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
+					  rdtgrp->closid, cntr_id, true);
+		if (ret)
+			goto out_done_assign;
+	}
+
+out_done_assign:
+	if (ret && !mbm_cntr_assigned_to_domain(r, cntr_id)) {
+		mbm_cntr_free(r, cntr_id);
+		rdtgroup_cntr_id_init(rdtgrp, evtid);
+	}
+
+	return ret;
+}
+
 /* rdtgroup information files for one cache resource. */
 static struct rftype res_common_files[] = {
 	{
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (17 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-04 14:16   ` Peter Newman
  2024-10-29 23:21 ` [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

The mbm_cntr_assign mode provides a limited number of hardware counters
that can be assigned to an RMID, event pair to monitor bandwidth while
assigned. If all counters are in use, the kernel will show an error
message: "Out of MBM assignable counters" when a new assignment is
requested. To make space for a new assignment, users must unassign an
already assigned counter.

Introduce an interface that allows for the unassignment of counter IDs
from both the group and the domain. Additionally, ensure that the global
counter is released if it is no longer assigned to any domains.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Changes related to addition of new function resctrl_config_cntr().
    The removed rdtgroup_mbm_cntr_is_assigned() as it was introduced
    already.
    Text changes to take care comments.

v8: Renamed rdtgroup_mbm_cntr_is_assigned to mbm_cntr_assigned_to_domain
    Added return error handling in resctrl_arch_config_cntr().

v7: Merged rdtgroup_unassign_cntr and rdtgroup_free_cntr functions.
    Renamed rdtgroup_mbm_cntr_test() to rdtgroup_mbm_cntr_is_assigned().
    Reworded the commit log little bit.

v6: Removed mbm_cntr_free from this patch.
    Added counter test in all the domains and free if it is not assigned to
    any domains.

v5: Few name changes to match cntr_id.
    Changed the function names to rdtgroup_unassign_cntr
    More comments on commit log.

v4: Added domain specific unassign feature.
    Few name changes.

v3: Removed the static from the prototype of rdtgroup_unassign_abmc.
    The function is not called directly from user anymore. These
    changes are related to global assignment interface.

v2: No changes.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 41 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index cb496bd97007..66de0ce12aba 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -719,6 +719,8 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 cntr_id, bool assign);
 int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
 			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
+int rdtgroup_unassign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
+				 struct rdt_mon_domain *d, enum resctrl_event_id evtid);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bc3752967c44..b0cce3dfd062 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2011,6 +2011,47 @@ int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
 	return ret;
 }
 
+/*
+ * Unassign a hardware counter associated with @evtid from the domain and
+ * the group. Unassign the counters from all the domains if rdt_mon_domain
+ * is NULL else unassign from the specific domain.
+ * Free the global counter once unassigned from all the domains.
+ */
+int rdtgroup_unassign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
+				 struct rdt_mon_domain *d, enum resctrl_event_id evtid)
+{
+	int index = MBM_EVENT_ARRAY_INDEX(evtid);
+	int cntr_id = rdtgrp->mon.cntr_id[index];
+	int ret;
+
+	/* Return early if the counter is unassigned already */
+	if (cntr_id == MON_CNTR_UNSET)
+		return 0;
+
+	if (!d) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
+						  rdtgrp->closid, cntr_id, false);
+			if (ret)
+				goto out_done_unassign;
+		}
+	} else {
+		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
+					  rdtgrp->closid, cntr_id, false);
+		if (ret)
+			goto out_done_unassign;
+	}
+
+	/* Free the counter id once it is unassigned from all the domains */
+	if (!mbm_cntr_assigned_to_domain(r, cntr_id)) {
+		mbm_cntr_free(r, cntr_id);
+		rdtgroup_cntr_id_init(rdtgrp, evtid);
+	}
+
+out_done_unassign:
+	return ret;
+}
+
 /* rdtgroup information files for one cache resource. */
 static struct rftype res_common_files[] = {
 	{
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (18 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-18 17:18   ` Reinette Chatre
  2024-12-04  4:16   ` Fenghua Yu
  2024-10-29 23:21 ` [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
                   ` (5 subsequent siblings)
  25 siblings, 2 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Assign/unassign counters on resctrl group creation/deletion. Two counters
are required per group, one for MBM total event and one for MBM local
event.

There are a limited number of counters available for assignment. If these
counters are exhausted, the kernel will display the error message: "Out of
MBM assignable counters". However, it is not necessary to fail the
creation of a group due to assignment failures. Users have the flexibility
to modify the assignments at a later time.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
    Updated couple of rdtgroup_unassign_cntrs() calls properly.
    Updated function comments.

v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
    Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
    Fixed the problem with unassigning the child MON groups of CTRL_MON group.

v7: Reworded the commit message.
    Removed the reference of ABMC with mbm_cntr_assign.
    Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.

v6: Removed the redundant comments on all the calls of
    rdtgroup_assign_cntrs. Updated the commit message.
    Dropped printing error message on every call of rdtgroup_assign_cntrs.

v5: Removed the code to enable/disable ABMC during the mount.
    That will be another patch.
    Added arch callers to get the arch specific data.
    Renamed fuctions to match the other abmc function.
    Added code comments for assignment failures.

v4: Few name changes based on the upstream discussion.
    Commit message update.

v3: This is a new patch. Patch addresses the upstream comment to enable
    ABMC feature by default if the feature is available.
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b0cce3dfd062..a8d21b0b2054 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
 	}
 }
 
+/*
+ * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
+ * counters are automatically assigned. Each group can accommodate two counters:
+ * one for the total event and one for the local event. Assignments may fail
+ * due to the limited number of counters. However, it is not necessary to fail
+ * the group creation and thus no failure is returned. Users have the option
+ * to modify the counter assignments after the group has been created.
+ */
+static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
+		return;
+
+	if (is_mbm_total_enabled())
+		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
+
+	if (is_mbm_local_enabled())
+		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
+/*
+ * Called when a group is deleted. Counters are unassigned if it was in
+ * assigned state.
+ */
+static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
+		return;
+
+	if (is_mbm_total_enabled())
+		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
+
+	if (is_mbm_local_enabled())
+		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
 		if (ret < 0)
 			goto out_mongrp;
 		rdtgroup_default.mon.mon_data_kn = kn_mondata;
+
+		rdtgroup_assign_cntrs(&rdtgroup_default);
 	}
 
 	ret = rdt_pseudo_lock_init();
@@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
 out_psl:
 	rdt_pseudo_lock_release();
 out_mondata:
-	if (resctrl_arch_mon_capable())
+	if (resctrl_arch_mon_capable()) {
+		rdtgroup_unassign_cntrs(&rdtgroup_default);
 		kernfs_remove(kn_mondata);
+	}
 out_mongrp:
 	if (resctrl_arch_mon_capable())
 		kernfs_remove(kn_mongrp);
@@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
 
 	head = &rdtgrp->mon.crdtgrp_list;
 	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
+		rdtgroup_unassign_cntrs(sentry);
 		free_rmid(sentry->closid, sentry->mon.rmid);
 		list_del(&sentry->mon.crdtgrp_list);
 
@@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
 		cpumask_or(&rdtgroup_default.cpu_mask,
 			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
 
+		rdtgroup_unassign_cntrs(rdtgrp);
+
 		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 		kernfs_remove(rdtgrp->kn);
@@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
 	for_each_alloc_capable_rdt_resource(r)
 		reset_all_ctrls(r);
 	rmdir_all_sub();
+	rdtgroup_unassign_cntrs(&rdtgroup_default);
 	rdt_pseudo_lock_release();
 	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
 	schemata_list_destroy();
@@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
 		resctrl_arch_disable_alloc();
 	if (resctrl_arch_mon_capable())
 		resctrl_arch_disable_mon();
+
 	resctrl_mounted = false;
 	kernfs_kill_sb(sb);
 	mutex_unlock(&rdtgroup_mutex);
@@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
 		goto out_unlock;
 	}
 
+	rdtgroup_assign_cntrs(rdtgrp);
+
 	kernfs_activate(rdtgrp->kn);
 
 	/*
@@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 	if (ret)
 		goto out_closid_free;
 
+	rdtgroup_assign_cntrs(rdtgrp);
+
 	kernfs_activate(rdtgrp->kn);
 
 	ret = rdtgroup_init_alloc(rdtgrp);
@@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 out_del_list:
 	list_del(&rdtgrp->rdtgroup_list);
 out_rmid_free:
+	rdtgroup_unassign_cntrs(rdtgrp);
 	mkdir_rdt_prepare_rmid_free(rdtgrp);
 out_closid_free:
 	closid_free(closid);
@@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	update_closid_rmid(tmpmask, NULL);
 
 	rdtgrp->flags = RDT_DELETED;
+
+	rdtgroup_unassign_cntrs(rdtgrp);
+
 	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 
 	/*
@@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
 	update_closid_rmid(tmpmask, NULL);
 
+	rdtgroup_unassign_cntrs(rdtgrp);
+
 	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
 	closid_free(rdtgrp->closid);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (19 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-18 17:39   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 22/26] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

In mbm_cntr_assign mode, the hardware counter should be assigned to read
the MBM events.

Report "Unassigned" in case the user attempts to read the events without
assigning the counter.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Used is_mbm_event() to check the event type.
    Minor user documentation update.

v8: Used MBM_EVENT_ARRAY_INDEX to get the index for the MBM event.
    Documentation update to make the text generic.

v7: Moved the documentation under "mon_data".
    Updated the text little bit.

v6: Added more explaination in the resctrl.rst
    Added checks to detect "Unassigned" before reading RMID.

v5: New patch.
---
 Documentation/arch/x86/resctrl.rst        | 10 ++++++++++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 2bc58d974934..864fc004d646 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -430,6 +430,16 @@ When monitoring is enabled all MON groups will also contain:
 	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
 	where "YY" is the node number.
 
+	When supported the 'mbm_cntr_assign' mode allows users to assign a
+	counter to mon_hw_id, event pair enabling bandwidth monitoring for
+	as long as the counter remains assigned. The hardware will continue
+	tracking the assigned mon_hw_id until the user manually unassigns
+	it, ensuring that counters are not reset during this period. With
+	a limited number of counters, the system may run out of assignable
+	counters. In that case, MBM event counters will return "Unassigned"
+	when the event is read. Users must manually assign a counter to read
+	the events.
+
 "mon_hw_id":
 	Available only with debug option. The identifier used by hardware
 	for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 200d89a64027..43a48943578f 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -567,7 +567,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	int ret = 0;
+	int ret = 0, index;
 
 	rdtgrp = rdtgroup_kn_lock_live(of->kn);
 	if (!rdtgrp) {
@@ -581,6 +581,14 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 	r = &rdt_resources_all[resid].r_resctrl;
 
+	if (resctrl_arch_mbm_cntr_assign_enabled(r) && is_mbm_event(evtid)) {
+		index = MBM_EVENT_ARRAY_INDEX(evtid);
+		if (rdtgrp->mon.cntr_id[index] == MON_CNTR_UNSET) {
+			rr.err = -ENOENT;
+			goto checkresult;
+		}
+	}
+
 	if (md.u.sum) {
 		/*
 		 * This file requires summing across all domains that share
@@ -618,6 +626,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		seq_puts(m, "Error\n");
 	else if (rr.err == -EINVAL)
 		seq_puts(m, "Unavailable\n");
+	else if (rr.err == -ENOENT)
+		seq_puts(m, "Unassigned\n");
 	else
 		seq_printf(m, "%llu\n", rr.val);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 22/26] x86/resctrl: Introduce the interface to switch between monitor modes
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (20 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Introduce interface to switch between mbm_cntr_assign and default modes.

$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_cntr_assign]
default

To enable the "mbm_cntr_assign" mode:
$ echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

To enable the default monitoring mode:
$ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

MBM event counters will reset when mbm_assign_mode is changed.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Fixed extra spaces in user documentation.
    Fixed problem changing the mode to mbm_cntr_assign mode when it is
    not supported. Added extra checks to detect if systems supports it.
    Used the rdtgroup_cntr_id_init to initialize cntr_id.

v8: Reset the internal counters after mbm_cntr_assign mode is changed.
    Renamed rdtgroup_mbm_cntr_reset() to mbm_cntr_reset()
    Updated the documentation to make text generic.

v7: Changed the interface name to mbm_assign_mode.
    Removed the references of ABMC.
    Added the changes to reset global and domain bitmaps.
    Added the changes to reset rmid.

v6: Changed the mode name to mbm_cntr_assign.
    Moved all the FS related code here.
    Added changes to reset mbm_cntr_map and resctrl group counters.

v5: Change log and mode description text correction.

v4: Minor commit text changes. Keep the default to ABMC when supported.
    Fixed comments to reflect changed interface "mbm_mode".

v3: New patch to address the review comments from upstream.
---
 Documentation/arch/x86/resctrl.rst     | 15 +++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 81 +++++++++++++++++++++++++-
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 864fc004d646..1b39866e8b04 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -290,6 +290,21 @@ with the following files:
 	mbm_local_bytes may report 'Unavailable' if there is no counter associated
 	with that event.
 
+	* To enable "mbm_cntr_assign" mode:
+	  ::
+
+	    # echo "mbm_cntr_assign" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+	* To enable default monitoring mode:
+	  ::
+
+	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+	The MBM events (mbm_total_bytes and/or mbm_local_bytes) associated with
+	counters may reset when mbm_assign_mode is changed. Moving to
+	mbm_cntr_assign mode require users to assign the counters to the events.
+	Otherwise, the MBM event counters will return "Unassigned" when read.
+
 "num_mbm_cntrs":
 	The number of monitoring counters available for assignment when the
 	architecture supports mbm_cntr_assign mode.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a8d21b0b2054..7fa6a86c6ca8 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -933,6 +933,84 @@ static void rdtgroup_cntr_id_init(struct rdtgroup *rdtgrp,
 	rdtgrp->mon.cntr_id[index] = MON_CNTR_UNSET;
 }
 
+static void mbm_cntr_reset(struct rdt_resource *r)
+{
+	struct rdtgroup *prgrp, *crgrp;
+	struct rdt_mon_domain *dom;
+
+	/*
+	 * Hardware counters will reset after switching the monitor mode.
+	 * Reset the architectural state so that reading of hardware
+	 * counter is not considered as an overflow in the next update.
+	 * Also reset the domain counter bitmap.
+	 */
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+		bitmap_zero(dom->mbm_cntr_map, r->mon.num_mbm_cntrs);
+		resctrl_arch_reset_rmid_all(r, dom);
+	}
+
+	/* Reset global MBM counter map */
+	bitmap_fill(r->mon.mbm_cntr_free_map, r->mon.num_mbm_cntrs);
+
+	/* Reset the cntr_id's for all the monitor groups */
+	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+		rdtgroup_cntr_id_init(prgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+		rdtgroup_cntr_id_init(prgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list,
+				    mon.crdtgrp_list) {
+			rdtgroup_cntr_id_init(crgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
+			rdtgroup_cntr_id_init(crgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
+		}
+	}
+}
+
+static ssize_t rdtgroup_mbm_assign_mode_write(struct kernfs_open_file *of,
+					      char *buf, size_t nbytes, loff_t off)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	int ret = 0;
+	bool enable;
+
+	/* Valid input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+
+	buf[nbytes - 1] = '\0';
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_last_cmd_clear();
+
+	if (!strcmp(buf, "default")) {
+		enable = 0;
+	} else if (!strcmp(buf, "mbm_cntr_assign")) {
+		if (r->mon.mbm_cntr_assignable) {
+			enable = 1;
+		} else {
+			ret = -EINVAL;
+			rdt_last_cmd_puts("mbm_cntr_assign mode is not supported\n");
+			goto write_exit;
+		}
+	} else {
+		ret = -EINVAL;
+		rdt_last_cmd_puts("Unsupported assign mode\n");
+		goto write_exit;
+	}
+
+	if (enable != resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		ret = resctrl_arch_mbm_cntr_assign_set(r, enable);
+		if (!ret)
+			mbm_cntr_reset(r);
+	}
+
+write_exit:
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return ret ?: nbytes;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -2166,9 +2244,10 @@ static struct rftype res_common_files[] = {
 	},
 	{
 		.name		= "mbm_assign_mode",
-		.mode		= 0444,
+		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= rdtgroup_mbm_assign_mode_show,
+		.write		= rdtgroup_mbm_assign_mode_write,
 		.fflags		= RFTYPE_MON_INFO,
 	},
 	{
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (21 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 22/26] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-18 19:23   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes Babu Moger
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Configure mbm_cntr_assign on AMD. 'mbm_cntr_assign' mode in AMD is ABMC
(Assignable Bandwidth Monitoring Counters). It is enabled by default when
supported on the system.

When the ABMC is updated, it must be updated on all the logical processors
in the resctrl domain.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Minor code change due to merge. Actual code did not change.

v8: Renamed resctrl_arch_mbm_cntr_assign_configure to
	resctrl_arch_mbm_cntr_assign_set_one.
    Adde r->mon_capable check.
    Commit message update.

v7: Introduced resctrl_arch_mbm_cntr_assign_configure() to configure.
    Moved the default settings to rdt_get_mon_l3_config(). It should be
    done before the hotplug handler is called. It cannot be done at
    rdtgroup_init().

v6: Keeping the default enablement in arch init code for now.
     This may need some discussion.
     Renamed resctrl_arch_configure_abmc to resctrl_arch_mbm_cntr_assign_configure.

v5: New patch to enable ABMC by default.
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 66de0ce12aba..b90d8c90b4b6 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -721,6 +721,7 @@ int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
 			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
 int rdtgroup_unassign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
 				 struct rdt_mon_domain *d, enum resctrl_event_id evtid);
+void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
 void rdt_staged_configs_clear(void);
 bool closid_allocated(unsigned int closid);
 int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e8d38a963f39..4ba5007fd1aa 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1267,6 +1267,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			r->mon.mbm_cntr_assignable = true;
 			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
 			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
+			hw_res->mbm_cntr_assign_enabled = true;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7fa6a86c6ca8..5b8bb8bd913c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2799,6 +2799,13 @@ int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable)
 	return 0;
 }
 
+void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r)
+{
+	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+	resctrl_abmc_set_one_amd(&hw_res->mbm_cntr_assign_enabled);
+}
+
 /*
  * We don't allow rdtgroup directories to be created anywhere
  * except the root directory. Thus when looking for the rdtgroup
@@ -4582,9 +4589,13 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 
 void resctrl_online_cpu(unsigned int cpu)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
 	mutex_lock(&rdtgroup_mutex);
 	/* The CPU is set in default rdtgroup after online. */
 	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+	if (r->mon_capable && r->mon.mbm_cntr_assignable)
+		resctrl_arch_mbm_cntr_assign_set_one(r);
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (22 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-18 19:43   ` Reinette Chatre
  2024-10-29 23:21 ` [PATCH v9 25/26] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
  2024-10-29 23:21 ` [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Users can modify the configuration of assignable events. Whenever the
event configuration is updated, MBM assignments must be revised across
all monitor groups within the impacted domains.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Again patch changed completely based on the comment.
    https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
    Introduced resctrl_mon_event_config_set to handle IPI.
    But sending another IPI inside IPI causes problem. Kernel reports SMP
    warning. So, introduced resctrl_arch_update_cntr() to send the command directly.

v8: Patch changed completely.
    Updated the assignment on same IPI as the event is updated.
    Could not do the way we discussed in the thread.
    https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
    Needed to figure out event type to update the configuration.

v7: New patch to update the assignments. Missed it earlier.
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 86 +++++++++++++++++++++++---
 include/linux/resctrl.h                |  3 +-
 2 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5b8bb8bd913c..7646d67ea10e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1710,6 +1710,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 }
 
 struct mon_config_info {
+	struct rdt_resource *r;
 	struct rdt_mon_domain *d;
 	u32 evtid;
 	u32 mon_config;
@@ -1735,26 +1736,28 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 	return INVALID_CONFIG_VALUE;
 }
 
-void resctrl_arch_mon_event_config_set(void *info)
+void resctrl_arch_mon_event_config_set(struct rdt_mon_domain *d,
+				       enum resctrl_event_id eventid, u32 val)
 {
-	struct mon_config_info *mon_info = info;
 	struct rdt_hw_mon_domain *hw_dom;
 	unsigned int index;
 
-	index = mon_event_config_index_get(mon_info->evtid);
+	index = mon_event_config_index_get(eventid);
 	if (index == INVALID_CONFIG_INDEX)
 		return;
 
-	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
+	wrmsr(MSR_IA32_EVT_CFG_BASE + index, val, 0);
 
-	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
-	switch (mon_info->evtid) {
+	switch (eventid) {
 	case QOS_L3_MBM_TOTAL_EVENT_ID:
-		hw_dom->mbm_total_cfg = mon_info->mon_config;
+		hw_dom->mbm_total_cfg = val;
 		break;
 	case QOS_L3_MBM_LOCAL_EVENT_ID:
-		hw_dom->mbm_local_cfg = mon_info->mon_config;
+		hw_dom->mbm_local_cfg = val;
+		break;
+	default:
 		break;
 	}
 }
@@ -1826,6 +1829,70 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_index(int cntr_id, unsigned int index)
+{
+	struct rdtgroup *prgrp, *crgrp;
+
+	/* Check if the cntr_id is associated to the event type updated */
+	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+		if (prgrp->mon.cntr_id[index] == cntr_id)
+			return prgrp;
+
+		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+			if (crgrp->mon.cntr_id[index] == cntr_id)
+				return crgrp;
+		}
+	}
+
+	return NULL;
+}
+
+static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+				     enum resctrl_event_id evtid, u32 rmid,
+				     u32 closid, u32 cntr_id, u32 val)
+{
+	union l3_qos_abmc_cfg abmc_cfg = { 0 };
+
+	abmc_cfg.split.cfg_en = 1;
+	abmc_cfg.split.cntr_en = 1;
+	abmc_cfg.split.cntr_id = cntr_id;
+	abmc_cfg.split.bw_src = rmid;
+	abmc_cfg.split.bw_type = val;
+
+	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg.full);
+}
+
+static void resctrl_mon_event_config_set(void *info)
+{
+	struct mon_config_info *mon_info = info;
+	struct rdt_mon_domain *d = mon_info->d;
+	struct rdt_resource *r = mon_info->r;
+	struct rdtgroup *rdtgrp;
+	unsigned int index;
+	u32 cntr_id;
+
+	resctrl_arch_mon_event_config_set(d, mon_info->evtid, mon_info->mon_config);
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
+		return;
+
+	index = mon_event_config_index_get(mon_info->evtid);
+	if (index == INVALID_CONFIG_INDEX)
+		return;
+
+	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
+		if (test_bit(cntr_id, d->mbm_cntr_map)) {
+			rdtgrp = rdtgroup_find_grp_by_cntr_id_index(cntr_id, index);
+			if (rdtgrp)
+				resctrl_arch_update_cntr(mon_info->r, d,
+							 mon_info->evtid,
+							 rdtgrp->mon.rmid,
+							 rdtgrp->closid,
+							 cntr_id,
+							 mon_info->mon_config);
+		}
+	}
+}
 
 static void mbm_config_write_domain(struct rdt_resource *r,
 				    struct rdt_mon_domain *d, u32 evtid, u32 val)
@@ -1841,6 +1908,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	if (config_val == INVALID_CONFIG_VALUE || config_val == val)
 		return;
 
+	mon_info.r = r;
 	mon_info.d = d;
 	mon_info.evtid = evtid;
 	mon_info.mon_config = val;
@@ -1852,7 +1920,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
 	smp_call_function_any(&d->hdr.cpu_mask,
-			      resctrl_arch_mon_event_config_set,
+			      resctrl_mon_event_config_set,
 			      &mon_info, 1);
 
 	/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0b8eeb8afc68..4dc858d7aa10 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -356,7 +356,8 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
  */
 void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
-void resctrl_arch_mon_event_config_set(void *info);
+void resctrl_arch_mon_event_config_set(struct rdt_mon_domain *d,
+				       enum resctrl_event_id eventid, u32 val);
 u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
 				      enum resctrl_event_id eventid);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 25/26] x86/resctrl: Introduce interface to list assignment states of all the groups
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (23 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-10-29 23:21 ` [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
  25 siblings, 0 replies; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Provide the interface to list the assignment states of all the resctrl
groups in mbm_cntr_assign mode.

Example:
$ mount -t resctrl resctrl /sys/fs/resctrl/
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=tl;1=tl;

List follows the following format:

"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Format for specific type of groups:

- Default CTRL_MON group:
  "//<domain_id>=<flags>"

- Non-default CTRL_MON group:
  "<CTRL_MON group>//<domain_id>=<flags>"

- Child MON group of default CTRL_MON group:
  "/<MON group>/<domain_id>=<flags>"

- Child MON group of non-default CTRL_MON group:
  "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Flags can be one of the following:
t  MBM total event is assigned
l  MBM local event is assigned
tl Both total and local MBM events are assigned
_  None of the MBM events are assigned

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Minor parameter update in resctrl_mbm_event_assigned().

v8: Moved resctrl_mbm_event_assigned() in here as it is first used here.
    Moved rdt_last_cmd_clear() before making any call.
    Updated the commit log.
    Corrected the doc format.

v7: Renamed the interface name from 'mbm_control' to 'mbm_assign_control'
    to match 'mbm_assign_mode'.
    Removed Arch references from FS code.
    Added rdt_last_cmd_clear() before the command processing.
    Added rdtgroup_mutex before all the calls.
    Removed references of ABMC from FS code.

v6: The domain specific assignment can be determined looking at mbm_cntr_map.
    Removed rdtgroup_abmc_dom_cfg() and rdtgroup_abmc_dom_state().
    Removed the switch statement for the domain_state detection.
    Determined the flags incremently.
    Removed special handling of default group while printing..

v5: Replaced "assignment flags" with "flags".
    Changes related to mon structure.
    Changes related renaming the interface from mbm_assign_control to
    mbm_control.

v4: Added functionality to query domain specific assigment in.
    rdtgroup_abmc_dom_state().

v3: New patch.
    Addresses the feedback to provide the global assignment interface.
    https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
---
 Documentation/arch/x86/resctrl.rst     | 44 +++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 77 ++++++++++++++++++++++++++
 3 files changed, 122 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 1b39866e8b04..590727bec44b 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -321,6 +321,50 @@ with the following files:
 	The number of free monitoring counters available assignment in each domain
 	when the architecture supports mbm_cntr_assign mode.
 
+"mbm_assign_control":
+	Reports the resctrl group and monitor status of each group.
+
+	List follows the following format:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Format for specific type of groups:
+
+	* Default CTRL_MON group:
+		"//<domain_id>=<flags>"
+
+	* Non-default CTRL_MON group:
+		"<CTRL_MON group>//<domain_id>=<flags>"
+
+	* Child MON group of default CTRL_MON group:
+		"/<MON group>/<domain_id>=<flags>"
+
+	* Child MON group of non-default CTRL_MON group:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Flags can be one of the following:
+	::
+
+	 t  MBM total event is assigned.
+	 l  MBM local event is assigned.
+	 tl Both MBM total and local events are assigned.
+	 _  None of the MBM events are assigned.
+
+	Examples:
+	::
+
+	 # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
+
+	 # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	 non_default_ctrl_mon_grp//0=tl;1=tl;
+	 non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+	 //0=tl;1=tl;
+	 /child_default_mon_grp/0=tl;1=tl;
+
+	There are four resctrl groups. All the groups have total and local MBM events
+	assigned on domain 0 and 1.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4ba5007fd1aa..fc0e4ea480cd 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1270,6 +1270,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			hw_res->mbm_cntr_assign_enabled = true;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
+			resctrl_file_fflags_init("mbm_assign_control", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7646d67ea10e..5cc40eacbe85 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1011,6 +1011,77 @@ static ssize_t rdtgroup_mbm_assign_mode_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static bool resctrl_mbm_event_assigned(struct rdtgroup *rdtg,
+				       struct rdt_mon_domain *d,
+				       enum resctrl_event_id evtid)
+{
+	int index = MBM_EVENT_ARRAY_INDEX(evtid);
+	int cntr_id = rdtg->mon.cntr_id[index];
+
+	return cntr_id != MON_CNTR_UNSET && test_bit(cntr_id, d->mbm_cntr_map);
+}
+
+static char *rdtgroup_mon_state_to_str(struct rdtgroup *rdtgrp,
+				       struct rdt_mon_domain *d, char *str)
+{
+	char *tmp = str;
+
+	/* Query the total and local event flags for the domain */
+	if (resctrl_mbm_event_assigned(rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID))
+		*tmp++ = 't';
+
+	if (resctrl_mbm_event_assigned(rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID))
+		*tmp++ = 'l';
+
+	if (tmp == str)
+		*tmp++ = '_';
+
+	*tmp = '\0';
+	return str;
+}
+
+static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
+					    struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdt_mon_domain *dom;
+	struct rdtgroup *rdtg;
+	char str[10];
+
+	mutex_lock(&rdtgroup_mutex);
+	rdt_last_cmd_clear();
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		mutex_unlock(&rdtgroup_mutex);
+		return -EINVAL;
+	}
+
+	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+		struct rdtgroup *crg;
+
+		seq_printf(s, "%s//", rdtg->kn->name);
+
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
+			seq_printf(s, "%d=%s;", dom->hdr.id,
+				   rdtgroup_mon_state_to_str(rdtg, dom, str));
+		seq_putc(s, '\n');
+
+		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
+				    mon.crdtgrp_list) {
+			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
+
+			list_for_each_entry(dom, &r->mon_domains, hdr.list)
+				seq_printf(s, "%d=%s;", dom->hdr.id,
+					   rdtgroup_mon_state_to_str(crg, dom, str));
+			seq_putc(s, '\n');
+		}
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -2310,6 +2381,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= mbm_local_bytes_config_show,
 		.write		= mbm_local_bytes_config_write,
 	},
+	{
+		.name		= "mbm_assign_control",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdtgroup_mbm_assign_control_show,
+	},
 	{
 		.name		= "mbm_assign_mode",
 		.mode		= 0644,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of the groups
  2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
                   ` (24 preceding siblings ...)
  2024-10-29 23:21 ` [PATCH v9 25/26] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
@ 2024-10-29 23:21 ` Babu Moger
  2024-11-18 21:51   ` Reinette Chatre
  25 siblings, 1 reply; 115+ messages in thread
From: Babu Moger @ 2024-10-29 23:21 UTC (permalink / raw)
  To: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, babu.moger,
	jithu.joseph, brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Introduce the interface to assign MBM events in mbm_cntr_assign mode.

Events can be enabled or disabled by writing to file
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

Format is similar to the list format with addition of opcode for the
assignment operation.
 "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

Format for specific type of groups:

 * Default CTRL_MON group:
         "//<domain_id><opcode><flags>"

 * Non-default CTRL_MON group:
         "<CTRL_MON group>//<domain_id><opcode><flags>"

 * Child MON group of default CTRL_MON group:
         "/<MON group>/<domain_id><opcode><flags>"

 * Child MON group of non-default CTRL_MON group:
         "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

Domain_id '*' will apply the flags on all the domains.

Opcode can be one of the following:

 = Update the assignment to match the flags
 + Assign a new MBM event without impacting existing assignments.
 - Unassign a MBM event from currently assigned events.

Assignment flags can be one of the following:
 t  MBM total event
 l  MBM local event
 tl Both total and local MBM events
 _  None of the MBM events. Valid only with '=' opcode. This flag cannot
    be combined with other flags.

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v9: Fixed handling special case '//0=' and '//".
    Removed extra strstr() call.
    Added generic failure text when assignment operation fails.
    Corrected user documentation format texts.

v8: Moved unassign as the first action during the assign modification.
    Assign none "_" takes priority. Cannot be mixed with other flags.
    Updated the documentation and .rst file format. htmldoc looks ok.

v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
    Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
    Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
    Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
    Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
    Removed ABMC reference in FS code.
    Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
    Not sure if we need to change the behaviour here. Processed them sequencially right now.
    Users have the liberty to pass the flags. Restricting it might be a problem later.

v6: Added support assign all if domain id is '*'
    Fixed the allocation of counter id if it not assigned already.

v5: Interface name changed from mbm_assign_control to mbm_control.
    Fixed opcode and flags combination.
    '=_" is valid.
    "-_" amd "+_" is not valid.
    Minor message update.
    Renamed the function with prefix - rdtgroup_.
    Corrected few documentation mistakes.
    Rebase related changes after SNC support.

v4: Added domain specific assignments. Fixed the opcode parsing.

v3: New patch.
    Addresses the feedback to provide the global assignment interface.
    https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
---
 Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
 arch/x86/kernel/cpu/resctrl/internal.h |  10 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 236 ++++++++++++++++++++++++-
 3 files changed, 360 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 590727bec44b..d0a107d251ec 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -347,7 +347,8 @@ with the following files:
 	 t  MBM total event is assigned.
 	 l  MBM local event is assigned.
 	 tl Both MBM total and local events are assigned.
-	 _  None of the MBM events are assigned.
+	 _  None of the MBM events are assigned. Only works with opcode '=' for write
+	    and cannot be combined with other flags.
 
 	Examples:
 	::
@@ -365,6 +366,119 @@ with the following files:
 	There are four resctrl groups. All the groups have total and local MBM events
 	assigned on domain 0 and 1.
 
+	Assignment state can be updated by writing to the interface.
+
+	Format is similar to the list format with addition of opcode for the
+	assignment operation.
+
+		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
+
+	Format for each type of groups:
+
+        * Default CTRL_MON group:
+                "//<domain_id><opcode><flags>"
+
+        * Non-default CTRL_MON group:
+                "<CTRL_MON group>//<domain_id><opcode><flags>"
+
+        * Child MON group of default CTRL_MON group:
+                "/<MON group>/<domain_id><opcode><flags>"
+
+        * Child MON group of non-default CTRL_MON group:
+                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
+
+	Domain_id '*' will apply the flags on all the domains.
+
+	Opcode can be one of the following:
+	::
+
+	 = Update the assignment to match the MBM event.
+	 + Assign a new MBM event without impacting existing assignments.
+	 - Unassign a MBM event from currently assigned events.
+
+	Examples:
+	Initial group status:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+	  //0=tl;1=tl;
+	  /child_default_mon_grp/0=tl;1=tl;
+
+	To update the default group to assign only total MBM event on domain 0:
+	::
+
+	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+	  //0=t;1=tl;
+	  /child_default_mon_grp/0=tl;1=tl;
+
+	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
+	::
+
+	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+	  //0=t;1=tl;
+	  /child_default_mon_grp/0=tl;1=l;
+
+	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
+	both local and total MBM events on domain 1:
+	::
+
+	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
+			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
+	  //0=t;1=tl;
+	  /child_default_mon_grp/0=tl;1=l;
+
+	To update the default group to add a local MBM event domain 0.
+	::
+
+	  # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=tl;1=tl;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
+	  //0=tl;1=tl;
+	  /child_default_mon_grp/0=tl;1=l;
+
+	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all the
+	MBM events on all the domains.
+	::
+
+	  # echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+	Assignment status after the update:
+	::
+
+	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	  non_default_ctrl_mon_grp//0=_;1=_;
+	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
+	  //0=tl;1=tl;
+	  /child_default_mon_grp/0=tl;1=l;
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b90d8c90b4b6..3ccaea6a2803 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -74,6 +74,16 @@
  */
 #define MBM_EVENT_ARRAY_INDEX(_event) ((_event) - 2)
 
+/*
+ * Assignment flags for mbm_cntr_assign feature
+ */
+enum {
+	ASSIGN_NONE	= 0,
+	ASSIGN_TOTAL	= BIT(QOS_L3_MBM_TOTAL_EVENT_ID),
+	ASSIGN_LOCAL	= BIT(QOS_L3_MBM_LOCAL_EVENT_ID),
+	ASSIGN_INVALID,
+};
+
 /**
  * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
  *			        aren't marked nohz_full
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5cc40eacbe85..9fe419d0c536 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1082,6 +1082,239 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdtgroup_str_to_mon_state(char *flag)
+{
+	int i, mon_state = ASSIGN_NONE;
+
+	if (!strlen(flag))
+		return ASSIGN_INVALID;
+
+	for (i = 0; i < strlen(flag); i++) {
+		switch (*(flag + i)) {
+		case 't':
+			mon_state |= ASSIGN_TOTAL;
+			break;
+		case 'l':
+			mon_state |= ASSIGN_LOCAL;
+			break;
+		case '_':
+			return ASSIGN_NONE;
+		default:
+			return ASSIGN_INVALID;
+		}
+	}
+
+	return mon_state;
+}
+
+static struct rdtgroup *rdtgroup_find_grp_by_name(enum rdt_group_type rtype,
+						  char *p_grp, char *c_grp)
+{
+	struct rdtgroup *rdtg, *crg;
+
+	if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
+		return &rdtgroup_default;
+	} else if (rtype == RDTCTRL_GROUP) {
+		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
+			if (!strcmp(p_grp, rdtg->kn->name))
+				return rdtg;
+	} else if (rtype == RDTMON_GROUP) {
+		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+			if (!strcmp(p_grp, rdtg->kn->name)) {
+				list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
+						    mon.crdtgrp_list) {
+					if (!strcmp(c_grp, crg->kn->name))
+						return crg;
+				}
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static int rdtgroup_process_flags(struct rdt_resource *r,
+				  enum rdt_group_type rtype,
+				  char *p_grp, char *c_grp, char *tok)
+{
+	int op, mon_state, assign_state, unassign_state;
+	char *dom_str, *id_str, *op_str;
+	struct rdt_mon_domain *d;
+	struct rdtgroup *rdtgrp;
+	unsigned long dom_id;
+	int ret, found = 0;
+
+	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
+
+	if (!rdtgrp) {
+		rdt_last_cmd_puts("Not a valid resctrl group\n");
+		return -EINVAL;
+	}
+
+next:
+	if (!tok || tok[0] == '\0')
+		return 0;
+
+	/* Start processing the strings for each domain */
+	dom_str = strim(strsep(&tok, ";"));
+
+	op_str = strpbrk(dom_str, "=+-");
+
+	if (op_str) {
+		op = *op_str;
+	} else {
+		rdt_last_cmd_puts("Missing operation =, +, - character\n");
+		return -EINVAL;
+	}
+
+	id_str = strsep(&dom_str, "=+-");
+
+	/* Check for domain id '*' which means all domains */
+	if (id_str && *id_str == '*') {
+		d = NULL;
+		goto check_state;
+	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
+		rdt_last_cmd_puts("Missing domain id\n");
+		return -EINVAL;
+	}
+
+	/* Verify if the dom_id is valid */
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found) {
+		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
+		return -EINVAL;
+	}
+
+check_state:
+	mon_state = rdtgroup_str_to_mon_state(dom_str);
+
+	if (mon_state == ASSIGN_INVALID) {
+		rdt_last_cmd_puts("Invalid assign flag\n");
+		goto out_fail;
+	}
+
+	assign_state = 0;
+	unassign_state = 0;
+
+	switch (op) {
+	case '+':
+		if (mon_state == ASSIGN_NONE) {
+			rdt_last_cmd_puts("Invalid assign opcode\n");
+			goto out_fail;
+		}
+		assign_state = mon_state;
+		break;
+	case '-':
+		if (mon_state == ASSIGN_NONE) {
+			rdt_last_cmd_puts("Invalid assign opcode\n");
+			goto out_fail;
+		}
+		unassign_state = mon_state;
+		break;
+	case '=':
+		assign_state = mon_state;
+		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
+		break;
+	default:
+		break;
+	}
+
+	if (unassign_state & ASSIGN_TOTAL) {
+		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (unassign_state & ASSIGN_LOCAL) {
+		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (assign_state & ASSIGN_TOTAL) {
+		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	if (assign_state & ASSIGN_LOCAL) {
+		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
+		if (ret)
+			goto out_fail;
+	}
+
+	goto next;
+
+out_fail:
+	rdt_last_cmd_printf("Assign operation '%c%s' failed on the group %s/%s/\n",
+			    op, dom_str, p_grp, c_grp);
+
+	return -EINVAL;
+}
+
+static ssize_t rdtgroup_mbm_assign_control_write(struct kernfs_open_file *of,
+						 char *buf, size_t nbytes, loff_t off)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	char *token, *cmon_grp, *mon_grp;
+	enum rdt_group_type rtype;
+	int ret;
+
+	/* Valid input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+
+	buf[nbytes - 1] = '\0';
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_last_cmd_clear();
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		mutex_unlock(&rdtgroup_mutex);
+		cpus_read_unlock();
+		return -EINVAL;
+	}
+
+	while ((token = strsep(&buf, "\n")) != NULL) {
+		/*
+		 * The write command follows the following format:
+		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
+		 * Extract the CTRL_MON group.
+		 */
+		cmon_grp = strsep(&token, "/");
+
+		/*
+		 * Extract the MON_GROUP.
+		 * strsep returns empty string for contiguous delimiters.
+		 * Empty mon_grp here means it is a RDTCTRL_GROUP.
+		 */
+		mon_grp = strsep(&token, "/");
+
+		if (*mon_grp == '\0')
+			rtype = RDTCTRL_GROUP;
+		else
+			rtype = RDTMON_GROUP;
+
+		ret = rdtgroup_process_flags(r, rtype, cmon_grp, mon_grp, token);
+		if (ret)
+			break;
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	return ret ?: nbytes;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -2383,9 +2616,10 @@ static struct rftype res_common_files[] = {
 	},
 	{
 		.name		= "mbm_assign_control",
-		.mode		= 0444,
+		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= rdtgroup_mbm_assign_control_show,
+		.write		= rdtgroup_mbm_assign_control_write,
 	},
 	{
 		.name		= "mbm_assign_mode",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* RE: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-10-29 23:21 ` [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
@ 2024-10-29 23:54   ` Luck, Tony
  2024-10-30 14:14     ` Moger, Babu
  2024-11-16  0:44   ` Reinette Chatre
  1 sibling, 1 reply; 115+ messages in thread
From: Luck, Tony @ 2024-10-29 23:54 UTC (permalink / raw)
  To: Babu Moger, corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: Yu, Fenghua, x86@kernel.org, hpa@zytor.com, thuth@redhat.com,
	paulmck@kernel.org, rostedt@goodmis.org,
	akpm@linux-foundation.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Joseph, Jithu, brijesh.singh@amd.com, Li, Xin3,
	ebiggers@google.com, andrew.cooper3@citrix.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, vikas.shivappa@linux.intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	peternewman@google.com, Wieczor-Retman, Maciej, Eranian, Stephane,
	jpoimboe@kernel.org, thomas.lendacky@amd.com

> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +                          enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +                          u32 cntr_id, bool assign)
> +{
> +     struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +     union l3_qos_abmc_cfg abmc_cfg = { 0 };
> +     struct arch_mbm_state *arch_mbm;
> +
> +     abmc_cfg.split.cfg_en = 1;
> +     abmc_cfg.split.cntr_en = assign ? 1 : 0;
> +     abmc_cfg.split.cntr_id = cntr_id;
> +     abmc_cfg.split.bw_src = rmid;
> +
> +     /* Update the event configuration from the domain */
> +     if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
> +             abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
> +             arch_mbm = &hw_dom->arch_mbm_total[rmid];
> +     } else {
> +             abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
> +             arch_mbm = &hw_dom->arch_mbm_local[rmid];
> +     }
> +
> +     smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
> +                           &abmc_cfg, 1);
> +
> +     return 0;
> +}

Compiling with W=1:

warning: variable 'arch_mbm' set but not used [-Wunused-but-set-variable]

[still not used by patch 26]

-Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* RE: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
@ 2024-10-29 23:57   ` Luck, Tony
  2024-10-30 14:15     ` Moger, Babu
  2024-11-04 14:14   ` Peter Newman
  2024-11-16  0:31   ` Reinette Chatre
  2 siblings, 1 reply; 115+ messages in thread
From: Luck, Tony @ 2024-10-29 23:57 UTC (permalink / raw)
  To: Babu Moger, corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: Yu, Fenghua, x86@kernel.org, hpa@zytor.com, thuth@redhat.com,
	paulmck@kernel.org, rostedt@goodmis.org,
	akpm@linux-foundation.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Joseph, Jithu, brijesh.singh@amd.com, Li, Xin3,
	ebiggers@google.com, andrew.cooper3@citrix.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, vikas.shivappa@linux.intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	peternewman@google.com, Wieczor-Retman, Maciej, Eranian, Stephane,
	jpoimboe@kernel.org, thomas.lendacky@amd.com

> Provide the interface to display the number of free monitoring counters
> available for assignment in each doamin when mbm_cntr_assign is supported.

s/doamin/domain/

-Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: RE: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-10-29 23:54   ` Luck, Tony
@ 2024-10-30 14:14     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-10-30 14:14 UTC (permalink / raw)
  To: Luck, Tony, corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: Yu, Fenghua, x86@kernel.org, hpa@zytor.com, thuth@redhat.com,
	paulmck@kernel.org, rostedt@goodmis.org,
	akpm@linux-foundation.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Joseph, Jithu, brijesh.singh@amd.com, Li, Xin3,
	ebiggers@google.com, andrew.cooper3@citrix.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, peternewman@google.com,
	Wieczor-Retman, Maciej, Eranian, Stephane, jpoimboe@kernel.org,
	thomas.lendacky@amd.com

Hi Tony,


On 10/29/24 18:54, Luck, Tony wrote:
>> +int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +                          enum resctrl_event_id evtid, u32 rmid, u32 closid,
>> +                          u32 cntr_id, bool assign)
>> +{
>> +     struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> +     union l3_qos_abmc_cfg abmc_cfg = { 0 };
>> +     struct arch_mbm_state *arch_mbm;
>> +
>> +     abmc_cfg.split.cfg_en = 1;
>> +     abmc_cfg.split.cntr_en = assign ? 1 : 0;
>> +     abmc_cfg.split.cntr_id = cntr_id;
>> +     abmc_cfg.split.bw_src = rmid;
>> +
>> +     /* Update the event configuration from the domain */
>> +     if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
>> +             abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
>> +             arch_mbm = &hw_dom->arch_mbm_total[rmid];
>> +     } else {
>> +             abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
>> +             arch_mbm = &hw_dom->arch_mbm_local[rmid];
>> +     }
>> +
>> +     smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
>> +                           &abmc_cfg, 1);
>> +
>> +     return 0;
>> +}
> 
> Compiling with W=1:
> 
> warning: variable 'arch_mbm' set but not used [-Wunused-but-set-variable]
> 
> [still not used by patch 26]

I knew I am going to miss something like this.   Thanks. Will fix it.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: RE: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-10-29 23:57   ` Luck, Tony
@ 2024-10-30 14:15     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-10-30 14:15 UTC (permalink / raw)
  To: Luck, Tony, corbet@lwn.net, Chatre, Reinette, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: Yu, Fenghua, x86@kernel.org, hpa@zytor.com, thuth@redhat.com,
	paulmck@kernel.org, rostedt@goodmis.org,
	akpm@linux-foundation.org, xiongwei.song@windriver.com,
	pawan.kumar.gupta@linux.intel.com, daniel.sneddon@linux.intel.com,
	perry.yuan@amd.com, sandipan.das@amd.com, Huang, Kai, Li, Xiaoyao,
	seanjc@google.com, Joseph, Jithu, brijesh.singh@amd.com, Li, Xin3,
	ebiggers@google.com, andrew.cooper3@citrix.com,
	mario.limonciello@amd.com, james.morse@arm.com,
	tan.shaopeng@fujitsu.com, vikas.shivappa@linux.intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	peternewman@google.com, Wieczor-Retman, Maciej, Eranian, Stephane,
	jpoimboe@kernel.org, thomas.lendacky@amd.com



On 10/29/24 18:57, Luck, Tony wrote:
>> Provide the interface to display the number of free monitoring counters
>> available for assignment in each doamin when mbm_cntr_assign is supported.
> 
> s/doamin/domain/
> 

Sure. Thanks

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
  2024-10-29 23:57   ` Luck, Tony
@ 2024-11-04 14:14   ` Peter Newman
  2024-11-04 17:31     ` Moger, Babu
  2024-11-16  0:31   ` Reinette Chatre
  2 siblings, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-11-04 14:14 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On Wed, Oct 30, 2024 at 12:24 AM Babu Moger <babu.moger@amd.com> wrote:
>
> Provide the interface to display the number of free monitoring counters
> available for assignment in each doamin when mbm_cntr_assign is supported.
>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: New patch.
> ---
>  Documentation/arch/x86/resctrl.rst     |  4 ++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 2f3a86278e84..2bc58d974934 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -302,6 +302,10 @@ with the following files:
>         memory bandwidth tracking to a single memory bandwidth event per
>         monitoring group.
>
> +"available_mbm_cntrs":
> +       The number of free monitoring counters available assignment in each domain
> +       when the architecture supports mbm_cntr_assign mode.

It seems you need to clarify that counters are only available to a
domain when they're available in all domains:

resctrl# for i in `seq 100`; do
> mkdir mon_groups/m${i}
> done
resctrl# cat info/L3_MON/available_mbm_cntrs
0=0;1=0;2=0;3=0;4=0;5=0;6=0;7=0;8=0;9=0;10=0;11=0;12=0;16=0;17=0;18=0;19=0;20=0;21=0;22=0;23=0;24=0;25=0;26=0;27=0;28=0

resctrl# cd info/L3_MON/
L3_MON# echo '/m1/0=_' > mbm_assign_control
L3_MON# cat available_mbm_cntrs
0=2;1=0;2=0;3=0;4=0;5=0;6=0;7=0;8=0;9=0;10=0;11=0;12=0;16=0;17=0;18=0;19=0;20=0;21=0;22=0;23=0;24=0;25=0;26=0;27=0;28=0
L3_MON# echo '/m16/0+t' > mbm_assign_control
-bash: echo: write error: Invalid argument
L3_MON# cat ../last_cmd_status
Out of MBM assignable counters
Assign operation '+t' failed on the group /m16/

L3_MON# rmdir ../../mon_groups/m1
L3_MON# cat available_mbm_cntrs
0=2;1=2;2=2;3=2;4=2;5=2;6=2;7=2;8=2;9=2;10=2;11=2;12=2;16=2;17=2;18=2;19=2;20=2;21=2;22=2;23=2;24=2;25=2;26=2;27=2;28=2
L3_MON# echo '/m16/0+t' > mbm_assign_control
L3_MON#


-Peter

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter
  2024-10-29 23:21 ` [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter Babu Moger
@ 2024-11-04 14:16   ` Peter Newman
  2024-11-04 18:21     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-11-04 14:16 UTC (permalink / raw)
  To: Babu Moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On Wed, Oct 30, 2024 at 12:25 AM Babu Moger <babu.moger@amd.com> wrote:
>
> The mbm_cntr_assign mode provides a limited number of hardware counters
> that can be assigned to an RMID, event pair to monitor bandwidth while
> assigned. If all counters are in use, the kernel will show an error
> message: "Out of MBM assignable counters" when a new assignment is
> requested. To make space for a new assignment, users must unassign an
> already assigned counter.
>
> Introduce an interface that allows for the unassignment of counter IDs
> from both the group and the domain. Additionally, ensure that the global
> counter is released if it is no longer assigned to any domains.

This seems unnecessarily restrictive. What's wrong with monitoring
different groups in different domains?


-Peter

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-04 14:14   ` Peter Newman
@ 2024-11-04 17:31     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-04 17:31 UTC (permalink / raw)
  To: Peter Newman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter,

On 11/4/24 08:14, Peter Newman wrote:
> Hi Babu,
> 
> On Wed, Oct 30, 2024 at 12:24 AM Babu Moger <babu.moger@amd.com> wrote:
>>
>> Provide the interface to display the number of free monitoring counters
>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: New patch.
>> ---
>>  Documentation/arch/x86/resctrl.rst     |  4 ++++
>>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 2f3a86278e84..2bc58d974934 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -302,6 +302,10 @@ with the following files:
>>         memory bandwidth tracking to a single memory bandwidth event per
>>         monitoring group.
>>
>> +"available_mbm_cntrs":
>> +       The number of free monitoring counters available assignment in each domain
>> +       when the architecture supports mbm_cntr_assign mode.
> 
> It seems you need to clarify that counters are only available to a
> domain when they're available in all domains:

Yes. Makes sense.

> 
> resctrl# for i in `seq 100`; do
>> mkdir mon_groups/m${i}
>> done
> resctrl# cat info/L3_MON/available_mbm_cntrs
> 0=0;1=0;2=0;3=0;4=0;5=0;6=0;7=0;8=0;9=0;10=0;11=0;12=0;16=0;17=0;18=0;19=0;20=0;21=0;22=0;23=0;24=0;25=0;26=0;27=0;28=0
> 
> resctrl# cd info/L3_MON/
> L3_MON# echo '/m1/0=_' > mbm_assign_control
> L3_MON# cat available_mbm_cntrs
> 0=2;1=0;2=0;3=0;4=0;5=0;6=0;7=0;8=0;9=0;10=0;11=0;12=0;16=0;17=0;18=0;19=0;20=0;21=0;22=0;23=0;24=0;25=0;26=0;27=0;28=0
> L3_MON# echo '/m16/0+t' > mbm_assign_control
> -bash: echo: write error: Invalid argument
> L3_MON# cat ../last_cmd_status
> Out of MBM assignable counters
> Assign operation '+t' failed on the group /m16/
> 
> L3_MON# rmdir ../../mon_groups/m1
> L3_MON# cat available_mbm_cntrs
> 0=2;1=2;2=2;3=2;4=2;5=2;6=2;7=2;8=2;9=2;10=2;11=2;12=2;16=2;17=2;18=2;19=2;20=2;21=2;22=2;23=2;24=2;25=2;26=2;27=2;28=2
> L3_MON# echo '/m16/0+t' > mbm_assign_control
> L3_MON#
> 

Test case looks good to me. Thanks for trying out.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter
  2024-11-04 14:16   ` Peter Newman
@ 2024-11-04 18:21     ` Moger, Babu
  2024-11-05 10:35       ` Peter Newman
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-04 18:21 UTC (permalink / raw)
  To: Peter Newman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter,

On 11/4/24 08:16, Peter Newman wrote:
> Hi Babu,
> 
> On Wed, Oct 30, 2024 at 12:25 AM Babu Moger <babu.moger@amd.com> wrote:
>>
>> The mbm_cntr_assign mode provides a limited number of hardware counters
>> that can be assigned to an RMID, event pair to monitor bandwidth while
>> assigned. If all counters are in use, the kernel will show an error
>> message: "Out of MBM assignable counters" when a new assignment is
>> requested. To make space for a new assignment, users must unassign an
>> already assigned counter.
>>
>> Introduce an interface that allows for the unassignment of counter IDs
>> from both the group and the domain. Additionally, ensure that the global
>> counter is released if it is no longer assigned to any domains.
> 
> This seems unnecessarily restrictive. What's wrong with monitoring
> different groups in different domains?

Yes. User can monitor different groups in different domains. But, they
will have to use different global counter for each group.

Here is an example.

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
groupA//0=t;1=_;
groupB//0=_;1=l;

Group A - counter 0 (Assigned to total event in Domain 0)
Group B - counter 1 (Assigned to local event in Domain 1)

We allocate two different counters here.  Now we are left with 30 counters
(max 32).


This is similar to CLOSID management we follow in resctrl. This is not a
new restriction,
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter
  2024-11-04 18:21     ` Moger, Babu
@ 2024-11-05 10:35       ` Peter Newman
  2024-11-05 19:58         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-11-05 10:35 UTC (permalink / raw)
  To: babu.moger
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On Mon, Nov 4, 2024 at 7:21 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter,
>
> On 11/4/24 08:16, Peter Newman wrote:
> > Hi Babu,
> >
> > On Wed, Oct 30, 2024 at 12:25 AM Babu Moger <babu.moger@amd.com> wrote:
> >>
> >> The mbm_cntr_assign mode provides a limited number of hardware counters
> >> that can be assigned to an RMID, event pair to monitor bandwidth while
> >> assigned. If all counters are in use, the kernel will show an error
> >> message: "Out of MBM assignable counters" when a new assignment is
> >> requested. To make space for a new assignment, users must unassign an
> >> already assigned counter.
> >>
> >> Introduce an interface that allows for the unassignment of counter IDs
> >> from both the group and the domain. Additionally, ensure that the global
> >> counter is released if it is no longer assigned to any domains.
> >
> > This seems unnecessarily restrictive. What's wrong with monitoring
> > different groups in different domains?
>
> Yes. User can monitor different groups in different domains. But, they
> will have to use different global counter for each group.

What is a global counter anyways? It sounds like an artifact of an
earlier revision. This concept does not sound intuitive to the user.

>
> Here is an example.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> groupA//0=t;1=_;
> groupB//0=_;1=l;
>
> Group A - counter 0 (Assigned to total event in Domain 0)
> Group B - counter 1 (Assigned to local event in Domain 1)
>
> We allocate two different counters here.  Now we are left with 30 counters
> (max 32).
>
>
> This is similar to CLOSID management we follow in resctrl. This is not a
> new restriction,

It is a restriction in a new feature that resembles a restriction in
an existing feature.

I don't see what function the global allocator serves now that there
is already a per-domain allocator. My best guess is that it avoids the
case of an mbm_assign_control write that succeeds in some domains but
fails in others.

I admit I said earlier that I was only planning to allocate globally,
but now that I'm evaluating how to make resctrl's monitoring
functionality scale on large systems, I'm being forced to reconsider.

As long as this is only a limitation I can fix later, I don't see it
as an obstacle. There would just need to be better documentation of
what sort of internal data structures the user needs to visualize in
order to use this feature successfully.

-Peter

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter
  2024-11-05 10:35       ` Peter Newman
@ 2024-11-05 19:58         ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-05 19:58 UTC (permalink / raw)
  To: Peter Newman
  Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter,

On 11/5/24 04:35, Peter Newman wrote:
> Hi Babu,
> 
> On Mon, Nov 4, 2024 at 7:21 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 11/4/24 08:16, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Wed, Oct 30, 2024 at 12:25 AM Babu Moger <babu.moger@amd.com> wrote:
>>>>
>>>> The mbm_cntr_assign mode provides a limited number of hardware counters
>>>> that can be assigned to an RMID, event pair to monitor bandwidth while
>>>> assigned. If all counters are in use, the kernel will show an error
>>>> message: "Out of MBM assignable counters" when a new assignment is
>>>> requested. To make space for a new assignment, users must unassign an
>>>> already assigned counter.
>>>>
>>>> Introduce an interface that allows for the unassignment of counter IDs
>>>> from both the group and the domain. Additionally, ensure that the global
>>>> counter is released if it is no longer assigned to any domains.
>>>
>>> This seems unnecessarily restrictive. What's wrong with monitoring
>>> different groups in different domains?
>>
>> Yes. User can monitor different groups in different domains. But, they
>> will have to use different global counter for each group.
> 
> What is a global counter anyways? It sounds like an artifact of an
> earlier revision. This concept does not sound intuitive to the user.


# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
32

This is a global counter. We have totally 32 hardware counters.

This is tracked by the bitmap mbm_cntr_free_map.


> 
>>
>> Here is an example.
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> groupA//0=t;1=_;
>> groupB//0=_;1=l;
>>
>> Group A - counter 0 (Assigned to total event in Domain 0)
>> Group B - counter 1 (Assigned to local event in Domain 1)
>>
>> We allocate two different counters here.  Now we are left with 30 counters
>> (max 32).
>>
>>
>> This is similar to CLOSID management we follow in resctrl. This is not a
>> new restriction,
> 
> It is a restriction in a new feature that resembles a restriction in
> an existing feature.
> 
> I don't see what function the global allocator serves now that there
> is already a per-domain allocator. My best guess is that it avoids the
> case of an mbm_assign_control write that succeeds in some domains but
> fails in others.
> 
> I admit I said earlier that I was only planning to allocate globally,
> but now that I'm evaluating how to make resctrl's monitoring
> functionality scale on large systems, I'm being forced to reconsider.
> 
> As long as this is only a limitation I can fix later, I don't see it
> as an obstacle. There would just need to be better documentation of
> what sort of internal data structures the user needs to visualize in
> order to use this feature successfully.


We have totally 32 global counters. That means we can assign up to 32 events.

Assigning events requires sending an IPI to write the MSR
(MSR_IA32_L3_QOS_ABMC_CFG) on every domain affected.

So, we wanted another bitmap to track if status of the assignment on each
domain. This is tracked by mbm_cntr_map. This bit is updated when we send
the IPI on that domain.

I dont consider this as a limitation. This helps to avoid sending
unnecessary IPIs to all the domains when user wants to assign an event.
This is kind of improvement I would say.

We still have the option to applying the assignment to all the domains by
setting "*" for the domain.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init
  2024-10-29 23:21 ` [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init Babu Moger
@ 2024-11-15 23:21   ` Reinette Chatre
  2024-11-18 17:44     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-15 23:21 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

In subject please use () to indicate a function, writing resctrl_late_init()

On 10/29/24 4:21 PM, Babu Moger wrote:
> The function resctrl_late_init() has the __init attribute, but some

No need to say "The function" when using ().

> functions it calls do not. Add the __init attribute to all the functions

None of the functions changed are actually called by resctrl_late_init(). If this
is indeed the goal then I think cache_alloc_hsw_probe() was missed.

> to maintain consistency throughout the call sequence.
> 
> Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
> Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
> Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Moved the patch to the begining of the series.
>     Fixed all the call sequences. Added additional Fixed tags.
> 
> v8: New patch.
> ---
>  arch/x86/kernel/cpu/resctrl/core.c     | 8 ++++----
>  arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
>  arch/x86/kernel/cpu/resctrl/monitor.c  | 4 ++--
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b681c2e07dbf..f845d0590429 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -275,7 +275,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>  	return true;
>  }
>  
> -static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
> +static __init void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>  {
>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>  	union cpuid_0x10_1_eax eax;
> @@ -294,7 +294,7 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>  	r->alloc_capable = true;
>  }
>  
> -static void rdt_get_cdp_config(int level)
> +static __init void rdt_get_cdp_config(int level)
>  {
>  	/*
>  	 * By default, CDP is disabled. CDP can be enabled by mount parameter
> @@ -304,12 +304,12 @@ static void rdt_get_cdp_config(int level)
>  	rdt_resources_all[level].r_resctrl.cdp_capable = true;
>  }
>  
> -static void rdt_get_cdp_l3_config(void)
> +static __init void rdt_get_cdp_l3_config(void)
>  {
>  	rdt_get_cdp_config(RDT_RESOURCE_L3);
>  }
>  
> -static void rdt_get_cdp_l2_config(void)
> +static __init void rdt_get_cdp_l2_config(void)
>  {
>  	rdt_get_cdp_config(RDT_RESOURCE_L2);
>  }
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 955999aecfca..16181b90159a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -627,7 +627,7 @@ int closids_supported(void);
>  void closid_free(int closid);
>  int alloc_rmid(u32 closid);
>  void free_rmid(u32 closid, u32 rmid);
> -int rdt_get_mon_l3_config(struct rdt_resource *r);
> +int __init rdt_get_mon_l3_config(struct rdt_resource *r);
>  void __exit rdt_put_mon_l3_config(void);
>  bool __init rdt_cpu_has(int flag);
>  void mon_event_count(void *info);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 851b561850e0..17790f92ef51 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -983,7 +983,7 @@ void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_
>  		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
>  }
>  
> -static int dom_data_init(struct rdt_resource *r)
> +static __init int dom_data_init(struct rdt_resource *r)
>  {
>  	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>  	u32 num_closid = resctrl_arch_get_num_closid(r);
> @@ -1081,7 +1081,7 @@ static struct mon_evt mbm_local_event = {
>   * because as per the SDM the total and local memory bandwidth
>   * are enumerated as part of L3 monitoring.
>   */
> -static void l3_mon_evt_init(struct rdt_resource *r)
> +static void __init l3_mon_evt_init(struct rdt_resource *r)

This change follows a different order from the other changes in this patch. "Function prototypes"
in Documentation/process/coding-style.rst indicates the preferred order is storage class
before return type. I acknowledge that resctrl is not consistent in this regard but we can
work towards the preferred order while keeping this patch consistent?

Reinette



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-10-29 23:21 ` [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
@ 2024-11-16  0:00   ` Reinette Chatre
  2024-11-18 19:04     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:00 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Introduce the interface file "mbm_assign_mode" to list monitor modes
> supported.
> 
> The "mbm_cntr_assign" mode provides the option to assign a counter to
> an RMID, event pair and monitor the bandwidth as long as it is assigned.
> 
> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
> 
> The "default" mode is the existing monitoring mode that works without the
> explicit counter assignment, instead relying on dynamic counter assignment
> by hardware that may result in hardware not dedicating a counter resulting
> in monitoring data reads returning "Unavailable".
> 
> Provide an interface to display the monitor mode on the system.
> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> [mbm_cntr_assign]
> default
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Updated user documentation based on comments.
> 
> v8: Commit message update.
> 
> v7: Updated the descriptions/commit log in resctrl.rst to generic text.
>     Thanks to James and Reinette.
>     Rename mbm_mode to mbm_assign_mode.
>     Introduced mutex lock in rdtgroup_mbm_mode_show().
> 
> v6: Added documentation for mbm_cntr_assign and legacy mode.
>     Moved mbm_mode fflags initialization to static initialization.
> 
> v5: Changed interface name to mbm_mode.
>     It will be always available even if ABMC feature is not supported.
>     Added description in resctrl.rst about ABMC mode.
>     Fixed display abmc and legacy consistantly.
> 
> v4: Fixed the checks for legacy and abmc mode. Default it ABMC.
> 
> v3: New patch to display ABMC capability.
> ---
>  Documentation/arch/x86/resctrl.rst     | 33 ++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 30586728a4cd..a93d7980e25f 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -257,6 +257,39 @@ with the following files:
>  	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>  	    0=0x30;1=0x30;3=0x15;4=0x15
>  
> +"mbm_assign_mode":
> +	Reports the list of monitoring modes supported. The enclosed brackets
> +	indicate which mode is enabled.
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> +	  [mbm_cntr_assign]
> +	  default
> +
> +	"mbm_cntr_assign":
> +
> +	In mbm_cntr_assign mode user-space is able to specify which of the
> +	events in CTRL_MON or MON groups should have a counter assigned using the
> +	"mbm_assign_control" file. The number of counters available is described
> +	in the "num_mbm_cntrs" file. Changing the mode may cause all counters on
> +	a resource to reset.
> +
> +	The mode is useful on platforms which support more CTRL_MON and MON
> +	groups than the hardware counters, meaning 'unassigned' events on CTRL_MON or

" than the hardware counters" -> " than hardware counters"?

> +	MON groups will report 'Unavailable' or count the traffic in an unpredictable
> +	way.

I think the above can be confusing to users. It mentioned "*will* report Unavailable"
and then "*or* count the traffic in an unpredictable way". It is not possible for
counter to report "Unavailable" while also reporting unpredictable data.

My concern is that there is no way for a user to know if the platform supports more
CTRL_MON and MON groups than hardware counters and the above seems to imply that counters
may be unreliable ... so how does a user know if counters are unreliable or not?

Can this be made specific to help users know if their platforms are impacted? From
what I know all AMD platforms are impacted so perhaps a straight-forward:

	"The mode is useful on AMD platforms which support more CTRL_MON and MON ..."

I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
to make the event data "more predictable" and then be concerned when the mode does
not exist.

As an alternative, is it possible to know the number of hardware counters on AMD systems
without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
users to know if their platform may be impacted by this type of "unpredictability" (by comparing 
num_mbm_cntrs to num_rmids).

> +
> +	AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
> +	enable this mode by default so that counters remain assigned even when the
> +	corresponding RMID is not in use by any processor.
> +
> +	"default":
> +
> +	In default mode resctrl assumes there is a hardware counter for each
> +	event within every CTRL_MON and MON group. Reading mbm_total_bytes or
> +	mbm_local_bytes may report 'Unavailable' if there is no counter associated
> +	with that event.

If I understand correctly, on AMD platforms without ABMC the events only report
"Unavailable" if there is no counter assigned at the time of the query. If a counter
is unassigned and then reassigned then the event count will reset and the user
will get some data back but it may thus be unpredictable (to match earlier language).
Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
"unpredictable" event counts (not just "Unavailable") ... this gets complicated
because users should be steered to avoid "default" mode if mbm_assign_mode is
available, while not be made concerned to use "default" mode on Intel where
mbm_assign_mode is not available.

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters
  2024-10-29 23:21 ` [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
@ 2024-11-16  0:06   ` Reinette Chatre
  2024-11-18 21:31     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:06 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> The mbm_cntr_assign mode provides an option to the user to assign a
> counter to an RMID, event pair and monitor the bandwidth as long as
> the counter is assigned. Number of assignments depend on number of
> monitoring counters available.
> 
> Provide the interface to display the number of monitoring counters
> supported. The interface file 'num_mbm_cntrs' is available when an
> architecture supports mbm_cntr_assign mode.
> 

As mentioned in previous patch, do you think it may be possible to
have a value for num_mbm_cntrs for non-ABMC AMD systems? If that is
available and always exposed to user space (irrespective of
mbm_cntr_assign mode) then it would be clear to user space on
benefits/risks of running a "default" mode.

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters
  2024-10-29 23:21 ` [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters Babu Moger
@ 2024-11-16  0:11   ` Reinette Chatre
  0 siblings, 0 replies; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:11 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Hardware provides a set of counters when mbm_assign_mode is supported.
> These counters are assigned to the MBM monitoring events of a CTRL_MON or
> MON group that needs to be tracked. The kernel must manage and track the
> available counters.
> 
> Introduce mbm_cntr_free_map bitmap to track available counters and set
> of routines to allocate and free the counters.
> 
> dom_data_init() requires mbm_cntr_assign state to initialize
> mbm_cntr_free_map bitmap. Move dom_data_init() after mbm_cntr_assign
> detection.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---


Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value
  2024-10-29 23:21 ` [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
@ 2024-11-16  0:24   ` Reinette Chatre
  2024-11-19 16:50     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:24 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -354,6 +354,10 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   */
>  void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
>  
> +void resctrl_arch_mon_event_config_set(void *info);

An architecture that may want to use this would need to know how to interpret
the info passed. For an API I thus expect the struct it points to to also
be available in this header file. 


> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
> +				      enum resctrl_event_id eventid);
> +
>  extern unsigned int resctrl_rmid_realloc_threshold;
>  extern unsigned int resctrl_rmid_realloc_limit;
>  

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
  2024-10-29 23:57   ` Luck, Tony
  2024-11-04 14:14   ` Peter Newman
@ 2024-11-16  0:31   ` Reinette Chatre
  2024-11-19 19:20     ` Moger, Babu
  2 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:31 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Provide the interface to display the number of free monitoring counters
> available for assignment in each doamin when mbm_cntr_assign is supported.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: New patch.
> ---
>  Documentation/arch/x86/resctrl.rst     |  4 ++++
>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 2f3a86278e84..2bc58d974934 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -302,6 +302,10 @@ with the following files:
>  	memory bandwidth tracking to a single memory bandwidth event per
>  	monitoring group.
>  
> +"available_mbm_cntrs":
> +	The number of free monitoring counters available assignment in each domain

"The number of free monitoring counters available assignment" -> "The number of monitoring
counters available for assignment"?

(not taking into account how text may change after addressing Peter's feedback)

> +	when the architecture supports mbm_cntr_assign mode.
> +
>  "max_threshold_occupancy":
>  		Read/write file provides the largest value (in
>  		bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3996f7528b66..e8d38a963f39 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1268,6 +1268,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>  			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
> +			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
>  		}
>  	}
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 654cdfee1b00..ef0c1246fa2a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -898,6 +898,33 @@ static int rdtgroup_num_mbm_cntrs_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> +static int rdtgroup_available_mbm_cntrs_show(struct kernfs_open_file *of,
> +					     struct seq_file *s, void *v)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +	struct rdt_mon_domain *dom;
> +	bool sep = false;
> +	u32 val;
> +
> +	cpus_read_lock();
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +		if (sep)
> +			seq_puts(s, ";");
> +
> +		val = r->mon.num_mbm_cntrs - hweight64(*dom->mbm_cntr_map);

This should probably be bitmap_weight() to address warnings like below that are
encountered by build testing with various configs (32bit in this case). 0day does
not seem to automatically pick up patches just based on submission but it sure will
when these are merged to tip so this needs a clean slate.

>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:916:32: warning: shift count >= width of type [-Wshift-count-overflow]
     916 |                 val = r->mon.num_mbm_cntrs - hweight64(*dom->mbm_cntr_map);
         |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/bitops/const_hweight.h:29:49: note: expanded from macro 'hweight64'
      29 | #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
         |                                                 ^~~~~~~~~~~~~~~~~~~~
   include/asm-generic/bitops/const_hweight.h:21:76: note: expanded from macro '__const_hweight64'
      21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
         |                                                                            ^  ~~
   include/asm-generic/bitops/const_hweight.h:20:49: note: expanded from macro '__const_hweight32'
      20 | #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
         |                                                 ^
   include/asm-generic/bitops/const_hweight.h:19:48: note: expanded from macro '__const_hweight16'
      19 | #define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
         |                                                ^
   include/asm-generic/bitops/const_hweight.h:10:9: note: expanded from macro '__const_hweight8'
      10 |          ((!!((w) & (1ULL << 0))) +     \
         |                ^


Reinette



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment
  2024-10-29 23:21 ` [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
@ 2024-11-16  0:35   ` Reinette Chatre
  0 siblings, 0 replies; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:35 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> The ABMC feature provides an option to the user to assign a hardware
> counter to an RMID, event pair and monitor the bandwidth as long as the
> counter is assigned. The bandwidth events will be tracked by the hardware
> until the user changes the configuration. Each resctrl group can configure
> maximum two counters, one for total event and one for local event.
> 
> The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh).
> Configuration is done by setting the counter id, bandwidth source (RMID)
> and bandwidth configuration supported by BMEC (Bandwidth Monitoring Event
> Configuration).
> 
> Attempts to read or write the MSR when ABMC is not enabled will result
> in a #GP(0) exception.
> 
> Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG
> (0xC000_03FDh):
> =========================================================================
> Bits 	Mnemonic	Description			Access Reset
> 							Type   Value
> =========================================================================
> 63 	CfgEn 		Configuration Enable 		R/W 	0
> 
> 62 	CtrEn 		Enable/disable counting		R/W 	0
> 
> 61:53 	– 		Reserved 			MBZ 	0
> 
> 52:48 	CtrID 		Counter Identifier		R/W	0
> 
> 47 	IsCOS		BwSrc field is a CLOSID		R/W	0
> 			(not an RMID)
> 
> 46:44 	–		Reserved			MBZ	0
> 
> 43:32	BwSrc		Bandwidth Source		R/W	0
> 			(RMID or CLOSID)
> 
> 31:0	BwType		Bandwidth configuration		R/W	0
> 			to track for this counter
> ==========================================================================
> 
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments
  2024-10-29 23:21 ` [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments Babu Moger
@ 2024-11-16  0:38   ` Reinette Chatre
  2024-11-19 20:02     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:38 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> mbm_cntr_assign feature provides an option to the user to assign a counter

nit for consistency: "mbm_cntr_assign feature" -> "mbm_cntr_assign mode"

> to an RMID, event pair and monitor the bandwidth as long as the counter is
> assigned. There can be two counters per monitor group, one for MBM total
> event and another for MBM local event.
> 
> Introduce cntr_id to manage the assignments.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-10-29 23:21 ` [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
  2024-10-29 23:54   ` Luck, Tony
@ 2024-11-16  0:44   ` Reinette Chatre
  2024-11-19 20:12     ` Moger, Babu
  1 sibling, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:44 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> The ABMC feature provides an option to the user to assign a hardware
> counter to an RMID, event pair and monitor the bandwidth as long as it is
> assigned. The assigned RMID will be tracked by the hardware until the user
> unassigns it manually.
> 
> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
> specifying the counter id, bandwidth source, and bandwidth types.

needs imperative tone

> 
> Provide the interface to assign the counter ids to RMID.
> 
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>     Monitoring (ABMC).
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <babu.moger@amd.com>

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-10-29 23:21 ` [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment Babu Moger
@ 2024-11-16  0:57   ` Reinette Chatre
  2024-11-20 18:05     ` Moger, Babu
  2024-12-04  4:16   ` Fenghua Yu
  1 sibling, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-16  0:57 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> The mbm_cntr_assign mode offers several hardware counters that can be
> assigned to an RMID, event pair and monitor the bandwidth as long as it
> is assigned.
> 
> Counters are managed at two levels. The global assignment is tracked
> using the mbm_cntr_free_map field in the struct resctrl_mon, while
> domain-specific assignments are tracked using the mbm_cntr_map field
> in the struct rdt_mon_domain. Allocation begins at the global level
> and is then applied individually to each domain.
> 
> Introduce an interface to allocate these counters and update the
> corresponding domains accordingly.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 00f7bf60e16a..cb496bd97007 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>  int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>  			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>  			     u32 cntr_id, bool assign);
> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>  void rdt_staged_configs_clear(void);
>  bool closid_allocated(unsigned int closid);
>  int resctrl_find_cleanest_closid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 1b5529c212f5..bc3752967c44 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>  	return 0;
>  }
>  
> +/*
> + * Configure the counter for the event, RMID pair for the domain.
> + * Update the bitmap and reset the architectural state.
> + */
> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +			       u32 cntr_id, bool assign)
> +{
> +	int ret;
> +
> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
> +	if (ret)
> +		return ret;
> +
> +	if (assign)
> +		__set_bit(cntr_id, d->mbm_cntr_map);
> +	else
> +		__clear_bit(cntr_id, d->mbm_cntr_map);
> +
> +	/*
> +	 * Reset the architectural state so that reading of hardware
> +	 * counter is not considered as an overflow in next update.
> +	 */
> +	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);

resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
@d ... note that after the architectural state is reset it initializes the
state by reading the event on the current CPU. By running it here it is
run on a random CPU that may not be in the right domain.

> +
> +	return ret;
> +}
> +
> +static bool mbm_cntr_assigned_to_domain(struct rdt_resource *r, u32 cntr_id)
> +{
> +	struct rdt_mon_domain *d;
> +
> +	list_for_each_entry(d, &r->mon_domains, hdr.list)
> +		if (test_bit(cntr_id, d->mbm_cntr_map))
> +			return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Assign a hardware counter to event @evtid of group @rdtgrp.
> + * Counter will be assigned to all the domains if rdt_mon_domain is NULL
> + * else the counter will be assigned to specific domain.
> + */
> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid)
> +{
> +	int index = MBM_EVENT_ARRAY_INDEX(evtid);
> +	int cntr_id = rdtgrp->mon.cntr_id[index];
> +	int ret;
> +
> +	/*
> +	 * Allocate a new counter id to the event if the counter is not
> +	 * assigned already.
> +	 */
> +	if (cntr_id == MON_CNTR_UNSET) {
> +		cntr_id = mbm_cntr_alloc(r);
> +		if (cntr_id < 0) {
> +			rdt_last_cmd_puts("Out of MBM assignable counters\n");
> +			return -ENOSPC;
> +		}
> +		rdtgrp->mon.cntr_id[index] = cntr_id;
> +	}
> +
> +	if (!d) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
> +						  rdtgrp->closid, cntr_id, true);
> +			if (ret)
> +				goto out_done_assign;

This may not be what users expect. What if, for example, domain #1 has a counter
assigned to "total" event and then user wants to change that to 
assign a counter to "total" event of all domains. Would this not reconfigure the
counter associated with domain #1 and unnecessarily reset it? Could this be
made a bit smarter to only configure a counter on a domain if it is not already
configured? This could perhaps form part of resctrl_config_cntr() to not scatter
the duplicate check everywhere. What do you think?

Also, looks like this can do partial assignment. For example, if one of the
domains encounter a failure then domains already configured are not undone. This
matches other similar flows but is not documented and left to reader to decipher. 


> +		}
> +	} else {
> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
> +					  rdtgrp->closid, cntr_id, true);

Looking at flows calling rdtgroup_assign_cntr_event() I do not see a check
if counter is already assigned. So, if a user makes a loop of assigning a counter
to the same event over and over it will result in an IPI every time. This seems
unnecessary, what do you think?

> +		if (ret)
> +			goto out_done_assign;
> +	}
> +
> +out_done_assign:
> +	if (ret && !mbm_cntr_assigned_to_domain(r, cntr_id)) {
> +		mbm_cntr_free(r, cntr_id);
> +		rdtgroup_cntr_id_init(rdtgrp, evtid);
> +	}
> +
> +	return ret;
> +}
> +
>  /* rdtgroup information files for one cache resource. */
>  static struct rftype res_common_files[] = {
>  	{

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-10-29 23:21 ` [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
@ 2024-11-18 17:18   ` Reinette Chatre
  2024-11-22  0:22     ` Moger, Babu
  2024-12-04  4:16   ` Fenghua Yu
  1 sibling, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 17:18 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Assign/unassign counters on resctrl group creation/deletion. Two counters
> are required per group, one for MBM total event and one for MBM local
> event.
> 
> There are a limited number of counters available for assignment. If these
> counters are exhausted, the kernel will display the error message: "Out of
> MBM assignable counters". However, it is not necessary to fail the
> creation of a group due to assignment failures. Users have the flexibility
> to modify the assignments at a later time.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
>     Updated couple of rdtgroup_unassign_cntrs() calls properly.
>     Updated function comments.
> 
> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>     Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>     Fixed the problem with unassigning the child MON groups of CTRL_MON group.
> 
> v7: Reworded the commit message.
>     Removed the reference of ABMC with mbm_cntr_assign.
>     Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
> 
> v6: Removed the redundant comments on all the calls of
>     rdtgroup_assign_cntrs. Updated the commit message.
>     Dropped printing error message on every call of rdtgroup_assign_cntrs.
> 
> v5: Removed the code to enable/disable ABMC during the mount.
>     That will be another patch.
>     Added arch callers to get the arch specific data.
>     Renamed fuctions to match the other abmc function.
>     Added code comments for assignment failures.
> 
> v4: Few name changes based on the upstream discussion.
>     Commit message update.
> 
> v3: This is a new patch. Patch addresses the upstream comment to enable
>     ABMC feature by default if the feature is available.
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>  1 file changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b0cce3dfd062..a8d21b0b2054 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>  	}
>  }
>  
> +/*
> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
> + * counters are automatically assigned. Each group can accommodate two counters:
> + * one for the total event and one for the local event. Assignments may fail
> + * due to the limited number of counters. However, it is not necessary to fail
> + * the group creation and thus no failure is returned. Users have the option
> + * to modify the counter assignments after the group has been created.
> + */
> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	if (is_mbm_total_enabled())
> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
> +
> +	if (is_mbm_local_enabled())
> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +/*
> + * Called when a group is deleted. Counters are unassigned if it was in
> + * assigned state.
> + */
> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	if (is_mbm_total_enabled())
> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
> +
> +	if (is_mbm_local_enabled())
> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
>  static int rdt_get_tree(struct fs_context *fc)
>  {
>  	struct rdt_fs_context *ctx = rdt_fc2context(fc);
> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>  		if (ret < 0)
>  			goto out_mongrp;
>  		rdtgroup_default.mon.mon_data_kn = kn_mondata;
> +
> +		rdtgroup_assign_cntrs(&rdtgroup_default);

I think counters should be assigned *before* the files exposing them
are added to resctrl.

>  	}
>  
>  	ret = rdt_pseudo_lock_init();
> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  out_psl:
>  	rdt_pseudo_lock_release();
>  out_mondata:
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>  		kernfs_remove(kn_mondata);

... and here remove the files before taking away the data exposed by them.

> +	}
>  out_mongrp:
>  	if (resctrl_arch_mon_capable())
>  		kernfs_remove(kn_mongrp);
> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>  
>  	head = &rdtgrp->mon.crdtgrp_list;
>  	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
> +		rdtgroup_unassign_cntrs(sentry);
>  		free_rmid(sentry->closid, sentry->mon.rmid);
>  		list_del(&sentry->mon.crdtgrp_list);
>  
> @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>  		cpumask_or(&rdtgroup_default.cpu_mask,
>  			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>  
> +		rdtgroup_unassign_cntrs(rdtgrp);
> +
>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  
>  		kernfs_remove(rdtgrp->kn);
> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>  	for_each_alloc_capable_rdt_resource(r)
>  		reset_all_ctrls(r);
>  	rmdir_all_sub();
> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
>  	rdt_pseudo_lock_release();
>  	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>  	schemata_list_destroy();
> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>  		resctrl_arch_disable_alloc();
>  	if (resctrl_arch_mon_capable())
>  		resctrl_arch_disable_mon();
> +
>  	resctrl_mounted = false;
>  	kernfs_kill_sb(sb);
>  	mutex_unlock(&rdtgroup_mutex);

Unnecessary hunk.

> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
>  		goto out_unlock;
>  	}
>  
> +	rdtgroup_assign_cntrs(rdtgrp);
> +
>  	kernfs_activate(rdtgrp->kn);
>  
>  	/*
> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>  	if (ret)
>  		goto out_closid_free;
>  
> +	rdtgroup_assign_cntrs(rdtgrp);
> +
>  	kernfs_activate(rdtgrp->kn);
>  
>  	ret = rdtgroup_init_alloc(rdtgrp);

Please compare the above two hunks with earlier "x86/resctrl: Introduce cntr_id in mongroup for assignments".
Earlier patch initializes the counters within mkdir_rdt_prepare_rmid_alloc() while the above
hunk assigns the counters after mkdir_rdt_prepare_rmid_alloc() is called. Could this fragmentation be avoided
with init done once within mkdir_rdt_prepare_rmid_alloc()?

> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>  out_del_list:
>  	list_del(&rdtgrp->rdtgroup_list);
>  out_rmid_free:
> +	rdtgroup_unassign_cntrs(rdtgrp);
>  	mkdir_rdt_prepare_rmid_free(rdtgrp);
>  out_closid_free:
>  	closid_free(closid);
> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  	update_closid_rmid(tmpmask, NULL);
>  
>  	rdtgrp->flags = RDT_DELETED;
> +
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  
>  	/*
> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>  	update_closid_rmid(tmpmask, NULL);
>  
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>  	closid_free(rdtgrp->closid);
>  

There is a potential problem here. rdtgroup_unassign_cntrs() attempts to remove counter from 
all domains associated with the resource group. This may fail in any of the domains that results
in the counter not being marked as free in the global map and not reset the counter in the
resource group ... but the resource group is removed right after calling rdtgroup_unassign_cntrs().
In this scenario there is thus a counter that is considered to be in use but not assigned to any
resource group.

From what I can tell there is a difference here between default resource group and the others:
on remount of resctrl the default resource group will maintain knowledge of the counter that could
not be unassigned. This means that unmount/remount of resctrl does not provide a real "clean slate"
when it comes to counter assignment. Is this intended?

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2024-10-29 23:21 ` [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
@ 2024-11-18 17:39   ` Reinette Chatre
  2024-11-20 19:14     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 17:39 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> In mbm_cntr_assign mode, the hardware counter should be assigned to read
> the MBM events.
> 
> Report "Unassigned" in case the user attempts to read the events without
> assigning the counter.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Used is_mbm_event() to check the event type.
>     Minor user documentation update.
> 
> v8: Used MBM_EVENT_ARRAY_INDEX to get the index for the MBM event.
>     Documentation update to make the text generic.
> 
> v7: Moved the documentation under "mon_data".
>     Updated the text little bit.
> 
> v6: Added more explaination in the resctrl.rst
>     Added checks to detect "Unassigned" before reading RMID.
> 
> v5: New patch.
> ---
>  Documentation/arch/x86/resctrl.rst        | 10 ++++++++++
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +++++++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 2bc58d974934..864fc004d646 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -430,6 +430,16 @@ When monitoring is enabled all MON groups will also contain:
>  	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
>  	where "YY" is the node number.
>  
> +	When supported the 'mbm_cntr_assign' mode allows users to assign a

Could you please do through the documentation changes and make all the quote
usage consistent with existing styles? For example, in this series "mbm_cntr_assign"
is used in doc in various ways ... within double quotes, within single quotes, as
well as without any quotes.

> +	counter to mon_hw_id, event pair enabling bandwidth monitoring for
> +	as long as the counter remains assigned. The hardware will continue
> +	tracking the assigned mon_hw_id until the user manually unassigns
> +	it, ensuring that counters are not reset during this period. With
> +	a limited number of counters, the system may run out of assignable
> +	counters. In that case, MBM event counters will return "Unassigned"

Please review style for all quote usage, for example, "Unassigned" above is
also not consistent.

> +	when the event is read. Users must manually assign a counter to read
> +	the events.
> +
Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init
  2024-11-15 23:21   ` Reinette Chatre
@ 2024-11-18 17:44     ` Moger, Babu
  2024-11-18 22:07       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-18 17:44 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 17:21, Reinette Chatre wrote:
> Hi Babu,
> 
> In subject please use () to indicate a function, writing resctrl_late_init()

Will change it to

x86/resctrl: Add __init attribute for all the call sequences in
resctrl_late_init()

> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> The function resctrl_late_init() has the __init attribute, but some
> 
> No need to say "The function" when using ().
> 
>> functions it calls do not. Add the __init attribute to all the functions
> 
> None of the functions changed are actually called by resctrl_late_init(). If this
> is indeed the goal then I think cache_alloc_hsw_probe() was missed.

Will change the function to.

static inline __init void cache_alloc_hsw_probe(void)

How about this description?

"resctrl_late_init() has the __init attribute, but some of the call
sequences of it do not have the __init attribute.

Add the __init attribute to all the functions in the call sequences to
maintain consistency throughout."

> 
>> to maintain consistency throughout the call sequence.
>>
>> Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
>> Fixes: def10853930a ("x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)")
>> Fixes: bd334c86b5d7 ("x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()")
>> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Moved the patch to the begining of the series.
>>     Fixed all the call sequences. Added additional Fixed tags.
>>
>> v8: New patch.
>> ---
>>  arch/x86/kernel/cpu/resctrl/core.c     | 8 ++++----
>>  arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
>>  arch/x86/kernel/cpu/resctrl/monitor.c  | 4 ++--
>>  3 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index b681c2e07dbf..f845d0590429 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -275,7 +275,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>>  	return true;
>>  }
>>  
>> -static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>> +static __init void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>>  {
>>  	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>  	union cpuid_0x10_1_eax eax;
>> @@ -294,7 +294,7 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>>  	r->alloc_capable = true;
>>  }
>>  
>> -static void rdt_get_cdp_config(int level)
>> +static __init void rdt_get_cdp_config(int level)
>>  {
>>  	/*
>>  	 * By default, CDP is disabled. CDP can be enabled by mount parameter
>> @@ -304,12 +304,12 @@ static void rdt_get_cdp_config(int level)
>>  	rdt_resources_all[level].r_resctrl.cdp_capable = true;
>>  }
>>  
>> -static void rdt_get_cdp_l3_config(void)
>> +static __init void rdt_get_cdp_l3_config(void)
>>  {
>>  	rdt_get_cdp_config(RDT_RESOURCE_L3);
>>  }
>>  
>> -static void rdt_get_cdp_l2_config(void)
>> +static __init void rdt_get_cdp_l2_config(void)
>>  {
>>  	rdt_get_cdp_config(RDT_RESOURCE_L2);
>>  }
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 955999aecfca..16181b90159a 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -627,7 +627,7 @@ int closids_supported(void);
>>  void closid_free(int closid);
>>  int alloc_rmid(u32 closid);
>>  void free_rmid(u32 closid, u32 rmid);
>> -int rdt_get_mon_l3_config(struct rdt_resource *r);
>> +int __init rdt_get_mon_l3_config(struct rdt_resource *r);
>>  void __exit rdt_put_mon_l3_config(void);
>>  bool __init rdt_cpu_has(int flag);
>>  void mon_event_count(void *info);
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 851b561850e0..17790f92ef51 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -983,7 +983,7 @@ void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_
>>  		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
>>  }
>>  
>> -static int dom_data_init(struct rdt_resource *r)
>> +static __init int dom_data_init(struct rdt_resource *r)
>>  {
>>  	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>>  	u32 num_closid = resctrl_arch_get_num_closid(r);
>> @@ -1081,7 +1081,7 @@ static struct mon_evt mbm_local_event = {
>>   * because as per the SDM the total and local memory bandwidth
>>   * are enumerated as part of L3 monitoring.
>>   */
>> -static void l3_mon_evt_init(struct rdt_resource *r)
>> +static void __init l3_mon_evt_init(struct rdt_resource *r)
> 
> This change follows a different order from the other changes in this patch. "Function prototypes"
> in Documentation/process/coding-style.rst indicates the preferred order is storage class
> before return type. I acknowledge that resctrl is not consistent in this regard but we can
> work towards the preferred order while keeping this patch consistent?

Sure. Will change it to.

static __init void l3_mon_evt_init(struct rdt_resource *r)


-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-16  0:00   ` Reinette Chatre
@ 2024-11-18 19:04     ` Moger, Babu
  2024-11-18 22:07       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-18 19:04 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:00, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>> supported.
>>
>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>
>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>
>> The "default" mode is the existing monitoring mode that works without the
>> explicit counter assignment, instead relying on dynamic counter assignment
>> by hardware that may result in hardware not dedicating a counter resulting
>> in monitoring data reads returning "Unavailable".
>>
>> Provide an interface to display the monitor mode on the system.
>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> [mbm_cntr_assign]
>> default
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Updated user documentation based on comments.
>>
>> v8: Commit message update.
>>
>> v7: Updated the descriptions/commit log in resctrl.rst to generic text.
>>     Thanks to James and Reinette.
>>     Rename mbm_mode to mbm_assign_mode.
>>     Introduced mutex lock in rdtgroup_mbm_mode_show().
>>
>> v6: Added documentation for mbm_cntr_assign and legacy mode.
>>     Moved mbm_mode fflags initialization to static initialization.
>>
>> v5: Changed interface name to mbm_mode.
>>     It will be always available even if ABMC feature is not supported.
>>     Added description in resctrl.rst about ABMC mode.
>>     Fixed display abmc and legacy consistantly.
>>
>> v4: Fixed the checks for legacy and abmc mode. Default it ABMC.
>>
>> v3: New patch to display ABMC capability.
>> ---
>>  Documentation/arch/x86/resctrl.rst     | 33 ++++++++++++++++++++++++++
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++
>>  2 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 30586728a4cd..a93d7980e25f 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -257,6 +257,39 @@ with the following files:
>>  	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>>  	    0=0x30;1=0x30;3=0x15;4=0x15
>>  
>> +"mbm_assign_mode":
>> +	Reports the list of monitoring modes supported. The enclosed brackets
>> +	indicate which mode is enabled.
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>> +	  [mbm_cntr_assign]
>> +	  default
>> +
>> +	"mbm_cntr_assign":
>> +
>> +	In mbm_cntr_assign mode user-space is able to specify which of the
>> +	events in CTRL_MON or MON groups should have a counter assigned using the
>> +	"mbm_assign_control" file. The number of counters available is described
>> +	in the "num_mbm_cntrs" file. Changing the mode may cause all counters on
>> +	a resource to reset.
>> +
>> +	The mode is useful on platforms which support more CTRL_MON and MON
>> +	groups than the hardware counters, meaning 'unassigned' events on CTRL_MON or
> 
> " than the hardware counters" -> " than hardware counters"?

Sure.

> 
>> +	MON groups will report 'Unavailable' or count the traffic in an unpredictable
>> +	way.
> 
> I think the above can be confusing to users. It mentioned "*will* report Unavailable"
> and then "*or* count the traffic in an unpredictable way". It is not possible for
> counter to report "Unavailable" while also reporting unpredictable data.
> 
> My concern is that there is no way for a user to know if the platform supports more
> CTRL_MON and MON groups than hardware counters and the above seems to imply that counters
> may be unreliable ... so how does a user know if counters are unreliable or not?

That is correct. There is no definite way to find out if the counters are
unreliable.

> 
> Can this be made specific to help users know if their platforms are impacted? From
> what I know all AMD platforms are impacted so perhaps a straight-forward:
> 
> 	"The mode is useful on AMD platforms which support more CTRL_MON and MON ..."

Sure.

> 
> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
> to make the event data "more predictable" and then be concerned when the mode does
> not exist.
> 
> As an alternative, is it possible to know the number of hardware counters on AMD systems
> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
> users to know if their platform may be impacted by this type of "unpredictability" (by comparing 
> num_mbm_cntrs to num_rmids).

There is some round about(or hacky) way to find that out number of RMIDs
that can be active.

> 
>> +
>> +	AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>> +	enable this mode by default so that counters remain assigned even when the
>> +	corresponding RMID is not in use by any processor.
>> +
>> +	"default":
>> +
>> +	In default mode resctrl assumes there is a hardware counter for each
>> +	event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>> +	mbm_local_bytes may report 'Unavailable' if there is no counter associated
>> +	with that event.
> 
> If I understand correctly, on AMD platforms without ABMC the events only report
> "Unavailable" if there is no counter assigned at the time of the query. If a counter
> is unassigned and then reassigned then the event count will reset and the user
> will get some data back but it may thus be unpredictable (to match earlier language).
> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
> "unpredictable" event counts (not just "Unavailable") ... this gets complicated

Yes. All the AMD systems without ABMC are affected by this problem.

> because users should be steered to avoid "default" mode if mbm_assign_mode is
> available, while not be made concerned to use "default" mode on Intel where
> mbm_assign_mode is not available.

Can we add text to clarify this?

> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2024-10-29 23:21 ` [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
@ 2024-11-18 19:23   ` Reinette Chatre
  2024-11-20 18:59     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 19:23 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Configure mbm_cntr_assign on AMD. 'mbm_cntr_assign' mode in AMD is ABMC
> (Assignable Bandwidth Monitoring Counters). It is enabled by default when
> supported on the system.
> 
> When the ABMC is updated, it must be updated on all the logical processors
> in the resctrl domain.

This needs imperative tone.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes
  2024-10-29 23:21 ` [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes Babu Moger
@ 2024-11-18 19:43   ` Reinette Chatre
  2024-11-21  2:14     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 19:43 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Users can modify the configuration of assignable events. Whenever the
> event configuration is updated, MBM assignments must be revised across
> all monitor groups within the impacted domains.

Please revisit the "Changelog" section in 
Documentation/process/maintainer-tip.rst

> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Again patch changed completely based on the comment.
>     https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
>     Introduced resctrl_mon_event_config_set to handle IPI.
>     But sending another IPI inside IPI causes problem. Kernel reports SMP
>     warning. So, introduced resctrl_arch_update_cntr() to send the command directly.

I see ... the WARN is because there is a check whether IRQs are disabled before
the check whether the function can be run locally.

> 
> v8: Patch changed completely.
>     Updated the assignment on same IPI as the event is updated.
>     Could not do the way we discussed in the thread.
>     https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
>     Needed to figure out event type to update the configuration.
> 
> v7: New patch to update the assignments. Missed it earlier.
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 86 +++++++++++++++++++++++---
>  include/linux/resctrl.h                |  3 +-
>  2 files changed, 79 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5b8bb8bd913c..7646d67ea10e 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1710,6 +1710,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>  }
>  
>  struct mon_config_info {
> +	struct rdt_resource *r;
>  	struct rdt_mon_domain *d;
>  	u32 evtid;
>  	u32 mon_config;
> @@ -1735,26 +1736,28 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>  	return INVALID_CONFIG_VALUE;
>  }
>  
> -void resctrl_arch_mon_event_config_set(void *info)
> +void resctrl_arch_mon_event_config_set(struct rdt_mon_domain *d,
> +				       enum resctrl_event_id eventid, u32 val)
>  {
> -	struct mon_config_info *mon_info = info;
>  	struct rdt_hw_mon_domain *hw_dom;
>  	unsigned int index;
>  
> -	index = mon_event_config_index_get(mon_info->evtid);
> +	index = mon_event_config_index_get(eventid);
>  	if (index == INVALID_CONFIG_INDEX)
>  		return;
>  
> -	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, val, 0);
>  
> -	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
> +	hw_dom = resctrl_to_arch_mon_dom(d);
>  
> -	switch (mon_info->evtid) {
> +	switch (eventid) {
>  	case QOS_L3_MBM_TOTAL_EVENT_ID:
> -		hw_dom->mbm_total_cfg = mon_info->mon_config;
> +		hw_dom->mbm_total_cfg = val;
>  		break;
>  	case QOS_L3_MBM_LOCAL_EVENT_ID:
> -		hw_dom->mbm_local_cfg = mon_info->mon_config;
> +		hw_dom->mbm_local_cfg = val;
> +		break;
> +	default:
>  		break;
>  	}
>  }
> @@ -1826,6 +1829,70 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> +static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_index(int cntr_id, unsigned int index)
> +{
> +	struct rdtgroup *prgrp, *crgrp;
> +
> +	/* Check if the cntr_id is associated to the event type updated */
> +	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> +		if (prgrp->mon.cntr_id[index] == cntr_id)
> +			return prgrp;
> +
> +		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> +			if (crgrp->mon.cntr_id[index] == cntr_id)
> +				return crgrp;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				     enum resctrl_event_id evtid, u32 rmid,
> +				     u32 closid, u32 cntr_id, u32 val)
> +{
> +	union l3_qos_abmc_cfg abmc_cfg = { 0 };
> +
> +	abmc_cfg.split.cfg_en = 1;
> +	abmc_cfg.split.cntr_en = 1;
> +	abmc_cfg.split.cntr_id = cntr_id;
> +	abmc_cfg.split.bw_src = rmid;
> +	abmc_cfg.split.bw_type = val;
> +
> +	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg.full);

Is it needed to create an almost duplicate function? What if instead 
only resctrl_arch_config_cntr() exists and it uses parameter to decide
whether to call resctrl_abmc_config_one_amd() directly or via 
smp_call_function_any()? I think that should help to make clear how
the code flows. 
Also note that this is an almost identical arch callback with no
error return. I expect that building on existing resctrl_arch_config_cntr()
will make things easier to understand.

> +}
> +
> +static void resctrl_mon_event_config_set(void *info)
> +{
> +	struct mon_config_info *mon_info = info;
> +	struct rdt_mon_domain *d = mon_info->d;
> +	struct rdt_resource *r = mon_info->r;

Note that local variable r is created here while the function is inconsistent by
switching between using r and mon_info->r.

> +	struct rdtgroup *rdtgrp;
> +	unsigned int index;
> +	u32 cntr_id;
> +
> +	resctrl_arch_mon_event_config_set(d, mon_info->evtid, mon_info->mon_config);
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	index = mon_event_config_index_get(mon_info->evtid);

This is an AMD arch specific helper to know which offset of an MSR to use. It should
not be used directly in resctrl fs code, this is what MBM_EVENT_ARRAY_INDEX was created for.

Since MBM_EVENT_ARRAY_INDEX is a macro it can be called closer to where it is used,
within  rdtgroup_find_grp_by_cntr_id_index(), which prompts a reconsider of that function name.

> +	if (index == INVALID_CONFIG_INDEX)
> +		return;
> +
> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
> +		if (test_bit(cntr_id, d->mbm_cntr_map)) {
> +			rdtgrp = rdtgroup_find_grp_by_cntr_id_index(cntr_id, index);
> +			if (rdtgrp)
> +				resctrl_arch_update_cntr(mon_info->r, d,
> +							 mon_info->evtid,
> +							 rdtgrp->mon.rmid,
> +							 rdtgrp->closid,
> +							 cntr_id,
> +							 mon_info->mon_config);
> +		}
> +	}
> +}

Could you please add some function comments to explain the flow here? For example,
what should reader consider if there is to rdtgroup found?

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters
  2024-11-16  0:06   ` Reinette Chatre
@ 2024-11-18 21:31     ` Moger, Babu
  2025-02-03 13:26       ` Peter Newman
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-18 21:31 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen,
	peternewman
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:06, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> The mbm_cntr_assign mode provides an option to the user to assign a
>> counter to an RMID, event pair and monitor the bandwidth as long as
>> the counter is assigned. Number of assignments depend on number of
>> monitoring counters available.
>>
>> Provide the interface to display the number of monitoring counters
>> supported. The interface file 'num_mbm_cntrs' is available when an
>> architecture supports mbm_cntr_assign mode.
>>
> 
> As mentioned in previous patch, do you think it may be possible to
> have a value for num_mbm_cntrs for non-ABMC AMD systems? If that is
> available and always exposed to user space (irrespective of
> mbm_cntr_assign mode) then it would be clear to user space on
> benefits/risks of running a "default" mode.

I am trying the work-around to get the number of max active RMIDs in
default mode. The method is to loop through all of the recently assigned
RMID's to see if any of their QM_CTR.U bits transition from 0->1.

I am not successful in getting it to work so far. I remember Peter was
trying this before in soft-ABMC. Peter, Any success with that?

If it does not work then our best option is to document it.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of the groups
  2024-10-29 23:21 ` [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
@ 2024-11-18 21:51   ` Reinette Chatre
  2024-11-21 20:29     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 21:51 UTC (permalink / raw)
  To: Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 10/29/24 4:21 PM, Babu Moger wrote:
> Introduce the interface to assign MBM events in mbm_cntr_assign mode.
> 
> Events can be enabled or disabled by writing to file
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> Format is similar to the list format with addition of opcode for the
> assignment operation.
>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Format for specific type of groups:
> 
>  * Default CTRL_MON group:
>          "//<domain_id><opcode><flags>"
> 
>  * Non-default CTRL_MON group:
>          "<CTRL_MON group>//<domain_id><opcode><flags>"
> 
>  * Child MON group of default CTRL_MON group:
>          "/<MON group>/<domain_id><opcode><flags>"
> 
>  * Child MON group of non-default CTRL_MON group:
>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> 
> Domain_id '*' will apply the flags on all the domains.
> 
> Opcode can be one of the following:
> 
>  = Update the assignment to match the flags
>  + Assign a new MBM event without impacting existing assignments.
>  - Unassign a MBM event from currently assigned events.
> 
> Assignment flags can be one of the following:
>  t  MBM total event
>  l  MBM local event
>  tl Both total and local MBM events
>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>     be combined with other flags.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Fixed handling special case '//0=' and '//".
>     Removed extra strstr() call.
>     Added generic failure text when assignment operation fails.
>     Corrected user documentation format texts.
> 
> v8: Moved unassign as the first action during the assign modification.
>     Assign none "_" takes priority. Cannot be mixed with other flags.
>     Updated the documentation and .rst file format. htmldoc looks ok.
> 
> v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
>     Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
>     Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
>     Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
>     Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
>     Removed ABMC reference in FS code.
>     Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
>     Not sure if we need to change the behaviour here. Processed them sequencially right now.
>     Users have the liberty to pass the flags. Restricting it might be a problem later.
> 
> v6: Added support assign all if domain id is '*'
>     Fixed the allocation of counter id if it not assigned already.
> 
> v5: Interface name changed from mbm_assign_control to mbm_control.
>     Fixed opcode and flags combination.
>     '=_" is valid.
>     "-_" amd "+_" is not valid.
>     Minor message update.
>     Renamed the function with prefix - rdtgroup_.
>     Corrected few documentation mistakes.
>     Rebase related changes after SNC support.
> 
> v4: Added domain specific assignments. Fixed the opcode parsing.
> 
> v3: New patch.
>     Addresses the feedback to provide the global assignment interface.
>     https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
> ---
>  Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
>  arch/x86/kernel/cpu/resctrl/internal.h |  10 ++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 236 ++++++++++++++++++++++++-
>  3 files changed, 360 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 590727bec44b..d0a107d251ec 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -347,7 +347,8 @@ with the following files:
>  	 t  MBM total event is assigned.
>  	 l  MBM local event is assigned.
>  	 tl Both MBM total and local events are assigned.
> -	 _  None of the MBM events are assigned.
> +	 _  None of the MBM events are assigned. Only works with opcode '=' for write
> +	    and cannot be combined with other flags.
>  
>  	Examples:
>  	::
> @@ -365,6 +366,119 @@ with the following files:
>  	There are four resctrl groups. All the groups have total and local MBM events
>  	assigned on domain 0 and 1.
>  
> +	Assignment state can be updated by writing to the interface.

This is already a bit far from original definition so it may help to be specific what is
meant with "the interface". For example,

	Assignment state can be updated by writing to "mbm_assign_control".

> +
> +	Format is similar to the list format with addition of opcode for the
> +	assignment operation.
> +
> +		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> +	Format for each type of groups:

"Format for each type of group"  or "Format of each type of group"?

> +
> +        * Default CTRL_MON group:
> +                "//<domain_id><opcode><flags>"
> +
> +        * Non-default CTRL_MON group:
> +                "<CTRL_MON group>//<domain_id><opcode><flags>"
> +
> +        * Child MON group of default CTRL_MON group:
> +                "/<MON group>/<domain_id><opcode><flags>"
> +
> +        * Child MON group of non-default CTRL_MON group:
> +                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> +	Domain_id '*' will apply the flags on all the domains.

"apply the flags on all the domains" -> "apply the flags to all the domains"?

> +
> +	Opcode can be one of the following:
> +	::
> +
> +	 = Update the assignment to match the MBM event.
> +	 + Assign a new MBM event without impacting existing assignments.
> +	 - Unassign a MBM event from currently assigned events.
> +
> +	Examples:
> +	Initial group status:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> +	  //0=tl;1=tl;
> +	  /child_default_mon_grp/0=tl;1=tl;
> +
> +	To update the default group to assign only total MBM event on domain 0:
> +	::
> +
> +	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> +	  //0=t;1=tl;
> +	  /child_default_mon_grp/0=tl;1=tl;
> +
> +	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
> +	::
> +
> +	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> +	  //0=t;1=tl;
> +	  /child_default_mon_grp/0=tl;1=l;
> +
> +	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
> +	both local and total MBM events on domain 1:
> +	::
> +
> +	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
> +			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> +	  //0=t;1=tl;
> +	  /child_default_mon_grp/0=tl;1=l;
> +
> +	To update the default group to add a local MBM event domain 0.

"." -> ":"

> +	::
> +
> +	  # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> +	  //0=tl;1=tl;
> +	  /child_default_mon_grp/0=tl;1=l;
> +
> +	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all the
> +	MBM events on all the domains.

"." -> ":"

> +	::
> +
> +	  # echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> +	Assignment status after the update:
> +	::
> +
> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +	  non_default_ctrl_mon_grp//0=_;1=_;
> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> +	  //0=tl;1=tl;
> +	  /child_default_mon_grp/0=tl;1=l;
> +
>  "max_threshold_occupancy":
>  		Read/write file provides the largest value (in
>  		bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index b90d8c90b4b6..3ccaea6a2803 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -74,6 +74,16 @@
>   */
>  #define MBM_EVENT_ARRAY_INDEX(_event) ((_event) - 2)
>  
> +/*
> + * Assignment flags for mbm_cntr_assign feature
> + */

"mbm_cntr_assign feature" -> "mbm_cntr_assign mode"?

> +enum {
> +	ASSIGN_NONE	= 0,
> +	ASSIGN_TOTAL	= BIT(QOS_L3_MBM_TOTAL_EVENT_ID),
> +	ASSIGN_LOCAL	= BIT(QOS_L3_MBM_LOCAL_EVENT_ID),
> +	ASSIGN_INVALID,
> +};
> +
>  /**
>   * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
>   *			        aren't marked nohz_full
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5cc40eacbe85..9fe419d0c536 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1082,6 +1082,239 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
>  	return 0;
>  }
>  
> +static int rdtgroup_str_to_mon_state(char *flag)

It seems strange to me that a variable used to contain flag bits
is of type int. Why is it not unsigned?

> +{
> +	int i, mon_state = ASSIGN_NONE;
> +
> +	if (!strlen(flag))
> +		return ASSIGN_INVALID;
> +
> +	for (i = 0; i < strlen(flag); i++) {
> +		switch (*(flag + i)) {
> +		case 't':
> +			mon_state |= ASSIGN_TOTAL;
> +			break;
> +		case 'l':
> +			mon_state |= ASSIGN_LOCAL;
> +			break;
> +		case '_':
> +			return ASSIGN_NONE;
> +		default:
> +			return ASSIGN_INVALID;
> +		}
> +	}
> +
> +	return mon_state;
> +}
> +
> +static struct rdtgroup *rdtgroup_find_grp_by_name(enum rdt_group_type rtype,
> +						  char *p_grp, char *c_grp)
> +{
> +	struct rdtgroup *rdtg, *crg;
> +
> +	if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
> +		return &rdtgroup_default;
> +	} else if (rtype == RDTCTRL_GROUP) {
> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
> +			if (!strcmp(p_grp, rdtg->kn->name))
> +				return rdtg;
> +	} else if (rtype == RDTMON_GROUP) {
> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> +			if (!strcmp(p_grp, rdtg->kn->name)) {
> +				list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
> +						    mon.crdtgrp_list) {
> +					if (!strcmp(c_grp, crg->kn->name))
> +						return crg;
> +				}
> +			}
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static int rdtgroup_process_flags(struct rdt_resource *r,
> +				  enum rdt_group_type rtype,
> +				  char *p_grp, char *c_grp, char *tok)
> +{
> +	int op, mon_state, assign_state, unassign_state;

Same comment about type ... these *_state variables are used to contain
bits representing the flags of the various states. An unsigned variable
seems more appropriate.

> +	char *dom_str, *id_str, *op_str;
> +	struct rdt_mon_domain *d;
> +	struct rdtgroup *rdtgrp;
> +	unsigned long dom_id;
> +	int ret, found = 0;

Could found be boolean?

> +
> +	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
> +
> +	if (!rdtgrp) {
> +		rdt_last_cmd_puts("Not a valid resctrl group\n");
> +		return -EINVAL;
> +	}
> +
> +next:
> +	if (!tok || tok[0] == '\0')
> +		return 0;
> +
> +	/* Start processing the strings for each domain */
> +	dom_str = strim(strsep(&tok, ";"));
> +
> +	op_str = strpbrk(dom_str, "=+-");
> +
> +	if (op_str) {
> +		op = *op_str;
> +	} else {
> +		rdt_last_cmd_puts("Missing operation =, +, - character\n");
> +		return -EINVAL;
> +	}
> +
> +	id_str = strsep(&dom_str, "=+-");
> +
> +	/* Check for domain id '*' which means all domains */
> +	if (id_str && *id_str == '*') {
> +		d = NULL;
> +		goto check_state;
> +	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
> +		rdt_last_cmd_puts("Missing domain id\n");
> +		return -EINVAL;
> +	}
> +
> +	/* Verify if the dom_id is valid */
> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +		if (d->hdr.id == dom_id) {
> +			found = 1;
> +			break;
> +		}
> +	}
> +
> +	if (!found) {
> +		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
> +		return -EINVAL;
> +	}

I am missing how "found" is handled on second iteration. If an invalid domain
follows a valid domain it seems like "found" remains set from previous iteration?

> +
> +check_state:
> +	mon_state = rdtgroup_str_to_mon_state(dom_str);
> +
> +	if (mon_state == ASSIGN_INVALID) {
> +		rdt_last_cmd_puts("Invalid assign flag\n");
> +		goto out_fail;
> +	}
> +
> +	assign_state = 0;
> +	unassign_state = 0;
> +
> +	switch (op) {
> +	case '+':
> +		if (mon_state == ASSIGN_NONE) {
> +			rdt_last_cmd_puts("Invalid assign opcode\n");
> +			goto out_fail;
> +		}
> +		assign_state = mon_state;
> +		break;
> +	case '-':
> +		if (mon_state == ASSIGN_NONE) {
> +			rdt_last_cmd_puts("Invalid assign opcode\n");
> +			goto out_fail;
> +		}
> +		unassign_state = mon_state;
> +		break;
> +	case '=':
> +		assign_state = mon_state;
> +		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	if (unassign_state & ASSIGN_TOTAL) {
> +		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (unassign_state & ASSIGN_LOCAL) {
> +		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (assign_state & ASSIGN_TOTAL) {
> +		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	if (assign_state & ASSIGN_LOCAL) {
> +		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	goto next;
> +
> +out_fail:
> +	rdt_last_cmd_printf("Assign operation '%c%s' failed on the group %s/%s/\n",
> +			    op, dom_str, p_grp, c_grp);
> +

Can the domain id be printed also? This seems only piece missing to understand what failed.

> +	return -EINVAL;
> +}
> +
> +static ssize_t rdtgroup_mbm_assign_control_write(struct kernfs_open_file *of,
> +						 char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdt_resource *r = of->kn->parent->priv;
> +	char *token, *cmon_grp, *mon_grp;
> +	enum rdt_group_type rtype;
> +	int ret;
> +
> +	/* Valid input requires a trailing newline */
> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +
> +	buf[nbytes - 1] = '\0';
> +
> +	cpus_read_lock();
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	rdt_last_cmd_clear();
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
> +		mutex_unlock(&rdtgroup_mutex);
> +		cpus_read_unlock();
> +		return -EINVAL;
> +	}
> +
> +	while ((token = strsep(&buf, "\n")) != NULL) {
> +		/*
> +		 * The write command follows the following format:
> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
> +		 * Extract the CTRL_MON group.
> +		 */
> +		cmon_grp = strsep(&token, "/");
> +
> +		/*
> +		 * Extract the MON_GROUP.
> +		 * strsep returns empty string for contiguous delimiters.
> +		 * Empty mon_grp here means it is a RDTCTRL_GROUP.
> +		 */
> +		mon_grp = strsep(&token, "/");
> +
> +		if (*mon_grp == '\0')
> +			rtype = RDTCTRL_GROUP;
> +		else
> +			rtype = RDTMON_GROUP;
> +
> +		ret = rdtgroup_process_flags(r, rtype, cmon_grp, mon_grp, token);
> +		if (ret)
> +			break;
> +	}
> +
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +
> +	return ret ?: nbytes;
> +}
> +
>  #ifdef CONFIG_PROC_CPU_RESCTRL
>  
>  /*
> @@ -2383,9 +2616,10 @@ static struct rftype res_common_files[] = {
>  	},
>  	{
>  		.name		= "mbm_assign_control",
> -		.mode		= 0444,
> +		.mode		= 0644,
>  		.kf_ops		= &rdtgroup_kf_single_ops,
>  		.seq_show	= rdtgroup_mbm_assign_control_show,
> +		.write		= rdtgroup_mbm_assign_control_write,
>  	},
>  	{
>  		.name		= "mbm_assign_mode",

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-18 19:04     ` Moger, Babu
@ 2024-11-18 22:07       ` Reinette Chatre
  2024-11-22 18:25         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 22:07 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/18/24 11:04 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 18:00, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>> supported.
>>>
>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>
>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>
>>> The "default" mode is the existing monitoring mode that works without the
>>> explicit counter assignment, instead relying on dynamic counter assignment
>>> by hardware that may result in hardware not dedicating a counter resulting
>>> in monitoring data reads returning "Unavailable".
>>>
>>> Provide an interface to display the monitor mode on the system.
>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> [mbm_cntr_assign]
>>> default
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---

...

>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>> to make the event data "more predictable" and then be concerned when the mode does
>> not exist.
>>
>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing 
>> num_mbm_cntrs to num_rmids).
> 
> There is some round about(or hacky) way to find that out number of RMIDs
> that can be active.

Does this give consistent and accurate data? Is this something that can be added to resctrl?
(Reading your other message [1] it does not sound as though it can produce an accurate
number on boot.)
If not then it will be up to the documentation to be accurate.


>>> +
>>> +	AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>> +	enable this mode by default so that counters remain assigned even when the
>>> +	corresponding RMID is not in use by any processor.
>>> +
>>> +	"default":
>>> +
>>> +	In default mode resctrl assumes there is a hardware counter for each
>>> +	event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>> +	mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>> +	with that event.
>>
>> If I understand correctly, on AMD platforms without ABMC the events only report
>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>> is unassigned and then reassigned then the event count will reset and the user
>> will get some data back but it may thus be unpredictable (to match earlier language).
>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
> 
> Yes. All the AMD systems without ABMC are affected by this problem.
> 
>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>> available, while not be made concerned to use "default" mode on Intel where
>> mbm_assign_mode is not available.
> 
> Can we add text to clarify this?

Please do.

Reinette

[1] https://lore.kernel.org/all/35fc70fd-0281-4ac8-b32b-efa2f4516901@amd.com/

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init
  2024-11-18 17:44     ` Moger, Babu
@ 2024-11-18 22:07       ` Reinette Chatre
  2024-11-20 20:02         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-18 22:07 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/18/24 9:44 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 17:21, Reinette Chatre wrote:
>> Hi Babu,
>>
>> In subject please use () to indicate a function, writing resctrl_late_init()
> 
> Will change it to
> 
> x86/resctrl: Add __init attribute for all the call sequences in
> resctrl_late_init()

I am not sure how to interpret "call sequences". The original is ok now that
cache_alloc_hsw_probe() is also included. Specifically, this can be:

	x86/resctrl: Add __init attribute to functions called from resctrl_late_init()

> 
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> The function resctrl_late_init() has the __init attribute, but some
>>
>> No need to say "The function" when using ().
>>
>>> functions it calls do not. Add the __init attribute to all the functions
>>
>> None of the functions changed are actually called by resctrl_late_init(). If this
>> is indeed the goal then I think cache_alloc_hsw_probe() was missed.
> 
> Will change the function to.
> 
> static inline __init void cache_alloc_hsw_probe(void)
> 
> How about this description?
> 
> "resctrl_late_init() has the __init attribute, but some of the call
> sequences of it do not have the __init attribute.

"some of the call sequences of it" sounds strange. It can be simplified with
"some of the functions called from it"?

> 
> Add the __init attribute to all the functions in the call sequences to
> maintain consistency throughout."

Sounds good, thank you.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value
  2024-11-16  0:24   ` Reinette Chatre
@ 2024-11-19 16:50     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-19 16:50 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:24, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -354,6 +354,10 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>>   */
>>  void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
>>  
>> +void resctrl_arch_mon_event_config_set(void *info);
> 
> An architecture that may want to use this would need to know how to interpret
> the info passed. For an API I thus expect the struct it points to to also
> be available in this header file.

Sure, I will move structure mon_config_info to resctrl.h.

> 
> 
>> +u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>> +				      enum resctrl_event_id eventid);
>> +
>>  extern unsigned int resctrl_rmid_realloc_threshold;
>>  extern unsigned int resctrl_rmid_realloc_limit;
>>  
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-16  0:31   ` Reinette Chatre
@ 2024-11-19 19:20     ` Moger, Babu
  2024-11-21 21:12       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-19 19:20 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:31, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Provide the interface to display the number of free monitoring counters
>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: New patch.
>> ---
>>  Documentation/arch/x86/resctrl.rst     |  4 ++++
>>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 2f3a86278e84..2bc58d974934 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -302,6 +302,10 @@ with the following files:
>>  	memory bandwidth tracking to a single memory bandwidth event per
>>  	monitoring group.
>>  
>> +"available_mbm_cntrs":
>> +	The number of free monitoring counters available assignment in each domain
> 
> "The number of free monitoring counters available assignment" -> "The number of monitoring
> counters available for assignment"?
> 
> (not taking into account how text may change after addressing Peter's feedback)

How about this?

"The number of monitoring counters available for assignment in each domain
when the architecture supports mbm_cntr_assign mode. There are a total of
"num_mbm_cntrs" counters are available for assignment. Counters can be
assigned or unassigned individually in each domain. A counter is available
for new assignment if it is unassigned in all domains."

> 
>> +	when the architecture supports mbm_cntr_assign mode.
>> +
>>  "max_threshold_occupancy":
>>  		Read/write file provides the largest value (in
>>  		bytes) at which a previously used LLC_occupancy
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 3996f7528b66..e8d38a963f39 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1268,6 +1268,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>  			cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>>  			r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
>>  			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
>> +			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
>>  		}
>>  	}
>>  
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 654cdfee1b00..ef0c1246fa2a 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -898,6 +898,33 @@ static int rdtgroup_num_mbm_cntrs_show(struct kernfs_open_file *of,
>>  	return 0;
>>  }
>>  
>> +static int rdtgroup_available_mbm_cntrs_show(struct kernfs_open_file *of,
>> +					     struct seq_file *s, void *v)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +	struct rdt_mon_domain *dom;
>> +	bool sep = false;
>> +	u32 val;
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&rdtgroup_mutex);
>> +
>> +	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +		if (sep)
>> +			seq_puts(s, ";");
>> +
>> +		val = r->mon.num_mbm_cntrs - hweight64(*dom->mbm_cntr_map);
> 
> This should probably be bitmap_weight() to address warnings like below that are
> encountered by build testing with various configs (32bit in this case). 0day does
> not seem to automatically pick up patches just based on submission but it sure will
> when these are merged to tip so this needs a clean slate.

Sure.

> 
>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:916:32: warning: shift count >= width of type [-Wshift-count-overflow]
>      916 |                 val = r->mon.num_mbm_cntrs - hweight64(*dom->mbm_cntr_map);
>          |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/asm-generic/bitops/const_hweight.h:29:49: note: expanded from macro 'hweight64'
>       29 | #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
>          |                                                 ^~~~~~~~~~~~~~~~~~~~
>    include/asm-generic/bitops/const_hweight.h:21:76: note: expanded from macro '__const_hweight64'
>       21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
>          |                                                                            ^  ~~
>    include/asm-generic/bitops/const_hweight.h:20:49: note: expanded from macro '__const_hweight32'
>       20 | #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
>          |                                                 ^
>    include/asm-generic/bitops/const_hweight.h:19:48: note: expanded from macro '__const_hweight16'
>       19 | #define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
>          |                                                ^
>    include/asm-generic/bitops/const_hweight.h:10:9: note: expanded from macro '__const_hweight8'
>       10 |          ((!!((w) & (1ULL << 0))) +     \
>          |                ^
> 
> 
> Reinette
> 
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments
  2024-11-16  0:38   ` Reinette Chatre
@ 2024-11-19 20:02     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-19 20:02 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:38, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> mbm_cntr_assign feature provides an option to the user to assign a counter
> 
> nit for consistency: "mbm_cntr_assign feature" -> "mbm_cntr_assign mode"

Sure.
> 
>> to an RMID, event pair and monitor the bandwidth as long as the counter is
>> assigned. There can be two counters per monitor group, one for MBM total
>> event and another for MBM local event.
>>
>> Introduce cntr_id to manage the assignments.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> | Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-11-16  0:44   ` Reinette Chatre
@ 2024-11-19 20:12     ` Moger, Babu
  2024-11-21 20:18       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-19 20:12 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:44, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> The ABMC feature provides an option to the user to assign a hardware
>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>> assigned. The assigned RMID will be tracked by the hardware until the user
>> unassigns it manually.
>>
>> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
>> specifying the counter id, bandwidth source, and bandwidth types.
> 
> needs imperative tone

How about this?

Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and
specifying the counter ID, bandwidth source, and bandwidth types.

> 
>>
>> Provide the interface to assign the counter ids to RMID.
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>     Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>     Monitoring (ABMC).
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-11-16  0:57   ` Reinette Chatre
@ 2024-11-20 18:05     ` Moger, Babu
  2024-11-21 20:50       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-20 18:05 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/15/24 18:57, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> The mbm_cntr_assign mode offers several hardware counters that can be
>> assigned to an RMID, event pair and monitor the bandwidth as long as it
>> is assigned.
>>
>> Counters are managed at two levels. The global assignment is tracked
>> using the mbm_cntr_free_map field in the struct resctrl_mon, while
>> domain-specific assignments are tracked using the mbm_cntr_map field
>> in the struct rdt_mon_domain. Allocation begins at the global level
>> and is then applied individually to each domain.
>>
>> Introduce an interface to allocate these counters and update the
>> corresponding domains accordingly.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
> 
> ...
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 00f7bf60e16a..cb496bd97007 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>  int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>  			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>  			     u32 cntr_id, bool assign);
>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>>  void rdt_staged_configs_clear(void);
>>  bool closid_allocated(unsigned int closid);
>>  int resctrl_find_cleanest_closid(void);
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 1b5529c212f5..bc3752967c44 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Configure the counter for the event, RMID pair for the domain.
>> + * Update the bitmap and reset the architectural state.
>> + */
>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
>> +			       u32 cntr_id, bool assign)
>> +{
>> +	int ret;
>> +
>> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (assign)
>> +		__set_bit(cntr_id, d->mbm_cntr_map);
>> +	else
>> +		__clear_bit(cntr_id, d->mbm_cntr_map);
>> +
>> +	/*
>> +	 * Reset the architectural state so that reading of hardware
>> +	 * counter is not considered as an overflow in next update.
>> +	 */
>> +	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
> 
> resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
> @d ... note that after the architectural state is reset it initializes the
> state by reading the event on the current CPU. By running it here it is
> run on a random CPU that may not be in the right domain.

Yes. That is correct.  We can move this part to our earlier
implementation. We dont need to read the RMID.  We just have to reset the
counter.

https://lore.kernel.org/lkml/16d88cc4091cef1999b7ec329364e12dd0dc748d.1728495588.git.babu.moger@amd.com/

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9fe419d0c536..bc3654ec3a08 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2371,6 +2371,13 @@ int resctrl_arch_config_cntr(struct rdt_resource
*r, struct rdt_mon_domain *d,
        smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
                              &abmc_cfg, 1);

+       /*
+        * Reset the architectural state so that reading of hardware
+        * counter is not considered as an overflow in next update.
+        */
+       if (arch_mbm)
+               memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
+
        return 0;
 }


> 
>> +
>> +	return ret;
>> +}
>> +
>> +static bool mbm_cntr_assigned_to_domain(struct rdt_resource *r, u32 cntr_id)
>> +{
>> +	struct rdt_mon_domain *d;
>> +
>> +	list_for_each_entry(d, &r->mon_domains, hdr.list)
>> +		if (test_bit(cntr_id, d->mbm_cntr_map))
>> +			return 1;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Assign a hardware counter to event @evtid of group @rdtgrp.
>> + * Counter will be assigned to all the domains if rdt_mon_domain is NULL
>> + * else the counter will be assigned to specific domain.
>> + */
>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid)
>> +{
>> +	int index = MBM_EVENT_ARRAY_INDEX(evtid);
>> +	int cntr_id = rdtgrp->mon.cntr_id[index];
>> +	int ret;
>> +
>> +	/*
>> +	 * Allocate a new counter id to the event if the counter is not
>> +	 * assigned already.
>> +	 */
>> +	if (cntr_id == MON_CNTR_UNSET) {
>> +		cntr_id = mbm_cntr_alloc(r);
>> +		if (cntr_id < 0) {
>> +			rdt_last_cmd_puts("Out of MBM assignable counters\n");
>> +			return -ENOSPC;
>> +		}
>> +		rdtgrp->mon.cntr_id[index] = cntr_id;
>> +	}
>> +
>> +	if (!d) {
>> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
>> +			ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
>> +						  rdtgrp->closid, cntr_id, true);
>> +			if (ret)
>> +				goto out_done_assign;
> 
> This may not be what users expect. What if, for example, domain #1 has a counter
> assigned to "total" event and then user wants to change that to 
> assign a counter to "total" event of all domains. Would this not reconfigure the
> counter associated with domain #1 and unnecessarily reset it? Could this be
> made a bit smarter to only configure a counter on a domain if it is not already
> configured? This could perhaps form part of resctrl_config_cntr() to not scatter
> the duplicate check everywhere. What do you think?

Yes. that is correct. We can add a check in resctrl_config_cntr().

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9fe419d0c536..bc3654ec3a08 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c

@@ -2384,6 +2391,10 @@ static int resctrl_config_cntr(struct rdt_resource
*r, struct rdt_mon_domain *d,
 {
        int ret;

+       /* Return success if the domain is in expected assign state already */
+       if (assign == test_bit(cntr_id, d->mbm_cntr_map))
+               return 0;
+
        ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id,
assign);
        if (ret)
                return ret;


> 
> Also, looks like this can do partial assignment. For example, if one of the
> domains encounter a failure then domains already configured are not undone. This
> matches other similar flows but is not documented and left to reader to decipher. 

I will add the text in patch 26
(x86/resctrl: Introduce interface to modify assignment states of the groups).

> 
> 
>> +		}
>> +	} else {
>> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
>> +					  rdtgrp->closid, cntr_id, true);
> 
> Looking at flows calling rdtgroup_assign_cntr_event() I do not see a check
> if counter is already assigned. So, if a user makes a loop of assigning a counter
> to the same event over and over it will result in an IPI every time. This seems
> unnecessary, what do you think?

This will be taken care by the above check in resctrl_config_cntr().

> 
>> +		if (ret)
>> +			goto out_done_assign;
>> +	}
>> +
>> +out_done_assign:
>> +	if (ret && !mbm_cntr_assigned_to_domain(r, cntr_id)) {
>> +		mbm_cntr_free(r, cntr_id);
>> +		rdtgroup_cntr_id_init(rdtgrp, evtid);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  /* rdtgroup information files for one cache resource. */
>>  static struct rftype res_common_files[] = {
>>  	{
> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported
  2024-11-18 19:23   ` Reinette Chatre
@ 2024-11-20 18:59     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-20 18:59 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/24 13:23, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Configure mbm_cntr_assign on AMD. 'mbm_cntr_assign' mode in AMD is ABMC
>> (Assignable Bandwidth Monitoring Counters). It is enabled by default when
>> supported on the system.
>>
>> When the ABMC is updated, it must be updated on all the logical processors
>> in the resctrl domain.
> 
> This needs imperative tone.
> 

Ensure that the ABMC is updated on all logical processors in the resctrl
domain.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode
  2024-11-18 17:39   ` Reinette Chatre
@ 2024-11-20 19:14     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-20 19:14 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/24 11:39, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> In mbm_cntr_assign mode, the hardware counter should be assigned to read
>> the MBM events.
>>
>> Report "Unassigned" in case the user attempts to read the events without
>> assigning the counter.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Used is_mbm_event() to check the event type.
>>     Minor user documentation update.
>>
>> v8: Used MBM_EVENT_ARRAY_INDEX to get the index for the MBM event.
>>     Documentation update to make the text generic.
>>
>> v7: Moved the documentation under "mon_data".
>>     Updated the text little bit.
>>
>> v6: Added more explaination in the resctrl.rst
>>     Added checks to detect "Unassigned" before reading RMID.
>>
>> v5: New patch.
>> ---
>>  Documentation/arch/x86/resctrl.rst        | 10 ++++++++++
>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +++++++++++-
>>  2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 2bc58d974934..864fc004d646 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -430,6 +430,16 @@ When monitoring is enabled all MON groups will also contain:
>>  	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
>>  	where "YY" is the node number.
>>  
>> +	When supported the 'mbm_cntr_assign' mode allows users to assign a
> 
> Could you please do through the documentation changes and make all the quote
> usage consistent with existing styles? For example, in this series "mbm_cntr_assign"
> is used in doc in various ways ... within double quotes, within single quotes, as
> well as without any quotes.

Yea. Will do. It should "mbm_cntr_assign" in all the references.

> 
>> +	counter to mon_hw_id, event pair enabling bandwidth monitoring for
>> +	as long as the counter remains assigned. The hardware will continue
>> +	tracking the assigned mon_hw_id until the user manually unassigns
>> +	it, ensuring that counters are not reset during this period. With
>> +	a limited number of counters, the system may run out of assignable
>> +	counters. In that case, MBM event counters will return "Unassigned"
> 
> Please review style for all quote usage, for example, "Unassigned" above is
> also not consistent.

Yes. Will change it to 'Unassigned'.

> 
>> +	when the event is read. Users must manually assign a counter to read
>> +	the events.
>> +
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init
  2024-11-18 22:07       ` Reinette Chatre
@ 2024-11-20 20:02         ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-20 20:02 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/24 16:07, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/18/24 9:44 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/15/24 17:21, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> In subject please use () to indicate a function, writing resctrl_late_init()
>>
>> Will change it to
>>
>> x86/resctrl: Add __init attribute for all the call sequences in
>> resctrl_late_init()
> 
> I am not sure how to interpret "call sequences". The original is ok now that
> cache_alloc_hsw_probe() is also included. Specifically, this can be:
> 
> 	x86/resctrl: Add __init attribute to functions called from resctrl_late_init()

Sure.
> 
>>
>>>
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>> The function resctrl_late_init() has the __init attribute, but some
>>>
>>> No need to say "The function" when using ().
>>>
>>>> functions it calls do not. Add the __init attribute to all the functions
>>>
>>> None of the functions changed are actually called by resctrl_late_init(). If this
>>> is indeed the goal then I think cache_alloc_hsw_probe() was missed.
>>
>> Will change the function to.
>>
>> static inline __init void cache_alloc_hsw_probe(void)
>>
>> How about this description?
>>
>> "resctrl_late_init() has the __init attribute, but some of the call
>> sequences of it do not have the __init attribute.
> 
> "some of the call sequences of it" sounds strange. It can be simplified with
> "some of the functions called from it"?

Sure,

> 
>>
>> Add the __init attribute to all the functions in the call sequences to
>> maintain consistency throughout."
> 
> Sounds good, thank you.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes
  2024-11-18 19:43   ` Reinette Chatre
@ 2024-11-21  2:14     ` Moger, Babu
  2024-11-21 20:58       ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-21  2:14 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/2024 1:43 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Users can modify the configuration of assignable events. Whenever the
>> event configuration is updated, MBM assignments must be revised across
>> all monitor groups within the impacted domains.
> 
> Please revisit the "Changelog" section in
> Documentation/process/maintainer-tip.rst
> 

ok.

Imperative mood, context, problem and solution.

>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Again patch changed completely based on the comment.
>>      https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
>>      Introduced resctrl_mon_event_config_set to handle IPI.
>>      But sending another IPI inside IPI causes problem. Kernel reports SMP
>>      warning. So, introduced resctrl_arch_update_cntr() to send the command directly.
> 
> I see ... the WARN is because there is a check whether IRQs are disabled before
> the check whether the function can be run locally.

ok

> 
>>
>> v8: Patch changed completely.
>>      Updated the assignment on same IPI as the event is updated.
>>      Could not do the way we discussed in the thread.
>>      https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
>>      Needed to figure out event type to update the configuration.
>>
>> v7: New patch to update the assignments. Missed it earlier.
>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 86 +++++++++++++++++++++++---
>>   include/linux/resctrl.h                |  3 +-
>>   2 files changed, 79 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 5b8bb8bd913c..7646d67ea10e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1710,6 +1710,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
>>   }
>>   
>>   struct mon_config_info {
>> +	struct rdt_resource *r;
>>   	struct rdt_mon_domain *d;
>>   	u32 evtid;
>>   	u32 mon_config;
>> @@ -1735,26 +1736,28 @@ u32 resctrl_arch_mon_event_config_get(struct rdt_mon_domain *d,
>>   	return INVALID_CONFIG_VALUE;
>>   }
>>   
>> -void resctrl_arch_mon_event_config_set(void *info)
>> +void resctrl_arch_mon_event_config_set(struct rdt_mon_domain *d,
>> +				       enum resctrl_event_id eventid, u32 val)
>>   {
>> -	struct mon_config_info *mon_info = info;
>>   	struct rdt_hw_mon_domain *hw_dom;
>>   	unsigned int index;
>>   
>> -	index = mon_event_config_index_get(mon_info->evtid);
>> +	index = mon_event_config_index_get(eventid);
>>   	if (index == INVALID_CONFIG_INDEX)
>>   		return;
>>   
>> -	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
>> +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, val, 0);
>>   
>> -	hw_dom = resctrl_to_arch_mon_dom(mon_info->d);
>> +	hw_dom = resctrl_to_arch_mon_dom(d);
>>   
>> -	switch (mon_info->evtid) {
>> +	switch (eventid) {
>>   	case QOS_L3_MBM_TOTAL_EVENT_ID:
>> -		hw_dom->mbm_total_cfg = mon_info->mon_config;
>> +		hw_dom->mbm_total_cfg = val;
>>   		break;
>>   	case QOS_L3_MBM_LOCAL_EVENT_ID:
>> -		hw_dom->mbm_local_cfg = mon_info->mon_config;
>> +		hw_dom->mbm_local_cfg = val;
>> +		break;
>> +	default:
>>   		break;
>>   	}
>>   }
>> @@ -1826,6 +1829,70 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
>>   	return 0;
>>   }
>>   
>> +static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_index(int cntr_id, unsigned int index)
>> +{
>> +	struct rdtgroup *prgrp, *crgrp;
>> +
>> +	/* Check if the cntr_id is associated to the event type updated */
>> +	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>> +		if (prgrp->mon.cntr_id[index] == cntr_id)
>> +			return prgrp;
>> +
>> +		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
>> +			if (crgrp->mon.cntr_id[index] == cntr_id)
>> +				return crgrp;
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>> +				     enum resctrl_event_id evtid, u32 rmid,
>> +				     u32 closid, u32 cntr_id, u32 val)
>> +{
>> +	union l3_qos_abmc_cfg abmc_cfg = { 0 };
>> +
>> +	abmc_cfg.split.cfg_en = 1;
>> +	abmc_cfg.split.cntr_en = 1;
>> +	abmc_cfg.split.cntr_id = cntr_id;
>> +	abmc_cfg.split.bw_src = rmid;
>> +	abmc_cfg.split.bw_type = val;
>> +
>> +	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg.full);
> 
> Is it needed to create an almost duplicate function? What if instead
> only resctrl_arch_config_cntr() exists and it uses parameter to decide
> whether to call resctrl_abmc_config_one_amd() directly or via
> smp_call_function_any()? I think that should help to make clear how
> the code flows.
> Also note that this is an almost identical arch callback with no
> error return. I expect that building on existing resctrl_arch_config_cntr()
> will make things easier to understand.

It can be done. But it takes another parameter to the function.
It has 7 parameters already. This will be 8th.
Will change it if that is ok.

> 
>> +}
>> +
>> +static void resctrl_mon_event_config_set(void *info)
>> +{
>> +	struct mon_config_info *mon_info = info;
>> +	struct rdt_mon_domain *d = mon_info->d;
>> +	struct rdt_resource *r = mon_info->r;
> 
> Note that local variable r is created here while the function is inconsistent by
> switching between using r and mon_info->r.

Sure. Got it.

> 
>> +	struct rdtgroup *rdtgrp;
>> +	unsigned int index;
>> +	u32 cntr_id;
>> +
>> +	resctrl_arch_mon_event_config_set(d, mon_info->evtid, mon_info->mon_config);
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
>> +	index = mon_event_config_index_get(mon_info->evtid);
> 
> This is an AMD arch specific helper to know which offset of an MSR to use. It should
> not be used directly in resctrl fs code, this is what MBM_EVENT_ARRAY_INDEX was created for.

Sure.

> 
> Since MBM_EVENT_ARRAY_INDEX is a macro it can be called closer to where it is used,
> within  rdtgroup_find_grp_by_cntr_id_index(), which prompts a reconsider of that function name.


How about ?

static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_event(int cntr_id, 
enum resctrl_event_id evtid)

Will move the macro MBM_EVENT_ARRAY_INDEX inside the function.


> 
>> +	if (index == INVALID_CONFIG_INDEX)
>> +		return;
>> +
>> +	for (cntr_id = 0; cntr_id < r->mon.num_mbm_cntrs; cntr_id++) {
>> +		if (test_bit(cntr_id, d->mbm_cntr_map)) {
>> +			rdtgrp = rdtgroup_find_grp_by_cntr_id_index(cntr_id, index);
>> +			if (rdtgrp)
>> +				resctrl_arch_update_cntr(mon_info->r, d,
>> +							 mon_info->evtid,
>> +							 rdtgrp->mon.rmid,
>> +							 rdtgrp->closid,
>> +							 cntr_id,
>> +							 mon_info->mon_config);
>> +		}
>> +	}
>> +}
> 
> Could you please add some function comments to explain the flow here? For example,
> what should reader consider if there is to rdtgroup found?

Sure.

-- 
- Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-11-19 20:12     ` Moger, Babu
@ 2024-11-21 20:18       ` Reinette Chatre
  2024-11-22 18:54         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-21 20:18 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/19/24 12:12 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 18:44, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> The ABMC feature provides an option to the user to assign a hardware
>>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>>> assigned. The assigned RMID will be tracked by the hardware until the user
>>> unassigns it manually.
>>>
>>> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
>>> specifying the counter id, bandwidth source, and bandwidth types.
>>
>> needs imperative tone
> 
> How about this?
> 
> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and
> specifying the counter ID, bandwidth source, and bandwidth types.
> 

ok with me. Exactly what ChatGPT suggests.

Please do note that that first paragraph informs reader that
a counter is assigned by user to "an RMID, event pair" while the hardware is configured with
"the counter ID, bandwidth source, and bandwidth types". There thus does not seem
to be a clear connection between what user assigns and what is programmed to hardware.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of the groups
  2024-11-18 21:51   ` Reinette Chatre
@ 2024-11-21 20:29     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-21 20:29 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/24 15:51, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Introduce the interface to assign MBM events in mbm_cntr_assign mode.
>>
>> Events can be enabled or disabled by writing to file
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> Format is similar to the list format with addition of opcode for the
>> assignment operation.
>>  "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Format for specific type of groups:
>>
>>  * Default CTRL_MON group:
>>          "//<domain_id><opcode><flags>"
>>
>>  * Non-default CTRL_MON group:
>>          "<CTRL_MON group>//<domain_id><opcode><flags>"
>>
>>  * Child MON group of default CTRL_MON group:
>>          "/<MON group>/<domain_id><opcode><flags>"
>>
>>  * Child MON group of non-default CTRL_MON group:
>>          "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>>
>> Domain_id '*' will apply the flags on all the domains.
>>
>> Opcode can be one of the following:
>>
>>  = Update the assignment to match the flags
>>  + Assign a new MBM event without impacting existing assignments.
>>  - Unassign a MBM event from currently assigned events.
>>
>> Assignment flags can be one of the following:
>>  t  MBM total event
>>  l  MBM local event
>>  tl Both total and local MBM events
>>  _  None of the MBM events. Valid only with '=' opcode. This flag cannot
>>     be combined with other flags.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Fixed handling special case '//0=' and '//".
>>     Removed extra strstr() call.
>>     Added generic failure text when assignment operation fails.
>>     Corrected user documentation format texts.
>>
>> v8: Moved unassign as the first action during the assign modification.
>>     Assign none "_" takes priority. Cannot be mixed with other flags.
>>     Updated the documentation and .rst file format. htmldoc looks ok.
>>
>> v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
>>     Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
>>     Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
>>     Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
>>     Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
>>     Removed ABMC reference in FS code.
>>     Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
>>     Not sure if we need to change the behaviour here. Processed them sequencially right now.
>>     Users have the liberty to pass the flags. Restricting it might be a problem later.
>>
>> v6: Added support assign all if domain id is '*'
>>     Fixed the allocation of counter id if it not assigned already.
>>
>> v5: Interface name changed from mbm_assign_control to mbm_control.
>>     Fixed opcode and flags combination.
>>     '=_" is valid.
>>     "-_" amd "+_" is not valid.
>>     Minor message update.
>>     Renamed the function with prefix - rdtgroup_.
>>     Corrected few documentation mistakes.
>>     Rebase related changes after SNC support.
>>
>> v4: Added domain specific assignments. Fixed the opcode parsing.
>>
>> v3: New patch.
>>     Addresses the feedback to provide the global assignment interface.
>>     https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
>> ---
>>  Documentation/arch/x86/resctrl.rst     | 116 +++++++++++-
>>  arch/x86/kernel/cpu/resctrl/internal.h |  10 ++
>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 236 ++++++++++++++++++++++++-
>>  3 files changed, 360 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 590727bec44b..d0a107d251ec 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -347,7 +347,8 @@ with the following files:
>>  	 t  MBM total event is assigned.
>>  	 l  MBM local event is assigned.
>>  	 tl Both MBM total and local events are assigned.
>> -	 _  None of the MBM events are assigned.
>> +	 _  None of the MBM events are assigned. Only works with opcode '=' for write
>> +	    and cannot be combined with other flags.
>>  
>>  	Examples:
>>  	::
>> @@ -365,6 +366,119 @@ with the following files:
>>  	There are four resctrl groups. All the groups have total and local MBM events
>>  	assigned on domain 0 and 1.
>>  
>> +	Assignment state can be updated by writing to the interface.
> 
> This is already a bit far from original definition so it may help to be specific what is
> meant with "the interface". For example,
> 
> 	Assignment state can be updated by writing to "mbm_assign_control".

Sure.

> 
>> +
>> +	Format is similar to the list format with addition of opcode for the
>> +	assignment operation.
>> +
>> +		"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>> +
>> +	Format for each type of groups:
> 
> "Format for each type of group"  or "Format of each type of group"?

Sure.

> 
>> +
>> +        * Default CTRL_MON group:
>> +                "//<domain_id><opcode><flags>"
>> +
>> +        * Non-default CTRL_MON group:
>> +                "<CTRL_MON group>//<domain_id><opcode><flags>"
>> +
>> +        * Child MON group of default CTRL_MON group:
>> +                "/<MON group>/<domain_id><opcode><flags>"
>> +
>> +        * Child MON group of non-default CTRL_MON group:
>> +                "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>> +
>> +	Domain_id '*' will apply the flags on all the domains.
> 
> "apply the flags on all the domains" -> "apply the flags to all the domains"?

Sure.

> 
>> +
>> +	Opcode can be one of the following:
>> +	::
>> +
>> +	 = Update the assignment to match the MBM event.
>> +	 + Assign a new MBM event without impacting existing assignments.
>> +	 - Unassign a MBM event from currently assigned events.
>> +
>> +	Examples:
>> +	Initial group status:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>> +	  //0=tl;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=tl;
>> +
>> +	To update the default group to assign only total MBM event on domain 0:
>> +	::
>> +
>> +	  # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>> +	  //0=t;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=tl;
>> +
>> +	To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
>> +	::
>> +
>> +	  # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>> +	  //0=t;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=l;
>> +
>> +	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
>> +	both local and total MBM events on domain 1:
>> +	::
>> +
>> +	  # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
>> +			/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
>> +	  //0=t;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=l;
>> +
>> +	To update the default group to add a local MBM event domain 0.
> 
> "." -> ":"

Sure.

> 
>> +	::
>> +
>> +	  # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=tl;1=tl;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
>> +	  //0=tl;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=l;
>> +
>> +	To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all the
>> +	MBM events on all the domains.
> 
> "." -> ":"

Sure.

> 
>> +	::
>> +
>> +	  # echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +
>> +	Assignment status after the update:
>> +	::
>> +
>> +	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> +	  non_default_ctrl_mon_grp//0=_;1=_;
>> +	  non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
>> +	  //0=tl;1=tl;
>> +	  /child_default_mon_grp/0=tl;1=l;
>> +
>>  "max_threshold_occupancy":
>>  		Read/write file provides the largest value (in
>>  		bytes) at which a previously used LLC_occupancy
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index b90d8c90b4b6..3ccaea6a2803 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -74,6 +74,16 @@
>>   */
>>  #define MBM_EVENT_ARRAY_INDEX(_event) ((_event) - 2)
>>  
>> +/*
>> + * Assignment flags for mbm_cntr_assign feature
>> + */
> 
> "mbm_cntr_assign feature" -> "mbm_cntr_assign mode"?

Sure.

> 
>> +enum {
>> +	ASSIGN_NONE	= 0,
>> +	ASSIGN_TOTAL	= BIT(QOS_L3_MBM_TOTAL_EVENT_ID),
>> +	ASSIGN_LOCAL	= BIT(QOS_L3_MBM_LOCAL_EVENT_ID),
>> +	ASSIGN_INVALID,
>> +};
>> +
>>  /**
>>   * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
>>   *			        aren't marked nohz_full
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 5cc40eacbe85..9fe419d0c536 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1082,6 +1082,239 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
>>  	return 0;
>>  }
>>  
>> +static int rdtgroup_str_to_mon_state(char *flag)
> 
> It seems strange to me that a variable used to contain flag bits
> is of type int. Why is it not unsigned?

Will change it to "unsigned int"

> 
>> +{
>> +	int i, mon_state = ASSIGN_NONE;
>> +
>> +	if (!strlen(flag))
>> +		return ASSIGN_INVALID;
>> +
>> +	for (i = 0; i < strlen(flag); i++) {
>> +		switch (*(flag + i)) {
>> +		case 't':
>> +			mon_state |= ASSIGN_TOTAL;
>> +			break;
>> +		case 'l':
>> +			mon_state |= ASSIGN_LOCAL;
>> +			break;
>> +		case '_':
>> +			return ASSIGN_NONE;
>> +		default:
>> +			return ASSIGN_INVALID;
>> +		}
>> +	}
>> +
>> +	return mon_state;
>> +}
>> +
>> +static struct rdtgroup *rdtgroup_find_grp_by_name(enum rdt_group_type rtype,
>> +						  char *p_grp, char *c_grp)
>> +{
>> +	struct rdtgroup *rdtg, *crg;
>> +
>> +	if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
>> +		return &rdtgroup_default;
>> +	} else if (rtype == RDTCTRL_GROUP) {
>> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
>> +			if (!strcmp(p_grp, rdtg->kn->name))
>> +				return rdtg;
>> +	} else if (rtype == RDTMON_GROUP) {
>> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +			if (!strcmp(p_grp, rdtg->kn->name)) {
>> +				list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
>> +						    mon.crdtgrp_list) {
>> +					if (!strcmp(c_grp, crg->kn->name))
>> +						return crg;
>> +				}
>> +			}
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static int rdtgroup_process_flags(struct rdt_resource *r,
>> +				  enum rdt_group_type rtype,
>> +				  char *p_grp, char *c_grp, char *tok)
>> +{
>> +	int op, mon_state, assign_state, unassign_state;
> 
> Same comment about type ... these *_state variables are used to contain
> bits representing the flags of the various states. An unsigned variable
> seems more appropriate.

Will change into unsigned int.

> 
>> +	char *dom_str, *id_str, *op_str;
>> +	struct rdt_mon_domain *d;
>> +	struct rdtgroup *rdtgrp;
>> +	unsigned long dom_id;
>> +	int ret, found = 0;
> 
> Could found be boolean?

Sure.

> 
>> +
>> +	rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
>> +
>> +	if (!rdtgrp) {
>> +		rdt_last_cmd_puts("Not a valid resctrl group\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +next:
>> +	if (!tok || tok[0] == '\0')
>> +		return 0;
>> +
>> +	/* Start processing the strings for each domain */
>> +	dom_str = strim(strsep(&tok, ";"));
>> +
>> +	op_str = strpbrk(dom_str, "=+-");
>> +
>> +	if (op_str) {
>> +		op = *op_str;
>> +	} else {
>> +		rdt_last_cmd_puts("Missing operation =, +, - character\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	id_str = strsep(&dom_str, "=+-");
>> +
>> +	/* Check for domain id '*' which means all domains */
>> +	if (id_str && *id_str == '*') {
>> +		d = NULL;
>> +		goto check_state;
>> +	} else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
>> +		rdt_last_cmd_puts("Missing domain id\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* Verify if the dom_id is valid */
>> +	list_for_each_entry(d, &r->mon_domains, hdr.list) {
>> +		if (d->hdr.id == dom_id) {
>> +			found = 1;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found) {
>> +		rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
>> +		return -EINVAL;
>> +	}
> 
> I am missing how "found" is handled on second iteration. If an invalid domain
> follows a valid domain it seems like "found" remains set from previous iteration?

Yes. It should be reset. Good catch !

> 
>> +
>> +check_state:
>> +	mon_state = rdtgroup_str_to_mon_state(dom_str);
>> +
>> +	if (mon_state == ASSIGN_INVALID) {
>> +		rdt_last_cmd_puts("Invalid assign flag\n");
>> +		goto out_fail;
>> +	}
>> +
>> +	assign_state = 0;
>> +	unassign_state = 0;
>> +
>> +	switch (op) {
>> +	case '+':
>> +		if (mon_state == ASSIGN_NONE) {
>> +			rdt_last_cmd_puts("Invalid assign opcode\n");
>> +			goto out_fail;
>> +		}
>> +		assign_state = mon_state;
>> +		break;
>> +	case '-':
>> +		if (mon_state == ASSIGN_NONE) {
>> +			rdt_last_cmd_puts("Invalid assign opcode\n");
>> +			goto out_fail;
>> +		}
>> +		unassign_state = mon_state;
>> +		break;
>> +	case '=':
>> +		assign_state = mon_state;
>> +		unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	if (unassign_state & ASSIGN_TOTAL) {
>> +		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (unassign_state & ASSIGN_LOCAL) {
>> +		ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (assign_state & ASSIGN_TOTAL) {
>> +		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	if (assign_state & ASSIGN_LOCAL) {
>> +		ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +		if (ret)
>> +			goto out_fail;
>> +	}
>> +
>> +	goto next;
>> +
>> +out_fail:
>> +	rdt_last_cmd_printf("Assign operation '%c%s' failed on the group %s/%s/\n",
>> +			    op, dom_str, p_grp, c_grp);
>> +
> 
> Can the domain id be printed also? This seems only piece missing to understand what failed.

Sure. Need to add a check if it is '*' or actual domain id.

Will do.

> 
>> +	return -EINVAL;
>> +}
>> +
>> +static ssize_t rdtgroup_mbm_assign_control_write(struct kernfs_open_file *of,
>> +						 char *buf, size_t nbytes, loff_t off)
>> +{
>> +	struct rdt_resource *r = of->kn->parent->priv;
>> +	char *token, *cmon_grp, *mon_grp;
>> +	enum rdt_group_type rtype;
>> +	int ret;
>> +
>> +	/* Valid input requires a trailing newline */
>> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> +		return -EINVAL;
>> +
>> +	buf[nbytes - 1] = '\0';
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&rdtgroup_mutex);
>> +
>> +	rdt_last_cmd_clear();
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
>> +		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
>> +		mutex_unlock(&rdtgroup_mutex);
>> +		cpus_read_unlock();
>> +		return -EINVAL;
>> +	}
>> +
>> +	while ((token = strsep(&buf, "\n")) != NULL) {
>> +		/*
>> +		 * The write command follows the following format:
>> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
>> +		 * Extract the CTRL_MON group.
>> +		 */
>> +		cmon_grp = strsep(&token, "/");
>> +
>> +		/*
>> +		 * Extract the MON_GROUP.
>> +		 * strsep returns empty string for contiguous delimiters.
>> +		 * Empty mon_grp here means it is a RDTCTRL_GROUP.
>> +		 */
>> +		mon_grp = strsep(&token, "/");
>> +
>> +		if (*mon_grp == '\0')
>> +			rtype = RDTCTRL_GROUP;
>> +		else
>> +			rtype = RDTMON_GROUP;
>> +
>> +		ret = rdtgroup_process_flags(r, rtype, cmon_grp, mon_grp, token);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	cpus_read_unlock();
>> +
>> +	return ret ?: nbytes;
>> +}
>> +
>>  #ifdef CONFIG_PROC_CPU_RESCTRL
>>  
>>  /*
>> @@ -2383,9 +2616,10 @@ static struct rftype res_common_files[] = {
>>  	},
>>  	{
>>  		.name		= "mbm_assign_control",
>> -		.mode		= 0444,
>> +		.mode		= 0644,
>>  		.kf_ops		= &rdtgroup_kf_single_ops,
>>  		.seq_show	= rdtgroup_mbm_assign_control_show,
>> +		.write		= rdtgroup_mbm_assign_control_write,
>>  	},
>>  	{
>>  		.name		= "mbm_assign_mode",
> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-11-20 18:05     ` Moger, Babu
@ 2024-11-21 20:50       ` Reinette Chatre
  2024-11-22 21:04         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-21 20:50 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/20/24 10:05 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 18:57, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> The mbm_cntr_assign mode offers several hardware counters that can be
>>> assigned to an RMID, event pair and monitor the bandwidth as long as it
>>> is assigned.
>>>
>>> Counters are managed at two levels. The global assignment is tracked
>>> using the mbm_cntr_free_map field in the struct resctrl_mon, while
>>> domain-specific assignments are tracked using the mbm_cntr_map field
>>> in the struct rdt_mon_domain. Allocation begins at the global level
>>> and is then applied individually to each domain.
>>>
>>> Introduce an interface to allocate these counters and update the
>>> corresponding domains accordingly.
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>
>> ...
>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 00f7bf60e16a..cb496bd97007 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>>  int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>  			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>  			     u32 cntr_id, bool assign);
>>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>>> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>>>  void rdt_staged_configs_clear(void);
>>>  bool closid_allocated(unsigned int closid);
>>>  int resctrl_find_cleanest_closid(void);
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 1b5529c212f5..bc3752967c44 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>  	return 0;
>>>  }
>>>  
>>> +/*
>>> + * Configure the counter for the event, RMID pair for the domain.
>>> + * Update the bitmap and reset the architectural state.
>>> + */
>>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>> +			       u32 cntr_id, bool assign)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (assign)
>>> +		__set_bit(cntr_id, d->mbm_cntr_map);
>>> +	else
>>> +		__clear_bit(cntr_id, d->mbm_cntr_map);
>>> +
>>> +	/*
>>> +	 * Reset the architectural state so that reading of hardware
>>> +	 * counter is not considered as an overflow in next update.
>>> +	 */
>>> +	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
>>
>> resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
>> @d ... note that after the architectural state is reset it initializes the
>> state by reading the event on the current CPU. By running it here it is
>> run on a random CPU that may not be in the right domain.
> 
> Yes. That is correct.  We can move this part to our earlier
> implementation. We dont need to read the RMID.  We just have to reset the
> counter.
> 
> https://lore.kernel.org/lkml/16d88cc4091cef1999b7ec329364e12dd0dc748d.1728495588.git.babu.moger@amd.com/
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9fe419d0c536..bc3654ec3a08 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2371,6 +2371,13 @@ int resctrl_arch_config_cntr(struct rdt_resource
> *r, struct rdt_mon_domain *d,
>         smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
>                               &abmc_cfg, 1);
> 
> +       /*
> +        * Reset the architectural state so that reading of hardware
> +        * counter is not considered as an overflow in next update.
> +        */
> +       if (arch_mbm)
> +               memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
> +
>         return 0;
>  }
> 
>

I am not sure what you envision here. One motivation for the move out of
resctrl_arch_config_cntr() was to avoid architectural state being reset twice. For reference,
mbm_config_write_domain()->resctrl_arch_reset_rmid_all(). Will architectural state
be reset twice again?
One thing that I did not notice before is that the non-architectural MBM state is not
reset. Care should be taken to reset this also when considering that there is a plan
to use that MBM state to build a generic rate event for all platforms:
https://lore.kernel.org/all/CALPaoCgFRFgQqG00Uc0GhMHK47bsbtFw6Bxy5O9A_HeYmGa5sA@mail.gmail.com/

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes
  2024-11-21  2:14     ` Moger, Babu
@ 2024-11-21 20:58       ` Reinette Chatre
  2024-11-22 20:12         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-21 20:58 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/20/24 6:14 PM, Moger, Babu wrote:
> On 11/18/2024 1:43 PM, Reinette Chatre wrote:
>> On 10/29/24 4:21 PM, Babu Moger wrote:

>>> +static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>> +                     enum resctrl_event_id evtid, u32 rmid,
>>> +                     u32 closid, u32 cntr_id, u32 val)
>>> +{
>>> +    union l3_qos_abmc_cfg abmc_cfg = { 0 };
>>> +
>>> +    abmc_cfg.split.cfg_en = 1;
>>> +    abmc_cfg.split.cntr_en = 1;
>>> +    abmc_cfg.split.cntr_id = cntr_id;
>>> +    abmc_cfg.split.bw_src = rmid;
>>> +    abmc_cfg.split.bw_type = val;
>>> +
>>> +    wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg.full);
>>
>> Is it needed to create an almost duplicate function? What if instead
>> only resctrl_arch_config_cntr() exists and it uses parameter to decide
>> whether to call resctrl_abmc_config_one_amd() directly or via
>> smp_call_function_any()? I think that should help to make clear how
>> the code flows.
>> Also note that this is an almost identical arch callback with no
>> error return. I expect that building on existing resctrl_arch_config_cntr()
>> will make things easier to understand.
> 
> It can be done. But it takes another parameter to the function.
> It has 7 parameters already. This will be 8th.
> Will change it if that is ok.

Please correct me if I am wrong but I am not familiar with a restriction on number
of parameters. It seems unnecessary to me to create two almost duplicate 7 parameter
functions to avoid one 8 parameter function.

>> Since MBM_EVENT_ARRAY_INDEX is a macro it can be called closer to where it is used,
>> within  rdtgroup_find_grp_by_cntr_id_index(), which prompts a reconsider of that function name.
> 
> 
> How about ?
> 
> static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_event(int cntr_id, enum resctrl_event_id evtid)

... or for something shorter just get_rdtgroup_from_cntr_event(), but no hard requirement.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-19 19:20     ` Moger, Babu
@ 2024-11-21 21:12       ` Reinette Chatre
  2024-11-22 23:36         ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-21 21:12 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/19/24 11:20 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 18:31, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> Provide the interface to display the number of free monitoring counters
>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>> v9: New patch.
>>> ---
>>>  Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>  arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>  3 files changed, 38 insertions(+)
>>>
>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>> index 2f3a86278e84..2bc58d974934 100644
>>> --- a/Documentation/arch/x86/resctrl.rst
>>> +++ b/Documentation/arch/x86/resctrl.rst
>>> @@ -302,6 +302,10 @@ with the following files:
>>>  	memory bandwidth tracking to a single memory bandwidth event per
>>>  	monitoring group.
>>>  
>>> +"available_mbm_cntrs":
>>> +	The number of free monitoring counters available assignment in each domain
>>
>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>> counters available for assignment"?
>>
>> (not taking into account how text may change after addressing Peter's feedback)
> 
> How about this?
> 
> "The number of monitoring counters available for assignment in each domain
> when the architecture supports mbm_cntr_assign mode. There are a total of
> "num_mbm_cntrs" counters are available for assignment. Counters can be
> assigned or unassigned individually in each domain. A counter is available
> for new assignment if it is unassigned in all domains."

Please consider the context of this paragraph. It follows right after the description
of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
I think it is confusing to follow that with a paragraph that states "Counters can be
assigned or unassigned individually in each domain." I wonder if it may be helpful to
use a different term ... for example a counter is *assigned* to an event of a monitoring
group but this assignment may be to specified (not yet supported) or all (this work) domains while
it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
needs to remain coherent if future work decides to indeed support per-domain assignment.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-11-18 17:18   ` Reinette Chatre
@ 2024-11-22  0:22     ` Moger, Babu
  2024-11-22  0:26       ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22  0:22 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	vikas.shivappa, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/2024 11:18 AM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/29/24 4:21 PM, Babu Moger wrote:
>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>> are required per group, one for MBM total event and one for MBM local
>> event.
>>
>> There are a limited number of counters available for assignment. If these
>> counters are exhausted, the kernel will display the error message: "Out of
>> MBM assignable counters". However, it is not necessary to fail the
>> creation of a group due to assignment failures. Users have the flexibility
>> to modify the assignments at a later time.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
>>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>>      Updated function comments.
>>
>> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>>      Fixed the problem with unassigning the child MON groups of CTRL_MON group.
>>
>> v7: Reworded the commit message.
>>      Removed the reference of ABMC with mbm_cntr_assign.
>>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
>>
>> v6: Removed the redundant comments on all the calls of
>>      rdtgroup_assign_cntrs. Updated the commit message.
>>      Dropped printing error message on every call of rdtgroup_assign_cntrs.
>>
>> v5: Removed the code to enable/disable ABMC during the mount.
>>      That will be another patch.
>>      Added arch callers to get the arch specific data.
>>      Renamed fuctions to match the other abmc function.
>>      Added code comments for assignment failures.
>>
>> v4: Few name changes based on the upstream discussion.
>>      Commit message update.
>>
>> v3: This is a new patch. Patch addresses the upstream comment to enable
>>      ABMC feature by default if the feature is available.
>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>>   1 file changed, 60 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b0cce3dfd062..a8d21b0b2054 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>>   	}
>>   }
>>   
>> +/*
>> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
>> + * counters are automatically assigned. Each group can accommodate two counters:
>> + * one for the total event and one for the local event. Assignments may fail
>> + * due to the limited number of counters. However, it is not necessary to fail
>> + * the group creation and thus no failure is returned. Users have the option
>> + * to modify the counter assignments after the group has been created.
>> + */
>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
>> +	if (is_mbm_total_enabled())
>> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>> +/*
>> + * Called when a group is deleted. Counters are unassigned if it was in
>> + * assigned state.
>> + */
>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
>> +	if (is_mbm_total_enabled())
>> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>>   static int rdt_get_tree(struct fs_context *fc)
>>   {
>>   	struct rdt_fs_context *ctx = rdt_fc2context(fc);
>> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>   		if (ret < 0)
>>   			goto out_mongrp;
>>   		rdtgroup_default.mon.mon_data_kn = kn_mondata;
>> +
>> +		rdtgroup_assign_cntrs(&rdtgroup_default);
> 
> I think counters should be assigned *before* the files exposing them
> are added to resctrl.

Sure.

> 
>>   	}
>>   
>>   	ret = rdt_pseudo_lock_init();
>> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>   out_psl:
>>   	rdt_pseudo_lock_release();
>>   out_mondata:
>> -	if (resctrl_arch_mon_capable())
>> +	if (resctrl_arch_mon_capable()) {
>> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>>   		kernfs_remove(kn_mondata);
> 
> ... and here remove the files before taking away the data exposed by them.

Sure.

> 
>> +	}
>>   out_mongrp:
>>   	if (resctrl_arch_mon_capable())
>>   		kernfs_remove(kn_mongrp);
>> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>>   
>>   	head = &rdtgrp->mon.crdtgrp_list;
>>   	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>> +		rdtgroup_unassign_cntrs(sentry);
>>   		free_rmid(sentry->closid, sentry->mon.rmid);
>>   		list_del(&sentry->mon.crdtgrp_list);
>>   
>> @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>>   		cpumask_or(&rdtgroup_default.cpu_mask,
>>   			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>   
>> +		rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>   		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>   
>>   		kernfs_remove(rdtgrp->kn);
>> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>   	for_each_alloc_capable_rdt_resource(r)
>>   		reset_all_ctrls(r);
>>   	rmdir_all_sub();
>> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
>>   	rdt_pseudo_lock_release();
>>   	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>   	schemata_list_destroy();
>> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>   		resctrl_arch_disable_alloc();
>>   	if (resctrl_arch_mon_capable())
>>   		resctrl_arch_disable_mon();
>> +
>>   	resctrl_mounted = false;
>>   	kernfs_kill_sb(sb);
>>   	mutex_unlock(&rdtgroup_mutex);
> 
> Unnecessary hunk.

ok

> 
>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
>>   		goto out_unlock;
>>   	}
>>   
>> +	rdtgroup_assign_cntrs(rdtgrp);
>> +
>>   	kernfs_activate(rdtgrp->kn);
>>   
>>   	/*
>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>   	if (ret)
>>   		goto out_closid_free;
>>   
>> +	rdtgroup_assign_cntrs(rdtgrp);
>> +
>>   	kernfs_activate(rdtgrp->kn);
>>   
>>   	ret = rdtgroup_init_alloc(rdtgrp);
> 
> Please compare the above two hunks with earlier "x86/resctrl: Introduce cntr_id in mongroup for assignments".
> Earlier patch initializes the counters within mkdir_rdt_prepare_rmid_alloc() while the above
> hunk assigns the counters after mkdir_rdt_prepare_rmid_alloc() is called. Could this fragmentation be avoided
> with init done once within mkdir_rdt_prepare_rmid_alloc()?

It seems more appropriate to call rdtgroup_cntr_id_init() inside 
mkdir_rdt_prepare(). This will initialize the cntr_id to MON_CNTR_UNSET.

And then call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().

what do you think?


> 
>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>   out_del_list:
>>   	list_del(&rdtgrp->rdtgroup_list);
>>   out_rmid_free:
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>>   	mkdir_rdt_prepare_rmid_free(rdtgrp);
>>   out_closid_free:
>>   	closid_free(closid);
>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>   	update_closid_rmid(tmpmask, NULL);
>>   
>>   	rdtgrp->flags = RDT_DELETED;
>> +
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>   	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>   
>>   	/*
>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>   	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>   	update_closid_rmid(tmpmask, NULL);
>>   
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>   	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>   	closid_free(rdtgrp->closid);
>>   
> 
> There is a potential problem here. rdtgroup_unassign_cntrs() attempts to remove counter from
> all domains associated with the resource group. This may fail in any of the domains that results
> in the counter not being marked as free in the global map and not reset the counter in the
> resource group ... but the resource group is removed right after calling rdtgroup_unassign_cntrs().
> In this scenario there is thus a counter that is considered to be in use but not assigned to any
> resource group.
> 
>>From what I can tell there is a difference here between default resource group and the others:
> on remount of resctrl the default resource group will maintain knowledge of the counter that could
> not be unassigned. This means that unmount/remount of resctrl does not provide a real "clean slate"
> when it comes to counter assignment. Is this intended?
> 

Yes. Agree. It is not intended.

How about adding bitmap_zero() inside rdt_get_tree() to address this 
problem? Also need to reset the cntr_id in rdt_kill_sb().

-- 
- Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-11-22  0:22     ` Moger, Babu
@ 2024-11-22  0:26       ` Moger, Babu
  2024-11-22 18:12         ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22  0:26 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky



On 11/21/2024 6:22 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/18/2024 11:18 AM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>> Assign/unassign counters on resctrl group creation/deletion. Two 
>>> counters
>>> are required per group, one for MBM total event and one for MBM local
>>> event.
>>>
>>> There are a limited number of counters available for assignment. If 
>>> these
>>> counters are exhausted, the kernel will display the error message: 
>>> "Out of
>>> MBM assignable counters". However, it is not necessary to fail the
>>> creation of a group due to assignment failures. Users have the 
>>> flexibility
>>> to modify the assignments at a later time.
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to 
>>> return void.
>>>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>>>      Updated function comments.
>>>
>>> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>>>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>>>      Fixed the problem with unassigning the child MON groups of 
>>> CTRL_MON group.
>>>
>>> v7: Reworded the commit message.
>>>      Removed the reference of ABMC with mbm_cntr_assign.
>>>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
>>>
>>> v6: Removed the redundant comments on all the calls of
>>>      rdtgroup_assign_cntrs. Updated the commit message.
>>>      Dropped printing error message on every call of 
>>> rdtgroup_assign_cntrs.
>>>
>>> v5: Removed the code to enable/disable ABMC during the mount.
>>>      That will be another patch.
>>>      Added arch callers to get the arch specific data.
>>>      Renamed fuctions to match the other abmc function.
>>>      Added code comments for assignment failures.
>>>
>>> v4: Few name changes based on the upstream discussion.
>>>      Commit message update.
>>>
>>> v3: This is a new patch. Patch addresses the upstream comment to enable
>>>      ABMC feature by default if the feature is available.
>>> ---
>>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>>>   1 file changed, 60 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
>>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index b0cce3dfd062..a8d21b0b2054 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>>>       }
>>>   }
>>> +/*
>>> + * Called when a new group is created. If "mbm_cntr_assign" mode is 
>>> enabled,
>>> + * counters are automatically assigned. Each group can accommodate 
>>> two counters:
>>> + * one for the total event and one for the local event. Assignments 
>>> may fail
>>> + * due to the limited number of counters. However, it is not 
>>> necessary to fail
>>> + * the group creation and thus no failure is returned. Users have 
>>> the option
>>> + * to modify the counter assignments after the group has been created.
>>> + */
>>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>>> +{
>>> +    struct rdt_resource *r = 
>>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>>> +        return;
>>> +
>>> +    if (is_mbm_total_enabled())
>>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL, 
>>> QOS_L3_MBM_TOTAL_EVENT_ID);
>>> +
>>> +    if (is_mbm_local_enabled())
>>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL, 
>>> QOS_L3_MBM_LOCAL_EVENT_ID);
>>> +}
>>> +
>>> +/*
>>> + * Called when a group is deleted. Counters are unassigned if it was in
>>> + * assigned state.
>>> + */
>>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>>> +{
>>> +    struct rdt_resource *r = 
>>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>>> +        return;
>>> +
>>> +    if (is_mbm_total_enabled())
>>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, 
>>> QOS_L3_MBM_TOTAL_EVENT_ID);
>>> +
>>> +    if (is_mbm_local_enabled())
>>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, 
>>> QOS_L3_MBM_LOCAL_EVENT_ID);
>>> +}
>>> +
>>>   static int rdt_get_tree(struct fs_context *fc)
>>>   {
>>>       struct rdt_fs_context *ctx = rdt_fc2context(fc);
>>> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>>           if (ret < 0)
>>>               goto out_mongrp;
>>>           rdtgroup_default.mon.mon_data_kn = kn_mondata;
>>> +
>>> +        rdtgroup_assign_cntrs(&rdtgroup_default);
>>
>> I think counters should be assigned *before* the files exposing them
>> are added to resctrl.
> 
> Sure.
> 
>>
>>>       }
>>>       ret = rdt_pseudo_lock_init();
>>> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>>   out_psl:
>>>       rdt_pseudo_lock_release();
>>>   out_mondata:
>>> -    if (resctrl_arch_mon_capable())
>>> +    if (resctrl_arch_mon_capable()) {
>>> +        rdtgroup_unassign_cntrs(&rdtgroup_default);
>>>           kernfs_remove(kn_mondata);
>>
>> ... and here remove the files before taking away the data exposed by 
>> them.
> 
> Sure.
> 
>>
>>> +    }
>>>   out_mongrp:
>>>       if (resctrl_arch_mon_capable())
>>>           kernfs_remove(kn_mongrp);
>>> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct 
>>> rdtgroup *rdtgrp)
>>>       head = &rdtgrp->mon.crdtgrp_list;
>>>       list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>>> +        rdtgroup_unassign_cntrs(sentry);
>>>           free_rmid(sentry->closid, sentry->mon.rmid);
>>>           list_del(&sentry->mon.crdtgrp_list);
>>> @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>>>           cpumask_or(&rdtgroup_default.cpu_mask,
>>>                  &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>> +        rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>           free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>           kernfs_remove(rdtgrp->kn);
>>> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>>       for_each_alloc_capable_rdt_resource(r)
>>>           reset_all_ctrls(r);
>>>       rmdir_all_sub();
>>> +    rdtgroup_unassign_cntrs(&rdtgroup_default);
>>>       rdt_pseudo_lock_release();
>>>       rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>>       schemata_list_destroy();
>>> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>>           resctrl_arch_disable_alloc();
>>>       if (resctrl_arch_mon_capable())
>>>           resctrl_arch_disable_mon();
>>> +
>>>       resctrl_mounted = false;
>>>       kernfs_kill_sb(sb);
>>>       mutex_unlock(&rdtgroup_mutex);
>>
>> Unnecessary hunk.
> 
> ok
> 
>>
>>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct 
>>> kernfs_node *parent_kn,
>>>           goto out_unlock;
>>>       }
>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>> +
>>>       kernfs_activate(rdtgrp->kn);
>>>       /*
>>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct 
>>> kernfs_node *parent_kn,
>>>       if (ret)
>>>           goto out_closid_free;
>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>> +
>>>       kernfs_activate(rdtgrp->kn);
>>>       ret = rdtgroup_init_alloc(rdtgrp);
>>
>> Please compare the above two hunks with earlier "x86/resctrl: 
>> Introduce cntr_id in mongroup for assignments".
>> Earlier patch initializes the counters within 
>> mkdir_rdt_prepare_rmid_alloc() while the above
>> hunk assigns the counters after mkdir_rdt_prepare_rmid_alloc() is 
>> called. Could this fragmentation be avoided
>> with init done once within mkdir_rdt_prepare_rmid_alloc()?
> 
> It seems more appropriate to call rdtgroup_cntr_id_init() inside 
> mkdir_rdt_prepare(). This will initialize the cntr_id to MON_CNTR_UNSET.
> 
> And then call rdtgroup_assign_cntrs() inside 
> mkdir_rdt_prepare_rmid_alloc().
> 
> what do you think?
> 
> 
>>
>>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct 
>>> kernfs_node *parent_kn,
>>>   out_del_list:
>>>       list_del(&rdtgrp->rdtgroup_list);
>>>   out_rmid_free:
>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>       mkdir_rdt_prepare_rmid_free(rdtgrp);
>>>   out_closid_free:
>>>       closid_free(closid);
>>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup 
>>> *rdtgrp, cpumask_var_t tmpmask)
>>>       update_closid_rmid(tmpmask, NULL);
>>>       rdtgrp->flags = RDT_DELETED;
>>> +
>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>       /*
>>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup 
>>> *rdtgrp, cpumask_var_t tmpmask)
>>>       cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>>       update_closid_rmid(tmpmask, NULL);
>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>       closid_free(rdtgrp->closid);
>>
>> There is a potential problem here. rdtgroup_unassign_cntrs() attempts 
>> to remove counter from
>> all domains associated with the resource group. This may fail in any 
>> of the domains that results
>> in the counter not being marked as free in the global map and not 
>> reset the counter in the
>> resource group ... but the resource group is removed right after 
>> calling rdtgroup_unassign_cntrs().
>> In this scenario there is thus a counter that is considered to be in 
>> use but not assigned to any
>> resource group.
>>
>>>> From what I can tell there is a difference here between default 
>>>> resource group and the others:
>> on remount of resctrl the default resource group will maintain 
>> knowledge of the counter that could
>> not be unassigned. This means that unmount/remount of resctrl does not 
>> provide a real "clean slate"
>> when it comes to counter assignment. Is this intended?
>>
> 
> Yes. Agree. It is not intended.
> 
> How about adding bitmap_zero() inside rdt_get_tree() to address this 
> problem? Also need to reset the cntr_id in rdt_kill_sb().

I meant reset the cntr_id for the default group in rdt_kill_sb().
-- 
- Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-11-22  0:26       ` Moger, Babu
@ 2024-11-22 18:12         ` Reinette Chatre
  2024-11-22 21:34           ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-22 18:12 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/21/24 4:26 PM, Moger, Babu wrote:
> On 11/21/2024 6:22 PM, Moger, Babu wrote:
>> On 11/18/2024 11:18 AM, Reinette Chatre wrote:
>>> On 10/29/24 4:21 PM, Babu Moger wrote:

>>>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
>>>>           goto out_unlock;
>>>>       }
>>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>>> +
>>>>       kernfs_activate(rdtgrp->kn);
>>>>       /*
>>>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>>>       if (ret)
>>>>           goto out_closid_free;
>>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>>> +
>>>>       kernfs_activate(rdtgrp->kn);
>>>>       ret = rdtgroup_init_alloc(rdtgrp);
>>>
>>> Please compare the above two hunks with earlier "x86/resctrl: Introduce cntr_id in mongroup for assignments".
>>> Earlier patch initializes the counters within mkdir_rdt_prepare_rmid_alloc() while the above
>>> hunk assigns the counters after mkdir_rdt_prepare_rmid_alloc() is called. Could this fragmentation be avoided
>>> with init done once within mkdir_rdt_prepare_rmid_alloc()?
>>
>> It seems more appropriate to call rdtgroup_cntr_id_init() inside mkdir_rdt_prepare(). This will initialize the cntr_id to MON_CNTR_UNSET.
>>
>> And then call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().
>>
>> what do you think?

Taking a closer look this seems most appropriate. mkdir_rdt_prepare() is where the resource groupreset
is created and all fields initialized, control and monitoring (irrespective of monitoring enabled). 
Doing the MON_CNTR_UNSET initalization in that central place seems good.
Yes, and then assigning the counters in mkdir_rdt_prepare_rmid_alloc() sounds good.

>>>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>>>   out_del_list:
>>>>       list_del(&rdtgrp->rdtgroup_list);
>>>>   out_rmid_free:
>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>>       mkdir_rdt_prepare_rmid_free(rdtgrp);
>>>>   out_closid_free:
>>>>       closid_free(closid);
>>>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>>>       update_closid_rmid(tmpmask, NULL);
>>>>       rdtgrp->flags = RDT_DELETED;
>>>> +
>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>> +
>>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>>       /*
>>>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>>>       cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>>>       update_closid_rmid(tmpmask, NULL);
>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>> +
>>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>>       closid_free(rdtgrp->closid);
>>>
>>> There is a potential problem here. rdtgroup_unassign_cntrs() attempts to remove counter from
>>> all domains associated with the resource group. This may fail in any of the domains that results
>>> in the counter not being marked as free in the global map and not reset the counter in the
>>> resource group ... but the resource group is removed right after calling rdtgroup_unassign_cntrs().
>>> In this scenario there is thus a counter that is considered to be in use but not assigned to any
>>> resource group.
>>>
>>>>> From what I can tell there is a difference here between default resource group and the others:
>>> on remount of resctrl the default resource group will maintain knowledge of the counter that could
>>> not be unassigned. This means that unmount/remount of resctrl does not provide a real "clean slate"
>>> when it comes to counter assignment. Is this intended?
>>>
>>
>> Yes. Agree. It is not intended.
>>
>> How about adding bitmap_zero() inside rdt_get_tree() to address this problem? Also need to reset the cntr_id in rdt_kill_sb().
> 
> I meant reset the cntr_id for the default group in rdt_kill_sb().

Doing the cntr_id reset like this matches the custom is to reset to defaults in rdt_kill_sb(). I am not sure
what you envision with the bitmap_zero() in rdt_get_tree() ... I wonder if it may not just be simpler to
call mbm_cntr_reset() from rdt_kill_sb()? This does raise the question if mbm_cntr_reset() should reset
architectural state. I do not think it does harm, the state will just be reset again when the mon dirs are
created?

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-18 22:07       ` Reinette Chatre
@ 2024-11-22 18:25         ` Moger, Babu
  2024-11-22 21:37           ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 18:25 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/18/2024 4:07 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/18/24 11:04 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/15/24 18:00, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>>> supported.
>>>>
>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>>
>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>>
>>>> The "default" mode is the existing monitoring mode that works without the
>>>> explicit counter assignment, instead relying on dynamic counter assignment
>>>> by hardware that may result in hardware not dedicating a counter resulting
>>>> in monitoring data reads returning "Unavailable".
>>>>
>>>> Provide an interface to display the monitor mode on the system.
>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>> [mbm_cntr_assign]
>>>> default
>>>>
>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>> ---
> 
> ...
> 
>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>> to make the event data "more predictable" and then be concerned when the mode does
>>> not exist.
>>>
>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>> num_mbm_cntrs to num_rmids).
>>
>> There is some round about(or hacky) way to find that out number of RMIDs
>> that can be active.
> 
> Does this give consistent and accurate data? Is this something that can be added to resctrl?
> (Reading your other message [1] it does not sound as though it can produce an accurate
> number on boot.)
> If not then it will be up to the documentation to be accurate.
> 
> 
>>>> +
>>>> +	AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>> +	enable this mode by default so that counters remain assigned even when the
>>>> +	corresponding RMID is not in use by any processor.
>>>> +
>>>> +	"default":
>>>> +
>>>> +	In default mode resctrl assumes there is a hardware counter for each
>>>> +	event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>> +	mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>> +	with that event.
>>>
>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>> is unassigned and then reassigned then the event count will reset and the user
>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>
>> Yes. All the AMD systems without ABMC are affected by this problem.
>>
>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>> available, while not be made concerned to use "default" mode on Intel where
>>> mbm_assign_mode is not available.
>>
>> Can we add text to clarify this?
> 
> Please do.

I think we need to add text about AMD systems. How about this?

"default":
In default mode resctrl assumes there is a hardware counter for each
event within every CTRL_MON and MON group. On AMD systems with 16 more 
monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 
'Unavailable' if there is no counter associated with that event. It is 
therefore recommended to use the 'mbm_cntr_assign' mode, if supported."

> 
> Reinette
> 
> [1] https://lore.kernel.org/all/35fc70fd-0281-4ac8-b32b-efa2f4516901@amd.com/
> 

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-11-21 20:18       ` Reinette Chatre
@ 2024-11-22 18:54         ` Moger, Babu
  2024-11-22 21:52           ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 18:54 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/21/2024 2:18 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/19/24 12:12 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/15/24 18:44, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>> The ABMC feature provides an option to the user to assign a hardware
>>>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>>>> assigned. The assigned RMID will be tracked by the hardware until the user
>>>> unassigns it manually.
>>>>
>>>> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
>>>> specifying the counter id, bandwidth source, and bandwidth types.
>>>
>>> needs imperative tone
>>
>> How about this?
>>
>> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and
>> specifying the counter ID, bandwidth source, and bandwidth types.
>>
> 
> ok with me. Exactly what ChatGPT suggests.

Hmm. ):

> 
> Please do note that that first paragraph informs reader that
> a counter is assigned by user to "an RMID, event pair" while the hardware is configured with
> "the counter ID, bandwidth source, and bandwidth types". There thus does not seem
> to be a clear connection between what user assigns and what is programmed to hardware.
> 

Adding RMID in the text might help.

Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and 
specifying the counter ID, RMID, bandwidth source, and bandwidth types.


-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes
  2024-11-21 20:58       ` Reinette Chatre
@ 2024-11-22 20:12         ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 20:12 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/21/2024 2:58 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/20/24 6:14 PM, Moger, Babu wrote:
>> On 11/18/2024 1:43 PM, Reinette Chatre wrote:
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
> 
>>>> +static void resctrl_arch_update_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>> +                     enum resctrl_event_id evtid, u32 rmid,
>>>> +                     u32 closid, u32 cntr_id, u32 val)
>>>> +{
>>>> +    union l3_qos_abmc_cfg abmc_cfg = { 0 };
>>>> +
>>>> +    abmc_cfg.split.cfg_en = 1;
>>>> +    abmc_cfg.split.cntr_en = 1;
>>>> +    abmc_cfg.split.cntr_id = cntr_id;
>>>> +    abmc_cfg.split.bw_src = rmid;
>>>> +    abmc_cfg.split.bw_type = val;
>>>> +
>>>> +    wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg.full);
>>>
>>> Is it needed to create an almost duplicate function? What if instead
>>> only resctrl_arch_config_cntr() exists and it uses parameter to decide
>>> whether to call resctrl_abmc_config_one_amd() directly or via
>>> smp_call_function_any()? I think that should help to make clear how
>>> the code flows.
>>> Also note that this is an almost identical arch callback with no
>>> error return. I expect that building on existing resctrl_arch_config_cntr()
>>> will make things easier to understand.
>>
>> It can be done. But it takes another parameter to the function.
>> It has 7 parameters already. This will be 8th.
>> Will change it if that is ok.
> 
> Please correct me if I am wrong but I am not familiar with a restriction on number
> of parameters. It seems unnecessary to me to create two almost duplicate 7 parameter
> functions to avoid one 8 parameter function.

I dont see any hard requirement. Will add one parameter for smp call or 
direct call.
> 
>>> Since MBM_EVENT_ARRAY_INDEX is a macro it can be called closer to where it is used,
>>> within  rdtgroup_find_grp_by_cntr_id_index(), which prompts a reconsider of that function name.
>>
>>
>> How about ?
>>
>> static struct rdtgroup *rdtgroup_find_grp_by_cntr_id_event(int cntr_id, enum resctrl_event_id evtid)
> 
> ... or for something shorter just get_rdtgroup_from_cntr_event(), but no hard requirement.
> 
Sure
Thanks
-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-11-21 20:50       ` Reinette Chatre
@ 2024-11-22 21:04         ` Moger, Babu
  2024-11-22 22:07           ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 21:04 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/21/2024 2:50 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/20/24 10:05 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/15/24 18:57, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>> The mbm_cntr_assign mode offers several hardware counters that can be
>>>> assigned to an RMID, event pair and monitor the bandwidth as long as it
>>>> is assigned.
>>>>
>>>> Counters are managed at two levels. The global assignment is tracked
>>>> using the mbm_cntr_free_map field in the struct resctrl_mon, while
>>>> domain-specific assignments are tracked using the mbm_cntr_map field
>>>> in the struct rdt_mon_domain. Allocation begins at the global level
>>>> and is then applied individually to each domain.
>>>>
>>>> Introduce an interface to allocate these counters and update the
>>>> corresponding domains accordingly.
>>>>
>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>> ---
>>>
>>> ...
>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>> index 00f7bf60e16a..cb496bd97007 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>>> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>>>   int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>   			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>>   			     u32 cntr_id, bool assign);
>>>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>>>> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>>>>   void rdt_staged_configs_clear(void);
>>>>   bool closid_allocated(unsigned int closid);
>>>>   int resctrl_find_cleanest_closid(void);
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 1b5529c212f5..bc3752967c44 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +/*
>>>> + * Configure the counter for the event, RMID pair for the domain.
>>>> + * Update the bitmap and reset the architectural state.
>>>> + */
>>>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>> +			       u32 cntr_id, bool assign)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	if (assign)
>>>> +		__set_bit(cntr_id, d->mbm_cntr_map);
>>>> +	else
>>>> +		__clear_bit(cntr_id, d->mbm_cntr_map);
>>>> +
>>>> +	/*
>>>> +	 * Reset the architectural state so that reading of hardware
>>>> +	 * counter is not considered as an overflow in next update.
>>>> +	 */
>>>> +	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
>>>
>>> resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
>>> @d ... note that after the architectural state is reset it initializes the
>>> state by reading the event on the current CPU. By running it here it is
>>> run on a random CPU that may not be in the right domain.
>>
>> Yes. That is correct.  We can move this part to our earlier
>> implementation. We dont need to read the RMID.  We just have to reset the
>> counter.
>>
>> https://lore.kernel.org/lkml/16d88cc4091cef1999b7ec329364e12dd0dc748d.1728495588.git.babu.moger@amd.com/
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 9fe419d0c536..bc3654ec3a08 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2371,6 +2371,13 @@ int resctrl_arch_config_cntr(struct rdt_resource
>> *r, struct rdt_mon_domain *d,
>>          smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
>>                                &abmc_cfg, 1);
>>
>> +       /*
>> +        * Reset the architectural state so that reading of hardware
>> +        * counter is not considered as an overflow in next update.
>> +        */
>> +       if (arch_mbm)
>> +               memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
>> +
>>          return 0;
>>   }
>>
>>
> 
> I am not sure what you envision here. One motivation for the move out of
> resctrl_arch_config_cntr() was to avoid architectural state being reset twice. For reference,
> mbm_config_write_domain()->resctrl_arch_reset_rmid_all(). Will architectural state
> be reset twice again?

That is good point. We don't have to do it twice.

We can move the whole reset(arch_mbm) in  resctrl_arch_config_cntr().

> One thing that I did not notice before is that the non-architectural MBM state is not
> reset. Care should be taken to reset this also when considering that there is a plan
> to use that MBM state to build a generic rate event for all platforms:
> https://lore.kernel.org/all/CALPaoCgFRFgQqG00Uc0GhMHK47bsbtFw6Bxy5O9A_HeYmGa5sA@mail.gmail.com/

Did you mean we should add the following code in resctrl_arch_config_cntr()?

m = get_mbm_state(d, closid, rmid, evtid);
if (m)
      memset(m, 0, sizeof(struct mbm_state));


- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-11-22 18:12         ` Reinette Chatre
@ 2024-11-22 21:34           ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 21:34 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/22/2024 12:12 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/21/24 4:26 PM, Moger, Babu wrote:
>> On 11/21/2024 6:22 PM, Moger, Babu wrote:
>>> On 11/18/2024 11:18 AM, Reinette Chatre wrote:
>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
> 
>>>>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
>>>>>            goto out_unlock;
>>>>>        }
>>>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>>>> +
>>>>>        kernfs_activate(rdtgrp->kn);
>>>>>        /*
>>>>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>>>>        if (ret)
>>>>>            goto out_closid_free;
>>>>> +    rdtgroup_assign_cntrs(rdtgrp);
>>>>> +
>>>>>        kernfs_activate(rdtgrp->kn);
>>>>>        ret = rdtgroup_init_alloc(rdtgrp);
>>>>
>>>> Please compare the above two hunks with earlier "x86/resctrl: Introduce cntr_id in mongroup for assignments".
>>>> Earlier patch initializes the counters within mkdir_rdt_prepare_rmid_alloc() while the above
>>>> hunk assigns the counters after mkdir_rdt_prepare_rmid_alloc() is called. Could this fragmentation be avoided
>>>> with init done once within mkdir_rdt_prepare_rmid_alloc()?
>>>
>>> It seems more appropriate to call rdtgroup_cntr_id_init() inside mkdir_rdt_prepare(). This will initialize the cntr_id to MON_CNTR_UNSET.
>>>
>>> And then call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().
>>>
>>> what do you think?
> 
> Taking a closer look this seems most appropriate. mkdir_rdt_prepare() is where the resource groupreset
> is created and all fields initialized, control and monitoring (irrespective of monitoring enabled).
> Doing the MON_CNTR_UNSET initalization in that central place seems good.
> Yes, and then assigning the counters in mkdir_rdt_prepare_rmid_alloc() sounds good.

ok.

> 
>>>>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>>>>>    out_del_list:
>>>>>        list_del(&rdtgrp->rdtgroup_list);
>>>>>    out_rmid_free:
>>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>>>        mkdir_rdt_prepare_rmid_free(rdtgrp);
>>>>>    out_closid_free:
>>>>>        closid_free(closid);
>>>>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>>>>        update_closid_rmid(tmpmask, NULL);
>>>>>        rdtgrp->flags = RDT_DELETED;
>>>>> +
>>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>>> +
>>>>>        free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>>>        /*
>>>>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>>>>        cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>>>>        update_closid_rmid(tmpmask, NULL);
>>>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>>> +
>>>>>        free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>>>        closid_free(rdtgrp->closid);
>>>>
>>>> There is a potential problem here. rdtgroup_unassign_cntrs() attempts to remove counter from
>>>> all domains associated with the resource group. This may fail in any of the domains that results
>>>> in the counter not being marked as free in the global map and not reset the counter in the
>>>> resource group ... but the resource group is removed right after calling rdtgroup_unassign_cntrs().
>>>> In this scenario there is thus a counter that is considered to be in use but not assigned to any
>>>> resource group.
>>>>
>>>>>>  From what I can tell there is a difference here between default resource group and the others:
>>>> on remount of resctrl the default resource group will maintain knowledge of the counter that could
>>>> not be unassigned. This means that unmount/remount of resctrl does not provide a real "clean slate"
>>>> when it comes to counter assignment. Is this intended?
>>>>
>>>
>>> Yes. Agree. It is not intended.
>>>
>>> How about adding bitmap_zero() inside rdt_get_tree() to address this problem? Also need to reset the cntr_id in rdt_kill_sb().
>>
>> I meant reset the cntr_id for the default group in rdt_kill_sb().
> 
> Doing the cntr_id reset like this matches the custom is to reset to defaults in rdt_kill_sb(). I am not sure
> what you envision with the bitmap_zero() in rdt_get_tree() ... I wonder if it may not just be simpler to
> call mbm_cntr_reset() from rdt_kill_sb()? This does raise the question if mbm_cntr_reset() should reset
> architectural state. I do not think it does harm, the state will just be reset again when the mon dirs are
> created?

Yes. Calling mbm_cntr_reset() from rdt_kill_sb() seems more clean approach.

Architectural state will reset again when counter is assigned(when mon 
directories are created).

thanks
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-22 18:25         ` Moger, Babu
@ 2024-11-22 21:37           ` Reinette Chatre
  2024-11-23  0:02             ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-22 21:37 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/22/24 10:25 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>>>> supported.
>>>>>
>>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>>>
>>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>>>
>>>>> The "default" mode is the existing monitoring mode that works without the
>>>>> explicit counter assignment, instead relying on dynamic counter assignment
>>>>> by hardware that may result in hardware not dedicating a counter resulting
>>>>> in monitoring data reads returning "Unavailable".
>>>>>
>>>>> Provide an interface to display the monitor mode on the system.
>>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>> [mbm_cntr_assign]
>>>>> default
>>>>>
>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>> ---
>>
>> ...
>>
>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>> not exist.
>>>>
>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>> num_mbm_cntrs to num_rmids).
>>>
>>> There is some round about(or hacky) way to find that out number of RMIDs
>>> that can be active.
>>
>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>> (Reading your other message [1] it does not sound as though it can produce an accurate
>> number on boot.)
>> If not then it will be up to the documentation to be accurate.
>>
>>
>>>>> +
>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>> +    corresponding RMID is not in use by any processor.
>>>>> +
>>>>> +    "default":
>>>>> +
>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>> +    with that event.
>>>>
>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>> is unassigned and then reassigned then the event count will reset and the user
>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>
>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>
>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>> available, while not be made concerned to use "default" mode on Intel where
>>>> mbm_assign_mode is not available.
>>>
>>> Can we add text to clarify this?
>>
>> Please do.
> 
> I think we need to add text about AMD systems. How about this?
> 
> "default":
> In default mode resctrl assumes there is a hardware counter for each
> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."


What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
of RMIDs that can be active but here "16" is made official in the documentation?

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-11-22 18:54         ` Moger, Babu
@ 2024-11-22 21:52           ` Reinette Chatre
  2024-11-23  0:15             ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-22 21:52 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/22/24 10:54 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/21/2024 2:18 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/19/24 12:12 PM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/15/24 18:44, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>> The ABMC feature provides an option to the user to assign a hardware
>>>>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>>>>> assigned. The assigned RMID will be tracked by the hardware until the user
>>>>> unassigns it manually.
>>>>>
>>>>> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
>>>>> specifying the counter id, bandwidth source, and bandwidth types.
>>>>
>>>> needs imperative tone
>>>
>>> How about this?
>>>
>>> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and
>>> specifying the counter ID, bandwidth source, and bandwidth types.
>>>
>>
>> ok with me. Exactly what ChatGPT suggests.
> 
> Hmm. ):
> 
>>
>> Please do note that that first paragraph informs reader that
>> a counter is assigned by user to "an RMID, event pair" while the hardware is configured with
>> "the counter ID, bandwidth source, and bandwidth types". There thus does not seem
>> to be a clear connection between what user assigns and what is programmed to hardware.
>>
> 
> Adding RMID in the text might help.
> 
> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and specifying the counter ID, RMID, bandwidth source, and bandwidth types.
> 

Isn't the bandwidth source and the RMID the same thing? How about something like:
"Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and specifying
 the counter ID, bandwidth source (RMID), and bandwidth event configuration."

Please feel free to improve.

Reinette




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-11-22 21:04         ` Moger, Babu
@ 2024-11-22 22:07           ` Reinette Chatre
  2024-11-23  0:09             ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-22 22:07 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/22/24 1:04 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/21/2024 2:50 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/20/24 10:05 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/15/24 18:57, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>> The mbm_cntr_assign mode offers several hardware counters that can be
>>>>> assigned to an RMID, event pair and monitor the bandwidth as long as it
>>>>> is assigned.
>>>>>
>>>>> Counters are managed at two levels. The global assignment is tracked
>>>>> using the mbm_cntr_free_map field in the struct resctrl_mon, while
>>>>> domain-specific assignments are tracked using the mbm_cntr_map field
>>>>> in the struct rdt_mon_domain. Allocation begins at the global level
>>>>> and is then applied individually to each domain.
>>>>>
>>>>> Introduce an interface to allocate these counters and update the
>>>>> corresponding domains accordingly.
>>>>>
>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>> ---
>>>>
>>>> ...
>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> index 00f7bf60e16a..cb496bd97007 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>>>>   int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>>                    enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>>>                    u32 cntr_id, bool assign);
>>>>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>>>>> +                   struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>>>>>   void rdt_staged_configs_clear(void);
>>>>>   bool closid_allocated(unsigned int closid);
>>>>>   int resctrl_find_cleanest_closid(void);
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> index 1b5529c212f5..bc3752967c44 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>>       return 0;
>>>>>   }
>>>>>   +/*
>>>>> + * Configure the counter for the event, RMID pair for the domain.
>>>>> + * Update the bitmap and reset the architectural state.
>>>>> + */
>>>>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>> +                   enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>>> +                   u32 cntr_id, bool assign)
>>>>> +{
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>> +    if (assign)
>>>>> +        __set_bit(cntr_id, d->mbm_cntr_map);
>>>>> +    else
>>>>> +        __clear_bit(cntr_id, d->mbm_cntr_map);
>>>>> +
>>>>> +    /*
>>>>> +     * Reset the architectural state so that reading of hardware
>>>>> +     * counter is not considered as an overflow in next update.
>>>>> +     */
>>>>> +    resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
>>>>
>>>> resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
>>>> @d ... note that after the architectural state is reset it initializes the
>>>> state by reading the event on the current CPU. By running it here it is
>>>> run on a random CPU that may not be in the right domain.
>>>
>>> Yes. That is correct.  We can move this part to our earlier
>>> implementation. We dont need to read the RMID.  We just have to reset the
>>> counter.
>>>
>>> https://lore.kernel.org/lkml/16d88cc4091cef1999b7ec329364e12dd0dc748d.1728495588.git.babu.moger@amd.com/
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 9fe419d0c536..bc3654ec3a08 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2371,6 +2371,13 @@ int resctrl_arch_config_cntr(struct rdt_resource
>>> *r, struct rdt_mon_domain *d,
>>>          smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
>>>                                &abmc_cfg, 1);
>>>
>>> +       /*
>>> +        * Reset the architectural state so that reading of hardware
>>> +        * counter is not considered as an overflow in next update.
>>> +        */
>>> +       if (arch_mbm)
>>> +               memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
>>> +
>>>          return 0;
>>>   }
>>>
>>>
>>
>> I am not sure what you envision here. One motivation for the move out of
>> resctrl_arch_config_cntr() was to avoid architectural state being reset twice. For reference,
>> mbm_config_write_domain()->resctrl_arch_reset_rmid_all(). Will architectural state
>> be reset twice again?
> 
> That is good point. We don't have to do it twice.
> 
> We can move the whole reset(arch_mbm) in  resctrl_arch_config_cntr().

This is not clear to me. The architectural state needs to be reset on MBM config write even
when assignable mode is not supported and/or enabled. Moving it to resctrl_arch_config_cntr()
will break this, no?

I wonder if it may not simplify things to call resctrl_arch_reset_rmid() from
resctrl_abmc_config_one_amd()?

>> One thing that I did not notice before is that the non-architectural MBM state is not
>> reset. Care should be taken to reset this also when considering that there is a plan
>> to use that MBM state to build a generic rate event for all platforms:
>> https://lore.kernel.org/all/CALPaoCgFRFgQqG00Uc0GhMHK47bsbtFw6Bxy5O9A_HeYmGa5sA@mail.gmail.com/
> 
> Did you mean we should add the following code in resctrl_arch_config_cntr()?
> 
> m = get_mbm_state(d, closid, rmid, evtid);
> if (m)
>      memset(m, 0, sizeof(struct mbm_state));

This is not arch code but instead resctrl fs, so resctrl_config_cntr() may be more appropriate?

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-21 21:12       ` Reinette Chatre
@ 2024-11-22 23:36         ` Moger, Babu
  2024-11-25 19:00           ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-22 23:36 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/21/2024 3:12 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/19/24 11:20 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/15/24 18:31, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>> Provide the interface to display the number of free monitoring counters
>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>
>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>> ---
>>>> v9: New patch.
>>>> ---
>>>>   Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>   arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>   3 files changed, 38 insertions(+)
>>>>
>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>> index 2f3a86278e84..2bc58d974934 100644
>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>   	memory bandwidth tracking to a single memory bandwidth event per
>>>>   	monitoring group.
>>>>   
>>>> +"available_mbm_cntrs":
>>>> +	The number of free monitoring counters available assignment in each domain
>>>
>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>> counters available for assignment"?
>>>
>>> (not taking into account how text may change after addressing Peter's feedback)
>>
>> How about this?
>>
>> "The number of monitoring counters available for assignment in each domain
>> when the architecture supports mbm_cntr_assign mode. There are a total of
>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>> assigned or unassigned individually in each domain. A counter is available
>> for new assignment if it is unassigned in all domains."
> 
> Please consider the context of this paragraph. It follows right after the description
> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
> I think it is confusing to follow that with a paragraph that states "Counters can be
> assigned or unassigned individually in each domain." I wonder if it may be helpful to
> use a different term ... for example a counter is *assigned* to an event of a monitoring
> group but this assignment may be to specified (not yet supported) or all (this work) domains while
> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
> needs to remain coherent if future work decides to indeed support per-domain assignment.
> 

Little bit lost here. Please help me.

"available_mbm_cntrs":
"The number of monitoring counters available for assignment in each 
domain when the architecture supports "mbm_cntr_assign" mode. There are 
a total of "num_mbm_cntrs" counters are available for assignment.
A counter is assigned to an event within a monitoring group and is 
available for activation across all domains. Users have the flexibility 
to activate it selectively within specific domains."

Thanks
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-22 21:37           ` Reinette Chatre
@ 2024-11-23  0:02             ` Moger, Babu
  2024-11-25 18:17               ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-23  0:02 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/22/2024 3:37 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/22/24 10:25 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>>>>> supported.
>>>>>>
>>>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>>>>
>>>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>>>>
>>>>>> The "default" mode is the existing monitoring mode that works without the
>>>>>> explicit counter assignment, instead relying on dynamic counter assignment
>>>>>> by hardware that may result in hardware not dedicating a counter resulting
>>>>>> in monitoring data reads returning "Unavailable".
>>>>>>
>>>>>> Provide an interface to display the monitor mode on the system.
>>>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>>> [mbm_cntr_assign]
>>>>>> default
>>>>>>
>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>> ---
>>>
>>> ...
>>>
>>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>>> not exist.
>>>>>
>>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>>> num_mbm_cntrs to num_rmids).
>>>>
>>>> There is some round about(or hacky) way to find that out number of RMIDs
>>>> that can be active.
>>>
>>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>>> (Reading your other message [1] it does not sound as though it can produce an accurate
>>> number on boot.)
>>> If not then it will be up to the documentation to be accurate.
>>>
>>>
>>>>>> +
>>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>>> +    corresponding RMID is not in use by any processor.
>>>>>> +
>>>>>> +    "default":
>>>>>> +
>>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>>> +    with that event.
>>>>>
>>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>>> is unassigned and then reassigned then the event count will reset and the user
>>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>>
>>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>>
>>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>>> available, while not be made concerned to use "default" mode on Intel where
>>>>> mbm_assign_mode is not available.
>>>>
>>>> Can we add text to clarify this?
>>>
>>> Please do.
>>
>> I think we need to add text about AMD systems. How about this?
>>
>> "default":
>> In default mode resctrl assumes there is a hardware counter for each
>> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."
> 
> 
> What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
> not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
> of RMIDs that can be active but here "16" is made official in the documentation?
> 

The lowest active RMID is 16. I could not get it using the hacky way.
I have verified testing on all the previous generation of AMD systems by 
creating the monitoring groups until it reports "Unavailable".
In recent systems it is 32.  We can drop the exact number to be generic.


There is no clear documentation on that.  Here is what the doc says.

A given implementation may have insufficient hardware to simultaneously 
track the bandwidth for all RMID values which the hardware supports. If 
an attempt is made to read a Bandwidth Count for an RMID that has been 
impacted by these hardware limitations, the “U” bit of the
QM_CTR will be set when the counter is read. Subsequent QM_CTR reads for 
that RMID and Event may return a value with the "U" bit clear. Potential 
causes of the “U” bit being set include (but are not limited to)

• RMID is not currently tracked by the hardware.
• RMID was not tracked by the hardware at some time since it was last read.
• RMID has not been read since it started being tracked by the hardware.

All RMIDs which are currently in use by one or more processors in the 
QOS domain will be tracked. The hardware will always begin tracking a 
new RMID value when it gets written to the PQR_ASSOC register of any of 
the processors in the QOS domain and it is not already being tracked. 
When the hardware begins tracking an RMID that it was not previously 
tracking, it will clear the QM_CTR for all events in the new RMID

- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-11-22 22:07           ` Reinette Chatre
@ 2024-11-23  0:09             ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-23  0:09 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/22/2024 4:07 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/22/24 1:04 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/21/2024 2:50 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 11/20/24 10:05 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 11/15/24 18:57, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>> The mbm_cntr_assign mode offers several hardware counters that can be
>>>>>> assigned to an RMID, event pair and monitor the bandwidth as long as it
>>>>>> is assigned.
>>>>>>
>>>>>> Counters are managed at two levels. The global assignment is tracked
>>>>>> using the mbm_cntr_free_map field in the struct resctrl_mon, while
>>>>>> domain-specific assignments are tracked using the mbm_cntr_map field
>>>>>> in the struct rdt_mon_domain. Allocation begins at the global level
>>>>>> and is then applied individually to each domain.
>>>>>>
>>>>>> Introduce an interface to allocate these counters and update the
>>>>>> corresponding domains accordingly.
>>>>>>
>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>> ---
>>>>>
>>>>> ...
>>>>>
>>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>>> index 00f7bf60e16a..cb496bd97007 100644
>>>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>>> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>>>>>>    int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>>>                     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>>>>                     u32 cntr_id, bool assign);
>>>>>> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
>>>>>> +                   struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>>>>>>    void rdt_staged_configs_clear(void);
>>>>>>    bool closid_allocated(unsigned int closid);
>>>>>>    int resctrl_find_cleanest_closid(void);
>>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> index 1b5529c212f5..bc3752967c44 100644
>>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>>>        return 0;
>>>>>>    }
>>>>>>    +/*
>>>>>> + * Configure the counter for the event, RMID pair for the domain.
>>>>>> + * Update the bitmap and reset the architectural state.
>>>>>> + */
>>>>>> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>>> +                   enum resctrl_event_id evtid, u32 rmid, u32 closid,
>>>>>> +                   u32 cntr_id, bool assign)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
>>>>>> +    if (ret)
>>>>>> +        return ret;
>>>>>> +
>>>>>> +    if (assign)
>>>>>> +        __set_bit(cntr_id, d->mbm_cntr_map);
>>>>>> +    else
>>>>>> +        __clear_bit(cntr_id, d->mbm_cntr_map);
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Reset the architectural state so that reading of hardware
>>>>>> +     * counter is not considered as an overflow in next update.
>>>>>> +     */
>>>>>> +    resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
>>>>>
>>>>> resctrl_arch_reset_rmid() expects to be run on a CPU that is in the domain
>>>>> @d ... note that after the architectural state is reset it initializes the
>>>>> state by reading the event on the current CPU. By running it here it is
>>>>> run on a random CPU that may not be in the right domain.
>>>>
>>>> Yes. That is correct.  We can move this part to our earlier
>>>> implementation. We dont need to read the RMID.  We just have to reset the
>>>> counter.
>>>>
>>>> https://lore.kernel.org/lkml/16d88cc4091cef1999b7ec329364e12dd0dc748d.1728495588.git.babu.moger@amd.com/
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 9fe419d0c536..bc3654ec3a08 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -2371,6 +2371,13 @@ int resctrl_arch_config_cntr(struct rdt_resource
>>>> *r, struct rdt_mon_domain *d,
>>>>           smp_call_function_any(&d->hdr.cpu_mask, resctrl_abmc_config_one_amd,
>>>>                                 &abmc_cfg, 1);
>>>>
>>>> +       /*
>>>> +        * Reset the architectural state so that reading of hardware
>>>> +        * counter is not considered as an overflow in next update.
>>>> +        */
>>>> +       if (arch_mbm)
>>>> +               memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
>>>> +
>>>>           return 0;
>>>>    }
>>>>
>>>>
>>>
>>> I am not sure what you envision here. One motivation for the move out of
>>> resctrl_arch_config_cntr() was to avoid architectural state being reset twice. For reference,
>>> mbm_config_write_domain()->resctrl_arch_reset_rmid_all(). Will architectural state
>>> be reset twice again?
>>
>> That is good point. We don't have to do it twice.
>>
>> We can move the whole reset(arch_mbm) in  resctrl_arch_config_cntr().
> 
> This is not clear to me. The architectural state needs to be reset on MBM config write even
> when assignable mode is not supported and/or enabled. Moving it to resctrl_arch_config_cntr()
> will break this, no?

Yes. The architectural state needs to be reset if ABMC is enabled or not 
enabled on MBM config write.

> 
> I wonder if it may not simplify things to call resctrl_arch_reset_rmid() from
> resctrl_abmc_config_one_amd()?

Yes. That is is an option. I can try.

> 
>>> One thing that I did not notice before is that the non-architectural MBM state is not
>>> reset. Care should be taken to reset this also when considering that there is a plan
>>> to use that MBM state to build a generic rate event for all platforms:
>>> https://lore.kernel.org/all/CALPaoCgFRFgQqG00Uc0GhMHK47bsbtFw6Bxy5O9A_HeYmGa5sA@mail.gmail.com/
>>
>> Did you mean we should add the following code in resctrl_arch_config_cntr()?
>>
>> m = get_mbm_state(d, closid, rmid, evtid);
>> if (m)
>>       memset(m, 0, sizeof(struct mbm_state));
> 
> This is not arch code but instead resctrl fs, so resctrl_config_cntr() may be more appropriate?

Sure. We can do that.

> 
> Reinette
> 
> 

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  2024-11-22 21:52           ` Reinette Chatre
@ 2024-11-23  0:15             ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-23  0:15 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/22/2024 3:52 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/22/24 10:54 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/21/2024 2:18 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 11/19/24 12:12 PM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 11/15/24 18:44, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>> The ABMC feature provides an option to the user to assign a hardware
>>>>>> counter to an RMID, event pair and monitor the bandwidth as long as it is
>>>>>> assigned. The assigned RMID will be tracked by the hardware until the user
>>>>>> unassigns it manually.
>>>>>>
>>>>>> Counters are configured by writing to L3_QOS_ABMC_CFG MSR and
>>>>>> specifying the counter id, bandwidth source, and bandwidth types.
>>>>>
>>>>> needs imperative tone
>>>>
>>>> How about this?
>>>>
>>>> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and
>>>> specifying the counter ID, bandwidth source, and bandwidth types.
>>>>
>>>
>>> ok with me. Exactly what ChatGPT suggests.
>>
>> Hmm. ):
>>
>>>
>>> Please do note that that first paragraph informs reader that
>>> a counter is assigned by user to "an RMID, event pair" while the hardware is configured with
>>> "the counter ID, bandwidth source, and bandwidth types". There thus does not seem
>>> to be a clear connection between what user assigns and what is programmed to hardware.
>>>
>>
>> Adding RMID in the text might help.
>>
>> Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and specifying the counter ID, RMID, bandwidth source, and bandwidth types.
>>
> 
> Isn't the bandwidth source and the RMID the same thing? How about something like:
> "Configure the counters by writing to the L3_QOS_ABMC_CFG MSR and specifying
>   the counter ID, bandwidth source (RMID), and bandwidth event configuration."
> 
> Please feel free to improve.

Looks good. Thanks
-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-23  0:02             ` Moger, Babu
@ 2024-11-25 18:17               ` Reinette Chatre
  2024-11-26 17:09                 ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-25 18:17 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/22/24 4:02 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/22/2024 3:37 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/22/24 10:25 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>>>>>> supported.
>>>>>>>
>>>>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>>>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>>>>>
>>>>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>>>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>>>>>
>>>>>>> The "default" mode is the existing monitoring mode that works without the
>>>>>>> explicit counter assignment, instead relying on dynamic counter assignment
>>>>>>> by hardware that may result in hardware not dedicating a counter resulting
>>>>>>> in monitoring data reads returning "Unavailable".
>>>>>>>
>>>>>>> Provide an interface to display the monitor mode on the system.
>>>>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>>>> [mbm_cntr_assign]
>>>>>>> default
>>>>>>>
>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>> ---
>>>>
>>>> ...
>>>>
>>>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>>>> not exist.
>>>>>>
>>>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>>>> num_mbm_cntrs to num_rmids).
>>>>>
>>>>> There is some round about(or hacky) way to find that out number of RMIDs
>>>>> that can be active.
>>>>
>>>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>>>> (Reading your other message [1] it does not sound as though it can produce an accurate
>>>> number on boot.)
>>>> If not then it will be up to the documentation to be accurate.
>>>>
>>>>
>>>>>>> +
>>>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>>>> +    corresponding RMID is not in use by any processor.
>>>>>>> +
>>>>>>> +    "default":
>>>>>>> +
>>>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>>>> +    with that event.
>>>>>>
>>>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>>>> is unassigned and then reassigned then the event count will reset and the user
>>>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>>>
>>>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>>>
>>>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>>>> available, while not be made concerned to use "default" mode on Intel where
>>>>>> mbm_assign_mode is not available.
>>>>>
>>>>> Can we add text to clarify this?
>>>>
>>>> Please do.
>>>
>>> I think we need to add text about AMD systems. How about this?
>>>
>>> "default":
>>> In default mode resctrl assumes there is a hardware counter for each
>>> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."
>>
>>
>> What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
>> not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
>> of RMIDs that can be active but here "16" is made official in the documentation?
>>
> 
> The lowest active RMID is 16. I could not get it using the hacky way.
> I have verified testing on all the previous generation of AMD systems by creating the monitoring groups until it reports "Unavailable".
> In recent systems it is 32.  We can drop the exact number to be generic.
> 
> 
> There is no clear documentation on that.  Here is what the doc says.
> 
> A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values which the hardware supports. If an attempt is made to read a Bandwidth Count for an RMID that has been impacted by these hardware limitations, the “U” bit of the
> QM_CTR will be set when the counter is read. Subsequent QM_CTR reads for that RMID and Event may return a value with the "U" bit clear. Potential causes of the “U” bit being set include (but are not limited to)
> 
> • RMID is not currently tracked by the hardware.
> • RMID was not tracked by the hardware at some time since it was last read.
> • RMID has not been read since it started being tracked by the hardware.
> 
> All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID
> 
> - Babu Moger
> 

I think I am starting to understand what is meant with the "count the traffic in an
unpredictable way". From what I understand the hardware uses the "U" bit to indicate
that an RMID was not tracked for a while, but it only sets this bit on the
first read. After that the "U" bit may be cleared if a counter can be assigned to an RMID
afterwards.
If it was only user space that reads the data then it should be clear to the user when the
hardware limitation is encountered and thus hardware behavior can be "predictable", but since
the overflow handler runs once per second it may indeed be the overflow handler that
encounters the "U" bit and that bit is not currently handled. This could leave user space
with impression that events are always returning data but that data may indeed be wrong.

In another thread [1] Tony confirmed that "U" bit is not returned by Intel systems so
this issue only impacts AMD. As I understand the other scenarios in which AMD systems
can return "U" (the first read after assigning an RMID and the first read after changing
the memory config) are all scenarios that can be controlled by resctrl.

I do not see why unpredictable data should be addressed with documentation. Could this not be
fixed instead? Essentially stating "AMD systems without ABMC count the traffic in an unpredictable
way" seems like a poor user experience.
What if instead resctrl handles the "U" bit better? For example, when the overflow
handler encounters the "U" bit the RMID can be permanently marked as "Unavailable"? Would
that not be better than the counter behaving unpredictably with users never knowing if they
can trust the event counters?

Reinette

[1] https://lore.kernel.org/all/ZzUvA2XE01U25A38@agluck-desk3/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-22 23:36         ` Moger, Babu
@ 2024-11-25 19:00           ` Reinette Chatre
  2024-11-26 23:31             ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-25 19:00 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/22/24 3:36 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>> Provide the interface to display the number of free monitoring counters
>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>
>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>> ---
>>>>> v9: New patch.
>>>>> ---
>>>>>   Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>   arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>   3 files changed, 38 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>       memory bandwidth tracking to a single memory bandwidth event per
>>>>>       monitoring group.
>>>>>   +"available_mbm_cntrs":
>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>
>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>> counters available for assignment"?
>>>>
>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>
>>> How about this?
>>>
>>> "The number of monitoring counters available for assignment in each domain
>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>> assigned or unassigned individually in each domain. A counter is available
>>> for new assignment if it is unassigned in all domains."
>>
>> Please consider the context of this paragraph. It follows right after the description
>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>> I think it is confusing to follow that with a paragraph that states "Counters can be
>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>
> 
> Little bit lost here. Please help me.

I think this highlights the uncertainty this interface brings. How do you expect users
to use this interface? At this time I think this interface can create a lot of confusion.
For example, consider a hypothetical system with three domains and four counters that
has the following state per mbm_assign_control:

//0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
/m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
/m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
/m3/0=_;1=_;2=_

Since, in this system there are only four counters available, and
they have all been assigned, then there are no new counters available for
assignment.

If I understand correctly, available_mbm_cntrs will read:
0=1;1=3;2=1

How is a user to interpret the above numbers? It does not reflect
that no counter can be assigned to m3, instead it reflects which of the
already assigned counters still need to be activated on domains.
If, for example, a user is expected to use this file to know how
many counters can still be assigned, should it not reflect the actual
available counters. In the above scenario it will then be:
0=0;1=0;2=0

Of course, when doing the above the user may get impression that a counter
that has already been assigned, just not activated, is no longer available
for use.

 
> "available_mbm_cntrs":
> "The number of monitoring counters available for assignment in each domain when the architecture supports "mbm_cntr_assign" mode. There are a total of "num_mbm_cntrs" counters are available for assignment.
> A counter is assigned to an event within a monitoring group and is available for activation across all domains. Users have the flexibility to activate it selectively within specific domains."
> 

Once we understand how users are to use this file the documentation should be easier
to create.

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-25 18:17               ` Reinette Chatre
@ 2024-11-26 17:09                 ` Moger, Babu
  2024-11-26 19:01                   ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-26 17:09 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/25/2024 12:17 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/22/24 4:02 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/22/2024 3:37 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 11/22/24 10:25 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>>>>> Hi Reinette,
>>>>>>
>>>>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes
>>>>>>>> supported.
>>>>>>>>
>>>>>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to
>>>>>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned.
>>>>>>>>
>>>>>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable
>>>>>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default.
>>>>>>>>
>>>>>>>> The "default" mode is the existing monitoring mode that works without the
>>>>>>>> explicit counter assignment, instead relying on dynamic counter assignment
>>>>>>>> by hardware that may result in hardware not dedicating a counter resulting
>>>>>>>> in monitoring data reads returning "Unavailable".
>>>>>>>>
>>>>>>>> Provide an interface to display the monitor mode on the system.
>>>>>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>>>>> [mbm_cntr_assign]
>>>>>>>> default
>>>>>>>>
>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>>> ---
>>>>>
>>>>> ...
>>>>>
>>>>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>>>>> not exist.
>>>>>>>
>>>>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>>>>> num_mbm_cntrs to num_rmids).
>>>>>>
>>>>>> There is some round about(or hacky) way to find that out number of RMIDs
>>>>>> that can be active.
>>>>>
>>>>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>>>>> (Reading your other message [1] it does not sound as though it can produce an accurate
>>>>> number on boot.)
>>>>> If not then it will be up to the documentation to be accurate.
>>>>>
>>>>>
>>>>>>>> +
>>>>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>>>>> +    corresponding RMID is not in use by any processor.
>>>>>>>> +
>>>>>>>> +    "default":
>>>>>>>> +
>>>>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>>>>> +    with that event.
>>>>>>>
>>>>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>>>>> is unassigned and then reassigned then the event count will reset and the user
>>>>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>>>>
>>>>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>>>>
>>>>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>>>>> available, while not be made concerned to use "default" mode on Intel where
>>>>>>> mbm_assign_mode is not available.
>>>>>>
>>>>>> Can we add text to clarify this?
>>>>>
>>>>> Please do.
>>>>
>>>> I think we need to add text about AMD systems. How about this?
>>>>
>>>> "default":
>>>> In default mode resctrl assumes there is a hardware counter for each
>>>> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."
>>>
>>>
>>> What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
>>> not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
>>> of RMIDs that can be active but here "16" is made official in the documentation?
>>>
>>
>> The lowest active RMID is 16. I could not get it using the hacky way.
>> I have verified testing on all the previous generation of AMD systems by creating the monitoring groups until it reports "Unavailable".
>> In recent systems it is 32.  We can drop the exact number to be generic.
>>
>>
>> There is no clear documentation on that.  Here is what the doc says.
>>
>> A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values which the hardware supports. If an attempt is made to read a Bandwidth Count for an RMID that has been impacted by these hardware limitations, the “U” bit of the
>> QM_CTR will be set when the counter is read. Subsequent QM_CTR reads for that RMID and Event may return a value with the "U" bit clear. Potential causes of the “U” bit being set include (but are not limited to)
>>
>> • RMID is not currently tracked by the hardware.
>> • RMID was not tracked by the hardware at some time since it was last read.
>> • RMID has not been read since it started being tracked by the hardware.
>>
>> All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID
>>
>> - Babu Moger
>>
> 
> I think I am starting to understand what is meant with the "count the traffic in an
> unpredictable way". From what I understand the hardware uses the "U" bit to indicate
> that an RMID was not tracked for a while, but it only sets this bit on the
> first read. After that the "U" bit may be cleared if a counter can be assigned to an RMID
> afterwards.
> If it was only user space that reads the data then it should be clear to the user when the
> hardware limitation is encountered and thus hardware behavior can be "predictable", but since
> the overflow handler runs once per second it may indeed be the overflow handler that
> encounters the "U" bit and that bit is not currently handled. This could leave user space
> with impression that events are always returning data but that data may indeed be wrong.
> 
> In another thread [1] Tony confirmed that "U" bit is not returned by Intel systems so
> this issue only impacts AMD. As I understand the other scenarios in which AMD systems
> can return "U" (the first read after assigning an RMID and the first read after changing
> the memory config) are all scenarios that can be controlled by resctrl.
> 
> I do not see why unpredictable data should be addressed with documentation. Could this not be
> fixed instead? Essentially stating "AMD systems without ABMC count the traffic in an unpredictable
> way" seems like a poor user experience.
> What if instead resctrl handles the "U" bit better? For example, when the overflow
> handler encounters the "U" bit the RMID can be permanently marked as "Unavailable"? Would
> that not be better than the counter behaving unpredictably with users never knowing if they
> can trust the event counters?

Actually, I was looking at handling "Unavailable" in little bit better 
way. Right now, I see it reports "Unavailable" first then it goes into 
overflow and stays in overflow forever.

Also setting the RMID Unavailable permanently is not a good option. We 
should have a way to reset it. At some later point the RMID can become 
active and report the correct numbers.

I was thinking of introducing a new arch state(in arch_mbm_state) to 
handle this case. Need to investigate more on this. What do you think?

> 
> Reinette
> 
> [1] https://lore.kernel.org/all/ZzUvA2XE01U25A38@agluck-desk3/
> 
> 

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-26 17:09                 ` Moger, Babu
@ 2024-11-26 19:01                   ` Reinette Chatre
  2024-11-26 21:57                     ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-26 19:01 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/26/24 9:09 AM, Moger, Babu wrote:
> On 11/25/2024 12:17 PM, Reinette Chatre wrote:
>> On 11/22/24 4:02 PM, Moger, Babu wrote:
>>> On 11/22/2024 3:37 PM, Reinette Chatre wrote:
>>>> On 11/22/24 10:25 AM, Moger, Babu wrote:
>>>>> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>>>>>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>>>>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:

>>>>>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>>>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>>>>>> not exist.
>>>>>>>>
>>>>>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>>>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>>>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>>>>>> num_mbm_cntrs to num_rmids).
>>>>>>>
>>>>>>> There is some round about(or hacky) way to find that out number of RMIDs
>>>>>>> that can be active.
>>>>>>
>>>>>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>>>>>> (Reading your other message [1] it does not sound as though it can produce an accurate
>>>>>> number on boot.)
>>>>>> If not then it will be up to the documentation to be accurate.
>>>>>>
>>>>>>
>>>>>>>>> +
>>>>>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>>>>>> +    corresponding RMID is not in use by any processor.
>>>>>>>>> +
>>>>>>>>> +    "default":
>>>>>>>>> +
>>>>>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>>>>>> +    with that event.
>>>>>>>>
>>>>>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>>>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>>>>>> is unassigned and then reassigned then the event count will reset and the user
>>>>>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>>>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>>>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>>>>>
>>>>>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>>>>>
>>>>>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>>>>>> available, while not be made concerned to use "default" mode on Intel where
>>>>>>>> mbm_assign_mode is not available.
>>>>>>>
>>>>>>> Can we add text to clarify this?
>>>>>>
>>>>>> Please do.
>>>>>
>>>>> I think we need to add text about AMD systems. How about this?
>>>>>
>>>>> "default":
>>>>> In default mode resctrl assumes there is a hardware counter for each
>>>>> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."
>>>>
>>>>
>>>> What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
>>>> not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
>>>> of RMIDs that can be active but here "16" is made official in the documentation?
>>>>
>>>
>>> The lowest active RMID is 16. I could not get it using the hacky way.
>>> I have verified testing on all the previous generation of AMD systems by creating the monitoring groups until it reports "Unavailable".
>>> In recent systems it is 32.  We can drop the exact number to be generic.
>>>
>>>
>>> There is no clear documentation on that.  Here is what the doc says.
>>>
>>> A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values which the hardware supports. If an attempt is made to read a Bandwidth Count for an RMID that has been impacted by these hardware limitations, the “U” bit of the
>>> QM_CTR will be set when the counter is read. Subsequent QM_CTR reads for that RMID and Event may return a value with the "U" bit clear. Potential causes of the “U” bit being set include (but are not limited to)
>>>
>>> • RMID is not currently tracked by the hardware.
>>> • RMID was not tracked by the hardware at some time since it was last read.
>>> • RMID has not been read since it started being tracked by the hardware.
>>>
>>> All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID
>>>
>>> - Babu Moger
>>>
>>
>> I think I am starting to understand what is meant with the "count the traffic in an
>> unpredictable way". From what I understand the hardware uses the "U" bit to indicate
>> that an RMID was not tracked for a while, but it only sets this bit on the
>> first read. After that the "U" bit may be cleared if a counter can be assigned to an RMID
>> afterwards.
>> If it was only user space that reads the data then it should be clear to the user when the
>> hardware limitation is encountered and thus hardware behavior can be "predictable", but since
>> the overflow handler runs once per second it may indeed be the overflow handler that
>> encounters the "U" bit and that bit is not currently handled. This could leave user space
>> with impression that events are always returning data but that data may indeed be wrong.
>>
>> In another thread [1] Tony confirmed that "U" bit is not returned by Intel systems so
>> this issue only impacts AMD. As I understand the other scenarios in which AMD systems
>> can return "U" (the first read after assigning an RMID and the first read after changing
>> the memory config) are all scenarios that can be controlled by resctrl.
>>
>> I do not see why unpredictable data should be addressed with documentation. Could this not be
>> fixed instead? Essentially stating "AMD systems without ABMC count the traffic in an unpredictable
>> way" seems like a poor user experience.
>> What if instead resctrl handles the "U" bit better? For example, when the overflow
>> handler encounters the "U" bit the RMID can be permanently marked as "Unavailable"? Would
>> that not be better than the counter behaving unpredictably with users never knowing if they
>> can trust the event counters?
> 
> Actually, I was looking at handling "Unavailable" in little bit better way. Right now, I see it reports "Unavailable" first then it goes into overflow and stays in overflow forever.

Could you please elaborate what you mean with "stays in overflow forever"?

> 
> Also setting the RMID Unavailable permanently is not a good option. We should have a way to reset it. At some later point the RMID can become active and report the correct numbers.

I assume that when an RMID becomes active cannot be the trigger to reset it since user space cannot
then be aware that a counter was not available for a while.

> I was thinking of introducing a new arch state(in arch_mbm_state) to handle this case. Need to investigate more on this. What do you think?
> 

Some new state is surely needed to reflect that the RMID's data may be wrong. It is not clear to
me how you envision the reset of the state. If it is driven from user space then I expect that
resctrl needs to be taught something about this and it cannot just be buried in arch code.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode
  2024-11-26 19:01                   ` Reinette Chatre
@ 2024-11-26 21:57                     ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-26 21:57 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/26/2024 1:01 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/26/24 9:09 AM, Moger, Babu wrote:
>> On 11/25/2024 12:17 PM, Reinette Chatre wrote:
>>> On 11/22/24 4:02 PM, Moger, Babu wrote:
>>>> On 11/22/2024 3:37 PM, Reinette Chatre wrote:
>>>>> On 11/22/24 10:25 AM, Moger, Babu wrote:
>>>>>> On 11/18/2024 4:07 PM, Reinette Chatre wrote:
>>>>>>> On 11/18/24 11:04 AM, Moger, Babu wrote:
>>>>>>>> On 11/15/24 18:00, Reinette Chatre wrote:
>>>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
> 
>>>>>>>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode
>>>>>>>>> to make the event data "more predictable" and then be concerned when the mode does
>>>>>>>>> not exist.
>>>>>>>>>
>>>>>>>>> As an alternative, is it possible to know the number of hardware counters on AMD systems
>>>>>>>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for
>>>>>>>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing
>>>>>>>>> num_mbm_cntrs to num_rmids).
>>>>>>>>
>>>>>>>> There is some round about(or hacky) way to find that out number of RMIDs
>>>>>>>> that can be active.
>>>>>>>
>>>>>>> Does this give consistent and accurate data? Is this something that can be added to resctrl?
>>>>>>> (Reading your other message [1] it does not sound as though it can produce an accurate
>>>>>>> number on boot.)
>>>>>>> If not then it will be up to the documentation to be accurate.
>>>>>>>
>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +    AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature
>>>>>>>>>> +    enable this mode by default so that counters remain assigned even when the
>>>>>>>>>> +    corresponding RMID is not in use by any processor.
>>>>>>>>>> +
>>>>>>>>>> +    "default":
>>>>>>>>>> +
>>>>>>>>>> +    In default mode resctrl assumes there is a hardware counter for each
>>>>>>>>>> +    event within every CTRL_MON and MON group. Reading mbm_total_bytes or
>>>>>>>>>> +    mbm_local_bytes may report 'Unavailable' if there is no counter associated
>>>>>>>>>> +    with that event.
>>>>>>>>>
>>>>>>>>> If I understand correctly, on AMD platforms without ABMC the events only report
>>>>>>>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter
>>>>>>>>> is unassigned and then reassigned then the event count will reset and the user
>>>>>>>>> will get some data back but it may thus be unpredictable (to match earlier language).
>>>>>>>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to
>>>>>>>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated
>>>>>>>>
>>>>>>>> Yes. All the AMD systems without ABMC are affected by this problem.
>>>>>>>>
>>>>>>>>> because users should be steered to avoid "default" mode if mbm_assign_mode is
>>>>>>>>> available, while not be made concerned to use "default" mode on Intel where
>>>>>>>>> mbm_assign_mode is not available.
>>>>>>>>
>>>>>>>> Can we add text to clarify this?
>>>>>>>
>>>>>>> Please do.
>>>>>>
>>>>>> I think we need to add text about AMD systems. How about this?
>>>>>>
>>>>>> "default":
>>>>>> In default mode resctrl assumes there is a hardware counter for each
>>>>>> event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported."
>>>>>
>>>>>
>>>>> What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is
>>>>> not clear, second, you mentioned earlier that there is just a "hacky" way to determine number
>>>>> of RMIDs that can be active but here "16" is made official in the documentation?
>>>>>
>>>>
>>>> The lowest active RMID is 16. I could not get it using the hacky way.
>>>> I have verified testing on all the previous generation of AMD systems by creating the monitoring groups until it reports "Unavailable".
>>>> In recent systems it is 32.  We can drop the exact number to be generic.
>>>>
>>>>
>>>> There is no clear documentation on that.  Here is what the doc says.
>>>>
>>>> A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values which the hardware supports. If an attempt is made to read a Bandwidth Count for an RMID that has been impacted by these hardware limitations, the “U” bit of the
>>>> QM_CTR will be set when the counter is read. Subsequent QM_CTR reads for that RMID and Event may return a value with the "U" bit clear. Potential causes of the “U” bit being set include (but are not limited to)
>>>>
>>>> • RMID is not currently tracked by the hardware.
>>>> • RMID was not tracked by the hardware at some time since it was last read.
>>>> • RMID has not been read since it started being tracked by the hardware.
>>>>
>>>> All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID
>>>>
>>>> - Babu Moger
>>>>
>>>
>>> I think I am starting to understand what is meant with the "count the traffic in an
>>> unpredictable way". From what I understand the hardware uses the "U" bit to indicate
>>> that an RMID was not tracked for a while, but it only sets this bit on the
>>> first read. After that the "U" bit may be cleared if a counter can be assigned to an RMID
>>> afterwards.
>>> If it was only user space that reads the data then it should be clear to the user when the
>>> hardware limitation is encountered and thus hardware behavior can be "predictable", but since
>>> the overflow handler runs once per second it may indeed be the overflow handler that
>>> encounters the "U" bit and that bit is not currently handled. This could leave user space
>>> with impression that events are always returning data but that data may indeed be wrong.
>>>
>>> In another thread [1] Tony confirmed that "U" bit is not returned by Intel systems so
>>> this issue only impacts AMD. As I understand the other scenarios in which AMD systems
>>> can return "U" (the first read after assigning an RMID and the first read after changing
>>> the memory config) are all scenarios that can be controlled by resctrl.
>>>
>>> I do not see why unpredictable data should be addressed with documentation. Could this not be
>>> fixed instead? Essentially stating "AMD systems without ABMC count the traffic in an unpredictable
>>> way" seems like a poor user experience.
>>> What if instead resctrl handles the "U" bit better? For example, when the overflow
>>> handler encounters the "U" bit the RMID can be permanently marked as "Unavailable"? Would
>>> that not be better than the counter behaving unpredictably with users never knowing if they
>>> can trust the event counters?
>>
>> Actually, I was looking at handling "Unavailable" in little bit better way. Right now, I see it reports "Unavailable" first then it goes into overflow and stays in overflow forever.
> 
> Could you please elaborate what you mean with "stays in overflow forever"?

This may not an issue. Once overflow(large number) happens, it will stay 
in that state until there is another change. But we are only concerned 
about the delta. Delta is fine.

> 
>>
>> Also setting the RMID Unavailable permanently is not a good option. We should have a way to reset it. At some later point the RMID can become active and report the correct numbers.
> 
> I assume that when an RMID becomes active cannot be the trigger to reset it since user space cannot
> then be aware that a counter was not available for a while.

Yes. That is correct.

> 
>> I was thinking of introducing a new arch state(in arch_mbm_state) to handle this case. Need to investigate more on this. What do you think?
>>
> 
> Some new state is surely needed to reflect that the RMID's data may be wrong. It is not clear to
> me how you envision the reset of the state. If it is driven from user space then I expect that
> resctrl needs to be taught something about this and it cannot just be buried in arch code.
> 

Yes. We need to take a hard look at this.
-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-25 19:00           ` Reinette Chatre
@ 2024-11-26 23:31             ` Moger, Babu
  2024-11-26 23:56               ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-26 23:31 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/25/2024 1:00 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/22/24 3:36 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>
>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>> ---
>>>>>> v9: New patch.
>>>>>> ---
>>>>>>    Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>>    arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>>    arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>    3 files changed, 38 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>        memory bandwidth tracking to a single memory bandwidth event per
>>>>>>        monitoring group.
>>>>>>    +"available_mbm_cntrs":
>>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>>
>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>> counters available for assignment"?
>>>>>
>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>
>>>> How about this?
>>>>
>>>> "The number of monitoring counters available for assignment in each domain
>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>> assigned or unassigned individually in each domain. A counter is available
>>>> for new assignment if it is unassigned in all domains."
>>>
>>> Please consider the context of this paragraph. It follows right after the description
>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>
>>
>> Little bit lost here. Please help me.
> 
> I think this highlights the uncertainty this interface brings. How do you expect users
> to use this interface? At this time I think this interface can create a lot of confusion.
> For example, consider a hypothetical system with three domains and four counters that
> has the following state per mbm_assign_control:
> 
> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
> /m3/0=_;1=_;2=_
> 
> Since, in this system there are only four counters available, and
> they have all been assigned, then there are no new counters available for
> assignment.
> 
> If I understand correctly, available_mbm_cntrs will read:
> 0=1;1=3;2=1

Yes. Exactly. This causes confusion to the user.
> 
> How is a user to interpret the above numbers? It does not reflect
> that no counter can be assigned to m3, instead it reflects which of the
> already assigned counters still need to be activated on domains.
> If, for example, a user is expected to use this file to know how
> many counters can still be assigned, should it not reflect the actual
> available counters. In the above scenario it will then be:
> 0=0;1=0;2=0

We can also just print
#cat available_mbm_cntrs
0

The domain specific information is not important here.
That was my original idea. We can go back to that definition. That is 
more clear to the user.

> 
> Of course, when doing the above the user may get impression that a counter
> that has already been assigned, just not activated, is no longer available
> for use.
> 
>   
>> "available_mbm_cntrs":
>> "The number of monitoring counters available for assignment in each domain when the architecture supports "mbm_cntr_assign" mode. There are a total of "num_mbm_cntrs" counters are available for assignment.
>> A counter is assigned to an event within a monitoring group and is available for activation across all domains. Users have the flexibility to activate it selectively within specific domains."
>>
> 
> Once we understand how users are to use this file the documentation should be easier
> to create.
> 
> Reinette
> 
> 

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-26 23:31             ` Moger, Babu
@ 2024-11-26 23:56               ` Reinette Chatre
  2024-11-27 14:57                 ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-26 23:56 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/26/24 3:31 PM, Moger, Babu wrote:
> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
>> On 11/22/24 3:36 PM, Moger, Babu wrote:
>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>>
>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>> ---
>>>>>>> v9: New patch.
>>>>>>> ---
>>>>>>>    Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>>>    arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>>>    arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>>    3 files changed, 38 insertions(+)
>>>>>>>
>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>>        memory bandwidth tracking to a single memory bandwidth event per
>>>>>>>        monitoring group.
>>>>>>>    +"available_mbm_cntrs":
>>>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>>>
>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>>> counters available for assignment"?
>>>>>>
>>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>>
>>>>> How about this?
>>>>>
>>>>> "The number of monitoring counters available for assignment in each domain
>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>>> assigned or unassigned individually in each domain. A counter is available
>>>>> for new assignment if it is unassigned in all domains."
>>>>
>>>> Please consider the context of this paragraph. It follows right after the description
>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>>
>>>
>>> Little bit lost here. Please help me.
>>
>> I think this highlights the uncertainty this interface brings. How do you expect users
>> to use this interface? At this time I think this interface can create a lot of confusion.
>> For example, consider a hypothetical system with three domains and four counters that
>> has the following state per mbm_assign_control:
>>
>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
>> /m3/0=_;1=_;2=_
>>
>> Since, in this system there are only four counters available, and
>> they have all been assigned, then there are no new counters available for
>> assignment.
>>
>> If I understand correctly, available_mbm_cntrs will read:
>> 0=1;1=3;2=1
> 
> Yes. Exactly. This causes confusion to the user.
>>
>> How is a user to interpret the above numbers? It does not reflect
>> that no counter can be assigned to m3, instead it reflects which of the
>> already assigned counters still need to be activated on domains.
>> If, for example, a user is expected to use this file to know how
>> many counters can still be assigned, should it not reflect the actual
>> available counters. In the above scenario it will then be:
>> 0=0;1=0;2=0
> 
> We can also just print
> #cat available_mbm_cntrs
> 0
> 
> The domain specific information is not important here.
> That was my original idea. We can go back to that definition. That is more clear to the user.

Tony's response [1] still applies.

I believe Tony's suggestion [2] considered that the available counters will be the
same for every domain for this implementation. That is why my example noted:
"0=0;1=0;2=0"

The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
makes me more concerned how this feature will be perceived by users who are not as familiar with
resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
work to do the enabling will make it easier for users to adopt and understand the feature.

What do you think?

Reinette

[1] https://lore.kernel.org/all/SJ1PR11MB6083DC9EA6D323356E957A87FC4E2@SJ1PR11MB6083.namprd11.prod.outlook.com/
[2] https://lore.kernel.org/all/SJ1PR11MB6083583A24FA3B3B7C2DCD64FC442@SJ1PR11MB6083.namprd11.prod.outlook.com/
[3] https://lore.kernel.org/all/ZwmadFbK--Qb8qWP@agluck-desk3.sc.intel.com/
[4] https://lore.kernel.org/all/CALPaoCh1BWdWww8Kztd13GBaY9mMeZX268fOQgECRytiKm-nPQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-26 23:56               ` Reinette Chatre
@ 2024-11-27 14:57                 ` Moger, Babu
  2024-11-27 19:05                   ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-27 14:57 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 11/26/2024 5:56 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 11/26/24 3:31 PM, Moger, Babu wrote:
>> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
>>> On 11/22/24 3:36 PM, Moger, Babu wrote:
>>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>>>
>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>>> ---
>>>>>>>> v9: New patch.
>>>>>>>> ---
>>>>>>>>     Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>>>>     arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>>>>     arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>>>     3 files changed, 38 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>>>         memory bandwidth tracking to a single memory bandwidth event per
>>>>>>>>         monitoring group.
>>>>>>>>     +"available_mbm_cntrs":
>>>>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>>>>
>>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>>>> counters available for assignment"?
>>>>>>>
>>>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>>>
>>>>>> How about this?
>>>>>>
>>>>>> "The number of monitoring counters available for assignment in each domain
>>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>>>> assigned or unassigned individually in each domain. A counter is available
>>>>>> for new assignment if it is unassigned in all domains."
>>>>>
>>>>> Please consider the context of this paragraph. It follows right after the description
>>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>>>
>>>>
>>>> Little bit lost here. Please help me.
>>>
>>> I think this highlights the uncertainty this interface brings. How do you expect users
>>> to use this interface? At this time I think this interface can create a lot of confusion.
>>> For example, consider a hypothetical system with three domains and four counters that
>>> has the following state per mbm_assign_control:
>>>
>>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
>>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
>>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
>>> /m3/0=_;1=_;2=_
>>>
>>> Since, in this system there are only four counters available, and
>>> they have all been assigned, then there are no new counters available for
>>> assignment.
>>>
>>> If I understand correctly, available_mbm_cntrs will read:
>>> 0=1;1=3;2=1
>>
>> Yes. Exactly. This causes confusion to the user.
>>>
>>> How is a user to interpret the above numbers? It does not reflect
>>> that no counter can be assigned to m3, instead it reflects which of the
>>> already assigned counters still need to be activated on domains.
>>> If, for example, a user is expected to use this file to know how
>>> many counters can still be assigned, should it not reflect the actual
>>> available counters. In the above scenario it will then be:
>>> 0=0;1=0;2=0
>>
>> We can also just print
>> #cat available_mbm_cntrs
>> 0
>>
>> The domain specific information is not important here.
>> That was my original idea. We can go back to that definition. That is more clear to the user.
> 
> Tony's response [1] still applies.
> 
> I believe Tony's suggestion [2] considered that the available counters will be the
> same for every domain for this implementation. That is why my example noted:
> "0=0;1=0;2=0"

yes. We can keep it like this.

> 
> The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
> familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
> makes me more concerned how this feature will be perceived by users who are not as familiar with
> resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
> to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
> work to do the enabling will make it easier for users to adopt and understand the feature.
> 
> What do you think?

It adds more complexity for sure.

1. Each group needs to remember counter ids in each domain for each event.
   For example:
   Resctrl group mon1
    Total event
    dom 0 cntr_id 1,
    dom 1 cntr_id 10
    dom 2 cntr_id 11

   Local event
    dom 0 cntr_id 2,
    dom 1 cntr_id 15
    dom 2 cntr_id 10


2. We should have a bitmap of "available counters" in each domain. We have
this already. But allocation method changes.

3. Dynamic allocation/free of the counters

There could be some more things which I can't think right now. It might
come up when we start working on it.

It is doable. But, is it worth adding this complexity? I am not sure.

Peter mentioned earlier that he was not interested in domain specific
assignments. He was only interested in all domain ("*") implementation.

We can add the support but not sure if it is going to drastically help the
user.

Yes, We should keep the options open for supporting domain level
allocation for future.

For now,  we can go ahead with the current implementation.


> 
> Reinette
> 
> [1] https://lore.kernel.org/all/SJ1PR11MB6083DC9EA6D323356E957A87FC4E2@SJ1PR11MB6083.namprd11.prod.outlook.com/
> [2] https://lore.kernel.org/all/SJ1PR11MB6083583A24FA3B3B7C2DCD64FC442@SJ1PR11MB6083.namprd11.prod.outlook.com/
> [3] https://lore.kernel.org/all/ZwmadFbK--Qb8qWP@agluck-desk3.sc.intel.com/
> [4] https://lore.kernel.org/all/CALPaoCh1BWdWww8Kztd13GBaY9mMeZX268fOQgECRytiKm-nPQ@mail.gmail.com/
> 

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-27 14:57                 ` Moger, Babu
@ 2024-11-27 19:05                   ` Reinette Chatre
  2024-11-28 11:10                     ` Peter Newman
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-11-27 19:05 UTC (permalink / raw)
  To: babu.moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: fenghua.yu, x86, hpa, thuth, paulmck, rostedt, akpm,
	xiongwei.song, pawan.kumar.gupta, daniel.sneddon, perry.yuan,
	sandipan.das, kai.huang, xiaoyao.li, seanjc, jithu.joseph,
	brijesh.singh, xin3.li, ebiggers, andrew.cooper3,
	mario.limonciello, james.morse, tan.shaopeng, tony.luck,
	linux-doc, linux-kernel, peternewman, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 11/27/24 6:57 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/26/2024 5:56 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 11/26/24 3:31 PM, Moger, Babu wrote:
>>> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
>>>> On 11/22/24 3:36 PM, Moger, Babu wrote:
>>>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>>>> ---
>>>>>>>>> v9: New patch.
>>>>>>>>> ---
>>>>>>>>>     Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>>>>>     arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>>>>>     arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>>>>     3 files changed, 38 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>>>>         memory bandwidth tracking to a single memory bandwidth event per
>>>>>>>>>         monitoring group.
>>>>>>>>>     +"available_mbm_cntrs":
>>>>>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>>>>>
>>>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>>>>> counters available for assignment"?
>>>>>>>>
>>>>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>>>>
>>>>>>> How about this?
>>>>>>>
>>>>>>> "The number of monitoring counters available for assignment in each domain
>>>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>>>>> assigned or unassigned individually in each domain. A counter is available
>>>>>>> for new assignment if it is unassigned in all domains."
>>>>>>
>>>>>> Please consider the context of this paragraph. It follows right after the description
>>>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>>>>
>>>>>
>>>>> Little bit lost here. Please help me.
>>>>
>>>> I think this highlights the uncertainty this interface brings. How do you expect users
>>>> to use this interface? At this time I think this interface can create a lot of confusion.
>>>> For example, consider a hypothetical system with three domains and four counters that
>>>> has the following state per mbm_assign_control:
>>>>
>>>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
>>>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
>>>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
>>>> /m3/0=_;1=_;2=_
>>>>
>>>> Since, in this system there are only four counters available, and
>>>> they have all been assigned, then there are no new counters available for
>>>> assignment.
>>>>
>>>> If I understand correctly, available_mbm_cntrs will read:
>>>> 0=1;1=3;2=1
>>>
>>> Yes. Exactly. This causes confusion to the user.
>>>>
>>>> How is a user to interpret the above numbers? It does not reflect
>>>> that no counter can be assigned to m3, instead it reflects which of the
>>>> already assigned counters still need to be activated on domains.
>>>> If, for example, a user is expected to use this file to know how
>>>> many counters can still be assigned, should it not reflect the actual
>>>> available counters. In the above scenario it will then be:
>>>> 0=0;1=0;2=0
>>>
>>> We can also just print
>>> #cat available_mbm_cntrs
>>> 0
>>>
>>> The domain specific information is not important here.
>>> That was my original idea. We can go back to that definition. That is more clear to the user.
>>
>> Tony's response [1] still applies.
>>
>> I believe Tony's suggestion [2] considered that the available counters will be the
>> same for every domain for this implementation. That is why my example noted:
>> "0=0;1=0;2=0"
> 
> yes. We can keep it like this.
> 
>>
>> The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
>> familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
>> makes me more concerned how this feature will be perceived by users who are not as familiar with
>> resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
>> to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
>> work to do the enabling will make it easier for users to adopt and understand the feature.
>>
>> What do you think?
> 
> It adds more complexity for sure.

I do see a difference in data structures used but the additional complexity is not
obvious to me. It seems like there will be one fewer data structure, the
global bitmap, and I think that will actually bring with it some simplification since
there is no longer the need to coordinate between the per-domain and global counters,
for example the logic that only frees a global counter if it is no longer used by a domain.

This may also simplify the update of the monitor event config (BMEC) since it can be
done directly on counters of the domain instead of needing to go back and forth between
global and per-domain counters.

> 
> 1. Each group needs to remember counter ids in each domain for each event.
>    For example:
>    Resctrl group mon1
>     Total event
>     dom 0 cntr_id 1,
>     dom 1 cntr_id 10
>     dom 2 cntr_id 11
> 
>    Local event
>     dom 0 cntr_id 2,
>     dom 1 cntr_id 15
>     dom 2 cntr_id 10

Indeed. The challenge here is that domains may come and go so it cannot be a simple
static array. As an alternative it can be an xarray indexed by the domain ID with
pointers to a struct like below to contain the counters associated with the monitor
group:
	struct cntr_id {
		u32	mbm_total;
		u32	mbm_local;
	}


Thinking more about how this array needs to be managed made me wonder how the
current implementation deals with domains that come and go. I do not think
this is currently handled. For example, if a new domain comes online and 
monitoring groups had counters dynamically assigned, then these counters are
not configured to the newly online domain. 

> 
> 
> 2. We should have a bitmap of "available counters" in each domain. We have
> this already. But allocation method changes.

Would allocation/free not be simpler with only the per-domain bitmap needing
to be consulted?

One implementation change I can think of is the dynamic assign of counters when
a monitor group is created. Now a free counter needs to be found in each
domain. Here it can be discussed if it should be an "all or nothing"
assignment but the handling does not seem to be complex and would need to be
solved eventually anyway.

> 3. Dynamic allocation/free of the counters
> 
> There could be some more things which I can't think right now. It might
> come up when we start working on it.
> 
> It is doable. But, is it worth adding this complexity? I am not sure.

Please elaborate where you see that this is too complex.

> 
> Peter mentioned earlier that he was not interested in domain specific
> assignments. He was only interested in all domain ("*") implementation.

Peter's most recent message indicates otherwise:
https://lore.kernel.org/all/CALPaoCgiHEaY_cDbCo=537JJ7mkYZDFFDs9heYvtQ80fXuuvWQ@mail.gmail.com/

> 
> We can add the support but not sure if it is going to drastically help the
> user.
> 
> Yes, We should keep the options open for supporting domain level
> allocation for future.

The current interface supports domain level allocation with this support in
mind. The complication is that the interface is not behaving in an intuitive
way when backing it with global allocation. So far I have seen a lot of
confusion from knowledgeable users and I'm afraid this will worsen when more
users are exposed to this work.

> 
> For now,  we can go ahead with the current implementation.
> 

As I see it this will require detailed documentation to explain the interface
peculiarities. This documentation is made more complicated knowing that the
peculiarities will be temporary since Peter already indicated that he will
need to fix this to support his work.

Putting this all together I do think it may really just avoid a lot of confusion
and extra unnecessary work if this is done now.

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-27 19:05                   ` Reinette Chatre
@ 2024-11-28 11:10                     ` Peter Newman
  2024-11-28 19:35                       ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-11-28 11:10 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: babu.moger, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86,
	hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu, Reinette,

On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Babu,
>
> On 11/27/24 6:57 AM, Moger, Babu wrote:
> > Hi Reinette,
> >
> > On 11/26/2024 5:56 PM, Reinette Chatre wrote:
> >> Hi Babu,
> >>
> >> On 11/26/24 3:31 PM, Moger, Babu wrote:
> >>> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
> >>>> On 11/22/24 3:36 PM, Moger, Babu wrote:
> >>>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
> >>>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
> >>>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
> >>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
> >>>>>>>>> Provide the interface to display the number of free monitoring counters
> >>>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
> >>>>>>>>> ---
> >>>>>>>>> v9: New patch.
> >>>>>>>>> ---
> >>>>>>>>>     Documentation/arch/x86/resctrl.rst     |  4 ++++
> >>>>>>>>>     arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
> >>>>>>>>>     arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
> >>>>>>>>>     3 files changed, 38 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> >>>>>>>>> index 2f3a86278e84..2bc58d974934 100644
> >>>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
> >>>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
> >>>>>>>>> @@ -302,6 +302,10 @@ with the following files:
> >>>>>>>>>         memory bandwidth tracking to a single memory bandwidth event per
> >>>>>>>>>         monitoring group.
> >>>>>>>>>     +"available_mbm_cntrs":
> >>>>>>>>> +    The number of free monitoring counters available assignment in each domain
> >>>>>>>>
> >>>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
> >>>>>>>> counters available for assignment"?
> >>>>>>>>
> >>>>>>>> (not taking into account how text may change after addressing Peter's feedback)
> >>>>>>>
> >>>>>>> How about this?
> >>>>>>>
> >>>>>>> "The number of monitoring counters available for assignment in each domain
> >>>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
> >>>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
> >>>>>>> assigned or unassigned individually in each domain. A counter is available
> >>>>>>> for new assignment if it is unassigned in all domains."
> >>>>>>
> >>>>>> Please consider the context of this paragraph. It follows right after the description
> >>>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
> >>>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
> >>>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
> >>>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
> >>>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
> >>>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
> >>>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
> >>>>>>
> >>>>>
> >>>>> Little bit lost here. Please help me.
> >>>>
> >>>> I think this highlights the uncertainty this interface brings. How do you expect users
> >>>> to use this interface? At this time I think this interface can create a lot of confusion.
> >>>> For example, consider a hypothetical system with three domains and four counters that
> >>>> has the following state per mbm_assign_control:
> >>>>
> >>>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
> >>>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
> >>>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
> >>>> /m3/0=_;1=_;2=_
> >>>>
> >>>> Since, in this system there are only four counters available, and
> >>>> they have all been assigned, then there are no new counters available for
> >>>> assignment.
> >>>>
> >>>> If I understand correctly, available_mbm_cntrs will read:
> >>>> 0=1;1=3;2=1
> >>>
> >>> Yes. Exactly. This causes confusion to the user.
> >>>>
> >>>> How is a user to interpret the above numbers? It does not reflect
> >>>> that no counter can be assigned to m3, instead it reflects which of the
> >>>> already assigned counters still need to be activated on domains.
> >>>> If, for example, a user is expected to use this file to know how
> >>>> many counters can still be assigned, should it not reflect the actual
> >>>> available counters. In the above scenario it will then be:
> >>>> 0=0;1=0;2=0
> >>>
> >>> We can also just print
> >>> #cat available_mbm_cntrs
> >>> 0
> >>>
> >>> The domain specific information is not important here.
> >>> That was my original idea. We can go back to that definition. That is more clear to the user.
> >>
> >> Tony's response [1] still applies.
> >>
> >> I believe Tony's suggestion [2] considered that the available counters will be the
> >> same for every domain for this implementation. That is why my example noted:
> >> "0=0;1=0;2=0"
> >
> > yes. We can keep it like this.
> >
> >>
> >> The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
> >> familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
> >> makes me more concerned how this feature will be perceived by users who are not as familiar with
> >> resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
> >> to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
> >> work to do the enabling will make it easier for users to adopt and understand the feature.
> >>
> >> What do you think?
> >
> > It adds more complexity for sure.
>
> I do see a difference in data structures used but the additional complexity is not
> obvious to me. It seems like there will be one fewer data structure, the
> global bitmap, and I think that will actually bring with it some simplification since
> there is no longer the need to coordinate between the per-domain and global counters,
> for example the logic that only frees a global counter if it is no longer used by a domain.
>
> This may also simplify the update of the monitor event config (BMEC) since it can be
> done directly on counters of the domain instead of needing to go back and forth between
> global and per-domain counters.
>
> >
> > 1. Each group needs to remember counter ids in each domain for each event.
> >    For example:
> >    Resctrl group mon1
> >     Total event
> >     dom 0 cntr_id 1,
> >     dom 1 cntr_id 10
> >     dom 2 cntr_id 11
> >
> >    Local event
> >     dom 0 cntr_id 2,
> >     dom 1 cntr_id 15
> >     dom 2 cntr_id 10
>
> Indeed. The challenge here is that domains may come and go so it cannot be a simple
> static array. As an alternative it can be an xarray indexed by the domain ID with
> pointers to a struct like below to contain the counters associated with the monitor
> group:
>         struct cntr_id {
>                 u32     mbm_total;
>                 u32     mbm_local;
>         }
>
>
> Thinking more about how this array needs to be managed made me wonder how the
> current implementation deals with domains that come and go. I do not think
> this is currently handled. For example, if a new domain comes online and
> monitoring groups had counters dynamically assigned, then these counters are
> not configured to the newly online domain.

In my prototype, I allocated a counter id-indexed array to each
monitoring domain structure for tracking the counter allocations,
because the hardware counters are all domain-scoped. That way the
tracking data goes away when the hardware does.

I was focused on allowing all pending counter updates to a domain
resulting from a single mbm_assign_control write to be batched and
processed in a single IPI, so I structured the counter tracker
something like this:

struct resctrl_monitor_cfg {
    int closid;
    int rmid;
    int evtid;
    bool dirty;
};

This mirrors the info needed in whatever register configures the
counter, plus a dirty flag to skip over the ones that don't need to be
updated.

For the benefit of displaying mbm_assign_control, I put a pointer back
to any counter array entry allocated in the mbm_state struct only
because it's an existing structure that exists for every rmid-domain
combination.

I didn't need to change the rdtgroup structure.

>
> >
> >
> > 2. We should have a bitmap of "available counters" in each domain. We have
> > this already. But allocation method changes.
>
> Would allocation/free not be simpler with only the per-domain bitmap needing
> to be consulted?
>
> One implementation change I can think of is the dynamic assign of counters when
> a monitor group is created. Now a free counter needs to be found in each
> domain. Here it can be discussed if it should be an "all or nothing"
> assignment but the handling does not seem to be complex and would need to be
> solved eventually anyway.
>
> > 3. Dynamic allocation/free of the counters
> >
> > There could be some more things which I can't think right now. It might
> > come up when we start working on it.
> >
> > It is doable. But, is it worth adding this complexity? I am not sure.
>
> Please elaborate where you see that this is too complex.
>
> >
> > Peter mentioned earlier that he was not interested in domain specific
> > assignments. He was only interested in all domain ("*") implementation.
>
> Peter's most recent message indicates otherwise:
> https://lore.kernel.org/all/CALPaoCgiHEaY_cDbCo=537JJ7mkYZDFFDs9heYvtQ80fXuuvWQ@mail.gmail.com/

For now, I'm focused on managing the domains locally whenever possible
to avoid all IPIs, as this gives me the least overhead.

I'm also prototyping the 'T' vs 't' approach that Reinette
suggested[1], as this may take a lot of performance pressure off the
mbm_assign_control interface, as most of the routine updates to
counter assignments would be automated.

-Peter

[1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-28 11:10                     ` Peter Newman
@ 2024-11-28 19:35                       ` Moger, Babu
  2024-11-29  9:59                         ` Peter Newman
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-11-28 19:35 UTC (permalink / raw)
  To: Peter Newman, Reinette Chatre
  Cc: babu.moger, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86,
	hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter,

On 11/28/2024 5:10 AM, Peter Newman wrote:
> Hi Babu, Reinette,
> 
> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Babu,
>>
>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/26/2024 5:56 PM, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 11/26/24 3:31 PM, Moger, Babu wrote:
>>>>> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
>>>>>> On 11/22/24 3:36 PM, Moger, Babu wrote:
>>>>>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>>>>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>>>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>> v9: New patch.
>>>>>>>>>>> ---
>>>>>>>>>>>      Documentation/arch/x86/resctrl.rst     |  4 ++++
>>>>>>>>>>>      arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
>>>>>>>>>>>      arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>>>>>>      3 files changed, 38 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>>>>>>          memory bandwidth tracking to a single memory bandwidth event per
>>>>>>>>>>>          monitoring group.
>>>>>>>>>>>      +"available_mbm_cntrs":
>>>>>>>>>>> +    The number of free monitoring counters available assignment in each domain
>>>>>>>>>>
>>>>>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>>>>>>> counters available for assignment"?
>>>>>>>>>>
>>>>>>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>>>>>>
>>>>>>>>> How about this?
>>>>>>>>>
>>>>>>>>> "The number of monitoring counters available for assignment in each domain
>>>>>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>>>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>>>>>>> assigned or unassigned individually in each domain. A counter is available
>>>>>>>>> for new assignment if it is unassigned in all domains."
>>>>>>>>
>>>>>>>> Please consider the context of this paragraph. It follows right after the description
>>>>>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>>>>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>>>>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>>>>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>>>>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>>>>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>>>>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>>>>>>
>>>>>>>
>>>>>>> Little bit lost here. Please help me.
>>>>>>
>>>>>> I think this highlights the uncertainty this interface brings. How do you expect users
>>>>>> to use this interface? At this time I think this interface can create a lot of confusion.
>>>>>> For example, consider a hypothetical system with three domains and four counters that
>>>>>> has the following state per mbm_assign_control:
>>>>>>
>>>>>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
>>>>>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
>>>>>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
>>>>>> /m3/0=_;1=_;2=_
>>>>>>
>>>>>> Since, in this system there are only four counters available, and
>>>>>> they have all been assigned, then there are no new counters available for
>>>>>> assignment.
>>>>>>
>>>>>> If I understand correctly, available_mbm_cntrs will read:
>>>>>> 0=1;1=3;2=1
>>>>>
>>>>> Yes. Exactly. This causes confusion to the user.
>>>>>>
>>>>>> How is a user to interpret the above numbers? It does not reflect
>>>>>> that no counter can be assigned to m3, instead it reflects which of the
>>>>>> already assigned counters still need to be activated on domains.
>>>>>> If, for example, a user is expected to use this file to know how
>>>>>> many counters can still be assigned, should it not reflect the actual
>>>>>> available counters. In the above scenario it will then be:
>>>>>> 0=0;1=0;2=0
>>>>>
>>>>> We can also just print
>>>>> #cat available_mbm_cntrs
>>>>> 0
>>>>>
>>>>> The domain specific information is not important here.
>>>>> That was my original idea. We can go back to that definition. That is more clear to the user.
>>>>
>>>> Tony's response [1] still applies.
>>>>
>>>> I believe Tony's suggestion [2] considered that the available counters will be the
>>>> same for every domain for this implementation. That is why my example noted:
>>>> "0=0;1=0;2=0"
>>>
>>> yes. We can keep it like this.
>>>
>>>>
>>>> The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
>>>> familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
>>>> makes me more concerned how this feature will be perceived by users who are not as familiar with
>>>> resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
>>>> to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
>>>> work to do the enabling will make it easier for users to adopt and understand the feature.
>>>>
>>>> What do you think?
>>>
>>> It adds more complexity for sure.
>>
>> I do see a difference in data structures used but the additional complexity is not
>> obvious to me. It seems like there will be one fewer data structure, the
>> global bitmap, and I think that will actually bring with it some simplification since
>> there is no longer the need to coordinate between the per-domain and global counters,
>> for example the logic that only frees a global counter if it is no longer used by a domain.
>>
>> This may also simplify the update of the monitor event config (BMEC) since it can be
>> done directly on counters of the domain instead of needing to go back and forth between
>> global and per-domain counters.
>>
>>>
>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>     For example:
>>>     Resctrl group mon1
>>>      Total event
>>>      dom 0 cntr_id 1,
>>>      dom 1 cntr_id 10
>>>      dom 2 cntr_id 11
>>>
>>>     Local event
>>>      dom 0 cntr_id 2,
>>>      dom 1 cntr_id 15
>>>      dom 2 cntr_id 10
>>
>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>> static array. As an alternative it can be an xarray indexed by the domain ID with
>> pointers to a struct like below to contain the counters associated with the monitor
>> group:
>>          struct cntr_id {
>>                  u32     mbm_total;
>>                  u32     mbm_local;
>>          }
>>
>>
>> Thinking more about how this array needs to be managed made me wonder how the
>> current implementation deals with domains that come and go. I do not think
>> this is currently handled. For example, if a new domain comes online and
>> monitoring groups had counters dynamically assigned, then these counters are
>> not configured to the newly online domain.

I am trying to understand the details of your approach here.
> 
> In my prototype, I allocated a counter id-indexed array to each
> monitoring domain structure for tracking the counter allocations,
> because the hardware counters are all domain-scoped. That way the
> tracking data goes away when the hardware does.
> 
> I was focused on allowing all pending counter updates to a domain
> resulting from a single mbm_assign_control write to be batched and
> processed in a single IPI, so I structured the counter tracker
> something like this:

Not sure what you meant here. How are you batching two IPIs for two domains?

#echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

This is still a single write. Two IPIs are sent separately, one for each 
domain.

Are you doing something different?

> 
> struct resctrl_monitor_cfg {
>      int closid;
>      int rmid;
>      int evtid;
>      bool dirty;
> };
> 
> This mirrors the info needed in whatever register configures the
> counter, plus a dirty flag to skip over the ones that don't need to be
> updated.

This is what my understanding of your implementation.

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..9cebf065cc97 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
         u32                             *mbps_val;
  };

+struct resctrl_monitor_cfg {
+    int closid;
+    int rmid;
+    int evtid;
+    bool dirty;
+};
+
  /**
   * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor 
resource
   * @hdr:               common header for different domain types
@@ -116,6 +123,7 @@ struct rdt_mon_domain {
         struct delayed_work             cqm_limbo;
         int                             mbm_work_cpu;
         int                             cqm_work_cpu;
+     /* Allocate num_mbm_cntrs entries in each domain */
+       struct resctrl_monitor_cfg      *mon_cfg;
  };


When a user requests an assignment for total event to the default group 
for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg 
entry.

If there is an empty entry, then use that entry for assignment and 
update closid, rmid, evtid and dirty = 1. We can get all these 
information from default group here.

Does this make sense?

> 
> For the benefit of displaying mbm_assign_control, I put a pointer back
> to any counter array entry allocated in the mbm_state struct only
> because it's an existing structure that exists for every rmid-domain
> combination.

Pointer in mbm_state may not be required here.

We are going to loop over resctrl groups. We can search the 
rdt_mon_domain to see if specific closid, rmid, evtid is already 
assigned or not in that domain.

> 
> I didn't need to change the rdtgroup structure.

Ok. That is good.

> 
>>
>>>
>>>
>>> 2. We should have a bitmap of "available counters" in each domain. We have
>>> this already. But allocation method changes.
>>
>> Would allocation/free not be simpler with only the per-domain bitmap needing
>> to be consulted?
>>
>> One implementation change I can think of is the dynamic assign of counters when
>> a monitor group is created. Now a free counter needs to be found in each
>> domain. Here it can be discussed if it should be an "all or nothing"
>> assignment but the handling does not seem to be complex and would need to be
>> solved eventually anyway.
>>
>>> 3. Dynamic allocation/free of the counters
>>>
>>> There could be some more things which I can't think right now. It might
>>> come up when we start working on it.
>>>
>>> It is doable. But, is it worth adding this complexity? I am not sure.
>>
>> Please elaborate where you see that this is too complex.
>>
>>>
>>> Peter mentioned earlier that he was not interested in domain specific
>>> assignments. He was only interested in all domain ("*") implementation.
>>
>> Peter's most recent message indicates otherwise:
>> https://lore.kernel.org/all/CALPaoCgiHEaY_cDbCo=537JJ7mkYZDFFDs9heYvtQ80fXuuvWQ@mail.gmail.com/
> 
> For now, I'm focused on managing the domains locally whenever possible
> to avoid all IPIs, as this gives me the least overhead.
> 
> I'm also prototyping the 'T' vs 't' approach that Reinette
> suggested[1], as this may take a lot of performance pressure off the
> mbm_assign_control interface, as most of the routine updates to
> counter assignments would be automated.
> 
> -Peter
> 
> [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/
> 

-- 
- Babu Moger


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-28 19:35                       ` Moger, Babu
@ 2024-11-29  9:59                         ` Peter Newman
  2024-11-29 17:06                           ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-11-29  9:59 UTC (permalink / raw)
  To: babu.moger
  Cc: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>
> Hi Peter,
>
> On 11/28/2024 5:10 AM, Peter Newman wrote:
> > Hi Babu, Reinette,
> >
> > On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Babu,
> >>
> >> On 11/27/24 6:57 AM, Moger, Babu wrote:

> >>> 1. Each group needs to remember counter ids in each domain for each event.
> >>>     For example:
> >>>     Resctrl group mon1
> >>>      Total event
> >>>      dom 0 cntr_id 1,
> >>>      dom 1 cntr_id 10
> >>>      dom 2 cntr_id 11
> >>>
> >>>     Local event
> >>>      dom 0 cntr_id 2,
> >>>      dom 1 cntr_id 15
> >>>      dom 2 cntr_id 10
> >>
> >> Indeed. The challenge here is that domains may come and go so it cannot be a simple
> >> static array. As an alternative it can be an xarray indexed by the domain ID with
> >> pointers to a struct like below to contain the counters associated with the monitor
> >> group:
> >>          struct cntr_id {
> >>                  u32     mbm_total;
> >>                  u32     mbm_local;
> >>          }
> >>
> >>
> >> Thinking more about how this array needs to be managed made me wonder how the
> >> current implementation deals with domains that come and go. I do not think
> >> this is currently handled. For example, if a new domain comes online and
> >> monitoring groups had counters dynamically assigned, then these counters are
> >> not configured to the newly online domain.
>
> I am trying to understand the details of your approach here.
> >
> > In my prototype, I allocated a counter id-indexed array to each
> > monitoring domain structure for tracking the counter allocations,
> > because the hardware counters are all domain-scoped. That way the
> > tracking data goes away when the hardware does.
> >
> > I was focused on allowing all pending counter updates to a domain
> > resulting from a single mbm_assign_control write to be batched and
> > processed in a single IPI, so I structured the counter tracker
> > something like this:
>
> Not sure what you meant here. How are you batching two IPIs for two domains?
>
> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> This is still a single write. Two IPIs are sent separately, one for each
> domain.
>
> Are you doing something different?

I said "all pending counter updates to a domain", whereby I meant
targeting a single domain.

Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.

What is important is that the following write also requires 1 or 2 IPIs:

(assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
for readability)

echo $'//0=t;1=t\n
/g1/0=t;1=t\n
/g2/0=t;1=t\n
/g3/0=t;1=t\n
/g4/0=t;1=t\n
/g5/0=t;1=t\n
/g6/0=t;1=t\n
/g7/0=t;1=t\n
/g8/0=t;1=t\n
/g9/0=t;1=t\n
/g10/0=t;1=t\n
/g11/0=t;1=t\n
/g12/0=t;1=t\n
/g13/0=t;1=t\n
/g14/0=t;1=t\n
/g15/0=t;1=t\n
/g16/0=t;1=t\n
/g17/0=t;1=t\n
/g18/0=t;1=t\n
/g19/0=t;1=t\n
/g20/0=t;1=t\n
/g21/0=t;1=t\n
/g22/0=t;1=t\n
/g23/0=t;1=t\n
/g24/0=t;1=t\n
/g25/0=t;1=t\n
/g26/0=t;1=t\n
/g27/0=t;1=t\n
/g28/0=t;1=t\n
/g29/0=t;1=t\n
/g30/0=t;1=t\n
/g31/0=t;1=t\n'

My ultimate goal is for a thread bound to a particular domain to be
able to unassign and reassign the local domain's 32 counters in a
single write() with no IPIs at all. And when IPIs are required, then
no more than one per domain, regardless of the number of groups
updated.


>
> >
> > struct resctrl_monitor_cfg {
> >      int closid;
> >      int rmid;
> >      int evtid;
> >      bool dirty;
> > };
> >
> > This mirrors the info needed in whatever register configures the
> > counter, plus a dirty flag to skip over the ones that don't need to be
> > updated.
>
> This is what my understanding of your implementation.
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d94abba1c716..9cebf065cc97 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>          u32                             *mbps_val;
>   };
>
> +struct resctrl_monitor_cfg {
> +    int closid;
> +    int rmid;
> +    int evtid;
> +    bool dirty;
> +};
> +
>   /**
>    * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
> resource
>    * @hdr:               common header for different domain types
> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>          struct delayed_work             cqm_limbo;
>          int                             mbm_work_cpu;
>          int                             cqm_work_cpu;
> +     /* Allocate num_mbm_cntrs entries in each domain */
> +       struct resctrl_monitor_cfg      *mon_cfg;
>   };
>
>
> When a user requests an assignment for total event to the default group
> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
> entry.
>
> If there is an empty entry, then use that entry for assignment and
> update closid, rmid, evtid and dirty = 1. We can get all these
> information from default group here.
>
> Does this make sense?

Yes, sounds correct.

>
> >
> > For the benefit of displaying mbm_assign_control, I put a pointer back
> > to any counter array entry allocated in the mbm_state struct only
> > because it's an existing structure that exists for every rmid-domain
> > combination.
>
> Pointer in mbm_state may not be required here.
>
> We are going to loop over resctrl groups. We can search the
> rdt_mon_domain to see if specific closid, rmid, evtid is already
> assigned or not in that domain.

No, not required I guess. High-performance CPUs can probably search a
32-entry array very quickly.

-Peter

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-29  9:59                         ` Peter Newman
@ 2024-11-29 17:06                           ` Moger, Babu
  2024-12-02 10:43                             ` Peter Newman
  2024-12-02 18:33                             ` Reinette Chatre
  0 siblings, 2 replies; 115+ messages in thread
From: Moger, Babu @ 2024-11-29 17:06 UTC (permalink / raw)
  To: Peter Newman, babu.moger
  Cc: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter, Reinette,

On 11/29/2024 3:59 AM, Peter Newman wrote:
> Hi Babu,
> 
> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>> Hi Babu, Reinette,
>>>
>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Babu,
>>>>
>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
> 
>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>      For example:
>>>>>      Resctrl group mon1
>>>>>       Total event
>>>>>       dom 0 cntr_id 1,
>>>>>       dom 1 cntr_id 10
>>>>>       dom 2 cntr_id 11
>>>>>
>>>>>      Local event
>>>>>       dom 0 cntr_id 2,
>>>>>       dom 1 cntr_id 15
>>>>>       dom 2 cntr_id 10
>>>>
>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>> group:
>>>>           struct cntr_id {
>>>>                   u32     mbm_total;
>>>>                   u32     mbm_local;
>>>>           }
>>>>
>>>>
>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>> current implementation deals with domains that come and go. I do not think
>>>> this is currently handled. For example, if a new domain comes online and
>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>> not configured to the newly online domain.
>>
>> I am trying to understand the details of your approach here.
>>>
>>> In my prototype, I allocated a counter id-indexed array to each
>>> monitoring domain structure for tracking the counter allocations,
>>> because the hardware counters are all domain-scoped. That way the
>>> tracking data goes away when the hardware does.
>>>
>>> I was focused on allowing all pending counter updates to a domain
>>> resulting from a single mbm_assign_control write to be batched and
>>> processed in a single IPI, so I structured the counter tracker
>>> something like this:
>>
>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>
>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> This is still a single write. Two IPIs are sent separately, one for each
>> domain.
>>
>> Are you doing something different?
> 
> I said "all pending counter updates to a domain", whereby I meant
> targeting a single domain.
> 
> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
> 
> What is important is that the following write also requires 1 or 2 IPIs:
> 
> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
> for readability)
> 
> echo $'//0=t;1=t\n
> /g1/0=t;1=t\n
> /g2/0=t;1=t\n
> /g3/0=t;1=t\n
> /g4/0=t;1=t\n
> /g5/0=t;1=t\n
> /g6/0=t;1=t\n
> /g7/0=t;1=t\n
> /g8/0=t;1=t\n
> /g9/0=t;1=t\n
> /g10/0=t;1=t\n
> /g11/0=t;1=t\n
> /g12/0=t;1=t\n
> /g13/0=t;1=t\n
> /g14/0=t;1=t\n
> /g15/0=t;1=t\n
> /g16/0=t;1=t\n
> /g17/0=t;1=t\n
> /g18/0=t;1=t\n
> /g19/0=t;1=t\n
> /g20/0=t;1=t\n
> /g21/0=t;1=t\n
> /g22/0=t;1=t\n
> /g23/0=t;1=t\n
> /g24/0=t;1=t\n
> /g25/0=t;1=t\n
> /g26/0=t;1=t\n
> /g27/0=t;1=t\n
> /g28/0=t;1=t\n
> /g29/0=t;1=t\n
> /g30/0=t;1=t\n
> /g31/0=t;1=t\n'
> 
> My ultimate goal is for a thread bound to a particular domain to be
> able to unassign and reassign the local domain's 32 counters in a
> single write() with no IPIs at all. And when IPIs are required, then
> no more than one per domain, regardless of the number of groups
> updated.
> 

Yes. I think I got the idea. Thanks.

> 
>>
>>>
>>> struct resctrl_monitor_cfg {
>>>       int closid;
>>>       int rmid;
>>>       int evtid;
>>>       bool dirty;
>>> };
>>>
>>> This mirrors the info needed in whatever register configures the
>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>> updated.
>>
>> This is what my understanding of your implementation.
>>
>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index d94abba1c716..9cebf065cc97 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>           u32                             *mbps_val;
>>    };
>>
>> +struct resctrl_monitor_cfg {
>> +    int closid;
>> +    int rmid;
>> +    int evtid;
>> +    bool dirty;
>> +};
>> +
>>    /**
>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>> resource
>>     * @hdr:               common header for different domain types
>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>           struct delayed_work             cqm_limbo;
>>           int                             mbm_work_cpu;
>>           int                             cqm_work_cpu;
>> +     /* Allocate num_mbm_cntrs entries in each domain */
>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>    };
>>
>>
>> When a user requests an assignment for total event to the default group
>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>> entry.
>>
>> If there is an empty entry, then use that entry for assignment and
>> update closid, rmid, evtid and dirty = 1. We can get all these
>> information from default group here.
>>
>> Does this make sense?
> 
> Yes, sounds correct.

I will probably add cntr_id in resctrl_monitor_cfg structure and 
initialize during the allocation. And rename the field 'dirty' to 
'active'(or something similar) to hold the assign state for that entry. 
That way we have all the information required for assignment at one 
place. We don't need to update the rdtgroup structure.

Reinette, What do you think about this approach?

> 
>>
>>>
>>> For the benefit of displaying mbm_assign_control, I put a pointer back
>>> to any counter array entry allocated in the mbm_state struct only
>>> because it's an existing structure that exists for every rmid-domain
>>> combination.
>>
>> Pointer in mbm_state may not be required here.
>>
>> We are going to loop over resctrl groups. We can search the
>> rdt_mon_domain to see if specific closid, rmid, evtid is already
>> assigned or not in that domain.
> 
> No, not required I guess. High-performance CPUs can probably search a
> 32-entry array very quickly.

Ok.

-- 
- Babu Moger


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-29 17:06                           ` Moger, Babu
@ 2024-12-02 10:43                             ` Peter Newman
  2024-12-02 15:02                               ` Moger, Babu
  2024-12-02 18:33                             ` Reinette Chatre
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Newman @ 2024-12-02 10:43 UTC (permalink / raw)
  To: babu.moger
  Cc: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Babu,

On Fri, Nov 29, 2024 at 6:06 PM Moger, Babu <bmoger@amd.com> wrote:
>
> Hi Peter, Reinette,
>
> On 11/29/2024 3:59 AM, Peter Newman wrote:
> > Hi Babu,
> >
> > On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 11/28/2024 5:10 AM, Peter Newman wrote:

> >>> In my prototype, I allocated a counter id-indexed array to each
> >>> monitoring domain structure for tracking the counter allocations,
> >>> because the hardware counters are all domain-scoped. That way the
> >>> tracking data goes away when the hardware does.
> >>>
> >>> I was focused on allowing all pending counter updates to a domain
> >>> resulting from a single mbm_assign_control write to be batched and
> >>> processed in a single IPI, so I structured the counter tracker
> >>> something like this:
> >>
> >> Not sure what you meant here. How are you batching two IPIs for two domains?
> >>
> >> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>
> >> This is still a single write. Two IPIs are sent separately, one for each
> >> domain.
> >>
> >> Are you doing something different?
> >
> > I said "all pending counter updates to a domain", whereby I meant
> > targeting a single domain.
> >
> > Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
> >
> > What is important is that the following write also requires 1 or 2 IPIs:
> >
> > (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
> > for readability)
> >
> > echo $'//0=t;1=t\n
> > /g1/0=t;1=t\n
> > /g2/0=t;1=t\n
> > /g3/0=t;1=t\n
> > /g4/0=t;1=t\n
> > /g5/0=t;1=t\n
> > /g6/0=t;1=t\n
> > /g7/0=t;1=t\n
> > /g8/0=t;1=t\n
> > /g9/0=t;1=t\n
> > /g10/0=t;1=t\n
> > /g11/0=t;1=t\n
> > /g12/0=t;1=t\n
> > /g13/0=t;1=t\n
> > /g14/0=t;1=t\n
> > /g15/0=t;1=t\n
> > /g16/0=t;1=t\n
> > /g17/0=t;1=t\n
> > /g18/0=t;1=t\n
> > /g19/0=t;1=t\n
> > /g20/0=t;1=t\n
> > /g21/0=t;1=t\n
> > /g22/0=t;1=t\n
> > /g23/0=t;1=t\n
> > /g24/0=t;1=t\n
> > /g25/0=t;1=t\n
> > /g26/0=t;1=t\n
> > /g27/0=t;1=t\n
> > /g28/0=t;1=t\n
> > /g29/0=t;1=t\n
> > /g30/0=t;1=t\n
> > /g31/0=t;1=t\n'
> >
> > My ultimate goal is for a thread bound to a particular domain to be
> > able to unassign and reassign the local domain's 32 counters in a
> > single write() with no IPIs at all. And when IPIs are required, then
> > no more than one per domain, regardless of the number of groups
> > updated.
> >
>
> Yes. I think I got the idea. Thanks.
>
> >
> >>
> >>>
> >>> struct resctrl_monitor_cfg {
> >>>       int closid;
> >>>       int rmid;
> >>>       int evtid;
> >>>       bool dirty;
> >>> };
> >>>
> >>> This mirrors the info needed in whatever register configures the
> >>> counter, plus a dirty flag to skip over the ones that don't need to be
> >>> updated.
> >>
> >> This is what my understanding of your implementation.
> >>
> >> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >> index d94abba1c716..9cebf065cc97 100644
> >> --- a/include/linux/resctrl.h
> >> +++ b/include/linux/resctrl.h
> >> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
> >>           u32                             *mbps_val;
> >>    };
> >>
> >> +struct resctrl_monitor_cfg {
> >> +    int closid;
> >> +    int rmid;
> >> +    int evtid;
> >> +    bool dirty;
> >> +};
> >> +
> >>    /**
> >>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
> >> resource
> >>     * @hdr:               common header for different domain types
> >> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
> >>           struct delayed_work             cqm_limbo;
> >>           int                             mbm_work_cpu;
> >>           int                             cqm_work_cpu;
> >> +     /* Allocate num_mbm_cntrs entries in each domain */
> >> +       struct resctrl_monitor_cfg      *mon_cfg;
> >>    };
> >>
> >>
> >> When a user requests an assignment for total event to the default group
> >> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
> >> entry.
> >>
> >> If there is an empty entry, then use that entry for assignment and
> >> update closid, rmid, evtid and dirty = 1. We can get all these
> >> information from default group here.
> >>
> >> Does this make sense?
> >
> > Yes, sounds correct.
>
> I will probably add cntr_id in resctrl_monitor_cfg structure and
> initialize during the allocation. And rename the field 'dirty' to
> 'active'(or something similar) to hold the assign state for that entry.
> That way we have all the information required for assignment at one
> place. We don't need to update the rdtgroup structure.

It concerns me that you want to say "active" instead of "dirty". What
I'm proposing is a write-back cache of the config values so that a
series of remote updates to many groups can be written back to
hardware all at once.

Therefore we want to track which entries are "dirty", implying that
they differ from what was last written to the registers and therefore
need to be flushed by the remote domain. Whether the counter is
enabled or not is already implicit in the configuration values (evtid
!= 0).

-Peter

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 10:43                             ` Peter Newman
@ 2024-12-02 15:02                               ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-12-02 15:02 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, corbet, tglx, mingo, bp, dave.hansen, fenghua.yu,
	x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Peter,

On 12/2/24 04:43, Peter Newman wrote:
> Hi Babu,
> 
> On Fri, Nov 29, 2024 at 6:06 PM Moger, Babu <bmoger@amd.com> wrote:
>>
>> Hi Peter, Reinette,
>>
>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
> 
>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>> monitoring domain structure for tracking the counter allocations,
>>>>> because the hardware counters are all domain-scoped. That way the
>>>>> tracking data goes away when the hardware does.
>>>>>
>>>>> I was focused on allowing all pending counter updates to a domain
>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>> processed in a single IPI, so I structured the counter tracker
>>>>> something like this:
>>>>
>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>
>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>> domain.
>>>>
>>>> Are you doing something different?
>>>
>>> I said "all pending counter updates to a domain", whereby I meant
>>> targeting a single domain.
>>>
>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>
>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>
>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>> for readability)
>>>
>>> echo $'//0=t;1=t\n
>>> /g1/0=t;1=t\n
>>> /g2/0=t;1=t\n
>>> /g3/0=t;1=t\n
>>> /g4/0=t;1=t\n
>>> /g5/0=t;1=t\n
>>> /g6/0=t;1=t\n
>>> /g7/0=t;1=t\n
>>> /g8/0=t;1=t\n
>>> /g9/0=t;1=t\n
>>> /g10/0=t;1=t\n
>>> /g11/0=t;1=t\n
>>> /g12/0=t;1=t\n
>>> /g13/0=t;1=t\n
>>> /g14/0=t;1=t\n
>>> /g15/0=t;1=t\n
>>> /g16/0=t;1=t\n
>>> /g17/0=t;1=t\n
>>> /g18/0=t;1=t\n
>>> /g19/0=t;1=t\n
>>> /g20/0=t;1=t\n
>>> /g21/0=t;1=t\n
>>> /g22/0=t;1=t\n
>>> /g23/0=t;1=t\n
>>> /g24/0=t;1=t\n
>>> /g25/0=t;1=t\n
>>> /g26/0=t;1=t\n
>>> /g27/0=t;1=t\n
>>> /g28/0=t;1=t\n
>>> /g29/0=t;1=t\n
>>> /g30/0=t;1=t\n
>>> /g31/0=t;1=t\n'
>>>
>>> My ultimate goal is for a thread bound to a particular domain to be
>>> able to unassign and reassign the local domain's 32 counters in a
>>> single write() with no IPIs at all. And when IPIs are required, then
>>> no more than one per domain, regardless of the number of groups
>>> updated.
>>>
>>
>> Yes. I think I got the idea. Thanks.
>>
>>>
>>>>
>>>>>
>>>>> struct resctrl_monitor_cfg {
>>>>>       int closid;
>>>>>       int rmid;
>>>>>       int evtid;
>>>>>       bool dirty;
>>>>> };
>>>>>
>>>>> This mirrors the info needed in whatever register configures the
>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>> updated.
>>>>
>>>> This is what my understanding of your implementation.
>>>>
>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>> index d94abba1c716..9cebf065cc97 100644
>>>> --- a/include/linux/resctrl.h
>>>> +++ b/include/linux/resctrl.h
>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>           u32                             *mbps_val;
>>>>    };
>>>>
>>>> +struct resctrl_monitor_cfg {
>>>> +    int closid;
>>>> +    int rmid;
>>>> +    int evtid;
>>>> +    bool dirty;
>>>> +};
>>>> +
>>>>    /**
>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>> resource
>>>>     * @hdr:               common header for different domain types
>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>           struct delayed_work             cqm_limbo;
>>>>           int                             mbm_work_cpu;
>>>>           int                             cqm_work_cpu;
>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>    };
>>>>
>>>>
>>>> When a user requests an assignment for total event to the default group
>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>> entry.
>>>>
>>>> If there is an empty entry, then use that entry for assignment and
>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>> information from default group here.
>>>>
>>>> Does this make sense?
>>>
>>> Yes, sounds correct.
>>
>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>> initialize during the allocation. And rename the field 'dirty' to
>> 'active'(or something similar) to hold the assign state for that entry.
>> That way we have all the information required for assignment at one
>> place. We don't need to update the rdtgroup structure.
> 
> It concerns me that you want to say "active" instead of "dirty". What
> I'm proposing is a write-back cache of the config values so that a
> series of remote updates to many groups can be written back to
> hardware all at once.
> 
> Therefore we want to track which entries are "dirty", implying that
> they differ from what was last written to the registers and therefore
> need to be flushed by the remote domain. Whether the counter is
> enabled or not is already implicit in the configuration values (evtid
> != 0).
> 

That is correct. But I wanted to add the "state" explicitly. Makes it easy
to search. We can overload it if you want both.

int state;

#define ASSIGN_STATE_ACTIVE  BIT(0)
#define ASSIGN_STATE_DIRTY   BIT(1)
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-11-29 17:06                           ` Moger, Babu
  2024-12-02 10:43                             ` Peter Newman
@ 2024-12-02 18:33                             ` Reinette Chatre
  2024-12-02 19:48                               ` Moger, Babu
  1 sibling, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-12-02 18:33 UTC (permalink / raw)
  To: babu.moger, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu and Peter,

On 11/29/24 9:06 AM, Moger, Babu wrote:
> Hi Peter, Reinette,
> 
> On 11/29/2024 3:59 AM, Peter Newman wrote:
>> Hi Babu,
>>
>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>> Hi Babu, Reinette,
>>>>
>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>> <reinette.chatre@intel.com> wrote:
>>>>>
>>>>> Hi Babu,
>>>>>
>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>
>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>      For example:
>>>>>>      Resctrl group mon1
>>>>>>       Total event
>>>>>>       dom 0 cntr_id 1,
>>>>>>       dom 1 cntr_id 10
>>>>>>       dom 2 cntr_id 11
>>>>>>
>>>>>>      Local event
>>>>>>       dom 0 cntr_id 2,
>>>>>>       dom 1 cntr_id 15
>>>>>>       dom 2 cntr_id 10
>>>>>
>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>> group:
>>>>>           struct cntr_id {
>>>>>                   u32     mbm_total;
>>>>>                   u32     mbm_local;
>>>>>           }
>>>>>
>>>>>
>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>> current implementation deals with domains that come and go. I do not think
>>>>> this is currently handled. For example, if a new domain comes online and
>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>> not configured to the newly online domain.
>>>
>>> I am trying to understand the details of your approach here.
>>>>
>>>> In my prototype, I allocated a counter id-indexed array to each
>>>> monitoring domain structure for tracking the counter allocations,
>>>> because the hardware counters are all domain-scoped. That way the
>>>> tracking data goes away when the hardware does.
>>>>
>>>> I was focused on allowing all pending counter updates to a domain
>>>> resulting from a single mbm_assign_control write to be batched and
>>>> processed in a single IPI, so I structured the counter tracker
>>>> something like this:
>>>
>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>
>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> This is still a single write. Two IPIs are sent separately, one for each
>>> domain.
>>>
>>> Are you doing something different?
>>
>> I said "all pending counter updates to a domain", whereby I meant
>> targeting a single domain.
>>
>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>
>> What is important is that the following write also requires 1 or 2 IPIs:
>>
>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>> for readability)
>>
>> echo $'//0=t;1=t\n
>> /g1/0=t;1=t\n
>> /g2/0=t;1=t\n
>> /g3/0=t;1=t\n
>> /g4/0=t;1=t\n
>> /g5/0=t;1=t\n
>> /g6/0=t;1=t\n
>> /g7/0=t;1=t\n
>> /g8/0=t;1=t\n
>> /g9/0=t;1=t\n
>> /g10/0=t;1=t\n
>> /g11/0=t;1=t\n
>> /g12/0=t;1=t\n
>> /g13/0=t;1=t\n
>> /g14/0=t;1=t\n
>> /g15/0=t;1=t\n
>> /g16/0=t;1=t\n
>> /g17/0=t;1=t\n
>> /g18/0=t;1=t\n
>> /g19/0=t;1=t\n
>> /g20/0=t;1=t\n
>> /g21/0=t;1=t\n
>> /g22/0=t;1=t\n
>> /g23/0=t;1=t\n
>> /g24/0=t;1=t\n
>> /g25/0=t;1=t\n
>> /g26/0=t;1=t\n
>> /g27/0=t;1=t\n
>> /g28/0=t;1=t\n
>> /g29/0=t;1=t\n
>> /g30/0=t;1=t\n
>> /g31/0=t;1=t\n'
>>
>> My ultimate goal is for a thread bound to a particular domain to be
>> able to unassign and reassign the local domain's 32 counters in a
>> single write() with no IPIs at all. And when IPIs are required, then
>> no more than one per domain, regardless of the number of groups
>> updated.
>>
> 
> Yes. I think I got the idea. Thanks.
> 
>>
>>>
>>>>
>>>> struct resctrl_monitor_cfg {
>>>>       int closid;
>>>>       int rmid;
>>>>       int evtid;
>>>>       bool dirty;
>>>> };
>>>>
>>>> This mirrors the info needed in whatever register configures the
>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>> updated.
>>>
>>> This is what my understanding of your implementation.
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index d94abba1c716..9cebf065cc97 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>           u32                             *mbps_val;
>>>    };
>>>
>>> +struct resctrl_monitor_cfg {
>>> +    int closid;
>>> +    int rmid;
>>> +    int evtid;
>>> +    bool dirty;
>>> +};
>>> +
>>>    /**
>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>> resource
>>>     * @hdr:               common header for different domain types
>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>           struct delayed_work             cqm_limbo;
>>>           int                             mbm_work_cpu;
>>>           int                             cqm_work_cpu;
>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>    };
>>>
>>>
>>> When a user requests an assignment for total event to the default group
>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>> entry.
>>>
>>> If there is an empty entry, then use that entry for assignment and
>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>> information from default group here.
>>>
>>> Does this make sense?
>>
>> Yes, sounds correct.
> 
> I will probably add cntr_id in resctrl_monitor_cfg structure and
> initialize during the allocation. And rename the field 'dirty' to
> 'active'(or something similar) to hold the assign state for that
> entry. That way we have all the information required for assignment
> at one place. We don't need to update the rdtgroup structure.
> 
> Reinette, What do you think about this approach?

I think this approach is in the right direction. Thanks to Peter for
the guidance here.
I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
though, I think the cntr_id would be the index to the array instead?

It may also be worthwhile to consider using a pointer to the resource
group instead of storing closid and rmid directly. If used to indicate
initialization then an initialized pointer is easier to distinguish than
the closid/rmid that may have zero as valid values.

I expect evtid will be enum resctrl_event_id and that raises the question
of whether "0" can indeed be used as an "uninitialized" value since doing
so would change the meaning of the enum. It may indeed keep things
separated by maintaining evtid as an enum resctrl_event_id and note the
initialization differently ... either via a pointer to a resource group
or entirely separately as Babu indicates later.

>>>> For the benefit of displaying mbm_assign_control, I put a pointer back
>>>> to any counter array entry allocated in the mbm_state struct only
>>>> because it's an existing structure that exists for every rmid-domain
>>>> combination.
>>>
>>> Pointer in mbm_state may not be required here.
>>>
>>> We are going to loop over resctrl groups. We can search the
>>> rdt_mon_domain to see if specific closid, rmid, evtid is already
>>> assigned or not in that domain.
>>
>> No, not required I guess. High-performance CPUs can probably search a
>> 32-entry array very quickly.
> 
> Ok.
> 

This is not so clear to me. I am wondering about the scenario when a resource
group is removed and its counters need to be freed. Searching which counters
need to be freed would then require a search of the array within every domain,
of which I understand there can be many? Having a pointer from the mbm_state
may help here.

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 18:33                             ` Reinette Chatre
@ 2024-12-02 19:48                               ` Moger, Babu
  2024-12-02 20:15                                 ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-12-02 19:48 UTC (permalink / raw)
  To: Reinette Chatre, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 12/2/24 12:33, Reinette Chatre wrote:
> Hi Babu and Peter,
> 
> On 11/29/24 9:06 AM, Moger, Babu wrote:
>> Hi Peter, Reinette,
>>
>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>> Hi Babu, Reinette,
>>>>>
>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>
>>>>>> Hi Babu,
>>>>>>
>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>
>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>      For example:
>>>>>>>      Resctrl group mon1
>>>>>>>       Total event
>>>>>>>       dom 0 cntr_id 1,
>>>>>>>       dom 1 cntr_id 10
>>>>>>>       dom 2 cntr_id 11
>>>>>>>
>>>>>>>      Local event
>>>>>>>       dom 0 cntr_id 2,
>>>>>>>       dom 1 cntr_id 15
>>>>>>>       dom 2 cntr_id 10
>>>>>>
>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>> group:
>>>>>>           struct cntr_id {
>>>>>>                   u32     mbm_total;
>>>>>>                   u32     mbm_local;
>>>>>>           }
>>>>>>
>>>>>>
>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>> not configured to the newly online domain.
>>>>
>>>> I am trying to understand the details of your approach here.
>>>>>
>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>> monitoring domain structure for tracking the counter allocations,
>>>>> because the hardware counters are all domain-scoped. That way the
>>>>> tracking data goes away when the hardware does.
>>>>>
>>>>> I was focused on allowing all pending counter updates to a domain
>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>> processed in a single IPI, so I structured the counter tracker
>>>>> something like this:
>>>>
>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>
>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>> domain.
>>>>
>>>> Are you doing something different?
>>>
>>> I said "all pending counter updates to a domain", whereby I meant
>>> targeting a single domain.
>>>
>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>
>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>
>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>> for readability)
>>>
>>> echo $'//0=t;1=t\n
>>> /g1/0=t;1=t\n
>>> /g2/0=t;1=t\n
>>> /g3/0=t;1=t\n
>>> /g4/0=t;1=t\n
>>> /g5/0=t;1=t\n
>>> /g6/0=t;1=t\n
>>> /g7/0=t;1=t\n
>>> /g8/0=t;1=t\n
>>> /g9/0=t;1=t\n
>>> /g10/0=t;1=t\n
>>> /g11/0=t;1=t\n
>>> /g12/0=t;1=t\n
>>> /g13/0=t;1=t\n
>>> /g14/0=t;1=t\n
>>> /g15/0=t;1=t\n
>>> /g16/0=t;1=t\n
>>> /g17/0=t;1=t\n
>>> /g18/0=t;1=t\n
>>> /g19/0=t;1=t\n
>>> /g20/0=t;1=t\n
>>> /g21/0=t;1=t\n
>>> /g22/0=t;1=t\n
>>> /g23/0=t;1=t\n
>>> /g24/0=t;1=t\n
>>> /g25/0=t;1=t\n
>>> /g26/0=t;1=t\n
>>> /g27/0=t;1=t\n
>>> /g28/0=t;1=t\n
>>> /g29/0=t;1=t\n
>>> /g30/0=t;1=t\n
>>> /g31/0=t;1=t\n'
>>>
>>> My ultimate goal is for a thread bound to a particular domain to be
>>> able to unassign and reassign the local domain's 32 counters in a
>>> single write() with no IPIs at all. And when IPIs are required, then
>>> no more than one per domain, regardless of the number of groups
>>> updated.
>>>
>>
>> Yes. I think I got the idea. Thanks.
>>
>>>
>>>>
>>>>>
>>>>> struct resctrl_monitor_cfg {
>>>>>       int closid;
>>>>>       int rmid;
>>>>>       int evtid;
>>>>>       bool dirty;
>>>>> };
>>>>>
>>>>> This mirrors the info needed in whatever register configures the
>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>> updated.
>>>>
>>>> This is what my understanding of your implementation.
>>>>
>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>> index d94abba1c716..9cebf065cc97 100644
>>>> --- a/include/linux/resctrl.h
>>>> +++ b/include/linux/resctrl.h
>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>           u32                             *mbps_val;
>>>>    };
>>>>
>>>> +struct resctrl_monitor_cfg {
>>>> +    int closid;
>>>> +    int rmid;
>>>> +    int evtid;
>>>> +    bool dirty;
>>>> +};
>>>> +
>>>>    /**
>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>> resource
>>>>     * @hdr:               common header for different domain types
>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>           struct delayed_work             cqm_limbo;
>>>>           int                             mbm_work_cpu;
>>>>           int                             cqm_work_cpu;
>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>    };
>>>>
>>>>
>>>> When a user requests an assignment for total event to the default group
>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>> entry.
>>>>
>>>> If there is an empty entry, then use that entry for assignment and
>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>> information from default group here.
>>>>
>>>> Does this make sense?
>>>
>>> Yes, sounds correct.
>>
>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>> initialize during the allocation. And rename the field 'dirty' to
>> 'active'(or something similar) to hold the assign state for that
>> entry. That way we have all the information required for assignment
>> at one place. We don't need to update the rdtgroup structure.
>>
>> Reinette, What do you think about this approach?
> 
> I think this approach is in the right direction. Thanks to Peter for
> the guidance here.
> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
> though, I think the cntr_id would be the index to the array instead?

Yes. I think We can use the index as cntn_id. Will let you know otherwise.


> 
> It may also be worthwhile to consider using a pointer to the resource
> group instead of storing closid and rmid directly. If used to indicate
> initialization then an initialized pointer is easier to distinguish than
> the closid/rmid that may have zero as valid values.

Sure. Sounds good.

> 
> I expect evtid will be enum resctrl_event_id and that raises the question
> of whether "0" can indeed be used as an "uninitialized" value since doing
> so would change the meaning of the enum. It may indeed keep things
> separated by maintaining evtid as an enum resctrl_event_id and note the
> initialization differently ... either via a pointer to a resource group
> or entirely separately as Babu indicates later.

Sure. Will add evtid as enum resctrl_event_id and use the "state" to
indicate assign/unassign/dirty status.

> 
>>>>> For the benefit of displaying mbm_assign_control, I put a pointer back
>>>>> to any counter array entry allocated in the mbm_state struct only
>>>>> because it's an existing structure that exists for every rmid-domain
>>>>> combination.
>>>>
>>>> Pointer in mbm_state may not be required here.
>>>>
>>>> We are going to loop over resctrl groups. We can search the
>>>> rdt_mon_domain to see if specific closid, rmid, evtid is already
>>>> assigned or not in that domain.
>>>
>>> No, not required I guess. High-performance CPUs can probably search a
>>> 32-entry array very quickly.
>>
>> Ok.
>>
> 
> This is not so clear to me. I am wondering about the scenario when a resource
> group is removed and its counters need to be freed. Searching which counters
> need to be freed would then require a search of the array within every domain,
> of which I understand there can be many? Having a pointer from the mbm_state
> may help here.

Sure. Will add the allocated entry pointer in mbm_state.

> 
> Reinette
> 
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 19:48                               ` Moger, Babu
@ 2024-12-02 20:15                                 ` Reinette Chatre
  2024-12-02 20:42                                   ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-12-02 20:15 UTC (permalink / raw)
  To: babu.moger, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 12/2/24 11:48 AM, Moger, Babu wrote:
> On 12/2/24 12:33, Reinette Chatre wrote:
>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>
>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>      For example:
>>>>>>>>      Resctrl group mon1
>>>>>>>>       Total event
>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>
>>>>>>>>      Local event
>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>       dom 2 cntr_id 10
>>>>>>>
>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>> group:
>>>>>>>           struct cntr_id {
>>>>>>>                   u32     mbm_total;
>>>>>>>                   u32     mbm_local;
>>>>>>>           }
>>>>>>>
>>>>>>>
>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>> not configured to the newly online domain.
>>>>>
>>>>> I am trying to understand the details of your approach here.
>>>>>>
>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>> tracking data goes away when the hardware does.
>>>>>>
>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>> something like this:
>>>>>
>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>
>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>> domain.
>>>>>
>>>>> Are you doing something different?
>>>>
>>>> I said "all pending counter updates to a domain", whereby I meant
>>>> targeting a single domain.
>>>>
>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>
>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>
>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>> for readability)
>>>>
>>>> echo $'//0=t;1=t\n
>>>> /g1/0=t;1=t\n
>>>> /g2/0=t;1=t\n
>>>> /g3/0=t;1=t\n
>>>> /g4/0=t;1=t\n
>>>> /g5/0=t;1=t\n
>>>> /g6/0=t;1=t\n
>>>> /g7/0=t;1=t\n
>>>> /g8/0=t;1=t\n
>>>> /g9/0=t;1=t\n
>>>> /g10/0=t;1=t\n
>>>> /g11/0=t;1=t\n
>>>> /g12/0=t;1=t\n
>>>> /g13/0=t;1=t\n
>>>> /g14/0=t;1=t\n
>>>> /g15/0=t;1=t\n
>>>> /g16/0=t;1=t\n
>>>> /g17/0=t;1=t\n
>>>> /g18/0=t;1=t\n
>>>> /g19/0=t;1=t\n
>>>> /g20/0=t;1=t\n
>>>> /g21/0=t;1=t\n
>>>> /g22/0=t;1=t\n
>>>> /g23/0=t;1=t\n
>>>> /g24/0=t;1=t\n
>>>> /g25/0=t;1=t\n
>>>> /g26/0=t;1=t\n
>>>> /g27/0=t;1=t\n
>>>> /g28/0=t;1=t\n
>>>> /g29/0=t;1=t\n
>>>> /g30/0=t;1=t\n
>>>> /g31/0=t;1=t\n'
>>>>
>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>> able to unassign and reassign the local domain's 32 counters in a
>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>> no more than one per domain, regardless of the number of groups
>>>> updated.
>>>>
>>>
>>> Yes. I think I got the idea. Thanks.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> struct resctrl_monitor_cfg {
>>>>>>       int closid;
>>>>>>       int rmid;
>>>>>>       int evtid;
>>>>>>       bool dirty;
>>>>>> };
>>>>>>
>>>>>> This mirrors the info needed in whatever register configures the
>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>> updated.
>>>>>
>>>>> This is what my understanding of your implementation.
>>>>>
>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>> --- a/include/linux/resctrl.h
>>>>> +++ b/include/linux/resctrl.h
>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>           u32                             *mbps_val;
>>>>>    };
>>>>>
>>>>> +struct resctrl_monitor_cfg {
>>>>> +    int closid;
>>>>> +    int rmid;
>>>>> +    int evtid;
>>>>> +    bool dirty;
>>>>> +};
>>>>> +
>>>>>    /**
>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>> resource
>>>>>     * @hdr:               common header for different domain types
>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>           struct delayed_work             cqm_limbo;
>>>>>           int                             mbm_work_cpu;
>>>>>           int                             cqm_work_cpu;
>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>    };
>>>>>
>>>>>
>>>>> When a user requests an assignment for total event to the default group
>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>> entry.
>>>>>
>>>>> If there is an empty entry, then use that entry for assignment and
>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>> information from default group here.
>>>>>
>>>>> Does this make sense?
>>>>
>>>> Yes, sounds correct.
>>>
>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>> initialize during the allocation. And rename the field 'dirty' to
>>> 'active'(or something similar) to hold the assign state for that
>>> entry. That way we have all the information required for assignment
>>> at one place. We don't need to update the rdtgroup structure.
>>>
>>> Reinette, What do you think about this approach?
>>
>> I think this approach is in the right direction. Thanks to Peter for
>> the guidance here.
>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>> though, I think the cntr_id would be the index to the array instead?
> 
> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
> 
> 
>>
>> It may also be worthwhile to consider using a pointer to the resource
>> group instead of storing closid and rmid directly. If used to indicate
>> initialization then an initialized pointer is easier to distinguish than
>> the closid/rmid that may have zero as valid values.
> 
> Sure. Sounds good.
> 
>>
>> I expect evtid will be enum resctrl_event_id and that raises the question
>> of whether "0" can indeed be used as an "uninitialized" value since doing
>> so would change the meaning of the enum. It may indeed keep things
>> separated by maintaining evtid as an enum resctrl_event_id and note the
>> initialization differently ... either via a pointer to a resource group
>> or entirely separately as Babu indicates later.
> 
> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
> indicate assign/unassign/dirty status.

Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
to the resource group to which the counter has been assigned then I expect NULL
means unassigned and a value means assigned?

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 20:15                                 ` Reinette Chatre
@ 2024-12-02 20:42                                   ` Moger, Babu
  2024-12-02 21:09                                     ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-12-02 20:42 UTC (permalink / raw)
  To: Reinette Chatre, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 12/2/24 14:15, Reinette Chatre wrote:
> Hi Babu,
> 
> On 12/2/24 11:48 AM, Moger, Babu wrote:
>> On 12/2/24 12:33, Reinette Chatre wrote:
>>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>>
>>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>>      For example:
>>>>>>>>>      Resctrl group mon1
>>>>>>>>>       Total event
>>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>>
>>>>>>>>>      Local event
>>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>>       dom 2 cntr_id 10
>>>>>>>>
>>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>>> group:
>>>>>>>>           struct cntr_id {
>>>>>>>>                   u32     mbm_total;
>>>>>>>>                   u32     mbm_local;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>
>>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>>> not configured to the newly online domain.
>>>>>>
>>>>>> I am trying to understand the details of your approach here.
>>>>>>>
>>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>>> tracking data goes away when the hardware does.
>>>>>>>
>>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>>> something like this:
>>>>>>
>>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>>
>>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>>> domain.
>>>>>>
>>>>>> Are you doing something different?
>>>>>
>>>>> I said "all pending counter updates to a domain", whereby I meant
>>>>> targeting a single domain.
>>>>>
>>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>>
>>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>>
>>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>>> for readability)
>>>>>
>>>>> echo $'//0=t;1=t\n
>>>>> /g1/0=t;1=t\n
>>>>> /g2/0=t;1=t\n
>>>>> /g3/0=t;1=t\n
>>>>> /g4/0=t;1=t\n
>>>>> /g5/0=t;1=t\n
>>>>> /g6/0=t;1=t\n
>>>>> /g7/0=t;1=t\n
>>>>> /g8/0=t;1=t\n
>>>>> /g9/0=t;1=t\n
>>>>> /g10/0=t;1=t\n
>>>>> /g11/0=t;1=t\n
>>>>> /g12/0=t;1=t\n
>>>>> /g13/0=t;1=t\n
>>>>> /g14/0=t;1=t\n
>>>>> /g15/0=t;1=t\n
>>>>> /g16/0=t;1=t\n
>>>>> /g17/0=t;1=t\n
>>>>> /g18/0=t;1=t\n
>>>>> /g19/0=t;1=t\n
>>>>> /g20/0=t;1=t\n
>>>>> /g21/0=t;1=t\n
>>>>> /g22/0=t;1=t\n
>>>>> /g23/0=t;1=t\n
>>>>> /g24/0=t;1=t\n
>>>>> /g25/0=t;1=t\n
>>>>> /g26/0=t;1=t\n
>>>>> /g27/0=t;1=t\n
>>>>> /g28/0=t;1=t\n
>>>>> /g29/0=t;1=t\n
>>>>> /g30/0=t;1=t\n
>>>>> /g31/0=t;1=t\n'
>>>>>
>>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>>> able to unassign and reassign the local domain's 32 counters in a
>>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>>> no more than one per domain, regardless of the number of groups
>>>>> updated.
>>>>>
>>>>
>>>> Yes. I think I got the idea. Thanks.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> struct resctrl_monitor_cfg {
>>>>>>>       int closid;
>>>>>>>       int rmid;
>>>>>>>       int evtid;
>>>>>>>       bool dirty;
>>>>>>> };
>>>>>>>
>>>>>>> This mirrors the info needed in whatever register configures the
>>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>>> updated.
>>>>>>
>>>>>> This is what my understanding of your implementation.
>>>>>>
>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>>> --- a/include/linux/resctrl.h
>>>>>> +++ b/include/linux/resctrl.h
>>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>>           u32                             *mbps_val;
>>>>>>    };
>>>>>>
>>>>>> +struct resctrl_monitor_cfg {
>>>>>> +    int closid;
>>>>>> +    int rmid;
>>>>>> +    int evtid;
>>>>>> +    bool dirty;
>>>>>> +};
>>>>>> +
>>>>>>    /**
>>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>>> resource
>>>>>>     * @hdr:               common header for different domain types
>>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>>           struct delayed_work             cqm_limbo;
>>>>>>           int                             mbm_work_cpu;
>>>>>>           int                             cqm_work_cpu;
>>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>>    };
>>>>>>
>>>>>>
>>>>>> When a user requests an assignment for total event to the default group
>>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>>> entry.
>>>>>>
>>>>>> If there is an empty entry, then use that entry for assignment and
>>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>>> information from default group here.
>>>>>>
>>>>>> Does this make sense?
>>>>>
>>>>> Yes, sounds correct.
>>>>
>>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>>> initialize during the allocation. And rename the field 'dirty' to
>>>> 'active'(or something similar) to hold the assign state for that
>>>> entry. That way we have all the information required for assignment
>>>> at one place. We don't need to update the rdtgroup structure.
>>>>
>>>> Reinette, What do you think about this approach?
>>>
>>> I think this approach is in the right direction. Thanks to Peter for
>>> the guidance here.
>>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>>> though, I think the cntr_id would be the index to the array instead?
>>
>> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
>>
>>
>>>
>>> It may also be worthwhile to consider using a pointer to the resource
>>> group instead of storing closid and rmid directly. If used to indicate
>>> initialization then an initialized pointer is easier to distinguish than
>>> the closid/rmid that may have zero as valid values.
>>
>> Sure. Sounds good.
>>
>>>
>>> I expect evtid will be enum resctrl_event_id and that raises the question
>>> of whether "0" can indeed be used as an "uninitialized" value since doing
>>> so would change the meaning of the enum. It may indeed keep things
>>> separated by maintaining evtid as an enum resctrl_event_id and note the
>>> initialization differently ... either via a pointer to a resource group
>>> or entirely separately as Babu indicates later.
>>
>> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
>> indicate assign/unassign/dirty status.
> 
> Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
> to the resource group to which the counter has been assigned then I expect NULL
> means unassigned and a value means assigned?

Yes. We use the rdtgroup pointer to check the assign/unassign state.

I will drop the 'state' field. Peter can add state when he wants use it
for optimization later.

I think we need to have the 'cntr_id" field here in resctrl_monitor_cfg.
When we access the pointer from mbm_state, we wont know what is cntr_id
index it came from.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 20:42                                   ` Moger, Babu
@ 2024-12-02 21:09                                     ` Reinette Chatre
  2024-12-02 21:28                                       ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-12-02 21:09 UTC (permalink / raw)
  To: babu.moger, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 12/2/24 12:42 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 12/2/24 14:15, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 12/2/24 11:48 AM, Moger, Babu wrote:
>>> On 12/2/24 12:33, Reinette Chatre wrote:
>>>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Babu,
>>>>>>>>>
>>>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>>>
>>>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>>>      For example:
>>>>>>>>>>      Resctrl group mon1
>>>>>>>>>>       Total event
>>>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>>>
>>>>>>>>>>      Local event
>>>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>>>       dom 2 cntr_id 10
>>>>>>>>>
>>>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>>>> group:
>>>>>>>>>           struct cntr_id {
>>>>>>>>>                   u32     mbm_total;
>>>>>>>>>                   u32     mbm_local;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>>>> not configured to the newly online domain.
>>>>>>>
>>>>>>> I am trying to understand the details of your approach here.
>>>>>>>>
>>>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>>>> tracking data goes away when the hardware does.
>>>>>>>>
>>>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>>>> something like this:
>>>>>>>
>>>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>>>
>>>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>>>> domain.
>>>>>>>
>>>>>>> Are you doing something different?
>>>>>>
>>>>>> I said "all pending counter updates to a domain", whereby I meant
>>>>>> targeting a single domain.
>>>>>>
>>>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>>>
>>>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>>>
>>>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>>>> for readability)
>>>>>>
>>>>>> echo $'//0=t;1=t\n
>>>>>> /g1/0=t;1=t\n
>>>>>> /g2/0=t;1=t\n
>>>>>> /g3/0=t;1=t\n
>>>>>> /g4/0=t;1=t\n
>>>>>> /g5/0=t;1=t\n
>>>>>> /g6/0=t;1=t\n
>>>>>> /g7/0=t;1=t\n
>>>>>> /g8/0=t;1=t\n
>>>>>> /g9/0=t;1=t\n
>>>>>> /g10/0=t;1=t\n
>>>>>> /g11/0=t;1=t\n
>>>>>> /g12/0=t;1=t\n
>>>>>> /g13/0=t;1=t\n
>>>>>> /g14/0=t;1=t\n
>>>>>> /g15/0=t;1=t\n
>>>>>> /g16/0=t;1=t\n
>>>>>> /g17/0=t;1=t\n
>>>>>> /g18/0=t;1=t\n
>>>>>> /g19/0=t;1=t\n
>>>>>> /g20/0=t;1=t\n
>>>>>> /g21/0=t;1=t\n
>>>>>> /g22/0=t;1=t\n
>>>>>> /g23/0=t;1=t\n
>>>>>> /g24/0=t;1=t\n
>>>>>> /g25/0=t;1=t\n
>>>>>> /g26/0=t;1=t\n
>>>>>> /g27/0=t;1=t\n
>>>>>> /g28/0=t;1=t\n
>>>>>> /g29/0=t;1=t\n
>>>>>> /g30/0=t;1=t\n
>>>>>> /g31/0=t;1=t\n'
>>>>>>
>>>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>>>> able to unassign and reassign the local domain's 32 counters in a
>>>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>>>> no more than one per domain, regardless of the number of groups
>>>>>> updated.
>>>>>>
>>>>>
>>>>> Yes. I think I got the idea. Thanks.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> struct resctrl_monitor_cfg {
>>>>>>>>       int closid;
>>>>>>>>       int rmid;
>>>>>>>>       int evtid;
>>>>>>>>       bool dirty;
>>>>>>>> };
>>>>>>>>
>>>>>>>> This mirrors the info needed in whatever register configures the
>>>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>>>> updated.
>>>>>>>
>>>>>>> This is what my understanding of your implementation.
>>>>>>>
>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>>>> --- a/include/linux/resctrl.h
>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>>>           u32                             *mbps_val;
>>>>>>>    };
>>>>>>>
>>>>>>> +struct resctrl_monitor_cfg {
>>>>>>> +    int closid;
>>>>>>> +    int rmid;
>>>>>>> +    int evtid;
>>>>>>> +    bool dirty;
>>>>>>> +};
>>>>>>> +
>>>>>>>    /**
>>>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>>>> resource
>>>>>>>     * @hdr:               common header for different domain types
>>>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>>>           struct delayed_work             cqm_limbo;
>>>>>>>           int                             mbm_work_cpu;
>>>>>>>           int                             cqm_work_cpu;
>>>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>>>    };
>>>>>>>
>>>>>>>
>>>>>>> When a user requests an assignment for total event to the default group
>>>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>>>> entry.
>>>>>>>
>>>>>>> If there is an empty entry, then use that entry for assignment and
>>>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>>>> information from default group here.
>>>>>>>
>>>>>>> Does this make sense?
>>>>>>
>>>>>> Yes, sounds correct.
>>>>>
>>>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>>>> initialize during the allocation. And rename the field 'dirty' to
>>>>> 'active'(or something similar) to hold the assign state for that
>>>>> entry. That way we have all the information required for assignment
>>>>> at one place. We don't need to update the rdtgroup structure.
>>>>>
>>>>> Reinette, What do you think about this approach?
>>>>
>>>> I think this approach is in the right direction. Thanks to Peter for
>>>> the guidance here.
>>>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>>>> though, I think the cntr_id would be the index to the array instead?
>>>
>>> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
>>>
>>>
>>>>
>>>> It may also be worthwhile to consider using a pointer to the resource
>>>> group instead of storing closid and rmid directly. If used to indicate
>>>> initialization then an initialized pointer is easier to distinguish than
>>>> the closid/rmid that may have zero as valid values.
>>>
>>> Sure. Sounds good.
>>>
>>>>
>>>> I expect evtid will be enum resctrl_event_id and that raises the question
>>>> of whether "0" can indeed be used as an "uninitialized" value since doing
>>>> so would change the meaning of the enum. It may indeed keep things
>>>> separated by maintaining evtid as an enum resctrl_event_id and note the
>>>> initialization differently ... either via a pointer to a resource group
>>>> or entirely separately as Babu indicates later.
>>>
>>> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
>>> indicate assign/unassign/dirty status.
>>
>> Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
>> to the resource group to which the counter has been assigned then I expect NULL
>> means unassigned and a value means assigned?
> 
> Yes. We use the rdtgroup pointer to check the assign/unassign state.
> 
> I will drop the 'state' field. Peter can add state when he wants use it
> for optimization later.
> 
> I think we need to have the 'cntr_id" field here in resctrl_monitor_cfg.
> When we access the pointer from mbm_state, we wont know what is cntr_id
> index it came from.
> 

oh, good point. I wonder how Peter addressed this in his PoC. As an alternative,
could the cntr_id be used in mbm_state instead of a pointer? 

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 21:09                                     ` Reinette Chatre
@ 2024-12-02 21:28                                       ` Moger, Babu
  2024-12-02 21:47                                         ` Reinette Chatre
  0 siblings, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-12-02 21:28 UTC (permalink / raw)
  To: Reinette Chatre, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 12/2/24 15:09, Reinette Chatre wrote:
> Hi Babu,
> 
> On 12/2/24 12:42 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 12/2/24 14:15, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 12/2/24 11:48 AM, Moger, Babu wrote:
>>>> On 12/2/24 12:33, Reinette Chatre wrote:
>>>>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>>>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Babu,
>>>>>>>>>>
>>>>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>>>>
>>>>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>>>>      For example:
>>>>>>>>>>>      Resctrl group mon1
>>>>>>>>>>>       Total event
>>>>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>>>>
>>>>>>>>>>>      Local event
>>>>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>>>>       dom 2 cntr_id 10
>>>>>>>>>>
>>>>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>>>>> group:
>>>>>>>>>>           struct cntr_id {
>>>>>>>>>>                   u32     mbm_total;
>>>>>>>>>>                   u32     mbm_local;
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>>>>> not configured to the newly online domain.
>>>>>>>>
>>>>>>>> I am trying to understand the details of your approach here.
>>>>>>>>>
>>>>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>>>>> tracking data goes away when the hardware does.
>>>>>>>>>
>>>>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>>>>> something like this:
>>>>>>>>
>>>>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>>>>
>>>>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>
>>>>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>>>>> domain.
>>>>>>>>
>>>>>>>> Are you doing something different?
>>>>>>>
>>>>>>> I said "all pending counter updates to a domain", whereby I meant
>>>>>>> targeting a single domain.
>>>>>>>
>>>>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>>>>
>>>>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>>>>
>>>>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>>>>> for readability)
>>>>>>>
>>>>>>> echo $'//0=t;1=t\n
>>>>>>> /g1/0=t;1=t\n
>>>>>>> /g2/0=t;1=t\n
>>>>>>> /g3/0=t;1=t\n
>>>>>>> /g4/0=t;1=t\n
>>>>>>> /g5/0=t;1=t\n
>>>>>>> /g6/0=t;1=t\n
>>>>>>> /g7/0=t;1=t\n
>>>>>>> /g8/0=t;1=t\n
>>>>>>> /g9/0=t;1=t\n
>>>>>>> /g10/0=t;1=t\n
>>>>>>> /g11/0=t;1=t\n
>>>>>>> /g12/0=t;1=t\n
>>>>>>> /g13/0=t;1=t\n
>>>>>>> /g14/0=t;1=t\n
>>>>>>> /g15/0=t;1=t\n
>>>>>>> /g16/0=t;1=t\n
>>>>>>> /g17/0=t;1=t\n
>>>>>>> /g18/0=t;1=t\n
>>>>>>> /g19/0=t;1=t\n
>>>>>>> /g20/0=t;1=t\n
>>>>>>> /g21/0=t;1=t\n
>>>>>>> /g22/0=t;1=t\n
>>>>>>> /g23/0=t;1=t\n
>>>>>>> /g24/0=t;1=t\n
>>>>>>> /g25/0=t;1=t\n
>>>>>>> /g26/0=t;1=t\n
>>>>>>> /g27/0=t;1=t\n
>>>>>>> /g28/0=t;1=t\n
>>>>>>> /g29/0=t;1=t\n
>>>>>>> /g30/0=t;1=t\n
>>>>>>> /g31/0=t;1=t\n'
>>>>>>>
>>>>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>>>>> able to unassign and reassign the local domain's 32 counters in a
>>>>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>>>>> no more than one per domain, regardless of the number of groups
>>>>>>> updated.
>>>>>>>
>>>>>>
>>>>>> Yes. I think I got the idea. Thanks.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> struct resctrl_monitor_cfg {
>>>>>>>>>       int closid;
>>>>>>>>>       int rmid;
>>>>>>>>>       int evtid;
>>>>>>>>>       bool dirty;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> This mirrors the info needed in whatever register configures the
>>>>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>>>>> updated.
>>>>>>>>
>>>>>>>> This is what my understanding of your implementation.
>>>>>>>>
>>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>>>>> --- a/include/linux/resctrl.h
>>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>>>>           u32                             *mbps_val;
>>>>>>>>    };
>>>>>>>>
>>>>>>>> +struct resctrl_monitor_cfg {
>>>>>>>> +    int closid;
>>>>>>>> +    int rmid;
>>>>>>>> +    int evtid;
>>>>>>>> +    bool dirty;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>    /**
>>>>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>>>>> resource
>>>>>>>>     * @hdr:               common header for different domain types
>>>>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>>>>           struct delayed_work             cqm_limbo;
>>>>>>>>           int                             mbm_work_cpu;
>>>>>>>>           int                             cqm_work_cpu;
>>>>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>>>>    };
>>>>>>>>
>>>>>>>>
>>>>>>>> When a user requests an assignment for total event to the default group
>>>>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>>>>> entry.
>>>>>>>>
>>>>>>>> If there is an empty entry, then use that entry for assignment and
>>>>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>>>>> information from default group here.
>>>>>>>>
>>>>>>>> Does this make sense?
>>>>>>>
>>>>>>> Yes, sounds correct.
>>>>>>
>>>>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>>>>> initialize during the allocation. And rename the field 'dirty' to
>>>>>> 'active'(or something similar) to hold the assign state for that
>>>>>> entry. That way we have all the information required for assignment
>>>>>> at one place. We don't need to update the rdtgroup structure.
>>>>>>
>>>>>> Reinette, What do you think about this approach?
>>>>>
>>>>> I think this approach is in the right direction. Thanks to Peter for
>>>>> the guidance here.
>>>>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>>>>> though, I think the cntr_id would be the index to the array instead?
>>>>
>>>> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
>>>>
>>>>
>>>>>
>>>>> It may also be worthwhile to consider using a pointer to the resource
>>>>> group instead of storing closid and rmid directly. If used to indicate
>>>>> initialization then an initialized pointer is easier to distinguish than
>>>>> the closid/rmid that may have zero as valid values.
>>>>
>>>> Sure. Sounds good.
>>>>
>>>>>
>>>>> I expect evtid will be enum resctrl_event_id and that raises the question
>>>>> of whether "0" can indeed be used as an "uninitialized" value since doing
>>>>> so would change the meaning of the enum. It may indeed keep things
>>>>> separated by maintaining evtid as an enum resctrl_event_id and note the
>>>>> initialization differently ... either via a pointer to a resource group
>>>>> or entirely separately as Babu indicates later.
>>>>
>>>> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
>>>> indicate assign/unassign/dirty status.
>>>
>>> Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
>>> to the resource group to which the counter has been assigned then I expect NULL
>>> means unassigned and a value means assigned?
>>
>> Yes. We use the rdtgroup pointer to check the assign/unassign state.
>>
>> I will drop the 'state' field. Peter can add state when he wants use it
>> for optimization later.
>>
>> I think we need to have the 'cntr_id" field here in resctrl_monitor_cfg.
>> When we access the pointer from mbm_state, we wont know what is cntr_id
>> index it came from.
>>
> 
> oh, good point. I wonder how Peter addressed this in his PoC. As an alternative,
> could the cntr_id be used in mbm_state instead of a pointer? 
> 

Yes. It can be done.

I thought it would be better to have everything at once place.

struct resctrl_monitor_cfg {
  unsigned int            cntr_id;
  enum resctrl_event_id   evtid;
  struct rdtgroup         *rgtgrp;
};

This will have everything required to assign/unassign the event.

Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 21:28                                       ` Moger, Babu
@ 2024-12-02 21:47                                         ` Reinette Chatre
  2024-12-02 22:06                                           ` Moger, Babu
  0 siblings, 1 reply; 115+ messages in thread
From: Reinette Chatre @ 2024-12-02 21:47 UTC (permalink / raw)
  To: babu.moger, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Babu,

On 12/2/24 1:28 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 12/2/24 15:09, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 12/2/24 12:42 PM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 12/2/24 14:15, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 12/2/24 11:48 AM, Moger, Babu wrote:
>>>>> On 12/2/24 12:33, Reinette Chatre wrote:
>>>>>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>>>>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>>>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>
>>>>>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>>>>>
>>>>>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>>>>>      For example:
>>>>>>>>>>>>      Resctrl group mon1
>>>>>>>>>>>>       Total event
>>>>>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>>>>>
>>>>>>>>>>>>      Local event
>>>>>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>>>>>       dom 2 cntr_id 10
>>>>>>>>>>>
>>>>>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>>>>>> group:
>>>>>>>>>>>           struct cntr_id {
>>>>>>>>>>>                   u32     mbm_total;
>>>>>>>>>>>                   u32     mbm_local;
>>>>>>>>>>>           }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>>>>>> not configured to the newly online domain.
>>>>>>>>>
>>>>>>>>> I am trying to understand the details of your approach here.
>>>>>>>>>>
>>>>>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>>>>>> tracking data goes away when the hardware does.
>>>>>>>>>>
>>>>>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>>>>>> something like this:
>>>>>>>>>
>>>>>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>>>>>
>>>>>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>
>>>>>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>>>>>> domain.
>>>>>>>>>
>>>>>>>>> Are you doing something different?
>>>>>>>>
>>>>>>>> I said "all pending counter updates to a domain", whereby I meant
>>>>>>>> targeting a single domain.
>>>>>>>>
>>>>>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>>>>>
>>>>>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>>>>>
>>>>>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>>>>>> for readability)
>>>>>>>>
>>>>>>>> echo $'//0=t;1=t\n
>>>>>>>> /g1/0=t;1=t\n
>>>>>>>> /g2/0=t;1=t\n
>>>>>>>> /g3/0=t;1=t\n
>>>>>>>> /g4/0=t;1=t\n
>>>>>>>> /g5/0=t;1=t\n
>>>>>>>> /g6/0=t;1=t\n
>>>>>>>> /g7/0=t;1=t\n
>>>>>>>> /g8/0=t;1=t\n
>>>>>>>> /g9/0=t;1=t\n
>>>>>>>> /g10/0=t;1=t\n
>>>>>>>> /g11/0=t;1=t\n
>>>>>>>> /g12/0=t;1=t\n
>>>>>>>> /g13/0=t;1=t\n
>>>>>>>> /g14/0=t;1=t\n
>>>>>>>> /g15/0=t;1=t\n
>>>>>>>> /g16/0=t;1=t\n
>>>>>>>> /g17/0=t;1=t\n
>>>>>>>> /g18/0=t;1=t\n
>>>>>>>> /g19/0=t;1=t\n
>>>>>>>> /g20/0=t;1=t\n
>>>>>>>> /g21/0=t;1=t\n
>>>>>>>> /g22/0=t;1=t\n
>>>>>>>> /g23/0=t;1=t\n
>>>>>>>> /g24/0=t;1=t\n
>>>>>>>> /g25/0=t;1=t\n
>>>>>>>> /g26/0=t;1=t\n
>>>>>>>> /g27/0=t;1=t\n
>>>>>>>> /g28/0=t;1=t\n
>>>>>>>> /g29/0=t;1=t\n
>>>>>>>> /g30/0=t;1=t\n
>>>>>>>> /g31/0=t;1=t\n'
>>>>>>>>
>>>>>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>>>>>> able to unassign and reassign the local domain's 32 counters in a
>>>>>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>>>>>> no more than one per domain, regardless of the number of groups
>>>>>>>> updated.
>>>>>>>>
>>>>>>>
>>>>>>> Yes. I think I got the idea. Thanks.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> struct resctrl_monitor_cfg {
>>>>>>>>>>       int closid;
>>>>>>>>>>       int rmid;
>>>>>>>>>>       int evtid;
>>>>>>>>>>       bool dirty;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> This mirrors the info needed in whatever register configures the
>>>>>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>>>>>> updated.
>>>>>>>>>
>>>>>>>>> This is what my understanding of your implementation.
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>>>>>> --- a/include/linux/resctrl.h
>>>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>>>>>           u32                             *mbps_val;
>>>>>>>>>    };
>>>>>>>>>
>>>>>>>>> +struct resctrl_monitor_cfg {
>>>>>>>>> +    int closid;
>>>>>>>>> +    int rmid;
>>>>>>>>> +    int evtid;
>>>>>>>>> +    bool dirty;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>>    /**
>>>>>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>>>>>> resource
>>>>>>>>>     * @hdr:               common header for different domain types
>>>>>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>>>>>           struct delayed_work             cqm_limbo;
>>>>>>>>>           int                             mbm_work_cpu;
>>>>>>>>>           int                             cqm_work_cpu;
>>>>>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>>>>>    };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> When a user requests an assignment for total event to the default group
>>>>>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>>>>>> entry.
>>>>>>>>>
>>>>>>>>> If there is an empty entry, then use that entry for assignment and
>>>>>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>>>>>> information from default group here.
>>>>>>>>>
>>>>>>>>> Does this make sense?
>>>>>>>>
>>>>>>>> Yes, sounds correct.
>>>>>>>
>>>>>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>>>>>> initialize during the allocation. And rename the field 'dirty' to
>>>>>>> 'active'(or something similar) to hold the assign state for that
>>>>>>> entry. That way we have all the information required for assignment
>>>>>>> at one place. We don't need to update the rdtgroup structure.
>>>>>>>
>>>>>>> Reinette, What do you think about this approach?
>>>>>>
>>>>>> I think this approach is in the right direction. Thanks to Peter for
>>>>>> the guidance here.
>>>>>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>>>>>> though, I think the cntr_id would be the index to the array instead?
>>>>>
>>>>> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
>>>>>
>>>>>
>>>>>>
>>>>>> It may also be worthwhile to consider using a pointer to the resource
>>>>>> group instead of storing closid and rmid directly. If used to indicate
>>>>>> initialization then an initialized pointer is easier to distinguish than
>>>>>> the closid/rmid that may have zero as valid values.
>>>>>
>>>>> Sure. Sounds good.
>>>>>
>>>>>>
>>>>>> I expect evtid will be enum resctrl_event_id and that raises the question
>>>>>> of whether "0" can indeed be used as an "uninitialized" value since doing
>>>>>> so would change the meaning of the enum. It may indeed keep things
>>>>>> separated by maintaining evtid as an enum resctrl_event_id and note the
>>>>>> initialization differently ... either via a pointer to a resource group
>>>>>> or entirely separately as Babu indicates later.
>>>>>
>>>>> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
>>>>> indicate assign/unassign/dirty status.
>>>>
>>>> Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
>>>> to the resource group to which the counter has been assigned then I expect NULL
>>>> means unassigned and a value means assigned?
>>>
>>> Yes. We use the rdtgroup pointer to check the assign/unassign state.
>>>
>>> I will drop the 'state' field. Peter can add state when he wants use it
>>> for optimization later.
>>>
>>> I think we need to have the 'cntr_id" field here in resctrl_monitor_cfg.
>>> When we access the pointer from mbm_state, we wont know what is cntr_id
>>> index it came from.
>>>
>>
>> oh, good point. I wonder how Peter addressed this in his PoC. As an alternative,
>> could the cntr_id be used in mbm_state instead of a pointer? 
>>
> 
> Yes. It can be done.
> 
> I thought it would be better to have everything at once place.
> 
> struct resctrl_monitor_cfg {
>   unsigned int            cntr_id;
>   enum resctrl_event_id   evtid;
>   struct rdtgroup         *rgtgrp;
> };
> 
> This will have everything required to assign/unassign the event.
> 

The "everything in one place" argument is not clear to me since the cntr_id
is indeed present already as the index to the array that stores this structure.
Including the cntr_id seems redundant to me. This is similar to several
other data structures in resctrl that are indexed either by closid or rmid,
without needing to store closid/rmid in these data structures self.

Reinette


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters
  2024-12-02 21:47                                         ` Reinette Chatre
@ 2024-12-02 22:06                                           ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-12-02 22:06 UTC (permalink / raw)
  To: Reinette Chatre, Peter Newman
  Cc: corbet, tglx, mingo, bp, dave.hansen, fenghua.yu, x86, hpa, thuth,
	paulmck, rostedt, akpm, xiongwei.song, pawan.kumar.gupta,
	daniel.sneddon, perry.yuan, sandipan.das, kai.huang, xiaoyao.li,
	seanjc, jithu.joseph, brijesh.singh, xin3.li, ebiggers,
	andrew.cooper3, mario.limonciello, james.morse, tan.shaopeng,
	tony.luck, linux-doc, linux-kernel, maciej.wieczor-retman,
	eranian, jpoimboe, thomas.lendacky

Hi Reinette,

On 12/2/24 15:47, Reinette Chatre wrote:
> Hi Babu,
> 
> On 12/2/24 1:28 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 12/2/24 15:09, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 12/2/24 12:42 PM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 12/2/24 14:15, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 12/2/24 11:48 AM, Moger, Babu wrote:
>>>>>> On 12/2/24 12:33, Reinette Chatre wrote:
>>>>>>> On 11/29/24 9:06 AM, Moger, Babu wrote:
>>>>>>>> On 11/29/2024 3:59 AM, Peter Newman wrote:
>>>>>>>>> On Thu, Nov 28, 2024 at 8:35 PM Moger, Babu <bmoger@amd.com> wrote:
>>>>>>>>>> On 11/28/2024 5:10 AM, Peter Newman wrote:
>>>>>>>>>>> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>>>>>>>>
>>>>>>>>>>>>> 1. Each group needs to remember counter ids in each domain for each event.
>>>>>>>>>>>>>      For example:
>>>>>>>>>>>>>      Resctrl group mon1
>>>>>>>>>>>>>       Total event
>>>>>>>>>>>>>       dom 0 cntr_id 1,
>>>>>>>>>>>>>       dom 1 cntr_id 10
>>>>>>>>>>>>>       dom 2 cntr_id 11
>>>>>>>>>>>>>
>>>>>>>>>>>>>      Local event
>>>>>>>>>>>>>       dom 0 cntr_id 2,
>>>>>>>>>>>>>       dom 1 cntr_id 15
>>>>>>>>>>>>>       dom 2 cntr_id 10
>>>>>>>>>>>>
>>>>>>>>>>>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>>>>>>>>>>>> static array. As an alternative it can be an xarray indexed by the domain ID with
>>>>>>>>>>>> pointers to a struct like below to contain the counters associated with the monitor
>>>>>>>>>>>> group:
>>>>>>>>>>>>           struct cntr_id {
>>>>>>>>>>>>                   u32     mbm_total;
>>>>>>>>>>>>                   u32     mbm_local;
>>>>>>>>>>>>           }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking more about how this array needs to be managed made me wonder how the
>>>>>>>>>>>> current implementation deals with domains that come and go. I do not think
>>>>>>>>>>>> this is currently handled. For example, if a new domain comes online and
>>>>>>>>>>>> monitoring groups had counters dynamically assigned, then these counters are
>>>>>>>>>>>> not configured to the newly online domain.
>>>>>>>>>>
>>>>>>>>>> I am trying to understand the details of your approach here.
>>>>>>>>>>>
>>>>>>>>>>> In my prototype, I allocated a counter id-indexed array to each
>>>>>>>>>>> monitoring domain structure for tracking the counter allocations,
>>>>>>>>>>> because the hardware counters are all domain-scoped. That way the
>>>>>>>>>>> tracking data goes away when the hardware does.
>>>>>>>>>>>
>>>>>>>>>>> I was focused on allowing all pending counter updates to a domain
>>>>>>>>>>> resulting from a single mbm_assign_control write to be batched and
>>>>>>>>>>> processed in a single IPI, so I structured the counter tracker
>>>>>>>>>>> something like this:
>>>>>>>>>>
>>>>>>>>>> Not sure what you meant here. How are you batching two IPIs for two domains?
>>>>>>>>>>
>>>>>>>>>> #echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>
>>>>>>>>>> This is still a single write. Two IPIs are sent separately, one for each
>>>>>>>>>> domain.
>>>>>>>>>>
>>>>>>>>>> Are you doing something different?
>>>>>>>>>
>>>>>>>>> I said "all pending counter updates to a domain", whereby I meant
>>>>>>>>> targeting a single domain.
>>>>>>>>>
>>>>>>>>> Depending on the CPU of the caller, your example write requires 1 or 2 IPIs.
>>>>>>>>>
>>>>>>>>> What is important is that the following write also requires 1 or 2 IPIs:
>>>>>>>>>
>>>>>>>>> (assuming /sys/fs/resctrl/mon_groups/[g1-g31] exist, line breaks added
>>>>>>>>> for readability)
>>>>>>>>>
>>>>>>>>> echo $'//0=t;1=t\n
>>>>>>>>> /g1/0=t;1=t\n
>>>>>>>>> /g2/0=t;1=t\n
>>>>>>>>> /g3/0=t;1=t\n
>>>>>>>>> /g4/0=t;1=t\n
>>>>>>>>> /g5/0=t;1=t\n
>>>>>>>>> /g6/0=t;1=t\n
>>>>>>>>> /g7/0=t;1=t\n
>>>>>>>>> /g8/0=t;1=t\n
>>>>>>>>> /g9/0=t;1=t\n
>>>>>>>>> /g10/0=t;1=t\n
>>>>>>>>> /g11/0=t;1=t\n
>>>>>>>>> /g12/0=t;1=t\n
>>>>>>>>> /g13/0=t;1=t\n
>>>>>>>>> /g14/0=t;1=t\n
>>>>>>>>> /g15/0=t;1=t\n
>>>>>>>>> /g16/0=t;1=t\n
>>>>>>>>> /g17/0=t;1=t\n
>>>>>>>>> /g18/0=t;1=t\n
>>>>>>>>> /g19/0=t;1=t\n
>>>>>>>>> /g20/0=t;1=t\n
>>>>>>>>> /g21/0=t;1=t\n
>>>>>>>>> /g22/0=t;1=t\n
>>>>>>>>> /g23/0=t;1=t\n
>>>>>>>>> /g24/0=t;1=t\n
>>>>>>>>> /g25/0=t;1=t\n
>>>>>>>>> /g26/0=t;1=t\n
>>>>>>>>> /g27/0=t;1=t\n
>>>>>>>>> /g28/0=t;1=t\n
>>>>>>>>> /g29/0=t;1=t\n
>>>>>>>>> /g30/0=t;1=t\n
>>>>>>>>> /g31/0=t;1=t\n'
>>>>>>>>>
>>>>>>>>> My ultimate goal is for a thread bound to a particular domain to be
>>>>>>>>> able to unassign and reassign the local domain's 32 counters in a
>>>>>>>>> single write() with no IPIs at all. And when IPIs are required, then
>>>>>>>>> no more than one per domain, regardless of the number of groups
>>>>>>>>> updated.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. I think I got the idea. Thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> struct resctrl_monitor_cfg {
>>>>>>>>>>>       int closid;
>>>>>>>>>>>       int rmid;
>>>>>>>>>>>       int evtid;
>>>>>>>>>>>       bool dirty;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> This mirrors the info needed in whatever register configures the
>>>>>>>>>>> counter, plus a dirty flag to skip over the ones that don't need to be
>>>>>>>>>>> updated.
>>>>>>>>>>
>>>>>>>>>> This is what my understanding of your implementation.
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>>>>> index d94abba1c716..9cebf065cc97 100644
>>>>>>>>>> --- a/include/linux/resctrl.h
>>>>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>>>>> @@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
>>>>>>>>>>           u32                             *mbps_val;
>>>>>>>>>>    };
>>>>>>>>>>
>>>>>>>>>> +struct resctrl_monitor_cfg {
>>>>>>>>>> +    int closid;
>>>>>>>>>> +    int rmid;
>>>>>>>>>> +    int evtid;
>>>>>>>>>> +    bool dirty;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>    /**
>>>>>>>>>>     * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
>>>>>>>>>> resource
>>>>>>>>>>     * @hdr:               common header for different domain types
>>>>>>>>>> @@ -116,6 +123,7 @@ struct rdt_mon_domain {
>>>>>>>>>>           struct delayed_work             cqm_limbo;
>>>>>>>>>>           int                             mbm_work_cpu;
>>>>>>>>>>           int                             cqm_work_cpu;
>>>>>>>>>> +     /* Allocate num_mbm_cntrs entries in each domain */
>>>>>>>>>> +       struct resctrl_monitor_cfg      *mon_cfg;
>>>>>>>>>>    };
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When a user requests an assignment for total event to the default group
>>>>>>>>>> for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
>>>>>>>>>> entry.
>>>>>>>>>>
>>>>>>>>>> If there is an empty entry, then use that entry for assignment and
>>>>>>>>>> update closid, rmid, evtid and dirty = 1. We can get all these
>>>>>>>>>> information from default group here.
>>>>>>>>>>
>>>>>>>>>> Does this make sense?
>>>>>>>>>
>>>>>>>>> Yes, sounds correct.
>>>>>>>>
>>>>>>>> I will probably add cntr_id in resctrl_monitor_cfg structure and
>>>>>>>> initialize during the allocation. And rename the field 'dirty' to
>>>>>>>> 'active'(or something similar) to hold the assign state for that
>>>>>>>> entry. That way we have all the information required for assignment
>>>>>>>> at one place. We don't need to update the rdtgroup structure.
>>>>>>>>
>>>>>>>> Reinette, What do you think about this approach?
>>>>>>>
>>>>>>> I think this approach is in the right direction. Thanks to Peter for
>>>>>>> the guidance here.
>>>>>>> I do not think that it is necessary to add cntr_id to resctrl_monitor_cfg
>>>>>>> though, I think the cntr_id would be the index to the array instead?
>>>>>>
>>>>>> Yes. I think We can use the index as cntn_id. Will let you know otherwise.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> It may also be worthwhile to consider using a pointer to the resource
>>>>>>> group instead of storing closid and rmid directly. If used to indicate
>>>>>>> initialization then an initialized pointer is easier to distinguish than
>>>>>>> the closid/rmid that may have zero as valid values.
>>>>>>
>>>>>> Sure. Sounds good.
>>>>>>
>>>>>>>
>>>>>>> I expect evtid will be enum resctrl_event_id and that raises the question
>>>>>>> of whether "0" can indeed be used as an "uninitialized" value since doing
>>>>>>> so would change the meaning of the enum. It may indeed keep things
>>>>>>> separated by maintaining evtid as an enum resctrl_event_id and note the
>>>>>>> initialization differently ... either via a pointer to a resource group
>>>>>>> or entirely separately as Babu indicates later.
>>>>>>
>>>>>> Sure. Will add evtid as enum resctrl_event_id and use the "state" to
>>>>>> indicate assign/unassign/dirty status.
>>>>>
>>>>> Is "assign/unassign" state needed? If resctrl_monitor_cfg contains a pointer
>>>>> to the resource group to which the counter has been assigned then I expect NULL
>>>>> means unassigned and a value means assigned?
>>>>
>>>> Yes. We use the rdtgroup pointer to check the assign/unassign state.
>>>>
>>>> I will drop the 'state' field. Peter can add state when he wants use it
>>>> for optimization later.
>>>>
>>>> I think we need to have the 'cntr_id" field here in resctrl_monitor_cfg.
>>>> When we access the pointer from mbm_state, we wont know what is cntr_id
>>>> index it came from.
>>>>
>>>
>>> oh, good point. I wonder how Peter addressed this in his PoC. As an alternative,
>>> could the cntr_id be used in mbm_state instead of a pointer? 
>>>
>>
>> Yes. It can be done.
>>
>> I thought it would be better to have everything at once place.
>>
>> struct resctrl_monitor_cfg {
>>   unsigned int            cntr_id;
>>   enum resctrl_event_id   evtid;
>>   struct rdtgroup         *rgtgrp;
>> };
>>
>> This will have everything required to assign/unassign the event.
>>
> 
> The "everything in one place" argument is not clear to me since the cntr_id
> is indeed present already as the index to the array that stores this structure.
> Including the cntr_id seems redundant to me. This is similar to several
> other data structures in resctrl that are indexed either by closid or rmid,
> without needing to store closid/rmid in these data structures self.
> 

Ok. That is fine. Will remove cntr_id index from resctrl_monitor_cfg.
Will add it in mbm_state. That should be good.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment
  2024-10-29 23:21 ` [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment Babu Moger
  2024-11-16  0:57   ` Reinette Chatre
@ 2024-12-04  4:16   ` Fenghua Yu
  1 sibling, 0 replies; 115+ messages in thread
From: Fenghua Yu @ 2024-12-04  4:16 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	peternewman, maciej.wieczor-retman, eranian, jpoimboe,
	thomas.lendacky

Hi, Babu,

On 10/29/24 16:21, Babu Moger wrote:
> The mbm_cntr_assign mode offers several hardware counters that can be
> assigned to an RMID, event pair and monitor the bandwidth as long as it
> is assigned.
> 
> Counters are managed at two levels. The global assignment is tracked
> using the mbm_cntr_free_map field in the struct resctrl_mon, while
> domain-specific assignments are tracked using the mbm_cntr_map field
> in the struct rdt_mon_domain. Allocation begins at the global level
> and is then applied individually to each domain.
> 
> Introduce an interface to allocate these counters and update the
> corresponding domains accordingly.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Introduced new function resctrl_config_cntr to assign the counter, update
>      the bitmap and reset the architectural state.
>      Taken care of error handling(freeing the counter) when assignment fails.
>      Moved mbm_cntr_assigned_to_domain here as it used in this patch.
>      Minor text changes.
> 
> v8: Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
>      Added the code to return the error if rdtgroup_assign_cntr_event fails.
>      Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
>      Updated typo in the comments.
> 
> v7: New patch. Moved all the FS code here.
>      Merged rdtgroup_assign_cntr and rdtgroup_alloc_cntr.
>      Adde new #define MBM_EVENT_ARRAY_INDEX.
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h |  2 +
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 87 ++++++++++++++++++++++++++
>   2 files changed, 89 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 00f7bf60e16a..cb496bd97007 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -717,6 +717,8 @@ unsigned int mon_event_config_index_get(u32 evtid);
>   int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     enum resctrl_event_id evtid, u32 rmid, u32 closid,
>   			     u32 cntr_id, bool assign);
> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid);
>   void rdt_staged_configs_clear(void);
>   bool closid_allocated(unsigned int closid);
>   int resctrl_find_cleanest_closid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 1b5529c212f5..bc3752967c44 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1924,6 +1924,93 @@ int resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
>   	return 0;
>   }
>   
> +/*
> + * Configure the counter for the event, RMID pair for the domain.
> + * Update the bitmap and reset the architectural state.
> + */
> +static int resctrl_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
> +			       enum resctrl_event_id evtid, u32 rmid, u32 closid,
> +			       u32 cntr_id, bool assign)
> +{
> +	int ret;
> +
> +	ret = resctrl_arch_config_cntr(r, d, evtid, rmid, closid, cntr_id, assign);
> +	if (ret)
> +		return ret;
> +
> +	if (assign)
> +		__set_bit(cntr_id, d->mbm_cntr_map);
> +	else
> +		__clear_bit(cntr_id, d->mbm_cntr_map);
> +
> +	/*
> +	 * Reset the architectural state so that reading of hardware
> +	 * counter is not considered as an overflow in next update.
> +	 */
> +	resctrl_arch_reset_rmid(r, d, closid, rmid, evtid);
> +
> +	return ret;
> +}
> +
> +static bool mbm_cntr_assigned_to_domain(struct rdt_resource *r, u32 cntr_id)
> +{
> +	struct rdt_mon_domain *d;
> +
> +	list_for_each_entry(d, &r->mon_domains, hdr.list)
> +		if (test_bit(cntr_id, d->mbm_cntr_map))
> +			return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Assign a hardware counter to event @evtid of group @rdtgrp.
> + * Counter will be assigned to all the domains if rdt_mon_domain is NULL
> + * else the counter will be assigned to specific domain.
> + */
> +int rdtgroup_assign_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
> +			       struct rdt_mon_domain *d, enum resctrl_event_id evtid)
> +{
> +	int index = MBM_EVENT_ARRAY_INDEX(evtid);
> +	int cntr_id = rdtgrp->mon.cntr_id[index];
> +	int ret;
> +
> +	/*
> +	 * Allocate a new counter id to the event if the counter is not
> +	 * assigned already.
> +	 */
> +	if (cntr_id == MON_CNTR_UNSET) {
> +		cntr_id = mbm_cntr_alloc(r);
> +		if (cntr_id < 0) {
> +			rdt_last_cmd_puts("Out of MBM assignable counters\n");
> +			return -ENOSPC;
> +		}
> +		rdtgrp->mon.cntr_id[index] = cntr_id;
> +	}
> +
> +	if (!d) {

Should assert cpus are locked here?

         /* Walking r->domains, ensure it can't race with cpuhp */
         lockdep_assert_cpus_held();

Please see more comments on patch #20.

> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
> +						  rdtgrp->closid, cntr_id, true);
> +			if (ret)
> +				goto out_done_assign;
> +		}
> +	} else {
> +		ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
> +					  rdtgrp->closid, cntr_id, true);
> +		if (ret)
> +			goto out_done_assign;
> +	}
> +
> +out_done_assign:
> +	if (ret && !mbm_cntr_assigned_to_domain(r, cntr_id)) {
> +		mbm_cntr_free(r, cntr_id);
> +		rdtgroup_cntr_id_init(rdtgrp, evtid);
> +	}
> +
> +	return ret;
> +}
> +
>   /* rdtgroup information files for one cache resource. */
>   static struct rftype res_common_files[] = {
>   	{

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-10-29 23:21 ` [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
  2024-11-18 17:18   ` Reinette Chatre
@ 2024-12-04  4:16   ` Fenghua Yu
  2024-12-04 17:03     ` Reinette Chatre
  2024-12-04 17:14     ` Moger, Babu
  1 sibling, 2 replies; 115+ messages in thread
From: Fenghua Yu @ 2024-12-04  4:16 UTC (permalink / raw)
  To: Babu Moger, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	peternewman, maciej.wieczor-retman, eranian, jpoimboe,
	thomas.lendacky

Hi, Babu,

On 10/29/24 16:21, Babu Moger wrote:
> Assign/unassign counters on resctrl group creation/deletion. Two counters
> are required per group, one for MBM total event and one for MBM local
> event.
> 
> There are a limited number of counters available for assignment. If these
> counters are exhausted, the kernel will display the error message: "Out of
> MBM assignable counters". However, it is not necessary to fail the
> creation of a group due to assignment failures. Users have the flexibility
> to modify the assignments at a later time.
> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>      Updated function comments.
> 
> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>      Fixed the problem with unassigning the child MON groups of CTRL_MON group.
> 
> v7: Reworded the commit message.
>      Removed the reference of ABMC with mbm_cntr_assign.
>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
> 
> v6: Removed the redundant comments on all the calls of
>      rdtgroup_assign_cntrs. Updated the commit message.
>      Dropped printing error message on every call of rdtgroup_assign_cntrs.
> 
> v5: Removed the code to enable/disable ABMC during the mount.
>      That will be another patch.
>      Added arch callers to get the arch specific data.
>      Renamed fuctions to match the other abmc function.
>      Added code comments for assignment failures.
> 
> v4: Few name changes based on the upstream discussion.
>      Commit message update.
> 
> v3: This is a new patch. Patch addresses the upstream comment to enable
>      ABMC feature by default if the feature is available.
> ---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>   1 file changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b0cce3dfd062..a8d21b0b2054 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>   	}
>   }
>   
> +/*
> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
> + * counters are automatically assigned. Each group can accommodate two counters:
> + * one for the total event and one for the local event. Assignments may fail
> + * due to the limited number of counters. However, it is not necessary to fail
> + * the group creation and thus no failure is returned. Users have the option
> + * to modify the counter assignments after the group has been created.
> + */
> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	if (is_mbm_total_enabled())
> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);

In this code path,
resctrl_mkdir()->resctrl_mkdir_ctrl_mon()->rdtgroup_assign_cntrs()->rdtgroup_assign_cntr_event()

CPUs are not protected by read lock while rdtgroup_assign_cntr_event() 
walks r->mon_domains and run assing counters code on CPUs in the 
domains. Without CPU protection, r->mon_domains may race with CPU hotplug.

In another patch (i.e. rdt_get_tree()), rdtgroup_assign_cntrs() is 
protected by cpus_read_lock()/unlock().

So maybe define two helpers:

// Called when caller takes cpus_read_lock()
rdtgroup_assign_cntrs_locked()
{
	lockdep_assert_cpus_held();

	then the current rdtgroup_assign_cntrs() code
}

// Called when caller doesn't take cpus_read_lock()
rdtgroup_assign_cntrs()
{
	cpus_read_lock();
	rdtgroup_assign_cntrs_locked();
	cpus_read_unlock();
}

> +
> +	if (is_mbm_local_enabled())
> +		rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +/*
> + * Called when a group is deleted. Counters are unassigned if it was in
> + * assigned state.
> + */
> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
> +		return;
> +
> +	if (is_mbm_total_enabled())
> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
> +
> +	if (is_mbm_local_enabled())
> +		rdtgroup_unassign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +

Seems rdtgroup_unassign_cntrs() is always protected by 
cpus_read_lock()/unlock(). So it's good.

>   static int rdt_get_tree(struct fs_context *fc)
>   {
>   	struct rdt_fs_context *ctx = rdt_fc2context(fc);
> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>   		if (ret < 0)
>   			goto out_mongrp;
>   		rdtgroup_default.mon.mon_data_kn = kn_mondata;
> +
> +		rdtgroup_assign_cntrs(&rdtgroup_default);

In this case, cpus_read_lock() was called earlier. Change to 
rdtgroup_assign_cntrs_locked().

>   	}
>   
>   	ret = rdt_pseudo_lock_init();
> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>   out_psl:
>   	rdt_pseudo_lock_release();
>   out_mondata:
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>   		kernfs_remove(kn_mondata);
> +	}
>   out_mongrp:
>   	if (resctrl_arch_mon_capable())
>   		kernfs_remove(kn_mongrp);
> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>   
>   	head = &rdtgrp->mon.crdtgrp_list;
>   	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
> +		rdtgroup_unassign_cntrs(sentry);
>   		free_rmid(sentry->closid, sentry->mon.rmid);
>   		list_del(&sentry->mon.crdtgrp_list);
>   
> @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>   		cpumask_or(&rdtgroup_default.cpu_mask,
>   			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>   
> +		rdtgroup_unassign_cntrs(rdtgrp);
> +
>   		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>   
>   		kernfs_remove(rdtgrp->kn);
> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>   	for_each_alloc_capable_rdt_resource(r)
>   		reset_all_ctrls(r);
>   	rmdir_all_sub();
> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
>   	rdt_pseudo_lock_release();
>   	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>   	schemata_list_destroy();
> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>   		resctrl_arch_disable_alloc();
>   	if (resctrl_arch_mon_capable())
>   		resctrl_arch_disable_mon();
> +

Unnecessary change.

>   	resctrl_mounted = false;
>   	kernfs_kill_sb(sb);
>   	mutex_unlock(&rdtgroup_mutex);
> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
>   		goto out_unlock;
>   	}
>   
> +	rdtgroup_assign_cntrs(rdtgrp);
> + >   	kernfs_activate(rdtgrp->kn);
>   
>   	/*
> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>   	if (ret)
>   		goto out_closid_free;
>   
> +	rdtgroup_assign_cntrs(rdtgrp);
> +
>   	kernfs_activate(rdtgrp->kn);
>   
>   	ret = rdtgroup_init_alloc(rdtgrp);
> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>   out_del_list:
>   	list_del(&rdtgrp->rdtgroup_list);
>   out_rmid_free:
> +	rdtgroup_unassign_cntrs(rdtgrp);
>   	mkdir_rdt_prepare_rmid_free(rdtgrp);
>   out_closid_free:
>   	closid_free(closid);
> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>   	update_closid_rmid(tmpmask, NULL);
>   
>   	rdtgrp->flags = RDT_DELETED;
> +
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>   	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>   
>   	/*
> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>   	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>   	update_closid_rmid(tmpmask, NULL);
>   
> +	rdtgroup_unassign_cntrs(rdtgrp);
> +
>   	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>   	closid_free(rdtgrp->closid);
>   

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-12-04  4:16   ` Fenghua Yu
@ 2024-12-04 17:03     ` Reinette Chatre
  2024-12-04 17:14     ` Moger, Babu
  1 sibling, 0 replies; 115+ messages in thread
From: Reinette Chatre @ 2024-12-04 17:03 UTC (permalink / raw)
  To: Fenghua Yu, Babu Moger, corbet, tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, vikas.shivappa, linux-doc, linux-kernel,
	peternewman, maciej.wieczor-retman, eranian, jpoimboe,
	thomas.lendacky

Hi Fenghua,

On 12/3/24 8:16 PM, Fenghua Yu wrote:
> Hi, Babu,
> 
> On 10/29/24 16:21, Babu Moger wrote:
>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>> are required per group, one for MBM total event and one for MBM local
>> event.
>>
>> There are a limited number of counters available for assignment. If these
>> counters are exhausted, the kernel will display the error message: "Out of
>> MBM assignable counters". However, it is not necessary to fail the
>> creation of a group due to assignment failures. Users have the flexibility
>> to modify the assignments at a later time.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
>>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>>      Updated function comments.
>>
>> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>>      Fixed the problem with unassigning the child MON groups of CTRL_MON group.
>>
>> v7: Reworded the commit message.
>>      Removed the reference of ABMC with mbm_cntr_assign.
>>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
>>
>> v6: Removed the redundant comments on all the calls of
>>      rdtgroup_assign_cntrs. Updated the commit message.
>>      Dropped printing error message on every call of rdtgroup_assign_cntrs.
>>
>> v5: Removed the code to enable/disable ABMC during the mount.
>>      That will be another patch.
>>      Added arch callers to get the arch specific data.
>>      Renamed fuctions to match the other abmc function.
>>      Added code comments for assignment failures.
>>
>> v4: Few name changes based on the upstream discussion.
>>      Commit message update.
>>
>> v3: This is a new patch. Patch addresses the upstream comment to enable
>>      ABMC feature by default if the feature is available.
>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>>   1 file changed, 60 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b0cce3dfd062..a8d21b0b2054 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>>       }
>>   }
>>   +/*
>> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
>> + * counters are automatically assigned. Each group can accommodate two counters:
>> + * one for the total event and one for the local event. Assignments may fail
>> + * due to the limited number of counters. However, it is not necessary to fail
>> + * the group creation and thus no failure is returned. Users have the option
>> + * to modify the counter assignments after the group has been created.
>> + */
>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +    struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +        return;
>> +
>> +    if (is_mbm_total_enabled())
>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL, QOS_L3_MBM_TOTAL_EVENT_ID);
> 
> In this code path,
> resctrl_mkdir()->resctrl_mkdir_ctrl_mon()->rdtgroup_assign_cntrs()->rdtgroup_assign_cntr_event()
> 
> CPUs are not protected by read lock while rdtgroup_assign_cntr_event() walks r->mon_domains and run assing counters code on CPUs in the domains. Without CPU protection, r->mon_domains may race with CPU hotplug.

From what I can tell rdtgroup_assign_cntrs() is called with CPU hotplug lock held:

rdtgroup_mkdir_ctrl_mon()
{

	ret = mkdir_rdt_prepare(...);
	/* mkdir_rdt_prepare()->rdtgroup_kn_lock_live()->cpus_read_lock() */
	...
	rdtgroup_assign_cntrs(rdtgrp);
	...
	rdtgroup_kn_unlock(parent_kn);
	/* rdtgroup_kn_unlock()->cpus_read_unlock() */
	return ret;
}

Reinette

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-12-04  4:16   ` Fenghua Yu
  2024-12-04 17:03     ` Reinette Chatre
@ 2024-12-04 17:14     ` Moger, Babu
  2024-12-04 17:19       ` Moger, Babu
  1 sibling, 1 reply; 115+ messages in thread
From: Moger, Babu @ 2024-12-04 17:14 UTC (permalink / raw)
  To: Fenghua Yu, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Fenghua,

Thanks for the review.

On 12/3/24 22:16, Fenghua Yu wrote:
> Hi, Babu,
> 
> On 10/29/24 16:21, Babu Moger wrote:
>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>> are required per group, one for MBM total event and one for MBM local
>> event.
>>
>> There are a limited number of counters available for assignment. If these
>> counters are exhausted, the kernel will display the error message: "Out of
>> MBM assignable counters". However, it is not necessary to fail the
>> creation of a group due to assignment failures. Users have the flexibility
>> to modify the assignments at a later time.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to
>> return void.
>>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>>      Updated function comments.
>>
>> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>>      Fixed the problem with unassigning the child MON groups of CTRL_MON
>> group.
>>
>> v7: Reworded the commit message.
>>      Removed the reference of ABMC with mbm_cntr_assign.
>>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
>>
>> v6: Removed the redundant comments on all the calls of
>>      rdtgroup_assign_cntrs. Updated the commit message.
>>      Dropped printing error message on every call of rdtgroup_assign_cntrs.
>>
>> v5: Removed the code to enable/disable ABMC during the mount.
>>      That will be another patch.
>>      Added arch callers to get the arch specific data.
>>      Renamed fuctions to match the other abmc function.
>>      Added code comments for assignment failures.
>>
>> v4: Few name changes based on the upstream discussion.
>>      Commit message update.
>>
>> v3: This is a new patch. Patch addresses the upstream comment to enable
>>      ABMC feature by default if the feature is available.
>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>>   1 file changed, 60 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b0cce3dfd062..a8d21b0b2054 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>>       }
>>   }
>>   +/*
>> + * Called when a new group is created. If "mbm_cntr_assign" mode is
>> enabled,
>> + * counters are automatically assigned. Each group can accommodate two
>> counters:
>> + * one for the total event and one for the local event. Assignments may
>> fail
>> + * due to the limited number of counters. However, it is not necessary
>> to fail
>> + * the group creation and thus no failure is returned. Users have the
>> option
>> + * to modify the counter assignments after the group has been created.
>> + */
>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +    struct rdt_resource *r =
>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +        return;
>> +
>> +    if (is_mbm_total_enabled())
>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL,
>> QOS_L3_MBM_TOTAL_EVENT_ID);
> 
> In this code path,
> resctrl_mkdir()->resctrl_mkdir_ctrl_mon()->rdtgroup_assign_cntrs()->rdtgroup_assign_cntr_event()
> 
> CPUs are not protected by read lock while rdtgroup_assign_cntr_event()
> walks r->mon_domains and run assing counters code on CPUs in the domains.
> Without CPU protection, r->mon_domains may race with CPU hotplug.
> 
> In another patch (i.e. rdt_get_tree()), rdtgroup_assign_cntrs() is
> protected by cpus_read_lock()/unlock().
> 
> So maybe define two helpers:
> 
> // Called when caller takes cpus_read_lock()
> rdtgroup_assign_cntrs_locked()
> {
>     lockdep_assert_cpus_held();
> 
>     then the current rdtgroup_assign_cntrs() code
> }
> 
> // Called when caller doesn't take cpus_read_lock()
> rdtgroup_assign_cntrs()
> {
>     cpus_read_lock();
>     rdtgroup_assign_cntrs_locked();
>     cpus_read_unlock();
> }
> 

Good observation. Agree. There is a problem.
Some of this code will change with earlier comments.

We know couple of paths are affected here. Why not just add the lock
before calling in affected paths instead of adding new helpers?

/*
 * Walking r->domains in rdtgroup_assign_cntrs, ensure it can't race
 * with cpuhp
 */
cpus_read_lock();
rdtgroup_assign_cntrs()
cpus_read_unlock();



>> +
>> +    if (is_mbm_local_enabled())
>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL,
>> QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>> +/*
>> + * Called when a group is deleted. Counters are unassigned if it was in
>> + * assigned state.
>> + */
>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +    struct rdt_resource *r =
>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +        return;
>> +
>> +    if (is_mbm_total_enabled())
>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL,
>> QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +    if (is_mbm_local_enabled())
>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL,
>> QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
> 
> Seems rdtgroup_unassign_cntrs() is always protected by
> cpus_read_lock()/unlock(). So it's good.

ok

> 
>>   static int rdt_get_tree(struct fs_context *fc)
>>   {
>>       struct rdt_fs_context *ctx = rdt_fc2context(fc);
>> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>           if (ret < 0)
>>               goto out_mongrp;
>>           rdtgroup_default.mon.mon_data_kn = kn_mondata;
>> +
>> +        rdtgroup_assign_cntrs(&rdtgroup_default);
> 
> In this case, cpus_read_lock() was called earlier. Change to
> rdtgroup_assign_cntrs_locked().
> 
>>       }
>>         ret = rdt_pseudo_lock_init();
>> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>   out_psl:
>>       rdt_pseudo_lock_release();
>>   out_mondata:
>> -    if (resctrl_arch_mon_capable())
>> +    if (resctrl_arch_mon_capable()) {
>> +        rdtgroup_unassign_cntrs(&rdtgroup_default);
>>           kernfs_remove(kn_mondata);
>> +    }
>>   out_mongrp:
>>       if (resctrl_arch_mon_capable())
>>           kernfs_remove(kn_mongrp);
>> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup
>> *rdtgrp)
>>         head = &rdtgrp->mon.crdtgrp_list;
>>       list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>> +        rdtgroup_unassign_cntrs(sentry);
>>           free_rmid(sentry->closid, sentry->mon.rmid);
>>           list_del(&sentry->mon.crdtgrp_list);
>>   @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>>           cpumask_or(&rdtgroup_default.cpu_mask,
>>                  &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>   +        rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>           free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>             kernfs_remove(rdtgrp->kn);
>> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>       for_each_alloc_capable_rdt_resource(r)
>>           reset_all_ctrls(r);
>>       rmdir_all_sub();
>> +    rdtgroup_unassign_cntrs(&rdtgroup_default);
>>       rdt_pseudo_lock_release();
>>       rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>       schemata_list_destroy();
>> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>           resctrl_arch_disable_alloc();
>>       if (resctrl_arch_mon_capable())
>>           resctrl_arch_disable_mon();
>> +
> 
> Unnecessary change.

ok.

> 
>>       resctrl_mounted = false;
>>       kernfs_kill_sb(sb);
>>       mutex_unlock(&rdtgroup_mutex);
>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node
>> *parent_kn,
>>           goto out_unlock;
>>       }
>>   +    rdtgroup_assign_cntrs(rdtgrp);
>> + >       kernfs_activate(rdtgrp->kn);
>>         /*
>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct
>> kernfs_node *parent_kn,
>>       if (ret)
>>           goto out_closid_free;
>>   +    rdtgroup_assign_cntrs(rdtgrp);
>> +
>>       kernfs_activate(rdtgrp->kn);
>>         ret = rdtgroup_init_alloc(rdtgrp);
>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct
>> kernfs_node *parent_kn,
>>   out_del_list:
>>       list_del(&rdtgrp->rdtgroup_list);
>>   out_rmid_free:
>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>       mkdir_rdt_prepare_rmid_free(rdtgrp);
>>   out_closid_free:
>>       closid_free(closid);
>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup
>> *rdtgrp, cpumask_var_t tmpmask)
>>       update_closid_rmid(tmpmask, NULL);
>>         rdtgrp->flags = RDT_DELETED;
>> +
>> +    rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>         /*
>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup
>> *rdtgrp, cpumask_var_t tmpmask)
>>       cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>       update_closid_rmid(tmpmask, NULL);
>>   +    rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>       closid_free(rdtgrp->closid);
>>   
> 
> Thanks.
> 
> -Fenghua
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled
  2024-12-04 17:14     ` Moger, Babu
@ 2024-12-04 17:19       ` Moger, Babu
  0 siblings, 0 replies; 115+ messages in thread
From: Moger, Babu @ 2024-12-04 17:19 UTC (permalink / raw)
  To: Fenghua Yu, corbet, reinette.chatre, tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, thuth, paulmck, rostedt, akpm, xiongwei.song,
	pawan.kumar.gupta, daniel.sneddon, perry.yuan, sandipan.das,
	kai.huang, xiaoyao.li, seanjc, jithu.joseph, brijesh.singh,
	xin3.li, ebiggers, andrew.cooper3, mario.limonciello, james.morse,
	tan.shaopeng, tony.luck, linux-doc, linux-kernel, peternewman,
	maciej.wieczor-retman, eranian, jpoimboe, thomas.lendacky

Hi Fenghua,

On 12/4/24 11:14, Moger, Babu wrote:
> Hi Fenghua,
> 
> Thanks for the review.
> 
> On 12/3/24 22:16, Fenghua Yu wrote:
>> Hi, Babu,
>>
>> On 10/29/24 16:21, Babu Moger wrote:
>>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>>> are required per group, one for MBM total event and one for MBM local
>>> event.
>>>
>>> There are a limited number of counters available for assignment. If these
>>> counters are exhausted, the kernel will display the error message: "Out of
>>> MBM assignable counters". However, it is not necessary to fail the
>>> creation of a group due to assignment failures. Users have the flexibility
>>> to modify the assignments at a later time.
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>> v9: Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to
>>> return void.
>>>      Updated couple of rdtgroup_unassign_cntrs() calls properly.
>>>      Updated function comments.
>>>
>>> v8: Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
>>>      Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
>>>      Fixed the problem with unassigning the child MON groups of CTRL_MON
>>> group.
>>>
>>> v7: Reworded the commit message.
>>>      Removed the reference of ABMC with mbm_cntr_assign.
>>>      Renamed the function rdtgroup_assign_cntrs to rdtgroup_assign_grp.
>>>
>>> v6: Removed the redundant comments on all the calls of
>>>      rdtgroup_assign_cntrs. Updated the commit message.
>>>      Dropped printing error message on every call of rdtgroup_assign_cntrs.
>>>
>>> v5: Removed the code to enable/disable ABMC during the mount.
>>>      That will be another patch.
>>>      Added arch callers to get the arch specific data.
>>>      Renamed fuctions to match the other abmc function.
>>>      Added code comments for assignment failures.
>>>
>>> v4: Few name changes based on the upstream discussion.
>>>      Commit message update.
>>>
>>> v3: This is a new patch. Patch addresses the upstream comment to enable
>>>      ABMC feature by default if the feature is available.
>>> ---
>>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++++++-
>>>   1 file changed, 60 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index b0cce3dfd062..a8d21b0b2054 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2932,6 +2932,46 @@ static void schemata_list_destroy(void)
>>>       }
>>>   }
>>>   +/*
>>> + * Called when a new group is created. If "mbm_cntr_assign" mode is
>>> enabled,
>>> + * counters are automatically assigned. Each group can accommodate two
>>> counters:
>>> + * one for the total event and one for the local event. Assignments may
>>> fail
>>> + * due to the limited number of counters. However, it is not necessary
>>> to fail
>>> + * the group creation and thus no failure is returned. Users have the
>>> option
>>> + * to modify the counter assignments after the group has been created.
>>> + */
>>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>>> +{
>>> +    struct rdt_resource *r =
>>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>>> +        return;
>>> +
>>> +    if (is_mbm_total_enabled())
>>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL,
>>> QOS_L3_MBM_TOTAL_EVENT_ID);
>>
>> In this code path,
>> resctrl_mkdir()->resctrl_mkdir_ctrl_mon()->rdtgroup_assign_cntrs()->rdtgroup_assign_cntr_event()
>>
>> CPUs are not protected by read lock while rdtgroup_assign_cntr_event()
>> walks r->mon_domains and run assing counters code on CPUs in the domains.
>> Without CPU protection, r->mon_domains may race with CPU hotplug.
>>
>> In another patch (i.e. rdt_get_tree()), rdtgroup_assign_cntrs() is
>> protected by cpus_read_lock()/unlock().
>>
>> So maybe define two helpers:
>>
>> // Called when caller takes cpus_read_lock()
>> rdtgroup_assign_cntrs_locked()
>> {
>>     lockdep_assert_cpus_held();
>>
>>     then the current rdtgroup_assign_cntrs() code
>> }
>>
>> // Called when caller doesn't take cpus_read_lock()
>> rdtgroup_assign_cntrs()
>> {
>>     cpus_read_lock();
>>     rdtgroup_assign_cntrs_locked();
>>     cpus_read_unlock();
>> }
>>
> 
> Good observation. Agree. There is a problem.
> Some of this code will change with earlier comments.
> 
> We know couple of paths are affected here. Why not just add the lock
> before calling in affected paths instead of adding new helpers?
> 
> /*
>  * Walking r->domains in rdtgroup_assign_cntrs, ensure it can't race
>  * with cpuhp
>  */
> cpus_read_lock();
> rdtgroup_assign_cntrs()
> cpus_read_unlock();
> 

Oh no. Looks like we are good here.

Looks at Reinette's response. Thanks Reinette.

https://lore.kernel.org/lkml/4032a5a5-dd0a-49ae-94b6-dc4fac4c190d@intel.com/


> 
> 
>>> +
>>> +    if (is_mbm_local_enabled())
>>> +        rdtgroup_assign_cntr_event(r, rdtgrp, NULL,
>>> QOS_L3_MBM_LOCAL_EVENT_ID);
>>> +}
>>> +
>>> +/*
>>> + * Called when a group is deleted. Counters are unassigned if it was in
>>> + * assigned state.
>>> + */
>>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>>> +{
>>> +    struct rdt_resource *r =
>>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>> +    if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>>> +        return;
>>> +
>>> +    if (is_mbm_total_enabled())
>>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL,
>>> QOS_L3_MBM_TOTAL_EVENT_ID);
>>> +
>>> +    if (is_mbm_local_enabled())
>>> +        rdtgroup_unassign_cntr_event(r, rdtgrp, NULL,
>>> QOS_L3_MBM_LOCAL_EVENT_ID);
>>> +}
>>> +
>>
>> Seems rdtgroup_unassign_cntrs() is always protected by
>> cpus_read_lock()/unlock(). So it's good.
> 
> ok
> 
>>
>>>   static int rdt_get_tree(struct fs_context *fc)
>>>   {
>>>       struct rdt_fs_context *ctx = rdt_fc2context(fc);
>>> @@ -2991,6 +3031,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>>           if (ret < 0)
>>>               goto out_mongrp;
>>>           rdtgroup_default.mon.mon_data_kn = kn_mondata;
>>> +
>>> +        rdtgroup_assign_cntrs(&rdtgroup_default);
>>
>> In this case, cpus_read_lock() was called earlier. Change to
>> rdtgroup_assign_cntrs_locked().
>>
>>>       }
>>>         ret = rdt_pseudo_lock_init();
>>> @@ -3021,8 +3063,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>>   out_psl:
>>>       rdt_pseudo_lock_release();
>>>   out_mondata:
>>> -    if (resctrl_arch_mon_capable())
>>> +    if (resctrl_arch_mon_capable()) {
>>> +        rdtgroup_unassign_cntrs(&rdtgroup_default);
>>>           kernfs_remove(kn_mondata);
>>> +    }
>>>   out_mongrp:
>>>       if (resctrl_arch_mon_capable())
>>>           kernfs_remove(kn_mongrp);
>>> @@ -3201,6 +3245,7 @@ static void free_all_child_rdtgrp(struct rdtgroup
>>> *rdtgrp)
>>>         head = &rdtgrp->mon.crdtgrp_list;
>>>       list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>>> +        rdtgroup_unassign_cntrs(sentry);
>>>           free_rmid(sentry->closid, sentry->mon.rmid);
>>>           list_del(&sentry->mon.crdtgrp_list);
>>>   @@ -3241,6 +3286,8 @@ static void rmdir_all_sub(void)
>>>           cpumask_or(&rdtgroup_default.cpu_mask,
>>>                  &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>>   +        rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>           free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>             kernfs_remove(rdtgrp->kn);
>>> @@ -3272,6 +3319,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>>       for_each_alloc_capable_rdt_resource(r)
>>>           reset_all_ctrls(r);
>>>       rmdir_all_sub();
>>> +    rdtgroup_unassign_cntrs(&rdtgroup_default);
>>>       rdt_pseudo_lock_release();
>>>       rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>>       schemata_list_destroy();
>>> @@ -3280,6 +3328,7 @@ static void rdt_kill_sb(struct super_block *sb)
>>>           resctrl_arch_disable_alloc();
>>>       if (resctrl_arch_mon_capable())
>>>           resctrl_arch_disable_mon();
>>> +
>>
>> Unnecessary change.
> 
> ok.
> 
>>
>>>       resctrl_mounted = false;
>>>       kernfs_kill_sb(sb);
>>>       mutex_unlock(&rdtgroup_mutex);
>>> @@ -3871,6 +3920,8 @@ static int rdtgroup_mkdir_mon(struct kernfs_node
>>> *parent_kn,
>>>           goto out_unlock;
>>>       }
>>>   +    rdtgroup_assign_cntrs(rdtgrp);
>>> + >       kernfs_activate(rdtgrp->kn);
>>>         /*
>>> @@ -3915,6 +3966,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct
>>> kernfs_node *parent_kn,
>>>       if (ret)
>>>           goto out_closid_free;
>>>   +    rdtgroup_assign_cntrs(rdtgrp);
>>> +
>>>       kernfs_activate(rdtgrp->kn);
>>>         ret = rdtgroup_init_alloc(rdtgrp);
>>> @@ -3940,6 +3993,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct
>>> kernfs_node *parent_kn,
>>>   out_del_list:
>>>       list_del(&rdtgrp->rdtgroup_list);
>>>   out_rmid_free:
>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>>       mkdir_rdt_prepare_rmid_free(rdtgrp);
>>>   out_closid_free:
>>>       closid_free(closid);
>>> @@ -4010,6 +4064,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup
>>> *rdtgrp, cpumask_var_t tmpmask)
>>>       update_closid_rmid(tmpmask, NULL);
>>>         rdtgrp->flags = RDT_DELETED;
>>> +
>>> +    rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>         /*
>>> @@ -4056,6 +4113,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup
>>> *rdtgrp, cpumask_var_t tmpmask)
>>>       cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>>       update_closid_rmid(tmpmask, NULL);
>>>   +    rdtgroup_unassign_cntrs(rdtgrp);
>>> +
>>>       free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>>       closid_free(rdtgrp->closid);
>>>   
>>
>> Thanks.
>>
>> -Fenghua
>>
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters
  2024-11-18 21:31     ` Moger, Babu
@ 2025-02-03 13:26       ` Peter Newman
  0 siblings, 0 replies; 115+ messages in thread
From: Peter Newman @ 2025-02-03 13:26 UTC (permalink / raw)
  To: babu.moger
  Cc: akpm, andrew.cooper3, bp, brijesh.singh, corbet, daniel.sneddon,
	dave.hansen, ebiggers, eranian, fenghua.yu, hpa, james.morse,
	jithu.joseph, jpoimboe, kai.huang, linux-doc, linux-kernel,
	maciej.wieczor-retman, mario.limonciello, mingo, paulmck,
	pawan.kumar.gupta, perry.yuan, peternewman, reinette.chatre,
	rostedt, sandipan.das, seanjc, tan.shaopeng, tglx,
	thomas.lendacky, thuth, tony.luck, x86, xiaoyao.li, xin3.li,
	xiongwei.song

Hi Babu,

On Mon, Nov 18, 2024 at 03:31:28PM -0600, Moger, Babu wrote:
> Hi Reinette,
> 
> On 11/15/24 18:06, Reinette Chatre wrote:
> > Hi Babu,
> > 
> > On 10/29/24 4:21 PM, Babu Moger wrote:
> >> The mbm_cntr_assign mode provides an option to the user to assign a
> >> counter to an RMID, event pair and monitor the bandwidth as long as
> >> the counter is assigned. Number of assignments depend on number of
> >> monitoring counters available.
> >>
> >> Provide the interface to display the number of monitoring counters
> >> supported. The interface file 'num_mbm_cntrs' is available when an
> >> architecture supports mbm_cntr_assign mode.
> >>
> > 
> > As mentioned in previous patch, do you think it may be possible to
> > have a value for num_mbm_cntrs for non-ABMC AMD systems? If that is
> > available and always exposed to user space (irrespective of
> > mbm_cntr_assign mode) then it would be clear to user space on
> > benefits/risks of running a "default" mode.
> 
> I am trying the work-around to get the number of max active RMIDs in
> default mode. The method is to loop through all of the recently assigned
> RMID's to see if any of their QM_CTR.U bits transition from 0->1.
> 
> I am not successful in getting it to work so far. I remember Peter was
> trying this before in soft-ABMC. Peter, Any success with that?

Sorry I missed this question before. By now maybe you've already debugged your
own implementation...

Here's what I've been using:

-- >8 --
Subject: [PATCH] x86/resctrl: Detect MBM counters on pre-ABMC AMD
 implementations

This procedure is based on information provided directly by AMD
which cannot currently be corroborated by any public documentation.

In particular, it assumes that deallocation of MBM counters for RMIDs
is driven directly by writing a new RMID value into PQR_ASSOC.

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 83 ++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 377b5db667930..4a13eab110510 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -705,6 +705,7 @@ int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
 void free_rmid(u32 closid, u32 rmid);
+int amd_detect_mbm_counters(void);
 int __init rdt_get_mon_l3_config(struct rdt_resource *r);
 void __exit rdt_put_mon_l3_config(void);
 bool __init rdt_cpu_has(int flag);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2dd6c47c9276a..f4a59251134d3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1172,6 +1172,89 @@ static __init int snc_get_config(void)
 	return ret;
 }
 
+static void zero_rmid(void *unused)
+{
+	wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
+}
+
+int amd_detect_mbm_counters(void)
+{
+	struct cpu_cacheinfo *ci;
+	u32 rmid, test_rmid;
+	int llc_index;
+	u64 ctr;
+
+	/*
+	 * The following detection mechanism below provided by AMD. It applies
+	 * only to pre-ABMC models (Rome, Milan, Genoa).
+	 */
+	if (WARN_ON(boot_cpu_data.x86_vendor != X86_VENDOR_AMD))
+		return -1;
+	if (WARN_ON(rdt_cpu_has(X86_FEATURE_ABMC)))
+		return -1;
+
+	/*
+	 * Must not migrate to another CCX during this test. Assume no IRQ
+	 * handler would access the MSRs used below before resctrl is
+	 * initialized.
+	 */
+	ci = get_cpu_cacheinfo(get_cpu());
+	llc_index = ci->num_leaves - 1;
+
+	/* Ensure PQR_ASSOC.RMID = 0 in this CCX. */
+	if (ci->cpu_map_populated)
+		on_each_cpu_mask(&ci->info_list[llc_index].shared_cpu_map,
+				 zero_rmid, NULL, true);
+
+	rmid = 0;
+	while (true) {
+		wrmsr(MSR_IA32_PQR_ASSOC, rmid, 0);
+		wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_MBM_TOTAL_EVENT_ID, rmid);
+
+		/*
+		 * Ensure that a counter has been allocated on this RMID. The
+		 * loop below is expected to complete in two iterations.
+		 */
+		do {
+			rdmsrl(MSR_IA32_QM_CTR, ctr);
+
+			if (WARN_ON(ctr & RMID_VAL_ERROR)) {
+				pr_err("failed to read total event on rmid %u\n",
+				       rmid);
+				put_cpu();
+				return 0;
+			}
+		} while (ctr & RMID_VAL_UNAVAIL);
+
+		/*
+		 * The order in which counters are reused is not predictable, so
+		 * check all previously-assigned counters. If any loses its
+		 * value, then too many are in use as a result of the last
+		 * PQR_ASSOC write.
+		 */
+		for (test_rmid = 0; test_rmid < rmid; test_rmid++) {
+			wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_MBM_TOTAL_EVENT_ID,
+			      test_rmid);
+			rdmsrl(MSR_IA32_QM_CTR, ctr);
+
+			/*
+			 * As soon as a previous counter loses a value, we have
+			 * determined the number of RMIDs which can hold a value
+			 * simultaneously in this CCX.
+			 */
+			if (ctr & RMID_VAL_UNAVAIL) {
+				put_cpu();
+				return rmid;
+			}
+		}
+		rmid++;
+	}
+
+	put_cpu();
+
+	return 0;
+}
+
 int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 {
 	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;

^ permalink raw reply related	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2025-02-03 13:27 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 23:21 [PATCH v9 00/26] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
2024-10-29 23:21 ` [PATCH v9 01/26] x86/resctrl: Add __init attribute for the functions called in resctrl_late_init Babu Moger
2024-11-15 23:21   ` Reinette Chatre
2024-11-18 17:44     ` Moger, Babu
2024-11-18 22:07       ` Reinette Chatre
2024-11-20 20:02         ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 02/26] x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC) Babu Moger
2024-10-29 23:21 ` [PATCH v9 03/26] x86/resctrl: Add ABMC feature in the command line options Babu Moger
2024-10-29 23:21 ` [PATCH v9 04/26] x86/resctrl: Consolidate monitoring related data from rdt_resource Babu Moger
2024-10-29 23:21 ` [PATCH v9 05/26] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details Babu Moger
2024-10-29 23:21 ` [PATCH v9 06/26] x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags Babu Moger
2024-10-29 23:21 ` [PATCH v9 07/26] x86/resctrl: Add support to enable/disable AMD ABMC feature Babu Moger
2024-10-29 23:21 ` [PATCH v9 08/26] x86/resctrl: Introduce the interface to display monitor mode Babu Moger
2024-11-16  0:00   ` Reinette Chatre
2024-11-18 19:04     ` Moger, Babu
2024-11-18 22:07       ` Reinette Chatre
2024-11-22 18:25         ` Moger, Babu
2024-11-22 21:37           ` Reinette Chatre
2024-11-23  0:02             ` Moger, Babu
2024-11-25 18:17               ` Reinette Chatre
2024-11-26 17:09                 ` Moger, Babu
2024-11-26 19:01                   ` Reinette Chatre
2024-11-26 21:57                     ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 09/26] x86/resctrl: Introduce interface to display number of monitoring counters Babu Moger
2024-11-16  0:06   ` Reinette Chatre
2024-11-18 21:31     ` Moger, Babu
2025-02-03 13:26       ` Peter Newman
2024-10-29 23:21 ` [PATCH v9 10/26] x86/resctrl: Introduce bitmap mbm_cntr_free_map to track assignable counters Babu Moger
2024-11-16  0:11   ` Reinette Chatre
2024-10-29 23:21 ` [PATCH v9 11/26] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct rdt_hw_mon_domain Babu Moger
2024-10-29 23:21 ` [PATCH v9 12/26] x86/resctrl: Remove MSR reading of event configuration value Babu Moger
2024-11-16  0:24   ` Reinette Chatre
2024-11-19 16:50     ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 13/26] x86/resctrl: Introduce mbm_cntr_map to track assignable counters at domain Babu Moger
2024-10-29 23:21 ` [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters Babu Moger
2024-10-29 23:57   ` Luck, Tony
2024-10-30 14:15     ` Moger, Babu
2024-11-04 14:14   ` Peter Newman
2024-11-04 17:31     ` Moger, Babu
2024-11-16  0:31   ` Reinette Chatre
2024-11-19 19:20     ` Moger, Babu
2024-11-21 21:12       ` Reinette Chatre
2024-11-22 23:36         ` Moger, Babu
2024-11-25 19:00           ` Reinette Chatre
2024-11-26 23:31             ` Moger, Babu
2024-11-26 23:56               ` Reinette Chatre
2024-11-27 14:57                 ` Moger, Babu
2024-11-27 19:05                   ` Reinette Chatre
2024-11-28 11:10                     ` Peter Newman
2024-11-28 19:35                       ` Moger, Babu
2024-11-29  9:59                         ` Peter Newman
2024-11-29 17:06                           ` Moger, Babu
2024-12-02 10:43                             ` Peter Newman
2024-12-02 15:02                               ` Moger, Babu
2024-12-02 18:33                             ` Reinette Chatre
2024-12-02 19:48                               ` Moger, Babu
2024-12-02 20:15                                 ` Reinette Chatre
2024-12-02 20:42                                   ` Moger, Babu
2024-12-02 21:09                                     ` Reinette Chatre
2024-12-02 21:28                                       ` Moger, Babu
2024-12-02 21:47                                         ` Reinette Chatre
2024-12-02 22:06                                           ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 15/26] x86/resctrl: Add data structures and definitions for ABMC assignment Babu Moger
2024-11-16  0:35   ` Reinette Chatre
2024-10-29 23:21 ` [PATCH v9 16/26] x86/resctrl: Introduce cntr_id in mongroup for assignments Babu Moger
2024-11-16  0:38   ` Reinette Chatre
2024-11-19 20:02     ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 17/26] x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC Babu Moger
2024-10-29 23:54   ` Luck, Tony
2024-10-30 14:14     ` Moger, Babu
2024-11-16  0:44   ` Reinette Chatre
2024-11-19 20:12     ` Moger, Babu
2024-11-21 20:18       ` Reinette Chatre
2024-11-22 18:54         ` Moger, Babu
2024-11-22 21:52           ` Reinette Chatre
2024-11-23  0:15             ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 18/26] x86/resctrl: Add the interface to assign/update counter assignment Babu Moger
2024-11-16  0:57   ` Reinette Chatre
2024-11-20 18:05     ` Moger, Babu
2024-11-21 20:50       ` Reinette Chatre
2024-11-22 21:04         ` Moger, Babu
2024-11-22 22:07           ` Reinette Chatre
2024-11-23  0:09             ` Moger, Babu
2024-12-04  4:16   ` Fenghua Yu
2024-10-29 23:21 ` [PATCH v9 19/26] x86/resctrl: Add the interface to unassign a MBM counter Babu Moger
2024-11-04 14:16   ` Peter Newman
2024-11-04 18:21     ` Moger, Babu
2024-11-05 10:35       ` Peter Newman
2024-11-05 19:58         ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 20/26] x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is enabled Babu Moger
2024-11-18 17:18   ` Reinette Chatre
2024-11-22  0:22     ` Moger, Babu
2024-11-22  0:26       ` Moger, Babu
2024-11-22 18:12         ` Reinette Chatre
2024-11-22 21:34           ` Moger, Babu
2024-12-04  4:16   ` Fenghua Yu
2024-12-04 17:03     ` Reinette Chatre
2024-12-04 17:14     ` Moger, Babu
2024-12-04 17:19       ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 21/26] x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign mode Babu Moger
2024-11-18 17:39   ` Reinette Chatre
2024-11-20 19:14     ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 22/26] x86/resctrl: Introduce the interface to switch between monitor modes Babu Moger
2024-10-29 23:21 ` [PATCH v9 23/26] x86/resctrl: Configure mbm_cntr_assign mode if supported Babu Moger
2024-11-18 19:23   ` Reinette Chatre
2024-11-20 18:59     ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 24/26] x86/resctrl: Update assignments on event configuration changes Babu Moger
2024-11-18 19:43   ` Reinette Chatre
2024-11-21  2:14     ` Moger, Babu
2024-11-21 20:58       ` Reinette Chatre
2024-11-22 20:12         ` Moger, Babu
2024-10-29 23:21 ` [PATCH v9 25/26] x86/resctrl: Introduce interface to list assignment states of all the groups Babu Moger
2024-10-29 23:21 ` [PATCH v9 26/26] x86/resctrl: Introduce interface to modify assignment states of " Babu Moger
2024-11-18 21:51   ` Reinette Chatre
2024-11-21 20:29     ` Moger, Babu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).