[PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl
@ 2015-08-06 21:55 Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling Vikas Shivappa
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This series has some preparatory patches and Intel cache allocation
support.

	Prep patches :

	Has some changes to hot cpu handling code in existing cache
monitoring and RAPL kernel code. This improves hot cpu notification
handling by not looping through all online cpus which could be expensive
in large systems. 

	Intel Cache allocation support:

	Cache allocation patches adds a cgroup subsystem to support the new
Cache Allocation feature found in future Intel Xeon Intel processors.
Cache Allocation is a sub-feature with in Resource Director
Technology(RDT) feature. Current patches support only L3 cache
allocation.

Cache Allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'.  This feature is used when a thread is allocating
a cache line ie when pulling new data into the cache.  

Threads are associated with a CLOS(Class of service). OS specifies the
CLOS of a thread by writing the IA32_PQR_ASSOC MSR during context
switch. The cache capacity associated with CLOS 'n' is specified by
writing to the IA32_L3_MASK_n MSR. 

More information about cache allocation can be found in the Intel SDM,
Volume 3 section 17.16.  SDM does not yet use the 'RDT' term yet and it
is planned to be changed at a later time.

	Why Cache Allocation ?

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache.  At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance.  Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.

*All the patches will apply on 4.2-rc5*.

Changes in v13:
Based on Peter, tglx feedback 
 - changed changelogs to be better formated and worded.
 - moved sched code to __switch_to
 - Fixed a lot of whitespace/indent issues in the documentation for
 cache allocation and better formated to make it more readable.(Thanks
     to Peter again for the many issues pointed out)
 - changed Intel cache allocation enabled to Intel cache allocation
 detected in patch 1/9 intel_rdt_late_init
 - changed find_next_bit to find_first_bit in 6/9 - cbm_is_contiguous
 - changed the rdt_files mode to default from 0666
 - changed the name clos_cbm_map to clos_cbm_table
 - changed usage of size_t sizeb to int size in intel_rdt_late_init
 - changed rdt_common.h to pqr_common.h and pulled
 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state) to pqr_common.h
 - changed usage of 'probe test' term to probe and mentioned its
 specifically done for hsw server and not just hsw.

Changes in v12:
 - From Matt's feedback replaced static cpumask_t tmp with function
 scope to static cpumask_t tmp_cpumask for the whole file. This is a
 temporary mask used during handling of hot cpu notifications in
 cqm/rapl and rdt code.  Although all the usage was serialized by hot
 cpu locking this makes it more readable. 

Changes in V11:  As per feedback from Thomas and discussions:

  - removed the cpumask_any_online_but.its usage could be easily replaced with
  'and'ing the cpu_online mask during hot cpu notifications.  Thomas
  pointed the API had issue where there tmp mask wasnt thread safe. I
  realized the support it indends to give does not seem to match with
  others in cpumask.h
  - the cqm patch which added mutex to hot cpu notification was merged
  with the cqm hot plug patch to improve notificaiton handling
  without commit logs and wasnt correct. seperated and just sending the
  cqm hot plug patch and will send the mutex cqm patch seperately
  - fixed issues in the hot cpu rdt handling. Since the cpu_starting was
  replaced with cpu_online , now the wrmsr needs to be actually
  scheduled on the target cpu - which the previous patch wasnt doing.
  Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
  handled the same way as cpu_online. By waiting till cpu_dead to update
  the rdt_cpumask , we may miss some of the msr updates.

Changes in V10:

- changed the hot cpu notification we handle in cqm and cache allocation
  to cpu_online and cpu_dead and removed others as the 
  cpu_*_prepare also had corresponding cancel notification 
  which we did not handle.
- changed the file in rdt cgroup to l3_cache_mask to represent that its
  for l3 cache.

Changes as per Thomas and PeterZ feedback:
- fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
  have static so that they burden stack space when large.
- removed mutex in cpu_starting notifications, replaced the locking with
  cpu_online.
- changed name from hsw_probetest to cache_alloc_hsw_probe.
- changed x86_rdt_max_closid to x86_cache_max_closid and
  x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
  to cache allocation and not to all rdt.

Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
  enabled.
- Reordered the local declarations to follow convention in
  intel_cqm_xchg_rmid

Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:

Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
    power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.

Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the 
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.

Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for 
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored 
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a 
 refcnt to track the number of cgroups using a closid in clos_cbm map.
 -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache 
 bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each 
package.
- Removed the cache bit mask reference in the intel_rdt structure as
  there was no need for that and we already maintain a separate
  closid<->cbm mapping.
- Made a few coding convention changes which include adding the 
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init

[PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
[PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling
[PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup
[PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
[PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
[PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
[PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
[PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 2/9] x86/intel_rapl: " Vikas Shivappa
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

 - In cqm_pick_event_reader, use the existing package<->core map instead
 of looping through all cpus in cqm_cpumask.

 - In intel_cqm_cpu_exit, use the same map instead of looping through
 all online cpus. In large systems with large number of cpus the time
 taken to loop may be expensive and also the time increases linearly.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 +++++++++++++++---------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 63eb68b..916bef9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -62,6 +62,12 @@ static LIST_HEAD(cache_groups);
  */
 static cpumask_t cqm_cpumask;
 
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
 #define RMID_VAL_ERROR		(1ULL << 63)
 #define RMID_VAL_UNAVAIL	(1ULL << 62)
 
@@ -1244,15 +1250,13 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-	int phys_id = topology_physical_package_id(cpu);
-	int i;
+	cpumask_and(&tmp_cpumask, &cqm_cpumask, topology_core_cpumask(cpu));
 
-	for_each_cpu(i, &cqm_cpumask) {
-		if (phys_id == topology_physical_package_id(i))
-			return;	/* already got reader for this socket */
-	}
-
-	cpumask_set_cpu(cpu, &cqm_cpumask);
+	/*
+	 * Pick a reader if there isn't one already.
+	 */
+	if (cpumask_empty(&tmp_cpumask))
+		cpumask_set_cpu(cpu, &cqm_cpumask);
 }
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
@@ -1270,7 +1274,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-	int phys_id = topology_physical_package_id(cpu);
 	int i;
 
 	/*
@@ -1279,15 +1282,12 @@ static void intel_cqm_cpu_exit(unsigned int cpu)
 	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
 		return;
 
-	for_each_online_cpu(i) {
-		if (i == cpu)
-			continue;
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
 
-		if (phys_id == topology_physical_package_id(i)) {
-			cpumask_set_cpu(i, &cqm_cpumask);
-			break;
-		}
-	}
+	if (i < nr_cpu_ids)
+		cpumask_set_cpu(i, &cqm_cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

 - In rapl_cpu_init, use the existing package<->core map instead of
 looping through all cpus in rapl_cpumask.

 - In rapl_cpu_exit, use the same mapping instead of looping all online
 cpus. In large systems with large number of cpus the time taken to
 loop may be expensive and also the time increase linearly.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c | 35 ++++++++++++++---------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 5cbd4e6..3f3fb4c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -132,6 +132,12 @@ static struct pmu rapl_pmu_class;
 static cpumask_t rapl_cpu_mask;
 static int rapl_cntr_mask;
 
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
 static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
 static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_to_free);
 
@@ -523,18 +529,16 @@ static struct pmu rapl_pmu_class = {
 static void rapl_cpu_exit(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
-	int i, phys_id = topology_physical_package_id(cpu);
 	int target = -1;
+	int i;
 
 	/* find a new cpu on same package */
-	for_each_online_cpu(i) {
-		if (i == cpu)
-			continue;
-		if (phys_id == topology_physical_package_id(i)) {
-			target = i;
-			break;
-		}
-	}
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
+	if (i < nr_cpu_ids)
+		target = i;
+
 	/*
 	 * clear cpu from cpumask
 	 * if was set in cpumask and still some cpu on package,
@@ -556,15 +560,10 @@ static void rapl_cpu_exit(int cpu)
 
 static void rapl_cpu_init(int cpu)
 {
-	int i, phys_id = topology_physical_package_id(cpu);
-
-	/* check if phys_is is already covered */
-	for_each_cpu(i, &rapl_cpu_mask) {
-		if (phys_id == topology_physical_package_id(i))
-			return;
-	}
-	/* was not found, so add it */
-	cpumask_set_cpu(cpu, &rapl_cpu_mask);
+	/* check if cpu's package is already covered.If not, add it.*/
+	cpumask_and(&tmp_cpumask, &rapl_cpu_mask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp_cpumask))
+		cpumask_set_cpu(cpu, &rapl_cpu_mask);
 }
 
 static __init void rapl_hsw_server_quirk(void)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 2/9] x86/intel_rapl: " Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection Vikas Shivappa
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Adds a description of Cache allocation technology, overview of kernel
implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology (RDT)
or Platform Shared resource control which provides support to control
Platform shared resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating a
line in cache ie when pulling new data into the cache. The tasks are
grouped into CLOS (class of service). OS uses MSR writes to indicate the
CLOSid of the thread when scheduling in and to indicate the cache
capacity associated with the CLOSid. Currently cache allocation is
supported for L3 cache.

More information can be found in the Intel SDM June 2015, Volume 3,
section 17.16.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/cgroups/rdt.txt | 219 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..1abc930
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,219 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology (RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM June 2015, Volume 3, section 17.16.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM) to
+restrict cache allocation to a defined 'subset' of cache which may be
+overlapping with other 'subsets'. This feature is used when allocating a
+line in cache ie when pulling new data into the cache. The programming
+of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class of
+service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number of
+threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache resources.
+
+A specific use case may be when a app which is constantly copying data
+like streaming app is using large amount of cache which could have
+otherwise been used by a high priority computing application. Using the
+cache allocation feature, the streaming application can be confined to
+use a smaller cache and the high priority application be awarded a
+larger amount of cache space.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation. This
+is intended for use by root users or system administrators to have
+control over the amount of L3 cache the applications can use to fill.
+
+Each cgroup has a CLOSid <-> CBM (cache bit mask) mapping. A CLOS (Class
+of service) is represented by a CLOSid. CLOSid is internal to the kernel
+and not exposed to user. Each cgroup would have one CBM and would just
+represent one cache 'subset'.
+
+When a child cgroup is created it inherits the CLOSid and the CBM from
+its parent. When a user changes the default CBM for a cgroup, a new
+CLOSid may be allocated if the CBM was not used before. The changing of
+'l3_cbm' may fail with -ENOSPC once the kernel runs out of maximum
+CLOSids it can support. User can create as many cgroups as he wants but
+having different CBMs at the same time is restricted by the maximum
+number of CLOSids (multiple cgroups can have the same CBM). Kernel
+maintains a CLOSid <-> cbm mapping which keeps reference counter for
+each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'l3_cbm' file.
+
+Root directory would have all available bits set in 'l3_cbm' file by
+default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.l3_cbm: The cache bitmask (CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM, CLOS
+---------------------------
+
+The 'l3_cbm' needs to be a subset of the parent node's 'l3_cbm'. Any
+contiguous subset of these bits (with a minimum of 2 bits on hsw server
+SKUs) maybe set to indicate the cache capacity desired. The 'l3_cbm'
+between 2 directories can overlap. The 'l3_cbm' would represent the
+cache 'subset' of the Cache allocation cgroup.
+
+For ex: on a system with 16 bits of max cbm bits and 4MB of L3 cache, if
+the directory has the least significant 4 bits set in its 'l3_cbm' file
+(meaning the 'l3_cbm' is just 0xf), it would be allocated 1MB of the L3
+cache which means the tasks belonging to this Cache allocation cgroup
+can use that 1MB cache to fill. If it has a l3_cbm=0xff, it would be
+allocated 2MB or half of the cache. The administrator who is using the
+cgroup can easily determine the total size of the cache from
+/proc/cpuinfo and decide the amount or specific percentage of cache
+allocations to be made to applications. Note that the granularity may
+differ on different SKUs and the administrator can obtain the
+granularity by calculating total size of cache / max cbm bits.
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the CLOSid
+(internally maintained by kernel) of the cgroup to which the task
+belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written when
+there is a change in the CLOSid for the CPU in order to minimize the
+latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+ - This path doesn't exist on any non-intel platforms.
+ - On Intel platforms, this would not exist by default unless CGROUP_RDT
+ is enabled.
+ - remains a no-op when CGROUP_RDT is enabled and intel hardware does
+ not support the feature.
+ - When feature is available, does not do MSR write till the user
+ manually creates a cgroup *and* assigns a new cache mask. Since the
+ child node inherits the parents cache mask, by cgroup creation there is
+ no scheduling hot path impact from the new cgroup.
+ - per cpu PQR values are cached and the MSR write is only done when
+ there is a task with different PQR is scheduled on the CPU. Typically
+ if the task groups are bound to be scheduled on a set of CPUs, the
+ number of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+Following is an example on how a system administrator/root user can
+configure L3 cache allocation to threads.
+
+To check if Cache allocation was enabled on your system
+  $ dmesg | grep -i intel_rdt
+  intel_rdt: Intel Cache Allocation enabled
+
+  $ cat /proc/cpuinfo
+output would have 'rdt' (if rdt is enabled) and 'cat_l3' (if L3
+cache allocation is enabled).
+
+example1: Following would mount the cache allocation cgroup subsystem
+and create 2 directories.
+
+  $ cd /sys/fs/cgroup
+  $ mkdir rdt
+  $ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+  $ cd rdt
+  $ mkdir group1
+  $ mkdir group2
+
+Following are some of the Files in the directory
+
+  $ ls
+  intel_rdt.l3_cbm
+  tasks
+
+Say if the cache is 4MB (looked up from /proc/cpuinfo) and max cbm is 16
+bits (indicated by the root nodes cbm). This assigns 1MB of cache to
+group1 and group2 which is exclusive between them.
+
+  $ cd group1
+  $ /bin/echo 0xf > intel_rdt.l3_cbm
+
+  $ cd group2
+  $ /bin/echo 0xf0 > intel_rdt.l3_cbm
+
+Assign tasks to the group2
+
+  $ /bin/echo PID1 > tasks
+  $ /bin/echo PID2 > tasks
+
+Now threads PID1 and PID2 get to fill the 1MB of cache that was
+allocated to group2. Similarly assign tasks to group1.
+
+example2: Below commands allocate '1MB L3 cache on socket1 to group1'
+and '2MB of L3 cache on socket2 to group2'.
+This mounts both cpuset and intel_rdt and hence the ls would list the
+files in both the subsystems.
+  $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
+
+Assign the cache
+  $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
+  $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
+
+Assign tasks for group1 and group2
+  $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
+  $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
+  $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
+  $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
+
+Tie the group1 to socket1 and group2 to socket2
+  $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
+  $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (2 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management Vikas Shivappa
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch includes CPUID enumeration routines for Cache allocation and
new values to track resources to the cpuinfo_x86 structure.

Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'. This feature is used when allocating a line in
cache ie when pulling new data into the cache. The programming of the
hardware is done via programming MSRs (model specific registers).

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h |  6 +++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/Makefile      |  1 +
 arch/x86/kernel/cpu/common.c      | 15 +++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c   | 40 +++++++++++++++++++++++++++++++++++++++
 init/Kconfig                      | 11 +++++++++++
 6 files changed, 75 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 3d6606f..ae5ae9d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -229,6 +229,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT		( 9*32+15) /* Resource Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
 #define X86_FEATURE_ADX		( 9*32+19) /* The ADCX and ADOX instructions */
@@ -252,6 +253,9 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3	(13*32 + 1) /* Cache Allocation L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 944f178..0a1a1bc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -120,6 +120,9 @@ struct cpuinfo_x86 {
 	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
+	/* Cache Allocation values: */
+	u16			x86_cache_max_cbm_len;
+	u16			x86_cache_max_closid;
 	/* cpuid returned max cores value: */
 	u16			 x86_max_cores;
 	u16			apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..4ff7a1f 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_nhmex.o
 endif
 
+obj-$(CONFIG_CGROUP_RDT) 		+= intel_rdt.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb9e5df..5bb46d9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -653,6 +653,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		}
 	}
 
+	/* Additional Intel-defined flags: level 0x00000010 */
+	if (c->cpuid_level >= 0x00000010) {
+		u32 eax, ebx, ecx, edx;
+
+		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[13] = ebx;
+
+		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_cache_max_closid = edx + 1;
+			c->x86_cache_max_cbm_len = eax + 1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	xlvl = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..f49e970
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,40 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2015-05-25 Written by
+ *    Vikas Shivappa <vikas.shivappa@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2015, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+		return -ENODEV;
+
+	pr_info("Intel cache allocation detected\n");
+
+	return 0;
+}
+
+late_initcall(intel_rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index af09b4f..7f10d40 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -971,6 +971,17 @@ config CPUSETS
 
 	  Say N if unsure.
 
+config CGROUP_RDT
+	bool "Resource Director Technology cgroup subsystem"
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+	  This option provides a cgroup to allocate Platform shared
+	  resources. Among the shared resources, current implementation
+	  focuses on L3 Cache. Using the interface user can specify the
+	  amount of L3 cache space into which an application can fill.
+
+	  Say N if unsure.
+
 config PROC_PID_CPUSET
 	bool "Include legacy /proc/<pid>/cpuset file"
 	depends on CPUSETS
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (3 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds a cgroup subsystem for Intel Resource Director
Technology (RDT) feature. This cgroup may eventually be used by many
sub-features of RDT. Therefore the cgroup may be associated with the
common RDT framework as well as sub-feature specific framework. Patch
also adds Class of service id (CLOSid) management code for Cache
allocation.

When a cgroup directory is created it has a CLOSid associated with it
which is inherited from its parent. The Closid is mapped to a l3_cbm
(capacity bit mask) which represents the L3 cache allocation to the
cgroup. Tasks belonging to the cgroup get to fill the cache represented
by the l3_cbm.

CLOSid is internal to the kernel and not exposed to user. Kernel uses
several ways to optimize the allocation of Closid and thereby exposing
the available Closids may actually provide wrong information to users as
it may be dynamically changing depending on its usage.

CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids. Each
l3_cbm has an associated CLOSid. However if multiple cgroups
have the same cache mask they would also have the same CLOSid. The
reference count parameter in CLOSid-CBM map keeps track of how many
cgroups are using each CLOSid<->CBM mapping.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  36 +++++++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 133 ++++++++++++++++++++++++++++++++++++++-
 include/linux/cgroup_subsys.h    |   4 ++
 3 files changed, 170 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..a887004
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,36 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+	unsigned long *closmap;
+};
+
+struct intel_rdt {
+	struct cgroup_subsys_state css;
+	u32 closid;
+};
+
+struct clos_cbm_table {
+	unsigned long l3_cbm;
+	unsigned int clos_refcnt;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+	return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f49e970..52e1fd6 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,17 +24,144 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_table *cctable;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static inline void closid_get(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	cct->clos_refcnt++;
+}
+
+static int closid_alloc(struct intel_rdt *ir)
+{
+	u32 maxid;
+	u32 id;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	maxid = boot_cpu_data.x86_cache_max_closid;
+	id = find_first_zero_bit(rdtss_info.closmap, maxid);
+	if (id == maxid)
+		return -ENOSPC;
+
+	set_bit(id, rdtss_info.closmap);
+	closid_get(id);
+	ir->closid = id;
+
+	return 0;
+}
+
+static inline void closid_free(u32 closid)
+{
+	clear_bit(closid, rdtss_info.closmap);
+	cctable[closid].l3_cbm = 0;
+}
+
+static inline void closid_put(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+	if (WARN_ON(!cct->clos_refcnt))
+		return;
+
+	if (!--cct->clos_refcnt)
+		closid_free(closid);
+}
+
+static struct cgroup_subsys_state *
+intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct intel_rdt *parent = css_rdt(parent_css);
+	struct intel_rdt *ir;
+
+	/*
+	 * cgroup_init cannot handle failures gracefully.
+	 * Return rdt_root_group.css instead of failure
+	 * always even when Cache allocation is not supported.
+	 */
+	if (!parent)
+		return &rdt_root_group.css;
+
+	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+	if (!ir)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&rdt_group_mutex);
+	ir->closid = parent->closid;
+	closid_get(ir->closid);
+	mutex_unlock(&rdt_group_mutex);
+
+	return &ir->css;
+}
+
+static void intel_rdt_css_free(struct cgroup_subsys_state *css)
+{
+	struct intel_rdt *ir = css_rdt(css);
+
+	mutex_lock(&rdt_group_mutex);
+	closid_put(ir->closid);
+	kfree(ir);
+	mutex_unlock(&rdt_group_mutex);
+}
 
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	static struct clos_cbm_table *cct;
+	u32 maxid, max_cbm_len;
+	int err = 0, size;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
+	}
+	maxid = c->x86_cache_max_closid;
+	max_cbm_len = c->x86_cache_max_cbm_len;
 
-	pr_info("Intel cache allocation detected\n");
+	size = BITS_TO_LONGS(maxid) * sizeof(long);
+	rdtss_info.closmap = kzalloc(size, GFP_KERNEL);
+	if (!rdtss_info.closmap) {
+		err = -ENOMEM;
+		goto out_err;
+	}
 
-	return 0;
+	size = maxid * sizeof(struct clos_cbm_table);
+	cctable = kzalloc(size, GFP_KERNEL);
+	if (!cctable) {
+		kfree(rdtss_info.closmap);
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	set_bit(0, rdtss_info.closmap);
+	rdt_root_group.closid = 0;
+	cct = &cctable[0];
+	cct->l3_cbm = (1ULL << max_cbm_len) - 1;
+	cct->clos_refcnt = 1;
+
+	pr_info("Intel cache allocation enabled\n");
+out_err:
+
+	return err;
 }
 
 late_initcall(intel_rdt_late_init);
+
+struct cgroup_subsys intel_rdt_cgrp_subsys = {
+	.css_alloc		= intel_rdt_css_alloc,
+	.css_free		= intel_rdt_css_free,
+	.early_init		= 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..0339312 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(intel_rdt)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (4 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Adds a file l3_cbm to the intel_rdt cgroup which represents the cache
capacity bit mask for the cgroup. The tasks in the cgroup would get to
fill the L3 cache represented by the cgroup's l3_cbm file. The bit mask
may map to ways in the cache but could be hardware implementation
specific.

The l3_cbm would represent one of IA32_L3_MASK_n MSRs, there by any
updates to the l3_cbm end up in an MSR write to the appropriate
IA32_L3_MASK_n. The IA32_L3_MASK_n MSRs are per package but the l3_cbm
represents the global value of the MSR on all packages.

When a child cgroup is created it inherits the CLOSid and the l3_cbm
from its parent. When a user changes the default l3_cbm for a cgroup, a
new CLOSid may be allocated if the l3_cbm was not used before. If the
new l3_cbm is the one that is already used, the count for that CLOSid
<-> l3_cbm is incremented. The changing of 'l3_cbm' may fail with
-ENOSPC once the kernel runs out of maximum CLOSids it can support.

User can create as many cgroups as he wants, but having different l3_cbm
at the same time is restricted by the maximum number of CLOSids. Kernel
maintains a CLOSid <-> l3_cbm mapping which keeps count of cgroups using
a CLOSid.

Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
 - This helps to use the scant CLOSids optimally.
 - This also implies that during context switch, write to PQR-MSR is
 done only when a task with a different bitmask is scheduled in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |   3 +
 arch/x86/kernel/cpu/intel_rdt.c  | 202 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 204 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index a887004..58bac91 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#define MAX_CBM_LENGTH			32
+#define IA32_L3_CBM_BASE		0xc90
+#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
 struct rdt_subsys_info {
 	unsigned long *closmap;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 52e1fd6..115f136 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,13 @@ static struct clos_cbm_table *cctable;
 static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
+/*
+ * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
+ */
+static cpumask_t rdt_cpumask;
+
+#define rdt_for_each_child(pos_css, parent_ir)		\
+	css_for_each_child((pos_css), &(parent_ir)->css)
 
 static inline void closid_get(u32 closid)
 {
@@ -117,12 +124,192 @@ static void intel_rdt_css_free(struct cgroup_subsys_state *css)
 	mutex_unlock(&rdt_group_mutex);
 }
 
+static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
+{
+	struct intel_rdt *ir = css_rdt(seq_css(m));
+
+	seq_printf(m, "%08lx\n", cctable[ir->closid].l3_cbm);
+
+	return 0;
+}
+
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+	unsigned long maxcbm = MAX_CBM_LENGTH;
+	unsigned long first_bit, zero_bit;
+
+	if (!var)
+		return false;
+
+	first_bit = find_first_bit(&var, maxcbm);
+	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+		return false;
+
+	return true;
+}
+
+static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+	struct cgroup_subsys_state *css;
+	struct intel_rdt *par, *c;
+	unsigned long *cbm_tmp;
+	int err = 0;
+
+	if (!cbm_is_contiguous(cbmvalue)) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	par = parent_rdt(ir);
+	cbm_tmp = &cctable[par->closid].l3_cbm;
+	if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH)) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	rcu_read_lock();
+	rdt_for_each_child(css, ir) {
+		c = css_rdt(css);
+		cbm_tmp = &cctable[c->closid].l3_cbm;
+		if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+			rcu_read_unlock();
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+	rcu_read_unlock();
+out_err:
+
+	return err;
+}
+
+static bool cbm_search(unsigned long cbm, u32 *closid)
+{
+	u32 maxid = boot_cpu_data.x86_cache_max_closid;
+	u32 i;
+
+	for (i = 0; i < maxid; i++) {
+		if (bitmap_equal(&cbm, &cctable[i].l3_cbm, MAX_CBM_LENGTH)) {
+			*closid = i;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void closcbm_map_dump(void)
+{
+	u32 i;
+
+	pr_debug("CBMMAP\n");
+	for (i = 0; i < boot_cpu_data.x86_cache_max_closid; i++) {
+		pr_debug("l3_cbm: 0x%x,clos_refcnt: %u\n",
+		 (unsigned int)cctable[i].l3_cbm, cctable[i].clos_refcnt);
+	}
+}
+
+static void cbm_cpu_update(void *info)
+{
+	u32 closid = (u32) info;
+
+	wrmsrl(CBM_FROM_INDEX(closid), cctable[closid].l3_cbm);
+}
+
+/*
+ * cbm_update_all() - Update the cache bit mask for all packages.
+ */
+static inline void cbm_update_all(u32 closid)
+{
+	on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
+}
+
+/*
+ * intel_cache_alloc_cbm_write() - Validates and writes the
+ * cache bit mask(cbm) to the IA32_L3_MASK_n
+ * and also store the same in the cctable.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * This helps to use the scant CLOSids optimally. This also
+ * implies that at context switch write to PQR-MSR is done
+ * only when a task with a different bitmask is scheduled in.
+ */
+static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 cbmvalue)
+{
+	u32 max_cbm = boot_cpu_data.x86_cache_max_cbm_len;
+	struct intel_rdt *ir = css_rdt(css);
+	u64 max_mask;
+	int err = 0;
+	u32 closid;
+
+	if (ir == &rdt_root_group)
+		return -EPERM;
+
+	/*
+	 * Need global mutex as cbm write may allocate a closid.
+	 */
+	mutex_lock(&rdt_group_mutex);
+
+	max_mask = (1ULL << max_cbm) - 1;
+	if (cbmvalue & ~max_mask) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (cbmvalue == cctable[ir->closid].l3_cbm)
+		goto out;
+
+	err = cbm_validate(ir, cbmvalue);
+	if (err)
+		goto out;
+
+	/*
+	 * Try to get a reference for a different CLOSid and release the
+	 * reference to the current CLOSid.
+	 * Need to put down the reference here and get it back in case we
+	 * run out of closids. Otherwise we run into a problem when
+	 * we could be using the last closid that could have been available.
+	 */
+	closid_put(ir->closid);
+	if (cbm_search(cbmvalue, &closid)) {
+		ir->closid = closid;
+		closid_get(closid);
+	} else {
+		closid = ir->closid;
+		err = closid_alloc(ir);
+		if (err) {
+			closid_get(ir->closid);
+			goto out;
+		}
+
+		cctable[ir->closid].l3_cbm = cbmvalue;
+		cbm_update_all(ir->closid);
+	}
+	closcbm_map_dump();
+out:
+	mutex_unlock(&rdt_group_mutex);
+
+	return err;
+}
+
+static inline void rdt_cpumask_update(int cpu)
+{
+	static cpumask_t tmp;
+
+	cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp))
+		cpumask_set_cpu(cpu, &rdt_cpumask);
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
 	static struct clos_cbm_table *cct;
 	u32 maxid, max_cbm_len;
-	int err = 0, size;
+	int err = 0, size, i;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
 		rdt_root_group.css.ss->disabled = 1;
@@ -152,6 +339,9 @@ static int __init intel_rdt_late_init(void)
 	cct->l3_cbm = (1ULL << max_cbm_len) - 1;
 	cct->clos_refcnt = 1;
 
+	for_each_online_cpu(i)
+		rdt_cpumask_update(i);
+
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
@@ -160,8 +350,18 @@ out_err:
 
 late_initcall(intel_rdt_late_init);
 
+static struct cftype rdt_files[] = {
+	{
+		.name		= "l3_cbm",
+		.seq_show	= intel_cache_alloc_cbm_read,
+		.write_u64	= intel_cache_alloc_cbm_write,
+	},
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys intel_rdt_cgrp_subsys = {
 	.css_alloc		= intel_rdt_css_alloc,
 	.css_free		= intel_rdt_css_free,
+	.legacy_cftypes		= rdt_files,
 	.early_init		= 0,
 };
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (5 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 23:03   ` Andy Lutomirski
  2015-08-06 21:55 ` [PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
  8 siblings, 1 reply; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the task's intel_rdt cgroup cache_mask.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.

This patch also implements a common software cache for IA32_PQR_MSR
(RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
Cache allocation. CMT updates the RMID where as cache_alloc updates the
CLOSid in the software cache. During scheduling when the new RMID/CLOSid
value is different from the cached values, IA32_PQR_MSR is updated.
Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
 cycles) this software cache is necessary to avoid reading the MSR to
compare the current CLOSid value.

The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
 - This path does not exist on any non-intel platforms.
 - On Intel platforms, this would not exist by default unless CGROUP_RDT
 is enabled.
 - remains a no-op when CGROUP_RDT is enabled and intel SKU does not
 support the feature.
 - When feature is available and enabled, never does MSR write till the
 user manually creates a cgroup directory *and* assigns a cache_mask
 different from root cgroup directory. Since the child node inherits
 the parents cache mask, by cgroup creation there is no scheduling hot
 path impact from the new cgroup.
 - MSR write is only done when there is a task with different Closid is
 scheduled on the CPU. Typically if the task groups are bound to be
 scheduled on a set of CPUs, the number of MSR writes is greatly
 reduced.
 - A per CPU cache of CLOSids is maintained to do the check so that we
 dont have to do a rdmsr which actually costs a lot of cycles.
 - For cgroup directories having same cache_mask the CLOSids are reused.
 This minimizes the number of CLOSids used and hence reduces the MSR
 write frequency.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h           | 44 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/pqr_common.h          | 27 ++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c            | 17 ++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 26 ++----------------
 arch/x86/kernel/process_64.c               |  6 ++++
 5 files changed, 97 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/include/asm/pqr_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 58bac91..3757c5c 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,10 +4,15 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/pqr_common.h>
+
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
+extern struct static_key rdt_enable_key;
+extern void __intel_rdt_sched_in(void);
+
 struct rdt_subsys_info {
 	unsigned long *closmap;
 };
@@ -35,5 +40,44 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
 	return css_rdt(ir->css.parent);
 }
 
+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+	return css_rdt(task_css(task, intel_rdt_cgrp_id));
+}
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdt cgroup directory and assigns a cache_mask thats
+ * different from the root cgroup's cache_mask.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different cgroup.
+ * - Closids are allocated so that different cgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	/*
+	 * Call the schedule in code only when RDT is enabled.
+	 */
+	if (static_key_false(&rdt_enable_key))
+		__intel_rdt_sched_in();
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
 #endif
 #endif
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
new file mode 100644
index 0000000..11e985c
--- /dev/null
+++ b/arch/x86/include/asm/pqr_common.h
@@ -0,0 +1,27 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid:		The cached Resource Monitoring ID
+ * @closid:		The cached Class Of Service ID
+ * @rmid_usecnt:	The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+	u32			rmid;
+	u32			closid;
+	int			rmid_usecnt;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 115f136..06cba8da 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,8 @@ static struct clos_cbm_table *cctable;
 static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+
 /*
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
@@ -88,6 +90,20 @@ static inline void closid_put(u32 closid)
 		closid_free(closid);
 }
 
+void __intel_rdt_sched_in(void)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	struct task_struct *task = current;
+	struct intel_rdt *ir;
+
+	ir = task_rdt(task);
+	if (ir->closid == state->closid)
+		return;
+
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->closid);
+	state->closid = ir->closid;
+}
+
 static struct cgroup_subsys_state *
 intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -342,6 +358,7 @@ static int __init intel_rdt_late_init(void)
 	for_each_online_cpu(i)
 		rdt_cpumask_update(i);
 
+	static_key_slow_inc(&rdt_enable_key);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 916bef9..d579e23 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,41 +7,22 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/pqr_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid:		The cached Resource Monitoring ID
- * @closid:		The cached Class Of Service ID
- * @rmid_usecnt:	The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
-	u32			rmid;
-	u32			closid;
-	int			rmid_usecnt;
-};
-
 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Both functions which modify the state
  * (intel_cqm_event_start and intel_cqm_event_stop) are called with
  * interrupts disabled, which is sufficient for the protection.
  */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -408,9 +389,9 @@ static void __intel_cqm_event_count(void *info);
  */
 static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 {
-	struct perf_event *event;
 	struct list_head *head = &group->hw.cqm_group_entry;
 	u32 old_rmid = group->hw.cqm_rmid;
+	struct perf_event *event;
 
 	lockdep_assert_held(&cache_mutex);
 
@@ -1265,7 +1246,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	state->rmid = 0;
-	state->closid = 0;
 	state->rmid_usecnt = 0;
 
 	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f6b9163..8c42b64 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -48,6 +48,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/intel_rdt.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -445,6 +446,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
+	/*
+	 * Load the Intel cache allocation PQR MSR.
+	 */
+	intel_rdt_sched_in();
+
 	return prev_p;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-08-06 21:55 ` [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
@ 2015-08-06 23:03   ` Andy Lutomirski
  2015-08-07 18:52     ` Vikas Shivappa
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2015-08-06 23:03 UTC (permalink / raw)
  To: Vikas Shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva

On 08/06/2015 02:55 PM, Vikas Shivappa wrote:
> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
> Cache Allocation, MSR write would let the task fill in the cache
> 'subset' represented by the task's intel_rdt cgroup cache_mask.
>
> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
> CLOSid. During context switch kernel implements this by writing the
> CLOSid of the cgroup to which the task belongs to the CPU's
> IA32_PQR_ASSOC MSR.
>
> This patch also implements a common software cache for IA32_PQR_MSR
> (RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
> Cache allocation. CMT updates the RMID where as cache_alloc updates the
> CLOSid in the software cache. During scheduling when the new RMID/CLOSid
> value is different from the cached values, IA32_PQR_MSR is updated.
> Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
>   cycles) this software cache is necessary to avoid reading the MSR to
> compare the current CLOSid value.
>
> The following considerations are done for the PQR MSR write so that it
> minimally impacts scheduler hot path:
>   - This path does not exist on any non-intel platforms.
>   - On Intel platforms, this would not exist by default unless CGROUP_RDT
>   is enabled.
>   - remains a no-op when CGROUP_RDT is enabled and intel SKU does not
>   support the feature.
>   - When feature is available and enabled, never does MSR write till the
>   user manually creates a cgroup directory *and* assigns a cache_mask
>   different from root cgroup directory. Since the child node inherits
>   the parents cache mask, by cgroup creation there is no scheduling hot
>   path impact from the new cgroup.
>   - MSR write is only done when there is a task with different Closid is
>   scheduled on the CPU. Typically if the task groups are bound to be
>   scheduled on a set of CPUs, the number of MSR writes is greatly
>   reduced.
>   - A per CPU cache of CLOSids is maintained to do the check so that we
>   dont have to do a rdmsr which actually costs a lot of cycles.
>   - For cgroup directories having same cache_mask the CLOSids are reused.
>   This minimizes the number of CLOSids used and hence reduces the MSR
>   write frequency.

What happens if a user process sets a painfully restrictive CLOS and 
then spends most of its time in the kernel doing work on behalf of 
unrelated tasks?  Does performance suck?

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-08-06 23:03   ` Andy Lutomirski
@ 2015-08-07 18:52     ` Vikas Shivappa
  0 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-07 18:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva



On Thu, 6 Aug 2015, Andy Lutomirski wrote:

> On 08/06/2015 02:55 PM, Vikas Shivappa wrote:
>> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
>> Cache Allocation, MSR write would let the task fill in the cache
>> 'subset' represented by the task's intel_rdt cgroup cache_mask.
>> 
>> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
>> CLOSid. During context switch kernel implements this by writing the
>> CLOSid of the cgroup to which the task belongs to the CPU's
>> IA32_PQR_ASSOC MSR.
>> 
>> This patch also implements a common software cache for IA32_PQR_MSR
>> (RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
>> Cache allocation. CMT updates the RMID where as cache_alloc updates the
>> CLOSid in the software cache. During scheduling when the new RMID/CLOSid
>> value is different from the cached values, IA32_PQR_MSR is updated.
>> Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
>>   cycles) this software cache is necessary to avoid reading the MSR to
>> compare the current CLOSid value.
>> 
>> The following considerations are done for the PQR MSR write so that it
>> minimally impacts scheduler hot path:
>>   - This path does not exist on any non-intel platforms.
>>   - On Intel platforms, this would not exist by default unless CGROUP_RDT
>>   is enabled.
>>   - remains a no-op when CGROUP_RDT is enabled and intel SKU does not
>>   support the feature.
>>   - When feature is available and enabled, never does MSR write till the
>>   user manually creates a cgroup directory *and* assigns a cache_mask
>>   different from root cgroup directory. Since the child node inherits
>>   the parents cache mask, by cgroup creation there is no scheduling hot
>>   path impact from the new cgroup.
>>   - MSR write is only done when there is a task with different Closid is
>>   scheduled on the CPU. Typically if the task groups are bound to be
>>   scheduled on a set of CPUs, the number of MSR writes is greatly
>>   reduced.
>>   - A per CPU cache of CLOSids is maintained to do the check so that we
>>   dont have to do a rdmsr which actually costs a lot of cycles.
>>   - For cgroup directories having same cache_mask the CLOSids are reused.
>>   This minimizes the number of CLOSids used and hence reduces the MSR
>>   write frequency.
>
> What happens if a user process sets a painfully restrictive CLOS

The patches currently lets the system admin/root user configure the cache 
allocation for threads using cgroup. user process cant decide for itself.

Thanks,
Vikas

and then 
> spends most of its time in the kernel doing work on behalf of unrelated 
> tasks?  Does performance suck?
>
> --Andy
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (6 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  2015-08-06 21:55 ` [PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds hot cpu support for Intel Cache allocation. Support
includes updating the cache bitmask MSRs IA32_L3_QOS_n when a new CPU
package comes online. The IA32_L3_QOS_n MSRs are one per Class of
service on each CPU package. The new package's MSRs are synchronized
with the values of existing MSRs. Also the software cache for
IA32_PQR_ASSOC MSRs are reset during hot cpu notifications.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 95 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 06cba8da..f151200 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/spinlock.h>
+#include <linux/cpu.h>
 #include <asm/intel_rdt.h>
 
 /*
@@ -40,6 +41,11 @@ struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
 static cpumask_t rdt_cpumask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
 
 #define rdt_for_each_child(pos_css, parent_ir)		\
 	css_for_each_child((pos_css), &(parent_ir)->css)
@@ -311,13 +317,86 @@ out:
 	return err;
 }
 
-static inline void rdt_cpumask_update(int cpu)
+static inline bool rdt_cpumask_update(int cpu)
 {
-	static cpumask_t tmp;
-
-	cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
-	if (cpumask_empty(&tmp))
+	cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp_cpumask)) {
 		cpumask_set_cpu(cpu, &rdt_cpumask);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid except IA32_L3_MASK_0 on the current package.
+ */
+static void cbm_update_msrs(void *info)
+{
+	int maxid = boot_cpu_data.x86_cache_max_closid;
+	unsigned int i;
+
+	/*
+	 * At cpureset, all bits of IA32_L3_MASK_n are set.
+	 * The index starts from one as there is no need
+	 * to update IA32_L3_MASK_0 as it belongs to root cgroup
+	 * whose cache mask is all 1s always.
+	 */
+	for (i = 1; i < maxid; i++) {
+		if (cctable[i].clos_refcnt)
+			cbm_cpu_update((void *)i);
+	}
+}
+
+static inline void intel_rdt_cpu_start(int cpu)
+{
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+	state->closid = 0;
+	mutex_lock(&rdt_group_mutex);
+	if (rdt_cpumask_update(cpu))
+		smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static void intel_rdt_cpu_exit(unsigned int cpu)
+{
+	int i;
+
+	mutex_lock(&rdt_group_mutex);
+	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+		mutex_unlock(&rdt_group_mutex);
+		return;
+	}
+
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
+
+	if (i < nr_cpu_ids)
+		cpumask_set_cpu(i, &rdt_cpumask);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_rdt_cpu_notifier(struct notifier_block *nb,
+				  unsigned long action, void *hcpu)
+{
+	unsigned int cpu  = (unsigned long)hcpu;
+
+	switch (action) {
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		intel_rdt_cpu_start(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		intel_rdt_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
 }
 
 static int __init intel_rdt_late_init(void)
@@ -355,9 +434,15 @@ static int __init intel_rdt_late_init(void)
 	cct->l3_cbm = (1ULL << max_cbm_len) - 1;
 	cct->clos_refcnt = 1;
 
+	cpu_notifier_register_begin();
+
 	for_each_online_cpu(i)
 		rdt_cpumask_update(i);
 
+	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
+
+	cpu_notifier_register_done();
+
 	static_key_slow_inc(&rdt_enable_key);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration
  2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
                   ` (7 preceding siblings ...)
  2015-08-06 21:55 ` [PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation Vikas Shivappa
@ 2015-08-06 21:55 ` Vikas Shivappa
  8 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-06 21:55 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch is specific to Intel haswell (hsw) server SKUs. Cache
Allocation on hsw server needs to be enumerated separately as HSW does
not have support for CPUID enumeration for Cache Allocation. This patch
does a probe by writing a CLOSid (Class of service id) into high 32 bits
of IA32_PQR_MSR and see if the bits stick. The probe is only done after
confirming that the CPU is HSW server. Other hardcoded values are:

 - L3 cache bit mask must be at least two bits.
 - Maximum CLOSids supported is always 4.
 - Maximum bits support in cache bit mask is always 20.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 59 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f151200..fda31c2 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,6 +38,11 @@ struct intel_rdt rdt_root_group;
 struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 
 /*
+ * Minimum bits required in Cache bitmask.
+ */
+static unsigned int min_bitmask_len = 1;
+
+/*
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
 static cpumask_t rdt_cpumask;
@@ -50,6 +55,56 @@ static cpumask_t tmp_cpumask;
 #define rdt_for_each_child(pos_css, parent_ir)		\
 	css_for_each_child((pos_css), &(parent_ir)->css)
 
+/*
+ * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
+ * as it does not have CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
+ * testing if the bits stick. Max CLOSids is always 4 and max cbm length
+ * is always 20 on hsw server parts. The minimum cache bitmask length
+ * allowed for HSW server is always 2 bits. Hardcode all of them.
+ */
+static inline bool cache_alloc_hsw_probe(void)
+{
+	u32 l, h_old, h_new, h_tmp;
+
+	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+		return false;
+
+	/*
+	 * Default value is always 0 if feature is present.
+	 */
+	h_tmp = h_old ^ 0x1U;
+	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+		return false;
+
+	if (h_tmp != h_new)
+		return false;
+
+	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+	boot_cpu_data.x86_cache_max_closid = 4;
+	boot_cpu_data.x86_cache_max_cbm_len = 20;
+	min_bitmask_len = 2;
+
+	return true;
+}
+
+static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_CAT_L3))
+		return true;
+
+	/*
+	 * Probe for Haswell server CPUs.
+	 */
+	if (c->x86 == 0x6 && c->x86_model == 0x3f)
+		return cache_alloc_hsw_probe();
+
+	return false;
+}
+
 static inline void closid_get(u32 closid)
 {
 	struct clos_cbm_table *cct = &cctable[closid];
@@ -160,7 +215,7 @@ static inline bool cbm_is_contiguous(unsigned long var)
 	unsigned long maxcbm = MAX_CBM_LENGTH;
 	unsigned long first_bit, zero_bit;
 
-	if (!var)
+	if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
 		return false;
 
 	first_bit = find_first_bit(&var, maxcbm);
@@ -406,7 +461,7 @@ static int __init intel_rdt_late_init(void)
 	u32 maxid, max_cbm_len;
 	int err = 0, size, i;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+	if (!cache_alloc_supported(c)) {
 		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V12 0/9] Hot cpu handling changes to cqm, rapl and Intel Cache Allocation support
@ 2015-07-01 22:21 Vikas Shivappa
  2015-07-01 22:21 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
  0 siblings, 1 reply; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-01 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: vikas.shivappa, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch has some changes to hot cpu handling code in existing cache
monitoring and RAPL kernel code. This improves hot cpu notification
handling by not looping through all online cpus which could be expensive
in large systems. 

Cache allocation patches(dependent on prep patches) adds a cgroup
subsystem to support the new Cache Allocation feature found in future
Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
Resource Director Technology(RDT) feature. RDT which provides support to
control sharing of platform resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'.  This feature is used when allocating
a line in cache ie when pulling new data into the cache.  The
programming of the h/w is done via programming  MSRs.  The patch series
support to perform L3 cache allocation.

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache.  At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance.  Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.

More information about the feature can be found in the Intel SDM, Volume
3 section 17.15.  SDM does not yet use the 'RDT' term yet and it is
planned to be changed at a later time.

*All the patches will apply on tip/perf/core*.

Changes in v12:

 - From Matt's feedback replaced static cpumask_t tmp with function
 scope at multiple locations to static cpumask_t tmp_cpumask for the
 whole file. This is a temporary mask used during handling of hot cpu
 notifications in cqm/rapl and rdt code(1/9,2/9 and 8/9).  Although all
 the usage was serialized by hot cpu locking this makes it more
 readable. 

Changes in V11:  As per feedback from Thomas and discussions:

  - removed the cpumask_any_online_but.its usage could be easily replaced with
  'and'ing the cpu_online mask during hot cpu notifications.  Thomas
  pointed the API had issue where there tmp mask wasnt thread safe. I
  realized the support it indends to give does not seem to match with
  others in cpumask.h
  - the cqm patch which added mutex to hot cpu notification was merged
  with the cqm hot plug patch to improve notificaiton handling
  without commit logs and wasnt correct. seperated and just sending the
  cqm hot plug patch and will send the mutex cqm patch seperately
  - fixed issues in the hot cpu rdt handling. Since the cpu_starting was
  replaced with cpu_online , now the wrmsr needs to be actually
  scheduled on the target cpu - which the previous patch wasnt doing.
  Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
  handled the same way as cpu_online. By waiting till cpu_dead to update
  the rdt_cpumask , we may miss some of the msr updates.

Changes in V10:

- changed the hot cpu notification we handle in cqm and cache allocation
  to cpu_online and cpu_dead and removed others as the 
  cpu_*_prepare also had corresponding cancel notification 
  which we did not handle.
- changed the file in rdt cgroup to l3_cache_mask to represent that its
  for l3 cache.

Changes as per Thomas and PeterZ feedback:
- fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
  have static so that they burden stack space when large.
- removed mutex in cpu_starting notifications, replaced the locking with
  cpu_online.
- changed name from hsw_probetest to cache_alloc_hsw_probe.
- changed x86_rdt_max_closid to x86_cache_max_closid and
  x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
  to cache allocation and not to all rdt.

Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
  enabled.
- Reordered the local declarations to follow convention in
  intel_cqm_xchg_rmid

Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:

Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
    power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.

Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the 
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.

Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for 
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored 
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a 
 refcnt to track the number of cgroups using a closid in clos_cbm map.
 -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache 
 bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each 
package.
- Removed the cache bit mask reference in the intel_rdt structure as
  there was no need for that and we already maintain a separate
  closid<->cbm mapping.
- Made a few coding convention changes which include adding the 
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init

[PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
[PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling for
[PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup
[PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
[PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
[PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
[PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
[PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-01 22:21 [PATCH V12 0/9] Hot cpu handling changes to cqm, rapl and Intel Cache Allocation support Vikas Shivappa
@ 2015-07-01 22:21 ` Vikas Shivappa
  2015-07-28 14:54   ` Peter Zijlstra
  2015-07-28 23:15   ` Marcelo Tosatti
  0 siblings, 2 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-01 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: vikas.shivappa, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache.  Currently L3 Cache is
the only resource that is supported in RDT.  More information can be
found in the Intel SDM, Volume 3, section 17.15.

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'.  This feature is used when
allocating a line in cache ie when pulling new data into the cache.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..dfff477
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache.  Currently L3 Cache is
+the only resource that is supported in RDT.  More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation  enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'l3_cache_mask' file.
+
+Root directory would have all available  bits set in 'l3_cache_mask' file
+by default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'l3_cache_mask' needs to be a  subset of the parent node's
+'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
+bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
+'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
+represent the cache 'subset' of the Cache allocation cgroup. For ex: on
+a system with 16 bits of max cbm bits, if the directory has the least
+significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
+is just 0xf), it would be allocated the right quarter of the Last level
+cache which means the tasks belonging to this Cache allocation cgroup
+can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits  out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.
+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.
+- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
+support the feature.
+- When feature is available, still remains a no-op till the user
+manually creates a cgroup *and* assigns a new cache mask. Since the
+child node inherits the parents cache mask , by cgroup creation there is
+no scheduling hot path impact from the new cgroup.
+- per cpu PQR values are cached and the MSR write is only done when
+there is a task with different PQR is scheduled on the CPU. Typically if
+the task groups are bound to be scheduled on a set of CPUs , the number
+of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+To check if Cache allocation was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of l3_cache_mask and CLOS should depend on the system you use.
+
+Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
+    cache allocation is enabled).
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+  cd /sys/fs/cgroup
+  mkdir rdt
+  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+  cd rdt
+
+Create 2 rdt cgroups
+
+  mkdir group1
+  mkdir group2
+
+Following are some of the Files in the directory
+
+  ls
+  rdt.l3_cache_mask
+  tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits.  This
+allocates 'right quarter' of the cache.
+
+  cd group2
+  /bin/echo 0xf > rdt.l3_cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+  cd group2
+  /bin/echo 0xff > rdt.l3_cache_mask
+
+Assign tasks to the group2
+
+  /bin/echo PID1 > tasks
+  /bin/echo PID2 > tasks
+
+  Meaning now threads
+  PID1 and PID2 get to fill the 'right half' of
+  the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+  cd group2
+  mkdir group21
+  cat rdt.l3_cache_mask
+   0xff - inherits parents mask.
+
+  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
+
+In order to restrict RDT cgroups to specific set of CPUs rdt can be
+comounted with cpusets.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-01 22:21 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
@ 2015-07-28 14:54   ` Peter Zijlstra
  2015-08-04 20:41     ` Vikas Shivappa
  2015-07-28 23:15   ` Marcelo Tosatti
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2015-07-28 14:54 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: linux-kernel, vikas.shivappa, x86, hpa, tglx, mingo, tj,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.

> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 0000000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> +        RDT
> +        ---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by vikas.shivappa@linux.intel.com
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=========
> +
> +1. Cache Allocation Technology
> +  1.1 What is RDT and Cache allocation ?
> +  1.2 Why is Cache allocation needed ?
> +  1.3 Cache allocation implementation overview
> +  1.4 Assignment of CBM and CLOS
> +  1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===================================
> +
> +1.1 What is RDT and Cache allocation
> +------------------------------------
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)

missing ' ' before the '('.

> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache.  Currently L3 Cache is

Double ' ' after '.' -- which _can_ be correct, but is inconsistent
throughout the document.

> +the only resource that is supported in RDT.  More information can be
> +found in the Intel SDM, Volume 3, section 17.15.

Please also include the SDM revision, like June 2015.

In fact, in the June 2015 V3 17.15 is CQM, not CAT.

> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'.  This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming  MSRs.

Double ' ' before 'MSRs'.

> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +----------------------------------
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number

Single ' ' after .

> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation  enables more cache resources to be made available

Double ' ' for no apparent reason.

> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running  instances of

Double ' '

> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +--------------------------------------------
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.

No ' ' before '('

> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal

Idem, also, _no_ space after '.'

> +to the kernel and not exposed to user.  Each cgroup would have one CBM

Double space after '.'

> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the

I'm thinking the convention is ' ' _after_ ',', not before.

> +cgroup never fails.  When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent.  When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter

Above you had ' ' around the arrows.

> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the L3 cache represented by
> +the cgroup's 'l3_cache_mask' file.
> +
> +Root directory would have all available  bits set in 'l3_cache_mask' file

Random double ' '

> +by default.
> +
> +Each RDT cgroup directory has the following files. Some of them may be a
> +part of common RDT framework or be specific to RDT sub-features like
> +cache allocation.
> +
> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
> + file. The bitmask must be contiguous and would have a 1 or 2 bit
> + minimum length.
> +
> +1.4 Assignment of CBM,CLOS
> +--------------------------
> +
> +The 'l3_cache_mask' needs to be a  subset of the parent node's
> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
> +a system with 16 bits of max cbm bits, if the directory has the least
> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
> +is just 0xf), it would be allocated the right quarter of the Last level
> +cache which means the tasks belonging to this Cache allocation cgroup
> +can use the right quarter of the cache to fill. If it
> +has the most significant 8 bits set ,it would be allocated the left
> +half of the cache(8 bits  out of 16 represents 50%).

Random whitespace again. Also try and limit paragraphs to 5-6 lines max.

> +
> +
> +The cache portion defined in the CBM file is available to all tasks
> +within the cgroup to fill and these task are not allowed to allocate
> +space in other parts of the cache.
> +
> +1.5 Scheduling and Context Switch
> +---------------------------------
> +
> +During context switch kernel implements this by writing the
> +CLOSid (internally maintained by kernel) of the cgroup to which the
> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> +when there is a change in the CLOSid for the CPU in order to minimize
> +the latency incurred during context switch.
> +
> +The following considerations are done for the PQR MSR write so that it
> +has minimal impact on scheduling hot path:
> +- This path doesnt exist on any non-intel platforms.

!x86 I think you mean, its entirely possible to have the code present
on AMD systems for instance.

> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
> +is enabled.

You can enable this just fine on AMD machines.

> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
> +support the feature.
> +- When feature is available, still remains a no-op till the user
> +manually creates a cgroup *and* assigns a new cache mask. Since the
> +child node inherits the parents cache mask , by cgroup creation there is
> +no scheduling hot path impact from the new cgroup.
> +- per cpu PQR values are cached and the MSR write is only done when
> +there is a task with different PQR is scheduled on the CPU. Typically if
> +the task groups are bound to be scheduled on a set of CPUs , the number
> +of MSR writes is greatly reduced.

Aside from many instances of random whitespace, maybe also format like:

 - point;

 - multi
   line point;

 - another
   multi
   line
   thing.

> +
> +2. Usage examples and syntax
> +============================
> +
> +To check if Cache allocation was enabled on your system
> +
> +dmesg | grep -i intel_rdt

  $ dmesg | grep -i intel_rdt

That is, whitespace before _and_ after _and_ indent, plus a prompt, to
clarify its a command and not part of the text and weirdly formatted.

> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx

  intel_rdt: Max bitmask length: xx

Again, wrap in whitespace and indent to set apart.

> +the length of l3_cache_mask and CLOS should depend on the system you use.
> +
> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3

Many more instances of random whitespace.

> +    cache allocation is enabled).
> +
> +Following would mount the cache allocation cgroup subsystem and create
> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> +details about how to use cgroups.
> +
> +  cd /sys/fs/cgroup
> +  mkdir rdt
> +  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
> +  cd rdt
> +
> +Create 2 rdt cgroups
> +
> +  mkdir group1
> +  mkdir group2
> +
> +Following are some of the Files in the directory
> +
> +  ls
> +  rdt.l3_cache_mask
> +  tasks
> +

See, here you do the whitespace and indent thing, but above you didn't.
That kind of inconsistency just bugs the hell out of me.

> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> +below allocates the 'right 1/4th(512KB)' of the cache to group2

Another few random whitespace fails.

> +
> +Edit the CBM for group2 to set the least significant 4 bits.  This
> +allocates 'right quarter' of the cache.
> +
> +  cd group2
> +  /bin/echo 0xf > rdt.l3_cache_mask
> +
> +
> +Edit the CBM for group2 to set the least significant 8 bits.This
> +allocates the right half of the cache to 'group2'.
> +
> +  cd group2
> +  /bin/echo 0xff > rdt.l3_cache_mask
> +
> +Assign tasks to the group2
> +
> +  /bin/echo PID1 > tasks
> +  /bin/echo PID2 > tasks
> +
> +  Meaning now threads
> +  PID1 and PID2 get to fill the 'right half' of
> +  the cache as the belong to cgroup group2.

This doesn't want to be indented, right?

> +
> +Create a group under group2
> +
> +  cd group2
> +  mkdir group21
> +  cat rdt.l3_cache_mask
> +   0xff - inherits parents mask.

And this would show the use of the prompt ($), allows one to distinguish
between commands and output.

> +
> +  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset

I'm betting you don't actually want us to type the "- ..." bit? Either
use a regular bash comment (#) to make it harmless, or format it
differently.

Because some poor sod is going to literally type that into his console
and wonder WTF just happened.

> +
> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> +comounted with cpusets.

Either RDT is in capitals or it is not, but this is silly.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-28 14:54   ` Peter Zijlstra
@ 2015-08-04 20:41     ` Vikas Shivappa
  0 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-04 20:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vikas Shivappa, linux-kernel, vikas.shivappa, x86, hpa, tglx,
	mingo, tj, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva



On Tue, 28 Jul 2015, Peter Zijlstra wrote:

> On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
>
> Please edit this document to have consistent spacing. Its really hard to
> read this. Every time I spot a misplaced space my brain stumbles and I
> need to restart.

Will fix all the spacing and other indentions issues mentioned. 
Thanks for pointing them all out. Although the other documents I see dont have a 
consistent format completely which is what confused me, this format would be 
better.

>> +
>> +The following considerations are done for the PQR MSR write so that it
>> +has minimal impact on scheduling hot path:
>> +- This path doesnt exist on any non-intel platforms.
>
> !x86 I think you mean, its entirely possible to have the code present
> on AMD systems for instance.
>
>> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
>> +is enabled.
>
> You can enable this just fine on AMD machines.

The cache alloc code is under CPU_SUP_INTEL ..

Thanks,
Vikas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-01 22:21 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
  2015-07-28 14:54   ` Peter Zijlstra
@ 2015-07-28 23:15   ` Marcelo Tosatti
  2015-07-29  0:06     ` Vikas Shivappa
  1 sibling, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-28 23:15 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: linux-kernel, vikas.shivappa, x86, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> Adds a description of Cache allocation technology, overview
> of kernel implementation and usage of Cache Allocation cgroup interface.
> 
> Cache allocation is a sub-feature of Resource Director Technology(RDT)
> Allocation or Platform Shared resource control which provides support to
> control Platform shared resources like L3 cache.  Currently L3 Cache is
> the only resource that is supported in RDT.  More information can be
> found in the Intel SDM, Volume 3, section 17.15.
> 
> Cache Allocation Technology provides a way for the Software (OS/VMM)
> to restrict cache allocation to a defined 'subset' of cache which may
> be overlapping with other 'subsets'.  This feature is used when
> allocating a line in cache ie when pulling new data into the cache.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> ---
>  Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 215 insertions(+)
>  create mode 100644 Documentation/cgroups/rdt.txt
> 
> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 0000000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> +        RDT
> +        ---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by vikas.shivappa@linux.intel.com
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=========
> +
> +1. Cache Allocation Technology
> +  1.1 What is RDT and Cache allocation ?
> +  1.2 Why is Cache allocation needed ?
> +  1.3 Cache allocation implementation overview
> +  1.4 Assignment of CBM and CLOS
> +  1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===================================
> +
> +1.1 What is RDT and Cache allocation
> +------------------------------------
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache.  Currently L3 Cache is
> +the only resource that is supported in RDT.  More information can be
> +found in the Intel SDM, Volume 3, section 17.15.
> +
> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'.  This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming  MSRs.
> +
> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +----------------------------------
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number
> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation  enables more cache resources to be made available
> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running  instances of
> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +--------------------------------------------
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> +to the kernel and not exposed to user.  Each cgroup would have one CBM
> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> +cgroup never fails.  When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent.  When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the L3 cache represented by
> +the cgroup's 'l3_cache_mask' file.
> +
> +Root directory would have all available  bits set in 'l3_cache_mask' file
> +by default.
> +
> +Each RDT cgroup directory has the following files. Some of them may be a
> +part of common RDT framework or be specific to RDT sub-features like
> +cache allocation.
> +
> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
> + file. The bitmask must be contiguous and would have a 1 or 2 bit
> + minimum length.
> +
> +1.4 Assignment of CBM,CLOS
> +--------------------------
> +
> +The 'l3_cache_mask' needs to be a  subset of the parent node's
> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
> +a system with 16 bits of max cbm bits, if the directory has the least
> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
> +is just 0xf), it would be allocated the right quarter of the Last level
> +cache which means the tasks belonging to this Cache allocation cgroup
> +can use the right quarter of the cache to fill. If it
> +has the most significant 8 bits set ,it would be allocated the left
> +half of the cache(8 bits  out of 16 represents 50%).
> +
> +The cache portion defined in the CBM file is available to all tasks
> +within the cgroup to fill and these task are not allowed to allocate
> +space in other parts of the cache.
> +
> +1.5 Scheduling and Context Switch
> +---------------------------------
> +
> +During context switch kernel implements this by writing the
> +CLOSid (internally maintained by kernel) of the cgroup to which the
> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> +when there is a change in the CLOSid for the CPU in order to minimize
> +the latency incurred during context switch.
> +
> +The following considerations are done for the PQR MSR write so that it
> +has minimal impact on scheduling hot path:
> +- This path doesnt exist on any non-intel platforms.
> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
> +is enabled.
> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
> +support the feature.
> +- When feature is available, still remains a no-op till the user
> +manually creates a cgroup *and* assigns a new cache mask. Since the
> +child node inherits the parents cache mask , by cgroup creation there is
> +no scheduling hot path impact from the new cgroup.
> +- per cpu PQR values are cached and the MSR write is only done when
> +there is a task with different PQR is scheduled on the CPU. Typically if
> +the task groups are bound to be scheduled on a set of CPUs , the number
> +of MSR writes is greatly reduced.
> +
> +2. Usage examples and syntax
> +============================
> +
> +To check if Cache allocation was enabled on your system
> +
> +dmesg | grep -i intel_rdt
> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
> +the length of l3_cache_mask and CLOS should depend on the system you use.
> +
> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
> +    cache allocation is enabled).
> +
> +Following would mount the cache allocation cgroup subsystem and create
> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> +details about how to use cgroups.
> +
> +  cd /sys/fs/cgroup
> +  mkdir rdt
> +  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
> +  cd rdt
> +
> +Create 2 rdt cgroups
> +
> +  mkdir group1
> +  mkdir group2
> +
> +Following are some of the Files in the directory
> +
> +  ls
> +  rdt.l3_cache_mask
> +  tasks
> +
> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> +
> +Edit the CBM for group2 to set the least significant 4 bits.  This
> +allocates 'right quarter' of the cache.
> +
> +  cd group2
> +  /bin/echo 0xf > rdt.l3_cache_mask
> +
> +
> +Edit the CBM for group2 to set the least significant 8 bits.This
> +allocates the right half of the cache to 'group2'.
> +
> +  cd group2
> +  /bin/echo 0xff > rdt.l3_cache_mask
> +
> +Assign tasks to the group2
> +
> +  /bin/echo PID1 > tasks
> +  /bin/echo PID2 > tasks
> +
> +  Meaning now threads
> +  PID1 and PID2 get to fill the 'right half' of
> +  the cache as the belong to cgroup group2.
> +
> +Create a group under group2
> +
> +  cd group2
> +  mkdir group21
> +  cat rdt.l3_cache_mask
> +   0xff - inherits parents mask.
> +
> +  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
> +
> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> +comounted with cpusets.
> -- 
> 1.9.1

Vikas,

Can you give an example of comounting with cpusets? What do you mean by
restrict RDT cgroups to specific set of CPUs?

Another limitation of this interface is that it assumes the
task <-> control group assignment is pertinent, that is:

| taskgroup, L3 policy|:

| taskgroupA, 50% L3 exclusive |, 
| taskgroupB, 50% L3 |,
| taskgroupC, 50% L3 |.

Whenever taskgroup A is empty (that is no runnable task in it), you waste 50% of
L3 cache.

I think this problem and the similar problem of L3 reservation with CPU
isolation can be solved in this way: whenever a task from cgroupE with exclusive way
access is migrated to a new die, impose the exclusivity (by removing
access to that way by other cgroups).

Whenever cgroupE has zero tasks, remove exclusivity (by allowing
other cgroups to use the exclusive ways of it).

I'll cook a patch.





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-28 23:15   ` Marcelo Tosatti
@ 2015-07-29  0:06     ` Vikas Shivappa
  2015-07-29  1:28       ` Auld, Will
  0 siblings, 1 reply; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-29  0:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, linux-kernel, vikas.shivappa, x86, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva



On Tue, 28 Jul 2015, Marcelo Tosatti wrote:

> On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
>> Adds a description of Cache allocation technology, overview
>> of kernel implementation and usage of Cache Allocation cgroup interface.
>>
>> Cache allocation is a sub-feature of Resource Director Technology(RDT)
>> Allocation or Platform Shared resource control which provides support to
>> control Platform shared resources like L3 cache.  Currently L3 Cache is
>> the only resource that is supported in RDT.  More information can be
>> found in the Intel SDM, Volume 3, section 17.15.
>>
>> Cache Allocation Technology provides a way for the Software (OS/VMM)
>> to restrict cache allocation to a defined 'subset' of cache which may
>> be overlapping with other 'subsets'.  This feature is used when
>> allocating a line in cache ie when pulling new data into the cache.
>>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>> ---
>>  Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 215 insertions(+)
>>  create mode 100644 Documentation/cgroups/rdt.txt
>>
>> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
>> new file mode 100644
>> index 0000000..dfff477
>> --- /dev/null
>> +++ b/Documentation/cgroups/rdt.txt
>> @@ -0,0 +1,215 @@
>> +        RDT
>> +        ---
>> +
>> +Copyright (C) 2014 Intel Corporation
>> +Written by vikas.shivappa@linux.intel.com
>> +(based on contents and format from cpusets.txt)
>> +
>> +CONTENTS:
>> +=========
>> +
>> +1. Cache Allocation Technology
>> +  1.1 What is RDT and Cache allocation ?
>> +  1.2 Why is Cache allocation needed ?
>> +  1.3 Cache allocation implementation overview
>> +  1.4 Assignment of CBM and CLOS
>> +  1.5 Scheduling and Context Switch
>> +2. Usage Examples and Syntax
>> +
>> +1. Cache Allocation Technology(Cache allocation)
>> +===================================
>> +
>> +1.1 What is RDT and Cache allocation
>> +------------------------------------
>> +
>> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
>> +Allocation or Platform Shared resource control which provides support to
>> +control Platform shared resources like L3 cache.  Currently L3 Cache is
>> +the only resource that is supported in RDT.  More information can be
>> +found in the Intel SDM, Volume 3, section 17.15.
>> +
>> +Cache Allocation Technology provides a way for the Software (OS/VMM)
>> +to restrict cache allocation to a defined 'subset' of cache which may
>> +be overlapping with other 'subsets'.  This feature is used when
>> +allocating a line in cache ie when pulling new data into the cache.
>> +The programming of the h/w is done via programming  MSRs.
>> +
>> +The different cache subsets are identified by CLOS identifier (class
>> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
>> +contiguous set of bits which defines the amount of cache resource that
>> +is available for each 'subset'.
>> +
>> +1.2 Why is Cache allocation needed
>> +----------------------------------
>> +
>> +In todays new processors the number of cores is continuously increasing,
>> +especially in large scale usage models where VMs are used like
>> +webservers and datacenters. The number of cores increase the number
>> +of threads or workloads that can simultaneously be run. When
>> +multi-threaded-applications, VMs, workloads run concurrently they
>> +compete for shared resources including L3 cache.
>> +
>> +The Cache allocation  enables more cache resources to be made available
>> +for higher priority applications based on guidance from the execution
>> +environment.
>> +
>> +The architecture also allows dynamically changing these subsets during
>> +runtime to further optimize the performance of the higher priority
>> +application with minimal degradation to the low priority app.
>> +Additionally, resources can be rebalanced for system throughput benefit.
>> +
>> +This technique may be useful in managing large computer systems which
>> +large L3 cache. Examples may be large servers running  instances of
>> +webservers or database servers. In such complex systems, these subsets
>> +can be used for more careful placing of the available cache
>> +resources.
>> +
>> +1.3 Cache allocation implementation Overview
>> +--------------------------------------------
>> +
>> +Kernel implements a cgroup subsystem to support cache allocation.
>> +
>> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
>> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
>> +to the kernel and not exposed to user.  Each cgroup would have one CBM
>> +and would just represent one cache 'subset'.
>> +
>> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
>> +cgroup never fails.  When a child cgroup is created it inherits the
>> +CLOSid and the CBM from its parent.  When a user changes the default
>> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
>> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
>> +the kernel runs out of maximum CLOSids it can support.
>> +User can create as many cgroups as he wants but having different CBMs
>> +at the same time is restricted by the maximum number of CLOSids
>> +(multiple cgroups can have the same CBM).
>> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
>> +for each cgroup using a CLOSid.
>> +
>> +The tasks in the cgroup would get to fill the L3 cache represented by
>> +the cgroup's 'l3_cache_mask' file.
>> +
>> +Root directory would have all available  bits set in 'l3_cache_mask' file
>> +by default.
>> +
>> +Each RDT cgroup directory has the following files. Some of them may be a
>> +part of common RDT framework or be specific to RDT sub-features like
>> +cache allocation.
>> +
>> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
>> + file. The bitmask must be contiguous and would have a 1 or 2 bit
>> + minimum length.
>> +
>> +1.4 Assignment of CBM,CLOS
>> +--------------------------
>> +
>> +The 'l3_cache_mask' needs to be a  subset of the parent node's
>> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
>> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
>> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
>> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
>> +a system with 16 bits of max cbm bits, if the directory has the least
>> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
>> +is just 0xf), it would be allocated the right quarter of the Last level
>> +cache which means the tasks belonging to this Cache allocation cgroup
>> +can use the right quarter of the cache to fill. If it
>> +has the most significant 8 bits set ,it would be allocated the left
>> +half of the cache(8 bits  out of 16 represents 50%).
>> +
>> +The cache portion defined in the CBM file is available to all tasks
>> +within the cgroup to fill and these task are not allowed to allocate
>> +space in other parts of the cache.
>> +
>> +1.5 Scheduling and Context Switch
>> +---------------------------------
>> +
>> +During context switch kernel implements this by writing the
>> +CLOSid (internally maintained by kernel) of the cgroup to which the
>> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
>> +when there is a change in the CLOSid for the CPU in order to minimize
>> +the latency incurred during context switch.
>> +
>> +The following considerations are done for the PQR MSR write so that it
>> +has minimal impact on scheduling hot path:
>> +- This path doesnt exist on any non-intel platforms.
>> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
>> +is enabled.
>> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
>> +support the feature.
>> +- When feature is available, still remains a no-op till the user
>> +manually creates a cgroup *and* assigns a new cache mask. Since the
>> +child node inherits the parents cache mask , by cgroup creation there is
>> +no scheduling hot path impact from the new cgroup.
>> +- per cpu PQR values are cached and the MSR write is only done when
>> +there is a task with different PQR is scheduled on the CPU. Typically if
>> +the task groups are bound to be scheduled on a set of CPUs , the number
>> +of MSR writes is greatly reduced.
>> +
>> +2. Usage examples and syntax
>> +============================
>> +
>> +To check if Cache allocation was enabled on your system
>> +
>> +dmesg | grep -i intel_rdt
>> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
>> +the length of l3_cache_mask and CLOS should depend on the system you use.
>> +
>> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
>> +    cache allocation is enabled).
>> +
>> +Following would mount the cache allocation cgroup subsystem and create
>> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
>> +details about how to use cgroups.
>> +
>> +  cd /sys/fs/cgroup
>> +  mkdir rdt
>> +  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
>> +  cd rdt
>> +
>> +Create 2 rdt cgroups
>> +
>> +  mkdir group1
>> +  mkdir group2
>> +
>> +Following are some of the Files in the directory
>> +
>> +  ls
>> +  rdt.l3_cache_mask
>> +  tasks
>> +
>> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
>> +below allocates the 'right 1/4th(512KB)' of the cache to group2
>> +
>> +Edit the CBM for group2 to set the least significant 4 bits.  This
>> +allocates 'right quarter' of the cache.
>> +
>> +  cd group2
>> +  /bin/echo 0xf > rdt.l3_cache_mask
>> +
>> +
>> +Edit the CBM for group2 to set the least significant 8 bits.This
>> +allocates the right half of the cache to 'group2'.
>> +
>> +  cd group2
>> +  /bin/echo 0xff > rdt.l3_cache_mask
>> +
>> +Assign tasks to the group2
>> +
>> +  /bin/echo PID1 > tasks
>> +  /bin/echo PID2 > tasks
>> +
>> +  Meaning now threads
>> +  PID1 and PID2 get to fill the 'right half' of
>> +  the cache as the belong to cgroup group2.
>> +
>> +Create a group under group2
>> +
>> +  cd group2
>> +  mkdir group21
>> +  cat rdt.l3_cache_mask
>> +   0xff - inherits parents mask.
>> +
>> +  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
>> +
>> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
>> +comounted with cpusets.
>> --
>> 1.9.1
>
> Vikas,
>
> Can you give an example of comounting with cpusets? What do you mean by
> restrict RDT cgroups to specific set of CPUs?

I was going to edit the documentation soon as i see a lot of feedback on the 
same. It may have caused confusion.

I mean just pinning down tasks to a set of cpus. This does not mean we make the 
cache exclusive to the tasks..

>
> Another limitation of this interface is that it assumes the
> task <-> control group assignment is pertinent, that is:
>
> | taskgroup, L3 policy|:
>
> | taskgroupA, 50% L3 exclusive |,
> | taskgroupB, 50% L3 |,
> | taskgroupC, 50% L3 |.
>
> Whenever taskgroup A is empty (that is no runnable task in it), you waste 50% of
> L3 cache.

Cgroup masks can always overlap , and hence wont have exclusive cache 
allocation.

>
> I think this problem and the similar problem of L3 reservation with CPU
> isolation can be solved in this way: whenever a task from cgroupE with exclusive way
> access is migrated to a new die, impose the exclusivity (by removing
> access to that way by other cgroups).
>
> Whenever cgroupE has zero tasks, remove exclusivity (by allowing
> other cgroups to use the exclusive ways of it).

Same comment as above - Cgroup masks can always overlap and other cgroups can 
allocate the same cache , and hence wont have exclusive cache 
allocation.

So natuarally the cgroup with tasks would get to use the cache if it has the 
same mask (say representing 50% of cache in your example) as others .
(assume there are 8 bits max cbm)
cgroupa - mask - 0xf
cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all 
the cache.

Thanks,
Vikas

>
> I'll cook a patch.
>
>
>
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-29  0:06     ` Vikas Shivappa
@ 2015-07-29  1:28       ` Auld, Will
  2015-07-29 19:32         ` Marcelo Tosatti
  2015-07-29 20:07         ` Vikas Shivappa
  0 siblings, 2 replies; 30+ messages in thread
From: Auld, Will @ 2015-07-29  1:28 UTC (permalink / raw)
  To: Shivappa, Vikas, Marcelo Tosatti
  Cc: Vikas Shivappa, linux-kernel@vger.kernel.org, x86@kernel.org,
	hpa@zytor.com, tglx@linutronix.de, mingo@kernel.org,
	tj@kernel.org, peterz@infradead.org, Fleming, Matt,
	Williamson, Glenn P, Juvva, Kanaka D, Auld, Will



> -----Original Message-----
> From: Shivappa, Vikas
> Sent: Tuesday, July 28, 2015 5:07 PM
> To: Marcelo Tosatti
> Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas;
> x86@kernel.org; hpa@zytor.com; tglx@linutronix.de; mingo@kernel.org;
> tj@kernel.org; peterz@infradead.org; Fleming, Matt; Auld, Will; Williamson,
> Glenn P; Juvva, Kanaka D
> Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
> cgroup usage guide
> 
> 
> 
> On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
> 
> > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> >> Adds a description of Cache allocation technology, overview of kernel
> >> implementation and usage of Cache Allocation cgroup interface.
> >>
> >> Cache allocation is a sub-feature of Resource Director
> >> Technology(RDT) Allocation or Platform Shared resource control which
> >> provides support to control Platform shared resources like L3 cache.
> >> Currently L3 Cache is the only resource that is supported in RDT.
> >> More information can be found in the Intel SDM, Volume 3, section 17.15.
> >>
> >> Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> to restrict cache allocation to a defined 'subset' of cache which may
> >> be overlapping with other 'subsets'.  This feature is used when
> >> allocating a line in cache ie when pulling new data into the cache.
> >>
> >> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> >> ---
> >>  Documentation/cgroups/rdt.txt | 215
> >> ++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 215 insertions(+)
> >>  create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >> diff --git a/Documentation/cgroups/rdt.txt
> >> b/Documentation/cgroups/rdt.txt new file mode 100644 index
> >> 0000000..dfff477
> >> --- /dev/null
> >> +++ b/Documentation/cgroups/rdt.txt
> >> @@ -0,0 +1,215 @@
> >> +        RDT
> >> +        ---
> >> +
> >> +Copyright (C) 2014 Intel Corporation Written by
> >> +vikas.shivappa@linux.intel.com (based on contents and format from
> >> +cpusets.txt)
> >> +
> >> +CONTENTS:
> >> +=========
> >> +
> >> +1. Cache Allocation Technology
> >> +  1.1 What is RDT and Cache allocation ?
> >> +  1.2 Why is Cache allocation needed ?
> >> +  1.3 Cache allocation implementation overview
> >> +  1.4 Assignment of CBM and CLOS
> >> +  1.5 Scheduling and Context Switch
> >> +2. Usage Examples and Syntax
> >> +
> >> +1. Cache Allocation Technology(Cache allocation)
> >> +===================================
> >> +
> >> +1.1 What is RDT and Cache allocation
> >> +------------------------------------
> >> +
> >> +Cache allocation is a sub-feature of Resource Director
> >> +Technology(RDT) Allocation or Platform Shared resource control which
> >> +provides support to control Platform shared resources like L3 cache.
> >> +Currently L3 Cache is the only resource that is supported in RDT.
> >> +More information can be found in the Intel SDM, Volume 3, section 17.15.
> >> +
> >> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> +to restrict cache allocation to a defined 'subset' of cache which
> >> +may be overlapping with other 'subsets'.  This feature is used when
> >> +allocating a line in cache ie when pulling new data into the cache.
> >> +The programming of the h/w is done via programming  MSRs.
> >> +
> >> +The different cache subsets are identified by CLOS identifier (class
> >> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> >> +contiguous set of bits which defines the amount of cache resource
> >> +that is available for each 'subset'.
> >> +
> >> +1.2 Why is Cache allocation needed
> >> +----------------------------------
> >> +
> >> +In todays new processors the number of cores is continuously
> >> +increasing, especially in large scale usage models where VMs are
> >> +used like webservers and datacenters. The number of cores increase
> >> +the number of threads or workloads that can simultaneously be run.
> >> +When multi-threaded-applications, VMs, workloads run concurrently
> >> +they compete for shared resources including L3 cache.
> >> +
> >> +The Cache allocation  enables more cache resources to be made
> >> +available for higher priority applications based on guidance from
> >> +the execution environment.
> >> +
> >> +The architecture also allows dynamically changing these subsets
> >> +during runtime to further optimize the performance of the higher
> >> +priority application with minimal degradation to the low priority app.
> >> +Additionally, resources can be rebalanced for system throughput benefit.
> >> +
> >> +This technique may be useful in managing large computer systems
> >> +which large L3 cache. Examples may be large servers running
> >> +instances of webservers or database servers. In such complex
> >> +systems, these subsets can be used for more careful placing of the
> >> +available cache resources.
> >> +
> >> +1.3 Cache allocation implementation Overview
> >> +--------------------------------------------
> >> +
> >> +Kernel implements a cgroup subsystem to support cache allocation.
> >> +
> >> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> >> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is
> >> +internal to the kernel and not exposed to user.  Each cgroup would
> >> +have one CBM and would just represent one cache 'subset'.
> >> +
> >> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> >> +cgroup never fails.  When a child cgroup is created it inherits the
> >> +CLOSid and the CBM from its parent.  When a user changes the default
> >> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> >> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC
> >> +once the kernel runs out of maximum CLOSids it can support.
> >> +User can create as many cgroups as he wants but having different
> >> +CBMs at the same time is restricted by the maximum number of CLOSids
> >> +(multiple cgroups can have the same CBM).
> >> +Kernel maintains a CLOSid<->cbm mapping which keeps reference
> >> +counter for each cgroup using a CLOSid.
> >> +
> >> +The tasks in the cgroup would get to fill the L3 cache represented
> >> +by the cgroup's 'l3_cache_mask' file.
> >> +
> >> +Root directory would have all available  bits set in 'l3_cache_mask'
> >> +file by default.
> >> +
> >> +Each RDT cgroup directory has the following files. Some of them may
> >> +be a part of common RDT framework or be specific to RDT sub-features
> >> +like cache allocation.
> >> +
> >> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by
> >> + this file. The bitmask must be contiguous and would have a 1 or 2
> >> + bit minimum length.
> >> +
> >> +1.4 Assignment of CBM,CLOS
> >> +--------------------------
> >> +
> >> +The 'l3_cache_mask' needs to be a  subset of the parent node's
> >> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum
> >> +of 2 bits on hsw SKUs) maybe set to indicate the cache mapping
> >> +desired. The 'l3_cache_mask' between 2 directories can overlap. The
> >> +'l3_cache_mask' would represent the cache 'subset' of the Cache
> >> +allocation cgroup. For ex: on a system with 16 bits of max cbm bits,
> >> +if the directory has the least significant 4 bits set in its 'l3_cache_mask'
> file(meaning the 'l3_cache_mask'
> >> +is just 0xf), it would be allocated the right quarter of the Last
> >> +level cache which means the tasks belonging to this Cache allocation
> >> +cgroup can use the right quarter of the cache to fill. If it has the
> >> +most significant 8 bits set ,it would be allocated the left half of
> >> +the cache(8 bits  out of 16 represents 50%).
> >> +
> >> +The cache portion defined in the CBM file is available to all tasks
> >> +within the cgroup to fill and these task are not allowed to allocate
> >> +space in other parts of the cache.
> >> +
> >> +1.5 Scheduling and Context Switch
> >> +---------------------------------
> >> +
> >> +During context switch kernel implements this by writing the CLOSid
> >> +(internally maintained by kernel) of the cgroup to which the task
> >> +belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> >> +when there is a change in the CLOSid for the CPU in order to
> >> +minimize the latency incurred during context switch.
> >> +
> >> +The following considerations are done for the PQR MSR write so that
> >> +it has minimal impact on scheduling hot path:
> >> +- This path doesnt exist on any non-intel platforms.
> >> +- On Intel platforms, this would not exist by default unless
> >> +CGROUP_RDT is enabled.
> >> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does
> >> +not support the feature.
> >> +- When feature is available, still remains a no-op till the user
> >> +manually creates a cgroup *and* assigns a new cache mask. Since the
> >> +child node inherits the parents cache mask , by cgroup creation
> >> +there is no scheduling hot path impact from the new cgroup.
> >> +- per cpu PQR values are cached and the MSR write is only done when
> >> +there is a task with different PQR is scheduled on the CPU.
> >> +Typically if the task groups are bound to be scheduled on a set of
> >> +CPUs , the number of MSR writes is greatly reduced.
> >> +
> >> +2. Usage examples and syntax
> >> +============================
> >> +
> >> +To check if Cache allocation was enabled on your system
> >> +
> >> +dmesg | grep -i intel_rdt
> >> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
> >> +the length of l3_cache_mask and CLOS should depend on the system you
> use.
> >> +
> >> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
> >> +    cache allocation is enabled).
> >> +
> >> +Following would mount the cache allocation cgroup subsystem and
> >> +create
> >> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> >> +details about how to use cgroups.
> >> +
> >> +  cd /sys/fs/cgroup
> >> +  mkdir rdt
> >> +  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt  cd rdt
> >> +
> >> +Create 2 rdt cgroups
> >> +
> >> +  mkdir group1
> >> +  mkdir group2
> >> +
> >> +Following are some of the Files in the directory
> >> +
> >> +  ls
> >> +  rdt.l3_cache_mask
> >> +  tasks
> >> +
> >> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> >> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> >> +
> >> +Edit the CBM for group2 to set the least significant 4 bits.  This
> >> +allocates 'right quarter' of the cache.
> >> +
> >> +  cd group2
> >> +  /bin/echo 0xf > rdt.l3_cache_mask
> >> +
> >> +
> >> +Edit the CBM for group2 to set the least significant 8 bits.This
> >> +allocates the right half of the cache to 'group2'.
> >> +
> >> +  cd group2
> >> +  /bin/echo 0xff > rdt.l3_cache_mask
> >> +
> >> +Assign tasks to the group2
> >> +
> >> +  /bin/echo PID1 > tasks
> >> +  /bin/echo PID2 > tasks
> >> +
> >> +  Meaning now threads
> >> +  PID1 and PID2 get to fill the 'right half' of  the cache as the
> >> + belong to cgroup group2.
> >> +
> >> +Create a group under group2
> >> +
> >> +  cd group2
> >> +  mkdir group21
> >> +  cat rdt.l3_cache_mask
> >> +   0xff - inherits parents mask.
> >> +
> >> +  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to
> >> + parent's mask's subset
> >> +
> >> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> >> +comounted with cpusets.
> >> --
> >> 1.9.1
> >
> > Vikas,
> >
> > Can you give an example of comounting with cpusets? What do you mean
> > by restrict RDT cgroups to specific set of CPUs?
> 
> I was going to edit the documentation soon as i see a lot of feedback on the
> same. It may have caused confusion.
> 
> I mean just pinning down tasks to a set of cpus. This does not mean we make the
> cache exclusive to the tasks..
> 
> >
> > Another limitation of this interface is that it assumes the task <->
> > control group assignment is pertinent, that is:
> >
> > | taskgroup, L3 policy|:
> >
> > | taskgroupA, 50% L3 exclusive |,
> > | taskgroupB, 50% L3 |,
> > | taskgroupC, 50% L3 |.
> >
> > Whenever taskgroup A is empty (that is no runnable task in it), you
> > waste 50% of
> > L3 cache.
> 
> Cgroup masks can always overlap , and hence wont have exclusive cache
> allocation.
> 
> >
> > I think this problem and the similar problem of L3 reservation with
> > CPU isolation can be solved in this way: whenever a task from cgroupE
> > with exclusive way access is migrated to a new die, impose the
> > exclusivity (by removing access to that way by other cgroups).
> >
> > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > cgroups to use the exclusive ways of it).
> 
> Same comment as above - Cgroup masks can always overlap and other cgroups
> can allocate the same cache , and hence wont have exclusive cache allocation.

[Auld, Will] You can define all the cbm to provide one clos with an exclusive area

> 
> So natuarally the cgroup with tasks would get to use the cache if it has the same
> mask (say representing 50% of cache in your example) as others .
 
[Auld, Will] automatic adjustment of the cbm make me nervous. There are times 
when we want to limit the cache for a process independent of whether there is 
lots of unused cache. 


> (assume there are 8 bits max cbm)
> cgroupa - mask - 0xf
> cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
> the cache.
> 
> Thanks,
> Vikas
> 
> >
> > I'll cook a patch.
> >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-29  1:28       ` Auld, Will
@ 2015-07-29 19:32         ` Marcelo Tosatti
  2015-07-30 17:47           ` Vikas Shivappa
  2015-07-29 20:07         ` Vikas Shivappa
  1 sibling, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-29 19:32 UTC (permalink / raw)
  To: Auld, Will
  Cc: Shivappa, Vikas, Vikas Shivappa, linux-kernel@vger.kernel.org,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, tj@kernel.org, peterz@infradead.org,
	Fleming, Matt, Williamson, Glenn P, Juvva, Kanaka D

On Wed, Jul 29, 2015 at 01:28:38AM +0000, Auld, Will wrote:
> > > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > > cgroups to use the exclusive ways of it).
> > 
> > Same comment as above - Cgroup masks can always overlap and other cgroups
> > can allocate the same cache , and hence wont have exclusive cache allocation.
> 
> [Auld, Will] You can define all the cbm to provide one clos with an exclusive area
> 
> > 
> > So natuarally the cgroup with tasks would get to use the cache if it has the same
> > mask (say representing 50% of cache in your example) as others .
>  
> [Auld, Will] automatic adjustment of the cbm make me nervous. There are times 
> when we want to limit the cache for a process independent of whether there is 
> lots of unused cache. 

How about this:

desiredclos (closid  p1  p2  p3 p4)
	     1       1   0   0  0
	     2	     0	 0   0  1
	     3	     0   1   1  0

p means part. 
closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a 
closid1 cgroup goes from empty->nonempty 
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
	       1       0   0   0  0
	       2       0   0   0  1
	       3       0   1   1  0

Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
	       1       0   0   0  0
	       2       0   0   0  1
	       3       1   1   1  0

Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

	id = smp_processor_id();
	for (part = desiredclos.p1; ...; part++)
		/* if my cosid is set and any other
 	   	   cosid is clear, for the part,
		   synchronize desiredclos --> inplacecos */
		if (part[mycosid] == 1 && 
		    part[any_othercosid] == 0)
			wrmsr(part, desiredclos);


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-29 19:32         ` Marcelo Tosatti
@ 2015-07-30 17:47           ` Vikas Shivappa
  2015-07-30 20:08             ` Marcelo Tosatti
  2015-07-30 20:22             ` Marcelo Tosatti
  0 siblings, 2 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-30 17:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Auld, Will, Shivappa, Vikas, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D

Marcello,

On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>
> How about this:
>
> desiredclos (closid  p1  p2  p3 p4)
> 	     1       1   0   0  0
> 	     2	     0	 0   0  1
> 	     3	     0   1   1  0

#1 Currently in the rdt cgroup , the root cgroup always has all the bits set and 
cant be changed (because the cgroup hierarchy would by default make this to have 
all bits as all the children need to have a subset of the root's bitmask). So if 
the user creates a cgroup and not put any task in it , the tasks in the root 
cgroup could be still using that part of the cache. Thats the reason i say we 
can have really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all parts set 
which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below comments which 
will change the way they work.

>
> p means part.

I am assuming p = (a contiguous cache capacity bit mask)

> closid 1 is a exclusive cgroup.
> closid 2 is a "cache hog" class.
> closid 3 is "default closid".
>
> Desiredclos is what user has specified.
>
> Transition 1: desiredclos --> effectiveclos
> Clean all bits of unused closid's
> (that must be updated whenever a
> closid1 cgroup goes from empty->nonempty
> and vice-versa).
>
> effectiveclos (closid  p1  p2  p3 p4)
> 	       1       0   0   0  0
> 	       2       0   0   0  1
> 	       3       0   1   1  0

>
> Transition 2: effectiveclos --> expandedclos
> expandedclos (closid  p1  p2  p3 p4)
> 	       1       0   0   0  0
> 	       2       0   0   0  1
> 	       3       1   1   1  0
> Then you have different inplacecos for each
> CPU (see pseudo-code below):
>
> On the following events.
>
> - task migration to new pCPU:
> - task creation:
>
> 	id = smp_processor_id();
> 	for (part = desiredclos.p1; ...; part++)
> 		/* if my cosid is set and any other
> 	   	   cosid is clear, for the part,
> 		   synchronize desiredclos --> inplacecos */
> 		if (part[mycosid] == 1 &&
> 		    part[any_othercosid] == 0)
> 			wrmsr(part, desiredclos);
>

Currently the root cgroup would have all the bits set which will act like a 
default cgroup where all the otherwise unused parts (assuming they are a 
set of contiguous cache capacity bits) will be used.

Otherwise the question is in the expandedclos - who decides to expand the closx 
parts to include some of the unused parts.. - that could just be a default root 
always ?

Thanks,
Vikas

>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 17:47           ` Vikas Shivappa
@ 2015-07-30 20:08             ` Marcelo Tosatti
  2015-07-31 15:34               ` Marcelo Tosatti
  2015-08-02 15:48               ` Martin Kletzander
  2015-07-30 20:22             ` Marcelo Tosatti
  1 sibling, 2 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-30 20:08 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Auld, Will, Vikas Shivappa, linux-kernel@vger.kernel.org,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, tj@kernel.org, peterz@infradead.org,
	Fleming, Matt, Williamson, Glenn P, Juvva, Kanaka D

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >	     1       1   0   0  0
> >	     2	     0	 0   0  1
> >	     3	     0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.
> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)

Yes.

> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >	       1       0   0   0  0
> >	       2       0   0   0  1
> >	       3       0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid  p1  p2  p3 p4)
> >	       1       0   0   0  0
> >	       2       0   0   0  1
> >	       3       1   1   1  0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> >	id = smp_processor_id();
> >	for (part = desiredclos.p1; ...; part++)
> >		/* if my cosid is set and any other
> >	   	   cosid is clear, for the part,
> >		   synchronize desiredclos --> inplacecos */
> >		if (part[mycosid] == 1 &&
> >		    part[any_othercosid] == 0)
> >			wrmsr(part, desiredclos);
> >
> 
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.
> 
> Otherwise the question is in the expandedclos - who decides to
> expand the closx parts to include some of the unused parts.. - that
> could just be a default root always ?

Right, so the problem is for certain closid's you might never want 
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is "unclassified applications" 
i suppose it is beneficial to expand in most cases, that is, 
use maximum amount of cache irrespective of eviction rate, which 
is the behaviour that exists now without CAT).

So perhaps a new flag "expand=y/n" can be added to the cgroup 
directories... What do you say?

Userspace representation of CAT
-------------------------------

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
        - exclusive number of bytes (*) reserved to a certain CLOSid.
        - shared number of bytes (*) reserved between a certain group
          of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:

sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
			shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
	 		shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
			shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number
		     specification (default is round up).
sharedregionK.expand - y/n - Expand shared region to more cache ways
 			when available (default N).

cgroupN.exclusive - Number of exclusive L3 cache bytes reserved 
		    for cgroup.
cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
		    for cgroup.
cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
		    for cgroup.
cgroupN.round_down - Round down to cache way bytes from respective number
		     specification (default is round up).
cgroupN.expand - y/n - Expand shared region to more cache ways when
		       available (default N).
cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
regions)

Example 1:
One application with 2M exclusive cache, two applications
with 1M exclusive each, sharing an expansive shared region of 1M.

cgroup1.exclusive = 2M

sharedregion1.exclusive = 1M
sharedregion1.expand = Y

cgroup2.exclusive = 1M
cgroup2.shared = sharedregion1

cgroup3.exclusive = 1M
cgroup3.shared = sharedregion1

Example 2:
3 high performance applications running, one of which is a cache hog
with no cache locality.

cgroup1.exclusive = 8M
cgroup2.exclusive = 8M

cgroup3.exclusive = 512K
cgroup3.round_down = Y

In all cases the default cgroup (which requires no explicit 
specification) is expansive and uses the remaining cache 
ways, including the ways shared by other hardware entities.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 20:08             ` Marcelo Tosatti
@ 2015-07-31 15:34               ` Marcelo Tosatti
  2015-08-02 15:48               ` Martin Kletzander
  1 sibling, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-31 15:34 UTC (permalink / raw)
  To: Vikas Shivappa, Tejun Heo
  Cc: Auld, Will, Vikas Shivappa, linux-kernel@vger.kernel.org,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, tj@kernel.org, peterz@infradead.org,
	Fleming, Matt, Williamson, Glenn P, Juvva, Kanaka D

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> > 
> > 
> > Marcello,
> > 
> > 
> > On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> > >
> > >How about this:
> > >
> > >desiredclos (closid  p1  p2  p3 p4)
> > >	     1       1   0   0  0
> > >	     2	     0	 0   0  1
> > >	     3	     0   1   1  0
> > 
> > #1 Currently in the rdt cgroup , the root cgroup always has all the
> > bits set and cant be changed (because the cgroup hierarchy would by
> > default make this to have all bits as all the children need to have
> > a subset of the root's bitmask). So if the user creates a cgroup and
> > not put any task in it , the tasks in the root cgroup could be still
> > using that part of the cache. Thats the reason i say we can have
> > really 'exclusive' masks.
> > 
> > Or in other words - there is always a desired clos (0) which has all
> > parts set which acts like a default pool.
> > 
> > Also the parts can overlap.  Please apply this for all the below
> > comments which will change the way they work.
> > 
> > >
> > >p means part.
> > 
> > I am assuming p = (a contiguous cache capacity bit mask)
> 
> Yes.
> 
> > >closid 1 is a exclusive cgroup.
> > >closid 2 is a "cache hog" class.
> > >closid 3 is "default closid".
> > >
> > >Desiredclos is what user has specified.
> > >
> > >Transition 1: desiredclos --> effectiveclos
> > >Clean all bits of unused closid's
> > >(that must be updated whenever a
> > >closid1 cgroup goes from empty->nonempty
> > >and vice-versa).
> > >
> > >effectiveclos (closid  p1  p2  p3 p4)
> > >	       1       0   0   0  0
> > >	       2       0   0   0  1
> > >	       3       0   1   1  0
> > 
> > >
> > >Transition 2: effectiveclos --> expandedclos
> > >expandedclos (closid  p1  p2  p3 p4)
> > >	       1       0   0   0  0
> > >	       2       0   0   0  1
> > >	       3       1   1   1  0
> > >Then you have different inplacecos for each
> > >CPU (see pseudo-code below):
> > >
> > >On the following events.
> > >
> > >- task migration to new pCPU:
> > >- task creation:
> > >
> > >	id = smp_processor_id();
> > >	for (part = desiredclos.p1; ...; part++)
> > >		/* if my cosid is set and any other
> > >	   	   cosid is clear, for the part,
> > >		   synchronize desiredclos --> inplacecos */
> > >		if (part[mycosid] == 1 &&
> > >		    part[any_othercosid] == 0)
> > >			wrmsr(part, desiredclos);
> > >
> > 
> > Currently the root cgroup would have all the bits set which will act
> > like a default cgroup where all the otherwise unused parts (assuming
> > they are a set of contiguous cache capacity bits) will be used.
> > 
> > Otherwise the question is in the expandedclos - who decides to
> > expand the closx parts to include some of the unused parts.. - that
> > could just be a default root always ?
> 
> Right, so the problem is for certain closid's you might never want 
> to expand (because doing so would cause data to be cached in a
> cache way which might have high eviction rate in the future).
> See the example from Will.
> 
> But for the default cache (that is "unclassified applications" 
> i suppose it is beneficial to expand in most cases, that is, 
> use maximum amount of cache irrespective of eviction rate, which 
> is the behaviour that exists now without CAT).
> 
> So perhaps a new flag "expand=y/n" can be added to the cgroup 
> directories... What do you say?
> 
> Userspace representation of CAT
> -------------------------------
> 
> Usage model:
> 1) measure application performance without L3 cache reservation.
> 2) measure application perf with L3 cache reservation and
> X number of cache ways until desired performance is attained.
> 
> Requirements:
> 1) Persistency of CLOS configuration across hardware. On migration
> of operating system or application between different hardware
> systems we'd like the following to be maintained:
>         - exclusive number of bytes (*) reserved to a certain CLOSid.
>         - shared number of bytes (*) reserved between a certain group
>           of CLOSid's.
> 
> For both code and data, rounded down or up in cache way size.
> 
> 2) Reasoning:
> Different CBM masks in different hardware platforms might be necessary
> to specify the same CLOS configuration, in terms of exclusive number of
> bytes and shared number of bytes. (cache-way rounded number of bytes).
> For example, due to L3 allocation by other hardware entities in certain parts
> of the cache it might be necessary to relocate CBM mask to achieve
> the same CLOS configuration.
> 
> 3) Proposed format:
> 
> sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
> 			shared region.
> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
> 	 		shared region.
> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
> 			shared region.
> sharedregionK.round_down - Round down to cache way bytes from respective number
> 		     specification (default is round up).
> sharedregionK.expand - y/n - Expand shared region to more cache ways
>  			when available (default N).
> 
> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved 
> 		    for cgroup.
> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
> 		    for cgroup.
> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
> 		    for cgroup.
> cgroupN.round_down - Round down to cache way bytes from respective number
> 		     specification (default is round up).
> cgroupN.expand - y/n - Expand shared region to more cache ways when
> 		       available (default N).
> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
> regions)
> 
> Example 1:
> One application with 2M exclusive cache, two applications
> with 1M exclusive each, sharing an expansive shared region of 1M.
> 
> cgroup1.exclusive = 2M
> 
> sharedregion1.exclusive = 1M
> sharedregion1.expand = Y
> 
> cgroup2.exclusive = 1M
> cgroup2.shared = sharedregion1
> 
> cgroup3.exclusive = 1M
> cgroup3.shared = sharedregion1
> 
> Example 2:
> 3 high performance applications running, one of which is a cache hog
> with no cache locality.
> 
> cgroup1.exclusive = 8M
> cgroup2.exclusive = 8M
> 
> cgroup3.exclusive = 512K
> cgroup3.round_down = Y
> 
> In all cases the default cgroup (which requires no explicit 
> specification) is expansive and uses the remaining cache 
> ways, including the ways shared by other hardware entities.
> 

Moving this discussion from another to this thread, sorry.

> >Second question:
> >Do you envision any use case which the placement of cache
> >and not the quantity of cache is a criteria for decision?
> >That is, two cases with the same amount of cache for each CLOSid,
> >but with different locations inside the cache?
> >(except sharing of ways by two CLOSid's, of course).
> >
> 
> cbm max - 16 bits.  000f - allocate right quarter. f000 - allocate
> left quarter.. ? extend the case to any number of valid contiguous
> bits.

Yes, the hardware allows you to specify the same number of cache ways 
to a given COSid in different cache locations. The question was whether
do you envision any use case where different locations make a
difference?

I can't see any (except for hardware users of cache ways, which 
the OS could control automatically, all it needs to know from
the user configuration is whether a given cgroup is a "exclusive user"
of a number of cache ways, in which case it should not use a

This information is crucial because if there are no forseeable use
cases then that can simplify the interface enormously (could have the
kernel handle the issues that my "userspace interface" proposal is
handling).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 20:08             ` Marcelo Tosatti
  2015-07-31 15:34               ` Marcelo Tosatti
@ 2015-08-02 15:48               ` Martin Kletzander
  2015-08-03 15:13                 ` Marcelo Tosatti
  1 sibling, 1 reply; 30+ messages in thread
From: Martin Kletzander @ 2015-08-02 15:48 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Auld, Will, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D

[-- Attachment #1: Type: text/plain, Size: 7853 bytes --]

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>
>>
>> Marcello,
>>
>>
>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>> >
>> >How about this:
>> >
>> >desiredclos (closid  p1  p2  p3 p4)
>> >	     1       1   0   0  0
>> >	     2	     0	 0   0  1
>> >	     3	     0   1   1  0
>>
>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>> bits set and cant be changed (because the cgroup hierarchy would by
>> default make this to have all bits as all the children need to have
>> a subset of the root's bitmask). So if the user creates a cgroup and
>> not put any task in it , the tasks in the root cgroup could be still
>> using that part of the cache. Thats the reason i say we can have
>> really 'exclusive' masks.
>>
>> Or in other words - there is always a desired clos (0) which has all
>> parts set which acts like a default pool.
>>
>> Also the parts can overlap.  Please apply this for all the below
>> comments which will change the way they work.
>>
>> >
>> >p means part.
>>
>> I am assuming p = (a contiguous cache capacity bit mask)
>
>Yes.
>
>> >closid 1 is a exclusive cgroup.
>> >closid 2 is a "cache hog" class.
>> >closid 3 is "default closid".
>> >
>> >Desiredclos is what user has specified.
>> >
>> >Transition 1: desiredclos --> effectiveclos
>> >Clean all bits of unused closid's
>> >(that must be updated whenever a
>> >closid1 cgroup goes from empty->nonempty
>> >and vice-versa).
>> >
>> >effectiveclos (closid  p1  p2  p3 p4)
>> >	       1       0   0   0  0
>> >	       2       0   0   0  1
>> >	       3       0   1   1  0
>>
>> >
>> >Transition 2: effectiveclos --> expandedclos
>> >expandedclos (closid  p1  p2  p3 p4)
>> >	       1       0   0   0  0
>> >	       2       0   0   0  1
>> >	       3       1   1   1  0
>> >Then you have different inplacecos for each
>> >CPU (see pseudo-code below):
>> >
>> >On the following events.
>> >
>> >- task migration to new pCPU:
>> >- task creation:
>> >
>> >	id = smp_processor_id();
>> >	for (part = desiredclos.p1; ...; part++)
>> >		/* if my cosid is set and any other
>> >	   	   cosid is clear, for the part,
>> >		   synchronize desiredclos --> inplacecos */
>> >		if (part[mycosid] == 1 &&
>> >		    part[any_othercosid] == 0)
>> >			wrmsr(part, desiredclos);
>> >
>>
>> Currently the root cgroup would have all the bits set which will act
>> like a default cgroup where all the otherwise unused parts (assuming
>> they are a set of contiguous cache capacity bits) will be used.
>>
>> Otherwise the question is in the expandedclos - who decides to
>> expand the closx parts to include some of the unused parts.. - that
>> could just be a default root always ?
>
>Right, so the problem is for certain closid's you might never want
>to expand (because doing so would cause data to be cached in a
>cache way which might have high eviction rate in the future).
>See the example from Will.
>
>But for the default cache (that is "unclassified applications"
>i suppose it is beneficial to expand in most cases, that is,
>use maximum amount of cache irrespective of eviction rate, which
>is the behaviour that exists now without CAT).
>
>So perhaps a new flag "expand=y/n" can be added to the cgroup
>directories... What do you say?
>
>Userspace representation of CAT
>-------------------------------
>
>Usage model:
>1) measure application performance without L3 cache reservation.
>2) measure application perf with L3 cache reservation and
>X number of cache ways until desired performance is attained.
>
>Requirements:
>1) Persistency of CLOS configuration across hardware. On migration
>of operating system or application between different hardware
>systems we'd like the following to be maintained:
>        - exclusive number of bytes (*) reserved to a certain CLOSid.
>        - shared number of bytes (*) reserved between a certain group
>          of CLOSid's.
>
>For both code and data, rounded down or up in cache way size.
>
>2) Reasoning:
>Different CBM masks in different hardware platforms might be necessary
>to specify the same CLOS configuration, in terms of exclusive number of
>bytes and shared number of bytes. (cache-way rounded number of bytes).
>For example, due to L3 allocation by other hardware entities in certain parts
>of the cache it might be necessary to relocate CBM mask to achieve
>the same CLOS configuration.
>
>3) Proposed format:
>

Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.

I'm not sure whether the following proposal to the format is the
internal structure or what's going to be in cgroups.  If this is
user-visible interface, I think it could be a little less detailed.

>sharedregionK.exclusive - Number of exclusive cache bytes reserved for
>			shared region.
>sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
>	 		shared region.
>sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
>			shared region.
>sharedregionK.round_down - Round down to cache way bytes from respective number
>		     specification (default is round up).
>sharedregionK.expand - y/n - Expand shared region to more cache ways
> 			when available (default N).
>
>cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
>		    for cgroup.
>cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
>		    for cgroup.
>cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
>		    for cgroup.

By exclusive, you mean that it's exclusive to the tasks in this
cgroup?

The thing is that we must differentiate between limiting some
process's from hogging the memory (like example 2 below) and making
some part of the cache exclusive for particular application (example 1
below).

I just hope we won't need to add something similar to 'isolcpus=' just
so we can make sure none of the tasks in the root cgroup can spoil the
part of the cache we need to have exclusive.

I'm not sure creating a new subgroup and moving all the tasks there
would work, It certainly is not possible with other cgroups, like the
cpuset cgroup mentioned beforehand.

I also don't quite fully understand how the co-mounting with the
cpuset cgroup should work, but that's not design-related.

One more question, how does this work on systems with multiple L3
caches (e.g. large NUMA node systems)?  I'm guessing if the process is
running only on some CPUs, the wrmsr() will be called on that
particular CPU(s), right?

>cgroupN.round_down - Round down to cache way bytes from respective number
>		     specification (default is round up).
>cgroupN.expand - y/n - Expand shared region to more cache ways when
>		       available (default N).
>cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>regions)
>
>Example 1:
>One application with 2M exclusive cache, two applications
>with 1M exclusive each, sharing an expansive shared region of 1M.
>
>cgroup1.exclusive = 2M
>
>sharedregion1.exclusive = 1M
>sharedregion1.expand = Y
>
>cgroup2.exclusive = 1M
>cgroup2.shared = sharedregion1
>
>cgroup3.exclusive = 1M
>cgroup3.shared = sharedregion1
>
>Example 2:
>3 high performance applications running, one of which is a cache hog
>with no cache locality.
>
>cgroup1.exclusive = 8M
>cgroup2.exclusive = 8M
>
>cgroup3.exclusive = 512K
>cgroup3.round_down = Y
>
>In all cases the default cgroup (which requires no explicit
>specification) is expansive and uses the remaining cache
>ways, including the ways shared by other hardware entities.
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-08-02 15:48               ` Martin Kletzander
@ 2015-08-03 15:13                 ` Marcelo Tosatti
  2015-08-03 18:22                   ` Vikas Shivappa
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2015-08-03 15:13 UTC (permalink / raw)
  To: Martin Kletzander
  Cc: Vikas Shivappa, Auld, Will, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D

On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid  p1  p2  p3 p4)
> >>>	     1       1   0   0  0
> >>>	     2	     0	 0   0  1
> >>>	     3	     0   1   1  0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap.  Please apply this for all the below
> >>comments which will change the way they work.
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >
> >Yes.
> >
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid  p1  p2  p3 p4)
> >>>	       1       0   0   0  0
> >>>	       2       0   0   0  1
> >>>	       3       0   1   1  0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid  p1  p2  p3 p4)
> >>>	       1       0   0   0  0
> >>>	       2       0   0   0  1
> >>>	       3       1   1   1  0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>>	id = smp_processor_id();
> >>>	for (part = desiredclos.p1; ...; part++)
> >>>		/* if my cosid is set and any other
> >>>	   	   cosid is clear, for the part,
> >>>		   synchronize desiredclos --> inplacecos */
> >>>		if (part[mycosid] == 1 &&
> >>>		    part[any_othercosid] == 0)
> >>>			wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >>
> >>Otherwise the question is in the expandedclos - who decides to
> >>expand the closx parts to include some of the unused parts.. - that
> >>could just be a default root always ?
> >
> >Right, so the problem is for certain closid's you might never want
> >to expand (because doing so would cause data to be cached in a
> >cache way which might have high eviction rate in the future).
> >See the example from Will.
> >
> >But for the default cache (that is "unclassified applications"
> >i suppose it is beneficial to expand in most cases, that is,
> >use maximum amount of cache irrespective of eviction rate, which
> >is the behaviour that exists now without CAT).
> >
> >So perhaps a new flag "expand=y/n" can be added to the cgroup
> >directories... What do you say?
> >
> >Userspace representation of CAT
> >-------------------------------
> >
> >Usage model:
> >1) measure application performance without L3 cache reservation.
> >2) measure application perf with L3 cache reservation and
> >X number of cache ways until desired performance is attained.
> >
> >Requirements:
> >1) Persistency of CLOS configuration across hardware. On migration
> >of operating system or application between different hardware
> >systems we'd like the following to be maintained:
> >       - exclusive number of bytes (*) reserved to a certain CLOSid.
> >       - shared number of bytes (*) reserved between a certain group
> >         of CLOSid's.
> >
> >For both code and data, rounded down or up in cache way size.
> >
> >2) Reasoning:
> >Different CBM masks in different hardware platforms might be necessary
> >to specify the same CLOS configuration, in terms of exclusive number of
> >bytes and shared number of bytes. (cache-way rounded number of bytes).
> >For example, due to L3 allocation by other hardware entities in certain parts
> >of the cache it might be necessary to relocate CBM mask to achieve
> >the same CLOS configuration.
> >
> >3) Proposed format:
> >
> 
> Few questions from a random listener, I apologise if some of them are
> in a wrong place due to me missing some information from past threads.
> 
> I'm not sure whether the following proposal to the format is the
> internal structure or what's going to be in cgroups.  If this is
> user-visible interface, I think it could be a little less detailed.

User visible interface. The idea is to have userspace code that performs

[ user visible specification ]  ----> [ cbm bitmasks on present hardware
				       platform ]

In systemd, probably (or whatever is between the user and the cgroup
interface).

> >sharedregionK.exclusive - Number of exclusive cache bytes reserved for
> >			shared region.
> >sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
> >	 		shared region.
> >sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
> >			shared region.
> >sharedregionK.round_down - Round down to cache way bytes from respective number
> >		     specification (default is round up).
> >sharedregionK.expand - y/n - Expand shared region to more cache ways
> >			when available (default N).
> >
> >cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
> >		    for cgroup.
> >cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
> >		    for cgroup.
> >cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
> >		    for cgroup.
> 
> By exclusive, you mean that it's exclusive to the tasks in this
> cgroup?

Correct.

> The thing is that we must differentiate between limiting some
> process's from hogging the memory (like example 2 below) and making
> some part of the cache exclusive for particular application (example 1
> below).

AFAICS there is no difference because: both require exclusive cache
access: the hog wants exclusive access between any other user of its
cachelines will be penalized. the high performance application wants 
exclusive cache access because any other user of its cachelines will
penalize it.

Where do you see the need to differentiate? 

> I just hope we won't need to add something similar to 'isolcpus=' just
> so we can make sure none of the tasks in the root cgroup can spoil the
> part of the cache we need to have exclusive.
> 
> I'm not sure creating a new subgroup and moving all the tasks there
> would work, It certainly is not possible with other cgroups, like the
> cpuset cgroup mentioned beforehand.

Why not? Should be able to place all tasks in a given cgroup? (trying
to setup systemd to do that now...).

> I also don't quite fully understand how the co-mounting with the
> cpuset cgroup should work, but that's not design-related.

Neither do I.

> One more question, how does this work on systems with multiple L3
> caches (e.g. large NUMA node systems)?  I'm guessing if the process is
> running only on some CPUs, the wrmsr() will be called on that
> particular CPU(s), right?

Not in the current patchset, that has to be fixed...

> >cgroupN.round_down - Round down to cache way bytes from respective number
> >		     specification (default is round up).
> >cgroupN.expand - y/n - Expand shared region to more cache ways when
> >		       available (default N).
> >cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
> >regions)
> >
> >Example 1:
> >One application with 2M exclusive cache, two applications
> >with 1M exclusive each, sharing an expansive shared region of 1M.
> >
> >cgroup1.exclusive = 2M
> >
> >sharedregion1.exclusive = 1M
> >sharedregion1.expand = Y
> >
> >cgroup2.exclusive = 1M
> >cgroup2.shared = sharedregion1
> >
> >cgroup3.exclusive = 1M
> >cgroup3.shared = sharedregion1
> >
> >Example 2:
> >3 high performance applications running, one of which is a cache hog
> >with no cache locality.
> >
> >cgroup1.exclusive = 8M
> >cgroup2.exclusive = 8M
> >
> >cgroup3.exclusive = 512K
> >cgroup3.round_down = Y
> >
> >In all cases the default cgroup (which requires no explicit
> >specification) is expansive and uses the remaining cache
> >ways, including the ways shared by other hardware entities.
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-08-03 15:13                 ` Marcelo Tosatti
@ 2015-08-03 18:22                   ` Vikas Shivappa
  0 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-08-03 18:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Martin Kletzander, Vikas Shivappa, Auld, Will, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D


Hello Marcelo/Martin,

Like I mentioned let me modify the documentation to better help understand 
the usage. Things like updating each package bitmask is already in the patches.

Lets discuss offline and come up a well defined proposal for change if any and 
then update that in next series. We seem to be just looping over same items.

Thanks,
Vikas

On Mon, 3 Aug 2015, Marcelo Tosatti wrote:

> On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
>> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>>>
>>>>
>>>> Marcello,
>>>>
>>>>
>>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>>>
>>>>> How about this:
>>>>>
>>>>> desiredclos (closid  p1  p2  p3 p4)
>>>>> 	     1       1   0   0  0
>>>>> 	     2	     0	 0   0  1
>>>>> 	     3	     0   1   1  0
>>>>
>>>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>>>> bits set and cant be changed (because the cgroup hierarchy would by
>>>> default make this to have all bits as all the children need to have
>>>> a subset of the root's bitmask). So if the user creates a cgroup and
>>>> not put any task in it , the tasks in the root cgroup could be still
>>>> using that part of the cache. Thats the reason i say we can have
>>>> really 'exclusive' masks.
>>>>
>>>> Or in other words - there is always a desired clos (0) which has all
>>>> parts set which acts like a default pool.
>>>>
>>>> Also the parts can overlap.  Please apply this for all the below
>>>> comments which will change the way they work.
>>>>
>>>>>
>>>>> p means part.
>>>>
>>>> I am assuming p = (a contiguous cache capacity bit mask)
>>>
>>> Yes.
>>>
>>>>> closid 1 is a exclusive cgroup.
>>>>> closid 2 is a "cache hog" class.
>>>>> closid 3 is "default closid".
>>>>>
>>>>> Desiredclos is what user has specified.
>>>>>
>>>>> Transition 1: desiredclos --> effectiveclos
>>>>> Clean all bits of unused closid's
>>>>> (that must be updated whenever a
>>>>> closid1 cgroup goes from empty->nonempty
>>>>> and vice-versa).
>>>>>
>>>>> effectiveclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       0   1   1  0
>>>>
>>>>>
>>>>> Transition 2: effectiveclos --> expandedclos
>>>>> expandedclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       1   1   1  0
>>>>> Then you have different inplacecos for each
>>>>> CPU (see pseudo-code below):
>>>>>
>>>>> On the following events.
>>>>>
>>>>> - task migration to new pCPU:
>>>>> - task creation:
>>>>>
>>>>> 	id = smp_processor_id();
>>>>> 	for (part = desiredclos.p1; ...; part++)
>>>>> 		/* if my cosid is set and any other
>>>>> 	   	   cosid is clear, for the part,
>>>>> 		   synchronize desiredclos --> inplacecos */
>>>>> 		if (part[mycosid] == 1 &&
>>>>> 		    part[any_othercosid] == 0)
>>>>> 			wrmsr(part, desiredclos);
>>>>>
>>>>
>>>> Currently the root cgroup would have all the bits set which will act
>>>> like a default cgroup where all the otherwise unused parts (assuming
>>>> they are a set of contiguous cache capacity bits) will be used.
>>>>
>>>> Otherwise the question is in the expandedclos - who decides to
>>>> expand the closx parts to include some of the unused parts.. - that
>>>> could just be a default root always ?
>>>
>>> Right, so the problem is for certain closid's you might never want
>>> to expand (because doing so would cause data to be cached in a
>>> cache way which might have high eviction rate in the future).
>>> See the example from Will.
>>>
>>> But for the default cache (that is "unclassified applications"
>>> i suppose it is beneficial to expand in most cases, that is,
>>> use maximum amount of cache irrespective of eviction rate, which
>>> is the behaviour that exists now without CAT).
>>>
>>> So perhaps a new flag "expand=y/n" can be added to the cgroup
>>> directories... What do you say?
>>>
>>> Userspace representation of CAT
>>> -------------------------------
>>>
>>> Usage model:
>>> 1) measure application performance without L3 cache reservation.
>>> 2) measure application perf with L3 cache reservation and
>>> X number of cache ways until desired performance is attained.
>>>
>>> Requirements:
>>> 1) Persistency of CLOS configuration across hardware. On migration
>>> of operating system or application between different hardware
>>> systems we'd like the following to be maintained:
>>>       - exclusive number of bytes (*) reserved to a certain CLOSid.
>>>       - shared number of bytes (*) reserved between a certain group
>>>         of CLOSid's.
>>>
>>> For both code and data, rounded down or up in cache way size.
>>>
>>> 2) Reasoning:
>>> Different CBM masks in different hardware platforms might be necessary
>>> to specify the same CLOS configuration, in terms of exclusive number of
>>> bytes and shared number of bytes. (cache-way rounded number of bytes).
>>> For example, due to L3 allocation by other hardware entities in certain parts
>>> of the cache it might be necessary to relocate CBM mask to achieve
>>> the same CLOS configuration.
>>>
>>> 3) Proposed format:
>>>
>>
>> Few questions from a random listener, I apologise if some of them are
>> in a wrong place due to me missing some information from past threads.
>>
>> I'm not sure whether the following proposal to the format is the
>> internal structure or what's going to be in cgroups.  If this is
>> user-visible interface, I think it could be a little less detailed.
>
> User visible interface. The idea is to have userspace code that performs
>
> [ user visible specification ]  ----> [ cbm bitmasks on present hardware
> 				       platform ]
>
> In systemd, probably (or whatever is between the user and the cgroup
> interface).
>
>>> sharedregionK.exclusive - Number of exclusive cache bytes reserved for
>>> 			shared region.
>>> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
>>> 	 		shared region.
>>> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
>>> 			shared region.
>>> sharedregionK.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> sharedregionK.expand - y/n - Expand shared region to more cache ways
>>> 			when available (default N).
>>>
>>> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
>>> 		    for cgroup.
>>
>> By exclusive, you mean that it's exclusive to the tasks in this
>> cgroup?
>
> Correct.
>
>> The thing is that we must differentiate between limiting some
>> process's from hogging the memory (like example 2 below) and making
>> some part of the cache exclusive for particular application (example 1
>> below).
>
> AFAICS there is no difference because: both require exclusive cache
> access: the hog wants exclusive access between any other user of its
> cachelines will be penalized. the high performance application wants
> exclusive cache access because any other user of its cachelines will
> penalize it.
>
> Where do you see the need to differentiate?
>
>> I just hope we won't need to add something similar to 'isolcpus=' just
>> so we can make sure none of the tasks in the root cgroup can spoil the
>> part of the cache we need to have exclusive.
>>
>> I'm not sure creating a new subgroup and moving all the tasks there
>> would work, It certainly is not possible with other cgroups, like the
>> cpuset cgroup mentioned beforehand.
>
> Why not? Should be able to place all tasks in a given cgroup? (trying
> to setup systemd to do that now...).
>
>> I also don't quite fully understand how the co-mounting with the
>> cpuset cgroup should work, but that's not design-related.
>
> Neither do I.
>
>> One more question, how does this work on systems with multiple L3
>> caches (e.g. large NUMA node systems)?  I'm guessing if the process is
>> running only on some CPUs, the wrmsr() will be called on that
>> particular CPU(s), right?
>
> Not in the current patchset, that has to be fixed...
>
>>> cgroupN.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> cgroupN.expand - y/n - Expand shared region to more cache ways when
>>> 		       available (default N).
>>> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>>> regions)
>>>
>>> Example 1:
>>> One application with 2M exclusive cache, two applications
>>> with 1M exclusive each, sharing an expansive shared region of 1M.
>>>
>>> cgroup1.exclusive = 2M
>>>
>>> sharedregion1.exclusive = 1M
>>> sharedregion1.expand = Y
>>>
>>> cgroup2.exclusive = 1M
>>> cgroup2.shared = sharedregion1
>>>
>>> cgroup3.exclusive = 1M
>>> cgroup3.shared = sharedregion1
>>>
>>> Example 2:
>>> 3 high performance applications running, one of which is a cache hog
>>> with no cache locality.
>>>
>>> cgroup1.exclusive = 8M
>>> cgroup2.exclusive = 8M
>>>
>>> cgroup3.exclusive = 512K
>>> cgroup3.round_down = Y
>>>
>>> In all cases the default cgroup (which requires no explicit
>>> specification) is expansive and uses the remaining cache
>>> ways, including the ways shared by other hardware entities.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 17:47           ` Vikas Shivappa
  2015-07-30 20:08             ` Marcelo Tosatti
@ 2015-07-30 20:22             ` Marcelo Tosatti
  2015-07-30 23:03               ` Vikas Shivappa
  1 sibling, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-30 20:22 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Auld, Will, Vikas Shivappa, linux-kernel@vger.kernel.org,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, tj@kernel.org, peterz@infradead.org,
	Fleming, Matt, Williamson, Glenn P, Juvva, Kanaka D

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >	     1       1   0   0  0
> >	     2	     0	 0   0  1
> >	     3	     0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.


> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)
> 
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >	       1       0   0   0  0
> >	       2       0   0   0  1
> >	       3       0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid  p1  p2  p3 p4)
> >	       1       0   0   0  0
> >	       2       0   0   0  1
> >	       3       1   1   1  0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> >	id = smp_processor_id();
> >	for (part = desiredclos.p1; ...; part++)
> >		/* if my cosid is set and any other
> >	   	   cosid is clear, for the part,
> >		   synchronize desiredclos --> inplacecos */
> >		if (part[mycosid] == 1 &&
> >		    part[any_othercosid] == 0)
> >			wrmsr(part, desiredclos);
> >
> 
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.

Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:

create cgroup-exclusive; reserve desired part of the cache 
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.

place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.

Is that right?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 20:22             ` Marcelo Tosatti
@ 2015-07-30 23:03               ` Vikas Shivappa
  2015-07-31 14:45                 ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-30 23:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Auld, Will, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:

> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>
>>
>> Marcello,
>>
>>
>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>
>>> How about this:
>>>
>>> desiredclos (closid  p1  p2  p3 p4)
>>> 	     1       1   0   0  0
>>> 	     2	     0	 0   0  1
>>> 	     3	     0   1   1  0
>>
>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>> bits set and cant be changed (because the cgroup hierarchy would by
>> default make this to have all bits as all the children need to have
>> a subset of the root's bitmask). So if the user creates a cgroup and
>> not put any task in it , the tasks in the root cgroup could be still
>> using that part of the cache. Thats the reason i say we can have
>> really 'exclusive' masks.
>>
>> Or in other words - there is always a desired clos (0) which has all
>> parts set which acts like a default pool.
>>
>> Also the parts can overlap.  Please apply this for all the below
>> comments which will change the way they work.
>
>
>>
>>>
>>> p means part.
>>
>> I am assuming p = (a contiguous cache capacity bit mask)
>>
>>> closid 1 is a exclusive cgroup.
>>> closid 2 is a "cache hog" class.
>>> closid 3 is "default closid".
>>>
>>> Desiredclos is what user has specified.
>>>
>>> Transition 1: desiredclos --> effectiveclos
>>> Clean all bits of unused closid's
>>> (that must be updated whenever a
>>> closid1 cgroup goes from empty->nonempty
>>> and vice-versa).
>>>
>>> effectiveclos (closid  p1  p2  p3 p4)
>>> 	       1       0   0   0  0
>>> 	       2       0   0   0  1
>>> 	       3       0   1   1  0
>>
>>>
>>> Transition 2: effectiveclos --> expandedclos
>>> expandedclos (closid  p1  p2  p3 p4)
>>> 	       1       0   0   0  0
>>> 	       2       0   0   0  1
>>> 	       3       1   1   1  0
>>> Then you have different inplacecos for each
>>> CPU (see pseudo-code below):
>>>
>>> On the following events.
>>>
>>> - task migration to new pCPU:
>>> - task creation:
>>>
>>> 	id = smp_processor_id();
>>> 	for (part = desiredclos.p1; ...; part++)
>>> 		/* if my cosid is set and any other
>>> 	   	   cosid is clear, for the part,
>>> 		   synchronize desiredclos --> inplacecos */
>>> 		if (part[mycosid] == 1 &&
>>> 		    part[any_othercosid] == 0)
>>> 			wrmsr(part, desiredclos);
>>>
>>
>> Currently the root cgroup would have all the bits set which will act
>> like a default cgroup where all the otherwise unused parts (assuming
>> they are a set of contiguous cache capacity bits) will be used.
>
> Right, but we don't want to place tasks in there in case one cgroup
> wants exclusive cache access.
>
> So whenever you want an exclusive cgroup you'd do:
>
> create cgroup-exclusive; reserve desired part of the cache
> for it.
> create cgroup-default; reserved all cache minus that of cgroup-exclusive
> for it.
>
> place tasks that belong to cgroup-exclusive into it.
> place all other tasks (including init) into cgroup-default.
>
> Is that right?

Yes you could do that.

You can create cgroups to have masks which are exclusive in todays 
implementation, just that you could also created more cgroups to overlap the 
masks again.. iow we dont have an exclusive flag for the cgroup mask.
Is that a common use case in 
the server environment that you need to prevent other cgroups from using a 
certain mask ? (since the root user should control these allocations .. he 
should know?)

>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-30 23:03               ` Vikas Shivappa
@ 2015-07-31 14:45                 ` Marcelo Tosatti
  0 siblings, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2015-07-31 14:45 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Auld, Will, Vikas Shivappa, linux-kernel@vger.kernel.org,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, tj@kernel.org, peterz@infradead.org,
	Fleming, Matt, Williamson, Glenn P, Juvva, Kanaka D

On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
> 
> 
> On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid  p1  p2  p3 p4)
> >>>	     1       1   0   0  0
> >>>	     2	     0	 0   0  1
> >>>	     3	     0   1   1  0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap.  Please apply this for all the below
> >>comments which will change the way they work.
> >
> >
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >>
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid  p1  p2  p3 p4)
> >>>	       1       0   0   0  0
> >>>	       2       0   0   0  1
> >>>	       3       0   1   1  0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid  p1  p2  p3 p4)
> >>>	       1       0   0   0  0
> >>>	       2       0   0   0  1
> >>>	       3       1   1   1  0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>>	id = smp_processor_id();
> >>>	for (part = desiredclos.p1; ...; part++)
> >>>		/* if my cosid is set and any other
> >>>	   	   cosid is clear, for the part,
> >>>		   synchronize desiredclos --> inplacecos */
> >>>		if (part[mycosid] == 1 &&
> >>>		    part[any_othercosid] == 0)
> >>>			wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >
> >Right, but we don't want to place tasks in there in case one cgroup
> >wants exclusive cache access.
> >
> >So whenever you want an exclusive cgroup you'd do:
> >
> >create cgroup-exclusive; reserve desired part of the cache
> >for it.
> >create cgroup-default; reserved all cache minus that of cgroup-exclusive
> >for it.
> >
> >place tasks that belong to cgroup-exclusive into it.
> >place all other tasks (including init) into cgroup-default.
> >
> >Is that right?
> 
> Yes you could do that.
> 
> You can create cgroups to have masks which are exclusive in todays
> implementation, just that you could also created more cgroups to
> overlap the masks again.. iow we dont have an exclusive flag for the
> cgroup mask.
> Is that a common use case in the server environment that you need to
> prevent other cgroups from using a certain mask ? (since the root
> user should control these allocations .. he should know?)

Yes, there are two known use-cases that have this characteristic:

1) High performance numeric application which has been optimized
to a certain fraction of the cache.

2) Low latency application in multi-application OS.

For both cases exclusive cache access is wanted.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-07-29  1:28       ` Auld, Will
  2015-07-29 19:32         ` Marcelo Tosatti
@ 2015-07-29 20:07         ` Vikas Shivappa
  1 sibling, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-07-29 20:07 UTC (permalink / raw)
  To: Auld, Will
  Cc: Shivappa, Vikas, Marcelo Tosatti, Vikas Shivappa,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, Fleming, Matt, Williamson, Glenn P,
	Juvva, Kanaka D



On Tue, 28 Jul 2015, Auld, Will wrote:

>
>
>> -----Original Message-----
>>
>> Same comment as above - Cgroup masks can always overlap and other cgroups
>> can allocate the same cache , and hence wont have exclusive cache allocation.
>
> [Auld, Will] You can define all the cbm to provide one clos with an exclusive area

Do you mean a CLOS that has all the bits set. We donot support exclusive area 
today. The bits in the mask can overlap .. hence can always share the same cache 
allocation .

>
>>
>> So natuarally the cgroup with tasks would get to use the cache if it has the same
>> mask (say representing 50% of cache in your example) as others .
>
> [Auld, Will] automatic adjustment of the cbm make me nervous. There are times
> when we want to limit the cache for a process independent of whether there is
> lots of unused cache.
>

Please see example below - In general , I just mean the cache mask can have bits 
that can overlap - does not matter whether there is tasks in it or not.

>
>> (assume there are 8 bits max cbm)
>> cgroupa - mask - 0xf
>> cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
>> the cache.
>>
>> Thanks,
>> Vikas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH V11 0/9] Hot cpu handling changes to cqm,rapl and Intel Cache Allocation support
@ 2015-06-25 19:25 Vikas Shivappa
  2015-06-25 19:25 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
  0 siblings, 1 reply; 30+ messages in thread
From: Vikas Shivappa @ 2015-06-25 19:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, hpa, tglx, mingo, tj, peterz, matt.fleming, will.auld,
	kanaka.d.juvva, glenn.p.williamson, vikas.shivappa,
	vikas.shivappa

This patch has some changes to hot cpu handling code in existing cache
monitoring and RAPL kernel code. This improves hot cpu notification
handling by not looping through all online cpus which could be expensive
in large systems. 

Cache allocation patches(dependent on prep patches) adds a cgroup
subsystem to support the new Cache Allocation feature found in future
Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
Resource Director Technology(RDT) feature. RDT which provides support to
control sharing of platform resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'.  This feature is used when allocating
a line in cache ie when pulling new data into the cache.  The
programming of the h/w is done via programming  MSRs.  The patch series
support to perform L3 cache allocation.

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache.  At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance.  Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.

More information about the feature can be found in the Intel SDM, Volume
3 section 17.15.  SDM does not yet use the 'RDT' term yet and it is
planned to be changed at a later time.

*All the patches will apply on tip/perf/core*.

Changes in V11:  As per feedback from Thomas and discussions:

  - removed the cpumask_any_online_but.its usage could be easily replaced with
  'and'ing the cpu_online mask during hot cpu notifications.  Thomas
  pointed the API had issue where there tmp mask wasnt thread safe. I
  realized the support it indends to give does not seem to match with
  others in cpumask.h
  - the cqm patch which added mutex to hot cpu notification was merged
  with the cqm hot plug patch to improve notificaiton handling
  without commit logs and wasnt correct. seperated and just sending the
  cqm hot plug patch and will send the mutex cqm patch seperately
  - fixed issues in the hot cpu rdt handling. Since the cpu_starting was
  replaced with cpu_online , now the wrmsr needs to be actually
  scheduled on the target cpu - which the previous patch wasnt doing.
  Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
  handled the same way as cpu_online. By waiting till cpu_dead to update
  the rdt_cpumask , we may miss some of the msr updates.

Changes in V10:

- changed the hot cpu notification we handle in cqm and cache allocation
  to cpu_online and cpu_dead and removed others as the 
  cpu_*_prepare also had corresponding cancel notification 
  which we did not handle.
- changed the file in rdt cgroup to l3_cache_mask to represent that its
  for l3 cache.

Changes as per Thomas and PeterZ feedback:
- fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
  have static so that they burden stack space when large.
- removed mutex in cpu_starting notifications, replaced the locking with
  cpu_online.
- changed name from hsw_probetest to cache_alloc_hsw_probe.
- changed x86_rdt_max_closid to x86_cache_max_closid and
  x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
  to cache allocation and not to all rdt.

Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
  enabled.
- Reordered the local declarations to follow convention in
  intel_cqm_xchg_rmid

Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:

Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
    power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.

Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the 
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.

Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for 
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored 
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a 
 refcnt to track the number of cgroups using a closid in clos_cbm map.
 -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache 
 bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each 
package.
- Removed the cache bit mask reference in the intel_rdt structure as
  there was no need for that and we already maintain a separate
  closid<->cbm mapping.
- Made a few coding convention changes which include adding the 
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init


[PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
[PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling for
[PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup
[PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
[PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
[PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
[PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
[PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
  2015-06-25 19:25 [PATCH V11 0/9] Hot cpu handling changes to cqm,rapl and Intel Cache Allocation support Vikas Shivappa
@ 2015-06-25 19:25 ` Vikas Shivappa
  0 siblings, 0 replies; 30+ messages in thread
From: Vikas Shivappa @ 2015-06-25 19:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, hpa, tglx, mingo, tj, peterz, matt.fleming, will.auld,
	kanaka.d.juvva, glenn.p.williamson, vikas.shivappa,
	vikas.shivappa

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache.  Currently L3 Cache is
the only resource that is supported in RDT.  More information can be
found in the Intel SDM, Volume 3, section 17.15.

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'.  This feature is used when
allocating a line in cache ie when pulling new data into the cache.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..dfff477
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache.  Currently L3 Cache is
+the only resource that is supported in RDT.  More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation  enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'l3_cache_mask' file.
+
+Root directory would have all available  bits set in 'l3_cache_mask' file
+by default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'l3_cache_mask' needs to be a  subset of the parent node's
+'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
+bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
+'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
+represent the cache 'subset' of the Cache allocation cgroup. For ex: on
+a system with 16 bits of max cbm bits, if the directory has the least
+significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
+is just 0xf), it would be allocated the right quarter of the Last level
+cache which means the tasks belonging to this Cache allocation cgroup
+can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits  out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.
+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.
+- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
+support the feature.
+- When feature is available, still remains a no-op till the user
+manually creates a cgroup *and* assigns a new cache mask. Since the
+child node inherits the parents cache mask , by cgroup creation there is
+no scheduling hot path impact from the new cgroup.
+- per cpu PQR values are cached and the MSR write is only done when
+there is a task with different PQR is scheduled on the CPU. Typically if
+the task groups are bound to be scheduled on a set of CPUs , the number
+of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+To check if Cache allocation was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of l3_cache_mask and CLOS should depend on the system you use.
+
+Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
+    cache allocation is enabled).
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+  cd /sys/fs/cgroup
+  mkdir rdt
+  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+  cd rdt
+
+Create 2 rdt cgroups
+
+  mkdir group1
+  mkdir group2
+
+Following are some of the Files in the directory
+
+  ls
+  rdt.l3_cache_mask
+  tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits.  This
+allocates 'right quarter' of the cache.
+
+  cd group2
+  /bin/echo 0xf > rdt.l3_cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+  cd group2
+  /bin/echo 0xff > rdt.l3_cache_mask
+
+Assign tasks to the group2
+
+  /bin/echo PID1 > tasks
+  /bin/echo PID2 > tasks
+
+  Meaning now threads
+  PID1 and PID2 get to fill the 'right half' of
+  the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+  cd group2
+  mkdir group21
+  cat rdt.l3_cache_mask
+   0xff - inherits parents mask.
+
+  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
+
+In order to restrict RDT cgroups to specific set of CPUs rdt can be
+comounted with cpusets.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-08-07 18:52 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-06 21:55 [PATCH V13 0/9] Intel cache allocation and Hot cpu handling changes to cqm, rapl Vikas Shivappa
2015-08-06 21:55 ` [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling Vikas Shivappa
2015-08-06 21:55 ` [PATCH 2/9] x86/intel_rapl: " Vikas Shivappa
2015-08-06 21:55 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
2015-08-06 21:55 ` [PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection Vikas Shivappa
2015-08-06 21:55 ` [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management Vikas Shivappa
2015-08-06 21:55 ` [PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
2015-08-06 21:55 ` [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
2015-08-06 23:03   ` Andy Lutomirski
2015-08-07 18:52     ` Vikas Shivappa
2015-08-06 21:55 ` [PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation Vikas Shivappa
2015-08-06 21:55 ` [PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
  -- strict thread matches above, loose matches on Subject: below --
2015-07-01 22:21 [PATCH V12 0/9] Hot cpu handling changes to cqm, rapl and Intel Cache Allocation support Vikas Shivappa
2015-07-01 22:21 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa
2015-07-28 14:54   ` Peter Zijlstra
2015-08-04 20:41     ` Vikas Shivappa
2015-07-28 23:15   ` Marcelo Tosatti
2015-07-29  0:06     ` Vikas Shivappa
2015-07-29  1:28       ` Auld, Will
2015-07-29 19:32         ` Marcelo Tosatti
2015-07-30 17:47           ` Vikas Shivappa
2015-07-30 20:08             ` Marcelo Tosatti
2015-07-31 15:34               ` Marcelo Tosatti
2015-08-02 15:48               ` Martin Kletzander
2015-08-03 15:13                 ` Marcelo Tosatti
2015-08-03 18:22                   ` Vikas Shivappa
2015-07-30 20:22             ` Marcelo Tosatti
2015-07-30 23:03               ` Vikas Shivappa
2015-07-31 14:45                 ` Marcelo Tosatti
2015-07-29 20:07         ` Vikas Shivappa
2015-06-25 19:25 [PATCH V11 0/9] Hot cpu handling changes to cqm,rapl and Intel Cache Allocation support Vikas Shivappa
2015-06-25 19:25 ` [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide Vikas Shivappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).