[PATCH v2 1/2] selftests/resctrl: fastcat for benchmarking counter reads

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 1/2] selftests/resctrl: fastcat for benchmarking counter reads
@ 2024-11-06 15:43 Peter Newman
  2024-11-06 15:43 ` [PATCH v2 2/2] x86/resctrl: Don't workqueue local event " Peter Newman
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Newman @ 2024-11-06 15:43 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: babu.moger, bp, dave.hansen, eranian, hpa, james.morse,
	linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx, tony.luck,
	x86, Peter Newman

This patch is provided for reference and not intended for submission.

This test program is used to evaluate the cost of reading resctrl
counters. It is essentially a drop-in replacement for cat, allowing
presumably small files, like resctrl nodes, to be read repeatedly using
pread(). In contrast, repeated invocations of cat would open and close
the counter files, which is an order of magnitude more costly than the
actual counter file read operations.

For example, the steps below were used to determine the cost of reading
a large number of local or remote counters on a dual-socket AMD Zen2:

Bind to L3 domain 6:

 # taskset -c 24-27,152-155 bash

Measure local read cost:

 # cd /sys/fs/resctrl
 # FASTCAT_READ_COUNT=100 FASTCAT_DELAY_USEC=1000000 perf stat \
       /tmp/fastcat mon_groups/*/mon_data/mon_L3_06/mbm_*

Measure remote read cost:

 # FASTCAT_READ_COUNT=100 FASTCAT_DELAY_USEC=1000000 perf stat \
       /tmp/fastcat mon_groups/*/mon_data/mon_L3_07/mbm_*

Signed-off-by: Peter Newman <peternewman@google.com>
---
v1: https://lore.kernel.org/lkml/20241031142553.3963058-1-peternewman@google.com/

---
 tools/testing/selftests/resctrl/.gitignore |  1 +
 tools/testing/selftests/resctrl/Makefile   |  6 +-
 tools/testing/selftests/resctrl/fastcat.c  | 73 ++++++++++++++++++++++
 3 files changed, 78 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/resctrl/fastcat.c

diff --git a/tools/testing/selftests/resctrl/.gitignore b/tools/testing/selftests/resctrl/.gitignore
index ab68442b6bc8d..11a40e331f4ad 100644
--- a/tools/testing/selftests/resctrl/.gitignore
+++ b/tools/testing/selftests/resctrl/.gitignore
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 resctrl_tests
+fastcat
diff --git a/tools/testing/selftests/resctrl/Makefile b/tools/testing/selftests/resctrl/Makefile
index f408bd6bfc3d4..c5ec1e8289390 100644
--- a/tools/testing/selftests/resctrl/Makefile
+++ b/tools/testing/selftests/resctrl/Makefile
@@ -3,10 +3,12 @@
 CFLAGS = -g -Wall -O2 -D_FORTIFY_SOURCE=2
 CFLAGS += $(KHDR_INCLUDES)
 
-TEST_GEN_PROGS := resctrl_tests
+TEST_GEN_PROGS := resctrl_tests fastcat
 
 LOCAL_HDRS += $(wildcard *.h)
 
 include ../lib.mk
 
-$(OUTPUT)/resctrl_tests: $(wildcard *.c)
+$(OUTPUT)/resctrl_tests: cache.c cat_test.c cmt_test.c fill_buf.c mba_test.c mbm_test.c resctrlfs.c resctrl_tests.c resctrl_val.c
+
+$(OUTPUT)/fastcat: fastcat.c
diff --git a/tools/testing/selftests/resctrl/fastcat.c b/tools/testing/selftests/resctrl/fastcat.c
new file mode 100644
index 0000000000000..0b362421fd34d
--- /dev/null
+++ b/tools/testing/selftests/resctrl/fastcat.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+int main(int argc, char **argv)
+{
+	int nfiles = argc - 1;
+	char *nread_str;
+	int nread;
+	int delay;
+	int *fds;
+	int i, j;
+
+	nread_str = getenv("FASTCAT_READ_COUNT");
+	if (!nread_str)
+		nread = 1;
+	else
+		nread = atoi(nread_str);
+
+	nread_str = getenv("FASTCAT_DELAY_USEC");
+	if (!nread_str)
+		delay = 0;
+	else
+		delay = atoi(nread_str);
+
+	if (nfiles < 1)
+		exit(1);
+
+	fds = malloc(sizeof(*fds) * (argc));
+	if (!fds) {
+		perror("malloc");
+		exit(1);
+	}
+
+	printf("opening %d files\n", nfiles);
+
+	for (i = 1; i < argc; i++) {
+		fds[i - 1] = open(argv[i], O_RDONLY);
+		if (fds[i - 1] < 0) {
+			perror(argv[i]);
+			exit(1);
+		}
+	}
+
+	printf("reading %d files %d times\n", nfiles, nread);
+
+	for (j = 0; j < nread; j++) {
+		for (i = 0; i < nfiles; i++) {
+			// Assumed to be large enough for any output of
+			// mbm_*_bytes
+			char buf[40];
+			ssize_t r;
+
+			r = pread(fds[i], buf, sizeof(buf), 0);
+			if (r < 0) {
+				perror(argv[i + 1]);
+				exit(1);
+			}
+		}
+		if (delay)
+			// Try to read cold...
+			usleep(delay);
+	}
+
+	printf("closing %d files\n", nfiles);
+	for (i = 0; i < nfiles; i++)
+		close(fds[i]);
+
+	return 0;
+}

base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-06 15:43 [PATCH v2 1/2] selftests/resctrl: fastcat for benchmarking counter reads Peter Newman
@ 2024-11-06 15:43 ` Peter Newman
  2024-11-07  1:10   ` Reinette Chatre
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Newman @ 2024-11-06 15:43 UTC (permalink / raw)
  To: reinette.chatre, fenghua.yu
  Cc: babu.moger, bp, dave.hansen, eranian, hpa, james.morse,
	linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx, tony.luck,
	x86, Peter Newman

Performance-conscious users may use threads bound to CPUs within a
specific monitoring domain to ensure that all bandwidth counters can be
read efficiently. The hardware counters are only accessible to CPUs
within the domain, so requests from CPUs outside the domain are
forwarded to a kernel worker or IPI handler, incurring a substantial
performance penalty on each read.  Recently, this penalty was observed
to be paid by local reads as well.

To support blocking implementations of resctrl_arch_rmid_read(),
mon_event_read() switched to smp_call_on_cpu() in most cases to read
event counters using a kernel worker thread. Unlike
smp_call_function_any(), which optimizes to a local function call when
the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
work unconditionally.

Introduce resctrl_arch_event_read_blocks() to allow the implementation
to indicate whether reading a particular event counter blocks. Use this
to limit the usage of smp_call_on_cpu() to only the counters where it is
actually needed. This reverts to the previous behavior of always using
smp_call_function_any() for all x86 implementations.

This is significant when supporting configurations such as a dual-socket
AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
counters for all groups on all domains requires 32768 (32*256*2) counter
reads. The resolution of global, per-group MBM data which can be
provided is therefore sensitive to the cost of each counter read.
Furthermore, redirecting this much work to IPI handlers or worker
threads at a regular interval is disruptive to the present workload.

The test program fastcat, which was introduced in an earlier path, was
used to simulate the impact of this change on an optimized event
counter-reading procedure. The goal is to maximize the frequency at
which MBM counters can be dumped, so the benchmark determines the cost
of an additional global MBM counter sample.

The total number of cycles needed to read all local and total MBM
counters for a large number of monitoring groups was collected using the
perf tool. The average over 100 iterations is given, with a 1-second
sleep between iterations to better represent the intended use case. The
test was run bound to the CPUs of a single MBM domain, once targeting
counters in the local domain and again for counters in a remote domain.

AMD EPYC 7B12 64-Core Processor (250 mon groups)

Local Domain:   5.72M -> 1.22M (-78.7%)
Remote Domain:  5.89M -> 5.20M (-11.8%)

Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz (220 mon groups)

Local Domain:   3.37M -> 2.52M (-25.4%)
Remote Domain:  5.16M -> 5.79M (+12.0%)

The slowdown for remote domain reads on Intel is worrying, but since
this change is effectively a revert to old behavior on x86, this
shouldn't be anything new.

Also note that the Remote Domain results and the baseline Local Domain
results only measure cycles in the test program. Because all counter
reading work was carried out in kernel worker threads or IPI handlers,
the total system cost of the operation is greater.

Fixes: 09909e098113 ("x86/resctrl: Queue mon_event_read() instead of sending an IPI")

Signed-off-by: Peter Newman <peternewman@google.com>
---
v1: https://lore.kernel.org/lkml/20241031142553.3963058-2-peternewman@google.com/

---
 arch/x86/include/asm/resctrl.h            | 7 +++++++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 8 +++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 8b1b6ce1e51b2..8696c0c0e1df4 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -178,6 +178,13 @@ static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid
 static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
 					     void *ctx) { };

+static inline bool resctrl_arch_event_read_blocks(struct rdt_resource *r,
+						  int evtid)
+{
+	/* all events can be read without blocking */
+	return false;
+}
+
 void resctrl_cpu_detect(struct cpuinfo_x86 *c);

 #else
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 200d89a640270..395bcc5362f4e 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -548,8 +548,14 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	 * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
 	 * MPAM's resctrl_arch_rmid_read() is unable to read the
 	 * counters on some platforms if its called in IRQ context.
+	 *
+	 * smp_call_on_cpu() dispatches to a kernel worker
+	 * unconditionally, even when the event can be read much more
+	 * efficiently on the current CPU, so only use it when
+	 * blocking is required.
 	 */
-	if (tick_nohz_full_cpu(cpu))
+	if (tick_nohz_full_cpu(cpu) ||
+	    !resctrl_arch_event_read_blocks(r, evtid))
 		smp_call_function_any(cpumask, mon_event_count, rr, 1);
 	else
 		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
-- 
2.47.0.199.ga7371fff76-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-06 15:43 ` [PATCH v2 2/2] x86/resctrl: Don't workqueue local event " Peter Newman
@ 2024-11-07  1:10   ` Reinette Chatre
  2024-11-07 11:01     ` Peter Newman
  0 siblings, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-07  1:10 UTC (permalink / raw)
  To: Peter Newman, fenghua.yu
  Cc: babu.moger, bp, dave.hansen, eranian, hpa, james.morse,
	linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx, tony.luck,
	x86

Hi Peter,

On 11/6/24 7:43 AM, Peter Newman wrote:
> Performance-conscious users may use threads bound to CPUs within a
> specific monitoring domain to ensure that all bandwidth counters can be
> read efficiently. The hardware counters are only accessible to CPUs
> within the domain, so requests from CPUs outside the domain are
> forwarded to a kernel worker or IPI handler, incurring a substantial
> performance penalty on each read.  Recently, this penalty was observed
> to be paid by local reads as well.
> 
> To support blocking implementations of resctrl_arch_rmid_read(),
> mon_event_read() switched to smp_call_on_cpu() in most cases to read
> event counters using a kernel worker thread. Unlike
> smp_call_function_any(), which optimizes to a local function call when
> the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
> work unconditionally.
> 
> Introduce resctrl_arch_event_read_blocks() to allow the implementation
> to indicate whether reading a particular event counter blocks. Use this
> to limit the usage of smp_call_on_cpu() to only the counters where it is
> actually needed. This reverts to the previous behavior of always using
> smp_call_function_any() for all x86 implementations.
> 
> This is significant when supporting configurations such as a dual-socket
> AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
> counters for all groups on all domains requires 32768 (32*256*2) counter
> reads. The resolution of global, per-group MBM data which can be
> provided is therefore sensitive to the cost of each counter read.
> Furthermore, redirecting this much work to IPI handlers or worker
> threads at a regular interval is disruptive to the present workload.
> 
> The test program fastcat, which was introduced in an earlier path, was
> used to simulate the impact of this change on an optimized event
> counter-reading procedure. The goal is to maximize the frequency at
> which MBM counters can be dumped, so the benchmark determines the cost
> of an additional global MBM counter sample.
> 
> The total number of cycles needed to read all local and total MBM

This optimization proposal aims to reduce the number of cycles used by
the code that is eventually called after user space reads from a file.
As you already noted, this causes a regression in this exact optimization area
in one out of the four scenarios tested.

As I understand this is understood to be a slow path and there are many things
that can impact the number of cycles such a query takes.

I wonder if we could perhaps instead change focus from shaving cycles (after
obtaining a mutex ... ) from a slow path to understand what the use case
is so that resctrl could perhaps obtain a better interface that supports the
use case better overall?

> counters for a large number of monitoring groups was collected using the
> perf tool. The average over 100 iterations is given, with a 1-second
> sleep between iterations to better represent the intended use case. The
> test was run bound to the CPUs of a single MBM domain, once targeting
> counters in the local domain and again for counters in a remote domain.

This sounds as though user space is essentially duplicating what the
MBM overflow handler currently does, which is to run a worker in each domain
to collect MBM data every second from every RMID for both MBM events.

* What are the requirements of this use case?
* Is there some benefit to user space in reading individual counters?
* Would it perhaps be acceptable to provide user space with a cached snapshot
  of all the MBM counters in a single query?

User space can use a single read to obtain values of an event for all RMIDs
on a domain without a single IPI if the architectural state from the overflow handler
is exposed. It could also be possible to present data from all domains in that single
read.

One complication I can think of is that data from the different domains may have
been collected up to a second apart. Is this something this use case can tolerate?

For example,
	# cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
	  <rdtgroup nameA> <MBM total count>
	  <rdtgroup nameB> <MBM total count>
	  ...

	# cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
	  <rdtgroup nameA> <MBM total count>
	  <rdtgroup nameB> <MBM total count>
	  ...

If the use case cannot tolerate data up to a second old then resctrl could add new files
in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
from each domain and read all events sequentially (which is essentially mbm_overflow()).

For example,
	# cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
	  <rdtgroup nameA> <MBM total count>
	  <rdtgroup nameB> <MBM total count>
	  ...

	# cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
	  <rdtgroup nameA> <MBM total count>
	  <rdtgroup nameB> <MBM total count>
	  ...

As I understand an interface like above would be an improvement over
what can be achieved by user space by optimizing queries to existing interface.

> 
> AMD EPYC 7B12 64-Core Processor (250 mon groups)
> 
> Local Domain:   5.72M -> 1.22M (-78.7%)
> Remote Domain:  5.89M -> 5.20M (-11.8%)
> 
> Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz (220 mon groups)
> 
> Local Domain:   3.37M -> 2.52M (-25.4%)
> Remote Domain:  5.16M -> 5.79M (+12.0%)

In my testing I see worse regression when I look at just kernel cycles.
remote domain before this patch : 17,793,035      cycles:k
remote domain after this patch:   28,131,703      cycles:k

> 
> The slowdown for remote domain reads on Intel is worrying, but since
> this change is effectively a revert to old behavior on x86, this
> shouldn't be anything new.

This does not sound right. It sounds as though you are saying
performance regressions from current behavior are acceptable as long
as the poor performance was encountered at some point before current
behavior?

It could perhaps be argued that performance regression on this slow
path are acceptable ... but doing so would contradict this work.

> Also note that the Remote Domain results and the baseline Local Domain
> results only measure cycles in the test program. Because all counter
> reading work was carried out in kernel worker threads or IPI handlers,
> the total system cost of the operation is greater.

This sounds like speculation. This can be measured with a small
change to your command line, for example, by using:
	perf stat -e cycles:u -e cycles:k ...

Reinette

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07  1:10   ` Reinette Chatre
@ 2024-11-07 11:01     ` Peter Newman
  2024-11-07 14:26       ` Peter Newman
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Newman @ 2024-11-07 11:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

Hi Reinette,

On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 11/6/24 7:43 AM, Peter Newman wrote:
> > Performance-conscious users may use threads bound to CPUs within a
> > specific monitoring domain to ensure that all bandwidth counters can be
> > read efficiently. The hardware counters are only accessible to CPUs
> > within the domain, so requests from CPUs outside the domain are
> > forwarded to a kernel worker or IPI handler, incurring a substantial
> > performance penalty on each read.  Recently, this penalty was observed
> > to be paid by local reads as well.
> >
> > To support blocking implementations of resctrl_arch_rmid_read(),
> > mon_event_read() switched to smp_call_on_cpu() in most cases to read
> > event counters using a kernel worker thread. Unlike
> > smp_call_function_any(), which optimizes to a local function call when
> > the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
> > work unconditionally.
> >
> > Introduce resctrl_arch_event_read_blocks() to allow the implementation
> > to indicate whether reading a particular event counter blocks. Use this
> > to limit the usage of smp_call_on_cpu() to only the counters where it is
> > actually needed. This reverts to the previous behavior of always using
> > smp_call_function_any() for all x86 implementations.
> >
> > This is significant when supporting configurations such as a dual-socket
> > AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
> > counters for all groups on all domains requires 32768 (32*256*2) counter
> > reads. The resolution of global, per-group MBM data which can be
> > provided is therefore sensitive to the cost of each counter read.
> > Furthermore, redirecting this much work to IPI handlers or worker
> > threads at a regular interval is disruptive to the present workload.
> >
> > The test program fastcat, which was introduced in an earlier path, was
> > used to simulate the impact of this change on an optimized event
> > counter-reading procedure. The goal is to maximize the frequency at
> > which MBM counters can be dumped, so the benchmark determines the cost
> > of an additional global MBM counter sample.
> >
> > The total number of cycles needed to read all local and total MBM
>
> This optimization proposal aims to reduce the number of cycles used by
> the code that is eventually called after user space reads from a file.
> As you already noted, this causes a regression in this exact optimization area
> in one out of the four scenarios tested.
>
> As I understand this is understood to be a slow path and there are many things
> that can impact the number of cycles such a query takes.
>
> I wonder if we could perhaps instead change focus from shaving cycles (after
> obtaining a mutex ... ) from a slow path to understand what the use case
> is so that resctrl could perhaps obtain a better interface that supports the
> use case better overall?

Sounds great.

>
> > counters for a large number of monitoring groups was collected using the
> > perf tool. The average over 100 iterations is given, with a 1-second
> > sleep between iterations to better represent the intended use case. The
> > test was run bound to the CPUs of a single MBM domain, once targeting
> > counters in the local domain and again for counters in a remote domain.
>
> This sounds as though user space is essentially duplicating what the
> MBM overflow handler currently does, which is to run a worker in each domain
> to collect MBM data every second from every RMID for both MBM events.
>
> * What are the requirements of this use case?

Accurate, per-RMID MBps data, ideally at 1-second resolution if the
overhead can be tolerable.

> * Is there some benefit to user space in reading individual counters?

The interface is available today and can retrieve the data with
somewhat acceptable performance. After applying this change (or
selecting a kernel version before the MPAM changes), call-graph
profiling showed that the remaining overhead of reading one counter at
a time from userspace on AMD Zen2 was around 20%, spacing each series
of 250 local reads by 1 second.

With the 1.22M figure I quoted below for a single domain from
userspace, this comes out to 488 usec at 2.5 Ghz, which is manageable.
Even if userspace serializes all of its per-domain reads manually to
avoid contention on the global rdtgroup_mutex, the results should be
consistent as long as the domains are always serialized in the same
order.

(apologies that my figures focus on AMD, but its high MBM domain
counts make it the most sensitive to read performance)

Also, if all of the counter-reading work is performed directly by the
thread, the time spent collecting the information is easier to
attribute to the management software rather than appearing as an
increase in kernel overhead.

Clearly not optimal, but an acceptable baseline.

> * Would it perhaps be acceptable to provide user space with a cached snapshot
>   of all the MBM counters in a single query?
>
> User space can use a single read to obtain values of an event for all RMIDs
> on a domain without a single IPI if the architectural state from the overflow handler
> is exposed. It could also be possible to present data from all domains in that single
> read.
>
> One complication I can think of is that data from the different domains may have
> been collected up to a second apart. Is this something this use case can tolerate?

This +/- 1-second drift would apply to the denominator of the mbps
calculation, so it could cause some very large errors. To produce
accurate mbps numbers from cached MBM count data, each sample would
need to be accompanied by a timestamp.

>
>
> For example,
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
> If the use case cannot tolerate data up to a second old then resctrl could add new files
> in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
> from each domain and read all events sequentially (which is essentially mbm_overflow()).
>
> For example,
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
> As I understand an interface like above would be an improvement over
> what can be achieved by user space by optimizing queries to existing interface.
>

Yes, this is an option now that the ABMC work is establishing a naming
scheme for groups. It would also significantly cut down on the number
of open file descriptors needed.

Tony had prototyped an MBM rate event[1], which cached a MBps value
per group/domain produced by the overflow workers. If the workers
produce the mbps samples immediately after directly reading the
counters, then this should be the best case in terms of precision and
introduce very little additional system overhead. Also, userspace
reading a rate sample that's 1 second old would be a lot less harmful
than performing an MBps calculation from a count sample that's 1
second old.

Perhaps combining the per-second bandwidth rate cache with a
per-domain file for each MBM event to aggregate the data for all
groups will be the optimally-performing solution?

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/20230620033702.33344-3-tony.luck@intel.com/#Z31fs:resctrl2:arch:x86:rdt_mbm_total_rate.c

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 11:01     ` Peter Newman
@ 2024-11-07 14:26       ` Peter Newman
  2024-11-07 16:57         ` Tony Luck
  2024-11-07 19:14         ` Reinette Chatre
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Newman @ 2024-11-07 14:26 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
>
> Hi Reinette,
>
> On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre <reinette.chatre@intel.com> wrote:

> > This sounds as though user space is essentially duplicating what the
> > MBM overflow handler currently does, which is to run a worker in each domain
> > to collect MBM data every second from every RMID for both MBM events.
> >
> > * What are the requirements of this use case?
>
> Accurate, per-RMID MBps data, ideally at 1-second resolution if the
> overhead can be tolerable.

Sorry, forgot about the assignable counters issue...

On AMD we'll have to cycle the available event counters through the
groups in order to get valid bandwidth counts.

>
> > * Is there some benefit to user space in reading individual counters?
>
> The interface is available today and can retrieve the data with
> somewhat acceptable performance. After applying this change (or
> selecting a kernel version before the MPAM changes), call-graph
> profiling showed that the remaining overhead of reading one counter at
> a time from userspace on AMD Zen2 was around 20%, spacing each series
> of 250 local reads by 1 second.
>
> With the 1.22M figure I quoted below for a single domain from
> userspace, this comes out to 488 usec at 2.5 Ghz, which is manageable.
> Even if userspace serializes all of its per-domain reads manually to
> avoid contention on the global rdtgroup_mutex, the results should be
> consistent as long as the domains are always serialized in the same
> order.
>
> (apologies that my figures focus on AMD, but its high MBM domain
> counts make it the most sensitive to read performance)
>
> Also, if all of the counter-reading work is performed directly by the
> thread, the time spent collecting the information is easier to
> attribute to the management software rather than appearing as an
> increase in kernel overhead.
>
> Clearly not optimal, but an acceptable baseline.
>
> > * Would it perhaps be acceptable to provide user space with a cached snapshot
> >   of all the MBM counters in a single query?
> >
> > User space can use a single read to obtain values of an event for all RMIDs
> > on a domain without a single IPI if the architectural state from the overflow handler
> > is exposed. It could also be possible to present data from all domains in that single
> > read.
> >
> > One complication I can think of is that data from the different domains may have
> > been collected up to a second apart. Is this something this use case can tolerate?
>
> This +/- 1-second drift would apply to the denominator of the mbps
> calculation, so it could cause some very large errors. To produce
> accurate mbps numbers from cached MBM count data, each sample would
> need to be accompanied by a timestamp.
>
> >
> >
> > For example,
> >         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
> >           <rdtgroup nameA> <MBM total count>
> >           <rdtgroup nameB> <MBM total count>
> >           ...
> >
> >         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
> >           <rdtgroup nameA> <MBM total count>
> >           <rdtgroup nameB> <MBM total count>
> >           ...
> >
> > If the use case cannot tolerate data up to a second old then resctrl could add new files
> > in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
> > from each domain and read all events sequentially (which is essentially mbm_overflow()).
> >
> > For example,
> >         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
> >           <rdtgroup nameA> <MBM total count>
> >           <rdtgroup nameB> <MBM total count>
> >           ...
> >
> >         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
> >           <rdtgroup nameA> <MBM total count>
> >           <rdtgroup nameB> <MBM total count>
> >           ...
> >
> > As I understand an interface like above would be an improvement over
> > what can be achieved by user space by optimizing queries to existing interface.
> >
>
> Yes, this is an option now that the ABMC work is establishing a naming
> scheme for groups. It would also significantly cut down on the number
> of open file descriptors needed.
>
> Tony had prototyped an MBM rate event[1], which cached a MBps value
> per group/domain produced by the overflow workers. If the workers
> produce the mbps samples immediately after directly reading the
> counters, then this should be the best case in terms of precision and
> introduce very little additional system overhead. Also, userspace
> reading a rate sample that's 1 second old would be a lot less harmful
> than performing an MBps calculation from a count sample that's 1
> second old.
>
> Perhaps combining the per-second bandwidth rate cache with a
> per-domain file for each MBM event to aggregate the data for all
> groups will be the optimally-performing solution?

Factoring ABMC into this, we'd have to decide the interval at which
we're comfortable with cycling the available event counters through
the group list in order to get valid MBps samples for every group
often enough.

A counter will have to stay assigned long enough to get two valid
counts before an MBps value can be reported. If the regular MBps
samples is what we want, maybe we just want a mode where the overflow
workers would also drive the counter assignments too in order to
produce regular samples from all groups.

-Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 14:26       ` Peter Newman
@ 2024-11-07 16:57         ` Tony Luck
  2024-11-07 19:15           ` Reinette Chatre
  2024-11-07 19:14         ` Reinette Chatre
  1 sibling, 1 reply; 17+ messages in thread
From: Tony Luck @ 2024-11-07 16:57 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, fenghua.yu, babu.moger, bp, dave.hansen, eranian,
	hpa, james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng,
	tglx, x86

On Thu, Nov 07, 2024 at 03:26:11PM +0100, Peter Newman wrote:
> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
> >
> > Hi Reinette,
> >
> > On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> 
> > > This sounds as though user space is essentially duplicating what the
> > > MBM overflow handler currently does, which is to run a worker in each domain
> > > to collect MBM data every second from every RMID for both MBM events.
> > >
> > > * What are the requirements of this use case?
> >
> > Accurate, per-RMID MBps data, ideally at 1-second resolution if the
> > overhead can be tolerable.
> 
> Sorry, forgot about the assignable counters issue...
> 
> On AMD we'll have to cycle the available event counters through the
> groups in order to get valid bandwidth counts.

See below.

> > > For example,
> > >         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
> > >           <rdtgroup nameA> <MBM total count>
> > >           <rdtgroup nameB> <MBM total count>
> > >           ...
> > >
> > >         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
> > >           <rdtgroup nameA> <MBM total count>
> > >           <rdtgroup nameB> <MBM total count>
> > >           ...

How about:

# cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
<rdtgroup nameA> <MBM total count> <timestamp> <generation>
<rdtgroup nameB> <MBM total count> <timestamp> <generation>
...
> > >
# cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
<rdtgroup nameA> <MBM total count> <timestamp> <generation>
<rdtgroup nameB> <MBM total count> <timestamp> <generation>
...

Where <timestamp> tracks when this sample was captured. And
<generation> is an integer that is incremented when data
for this event is lost (e.g. due to ABMC counter re-assignment).

Then a monitor application can compute bandwidth for each
group by periodic sampling and for each group:

	if (thisgeneration == lastgeneration) {
		bw = (thiscount - lastcount) / (thistimestanp - lasttimestamp);

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 14:26       ` Peter Newman
  2024-11-07 16:57         ` Tony Luck
@ 2024-11-07 19:14         ` Reinette Chatre
  2024-11-13 13:28           ` Peter Newman
  1 sibling, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-07 19:14 UTC (permalink / raw)
  To: Peter Newman
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

Hi Peter,

On 11/7/24 6:26 AM, Peter Newman wrote:
> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
>>
>> Hi Reinette,
>>
>> On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> 
>>> This sounds as though user space is essentially duplicating what the
>>> MBM overflow handler currently does, which is to run a worker in each domain
>>> to collect MBM data every second from every RMID for both MBM events.
>>>
>>> * What are the requirements of this use case?
>>
>> Accurate, per-RMID MBps data, ideally at 1-second resolution if the
>> overhead can be tolerable.
> 
> Sorry, forgot about the assignable counters issue...
> 
> On AMD we'll have to cycle the available event counters through the
> groups in order to get valid bandwidth counts.

ack

> 
>>
>>> * Is there some benefit to user space in reading individual counters?

My question was poorly worded. What I would like to understand is if there
is a requirement for user space to keep being able to quickly query individual counters
like "query <event> of <monitor group>" or would it be acceptable for
user space to get a dump of event values of all monitor groups in a single query
and then post-process to dig out individual counters? 

>> The interface is available today and can retrieve the data with
>> somewhat acceptable performance. After applying this change (or
>> selecting a kernel version before the MPAM changes), call-graph
>> profiling showed that the remaining overhead of reading one counter at
>> a time from userspace on AMD Zen2 was around 20%, spacing each series
>> of 250 local reads by 1 second.
>>
>> With the 1.22M figure I quoted below for a single domain from
>> userspace, this comes out to 488 usec at 2.5 Ghz, which is manageable.
>> Even if userspace serializes all of its per-domain reads manually to
>> avoid contention on the global rdtgroup_mutex, the results should be
>> consistent as long as the domains are always serialized in the same
>> order.
>>
>> (apologies that my figures focus on AMD, but its high MBM domain
>> counts make it the most sensitive to read performance)
>>
>> Also, if all of the counter-reading work is performed directly by the
>> thread, the time spent collecting the information is easier to
>> attribute to the management software rather than appearing as an
>> increase in kernel overhead.
>>
>> Clearly not optimal, but an acceptable baseline.

Thank you for giving this insight.

>>
>>> * Would it perhaps be acceptable to provide user space with a cached snapshot
>>>   of all the MBM counters in a single query?
>>>
>>> User space can use a single read to obtain values of an event for all RMIDs
>>> on a domain without a single IPI if the architectural state from the overflow handler
>>> is exposed. It could also be possible to present data from all domains in that single
>>> read.
>>>
>>> One complication I can think of is that data from the different domains may have
>>> been collected up to a second apart. Is this something this use case can tolerate?
>>
>> This +/- 1-second drift would apply to the denominator of the mbps
>> calculation, so it could cause some very large errors. To produce
>> accurate mbps numbers from cached MBM count data, each sample would
>> need to be accompanied by a timestamp.

Reading more it does seem that making the raw cached state available to user space
would introduce too many problematic corner cases.

>>> For example,
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>> If the use case cannot tolerate data up to a second old then resctrl could add new files
>>> in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
>>> from each domain and read all events sequentially (which is essentially mbm_overflow()).
>>>
>>> For example,
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>> As I understand an interface like above would be an improvement over
>>> what can be achieved by user space by optimizing queries to existing interface.
>>>
>>
>> Yes, this is an option now that the ABMC work is establishing a naming
>> scheme for groups. It would also significantly cut down on the number
>> of open file descriptors needed.

ack.

>>
>> Tony had prototyped an MBM rate event[1], which cached a MBps value
>> per group/domain produced by the overflow workers. If the workers
>> produce the mbps samples immediately after directly reading the
>> counters, then this should be the best case in terms of precision and
>> introduce very little additional system overhead. Also, userspace
>> reading a rate sample that's 1 second old would be a lot less harmful
>> than performing an MBps calculation from a count sample that's 1
>> second old.

I looked at that implementation. Please do note that the implementation
appears to be a PoC that does not handle the corner cases where issues may
arise. For example, when it reads the event values in the overflow handler
the rate is updated even if there was an error during the counter read.
The moment a counter read encounters an error it impacts the measured
rate so this will need more thought.

>> Perhaps combining the per-second bandwidth rate cache with a
>> per-domain file for each MBM event to aggregate the data for all
>> groups will be the optimally-performing solution?

I do not see a problem with exposing a per-domain file for each MBM event
that aggregates the data of all groups. For best accuracy I expect that
this file will be created on demand, querying hardware counters at the time
user space makes the request. This may indeed result in output like below
(naming used is just a sample for this discussion):

         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
           <rdtgroup nameA> <MBM total count>
           <rdtgroup nameB> <MBM total count>
           <rdtgroup nameC> Error
           <rdtgroup nameD> Unavailable
           <rdtgroup nameE> Unassigned
	   ...

As I understand from your earlier description this essentially moves the
counter reading work from a user space thread to the kernel. There are
options to do this work, most disruptive can be done with a
smp_call_function_any() to read all the domain's counters from a CPU
in the domain without any preemption, less disruptive can be done
with smp_call_on_cpu(). Some cond_resched() could be included if
the number of events needing to be read starts impacting other parts of
the kernel. You already indicated that reading the counters from user space
takes 488usec, which is very far from what will trigger the softlockup
detector though.
 
> Factoring ABMC into this, we'd have to decide the interval at which
> we're comfortable with cycling the available event counters through
> the group list in order to get valid MBps samples for every group
> often enough.
> 
> A counter will have to stay assigned long enough to get two valid
> counts before an MBps value can be reported. If the regular MBps
> samples is what we want, maybe we just want a mode where the overflow
> workers would also drive the counter assignments too in order to
> produce regular samples from all groups.

The assignable counter implementation currently depends on user space
assigning counters. In this design the events that do not have counters
assigned return "Unassigned". The "rate" value can also return
"Unassigned" in this existing design. "Unassigned" is one scenario that
needs to be handled, there is also hardware errors and MSR returning
"Unavailable". 

We can surely consider a new mode that does not allow user space to assign
counters but instead lets resctrl manage counter assignments to support
rate events when assignable counters are supported.

I see a couple of features being discussed in parallel:
- A new per-domain file for each MBM event that aggregates the data of all groups.
- A new "rate" event built on top of user assigned counters.
- A new "rate" event built on top of resctrl assigned counters.

Which best support the use case you have in mind?

Reinette




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 16:57         ` Tony Luck
@ 2024-11-07 19:15           ` Reinette Chatre
  2024-11-07 20:58             ` Luck, Tony
  0 siblings, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-07 19:15 UTC (permalink / raw)
  To: Tony Luck, Peter Newman
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	x86

Hi Tony,

On 11/7/24 8:57 AM, Tony Luck wrote:
> On Thu, Nov 07, 2024 at 03:26:11PM +0100, Peter Newman wrote:
>> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
>>>
>>> Hi Reinette,
>>>
>>> On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
>>
>>>> This sounds as though user space is essentially duplicating what the
>>>> MBM overflow handler currently does, which is to run a worker in each domain
>>>> to collect MBM data every second from every RMID for both MBM events.
>>>>
>>>> * What are the requirements of this use case?
>>>
>>> Accurate, per-RMID MBps data, ideally at 1-second resolution if the
>>> overhead can be tolerable.
>>
>> Sorry, forgot about the assignable counters issue...
>>
>> On AMD we'll have to cycle the available event counters through the
>> groups in order to get valid bandwidth counts.
> 
> See below.
> 
>>>> For example,
>>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
>>>>           <rdtgroup nameA> <MBM total count>
>>>>           <rdtgroup nameB> <MBM total count>
>>>>           ...
>>>>
>>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>>>>           <rdtgroup nameA> <MBM total count>
>>>>           <rdtgroup nameB> <MBM total count>
>>>>           ...
> 
> How about:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
> <rdtgroup nameA> <MBM total count> <timestamp> <generation>
> <rdtgroup nameB> <MBM total count> <timestamp> <generation>
> ...
>>>>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
> <rdtgroup nameA> <MBM total count> <timestamp> <generation>
> <rdtgroup nameB> <MBM total count> <timestamp> <generation>
> ...
> 
> Where <timestamp> tracks when this sample was captured. And
> <generation> is an integer that is incremented when data
> for this event is lost (e.g. due to ABMC counter re-assignment).

It is not obvious to me how resctrl can provide a reliable
"generation" value.

 
> Then a monitor application can compute bandwidth for each
> group by periodic sampling and for each group:
> 
> 	if (thisgeneration == lastgeneration) {
> 		bw = (thiscount - lastcount) / (thistimestanp - lasttimestamp);

If user space needs visibility into these internals then we could also
consider adding a trace event that logs the timestamped data right when it
is queried by the overflow handler.

Reinette

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 19:15           ` Reinette Chatre
@ 2024-11-07 20:58             ` Luck, Tony
  2024-11-07 22:03               ` Reinette Chatre
  0 siblings, 1 reply; 17+ messages in thread
From: Luck, Tony @ 2024-11-07 20:58 UTC (permalink / raw)
  To: Chatre, Reinette, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

> > # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
> > <rdtgroup nameA> <MBM total count> <timestamp> <generation>
> > <rdtgroup nameB> <MBM total count> <timestamp> <generation>
> > ...
> >
> > Where <timestamp> tracks when this sample was captured. And
> > <generation> is an integer that is incremented when data
> > for this event is lost (e.g. due to ABMC counter re-assignment).

Maintaining separate timestamps for each group may be overkill.
The overflow function walks through them all quite rapidly. On
Intel Icelake with 100 groups there is only a 670 usec delta
between the first and last.

> It is not obvious to me how resctrl can provide a reliable
> "generation" value.

Keep a generation count for each event in each group. Increment
the count when taking the h/w counter away.

> > Then a monitor application can compute bandwidth for each
> > group by periodic sampling and for each group:
> >
> >     if (thisgeneration == lastgeneration) {
> >             bw = (thiscount - lastcount) / (thistimestanp - lasttimestamp);
>
> If user space needs visibility into these internals then we could also
> consider adding a trace event that logs the timestamped data right when it
> is queried by the overflow handler.

That would provide accurate data at low overhead, assuming that
the user wants bandwidth data every second. If they only need
data over longer time intervals all the extra trace events aren't
needed.

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 20:58             ` Luck, Tony
@ 2024-11-07 22:03               ` Reinette Chatre
  2024-11-07 22:14                 ` Luck, Tony
  0 siblings, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-07 22:03 UTC (permalink / raw)
  To: Luck, Tony, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

Hi Tony,

On 11/7/24 12:58 PM, Luck, Tony wrote:
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>>> <rdtgroup nameA> <MBM total count> <timestamp> <generation>
>>> <rdtgroup nameB> <MBM total count> <timestamp> <generation>
>>> ...
>>>
>>> Where <timestamp> tracks when this sample was captured. And
>>> <generation> is an integer that is incremented when data
>>> for this event is lost (e.g. due to ABMC counter re-assignment).
> 
> Maintaining separate timestamps for each group may be overkill.
> The overflow function walks through them all quite rapidly. On
> Intel Icelake with 100 groups there is only a 670 usec delta
> between the first and last.

If cached data is presented to the user I think the timestamp is
required to let user space know when the data was collected. This timestamp
would be unique per domain as it reflects the per-domain overflow workers.
As you state, it may be overkill if done for each group but I think it
is valuable to have it for the particular domain.
It sounds as though the use case is for user space to query counters
every second. With the overflow handler and user space thread running
queries at same interval it may help to ensure that user space and kernel
does not get out of sync. For example, a scenario where user space believes it
queries once per second but receives the same cached data in two consecutive
queries.

> 
>> It is not obvious to me how resctrl can provide a reliable
>> "generation" value.
> 
> Keep a generation count for each event in each group. Increment
> the count when taking the h/w counter away.

Since this is a snapshot of the counter, why not pass the exact value or
issue encountered when counter was read? For example, "Error", "Unavailable", or
"Unassigned" instead of a "MBM <total|local> count"? We need to be careful
when presenting cached data to user space since the data becomes stale
if any issue is encountered during its query from hardware because that would
make any cached "MBM <total|local> count" invalid.

A generation value would be of most use if it can be understood by user space.
I think that would require something separate for user space to know which
"generation" a counter is after it is assigned so that a query of cached data
can be matched to it. 

I think maybe the issue you are trying to address is a user assigning a counter
and then reading the cached data and getting cached data from a previous
configuration? Please note that in the current implementation the cached
data is reset directly on counter assignment [1]. If a user assigns a new
counter and then immediately read cached data then the cached data will
reflect the assignment even if the overflow worker thread did not get a chance
to run since the assignment.

> 
>>> Then a monitor application can compute bandwidth for each
>>> group by periodic sampling and for each group:
>>>
>>>     if (thisgeneration == lastgeneration) {
>>>             bw = (thiscount - lastcount) / (thistimestanp - lasttimestamp);
>>
>> If user space needs visibility into these internals then we could also
>> consider adding a trace event that logs the timestamped data right when it
>> is queried by the overflow handler.
> 
> That would provide accurate data at low overhead, assuming that
> the user wants bandwidth data every second. If they only need
> data over longer time intervals all the extra trace events aren't
> needed.

Using tracepoints comes with benefits of features supported by its user space
infrastructure. This is one more tool available to explore what would work best
to address the use cases. The use case presented in this thread is to collect
monitoring data once per second.

Reinette

[1] https://lore.kernel.org/all/3851fbd6ccd1cdc504229e4c7f7d2575c13f5bd6.1730244116.git.babu.moger@amd.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 22:03               ` Reinette Chatre
@ 2024-11-07 22:14                 ` Luck, Tony
  2024-11-07 22:46                   ` Reinette Chatre
  0 siblings, 1 reply; 17+ messages in thread
From: Luck, Tony @ 2024-11-07 22:14 UTC (permalink / raw)
  To: Chatre, Reinette, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

> I think maybe the issue you are trying to address is a user assigning a counter
> and then reading the cached data and getting cached data from a previous
> configuration? Please note that in the current implementation the cached
> data is reset directly on counter assignment [1]. If a user assigns a new
> counter and then immediately read cached data then the cached data will
> reflect the assignment even if the overflow worker thread did not get a chance
> to run since the assignment.

The issue is that AMD's ABMC implementation resets counts when reassigning
h/w counters to events in resctrl groups.  If the processes reading counters is
not fully aware of h/w counter reassignment, insanity will occur.

E.g. read a counter:

$ cat mbm_local_bytes
123456789

H/w counter for this event/group assigned elsewhere.

H/w counter assigned back to this event/group

$ cat mbm_local_bytes
23456

Bandwidth calculation sees traffic amount:
	 (23456 - 123456789) = -123433333
Oops. Negative!

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 22:14                 ` Luck, Tony
@ 2024-11-07 22:46                   ` Reinette Chatre
  2024-11-07 23:30                     ` Luck, Tony
  0 siblings, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-07 22:46 UTC (permalink / raw)
  To: Luck, Tony, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

Hi Tony,

On 11/7/24 2:14 PM, Luck, Tony wrote:
>> I think maybe the issue you are trying to address is a user assigning a counter
>> and then reading the cached data and getting cached data from a previous
>> configuration? Please note that in the current implementation the cached
>> data is reset directly on counter assignment [1]. If a user assigns a new
>> counter and then immediately read cached data then the cached data will
>> reflect the assignment even if the overflow worker thread did not get a chance
>> to run since the assignment.
> 
> The issue is that AMD's ABMC implementation resets counts when reassigning
> h/w counters to events in resctrl groups.  If the processes reading counters is
> not fully aware of h/w counter reassignment, insanity will occur.
> 
> E.g. read a counter:
> 
> $ cat mbm_local_bytes
> 123456789
> 
> H/w counter for this event/group assigned elsewhere.
> 
> H/w counter assigned back to this event/group
> 
> $ cat mbm_local_bytes
> 23456
> 
> Bandwidth calculation sees traffic amount:
> 	 (23456 - 123456789) = -123433333
> Oops. Negative!

As I understand this is already an issue today on AMD systems without assignable counters
that may run out of counters. On these systems, any RMID that is no longer being tracked will
be reset to zero. [1] 

The support for assignable counters give user space control over this unexpected reset of
counters.

The scenario you present seem to demonstrate how two independent user space systems
can trample on each other when interacting with the same resources. Is this something you expect
resctrl should protect against? I would expect that there would be a single user space system
doing something like above and it would reset history after unassigning a counter.

This does indeed highlight that if resctrl does start to dynamically assign counters (which
has only been speculated in this thread and is not part of current [1] design) then it may cause
problems on user space side.

Reinette

[1] https://lore.kernel.org/all/cover.1730244116.git.babu.moger@amd.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 22:46                   ` Reinette Chatre
@ 2024-11-07 23:30                     ` Luck, Tony
  2024-11-08  0:21                       ` Reinette Chatre
  0 siblings, 1 reply; 17+ messages in thread
From: Luck, Tony @ 2024-11-07 23:30 UTC (permalink / raw)
  To: Chatre, Reinette, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

> > E.g. read a counter:
> >
> > $ cat mbm_local_bytes
> > 123456789
> >
> > H/w counter for this event/group assigned elsewhere.
> >
> > H/w counter assigned back to this event/group
> >
> > $ cat mbm_local_bytes
> > 23456
> >
> > Bandwidth calculation sees traffic amount:
> >      (23456 - 123456789) = -123433333
> > Oops. Negative!
>
> As I understand this is already an issue today on AMD systems without assignable counters
> that may run out of counters. On these systems, any RMID that is no longer being tracked will
> be reset to zero. [1]

My understanding too.

> The support for assignable counters give user space control over this unexpected reset of
> counters.
>
> The scenario you present seem to demonstrate how two independent user space systems
> can trample on each other when interacting with the same resources. Is this something you expect
> resctrl should protect against? I would expect that there would be a single user space system
> doing something like above and it would reset history after unassigning a counter.

As we are discussing adding a new interface, I thought it worth considering adding
a way for user space to be aware of the re-assignment of counters. IMHO it would be
a nice to have feature. Not required if all users of resctrl are aware of each other's
actions.

> This does indeed highlight that if resctrl does start to dynamically assign counters (which
> has only been speculated in this thread and is not part of current [1] design) then it may cause
> problems on user space side.

Agreed. Dynamic assignment would break "the user knows what is happening" assumption.
Seems like a bad idea.

> Reinette
>
> [1] https://lore.kernel.org/all/cover.1730244116.git.babu.moger@amd.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 23:30                     ` Luck, Tony
@ 2024-11-08  0:21                       ` Reinette Chatre
  0 siblings, 0 replies; 17+ messages in thread
From: Reinette Chatre @ 2024-11-08  0:21 UTC (permalink / raw)
  To: Luck, Tony, Peter Newman
  Cc: Yu, Fenghua, babu.moger@amd.com, bp@alien8.de,
	dave.hansen@linux.intel.com, Eranian, Stephane, hpa@zytor.com,
	james.morse@arm.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, nert.pinx@gmail.com, tan.shaopeng@fujitsu.com,
	tglx@linutronix.de, x86@kernel.org

Hi Tony,

On 11/7/24 3:30 PM, Luck, Tony wrote:
>>> E.g. read a counter:
>>>
>>> $ cat mbm_local_bytes
>>> 123456789
>>>
>>> H/w counter for this event/group assigned elsewhere.
>>>
>>> H/w counter assigned back to this event/group
>>>
>>> $ cat mbm_local_bytes
>>> 23456
>>>
>>> Bandwidth calculation sees traffic amount:
>>>      (23456 - 123456789) = -123433333
>>> Oops. Negative!
>>
>> As I understand this is already an issue today on AMD systems without assignable counters
>> that may run out of counters. On these systems, any RMID that is no longer being tracked will
>> be reset to zero. [1]
> 
> My understanding too.
> 
>> The support for assignable counters give user space control over this unexpected reset of
>> counters.
>>
>> The scenario you present seem to demonstrate how two independent user space systems
>> can trample on each other when interacting with the same resources. Is this something you expect
>> resctrl should protect against? I would expect that there would be a single user space system
>> doing something like above and it would reset history after unassigning a counter.
> 
> As we are discussing adding a new interface, I thought it worth considering adding
> a way for user space to be aware of the re-assignment of counters. IMHO it would be
> a nice to have feature. Not required if all users of resctrl are aware of each other's
> actions.

If this is indeed a requirement it may be best to consider it as part of the current
work to enable assignable counters. For example, by adding the "generation" value to
"mbm_assign_control" file that independent user space apps can query to get current counter
state before parsing event data.
I am not familiar with a use case relying on independent user space applications interacting
with resctrl so I would like to understand this requirement better before making the interface
more complicated.

> 
>> This does indeed highlight that if resctrl does start to dynamically assign counters (which
>> has only been speculated in this thread and is not part of current [1] design) then it may cause
>> problems on user space side.
> 
> Agreed. Dynamic assignment would break "the user knows what is happening" assumption.
> Seems like a bad idea.

I believe that is why Peter described it as a new "mode" that user space can select and thus
be aware of. This does not address how user space is expected to deal with event data reads that
may not increment when this "mode" is active though.

Reinette


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-07 19:14         ` Reinette Chatre
@ 2024-11-13 13:28           ` Peter Newman
  2024-11-14  5:40             ` Reinette Chatre
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Newman @ 2024-11-13 13:28 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

Hi Reinette,

On Thu, Nov 7, 2024 at 8:15 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 11/7/24 6:26 AM, Peter Newman wrote:
> > On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
> >>

> >> Tony had prototyped an MBM rate event[1], which cached a MBps value
> >> per group/domain produced by the overflow workers. If the workers
> >> produce the mbps samples immediately after directly reading the
> >> counters, then this should be the best case in terms of precision and
> >> introduce very little additional system overhead. Also, userspace
> >> reading a rate sample that's 1 second old would be a lot less harmful
> >> than performing an MBps calculation from a count sample that's 1
> >> second old.
>
> I looked at that implementation. Please do note that the implementation
> appears to be a PoC that does not handle the corner cases where issues may
> arise. For example, when it reads the event values in the overflow handler
> the rate is updated even if there was an error during the counter read.
> The moment a counter read encounters an error it impacts the measured
> rate so this will need more thought.
>
> >> Perhaps combining the per-second bandwidth rate cache with a
> >> per-domain file for each MBM event to aggregate the data for all
> >> groups will be the optimally-performing solution?
>
> I do not see a problem with exposing a per-domain file for each MBM event
> that aggregates the data of all groups. For best accuracy I expect that
> this file will be created on demand, querying hardware counters at the time
> user space makes the request. This may indeed result in output like below
> (naming used is just a sample for this discussion):
>
>          # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>            <rdtgroup nameA> <MBM total count>
>            <rdtgroup nameB> <MBM total count>
>            <rdtgroup nameC> Error
>            <rdtgroup nameD> Unavailable
>            <rdtgroup nameE> Unassigned
>            ...
>
> As I understand from your earlier description this essentially moves the
> counter reading work from a user space thread to the kernel. There are
> options to do this work, most disruptive can be done with a
> smp_call_function_any() to read all the domain's counters from a CPU
> in the domain without any preemption, less disruptive can be done
> with smp_call_on_cpu(). Some cond_resched() could be included if
> the number of events needing to be read starts impacting other parts of
> the kernel. You already indicated that reading the counters from user space
> takes 488usec, which is very far from what will trigger the softlockup
> detector though.
>
> > Factoring ABMC into this, we'd have to decide the interval at which
> > we're comfortable with cycling the available event counters through
> > the group list in order to get valid MBps samples for every group
> > often enough.
> >
> > A counter will have to stay assigned long enough to get two valid
> > counts before an MBps value can be reported. If the regular MBps
> > samples is what we want, maybe we just want a mode where the overflow
> > workers would also drive the counter assignments too in order to
> > produce regular samples from all groups.
>
> The assignable counter implementation currently depends on user space
> assigning counters. In this design the events that do not have counters
> assigned return "Unassigned". The "rate" value can also return
> "Unassigned" in this existing design. "Unassigned" is one scenario that
> needs to be handled, there is also hardware errors and MSR returning
> "Unavailable".
>
> We can surely consider a new mode that does not allow user space to assign
> counters but instead lets resctrl manage counter assignments to support
> rate events when assignable counters are supported.
>
> I see a couple of features being discussed in parallel:
> - A new per-domain file for each MBM event that aggregates the data of all groups.
> - A new "rate" event built on top of user assigned counters.
> - A new "rate" event built on top of resctrl assigned counters.
>
> Which best support the use case you have in mind?

After discussing with my users, it sounds like "all of the above".

They like the idea of resctrl automatically dealing with counter
assignments for them, but they would also like to retain enough
control to provide higher resolution data for groups that concern them
more. The auto-assignment is nice in that they would get reliable
bandwidth numbers on AMD with very little development effort and the
assigning work will be done very efficiently, but eventually they will
want more control.

They also asked whether they would be able to switch between
resctrl-assigned and user-assigned at runtime. I think the importance
of this would depend on how efficient the mbm_assign_control interface
ends up being. If it will necessarily result in floods of IPIs when
cycling counters on a large system, they will be eager to not use it
whenever they can avoid it.

The rate assignment events are easier to deal with because they can
simply be retrieved from memory without the caller needing to worry
about what domain they're reading from, so I don't believe we will
ever want to deal with cached count values paired with timestamps. On
systems with assignable counters, I don't know how much of a problem
the varying update rate will be. The aggregation files reading from
memory are cheap to poll every second, so I don't know how big of an
issue it is for most of the groups to report some sort of N/A-value
for bandwidth most of the time. They won't have any difficulty
remembering how long ago they last saw a valid mbps value because
they're already histogramming all the metrics they watch.

Thanks,
-Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-13 13:28           ` Peter Newman
@ 2024-11-14  5:40             ` Reinette Chatre
  2024-11-14 10:18               ` Peter Newman
  0 siblings, 1 reply; 17+ messages in thread
From: Reinette Chatre @ 2024-11-14  5:40 UTC (permalink / raw)
  To: Peter Newman
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

Hi Peter,

On 11/13/24 5:28 AM, Peter Newman wrote:
> On Thu, Nov 7, 2024 at 8:15 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 11/7/24 6:26 AM, Peter Newman wrote:
>>> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
>>>>
> 
>>>> Tony had prototyped an MBM rate event[1], which cached a MBps value
>>>> per group/domain produced by the overflow workers. If the workers
>>>> produce the mbps samples immediately after directly reading the
>>>> counters, then this should be the best case in terms of precision and
>>>> introduce very little additional system overhead. Also, userspace
>>>> reading a rate sample that's 1 second old would be a lot less harmful
>>>> than performing an MBps calculation from a count sample that's 1
>>>> second old.
>>
>> I looked at that implementation. Please do note that the implementation
>> appears to be a PoC that does not handle the corner cases where issues may
>> arise. For example, when it reads the event values in the overflow handler
>> the rate is updated even if there was an error during the counter read.
>> The moment a counter read encounters an error it impacts the measured
>> rate so this will need more thought.
>>
>>>> Perhaps combining the per-second bandwidth rate cache with a
>>>> per-domain file for each MBM event to aggregate the data for all
>>>> groups will be the optimally-performing solution?
>>
>> I do not see a problem with exposing a per-domain file for each MBM event
>> that aggregates the data of all groups. For best accuracy I expect that
>> this file will be created on demand, querying hardware counters at the time
>> user space makes the request. This may indeed result in output like below
>> (naming used is just a sample for this discussion):
>>
>>          # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>>            <rdtgroup nameA> <MBM total count>
>>            <rdtgroup nameB> <MBM total count>
>>            <rdtgroup nameC> Error
>>            <rdtgroup nameD> Unavailable
>>            <rdtgroup nameE> Unassigned
>>            ...
>>
>> As I understand from your earlier description this essentially moves the
>> counter reading work from a user space thread to the kernel. There are
>> options to do this work, most disruptive can be done with a
>> smp_call_function_any() to read all the domain's counters from a CPU
>> in the domain without any preemption, less disruptive can be done
>> with smp_call_on_cpu(). Some cond_resched() could be included if
>> the number of events needing to be read starts impacting other parts of
>> the kernel. You already indicated that reading the counters from user space
>> takes 488usec, which is very far from what will trigger the softlockup
>> detector though.
>>
>>> Factoring ABMC into this, we'd have to decide the interval at which
>>> we're comfortable with cycling the available event counters through
>>> the group list in order to get valid MBps samples for every group
>>> often enough.
>>>
>>> A counter will have to stay assigned long enough to get two valid
>>> counts before an MBps value can be reported. If the regular MBps
>>> samples is what we want, maybe we just want a mode where the overflow
>>> workers would also drive the counter assignments too in order to
>>> produce regular samples from all groups.
>>
>> The assignable counter implementation currently depends on user space
>> assigning counters. In this design the events that do not have counters
>> assigned return "Unassigned". The "rate" value can also return
>> "Unassigned" in this existing design. "Unassigned" is one scenario that
>> needs to be handled, there is also hardware errors and MSR returning
>> "Unavailable".
>>
>> We can surely consider a new mode that does not allow user space to assign
>> counters but instead lets resctrl manage counter assignments to support
>> rate events when assignable counters are supported.
>>
>> I see a couple of features being discussed in parallel:
>> - A new per-domain file for each MBM event that aggregates the data of all groups.
>> - A new "rate" event built on top of user assigned counters.
>> - A new "rate" event built on top of resctrl assigned counters.
>>
>> Which best support the use case you have in mind?
> 
> After discussing with my users, it sounds like "all of the above".
> 
> They like the idea of resctrl automatically dealing with counter
> assignments for them, but they would also like to retain enough
> control to provide higher resolution data for groups that concern them
> more. The auto-assignment is nice in that they would get reliable
> bandwidth numbers on AMD with very little development effort and the
> assigning work will be done very efficiently, but eventually they will
> want more control.

It is not obvious to me what the expectations/assumptions are regarding
efficiency. One efficiency benefit I can think of is related to what you
suggested earlier where it is the overflow handler that can do the assignment.
By doing so resctrl can ensure counters are assigned at edges of time range being
measured by overflow handler.

"eventually they will want more control" sounds like there are more requirements
that we need to be made aware of and makes me hesitant to add complicated
automation for nothing.

> They also asked whether they would be able to switch between
> resctrl-assigned and user-assigned at runtime. I think the importance
> of this would depend on how efficient the mbm_assign_control interface
> ends up being. If it will necessarily result in floods of IPIs when
> cycling counters on a large system, they will be eager to not use it
> whenever they can avoid it.

It could be possible to use both (resctrl-assigned and user-assigned)
at the same time. For example, mbm_assign_control could use flag to indicate
which event needs to be counted and whether counter can be shared or not.
For example, we already have "t" for dedicated "MBM total", and there
could theoretically by "T" for shared "MBM total". A very basic
solution could be for "T" counters to be assigned for two runs
of the overflow handler and then round-robin to the next group of
"T" counters.

There are some sharp corners when implementing something like this.
Tony already pointed out one scenario where reading a "shared counter"
monitor group's mbm_local_bytes/mbm_total_bytes
becomes unreliable due to counter being reset at each re-assignment.

On the more complicated side resctrl could perhaps learn more about
how perf multiplexes the limited PMC.

I do think this will be an enhancement of existing counter assignment
that Babu is working on and I do not know how much your users know
about it or experimented with it. You already mentioned that the
users would want more control so it may be most flexible
to leave counter assignment to user space with the planned
mbm_assign_control interface?
 
> The rate assignment events are easier to deal with because they can
> simply be retrieved from memory without the caller needing to worry
> about what domain they're reading from, so I don't believe we will
> ever want to deal with cached count values paired with timestamps. On
> systems with assignable counters, I don't know how much of a problem
> the varying update rate will be. The aggregation files reading from
> memory are cheap to poll every second, so I don't know how big of an
> issue it is for most of the groups to report some sort of N/A-value
> for bandwidth most of the time. They won't have any difficulty
> remembering how long ago they last saw a valid mbps value because
> they're already histogramming all the metrics they watch.

This part seems most related to issue at hand. I understand that
an aggregate file with rate data is most ideal. You snipped my question
whether reading individual counters are required but thinking about this
more it may indeed also be useful to have a per monitor group rate file
containing the rate of that monitor group.

Would existing struct mbm_state.prev_bw be sufficient as a rate exposed to user
space? At this time it is only computed for consumption by the software controller
but we can explore calling mbm_bw_count() at every iteration of the overflow
handler irrespective of software controller.
Please note two things:
- Tony is already adding enhancement for bandwidth to be computed for total as well as
  local MBM bandwidth ([2])
- Computing running bandwidth irrespective of software controller will need mbm_bw_count()
  to be more robust since it (as Tony highligted in [1]) assumes that 
  the preceding counter read always succeeds. 

Reinette

[1] https://lore.kernel.org/all/ZzUvA2XE01U25A38@agluck-desk3/
[2] https://lore.kernel.org/all/20241029172832.93963-3-tony.luck@intel.com/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
  2024-11-14  5:40             ` Reinette Chatre
@ 2024-11-14 10:18               ` Peter Newman
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Newman @ 2024-11-14 10:18 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, babu.moger, bp, dave.hansen, eranian, hpa,
	james.morse, linux-kernel, mingo, nert.pinx, tan.shaopeng, tglx,
	tony.luck, x86

Hi Reinette,

On Thu, Nov 14, 2024 at 6:40 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 11/13/24 5:28 AM, Peter Newman wrote:
> > On Thu, Nov 7, 2024 at 8:15 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >> On 11/7/24 6:26 AM, Peter Newman wrote:
> >>> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@google.com> wrote:
> >>>>
> >
> >>>> Tony had prototyped an MBM rate event[1], which cached a MBps value
> >>>> per group/domain produced by the overflow workers. If the workers
> >>>> produce the mbps samples immediately after directly reading the
> >>>> counters, then this should be the best case in terms of precision and
> >>>> introduce very little additional system overhead. Also, userspace
> >>>> reading a rate sample that's 1 second old would be a lot less harmful
> >>>> than performing an MBps calculation from a count sample that's 1
> >>>> second old.
> >>
> >> I looked at that implementation. Please do note that the implementation
> >> appears to be a PoC that does not handle the corner cases where issues may
> >> arise. For example, when it reads the event values in the overflow handler
> >> the rate is updated even if there was an error during the counter read.
> >> The moment a counter read encounters an error it impacts the measured
> >> rate so this will need more thought.
> >>
> >>>> Perhaps combining the per-second bandwidth rate cache with a
> >>>> per-domain file for each MBM event to aggregate the data for all
> >>>> groups will be the optimally-performing solution?
> >>
> >> I do not see a problem with exposing a per-domain file for each MBM event
> >> that aggregates the data of all groups. For best accuracy I expect that
> >> this file will be created on demand, querying hardware counters at the time
> >> user space makes the request. This may indeed result in output like below
> >> (naming used is just a sample for this discussion):
> >>
> >>          # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
> >>            <rdtgroup nameA> <MBM total count>
> >>            <rdtgroup nameB> <MBM total count>
> >>            <rdtgroup nameC> Error
> >>            <rdtgroup nameD> Unavailable
> >>            <rdtgroup nameE> Unassigned
> >>            ...
> >>
> >> As I understand from your earlier description this essentially moves the
> >> counter reading work from a user space thread to the kernel. There are
> >> options to do this work, most disruptive can be done with a
> >> smp_call_function_any() to read all the domain's counters from a CPU
> >> in the domain without any preemption, less disruptive can be done
> >> with smp_call_on_cpu(). Some cond_resched() could be included if
> >> the number of events needing to be read starts impacting other parts of
> >> the kernel. You already indicated that reading the counters from user space
> >> takes 488usec, which is very far from what will trigger the softlockup
> >> detector though.
> >>
> >>> Factoring ABMC into this, we'd have to decide the interval at which
> >>> we're comfortable with cycling the available event counters through
> >>> the group list in order to get valid MBps samples for every group
> >>> often enough.
> >>>
> >>> A counter will have to stay assigned long enough to get two valid
> >>> counts before an MBps value can be reported. If the regular MBps
> >>> samples is what we want, maybe we just want a mode where the overflow
> >>> workers would also drive the counter assignments too in order to
> >>> produce regular samples from all groups.
> >>
> >> The assignable counter implementation currently depends on user space
> >> assigning counters. In this design the events that do not have counters
> >> assigned return "Unassigned". The "rate" value can also return
> >> "Unassigned" in this existing design. "Unassigned" is one scenario that
> >> needs to be handled, there is also hardware errors and MSR returning
> >> "Unavailable".
> >>
> >> We can surely consider a new mode that does not allow user space to assign
> >> counters but instead lets resctrl manage counter assignments to support
> >> rate events when assignable counters are supported.
> >>
> >> I see a couple of features being discussed in parallel:
> >> - A new per-domain file for each MBM event that aggregates the data of all groups.
> >> - A new "rate" event built on top of user assigned counters.
> >> - A new "rate" event built on top of resctrl assigned counters.
> >>
> >> Which best support the use case you have in mind?
> >
> > After discussing with my users, it sounds like "all of the above".
> >
> > They like the idea of resctrl automatically dealing with counter
> > assignments for them, but they would also like to retain enough
> > control to provide higher resolution data for groups that concern them
> > more. The auto-assignment is nice in that they would get reliable
> > bandwidth numbers on AMD with very little development effort and the
> > assigning work will be done very efficiently, but eventually they will
> > want more control.
>
> It is not obvious to me what the expectations/assumptions are regarding
> efficiency. One efficiency benefit I can think of is related to what you
> suggested earlier where it is the overflow handler that can do the assignment.
> By doing so resctrl can ensure counters are assigned at edges of time range being
> measured by overflow handler.

I'm mainly concerned about the cost of sending IPIs numbering in the
thousands around the system to update the counters in every domain.
The overflow handlers are already local to their domains, so they can
drive all the necessary register updates directly.

>
> "eventually they will want more control" sounds like there are more requirements
> that we need to be made aware of and makes me hesitant to add complicated
> automation for nothing.
>
> > They also asked whether they would be able to switch between
> > resctrl-assigned and user-assigned at runtime. I think the importance
> > of this would depend on how efficient the mbm_assign_control interface
> > ends up being. If it will necessarily result in floods of IPIs when
> > cycling counters on a large system, they will be eager to not use it
> > whenever they can avoid it.
>
> It could be possible to use both (resctrl-assigned and user-assigned)
> at the same time. For example, mbm_assign_control could use flag to indicate
> which event needs to be counted and whether counter can be shared or not.
> For example, we already have "t" for dedicated "MBM total", and there
> could theoretically by "T" for shared "MBM total". A very basic
> solution could be for "T" counters to be assigned for two runs
> of the overflow handler and then round-robin to the next group of
> "T" counters.

This sounds promising, since it would cut down on the number of
routine counter-reassignments userspace would need to drive. The
counter-updating traffic could then be focused on the problematic
groups which we hope will be small in number.

>
> There are some sharp corners when implementing something like this.
> Tony already pointed out one scenario where reading a "shared counter"
> monitor group's mbm_local_bytes/mbm_total_bytes
> becomes unreliable due to counter being reset at each re-assignment.
>
> On the more complicated side resctrl could perhaps learn more about
> how perf multiplexes the limited PMC.
>
> I do think this will be an enhancement of existing counter assignment
> that Babu is working on and I do not know how much your users know
> about it or experimented with it. You already mentioned that the
> users would want more control so it may be most flexible
> to leave counter assignment to user space with the planned
> mbm_assign_control interface?

Yes, it's an acceptable interface.

>
> > The rate assignment events are easier to deal with because they can
> > simply be retrieved from memory without the caller needing to worry
> > about what domain they're reading from, so I don't believe we will
> > ever want to deal with cached count values paired with timestamps. On
> > systems with assignable counters, I don't know how much of a problem
> > the varying update rate will be. The aggregation files reading from
> > memory are cheap to poll every second, so I don't know how big of an
> > issue it is for most of the groups to report some sort of N/A-value
> > for bandwidth most of the time. They won't have any difficulty
> > remembering how long ago they last saw a valid mbps value because
> > they're already histogramming all the metrics they watch.
>
> This part seems most related to issue at hand. I understand that
> an aggregate file with rate data is most ideal. You snipped my question
> whether reading individual counters are required but thinking about this
> more it may indeed also be useful to have a per monitor group rate file
> containing the rate of that monitor group.

Yes, it should be no trouble to implement the per-group rate files.

>
> Would existing struct mbm_state.prev_bw be sufficient as a rate exposed to user
> space? At this time it is only computed for consumption by the software controller
> but we can explore calling mbm_bw_count() at every iteration of the overflow
> handler irrespective of software controller.
> Please note two things:
> - Tony is already adding enhancement for bandwidth to be computed for total as well as
>   local MBM bandwidth ([2])
> - Computing running bandwidth irrespective of software controller will need mbm_bw_count()
>   to be more robust since it (as Tony highligted in [1]) assumes that
>   the preceding counter read always succeeds.

Yes, I had applied Tony's last series and prototyped interfaces for
reporting the mbm_state.prev_bw values to user space. It seemed to
work well.

Thanks!
-Peter


> [1] https://lore.kernel.org/all/ZzUvA2XE01U25A38@agluck-desk3/
> [2] https://lore.kernel.org/all/20241029172832.93963-3-tony.luck@intel.com/
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-11-14 10:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-06 15:43 [PATCH v2 1/2] selftests/resctrl: fastcat for benchmarking counter reads Peter Newman
2024-11-06 15:43 ` [PATCH v2 2/2] x86/resctrl: Don't workqueue local event " Peter Newman
2024-11-07  1:10   ` Reinette Chatre
2024-11-07 11:01     ` Peter Newman
2024-11-07 14:26       ` Peter Newman
2024-11-07 16:57         ` Tony Luck
2024-11-07 19:15           ` Reinette Chatre
2024-11-07 20:58             ` Luck, Tony
2024-11-07 22:03               ` Reinette Chatre
2024-11-07 22:14                 ` Luck, Tony
2024-11-07 22:46                   ` Reinette Chatre
2024-11-07 23:30                     ` Luck, Tony
2024-11-08  0:21                       ` Reinette Chatre
2024-11-07 19:14         ` Reinette Chatre
2024-11-13 13:28           ` Peter Newman
2024-11-14  5:40             ` Reinette Chatre
2024-11-14 10:18               ` Peter Newman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox