[PATCH V2] perf: qcom: Add L3 cache PMU driver

All of lore.kernel.org
 help / color / mirror / Atom feed

From: mark.rutland@arm.com (Mark Rutland)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH V2] perf: qcom: Add L3 cache PMU driver
Date: Fri, 17 Feb 2017 18:52:44 +0000	[thread overview]
Message-ID: <20170217185243.GC25876@leverpostej> (raw)
In-Reply-To: <8ace145be288028fb5faeb90e48a0d64@codeaurora.org>

On Thu, Feb 16, 2017 at 05:01:19PM -0500, Agustin Vega-Frias wrote:
> On 2017-02-16 11:41, Mark Rutland wrote:
> >On Thu, Feb 16, 2017 at 10:03:05AM -0500, Agustin Vega-Frias wrote:
> >>This adds a new dynamic PMU to the Perf Events framework to program
> >>and control the L3 cache PMUs in some Qualcomm Technologies SOCs.

> >At a quick glance, this looks *awfully* similar to the L2 PMU, modulo
> >using MMIO rather than sysreg accesses. The basic shape of the hardware
> >seems the same, and the register offsets are practically identical.
> >
> >I guess that the L2 and L3 are largely the same block?
> >
> >How exactly does this relate to the L2 PMU hardware? Could you please
> >elaborate on any major differences?
> 
> The L2 and L3 blocks are separate. In the current SoC each internal
> cluster shares an L2, but all clusters in the SoC share all the L3
> slices. Logically it is seen as a single, larger L3 cache, shared by
> all CPUs in the system. In spite of this, each physical slice has its
> own PMU. Which slice serves a given memory access is determined by
> a hashing algorithm which means all CPUs might have cache lines on
> every slice.
> 
> >This has similar issues to earlier versions of the L2 PMU driver, such
> >as permitting cross-{slice,pmu} groups, and aggregating per-slice
> >events, which have been addressed in the upstreamed L2 PMU driver.
> 
> We represent this as a single perf PMU that can only be associated
> with a sigle CPU context. We do aggregation here, since logically it
> is a single L3 cache.

Collating the PMUs behind a single logical PMU is fine.

I'm specifically referring to collating events (i.e. hiding separately
controlled counters behind a single perf_event).

Since the L2 slices were cluster-affine, we could manage those on a
per-cpu basis. Here you'd either have to have a config field to select
the slice, or expose each slice as its own PMU.

> >>+static void qcom_l3_cache__64bit_counter_start(struct
> >>perf_event *event)
> >>+{
> >>+	struct l3cache_pmu *socket = to_l3cache_pmu(event->pmu);
> >>+	struct hml3_pmu *slice;
> >>+	int idx = event->hw.idx;
> >>+	u64 value = local64_read(&event->count);
> >>+
> >>+	list_for_each_entry(slice, &socket->pmus, entry) {
> >>+		hml3_pmu__counter_enable_gang(slice, idx+1);
> >>+
> >>+		if (value) {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, value >> 32);
> >>+			hml3_pmu__counter_set_value(slice, idx, (u32)value);
> >>+			value = 0;
> >>+		} else {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, 0);
> >>+			hml3_pmu__counter_set_value(slice, idx, 0);
> >>+		}
> >>+
> >>+		hml3_pmu__counter_set_event(slice, idx+1, 0);
> >>+		hml3_pmu__counter_set_event(slice, idx, get_event_type(event));
> >>+
> >>+		hml3_pmu__counter_enable(slice, idx+1);
> >>+		hml3_pmu__counter_enable(slice, idx);
> >>+	}
> >>+}
> >
> >As with other PMU drivers, NAK to aggregating (separately-controlled)
> >HW events behind a single event. We should not be aggregating across
> >slices as we cannot start/stop those atomically.
> 
> In this case we start the hardware events in all slices. IOW there is a
> one-to-many relationship between perf_events in this perf PMU and events
> in the hardware PMUs.

I understand what is happening here; I'm simply disagreeing with it. We
do not permit event aggregation in this manner. Please see [1].

[...]

> >>+PMU_FORMAT_ATTR(event, "config:0-7");
> >>+PMU_FORMAT_ATTR(lc, "config:32");
> >
> >What is this 'lc' field?
> 
> It means "long counter". We have a feature that allows chaining two
> 32 bit
> hardware counters. This is how a user requests that.

Ok. This will need documentation.

[...]

> >>+static int qcom_l3_cache_pmu_offline_cpu(unsigned int cpu,
> >>struct hlist_node *n)
> >>+{
> >>+	struct l3cache_pmu *socket = hlist_entry_safe(n, struct
> >>l3cache_pmu, node);
> >>+	unsigned int target;
> >>+
> >>+	if (!cpumask_test_and_clear_cpu(cpu, &socket->cpu))
> >>+		return 0;
> >>+	target = cpumask_any_but(cpu_online_mask, cpu);
> >>+	if (target >= nr_cpu_ids)
> >>+		return 0;
> >>+	/*
> >>+	 * TODO: migrate context once core races on event->ctx have
> >>+	 * been fixed.
> >>+	 */
> >>+	cpumask_set_cpu(target, &socket->cpu);
> >>+	return 0;
> >>+}
> >
> >The event->ctx race has been fixed for a while now.
> 
> Great, I was searching for that yesterday, but could not find
> anything and
> assumed it had not been fixed, given that the CCI driver still has this
> comment in it. So it is safe to add the call to
> perf_pmu_migrate_context now?
> >
> >Please follow the example of the L2 PMU's hotplug callback, and also
> >fold the reset into the hotplug callback.
> 
> I agree we will need to do that once we have multi-socket support, but
> would you agree to keep this as-is for the time being?

If you treat the PMU as being affine to cpu_possible_mask you can follow
the same strategy as the L2 PMU driver. Even if it's not strictly
necessary, it avoids a needless difference between the two drivers.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/478424.html

WARNING: multiple messages have this Message-ID (diff)

From: Mark Rutland <mark.rutland@arm.com>
To: Agustin Vega-Frias <agustinv@codeaurora.org>
Cc: linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	Will Deacon <will.deacon@arm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	timur@codeaurora.org, nleeder@codeaurora.org,
	agross@codeaurora.org, jcm@redhat.com, msalter@redhat.com,
	mlangsdo@redhat.com, ahs3@redhat.com
Subject: Re: [PATCH V2] perf: qcom: Add L3 cache PMU driver
Date: Fri, 17 Feb 2017 18:52:44 +0000	[thread overview]
Message-ID: <20170217185243.GC25876@leverpostej> (raw)
In-Reply-To: <8ace145be288028fb5faeb90e48a0d64@codeaurora.org>

On Thu, Feb 16, 2017 at 05:01:19PM -0500, Agustin Vega-Frias wrote:
> On 2017-02-16 11:41, Mark Rutland wrote:
> >On Thu, Feb 16, 2017 at 10:03:05AM -0500, Agustin Vega-Frias wrote:
> >>This adds a new dynamic PMU to the Perf Events framework to program
> >>and control the L3 cache PMUs in some Qualcomm Technologies SOCs.

> >At a quick glance, this looks *awfully* similar to the L2 PMU, modulo
> >using MMIO rather than sysreg accesses. The basic shape of the hardware
> >seems the same, and the register offsets are practically identical.
> >
> >I guess that the L2 and L3 are largely the same block?
> >
> >How exactly does this relate to the L2 PMU hardware? Could you please
> >elaborate on any major differences?
> 
> The L2 and L3 blocks are separate. In the current SoC each internal
> cluster shares an L2, but all clusters in the SoC share all the L3
> slices. Logically it is seen as a single, larger L3 cache, shared by
> all CPUs in the system. In spite of this, each physical slice has its
> own PMU. Which slice serves a given memory access is determined by
> a hashing algorithm which means all CPUs might have cache lines on
> every slice.
> 
> >This has similar issues to earlier versions of the L2 PMU driver, such
> >as permitting cross-{slice,pmu} groups, and aggregating per-slice
> >events, which have been addressed in the upstreamed L2 PMU driver.
> 
> We represent this as a single perf PMU that can only be associated
> with a sigle CPU context. We do aggregation here, since logically it
> is a single L3 cache.

Collating the PMUs behind a single logical PMU is fine.

I'm specifically referring to collating events (i.e. hiding separately
controlled counters behind a single perf_event).

Since the L2 slices were cluster-affine, we could manage those on a
per-cpu basis. Here you'd either have to have a config field to select
the slice, or expose each slice as its own PMU.

> >>+static void qcom_l3_cache__64bit_counter_start(struct
> >>perf_event *event)
> >>+{
> >>+	struct l3cache_pmu *socket = to_l3cache_pmu(event->pmu);
> >>+	struct hml3_pmu *slice;
> >>+	int idx = event->hw.idx;
> >>+	u64 value = local64_read(&event->count);
> >>+
> >>+	list_for_each_entry(slice, &socket->pmus, entry) {
> >>+		hml3_pmu__counter_enable_gang(slice, idx+1);
> >>+
> >>+		if (value) {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, value >> 32);
> >>+			hml3_pmu__counter_set_value(slice, idx, (u32)value);
> >>+			value = 0;
> >>+		} else {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, 0);
> >>+			hml3_pmu__counter_set_value(slice, idx, 0);
> >>+		}
> >>+
> >>+		hml3_pmu__counter_set_event(slice, idx+1, 0);
> >>+		hml3_pmu__counter_set_event(slice, idx, get_event_type(event));
> >>+
> >>+		hml3_pmu__counter_enable(slice, idx+1);
> >>+		hml3_pmu__counter_enable(slice, idx);
> >>+	}
> >>+}
> >
> >As with other PMU drivers, NAK to aggregating (separately-controlled)
> >HW events behind a single event. We should not be aggregating across
> >slices as we cannot start/stop those atomically.
> 
> In this case we start the hardware events in all slices. IOW there is a
> one-to-many relationship between perf_events in this perf PMU and events
> in the hardware PMUs.

I understand what is happening here; I'm simply disagreeing with it. We
do not permit event aggregation in this manner. Please see [1].

[...]

> >>+PMU_FORMAT_ATTR(event, "config:0-7");
> >>+PMU_FORMAT_ATTR(lc, "config:32");
> >
> >What is this 'lc' field?
> 
> It means "long counter". We have a feature that allows chaining two
> 32 bit
> hardware counters. This is how a user requests that.

Ok. This will need documentation.

[...]

> >>+static int qcom_l3_cache_pmu_offline_cpu(unsigned int cpu,
> >>struct hlist_node *n)
> >>+{
> >>+	struct l3cache_pmu *socket = hlist_entry_safe(n, struct
> >>l3cache_pmu, node);
> >>+	unsigned int target;
> >>+
> >>+	if (!cpumask_test_and_clear_cpu(cpu, &socket->cpu))
> >>+		return 0;
> >>+	target = cpumask_any_but(cpu_online_mask, cpu);
> >>+	if (target >= nr_cpu_ids)
> >>+		return 0;
> >>+	/*
> >>+	 * TODO: migrate context once core races on event->ctx have
> >>+	 * been fixed.
> >>+	 */
> >>+	cpumask_set_cpu(target, &socket->cpu);
> >>+	return 0;
> >>+}
> >
> >The event->ctx race has been fixed for a while now.
> 
> Great, I was searching for that yesterday, but could not find
> anything and
> assumed it had not been fixed, given that the CCI driver still has this
> comment in it. So it is safe to add the call to
> perf_pmu_migrate_context now?
> >
> >Please follow the example of the L2 PMU's hotplug callback, and also
> >fold the reset into the hotplug callback.
> 
> I agree we will need to do that once we have multi-socket support, but
> would you agree to keep this as-is for the time being?

If you treat the PMU as being affine to cpu_possible_mask you can follow
the same strategy as the L2 PMU driver. Even if it's not strictly
necessary, it avoids a needless difference between the two drivers.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/478424.html

next prev parent reply	other threads:[~2017-02-17 18:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-16 15:03 [PATCH V2] perf: qcom: Add L3 cache PMU driver Agustin Vega-Frias
2017-02-16 15:03 ` Agustin Vega-Frias
2017-02-16 16:41 ` Mark Rutland
2017-02-16 16:41   ` Mark Rutland
2017-02-16 22:01   ` Agustin Vega-Frias
2017-02-16 22:01     ` Agustin Vega-Frias
2017-02-17 18:52     ` Mark Rutland [this message]
2017-02-17 18:52       ` Mark Rutland
2017-02-24 21:58 ` Timur Tabi
2017-02-24 21:58   ` Timur Tabi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170217185243.GC25876@leverpostej \
    --to=mark.rutland@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.