LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin @ 2021-06-14  4:47 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <1623643443.b9twp3txmw.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>> when it is context switched. This can be disabled by architectures that
>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>
>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>
>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>> with the resulting code if task->active_mm were at least better
>>>> documented and possibly even guarded by ifdefs.
>>> 
>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>> I don't think anything has changed in 20 years, I don't know what more
>>> is needed, but if you can add to documentation that would be nice. Maybe
>>> moving a bit of that into .c and .h files?
>>> 
>> 
>> Quoting from that file:
>> 
>>   - however, we obviously need to keep track of which address space we
>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>     which shows what the currently active address space is.
>> 
>> This isn't even true right now on x86.
> 
> From the perspective of core code, it is. x86 might do something crazy 
> with it, but it has to make it appear this way to non-arch code that
> uses active_mm.
> 
> Is x86's scheme documented?
> 
>> With your patch applied:
>> 
>>  To support all that, the "struct mm_struct" now has two counters: a
>>  "mm_users" counter that is how many "real address space users" there are,
>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>  users) plus one if there are any real users.
>> 
>> isn't even true any more.
> 
> Well yeah but the active_mm concept hasn't changed. The refcounting 
> change is hopefully reasonably documented?
> 
>> 
>> 
>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>
>>>> So I tend to think that, depending on config, the core code should
>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>> 
>>> I don't actually know what you mean.
>>> 
>>> core code needs the concept of an "active_mm". This is the mm that your 
>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>> active_mm still points to init_mm for kernel threads.
>> 
>> Core code does *not* need this concept.  First, it's wrong on x86 since
>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>> "active" for any sensible definition of the word active is wrong.
>> Fortunately there is no such code.
>> 
>> I looked through all active_mm references in core code.  We have:
>> 
>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>> with membarrier.
>> 
>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>> 
>> kernel/exit.c: exit_mm() a BUG_ON().
>> 
>> kernel/fork.c: initialization code and a warning.
>> 
>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>> 
>> fs/exec.c: nothing of interest
> 
> I might not have been clear. Core code doesn't need active_mm if 
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
> 
> The part I don't understand is when you say it can just go away. How? 
> 
>> I didn't go through drivers, but I maintain my point.  active_mm is
>> there for refcounting.  So please don't just make it even more confusing
>> -- do your performance improvement, but improve the code at the same
>> time: get rid of active_mm, at least on architectures that opt out of
>> the refcounting.
> 
> powerpc opts out of the refcounting and can not "get rid of active_mm".
> Not even in theory.

That is to say, it does do a type of reference management that requires 
active_mm so you can argue it has not entirely opted out of refcounting.
But we're not just doing refcounting for the sake of refcounting! That
would make no sense.

active_mm is required because that's the mm that we have switched to 
(from core code's perspective), and it is integral to know when to 
switch to a different mm. See how active_mm is a fundamental concept
in core code? It's part of the contract between core code and the
arch mm context management calls. reference counting follows from there
but it's not the _reason_ for this code.

Pretend the reference problem does not exit (whether by refcounting or 
shootdown or garbage collection or whatever). We still can't remove 
active_mm! We need it to know how to call into arch functions like 
switch_mm.

I don't know if you just forgot that critical requirement in your above 
list, or you actually are entirely using x86's mental model for this 
code which is doing something entirely different that does not need it 
at all. If that is the case I really don't mind some cleanup or wrapper 
functions for x86 do entirely do its own thing, but if that's the case
you can't criticize core code's use of active_mm due to the current
state of x86. It's x86 that needs documentation and cleaning up.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin @ 2021-06-14  5:21 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <1623645385.u2cqbcn3co.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>
>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>
>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>> with the resulting code if task->active_mm were at least better
>>>>> documented and possibly even guarded by ifdefs.
>>>> 
>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>> I don't think anything has changed in 20 years, I don't know what more
>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>> moving a bit of that into .c and .h files?
>>>> 
>>> 
>>> Quoting from that file:
>>> 
>>>   - however, we obviously need to keep track of which address space we
>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>     which shows what the currently active address space is.
>>> 
>>> This isn't even true right now on x86.
>> 
>> From the perspective of core code, it is. x86 might do something crazy 
>> with it, but it has to make it appear this way to non-arch code that
>> uses active_mm.
>> 
>> Is x86's scheme documented?
>> 
>>> With your patch applied:
>>> 
>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>  "mm_users" counter that is how many "real address space users" there are,
>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>  users) plus one if there are any real users.
>>> 
>>> isn't even true any more.
>> 
>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>> change is hopefully reasonably documented?
>> 
>>> 
>>> 
>>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>>
>>>>> So I tend to think that, depending on config, the core code should
>>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>>> 
>>>> I don't actually know what you mean.
>>>> 
>>>> core code needs the concept of an "active_mm". This is the mm that your 
>>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>>> active_mm still points to init_mm for kernel threads.
>>> 
>>> Core code does *not* need this concept.  First, it's wrong on x86 since
>>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>>> "active" for any sensible definition of the word active is wrong.
>>> Fortunately there is no such code.
>>> 
>>> I looked through all active_mm references in core code.  We have:
>>> 
>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>> with membarrier.
>>> 
>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>> 
>>> kernel/exit.c: exit_mm() a BUG_ON().
>>> 
>>> kernel/fork.c: initialization code and a warning.
>>> 
>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>> 
>>> fs/exec.c: nothing of interest
>> 
>> I might not have been clear. Core code doesn't need active_mm if 
>> active_mm somehow goes away. I'm saying active_mm can't go away because
>> it's needed to support (most) archs that do lazy tlb mm switching.
>> 
>> The part I don't understand is when you say it can just go away. How? 
>> 
>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>> there for refcounting.  So please don't just make it even more confusing
>>> -- do your performance improvement, but improve the code at the same
>>> time: get rid of active_mm, at least on architectures that opt out of
>>> the refcounting.
>> 
>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>> Not even in theory.
> 
> That is to say, it does do a type of reference management that requires 
> active_mm so you can argue it has not entirely opted out of refcounting.
> But we're not just doing refcounting for the sake of refcounting! That
> would make no sense.
> 
> active_mm is required because that's the mm that we have switched to 
> (from core code's perspective), and it is integral to know when to 
> switch to a different mm. See how active_mm is a fundamental concept
> in core code? It's part of the contract between core code and the
> arch mm context management calls. reference counting follows from there
> but it's not the _reason_ for this code.
> 
> Pretend the reference problem does not exit (whether by refcounting or 
> shootdown or garbage collection or whatever). We still can't remove 
> active_mm! We need it to know how to call into arch functions like 
> switch_mm.
> 
> I don't know if you just forgot that critical requirement in your above 
> list, or you actually are entirely using x86's mental model for this 
> code which is doing something entirely different that does not need it 
> at all. If that is the case I really don't mind some cleanup or wrapper 
> functions for x86 do entirely do its own thing, but if that's the case
> you can't criticize core code's use of active_mm due to the current
> state of x86. It's x86 that needs documentation and cleaning up.

Ah, that must be where your confusion is coming from: x86's switch_mm 
doesn't use prev anywhere, and the reference scheme it is using appears 
to be under-documented, although vague references in changelogs suggest 
it has not actually "opted out" of active_mm refcounting.

That's understandable, but please redirect your objections to the proper 
place. git blame suggests 3d28ebceaffab.

Thanks,
Nick

^ permalink raw reply

* [PATCH v2 1/4] drivers/nvdimm: Add nvdimm pmu structure
From: Kajol Jain @ 2021-06-14  5:23 UTC (permalink / raw)
  To: mpe, linuxppc-dev, nvdimm, linux-kernel, peterz
  Cc: santosh, maddy, rnsastry, aneesh.kumar, atrajeev, kjain, vaibhav,
	dan.j.williams, ira.weiny, tglx
In-Reply-To: <20210614052326.285710-1-kjain@linux.ibm.com>

A structure is added, called nvdimm_pmu, for performance
stats reporting support of nvdimm devices. It can be used to add
nvdimm pmu data such as supported events and pmu event functions
like event_init/add/read/del with cpu hotplug support.

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 include/linux/nd.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/linux/nd.h b/include/linux/nd.h
index ee9ad76afbba..712499cf7335 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -8,6 +8,8 @@
 #include <linux/ndctl.h>
 #include <linux/device.h>
 #include <linux/badblocks.h>
+#include <linux/platform_device.h>
+#include <linux/perf_event.h>
 
 enum nvdimm_event {
 	NVDIMM_REVALIDATE_POISON,
@@ -23,6 +25,47 @@ enum nvdimm_claim_class {
 	NVDIMM_CCLASS_UNKNOWN,
 };
 
+/* Event attribute array index */
+#define NVDIMM_PMU_FORMAT_ATTR		0
+#define NVDIMM_PMU_EVENT_ATTR		1
+#define NVDIMM_PMU_CPUMASK_ATTR		2
+#define NVDIMM_PMU_NULL_ATTR		3
+
+/**
+ * struct nvdimm_pmu - data structure for nvdimm perf driver
+ *
+ * @name: name of the nvdimm pmu device.
+ * @pmu: pmu data structure for nvdimm performance stats.
+ * @dev: nvdimm device pointer.
+ * @functions(event_init/add/del/read): platform specific pmu functions.
+ * @attr_groups: data structure for events, formats and cpumask
+ * @cpu: designated cpu for counter access.
+ * @node: node for cpu hotplug notifier link.
+ * @cpuhp_state: state for cpu hotplug notification.
+ * @arch_cpumask: cpumask to get designated cpu for counter access.
+ */
+struct nvdimm_pmu {
+	const char *name;
+	struct pmu pmu;
+	struct device *dev;
+	int (*event_init)(struct perf_event *event);
+	int  (*add)(struct perf_event *event, int flags);
+	void (*del)(struct perf_event *event, int flags);
+	void (*read)(struct perf_event *event);
+	/*
+	 * Attribute groups for the nvdimm pmu. Index 0 used for
+	 * format attribute, index 1 used for event attribute,
+	 * index 2 used for cpusmask attribute and index 3 kept as NULL.
+	 */
+	const struct attribute_group *attr_groups[4];
+	int cpu;
+	struct hlist_node node;
+	enum cpuhp_state cpuhp_state;
+
+	/* cpumask provided by arch/platform specific code */
+	struct cpumask arch_cpumask;
+};
+
 struct nd_device_driver {
 	struct device_driver drv;
 	unsigned long type;
-- 
2.27.0


^ permalink raw reply related

* [PATCH v2 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats
From: Kajol Jain @ 2021-06-14  5:23 UTC (permalink / raw)
  To: mpe, linuxppc-dev, nvdimm, linux-kernel, peterz
  Cc: santosh, maddy, rnsastry, aneesh.kumar, atrajeev, kjain, vaibhav,
	dan.j.williams, ira.weiny, tglx
In-Reply-To: <20210614052326.285710-1-kjain@linux.ibm.com>

A common interface is added to get performance stats reporting
support for nvdimm devices. Added interface includes support for
pmu register/unregister functions, cpu hotplug and pmu event
functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/nd_perf.c | 230 +++++++++++++++++++++++++++++++++++++++
 include/linux/nd.h       |   3 +
 3 files changed, 234 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 29203f3d3069..25dba6095612 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -18,6 +18,7 @@ nd_e820-y := e820.o
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
+libnvdimm-y += nd_perf.o
 libnvdimm-y += dimm.o
 libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
new file mode 100644
index 000000000000..d4e9b9306043
--- /dev/null
+++ b/drivers/nvdimm/nd_perf.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * nd_perf.c: NVDIMM Device Performance Monitoring Unit support
+ *
+ * Perf interface to expose nvdimm performance stats.
+ *
+ * Copyright (C) 2021 IBM Corporation
+ */
+
+#define pr_fmt(fmt) "nvdimm_pmu: " fmt
+
+#include <linux/nd.h>
+
+static ssize_t nvdimm_pmu_cpumask_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+	struct nvdimm_pmu *nd_pmu;
+
+	nd_pmu = container_of(pmu, struct nvdimm_pmu, pmu);
+
+	return cpumap_print_to_pagebuf(true, buf, cpumask_of(nd_pmu->cpu));
+}
+
+static int nvdimm_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvdimm_pmu *nd_pmu;
+	u32 target;
+	int nodeid;
+	const struct cpumask *cpumask;
+
+	nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
+
+	/* Clear it, incase given cpu is set in nd_pmu->arch_cpumask */
+	cpumask_test_and_clear_cpu(cpu, &nd_pmu->arch_cpumask);
+
+	/*
+	 * If given cpu is not same as current designated cpu for
+	 * counter access, just return.
+	 */
+	if (cpu != nd_pmu->cpu)
+		return 0;
+
+	/* Check for any active cpu in nd_pmu->arch_cpumask */
+	target = cpumask_any(&nd_pmu->arch_cpumask);
+
+	/*
+	 * Incase we don't have any active cpu in nd_pmu->arch_cpumask,
+	 * check in given cpu's numa node list.
+	 */
+	if (target >= nr_cpu_ids) {
+		nodeid = cpu_to_node(cpu);
+		cpumask = cpumask_of_node(nodeid);
+		target = cpumask_any_but(cpumask, cpu);
+	}
+	nd_pmu->cpu = target;
+
+	/* Migrate nvdimm pmu events to the new target cpu if valid */
+	if (target >= 0 && target < nr_cpu_ids)
+		perf_pmu_migrate_context(&nd_pmu->pmu, cpu, target);
+
+	return 0;
+}
+
+static int nvdimm_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvdimm_pmu *nd_pmu;
+
+	nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
+
+	if (nd_pmu->cpu >= nr_cpu_ids)
+		nd_pmu->cpu = cpu;
+
+	return 0;
+}
+
+static int create_cpumask_attr_group(struct nvdimm_pmu *nd_pmu)
+{
+	struct perf_pmu_events_attr *attr;
+	struct attribute **attrs;
+	struct attribute_group *nvdimm_pmu_cpumask_group;
+
+	attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+	if (!attr)
+		return -ENOMEM;
+
+	attrs = kzalloc(2 * sizeof(struct attribute *), GFP_KERNEL);
+	if (!attrs) {
+		kfree(attr);
+		return -ENOMEM;
+	}
+
+	/* Allocate memory for cpumask attribute group */
+	nvdimm_pmu_cpumask_group = kzalloc(sizeof(*nvdimm_pmu_cpumask_group), GFP_KERNEL);
+	if (!nvdimm_pmu_cpumask_group) {
+		kfree(attr);
+		kfree(attrs);
+		return -ENOMEM;
+	}
+
+	sysfs_attr_init(&attr->attr.attr);
+	attr->attr.attr.name = "cpumask";
+	attr->attr.attr.mode = 0444;
+	attr->attr.show = nvdimm_pmu_cpumask_show;
+	attrs[0] = &attr->attr.attr;
+	attrs[1] = NULL;
+
+	nvdimm_pmu_cpumask_group->attrs = attrs;
+	nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR] = nvdimm_pmu_cpumask_group;
+	return 0;
+}
+
+static int nvdimm_pmu_cpu_hotplug_init(struct nvdimm_pmu *nd_pmu)
+{
+	int nodeid, rc;
+	const struct cpumask *cpumask;
+
+	/*
+	 * Incase cpu hotplug is not handled by arch specific code
+	 * they can still provide required cpumask which can be used
+	 * to get designatd cpu for counter access.
+	 * Check for any active cpu in nd_pmu->arch_cpumask.
+	 */
+	if (!cpumask_empty(&nd_pmu->arch_cpumask)) {
+		nd_pmu->cpu = cpumask_any(&nd_pmu->arch_cpumask);
+	} else {
+		/* pick active cpu from the cpumask of device numa node. */
+		nodeid = dev_to_node(nd_pmu->dev);
+		cpumask = cpumask_of_node(nodeid);
+		nd_pmu->cpu = cpumask_any(cpumask);
+	}
+
+	rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "perf/nvdimm:online",
+				     nvdimm_pmu_cpu_online, nvdimm_pmu_cpu_offline);
+
+	if (rc < 0)
+		return rc;
+
+	nd_pmu->cpuhp_state = rc;
+
+	/* Register the pmu instance for cpu hotplug */
+	rc = cpuhp_state_add_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);
+	if (rc) {
+		cpuhp_remove_multi_state(nd_pmu->cpuhp_state);
+		return rc;
+	}
+
+	/* Create cpumask attribute group */
+	rc = create_cpumask_attr_group(nd_pmu);
+	if (rc) {
+		cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);
+		cpuhp_remove_multi_state(nd_pmu->cpuhp_state);
+		return rc;
+	}
+
+	return 0;
+}
+
+void nvdimm_pmu_free_hotplug_memory(struct nvdimm_pmu *nd_pmu)
+{
+	cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);
+	cpuhp_remove_multi_state(nd_pmu->cpuhp_state);
+
+	if (nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR])
+		kfree(nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR]->attrs);
+	kfree(nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR]);
+}
+
+int register_nvdimm_pmu(struct nvdimm_pmu *nd_pmu, struct platform_device *pdev)
+{
+	int rc;
+
+	if (!nd_pmu || !pdev)
+		return -EINVAL;
+
+	/* event functions like add/del/read/event_init should not be NULL */
+	if (WARN_ON_ONCE(!(nd_pmu->event_init && nd_pmu->add && nd_pmu->del && nd_pmu->read)))
+		return -EINVAL;
+
+	nd_pmu->pmu.task_ctx_nr = perf_invalid_context;
+	nd_pmu->pmu.name = nd_pmu->name;
+	nd_pmu->pmu.event_init = nd_pmu->event_init;
+	nd_pmu->pmu.add = nd_pmu->add;
+	nd_pmu->pmu.del = nd_pmu->del;
+	nd_pmu->pmu.read = nd_pmu->read;
+
+	nd_pmu->pmu.attr_groups = nd_pmu->attr_groups;
+	nd_pmu->pmu.capabilities = PERF_PMU_CAP_NO_INTERRUPT |
+				PERF_PMU_CAP_NO_EXCLUDE;
+
+	/*
+	 * Add platform_device->dev pointer to nvdimm_pmu to access
+	 * device data in events functions.
+	 */
+	nd_pmu->dev = &pdev->dev;
+
+	/*
+	 * Incase cpumask attribute is set it means cpu
+	 * hotplug is handled by the arch specific code and
+	 * we can skip calling hotplug_init.
+	 */
+	if (!nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR]) {
+		/* init cpuhotplug */
+		rc = nvdimm_pmu_cpu_hotplug_init(nd_pmu);
+		if (rc) {
+			pr_info("cpu hotplug feature failed for device: %s\n", nd_pmu->name);
+			return rc;
+		}
+	}
+
+	rc = perf_pmu_register(&nd_pmu->pmu, nd_pmu->name, -1);
+	if (rc) {
+		nvdimm_pmu_free_hotplug_memory(nd_pmu);
+		return rc;
+	}
+
+	pr_info("%s NVDIMM performance monitor support registered\n",
+		nd_pmu->name);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(register_nvdimm_pmu);
+
+void unregister_nvdimm_pmu(struct nvdimm_pmu *nd_pmu)
+{
+	/* handle freeing of memory nd_pmu in arch specific code */
+	perf_pmu_unregister(&nd_pmu->pmu);
+	nvdimm_pmu_free_hotplug_memory(nd_pmu);
+}
+EXPORT_SYMBOL_GPL(unregister_nvdimm_pmu);
diff --git a/include/linux/nd.h b/include/linux/nd.h
index 712499cf7335..7d8b4f7d277d 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -66,6 +66,9 @@ struct nvdimm_pmu {
 	struct cpumask arch_cpumask;
 };
 
+int register_nvdimm_pmu(struct nvdimm_pmu *nvdimm, struct platform_device *pdev);
+void unregister_nvdimm_pmu(struct nvdimm_pmu *nd_pmu);
+
 struct nd_device_driver {
 	struct device_driver drv;
 	unsigned long type;
-- 
2.27.0


^ permalink raw reply related

* [PATCH v2 4/4] powerpc/papr_scm: Document papr_scm sysfs event format entries
From: Kajol Jain @ 2021-06-14  5:23 UTC (permalink / raw)
  To: mpe, linuxppc-dev, nvdimm, linux-kernel, peterz
  Cc: santosh, maddy, rnsastry, aneesh.kumar, atrajeev, kjain, vaibhav,
	dan.j.williams, ira.weiny, tglx
In-Reply-To: <20210614052326.285710-1-kjain@linux.ibm.com>

Details is added for the event, cpumask and format attributes
in the ABI documentation.

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 92e2db0e2d3d..be91de341454 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -59,3 +59,34 @@ Description:
 		* "CchRHCnt" : Cache Read Hit Count
 		* "CchWHCnt" : Cache Write Hit Count
 		* "FastWCnt" : Fast Write Count
+
+What:		/sys/devices/nmemX/format
+Date:		June 2021
+Contact:	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
+Description:	(RO) Attribute group to describe the magic bits
+                that go into perf_event_attr.config for a particular pmu.
+                (See ABI/testing/sysfs-bus-event_source-devices-format).
+
+                Each attribute under this group defines a bit range of the
+                perf_event_attr.config. Supported attribute is listed
+                below::
+
+		    event  = "config:0-4"  - event ID
+
+		For example::
+		    noopstat = "event=0x1"
+
+What:		/sys/devices/nmemX/events
+Date:		June 2021
+Contact:	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
+Description:    (RO) Attribute group to describe performance monitoring
+                events specific to papr-scm. Each attribute in this group describes
+                a single performance monitoring event supported by this nvdimm pmu.
+                The name of the file is the name of the event.
+                (See ABI/testing/sysfs-bus-event_source-devices-events).
+
+What:		/sys/devices/nmemX/cpumask
+Date:		June 2021
+Contact:	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
+Description:	(RO) This sysfs file exposes the cpumask which is designated to make
+                HCALLs to retrieve nvdimm pmu event counter data.
-- 
2.27.0


^ permalink raw reply related

* [PATCH v2 0/4] Add perf interface to expose nvdimm
From: Kajol Jain @ 2021-06-14  5:23 UTC (permalink / raw)
  To: mpe, linuxppc-dev, nvdimm, linux-kernel, peterz
  Cc: santosh, maddy, rnsastry, aneesh.kumar, atrajeev, kjain, vaibhav,
	dan.j.williams, ira.weiny, tglx

Patchset adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as supported events, cpumask
pmu event functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Added implementation to expose IBM pseries platform nmem*
device performance stats using this interface.

Result from power9 pseries lpar with 2 nvdimm device:
command:# perf list nmem
  nmem0/cchrhcnt/                                    [Kernel PMU event]
  nmem0/cchwhcnt/                                    [Kernel PMU event]
  nmem0/critrscu/                                    [Kernel PMU event]
  nmem0/ctlresct/                                    [Kernel PMU event]
  nmem0/ctlrestm/                                    [Kernel PMU event]
  nmem0/fastwcnt/                                    [Kernel PMU event]
  nmem0/hostlcnt/                                    [Kernel PMU event]
  nmem0/hostldur/                                    [Kernel PMU event]
  nmem0/hostscnt/                                    [Kernel PMU event]
  nmem0/hostsdur/                                    [Kernel PMU event]
  nmem0/medrcnt/                                     [Kernel PMU event]
  nmem0/medrdur/                                     [Kernel PMU event]
  nmem0/medwcnt/                                     [Kernel PMU event]
  nmem0/medwdur/                                     [Kernel PMU event]
  nmem0/memlife/                                     [Kernel PMU event]
  nmem0/noopstat/                                    [Kernel PMU event]
  nmem0/ponsecs/                                     [Kernel PMU event]
  nmem1/cchrhcnt/                                    [Kernel PMU event]
  nmem1/cchwhcnt/                                    [Kernel PMU event]
  nmem1/critrscu/                                    [Kernel PMU event]
  ...
  nmem1/noopstat/                                    [Kernel PMU event]
  nmem1/ponsecs/                                     [Kernel PMU event]

Patch1:
        Introduces the nvdimm_pmu structure
Patch2:
	Adds common interface to add arch/platform specific data
	includes supported events, pmu event functions. It also
	adds code for cpu hotplug support.
Patch3:
        Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
        nmem* pmu. It fills in the nvdimm_pmu structure with event attrs
        cpumask andevent functions and then registers the pmu by adding
        callbacks to register_nvdimm_pmu.
Patch4:
        Sysfs documentation patch

Changelog
---
PATCH v1 -> PATCH v2
- Fix hotplug code by adding pmu migration call
  incase current designated cpu got offline. As
  pointed by Peter Zijlstra.

- Removed the retun -1 part from cpu hotplug offline
  function.

- Link to the previous patchset : https://lkml.org/lkml/2021/6/8/500
---
Kajol Jain (4):
  drivers/nvdimm: Add nvdimm pmu structure
  drivers/nvdimm: Add perf interface to expose nvdimm performance stats
  powerpc/papr_scm: Add perf interface support
  powerpc/papr_scm: Document papr_scm sysfs event format entries

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  31 ++
 arch/powerpc/include/asm/device.h             |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c     | 365 ++++++++++++++++++
 drivers/nvdimm/Makefile                       |   1 +
 drivers/nvdimm/nd_perf.c                      | 230 +++++++++++
 include/linux/nd.h                            |  46 +++
 6 files changed, 678 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

-- 
2.27.0


^ permalink raw reply

* [PATCH v2 3/4] powerpc/papr_scm: Add perf interface support
From: Kajol Jain @ 2021-06-14  5:23 UTC (permalink / raw)
  To: mpe, linuxppc-dev, nvdimm, linux-kernel, peterz
  Cc: santosh, maddy, rnsastry, aneesh.kumar, atrajeev, kjain, vaibhav,
	dan.j.williams, ira.weiny, tglx
In-Reply-To: <20210614052326.285710-1-kjain@linux.ibm.com>

Performance monitoring support for papr-scm nvdimm devices
via perf interface is added which includes addition of pmu
functions like add/del/read/event_init for nvdimm_pmu struture.

A new parameter 'priv' in added to the pdev_archdata structure to save
nvdimm_pmu device pointer, to handle the unregistering of pmu device.

papr_scm_pmu_register function populates the nvdimm_pmu structure
with events, cpumask, attribute groups along with event handling
functions. Finally the populated nvdimm_pmu structure is passed to
register the pmu device.
Event handling functions internally uses hcall to get events and
counter data.

Result in power9 machine with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

  nmem0/cchrhcnt/                                    [Kernel PMU event]
  nmem0/cchwhcnt/                                    [Kernel PMU event]
  nmem0/critrscu/                                    [Kernel PMU event]
  nmem0/ctlresct/                                    [Kernel PMU event]
  nmem0/ctlrestm/                                    [Kernel PMU event]
  nmem0/fastwcnt/                                    [Kernel PMU event]
  nmem0/hostlcnt/                                    [Kernel PMU event]
  nmem0/hostldur/                                    [Kernel PMU event]
  nmem0/hostscnt/                                    [Kernel PMU event]
  nmem0/hostsdur/                                    [Kernel PMU event]
  nmem0/medrcnt/                                     [Kernel PMU event]
  nmem0/medrdur/                                     [Kernel PMU event]
  nmem0/medwcnt/                                     [Kernel PMU event]
  nmem0/medwdur/                                     [Kernel PMU event]
  nmem0/memlife/                                     [Kernel PMU event]
  nmem0/noopstat/                                    [Kernel PMU event]
  nmem0/ponsecs/                                     [Kernel PMU event]
  nmem1/cchrhcnt/                                    [Kernel PMU event]
  nmem1/cchwhcnt/                                    [Kernel PMU event]
  nmem1/critrscu/                                    [Kernel PMU event]
  ...
  nmem1/noopstat/                                    [Kernel PMU event]
  nmem1/ponsecs/                                     [Kernel PMU event]

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 arch/powerpc/include/asm/device.h         |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c | 365 ++++++++++++++++++++++
 2 files changed, 370 insertions(+)

diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 219559d65864..47ed639f3b8f 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -48,6 +48,11 @@ struct dev_archdata {
 
 struct pdev_archdata {
 	u64 dma_mask;
+	/*
+	 * Pointer to nvdimm_pmu structure, to handle the unregistering
+	 * of pmu device
+	 */
+	void *priv;
 };
 
 #endif /* _ASM_POWERPC_DEVICE_H */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..92632b4a4a60 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,8 @@
 #include <asm/plpar_wrappers.h>
 #include <asm/papr_pdsm.h>
 #include <asm/mce.h>
+#include <linux/perf_event.h>
+#include <linux/ctype.h>
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -67,6 +69,8 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+#define to_nvdimm_pmu(_pmu)	container_of(_pmu, struct nvdimm_pmu, pmu)
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
 	u8 stat_id[8];
@@ -116,6 +120,12 @@ struct papr_scm_priv {
 
 	/* length of the stat buffer as expected by phyp */
 	size_t stat_buffer_len;
+
+	 /* array to have event_code and stat_id mappings */
+	char **nvdimm_events_map;
+
+	/* count of supported events */
+	u32 total_events;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -329,6 +339,354 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
 	return 0;
 }
 
+PMU_FORMAT_ATTR(event, "config:0-4");
+
+static struct attribute *nvdimm_pmu_format_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group nvdimm_pmu_format_group = {
+	.name = "format",
+	.attrs = nvdimm_pmu_format_attr,
+};
+
+static int papr_scm_pmu_get_value(struct perf_event *event, struct device *dev, u64 *count)
+{
+	struct papr_scm_perf_stat *stat;
+	struct papr_scm_perf_stats *stats;
+	struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data;
+	int rc, size;
+
+	/* Allocate request buffer enough to hold single performance stat */
+	size = sizeof(struct papr_scm_perf_stats) +
+		sizeof(struct papr_scm_perf_stat);
+
+	if (!p || !p->nvdimm_events_map)
+		return -EINVAL;
+
+	stats = kzalloc(size, GFP_KERNEL);
+	if (!stats)
+		return -ENOMEM;
+
+	stat = &stats->scm_statistic[0];
+	memcpy(&stat->stat_id,
+	       p->nvdimm_events_map[event->attr.config - 1],
+		sizeof(stat->stat_id));
+	stat->stat_val = 0;
+
+	rc = drc_pmem_query_stats(p, stats, 1);
+	if (rc < 0) {
+		kfree(stats);
+		return rc;
+	}
+
+	*count = be64_to_cpu(stat->stat_val);
+	kfree(stats);
+	return 0;
+}
+
+static int papr_scm_pmu_event_init(struct perf_event *event)
+{
+	struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
+	struct papr_scm_priv *p;
+
+	if (!nd_pmu)
+		return -EINVAL;
+
+	/* test the event attr type for PMU enumeration */
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* it does not support event sampling mode */
+	if (is_sampling_event(event))
+		return -EOPNOTSUPP;
+
+	/* no branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
+	p = (struct papr_scm_priv *)nd_pmu->dev->driver_data;
+	if (!p)
+		return -EINVAL;
+
+	/* Invalid eventcode */
+	if (event->attr.config == 0 || event->attr.config > p->total_events)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int papr_scm_pmu_add(struct perf_event *event, int flags)
+{
+	u64 count;
+	int rc;
+	struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
+
+	if (!nd_pmu)
+		return -EINVAL;
+
+	if (flags & PERF_EF_START) {
+		rc = papr_scm_pmu_get_value(event, nd_pmu->dev, &count);
+		if (rc)
+			return rc;
+
+		local64_set(&event->hw.prev_count, count);
+	}
+
+	return 0;
+}
+
+static void papr_scm_pmu_read(struct perf_event *event)
+{
+	u64 prev, now;
+	int rc;
+	struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
+
+	if (!nd_pmu)
+		return;
+
+	rc = papr_scm_pmu_get_value(event, nd_pmu->dev, &now);
+	if (rc)
+		return;
+
+	prev = local64_xchg(&event->hw.prev_count, now);
+	local64_add(now - prev, &event->count);
+}
+
+static void papr_scm_pmu_del(struct perf_event *event, int flags)
+{
+	papr_scm_pmu_read(event);
+}
+
+static ssize_t device_show_string(struct device *dev, struct device_attribute *attr,
+				  char *buf)
+{
+	struct perf_pmu_events_attr *d;
+
+	d = container_of(attr, struct perf_pmu_events_attr, attr);
+
+	return sysfs_emit(buf, "%s\n", (char *)d->event_str);
+}
+
+static char *strtolower(char *updated_name)
+{
+	int i = 0;
+
+	while (updated_name[i]) {
+		if (isupper(updated_name[i]))
+			updated_name[i] = tolower(updated_name[i]);
+		i++;
+	}
+	updated_name[i] = '\0';
+	return strim(updated_name);
+}
+
+/* device_str_attr_create : Populate event "name" and string "str" in attribute */
+static struct attribute *device_str_attr_create_(char *name, char *str)
+{
+	struct perf_pmu_events_attr *attr;
+
+	attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+
+	if (!attr)
+		return NULL;
+
+	sysfs_attr_init(&attr->attr.attr);
+	attr->event_str = str;
+	attr->attr.attr.name = strtolower(name);
+	attr->attr.attr.mode = 0444;
+	attr->attr.show = device_show_string;
+
+	return &attr->attr.attr;
+}
+
+static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu *nd_pmu)
+{
+	struct papr_scm_perf_stat *stat;
+	struct papr_scm_perf_stats *stats, *single_stats;
+	int index, size, rc, count;
+	u32 available_events;
+	struct attribute **events;
+	char *eventcode, *eventname, *statid;
+	struct attribute_group *nvdimm_pmu_events_group;
+
+	if (!p->stat_buffer_len)
+		return -ENOENT;
+
+	available_events = (p->stat_buffer_len  - sizeof(struct papr_scm_perf_stats))
+			/ sizeof(struct papr_scm_perf_stat);
+
+	/* Allocate memory for events attribute group */
+	nvdimm_pmu_events_group = kzalloc(sizeof(*nvdimm_pmu_events_group), GFP_KERNEL);
+	if (!nvdimm_pmu_events_group)
+		return -ENOMEM;
+
+	/* Allocate the buffer for phyp where stats are written */
+	stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
+	if (!stats) {
+		rc = -ENOMEM;
+		goto out_nvdimm_pmu_events_group;
+	}
+
+	/* Allocate memory to nvdimm_event_map */
+	p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), GFP_KERNEL);
+	if (!p->nvdimm_events_map) {
+		rc = -ENOMEM;
+		goto out_stats;
+	}
+
+	/* Called to get list of events supported */
+	rc = drc_pmem_query_stats(p, stats, 0);
+	if (rc)
+		goto out_nvdimm_events_map;
+
+	/* Allocate buffer to hold single performance stat */
+	size = sizeof(struct papr_scm_perf_stats) + sizeof(struct papr_scm_perf_stat);
+
+	single_stats = kzalloc(size, GFP_KERNEL);
+	if (!single_stats) {
+		rc = -ENOMEM;
+		goto out_nvdimm_events_map;
+	}
+
+	events = kzalloc(available_events * sizeof(struct attribute *), GFP_KERNEL);
+	if (!events) {
+		rc = -ENOMEM;
+		goto out_single_stats;
+	}
+
+	for (index = 0, stat = stats->scm_statistic, count = 0;
+		     index < available_events; index++, ++stat) {
+
+		single_stats->scm_statistic[0] = *stat;
+		rc = drc_pmem_query_stats(p, single_stats, 1);
+
+		if (rc < 0) {
+			pr_info("Event not supported %s for device %s\n",
+				stat->stat_id, nvdimm_name(p->nvdimm));
+		} else {
+			eventcode = kasprintf(GFP_KERNEL, "event=0x%x", count + 1);
+			eventname = kzalloc(strlen(stat->stat_id) + 1, GFP_KERNEL);
+			statid = kzalloc(strlen(stat->stat_id) + 1, GFP_KERNEL);
+
+			if (!eventname || !statid || !eventcode)
+				goto out;
+
+			strcpy(eventname, stat->stat_id);
+			events[count] = device_str_attr_create_(eventname,
+								eventcode);
+			if (!events[count])
+				goto out;
+
+			strcpy(statid, stat->stat_id);
+			p->nvdimm_events_map[count] = statid;
+			count++;
+			continue;
+out:
+			kfree(eventcode);
+			kfree(eventname);
+			kfree(statid);
+		}
+	}
+
+	if (!count)
+		goto out_events;
+
+	events[count] = NULL;
+	p->nvdimm_events_map[count] = NULL;
+	p->total_events = count;
+
+	nvdimm_pmu_events_group->name = "events";
+	nvdimm_pmu_events_group->attrs = events;
+
+	/* Fill attribute groups for the nvdimm pmu device */
+	nd_pmu->attr_groups[NVDIMM_PMU_FORMAT_ATTR] = &nvdimm_pmu_format_group;
+	nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR] = nvdimm_pmu_events_group;
+	nd_pmu->attr_groups[NVDIMM_PMU_NULL_ATTR] = NULL;
+
+	kfree(single_stats);
+	kfree(stats);
+	return 0;
+
+out_events:
+	kfree(events);
+out_single_stats:
+	kfree(single_stats);
+out_nvdimm_events_map:
+	kfree(p->nvdimm_events_map);
+out_stats:
+	kfree(stats);
+out_nvdimm_pmu_events_group:
+	kfree(nvdimm_pmu_events_group);
+	return rc;
+}
+
+/* Function to free the attr_groups which are dynamically allocated */
+static void papr_scm_pmu_mem_free(struct nvdimm_pmu *nd_pmu)
+{
+	if (nd_pmu) {
+		if (nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR])
+			kfree(nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR]->attrs);
+		kfree(nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR]);
+	}
+}
+
+static void papr_scm_pmu_register(struct papr_scm_priv *p)
+{
+	struct nvdimm_pmu *nd_pmu;
+	int rc, nodeid;
+
+	nd_pmu = kzalloc(sizeof(*nd_pmu), GFP_KERNEL);
+	if (!nd_pmu) {
+		rc = -ENOMEM;
+		goto pmu_err_print;
+	}
+
+	rc = papr_scm_pmu_check_events(p, nd_pmu);
+	if (rc)
+		goto pmu_check_events_err;
+
+	nd_pmu->name = nvdimm_name(p->nvdimm);
+	nd_pmu->event_init = papr_scm_pmu_event_init;
+	nd_pmu->read = papr_scm_pmu_read;
+	nd_pmu->add = papr_scm_pmu_add;
+	nd_pmu->del = papr_scm_pmu_del;
+
+	/*updating the cpumask variable */
+	nodeid = dev_to_node(&p->pdev->dev);
+	nd_pmu->arch_cpumask = *cpumask_of_node(nodeid);
+
+	/* cpumask should not be NULL */
+	WARN_ON_ONCE(cpumask_empty(&nd_pmu->arch_cpumask));
+
+	rc = register_nvdimm_pmu(nd_pmu, p->pdev);
+	if (rc)
+		goto pmu_register_err;
+
+	/*
+	 * Set archdata.priv value to nvdimm_pmu structure, to handle the
+	 * unregistering of pmu device.
+	 */
+	p->pdev->archdata.priv = nd_pmu;
+	return;
+
+pmu_register_err:
+	papr_scm_pmu_mem_free(nd_pmu);
+	kfree(p->nvdimm_events_map);
+pmu_check_events_err:
+	kfree(nd_pmu);
+pmu_err_print:
+	dev_info(&p->pdev->dev, "nvdimm pmu didn't register rc=%d\n", rc);
+}
+
+static void papr_scm_pmu_uninit(struct nvdimm_pmu *nd_pmu)
+{
+	unregister_nvdimm_pmu(nd_pmu);
+	papr_scm_pmu_mem_free(nd_pmu);
+	kfree(nd_pmu);
+}
+
 /*
  * Issue hcall to retrieve dimm health info and populate papr_scm_priv with the
  * health information.
@@ -1177,6 +1535,7 @@ static int papr_scm_probe(struct platform_device *pdev)
 		goto err2;
 
 	platform_set_drvdata(pdev, p);
+	papr_scm_pmu_register(p);
 
 	return 0;
 
@@ -1195,6 +1554,12 @@ static int papr_scm_remove(struct platform_device *pdev)
 
 	nvdimm_bus_unregister(p->bus);
 	drc_pmem_unbind(p);
+
+	if (pdev->archdata.priv)
+		papr_scm_pmu_uninit(pdev->archdata.priv);
+
+	pdev->archdata.priv = NULL;
+	kfree(p->nvdimm_events_map);
 	kfree(p->bus_desc.provider_name);
 	kfree(p);
 
-- 
2.27.0


^ permalink raw reply related

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
From: Christophe Leroy @ 2021-06-14  5:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Matthew Wilcox
  Cc: Linus Torvalds, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev
In-Reply-To: <dd2f0d9d-da78-46a4-78cb-c5cfb7df7f16@linux.ibm.com>



Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit :
> On 6/13/21 4:20 PM, Matthew Wilcox wrote:
>> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>>> is arch dependent. On some architecture it is pte_t * and on the other
>>> struct page *. The reason being highmem and level 4 page table can
>>> be located in highmem.
>>
>> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
>> we have pgtable_t for the benefit of s390's crazy sub-page page table
>> sizes.
> 
> That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't 
> be pte_t * everywhere and why we have
> it as struct page *.

ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages.

In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough 
some powerpc platforms have only one sub-page per page.

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Don't read sigaction arguments back from user memory
From: Christophe Leroy @ 2021-06-14  5:30 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Michael Ellerman; +Cc: cmr
In-Reply-To: <1623633444.p3rmbd7eti.astroid@bobo.none>



Le 14/06/2021 à 03:32, Nicholas Piggin a écrit :
> Excerpts from Michael Ellerman's message of June 10, 2021 5:29 pm:
>> When delivering a signal to a sigaction style handler (SA_SIGINFO), we
>> pass pointers to the siginfo and ucontext via r4 and r5.
>>
>> Currently we populate the values in those registers by reading the
>> pointers out of the sigframe in user memory, even though the values in
>> user memory were written by the kernel just prior:
>>
>>    unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
>>    unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
>>    ...
>>    if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
>>    	err |= get_user(regs->gpr[4], (unsigned long __user *)&frame->pinfo);
>>    	err |= get_user(regs->gpr[5], (unsigned long __user *)&frame->puc);
>>
>> ie. we write &frame->info into frame->pinfo, and then read frame->pinfo
>> back into r4, and similarly for &frame->uc.
>>
>> The code has always been like this, since linux-fullhistory commit
>> d4f2d95eca2c ("Forward port of 2.4 ppc64 signal changes.").
>>
>> There's no reason for us to read the values back from user memory,
>> rather than just setting the value in the gpr[4/5] directly. In fact
>> reading the value back from user memory opens up the possibility of
>> another user thread changing the values before we read them back.
>> Although any process doing that would be racing against the kernel
>> delivering the signal, and would risk corrupting the stack, so that
>> would be a userspace bug.
>>
>> Note that this is 64-bit only code, so there's no subtlety with the size
>> of pointers differing between kernel and user. Also the frame variable
>> is not modified to point elsewhere during the function.
>>
>> In the past reading the values back from user memory was not costly, but
>> now that we have KUAP on some CPUs it is, so we'd rather avoid it for
>> that reason too.
>>
>> So change the code to just set the values directly, using the same
>> values we have written to the sigframe previously in the function.
>>
>> Note also that this matches what our 32-bit signal code does.
>>
>> Using a version of will-it-scale's signal1_threads that sets SA_SIGINFO,
>> this results in a ~4% increase in signals per second on a Power9, from
>> 229,777 to 239,766.
> 
> Good find, nice improvement. Will make it possible to make the error
> handling much nicer too I think.
> 
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> 
> You've moved copy_siginfo_to_user right up to the user access unlock,
> could save 2 more KUAP lock/unlocks if we had an unsafe_clear_user. If
> we can move the other user access stuff up as well, the stack frame
> put_user could use unsafe_put_user as well, saving 1 more. Another few
> percent?

I'm looking at making an 'unsafe' version of copy_siginfo_to_user().
That's straight forward for 'native' signals, but for compat signals that's more tricky.

> 
> 
>>
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>   arch/powerpc/kernel/signal_64.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>> index dca66481d0c2..f58e7a98d0df 100644
>> --- a/arch/powerpc/kernel/signal_64.c
>> +++ b/arch/powerpc/kernel/signal_64.c
>> @@ -948,8 +948,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>   	regs->gpr[3] = ksig->sig;
>>   	regs->result = 0;
>>   	if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
>> -		err |= get_user(regs->gpr[4], (unsigned long __user *)&frame->pinfo);
>> -		err |= get_user(regs->gpr[5], (unsigned long __user *)&frame->puc);
>> +		regs->gpr[4] = (unsigned long)&frame->info;
>> +		regs->gpr[5] = (unsigned long)&frame->uc;
>>   		regs->gpr[6] = (unsigned long) frame;
>>   	} else {
>>   		regs->gpr[4] = (unsigned long)&frame->uc.uc_mcontext;
>> -- 
>> 2.25.1
>>
>>

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Copy siginfo before changing regs->nip
From: Christophe Leroy @ 2021-06-14  5:31 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Michael Ellerman; +Cc: cmr, dja
In-Reply-To: <1623633868.lnyugcilh9.astroid@bobo.none>



Le 14/06/2021 à 03:29, Nicholas Piggin a écrit :
> Excerpts from Nicholas Piggin's message of June 14, 2021 10:47 am:
>> Excerpts from Michael Ellerman's message of June 8, 2021 11:46 pm:
>>> In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
>>> to minimise uaccess switches") the 64-bit signal code was rearranged to
>>> use user_write_access_begin/end().
>>>
>>> As part of that change the call to copy_siginfo_to_user() was moved
>>> later in the function, so that it could be done after the
>>> user_write_access_end().
>>>
>>> In particular it was moved after we modify regs->nip to point to the
>>> signal trampoline. That means if copy_siginfo_to_user() fails we exit
>>> handle_rt_signal64() with an error but with regs->nip modified, whereas
>>> previously we would not modify regs->nip until the copy succeeded.
>>>
>>> Returning an error from signal delivery but with regs->nip updated
>>> leaves the process in a sort of half-delivered state. We do immediately
>>> force a SEGV in signal_setup_done(), called from do_signal(), so the
>>> process should never run in the half-delivered state.
>>>
>>> However that SEGV is not delivered until we've gone around to
>>> do_notify_resume() again, so it's possible some tracing could observe
>>> the half-delivered state.
>>>
>>> There are other cases where we fail signal delivery with regs partly
>>> updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
>>> is very unlikely to fail as it reads back from the frame we just wrote
>>> to.
>>>
>>> Looking at other arches they seem to be more careful about leaving regs
>>> unchanged until the copy operations have succeeded, and in general that
>>> seems like good hygenie.
>>>
>>> So although the current behaviour is not cleary buggy, it's also not
>>> clearly correct. So move the call to copy_siginfo_to_user() up prior to
>>> the modification of regs->nip, which is closer to the old behaviour, and
>>> easier to reason about.
>>
>> Good catch, should it still have a Fixes: tag though? Even if it's not
>> clearly buggy we want it to be patched.
> 
> Also...
> 
>>>
>>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>> ---
>>>   arch/powerpc/kernel/signal_64.c | 9 ++++-----
>>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>>> index dca66481d0c2..f9e1f5428b9e 100644
>>> --- a/arch/powerpc/kernel/signal_64.c
>>> +++ b/arch/powerpc/kernel/signal_64.c
>>> @@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>   	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
>>>   	user_write_access_end();
>>>   
>>> +	/* Save the siginfo outside of the unsafe block. */
>>> +	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>> +		goto badframe;
>>> +
>>>   	/* Make sure signal handler doesn't get spurious FP exceptions */
>>>   	tsk->thread.fp_state.fpscr = 0;
>>>   
>>> @@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>   		regs->nip = (unsigned long) &frame->tramp[0];
>>>   	}
>>>   
>>> -
>>> -	/* Save the siginfo outside of the unsafe block. */
>>> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>> -		goto badframe;
>>> -
>>>   	/* Allocate a dummy caller frame for the signal handler. */
>>>   	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
>>>   	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
> 
> Does the same reasoning apply to this one and the ELF V1 function
> descriptor thing? It seems like you could move all of that block
> up instead. With your other SA_SIGINFO get_user patch, there would
> then be no possibility of error after you start modifying regs.
> 

To move the above in the user access block, we need to open a larger window. At the time being the 
window opened only contains the 'frame'. 'newsp' points before the 'frame'.

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Don't read sigaction arguments back from user memory
From: Nicholas Piggin @ 2021-06-14  5:49 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev, Michael Ellerman; +Cc: cmr
In-Reply-To: <c677eab1-0ecd-8630-89c0-6fcc35788356@csgroup.eu>

Excerpts from Christophe Leroy's message of June 14, 2021 3:30 pm:
> 
> 
> Le 14/06/2021 à 03:32, Nicholas Piggin a écrit :
>> Excerpts from Michael Ellerman's message of June 10, 2021 5:29 pm:
>>> When delivering a signal to a sigaction style handler (SA_SIGINFO), we
>>> pass pointers to the siginfo and ucontext via r4 and r5.
>>>
>>> Currently we populate the values in those registers by reading the
>>> pointers out of the sigframe in user memory, even though the values in
>>> user memory were written by the kernel just prior:
>>>
>>>    unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
>>>    unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
>>>    ...
>>>    if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
>>>    	err |= get_user(regs->gpr[4], (unsigned long __user *)&frame->pinfo);
>>>    	err |= get_user(regs->gpr[5], (unsigned long __user *)&frame->puc);
>>>
>>> ie. we write &frame->info into frame->pinfo, and then read frame->pinfo
>>> back into r4, and similarly for &frame->uc.
>>>
>>> The code has always been like this, since linux-fullhistory commit
>>> d4f2d95eca2c ("Forward port of 2.4 ppc64 signal changes.").
>>>
>>> There's no reason for us to read the values back from user memory,
>>> rather than just setting the value in the gpr[4/5] directly. In fact
>>> reading the value back from user memory opens up the possibility of
>>> another user thread changing the values before we read them back.
>>> Although any process doing that would be racing against the kernel
>>> delivering the signal, and would risk corrupting the stack, so that
>>> would be a userspace bug.
>>>
>>> Note that this is 64-bit only code, so there's no subtlety with the size
>>> of pointers differing between kernel and user. Also the frame variable
>>> is not modified to point elsewhere during the function.
>>>
>>> In the past reading the values back from user memory was not costly, but
>>> now that we have KUAP on some CPUs it is, so we'd rather avoid it for
>>> that reason too.
>>>
>>> So change the code to just set the values directly, using the same
>>> values we have written to the sigframe previously in the function.
>>>
>>> Note also that this matches what our 32-bit signal code does.
>>>
>>> Using a version of will-it-scale's signal1_threads that sets SA_SIGINFO,
>>> this results in a ~4% increase in signals per second on a Power9, from
>>> 229,777 to 239,766.
>> 
>> Good find, nice improvement. Will make it possible to make the error
>> handling much nicer too I think.
>> 
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>> 
>> You've moved copy_siginfo_to_user right up to the user access unlock,
>> could save 2 more KUAP lock/unlocks if we had an unsafe_clear_user. If
>> we can move the other user access stuff up as well, the stack frame
>> put_user could use unsafe_put_user as well, saving 1 more. Another few
>> percent?
> 
> I'm looking at making an 'unsafe' version of copy_siginfo_to_user().
> That's straight forward for 'native' signals, but for compat signals that's more tricky.

Ah nice. Native is most important at the moment.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Copy siginfo before changing regs->nip
From: Nicholas Piggin @ 2021-06-14  5:55 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev, Michael Ellerman; +Cc: dja, cmr
In-Reply-To: <759f715b-df1f-847d-d836-2d202c8a7dc4@csgroup.eu>

Excerpts from Christophe Leroy's message of June 14, 2021 3:31 pm:
> 
> 
> Le 14/06/2021 à 03:29, Nicholas Piggin a écrit :
>> Excerpts from Nicholas Piggin's message of June 14, 2021 10:47 am:
>>> Excerpts from Michael Ellerman's message of June 8, 2021 11:46 pm:
>>>> In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
>>>> to minimise uaccess switches") the 64-bit signal code was rearranged to
>>>> use user_write_access_begin/end().
>>>>
>>>> As part of that change the call to copy_siginfo_to_user() was moved
>>>> later in the function, so that it could be done after the
>>>> user_write_access_end().
>>>>
>>>> In particular it was moved after we modify regs->nip to point to the
>>>> signal trampoline. That means if copy_siginfo_to_user() fails we exit
>>>> handle_rt_signal64() with an error but with regs->nip modified, whereas
>>>> previously we would not modify regs->nip until the copy succeeded.
>>>>
>>>> Returning an error from signal delivery but with regs->nip updated
>>>> leaves the process in a sort of half-delivered state. We do immediately
>>>> force a SEGV in signal_setup_done(), called from do_signal(), so the
>>>> process should never run in the half-delivered state.
>>>>
>>>> However that SEGV is not delivered until we've gone around to
>>>> do_notify_resume() again, so it's possible some tracing could observe
>>>> the half-delivered state.
>>>>
>>>> There are other cases where we fail signal delivery with regs partly
>>>> updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
>>>> is very unlikely to fail as it reads back from the frame we just wrote
>>>> to.
>>>>
>>>> Looking at other arches they seem to be more careful about leaving regs
>>>> unchanged until the copy operations have succeeded, and in general that
>>>> seems like good hygenie.
>>>>
>>>> So although the current behaviour is not cleary buggy, it's also not
>>>> clearly correct. So move the call to copy_siginfo_to_user() up prior to
>>>> the modification of regs->nip, which is closer to the old behaviour, and
>>>> easier to reason about.
>>>
>>> Good catch, should it still have a Fixes: tag though? Even if it's not
>>> clearly buggy we want it to be patched.
>> 
>> Also...
>> 
>>>>
>>>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>>> ---
>>>>   arch/powerpc/kernel/signal_64.c | 9 ++++-----
>>>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>>>> index dca66481d0c2..f9e1f5428b9e 100644
>>>> --- a/arch/powerpc/kernel/signal_64.c
>>>> +++ b/arch/powerpc/kernel/signal_64.c
>>>> @@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>>   	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
>>>>   	user_write_access_end();
>>>>   
>>>> +	/* Save the siginfo outside of the unsafe block. */
>>>> +	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>>> +		goto badframe;
>>>> +
>>>>   	/* Make sure signal handler doesn't get spurious FP exceptions */
>>>>   	tsk->thread.fp_state.fpscr = 0;
>>>>   
>>>> @@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>>   		regs->nip = (unsigned long) &frame->tramp[0];
>>>>   	}
>>>>   
>>>> -
>>>> -	/* Save the siginfo outside of the unsafe block. */
>>>> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>>> -		goto badframe;
>>>> -
>>>>   	/* Allocate a dummy caller frame for the signal handler. */
>>>>   	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
>>>>   	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
>> 
>> Does the same reasoning apply to this one and the ELF V1 function
>> descriptor thing? It seems like you could move all of that block
>> up instead. With your other SA_SIGINFO get_user patch, there would
>> then be no possibility of error after you start modifying regs.
>> 
> 
> To move the above in the user access block, we need to open a larger window. At the time being the 
> window opened only contains the 'frame'. 'newsp' points before the 'frame'.
> 

Only by 64/128 bytes though. Is that a problem? Not for 64s. Could it 
cause more overhead than it saves on other platforms?

For protection, it looks like all the important control data is in the 
signal frame anyway, this frame is just for stack unwinding?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v9 01/14] swiotlb: Refactor swiotlb init functions
From: Christoph Hellwig @ 2021-06-14  6:16 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-2-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:46PM +0800, Claire Chang wrote:
> +	spin_lock_init(&mem->lock);
> +	for (i = 0; i < mem->nslabs; i++) {
> +		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
> +		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> +		mem->slots[i].alloc_size = 0;
> +	}
> +
> +	if (memory_decrypted)
> +		set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
> +	memset(vaddr, 0, bytes);

We don't really need to do this call before the memset.  Which means we
can just move it to the callers that care instead of having a bool
argument.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 02/14] swiotlb: Refactor swiotlb_create_debugfs
From: Christoph Hellwig @ 2021-06-14  6:17 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-3-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:47PM +0800, Claire Chang wrote:
> Split the debugfs creation to make the code reusable for supporting
> different bounce buffer pools, e.g. restricted DMA pool.
> 
> Signed-off-by: Claire Chang <tientzu@chromium.org>
> ---
>  kernel/dma/swiotlb.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 1a1208c81e85..8a3e2b3b246d 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -64,6 +64,9 @@
>  enum swiotlb_force swiotlb_force;
>  
>  struct io_tlb_mem *io_tlb_default_mem;
> +#ifdef CONFIG_DEBUG_FS
> +static struct dentry *debugfs_dir;
> +#endif

What about moving this declaration into the main CONFIG_DEBUG_FS block
near the functions using it?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 03/14] swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used
From: Christoph Hellwig @ 2021-06-14  6:20 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, Jianxiong Gao, Daniel Vetter,
	Will Deacon, Konrad Rzeszutek Wilk, maarten.lankhorst, airlied,
	Dan Williams, linuxppc-dev, jani.nikula, Rob Herring,
	rodrigo.vivi, Bjorn Helgaas, boris.ostrovsky, Andy Shevchenko,
	jgross, Nicolas Boichat, Greg KH, Randy Dunlap, lkml, Tomasz Figa,
	list@263.net:IOMMU DRIVERS, Jim Quinlan, xypron.glpk,
	Robin Murphy, bauerman
In-Reply-To: <CALiNf2_nzP=qLg5Fqvn3kiaMiaR9r+QJhE3pqypW4FPrgo23DQ@mail.gmail.com>

On Fri, Jun 11, 2021 at 11:33:15PM +0800, Claire Chang wrote:
> I'm not sure if this would break arch/x86/pci/sta2x11-fixup.c
> swiotlb_late_init_with_default_size is called here
> https://elixir.bootlin.com/linux/v5.13-rc5/source/arch/x86/pci/sta2x11-fixup.c#L60

It will.  It will also break all non-OF devices.  I think you need to
initialize the initial pool in device_initialize, which covers all devices.

^ permalink raw reply

* Re: [PATCH v9 04/14] swiotlb: Add restricted DMA pool initialization
From: Christoph Hellwig @ 2021-06-14  6:21 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-5-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:49PM +0800, Claire Chang wrote:
> Add the initialization function to create restricted DMA pools from
> matching reserved-memory nodes.

Bisection hazard:  we should only add the new config option when the
code is actually read to be used.  So this patch should move to the end
of the series.

^ permalink raw reply

* Re: [PATCH v9 05/14] swiotlb: Update is_swiotlb_buffer to add a struct device argument
From: Christoph Hellwig @ 2021-06-14  6:21 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-6-tientzu@chromium.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 06/14] swiotlb: Update is_swiotlb_active to add a struct device argument
From: Christoph Hellwig @ 2021-06-14  6:23 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-7-tientzu@chromium.org>

>  kernel/dma/direct.c                          | 2 +-
>  kernel/dma/swiotlb.c                         | 4 ++--
>  6 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
> index ce6b664b10aa..89a894354263 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
> @@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj)
>  
>  	max_order = MAX_ORDER;
>  #ifdef CONFIG_SWIOTLB
> -	if (is_swiotlb_active()) {
> +	if (is_swiotlb_active(obj->base.dev->dev)) {

This is the same device used for DMA mapping in
i915_gem_gtt_prepare_pages, so this looks good.

> index f4c2e46b6fe1..2ca9d9a9e5d5 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
> @@ -276,7 +276,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
>  	}
>  
>  #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
> -	need_swiotlb = is_swiotlb_active();
> +	need_swiotlb = is_swiotlb_active(dev->dev);
>  #endif

This looks good, too.

> diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
> index b7a8f3a1921f..0d56985bfe81 100644
> --- a/drivers/pci/xen-pcifront.c
> +++ b/drivers/pci/xen-pcifront.c
> @@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev)
>  
>  	spin_unlock(&pcifront_dev_lock);
>  
> -	if (!err && !is_swiotlb_active()) {
> +	if (!err && !is_swiotlb_active(&pdev->xdev->dev)) {

This looks good as well.

So I think the devices are all good.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 07/14] swiotlb: Bounce data from/to restricted DMA pool if available
From: Christoph Hellwig @ 2021-06-14  6:25 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-8-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:52PM +0800, Claire Chang wrote:
> Regardless of swiotlb setting, the restricted DMA pool is preferred if
> available.
> 
> The restricted DMA pools provide a basic level of protection against the
> DMA overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system
> needs to provide a way to lock down the memory access, e.g., MPU.
> 
> Note that is_dev_swiotlb_force doesn't check if
> swiotlb_force == SWIOTLB_FORCE. Otherwise the memory allocation behavior
> with default swiotlb will be changed by the following patche
> ("dma-direct: Allocate memory from restricted DMA pool if available").
> 
> Signed-off-by: Claire Chang <tientzu@chromium.org>
> ---
>  include/linux/swiotlb.h | 10 +++++++++-
>  kernel/dma/direct.c     |  3 ++-
>  kernel/dma/direct.h     |  3 ++-
>  kernel/dma/swiotlb.c    |  1 +
>  4 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 06cf17a80f5c..8200c100fe10 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -85,6 +85,7 @@ extern enum swiotlb_force swiotlb_force;
>   *		unmap calls.
>   * @debugfs:	The dentry to debugfs.
>   * @late_alloc:	%true if allocated using the page allocator
> + * @force_swiotlb: %true if swiotlb is forced
>   */
>  struct io_tlb_mem {
>  	phys_addr_t start;
> @@ -95,6 +96,7 @@ struct io_tlb_mem {
>  	spinlock_t lock;
>  	struct dentry *debugfs;
>  	bool late_alloc;
> +	bool force_swiotlb;
>  	struct io_tlb_slot {
>  		phys_addr_t orig_addr;
>  		size_t alloc_size;
> @@ -115,6 +117,11 @@ static inline void swiotlb_set_io_tlb_default_mem(struct device *dev)
>  	dev->dma_io_tlb_mem = io_tlb_default_mem;
>  }
>  
> +static inline bool is_dev_swiotlb_force(struct device *dev)
> +{
> +	return dev->dma_io_tlb_mem->force_swiotlb;
> +}
> +
>  void __init swiotlb_exit(void);
>  unsigned int swiotlb_max_segment(void);
>  size_t swiotlb_max_mapping_size(struct device *dev);
> @@ -126,8 +133,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
>  {
>  	return false;
>  }
> -static inline void swiotlb_set_io_tlb_default_mem(struct device *dev)
> +static inline bool is_dev_swiotlb_force(struct device *dev)
>  {
> +	return false;
>  }
>  static inline void swiotlb_exit(void)
>  {
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 7a88c34d0867..078f7087e466 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -496,7 +496,8 @@ size_t dma_direct_max_mapping_size(struct device *dev)
>  {
>  	/* If SWIOTLB is active, use its maximum mapping size */
>  	if (is_swiotlb_active(dev) &&
> -	    (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
> +	    (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE ||
> +	     is_dev_swiotlb_force(dev)))

I think we can remove the extra swiotlb_force check here if the
swiotlb_force setting is propagated into io_tlb_default_mem->force
when that is initialized. This avoids an extra check in the fast path.

> -	if (unlikely(swiotlb_force == SWIOTLB_FORCE))
> +	if (unlikely(swiotlb_force == SWIOTLB_FORCE) ||
> +	    is_dev_swiotlb_force(dev))

Same here.

^ permalink raw reply

* Re: [PATCH v9 08/14] swiotlb: Move alloc_size to find_slots
From: Christoph Hellwig @ 2021-06-14  6:25 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-9-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:53PM +0800, Claire Chang wrote:
> Move the maintenance of alloc_size to find_slots for better code
> reusability later.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 09/14] swiotlb: Refactor swiotlb_tbl_unmap_single
From: Christoph Hellwig @ 2021-06-14  6:26 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-10-tientzu@chromium.org>

On Fri, Jun 11, 2021 at 11:26:54PM +0800, Claire Chang wrote:
> Add a new function, release_slots, to make the code reusable for supporting
> different bounce buffer pools, e.g. restricted DMA pool.
> 
> Signed-off-by: Claire Chang <tientzu@chromium.org>
> ---
>  kernel/dma/swiotlb.c | 35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 364c6c822063..a6562573f090 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -554,27 +554,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	return tlb_addr;
>  }
>  
> -/*
> - * tlb_addr is the physical address of the bounce buffer to unmap.
> - */
> -void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
> -			      size_t mapping_size, enum dma_data_direction dir,
> -			      unsigned long attrs)
> +static void release_slots(struct device *dev, phys_addr_t tlb_addr)
>  {
> -	struct io_tlb_mem *mem = hwdev->dma_io_tlb_mem;
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	unsigned long flags;
> -	unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
> +	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>  	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
>  	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
>  	int count, i;
>  
> -	/*
> -	 * First, sync the memory before unmapping the entry
> -	 */
> -	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
> -	    (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> -		swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> -
>  	/*
>  	 * Return the buffer to the free list by setting the corresponding
>  	 * entries to indicate the number of contiguous entries available.
> @@ -609,6 +597,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
>  	spin_unlock_irqrestore(&mem->lock, flags);
>  }
>  
> +/*
> + * tlb_addr is the physical address of the bounce buffer to unmap.
> + */
> +void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
> +			      size_t mapping_size, enum dma_data_direction dir,
> +			      unsigned long attrs)
> +{
> +	/*
> +	 * First, sync the memory before unmapping the entry
> +	 */
> +	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
> +	    (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> +		swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> +
> +	release_slots(dev, tlb_addr);

Can you give this a swiotlb_ prefix?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v9 11/14] swiotlb: Add restricted DMA alloc/free support.
From: Christoph Hellwig @ 2021-06-14  6:28 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210611152659.2142983-12-tientzu@chromium.org>

I think merging this with the next two patches would be a little more
clear.

^ permalink raw reply

* Re: [PATCH v9 11/14] swiotlb: Add restricted DMA alloc/free support.
From: Christoph Hellwig @ 2021-06-14  6:28 UTC (permalink / raw)
  To: Claire Chang
  Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
	matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
	Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
	linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
	Greg KH, Randy Dunlap, lkml, tfiga, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <20210614062801.GJ28343@lst.de>

On Mon, Jun 14, 2021 at 08:28:01AM +0200, Christoph Hellwig wrote:
> I think merging this with the next two patches would be a little more
> clear.

Sorry, I mean the next patch and the previous one.

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Copy siginfo before changing regs->nip
From: Christophe Leroy @ 2021-06-14  7:22 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Michael Ellerman; +Cc: dja, cmr
In-Reply-To: <1623649799.9s42wcnyya.astroid@bobo.none>



Le 14/06/2021 à 07:55, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of June 14, 2021 3:31 pm:
>>
>>
>> Le 14/06/2021 à 03:29, Nicholas Piggin a écrit :
>>> Excerpts from Nicholas Piggin's message of June 14, 2021 10:47 am:
>>>> Excerpts from Michael Ellerman's message of June 8, 2021 11:46 pm:
>>>>> In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
>>>>> to minimise uaccess switches") the 64-bit signal code was rearranged to
>>>>> use user_write_access_begin/end().
>>>>>
>>>>> As part of that change the call to copy_siginfo_to_user() was moved
>>>>> later in the function, so that it could be done after the
>>>>> user_write_access_end().
>>>>>
>>>>> In particular it was moved after we modify regs->nip to point to the
>>>>> signal trampoline. That means if copy_siginfo_to_user() fails we exit
>>>>> handle_rt_signal64() with an error but with regs->nip modified, whereas
>>>>> previously we would not modify regs->nip until the copy succeeded.
>>>>>
>>>>> Returning an error from signal delivery but with regs->nip updated
>>>>> leaves the process in a sort of half-delivered state. We do immediately
>>>>> force a SEGV in signal_setup_done(), called from do_signal(), so the
>>>>> process should never run in the half-delivered state.
>>>>>
>>>>> However that SEGV is not delivered until we've gone around to
>>>>> do_notify_resume() again, so it's possible some tracing could observe
>>>>> the half-delivered state.
>>>>>
>>>>> There are other cases where we fail signal delivery with regs partly
>>>>> updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
>>>>> is very unlikely to fail as it reads back from the frame we just wrote
>>>>> to.
>>>>>
>>>>> Looking at other arches they seem to be more careful about leaving regs
>>>>> unchanged until the copy operations have succeeded, and in general that
>>>>> seems like good hygenie.
>>>>>
>>>>> So although the current behaviour is not cleary buggy, it's also not
>>>>> clearly correct. So move the call to copy_siginfo_to_user() up prior to
>>>>> the modification of regs->nip, which is closer to the old behaviour, and
>>>>> easier to reason about.
>>>>
>>>> Good catch, should it still have a Fixes: tag though? Even if it's not
>>>> clearly buggy we want it to be patched.
>>>
>>> Also...
>>>
>>>>>
>>>>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>>>> ---
>>>>>    arch/powerpc/kernel/signal_64.c | 9 ++++-----
>>>>>    1 file changed, 4 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>>>>> index dca66481d0c2..f9e1f5428b9e 100644
>>>>> --- a/arch/powerpc/kernel/signal_64.c
>>>>> +++ b/arch/powerpc/kernel/signal_64.c
>>>>> @@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>>>    	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
>>>>>    	user_write_access_end();
>>>>>    
>>>>> +	/* Save the siginfo outside of the unsafe block. */
>>>>> +	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>>>> +		goto badframe;
>>>>> +
>>>>>    	/* Make sure signal handler doesn't get spurious FP exceptions */
>>>>>    	tsk->thread.fp_state.fpscr = 0;
>>>>>    
>>>>> @@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>>>>    		regs->nip = (unsigned long) &frame->tramp[0];
>>>>>    	}
>>>>>    
>>>>> -
>>>>> -	/* Save the siginfo outside of the unsafe block. */
>>>>> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>>>>> -		goto badframe;
>>>>> -
>>>>>    	/* Allocate a dummy caller frame for the signal handler. */
>>>>>    	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
>>>>>    	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
>>>
>>> Does the same reasoning apply to this one and the ELF V1 function
>>> descriptor thing? It seems like you could move all of that block
>>> up instead. With your other SA_SIGINFO get_user patch, there would
>>> then be no possibility of error after you start modifying regs.
>>>
>>
>> To move the above in the user access block, we need to open a larger window. At the time being the
>> window opened only contains the 'frame'. 'newsp' points before the 'frame'.
>>
> 
> Only by 64/128 bytes though. Is that a problem? Not for 64s. Could it
> cause more overhead than it saves on other platforms?

No it is not a problem at all, just need to not be forgotten, on ppc64 it may go unnoticed, on 32s 
it will blew up if we forget to enlarge the access window and the access involves a different 256M 
segment (Very unlikely for sure but ...)

> 
> For protection, it looks like all the important control data is in the
> signal frame anyway, this frame is just for stack unwinding?

That's my understanding as well.

Christophe

^ permalink raw reply

* Re: [PATCH 00/21] Rid W=1 warnings from IDE
From: Lee Jones @ 2021-06-14  9:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-ide, Mike Waychison, Paul Mackerras, Christopher J. Reimer,
	Tim Hockin, Erik Andersen, Alan Cox, CJ, Sergei Shtylyov,
	Duncan Laurie, Scott Snyder, Gadi Oxman, Andre Hedrick,
	Christian Brunner, Jens Axboe, or, Benoit Poulot-Cazajous,
	Robert Bringman, linux-kernel, Clear Zhang, Mark Lord, Adrian Sun,
	Frank Tiernan, linuxppc-dev, David S. Miller
In-Reply-To: <YL3YMGl9kmtv55B/@infradead.org>

On Mon, 07 Jun 2021, Christoph Hellwig wrote:

> Please don't touch this code as it is about to be removed entirely.

Do you have an ETA for this work?

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox