LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 5/5] powerpc/pseries: Add support for FORM2 associativity
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210812132223.225214-1-aneesh.kumar@linux.ibm.com>

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/powerpc/associativity.rst   | 104 ++++++++++++
 arch/powerpc/include/asm/firmware.h       |   3 +-
 arch/powerpc/include/asm/prom.h           |   1 +
 arch/powerpc/kernel/prom_init.c           |   3 +-
 arch/powerpc/mm/numa.c                    | 187 ++++++++++++++++++----
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 262 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
new file mode 100644
index 000000000000..07e7dd3d6c87
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,104 @@
+============================
+NUMA resource associativity
+=============================
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the oldest format and is now considered deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-----
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-----
+With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
+device tree properties are used to determine the NUMA distance between resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers (domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or more numbers
+(domainID index) that represents the 1 based ordinal in the associativity lists.
+The list of domainID indexes represents an increasing hierarchy of resource grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
+Linux kernel computes NUMA distance between two domains by recursively comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+-------
+Form 2 associativity format adds separate device tree properties representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows flexible primary
+domain numbering. With numa distance computation now detached from the index value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
+ids at the same domainID index representing resource groups of different performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
+the domainIDs present in the system. The offset of the domainID in this property is
+used as an index while computing numa distance information via "ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
+computing the distance of domain 8 from other domains present in the system. For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
+The number N must be equal to the square of m where m is the number of domainIDs in the
+numa-lookup-index-table.
+
+For ex:
+ibm,numa-lookup-index-table = <3 0 8 40>;
+ibm,numa-distace-table = <9>, /bits/ 8 < 10  20  80
+					 20  10 160
+					 80 160  10>;
+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+A possible "ibm,associativity" property for resources in node 0, 8 and 40
+
+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40}
+
+With "ibm,associativity-reference-points"  { 0x3 }
+
+"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
+Since domainID can be sparse, the matrix of distances can also be effectively sparse.
+With "ibm,lookup-index-table" we can achieve a compact representation of
+distance information.
diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 60b631161360..97a3bd9ffeb9 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -53,6 +53,7 @@
 #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
 #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
 #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
+#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -73,7 +74,7 @@ enum {
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
-		FW_FEATURE_RPT_INVALIDATE,
+		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
 	FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index df9fec9d232c..5c80152e8f18 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_XCMO		0x0440	/* Page Coalescing */
 #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
+#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 57db605ad33a..95a42d49e291 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
+		OV5_FEAT(OV5_FORM2_AFFINITY),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fdb2472befa4..32acef4da845 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
 
 #define FORM0_AFFINITY 0
 #define FORM1_AFFINITY 1
+#define FORM2_AFFINITY 2
 static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
+	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
+};
+static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -166,6 +171,54 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
+static int __associativity_to_nid(const __be32 *associativity,
+				  int max_array_sz)
+{
+	int nid;
+	/*
+	 * primary_domain_index is 1 based array index.
+	 */
+	int index = primary_domain_index  - 1;
+
+	if (!numa_enabled || index >= max_array_sz)
+		return NUMA_NO_NODE;
+
+	nid = of_read_number(&associativity[index], 1);
+
+	/* POWER4 LPAR uses 0xffff as invalid node */
+	if (nid == 0xffff || nid >= nr_node_ids)
+		nid = NUMA_NO_NODE;
+	return nid;
+}
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+	int array_sz = of_read_number(associativity, 1);
+
+	/* Skip the first element in the associativity array */
+	return __associativity_to_nid((associativity + 1), array_sz);
+}
+
+static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	int dist;
+	int node1, node2;
+
+	node1 = associativity_to_nid(cpu1_assoc);
+	node2 = associativity_to_nid(cpu2_assoc);
+
+	dist = numa_distance_table[node1][node2];
+	if (dist <= LOCAL_DISTANCE)
+		return 0;
+	else if (dist <= REMOTE_DISTANCE)
+		return 1;
+	else
+		return 2;
+}
+
 static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
@@ -186,8 +239,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	/* We should not get called with FORM0 */
 	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
-
-	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	if (affinity_form == FORM1_AFFINITY)
+		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
 }
 
 /* must hold reference to node during call */
@@ -201,7 +255,9 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (affinity_form == FORM0_AFFINITY)
+	if (affinity_form == FORM2_AFFINITY)
+		return numa_distance_table[a][b];
+	else if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < distance_ref_points_depth; i++) {
@@ -216,37 +272,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static int __associativity_to_nid(const __be32 *associativity,
-				  int max_array_sz)
-{
-	int nid;
-	/*
-	 * primary_domain_index is 1 based array index.
-	 */
-	int index = primary_domain_index  - 1;
-
-	if (!numa_enabled || index >= max_array_sz)
-		return NUMA_NO_NODE;
-
-	nid = of_read_number(&associativity[index], 1);
-
-	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= nr_node_ids)
-		nid = NUMA_NO_NODE;
-	return nid;
-}
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
-{
-	int array_sz = of_read_number(associativity, 1);
-
-	/* Skip the first element in the associativity array */
-	return __associativity_to_nid((associativity + 1), array_sz);
-}
-
 /* Returns the nid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -320,6 +345,8 @@ static void initialize_form1_numa_distance(const __be32 *associativity)
  */
 void update_numa_distance(struct device_node *node)
 {
+	int nid;
+
 	if (affinity_form == FORM0_AFFINITY)
 		return;
 	else if (affinity_form == FORM1_AFFINITY) {
@@ -332,6 +359,84 @@ void update_numa_distance(struct device_node *node)
 		initialize_form1_numa_distance(associativity);
 		return;
 	}
+
+	/* FORM2 affinity  */
+	nid = of_node_to_nid_single(node);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	/*
+	 * With FORM2 we expect NUMA distance of all possible NUMA
+	 * nodes to be provided during boot.
+	 */
+	WARN(numa_distance_table[nid][nid] == -1,
+	     "NUMA distance details for node %d not provided\n", nid);
+}
+
+/*
+ * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
+ * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
+ */
+static void initialize_form2_numa_distance_lookup_table(void)
+{
+	int i, j;
+	struct device_node *root;
+	const __u8 *numa_dist_table;
+	const __be32 *numa_lookup_index;
+	int numa_dist_table_length;
+	int max_numa_index, distance_index;
+
+	if (firmware_has_feature(FW_FEATURE_OPAL))
+		root = of_find_node_by_path("/ibm,opal");
+	else
+		root = of_find_node_by_path("/rtas");
+	if (!root)
+		root = of_find_node_by_path("/");
+
+	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
+	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
+
+	/* first element of the array is the size and is encode-int */
+	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
+	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
+	/* Skip the size which is encoded int */
+	numa_dist_table += sizeof(__be32);
+
+	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
+		 numa_dist_table_length, max_numa_index);
+
+	for (i = 0; i < max_numa_index; i++)
+		/* +1 skip the max_numa_index in the property */
+		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
+
+
+	if (numa_dist_table_length != max_numa_index * max_numa_index) {
+		WARN(1, "Wrong NUMA distance information\n");
+		/* consider everybody else just remote. */
+		for (i = 0;  i < max_numa_index; i++) {
+			for (j = 0; j < max_numa_index; j++) {
+				int nodeA = numa_id_index_table[i];
+				int nodeB = numa_id_index_table[j];
+
+				if (nodeA == nodeB)
+					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
+				else
+					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
+			}
+		}
+	}
+
+	distance_index = 0;
+	for (i = 0;  i < max_numa_index; i++) {
+		for (j = 0; j < max_numa_index; j++) {
+			int nodeA = numa_id_index_table[i];
+			int nodeB = numa_id_index_table[j];
+
+			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
+			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
+		}
+	}
+	of_node_put(root);
 }
 
 static int __init find_primary_domain_index(void)
@@ -344,6 +449,9 @@ static int __init find_primary_domain_index(void)
 	 */
 	if (firmware_has_feature(FW_FEATURE_OPAL)) {
 		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
+		dbg("Using form 2 affinity\n");
+		affinity_form = FORM2_AFFINITY;
 	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
 		dbg("Using form 1 affinity\n");
 		affinity_form = FORM1_AFFINITY;
@@ -388,9 +496,12 @@ static int __init find_primary_domain_index(void)
 
 		index = of_read_number(&distance_ref_points[1], 1);
 	} else {
+		/*
+		 * Both FORM1 and FORM2 affinity find the primary domain details
+		 * at the same offset.
+		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
-
 	/*
 	 * Warn and cap if the hardware supports more than
 	 * MAX_DISTANCE_REF_POINTS domains.
@@ -819,6 +930,12 @@ static int __init parse_numa_properties(void)
 
 	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
+	/*
+	 * If it is FORM2 initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table();
+
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
 	 * init, we need to know the node ids now. This is because
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 5d4c2bc20bba..f162156b7b68 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
+	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
 };
 
 static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
-- 
2.31.1


^ permalink raw reply related

* [PATCH v8 4/5] powerpc/pseries: Add a helper for form1 cpu distance
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210812132223.225214-1-aneesh.kumar@linux.ibm.com>

This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c                | 10 +++++++++-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index a6425a70c37b..a4712ecad3e9 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 				 cpu_all_mask :				\
 				 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c0a89839310c..fdb2472befa4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 	return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	/* We should not get called with FORM0 */
+	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
 	if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
 		return -EIO;
 
-	return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+	return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
 	if (!disp_cpu_assoc || !vcpu_assoc)
 		return -EIO;
 
-	return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+	return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1


^ permalink raw reply related

* [PATCH v8 3/5] powerpc/pseries: Consolidate different NUMA distance update code paths
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210812132223.225214-1-aneesh.kumar@linux.ibm.com>

The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h           |   2 +
 arch/powerpc/mm/numa.c                        | 212 +++++++++++++-----
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c        |   2 +
 4 files changed, 161 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..a6425a70c37b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu)
 }
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+void update_numa_distance(struct device_node *node);
 
 #else
 
@@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	return first_online_node;
 }
 
+static inline void update_numa_distance(struct device_node *node) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 368719b14dcc..c0a89839310c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,50 +208,35 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-		const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+				  int max_array_sz)
 {
-	int i;
+	int nid;
+	/*
+	 * primary_domain_index is 1 based array index.
+	 */
+	int index = primary_domain_index  - 1;
 
-	if (affinity_form != FORM1_AFFINITY)
-		return;
+	if (!numa_enabled || index >= max_array_sz)
+		return NUMA_NO_NODE;
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
-		const __be32 *entry;
+	nid = of_read_number(&associativity[index], 1);
 
-		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
-		distance_lookup_table[nid][i] = of_read_number(entry, 1);
-	}
+	/* POWER4 LPAR uses 0xffff as invalid node */
+	if (nid == 0xffff || nid >= nr_node_ids)
+		nid = NUMA_NO_NODE;
+	return nid;
 }
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
 {
-	int nid = NUMA_NO_NODE;
-
-	if (!numa_enabled)
-		goto out;
-
-	if (of_read_number(associativity, 1) >= primary_domain_index)
-		nid = of_read_number(&associativity[primary_domain_index], 1);
-
-	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= nr_node_ids)
-		nid = NUMA_NO_NODE;
-
-	if (nid > 0 &&
-		of_read_number(associativity, 1) >= distance_ref_points_depth) {
-		/*
-		 * Skip the length field and send start of associativity array
-		 */
-		initialize_distance_lookup_table(nid, associativity + 1);
-	}
+	int array_sz = of_read_number(associativity, 1);
 
-out:
-	return nid;
+	/* Skip the first element in the associativity array */
+	return __associativity_to_nid((associativity + 1), array_sz);
 }
 
 /* Returns the nid associated with the given device tree node,
@@ -287,6 +272,60 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity,
+					     int max_array_sz)
+{
+	int i, nid;
+
+	if (affinity_form != FORM1_AFFINITY)
+		return;
+
+	nid = __associativity_to_nid(associativity, max_array_sz);
+	if (nid != NUMA_NO_NODE) {
+		for (i = 0; i < distance_ref_points_depth; i++) {
+			const __be32 *entry;
+			int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+			/*
+			 * broken hierarchy, return with broken distance table
+			 */
+			if (WARN(index >= max_array_sz, "Broken ibm,associativity property"))
+				return;
+
+			entry = &associativity[index];
+			distance_lookup_table[nid][i] = of_read_number(entry, 1);
+		}
+	}
+}
+
+static void initialize_form1_numa_distance(const __be32 *associativity)
+{
+	int array_sz;
+
+	array_sz = of_read_number(associativity, 1);
+	/* Skip the first element in the associativity array */
+	__initialize_form1_numa_distance(associativity + 1, array_sz);
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+	if (affinity_form == FORM0_AFFINITY)
+		return;
+	else if (affinity_form == FORM1_AFFINITY) {
+		const __be32 *associativity;
+
+		associativity = of_get_associativity(node);
+		if (!associativity)
+			return;
+
+		initialize_form1_numa_distance(associativity);
+		return;
+	}
+}
+
 static int __init find_primary_domain_index(void)
 {
 	int index;
@@ -433,6 +472,38 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
 	return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+	struct assoc_arrays aa = { .arrays = NULL };
+	int default_nid = NUMA_NO_NODE;
+	int nid = default_nid;
+	int rc, index;
+
+	if ((primary_domain_index < 0) || !numa_enabled)
+		return default_nid;
+
+	rc = of_get_assoc_arrays(&aa);
+	if (rc)
+		return default_nid;
+
+	if (primary_domain_index <= aa.array_sz &&
+	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
+		const __be32 *associativity;
+
+		index = lmb->aa_index * aa.array_sz;
+		associativity = &aa.arrays[index];
+		nid = __associativity_to_nid(associativity, aa.array_sz);
+		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+			/*
+			 * lookup array associativity entries have
+			 * no length of the array as the first element.
+			 */
+			__initialize_form1_numa_distance(associativity, aa.array_sz);
+		}
+	}
+	return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -453,26 +524,19 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
 	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-		nid = of_read_number(&aa.arrays[index], 1);
-
-		if (nid == 0xffff || nid >= nr_node_ids)
-			nid = default_nid;
+		const __be32 *associativity;
 
-		if (nid > 0) {
-			index = lmb->aa_index * aa.array_sz;
-			initialize_distance_lookup_table(nid,
-							&aa.arrays[index]);
-		}
+		index = lmb->aa_index * aa.array_sz;
+		associativity = &aa.arrays[index];
+		nid = __associativity_to_nid(associativity, aa.array_sz);
 	}
-
 	return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
 	long rc, hwid;
 
 	/*
@@ -492,12 +556,30 @@ static int vphn_get_nid(long lcpu)
 
 		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
 		if (rc == H_SUCCESS)
-			return associativity_to_nid(associativity);
+			return 0;
 	}
 
+	return -1;
+}
+
+static int vphn_get_nid(long lcpu)
+{
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+
+
+	if (!__vphn_get_associativity(lcpu, associativity))
+		return associativity_to_nid(associativity);
+
 	return NUMA_NO_NODE;
+
 }
 #else
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
+{
+	return -1;
+}
+
 static int vphn_get_nid(long unused)
 {
 	return NUMA_NO_NODE;
@@ -692,7 +774,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
 			size = read_n_cells(n_mem_size_cells, usm);
 		}
 
-		nid = of_drconf_to_nid_single(lmb);
+		nid = get_nid_and_numa_distance(lmb);
 		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
 					  &nid);
 		node_set_online(nid);
@@ -709,6 +791,7 @@ static int __init parse_numa_properties(void)
 	struct device_node *memory;
 	int default_nid = 0;
 	unsigned long i;
+	const __be32 *associativity;
 
 	if (numa_enabled == 0) {
 		printk(KERN_WARNING "NUMA disabled by user\n");
@@ -734,18 +817,30 @@ static int __init parse_numa_properties(void)
 	 * each node to be onlined must have NODE_DATA etc backing it.
 	 */
 	for_each_present_cpu(i) {
+		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
 		struct device_node *cpu;
-		int nid = vphn_get_nid(i);
+		int nid = NUMA_NO_NODE;
 
-		/*
-		 * Don't fall back to default_nid yet -- we will plug
-		 * cpus into nodes once the memory scan has discovered
-		 * the topology.
-		 */
-		if (nid == NUMA_NO_NODE) {
+		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
+
+		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
+			nid = associativity_to_nid(vphn_assoc);
+			initialize_form1_numa_distance(vphn_assoc);
+		} else {
+
+			/*
+			 * Don't fall back to default_nid yet -- we will plug
+			 * cpus into nodes once the memory scan has discovered
+			 * the topology.
+			 */
 			cpu = of_get_cpu_node(i, NULL);
 			BUG_ON(!cpu);
-			nid = of_node_to_nid_single(cpu);
+
+			associativity = of_get_associativity(cpu);
+			if (associativity) {
+				nid = associativity_to_nid(associativity);
+				initialize_form1_numa_distance(associativity);
+			}
 			of_node_put(cpu);
 		}
 
@@ -781,8 +876,11 @@ static int __init parse_numa_properties(void)
 		 * have associativity properties.  If none, then
 		 * everything goes to default_nid.
 		 */
-		nid = of_node_to_nid_single(memory);
-		if (nid < 0)
+		associativity = of_get_associativity(memory);
+		if (associativity) {
+			nid = associativity_to_nid(associativity);
+			initialize_form1_numa_distance(associativity);
+		} else
 			nid = default_nid;
 
 		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
 		return saved_rc;
 	}
 
+	update_numa_distance(dn);
+
 	rc = dlpar_online_cpu(dn);
 	if (rc) {
 		saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 377d852f5a9a..ee1d81d7e54a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
 		return -ENODEV;
 	}
 
+	update_numa_distance(lmb_node);
+
 	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
 	if (!dr_node) {
 		dlpar_free_cc_nodes(lmb_node);
-- 
2.31.1


^ permalink raw reply related

* [PATCH v8 2/5] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210812132223.225214-1-aneesh.kumar@linux.ibm.com>

Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/firmware.h       |  4 +--
 arch/powerpc/include/asm/prom.h           |  2 +-
 arch/powerpc/kernel/prom_init.c           |  2 +-
 arch/powerpc/mm/numa.c                    | 35 ++++++++++++++---------
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPAL		ASM_CONST(0x0000000010000000)
 #define FW_FEATURE_SET_MODE	ASM_CONST(0x0000000040000000)
 #define FW_FEATURE_BEST_ENERGY	ASM_CONST(0x0000000080000000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0000000100000000)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0000000100000000)
 #define FW_FEATURE_PRRN		ASM_CONST(0x0000000200000000)
 #define FW_FEATURE_DRMEM_V2	ASM_CONST(0x0000000400000000)
 #define FW_FEATURE_DRC_INFO	ASM_CONST(0x0000000800000000)
@@ -69,7 +69,7 @@ enum {
 		FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
 		FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
 		FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+		FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI			0x0201	/* PCIe/MSI support */
 #define OV5_CMO			0x0480	/* Cooperative Memory Overcommitment */
 #define OV5_XCMO		0x0440	/* Page Coalescing */
-#define OV5_TYPE1_AFFINITY	0x0580	/* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a5bf355ce1d6..57db605ad33a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..368719b14dcc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (!form1_affinity)
+	if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < distance_ref_points_depth; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
 	int i;
 
-	if (!form1_affinity)
+	if (affinity_form != FORM1_AFFINITY)
 		return;
 
 	for (i = 0; i < distance_ref_points_depth; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
 	int index;
 	struct device_node *root;
 
+	/*
+	 * Check for which form of affinity.
+	 */
+	if (firmware_has_feature(FW_FEATURE_OPAL)) {
+		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+		dbg("Using form 1 affinity\n");
+		affinity_form = FORM1_AFFINITY;
+	} else
+		affinity_form = FORM0_AFFINITY;
+
 	if (firmware_has_feature(FW_FEATURE_OPAL))
 		root = of_find_node_by_path("/ibm,opal");
 	else
@@ -318,23 +332,16 @@ static int __init find_primary_domain_index(void)
 	}
 
 	distance_ref_points_depth /= sizeof(int);
-
-	if (firmware_has_feature(FW_FEATURE_OPAL) ||
-	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-		dbg("Using form 1 affinity\n");
-		form1_affinity = 1;
-	}
-
-	if (form1_affinity) {
-		index = of_read_number(distance_ref_points, 1);
-	} else {
+	if (affinity_form == FORM0_AFFINITY) {
 		if (distance_ref_points_depth < 2) {
 			printk(KERN_WARNING "NUMA: "
-				"short ibm,associativity-reference-points\n");
+			       "short ibm,associativity-reference-points\n");
 			goto err;
 		}
 
 		index = of_read_number(&distance_ref_points[1], 1);
+	} else {
+		index = of_read_number(distance_ref_points, 1);
 	}
 
 	/*
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 4c7b7f5a2ebc..5d4c2bc20bba 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -119,7 +119,7 @@ struct vec5_fw_feature {
 
 static __initdata struct vec5_fw_feature
 vec5_fw_features_table[] = {
-	{FW_FEATURE_TYPE1_AFFINITY,	OV5_TYPE1_AFFINITY},
+	{FW_FEATURE_FORM1_AFFINITY,	OV5_FORM1_AFFINITY},
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
-- 
2.31.1


^ permalink raw reply related

* [PATCH v8 1/5] powerpc/pseries: rename min_common_depth to primary_domain_index
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210812132223.225214-1-aneesh.kumar@linux.ibm.com>

No functional change in this patch.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
 	if (!numa_enabled)
 		goto out;
 
-	if (of_read_number(associativity, 1) >= min_common_depth)
-		nid = of_read_number(&associativity[min_common_depth], 1);
+	if (of_read_number(associativity, 1) >= primary_domain_index)
+		nid = of_read_number(&associativity[primary_domain_index], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-	int depth;
+	int index;
 	struct device_node *root;
 
 	if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	if (form1_affinity) {
-		depth = of_read_number(distance_ref_points, 1);
+		index = of_read_number(distance_ref_points, 1);
 	} else {
 		if (distance_ref_points_depth < 2) {
 			printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
 			goto err;
 		}
 
-		depth = of_read_number(&distance_ref_points[1], 1);
+		index = of_read_number(&distance_ref_points[1], 1);
 	}
 
 	/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	of_node_put(root);
-	return depth;
+	return index;
 
 err:
 	of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	int nid = default_nid;
 	int rc, index;
 
-	if ((min_common_depth < 0) || !numa_enabled)
+	if ((primary_domain_index < 0) || !numa_enabled)
 		return default_nid;
 
 	rc = of_get_assoc_arrays(&aa);
 	if (rc)
 		return default_nid;
 
-	if (min_common_depth <= aa.array_sz &&
+	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
 		nid = of_read_number(&aa.arrays[index], 1);
 
 		if (nid == 0xffff || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
 		return -1;
 	}
 
-	min_common_depth = find_min_common_depth();
+	primary_domain_index = find_primary_domain_index();
 
-	if (min_common_depth < 0) {
+	if (primary_domain_index < 0) {
 		/*
-		 * if we fail to parse min_common_depth from device tree
+		 * if we fail to parse primary_domain_index from device tree
 		 * mark the numa disabled, boot with numa disabled.
 		 */
 		numa_enabled = false;
-		return min_common_depth;
+		return primary_domain_index;
 	}
 
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
 			goto out;
 	}
 
-	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	max_nodes = of_read_number(&domains[primary_domain_index], 1);
 	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
 	prop_length /= sizeof(int);
-	if (prop_length > min_common_depth + 2)
+	if (prop_length > primary_domain_index + 2)
 		coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
 		goto out;
 
 	index = of_read_number(associativity, 1);
-	if (index > min_common_depth + 1)
+	if (index > primary_domain_index + 1)
 		return of_read_number(&associativity[index - 1], 1);
 
 out:
-- 
2.31.1


^ permalink raw reply related

* [PATCH v8 0/5] Add support for FORM2 associativity
From: Aneesh Kumar K.V @ 2021-08-12 13:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	David Gibson

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":1071644672,
    "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
    "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
    "daxregion":{
      "id":0,
      "size":1071644672,
      "devices":[
        {
          "chardev":"dax0.0",
          "size":1071644672,
          "target_node":4,
          "mode":"devdax"
        }
      ]
    },
    "align":2097152,
    "numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}  \
-device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb413@gmail.com/

Changes from v7:
* Address review feedback 
* fold patch 6 to patch 3

Changes from v6:
* Address review feedback 

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback 

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (5):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst       | 104 +++++
 arch/powerpc/include/asm/firmware.h           |   7 +-
 arch/powerpc/include/asm/prom.h               |   3 +-
 arch/powerpc/include/asm/topology.h           |   6 +-
 arch/powerpc/kernel/prom_init.c               |   3 +-
 arch/powerpc/mm/numa.c                        | 432 ++++++++++++++----
 arch/powerpc/platforms/pseries/firmware.c     |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c        |   2 +
 arch/powerpc/platforms/pseries/lpar.c         |   4 +-
 10 files changed, 455 insertions(+), 111 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1


^ permalink raw reply

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
From: Aneesh Kumar K.V @ 2021-08-12 13:20 UTC (permalink / raw)
  To: Michael Ellerman, Puvichakravarthy Ramachandran; +Cc: linuxppc-dev, npiggin
In-Reply-To: <87fsven7yv.fsf@mpe.ellerman.id.au>

On 8/12/21 6:19 PM, Michael Ellerman wrote:
> "Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>>> With shared mapping, even though we are unmapping a large range, the kernel
>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>> kernel is unmapping a 2M range.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>   arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> index aefc100d79a7..21d0f098e43b 100644
>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>    * invalidating a full PID, so it has a far lower threshold to change > from
>>>    * individual page flushes to full-pid flushes.
>>>    */
>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>   static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>
>>>   static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>        if (fullmm)
>>>                flush_pid = true;
>>>        else if (type == FLUSH_TYPE_GLOBAL)
>>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>        else
>>>                flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>
>> Additional details on the test environment. This was tested on a 2 Node/8
>> socket Power10 system.
>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>   Rate of work: = 176
>> #           time             counts unit events
>>       1.029206442         4198594519      cycles
>>       1.029206442         2458254252      instructions              # 0.59 insn per cycle
>>       1.029206442         3004031488      PM_EXEC_STALL
>>       1.029206442         1798186036      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 181
>>       2.054288539         4183883450      cycles
>>       2.054288539         2472178171      instructions              # 0.59 insn per cycle
>>       2.054288539         3014609313      PM_EXEC_STALL
>>       2.054288539         1797851642      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 180
>>       3.078306883         4171250717      cycles
>>       3.078306883         2468341094      instructions              # 0.59 insn per cycle
>>       3.078306883         2993036205      PM_EXEC_STALL
>>       3.078306883         1798181890      PM_EXEC_STALL_TLBIE
>> .
>> .
>>
>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>> 34
>>
>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>   Rate of work: = 313
>> #           time             counts unit events
>>       1.030310506         4206071143      cycles
>>       1.030310506         4314716958      instructions              # 1.03 insn per cycle
>>       1.030310506         2157762167      PM_EXEC_STALL
>>       1.030310506          110825573      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 322
>>       2.056034068         4331745630      cycles
>>       2.056034068         4531658304      instructions              # 1.05 insn per cycle
>>       2.056034068         2288971361      PM_EXEC_STALL
>>       2.056034068          111267927      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 321
>>       3.081216434         4327050349      cycles
>>       3.081216434         4379679508      instructions              # 1.01 insn per cycle
>>       3.081216434         2252602550      PM_EXEC_STALL
>>       3.081216434          110974887      PM_EXEC_STALL_TLBIE
> 
> 
> What is the tlbie test actually doing?
> 
> Does it do anything to measure the cost of refilling after the full mm flush?
> 


That is essentially

for ()
{
   shmat()
   fillshm()
   shmdt()

}

for a 256MB range. So it is not really a fair benchmark because it 
doesn't take into account the impact of throwing away the full pid 
translation. But even then the TLBIE stalls is an important data point?

-aneesh



^ permalink raw reply

* Re: [PATCH v2 5/9] powerpc/microwatt: Use standard 16550 UART for console
From: Christophe Leroy @ 2021-08-12 13:14 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev
In-Reply-To: <YMwXGCTzedpQje7r@thinks.paulus.ozlabs.org>



Le 18/06/2021 à 05:46, Paul Mackerras a écrit :
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> This adds support to the Microwatt platform to use the standard
> 16550-style UART which available in the standalone Microwatt FPGA.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> ---
>   arch/powerpc/boot/dts/microwatt.dts      | 27 ++++++++++++----
>   arch/powerpc/kernel/udbg_16550.c         | 39 ++++++++++++++++++++++++
>   arch/powerpc/platforms/microwatt/Kconfig |  1 +
>   arch/powerpc/platforms/microwatt/setup.c |  2 ++
>   4 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/boot/dts/microwatt.dts b/arch/powerpc/boot/dts/microwatt.dts
> index 04e5dd92270e..974abbdda249 100644
> --- a/arch/powerpc/boot/dts/microwatt.dts
> +++ b/arch/powerpc/boot/dts/microwatt.dts
> @@ -6,6 +6,10 @@ / {
>   	model-name = "microwatt";
>   	compatible = "microwatt-soc";
>   
> +	aliases {
> +		serial0 = &UART0;
> +	};
> +
>   	reserved-memory {
>   		#size-cells = <0x02>;
>   		#address-cells = <0x02>;
> @@ -89,12 +93,6 @@ PowerPC,Microwatt@0 {
>   		};
>   	};
>   
> -	chosen {
> -		bootargs = "";
> -		ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00
> -					  00 00 00 00 00 00 00 00 40 00 40];
> -	};
> -
>   	soc@c0000000 {
>   		compatible = "simple-bus";
>   		#address-cells = <1>;
> @@ -119,5 +117,22 @@ ICS: interrupt-controller@5000 {
>   			#interrupt-cells = <2>;
>   		};
>   
> +		UART0: serial@2000 {
> +			device_type = "serial";
> +			compatible = "ns16550";
> +			reg = <0x2000 0x8>;
> +			clock-frequency = <100000000>;
> +			current-speed = <115200>;
> +			reg-shift = <2>;
> +			fifo-size = <16>;
> +			interrupts = <0x10 0x1>;
> +		};
> +	};
> +
> +	chosen {
> +		bootargs = "";
> +		ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00
> +					  00 00 00 00 00 00 00 00 40 00 40];
> +		stdout-path = &UART0;
>   	};
>   };
> diff --git a/arch/powerpc/kernel/udbg_16550.c b/arch/powerpc/kernel/udbg_16550.c
> index 9356b60d6030..8513aa49614e 100644
> --- a/arch/powerpc/kernel/udbg_16550.c
> +++ b/arch/powerpc/kernel/udbg_16550.c
> @@ -296,3 +296,42 @@ void __init udbg_init_40x_realmode(void)
>   }
>   
>   #endif /* CONFIG_PPC_EARLY_DEBUG_40x */
> +
> +#ifdef CONFIG_PPC_EARLY_DEBUG_MICROWATT
> +
> +#define UDBG_UART_MW_ADDR	((void __iomem *)0xc0002000)
> +
> +static u8 udbg_uart_in_isa300_rm(unsigned int reg)
> +{
> +	uint64_t msr = mfmsr();
> +	uint8_t  c;
> +
> +	mtmsr(msr & ~(MSR_EE|MSR_DR));
> +	isync();
> +	eieio();
> +	c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2));
> +	mtmsr(msr);
> +	isync();
> +	return c;
> +}

How do you make sure that GCC won't emit any access to the stack between the two mtmsr() ?

What about using real_205_readb() and real_205_writeb() instead ?

> +
> +static void udbg_uart_out_isa300_rm(unsigned int reg, u8 val)
> +{
> +	uint64_t msr = mfmsr();
> +
> +	mtmsr(msr & ~(MSR_EE|MSR_DR));
> +	isync();
> +	eieio();
> +	__raw_rm_writeb(val, UDBG_UART_MW_ADDR + (reg << 2));
> +	mtmsr(msr);
> +	isync();
> +}
> +
> +void __init udbg_init_debug_microwatt(void)
> +{
> +	udbg_uart_in = udbg_uart_in_isa300_rm;
> +	udbg_uart_out = udbg_uart_out_isa300_rm;
> +	udbg_use_uart();
> +}
> +
> +#endif /* CONFIG_PPC_EARLY_DEBUG_MICROWATT */
> diff --git a/arch/powerpc/platforms/microwatt/Kconfig b/arch/powerpc/platforms/microwatt/Kconfig
> index b52c869c0eb8..50ed0cedb5f1 100644
> --- a/arch/powerpc/platforms/microwatt/Kconfig
> +++ b/arch/powerpc/platforms/microwatt/Kconfig
> @@ -6,6 +6,7 @@ config PPC_MICROWATT
>   	select PPC_ICS_NATIVE
>   	select PPC_ICP_NATIVE
>   	select PPC_NATIVE
> +	select PPC_UDBG_16550
>   	help
>             This option enables support for FPGA-based Microwatt implementations.
>   
> diff --git a/arch/powerpc/platforms/microwatt/setup.c b/arch/powerpc/platforms/microwatt/setup.c
> index 1c1b7791fa57..0b02603bdb74 100644
> --- a/arch/powerpc/platforms/microwatt/setup.c
> +++ b/arch/powerpc/platforms/microwatt/setup.c
> @@ -14,6 +14,7 @@
>   #include <asm/machdep.h>
>   #include <asm/time.h>
>   #include <asm/xics.h>
> +#include <asm/udbg.h>
>   
>   static void __init microwatt_init_IRQ(void)
>   {
> @@ -35,5 +36,6 @@ define_machine(microwatt) {
>   	.name			= "microwatt",
>   	.probe			= microwatt_probe,
>   	.init_IRQ		= microwatt_init_IRQ,
> +	.progress		= udbg_progress,
>   	.calibrate_decr		= generic_calibrate_decr,
>   };
> 

^ permalink raw reply

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
From: Michael Ellerman @ 2021-08-12 12:49 UTC (permalink / raw)
  To: Puvichakravarthy Ramachandran, aneesh.kumar; +Cc: linuxppc-dev, npiggin
In-Reply-To: <OFAE67F802.E3873360-ON00258729.0020407B-65258729.002BAB12@ibm.com>

"Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>   * invalidating a full PID, so it has a far lower threshold to change > from
>>   * individual page flushes to full-pid flushes.
>>   */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>> 
>>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>       if (fullmm)
>>               flush_pid = true;
>>       else if (type == FLUSH_TYPE_GLOBAL)
>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>       else
>>               flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> Additional details on the test environment. This was tested on a 2 Node/8 
> socket Power10 system.
> The LPAR had 105 cores and the LPAR spanned across all the sockets. 
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 176 
> #           time             counts unit events
>      1.029206442         4198594519      cycles  
>      1.029206442         2458254252      instructions              # 0.59 insn per cycle 
>      1.029206442         3004031488      PM_EXEC_STALL   
>      1.029206442         1798186036      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 181 
>      2.054288539         4183883450      cycles  
>      2.054288539         2472178171      instructions              # 0.59 insn per cycle 
>      2.054288539         3014609313      PM_EXEC_STALL   
>      2.054288539         1797851642      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 180 
>      3.078306883         4171250717      cycles  
>      3.078306883         2468341094      instructions              # 0.59 insn per cycle 
>      3.078306883         2993036205      PM_EXEC_STALL   
>      3.078306883         1798181890      PM_EXEC_STALL_TLBIE   
> .
> . 
>
> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
> 34
>
> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 313 
> #           time             counts unit events
>      1.030310506         4206071143      cycles  
>      1.030310506         4314716958      instructions              # 1.03 insn per cycle 
>      1.030310506         2157762167      PM_EXEC_STALL   
>      1.030310506          110825573      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 322 
>      2.056034068         4331745630      cycles  
>      2.056034068         4531658304      instructions              # 1.05 insn per cycle 
>      2.056034068         2288971361      PM_EXEC_STALL   
>      2.056034068          111267927      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 321 
>      3.081216434         4327050349      cycles  
>      3.081216434         4379679508      instructions              # 1.01 insn per cycle 
>      3.081216434         2252602550      PM_EXEC_STALL   
>      3.081216434          110974887      PM_EXEC_STALL_TLBIE   


What is the tlbie test actually doing?

Does it do anything to measure the cost of refilling after the full mm flush?

cheers

^ permalink raw reply

* Re: [PATCH -next] trap: Cleanup trap_init()
From: Russell King (Oracle) @ 2021-08-12 12:41 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Paul Walmsley, James E . J . Bottomley, linux-mm, Paul Mackerras,
	linux-riscv, Anton Ivanov, Jonas Bonn, Yoshinori Sato,
	linux-hexagon, Helge Deller, Ley Foon Tan, Vineet Gupta,
	linux-snps-arc, uclinux-h8-devel, Jeff Dike, linux-um,
	Stefan Kristiansson, Richard Weinberger, openrisc, Stafford Horne,
	linux-arm-kernel, linux-parisc, linux-kernel, Andrew Morton,
	linuxppc-dev
In-Reply-To: <20210812123602.76356-1-wangkefeng.wang@huawei.com>

On Thu, Aug 12, 2021 at 08:36:02PM +0800, Kefeng Wang wrote:
> There are some empty trap_init() in different ARCHs, introduce
> a new weak trap_init() function to cleanup them.
> 
> Cc: Vineet Gupta <vgupta@kernel.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Ley Foon Tan <ley.foon.tan@intel.com>
> Cc: Jonas Bonn <jonas@southpole.se>
> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
> Cc: Stafford Horne <shorne@gmail.com>
> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Paul Walmsley <palmerdabbelt@google.com>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

For 32-bit arm:

Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Thanks!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH -next] trap: Cleanup trap_init()
From: Kefeng Wang @ 2021-08-12 12:36 UTC (permalink / raw)
  To: linux-snps-arc, linux-kernel, linux-arm-kernel, uclinux-h8-devel,
	linux-hexagon, openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-um, linux-mm
  Cc: Jonas Bonn, Kefeng Wang, Andrew Morton, Yoshinori Sato,
	Helge Deller, Paul Walmsley, Russell King, Ley Foon Tan,
	Stefan Kristiansson, James E . J . Bottomley, Richard Weinberger,
	Paul Mackerras, Vineet Gupta, Stafford Horne, Jeff Dike,
	Anton Ivanov

There are some empty trap_init() in different ARCHs, introduce
a new weak trap_init() function to cleanup them.

Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <palmerdabbelt@google.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 arch/arc/kernel/traps.c      | 5 -----
 arch/arm/kernel/traps.c      | 5 -----
 arch/h8300/kernel/traps.c    | 4 ----
 arch/hexagon/kernel/traps.c  | 4 ----
 arch/nds32/kernel/traps.c    | 5 -----
 arch/nios2/kernel/traps.c    | 5 -----
 arch/openrisc/kernel/traps.c | 5 -----
 arch/parisc/kernel/traps.c   | 4 ----
 arch/powerpc/kernel/traps.c  | 5 -----
 arch/riscv/kernel/traps.c    | 5 -----
 arch/um/kernel/trap.c        | 4 ----
 init/main.c                  | 2 ++
 12 files changed, 2 insertions(+), 51 deletions(-)

diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c
index 57235e5c0cea..6b83e3f2b41c 100644
--- a/arch/arc/kernel/traps.c
+++ b/arch/arc/kernel/traps.c
@@ -20,11 +20,6 @@
 #include <asm/unaligned.h>
 #include <asm/kprobes.h>
 
-void __init trap_init(void)
-{
-	return;
-}
-
 void die(const char *str, struct pt_regs *regs, unsigned long address)
 {
 	show_kernel_fault_diag(str, regs, address);
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 64308e3a5d0c..e9b4f2b49bd8 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -781,11 +781,6 @@ void abort(void)
 	panic("Oops failed to kill thread");
 }
 
-void __init trap_init(void)
-{
-	return;
-}
-
 #ifdef CONFIG_KUSER_HELPERS
 static void __init kuser_init(void *vectors)
 {
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index 5d8b969cd8f3..bdbe988d8dbc 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -39,10 +39,6 @@ void __init base_trap_init(void)
 {
 }
 
-void __init trap_init(void)
-{
-}
-
 asmlinkage void set_esp0(unsigned long ssp)
 {
 	current->thread.esp0 = ssp;
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 904134b37232..edfc35dafeb1 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -28,10 +28,6 @@
 #define TRAP_SYSCALL	1
 #define TRAP_DEBUG	0xdb
 
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_GENERIC_BUG
 /* Maybe should resemble arch/sh/kernel/traps.c ?? */
 int is_valid_bugaddr(unsigned long addr)
diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ee0d9ae192a5..f06421c645af 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long val)
 }
 
 extern char *exception_vector, *exception_vector_end;
-void __init trap_init(void)
-{
-	return;
-}
-
 void __init early_trap_init(void)
 {
 	unsigned long ivb = 0;
diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index b172da4eb1a9..596986a74a26 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long *stack,
 	printk("%s\n", loglvl);
 }
 
-void __init trap_init(void)
-{
-	/* Nothing to do here */
-}
-
 /* Breakpoint handler */
 asmlinkage void breakpoint_c(struct pt_regs *fp)
 {
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 4d61333c2623..aa1e709405ac 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int vector)
 	die("Oops", regs, 9);
 }
 
-void __init trap_init(void)
-{
-	/* Nothing needs to be done */
-}
-
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
 	force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 8d8441d4562a..747c328fb886 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -859,7 +859,3 @@ void  __init early_trap_init(void)
 
 	initialize_ivt(&fault_vector_20);
 }
-
-void __init trap_init(void)
-{
-}
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index e103b89234cd..91efb5c6f2f3 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -2209,11 +2209,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stack)
 	die("Bad kernel stack pointer", regs, SIGABRT);
 }
 
-void __init trap_init(void)
-{
-}
-
-
 #ifdef CONFIG_PPC_EMULATED_STATS
 
 #define WARN_EMULATED_SETUP(type)	.type = { .name = #type }
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0a98fd0ddfe9..0daaa3e4630d 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -199,11 +199,6 @@ int is_valid_bugaddr(unsigned long pc)
 }
 #endif /* CONFIG_GENERIC_BUG */
 
-/* stvec & scratch is already set from head.S */
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_VMAP_STACK
 static DEFINE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)],
 		overflow_stack)__aligned(16);
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index ad12f78bda7e..3198c4767387 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -311,7 +311,3 @@ void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs)
 {
 	do_IRQ(WINCH_IRQ, regs);
 }
-
-void trap_init(void)
-{
-}
diff --git a/init/main.c b/init/main.c
index 4b4897b791fd..863e5087263d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -775,6 +775,8 @@ void __init __weak poking_init(void) { }
 
 void __init __weak pgtable_cache_init(void) { }
 
+void __init __weak trap_init(void) { }
+
 bool initcall_debug;
 core_param(initcall_debug, initcall_debug, bool, 0644);
 
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: kernel test robot @ 2021-08-12 12:26 UTC (permalink / raw)
  To: Xianting Tian, gregkh, jirislaby, amit, arnd, osandov
  Cc: kbuild-all, Xianting Tian, linux-kernel, virtualization,
	clang-built-linux, linuxppc-dev
In-Reply-To: <20210812094532.145497-2-xianting.tian@linux.alibaba.com>

[-- Attachment #1: Type: text/plain, Size: 3818 bytes --]

Hi Xianting,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tty/tty-testing]
[also build test WARNING on char-misc/char-misc-testing soc/for-next v5.14-rc5 next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-testing
config: hexagon-randconfig-r041-20210812 (attached as .config)
compiler: clang version 12.0.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/9f2925b5429149ceb0ea6eeaa8c81d422c3124fc
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847
        git checkout 9f2925b5429149ceb0ea6eeaa8c81d422c3124fc
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=hexagon 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is uninitialized when used here [-Wuninitialized]
           spin_unlock_irqrestore(&hp->c_lock, flags);
                                   ^~
   drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to silence this warning
           struct hvc_struct *hp;
                                ^
                                 = NULL
   1 warning generated.


vim +/hp +190 drivers/tty/hvc/hvc_console.c

   136	
   137	/*
   138	 * Console APIs, NOT TTY.  These APIs are available immediately when
   139	 * hvc_console_setup() finds adapters.
   140	 */
   141	
   142	static void hvc_console_print(struct console *co, const char *b,
   143				      unsigned count)
   144	{
   145		char *c;
   146		unsigned i = 0, n = 0;
   147		int r, donecr = 0, index = co->index;
   148		unsigned long flags;
   149		struct hvc_struct *hp;
   150	
   151		/* Console access attempt outside of acceptable console range. */
   152		if (index >= MAX_NR_HVC_CONSOLES)
   153			return;
   154	
   155		/* This console adapter was removed so it is not usable. */
   156		if (vtermnos[index] == -1 || !cons_outbuf[index])
   157			return;
   158	
   159		c = cons_outbuf[index];
   160	
   161		spin_lock_irqsave(&hp->c_lock, flags);
   162		while (count > 0 || i > 0) {
   163			if (count > 0 && i < sizeof(c)) {
   164				if (b[n] == '\n' && !donecr) {
   165					c[i++] = '\r';
   166					donecr = 1;
   167				} else {
   168					c[i++] = b[n++];
   169					donecr = 0;
   170					--count;
   171				}
   172			} else {
   173				r = cons_ops[index]->put_chars(vtermnos[index], c, i);
   174				if (r <= 0) {
   175					/* throw away characters on error
   176					 * but spin in case of -EAGAIN */
   177					if (r != -EAGAIN) {
   178						i = 0;
   179					} else {
   180						hvc_console_flush(cons_ops[index],
   181							      vtermnos[index]);
   182					}
   183				} else if (r > 0) {
   184					i -= r;
   185					if (i > 0)
   186						memmove(c, c+r, i);
   187				}
   188			}
   189		}
 > 190		spin_unlock_irqrestore(&hp->c_lock, flags);
   191		hvc_console_flush(cons_ops[index], vtermnos[index]);
   192	}
   193	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28662 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry
From: Michael Ellerman @ 2021-08-12 12:14 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Christophe Leroy, linuxppc-dev
In-Reply-To: <bb3855e1-f15d-1df6-5699-9addc69ce1db@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 8/12/21 12:58 PM, Michael Ellerman wrote:
>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>> Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit :
>>>> Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/mm/book3s64/radix_tlb.c | 48 ++++++++++++++++++++++++++++
>>>>    1 file changed, 48 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> index aefc100d79a7..5cca0fe130e7 100644
>>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> @@ -17,6 +17,7 @@
>> ...
>>>> +
>>>> +static int __init create_tlb_single_page_flush_ceiling(void)
>>>> +{
>>>> +	debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
>>>> +			    powerpc_debugfs_root, NULL, &fops_tlbflush);
>>>
>>> Could you just use debugfs_create_u32() instead of re-implementing simple read and write ?
>> 
>> Yeah AFAICS that should work fine.
>> 
>> It could probably even be a u16?
>
> I was looking at switching all that to u64. Should i fallback to u16, 
> considering a tlb_signle_page_flush_ceiling value larger that 2**16 
> doesn't make sense?

Hmm, if we make it u16 and someone writes a value >= 2^16 it just
truncates the value to 0, which is a bit unfortunate.

So maybe just make it u32, that way if someone writes a stupidly large
value it stays large.

cheers

^ permalink raw reply

* Re: [PATCH 07/11] treewide: Replace the use of mem_encrypt_active() with prot_guest_has()
From: Kirill A. Shutemov @ 2021-08-12 10:07 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kuppuswamy, Sathyanarayanan, linux-efi, Brijesh Singh, kvm,
	Peter Zijlstra, Dave Hansen, dri-devel, platform-driver-x86,
	Will Deacon, linux-s390, Andi Kleen, Baoquan He, Joerg Roedel,
	x86, amd-gfx, David Airlie, Ingo Molnar,
	linux-graphics-maintainer, Dave Young, Tianyu Lan,
	Thomas Zimmermann, Maarten Lankhorst, Maxime Ripard,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, kexec,
	linux-kernel, iommu, Daniel Vetter, linux-fsdevel, linuxppc-dev
In-Reply-To: <0a819549-e481-c004-7da8-82ba427b13ce@amd.com>

On Wed, Aug 11, 2021 at 10:52:55AM -0500, Tom Lendacky wrote:
> On 8/11/21 7:19 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote:
> >> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>>
> >>> On 7/27/21 3:26 PM, Tom Lendacky wrote:
> >>>> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> >>>> index de01903c3735..cafed6456d45 100644
> >>>> --- a/arch/x86/kernel/head64.c
> >>>> +++ b/arch/x86/kernel/head64.c
> >>>> @@ -19,7 +19,7 @@
> >>>>   #include <linux/start_kernel.h>
> >>>>   #include <linux/io.h>
> >>>>   #include <linux/memblock.h>
> >>>> -#include <linux/mem_encrypt.h>
> >>>> +#include <linux/protected_guest.h>
> >>>>   #include <linux/pgtable.h>
> >>>>     #include <asm/processor.h>
> >>>> @@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long
> >>>> physaddr,
> >>>>        * there is no need to zero it after changing the memory encryption
> >>>>        * attribute.
> >>>>        */
> >>>> -    if (mem_encrypt_active()) {
> >>>> +    if (prot_guest_has(PATTR_MEM_ENCRYPT)) {
> >>>>           vaddr = (unsigned long)__start_bss_decrypted;
> >>>>           vaddr_end = (unsigned long)__end_bss_decrypted;
> >>>
> >>>
> >>> Since this change is specific to AMD, can you replace PATTR_MEM_ENCRYPT with
> >>> prot_guest_has(PATTR_SME) || prot_guest_has(PATTR_SEV). It is not used in
> >>> TDX.
> >>
> >> This is a direct replacement for now.
> > 
> > With current implementation of prot_guest_has() for TDX it breaks boot for
> > me.
> > 
> > Looking at code agains, now I *think* the reason is accessing a global
> > variable from __startup_64() inside TDX version of prot_guest_has().
> > 
> > __startup_64() is special. If you access any global variable you need to
> > use fixup_pointer(). See comment before __startup_64().
> > 
> > I'm not sure how you get away with accessing sme_me_mask directly from
> > there. Any clues? Maybe just a luck and complier generates code just right
> > for your case, I donno.
> 
> Hmm... yeah, could be that the compiler is using rip-relative addressing
> for it because it lives in the .data section?

I guess. It has to be fixed. It may break with complier upgrade or any
random change around the code.

BTW, does it work with clang for you?

> For the static variables in mem_encrypt_identity.c I did an assembler rip
> relative LEA, but probably could have passed physaddr to sme_enable() and
> used a fixup_pointer() style function, instead.

Sounds like a plan.

> > A separate point is that TDX version of prot_guest_has() relies on
> > cpu_feature_enabled() which is not ready at this point.
> 
> Does TDX have to do anything special to make memory able to be shared with
> the hypervisor?

Yes. But there's nothing that required any changes in early boot. It
handled in ioremap/set_memory.

> You might have to use something that is available earlier
> than cpu_feature_enabled() in that case (should you eventually support
> kvmclock).

Maybe.

> > I think __bss_decrypted fixup has to be done if sme_me_mask is non-zero.
> > Or just do it uncoditionally because it's NOP for sme_me_mask == 0.
> 
> For SNP, we'll have to additionally call the HV to update the RMP to make
> the memory shared. But that could also be done unconditionally since the
> early_snp_set_memory_shared() routine will check for SNP before doing
> anything.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH] powerpc/interrupt: Do not call single_step_exception() from other exceptions
From: Nicholas Piggin @ 2021-08-12 10:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, fthain,
	Michael Ellerman, Paul Mackerras, userm57
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <aed174f5cbc06f2cf95233c071d8aac948e46043.1628611921.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am:
> single_step_exception() is called by emulate_single_step() which
> is called from (at least) alignment exception() handler and
> program_check_exception() handler.
> 
> Redefine it as a regular __single_step_exception() which is called
> by both single_step_exception() handler and emulate_single_step()
> function.
> 

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
> Cc: stable@vger.kernel.org
> Cc: Stan Johnson <userm57@yahoo.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Finn Thain <fthain@linux-m68k.org>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/kernel/traps.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index dfbce527c98e..d56254f05e17 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -1104,7 +1104,7 @@ DEFINE_INTERRUPT_HANDLER(RunModeException)
>  	_exception(SIGTRAP, regs, TRAP_UNK, 0);
>  }
>  
> -DEFINE_INTERRUPT_HANDLER(single_step_exception)
> +static void __single_step_exception(struct pt_regs *regs)
>  {
>  	clear_single_step(regs);
>  	clear_br_trace(regs);
> @@ -1121,6 +1121,11 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception)
>  	_exception(SIGTRAP, regs, TRAP_TRACE, regs->nip);
>  }
>  
> +DEFINE_INTERRUPT_HANDLER(single_step_exception)
> +{
> +	__single_step_exception(regs);
> +}
> +
>  /*
>   * After we have successfully emulated an instruction, we have to
>   * check if the instruction was being single-stepped, and if so,
> @@ -1130,7 +1135,7 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception)
>  static void emulate_single_step(struct pt_regs *regs)
>  {
>  	if (single_stepping(regs))
> -		single_step_exception(regs);
> +		__single_step_exception(regs);
>  }
>  
>  static inline int __parse_fpscr(unsigned long fpscr)
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
From: Nicholas Piggin @ 2021-08-12 10:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, fthain,
	Michael Ellerman, Paul Mackerras, userm57
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <c17d234f4927d39a1d7100864a8e1145323d33a0.1628611927.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am:
> An interrupt handler shall not be called from another interrupt
> handler otherwise this leads to problems like the following:
> 
> 	Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 1000)
> 	------------[ cut here ]------------
> 	Bug: Write fault blocked by KUAP!
> 	WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 do_page_fault+0x484/0x720
> 	Modules linked in:
> 	CPU: 0 PID: 1617 Comm: sshd Tainted: G        W         5.13.0-pmac-00010-g8393422eb77 #7
> 	NIP:  c001b77c LR: c001b77c CTR: 00000000
> 	REGS: cb9e5bc0 TRAP: 0700   Tainted: G        W          (5.13.0-pmac-00010-g8393422eb77)
> 	MSR:  00021032 <ME,IR,DR,RI>  CR: 24942424  XER: 00000000
> 
> 	GPR00: c001b77c cb9e5c80 c1582c00 00000021 3ffffbff 085b0000 00000027 c8eb644c
> 	GPR08: 00000023 00000000 00000000 00000000 24942424 0063f8c8 00000000 000186a0
> 	GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
> 	GPR24: 00000040 afd4fa96 00000040 02000000 c1fda6c0 afd4fa84 00000300 cb9e5cc0
> 	NIP [c001b77c] do_page_fault+0x484/0x720
> 	LR [c001b77c] do_page_fault+0x484/0x720
> 	Call Trace:
> 	[cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
> 	[cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
> 	--- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
> 	NIP:  c001f9b4 LR: c03250a0 CTR: 00000004
> 	REGS: cb9e5cc0 TRAP: 0300   Tainted: G        W          (5.13.0-pmac-00010-g8393422eb77)
> 	MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 48028468  XER: 20000000
> 	DAR: afd4fa84 DSISR: 0a000000
> 	GPR00: 20726f6f cb9e5d80 c1582c00 00000004 cb9e5e3a 00000016 afd4fa80 00000000
> 	GPR08: 3835202d 72777872 2d78722d 00000004 28028464 0063f8c8 00000000 000186a0
> 	GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
> 	GPR24: 00000040 afd4fa96 00000040 cb9e5e0c 00000daa a0000000 cb9e5e98 afd4fa56
> 	NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
> 	LR [c03250a0] _copy_to_iter+0x144/0x990
> 	--- interrupt: 300
> 	[cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
> 	[cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
> 	[cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
> 	[cb9e5f00] [c01571ac] ksys_read+0x70/0x118
> 	[cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
> 	--- interrupt: c00 at 0xa7855c88
> 	NIP:  a7855c88 LR: a7855c5c CTR: 00000000
> 	REGS: cb9e5f40 TRAP: 0c00   Tainted: G        W          (5.13.0-pmac-00010-g8393422eb77)
> 	MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 2402446c  XER: 00000000
> 
> 	GPR00: 00000003 afd4ec70 a72137d0 0000000b afd4ecac 00004000 0065a990 00000800
> 	GPR08: 00000000 a7947930 00000000 00000004 c15831b0 0063f8c8 00000000 000186a0
> 	GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 00000001 0065fac0
> 	GPR24: 00000000 00000089 00664050 00000000 00668e30 a720c8dc a7943ff4 0065f9b0
> 	NIP [a7855c88] 0xa7855c88
> 	LR [a7855c5c] 0xa7855c5c
> 	--- interrupt: c00
> 	Instruction dump:
> 	3884aa88 38630178 48076861 807f0080 48042e45 2f830000 419e0148 3c80c079
> 	3c60c076 38841be4 386301c0 4801f705 <0fe00000> 3860000b 4bfffe30 3c80c06b
> 	---[ end trace fd69b91a8046c2e5 ]---
> 
> Here the problem is that by re-enterring an exception handler,
> kuap_save_and_lock() is called a second time with this time KUAP
> access locked, leading to regs->kuap being overwritten hence
> KUAP not being unlocked at exception exit as expected.
> 
> Do not call do_IRQ() from timer_interrupt() directly. Instead,
> redefine do_IRQ() as a standard function named __do_IRQ(), and
> call it from both do_IRQ() and time_interrupt() handlers.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> 
> Reported-by: Stan Johnson <userm57@yahoo.com>
> Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
> Cc: stable@vger.kernel.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Finn Thain <fthain@linux-m68k.org>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/include/asm/interrupt.h | 3 +++
>  arch/powerpc/include/asm/irq.h       | 2 +-
>  arch/powerpc/kernel/irq.c            | 7 ++++++-
>  arch/powerpc/kernel/time.c           | 2 +-
>  4 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
> index fc4702bdd119..1e984a35a39f 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -590,6 +590,9 @@ DECLARE_INTERRUPT_HANDLER_NMI(hmi_exception_realmode);
>  
>  DECLARE_INTERRUPT_HANDLER_ASYNC(TAUException);
>  
> +/* irq.c */
> +DECLARE_INTERRUPT_HANDLER_ASYNC(do_IRQ);
> +
>  void __noreturn unrecoverable_exception(struct pt_regs *regs);
>  
>  void replay_system_reset(void);
> diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
> index 4982f3711fc3..2b3278534bc1 100644
> --- a/arch/powerpc/include/asm/irq.h
> +++ b/arch/powerpc/include/asm/irq.h
> @@ -52,7 +52,7 @@ extern void *mcheckirq_ctx[NR_CPUS];
>  extern void *hardirq_ctx[NR_CPUS];
>  extern void *softirq_ctx[NR_CPUS];
>  
> -extern void do_IRQ(struct pt_regs *regs);
> +void __do_IRQ(struct pt_regs *regs);
>  extern void __init init_IRQ(void);
>  extern void __do_irq(struct pt_regs *regs);
>  
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index 91e63eac4e8f..551b653228c4 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -750,7 +750,7 @@ void __do_irq(struct pt_regs *regs)
>  	trace_irq_exit(regs);
>  }
>  
> -DEFINE_INTERRUPT_HANDLER_ASYNC(do_IRQ)
> +void __do_IRQ(struct pt_regs *regs)
>  {
>  	struct pt_regs *old_regs = set_irq_regs(regs);
>  	void *cursp, *irqsp, *sirqsp;
> @@ -774,6 +774,11 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(do_IRQ)
>  	set_irq_regs(old_regs);
>  }
>  
> +DEFINE_INTERRUPT_HANDLER_ASYNC(do_IRQ)
> +{
> +	__do_IRQ(regs);
> +}
> +
>  static void *__init alloc_vm_stack(void)
>  {
>  	return __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, THREADINFO_GFP,
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index e45ce427bffb..c487ba5a6e11 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -586,7 +586,7 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
>  
>  #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
>  	if (atomic_read(&ppc_n_lost_interrupts) != 0)
> -		do_IRQ(regs);
> +		__do_IRQ(regs);
>  #endif
>  
>  	old_regs = set_irq_regs(regs);
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* [PATCH v6 2/2] virtio-console: remove unnecessary kmemdup()
From: Xianting Tian @ 2021-08-12  9:45 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20210812094532.145497-1-xianting.tian@linux.alibaba.com>

hvc framework will never pass stack memory to the put_chars() function,
So the calling of kmemdup() is unnecessary, we can remove it.

This revert commit c4baad5029 ("virtio-console: avoid DMA from stack")

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
---
 drivers/char/virtio_console.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7eaf303a7..4ed3ffb1d 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int count)
 {
 	struct port *port;
 	struct scatterlist sg[1];
-	void *data;
-	int ret;
 
 	if (unlikely(early_put_chars))
 		return early_put_chars(vtermno, buf, count);
@@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int count)
 	if (!port)
 		return -EPIPE;
 
-	data = kmemdup(buf, count, GFP_ATOMIC);
-	if (!data)
-		return -ENOMEM;
-
-	sg_init_one(sg, data, count);
-	ret = __send_to_port(port, sg, 1, count, data, false);
-	kfree(data);
-	return ret;
+	sg_init_one(sg, buf, count);
+	return __send_to_port(port, sg, 1, count, (void *)buf, false);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-08-12  9:45 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20210812094532.145497-1-xianting.tian@linux.alibaba.com>

As well known, hvc backend can register its opertions to hvc backend.
the opertions contain put_chars(), get_chars() and so on.

Some hvc backend may do dma in its opertions. eg, put_chars() of
virtio-console. But in the code of hvc framework, it may pass DMA
incapable memory to put_chars() under a specific configuration, which
is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
1, c[] is on stack,
   hvc_console_print():
	char c[N_OUTBUF] __ALIGNED__;
	cons_ops[index]->put_chars(vtermnos[index], c, i);
2, ch is on stack,
   static void hvc_poll_put_char(,,char ch)
   {
	struct tty_struct *tty = driver->ttys[0];
	struct hvc_struct *hp = tty->driver_data;
	int n;

	do {
		n = hp->ops->put_chars(hp->vtermno, &ch, 1);
	} while (n <= 0);
   }

Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
is passed to virtio-console by hvc framework in above code. But I think
the fix is aggressive, it directly uses kmemdup() to alloc new buffer
from kmalloc area and do memcpy no matter the memory is in kmalloc area
or not. But most importantly, it should better be fixed in the hvc
framework, by changing it to never pass stack memory to the put_chars()
function in the first place. Otherwise, we still face the same issue if
a new hvc backend using dma added in the furture.

We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no
longer the stack memory. we can use it in above two cases.

Other fix is use L1_CACHE_BYTES as the alignment, use 'sizeof(long)' as
dma alignment is wrong. And use struct_size() to calculate size of
hvc_struct.

Introduce another array(cons_outbuf[]) for the hp->c pointers next to
the cons_ops[] and vtermnos[] arrays.

With the patch, we can remove the fix c4baad5029.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Tested-by: Xianting Tian <xianting.tian@linux.alibaba.com>
---
 drivers/tty/hvc/hvc_console.c | 40 +++++++++++++++++++++--------------
 drivers/tty/hvc/hvc_console.h | 16 ++++++++++++--
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5bb8c4e44..c56564eb7 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -41,16 +41,6 @@
  */
 #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
 
-/*
- * These sizes are most efficient for vio, because they are the
- * native transfer size. We could make them selectable in the
- * future to better deal with backends that want other buffer sizes.
- */
-#define N_OUTBUF	16
-#define N_INBUF		16
-
-#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
-
 static struct tty_driver *hvc_driver;
 static struct task_struct *hvc_task;
 
@@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp)
 static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
 static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 	{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
+static char *cons_outbuf[MAX_NR_HVC_CONSOLES];
 
 /*
  * Console APIs, NOT TTY.  These APIs are available immediately when
@@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 static void hvc_console_print(struct console *co, const char *b,
 			      unsigned count)
 {
-	char c[N_OUTBUF] __ALIGNED__;
+	char *c;
 	unsigned i = 0, n = 0;
 	int r, donecr = 0, index = co->index;
+	unsigned long flags;
+	struct hvc_struct *hp;
 
 	/* Console access attempt outside of acceptable console range. */
 	if (index >= MAX_NR_HVC_CONSOLES)
 		return;
 
 	/* This console adapter was removed so it is not usable. */
-	if (vtermnos[index] == -1)
+	if (vtermnos[index] == -1 || !cons_outbuf[index])
 		return;
 
+	c = cons_outbuf[index];
+
+	spin_lock_irqsave(&hp->c_lock, flags);
 	while (count > 0 || i > 0) {
 		if (count > 0 && i < sizeof(c)) {
 			if (b[n] == '\n' && !donecr) {
@@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const char *b,
 			}
 		}
 	}
+	spin_unlock_irqrestore(&hp->c_lock, flags);
 	hvc_console_flush(cons_ops[index], vtermnos[index]);
 }
 
@@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch)
 	struct tty_struct *tty = driver->ttys[0];
 	struct hvc_struct *hp = tty->driver_data;
 	int n;
+	unsigned long flags;
+	char *c;
+
+	if (!hp || !cons_outbuf[hp->index])
+		return;
+
+	c = cons_outbuf[hp->index];
 
 	do {
-		n = hp->ops->put_chars(hp->vtermno, &ch, 1);
+		spin_lock_irqsave(&hp->c_lock, flags);
+		c[0] = ch;
+		n = hp->ops->put_chars(hp->vtermno, c, 1);
+		spin_unlock_irqrestore(&hp->c_lock, flags);
 	} while (n <= 0);
 }
 #endif
@@ -922,8 +929,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 			return ERR_PTR(err);
 	}
 
-	hp = kzalloc(ALIGN(sizeof(*hp), sizeof(long)) + outbuf_size,
-			GFP_KERNEL);
+	hp = kzalloc(struct_size(hp, outbuf, outbuf_size), GFP_KERNEL);
 	if (!hp)
 		return ERR_PTR(-ENOMEM);
 
@@ -931,13 +937,13 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 	hp->data = data;
 	hp->ops = ops;
 	hp->outbuf_size = outbuf_size;
-	hp->outbuf = &((char *)hp)[ALIGN(sizeof(*hp), sizeof(long))];
 
 	tty_port_init(&hp->port);
 	hp->port.ops = &hvc_port_ops;
 
 	INIT_WORK(&hp->tty_resize, hvc_set_winsz);
 	spin_lock_init(&hp->lock);
+	spin_lock_init(&hp->c_lock);
 	mutex_lock(&hvc_structs_mutex);
 
 	/*
@@ -964,6 +970,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 	if (i < MAX_NR_HVC_CONSOLES) {
 		cons_ops[i] = ops;
 		vtermnos[i] = vtermno;
+		cons_outbuf[i] = hp->c;
 	}
 
 	list_add_tail(&(hp->next), &hvc_structs);
@@ -988,6 +995,7 @@ int hvc_remove(struct hvc_struct *hp)
 	if (hp->index < MAX_NR_HVC_CONSOLES) {
 		vtermnos[hp->index] = -1;
 		cons_ops[hp->index] = NULL;
+		cons_outbuf[hp->index] = NULL;
 	}
 
 	/* Don't whack hp->irq because tty_hangup() will need to free the irq. */
diff --git a/drivers/tty/hvc/hvc_console.h b/drivers/tty/hvc/hvc_console.h
index 18d005814..52374e2da 100644
--- a/drivers/tty/hvc/hvc_console.h
+++ b/drivers/tty/hvc/hvc_console.h
@@ -32,13 +32,21 @@
  */
 #define HVC_ALLOC_TTY_ADAPTERS	8
 
+/*
+ * These sizes are most efficient for vio, because they are the
+ * native transfer size. We could make them selectable in the
+ * future to better deal with backends that want other buffer sizes.
+ */
+#define N_OUTBUF	16
+#define N_INBUF		16
+
+#define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
+
 struct hvc_struct {
 	struct tty_port port;
 	spinlock_t lock;
 	int index;
 	int do_wakeup;
-	char *outbuf;
-	int outbuf_size;
 	int n_outbuf;
 	uint32_t vtermno;
 	const struct hv_ops *ops;
@@ -48,6 +56,10 @@ struct hvc_struct {
 	struct work_struct tty_resize;
 	struct list_head next;
 	unsigned long flags;
+	spinlock_t c_lock;
+	char c[N_OUTBUF] __ALIGNED__;
+	int outbuf_size;
+	char outbuf[0] __ALIGNED__;
 };
 
 /* implemented by a low level driver */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 0/2] make hvc pass dma capable memory to its backend
From: Xianting Tian @ 2021-08-12  9:45 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, linuxppc-dev, linux-kernel, virtualization

Dear all,

This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)

V1
virtio-console: avoid DMA from vmalloc area
https://lkml.org/lkml/2021/7/27/494

For v1 patch, Arnd Bergmann suggests to fix the issue in the first
place:
Make hvc pass DMA capable memory to put_chars()
The fix suggestion is included in v2.

V2
[PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/1/8
[PATCH 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/1/9

For v2 patch, Arnd Bergmann suggests to make new buf part of the
hvc_struct structure, and fix the compile issue.
The fix suggestion is included in v3.

V3
[PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/3/1347
[PATCH v3 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/3/1348

For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of
hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to
calculate the size of hvc_struct. The fix suggestion is included in
v4.

V4
[PATCH v4 0/2] make hvc pass dma capable memory to its backend
https://lkml.org/lkml/2021/8/5/1350
[PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/5/1351
[PATCH v4 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/5/1352

For v4 patch, Arnd Bergmann suggests to introduce another
array(cons_outbuf[]) for the buffer pointers next to the cons_ops[]
and vtermnos[] arrays. This fix included in this v5 patch.

V5
Arnd Bergmann suggests to use "L1_CACHE_BYTES" as dma alignment,
use 'sizeof(long)' as dma alignment is wrong. fix it in v6.


drivers/char/virtio_console.c | 12 ++----------
drivers/tty/hvc/hvc_console.c | 40 +++++++++++++++++++++--------------
drivers/tty/hvc/hvc_console.h | 16 ++++++++++++--
3 file changed

^ permalink raw reply

* Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting TIan @ 2021-08-12  9:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jiri Slaby, Amit Shah, gregkh, Linux Kernel Mailing List,
	open list:DRM DRIVER FOR QEMU'S CIRRUS DEVICE, Guo Ren,
	linuxppc-dev, Omar Sandoval
In-Reply-To: <CAK8P3a2ykLvJkhX+wDAOHdyLHjPFAfhOxi5BNM9kTKv_8F7VQg@mail.gmail.com>


在 2021/8/12 下午4:54, Arnd Bergmann 写道:
> On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan
> <xianting.tian@linux.alibaba.com> wrote:
>> 在 2021/8/6 下午10:51, Arnd Bergmann 写道:
>>> On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian
>>>> +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
>>> I think you need a higher alignment for DMA buffers, instead of sizeof(long),
>>> I would suggest ARCH_DMA_MINALIGN.
>> As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think
>> it 's better remain the code unchanged,
>>
>> I will send v5 patch soon.
> I think you could just use "L1_CACHE_BYTES" as the alignment in this case.
> This will make the structure slightly larger for architectures that do not have
> alignment constraints on DMA buffers, but using a smaller alignment is
> clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN.
yes, I unstand you, the align size must  L1_CACHE_BYTES at least.
>
> Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already,
yes, I summited this patch, it is discussing, seems they don't want to 
apply it.
> as some implementations do not have coherent DMA. I had failed to
> realized though that on x86 you do not get an ARCH_DMA_MINALIGN
> definition.
I didn't find the definition in arch/x86/include/asm/cache.h and other 
place, x86 is dma coherent, it may doesn't need it.
>
>         Arnd

^ permalink raw reply

* Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer
From: David Hildenbrand @ 2021-08-12  9:04 UTC (permalink / raw)
  To: Hou Wenlong, kvm
  Cc: x86, Wanpeng Li, linux-mips, H. Peter Anvin, Claudio Imbrenda,
	Will Deacon, kvmarm, linux-s390, Janosch Frank, Marc Zyngier,
	Joerg Roedel, Huacai Chen, Christian Borntraeger,
	Aleksandar Markovic, Ingo Molnar, Catalin Marinas, Vasily Gorbik,
	Suzuki K Poulose, Heiko Carstens, kvm-ppc, Borislav Petkov,
	Thomas Gleixner, Alexandru Elisei, linux-arm-kernel, Jim Mattson,
	Thomas Bogendoerfer, Sean Christopherson, Cornelia Huck,
	linux-kernel, James Morse, Paolo Bonzini, Vitaly Kuznetsov,
	linuxppc-dev
In-Reply-To: <1c510b24fc1d7cbae8aa4b69c0799ebd32e65b82.1628739116.git.houwenlong93@linux.alibaba.com>

On 12.08.21 06:02, Hou Wenlong wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Refactor kvm_arch_vcpu_fault() to return 'struct page *' instead of
> 'vm_fault_t' to simplify architecture specific implementations that do
> more than return SIGBUS.  Currently this only applies to s390, but a
> future patch will move x86's pio_data handling into x86 where it belongs.
> 
> No functional changed intended.
> 
> Cc: Hou Wenlong <houwenlong93@linux.alibaba.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
> ---
>   arch/arm64/kvm/arm.c       |  4 ++--
>   arch/mips/kvm/mips.c       |  4 ++--
>   arch/powerpc/kvm/powerpc.c |  4 ++--
>   arch/s390/kvm/kvm-s390.c   | 12 ++++--------
>   arch/x86/kvm/x86.c         |  4 ++--
>   include/linux/kvm_host.h   |  2 +-
>   virt/kvm/kvm_main.c        |  5 ++++-
>   7 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e9a2b8f27792..83f4ffe3e4f2 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -161,9 +161,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	return ret;
>   }
>   
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   {
> -	return VM_FAULT_SIGBUS;
> +	return NULL;
>   }
>   
>   
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index af9dd029a4e1..ae79874e6fd2 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -1053,9 +1053,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>   	return -ENOIOCTLCMD;
>   }
>   
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   {
> -	return VM_FAULT_SIGBUS;
> +	return NULL;
>   }
>   
>   int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index be33b5321a76..b9c21f9ab784 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -2090,9 +2090,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   	return r;
>   }
>   
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   {
> -	return VM_FAULT_SIGBUS;
> +	return NULL;
>   }
>   
>   static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 02574d7b3612..e1b69833e228 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -4979,17 +4979,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   	return r;
>   }
>   
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   {
>   #ifdef CONFIG_KVM_S390_UCONTROL
> -	if ((vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET)
> -		 && (kvm_is_ucontrol(vcpu->kvm))) {
> -		vmf->page = virt_to_page(vcpu->arch.sie_block);
> -		get_page(vmf->page);
> -		return 0;
> -	}
> +	if (vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET && kvm_is_ucontrol(vcpu->kvm))
> +		return virt_to_page(vcpu->arch.sie_block);
>   #endif
> -	return VM_FAULT_SIGBUS;
> +	return NULL;
>   }
>   
>   /* Section: memory related */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3cedc7cc132a..1e3bbe5cd33a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5347,9 +5347,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   	return r;
>   }
>   
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   {
> -	return VM_FAULT_SIGBUS;
> +	return NULL;
>   }
>   
>   static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 492d183dd7d0..a949de534722 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -995,7 +995,7 @@ long kvm_arch_dev_ioctl(struct file *filp,
>   			unsigned int ioctl, unsigned long arg);
>   long kvm_arch_vcpu_ioctl(struct file *filp,
>   			 unsigned int ioctl, unsigned long arg);
> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
>   
>   int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 30d322519253..f7d21418971b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3448,7 +3448,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>   		    &vcpu->dirty_ring,
>   		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>   	else
> -		return kvm_arch_vcpu_fault(vcpu, vmf);
> +		page = kvm_arch_vcpu_fault(vcpu, vmf);
> +	if (!page)
> +		return VM_FAULT_SIGBUS;
> +
>   	get_page(page);
>   	vmf->page = page;
>   	return 0;
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

But at the same time I wonder if we should just get rid of 
CONFIG_KVM_S390_UCONTROL and consequently kvm_arch_vcpu_fault().


In practice CONFIG_KVM_S390_UCONTROL, is never enabled in any reasonable 
kernel build and consequently it's never tested; further, exposing the 
sie_block to user space allows user space to generate random SIE 
validity intercepts.

CONFIG_KVM_S390_UCONTROL feels like something that should just be 
maintained out of tree by someone who really needs to hack deep into hw 
virtualization for testing purposes etc.

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: Arnd Bergmann @ 2021-08-12  8:54 UTC (permalink / raw)
  To: Xianting TIan
  Cc: Arnd Bergmann, Jiri Slaby, Amit Shah, gregkh,
	Linux Kernel Mailing List,
	open list:DRM DRIVER FOR QEMU'S CIRRUS DEVICE, Guo Ren,
	linuxppc-dev, Omar Sandoval
In-Reply-To: <f18d017b-d6f7-cf87-8859-8d6b50c7c289@linux.alibaba.com>

On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan
<xianting.tian@linux.alibaba.com> wrote:
> 在 2021/8/6 下午10:51, Arnd Bergmann 写道:
> > On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian
> >> +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
> > I think you need a higher alignment for DMA buffers, instead of sizeof(long),
> > I would suggest ARCH_DMA_MINALIGN.
>
> As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think
> it 's better remain the code unchanged,
>
> I will send v5 patch soon.

I think you could just use "L1_CACHE_BYTES" as the alignment in this case.
This will make the structure slightly larger for architectures that do not have
alignment constraints on DMA buffers, but using a smaller alignment is
clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN.

Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already,
as some implementations do not have coherent DMA. I had failed to
realized though that on x86 you do not get an ARCH_DMA_MINALIGN
definition.

       Arnd

^ permalink raw reply

* [PATCH v5 2/2] virtio-console: remove unnecessary kmemdup()
From: Xianting Tian @ 2021-08-12  8:51 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, guoren, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20210812085112.145265-1-xianting.tian@linux.alibaba.com>

hvc framework will never pass stack memory to the put_chars() function,
So the calling of kmemdup() is unnecessary, we can remove it.

This revert commit c4baad5029 ("virtio-console: avoid DMA from stack")

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
---
 drivers/char/virtio_console.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7eaf303a7..4ed3ffb1d 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int count)
 {
 	struct port *port;
 	struct scatterlist sg[1];
-	void *data;
-	int ret;
 
 	if (unlikely(early_put_chars))
 		return early_put_chars(vtermno, buf, count);
@@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int count)
 	if (!port)
 		return -EPIPE;
 
-	data = kmemdup(buf, count, GFP_ATOMIC);
-	if (!data)
-		return -ENOMEM;
-
-	sg_init_one(sg, data, count);
-	ret = __send_to_port(port, sg, 1, count, data, false);
-	kfree(data);
-	return ret;
+	sg_init_one(sg, buf, count);
+	return __send_to_port(port, sg, 1, count, (void *)buf, false);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 0/2] make hvc pass dma capable memory to its backend
From: Xianting Tian @ 2021-08-12  8:51 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, guoren, linuxppc-dev, linux-kernel, virtualization

Dear all,

This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)

V1
virtio-console: avoid DMA from vmalloc area
https://lkml.org/lkml/2021/7/27/494

For v1 patch, Arnd Bergmann suggests to fix the issue in the first
place:
Make hvc pass DMA capable memory to put_chars()
The fix suggestion is included in v2.

V2
[PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/1/8
[PATCH 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/1/9

For v2 patch, Arnd Bergmann suggests to make new buf part of the
hvc_struct structure, and fix the compile issue.
The fix suggestion is included in v3.

V3
[PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/3/1347
[PATCH v3 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/3/1348

For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of
hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to
calculate the size of hvc_struct. The fix suggestion is included in
v4.

V4
[PATCH v4 0/2] make hvc pass dma capable memory to its backend
https://lkml.org/lkml/2021/8/5/1350
[PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/5/1351
[PATCH v4 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/5/1352

For v4 patch, Arnd Bergmann suggests to introduce another
array(cons_outbuf[]) for the buffer pointers next to the cons_ops[]
and vtermnos[] arrays. This fix included in this v5 patch.

drivers/char/virtio_console.c | 12 ++----------
drivers/tty/hvc/hvc_console.c | 40 +++++++++++++++++++++--------------
drivers/tty/hvc/hvc_console.h | 16 ++++++++++++--
3 file changed

^ permalink raw reply

* [PATCH v5 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-08-12  8:51 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: Xianting Tian, guoren, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20210812085112.145265-1-xianting.tian@linux.alibaba.com>

As well known, hvc backend can register its opertions to hvc backend.
the opertions contain put_chars(), get_chars() and so on.

Some hvc backend may do dma in its opertions. eg, put_chars() of
virtio-console. But in the code of hvc framework, it may pass DMA
incapable memory to put_chars() under a specific configuration, which
is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
1, c[] is on stack,
   hvc_console_print():
	char c[N_OUTBUF] __ALIGNED__;
	cons_ops[index]->put_chars(vtermnos[index], c, i);
2, ch is on stack,
   static void hvc_poll_put_char(,,char ch)
   {
	struct tty_struct *tty = driver->ttys[0];
	struct hvc_struct *hp = tty->driver_data;
	int n;

	do {
		n = hp->ops->put_chars(hp->vtermno, &ch, 1);
	} while (n <= 0);
   }

Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
is passed to virtio-console by hvc framework in above code. But I think
the fix is aggressive, it directly uses kmemdup() to alloc new buffer
from kmalloc area and do memcpy no matter the memory is in kmalloc area
or not. But most importantly, it should better be fixed in the hvc
framework, by changing it to never pass stack memory to the put_chars()
function in the first place. Otherwise, we still face the same issue if
a new hvc backend using dma added in the furture.

We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no
longer the stack memory. we can use it in above two cases.

Other cleanup is use ARCH_DMA_MINALIGN for align, and make 'hp->outbuf'
aligned and use struct_size() for the size parameter of kzalloc().

With the patch, we can remove the fix c4baad5029.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Tested-by: Xianting Tian <xianting.tian@linux.alibaba.com>
---
 drivers/tty/hvc/hvc_console.c | 40 +++++++++++++++++++++--------------
 drivers/tty/hvc/hvc_console.h | 16 ++++++++++++--
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5bb8c4e44..c56564eb7 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -41,16 +41,6 @@
  */
 #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
 
-/*
- * These sizes are most efficient for vio, because they are the
- * native transfer size. We could make them selectable in the
- * future to better deal with backends that want other buffer sizes.
- */
-#define N_OUTBUF	16
-#define N_INBUF		16
-
-#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
-
 static struct tty_driver *hvc_driver;
 static struct task_struct *hvc_task;
 
@@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp)
 static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
 static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 	{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
+static char *cons_outbuf[MAX_NR_HVC_CONSOLES];
 
 /*
  * Console APIs, NOT TTY.  These APIs are available immediately when
@@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 static void hvc_console_print(struct console *co, const char *b,
 			      unsigned count)
 {
-	char c[N_OUTBUF] __ALIGNED__;
+	char *c;
 	unsigned i = 0, n = 0;
 	int r, donecr = 0, index = co->index;
+	unsigned long flags;
+	struct hvc_struct *hp;
 
 	/* Console access attempt outside of acceptable console range. */
 	if (index >= MAX_NR_HVC_CONSOLES)
 		return;
 
 	/* This console adapter was removed so it is not usable. */
-	if (vtermnos[index] == -1)
+	if (vtermnos[index] == -1 || !cons_outbuf[index])
 		return;
 
+	c = cons_outbuf[index];
+
+	spin_lock_irqsave(&hp->c_lock, flags);
 	while (count > 0 || i > 0) {
 		if (count > 0 && i < sizeof(c)) {
 			if (b[n] == '\n' && !donecr) {
@@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const char *b,
 			}
 		}
 	}
+	spin_unlock_irqrestore(&hp->c_lock, flags);
 	hvc_console_flush(cons_ops[index], vtermnos[index]);
 }
 
@@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch)
 	struct tty_struct *tty = driver->ttys[0];
 	struct hvc_struct *hp = tty->driver_data;
 	int n;
+	unsigned long flags;
+	char *c;
+
+	if (!hp || !cons_outbuf[hp->index])
+		return;
+
+	c = cons_outbuf[hp->index];
 
 	do {
-		n = hp->ops->put_chars(hp->vtermno, &ch, 1);
+		spin_lock_irqsave(&hp->c_lock, flags);
+		c[0] = ch;
+		n = hp->ops->put_chars(hp->vtermno, c, 1);
+		spin_unlock_irqrestore(&hp->c_lock, flags);
 	} while (n <= 0);
 }
 #endif
@@ -922,8 +929,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 			return ERR_PTR(err);
 	}
 
-	hp = kzalloc(ALIGN(sizeof(*hp), sizeof(long)) + outbuf_size,
-			GFP_KERNEL);
+	hp = kzalloc(struct_size(hp, outbuf, outbuf_size), GFP_KERNEL);
 	if (!hp)
 		return ERR_PTR(-ENOMEM);
 
@@ -931,13 +937,13 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 	hp->data = data;
 	hp->ops = ops;
 	hp->outbuf_size = outbuf_size;
-	hp->outbuf = &((char *)hp)[ALIGN(sizeof(*hp), sizeof(long))];
 
 	tty_port_init(&hp->port);
 	hp->port.ops = &hvc_port_ops;
 
 	INIT_WORK(&hp->tty_resize, hvc_set_winsz);
 	spin_lock_init(&hp->lock);
+	spin_lock_init(&hp->c_lock);
 	mutex_lock(&hvc_structs_mutex);
 
 	/*
@@ -964,6 +970,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
 	if (i < MAX_NR_HVC_CONSOLES) {
 		cons_ops[i] = ops;
 		vtermnos[i] = vtermno;
+		cons_outbuf[i] = hp->c;
 	}
 
 	list_add_tail(&(hp->next), &hvc_structs);
@@ -988,6 +995,7 @@ int hvc_remove(struct hvc_struct *hp)
 	if (hp->index < MAX_NR_HVC_CONSOLES) {
 		vtermnos[hp->index] = -1;
 		cons_ops[hp->index] = NULL;
+		cons_outbuf[hp->index] = NULL;
 	}
 
 	/* Don't whack hp->irq because tty_hangup() will need to free the irq. */
diff --git a/drivers/tty/hvc/hvc_console.h b/drivers/tty/hvc/hvc_console.h
index 18d005814..52374e2da 100644
--- a/drivers/tty/hvc/hvc_console.h
+++ b/drivers/tty/hvc/hvc_console.h
@@ -32,13 +32,21 @@
  */
 #define HVC_ALLOC_TTY_ADAPTERS	8
 
+/*
+ * These sizes are most efficient for vio, because they are the
+ * native transfer size. We could make them selectable in the
+ * future to better deal with backends that want other buffer sizes.
+ */
+#define N_OUTBUF	16
+#define N_INBUF		16
+
+#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
+
 struct hvc_struct {
 	struct tty_port port;
 	spinlock_t lock;
 	int index;
 	int do_wakeup;
-	char *outbuf;
-	int outbuf_size;
 	int n_outbuf;
 	uint32_t vtermno;
 	const struct hv_ops *ops;
@@ -48,6 +56,10 @@ struct hvc_struct {
 	struct work_struct tty_resize;
 	struct list_head next;
 	unsigned long flags;
+	spinlock_t c_lock;
+	char c[N_OUTBUF] __ALIGNED__;
+	int outbuf_size;
+	char outbuf[0] __ALIGNED__;
 };
 
 /* implemented by a low level driver */
-- 
2.17.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox