LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Paolo Bonzini @ 2021-06-24 10:21 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624529635.75a1ann91v.astroid@bobo.none>

On 24/06/21 12:17, Nicholas Piggin wrote:
>> If all callers were updated that is one thing, but from the changelog
>> it sounds like that would not happen and there would be some gfn_to_pfn
>> users left over.
>>
>> But yes in the end you would either need to make gfn_to_pfn never return
>> a page found via follow_pte, or change all callers to the new way. If
>> the plan is for the latter then I guess that's fine.
>
> Actually in that case anyway I don't see the need -- the existence of
> gfn_to_pfn is enough to know it might be buggy. It can just as easily
> be grepped for as kvm_pfn_page_unwrap.

Sure, but that would leave us with longer function names 
(gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
that looks worse and the unsafe use is the one that looks safe.

> And are gfn_to_page cases also
> vulernable to the same issue?

No, they're just broken for the VM_IO|VM_PFNMAP case.

Paolo

> So I think it could be marked deprecated or something if not everything
> will be converted in the one series, and don't need to touch all that
> arch code with this patch.


^ permalink raw reply

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
From: Laurent Dufour @ 2021-06-24 10:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, nvdimm, dan.j.williams, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <20210617165105.574178-8-aneesh.kumar@linux.ibm.com>

Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>   arch/powerpc/include/asm/firmware.h       |   3 +-
>   arch/powerpc/include/asm/prom.h           |   1 +
>   arch/powerpc/kernel/prom_init.c           |   3 +-
>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>   6 files changed, 286 insertions(+), 6 deletions(-)
>   create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..93be604ac54d
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,135 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.
> +
> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> +
> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
                                                            architecture ^

> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID index represnets increasing hierachy of resource grouping.
                         represents ^

> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> +
> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.
> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value of
> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> +same domainID index representing resource groups of different performance/latency characteristics.
> +
> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is considered
> +the domainID index.
> +
> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
> +
> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +
> +"ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }
> +
> +Each resource (drcIndex) now also supports additional optional device tree properties.
> +These properties are marked optional because the platform can choose not to export
> +them and provide the system topology details using the earlier defined device tree
> +properties alone. The optional device tree properties are used when adding new resources
> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> +contains the newly added resource during boot.
> +
> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> +when building the NUMA distance of the numa node to which this resource belongs. This can
> +be looked at as the index at which this new domainID would have appeared in
> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> +The value is 1 based array index value.
> +
> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
> +
> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> +from this resource domain to other resources.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,associativity     = { 4, 5, 10, 50}

Is missing the first byte of the property (length) or an associativity number?

> +ibm,numa-lookup-index = { 4 }
> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
> +
> +resulting in a new toplogy as below.
> +  | 0    8   40   50
> +--|------------------
> +  |
> +0 | 10   20  80   160
> +  |
> +8 | 20   10  160  255
> +  |
> +40| 80   160  10  80
> +  |
> +50| 160  255  80  10
> +
> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>   #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>   #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>   #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>   
>   #ifndef __ASSEMBLY__
>   
> @@ -73,7 +74,7 @@ enum {
>   		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>   		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>   		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>   	FW_FEATURE_PSERIES_ALWAYS = 0,
>   	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>   	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>   #define OV5_XCMO		0x0440	/* Page Coalescing */
>   #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>   #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>   #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>   #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>   #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 64b9593038a7..496fdac54c29 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>   #else
>   		0,
>   #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>   		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>   		.micro_checkpoint = 0,
>   		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d32729f235b8..5a7d94960fb7 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>   
>   #define FORM0_AFFINITY 0
>   #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>   static int affinity_form;
>   
>   #define MAX_DISTANCE_REF_POINTS 4
>   static int max_associativity_domain_index;
>   static const __be32 *distance_ref_points;
>   static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES];
>   
>   /*
>    * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>   }
>   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>   
> +/*
> + * With FORM2 if we are not using logical domain ids, we will find
> + * both primary and seconday domains with same value. Hence always
> + * start comparison from secondary domains
> + */
> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist = 0;
> +

extra blank line.

> +	int i, index;
> +
> +	for (i = 1; i < max_associativity_domain_index; i++) {
> +		index = be32_to_cpu(distance_ref_points[i]);
> +		if (cpu1_assoc[index] == cpu2_assoc[index])
> +			break;
> +		dist++;
> +	}
> +
> +	return dist;
> +}
> +
>   static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	int dist = 0;
> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   			break;
>   		dist++;
>   	}
> -
>   	return dist;
>   }
>   
> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	/* We should not get called with FORM0 */
>   	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>   }
>   
>   /* must hold reference to node during call */
> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>   	int i;
>   	int distance = LOCAL_DISTANCE;
>   
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>   		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>   
>   	for (i = 0; i < max_associativity_domain_index; i++) {
> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>   
>   /*
>    * Used to update distance information w.r.t newly added node.
> + * ibm,numa-lookup-index -> 4
> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>    */
>   void update_numa_distance(struct device_node *node)
>   {
> +	int i, nid, other_nid, other_nid_index = 0;
> +	const __be32 *numa_indexp;
> +	const __u8  *numa_distancep;
> +	int numa_index, max_numa_index, numa_distance;
> +
>   	if (affinity_form == FORM0_AFFINITY)
>   		return;
>   	else if (affinity_form == FORM1_AFFINITY) {
>   		initialize_form1_numa_distance(node);
>   		return;
>   	}
> +	/* FORM2 affinity  */
> +
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/* Already initialized */
> +	if (numa_distance_table[nid][nid] != -1)
> +		return;
> +	/*
> +	 * update node distance if not already populated.
> +	 */
> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> +	if (!numa_distancep)
> +		return;
> +
> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
> +	if (!numa_indexp)
> +		return;
> +
> +	numa_index = of_read_number(numa_indexp, 1);
> +	/*
> +	 * update the numa_id_index_table. Device tree look at index table as
> +	 * 1 based array indexing.
> +	 */
> +	numa_id_index_table[numa_index - 1] = nid;
> +
> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
> +	VM_WARN_ON(max_numa_index != 2 * numa_index);

Could you explain shortly in a comment the meaning of this VM_WARN_ON check?

> +	/* Skip the size which is encoded int */
> +	numa_distancep += sizeof(__be32);
> +
> +	/*
> +	 * First fill the distance information from other node to this node.
> +	 */
> +	other_nid_index = 0;
> +	for (i = 0; i < numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[other_nid][nid] = numa_distance;
> +	}
> +
> +	other_nid_index = 0;
> +	for (; i < max_numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[nid][other_nid] = numa_distance;
> +	}
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +	int i, curr_row = 0, curr_column = 0;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> +
> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
> +
> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
> +		int nodeA = numa_id_index_table[curr_row];
> +		int nodeB = numa_id_index_table[curr_column++];
> +
> +		numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index];
> +
> +		pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		if (curr_column >= max_numa_index) {
> +			curr_row++;
> +			/* reset the column */
> +			curr_column = 0;
> +		}
> +	}
>   }
>   
>   static int __init find_primary_domain_index(void)
> @@ -324,6 +453,9 @@ static int __init find_primary_domain_index(void)
>   	 */
>   	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>   		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>   	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>   		dbg("Using form 1 affinity\n");
>   		affinity_form = FORM1_AFFINITY;
> @@ -368,8 +500,17 @@ static int __init find_primary_domain_index(void)
>   
>   		index = of_read_number(&distance_ref_points[1], 1);
>   	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>   		index = of_read_number(distance_ref_points, 1);
>   	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);
>   
>   	/*
>   	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>   	{FW_FEATURE_PRRN,		OV5_PRRN},
>   	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>   	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>   };
>   
>   static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
> 


^ permalink raw reply

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Nicholas Piggin @ 2021-06-24 10:34 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <20210624035749.4054934-1-stevensd@google.com>

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
> assoicated struct pages, so they should not be passed to pfn_to_page.
> This series removes such calls from the x86 and arm64 secondary MMU. To
> do this, this series modifies gfn_to_pfn to return a struct page in
> addition to a pfn, if the hva was resolved by gup. This allows the
> caller to call put_page only when necessated by gup.
> 
> This series provides a helper function that unwraps the new return type
> of gfn_to_pfn to provide behavior identical to the old behavior. As I
> have no hardware to test powerpc/mips changes, the function is used
> there for minimally invasive changes. Additionally, as gfn_to_page and
> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
> easily changed over to only use pfns.
> 
> This addresses CVE-2021-22543 on x86 and arm64.

Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)

I have no problem with improving the API and probably in the direction 
of your series is good. But there seems to be a lot of unfixed arch code 
and broken APIs remaining left to do after your series too. This might 
be most suitable to backport and as a base for your series that can take
more time to convert to new APIs.

Thanks,
Nick

---

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..e208c279d903 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2104,13 +2104,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
+	 *
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
 	 */ 
-	kvm_get_pfn(pfn);
+	if (!kvm_try_get_pfn(pfn))
+		r = -EFAULT;
 
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
-	return 0;
+
+	return r;
 }
 
 /*
@@ -2487,6 +2495,13 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+	if (kvm_is_reserved_pfn(pfn))
+		return 1;
+	return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 void kvm_get_pfn(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))

^ permalink raw reply related

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Nicholas Piggin @ 2021-06-24 10:42 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <fc2a88ed-6a98-857d-bb1f-73260b01ac30@redhat.com>

Excerpts from Paolo Bonzini's message of June 24, 2021 8:21 pm:
> On 24/06/21 12:17, Nicholas Piggin wrote:
>>> If all callers were updated that is one thing, but from the changelog
>>> it sounds like that would not happen and there would be some gfn_to_pfn
>>> users left over.
>>>
>>> But yes in the end you would either need to make gfn_to_pfn never return
>>> a page found via follow_pte, or change all callers to the new way. If
>>> the plan is for the latter then I guess that's fine.
>>
>> Actually in that case anyway I don't see the need -- the existence of
>> gfn_to_pfn is enough to know it might be buggy. It can just as easily
>> be grepped for as kvm_pfn_page_unwrap.
> 
> Sure, but that would leave us with longer function names 
> (gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
> that looks worse and the unsafe use is the one that looks safe.

The churn isn't justified because of function name length. Choose g2pp() 
if you want a non-descriptive but short name.

The existing name isn't good anyway because it not only looks up a pfn 
but also a page, and more importantly it gets a ref on the page. The
name should be changed if you introduce a new API.

>> And are gfn_to_page cases also
>> vulernable to the same issue?
> 
> No, they're just broken for the VM_IO|VM_PFNMAP case.

No they aren't vulnerable, or they are vunlerable but also broken in 
other cases?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 4/6] KVM: arm64/mmu: avoid struct page in MMU
From: Marc Zyngier @ 2021-06-24 10:43 UTC (permalink / raw)
  To: David Stevens
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	intel-gvt-dev, Joerg Roedel, Huacai Chen, Aleksandar Markovic,
	Zhi Wang, Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang,
	Alexandru Elisei, linux-arm-kernel, Jim Mattson,
	Sean Christopherson, linux-kernel, James Morse, Paolo Bonzini,
	Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210624035749.4054934-5-stevensd@google.com>

On Thu, 24 Jun 2021 04:57:47 +0100,
David Stevens <stevensd@chromium.org> wrote:
> 
> From: David Stevens <stevensd@chromium.org>
> 
> Avoid converting pfns returned by follow_fault_pfn to struct pages to
> transiently take a reference. The reference was originally taken to
> match the reference taken by gup. However, pfns returned by
> follow_fault_pfn may not have a struct page set up for reference
> counting.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++--------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 896b3644b36f..a741972cb75f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c

[...]

> @@ -968,16 +968,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 */
>  	if (vma_pagesize == PAGE_SIZE && !force_pte)
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> -							   &pfn, &fault_ipa);
> +							   &pfnpg, &fault_ipa);
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
>  	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> +		clean_dcache_guest_page(pfnpg.pfn, vma_pagesize);
>  
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
> -		invalidate_icache_guest_page(pfn, vma_pagesize);
> +		invalidate_icache_guest_page(pfnpg.pfn, vma_pagesize);

This is going to clash with what is currently in -next, specially with
MTE.

Paolo, if you really want to take this in 5.13, can you please push a
stable branch based on -rc4 or older so that I can merge it in and
test it?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
From: Nicholas Piggin @ 2021-06-24 10:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	intel-gvt-dev, Joerg Roedel, Huacai Chen, Aleksandar Markovic,
	Zhi Wang, Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang,
	Alexandru Elisei, linux-arm-kernel, Jim Mattson,
	Sean Christopherson, linux-kernel, James Morse, David Stevens,
	Paolo Bonzini, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <87mtrfinks.wl-maz@kernel.org>

Excerpts from Marc Zyngier's message of June 24, 2021 8:06 pm:
> On Thu, 24 Jun 2021 09:58:00 +0100,
> Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> > From: David Stevens <stevensd@chromium.org>
>> >  out_unlock:
>> >  	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>> >  		read_unlock(&vcpu->kvm->mmu_lock);
>> >  	else
>> >  		write_unlock(&vcpu->kvm->mmu_lock);
>> > -	kvm_release_pfn_clean(pfn);
>> > +	if (pfnpg.page)
>> > +		put_page(pfnpg.page);
>> >  	return r;
>> >  }
>> 
>> How about
>> 
>>   kvm_release_pfn_page_clean(pfnpg);
> 
> I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
> doesn't mark the page 'clean'. I find put_page() more correct.
> 
> Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
> bad at naming things that I could just as well call it 'bob()'.

That seems like a fine name to me. A little better than bob.

Thanks,
Nick

^ permalink raw reply

* [PATCH] ASoC: fsl_xcvr: Omit superfluous error message in fsl_xcvr_probe()
From: Tang Bin @ 2021-06-24 10:45 UTC (permalink / raw)
  To: timur, nicoleotsuka, Xiubo.Lee, lgirdwood, broonie, perex, tiwai
  Cc: alsa-devel, linuxppc-dev, linux-kernel, Tang Bin

In the function fsl_xcvr__probe(), when get irq failed,
the function platform_get_irq() logs an error message, so remove
redundant message here.

Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com>
---
 sound/soc/fsl/fsl_xcvr.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/sound/soc/fsl/fsl_xcvr.c b/sound/soc/fsl/fsl_xcvr.c
index 5e8284db857b..711d738f8de1 100644
--- a/sound/soc/fsl/fsl_xcvr.c
+++ b/sound/soc/fsl/fsl_xcvr.c
@@ -1190,10 +1190,8 @@ static int fsl_xcvr_probe(struct platform_device *pdev)
 
 	/* get IRQs */
 	irq = platform_get_irq(pdev, 0);
-	if (irq < 0) {
-		dev_err(dev, "no irq[0]: %d\n", irq);
+	if (irq < 0)
 		return irq;
-	}
 
 	ret = devm_request_irq(dev, irq, irq0_isr, 0, pdev->name, xcvr);
 	if (ret) {
-- 
2.18.2




^ permalink raw reply related

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
From: Aneesh Kumar K.V @ 2021-06-24 10:55 UTC (permalink / raw)
  To: Laurent Dufour, linuxppc-dev, mpe
  Cc: Nathan Lynch, nvdimm, dan.j.williams, Daniel Henrique Barboza,
	David Gibson
In-Reply-To: <6287f135-54a1-5c5e-6a7f-a8e4c0dc5113@linux.ibm.com>

On 6/24/21 4:03 PM, Laurent Dufour wrote:
> Hi Aneesh,
> 
> A little bit of wordsmithing below...
> 
> Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>   arch/powerpc/include/asm/firmware.h       |   3 +-
>>   arch/powerpc/include/asm/prom.h           |   1 +
>>   arch/powerpc/kernel/prom_init.c           |   3 +-
>>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>>   6 files changed, 286 insertions(+), 6 deletions(-)
>>   create mode 100644 Documentation/powerpc/associativity.rst
>>
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform 
>> resources into
>> +domains of substantially similar mean performance relative to 
>> resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources 
>> subsets
>> +are represented as being members of a sub-grouping domain. This 
>> performance
>> +characteristic is presented in terms of NUMA node distance within the 
>> Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating 
>> these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 
>> and Form2
>> +associativity grouping. Form 0 is the older format and is now 
>> considered deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,arcitecture-vec-5 property".
>                                                             architecture ^
> 

fixed

>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates 
>> usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance 
>> between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of 
>> numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or 
>> more list of numbers
>> +(domainID index) that represents the 1 based ordinal in the 
>> associativity lists.
>> +The list of domainID index represnets increasing hierachy of resource 
>> grouping.
>                          represents ^
> 

fixed

>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the 
>> NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by 
>> recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at 
>> every higher
>> +level of the resource group, the kernel doubles the NUMA distance 
>> between the
>> +comparing domains.
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also 
>> allows flexible primary
>> +domain numbering. With numa distance computation now detached from 
>> the index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary 
>> domain ids at the
>> +same domainID index representing resource groups of different 
>> performance/latency characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of 
>> byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains one or more list 
>> numbers representing
>> +the domainIDs present in the system. The offset of the domainID in 
>> this property is considered
>> +the domainID index.
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with 
>> encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index 
>> for domainID 8 is 1.
>> +
>> +"ibm,numa-distance-table" property contains one or more list of 
>> numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as 
>> with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance 
>> value we could encode is 255.
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +
>> +"ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>> +
>> +Each resource (drcIndex) now also supports additional optional device 
>> tree properties.
>> +These properties are marked optional because the platform can choose 
>> not to export
>> +them and provide the system topology details using the earlier 
>> defined device tree
>> +properties alone. The optional device tree properties are used when 
>> adding new resources
>> +(DLPAR) and when the platform didn't provide the topology details of 
>> the domain which
>> +contains the newly added resource during boot.
>> +
>> +"ibm,numa-lookup-index" property contains a number representing the 
>> domainID index to be used
>> +when building the NUMA distance of the numa node to which this 
>> resource belongs. This can
>> +be looked at as the index at which this new domainID would have 
>> appeared in
>> +"ibm,numa-lookup-index-table" if the domain was present during boot. 
>> The domainID
>> +of the new resource can be obtained from the existing 
>> "ibm,associativity" property. This
>> +can be used to build distance information of a newly onlined NUMA 
>> node via DLPAR operation.
>> +The value is 1 based array index value.
>> +
>> +prop-encoded-array: An integer encoded as with encode-int specifying 
>> the domainID index
>> +
>> +"ibm,numa-distance" property contains one or more list of numbers 
>> presenting the NUMA distance
>> +from this resource domain to other resources.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as 
>> with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance 
>> value we could encode is 255.
>> +
>> +For ex:
>> +ibm,associativity     = { 4, 5, 10, 50}
> 
> Is missing the first byte of the property (length) or an associativity 
> number?
> 

that should be {3, 5,10,50}  fixed.

>> +ibm,numa-lookup-index = { 4 }
>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> +
>> +resulting in a new toplogy as below.
>> +  | 0    8   40   50
>> +--|------------------
>> +  |
>> +0 | 10   20  80   160
>> +  |
>> +8 | 20   10  160  255
>> +  |
>> +40| 80   160  10  80
>> +  |
>> +50| 160  255  80  10
>> +
>> diff --git a/arch/powerpc/include/asm/firmware.h 
>> b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>

...

>> +    numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> +    if (!numa_distancep)
>> +        return;
>> +
>> +    numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
>> +    if (!numa_indexp)
>> +        return;
>> +
>> +    numa_index = of_read_number(numa_indexp, 1);
>> +    /*
>> +     * update the numa_id_index_table. Device tree look at index 
>> table as
>> +     * 1 based array indexing.
>> +     */
>> +    numa_id_index_table[numa_index - 1] = nid;
>> +
>> +    max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
>> +    VM_WARN_ON(max_numa_index != 2 * numa_index);
> 
> Could you explain shortly in a comment the meaning of this VM_WARN_ON 
> check?
> 

Based on the other review feedback this is dropped. We now derive domain 
distance offset based on the number of elements in "ibm,numa-distance"

>> +    /* Skip the size which is encoded int */
>> +    numa_distancep += sizeof(__be32);
>> +
>> +    /*
>> +     * First fill the distance information from other node to this node.
>> +     */
>> +    other_nid_index = 0;
>> +    for (i = 0; i < numa_index; i++) {
>> +        numa_distance = numa_distancep[i];
>> +        other_nid = numa_id_index_table[other_nid_index++];
>> +        numa_distance_table[other_nid][nid] = numa_distance;
>> +    }
>> +
>> +    other_nid_index = 0;
>> +    for (; i < max_numa_index; i++) {
>> +        numa_distance = numa_distancep[i];
>> +        other_nid = numa_id_index_table[other_nid_index++];
>> +        numa_distance_table[nid][other_nid] = numa_distance;
>> +    }
>> +}
>> +

Thanks for reviewing the patch.

-aneesh

^ permalink raw reply

* Re: [PATCH v2] powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() on ppc32
From: Naveen N. Rao @ 2021-06-24 10:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <5bdc8cbc9a95d0779e27c9ddbf42b40f51f883c0.1624425798.git.christophe.leroy@csgroup.eu>

Christophe Leroy wrote:
> From: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> 
> Trying to use a kprobe on ppc32 results in the below splat:
>     BUG: Unable to handle kernel data access on read at 0x7c0802a6
>     Faulting instruction address: 0xc002e9f0
>     Oops: Kernel access of bad area, sig: 11 [#1]
>     BE PAGE_SIZE=4K PowerPC 44x Platform
>     Modules linked in:
>     CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
>     NIP:  c002e9f0 LR: c0011858 CTR: 00008a47
>     REGS: c292fd50 TRAP: 0300   Not tainted  (5.13.0-rc1-01824-g3a81c0495fdb)
>     MSR:  00009000 <EE,ME>  CR: 24002002  XER: 20000000
>     DEAR: 7c0802a6 ESR: 00000000
>     <snip>
>     NIP [c002e9f0] emulate_step+0x28/0x324
>     LR [c0011858] optinsn_slot+0x128/0x10000
>     Call Trace:
>      opt_pre_handler+0x7c/0xb4 (unreliable)
>      optinsn_slot+0x128/0x10000
>      ret_from_syscall+0x0/0x28
> 
> The offending instruction is:
>     81 24 00 00     lwz     r9,0(r4)
> 
> Here, we are trying to load the second argument to emulate_step():
> struct ppc_inst, which is the instruction to be emulated. On ppc64,
> structures are passed in registers when passed by value. However, per
> the ppc32 ABI, structures are always passed to functions as pointers.
> This isn't being adhered to when setting up the call to emulate_step()
> in the optprobe trampoline. Fix the same.
> 
> Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
> Cc: stable@vger.kernel.org
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> v2: Rebased on powerpc/merge 7f030e9d57b8
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Thanks for rebasing this!

I think git am drops everything after three dashes, so applying this 
patch will drop your SOB. The recommended style (*) for adding a 
changelog is to include it within [] before the second SOB.


- Naveen

(*) https://www.kernel.org/doc/html/latest/maintainer/modifying-patches.html


^ permalink raw reply

* Re: [PATCH v2] powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() on ppc32
From: Christophe Leroy @ 2021-06-24 11:09 UTC (permalink / raw)
  To: Naveen N. Rao, Benjamin Herrenschmidt, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1624531892.3vdz8ibfty.naveen@linux.ibm.com>



Le 24/06/2021 à 12:59, Naveen N. Rao a écrit :
> Christophe Leroy wrote:
>> From: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>>
>> Trying to use a kprobe on ppc32 results in the below splat:
>>     BUG: Unable to handle kernel data access on read at 0x7c0802a6
>>     Faulting instruction address: 0xc002e9f0
>>     Oops: Kernel access of bad area, sig: 11 [#1]
>>     BE PAGE_SIZE=4K PowerPC 44x Platform
>>     Modules linked in:
>>     CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
>>     NIP:  c002e9f0 LR: c0011858 CTR: 00008a47
>>     REGS: c292fd50 TRAP: 0300   Not tainted  (5.13.0-rc1-01824-g3a81c0495fdb)
>>     MSR:  00009000 <EE,ME>  CR: 24002002  XER: 20000000
>>     DEAR: 7c0802a6 ESR: 00000000
>>     <snip>
>>     NIP [c002e9f0] emulate_step+0x28/0x324
>>     LR [c0011858] optinsn_slot+0x128/0x10000
>>     Call Trace:
>>      opt_pre_handler+0x7c/0xb4 (unreliable)
>>      optinsn_slot+0x128/0x10000
>>      ret_from_syscall+0x0/0x28
>>
>> The offending instruction is:
>>     81 24 00 00     lwz     r9,0(r4)
>>
>> Here, we are trying to load the second argument to emulate_step():
>> struct ppc_inst, which is the instruction to be emulated. On ppc64,
>> structures are passed in registers when passed by value. However, per
>> the ppc32 ABI, structures are always passed to functions as pointers.
>> This isn't being adhered to when setting up the call to emulate_step()
>> in the optprobe trampoline. Fix the same.
>>
>> Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>> v2: Rebased on powerpc/merge 7f030e9d57b8
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> Thanks for rebasing this!
> 
> I think git am drops everything after three dashes, so applying this patch will drop your SOB. The 
> recommended style (*) for adding a changelog is to include it within [] before the second SOB.
> 

Yes, I saw that after sending the mail. Usually I add a signed-off-by with 'git --amend -s' when I 
add the history, so it goes before the '---' I'm adding.

This time I must have forgotten the '-s' so it was added by the 'git format-patch -s' which is in my 
submit script, and so it was added at the end.

It's not really a big deal, up to Michael to either move it at the right place or discard it, I 
don't really mind. Do the easiest for you.

Thanks
Christophe

^ permalink raw reply

* Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
From: Robin Murphy @ 2021-06-24 11:14 UTC (permalink / raw)
  To: Claire Chang, Christoph Hellwig
  Cc: heikki.krogerus, linux-devicetree, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Jianxiong Gao, Stefano Stabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Bartosz Golaszewski, bskeggs, linux-pci,
	xen-devel, Marek Szyprowski, Dan Williams, matthew.auld,
	Nicolas Boichat, thomas.hellstrom, Jim Quinlan, Will Deacon,
	Konrad Rzeszutek Wilk, intel-gfx, maarten.lankhorst, jani.nikula,
	Rob Herring, rodrigo.vivi, Bjorn Helgaas, boris.ostrovsky,
	Andy Shevchenko, jgross, airlied, Thierry Reding, Greg KH,
	Randy Dunlap, Qian Cai, lkml, list@263.net:IOMMU DRIVERS,
	Daniel Vetter, xypron.glpk, Tom Lendacky, linuxppc-dev, bauerman
In-Reply-To: <CALiNf288ZLMhY3E8E3N+z9rkwi1viWNLm1wwMEwT4rNwh3FfwQ@mail.gmail.com>

On 2021-06-24 07:05, Claire Chang wrote:
> On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
>>> is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
>>>
>>> is_swiotlb_force_bounce() was the new function introduced in this patch here.
>>>
>>> +static inline bool is_swiotlb_force_bounce(struct device *dev)
>>> +{
>>> +     return dev->dma_io_tlb_mem->force_bounce;
>>> +}
>>
>> To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
>> turn this into :
>>
>>          return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
>>
>> for a quick debug check?
> 
> I just realized that dma_io_tlb_mem might be NULL like Christoph
> pointed out since swiotlb might not get initialized.
> However,  `Unable to handle kernel paging request at virtual address
> dfff80000000000e` looks more like the address is garbage rather than
> NULL?
> I wonder if that's because dev->dma_io_tlb_mem is not assigned
> properly (which means device_initialize is not called?).

What also looks odd is that the base "address" 0xdfff800000000000 is 
held in a couple of registers, but the offset 0xe looks too small to 
match up to any relevant structure member in that dereference chain :/

Robin.

^ permalink raw reply

* Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
From: Will Deacon @ 2021-06-24 11:18 UTC (permalink / raw)
  To: Robin Murphy
  Cc: heikki.krogerus, linux-devicetree, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Jianxiong Gao, Stefano Stabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Marek Szyprowski, Dan Williams,
	matthew.auld, Nicolas Boichat, thomas.hellstrom, Jim Quinlan,
	Konrad Rzeszutek Wilk, intel-gfx, maarten.lankhorst, jani.nikula,
	Rob Herring, rodrigo.vivi, Bjorn Helgaas, Claire Chang,
	boris.ostrovsky, Andy Shevchenko, jgross, airlied, Thierry Reding,
	Greg KH, Randy Dunlap, Qian Cai, lkml, list@263.net:IOMMU DRIVERS,
	Daniel Vetter, xypron.glpk, Tom Lendacky, linuxppc-dev, bauerman
In-Reply-To: <364e6715-eafd-fc4a-e0af-ce2a042756b4@arm.com>

On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:
> On 2021-06-24 07:05, Claire Chang wrote:
> > On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig <hch@lst.de> wrote:
> > > 
> > > On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > > > is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> > > > 
> > > > is_swiotlb_force_bounce() was the new function introduced in this patch here.
> > > > 
> > > > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > > > +{
> > > > +     return dev->dma_io_tlb_mem->force_bounce;
> > > > +}
> > > 
> > > To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> > > turn this into :
> > > 
> > >          return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
> > > 
> > > for a quick debug check?
> > 
> > I just realized that dma_io_tlb_mem might be NULL like Christoph
> > pointed out since swiotlb might not get initialized.
> > However,  `Unable to handle kernel paging request at virtual address
> > dfff80000000000e` looks more like the address is garbage rather than
> > NULL?
> > I wonder if that's because dev->dma_io_tlb_mem is not assigned
> > properly (which means device_initialize is not called?).
> 
> What also looks odd is that the base "address" 0xdfff800000000000 is held in
> a couple of registers, but the offset 0xe looks too small to match up to any
> relevant structure member in that dereference chain :/

FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
Christoph's suggestion is needed regardless. But I agree that it won't help
with the issue reported by Qian Cai.

Qian Cai: please can you share your .config and your command line?

Thanks,

Will

^ permalink raw reply

* [PATCH printk v3 0/6] printk: remove safe buffers
From: John Ogness @ 2021-06-24 11:11 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Alexey Kardashevskiy, Alexei Starovoitov, Chris Wilson,
	Wolfram Sang (Renesas), Paul Mackerras, Tiezhu Yang, Marc Zyngier,
	Masahiro Yamada, Russell King, Peter Zijlstra, Yue Hu,
	Ingo Molnar, linux-arm-kernel, Sami Tolvanen, Valentin Schneider,
	Kees Cook, Paul E. McKenney, Anshuman Khandual,
	Frederic Weisbecker, Nicholas Piggin, Nathan Chancellor,
	Nick Terrell, Steven Rostedt, Thomas Gleixner, Vlastimil Babka,
	kexec, linux-kernel, Pekka Enberg, Sergey Senozhatsky,
	Eric Biederman, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Cédric Le Goater

Hi,

Here is v3 of a series to remove the safe buffers. v2 can be
found here [0]. The safe buffers are no longer needed because
messages can be stored directly into the log buffer from any
context.

However, the safe buffers also provided a form of recursion
protection. For that reason, explicit recursion protection is
implemented for this series.

The safe buffers also implicitly provided serialization
between multiple CPUs executing in NMI context. This was
particularly necessary for the nmi_backtrace() output. This
serializiation is now preserved by using the printk_cpu_lock.

And finally, with the removal of the safe buffers, there is no
need for extra NMI enter/exit tracking. So this is also removed
(which includes removing config option CONFIG_PRINTK_NMI).

Changes since v2:

- Move irq disabling/enabling out of the
  console_lock_spinning_*() functions to simplify the patches
  keep the function prototypes simple.

- Change printk_enter_irqsave()/printk_exit_irqrestore() to
  macros to allow a more common calling convention for irq
  flags.

- Use the counter pointer from printk_enter_irqsave() in
  printk_exit_irqrestore() rather than fetching it again. This
  avoids any possible race conditions when printk's percpu
  flag is set.

- Use the printk_cpu_lock to serialize banner and regs with
  the stack dump in nmi_cpu_backtrace().

John Ogness

[0] https://lore.kernel.org/lkml/20210330153512.1182-1-john.ogness@linutronix.de

John Ogness (6):
  lib/nmi_backtrace: explicitly serialize banner and regs
  printk: track/limit recursion
  printk: remove safe buffers
  printk: remove NMI tracking
  printk: convert @syslog_lock to mutex
  printk: syslog: close window between wait and read

 arch/arm/kernel/smp.c          |   2 -
 arch/powerpc/kernel/traps.c    |   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 arch/powerpc/kexec/crash.c     |   3 -
 include/linux/hardirq.h        |   2 -
 include/linux/printk.h         |  22 --
 init/Kconfig                   |   5 -
 kernel/kexec_core.c            |   1 -
 kernel/panic.c                 |   3 -
 kernel/printk/internal.h       |  23 ---
 kernel/printk/printk.c         | 273 +++++++++++++++----------
 kernel/printk/printk_safe.c    | 361 +--------------------------------
 kernel/trace/trace.c           |   2 -
 lib/nmi_backtrace.c            |  13 +-
 14 files changed, 176 insertions(+), 540 deletions(-)


base-commit: 48e72544d6f06daedbf1d9b14610be89dba67526
-- 
2.20.1


^ permalink raw reply

* [PATCH printk v3 3/6] printk: remove safe buffers
From: John Ogness @ 2021-06-24 11:11 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Kees Cook, Paul E. McKenney, Alexey Kardashevskiy,
	Nicholas Piggin, linux-kernel, Steven Rostedt, kexec,
	Sergey Senozhatsky, Yue Hu, Paul Mackerras, Eric Biederman,
	Thomas Gleixner, linuxppc-dev, Andrew Morton, Tiezhu Yang,
	Cédric Le Goater
In-Reply-To: <20210624111148.5190-1-john.ogness@linutronix.de>

With @logbuf_lock removed, the high level printk functions for
storing messages are lockless. Messages can be stored from any
context, so there is no need for the NMI and safe buffers anymore.
Remove the NMI and safe buffers.

Although the safe buffers are removed, the NMI and safe context
tracking is still in place. In these contexts, store the message
immediately but still use irq_work to defer the console printing.

Since printk recursion tracking is in place, safe context tracking
for most of printk is not needed. Remove it. Only safe context
tracking relating to the console lock is left in place. This is
because the console lock is needed for the actual printing.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 arch/powerpc/kernel/traps.c    |   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 include/linux/printk.h         |  10 -
 kernel/kexec_core.c            |   1 -
 kernel/panic.c                 |   3 -
 kernel/printk/internal.h       |  17 --
 kernel/printk/printk.c         | 126 +++++--------
 kernel/printk/printk_safe.c    | 332 +--------------------------------
 lib/nmi_backtrace.c            |   6 -
 9 files changed, 51 insertions(+), 450 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a44a30b0688c..5828c83eaca6 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -171,7 +171,6 @@ extern void panic_flush_kmsg_start(void)
 
 extern void panic_flush_kmsg_end(void)
 {
-	printk_safe_flush_on_panic();
 	kmsg_dump(KMSG_DUMP_PANIC);
 	bust_spinlocks(0);
 	debug_locks_off();
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index c9a8f4781a10..dc17d8903d4f 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 
 	wd_smp_unlock(&flags);
 
-	printk_safe_flush();
-	/*
-	 * printk_safe_flush() seems to require another print
-	 * before anything actually goes out to console.
-	 */
 	if (sysctl_hardlockup_all_cpu_backtrace)
 		trigger_allbutself_cpu_backtrace();
 
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 1790a5521fd9..664612f75dac 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
 void dump_stack_print_info(const char *log_lvl);
 void show_regs_print_info(const char *log_lvl);
 extern asmlinkage void dump_stack(void) __cold;
-extern void printk_safe_flush(void);
-extern void printk_safe_flush_on_panic(void);
 #else
 static inline __printf(1, 0)
 int vprintk(const char *s, va_list args)
@@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char *log_lvl)
 static inline void dump_stack(void)
 {
 }
-
-static inline void printk_safe_flush(void)
-{
-}
-
-static inline void printk_safe_flush_on_panic(void)
-{
-}
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..480d5f77ef4f 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs)
 	old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
 	if (old_cpu == PANIC_CPU_INVALID) {
 		/* This is the 1st CPU which comes here, so go ahead. */
-		printk_safe_flush_on_panic();
 		__crash_kexec(regs);
 
 		/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a72a58..1f0df42f8d0c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -247,7 +247,6 @@ void panic(const char *fmt, ...)
 	 * Bypass the panic_cpu check and call __crash_kexec directly.
 	 */
 	if (!_crash_kexec_post_notifiers) {
-		printk_safe_flush_on_panic();
 		__crash_kexec(NULL);
 
 		/*
@@ -271,8 +270,6 @@ void panic(const char *fmt, ...)
 	 */
 	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
 
-	/* Call flush even twice. It tries harder with a single online CPU */
-	printk_safe_flush_on_panic();
 	kmsg_dump(KMSG_DUMP_PANIC);
 
 	/*
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 51615c909b2f..6cc35c5de890 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list args);
 void __printk_safe_enter(void);
 void __printk_safe_exit(void);
 
-void printk_safe_init(void);
 bool printk_percpu_data_ready(void);
 
 #define printk_safe_enter_irqsave(flags)	\
@@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void);
 		local_irq_restore(flags);	\
 	} while (0)
 
-#define printk_safe_enter_irq()		\
-	do {					\
-		local_irq_disable();		\
-		__printk_safe_enter();		\
-	} while (0)
-
-#define printk_safe_exit_irq()			\
-	do {					\
-		__printk_safe_exit();		\
-		local_irq_enable();		\
-	} while (0)
-
 void defer_console_output(void);
 
 #else
@@ -61,9 +48,5 @@ void defer_console_output(void);
 #define printk_safe_enter_irqsave(flags) local_irq_save(flags)
 #define printk_safe_exit_irqrestore(flags) local_irq_restore(flags)
 
-#define printk_safe_enter_irq() local_irq_disable()
-#define printk_safe_exit_irq() local_irq_enable()
-
-static inline void printk_safe_init(void) { }
 static inline bool printk_percpu_data_ready(void) { return false; }
 #endif /* CONFIG_PRINTK */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7fa0b4d91975..495520b7369c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -732,27 +732,22 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 	if (ret)
 		return ret;
 
-	printk_safe_enter_irq();
 	if (!prb_read_valid(prb, atomic64_read(&user->seq), r)) {
 		if (file->f_flags & O_NONBLOCK) {
 			ret = -EAGAIN;
-			printk_safe_exit_irq();
 			goto out;
 		}
 
-		printk_safe_exit_irq();
 		ret = wait_event_interruptible(log_wait,
 				prb_read_valid(prb, atomic64_read(&user->seq), r));
 		if (ret)
 			goto out;
-		printk_safe_enter_irq();
 	}
 
 	if (r->info->seq != atomic64_read(&user->seq)) {
 		/* our last seen message is gone, return error and reset */
 		atomic64_set(&user->seq, r->info->seq);
 		ret = -EPIPE;
-		printk_safe_exit_irq();
 		goto out;
 	}
 
@@ -762,7 +757,6 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 				  &r->info->dev_info);
 
 	atomic64_set(&user->seq, r->info->seq + 1);
-	printk_safe_exit_irq();
 
 	if (len > count) {
 		ret = -EINVAL;
@@ -797,7 +791,6 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	if (offset)
 		return -ESPIPE;
 
-	printk_safe_enter_irq();
 	switch (whence) {
 	case SEEK_SET:
 		/* the first record */
@@ -818,7 +811,6 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	default:
 		ret = -EINVAL;
 	}
-	printk_safe_exit_irq();
 	return ret;
 }
 
@@ -833,7 +825,6 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
 
 	poll_wait(file, &log_wait, wait);
 
-	printk_safe_enter_irq();
 	if (prb_read_valid_info(prb, atomic64_read(&user->seq), &info, NULL)) {
 		/* return error when data has vanished underneath us */
 		if (info.seq != atomic64_read(&user->seq))
@@ -841,7 +832,6 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
 		else
 			ret = EPOLLIN|EPOLLRDNORM;
 	}
-	printk_safe_exit_irq();
 
 	return ret;
 }
@@ -874,9 +864,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 	prb_rec_init_rd(&user->record, &user->info,
 			&user->text_buf[0], sizeof(user->text_buf));
 
-	printk_safe_enter_irq();
 	atomic64_set(&user->seq, prb_first_valid_seq(prb));
-	printk_safe_exit_irq();
 
 	file->private_data = user;
 	return 0;
@@ -1042,9 +1030,6 @@ static inline void log_buf_add_cpu(void) {}
 
 static void __init set_percpu_data_ready(void)
 {
-	printk_safe_init();
-	/* Make sure we set this flag only after printk_safe() init is done */
-	barrier();
 	__printk_percpu_data_ready = true;
 }
 
@@ -1082,6 +1067,7 @@ void __init setup_log_buf(int early)
 	struct prb_desc *new_descs;
 	struct printk_info info;
 	struct printk_record r;
+	unsigned int text_size;
 	size_t new_descs_size;
 	size_t new_infos_size;
 	unsigned long flags;
@@ -1142,24 +1128,37 @@ void __init setup_log_buf(int early)
 		 new_descs, ilog2(new_descs_count),
 		 new_infos);
 
-	printk_safe_enter_irqsave(flags);
+	local_irq_save(flags);
 
 	log_buf_len = new_log_buf_len;
 	log_buf = new_log_buf;
 	new_log_buf_len = 0;
 
 	free = __LOG_BUF_LEN;
-	prb_for_each_record(0, &printk_rb_static, seq, &r)
-		free -= add_to_rb(&printk_rb_dynamic, &r);
+	prb_for_each_record(0, &printk_rb_static, seq, &r) {
+		text_size = add_to_rb(&printk_rb_dynamic, &r);
+		if (text_size > free)
+			free = 0;
+		else
+			free -= text_size;
+	}
 
-	/*
-	 * This is early enough that everything is still running on the
-	 * boot CPU and interrupts are disabled. So no new messages will
-	 * appear during the transition to the dynamic buffer.
-	 */
 	prb = &printk_rb_dynamic;
 
-	printk_safe_exit_irqrestore(flags);
+	local_irq_restore(flags);
+
+	/*
+	 * Copy any remaining messages that might have appeared from
+	 * NMI context after copying but before switching to the
+	 * dynamic buffer.
+	 */
+	prb_for_each_record(seq, &printk_rb_static, seq, &r) {
+		text_size = add_to_rb(&printk_rb_dynamic, &r);
+		if (text_size > free)
+			free = 0;
+		else
+			free -= text_size;
+	}
 
 	if (seq != prb_next_seq(&printk_rb_static)) {
 		pr_err("dropped %llu messages\n",
@@ -1498,11 +1497,9 @@ static int syslog_print(char __user *buf, int size)
 		size_t n;
 		size_t skip;
 
-		printk_safe_enter_irq();
-		raw_spin_lock(&syslog_lock);
+		raw_spin_lock_irq(&syslog_lock);
 		if (!prb_read_valid(prb, syslog_seq, &r)) {
-			raw_spin_unlock(&syslog_lock);
-			printk_safe_exit_irq();
+			raw_spin_unlock_irq(&syslog_lock);
 			break;
 		}
 		if (r.info->seq != syslog_seq) {
@@ -1531,8 +1528,7 @@ static int syslog_print(char __user *buf, int size)
 			syslog_partial += n;
 		} else
 			n = 0;
-		raw_spin_unlock(&syslog_lock);
-		printk_safe_exit_irq();
+		raw_spin_unlock_irq(&syslog_lock);
 
 		if (!n)
 			break;
@@ -1566,7 +1562,6 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		return -ENOMEM;
 
 	time = printk_time;
-	printk_safe_enter_irq();
 	/*
 	 * Find first record that fits, including all following records,
 	 * into the user-provided buffer for this dump.
@@ -1587,23 +1582,20 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 			break;
 		}
 
-		printk_safe_exit_irq();
 		if (copy_to_user(buf + len, text, textlen))
 			len = -EFAULT;
 		else
 			len += textlen;
-		printk_safe_enter_irq();
 
 		if (len < 0)
 			break;
 	}
 
 	if (clear) {
-		raw_spin_lock(&syslog_lock);
+		raw_spin_lock_irq(&syslog_lock);
 		latched_seq_write(&clear_seq, seq);
-		raw_spin_unlock(&syslog_lock);
+		raw_spin_unlock_irq(&syslog_lock);
 	}
-	printk_safe_exit_irq();
 
 	kfree(text);
 	return len;
@@ -1611,11 +1603,9 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 
 static void syslog_clear(void)
 {
-	printk_safe_enter_irq();
-	raw_spin_lock(&syslog_lock);
+	raw_spin_lock_irq(&syslog_lock);
 	latched_seq_write(&clear_seq, prb_next_seq(prb));
-	raw_spin_unlock(&syslog_lock);
-	printk_safe_exit_irq();
+	raw_spin_unlock_irq(&syslog_lock);
 }
 
 /* Return a consistent copy of @syslog_seq. */
@@ -1703,12 +1693,10 @@ int do_syslog(int type, char __user *buf, int len, int source)
 		break;
 	/* Number of chars in the log buffer */
 	case SYSLOG_ACTION_SIZE_UNREAD:
-		printk_safe_enter_irq();
-		raw_spin_lock(&syslog_lock);
+		raw_spin_lock_irq(&syslog_lock);
 		if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
 			/* No unread messages. */
-			raw_spin_unlock(&syslog_lock);
-			printk_safe_exit_irq();
+			raw_spin_unlock_irq(&syslog_lock);
 			return 0;
 		}
 		if (info.seq != syslog_seq) {
@@ -1736,8 +1724,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
 			}
 			error -= syslog_partial;
 		}
-		raw_spin_unlock(&syslog_lock);
-		printk_safe_exit_irq();
+		raw_spin_unlock_irq(&syslog_lock);
 		break;
 	/* Size of the log buffer */
 	case SYSLOG_ACTION_SIZE_BUFFER:
@@ -1852,7 +1839,7 @@ static int console_trylock_spinning(void)
 	if (console_trylock())
 		return 1;
 
-	printk_safe_enter_irqsave(flags);
+	local_irq_save(flags);
 
 	raw_spin_lock(&console_owner_lock);
 	owner = READ_ONCE(console_owner);
@@ -1873,7 +1860,7 @@ static int console_trylock_spinning(void)
 	 * that active printer isn't us (recursive printk?).
 	 */
 	if (!spin) {
-		printk_safe_exit_irqrestore(flags);
+		local_irq_restore(flags);
 		return 0;
 	}
 
@@ -1884,7 +1871,7 @@ static int console_trylock_spinning(void)
 		cpu_relax();
 	spin_release(&console_owner_dep_map, _THIS_IP_);
 
-	printk_safe_exit_irqrestore(flags);
+	local_irq_restore(flags);
 	/*
 	 * The owner passed the console lock to us.
 	 * Since we did not spin on console lock, annotate
@@ -2219,7 +2206,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 {
 	int printed_len;
 	bool in_sched = false;
-	unsigned long flags;
 
 	/* Suppress unimportant messages after panic happens */
 	if (unlikely(suppress_printk))
@@ -2233,9 +2219,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	boot_delay_msec(level);
 	printk_delay();
 
-	printk_safe_enter_irqsave(flags);
 	printed_len = vprintk_store(facility, level, dev_info, fmt, args);
-	printk_safe_exit_irqrestore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
@@ -2664,9 +2648,9 @@ void console_unlock(void)
 
 	for (;;) {
 		size_t ext_len = 0;
+		int handover;
 		size_t len;
 
-		printk_safe_enter_irqsave(flags);
 skip:
 		if (!prb_read_valid(prb, console_seq, &r))
 			break;
@@ -2716,19 +2700,22 @@ void console_unlock(void)
 		 * were to occur on another CPU, it may wait for this one to
 		 * finish. This task can not be preempted if there is a
 		 * waiter waiting to take over.
+		 *
+		 * Interrupts are disabled because the hand over to a waiter
+		 * must not be interrupted until the hand over is completed
+		 * (@console_waiter is cleared).
 		 */
+		local_irq_save(flags);
 		console_lock_spinning_enable();
 
 		stop_critical_timings();	/* don't trace print latency */
 		call_console_drivers(ext_text, ext_len, text, len);
 		start_critical_timings();
 
-		if (console_lock_spinning_disable_and_check()) {
-			printk_safe_exit_irqrestore(flags);
+		handover = console_lock_spinning_disable_and_check();
+		local_irq_restore(flags);
+		if (handover)
 			return;
-		}
-
-		printk_safe_exit_irqrestore(flags);
 
 		if (do_cond_resched)
 			cond_resched();
@@ -2745,8 +2732,6 @@ void console_unlock(void)
 	 * flush, no worries.
 	 */
 	retry = prb_read_valid(prb, console_seq, NULL);
-	printk_safe_exit_irqrestore(flags);
-
 	if (retry && console_trylock())
 		goto again;
 }
@@ -2808,13 +2793,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
 	console_trylock();
 	console_may_schedule = 0;
 
-	if (mode == CONSOLE_REPLAY_ALL) {
-		unsigned long flags;
-
-		printk_safe_enter_irqsave(flags);
+	if (mode == CONSOLE_REPLAY_ALL)
 		console_seq = prb_first_valid_seq(prb);
-		printk_safe_exit_irqrestore(flags);
-	}
 	console_unlock();
 }
 
@@ -3466,14 +3446,12 @@ bool kmsg_dump_get_line(struct kmsg_dump_iter *iter, bool syslog,
 	struct printk_info info;
 	unsigned int line_count;
 	struct printk_record r;
-	unsigned long flags;
 	size_t l = 0;
 	bool ret = false;
 
 	if (iter->cur_seq < min_seq)
 		iter->cur_seq = min_seq;
 
-	printk_safe_enter_irqsave(flags);
 	prb_rec_init_rd(&r, &info, line, size);
 
 	/* Read text or count text lines? */
@@ -3494,7 +3472,6 @@ bool kmsg_dump_get_line(struct kmsg_dump_iter *iter, bool syslog,
 	iter->cur_seq = r.info->seq + 1;
 	ret = true;
 out:
-	printk_safe_exit_irqrestore(flags);
 	if (len)
 		*len = l;
 	return ret;
@@ -3526,7 +3503,6 @@ bool kmsg_dump_get_buffer(struct kmsg_dump_iter *iter, bool syslog,
 	u64 min_seq = latched_seq_read_nolock(&clear_seq);
 	struct printk_info info;
 	struct printk_record r;
-	unsigned long flags;
 	u64 seq;
 	u64 next_seq;
 	size_t len = 0;
@@ -3539,7 +3515,6 @@ bool kmsg_dump_get_buffer(struct kmsg_dump_iter *iter, bool syslog,
 	if (iter->cur_seq < min_seq)
 		iter->cur_seq = min_seq;
 
-	printk_safe_enter_irqsave(flags);
 	if (prb_read_valid_info(prb, iter->cur_seq, &info, NULL)) {
 		if (info.seq != iter->cur_seq) {
 			/* messages are gone, move to first available one */
@@ -3548,10 +3523,8 @@ bool kmsg_dump_get_buffer(struct kmsg_dump_iter *iter, bool syslog,
 	}
 
 	/* last entry */
-	if (iter->cur_seq >= iter->next_seq) {
-		printk_safe_exit_irqrestore(flags);
+	if (iter->cur_seq >= iter->next_seq)
 		goto out;
-	}
 
 	/*
 	 * Find first record that fits, including all following records,
@@ -3583,7 +3556,6 @@ bool kmsg_dump_get_buffer(struct kmsg_dump_iter *iter, bool syslog,
 
 	iter->next_seq = next_seq;
 	ret = true;
-	printk_safe_exit_irqrestore(flags);
 out:
 	if (len_out)
 		*len_out = len;
@@ -3601,12 +3573,8 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
  */
 void kmsg_dump_rewind(struct kmsg_dump_iter *iter)
 {
-	unsigned long flags;
-
-	printk_safe_enter_irqsave(flags);
 	iter->cur_seq = latched_seq_read_nolock(&clear_seq);
 	iter->next_seq = prb_next_seq(prb);
-	printk_safe_exit_irqrestore(flags);
 }
 EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
 
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 94232186fccb..0456cd48d01c 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -15,286 +15,9 @@
 
 #include "internal.h"
 
-/*
- * In NMI and safe mode, printk() avoids taking locks. Instead,
- * it uses an alternative implementation that temporary stores
- * the strings into a per-CPU buffer. The content of the buffer
- * is later flushed into the main ring buffer via IRQ work.
- *
- * The alternative implementation is chosen transparently
- * by examining current printk() context mask stored in @printk_context
- * per-CPU variable.
- *
- * The implementation allows to flush the strings also from another CPU.
- * There are situations when we want to make sure that all buffers
- * were handled or when IRQs are blocked.
- */
-
-#define SAFE_LOG_BUF_LEN ((1 << CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT) -	\
-				sizeof(atomic_t) -			\
-				sizeof(atomic_t) -			\
-				sizeof(struct irq_work))
-
-struct printk_safe_seq_buf {
-	atomic_t		len;	/* length of written data */
-	atomic_t		message_lost;
-	struct irq_work		work;	/* IRQ work that flushes the buffer */
-	unsigned char		buffer[SAFE_LOG_BUF_LEN];
-};
-
-static DEFINE_PER_CPU(struct printk_safe_seq_buf, safe_print_seq);
 static DEFINE_PER_CPU(int, printk_context);
 
-static DEFINE_RAW_SPINLOCK(safe_read_lock);
-
-#ifdef CONFIG_PRINTK_NMI
-static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq);
-#endif
-
-/* Get flushed in a more safe context. */
-static void queue_flush_work(struct printk_safe_seq_buf *s)
-{
-	if (printk_percpu_data_ready())
-		irq_work_queue(&s->work);
-}
-
-/*
- * Add a message to per-CPU context-dependent buffer. NMI and printk-safe
- * have dedicated buffers, because otherwise printk-safe preempted by
- * NMI-printk would have overwritten the NMI messages.
- *
- * The messages are flushed from irq work (or from panic()), possibly,
- * from other CPU, concurrently with printk_safe_log_store(). Should this
- * happen, printk_safe_log_store() will notice the buffer->len mismatch
- * and repeat the write.
- */
-static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s,
-						const char *fmt, va_list args)
-{
-	int add;
-	size_t len;
-	va_list ap;
-
-again:
-	len = atomic_read(&s->len);
-
-	/* The trailing '\0' is not counted into len. */
-	if (len >= sizeof(s->buffer) - 1) {
-		atomic_inc(&s->message_lost);
-		queue_flush_work(s);
-		return 0;
-	}
-
-	/*
-	 * Make sure that all old data have been read before the buffer
-	 * was reset. This is not needed when we just append data.
-	 */
-	if (!len)
-		smp_rmb();
-
-	va_copy(ap, args);
-	add = vscnprintf(s->buffer + len, sizeof(s->buffer) - len, fmt, ap);
-	va_end(ap);
-	if (!add)
-		return 0;
-
-	/*
-	 * Do it once again if the buffer has been flushed in the meantime.
-	 * Note that atomic_cmpxchg() is an implicit memory barrier that
-	 * makes sure that the data were written before updating s->len.
-	 */
-	if (atomic_cmpxchg(&s->len, len, len + add) != len)
-		goto again;
-
-	queue_flush_work(s);
-	return add;
-}
-
-static inline void printk_safe_flush_line(const char *text, int len)
-{
-	/*
-	 * Avoid any console drivers calls from here, because we may be
-	 * in NMI or printk_safe context (when in panic). The messages
-	 * must go only into the ring buffer at this stage.  Consoles will
-	 * get explicitly called later when a crashdump is not generated.
-	 */
-	printk_deferred("%.*s", len, text);
-}
-
-/* printk part of the temporary buffer line by line */
-static int printk_safe_flush_buffer(const char *start, size_t len)
-{
-	const char *c, *end;
-	bool header;
-
-	c = start;
-	end = start + len;
-	header = true;
-
-	/* Print line by line. */
-	while (c < end) {
-		if (*c == '\n') {
-			printk_safe_flush_line(start, c - start + 1);
-			start = ++c;
-			header = true;
-			continue;
-		}
-
-		/* Handle continuous lines or missing new line. */
-		if ((c + 1 < end) && printk_get_level(c)) {
-			if (header) {
-				c = printk_skip_level(c);
-				continue;
-			}
-
-			printk_safe_flush_line(start, c - start);
-			start = c++;
-			header = true;
-			continue;
-		}
-
-		header = false;
-		c++;
-	}
-
-	/* Check if there was a partial line. Ignore pure header. */
-	if (start < end && !header) {
-		static const char newline[] = KERN_CONT "\n";
-
-		printk_safe_flush_line(start, end - start);
-		printk_safe_flush_line(newline, strlen(newline));
-	}
-
-	return len;
-}
-
-static void report_message_lost(struct printk_safe_seq_buf *s)
-{
-	int lost = atomic_xchg(&s->message_lost, 0);
-
-	if (lost)
-		printk_deferred("Lost %d message(s)!\n", lost);
-}
-
-/*
- * Flush data from the associated per-CPU buffer. The function
- * can be called either via IRQ work or independently.
- */
-static void __printk_safe_flush(struct irq_work *work)
-{
-	struct printk_safe_seq_buf *s =
-		container_of(work, struct printk_safe_seq_buf, work);
-	unsigned long flags;
-	size_t len;
-	int i;
-
-	/*
-	 * The lock has two functions. First, one reader has to flush all
-	 * available message to make the lockless synchronization with
-	 * writers easier. Second, we do not want to mix messages from
-	 * different CPUs. This is especially important when printing
-	 * a backtrace.
-	 */
-	raw_spin_lock_irqsave(&safe_read_lock, flags);
-
-	i = 0;
-more:
-	len = atomic_read(&s->len);
-
-	/*
-	 * This is just a paranoid check that nobody has manipulated
-	 * the buffer an unexpected way. If we printed something then
-	 * @len must only increase. Also it should never overflow the
-	 * buffer size.
-	 */
-	if ((i && i >= len) || len > sizeof(s->buffer)) {
-		const char *msg = "printk_safe_flush: internal error\n";
-
-		printk_safe_flush_line(msg, strlen(msg));
-		len = 0;
-	}
-
-	if (!len)
-		goto out; /* Someone else has already flushed the buffer. */
-
-	/* Make sure that data has been written up to the @len */
-	smp_rmb();
-	i += printk_safe_flush_buffer(s->buffer + i, len - i);
-
-	/*
-	 * Check that nothing has got added in the meantime and truncate
-	 * the buffer. Note that atomic_cmpxchg() is an implicit memory
-	 * barrier that makes sure that the data were copied before
-	 * updating s->len.
-	 */
-	if (atomic_cmpxchg(&s->len, len, 0) != len)
-		goto more;
-
-out:
-	report_message_lost(s);
-	raw_spin_unlock_irqrestore(&safe_read_lock, flags);
-}
-
-/**
- * printk_safe_flush - flush all per-cpu nmi buffers.
- *
- * The buffers are flushed automatically via IRQ work. This function
- * is useful only when someone wants to be sure that all buffers have
- * been flushed at some point.
- */
-void printk_safe_flush(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-#ifdef CONFIG_PRINTK_NMI
-		__printk_safe_flush(&per_cpu(nmi_print_seq, cpu).work);
-#endif
-		__printk_safe_flush(&per_cpu(safe_print_seq, cpu).work);
-	}
-}
-
-/**
- * printk_safe_flush_on_panic - flush all per-cpu nmi buffers when the system
- *	goes down.
- *
- * Similar to printk_safe_flush() but it can be called even in NMI context when
- * the system goes down. It does the best effort to get NMI messages into
- * the main ring buffer.
- *
- * Note that it could try harder when there is only one CPU online.
- */
-void printk_safe_flush_on_panic(void)
-{
-	/*
-	 * Make sure that we could access the safe buffers.
-	 * Do not risk a double release when more CPUs are up.
-	 */
-	if (raw_spin_is_locked(&safe_read_lock)) {
-		if (num_online_cpus() > 1)
-			return;
-
-		debug_locks_off();
-		raw_spin_lock_init(&safe_read_lock);
-	}
-
-	printk_safe_flush();
-}
-
 #ifdef CONFIG_PRINTK_NMI
-/*
- * Safe printk() for NMI context. It uses a per-CPU buffer to
- * store the message. NMIs are not nested, so there is always only
- * one writer running. But the buffer might get flushed from another
- * CPU, so we need to be careful.
- */
-static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
-{
-	struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
-
-	return printk_safe_log_store(s, fmt, args);
-}
-
 void noinstr printk_nmi_enter(void)
 {
 	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
@@ -309,9 +32,6 @@ void noinstr printk_nmi_exit(void)
  * Marks a code that might produce many messages in NMI context
  * and the risk of losing them is more critical than eventual
  * reordering.
- *
- * It has effect only when called in NMI context. Then printk()
- * will store the messages into the main logbuf directly.
  */
 void printk_nmi_direct_enter(void)
 {
@@ -324,27 +44,8 @@ void printk_nmi_direct_exit(void)
 	this_cpu_and(printk_context, ~PRINTK_NMI_DIRECT_CONTEXT_MASK);
 }
 
-#else
-
-static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
-{
-	return 0;
-}
-
 #endif /* CONFIG_PRINTK_NMI */
 
-/*
- * Lock-less printk(), to avoid deadlocks should the printk() recurse
- * into itself. It uses a per-CPU buffer to store the message, just like
- * NMI.
- */
-static __printf(1, 0) int vprintk_safe(const char *fmt, va_list args)
-{
-	struct printk_safe_seq_buf *s = this_cpu_ptr(&safe_print_seq);
-
-	return printk_safe_log_store(s, fmt, args);
-}
-
 /* Can be preempted by NMI. */
 void __printk_safe_enter(void)
 {
@@ -369,7 +70,10 @@ asmlinkage int vprintk(const char *fmt, va_list args)
 	 * Use the main logbuf even in NMI. But avoid calling console
 	 * drivers that might have their own locks.
 	 */
-	if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) {
+	if (this_cpu_read(printk_context) &
+	    (PRINTK_NMI_DIRECT_CONTEXT_MASK |
+	     PRINTK_NMI_CONTEXT_MASK |
+	     PRINTK_SAFE_CONTEXT_MASK)) {
 		unsigned long flags;
 		int len;
 
@@ -380,35 +84,7 @@ asmlinkage int vprintk(const char *fmt, va_list args)
 		return len;
 	}
 
-	/* Use extra buffer in NMI. */
-	if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
-		return vprintk_nmi(fmt, args);
-
-	/* Use extra buffer to prevent a recursion deadlock in safe mode. */
-	if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
-		return vprintk_safe(fmt, args);
-
 	/* No obstacles. */
 	return vprintk_default(fmt, args);
 }
 EXPORT_SYMBOL(vprintk);
-
-void __init printk_safe_init(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		struct printk_safe_seq_buf *s;
-
-		s = &per_cpu(safe_print_seq, cpu);
-		init_irq_work(&s->work, __printk_safe_flush);
-
-#ifdef CONFIG_PRINTK_NMI
-		s = &per_cpu(nmi_print_seq, cpu);
-		init_irq_work(&s->work, __printk_safe_flush);
-#endif
-	}
-
-	/* Flush pending messages that did not have scheduled IRQ works. */
-	printk_safe_flush();
-}
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index dae233c5f597..9813a983d024 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
 		touch_softlockup_watchdog();
 	}
 
-	/*
-	 * Force flush any remote buffers that might be stuck in IRQ context
-	 * and therefore could not run their irq_work.
-	 */
-	printk_safe_flush();
-
 	clear_bit_unlock(0, &backtrace_flag);
 	put_cpu();
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH printk v3 4/6] printk: remove NMI tracking
From: John Ogness @ 2021-06-24 11:11 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Alexei Starovoitov, Chris Wilson,
	Wolfram Sang (Renesas), Paul Mackerras, Marc Zyngier,
	Masahiro Yamada, Russell King, Ingo Molnar, linux-arm-kernel,
	Sami Tolvanen, Valentin Schneider, Kees Cook, Anshuman Khandual,
	Frederic Weisbecker, Steven Rostedt, Nathan Chancellor,
	Nick Terrell, Thomas Gleixner, Vlastimil Babka, linux-kernel,
	Pekka Enberg, Sergey Senozhatsky, Andrew Morton, linuxppc-dev,
	Mike Rapoport
In-Reply-To: <20210624111148.5190-1-john.ogness@linutronix.de>

All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
---
 arch/arm/kernel/smp.c       |  2 --
 arch/powerpc/kexec/crash.c  |  3 ---
 include/linux/hardirq.h     |  2 --
 include/linux/printk.h      | 12 ------------
 init/Kconfig                |  5 -----
 kernel/printk/internal.h    |  6 ------
 kernel/printk/printk_safe.c | 37 +------------------------------------
 kernel/trace/trace.c        |  2 --
 8 files changed, 1 insertion(+), 68 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 74679240a9d8..0dd2d733ad62 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -668,9 +668,7 @@ static void do_handle_IPI(int ipinr)
 		break;
 
 	case IPI_CPU_BACKTRACE:
-		printk_nmi_enter();
 		nmi_cpu_backtrace(get_irq_regs());
-		printk_nmi_exit();
 		break;
 
 	default:
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index c9a889880214..d488311efab1 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -311,9 +311,6 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
 	unsigned int i;
 	int (*old_handler)(struct pt_regs *regs);
 
-	/* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
-	printk_nmi_enter();
-
 	/*
 	 * This function is only called after the system
 	 * has panicked or is otherwise in a critical state.
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 69bc86ea382c..76878b357ffa 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -116,7 +116,6 @@ extern void rcu_nmi_exit(void);
 	do {							\
 		lockdep_off();					\
 		arch_nmi_enter();				\
-		printk_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 	} while (0)
@@ -135,7 +134,6 @@ extern void rcu_nmi_exit(void);
 	do {							\
 		BUG_ON(!in_nmi());				\
 		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
-		printk_nmi_exit();				\
 		arch_nmi_exit();				\
 		lockdep_on();					\
 	} while (0)
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 664612f75dac..14a18fa15c92 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -149,18 +149,6 @@ static inline __printf(1, 2) __cold
 void early_printk(const char *s, ...) { }
 #endif
 
-#ifdef CONFIG_PRINTK_NMI
-extern void printk_nmi_enter(void);
-extern void printk_nmi_exit(void);
-extern void printk_nmi_direct_enter(void);
-extern void printk_nmi_direct_exit(void);
-#else
-static inline void printk_nmi_enter(void) { }
-static inline void printk_nmi_exit(void) { }
-static inline void printk_nmi_direct_enter(void) { }
-static inline void printk_nmi_direct_exit(void) { }
-#endif /* PRINTK_NMI */
-
 struct dev_printk_info;
 
 #ifdef CONFIG_PRINTK
diff --git a/init/Kconfig b/init/Kconfig
index 5babea38e346..4053588db7a0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1488,11 +1488,6 @@ config PRINTK
 	  very difficult to diagnose system problems, saying N here is
 	  strongly discouraged.
 
-config PRINTK_NMI
-	def_bool y
-	depends on PRINTK
-	depends on HAVE_NMI
-
 config BUG
 	bool "BUG() support" if EXPERT
 	default y
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 6cc35c5de890..ff890ae3ee6a 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -6,12 +6,6 @@
 
 #ifdef CONFIG_PRINTK
 
-#define PRINTK_SAFE_CONTEXT_MASK	0x007ffffff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK	0x008000000
-#define PRINTK_NMI_CONTEXT_MASK		0xff0000000
-
-#define PRINTK_NMI_CONTEXT_OFFSET	0x010000000
-
 __printf(4, 0)
 int vprintk_store(int facility, int level,
 		  const struct dev_printk_info *dev_info,
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 0456cd48d01c..47d961149a06 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -4,12 +4,9 @@
  */
 
 #include <linux/preempt.h>
-#include <linux/spinlock.h>
-#include <linux/debug_locks.h>
 #include <linux/kdb.h>
 #include <linux/smp.h>
 #include <linux/cpumask.h>
-#include <linux/irq_work.h>
 #include <linux/printk.h>
 #include <linux/kprobes.h>
 
@@ -17,35 +14,6 @@
 
 static DEFINE_PER_CPU(int, printk_context);
 
-#ifdef CONFIG_PRINTK_NMI
-void noinstr printk_nmi_enter(void)
-{
-	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
-}
-
-void noinstr printk_nmi_exit(void)
-{
-	this_cpu_sub(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
-}
-
-/*
- * Marks a code that might produce many messages in NMI context
- * and the risk of losing them is more critical than eventual
- * reordering.
- */
-void printk_nmi_direct_enter(void)
-{
-	if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
-		this_cpu_or(printk_context, PRINTK_NMI_DIRECT_CONTEXT_MASK);
-}
-
-void printk_nmi_direct_exit(void)
-{
-	this_cpu_and(printk_context, ~PRINTK_NMI_DIRECT_CONTEXT_MASK);
-}
-
-#endif /* CONFIG_PRINTK_NMI */
-
 /* Can be preempted by NMI. */
 void __printk_safe_enter(void)
 {
@@ -70,10 +38,7 @@ asmlinkage int vprintk(const char *fmt, va_list args)
 	 * Use the main logbuf even in NMI. But avoid calling console
 	 * drivers that might have their own locks.
 	 */
-	if (this_cpu_read(printk_context) &
-	    (PRINTK_NMI_DIRECT_CONTEXT_MASK |
-	     PRINTK_NMI_CONTEXT_MASK |
-	     PRINTK_SAFE_CONTEXT_MASK)) {
+	if (this_cpu_read(printk_context) || in_nmi()) {
 		unsigned long flags;
 		int len;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 915fe8790f04..6ac254f7a04c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9409,7 +9409,6 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
 	tracing_off();
 
 	local_irq_save(flags);
-	printk_nmi_direct_enter();
 
 	/* Simulate the iterator */
 	trace_init_global_iter(&iter);
@@ -9491,7 +9490,6 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
 		atomic_dec(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled);
 	}
 	atomic_dec(&dump_running);
-	printk_nmi_direct_exit();
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(ftrace_dump);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
From: Robin Murphy @ 2021-06-24 11:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: heikki.krogerus, linux-devicetree, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Jianxiong Gao, Stefano Stabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Marek Szyprowski, Dan Williams,
	matthew.auld, Nicolas Boichat, thomas.hellstrom, Jim Quinlan,
	Konrad Rzeszutek Wilk, intel-gfx, maarten.lankhorst, jani.nikula,
	Rob Herring, rodrigo.vivi, Bjorn Helgaas, Claire Chang,
	boris.ostrovsky, Andy Shevchenko, jgross, airlied, Thierry Reding,
	Greg KH, Randy Dunlap, Qian Cai, lkml, list@263.net:IOMMU DRIVERS,
	Daniel Vetter, xypron.glpk, Tom Lendacky, linuxppc-dev, bauerman
In-Reply-To: <20210624111855.GA1382@willie-the-truck>

On 2021-06-24 12:18, Will Deacon wrote:
> On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:
>> On 2021-06-24 07:05, Claire Chang wrote:
>>> On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
>>>>> is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
>>>>>
>>>>> is_swiotlb_force_bounce() was the new function introduced in this patch here.
>>>>>
>>>>> +static inline bool is_swiotlb_force_bounce(struct device *dev)
>>>>> +{
>>>>> +     return dev->dma_io_tlb_mem->force_bounce;
>>>>> +}
>>>>
>>>> To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
>>>> turn this into :
>>>>
>>>>           return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
>>>>
>>>> for a quick debug check?
>>>
>>> I just realized that dma_io_tlb_mem might be NULL like Christoph
>>> pointed out since swiotlb might not get initialized.
>>> However,  `Unable to handle kernel paging request at virtual address
>>> dfff80000000000e` looks more like the address is garbage rather than
>>> NULL?
>>> I wonder if that's because dev->dma_io_tlb_mem is not assigned
>>> properly (which means device_initialize is not called?).
>>
>> What also looks odd is that the base "address" 0xdfff800000000000 is held in
>> a couple of registers, but the offset 0xe looks too small to match up to any
>> relevant structure member in that dereference chain :/
> 
> FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
> been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
> Christoph's suggestion is needed regardless.

Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. 
The massive jump in KernelCI baseline failures as of yesterday looks 
like every arm64 machine with less than 4GB of RAM blowing up...

Robin.

> But I agree that it won't help
> with the issue reported by Qian Cai.
> 
> Qian Cai: please can you share your .config and your command line?
> 
> Thanks,
> 
> Will
> 

^ permalink raw reply

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Nicholas Piggin @ 2021-06-24 11:42 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <1624530624.8jff1f4u11.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>> assoicated struct pages, so they should not be passed to pfn_to_page.
>> This series removes such calls from the x86 and arm64 secondary MMU. To
>> do this, this series modifies gfn_to_pfn to return a struct page in
>> addition to a pfn, if the hva was resolved by gup. This allows the
>> caller to call put_page only when necessated by gup.
>> 
>> This series provides a helper function that unwraps the new return type
>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>> have no hardware to test powerpc/mips changes, the function is used
>> there for minimally invasive changes. Additionally, as gfn_to_page and
>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>> easily changed over to only use pfns.
>> 
>> This addresses CVE-2021-22543 on x86 and arm64.
> 
> Does this fix the problem? (untested I don't have a POC setup at hand,
> but at least in concept)

This one actually compiles at least. Unfortunately I don't have much 
time in the near future to test, and I only just found out about this
CVE a few hours ago.

---


It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

---
 virt/kvm/kvm_main.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+	if (kvm_is_reserved_pfn(pfn))
+		return 1;
+	return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool *async,
 			       bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
+	 *
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
 	 */ 
-	kvm_get_pfn(pfn);
+	if (!kvm_try_get_pfn(pfn))
+		r = -EFAULT;
 
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
-	return 0;
+
+	return r;
 }
 
 /*
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
From: Will Deacon @ 2021-06-24 11:48 UTC (permalink / raw)
  To: Robin Murphy
  Cc: heikki.krogerus, linux-devicetree, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Jianxiong Gao, Stefano Stabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
	bskeggs, linux-pci, xen-devel, Marek Szyprowski, Dan Williams,
	matthew.auld, Nicolas Boichat, thomas.hellstrom, Jim Quinlan,
	Konrad Rzeszutek Wilk, intel-gfx, maarten.lankhorst, jani.nikula,
	Rob Herring, rodrigo.vivi, Bjorn Helgaas, Claire Chang,
	boris.ostrovsky, Andy Shevchenko, jgross, airlied, Thierry Reding,
	Greg KH, Randy Dunlap, Qian Cai, lkml, list@263.net:IOMMU DRIVERS,
	Daniel Vetter, xypron.glpk, Tom Lendacky, linuxppc-dev, bauerman
In-Reply-To: <452155d2-c98e-23f6-86d6-3a2ff2e74783@arm.com>

On Thu, Jun 24, 2021 at 12:34:09PM +0100, Robin Murphy wrote:
> On 2021-06-24 12:18, Will Deacon wrote:
> > On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:
> > > On 2021-06-24 07:05, Claire Chang wrote:
> > > > On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig <hch@lst.de> wrote:
> > > > > 
> > > > > On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > > > > > is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> > > > > > 
> > > > > > is_swiotlb_force_bounce() was the new function introduced in this patch here.
> > > > > > 
> > > > > > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > > > > > +{
> > > > > > +     return dev->dma_io_tlb_mem->force_bounce;
> > > > > > +}
> > > > > 
> > > > > To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> > > > > turn this into :
> > > > > 
> > > > >           return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
> > > > > 
> > > > > for a quick debug check?
> > > > 
> > > > I just realized that dma_io_tlb_mem might be NULL like Christoph
> > > > pointed out since swiotlb might not get initialized.
> > > > However,  `Unable to handle kernel paging request at virtual address
> > > > dfff80000000000e` looks more like the address is garbage rather than
> > > > NULL?
> > > > I wonder if that's because dev->dma_io_tlb_mem is not assigned
> > > > properly (which means device_initialize is not called?).
> > > 
> > > What also looks odd is that the base "address" 0xdfff800000000000 is held in
> > > a couple of registers, but the offset 0xe looks too small to match up to any
> > > relevant structure member in that dereference chain :/
> > 
> > FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
> > been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
> > Christoph's suggestion is needed regardless.
> 
> Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. The
> massive jump in KernelCI baseline failures as of yesterday looks like every
> arm64 machine with less than 4GB of RAM blowing up...

Ok, diff below which attempts to tackle the offset issue I mentioned as
well. Qian Cai -- please can you try with these changes?

Will

--->8

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 175b6c113ed8..39284ff2a6cd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 
 static inline bool is_swiotlb_force_bounce(struct device *dev)
 {
-       return dev->dma_io_tlb_mem->force_bounce;
+       struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+
+       return mem && mem->force_bounce;
 }
 
 void __init swiotlb_exit(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 44be8258e27b..0ffbaae9fba2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
                dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
        unsigned int nslots = nr_slots(alloc_size), stride;
        unsigned int index, wrap, count = 0, i;
+       unsigned int offset = swiotlb_align_offset(dev, orig_addr);
        unsigned long flags;
 
        BUG_ON(!nslots);
@@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
        for (i = index; i < index + nslots; i++) {
                mem->slots[i].list = 0;
                mem->slots[i].alloc_size =
-                       alloc_size - ((i - index) << IO_TLB_SHIFT);
+                       alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
        }
        for (i = index - 1;
             io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&

^ permalink raw reply related

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Paolo Bonzini @ 2021-06-24 12:00 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <1624534759.nj0ylor2eh.astroid@bobo.none>

On 24/06/21 13:42, Nicholas Piggin wrote:
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> +	if (kvm_is_reserved_pfn(pfn))
> +		return 1;

So !pfn_valid would always return true.  Yeah, this should work and is 
certainly appealing!

Paolo


> +	return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
>   static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   			       unsigned long addr, bool *async,
>   			       bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Whoever called remap_pfn_range is also going to call e.g.
>   	 * unmap_mapping_range before the underlying pages are freed,
>   	 * causing a call to our MMU notifier.
> +	 *
> +	 * Certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages, but be allocated without refcounting e.g.,
> +	 * tail pages of non-compound higher order allocations, which
> +	 * would then underflow the refcount when the caller does the
> +	 * required put_page. Don't allow those pages here.
>   	 */
> -	kvm_get_pfn(pfn);
> +	if (!kvm_try_get_pfn(pfn))
> +		r = -EFAULT;
>   
>   out:
>   	pte_unmap_unlock(ptep, ptl);
>   	*p_pfn = pfn;
> -	return 0;
> +
> +	return r;
>   }
>   
>   /*
> 


^ permalink raw reply

* [PATCH] powerpc/64s: Fix boot failure with 4K Radix
From: Michael Ellerman @ 2021-06-24 12:34 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: dja

When using the Radix MMU our PGD is always 64K, and must be naturally
aligned.

For a 4K page size kernel that means page alignment of swapper_pg_dir is
not sufficient, leading to failure to boot.

Use the existing MAX_PTRS_PER_PGD which has the correct value, and
avoids us hard-coding 64K here.

Fixes: e72421a085a8 ("powerpc: Define swapper_pg_dir[] in C")
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/pgtable.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 1707ab580ee2..cd16b407f47e 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -28,7 +28,13 @@
 #include <asm/hugetlb.h>
 #include <asm/pte-walk.h>
 
-pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __page_aligned_bss;
+#ifdef CONFIG_PPC64
+#define PGD_ALIGN (sizeof(pgd_t) * MAX_PTRS_PER_PGD)
+#else
+#define PGD_ALIGN PAGE_SIZE
+#endif
+
+pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __section(".bss..page_aligned") __aligned(PGD_ALIGN);
 
 static inline int is_exec_fault(void)
 {
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Paolo Bonzini @ 2021-06-24 12:41 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <1624534759.nj0ylor2eh.astroid@bobo.none>

On 24/06/21 13:42, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>> caller to call put_page only when necessated by gup.
>>>
>>> This series provides a helper function that unwraps the new return type
>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>> have no hardware to test powerpc/mips changes, the function is used
>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>> easily changed over to only use pfns.
>>>
>>> This addresses CVE-2021-22543 on x86 and arm64.
>>
>> Does this fix the problem? (untested I don't have a POC setup at hand,
>> but at least in concept)
> 
> This one actually compiles at least. Unfortunately I don't have much
> time in the near future to test, and I only just found out about this
> CVE a few hours ago.

And it also works (the reproducer gets an infinite stream of userspace 
exits and especially does not crash).  We can still go for David's 
solution later since MMU notifiers are able to deal with this pages, but 
it's a very nice patch for stable kernels.

If you provide a Signed-off-by, I can integrate it.

Paolo

> ---
> 
> 
> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
> 
> Fix this by only taking a reference on the page if it was non-zero,
> which indicates it is participating in normal refcounting (and can be
> released with put_page).
> 
> ---
>   virt/kvm/kvm_main.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6a6bc7af0e28..46fb042837d2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>   	return true;
>   }
>   
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> +	if (kvm_is_reserved_pfn(pfn))
> +		return 1;
> +	return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
>   static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   			       unsigned long addr, bool *async,
>   			       bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Whoever called remap_pfn_range is also going to call e.g.
>   	 * unmap_mapping_range before the underlying pages are freed,
>   	 * causing a call to our MMU notifier.
> +	 *
> +	 * Certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages, but be allocated without refcounting e.g.,
> +	 * tail pages of non-compound higher order allocations, which
> +	 * would then underflow the refcount when the caller does the
> +	 * required put_page. Don't allow those pages here.
>   	 */
> -	kvm_get_pfn(pfn);
> +	if (!kvm_try_get_pfn(pfn))
> +		r = -EFAULT;
>   
>   out:
>   	pte_unmap_unlock(ptep, ptl);
>   	*p_pfn = pfn;
> -	return 0;
> +
> +	return r;
>   }
>   
>   /*
> 


^ permalink raw reply

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Nicholas Piggin @ 2021-06-24 12:57 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <0d3a699a-15eb-9f1b-0735-79d14736f38c@redhat.com>

Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
> On 24/06/21 13:42, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>>> caller to call put_page only when necessated by gup.
>>>>
>>>> This series provides a helper function that unwraps the new return type
>>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>>> have no hardware to test powerpc/mips changes, the function is used
>>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>>> easily changed over to only use pfns.
>>>>
>>>> This addresses CVE-2021-22543 on x86 and arm64.
>>>
>>> Does this fix the problem? (untested I don't have a POC setup at hand,
>>> but at least in concept)
>> 
>> This one actually compiles at least. Unfortunately I don't have much
>> time in the near future to test, and I only just found out about this
>> CVE a few hours ago.
> 
> And it also works (the reproducer gets an infinite stream of userspace 
> exits and especially does not crash).  We can still go for David's 
> solution later since MMU notifiers are able to deal with this pages, but 
> it's a very nice patch for stable kernels.

Oh nice, thanks for testing. How's this?

Thanks,
Nick

---

KVM: Fix page ref underflow for regions with valid but non-refcounted pages

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 virt/kvm/kvm_main.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+	if (kvm_is_reserved_pfn(pfn))
+		return 1;
+	return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool *async,
 			       bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
+	 *
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
 	 */ 
-	kvm_get_pfn(pfn);
+	if (!kvm_try_get_pfn(pfn))
+		r = -EFAULT;
 
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
-	return 0;
+
+	return r;
 }
 
 /*
-- 
2.23.0

^ permalink raw reply related

* Re: [PATCH v8 0/6] Support for H_RPT_INVALIDATE in PowerPC KVM
From: Michael Ellerman @ 2021-06-24 13:59 UTC (permalink / raw)
  To: Bharata B Rao, linuxppc-dev, kvm-ppc
  Cc: aneesh.kumar, farosas, npiggin, david
In-Reply-To: <20210621085003.904767-1-bharata@linux.ibm.com>

On Mon, 21 Jun 2021 14:19:57 +0530, Bharata B Rao wrote:
> This patchset adds support for the new hcall H_RPT_INVALIDATE
> and replaces the nested tlb flush calls with this new hcall
> if support for the same exists.
> 
> Changes in v8:
> -------------
> - Used tlb_single_page_flush_ceiling in the process-scoped range
>   flush routine to switch to full PID invalation if
>   the number of pages is above the threshold
> - Moved iterating over page sizes into the actual routine that
>   handles the eventual flushing thereby limiting the page size
>   iteration only to range based flushing
> - Converted #if 0 section into a comment section to avoid
>   checkpatch from complaining.
> - Used a threshold in the partition-scoped range flushing
>   to switch to full LPID invalidation
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[1/6] KVM: PPC: Book3S HV: Fix comments of H_RPT_INVALIDATE arguments
      https://git.kernel.org/powerpc/c/f09216a190a4c2f62e1725f9d92e7c122b4ee423
[2/6] powerpc/book3s64/radix: Add H_RPT_INVALIDATE pgsize encodings to mmu_psize_def
      https://git.kernel.org/powerpc/c/d6265cb33b710789cbc390316eba50a883d6dcc8
[3/6] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE
      https://git.kernel.org/powerpc/c/f0c6fbbb90504fb7e9dbf0865463d3c2b4de49e5
[4/6] KVM: PPC: Book3S HV: Nested support in H_RPT_INVALIDATE
      https://git.kernel.org/powerpc/c/53324b51c5eee22d420a2df68b1820d929fa90f3
[5/6] KVM: PPC: Book3S HV: Add KVM_CAP_PPC_RPT_INVALIDATE capability
      https://git.kernel.org/powerpc/c/b87cc116c7e1bc62a84d8c46acd401db179edb11
[6/6] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM
      https://git.kernel.org/powerpc/c/81468083f3c76a08183813e3af63a7c3cea3f537

cheers

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: Workaround high stack usage with clang
From: Michael Ellerman @ 2021-06-24 13:59 UTC (permalink / raw)
  To: Paul Mackerras, Michael Ellerman, Nathan Chancellor
  Cc: kernel test robot, Arnd Bergmann, Nick Desaulniers, linux-kernel,
	kvm-ppc, clang-built-linux, Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210621182440.990242-1-nathan@kernel.org>

On Mon, 21 Jun 2021 11:24:40 -0700, Nathan Chancellor wrote:
> LLVM does not emit optimal byteswap assembly, which results in high
> stack usage in kvmhv_enter_nested_guest() due to the inlining of
> byteswap_pt_regs(). With LLVM 12.0.0:
> 
> arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of
> 2512 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=]
> long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>      ^
> 1 error generated.
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[1/1] KVM: PPC: Book3S HV: Workaround high stack usage with clang
      https://git.kernel.org/powerpc/c/51696f39cbee5bb684e7959c0c98b5f54548aa34

cheers

^ permalink raw reply

* Re: [PATCH v7 00/32] KVM: PPC: Book3S: C-ify the P9 entry/exit code
From: Michael Ellerman @ 2021-06-24 13:59 UTC (permalink / raw)
  To: Nicholas Piggin, kvm-ppc; +Cc: linuxppc-dev
In-Reply-To: <20210528090752.3542186-1-npiggin@gmail.com>

On Fri, 28 May 2021 19:07:20 +1000, Nicholas Piggin wrote:
> Git tree here
> 
> https://github.com/npiggin/linux/tree/kvm-in-c-5.14-1
> 
> This series applies to upstream plus a couple of KVM regression fixes
> not yet in powerpc tree, which are included in the above git tree.
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[01/32] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point
        https://git.kernel.org/powerpc/c/f36011569b90b3973f07cea00c5872c4dc0c707f
[02/32] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM
        https://git.kernel.org/powerpc/c/f33e0702d98cc5ff21f44833525b07581862eb57
[03/32] KVM: PPC: Book3S 64: add hcall interrupt handler
        https://git.kernel.org/powerpc/c/31c67cfe2a6a5a7364dc1552b877c6b7820dd556
[04/32] KVM: PPC: Book3S 64: Move hcall early register setup to KVM
        https://git.kernel.org/powerpc/c/04ece7b60b689e1de38b9b0f597f8f94951e4367
[05/32] KVM: PPC: Book3S 64: Move interrupt early register setup to KVM
        https://git.kernel.org/powerpc/c/69fdd67499716efca861f7cecabdfeee5e5d7b51
[06/32] KVM: PPC: Book3S 64: move bad_host_intr check to HV handler
        https://git.kernel.org/powerpc/c/1b5821c630c219e3c6f643ebbefcf08c9fa714d8
[07/32] KVM: PPC: Book3S 64: Minimise hcall handler calling convention differences
        https://git.kernel.org/powerpc/c/e2762743c6328dde14290cd58ddf2175b068ad80
[08/32] KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C
        https://git.kernel.org/powerpc/c/023c3c96ca4d196c09d554d5a98900406e4d7ecb
[09/32] KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR
        https://git.kernel.org/powerpc/c/413679e73bdfc2720dc2fa2172b65b7411185fa7
[10/32] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races
        https://git.kernel.org/powerpc/c/6ffe2c6e6dcefb971e4046f02086c4adadd0b310
[11/32] KVM: PPC: Book3S HV P9: Move xive vcpu context management into kvmhv_p9_guest_entry
        https://git.kernel.org/powerpc/c/09512c29167bd3792820caf83bcca4d4e5ac2266
[12/32] KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together
        https://git.kernel.org/powerpc/c/48013cbc504e064d2318f24482cfbe3c53e0a812
[13/32] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path
        https://git.kernel.org/powerpc/c/9dc2babc185e0a24fbb48098daafd552cac157fa
[14/32] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C
        https://git.kernel.org/powerpc/c/89d35b23910158a9add33a206e973f4227906d3c
[15/32] KVM: PPC: Book3S HV P9: inline kvmhv_load_hv_regs_and_go into __kvmhv_vcpu_entry_p9
        https://git.kernel.org/powerpc/c/c00366e2375408e43370cd7981af3354f7c83ed3
[16/32] KVM: PPC: Book3S HV P9: Read machine check registers while MSR[RI] is 0
        https://git.kernel.org/powerpc/c/6d770e3fe9a120560cda66331ce5faa363400e97
[17/32] KVM: PPC: Book3S HV P9: Improve exit timing accounting coverage
        https://git.kernel.org/powerpc/c/a32ed1bb70723ec7a6c888b6c7071d516cca0e8f
[18/32] KVM: PPC: Book3S HV P9: Move SPR loading after expiry time check
        https://git.kernel.org/powerpc/c/68e3baaca8c56bbb336d2215f201f4047ce736e5
[19/32] KVM: PPC: Book3S HV P9: Add helpers for OS SPR handling
        https://git.kernel.org/powerpc/c/edba6aff4f2c3893e168df6a2e9a20f3c39b0b30
[20/32] KVM: PPC: Book3S HV P9: Switch to guest MMU context as late as possible
        https://git.kernel.org/powerpc/c/41f779917669fcc28a7f5646d1f7a85043c9d152
[21/32] KVM: PPC: Book3S HV: Implement radix prefetch workaround by disabling MMU
        https://git.kernel.org/powerpc/c/2e1ae9cd56f8616a707185f3c6cb7ee2a20809e1
[22/32] KVM: PPC: Book3S HV: Remove support for dependent threads mode on P9
        https://git.kernel.org/powerpc/c/aaae8c79005846eeafc7a0e5d3eda4e34ea8ca2e
[23/32] KVM: PPC: Book3S HV: Remove radix guest support from P7/8 path
        https://git.kernel.org/powerpc/c/9769a7fd79b65a6a6f8362154ab59c36d0defbf3
[24/32] KVM: PPC: Book3S HV: Remove virt mode checks from real mode handlers
        https://git.kernel.org/powerpc/c/dcbac73a5b374873bd6dfd8a0ee5d0b7fc844420
[25/32] KVM: PPC: Book3S HV: Remove unused nested HV tests in XICS emulation
        https://git.kernel.org/powerpc/c/2ce008c8b25467ceacf45bcf0e183d660edb82f2
[26/32] KVM: PPC: Book3S HV P9: Allow all P9 processors to enable nested HV
        https://git.kernel.org/powerpc/c/cbcff8b1c53e458ed4e23877048d7268fd13ab8a
[27/32] KVM: PPC: Book3S HV: small pseries_do_hcall cleanup
        https://git.kernel.org/powerpc/c/a9aa86e08b3a0b2c273cdb772283c872e55f14bf
[28/32] KVM: PPC: Book3S HV: add virtual mode handlers for HPT hcalls and page faults
        https://git.kernel.org/powerpc/c/6165d5dd99dbaec7a309491c3951bd81fc89978d
[29/32] KVM: PPC: Book3S HV P9: Reflect userspace hcalls to hash guests to support PR KVM
        https://git.kernel.org/powerpc/c/ac3c8b41c27ea112daed031f852a4b361c11a03e
[30/32] KVM: PPC: Book3S HV P9: implement hash guest support
        https://git.kernel.org/powerpc/c/079a09a500c399f804effcf9bb49214cdfa698e5
[31/32] KVM: PPC: Book3S HV P9: implement hash host / hash guest support
        https://git.kernel.org/powerpc/c/0bf7e1b2e9a496e1ebca9e3e1f53c7e98add4417
[32/32] KVM: PPC: Book3S HV: remove ISA v3.0 and v3.1 support from P7/8 path
        https://git.kernel.org/powerpc/c/fae5c9f3664ba278137e54a2083b39b90c64093a

cheers

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox