LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v16 09/12] of: Define functions to allocate and free FDT
From: Lakshmi Ramasubramanian @ 2021-02-04 16:41 UTC (permalink / raw)
  To: zohar, bauerman, robh, takahiro.akashi, gregkh, will, joe,
	catalin.marinas, mpe
  Cc: mark.rutland, bhsharma, tao.li, paulus, vincenzo.frascino,
	frowand.list, sashal, masahiroy, jmorris, linux-arm-kernel, serge,
	devicetree, pasha.tatashin, prsriva, hsinyi, allison,
	christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20210204164135.29856-1-nramas@linux.microsoft.com>

Kernel components that use Flattened Device Tree (FDT) allocate kernel
memory and call fdt_open_into() to reorganize the tree into a form
suitable for the read-write operations.  These operations can be
combined into a single function to allocate and initialize the FDT
so the different architecures do not have to duplicate the code.

Define of_alloc_and_init_fdt() and of_free_fdt() in drivers/of/kexec.c
to allocate and initialize FDT, and to free the FDT buffer respectively.

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Rob Herring <robh@kernel.org>
Suggested-by: Joe Perches <joe@perches.com>
---
 drivers/of/kexec.c | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/of.h |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 5ae0e5d90f55..197e71104f47 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -11,6 +11,7 @@
 
 #include <linux/kernel.h>
 #include <linux/kexec.h>
+#include <linux/mm.h>
 #include <linux/memblock.h>
 #include <linux/libfdt.h>
 #include <linux/of.h>
@@ -28,6 +29,42 @@
 #define FDT_PROP_RNG_SEED	"rng-seed"
 #define RNG_SEED_SIZE		128
 
+/**
+ * of_alloc_and_init_fdt - Allocate and initialize a Flattened device tree
+ *
+ * @fdt_size:	Flattened device tree size
+ *
+ * Return the allocated FDT buffer on success, or NULL on error.
+ */
+void *of_alloc_and_init_fdt(unsigned int fdt_size)
+{
+	void *fdt;
+	int ret;
+
+	fdt = kvmalloc(fdt_size, GFP_KERNEL);
+	if (!fdt)
+		return NULL;
+
+	ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
+	if (ret < 0) {
+		pr_err("Error setting up the new device tree.\n");
+		kvfree(fdt);
+		fdt = NULL;
+	}
+
+	return fdt;
+}
+
+/**
+ * of_free_fdt - Free the buffer for Flattened device tree
+ *
+ * @fdt:	Flattened device tree buffer to free
+ */
+void of_free_fdt(void *fdt)
+{
+	kvfree(fdt);
+}
+
 /**
  * fdt_find_and_del_mem_rsv - delete memory reservation with given address and size
  *
diff --git a/include/linux/of.h b/include/linux/of.h
index 19f77dd12507..9f0261761e28 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -563,6 +563,8 @@ struct kimage;
 int of_kexec_setup_new_fdt(const struct kimage *image, void *fdt,
 			   unsigned long initrd_load_addr, unsigned long initrd_len,
 			   const char *cmdline);
+void *of_alloc_and_init_fdt(unsigned int fdt_size);
+void of_free_fdt(void *fdt);
 
 #ifdef CONFIG_IMA_KEXEC
 int of_ima_add_kexec_buffer(struct kimage *image,
-- 
2.30.0


^ permalink raw reply related

* [PATCH v16 08/12] powerpc: Delete unused function delete_fdt_mem_rsv()
From: Lakshmi Ramasubramanian @ 2021-02-04 16:41 UTC (permalink / raw)
  To: zohar, bauerman, robh, takahiro.akashi, gregkh, will, joe,
	catalin.marinas, mpe
  Cc: mark.rutland, bhsharma, tao.li, paulus, vincenzo.frascino,
	frowand.list, sashal, masahiroy, jmorris, linux-arm-kernel, serge,
	devicetree, pasha.tatashin, prsriva, hsinyi, allison,
	christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20210204164135.29856-1-nramas@linux.microsoft.com>

delete_fdt_mem_rsv() defined in "arch/powerpc/kexec/file_load.c"
has been renamed to fdt_find_and_del_mem_rsv(), and moved to
"drivers/of/kexec.c".

Remove delete_fdt_mem_rsv() in "arch/powerpc/kexec/file_load.c".

Co-developed-by: Prakhar Srivastava <prsriva@linux.microsoft.com>
Signed-off-by: Prakhar Srivastava <prsriva@linux.microsoft.com>
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
 arch/powerpc/include/asm/kexec.h |  1 -
 arch/powerpc/kexec/file_load.c   | 32 --------------------------------
 2 files changed, 33 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 939bc40dfa62..2c0be93d239a 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -118,7 +118,6 @@ char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
 int setup_purgatory(struct kimage *image, const void *slave_code,
 		    const void *fdt, unsigned long kernel_load_addr,
 		    unsigned long fdt_load_addr);
-int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
 struct kexec_buf;
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 5dd3a9c45a2d..036c8fb48fc3 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -108,35 +108,3 @@ int setup_purgatory(struct kimage *image, const void *slave_code,
 
 	return 0;
 }
-
-/**
- * delete_fdt_mem_rsv - delete memory reservation with given address and size
- *
- * Return: 0 on success, or negative errno on error.
- */
-int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size)
-{
-	int i, ret, num_rsvs = fdt_num_mem_rsv(fdt);
-
-	for (i = 0; i < num_rsvs; i++) {
-		uint64_t rsv_start, rsv_size;
-
-		ret = fdt_get_mem_rsv(fdt, i, &rsv_start, &rsv_size);
-		if (ret) {
-			pr_err("Malformed device tree.\n");
-			return -EINVAL;
-		}
-
-		if (rsv_start == start && rsv_size == size) {
-			ret = fdt_del_mem_rsv(fdt, i);
-			if (ret) {
-				pr_err("Error deleting device tree reservation.\n");
-				return -EINVAL;
-			}
-
-			return 0;
-		}
-	}
-
-	return -ENOENT;
-}
-- 
2.30.0


^ permalink raw reply related

* [PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT
From: Lakshmi Ramasubramanian @ 2021-02-04 16:41 UTC (permalink / raw)
  To: zohar, bauerman, robh, takahiro.akashi, gregkh, will, joe,
	catalin.marinas, mpe
  Cc: mark.rutland, bhsharma, tao.li, paulus, vincenzo.frascino,
	frowand.list, sashal, masahiroy, jmorris, linux-arm-kernel, serge,
	devicetree, pasha.tatashin, prsriva, hsinyi, allison,
	christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20210204164135.29856-1-nramas@linux.microsoft.com>

of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Rob Herring <robh@kernel.org>
---
 arch/arm64/kernel/machine_kexec_file.c | 37 +++++++-------------------
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c
index 7da22bb7b9d5..7d6cc478f73c 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -29,7 +29,7 @@ const struct kexec_file_ops * const kexec_file_loaders[] = {
 
 int arch_kimage_file_post_load_cleanup(struct kimage *image)
 {
-	vfree(image->arch.dtb);
+	of_free_fdt(image->arch.dtb);
 	image->arch.dtb = NULL;
 
 	vfree(image->arch.elf_headers);
@@ -57,36 +57,19 @@ static int create_dtb(struct kimage *image,
 	cmdline_len = cmdline ? strlen(cmdline) : 0;
 	buf_size = fdt_totalsize(initial_boot_params)
 			+ cmdline_len + DTB_EXTRA_SPACE;
-
-	for (;;) {
-		buf = vmalloc(buf_size);
-		if (!buf)
-			return -ENOMEM;
-
-		/* duplicate a device tree blob */
-		ret = fdt_open_into(initial_boot_params, buf, buf_size);
-		if (ret)
-			return -EINVAL;
-
-		ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr,
+	buf = of_alloc_and_init_fdt(buf_size);
+	if (!buf)
+		return -ENOMEM;
+	ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr,
 					     initrd_len, cmdline);
-		if (ret) {
-			vfree(buf);
-			if (ret == -ENOMEM) {
-				/* unlikely, but just in case */
-				buf_size += DTB_EXTRA_SPACE;
-				continue;
-			} else {
-				return ret;
-			}
-		}
-
+	if (!ret) {
 		/* trim it */
 		fdt_pack(buf);
 		*dtb = buf;
+	} else
+		of_free_fdt(buf);
 
-		return 0;
-	}
+	return ret;
 }
 
 static int prepare_elf_headers(void **addr, unsigned long *sz)
@@ -224,6 +207,6 @@ int load_other_segments(struct kimage *image,
 
 out_err:
 	image->nr_segments = orig_segments;
-	vfree(dtb);
+	of_free_fdt(dtb);
 	return ret;
 }
-- 
2.30.0


^ permalink raw reply related

* [PATCH v16 12/12] arm64: Enable passing IMA log to next kernel on kexec
From: Lakshmi Ramasubramanian @ 2021-02-04 16:41 UTC (permalink / raw)
  To: zohar, bauerman, robh, takahiro.akashi, gregkh, will, joe,
	catalin.marinas, mpe
  Cc: mark.rutland, bhsharma, tao.li, paulus, vincenzo.frascino,
	frowand.list, sashal, masahiroy, jmorris, linux-arm-kernel, serge,
	devicetree, pasha.tatashin, prsriva, hsinyi, allison,
	christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20210204164135.29856-1-nramas@linux.microsoft.com>

Update CONFIG_KEXEC_FILE to select CONFIG_HAVE_IMA_KEXEC, if CONFIG_IMA
is enabled, to indicate that the IMA measurement log information is
present in the device tree for ARM64.

Co-developed-by: Prakhar Srivastava <prsriva@linux.microsoft.com>
Signed-off-by: Prakhar Srivastava <prsriva@linux.microsoft.com>
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 05e17351e4f3..8a93573cebb6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1093,6 +1093,7 @@ config KEXEC
 config KEXEC_FILE
 	bool "kexec file based system call"
 	select KEXEC_CORE
+	select HAVE_IMA_KEXEC if IMA
 	help
 	  This is new version of kexec system call. This system call is
 	  file based and takes file descriptors as system call argument
-- 
2.30.0


^ permalink raw reply related

* [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
From: Lakshmi Ramasubramanian @ 2021-02-04 16:41 UTC (permalink / raw)
  To: zohar, bauerman, robh, takahiro.akashi, gregkh, will, joe,
	catalin.marinas, mpe
  Cc: mark.rutland, bhsharma, tao.li, paulus, vincenzo.frascino,
	frowand.list, sashal, masahiroy, jmorris, linux-arm-kernel, serge,
	devicetree, pasha.tatashin, prsriva, hsinyi, allison,
	christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20210204164135.29856-1-nramas@linux.microsoft.com>

of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

powerpc sets the FDT address in image_loader_data field in
"struct kimage" and the memory is freed in
kimage_file_post_load_cleanup().  This cleanup function uses kfree()
to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
for allocation, the buffer needs to be freed using kvfree().

Define "fdt" field in "struct kimage_arch" for powerpc to store
the address of FDT, and free the memory in powerpc specific
arch_kimage_file_post_load_cleanup().

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Rob Herring <robh@kernel.org>
Suggested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
---
 arch/powerpc/include/asm/kexec.h  |  2 ++
 arch/powerpc/kexec/elf_64.c       | 26 ++++++++++++++++----------
 arch/powerpc/kexec/file_load_64.c |  3 +++
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 2c0be93d239a..d7d13cac4d31 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -111,6 +111,8 @@ struct kimage_arch {
 	unsigned long elf_headers_mem;
 	unsigned long elf_headers_sz;
 	void *elf_headers;
+
+	void *fdt;
 };
 
 char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index d0e459bb2f05..51d2d8eb6c1b 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -19,6 +19,7 @@
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
 #include <linux/module.h>
+#include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
 	unsigned int fdt_size;
 	unsigned long kernel_load_addr;
 	unsigned long initrd_load_addr = 0, fdt_load_addr;
-	void *fdt;
+	void *fdt = NULL;
 	const void *slave_code;
 	struct elfhdr ehdr;
 	char *modified_cmdline = NULL;
@@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
 	}
 
 	fdt_size = fdt_totalsize(initial_boot_params) * 2;
-	fdt = kmalloc(fdt_size, GFP_KERNEL);
+	fdt = of_alloc_and_init_fdt(fdt_size);
 	if (!fdt) {
 		pr_err("Not enough memory for the device tree.\n");
 		ret = -ENOMEM;
 		goto out;
 	}
-	ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
-	if (ret < 0) {
-		pr_err("Error setting up the new device tree.\n");
-		ret = -EINVAL;
-		goto out;
-	}
 
 	ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
 				  initrd_len, cmdline);
@@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
 	ret = kexec_add_buffer(&kbuf);
 	if (ret)
 		goto out;
+
+	/* FDT will be freed in arch_kimage_file_post_load_cleanup */
+	image->arch.fdt = fdt;
+
 	fdt_load_addr = kbuf.mem;
 
 	pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
@@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
 	kfree(modified_cmdline);
 	kexec_free_elf_info(&elf_info);
 
-	/* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
-	return ret ? ERR_PTR(ret) : fdt;
+	/*
+	 * Once FDT buffer has been successfully passed to kexec_add_buffer(),
+	 * the FDT buffer address is saved in image->arch.fdt. In that case,
+	 * the memory cannot be freed here in case of any other error.
+	 */
+	if (ret && !image->arch.fdt)
+		of_free_fdt(fdt);
+
+	return ret ? ERR_PTR(ret) : NULL;
 }
 
 const struct kexec_file_ops kexec_elf64_ops = {
diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
index 3cab318aa3b9..d9d5b5569a6d 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1113,5 +1113,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	image->arch.elf_headers = NULL;
 	image->arch.elf_headers_sz = 0;
 
+	of_free_fdt(image->arch.fdt);
+	image->arch.fdt = NULL;
+
 	return kexec_image_post_load_cleanup_default(image);
 }
-- 
2.30.0


^ permalink raw reply related

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Jens Axboe @ 2021-02-04 16:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Christophe Leroy, Nicholas Piggin,
	Michael Ellerman, Zorro Lang
  Cc: linuxppc-dev
In-Reply-To: <e2dfb7ec-a4d4-579d-a79b-6709df6e9e50@linux.ibm.com>

On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>
>>
>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>>>> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>>>
>>>>>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>>>>>>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>>>>>>>> +Aneesh
>>>>>>>>>
>>>>>>>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>>>>>>> ..
>>>>>>>>>> [   96.200296] ------------[ cut here ]------------
>>>>>>>>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>>>>>>>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at 
>>>>>>>>>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>>>>>>>
>>>>>>>>>> [   96.200734] NIP [c000000000849424] 
>>>>>>>>>> fault_in_pages_readable+0x104/0x350
>>>>>>>>>> [   96.200741] LR [c00000000084952c] 
>>>>>>>>>> fault_in_pages_readable+0x20c/0x350
>>>>>>>>>> [   96.200747] --- interrupt: 300
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Problem happens in a section where userspace access is supposed 
>>>>>>>>> to be granted, so the patch you
>>>>>>>>> proposed is definitely not the right fix.
>>>>>>>>>
>>>>>>>>> c000000000849408:    2c 01 00 4c     isync
>>>>>>>>> c00000000084940c:    a6 03 3d 7d     mtspr   29,r9  <== granting 
>>>>>>>>> userspace access permission
>>>>>>>>> c000000000849410:    2c 01 00 4c     isync
>>>>>>>>> c000000000849414:    00 00 36 e9     ld      r9,0(r22)
>>>>>>>>> c000000000849418:    20 00 29 81     lwz     r9,32(r9)
>>>>>>>>> c00000000084941c:    00 02 29 71     andi.   r9,r9,512
>>>>>>>>> c000000000849420:    78 d3 5e 7f     mr      r30,r26
>>>>>>>>> ==> c000000000849424:    00 00 bf 8b     lbz     r29,0(r31)  <== 
>>>>>>>>> accessing userspace
>>>>>>>>> c000000000849428:    10 00 82 41     beq     c000000000849438 
>>>>>>>>> <fault_in_pages_readable+0x118>
>>>>>>>>> c00000000084942c:    2c 01 00 4c     isync
>>>>>>>>> c000000000849430:    a6 03 bd 7e     mtspr   29,r21  <== 
>>>>>>>>> clearing userspace access permission
>>>>>>>>> c000000000849434:    2c 01 00 4c     isync
>>>>>>>>>
>>>>>>>>> My first guess is that the problem is linked to the following 
>>>>>>>>> function, see the comment
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>>    * For kernel thread that doesn't have thread.regs return
>>>>>>>>>    * default AMR/IAMR values.
>>>>>>>>>    */
>>>>>>>>> static inline u64 current_thread_amr(void)
>>>>>>>>> {
>>>>>>>>>     if (current->thread.regs)
>>>>>>>>>         return current->thread.regs->amr;
>>>>>>>>>     return AMR_KUAP_BLOCKED;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Above function was introduced by commit 48a8ab4eeb82 
>>>>>>>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
>>>>>>>>> when in kernel mode")
>>>>>>>>
>>>>>>>> Yeah that's a bit of a curly one.
>>>>>>>>
>>>>>>>> At some point io_uring did kthread_use_mm(), which is supposed to 
>>>>>>>> mean
>>>>>>>> the kthread can operate on behalf of the original process that 
>>>>>>>> submitted
>>>>>>>> the IO.
>>>>>>>>
>>>>>>>> But because KUAP is implemented using memory protection keys, it 
>>>>>>>> depends
>>>>>>>> on the value of the AMR register, which is not part of the mm, 
>>>>>>>> it's in
>>>>>>>> thread.regs->amr.
>>>>>>>>
>>>>>>>> And what's worse by the time we're in kthread_use_mm() we no 
>>>>>>>> longer have
>>>>>>>> access to the thread.regs->amr of the original process that 
>>>>>>>> submitted
>>>>>>>> the IO.
>>>>>>>>
>>>>>>>> We also can't simply move the AMR into the mm, precisely because 
>>>>>>>> it's
>>>>>>>> per thread, not per mm.
>>>>>>>>
>>>>>>>> So TBH I don't know how we're going to fix this.
>>>>>>>>
>>>>>>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>>>>>>> arguably a bug because it allows a process to circumvent memory 
>>>>>>>> keys by
>>>>>>>> asking the kernel to do the access.
>>>>>>>
>>>>>>> We shouldn't need to inherit AMR should we? We only need it to be 
>>>>>>> locked
>>>>>>> for kernel threads until it's explicitly unlocked -- nothing mm 
>>>>>>> specific
>>>>>>> there. I think current_thread_amr could return 0 for kernel 
>>>>>>> threads? Or
>>>>>>> I would even avoid using that function for allow_user_access and open
>>>>>>> code the kthread case and remove it from current_thread_amr().
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>
>>>>>
>>>>> updated one
>>>>>
>>>>>  From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access 
>>>>> userspace
>>>>>   after kthread_use_mm
>>>>>
>>>>> This fix the bad fault reported by KUAP when io_wqe_worker access 
>>>>> userspace.
>>>>>
>>>>>   Bug: Read fault blocked by KUAP!
>>>>>   WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
>>>>> __do_page_fault+0x6b4/0xcd0
>>>>>   NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
>>>>>   LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>> ..........
>>>>>   Call Trace:
>>>>>   [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0 
>>>>> (unreliable)
>>>>>   [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
>>>>>   [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
>>>>>   --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
>>>>> ..........
>>>>>   NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
>>>>>   LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
>>>>>   interrupt: 300
>>>>>   [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
>>>>>   [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
>>>>>   [c000000016367990] [c000000000710330] 
>>>>> iomap_file_buffered_write+0xa0/0x120
>>>>>   [c0000000163679e0] [c00800000040791c] 
>>>>> xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
>>>>>   [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
>>>>>   [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
>>>>>   [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
>>>>>   [c000000016367cb0] [c0000000006e2578] 
>>>>> io_worker_handle_work+0x498/0x800
>>>>>   [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
>>>>>   [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
>>>>>   [c000000016367e10] [c00000000000dbf0] 
>>>>> ret_from_kernel_thread+0x5c/0x6c
>>>>>
>>>>> The kernel consider thread AMR value for kernel thread to be
>>>>> AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
>>>>> of course not correct and we should allow userspace access after
>>>>> kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
>>>>> AMR value of the operating address space. But, the AMR value is
>>>>> thread-specific and we inherit the address space and not thread
>>>>> access restrictions. Because of this ignore AMR value when accessing
>>>>> userspace via kernel thread.
>>>>>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>> Changes from v1:
>>>>> * Address review feedback from Nick
>>>>>
>>>>>   arch/powerpc/include/asm/book3s/64/kup.h | 8 +++++++-
>>>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
>>>>> b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> index f50f72e535aa..95f4df99249e 100644
>>>>> --- a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> @@ -384,7 +384,13 @@ static __always_inline void 
>>>>> allow_user_access(void __user *to, const void __user
>>>>>       // This is written so we can resolve to a single case at build 
>>>>> time
>>>>>       BUILD_BUG_ON(!__builtin_constant_p(dir));
>>>>> -    if (mmu_has_feature(MMU_FTR_PKEY))
>>>>> +    /*
>>>>> +     * if it is a kthread that did kthread_use_mm() don't
>>>>> +     * use current_thread_amr().
>>>>
>>>> According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel 
>>>> thread */
>>>> It doesn't seem to be related to kthread_use_mm()
>>>
>>> That should be a sufficient check here. if we did reach here without 
>>> calling kthread_user_mm, we will crash on access because we don't have 
>>> a mm attached to the current process.  a kernel thread with 
>>> kthread_use_mm has
>>
>> Ok but then the comment doesn't match the check.
> 
> 
> I was trying to be explict in the comment that we expect the thread to 
> have done kthread_use_mm().
> 
>>
>> And also the comment in current_thread_amr() is then misleading.
>>
>> Why not do the current->flags & PF_KTHREAD check in current_thread_amr() 
>> and return 0 in that case instead of BLOCKED ?
> 
> In my view currrent_thread_amr() is more generic and we want to be 
> explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we 
> call allow user access, we relax the AMR value.

Just following up on this, as I'd hate to have 5.11 released with this
bug in it for powerpc. It'll obviously also affect other cases of a
kernel thread faulting after having done kthread_use_mm(), though I'm
not sure how widespread that is. In any case, it'll leave io_uring
mostly broken on powerpc if this isn't patched for release.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Aneesh Kumar K.V @ 2021-02-04 17:01 UTC (permalink / raw)
  To: Jens Axboe, Christophe Leroy, Nicholas Piggin, Michael Ellerman,
	Zorro Lang
  Cc: linuxppc-dev
In-Reply-To: <086f6525-b851-c4c0-d611-42f76a54a2d9@kernel.dk>

On 2/4/21 10:23 PM, Jens Axboe wrote:
> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
>> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>>
>>>
>>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>>>>
>>>>>
>>>>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>>>>> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>>>>
>>>>>>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>>>>>>>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>>>>>>>>> +Aneesh
>>>>>>>>>>
>>>>>>>>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>>>>>>>> ..
>>>>>>>>>>> [   96.200296] ------------[ cut here ]------------
>>>>>>>>>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>>>>>>>>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at
>>>>>>>>>>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>>>>>>>>
>>>>>>>>>>> [   96.200734] NIP [c000000000849424]
>>>>>>>>>>> fault_in_pages_readable+0x104/0x350
>>>>>>>>>>> [   96.200741] LR [c00000000084952c]
>>>>>>>>>>> fault_in_pages_readable+0x20c/0x350
>>>>>>>>>>> [   96.200747] --- interrupt: 300
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Problem happens in a section where userspace access is supposed
>>>>>>>>>> to be granted, so the patch you
>>>>>>>>>> proposed is definitely not the right fix.
>>>>>>>>>>
>>>>>>>>>> c000000000849408:    2c 01 00 4c     isync
>>>>>>>>>> c00000000084940c:    a6 03 3d 7d     mtspr   29,r9  <== granting
>>>>>>>>>> userspace access permission
>>>>>>>>>> c000000000849410:    2c 01 00 4c     isync
>>>>>>>>>> c000000000849414:    00 00 36 e9     ld      r9,0(r22)
>>>>>>>>>> c000000000849418:    20 00 29 81     lwz     r9,32(r9)
>>>>>>>>>> c00000000084941c:    00 02 29 71     andi.   r9,r9,512
>>>>>>>>>> c000000000849420:    78 d3 5e 7f     mr      r30,r26
>>>>>>>>>> ==> c000000000849424:    00 00 bf 8b     lbz     r29,0(r31)  <==
>>>>>>>>>> accessing userspace
>>>>>>>>>> c000000000849428:    10 00 82 41     beq     c000000000849438
>>>>>>>>>> <fault_in_pages_readable+0x118>
>>>>>>>>>> c00000000084942c:    2c 01 00 4c     isync
>>>>>>>>>> c000000000849430:    a6 03 bd 7e     mtspr   29,r21  <==
>>>>>>>>>> clearing userspace access permission
>>>>>>>>>> c000000000849434:    2c 01 00 4c     isync
>>>>>>>>>>
>>>>>>>>>> My first guess is that the problem is linked to the following
>>>>>>>>>> function, see the comment
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>>     * For kernel thread that doesn't have thread.regs return
>>>>>>>>>>     * default AMR/IAMR values.
>>>>>>>>>>     */
>>>>>>>>>> static inline u64 current_thread_amr(void)
>>>>>>>>>> {
>>>>>>>>>>      if (current->thread.regs)
>>>>>>>>>>          return current->thread.regs->amr;
>>>>>>>>>>      return AMR_KUAP_BLOCKED;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Above function was introduced by commit 48a8ab4eeb82
>>>>>>>>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
>>>>>>>>>> when in kernel mode")
>>>>>>>>>
>>>>>>>>> Yeah that's a bit of a curly one.
>>>>>>>>>
>>>>>>>>> At some point io_uring did kthread_use_mm(), which is supposed to
>>>>>>>>> mean
>>>>>>>>> the kthread can operate on behalf of the original process that
>>>>>>>>> submitted
>>>>>>>>> the IO.
>>>>>>>>>
>>>>>>>>> But because KUAP is implemented using memory protection keys, it
>>>>>>>>> depends
>>>>>>>>> on the value of the AMR register, which is not part of the mm,
>>>>>>>>> it's in
>>>>>>>>> thread.regs->amr.
>>>>>>>>>
>>>>>>>>> And what's worse by the time we're in kthread_use_mm() we no
>>>>>>>>> longer have
>>>>>>>>> access to the thread.regs->amr of the original process that
>>>>>>>>> submitted
>>>>>>>>> the IO.
>>>>>>>>>
>>>>>>>>> We also can't simply move the AMR into the mm, precisely because
>>>>>>>>> it's
>>>>>>>>> per thread, not per mm.
>>>>>>>>>
>>>>>>>>> So TBH I don't know how we're going to fix this.
>>>>>>>>>
>>>>>>>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>>>>>>>> arguably a bug because it allows a process to circumvent memory
>>>>>>>>> keys by
>>>>>>>>> asking the kernel to do the access.
>>>>>>>>
>>>>>>>> We shouldn't need to inherit AMR should we? We only need it to be
>>>>>>>> locked
>>>>>>>> for kernel threads until it's explicitly unlocked -- nothing mm
>>>>>>>> specific
>>>>>>>> there. I think current_thread_amr could return 0 for kernel
>>>>>>>> threads? Or
>>>>>>>> I would even avoid using that function for allow_user_access and open
>>>>>>>> code the kthread case and remove it from current_thread_amr().
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>
>>>>>>
>>>>>> updated one
>>>>>>
>>>>>>   From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>>>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
>>>>>> userspace
>>>>>>    after kthread_use_mm
>>>>>>
>>>>>> This fix the bad fault reported by KUAP when io_wqe_worker access
>>>>>> userspace.
>>>>>>
>>>>>>    Bug: Read fault blocked by KUAP!
>>>>>>    WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
>>>>>> __do_page_fault+0x6b4/0xcd0
>>>>>>    NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
>>>>>>    LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>> ..........
>>>>>>    Call Trace:
>>>>>>    [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>> (unreliable)
>>>>>>    [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
>>>>>>    [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
>>>>>>    --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
>>>>>> ..........
>>>>>>    NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
>>>>>>    LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
>>>>>>    interrupt: 300
>>>>>>    [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
>>>>>>    [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
>>>>>>    [c000000016367990] [c000000000710330]
>>>>>> iomap_file_buffered_write+0xa0/0x120
>>>>>>    [c0000000163679e0] [c00800000040791c]
>>>>>> xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
>>>>>>    [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
>>>>>>    [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
>>>>>>    [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
>>>>>>    [c000000016367cb0] [c0000000006e2578]
>>>>>> io_worker_handle_work+0x498/0x800
>>>>>>    [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
>>>>>>    [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
>>>>>>    [c000000016367e10] [c00000000000dbf0]
>>>>>> ret_from_kernel_thread+0x5c/0x6c
>>>>>>
>>>>>> The kernel consider thread AMR value for kernel thread to be
>>>>>> AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
>>>>>> of course not correct and we should allow userspace access after
>>>>>> kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
>>>>>> AMR value of the operating address space. But, the AMR value is
>>>>>> thread-specific and we inherit the address space and not thread
>>>>>> access restrictions. Because of this ignore AMR value when accessing
>>>>>> userspace via kernel thread.
>>>>>>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> ---
>>>>>> Changes from v1:
>>>>>> * Address review feedback from Nick
>>>>>>
>>>>>>    arch/powerpc/include/asm/book3s/64/kup.h | 8 +++++++-
>>>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>> b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>> index f50f72e535aa..95f4df99249e 100644
>>>>>> --- a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>> @@ -384,7 +384,13 @@ static __always_inline void
>>>>>> allow_user_access(void __user *to, const void __user
>>>>>>        // This is written so we can resolve to a single case at build
>>>>>> time
>>>>>>        BUILD_BUG_ON(!__builtin_constant_p(dir));
>>>>>> -    if (mmu_has_feature(MMU_FTR_PKEY))
>>>>>> +    /*
>>>>>> +     * if it is a kthread that did kthread_use_mm() don't
>>>>>> +     * use current_thread_amr().
>>>>>
>>>>> According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel
>>>>> thread */
>>>>> It doesn't seem to be related to kthread_use_mm()
>>>>
>>>> That should be a sufficient check here. if we did reach here without
>>>> calling kthread_user_mm, we will crash on access because we don't have
>>>> a mm attached to the current process.  a kernel thread with
>>>> kthread_use_mm has
>>>
>>> Ok but then the comment doesn't match the check.
>>
>>
>> I was trying to be explict in the comment that we expect the thread to
>> have done kthread_use_mm().
>>
>>>
>>> And also the comment in current_thread_amr() is then misleading.
>>>
>>> Why not do the current->flags & PF_KTHREAD check in current_thread_amr()
>>> and return 0 in that case instead of BLOCKED ?
>>
>> In my view currrent_thread_amr() is more generic and we want to be
>> explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we
>> call allow user access, we relax the AMR value.
> 
> Just following up on this, as I'd hate to have 5.11 released with this
> bug in it for powerpc. It'll obviously also affect other cases of a
> kernel thread faulting after having done kthread_use_mm(), though I'm
> not sure how widespread that is. In any case, it'll leave io_uring
> mostly broken on powerpc if this isn't patched for release.
> 

I am waiting for test feedback on the change I posted earlier. I am also 
running a regression run myself. Once that is complete i will post the 
patch as a separate email.

-aneesh

^ permalink raw reply

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Jens Axboe @ 2021-02-04 17:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Christophe Leroy, Nicholas Piggin,
	Michael Ellerman, Zorro Lang
  Cc: linuxppc-dev
In-Reply-To: <490cea1a-8161-2f76-03e6-7e2b16efa648@linux.ibm.com>

On 2/4/21 10:01 AM, Aneesh Kumar K.V wrote:
> On 2/4/21 10:23 PM, Jens Axboe wrote:
>> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
>>> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>>>>>> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>
>>>>>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>>>>>
>>>>>>>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>>>>>>>>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>>>>>>>>>> +Aneesh
>>>>>>>>>>>
>>>>>>>>>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>>>>>>>>> ..
>>>>>>>>>>>> [   96.200296] ------------[ cut here ]------------
>>>>>>>>>>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>>>>>>>>>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at
>>>>>>>>>>>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>>>>>>>>>
>>>>>>>>>>>> [   96.200734] NIP [c000000000849424]
>>>>>>>>>>>> fault_in_pages_readable+0x104/0x350
>>>>>>>>>>>> [   96.200741] LR [c00000000084952c]
>>>>>>>>>>>> fault_in_pages_readable+0x20c/0x350
>>>>>>>>>>>> [   96.200747] --- interrupt: 300
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Problem happens in a section where userspace access is supposed
>>>>>>>>>>> to be granted, so the patch you
>>>>>>>>>>> proposed is definitely not the right fix.
>>>>>>>>>>>
>>>>>>>>>>> c000000000849408:    2c 01 00 4c     isync
>>>>>>>>>>> c00000000084940c:    a6 03 3d 7d     mtspr   29,r9  <== granting
>>>>>>>>>>> userspace access permission
>>>>>>>>>>> c000000000849410:    2c 01 00 4c     isync
>>>>>>>>>>> c000000000849414:    00 00 36 e9     ld      r9,0(r22)
>>>>>>>>>>> c000000000849418:    20 00 29 81     lwz     r9,32(r9)
>>>>>>>>>>> c00000000084941c:    00 02 29 71     andi.   r9,r9,512
>>>>>>>>>>> c000000000849420:    78 d3 5e 7f     mr      r30,r26
>>>>>>>>>>> ==> c000000000849424:    00 00 bf 8b     lbz     r29,0(r31)  <==
>>>>>>>>>>> accessing userspace
>>>>>>>>>>> c000000000849428:    10 00 82 41     beq     c000000000849438
>>>>>>>>>>> <fault_in_pages_readable+0x118>
>>>>>>>>>>> c00000000084942c:    2c 01 00 4c     isync
>>>>>>>>>>> c000000000849430:    a6 03 bd 7e     mtspr   29,r21  <==
>>>>>>>>>>> clearing userspace access permission
>>>>>>>>>>> c000000000849434:    2c 01 00 4c     isync
>>>>>>>>>>>
>>>>>>>>>>> My first guess is that the problem is linked to the following
>>>>>>>>>>> function, see the comment
>>>>>>>>>>>
>>>>>>>>>>> /*
>>>>>>>>>>>     * For kernel thread that doesn't have thread.regs return
>>>>>>>>>>>     * default AMR/IAMR values.
>>>>>>>>>>>     */
>>>>>>>>>>> static inline u64 current_thread_amr(void)
>>>>>>>>>>> {
>>>>>>>>>>>      if (current->thread.regs)
>>>>>>>>>>>          return current->thread.regs->amr;
>>>>>>>>>>>      return AMR_KUAP_BLOCKED;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Above function was introduced by commit 48a8ab4eeb82
>>>>>>>>>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
>>>>>>>>>>> when in kernel mode")
>>>>>>>>>>
>>>>>>>>>> Yeah that's a bit of a curly one.
>>>>>>>>>>
>>>>>>>>>> At some point io_uring did kthread_use_mm(), which is supposed to
>>>>>>>>>> mean
>>>>>>>>>> the kthread can operate on behalf of the original process that
>>>>>>>>>> submitted
>>>>>>>>>> the IO.
>>>>>>>>>>
>>>>>>>>>> But because KUAP is implemented using memory protection keys, it
>>>>>>>>>> depends
>>>>>>>>>> on the value of the AMR register, which is not part of the mm,
>>>>>>>>>> it's in
>>>>>>>>>> thread.regs->amr.
>>>>>>>>>>
>>>>>>>>>> And what's worse by the time we're in kthread_use_mm() we no
>>>>>>>>>> longer have
>>>>>>>>>> access to the thread.regs->amr of the original process that
>>>>>>>>>> submitted
>>>>>>>>>> the IO.
>>>>>>>>>>
>>>>>>>>>> We also can't simply move the AMR into the mm, precisely because
>>>>>>>>>> it's
>>>>>>>>>> per thread, not per mm.
>>>>>>>>>>
>>>>>>>>>> So TBH I don't know how we're going to fix this.
>>>>>>>>>>
>>>>>>>>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>>>>>>>>> arguably a bug because it allows a process to circumvent memory
>>>>>>>>>> keys by
>>>>>>>>>> asking the kernel to do the access.
>>>>>>>>>
>>>>>>>>> We shouldn't need to inherit AMR should we? We only need it to be
>>>>>>>>> locked
>>>>>>>>> for kernel threads until it's explicitly unlocked -- nothing mm
>>>>>>>>> specific
>>>>>>>>> there. I think current_thread_amr could return 0 for kernel
>>>>>>>>> threads? Or
>>>>>>>>> I would even avoid using that function for allow_user_access and open
>>>>>>>>> code the kthread case and remove it from current_thread_amr().
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>
>>>>>>>
>>>>>>> updated one
>>>>>>>
>>>>>>>   From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>>>>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
>>>>>>> userspace
>>>>>>>    after kthread_use_mm
>>>>>>>
>>>>>>> This fix the bad fault reported by KUAP when io_wqe_worker access
>>>>>>> userspace.
>>>>>>>
>>>>>>>    Bug: Read fault blocked by KUAP!
>>>>>>>    WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
>>>>>>> __do_page_fault+0x6b4/0xcd0
>>>>>>>    NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
>>>>>>>    LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>>> ..........
>>>>>>>    Call Trace:
>>>>>>>    [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>>> (unreliable)
>>>>>>>    [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
>>>>>>>    [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
>>>>>>>    --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
>>>>>>> ..........
>>>>>>>    NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
>>>>>>>    LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
>>>>>>>    interrupt: 300
>>>>>>>    [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
>>>>>>>    [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
>>>>>>>    [c000000016367990] [c000000000710330]
>>>>>>> iomap_file_buffered_write+0xa0/0x120
>>>>>>>    [c0000000163679e0] [c00800000040791c]
>>>>>>> xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
>>>>>>>    [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
>>>>>>>    [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
>>>>>>>    [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
>>>>>>>    [c000000016367cb0] [c0000000006e2578]
>>>>>>> io_worker_handle_work+0x498/0x800
>>>>>>>    [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
>>>>>>>    [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
>>>>>>>    [c000000016367e10] [c00000000000dbf0]
>>>>>>> ret_from_kernel_thread+0x5c/0x6c
>>>>>>>
>>>>>>> The kernel consider thread AMR value for kernel thread to be
>>>>>>> AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
>>>>>>> of course not correct and we should allow userspace access after
>>>>>>> kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
>>>>>>> AMR value of the operating address space. But, the AMR value is
>>>>>>> thread-specific and we inherit the address space and not thread
>>>>>>> access restrictions. Because of this ignore AMR value when accessing
>>>>>>> userspace via kernel thread.
>>>>>>>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> ---
>>>>>>> Changes from v1:
>>>>>>> * Address review feedback from Nick
>>>>>>>
>>>>>>>    arch/powerpc/include/asm/book3s/64/kup.h | 8 +++++++-
>>>>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>> b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>> index f50f72e535aa..95f4df99249e 100644
>>>>>>> --- a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>> @@ -384,7 +384,13 @@ static __always_inline void
>>>>>>> allow_user_access(void __user *to, const void __user
>>>>>>>        // This is written so we can resolve to a single case at build
>>>>>>> time
>>>>>>>        BUILD_BUG_ON(!__builtin_constant_p(dir));
>>>>>>> -    if (mmu_has_feature(MMU_FTR_PKEY))
>>>>>>> +    /*
>>>>>>> +     * if it is a kthread that did kthread_use_mm() don't
>>>>>>> +     * use current_thread_amr().
>>>>>>
>>>>>> According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel
>>>>>> thread */
>>>>>> It doesn't seem to be related to kthread_use_mm()
>>>>>
>>>>> That should be a sufficient check here. if we did reach here without
>>>>> calling kthread_user_mm, we will crash on access because we don't have
>>>>> a mm attached to the current process.  a kernel thread with
>>>>> kthread_use_mm has
>>>>
>>>> Ok but then the comment doesn't match the check.
>>>
>>>
>>> I was trying to be explict in the comment that we expect the thread to
>>> have done kthread_use_mm().
>>>
>>>>
>>>> And also the comment in current_thread_amr() is then misleading.
>>>>
>>>> Why not do the current->flags & PF_KTHREAD check in current_thread_amr()
>>>> and return 0 in that case instead of BLOCKED ?
>>>
>>> In my view currrent_thread_amr() is more generic and we want to be
>>> explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we
>>> call allow user access, we relax the AMR value.
>>
>> Just following up on this, as I'd hate to have 5.11 released with this
>> bug in it for powerpc. It'll obviously also affect other cases of a
>> kernel thread faulting after having done kthread_use_mm(), though I'm
>> not sure how widespread that is. In any case, it'll leave io_uring
>> mostly broken on powerpc if this isn't patched for release.
>>
> 
> I am waiting for test feedback on the change I posted earlier. I am also 
> running a regression run myself. Once that is complete i will post the 
> patch as a separate email.

Perfect, sounds good!

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Zorro Lang @ 2021-02-04 17:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Jens Axboe, linuxppc-dev, Nicholas Piggin
In-Reply-To: <490cea1a-8161-2f76-03e6-7e2b16efa648@linux.ibm.com>

On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote:
> On 2/4/21 10:23 PM, Jens Axboe wrote:
> > On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
> > > On 2/2/21 11:50 AM, Christophe Leroy wrote:
> > > > 
> > > > 
> > > > Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
> > > > > On 2/2/21 11:32 AM, Christophe Leroy wrote:
> > > > > > 
> > > > > > 
> > > > > > Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
> > > > > > > Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
> > > > > > > 
> > > > > > > > Nicholas Piggin <npiggin@gmail.com> writes:
> > > > > > > > 
> > > > > > > > > Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
> > > > > > > > > > Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> > > > > > > > > > > +Aneesh
> > > > > > > > > > > 
> > > > > > > > > > > Le 29/01/2021 à 07:52, Zorro Lang a écrit :
> > > > > > > > > > ..
> > > > > > > > > > > > [   96.200296] ------------[ cut here ]------------
> > > > > > > > > > > > [   96.200304] Bug: Read fault blocked by KUAP!
> > > > > > > > > > > > [   96.200309] WARNING: CPU: 3 PID: 1876 at
> > > > > > > > > > > > arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
> > > > > > > > > > > 
> > > > > > > > > > > > [   96.200734] NIP [c000000000849424]
> > > > > > > > > > > > fault_in_pages_readable+0x104/0x350
> > > > > > > > > > > > [   96.200741] LR [c00000000084952c]
> > > > > > > > > > > > fault_in_pages_readable+0x20c/0x350
> > > > > > > > > > > > [   96.200747] --- interrupt: 300
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Problem happens in a section where userspace access is supposed
> > > > > > > > > > > to be granted, so the patch you
> > > > > > > > > > > proposed is definitely not the right fix.
> > > > > > > > > > > 
> > > > > > > > > > > c000000000849408:    2c 01 00 4c     isync
> > > > > > > > > > > c00000000084940c:    a6 03 3d 7d     mtspr   29,r9  <== granting
> > > > > > > > > > > userspace access permission
> > > > > > > > > > > c000000000849410:    2c 01 00 4c     isync
> > > > > > > > > > > c000000000849414:    00 00 36 e9     ld      r9,0(r22)
> > > > > > > > > > > c000000000849418:    20 00 29 81     lwz     r9,32(r9)
> > > > > > > > > > > c00000000084941c:    00 02 29 71     andi.   r9,r9,512
> > > > > > > > > > > c000000000849420:    78 d3 5e 7f     mr      r30,r26
> > > > > > > > > > > ==> c000000000849424:    00 00 bf 8b     lbz     r29,0(r31)  <==
> > > > > > > > > > > accessing userspace
> > > > > > > > > > > c000000000849428:    10 00 82 41     beq     c000000000849438
> > > > > > > > > > > <fault_in_pages_readable+0x118>
> > > > > > > > > > > c00000000084942c:    2c 01 00 4c     isync
> > > > > > > > > > > c000000000849430:    a6 03 bd 7e     mtspr   29,r21  <==
> > > > > > > > > > > clearing userspace access permission
> > > > > > > > > > > c000000000849434:    2c 01 00 4c     isync
> > > > > > > > > > > 
> > > > > > > > > > > My first guess is that the problem is linked to the following
> > > > > > > > > > > function, see the comment
> > > > > > > > > > > 
> > > > > > > > > > > /*
> > > > > > > > > > >     * For kernel thread that doesn't have thread.regs return
> > > > > > > > > > >     * default AMR/IAMR values.
> > > > > > > > > > >     */
> > > > > > > > > > > static inline u64 current_thread_amr(void)
> > > > > > > > > > > {
> > > > > > > > > > >      if (current->thread.regs)
> > > > > > > > > > >          return current->thread.regs->amr;
> > > > > > > > > > >      return AMR_KUAP_BLOCKED;
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > Above function was introduced by commit 48a8ab4eeb82
> > > > > > > > > > > ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
> > > > > > > > > > > when in kernel mode")
> > > > > > > > > > 
> > > > > > > > > > Yeah that's a bit of a curly one.
> > > > > > > > > > 
> > > > > > > > > > At some point io_uring did kthread_use_mm(), which is supposed to
> > > > > > > > > > mean
> > > > > > > > > > the kthread can operate on behalf of the original process that
> > > > > > > > > > submitted
> > > > > > > > > > the IO.
> > > > > > > > > > 
> > > > > > > > > > But because KUAP is implemented using memory protection keys, it
> > > > > > > > > > depends
> > > > > > > > > > on the value of the AMR register, which is not part of the mm,
> > > > > > > > > > it's in
> > > > > > > > > > thread.regs->amr.
> > > > > > > > > > 
> > > > > > > > > > And what's worse by the time we're in kthread_use_mm() we no
> > > > > > > > > > longer have
> > > > > > > > > > access to the thread.regs->amr of the original process that
> > > > > > > > > > submitted
> > > > > > > > > > the IO.
> > > > > > > > > > 
> > > > > > > > > > We also can't simply move the AMR into the mm, precisely because
> > > > > > > > > > it's
> > > > > > > > > > per thread, not per mm.
> > > > > > > > > > 
> > > > > > > > > > So TBH I don't know how we're going to fix this.
> > > > > > > > > > 
> > > > > > > > > > I guess we could return AMR=unblocked for kernel threads, but that's
> > > > > > > > > > arguably a bug because it allows a process to circumvent memory
> > > > > > > > > > keys by
> > > > > > > > > > asking the kernel to do the access.
> > > > > > > > > 
> > > > > > > > > We shouldn't need to inherit AMR should we? We only need it to be
> > > > > > > > > locked
> > > > > > > > > for kernel threads until it's explicitly unlocked -- nothing mm
> > > > > > > > > specific
> > > > > > > > > there. I think current_thread_amr could return 0 for kernel
> > > > > > > > > threads? Or
> > > > > > > > > I would even avoid using that function for allow_user_access and open
> > > > > > > > > code the kthread case and remove it from current_thread_amr().
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Nick
> > > > > > > > 
> > > > > > > 
> > > > > > > updated one
> > > > > > > 
> > > > > > >   From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
> > > > > > > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> > > > > > > Date: Tue, 2 Feb 2021 09:23:38 +0530
> > > > > > > Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
> > > > > > > userspace
> > > > > > >    after kthread_use_mm
> > > > > > > 
> > > > > > > This fix the bad fault reported by KUAP when io_wqe_worker access
> > > > > > > userspace.
> > > > > > > 
> > > > > > >    Bug: Read fault blocked by KUAP!
> > > > > > >    WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
> > > > > > > __do_page_fault+0x6b4/0xcd0
> > > > > > >    NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
> > > > > > >    LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
> > > > > > > ..........
> > > > > > >    Call Trace:
> > > > > > >    [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
> > > > > > > (unreliable)
> > > > > > >    [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
> > > > > > >    [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
> > > > > > >    --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
> > > > > > > ..........
> > > > > > >    NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
> > > > > > >    LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
> > > > > > >    interrupt: 300
> > > > > > >    [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
> > > > > > >    [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
> > > > > > >    [c000000016367990] [c000000000710330]
> > > > > > > iomap_file_buffered_write+0xa0/0x120
> > > > > > >    [c0000000163679e0] [c00800000040791c]
> > > > > > > xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
> > > > > > >    [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
> > > > > > >    [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
> > > > > > >    [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
> > > > > > >    [c000000016367cb0] [c0000000006e2578]
> > > > > > > io_worker_handle_work+0x498/0x800
> > > > > > >    [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
> > > > > > >    [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
> > > > > > >    [c000000016367e10] [c00000000000dbf0]
> > > > > > > ret_from_kernel_thread+0x5c/0x6c
> > > > > > > 
> > > > > > > The kernel consider thread AMR value for kernel thread to be
> > > > > > > AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
> > > > > > > of course not correct and we should allow userspace access after
> > > > > > > kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
> > > > > > > AMR value of the operating address space. But, the AMR value is
> > > > > > > thread-specific and we inherit the address space and not thread
> > > > > > > access restrictions. Because of this ignore AMR value when accessing
> > > > > > > userspace via kernel thread.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > > > > ---
> > > > > > > Changes from v1:
> > > > > > > * Address review feedback from Nick
> > > > > > > 
> > > > > > >    arch/powerpc/include/asm/book3s/64/kup.h | 8 +++++++-
> > > > > > >    1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/arch/powerpc/include/asm/book3s/64/kup.h
> > > > > > > b/arch/powerpc/include/asm/book3s/64/kup.h
> > > > > > > index f50f72e535aa..95f4df99249e 100644
> > > > > > > --- a/arch/powerpc/include/asm/book3s/64/kup.h
> > > > > > > +++ b/arch/powerpc/include/asm/book3s/64/kup.h
> > > > > > > @@ -384,7 +384,13 @@ static __always_inline void
> > > > > > > allow_user_access(void __user *to, const void __user
> > > > > > >        // This is written so we can resolve to a single case at build
> > > > > > > time
> > > > > > >        BUILD_BUG_ON(!__builtin_constant_p(dir));
> > > > > > > -    if (mmu_has_feature(MMU_FTR_PKEY))
> > > > > > > +    /*
> > > > > > > +     * if it is a kthread that did kthread_use_mm() don't
> > > > > > > +     * use current_thread_amr().
> > > > > > 
> > > > > > According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel
> > > > > > thread */
> > > > > > It doesn't seem to be related to kthread_use_mm()
> > > > > 
> > > > > That should be a sufficient check here. if we did reach here without
> > > > > calling kthread_user_mm, we will crash on access because we don't have
> > > > > a mm attached to the current process.  a kernel thread with
> > > > > kthread_use_mm has
> > > > 
> > > > Ok but then the comment doesn't match the check.
> > > 
> > > 
> > > I was trying to be explict in the comment that we expect the thread to
> > > have done kthread_use_mm().
> > > 
> > > > 
> > > > And also the comment in current_thread_amr() is then misleading.
> > > > 
> > > > Why not do the current->flags & PF_KTHREAD check in current_thread_amr()
> > > > and return 0 in that case instead of BLOCKED ?
> > > 
> > > In my view currrent_thread_amr() is more generic and we want to be
> > > explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we
> > > call allow user access, we relax the AMR value.
> > 
> > Just following up on this, as I'd hate to have 5.11 released with this
> > bug in it for powerpc. It'll obviously also affect other cases of a
> > kernel thread faulting after having done kthread_use_mm(), though I'm
> > not sure how widespread that is. In any case, it'll leave io_uring
> > mostly broken on powerpc if this isn't patched for release.
> > 
> 
> I am waiting for test feedback on the change I posted earlier. I am also
> running a regression run myself. Once that is complete i will post the patch
> as a separate email.

Are you waiting a test from me? Or someone else who test PPC? Although I'm
the "Reported-by" of this bug, I just can help to verify this bug itself,
I don't have enough test cases to do regression test from PPC side. Do you need
me to verify this bug itself.

Thanks,
Zorro

> 
> -aneesh
> 


^ permalink raw reply

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Aneesh Kumar K.V @ 2021-02-04 17:42 UTC (permalink / raw)
  To: Zorro Lang; +Cc: Jens Axboe, linuxppc-dev, Nicholas Piggin
In-Reply-To: <20210204175703.GH14354@localhost.localdomain>

On 2/4/21 11:27 PM, Zorro Lang wrote:
> On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote:
>> On 2/4/21 10:23 PM, Jens Axboe wrote:
>>> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
>>>> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>>>>
>>>>>
>>>>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>>>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>>>>>>
>>>>>>>
>>>>>>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>>>>>>> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>>
>>>>>>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>>>>>>
>>>>>>>>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>>>>>>>>>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>>>>>>>>>>> +Aneesh
>>>>>>>>>>>>
>>>>>>>>>>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>>>>>>>>>> ..
>>>>>>>>>>>>> [   96.200296] ------------[ cut here ]------------
>>>>>>>>>>>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>>>>>>>>>>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at
>>>>>>>>>>>>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>>>>>>>>>>
>>>>>>>>>>>>> [   96.200734] NIP [c000000000849424]
>>>>>>>>>>>>> fault_in_pages_readable+0x104/0x350
>>>>>>>>>>>>> [   96.200741] LR [c00000000084952c]
>>>>>>>>>>>>> fault_in_pages_readable+0x20c/0x350
>>>>>>>>>>>>> [   96.200747] --- interrupt: 300
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Problem happens in a section where userspace access is supposed
>>>>>>>>>>>> to be granted, so the patch you
>>>>>>>>>>>> proposed is definitely not the right fix.
>>>>>>>>>>>>
>>>>>>>>>>>> c000000000849408:    2c 01 00 4c     isync
>>>>>>>>>>>> c00000000084940c:    a6 03 3d 7d     mtspr   29,r9  <== granting
>>>>>>>>>>>> userspace access permission
>>>>>>>>>>>> c000000000849410:    2c 01 00 4c     isync
>>>>>>>>>>>> c000000000849414:    00 00 36 e9     ld      r9,0(r22)
>>>>>>>>>>>> c000000000849418:    20 00 29 81     lwz     r9,32(r9)
>>>>>>>>>>>> c00000000084941c:    00 02 29 71     andi.   r9,r9,512
>>>>>>>>>>>> c000000000849420:    78 d3 5e 7f     mr      r30,r26
>>>>>>>>>>>> ==> c000000000849424:    00 00 bf 8b     lbz     r29,0(r31)  <==
>>>>>>>>>>>> accessing userspace
>>>>>>>>>>>> c000000000849428:    10 00 82 41     beq     c000000000849438
>>>>>>>>>>>> <fault_in_pages_readable+0x118>
>>>>>>>>>>>> c00000000084942c:    2c 01 00 4c     isync
>>>>>>>>>>>> c000000000849430:    a6 03 bd 7e     mtspr   29,r21  <==
>>>>>>>>>>>> clearing userspace access permission
>>>>>>>>>>>> c000000000849434:    2c 01 00 4c     isync
>>>>>>>>>>>>
>>>>>>>>>>>> My first guess is that the problem is linked to the following
>>>>>>>>>>>> function, see the comment
>>>>>>>>>>>>
>>>>>>>>>>>> /*
>>>>>>>>>>>>      * For kernel thread that doesn't have thread.regs return
>>>>>>>>>>>>      * default AMR/IAMR values.
>>>>>>>>>>>>      */
>>>>>>>>>>>> static inline u64 current_thread_amr(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>       if (current->thread.regs)
>>>>>>>>>>>>           return current->thread.regs->amr;
>>>>>>>>>>>>       return AMR_KUAP_BLOCKED;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> Above function was introduced by commit 48a8ab4eeb82
>>>>>>>>>>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
>>>>>>>>>>>> when in kernel mode")
>>>>>>>>>>>
>>>>>>>>>>> Yeah that's a bit of a curly one.
>>>>>>>>>>>
>>>>>>>>>>> At some point io_uring did kthread_use_mm(), which is supposed to
>>>>>>>>>>> mean
>>>>>>>>>>> the kthread can operate on behalf of the original process that
>>>>>>>>>>> submitted
>>>>>>>>>>> the IO.
>>>>>>>>>>>
>>>>>>>>>>> But because KUAP is implemented using memory protection keys, it
>>>>>>>>>>> depends
>>>>>>>>>>> on the value of the AMR register, which is not part of the mm,
>>>>>>>>>>> it's in
>>>>>>>>>>> thread.regs->amr.
>>>>>>>>>>>
>>>>>>>>>>> And what's worse by the time we're in kthread_use_mm() we no
>>>>>>>>>>> longer have
>>>>>>>>>>> access to the thread.regs->amr of the original process that
>>>>>>>>>>> submitted
>>>>>>>>>>> the IO.
>>>>>>>>>>>
>>>>>>>>>>> We also can't simply move the AMR into the mm, precisely because
>>>>>>>>>>> it's
>>>>>>>>>>> per thread, not per mm.
>>>>>>>>>>>
>>>>>>>>>>> So TBH I don't know how we're going to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>>>>>>>>>> arguably a bug because it allows a process to circumvent memory
>>>>>>>>>>> keys by
>>>>>>>>>>> asking the kernel to do the access.
>>>>>>>>>>
>>>>>>>>>> We shouldn't need to inherit AMR should we? We only need it to be
>>>>>>>>>> locked
>>>>>>>>>> for kernel threads until it's explicitly unlocked -- nothing mm
>>>>>>>>>> specific
>>>>>>>>>> there. I think current_thread_amr could return 0 for kernel
>>>>>>>>>> threads? Or
>>>>>>>>>> I would even avoid using that function for allow_user_access and open
>>>>>>>>>> code the kthread case and remove it from current_thread_amr().
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>
>>>>>>>> updated one
>>>>>>>>
>>>>>>>>    From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>>>>>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
>>>>>>>> userspace
>>>>>>>>     after kthread_use_mm
>>>>>>>>
>>>>>>>> This fix the bad fault reported by KUAP when io_wqe_worker access
>>>>>>>> userspace.
>>>>>>>>
>>>>>>>>     Bug: Read fault blocked by KUAP!
>>>>>>>>     WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
>>>>>>>> __do_page_fault+0x6b4/0xcd0
>>>>>>>>     NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
>>>>>>>>     LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>>>> ..........
>>>>>>>>     Call Trace:
>>>>>>>>     [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>>>>> (unreliable)
>>>>>>>>     [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
>>>>>>>>     [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
>>>>>>>>     --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
>>>>>>>> ..........
>>>>>>>>     NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
>>>>>>>>     LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
>>>>>>>>     interrupt: 300
>>>>>>>>     [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
>>>>>>>>     [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
>>>>>>>>     [c000000016367990] [c000000000710330]
>>>>>>>> iomap_file_buffered_write+0xa0/0x120
>>>>>>>>     [c0000000163679e0] [c00800000040791c]
>>>>>>>> xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
>>>>>>>>     [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
>>>>>>>>     [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
>>>>>>>>     [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
>>>>>>>>     [c000000016367cb0] [c0000000006e2578]
>>>>>>>> io_worker_handle_work+0x498/0x800
>>>>>>>>     [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
>>>>>>>>     [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
>>>>>>>>     [c000000016367e10] [c00000000000dbf0]
>>>>>>>> ret_from_kernel_thread+0x5c/0x6c
>>>>>>>>
>>>>>>>> The kernel consider thread AMR value for kernel thread to be
>>>>>>>> AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
>>>>>>>> of course not correct and we should allow userspace access after
>>>>>>>> kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
>>>>>>>> AMR value of the operating address space. But, the AMR value is
>>>>>>>> thread-specific and we inherit the address space and not thread
>>>>>>>> access restrictions. Because of this ignore AMR value when accessing
>>>>>>>> userspace via kernel thread.
>>>>>>>>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>>> ---
>>>>>>>> Changes from v1:
>>>>>>>> * Address review feedback from Nick
>>>>>>>>
>>>>>>>>     arch/powerpc/include/asm/book3s/64/kup.h | 8 +++++++-
>>>>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>>> b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>>> index f50f72e535aa..95f4df99249e 100644
>>>>>>>> --- a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>>> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>>>>> @@ -384,7 +384,13 @@ static __always_inline void
>>>>>>>> allow_user_access(void __user *to, const void __user
>>>>>>>>         // This is written so we can resolve to a single case at build
>>>>>>>> time
>>>>>>>>         BUILD_BUG_ON(!__builtin_constant_p(dir));
>>>>>>>> -    if (mmu_has_feature(MMU_FTR_PKEY))
>>>>>>>> +    /*
>>>>>>>> +     * if it is a kthread that did kthread_use_mm() don't
>>>>>>>> +     * use current_thread_amr().
>>>>>>>
>>>>>>> According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel
>>>>>>> thread */
>>>>>>> It doesn't seem to be related to kthread_use_mm()
>>>>>>
>>>>>> That should be a sufficient check here. if we did reach here without
>>>>>> calling kthread_user_mm, we will crash on access because we don't have
>>>>>> a mm attached to the current process.  a kernel thread with
>>>>>> kthread_use_mm has
>>>>>
>>>>> Ok but then the comment doesn't match the check.
>>>>
>>>>
>>>> I was trying to be explict in the comment that we expect the thread to
>>>> have done kthread_use_mm().
>>>>
>>>>>
>>>>> And also the comment in current_thread_amr() is then misleading.
>>>>>
>>>>> Why not do the current->flags & PF_KTHREAD check in current_thread_amr()
>>>>> and return 0 in that case instead of BLOCKED ?
>>>>
>>>> In my view currrent_thread_amr() is more generic and we want to be
>>>> explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we
>>>> call allow user access, we relax the AMR value.
>>>
>>> Just following up on this, as I'd hate to have 5.11 released with this
>>> bug in it for powerpc. It'll obviously also affect other cases of a
>>> kernel thread faulting after having done kthread_use_mm(), though I'm
>>> not sure how widespread that is. In any case, it'll leave io_uring
>>> mostly broken on powerpc if this isn't patched for release.
>>>
>>
>> I am waiting for test feedback on the change I posted earlier. I am also
>> running a regression run myself. Once that is complete i will post the patch
>> as a separate email.
> 
> Are you waiting a test from me? Or someone else who test PPC? Although I'm
> the "Reported-by" of this bug, I just can help to verify this bug itself,
> I don't have enough test cases to do regression test from PPC side. Do you need
> me to verify this bug itself.
> 
>

If you can verify this bug, it would be really helpful.

-aneesh

^ permalink raw reply

* [PATCH v2 2/2] ima: Free IMA measurement buffer after kexec syscall
From: Lakshmi Ramasubramanian @ 2021-02-04 17:49 UTC (permalink / raw)
  To: zohar, bauerman, dmitry.kasatkin, ebiederm, gregkh, sashal,
	tyhicks
  Cc: linux-integrity, linuxppc-dev, linux-kernel
In-Reply-To: <20210204174951.25771-1-nramas@linux.microsoft.com>

IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function.  This buffer is not freed before
completing the kexec system call resulting in memory leak.

Add ima_buffer field in "struct kimage" to store the virtual address
of the buffer allocated for the IMA measurement list.
Free the memory allocated for the IMA measurement list in
kimage_file_post_load_cleanup() function.

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list")
---
 include/linux/kexec.h              | 5 +++++
 kernel/kexec_file.c                | 5 +++++
 security/integrity/ima/ima_kexec.c | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9e93bef52968..5f61389f5f36 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -300,6 +300,11 @@ struct kimage {
 	/* Information for loading purgatory */
 	struct purgatory_info purgatory_info;
 #endif
+
+#ifdef CONFIG_IMA_KEXEC
+	/* Virtual address of IMA measurement buffer for kexec syscall */
+	void *ima_buffer;
+#endif
 };
 
 /* kexec interface functions */
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b02086d70492..5c3447cf7ad5 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -166,6 +166,11 @@ void kimage_file_post_load_cleanup(struct kimage *image)
 	vfree(pi->sechdrs);
 	pi->sechdrs = NULL;
 
+#ifdef CONFIG_IMA_KEXEC
+	vfree(image->ima_buffer);
+	image->ima_buffer = NULL;
+#endif /* CONFIG_IMA_KEXEC */
+
 	/* See if architecture has anything to cleanup post load */
 	arch_kimage_file_post_load_cleanup(image);
 
diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 206ddcaa5c67..e29bea3dd4cc 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -129,6 +129,8 @@ void ima_add_kexec_buffer(struct kimage *image)
 		return;
 	}
 
+	image->ima_buffer = kexec_buffer;
+
 	pr_debug("kexec measurement buffer for the loaded kernel at 0x%lx.\n",
 		 kbuf.mem);
 }
-- 
2.30.0


^ permalink raw reply related

* [PATCH v2 1/2] ima: Free IMA measurement buffer on error
From: Lakshmi Ramasubramanian @ 2021-02-04 17:49 UTC (permalink / raw)
  To: zohar, bauerman, dmitry.kasatkin, ebiederm, gregkh, sashal,
	tyhicks
  Cc: linux-integrity, linuxppc-dev, linux-kernel

IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function.  In error code paths this memory
is not freed resulting in memory leak.

Free the memory allocated for the IMA measurement list in
the error code paths in ima_add_kexec_buffer() function.

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list")
---
 security/integrity/ima/ima_kexec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 121de3e04af2..206ddcaa5c67 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -119,6 +119,7 @@ void ima_add_kexec_buffer(struct kimage *image)
 	ret = kexec_add_buffer(&kbuf);
 	if (ret) {
 		pr_err("Error passing over kexec measurement buffer.\n");
+		vfree(kexec_buffer);
 		return;
 	}
 
-- 
2.30.0


^ permalink raw reply related

* Re: [PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT
From: Will Deacon @ 2021-02-04 18:00 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: mark.rutland, bhsharma, tao.li, zohar, paulus, vincenzo.frascino,
	frowand.list, sashal, robh, masahiroy, jmorris, takahiro.akashi,
	linux-arm-kernel, catalin.marinas, serge, devicetree,
	pasha.tatashin, prsriva, hsinyi, allison, christophe.leroy,
	mbrugger, balajib, dmitry.kasatkin, linux-kernel, james.morse,
	gregkh, joe, linux-integrity, linuxppc-dev, bauerman
In-Reply-To: <20210204164135.29856-11-nramas@linux.microsoft.com>

On Thu, Feb 04, 2021 at 08:41:33AM -0800, Lakshmi Ramasubramanian wrote:
> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> drivers/of/kexec.c to allocate and free memory for FDT.
> 
> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> initialize the FDT, and to free the FDT respectively.
> 
> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
> Suggested-by: Rob Herring <robh@kernel.org>
> ---
>  arch/arm64/kernel/machine_kexec_file.c | 37 +++++++-------------------
>  1 file changed, 10 insertions(+), 27 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply

* Re: [PATCH] cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN
From: Rafael J. Wysocki @ 2021-02-04 18:27 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Vincent Guittot, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, Paul Mackerras, linuxppc-dev
In-Reply-To: <bed6bc7e15c3ed398dd61b8f3968049f1f16b1b6.1612244449.git.viresh.kumar@linaro.org>

On Tue, Feb 2, 2021 at 6:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> This flag is set by one of the drivers but it isn't used in the code
> otherwise. Remove the unused flag and update the driver.
>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

Applied as 5.12 material, thanks!

> ---
> Rebased over:
>
> https://lore.kernel.org/lkml/a59bb322b22c247d570b70a8e94067804287623b.1612241683.git.viresh.kumar@linaro.org/
>
>  drivers/cpufreq/pmac32-cpufreq.c |  3 +--
>  include/linux/cpufreq.h          | 13 +++++--------
>  2 files changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cpufreq/pmac32-cpufreq.c b/drivers/cpufreq/pmac32-cpufreq.c
> index 73621bc11976..4f20c6a9108d 100644
> --- a/drivers/cpufreq/pmac32-cpufreq.c
> +++ b/drivers/cpufreq/pmac32-cpufreq.c
> @@ -439,8 +439,7 @@ static struct cpufreq_driver pmac_cpufreq_driver = {
>         .init           = pmac_cpufreq_cpu_init,
>         .suspend        = pmac_cpufreq_suspend,
>         .resume         = pmac_cpufreq_resume,
> -       .flags          = CPUFREQ_PM_NO_WARN |
> -                         CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
> +       .flags          = CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
>         .attr           = cpufreq_generic_attr,
>         .name           = "powermac",
>  };
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index c8e40e91fe9b..353969c7acd3 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -398,8 +398,11 @@ struct cpufreq_driver {
>  /* loops_per_jiffy or other kernel "constants" aren't affected by frequency transitions */
>  #define CPUFREQ_CONST_LOOPS                    BIT(1)
>
> -/* don't warn on suspend/resume speed mismatches */
> -#define CPUFREQ_PM_NO_WARN                     BIT(2)
> +/*
> + * Set by drivers that want the core to automatically register the cpufreq
> + * driver as a thermal cooling device.
> + */
> +#define CPUFREQ_IS_COOLING_DEV                 BIT(2)
>
>  /*
>   * This should be set by platforms having multiple clock-domains, i.e.
> @@ -431,12 +434,6 @@ struct cpufreq_driver {
>   */
>  #define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING      BIT(6)
>
> -/*
> - * Set by drivers that want the core to automatically register the cpufreq
> - * driver as a thermal cooling device.
> - */
> -#define CPUFREQ_IS_COOLING_DEV                 BIT(7)
> -
>  int cpufreq_register_driver(struct cpufreq_driver *driver_data);
>  int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
>
> --
> 2.25.0.rc1.19.g042ed3e048af
>

^ permalink raw reply

* Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
From: Rob Herring @ 2021-02-04 19:26 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: Mark Rutland, Bhupesh Sharma, tao.li, Mimi Zohar, Paul Mackerras,
	vincenzo.frascino, Frank Rowand, Sasha Levin, Masahiro Yamada,
	James Morris, AKASHI, Takahiro, linux-arm-kernel, Catalin Marinas,
	Serge E. Hallyn, devicetree, Pavel Tatashin, Will Deacon,
	Prakhar Srivastava, Hsin-Yi Wang, Allison Randal,
	Christophe Leroy, Matthias Brugger, balajib, dmitry.kasatkin,
	linux-kernel@vger.kernel.org, James Morse, Greg Kroah-Hartman,
	Joe Perches, linux-integrity, linuxppc-dev, Thiago Jung Bauermann
In-Reply-To: <20210204164135.29856-12-nramas@linux.microsoft.com>

On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
<nramas@linux.microsoft.com> wrote:
>
> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> drivers/of/kexec.c to allocate and free memory for FDT.
>
> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> initialize the FDT, and to free the FDT respectively.
>
> powerpc sets the FDT address in image_loader_data field in
> "struct kimage" and the memory is freed in
> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
> for allocation, the buffer needs to be freed using kvfree().

You could just change the kexec core to call kvfree() instead.

> Define "fdt" field in "struct kimage_arch" for powerpc to store
> the address of FDT, and free the memory in powerpc specific
> arch_kimage_file_post_load_cleanup().

However, given all the other buffers have an explicit field in kimage
or kimage_arch, changing powerpc is to match arm64 is better IMO.

> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
> Suggested-by: Rob Herring <robh@kernel.org>
> Suggested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/kexec.h  |  2 ++
>  arch/powerpc/kexec/elf_64.c       | 26 ++++++++++++++++----------
>  arch/powerpc/kexec/file_load_64.c |  3 +++
>  3 files changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
> index 2c0be93d239a..d7d13cac4d31 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -111,6 +111,8 @@ struct kimage_arch {
>         unsigned long elf_headers_mem;
>         unsigned long elf_headers_sz;
>         void *elf_headers;
> +
> +       void *fdt;
>  };
>
>  char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index d0e459bb2f05..51d2d8eb6c1b 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -19,6 +19,7 @@
>  #include <linux/kexec.h>
>  #include <linux/libfdt.h>
>  #include <linux/module.h>
> +#include <linux/of.h>
>  #include <linux/of_fdt.h>
>  #include <linux/slab.h>
>  #include <linux/types.h>
> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>         unsigned int fdt_size;
>         unsigned long kernel_load_addr;
>         unsigned long initrd_load_addr = 0, fdt_load_addr;
> -       void *fdt;
> +       void *fdt = NULL;
>         const void *slave_code;
>         struct elfhdr ehdr;
>         char *modified_cmdline = NULL;
> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>         }
>
>         fdt_size = fdt_totalsize(initial_boot_params) * 2;
> -       fdt = kmalloc(fdt_size, GFP_KERNEL);
> +       fdt = of_alloc_and_init_fdt(fdt_size);
>         if (!fdt) {
>                 pr_err("Not enough memory for the device tree.\n");
>                 ret = -ENOMEM;
>                 goto out;
>         }
> -       ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
> -       if (ret < 0) {
> -               pr_err("Error setting up the new device tree.\n");
> -               ret = -EINVAL;
> -               goto out;
> -       }
>
>         ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,

The first thing this function does is call setup_new_fdt() which first
calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
PPC code split. It looks like there's a 32-bit and 64-bit split, but
32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
setup_new_fdt_ppc64()). The arm64 version is calling
of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.

So we can just make of_alloc_and_init_fdt() also call
of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
the alloc and copy). I don't think the architecture needs to pick the
size either. It's doubtful that either one is that sensitive to the
amount of extra space.

>                                   initrd_len, cmdline);
> @@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>         ret = kexec_add_buffer(&kbuf);
>         if (ret)
>                 goto out;
> +
> +       /* FDT will be freed in arch_kimage_file_post_load_cleanup */
> +       image->arch.fdt = fdt;
> +
>         fdt_load_addr = kbuf.mem;
>
>         pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
> @@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>         kfree(modified_cmdline);
>         kexec_free_elf_info(&elf_info);
>
> -       /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
> -       return ret ? ERR_PTR(ret) : fdt;
> +       /*
> +        * Once FDT buffer has been successfully passed to kexec_add_buffer(),
> +        * the FDT buffer address is saved in image->arch.fdt. In that case,
> +        * the memory cannot be freed here in case of any other error.
> +        */
> +       if (ret && !image->arch.fdt)
> +               of_free_fdt(fdt);

Just call kvfree() directly.

> +
> +       return ret ? ERR_PTR(ret) : NULL;
>  }
>
>  const struct kexec_file_ops kexec_elf64_ops = {
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> index 3cab318aa3b9..d9d5b5569a6d 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -1113,5 +1113,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
>         image->arch.elf_headers = NULL;
>         image->arch.elf_headers_sz = 0;
>
> +       of_free_fdt(image->arch.fdt);
> +       image->arch.fdt = NULL;
> +
>         return kexec_image_post_load_cleanup_default(image);
>  }
> --
> 2.30.0
>

^ permalink raw reply

* Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
From: Konrad Rzeszutek Wilk @ 2021-02-04 19:31 UTC (permalink / raw)
  To: Robin Murphy
  Cc: ulf.hansson, airlied, joonas.lahtinen, dri-devel, linux-kernel,
	paulus, hpa, Christoph Hellwig, m.szyprowski, sstabellini,
	adrian.hunter, Dongli Zhang, x86, joe.jin, mingo, peterz, mingo,
	bskeggs, linux-pci, xen-devel, matthew.auld, thomas.lendacky,
	intel-gfx, jani.nikula, bp, rodrigo.vivi, bhelgaas, Claire Chang,
	boris.ostrovsky, chris, jgross, tsbogend, nouveau, linux-mmc,
	linux-mips, iommu, tglx, bauerman, daniel, akpm, linuxppc-dev,
	rppt
In-Reply-To: <b46ddefe-d91a-fa6a-0e0d-cf1edc343c2e@arm.com>

On Thu, Feb 04, 2021 at 11:49:23AM +0000, Robin Murphy wrote:
> On 2021-02-04 07:29, Christoph Hellwig wrote:
> > On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:
> > > This patch converts several swiotlb related variables to arrays, in
> > > order to maintain stat/status for different swiotlb buffers. Here are
> > > variables involved:
> > > 
> > > - io_tlb_start and io_tlb_end
> > > - io_tlb_nslabs and io_tlb_used
> > > - io_tlb_list
> > > - io_tlb_index
> > > - max_segment
> > > - io_tlb_orig_addr
> > > - no_iotlb_memory
> > > 
> > > There is no functional change and this is to prepare to enable 64-bit
> > > swiotlb.
> > 
> > Claire Chang (on Cc) already posted a patch like this a month ago,
> > which looks much better because it actually uses a struct instead
> > of all the random variables.
> 
> Indeed, I skimmed the cover letter and immediately thought that this whole
> thing is just the restricted DMA pool concept[1] again, only from a slightly
> different angle.


Kind of. Let me lay out how some of these pieces are right now:

+-----------------------+      +----------------------+
|                       |      |                      |
|                       |      |                      |
|   a)Xen-SWIOTLB       |      | b)SWIOTLB (for !Xen) |
|                       |      |                      |
+-----------XX----------+      +-------X--------------+
              XXXX             XXXXXXXXX
                 XXXX     XX XXX
                    X   XX
                    XXXX
         +----------XX-----------+
         |                       |
         |                       |
         |   c) SWIOTLB generic  |
         |                       |
         +-----------------------+

Dongli's patches modify the SWIOTLB generic c), and Xen-SWIOTLB a)
parts.

Also see the IOMMU_INIT logic which lays this a bit more deepth
(for example how to enable SWIOTLB on AMD boxes, or IBM with Calgary
IOMMU, etc - see iommu_table.h).

Furtheremore it lays the groundwork to allocate AMD SEV SWIOTLB buffers
later after boot (so that you can stich different pools together).
All the bits are kind of inside of the SWIOTLB code. And also it changes
the Xen-SWIOTLB to do something similar.

The mempool did it similarly by taking the internal parts (aka the
various io_tlb) of SWIOTLB and exposing them out and having
other code:

+-----------------------+      +----------------------+
|                       |      |                      |
|                       |      |                      |
| a)Xen-SWIOTLB         |      | b)SWIOTLB (for !Xen) |
|                       |      |                      |
+-----------XX----------+      +-------X--------------+
              XXXX             XXXXXXXXX
                 XXXX     XX XXX
                    X   XX
                    XXXX
         +----------XX-----------+         +------------------+
         |                       |         | Device tree      |
         |                       +<--------+ enabling SWIOTLB |
         |c) SWIOTLB generic     |         |                  |
         |                       |         | mempool          |
         +-----------------------+         +------------------+

What I was suggesting to Clarie to follow Xen model, that is
do something like this:

+-----------------------+      +----------------------+   +--------------------+
|                       |      |                      |   |                    |
|                       |      |                      |   |                    |
| a)Xen-SWIOTLB         |      | b)SWIOTLB (for !Xen) |   | e) DT-SWIOTLB      |
|                       |      |                      |   |                    |
+-----------XX----------+      +-------X--------------+   +----XX-X------------+
              XXXX             XXXXXXXXX        XXX X X XX X XX
                 XXXX     XX XXX        XXXXXXXX
                    X   XX XXXXXXXXXXXXX
                    XXXXXXXX
         +----------XXX----------+
         |                       |
         |                       |
         |c) SWIOTLB generic     |
         |                       |
         +-----------------------+


so using the SWIOTLB generic parts, and then bolt on top
of the device-tree logic, along with the mempool logic.



But Christopher has an interesting suggestion which is
to squash the all the existing code (a, b, c) all together
and pepper it with various jump-tables.


So:


-----------------------------+
| SWIOTLB:                   |
|                            |
|  a) SWIOTLB (for non-Xen)  |
|  b) Xen-SWIOTLB            |
|  c) DT-SWIOTLB             |
|                            |
|                            |
-----------------------------+


with all the various bits (M2P/P2M for Xen, mempool for ARM,
and normal allocation for BM) in one big file.


^ permalink raw reply

* [PATCH] arch:powerpc simple_write_to_buffer return check
From: Mayank Suman @ 2021-02-04 18:16 UTC (permalink / raw)
  To: ruscur, oohall, mpe, benh, paulus, linuxppc-dev, linux-kernel
  Cc: Mayank Suman

Signed-off-by: Mayank Suman <mayanksuman@live.com>
---
 arch/powerpc/kernel/eeh.c                    | 8 ++++----
 arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 813713c9120c..2dbe1558a71f 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp,
 	char buf[20];
 	int ret;
 
-	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
-	if (!ret)
+	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+	if (ret <= 0)
 		return -EFAULT;
 
 	/*
@@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
 
 	memset(buf, 0, sizeof(buf));
 	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-	if (!ret)
+	if (ret <= 0)
 		return -EFAULT;
 
 	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
@@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
 
 	memset(buf, 0, sizeof(buf));
 	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-	if (!ret)
+	if (ret <= 0)
 		return -EFAULT;
 
 	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 89e22c460ebf..36ed2b8f7375 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
 		return -ENXIO;
 
 	/* Copy over argument buffer */
-	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
-	if (!ret)
+	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+	if (ret <= 0)
 		return -EFAULT;
 
 	/* Retrieve parameters */
-- 
2.30.0


^ permalink raw reply related

* [PATCH] KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
From: Fabiano Rosas @ 2021-02-04 20:05 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev

The Facility Status and Control Register is a privileged SPR that
defines the availability of some features in problem state. Since it
can be written by the guest, we must restore it to the previous host
value after guest exit.

This restoration is currently done by taking the value from
current->thread.fscr, which in the P9 path is not enough anymore
because the guest could context switch the QEMU thread, causing the
guest-current value to be saved into the thread struct.

The above situation manifested when running a QEMU linked against a
libc with System Call Vectored support, which causes scv
instructions to be run by QEMU early during the guest boot (during
SLOF), at which point the FSCR is 0 due to guest entry. After a few
scv calls (1 to a couple hundred), the context switching happens and
the QEMU thread runs with the guest value, resulting in a Facility
Unavailable interrupt.

This patch saves and restores the host value of FSCR in the inner
guest entry loop in a way independent of current->thread.fscr. The old
way of doing it is still kept in place because it works for the old
entry path.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6f612d240392..f2ddf7139a2a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3595,6 +3595,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
 	unsigned long host_tidr = mfspr(SPRN_TIDR);
 	unsigned long host_iamr = mfspr(SPRN_IAMR);
 	unsigned long host_amr = mfspr(SPRN_AMR);
+	unsigned long host_fscr = mfspr(SPRN_FSCR);
 	s64 dec;
 	u64 tb;
 	int trap, save_pmu;
@@ -3735,6 +3736,9 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
 	if (host_amr != vcpu->arch.amr)
 		mtspr(SPRN_AMR, host_amr);
 
+	if (host_fscr != vcpu->arch.fscr)
+		mtspr(SPRN_FSCR, host_fscr);
+
 	msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX);
 	store_fp_state(&vcpu->arch.fp);
 #ifdef CONFIG_ALTIVEC
-- 
2.29.2


^ permalink raw reply related

* Re: [PATCH] arch:powerpc simple_write_to_buffer return check
From: Oliver O'Halloran @ 2021-02-04 22:35 UTC (permalink / raw)
  To: Mayank Suman; +Cc: Linux Kernel Mailing List, Paul Mackerras, linuxppc-dev
In-Reply-To: <PS1PR04MB29345AB59076B370A4F99F75D6B39@PS1PR04MB2934.apcprd04.prod.outlook.com>

On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman <mayanksuman@live.com> wrote:
>
> Signed-off-by: Mayank Suman <mayanksuman@live.com>

commit messages aren't optional

> ---
>  arch/powerpc/kernel/eeh.c                    | 8 ++++----
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c9120c..2dbe1558a71f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp,
>         char buf[20];
>         int ret;
>
> -       ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -       if (!ret)
> +       ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);

We should probably be zeroing the buffer. Reading to sizeof(buf) - 1
is done in a few places to guarantee that the string is nul
terminated, but without the preceeding memset() that isn't actually
guaranteed.

> +       if (ret <= 0)
>                 return -EFAULT;

EFAULT is supposed to be returned when the user supplies a buffer to
write(2) which is outside their address space. I figured letting the
sscanf() in the next step fail if the user passes writes a zero-length
buffer and returning EINVAL made more sense. That said, the exact
semantics around zero length writes are pretty handwavy so I guess
this isn't wrong, but I don't think it's better either.

>         /*
> @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
>
>         memset(buf, 0, sizeof(buf));
>         ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
> -       if (!ret)
> +       if (ret <= 0)
>                 return -EFAULT;
>
>         ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
> @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
>
>         memset(buf, 0, sizeof(buf));
>         ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
> -       if (!ret)
> +       if (ret <= 0)
>                 return -EFAULT;
>
>         ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 89e22c460ebf..36ed2b8f7375 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
>                 return -ENXIO;
>
>         /* Copy over argument buffer */
> -       ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -       if (!ret)
> +       ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
> +       if (ret <= 0)
>                 return -EFAULT;
>
>         /* Retrieve parameters */
> --
> 2.30.0
>

^ permalink raw reply

* Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
From: Lakshmi Ramasubramanian @ 2021-02-04 23:23 UTC (permalink / raw)
  To: Rob Herring
  Cc: Mark Rutland, Bhupesh Sharma, tao.li, Mimi Zohar, Paul Mackerras,
	vincenzo.frascino, Frank Rowand, Sasha Levin, Masahiro Yamada,
	James Morris, AKASHI, Takahiro, linux-arm-kernel, Catalin Marinas,
	Serge E. Hallyn, devicetree, Pavel Tatashin, Will Deacon,
	Prakhar Srivastava, Hsin-Yi Wang, Allison Randal,
	Christophe Leroy, Matthias Brugger, balajib, dmitry.kasatkin,
	linux-kernel@vger.kernel.org, James Morse, Greg Kroah-Hartman,
	Joe Perches, linux-integrity, linuxppc-dev, Thiago Jung Bauermann
In-Reply-To: <CAL_JsqK1Pb9nAeL84EP2U3MQgpBsm+E_0QXmzbigWXnS245WPQ@mail.gmail.com>

On 2/4/21 11:26 AM, Rob Herring wrote:
> On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
> <nramas@linux.microsoft.com> wrote:
>>
>> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
>> drivers/of/kexec.c to allocate and free memory for FDT.
>>
>> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
>> initialize the FDT, and to free the FDT respectively.
>>
>> powerpc sets the FDT address in image_loader_data field in
>> "struct kimage" and the memory is freed in
>> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
>> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
>> for allocation, the buffer needs to be freed using kvfree().
> 
> You could just change the kexec core to call kvfree() instead.

> 
>> Define "fdt" field in "struct kimage_arch" for powerpc to store
>> the address of FDT, and free the memory in powerpc specific
>> arch_kimage_file_post_load_cleanup().
> 
> However, given all the other buffers have an explicit field in kimage
> or kimage_arch, changing powerpc is to match arm64 is better IMO.

Just to be clear:
I'll leave this as is - free FDT buffer in powerpc's 
arch_kimage_file_post_load_cleanup() to match arm64 behavior.

Will not change "kexec core" to call kvfree() - doing that change would 
require changing all architectures to use kvmalloc() for 
image_loader_data allocation.

> 
>> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
>> Suggested-by: Rob Herring <robh@kernel.org>
>> Suggested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/kexec.h  |  2 ++
>>   arch/powerpc/kexec/elf_64.c       | 26 ++++++++++++++++----------
>>   arch/powerpc/kexec/file_load_64.c |  3 +++
>>   3 files changed, 21 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
>> index 2c0be93d239a..d7d13cac4d31 100644
>> --- a/arch/powerpc/include/asm/kexec.h
>> +++ b/arch/powerpc/include/asm/kexec.h
>> @@ -111,6 +111,8 @@ struct kimage_arch {
>>          unsigned long elf_headers_mem;
>>          unsigned long elf_headers_sz;
>>          void *elf_headers;
>> +
>> +       void *fdt;
>>   };
>>
>>   char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
>> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
>> index d0e459bb2f05..51d2d8eb6c1b 100644
>> --- a/arch/powerpc/kexec/elf_64.c
>> +++ b/arch/powerpc/kexec/elf_64.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/kexec.h>
>>   #include <linux/libfdt.h>
>>   #include <linux/module.h>
>> +#include <linux/of.h>
>>   #include <linux/of_fdt.h>
>>   #include <linux/slab.h>
>>   #include <linux/types.h>
>> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>          unsigned int fdt_size;
>>          unsigned long kernel_load_addr;
>>          unsigned long initrd_load_addr = 0, fdt_load_addr;
>> -       void *fdt;
>> +       void *fdt = NULL;
>>          const void *slave_code;
>>          struct elfhdr ehdr;
>>          char *modified_cmdline = NULL;
>> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>          }
>>
>>          fdt_size = fdt_totalsize(initial_boot_params) * 2;
>> -       fdt = kmalloc(fdt_size, GFP_KERNEL);
>> +       fdt = of_alloc_and_init_fdt(fdt_size);
>>          if (!fdt) {
>>                  pr_err("Not enough memory for the device tree.\n");
>>                  ret = -ENOMEM;
>>                  goto out;
>>          }
>> -       ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
>> -       if (ret < 0) {
>> -               pr_err("Error setting up the new device tree.\n");
>> -               ret = -EINVAL;
>> -               goto out;
>> -       }
>>
>>          ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
> 
> The first thing this function does is call setup_new_fdt() which first
> calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
> PPC code split. It looks like there's a 32-bit and 64-bit split, but
> 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
> setup_new_fdt_ppc64()). The arm64 version is calling
> of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.
> 
> So we can just make of_alloc_and_init_fdt() also call
> of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
> the alloc and copy). 
ok - will move fdt allocation into of_kexec_setup_new_fdt().

I don't think the architecture needs to pick the
> size either. It's doubtful that either one is that sensitive to the
> amount of extra space.
I am not clear about the above comment -
are you saying the architectures don't need to pass FDT size to the 
alloc function?

arm64 is adding command line string length and some extra space to the 
size computed from initial_boot_params for FDT Size:

	buf_size = fdt_totalsize(initial_boot_params)
			+ cmdline_len + DTB_EXTRA_SPACE;

powerpc is just using twice the size computed from initial_boot_params

	fdt_size = fdt_totalsize(initial_boot_params) * 2;

I think it would be safe to let arm64 and powerpc pass the required FDT 
size, along with the other params to of_kexec_setup_new_fdt() - and in 
this function we allocate FDT and set it up.

And, for powerpc leave the remaining code in setup_new_fdt_ppc64().

Would that be ok?

> 
>>                                    initrd_len, cmdline);
>> @@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>          ret = kexec_add_buffer(&kbuf);
>>          if (ret)
>>                  goto out;
>> +
>> +       /* FDT will be freed in arch_kimage_file_post_load_cleanup */
>> +       image->arch.fdt = fdt;
>> +
>>          fdt_load_addr = kbuf.mem;
>>
>>          pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
>> @@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>          kfree(modified_cmdline);
>>          kexec_free_elf_info(&elf_info);
>>
>> -       /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
>> -       return ret ? ERR_PTR(ret) : fdt;
>> +       /*
>> +        * Once FDT buffer has been successfully passed to kexec_add_buffer(),
>> +        * the FDT buffer address is saved in image->arch.fdt. In that case,
>> +        * the memory cannot be freed here in case of any other error.
>> +        */
>> +       if (ret && !image->arch.fdt)
>> +               of_free_fdt(fdt);
> 
> Just call kvfree() directly.
Sure - will do.

  -lakshmi

> 
>> +
>> +       return ret ? ERR_PTR(ret) : NULL;
>>   }
>>
>>   const struct kexec_file_ops kexec_elf64_ops = {
>> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
>> index 3cab318aa3b9..d9d5b5569a6d 100644
>> --- a/arch/powerpc/kexec/file_load_64.c
>> +++ b/arch/powerpc/kexec/file_load_64.c
>> @@ -1113,5 +1113,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
>>          image->arch.elf_headers = NULL;
>>          image->arch.elf_headers_sz = 0;
>>
>> +       of_free_fdt(image->arch.fdt);
>> +       image->arch.fdt = NULL;
>> +
>>          return kexec_image_post_load_cleanup_default(image);
>>   }
>> --
>> 2.30.0
>>


^ permalink raw reply

* Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
From: Bjorn Helgaas @ 2021-02-04 23:27 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Joerg Roedel,
	open list:PCI ENHANCED ERROR HANDLING (EEH) FOR POWERPC,
	open list:PCI SUBSYSTEM, open list, Lalithambika Krishnakumar,
	Alex Williamson, Oliver O'Halloran, Bjorn Helgaas,
	Mika Westerberg, Lu Baolu
In-Reply-To: <CAAd53p7FfRCgfC5dGL3HyP+rbVtR2VCfMPYBBvJ=-DFCWFeVPA@mail.gmail.com>

[+cc Alex]

On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > firmware:
> > > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> > > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> > > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > >
> > > It happens right after ACS gets enabled during resume.
> > >
> > > To prevent that from happening, disable AER interrupt and enable it on
> > > system suspend and resume, respectively.
> >
> > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > am curious about why the error is reported in the first place.
> >
> > Is this a consequence of the link going down and back up?
> 
> Could be. From the observations, it only happens when firmware suspend
> (S3) is used.
> Maybe it happens when it's gets powered up, but I don't have equipment
> to debug at hardware level.
> 
> If we use non-firmware suspend method, enabling ACS after resume won't
> trip AER and DPC.
> 
> > Is it consequence of the device doing a DMA when it shouldn't?
> 
> If it's doing DMA while suspending, the same error should also happen
> after NVMe is suspended and before PCIe port suspending.
> Furthermore, if non-firmware suspend method is used, there's so such
> issue, so less likely to be any DMA operation.
> 
> > Are we doing something in the wrong order during suspend?  Or maybe
> > resume, since I assume the error is reported during resume?
> 
> Yes the error is reported during resume. The suspend/resume order
> seems fine as non-firmware suspend doesn't have this issue.

I really feel like we need a better understanding of what's going on
here.  Disabling the AER interrupt is like closing our eyes and
pretending that because we don't see it, it didn't happen.

An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
access from the CPU wouldn't trigger this error.  And it sounds like
the error is triggered before we even start running the driver after
resume.

If we're powering up an NVMe device from D3cold and it DMAs before the
driver touches it, something would be seriously broken.  I doubt
that's what's happening.  Maybe a device could resume some previously
programmed DMA after powering up from D3hot.

Or maybe the error occurred on suspend, like if the device wasn't
quiesced or something, but we didn't notice it until resume?  The 
AER error status bits are RW1CS, which means they can be preserved
across hot/warm/cold resets.

Can you instrument the code to see whether the AER error status bit is
set before enabling ACS?  I'm not sure that merely enabling ACS (I
assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
should cause an interrupt for a previously-logged error.  I suspect
that could happen when enabling *AER*, but I wouldn't think it would
happen when enabling *ACS*.

Does this error happen on multiple machines from different vendors?
Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up
after it did something to cause an error.

> > If we *do* take the error, why doesn't DPC recovery work?
> 
> It works for the root port, but not for the NVMe drive:
> [   50.947816] pcieport 0000:00:1b.0: DPC: containment event,
> status:0x1f01 source:0x0000
> [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> ID)
> [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error
> status/mask=00200000/00010000
> [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> [   50.948400] ACPI: EC: event unblocked
> [   50.948432] xhci_hcd 0000:00:14.0: PME# disabled
> [   50.948444] xhci_hcd 0000:00:14.0: enabling bus mastering
> [   50.949056] pcieport 0000:00:1b.0: PME# disabled
> [   50.949068] pcieport 0000:00:1c.0: PME# disabled
> [   50.949416] e1000e 0000:00:1f.6: PME# disabled
> [   50.949463] e1000e 0000:00:1f.6: enabling bus mastering
> [   50.951606] sd 0:0:0:0: [sda] Starting disk
> [   50.951610] nvme 0000:01:00.0: can't change power state from D3hot
> to D0 (config space inaccessible)
> [   50.951730] nvme nvme0: Removing after probe failure status: -19
> [   50.952360] nvme nvme0: failed to set APST feature (-19)
> [   50.971136] snd_hda_intel 0000:00:1f.3: PME# disabled
> [   51.089330] pcieport 0000:00:1b.0: AER: broadcast resume message
> [   51.089345] pcieport 0000:00:1b.0: AER: device recovery successful
> 
> But I think why recovery doesn't work for NVMe is for another discussion...
> 
> Kai-Heng
> 
> >
> > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> > > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > >  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
> > >  1 file changed, 18 insertions(+)
> > >
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index 77b0f2c45bc0..0e9a85530ae6 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
> > >       return 0;
> > >  }
> > >
> > > +static int aer_suspend(struct pcie_device *dev)
> > > +{
> > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > +
> > > +     aer_disable_rootport(rpc);
> > > +     return 0;
> > > +}
> > > +
> > > +static int aer_resume(struct pcie_device *dev)
> > > +{
> > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > +
> > > +     aer_enable_rootport(rpc);
> > > +     return 0;
> > > +}
> > > +
> > >  /**
> > >   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
> > >   * @dev: pointer to Root Port, RCEC, or RCiEP
> > > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
> > >       .service        = PCIE_PORT_SERVICE_AER,
> > >
> > >       .probe          = aer_probe,
> > > +     .suspend        = aer_suspend,
> > > +     .resume         = aer_resume,
> > >       .remove         = aer_remove,
> > >  };
> > >
> > > --
> > > 2.29.2
> > >

^ permalink raw reply

* Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
From: Rob Herring @ 2021-02-04 23:36 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: Mark Rutland, Bhupesh Sharma, tao.li, Mimi Zohar, Paul Mackerras,
	vincenzo.frascino, Frank Rowand, Sasha Levin, Masahiro Yamada,
	James Morris, AKASHI, Takahiro, linux-arm-kernel, Catalin Marinas,
	Serge E. Hallyn, devicetree, Pavel Tatashin, Will Deacon,
	Prakhar Srivastava, Hsin-Yi Wang, Allison Randal,
	Christophe Leroy, Matthias Brugger, balajib, dmitry.kasatkin,
	linux-kernel@vger.kernel.org, James Morse, Greg Kroah-Hartman,
	Joe Perches, linux-integrity, linuxppc-dev, Thiago Jung Bauermann
In-Reply-To: <503d42ba-89bf-4ad9-9d4c-acb625580f77@linux.microsoft.com>

On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian
<nramas@linux.microsoft.com> wrote:
>
> On 2/4/21 11:26 AM, Rob Herring wrote:
> > On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
> > <nramas@linux.microsoft.com> wrote:
> >>
> >> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> >> drivers/of/kexec.c to allocate and free memory for FDT.
> >>
> >> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> >> initialize the FDT, and to free the FDT respectively.
> >>
> >> powerpc sets the FDT address in image_loader_data field in
> >> "struct kimage" and the memory is freed in
> >> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
> >> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
> >> for allocation, the buffer needs to be freed using kvfree().
> >
> > You could just change the kexec core to call kvfree() instead.
>
> >
> >> Define "fdt" field in "struct kimage_arch" for powerpc to store
> >> the address of FDT, and free the memory in powerpc specific
> >> arch_kimage_file_post_load_cleanup().
> >
> > However, given all the other buffers have an explicit field in kimage
> > or kimage_arch, changing powerpc is to match arm64 is better IMO.
>
> Just to be clear:
> I'll leave this as is - free FDT buffer in powerpc's
> arch_kimage_file_post_load_cleanup() to match arm64 behavior.

Yes.

> Will not change "kexec core" to call kvfree() - doing that change would
> require changing all architectures to use kvmalloc() for
> image_loader_data allocation.

Actually, no. AIUI, kvfree() can be used whether you used kvmalloc,
vmalloc, or kmalloc for the alloc.

> >> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
> >> Suggested-by: Rob Herring <robh@kernel.org>
> >> Suggested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> >> ---
> >>   arch/powerpc/include/asm/kexec.h  |  2 ++
> >>   arch/powerpc/kexec/elf_64.c       | 26 ++++++++++++++++----------
> >>   arch/powerpc/kexec/file_load_64.c |  3 +++
> >>   3 files changed, 21 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
> >> index 2c0be93d239a..d7d13cac4d31 100644
> >> --- a/arch/powerpc/include/asm/kexec.h
> >> +++ b/arch/powerpc/include/asm/kexec.h
> >> @@ -111,6 +111,8 @@ struct kimage_arch {
> >>          unsigned long elf_headers_mem;
> >>          unsigned long elf_headers_sz;
> >>          void *elf_headers;
> >> +
> >> +       void *fdt;
> >>   };
> >>
> >>   char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
> >> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> >> index d0e459bb2f05..51d2d8eb6c1b 100644
> >> --- a/arch/powerpc/kexec/elf_64.c
> >> +++ b/arch/powerpc/kexec/elf_64.c
> >> @@ -19,6 +19,7 @@
> >>   #include <linux/kexec.h>
> >>   #include <linux/libfdt.h>
> >>   #include <linux/module.h>
> >> +#include <linux/of.h>
> >>   #include <linux/of_fdt.h>
> >>   #include <linux/slab.h>
> >>   #include <linux/types.h>
> >> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
> >>          unsigned int fdt_size;
> >>          unsigned long kernel_load_addr;
> >>          unsigned long initrd_load_addr = 0, fdt_load_addr;
> >> -       void *fdt;
> >> +       void *fdt = NULL;
> >>          const void *slave_code;
> >>          struct elfhdr ehdr;
> >>          char *modified_cmdline = NULL;
> >> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
> >>          }
> >>
> >>          fdt_size = fdt_totalsize(initial_boot_params) * 2;
> >> -       fdt = kmalloc(fdt_size, GFP_KERNEL);
> >> +       fdt = of_alloc_and_init_fdt(fdt_size);
> >>          if (!fdt) {
> >>                  pr_err("Not enough memory for the device tree.\n");
> >>                  ret = -ENOMEM;
> >>                  goto out;
> >>          }
> >> -       ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
> >> -       if (ret < 0) {
> >> -               pr_err("Error setting up the new device tree.\n");
> >> -               ret = -EINVAL;
> >> -               goto out;
> >> -       }
> >>
> >>          ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
> >
> > The first thing this function does is call setup_new_fdt() which first
> > calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
> > PPC code split. It looks like there's a 32-bit and 64-bit split, but
> > 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
> > setup_new_fdt_ppc64()). The arm64 version is calling
> > of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.
> >
> > So we can just make of_alloc_and_init_fdt() also call
> > of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
> > the alloc and copy).
> ok - will move fdt allocation into of_kexec_setup_new_fdt().
>
> I don't think the architecture needs to pick the
> > size either. It's doubtful that either one is that sensitive to the
> > amount of extra space.
> I am not clear about the above comment -
> are you saying the architectures don't need to pass FDT size to the
> alloc function?
>
> arm64 is adding command line string length and some extra space to the
> size computed from initial_boot_params for FDT Size:
>
>         buf_size = fdt_totalsize(initial_boot_params)
>                         + cmdline_len + DTB_EXTRA_SPACE;
>
> powerpc is just using twice the size computed from initial_boot_params
>
>         fdt_size = fdt_totalsize(initial_boot_params) * 2;
>
> I think it would be safe to let arm64 and powerpc pass the required FDT
> size, along with the other params to of_kexec_setup_new_fdt() - and in
> this function we allocate FDT and set it up.

It's pretty clear that someone just picked something that 'should be
enough'. The only thing I can guess for the difference is that arm
DT's tend to be a bit larger. So doubling the size would be even more
excessive. Either way, we're talking 10s kB to few 100kB. I'd go with
DTB_EXTRA_SPACE and we can bump it up if someone has problems.

Also, I would like for 'initial_boot_params' to be private ultimately,
so removing any references is helpful.

> And, for powerpc leave the remaining code in setup_new_fdt_ppc64().

Right.

Rob

^ permalink raw reply

* Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
From: Lakshmi Ramasubramanian @ 2021-02-04 23:42 UTC (permalink / raw)
  To: Rob Herring
  Cc: Mark Rutland, tao.li, Mimi Zohar, Paul Mackerras,
	vincenzo.frascino, Frank Rowand, Sasha Levin, Masahiro Yamada,
	James Morris, AKASHI, Takahiro, linux-arm-kernel, Catalin Marinas,
	Serge E. Hallyn, devicetree, Pavel Tatashin, Will Deacon,
	Prakhar Srivastava, Hsin-Yi Wang, Allison Randal,
	Christophe Leroy, Matthias Brugger, balajib, dmitry.kasatkin,
	linux-kernel@vger.kernel.org, James Morse, Greg Kroah-Hartman,
	Joe Perches, linux-integrity, linuxppc-dev, Thiago Jung Bauermann
In-Reply-To: <CAL_JsqKY9fxOowW=sBVm9s8j=3RWA7Jn9Ft9Edyx5qy5Yvykmw@mail.gmail.com>

On 2/4/21 3:36 PM, Rob Herring wrote:
> On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian
> <nramas@linux.microsoft.com> wrote:
>>
>> On 2/4/21 11:26 AM, Rob Herring wrote:
>>> On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
>>> <nramas@linux.microsoft.com> wrote:
>>>>
>>>> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
>>>> drivers/of/kexec.c to allocate and free memory for FDT.
>>>>
>>>> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
>>>> initialize the FDT, and to free the FDT respectively.
>>>>
>>>> powerpc sets the FDT address in image_loader_data field in
>>>> "struct kimage" and the memory is freed in
>>>> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
>>>> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
>>>> for allocation, the buffer needs to be freed using kvfree().
>>>
>>> You could just change the kexec core to call kvfree() instead.
>>
>>>
>>>> Define "fdt" field in "struct kimage_arch" for powerpc to store
>>>> the address of FDT, and free the memory in powerpc specific
>>>> arch_kimage_file_post_load_cleanup().
>>>
>>> However, given all the other buffers have an explicit field in kimage
>>> or kimage_arch, changing powerpc is to match arm64 is better IMO.
>>
>> Just to be clear:
>> I'll leave this as is - free FDT buffer in powerpc's
>> arch_kimage_file_post_load_cleanup() to match arm64 behavior.
> 
> Yes.

ok

> 
>> Will not change "kexec core" to call kvfree() - doing that change would
>> require changing all architectures to use kvmalloc() for
>> image_loader_data allocation.
> 
> Actually, no. AIUI, kvfree() can be used whether you used kvmalloc,
> vmalloc, or kmalloc for the alloc.
That is good information. Thanks.

> 
>>>> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
>>>> Suggested-by: Rob Herring <robh@kernel.org>
>>>> Suggested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/include/asm/kexec.h  |  2 ++
>>>>    arch/powerpc/kexec/elf_64.c       | 26 ++++++++++++++++----------
>>>>    arch/powerpc/kexec/file_load_64.c |  3 +++
>>>>    3 files changed, 21 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
>>>> index 2c0be93d239a..d7d13cac4d31 100644
>>>> --- a/arch/powerpc/include/asm/kexec.h
>>>> +++ b/arch/powerpc/include/asm/kexec.h
>>>> @@ -111,6 +111,8 @@ struct kimage_arch {
>>>>           unsigned long elf_headers_mem;
>>>>           unsigned long elf_headers_sz;
>>>>           void *elf_headers;
>>>> +
>>>> +       void *fdt;
>>>>    };
>>>>
>>>>    char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
>>>> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
>>>> index d0e459bb2f05..51d2d8eb6c1b 100644
>>>> --- a/arch/powerpc/kexec/elf_64.c
>>>> +++ b/arch/powerpc/kexec/elf_64.c
>>>> @@ -19,6 +19,7 @@
>>>>    #include <linux/kexec.h>
>>>>    #include <linux/libfdt.h>
>>>>    #include <linux/module.h>
>>>> +#include <linux/of.h>
>>>>    #include <linux/of_fdt.h>
>>>>    #include <linux/slab.h>
>>>>    #include <linux/types.h>
>>>> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>>>           unsigned int fdt_size;
>>>>           unsigned long kernel_load_addr;
>>>>           unsigned long initrd_load_addr = 0, fdt_load_addr;
>>>> -       void *fdt;
>>>> +       void *fdt = NULL;
>>>>           const void *slave_code;
>>>>           struct elfhdr ehdr;
>>>>           char *modified_cmdline = NULL;
>>>> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>>>>           }
>>>>
>>>>           fdt_size = fdt_totalsize(initial_boot_params) * 2;
>>>> -       fdt = kmalloc(fdt_size, GFP_KERNEL);
>>>> +       fdt = of_alloc_and_init_fdt(fdt_size);
>>>>           if (!fdt) {
>>>>                   pr_err("Not enough memory for the device tree.\n");
>>>>                   ret = -ENOMEM;
>>>>                   goto out;
>>>>           }
>>>> -       ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
>>>> -       if (ret < 0) {
>>>> -               pr_err("Error setting up the new device tree.\n");
>>>> -               ret = -EINVAL;
>>>> -               goto out;
>>>> -       }
>>>>
>>>>           ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
>>>
>>> The first thing this function does is call setup_new_fdt() which first
>>> calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
>>> PPC code split. It looks like there's a 32-bit and 64-bit split, but
>>> 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
>>> setup_new_fdt_ppc64()). The arm64 version is calling
>>> of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.
>>>
>>> So we can just make of_alloc_and_init_fdt() also call
>>> of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
>>> the alloc and copy).
>> ok - will move fdt allocation into of_kexec_setup_new_fdt().
>>
>> I don't think the architecture needs to pick the
>>> size either. It's doubtful that either one is that sensitive to the
>>> amount of extra space.
>> I am not clear about the above comment -
>> are you saying the architectures don't need to pass FDT size to the
>> alloc function?
>>
>> arm64 is adding command line string length and some extra space to the
>> size computed from initial_boot_params for FDT Size:
>>
>>          buf_size = fdt_totalsize(initial_boot_params)
>>                          + cmdline_len + DTB_EXTRA_SPACE;
>>
>> powerpc is just using twice the size computed from initial_boot_params
>>
>>          fdt_size = fdt_totalsize(initial_boot_params) * 2;
>>
>> I think it would be safe to let arm64 and powerpc pass the required FDT
>> size, along with the other params to of_kexec_setup_new_fdt() - and in
>> this function we allocate FDT and set it up.
> 
> It's pretty clear that someone just picked something that 'should be
> enough'. The only thing I can guess for the difference is that arm
> DT's tend to be a bit larger. So doubling the size would be even more
> excessive. Either way, we're talking 10s kB to few 100kB. I'd go with
> DTB_EXTRA_SPACE and we can bump it up if someone has problems.
ok - I'll use DTB_EXTRA_SPACE for the fdt allocation.

> 
> Also, I would like for 'initial_boot_params' to be private ultimately,
> so removing any references is helpful.
ok

> 
>> And, for powerpc leave the remaining code in setup_new_fdt_ppc64().
> 
> Right.
ok

thanks,
  -lakshmi

^ permalink raw reply

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Michael Ellerman @ 2021-02-05  0:22 UTC (permalink / raw)
  To: Nicholas Piggin, Christophe Leroy, linuxppc-dev; +Cc: Michal Suchanek
In-Reply-To: <1612428699.u023r42mj3.astroid@bobo.none>

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
>> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
>>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
>>>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
...
>>>>> +
>>>>> +	/*
>>>>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>>>>> +	 * The value of AMR only matters while we're in the kernel.
>>>>> +	 */
>>>>> +	kuap_restore_amr(regs);
>>>>
>>>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
>>>>
>>>> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in
>>>> a way or another, and get the previous KUAP state restored by this way ?
>>> 
>>> I'm not sure if there much more risk if it's here rather than the
>>> instruction being in another place in the code.
>>> 
>>> There's a lot of user access around the kernel too if you want to find a
>>> gadget to unlock KUAP then I suppose there is a pretty large attack
>>> surface.
>> 
>> My understanding is that user access scope is strictly limited, for instance we enforce the 
>> begin/end of user access to be in the same function, and we refrain from calling any other function 
>> inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory 
>> there is no way to get out of the function while user access is open.
>> 
>> Here with the interrupt exit function it is free beer. You have a place where you re-open user 
>> access and return with a simple blr. So that's open bar. If someone manages to just call the 
>> interrupt exit function, then user access remains open
>
> Hmm okay maybe that's a good point.

I don't think it's a very attractive gadget, it's not just a plain blr,
it does a full stack frame tear down before the return. And there's no
LR reloads anywhere very close.

Obviously it depends on what the compiler decides to do, it's possible
it could be a usable gadget. But there are other places that are more
attractive I think, eg:

c00000000061d768:	a6 03 3d 7d 	mtspr   29,r9
c00000000061d76c:	2c 01 00 4c 	isync
c00000000061d770:	00 00 00 60 	nop
c00000000061d774:	30 00 21 38 	addi    r1,r1,48
c00000000061d778:	20 00 80 4e 	blr


So I don't think we should redesign the code *purely* because we're
worried about interrupt_exit_kernel_prepare() being a useful gadget. If
we can come up with a way to restructure it that reads well and is
maintainable, and also reduces the chance of it being a good gadget then
sure.

cheers

^ permalink raw reply

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Nicholas Piggin @ 2021-02-05  2:16 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev, Michael Ellerman; +Cc: Michal Suchanek
In-Reply-To: <87blczpdm3.fsf@mpe.ellerman.id.au>

Excerpts from Michael Ellerman's message of February 5, 2021 10:22 am:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
>>> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
>>>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
>>>>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
> ...
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>>>>>> +	 * The value of AMR only matters while we're in the kernel.
>>>>>> +	 */
>>>>>> +	kuap_restore_amr(regs);
>>>>>
>>>>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
>>>>>
>>>>> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in
>>>>> a way or another, and get the previous KUAP state restored by this way ?
>>>> 
>>>> I'm not sure if there much more risk if it's here rather than the
>>>> instruction being in another place in the code.
>>>> 
>>>> There's a lot of user access around the kernel too if you want to find a
>>>> gadget to unlock KUAP then I suppose there is a pretty large attack
>>>> surface.
>>> 
>>> My understanding is that user access scope is strictly limited, for instance we enforce the 
>>> begin/end of user access to be in the same function, and we refrain from calling any other function 
>>> inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory 
>>> there is no way to get out of the function while user access is open.
>>> 
>>> Here with the interrupt exit function it is free beer. You have a place where you re-open user 
>>> access and return with a simple blr. So that's open bar. If someone manages to just call the 
>>> interrupt exit function, then user access remains open
>>
>> Hmm okay maybe that's a good point.
> 
> I don't think it's a very attractive gadget, it's not just a plain blr,
> it does a full stack frame tear down before the return. And there's no
> LR reloads anywhere very close.
> 
> Obviously it depends on what the compiler decides to do, it's possible
> it could be a usable gadget. But there are other places that are more
> attractive I think, eg:
> 
> c00000000061d768:	a6 03 3d 7d 	mtspr   29,r9
> c00000000061d76c:	2c 01 00 4c 	isync
> c00000000061d770:	00 00 00 60 	nop
> c00000000061d774:	30 00 21 38 	addi    r1,r1,48
> c00000000061d778:	20 00 80 4e 	blr
> 
> 
> So I don't think we should redesign the code *purely* because we're
> worried about interrupt_exit_kernel_prepare() being a useful gadget. If
> we can come up with a way to restructure it that reads well and is
> maintainable, and also reduces the chance of it being a good gadget then
> sure.

Okay. That would be good if we can keep it in C, the pkeys + kuap combo
is fairly complicated and we might want to something cleverer with it, 
so that would make it even more difficult in asm.

Thanks,
Nick

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox