LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/5] powerpc/lib: Initialize a temporary mm for code patching
From: Christopher M. Riedl @ 2020-06-03  5:19 UTC (permalink / raw)
  To: linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-1-cmr@informatik.wtf>

When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped with permissive memory
protections. Currently, a per-cpu vmalloc patch area is used for this
purpose. While the patch area is per-cpu, the temporary page mapping is
inserted into the kernel page tables for the duration of the patching.
The mapping is exposed to CPUs other than the patching CPU - this is
undesirable from a hardening perspective.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
portion. The next patch uses the temporary mm and patching address for
code patching.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
---
 arch/powerpc/lib/code-patching.c | 33 ++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 5ecf0d635a8d..599114f63b44 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include <linux/cpuhotplug.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/sched/task.h>
+#include <linux/random.h>

 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -45,6 +47,37 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
 }

 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+static struct mm_struct *patching_mm __ro_after_init;
+static unsigned long patching_addr __ro_after_init;
+
+void __init poking_init(void)
+{
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+
+	/*
+	 * Some parts of the kernel (static keys for example) depend on
+	 * successful code patching. Code patching under STRICT_KERNEL_RWX
+	 * requires this setup - otherwise we cannot patch at all. We use
+	 * BUG_ON() here and later since an early failure is preferred to
+	 * buggy behavior and/or strange crashes later.
+	 */
+	patching_mm = copy_init_mm();
+	BUG_ON(!patching_mm);
+
+	/*
+	 * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
+	 * XXX: Do we want additional bits of entropy for radix?
+	 */
+	patching_addr = (get_random_long() & PAGE_MASK) %
+		(DEFAULT_MAP_WINDOW - PAGE_SIZE);
+
+	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+	BUG_ON(!ptep);
+	pte_unmap_unlock(ptep, ptl);
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);

 static int text_area_cpu_up(unsigned int cpu)
-- 
2.26.2

^ permalink raw reply related

* [PATCH 5/5] powerpc: Add LKDTM test to hijack a patch mapping
From: Christopher M. Riedl @ 2020-06-03  5:19 UTC (permalink / raw)
  To: linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-1-cmr@informatik.wtf>

When live patching with STRICT_KERNEL_RWX, the CPU doing the patching
must use a temporary mapping which allows for writing to kernel text.
During the entire window of time when this temporary mapping is in use,
another CPU could write to the same mapping and maliciously alter kernel
text. Implement a LKDTM test to attempt to exploit such a openings when
a CPU is patching under STRICT_KERNEL_RWX. The test is only implemented
on powerpc for now.

The LKDTM "hijack" test works as follows:

	1. A CPU executes an infinite loop to patch an instruction.
	   This is the "patching" CPU.
	2. Another CPU attempts to write to the address of the temporary
	   mapping used by the "patching" CPU. This other CPU is the
	   "hijacker" CPU. The hijack either fails with a segfault or
	   succeeds, in which case some kernel text is now overwritten.

How to run the test:

	mount -t debugfs none /sys/kernel/debug
	(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
---
 drivers/misc/lkdtm/core.c  |   1 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c | 101 +++++++++++++++++++++++++++++++++++++
 3 files changed, 103 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index a5e344df9166..482e72f6a1e1 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -145,6 +145,7 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(WRITE_RO),
 	CRASHTYPE(WRITE_RO_AFTER_INIT),
 	CRASHTYPE(WRITE_KERN),
+	CRASHTYPE(HIJACK_PATCH),
 	CRASHTYPE(REFCOUNT_INC_OVERFLOW),
 	CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
 	CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 601a2156a0d4..bfcf3542370d 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* lkdtm_refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 62f76d506f04..8bda3b56bc78 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mman.h>
 #include <linux/uaccess.h>
+#include <linux/kthread.h>
 #include <asm/cacheflush.h>
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -213,6 +214,106 @@ void lkdtm_ACCESS_NULL(void)
 	*ptr = tmp;
 }
 
+#if defined(CONFIG_PPC) && defined(CONFIG_STRICT_KERNEL_RWX)
+#include <include/asm/code-patching.h>
+
+extern unsigned long read_cpu_patching_addr(unsigned int cpu);
+
+static struct ppc_inst * const patch_site = (struct ppc_inst *)&do_nothing;
+
+static int lkdtm_patching_cpu(void *data)
+{
+	int err = 0;
+	struct ppc_inst insn = ppc_inst(0xdeadbeef);
+
+	pr_info("starting patching_cpu=%d\n", smp_processor_id());
+	do {
+		err = patch_instruction(patch_site, insn);
+	} while (ppc_inst_equal(ppc_inst_read(READ_ONCE(patch_site)), insn) &&
+			!err && !kthread_should_stop());
+
+	if (err)
+		pr_warn("patch_instruction returned error: %d\n", err);
+
+	set_current_state(TASK_INTERRUPTIBLE);
+	while (!kthread_should_stop()) {
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+
+	return err;
+}
+
+void lkdtm_HIJACK_PATCH(void)
+{
+	struct task_struct *patching_kthrd;
+	struct ppc_inst original_insn;
+	int patching_cpu, hijacker_cpu, attempts;
+	unsigned long addr;
+	bool hijacked;
+
+	if (num_online_cpus() < 2) {
+		pr_warn("need at least two cpus\n");
+		return;
+	}
+
+	original_insn = ppc_inst_read(READ_ONCE(patch_site));
+
+	hijacker_cpu = smp_processor_id();
+	patching_cpu = cpumask_any_but(cpu_online_mask, hijacker_cpu);
+
+	patching_kthrd = kthread_create_on_node(&lkdtm_patching_cpu, NULL,
+						cpu_to_node(patching_cpu),
+						"lkdtm_patching_cpu");
+	kthread_bind(patching_kthrd, patching_cpu);
+	wake_up_process(patching_kthrd);
+
+	addr = offset_in_page(patch_site) | read_cpu_patching_addr(patching_cpu);
+
+	pr_info("starting hijacker_cpu=%d\n", hijacker_cpu);
+	for (attempts = 0; attempts < 100000; ++attempts) {
+		/* Use __put_user to catch faults without an Oops */
+		hijacked = !__put_user(0xbad00bad, (unsigned int *)addr);
+
+		if (hijacked) {
+			if (kthread_stop(patching_kthrd))
+				goto out;
+			break;
+		}
+	}
+	pr_info("hijack attempts: %d\n", attempts);
+
+	if (hijacked) {
+		if (*(unsigned int *)READ_ONCE(patch_site) == 0xbad00bad)
+			pr_err("overwrote kernel text\n");
+		/*
+		 * There are window conditions where the hijacker cpu manages to
+		 * write to the patch site but the site gets overwritten again by
+		 * the patching cpu. We still consider that a "successful" hijack
+		 * since the hijacker cpu did not fault on the write.
+		 */
+		pr_err("FAIL: wrote to another cpu's patching area\n");
+	} else {
+		kthread_stop(patching_kthrd);
+	}
+
+out:
+	/* Restore the original insn for any future lkdtm tests */
+	patch_instruction(patch_site, original_insn);
+}
+
+#else
+
+void lkdtm_HIJACK_PATCH(void)
+{
+	if (!IS_ENABLED(CONFIG_PPC))
+		pr_err("XFAIL: this test is powerpc-only\n");
+	if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
+		pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
+}
+
+#endif /* CONFIG_PPC && CONFIG_STRICT_KERNEL_RWX */
+
 void __init lkdtm_perms_init(void)
 {
 	/* Make sure we can write to __ro_after_init values during __init */
-- 
2.26.2


^ permalink raw reply related

* [PATCH 0/5] Use per-CPU temporary mappings for patching
From: Christopher M. Riedl @ 2020-06-03  5:19 UTC (permalink / raw)
  To: linuxppc-dev, kernel-hardening

When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

	1. Map page of text to be patched to per-CPU VM area w/
	   PAGE_KERNEL protection
	2. Patch text
	3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching (or any
other sensitive operations requiring temporarily overriding memory
protections) [0].

x86 introduced "temporary mm" structs which allow the creation of
mappings local to a particular CPU [1]. This series intends to bring the
notion of a temporary mm to powerpc and harden powerpc by using such a
mapping for patching a kernel with strict RWX permissions.

The first patch introduces the temporary mm struct and API for powerpc
along with a new function to retrieve a current hw breakpoint.

The second patch uses the `poking_init` init hook added by the x86
patches to initialize a temporary mm and patching address. The patching
address is randomized between 0 and DEFAULT_MAP_WINDOW-PAGE_SIZE. The
upper limit is necessary due to how the hash MMU operates - by default
the space above DEFAULT_MAP_WINDOW is not available. For now, both hash
and radix randomize inside this range. The number of possible random
addresses is dependent on PAGE_SIZE and limited by DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

	bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

	PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
	bits of entropy = log2(128TB / 64K)
	bits of entropy = 31

Randomization occurs only once during initialization at boot.

The third patch replaces the VM area with the temporary mm in the
patching code. The page for patching has to be mapped PAGE_SHARED with
the hash MMU since hash prevents the kernel from accessing userspace
pages with PAGE_PRIVILEGED bit set. On the radix MMU the page is mapped with
PAGE_KERNEL which has the added benefit that we can skip KUAP. 

The fourth and fifth patches implement an LKDTM test "proof-of-concept" which
exploits the previous vulnerability (ie. the mapping during patching is exposed
in kernel page tables and accessible by other CPUS). The LKDTM test is somewhat
"rough" in that it uses a brute-force approach - I am open to any suggestions
and/or ideas to improve this. Currently, the LKDTM test passes with this series
on POWER8 (hash) and POWER9 (radix, hash) and fails without this series (ie.
the temporary mapping for patching is exposed to CPUs other than the patching
CPU).

The test can be applied to a tree without this new series by first
adding this in /arch/powerpc/lib/code-patching.c:

 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);

+#ifdef CONFIG_LKDTM
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+       return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
        struct vm_struct *area;

And then applying the last patch of this series which adds the LKDTM test,
(powerpc: Add LKDTM test to hijack a patch mapping).

Tested on QEMU (POWER8, POWER9), POWER8 VM, and a Blackbird (8-core POWER9).

v1:	* Rebase on linuxppc/next (4336b9337824)
	* Save and restore second hw watchpoint
	* Use new ppc_inst_* functions for patching check and in LKDTM
	  test

rfc-v2:	* Many fixes and improvements mostly based on extensive feedback and
          testing by Christophe Leroy (thanks!).
	* Make patching_mm and patching_addr static and mode '__ro_after_init'
	  to after the variable name (more common in other parts of the kernel)
	* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to fix
	  PPC64e compile
	* Add comment explaining why we use BUG_ON() during the init call to
	  setup for patching later
	* Move ptep into patch_mapping to avoid walking page tables a second
	  time when unmapping the temporary mapping
	* Use KUAP under non-radix, also manually dirty the PTE for patch
	  mapping on non-BOOK3S_64 platforms
	* Properly return any error from __patch_instruction
	* Do not use 'memcmp' where a simple comparison is appropriate
	* Simplify expression for patch address by removing pointer maths
	* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.amit@gmail.com/

Christopher M. Riedl (5):
  powerpc/mm: Introduce temporary mm
  powerpc/lib: Initialize a temporary mm for code patching
  powerpc/lib: Use a temporary mm for code patching
  powerpc/lib: Add LKDTM accessor for patching addr
  powerpc: Add LKDTM test to hijack a patch mapping

 arch/powerpc/include/asm/debug.h       |   1 +
 arch/powerpc/include/asm/mmu_context.h |  64 +++++++++
 arch/powerpc/kernel/process.c          |   5 +
 arch/powerpc/lib/code-patching.c       | 172 +++++++++++++------------
 drivers/misc/lkdtm/core.c              |   1 +
 drivers/misc/lkdtm/lkdtm.h             |   1 +
 drivers/misc/lkdtm/perms.c             | 101 +++++++++++++++
 7 files changed, 260 insertions(+), 85 deletions(-)

-- 
2.26.2

^ permalink raw reply

* Re: [PATCH] cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: Greg Kroah-Hartman @ 2020-06-03  6:14 UTC (permalink / raw)
  To: wanghai (M)
  Cc: Andrew Donnellan, Arnd Bergmann, kernel-janitors, linux-kernel,
	Markus Elfring, Ian Munsie, Frederic Barrat, linuxppc-dev
In-Reply-To: <25ad528b-beaf-820f-9738-ea304dcbc0d7@huawei.com>

On Wed, Jun 03, 2020 at 09:42:41AM +0800, wanghai (M) wrote:
> 
> 在 2020/6/3 1:20, Markus Elfring 写道:
> > > Fix it by adding a call to kobject_put() in the error path of
> > > kobject_init_and_add().
> > Thanks for another completion of the exception handling.
> > 
> > Would an other patch subject be a bit nicer?
> Thanks for the guidance, I will perfect this description and send a v2

Please note that you are responding to someone that a lot of kernel
developers and maintainers have blacklisted as being very annoying and
not helpful at all.

Please do not feel that you need to respond to, or change any patch in
response to their emails at all.

I strongly recommend you just add them to your filters to not have to
see their messages.  That's what I have done.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: wanghai (M) @ 2020-06-03  6:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andrew Donnellan, Arnd Bergmann, kernel-janitors, linux-kernel,
	Markus Elfring, Ian Munsie, Frederic Barrat, linuxppc-dev
In-Reply-To: <20200603061443.GB531505@kroah.com>


在 2020/6/3 14:14, Greg Kroah-Hartman 写道:
> On Wed, Jun 03, 2020 at 09:42:41AM +0800, wanghai (M) wrote:
>> 在 2020/6/3 1:20, Markus Elfring 写道:
>>>> Fix it by adding a call to kobject_put() in the error path of
>>>> kobject_init_and_add().
>>> Thanks for another completion of the exception handling.
>>>
>>> Would an other patch subject be a bit nicer?
>> Thanks for the guidance, I will perfect this description and send a v2
> Please note that you are responding to someone that a lot of kernel
> developers and maintainers have blacklisted as being very annoying and
> not helpful at all.
>
> Please do not feel that you need to respond to, or change any patch in
> response to their emails at all.
>
> I strongly recommend you just add them to your filters to not have to
> see their messages.  That's what I have done.
>
> thanks,
>
> greg k-h
>
> .

Okay, so I don’t have to send the v2 patch.


--

thanks,

Wang Hai



^ permalink raw reply

* Re: [PATCH] cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: Markus Elfring @ 2020-06-03  6:48 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Wang Hai, Andrew Donnellan, Arnd Bergmann, kernel-janitors,
	linux-kernel, Ian Munsie, Frederic Barrat, linuxppc-dev
In-Reply-To: <20200603061443.GB531505@kroah.com>

> Please note that you are responding to someone that a lot of kernel
> developers and maintainers have blacklisted as being very annoying

I can understand that you can occasionally become annoyed.

> and not helpful at all.

I got the impression that some contributors (including you)
found also a selection of my contributions useful.

> Please do not feel that you need to respond to, or change any patch in
> response to their emails at all.

I suggest to reconsider your responses to provided information once more.

> I strongly recommend you just add them to your filters to not have to
> see their messages.  That's what I have done.

I find such an “advice” questionable while it is generally possible
to adjust the communication settings as needed.

Regards,
Markus

^ permalink raw reply

* Re: [PATCH] cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: Greg Kroah-Hartman @ 2020-06-03  6:50 UTC (permalink / raw)
  To: wanghai (M)
  Cc: Andrew Donnellan, Arnd Bergmann, kernel-janitors, linux-kernel,
	Markus Elfring, Ian Munsie, Frederic Barrat, linuxppc-dev
In-Reply-To: <20ae5516-7e41-f706-46ba-955e1936f183@huawei.com>

On Wed, Jun 03, 2020 at 02:34:07PM +0800, wanghai (M) wrote:
> 
> 在 2020/6/3 14:14, Greg Kroah-Hartman 写道:
> > On Wed, Jun 03, 2020 at 09:42:41AM +0800, wanghai (M) wrote:
> > > 在 2020/6/3 1:20, Markus Elfring 写道:
> > > > > Fix it by adding a call to kobject_put() in the error path of
> > > > > kobject_init_and_add().
> > > > Thanks for another completion of the exception handling.
> > > > 
> > > > Would an other patch subject be a bit nicer?
> > > Thanks for the guidance, I will perfect this description and send a v2
> > Please note that you are responding to someone that a lot of kernel
> > developers and maintainers have blacklisted as being very annoying and
> > not helpful at all.
> > 
> > Please do not feel that you need to respond to, or change any patch in
> > response to their emails at all.
> > 
> > I strongly recommend you just add them to your filters to not have to
> > see their messages.  That's what I have done.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> > .
> 
> Okay, so I don’t have to send the v2 patch.

No, all should be fine, I'll review the patch when after 5.8-rc1 is out,
and if I find any problems with it, will let you know then.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: wanghai (M) @ 2020-06-03  6:54 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andrew Donnellan, Arnd Bergmann, kernel-janitors, linux-kernel,
	Markus Elfring, Ian Munsie, Frederic Barrat, linuxppc-dev
In-Reply-To: <20200603065024.GA587198@kroah.com>


在 2020/6/3 14:50, Greg Kroah-Hartman 写道:
> On Wed, Jun 03, 2020 at 02:34:07PM +0800, wanghai (M) wrote:
>> 在 2020/6/3 14:14, Greg Kroah-Hartman 写道:
>>> On Wed, Jun 03, 2020 at 09:42:41AM +0800, wanghai (M) wrote:
>>>> 在 2020/6/3 1:20, Markus Elfring 写道:
>>>>>> Fix it by adding a call to kobject_put() in the error path of
>>>>>> kobject_init_and_add().
>>>>> Thanks for another completion of the exception handling.
>>>>>
>>>>> Would an other patch subject be a bit nicer?
>>>> Thanks for the guidance, I will perfect this description and send a v2
>>> Please note that you are responding to someone that a lot of kernel
>>> developers and maintainers have blacklisted as being very annoying and
>>> not helpful at all.
>>>
>>> Please do not feel that you need to respond to, or change any patch in
>>> response to their emails at all.
>>>
>>> I strongly recommend you just add them to your filters to not have to
>>> see their messages.  That's what I have done.
>>>
>>> thanks,
>>>
>>> greg k-h
>>>
>>> .
>> Okay, so I don’t have to send the v2 patch.
> No, all should be fine, I'll review the patch when after 5.8-rc1 is out,
> and if I find any problems with it, will let you know then.

Got it. Thanks.


thanks,

Wang Hai




^ permalink raw reply

* Re: cxl: Fix kobject memory leak in cxl_sysfs_afu_new_cr()
From: Markus Elfring @ 2020-06-03  6:56 UTC (permalink / raw)
  To: Wang Hai, linuxppc-dev
  Cc: Andrew Donnellan, Arnd Bergmann, Greg Kroah-Hartman,
	kernel-janitors, linux-kernel, Ian Munsie, Frederic Barrat
In-Reply-To: <20ae5516-7e41-f706-46ba-955e1936f183@huawei.com>

> Okay, so I don’t have to send the v2 patch.

It will become more interesting under which circumstances the presented
software development concerns will be taken better into account.

Regards,
Markus

^ permalink raw reply

* Re: [PATCH 1/5] powerpc/mm: Introduce temporary mm
From: Christophe Leroy @ 2020-06-03  6:58 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-2-cmr@informatik.wtf>



Le 03/06/2020 à 07:19, Christopher M. Riedl a écrit :
> x86 supports the notion of a temporary mm which restricts access to
> temporary PTEs to a single CPU. A temporary mm is useful for situations
> where a CPU needs to perform sensitive operations (such as patching a
> STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> said mappings to other CPUs. A side benefit is that other CPU TLBs do
> not need to be flushed when the temporary mm is torn down.
> 
> Mappings in the temporary mm can be set in the userspace portion of the
> address-space.
> 
> Interrupts must be disabled while the temporary mm is in use. HW
> breakpoints, which may have been set by userspace as watchpoints on
> addresses now within the temporary mm, are saved and disabled when
> loading the temporary mm. The HW breakpoints are restored when unloading
> the temporary mm. All HW breakpoints are indiscriminately disabled while
> the temporary mm is in use.
> 
> Based on x86 implementation:
> 
> commit cefa929c034e
> ("x86/mm: Introduce temporary mm structs")
> 
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>   arch/powerpc/include/asm/debug.h       |  1 +
>   arch/powerpc/include/asm/mmu_context.h | 64 ++++++++++++++++++++++++++
>   arch/powerpc/kernel/process.c          |  5 ++
>   3 files changed, 70 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
> index ec57daf87f40..827350c9bcf3 100644
> --- a/arch/powerpc/include/asm/debug.h
> +++ b/arch/powerpc/include/asm/debug.h
> @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
>   #endif
>   
>   void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
>   bool ppc_breakpoint_available(void);
>   #ifdef CONFIG_PPC_ADV_DEBUG_REGS
>   extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 1a474f6b1992..9269c7c7b04e 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -10,6 +10,7 @@
>   #include <asm/mmu.h>	
>   #include <asm/cputable.h>
>   #include <asm/cputhreads.h>
> +#include <asm/debug.h>
>   
>   /*
>    * Most if the context management is out of line
> @@ -300,5 +301,68 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>   	return 0;
>   }
>   
> +struct temp_mm {
> +	struct mm_struct *temp;
> +	struct mm_struct *prev;
> +	bool is_kernel_thread;
> +	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> +};
> +
> +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> +{
> +	temp_mm->temp = mm;
> +	temp_mm->prev = NULL;
> +	temp_mm->is_kernel_thread = false;
> +	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> +}
> +
> +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	temp_mm->is_kernel_thread = current->mm == NULL;
> +	if (temp_mm->is_kernel_thread)
> +		temp_mm->prev = current->active_mm;
> +	else
> +		temp_mm->prev = current->mm;

Is that necessary to make different for kernel threads ? When I look at 
x86 implementation, they don't do such a thing.

> +
> +	/*
> +	 * Hash requires a non-NULL current->mm to allocate a userspace address
> +	 * when handling a page fault. Does not appear to hurt in Radix either.
> +	 */
> +	current->mm = temp_mm->temp;
> +	switch_mm_irqs_off(NULL, temp_mm->temp, current);
> +
> +	if (ppc_breakpoint_available()) {
> +		struct arch_hw_breakpoint null_brk = {0};
> +		int i = 0;
> +
> +		for (; i < nr_wp_slots(); ++i) {
> +			__get_breakpoint(i, &temp_mm->brk[i]);
> +			if (temp_mm->brk[i].type != 0)
> +				__set_breakpoint(i, &null_brk);
> +		}
> +	}
> +}
> +
> +static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	if (temp_mm->is_kernel_thread)
> +		current->mm = NULL;
> +	else
> +		current->mm = temp_mm->prev;
> +	switch_mm_irqs_off(NULL, temp_mm->prev, current);
> +
> +	if (ppc_breakpoint_available()) {
> +		int i = 0;
> +
> +		for (; i < nr_wp_slots(); ++i)
> +			if (temp_mm->brk[i].type != 0)
> +				__set_breakpoint(i, &temp_mm->brk[i]);
> +	}
> +}
> +
>   #endif /* __KERNEL__ */
>   #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 048d64c4e115..3973144f6980 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -825,6 +825,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
>   	return 0;
>   }
>   
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> +{
> +	memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> +}
> +
>   void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
>   {
>   	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> 

Christophe

^ permalink raw reply

* Re: [PATCH 2/5] powerpc/lib: Initialize a temporary mm for code patching
From: Christophe Leroy @ 2020-06-03  7:01 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-3-cmr@informatik.wtf>



Le 03/06/2020 à 07:19, Christopher M. Riedl a écrit :
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped with permissive memory
> protections. Currently, a per-cpu vmalloc patch area is used for this
> purpose. While the patch area is per-cpu, the temporary page mapping is
> inserted into the kernel page tables for the duration of the patching.
> The mapping is exposed to CPUs other than the patching CPU - this is
> undesirable from a hardening perspective.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> portion. The next patch uses the temporary mm and patching address for
> code patching.
> 
> Based on x86 implementation:
> 
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
> 
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>   arch/powerpc/lib/code-patching.c | 33 ++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 5ecf0d635a8d..599114f63b44 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -11,6 +11,8 @@
>   #include <linux/cpuhotplug.h>
>   #include <linux/slab.h>
>   #include <linux/uaccess.h>
> +#include <linux/sched/task.h>
> +#include <linux/random.h>
>   
>   #include <asm/pgtable.h>
>   #include <asm/tlbflush.h>
> @@ -45,6 +47,37 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
>   }
>   
>   #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +static struct mm_struct *patching_mm __ro_after_init;
> +static unsigned long patching_addr __ro_after_init;
> +
> +void __init poking_init(void)
> +{
> +	spinlock_t *ptl; /* for protecting pte table */
> +	pte_t *ptep;
> +
> +	/*
> +	 * Some parts of the kernel (static keys for example) depend on
> +	 * successful code patching. Code patching under STRICT_KERNEL_RWX
> +	 * requires this setup - otherwise we cannot patch at all. We use
> +	 * BUG_ON() here and later since an early failure is preferred to
> +	 * buggy behavior and/or strange crashes later.
> +	 */
> +	patching_mm = copy_init_mm();
> +	BUG_ON(!patching_mm);
> +
> +	/*
> +	 * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> +	 * XXX: Do we want additional bits of entropy for radix?
> +	 */
> +	patching_addr = (get_random_long() & PAGE_MASK) %
> +		(DEFAULT_MAP_WINDOW - PAGE_SIZE);
> +
> +	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +	BUG_ON(!ptep);
> +	pte_unmap_unlock(ptep, ptl);

Is this needed ? What's the point in getting the pte to unmap it 
immediatly without doing anything with it ?

Christophe

> +}
> +
>   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>   
>   static int text_area_cpu_up(unsigned int cpu)
> 

^ permalink raw reply

* Re: [PATCH 3/5] powerpc/lib: Use a temporary mm for code patching
From: Christophe Leroy @ 2020-06-03  7:12 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-4-cmr@informatik.wtf>



Le 03/06/2020 à 07:19, Christopher M. Riedl a écrit :
> Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
> mappings to other CPUs. These mappings should be kept local to the CPU
> doing the patching. Use the pre-initialized temporary mm and patching
> address for this purpose. Also add a check after patching to ensure the
> patch succeeded.
> 
> Use the KUAP functions on non-BOOKS3_64 platforms since the temporary
> mapping for patching uses a userspace address (to keep the mapping
> local). On BOOKS3_64 platforms hash does not implement KUAP and on radix
> the use of PAGE_KERNEL sets EAA[0] for the PTE which means the AMR
> (KUAP) protection is ignored (see PowerISA v3.0b, Fig, 35).
> 
> Based on x86 implementation:
> 
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
> 
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>   arch/powerpc/lib/code-patching.c | 148 ++++++++++++-------------------
>   1 file changed, 55 insertions(+), 93 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 599114f63b44..df0765845204 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -20,6 +20,7 @@
>   #include <asm/code-patching.h>
>   #include <asm/setup.h>
>   #include <asm/inst.h>
> +#include <asm/mmu_context.h>
>   
>   static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst instr,
>   			       struct ppc_inst *patch_addr)
> @@ -78,101 +79,58 @@ void __init poking_init(void)
>   	pte_unmap_unlock(ptep, ptl);
>   }
>   
> -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> -
> -static int text_area_cpu_up(unsigned int cpu)
> -{
> -	struct vm_struct *area;
> -
> -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> -	if (!area) {
> -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> -			cpu);
> -		return -1;
> -	}
> -	this_cpu_write(text_poke_area, area);
> -
> -	return 0;
> -}
> -
> -static int text_area_cpu_down(unsigned int cpu)
> -{
> -	free_vm_area(this_cpu_read(text_poke_area));
> -	return 0;
> -}
> -
> -/*
> - * Run as a late init call. This allows all the boot time patching to be done
> - * simply by patching the code, and then we're called here prior to
> - * mark_rodata_ro(), which happens after all init calls are run. Although
> - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
> - * it as being preferable to a kernel that will crash later when someone tries
> - * to use patch_instruction().
> - */
> -static int __init setup_text_poke_area(void)
> -{
> -	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> -		"powerpc/text_poke:online", text_area_cpu_up,
> -		text_area_cpu_down));
> -
> -	return 0;
> -}
> -late_initcall(setup_text_poke_area);
> +struct patch_mapping {
> +	spinlock_t *ptl; /* for protecting pte table */
> +	pte_t *ptep;
> +	struct temp_mm temp_mm;
> +};
>   
>   /*
>    * This can be called for kernel text or a module.
>    */
> -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> +static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
>   {
> -	unsigned long pfn;
> -	int err;
> +	struct page *page;
> +	pte_t pte;
> +	pgprot_t pgprot;
>   
>   	if (is_vmalloc_addr(addr))
> -		pfn = vmalloc_to_pfn(addr);
> +		page = vmalloc_to_page(addr);
>   	else
> -		pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> +		page = virt_to_page(addr);
>   
> -	err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
> +	if (radix_enabled())
> +		pgprot = PAGE_KERNEL;
> +	else
> +		pgprot = PAGE_SHARED;
>   
> -	pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
> -	if (err)
> +	patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> +					     &patch_mapping->ptl);
> +	if (unlikely(!patch_mapping->ptep)) {
> +		pr_warn("map patch: failed to allocate pte for patching\n");
>   		return -1;
> +	}
> +
> +	pte = mk_pte(page, pgprot);
> +	if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
> +		pte = pte_mkdirty(pte);

Are you should you don't need the DIRTY bit for BOOK3S/64 non radix ?

I think the DIRTY bit is needed always, and adding it when it is already 
there is harmless, so it should be done inconditionnnaly.

> +	set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
> +
> +	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> +	use_temporary_mm(&patch_mapping->temp_mm);
>   
>   	return 0;
>   }
>   
> -static inline int unmap_patch_area(unsigned long addr)
> +static void unmap_patch(struct patch_mapping *patch_mapping)
>   {
> -	pte_t *ptep;
> -	pmd_t *pmdp;
> -	pud_t *pudp;
> -	pgd_t *pgdp;
> -
> -	pgdp = pgd_offset_k(addr);
> -	if (unlikely(!pgdp))
> -		return -EINVAL;
> -
> -	pudp = pud_offset(pgdp, addr);
> -	if (unlikely(!pudp))
> -		return -EINVAL;
> -
> -	pmdp = pmd_offset(pudp, addr);
> -	if (unlikely(!pmdp))
> -		return -EINVAL;
> -
> -	ptep = pte_offset_kernel(pmdp, addr);
> -	if (unlikely(!ptep))
> -		return -EINVAL;
> +	/* In hash, pte_clear flushes the tlb */
> +	pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> +	unuse_temporary_mm(&patch_mapping->temp_mm);
>   
> -	pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr);
> -
> -	/*
> -	 * In hash, pte_clear flushes the tlb, in radix, we have to
> -	 */
> -	pte_clear(&init_mm, addr, ptep);
> -	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> -
> -	return 0;
> +	/* In radix, we have to explicitly flush the tlb (no-op in hash) */
> +	local_flush_tlb_mm(patching_mm);
> +	pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
>   }
>   
>   static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
> @@ -180,32 +138,36 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
>   	int err;
>   	struct ppc_inst *patch_addr = NULL;
>   	unsigned long flags;
> -	unsigned long text_poke_addr;
> -	unsigned long kaddr = (unsigned long)addr;
> +	struct patch_mapping patch_mapping;
>   
>   	/*
> -	 * During early early boot patch_instruction is called
> -	 * when text_poke_area is not ready, but we still need
> -	 * to allow patching. We just do the plain old patching
> +	 * The patching_mm is initialized before calling mark_rodata_ro. Prior
> +	 * to this, patch_instruction is called when we don't have (and don't
> +	 * need) the patching_mm so just do plain old patching.
>   	 */
> -	if (!this_cpu_read(text_poke_area))
> +	if (!patching_mm)
>   		return raw_patch_instruction(addr, instr);
>   
>   	local_irq_save(flags);
>   
> -	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
> -	if (map_patch_area(addr, text_poke_addr)) {
> -		err = -1;
> +	err = map_patch(addr, &patch_mapping);
> +	if (err)
>   		goto out;
> -	}
>   
> -	patch_addr = (struct ppc_inst *)(text_poke_addr + (kaddr & ~PAGE_MASK));
> +	patch_addr = (struct ppc_inst *)(patching_addr | offset_in_page(addr));
>   
> -	__patch_instruction(addr, instr, patch_addr);
> +	if (!radix_enabled())
> +		allow_write_to_user(patch_addr, sizeof(instr));

Can't use sizeof(instr), you have to use ppc_inst_size()

> +	err = __patch_instruction(addr, instr, patch_addr);
> +	if (!radix_enabled())
> +		prevent_write_to_user(patch_addr, sizeof(instr));

Same

>   
> -	err = unmap_patch_area(text_poke_addr);
> -	if (err)
> -		pr_warn("failed to unmap %lx\n", text_poke_addr);
> +	unmap_patch(&patch_mapping);
> +	/*
> +	 * Something is wrong if what we just wrote doesn't match what we
> +	 * think we just wrote.
> +	 */
> +	WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
>   
>   out:
>   	local_irq_restore(flags);
> 

Christophe

^ permalink raw reply

* Re: [PATCH 4/5] powerpc/lib: Add LKDTM accessor for patching addr
From: Christophe Leroy @ 2020-06-03  7:14 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-5-cmr@informatik.wtf>



Le 03/06/2020 à 07:19, Christopher M. Riedl a écrit :
> When live patching a STRICT_RWX kernel, a mapping is installed at a
> "patching address" with temporary write permissions. Provide a
> LKDTM-only accessor function for this address in preparation for a LKDTM
> test which attempts to "hijack" this mapping by writing to it from
> another CPU.
> 
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>   arch/powerpc/lib/code-patching.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index df0765845204..c23453049116 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -52,6 +52,13 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
>   static struct mm_struct *patching_mm __ro_after_init;
>   static unsigned long patching_addr __ro_after_init;
>   
> +#ifdef CONFIG_LKDTM
> +unsigned long read_cpu_patching_addr(unsigned int cpu)

If this fonction is not static, it means it is intended to be used from 
some other C file, so it should be declared in a .h too.

Christophe

> +{
> +	return patching_addr;
> +}
> +#endif
> +
>   void __init poking_init(void)
>   {
>   	spinlock_t *ptl; /* for protecting pte table */
> 

^ permalink raw reply

* Re: [PATCH 5/5] powerpc: Add LKDTM test to hijack a patch mapping
From: Christophe Leroy @ 2020-06-03  7:20 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, kernel-hardening
In-Reply-To: <20200603051912.23296-6-cmr@informatik.wtf>



Le 03/06/2020 à 07:19, Christopher M. Riedl a écrit :
> When live patching with STRICT_KERNEL_RWX, the CPU doing the patching
> must use a temporary mapping which allows for writing to kernel text.
> During the entire window of time when this temporary mapping is in use,
> another CPU could write to the same mapping and maliciously alter kernel
> text. Implement a LKDTM test to attempt to exploit such a openings when
> a CPU is patching under STRICT_KERNEL_RWX. The test is only implemented
> on powerpc for now.
> 
> The LKDTM "hijack" test works as follows:
> 
> 	1. A CPU executes an infinite loop to patch an instruction.
> 	   This is the "patching" CPU.
> 	2. Another CPU attempts to write to the address of the temporary
> 	   mapping used by the "patching" CPU. This other CPU is the
> 	   "hijacker" CPU. The hijack either fails with a segfault or
> 	   succeeds, in which case some kernel text is now overwritten.
> 
> How to run the test:
> 
> 	mount -t debugfs none /sys/kernel/debug
> 	(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)
> 
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>   drivers/misc/lkdtm/core.c  |   1 +
>   drivers/misc/lkdtm/lkdtm.h |   1 +
>   drivers/misc/lkdtm/perms.c | 101 +++++++++++++++++++++++++++++++++++++
>   3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
> index a5e344df9166..482e72f6a1e1 100644
> --- a/drivers/misc/lkdtm/core.c
> +++ b/drivers/misc/lkdtm/core.c
> @@ -145,6 +145,7 @@ static const struct crashtype crashtypes[] = {
>   	CRASHTYPE(WRITE_RO),
>   	CRASHTYPE(WRITE_RO_AFTER_INIT),
>   	CRASHTYPE(WRITE_KERN),
> +	CRASHTYPE(HIJACK_PATCH),
>   	CRASHTYPE(REFCOUNT_INC_OVERFLOW),
>   	CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
>   	CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
> diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
> index 601a2156a0d4..bfcf3542370d 100644
> --- a/drivers/misc/lkdtm/lkdtm.h
> +++ b/drivers/misc/lkdtm/lkdtm.h
> @@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
>   void lkdtm_EXEC_NULL(void);
>   void lkdtm_ACCESS_USERSPACE(void);
>   void lkdtm_ACCESS_NULL(void);
> +void lkdtm_HIJACK_PATCH(void);
>   
>   /* lkdtm_refcount.c */
>   void lkdtm_REFCOUNT_INC_OVERFLOW(void);
> diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
> index 62f76d506f04..8bda3b56bc78 100644
> --- a/drivers/misc/lkdtm/perms.c
> +++ b/drivers/misc/lkdtm/perms.c
> @@ -9,6 +9,7 @@
>   #include <linux/vmalloc.h>
>   #include <linux/mman.h>
>   #include <linux/uaccess.h>
> +#include <linux/kthread.h>
>   #include <asm/cacheflush.h>
>   
>   /* Whether or not to fill the target memory area with do_nothing(). */
> @@ -213,6 +214,106 @@ void lkdtm_ACCESS_NULL(void)
>   	*ptr = tmp;
>   }
>   
> +#if defined(CONFIG_PPC) && defined(CONFIG_STRICT_KERNEL_RWX)

Why only PPC ? I understood that this applies also to x86. And 
regarless, the test should be able to run on other architectures, 
allthought for sure it will fail. That's the case for other tests.

> +#include <include/asm/code-patching.h>
> +
> +extern unsigned long read_cpu_patching_addr(unsigned int cpu);

'extern' keyword is useless for functions and shall be banned.

Shouldn't this declaration be in asm/code-patching.h ?

> +
> +static struct ppc_inst * const patch_site = (struct ppc_inst *)&do_nothing;
> +
> +static int lkdtm_patching_cpu(void *data)
> +{
> +	int err = 0;
> +	struct ppc_inst insn = ppc_inst(0xdeadbeef);
> +
> +	pr_info("starting patching_cpu=%d\n", smp_processor_id());
> +	do {
> +		err = patch_instruction(patch_site, insn);
> +	} while (ppc_inst_equal(ppc_inst_read(READ_ONCE(patch_site)), insn) &&
> +			!err && !kthread_should_stop());
> +
> +	if (err)
> +		pr_warn("patch_instruction returned error: %d\n", err);
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	while (!kthread_should_stop()) {
> +		schedule();
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	}
> +
> +	return err;
> +}
> +
> +void lkdtm_HIJACK_PATCH(void)
> +{
> +	struct task_struct *patching_kthrd;
> +	struct ppc_inst original_insn;
> +	int patching_cpu, hijacker_cpu, attempts;
> +	unsigned long addr;
> +	bool hijacked;
> +
> +	if (num_online_cpus() < 2) {
> +		pr_warn("need at least two cpus\n");
> +		return;
> +	}
> +
> +	original_insn = ppc_inst_read(READ_ONCE(patch_site));
> +
> +	hijacker_cpu = smp_processor_id();
> +	patching_cpu = cpumask_any_but(cpu_online_mask, hijacker_cpu);
> +
> +	patching_kthrd = kthread_create_on_node(&lkdtm_patching_cpu, NULL,
> +						cpu_to_node(patching_cpu),
> +						"lkdtm_patching_cpu");
> +	kthread_bind(patching_kthrd, patching_cpu);
> +	wake_up_process(patching_kthrd);
> +
> +	addr = offset_in_page(patch_site) | read_cpu_patching_addr(patching_cpu);
> +
> +	pr_info("starting hijacker_cpu=%d\n", hijacker_cpu);
> +	for (attempts = 0; attempts < 100000; ++attempts) {
> +		/* Use __put_user to catch faults without an Oops */
> +		hijacked = !__put_user(0xbad00bad, (unsigned int *)addr);
> +
> +		if (hijacked) {
> +			if (kthread_stop(patching_kthrd))
> +				goto out;
> +			break;
> +		}
> +	}
> +	pr_info("hijack attempts: %d\n", attempts);
> +
> +	if (hijacked) {
> +		if (*(unsigned int *)READ_ONCE(patch_site) == 0xbad00bad)
> +			pr_err("overwrote kernel text\n");
> +		/*
> +		 * There are window conditions where the hijacker cpu manages to
> +		 * write to the patch site but the site gets overwritten again by
> +		 * the patching cpu. We still consider that a "successful" hijack
> +		 * since the hijacker cpu did not fault on the write.
> +		 */
> +		pr_err("FAIL: wrote to another cpu's patching area\n");
> +	} else {
> +		kthread_stop(patching_kthrd);
> +	}
> +
> +out:
> +	/* Restore the original insn for any future lkdtm tests */
> +	patch_instruction(patch_site, original_insn);
> +}
> +
> +#else
> +
> +void lkdtm_HIJACK_PATCH(void)
> +{
> +	if (!IS_ENABLED(CONFIG_PPC))
> +		pr_err("XFAIL: this test is powerpc-only\n");
> +	if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
> +		pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
> +}
> +
> +#endif /* CONFIG_PPC && CONFIG_STRICT_KERNEL_RWX */
> +
>   void __init lkdtm_perms_init(void)
>   {
>   	/* Make sure we can write to __ro_after_init values during __init */
> 

Christophe

^ permalink raw reply

* [PATCH v4 1/4] riscv: Move kernel mapping to vmalloc zone
From: Alexandre Ghiti @ 2020-06-03  8:00 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Atish Patra,
	Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti
In-Reply-To: <20200603080010.13366-1-alex@ghiti.fr>

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be loaded
physically at the beginning of the main memory. Therefore, we could use
the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from PAGE_OFFSET
and since in the linear mapping, two different virtual addresses cannot
point to the same physical address, the kernel mapping needs to lie outside
the linear mapping.

In addition, because modules and BPF must be close to the kernel (inside
+-2GB window), the kernel is placed at the end of the vmalloc zone minus
2GB, which leaves room for modules and BPF. The kernel could not be
placed at the beginning of the vmalloc zone since other vmalloc
allocations from the kernel could get all the +-2GB window around the
kernel which would prevent new modules and BPF programs to be loaded.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/boot/loader.lds.S     |  3 +-
 arch/riscv/include/asm/page.h    | 10 +++++-
 arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
 arch/riscv/kernel/head.S         |  3 +-
 arch/riscv/kernel/module.c       |  4 +--
 arch/riscv/kernel/vmlinux.lds.S  |  3 +-
 arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
 arch/riscv/mm/physaddr.c         |  2 +-
 8 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include <asm/page.h>
+#include <asm/pgtable.h>
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-	. = PAGE_OFFSET;
+	. = KERNEL_LINK_ADDR;
 
 	.payload : {
 		*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..48bb09b6a9b7 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET		(pfn_base)
 #else
 #define va_pa_offset		0
+#define va_kernel_pa_offset	0
 #define ARCH_PFN_OFFSET		(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;
 
 #define __pa_to_va_nodebug(x)	((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)	((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x)	((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x)	\
+	((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)		\
+	(((x) >= PAGE_OFFSET) ?		\
+		linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 35b60035b6b0..94ef3b49dfb6 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,29 @@
 
 #include <asm/pgtable-bits.h>
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
-#include <asm/page.h>
-#include <asm/tlbflush.h>
-#include <linux/mm_types.h>
-
-#ifdef CONFIG_MMU
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR	PAGE_OFFSET
+#define KERNEL_LINK_ADDR	PAGE_OFFSET
+#else
+/*
+ * Leave 2GB for modules and BPF that must lie within a 2GB range around
+ * the kernel.
+ */
+#define KERNEL_VIRT_ADDR	(VMALLOC_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR	KERNEL_VIRT_ADDR
 
 #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END      (PAGE_OFFSET - 1)
 #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
 
 #define BPF_JIT_REGION_SIZE	(SZ_128M)
-#define BPF_JIT_REGION_START	(PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END	(VMALLOC_END)
+#define BPF_JIT_REGION_START	PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END	(BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START	BPF_JIT_REGION_END
+#define VMALLOC_MODULE_END	(((unsigned long)&_start & PAGE_MASK) + SZ_2G)
+#endif
 
 /*
  * Roughly size the vmemmap space to be large enough to fit enough
@@ -57,9 +63,16 @@
 #define FIXADDR_SIZE     PGDIR_SIZE
 #endif
 #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
-
 #endif
 
+#ifndef __ASSEMBLY__
+
+/* Page Upper Directory not used in RISC-V */
+#include <asm-generic/pgtable-nopud.h>
+#include <asm/page.h>
+#include <asm/tlbflush.h>
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_64BIT
 #include <asm/pgtable-64.h>
 #else
@@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
 
 #define kern_addr_valid(addr)   (1) /* FIXME */
 
+extern char _start[];
 extern void *dtb_early_va;
 void setup_bootmem(void);
 void paging_init(void);
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 98a406474e7d..8f5bb7731327 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -49,7 +49,8 @@ ENTRY(_start)
 #ifdef CONFIG_MMU
 relocate:
 	/* Relocate return address */
-	li a1, PAGE_OFFSET
+	la a1, kernel_virt_addr
+	REG_L a1, 0(a1)
 	la a2, _start
 	sub a1, a1, a2
 	add ra, ra, a1
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 8bbe5dbe1341..1a8fbe05accf 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 }
 
 #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-#define VMALLOC_MODULE_START \
-	 max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
 void *module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
-				    VMALLOC_END, GFP_KERNEL,
+				    VMALLOC_MODULE_END, GFP_KERNEL,
 				    PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 }
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 0339b6bbe11a..a9abde62909f 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -4,7 +4,8 @@
  * Copyright (C) 2017 SiFive
  */
 
-#define LOAD_OFFSET PAGE_OFFSET
+#include <asm/pgtable.h>
+#define LOAD_OFFSET KERNEL_LINK_ADDR
 #include <asm/vmlinux.lds.h>
 #include <asm/page.h>
 #include <asm/cache.h>
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 736de6c8739f..37be2eb45e58 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -22,6 +22,9 @@
 
 #include "../kernel/head.h"
 
+unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
+EXPORT_SYMBOL(kernel_virt_addr);
+
 unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 							__page_aligned_bss;
 EXPORT_SYMBOL(empty_zero_page);
@@ -178,8 +181,12 @@ void __init setup_bootmem(void)
 }
 
 #ifdef CONFIG_MMU
+/* Offset between linear mapping virtual address and kernel load address */
 unsigned long va_pa_offset;
 EXPORT_SYMBOL(va_pa_offset);
+/* Offset between kernel mapping virtual address and kernel load address */
+unsigned long va_kernel_pa_offset;
+EXPORT_SYMBOL(va_kernel_pa_offset);
 unsigned long pfn_base;
 EXPORT_SYMBOL(pfn_base);
 
@@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
 	if (mmu_enabled)
 		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
-	pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+	pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
 	BUG_ON(pmd_num >= NUM_EARLY_PMDS);
 	return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
 }
@@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
 #endif
 
+static uintptr_t load_pa, load_sz;
+
+void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
+{
+	uintptr_t va, end_va;
+
+	end_va = kernel_virt_addr + load_sz;
+	for (va = kernel_virt_addr; va < end_va; va += map_size)
+		create_pgd_mapping(pgdir, va,
+				   load_pa + (va - kernel_virt_addr),
+				   map_size, PAGE_KERNEL_EXEC);
+}
+
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
 	uintptr_t va, end_va;
-	uintptr_t load_pa = (uintptr_t)(&_start);
-	uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
 	uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
 
+	load_pa = (uintptr_t)(&_start);
+	load_sz = (uintptr_t)(&_end) - load_pa;
+
 	va_pa_offset = PAGE_OFFSET - load_pa;
+	va_kernel_pa_offset = kernel_virt_addr - load_pa;
+
 	pfn_base = PFN_DOWN(load_pa);
 
 	/*
@@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
 			   (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
 	/* Setup trampoline PGD and PMD */
-	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+	create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
 			   (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
-	create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
+	create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
 			   load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
 #else
 	/* Setup trampoline PGD */
-	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+	create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
 			   load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
 #endif
 
 	/*
-	 * Setup early PGD covering entire kernel which will allows
+	 * Setup early PGD covering entire kernel which will allow
 	 * us to reach paging_init(). We map all memory banks later
 	 * in setup_vm_final() below.
 	 */
-	end_va = PAGE_OFFSET + load_sz;
-	for (va = PAGE_OFFSET; va < end_va; va += map_size)
-		create_pgd_mapping(early_pg_dir, va,
-				   load_pa + (va - PAGE_OFFSET),
-				   map_size, PAGE_KERNEL_EXEC);
+	create_kernel_page_table(early_pg_dir, map_size);
 
 	/* Create fixed mapping for early FDT parsing */
 	end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
@@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
 	uintptr_t va, map_size;
 	phys_addr_t pa, start, end;
 	struct memblock_region *reg;
+	static struct vm_struct vm_kernel = { 0 };
 
 	/* Set mmu_enabled flag */
 	mmu_enabled = true;
@@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
 		for (pa = start; pa < end; pa += map_size) {
 			va = (uintptr_t)__va(pa);
 			create_pgd_mapping(swapper_pg_dir, va, pa,
-					   map_size, PAGE_KERNEL_EXEC);
+					   map_size, PAGE_KERNEL);
 		}
 	}
 
+	/* Map the kernel */
+	create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
+
+	/* Reserve the vmalloc area occupied by the kernel */
+	vm_kernel.addr = (void *)kernel_virt_addr;
+	vm_kernel.phys_addr = load_pa;
+	vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
+	vm_kernel.flags = VM_MAP | VM_NO_GUARD;
+	vm_kernel.caller = __builtin_return_address(0);
+
+	vm_area_add_early(&vm_kernel);
+
 	/* Clear fixmap PTE and PMD mappings */
 	clear_fixmap(FIX_PTE);
 	clear_fixmap(FIX_PMD);
diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
index e8e4dcd39fed..35703d5ef5fd 100644
--- a/arch/riscv/mm/physaddr.c
+++ b/arch/riscv/mm/physaddr.c
@@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
 
 phys_addr_t __phys_addr_symbol(unsigned long x)
 {
-	unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
+	unsigned long kernel_start = (unsigned long)kernel_virt_addr;
 	unsigned long kernel_end = (unsigned long)_end;
 
 	/*
-- 
2.20.1


^ permalink raw reply related

* [PATCH v4 2/4] riscv: Introduce CONFIG_RELOCATABLE
From: Alexandre Ghiti @ 2020-06-03  8:00 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Atish Patra,
	Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Anup Patel, Alexandre Ghiti
In-Reply-To: <20200603080010.13366-1-alex@ghiti.fr>

This config allows to compile the kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR and to 4-level
page table folding at runtime. Runtime relocation is possible since
relocation metadata are embedded into the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the compiler
options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
---
 arch/riscv/Kconfig              | 12 +++++++
 arch/riscv/Makefile             |  5 ++-
 arch/riscv/kernel/vmlinux.lds.S |  6 ++--
 arch/riscv/mm/Makefile          |  4 +++
 arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
 5 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a31e1a41913a..93127d5913fe 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -170,6 +170,18 @@ config PGTABLE_LEVELS
 	default 3 if 64BIT
 	default 2
 
+config RELOCATABLE
+	bool
+	depends on MMU
+	help
+          This builds a kernel as a Position Independent Executable (PIE),
+          which retains all relocation metadata required to relocate the
+          kernel binary at runtime to a different virtual address than the
+          address it was linked at.
+          Since RISCV uses the RELA relocation format, this requires a
+          relocation pass at runtime even if the kernel is loaded at the
+          same address it was linked at.
+
 source "arch/riscv/Kconfig.socs"
 
 menu "Platform type"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index fb6e37db836d..1406416ea743 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -9,7 +9,10 @@
 #
 
 OBJCOPYFLAGS    := -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
+KBUILD_CFLAGS += -fPIE
+endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
 	LDFLAGS_vmlinux := --no-relax
 endif
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index a9abde62909f..e8ffba8c2044 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -85,8 +85,10 @@ SECTIONS
 
 	BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
 
-	.rel.dyn : {
-		*(.rel.dyn*)
+	.rela.dyn : ALIGN(8) {
+		__rela_dyn_start = .;
+		*(.rela .rela*)
+		__rela_dyn_end = .;
 	}
 
 	_end = .;
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 363ef01c30b1..dc5cdaa80bc1 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 CFLAGS_init.o := -mcmodel=medany
+ifdef CONFIG_RELOCATABLE
+CFLAGS_init.o += -fno-pie
+endif
+
 ifdef CONFIG_FTRACE
 CFLAGS_REMOVE_init.o = -pg
 endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 37be2eb45e58..e63ea5b6b6cf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -13,6 +13,9 @@
 #include <linux/of_fdt.h>
 #include <linux/libfdt.h>
 #include <linux/set_memory.h>
+#ifdef CONFIG_RELOCATABLE
+#include <linux/elf.h>
+#endif
 
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
@@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __rela_dyn_start, __rela_dyn_end;
+
+#ifdef CONFIG_64BIT
+#define Elf_Rela Elf64_Rela
+#define Elf_Addr Elf64_Addr
+#else
+#define Elf_Rela Elf32_Rela
+#define Elf_Addr Elf32_Addr
+#endif
+
+void __init relocate_kernel(uintptr_t load_pa)
+{
+	Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
+	/*
+	 * This holds the offset between the linked virtual address and the
+	 * relocated virtual address.
+	 */
+	uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
+	/*
+	 * This holds the offset between kernel linked virtual address and
+	 * physical address.
+	 */
+	uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
+
+	for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
+		Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
+		Elf_Addr relocated_addr = rela->r_addend;
+
+		if (rela->r_info != R_RISCV_RELATIVE)
+			continue;
+
+		/*
+		 * Make sure to not relocate vdso symbols like rt_sigreturn
+		 * which are linked from the address 0 in vmlinux since
+		 * vdso symbol addresses are actually used as an offset from
+		 * mm->context.vdso in VDSO_OFFSET macro.
+		 */
+		if (relocated_addr >= KERNEL_LINK_ADDR)
+			relocated_addr += reloc_offset;
+
+		*(Elf_Addr *)addr = relocated_addr;
+	}
+}
+
+#endif
+
 static uintptr_t load_pa, load_sz;
 
 void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
@@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
 	pfn_base = PFN_DOWN(load_pa);
 
+#ifdef CONFIG_RELOCATABLE
+#ifdef CONFIG_64BIT
+	/*
+	 * Early page table uses only one PGDIR, which makes it possible
+	 * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
+	 * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
+	 * since a part of the kernel would not get mapped.
+	 * This cannot happen on rv32 as we use the entire page directory level.
+	 */
+	BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
+#endif
+	relocate_kernel(load_pa);
+#endif
 	/*
 	 * Enforce boot alignment requirements of RV32 and
 	 * RV64 by only allowing PMD or PGD mappings.
-- 
2.20.1


^ permalink raw reply related

* [PATCH v4 3/4] powerpc: Move script to check relocations at compile time in scripts/
From: Alexandre Ghiti @ 2020-06-03  8:00 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Atish Patra,
	Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Anup Patel, Alexandre Ghiti
In-Reply-To: <20200603080010.13366-1-alex@ghiti.fr>

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

Powerpc architecture has a script that allows to check at compile time
for such unexpected relocations: extract the common logic to scripts/
so that other architectures can take advantage of it.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Anup Patel <anup@brainfault.org>
---
 arch/powerpc/tools/relocs_check.sh | 18 ++----------------
 scripts/relocs_check.sh            | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 16 deletions(-)
 create mode 100755 scripts/relocs_check.sh

diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh
index 014e00e74d2b..e367895941ae 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
 	exit 1
 fi
 
-# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
-objdump="$1"
-nm="$2"
-vmlinux="$3"
-
-# Remove from the bad relocations those that match an undefined weak symbol
-# which will result in an absolute relocation to 0.
-# Weak unresolved symbols are of that form in nm output:
-# "                  w _binary__btf_vmlinux_bin_end"
-undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
-
 bad_relocs=$(
-$objdump -R "$vmlinux" |
-	# Only look at relocation lines.
-	grep -E '\<R_' |
+${srctree}/scripts/relocs_check.sh "$@" |
 	# These relocations are okay
 	# On PPC64:
 	#	R_PPC64_RELATIVE, R_PPC64_NONE
@@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
 R_PPC_ADDR16_HI
 R_PPC_ADDR16_HA
 R_PPC_RELATIVE
-R_PPC_NONE' |
-	([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
+R_PPC_NONE'
 )
 
 if [ -z "$bad_relocs" ]; then
diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
new file mode 100755
index 000000000000..137c660499f3
--- /dev/null
+++ b/scripts/relocs_check.sh
@@ -0,0 +1,20 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+# Get a list of all the relocations, remove from it the relocations
+# that are known to be legitimate and return this list to arch specific
+# script that will look for suspicious relocations.
+
+objdump="$1"
+nm="$2"
+vmlinux="$3"
+
+# Remove from the possible bad relocations those that match an undefined
+# weak symbol which will result in an absolute relocation to 0.
+# Weak unresolved symbols are of that form in nm output:
+# "                  w _binary__btf_vmlinux_bin_end"
+undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
+
+$objdump -R "$vmlinux" |
+	grep -E '\<R_' |
+	([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
-- 
2.20.1


^ permalink raw reply related

* [PATCH v4 4/4] riscv: Check relocations at compile time
From: Alexandre Ghiti @ 2020-06-03  8:00 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Atish Patra,
	Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Anup Patel, Alexandre Ghiti
In-Reply-To: <20200603080010.13366-1-alex@ghiti.fr>

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

There exists a script in scripts/ that extracts the relocations from
vmlinux that is then used at postlink to check the relocations.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Anup Patel <anup@brainfault.org>
---
 arch/riscv/Makefile.postlink     | 36 ++++++++++++++++++++++++++++++++
 arch/riscv/tools/relocs_check.sh | 26 +++++++++++++++++++++++
 2 files changed, 62 insertions(+)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh

diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
new file mode 100644
index 000000000000..bf2b2bca1845
--- /dev/null
+++ b/arch/riscv/Makefile.postlink
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===========================================================================
+# Post-link riscv pass
+# ===========================================================================
+#
+# Check that vmlinux relocations look sane
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include scripts/Kbuild.include
+
+quiet_cmd_relocs_check = CHKREL  $@
+cmd_relocs_check = 							\
+	$(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@"
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+	@true
+ifdef CONFIG_RELOCATABLE
+	$(call if_changed,relocs_check)
+endif
+
+%.ko: FORCE
+	@true
+
+clean:
+	@true
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
new file mode 100755
index 000000000000..baeb2e7b2290
--- /dev/null
+++ b/arch/riscv/tools/relocs_check.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Based on powerpc relocs_check.sh
+
+# This script checks the relocations of a vmlinux for "suspicious"
+# relocations.
+
+if [ $# -lt 3 ]; then
+        echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
+        exit 1
+fi
+
+bad_relocs=$(
+${srctree}/scripts/relocs_check.sh "$@" |
+	# These relocations are okay
+	#	R_RISCV_RELATIVE
+	grep -F -w -v 'R_RISCV_RELATIVE'
+)
+
+if [ -z "$bad_relocs" ]; then
+	exit 0
+fi
+
+num_bad=$(echo "$bad_relocs" | wc -l)
+echo "WARNING: $num_bad bad relocations"
+echo "$bad_relocs"
-- 
2.20.1


^ permalink raw reply related

* [PATCH v4 0/4] vmalloc kernel mapping and relocatable kernel
From: Alexandre Ghiti @ 2020-06-03  8:00 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Atish Patra,
	Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti

This patchset originally implemented relocatable kernel support but now          
also moves the kernel mapping into the vmalloc zone.                             
                                                                                 
The first patch explains why we need to move the kernel into vmalloc             
zone (instead of memcpying it around). That patch should ease KASLR              
implementation a lot.                                                            
                                                                                 
The second patch allows to build relocatable kernels but is not selected         
by default.                                                                      
                                                                                 
The third and fourth patches take advantage of an already existing powerpc       
script that checks relocations at compile-time, and uses it for riscv.           
                                                                                 
Changes in v4:                                                                   
  * Fix BPF region that overlapped with kernel's as suggested by Zong            
  * Fix end of module region that could be larger than 2GB as suggested by Zong  
  * Fix the size of the vm area reserved for the kernel as we could lose         
    PMD_SIZE if the size was already aligned on PMD_SIZE                         
  * Split compile time relocations check patch into 2 patches as suggested by Anup
  * Applied Reviewed-by from Zong and Anup                                       
                                                                                 
Changes in v3:                                                                   
  * Move kernel mapping to vmalloc                                               
                                                                                 
Changes in v2:                                                                   
  * Make RELOCATABLE depend on MMU as suggested by Anup                          
  * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup           
  * Use __pa_symbol instead of __pa, as suggested by Zong                        
  * Rebased on top of v5.6-rc3                                                   
  * Tested with sv48 patchset                                                    
  * Add Reviewed/Tested-by from Zong and Anup

Alexandre Ghiti (4):
  riscv: Move kernel mapping to vmalloc zone
  riscv: Introduce CONFIG_RELOCATABLE
  powerpc: Move script to check relocations at compile time in scripts/
  riscv: Check relocations at compile time

 arch/powerpc/tools/relocs_check.sh |  18 +----
 arch/riscv/Kconfig                 |  12 +++
 arch/riscv/Makefile                |   5 +-
 arch/riscv/Makefile.postlink       |  36 +++++++++
 arch/riscv/boot/loader.lds.S       |   3 +-
 arch/riscv/include/asm/page.h      |  10 ++-
 arch/riscv/include/asm/pgtable.h   |  38 ++++++---
 arch/riscv/kernel/head.S           |   3 +-
 arch/riscv/kernel/module.c         |   4 +-
 arch/riscv/kernel/vmlinux.lds.S    |   9 ++-
 arch/riscv/mm/Makefile             |   4 +
 arch/riscv/mm/init.c               | 121 +++++++++++++++++++++++++----
 arch/riscv/mm/physaddr.c           |   2 +-
 arch/riscv/tools/relocs_check.sh   |  26 +++++++
 scripts/relocs_check.sh            |  20 +++++
 15 files changed, 259 insertions(+), 52 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH v2 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms
From: Michal Simek @ 2020-06-03  8:10 UTC (permalink / raw)
  To: Michael Ellerman, Michal Simek, Takashi Iwai
  Cc: Kate Stewart, Mark Rutland, Desnes A. Nunes do Rosario,
	Geert Uytterhoeven, linux-doc, alsa-devel, dri-devel,
	Jaroslav Kysela, Richard Fontana, Paul Mackerras, Miquel Raynal,
	Mauro Carvalho Chehab, Fabio Estevam, Sasha Levin, sfr,
	Jonathan Corbet, maz, Masahiro Yamada, Takashi Iwai, YueHaibing,
	Krzysztof Kozlowski, Allison Randal, linux-arm-kernel, devicetree,
	Andrew Donnellan, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
	Alistair Popple, linuxppc-dev, Nicholas Piggin, Alexios Zavras,
	Mark Brown, git, linux-fbdev, Jonathan Cameron, Thomas Gleixner,
	Andy Shevchenko, Dmitry Vyukov, Christophe Leroy, Wei Hu,
	Greg Kroah-Hartman, Nick Desaulniers, linux-kernel, Rob Herring,
	Enrico Weigelt, David S. Miller, Thiago Jung Bauermann
In-Reply-To: <87wo4yerom.fsf@mpe.ellerman.id.au>

Hi Michael,

On 26. 05. 20 15:44, Michael Ellerman wrote:
> Michal Simek <monstr@monstr.eu> writes:
>> Hi Michael,
>>
>> On 01. 04. 20 13:30, Michal Simek wrote:
>>> On 01. 04. 20 12:38, Takashi Iwai wrote:
>>>> On Wed, 01 Apr 2020 12:35:16 +0200,
>>>> Michael Ellerman wrote:
>>>>>
>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>> On 01. 04. 20 4:07, Michael Ellerman wrote:
>>>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> recently we wanted to update xilinx intc driver and we found that function
>>>>>>>> which we wanted to remove is still wired by ancient Xilinx PowerPC
>>>>>>>> platforms. Here is the thread about it.
>>>>>>>> https://lore.kernel.org/linux-next/48d3232d-0f1d-42ea-3109-f44bbabfa2e8@xilinx.com/
>>>>>>>>
>>>>>>>> I have been talking about it internally and there is no interest in these
>>>>>>>> platforms and it is also orphan for quite a long time. None is really
>>>>>>>> running/testing these platforms regularly that's why I think it makes sense
>>>>>>>> to remove them also with drivers which are specific to this platform.
>>>>>>>>
>>>>>>>> U-Boot support was removed in 2017 without anybody complain about it
>>>>>>>> https://github.com/Xilinx/u-boot-xlnx/commit/98f705c9cefdfdba62c069821bbba10273a0a8ed
>>>>>>>>
>>>>>>>> Based on current ppc/next.
>>>>>>>>
>>>>>>>> If anyone has any objection about it, please let me know.
>>>>>>>
>>>>>>> Thanks for taking the time to find all this code and remove it.
>>>>>>>
>>>>>>> I'm not going to take this series for v5.7, it was posted too close to
>>>>>>> the merge window, and doing so wouldn't give people much time to object,
>>>>>>> especially given people are distracted at the moment.
>>>>>>>
>>>>>>> I'm happy to take it for v5.8, assuming there's no major objections.
>>>>>>
>>>>>> Sure. Just to let you know Christophe Leroy included this patch in his
>>>>>> series about ppc405 removal. It should be the same.
>>>>>>
>>>>>> If you don't want to take that alsa patch I can send it separately and
>>>>>> this patch can be taken from his series. I don't really mind but please
>>>>>> let me know what way you prefer.
>>>>>
>>>>> It's better to keep it all together, so I'm happy take the alsa patch as
>>>>> well, it's already been acked.
>>
>> Can you please take this series? I know that there is v5 from Christophe
>> which has this 1/2 as 1/13. But I need this alsa patch too and I would
>> like to close this because it is around for almost 2 months and none
>> raised a concern about removing just these Xilinx platforms.
> 
> Sorry I meant to reply to your last mail.
> 
> I have Christophe's series in my testing branch, planning for it to be
> in v5.8.
> 
> Even if the rest of his series doesn't make it for some reason, as you
> say the Xilinx removal is uncontroversial so I'll keep that in.
> 
> I forgot about the sound patch, I'll pick that up as well.

I took a look at your
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git repo
and I can't see any branch with my patches.
Also was checking linux-next and my patches are also not there.
That's why I am curious if this will be go v5.8 in MW.

Thanks,
Michal



^ permalink raw reply

* Re: [PATCH v2 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms
From: Christophe Leroy @ 2020-06-03  8:13 UTC (permalink / raw)
  To: Michal Simek, Michael Ellerman, Takashi Iwai
  Cc: Kate Stewart, Mark Rutland, Desnes A. Nunes do Rosario,
	Geert Uytterhoeven, linux-doc, alsa-devel, dri-devel,
	Jaroslav Kysela, Richard Fontana, Paul Mackerras, Miquel Raynal,
	Mauro Carvalho Chehab, Fabio Estevam, Sasha Levin, sfr,
	Jonathan Corbet, maz, Masahiro Yamada, Takashi Iwai, YueHaibing,
	Krzysztof Kozlowski, Allison Randal, linux-arm-kernel, devicetree,
	Andrew Donnellan, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
	Alistair Popple, linuxppc-dev, Nicholas Piggin, Alexios Zavras,
	Mark Brown, git, linux-fbdev, Jonathan Cameron, Thomas Gleixner,
	Andy Shevchenko, Dmitry Vyukov, Christophe Leroy, Wei Hu,
	Greg Kroah-Hartman, Nick Desaulniers, linux-kernel, Rob Herring,
	Enrico Weigelt, David S. Miller, Thiago Jung Bauermann
In-Reply-To: <4b807ebc-8d8f-ad76-f5e2-9ce8410dc70c@xilinx.com>

Hi,

Le 03/06/2020 à 10:10, Michal Simek a écrit :
> Hi Michael,
> 
> On 26. 05. 20 15:44, Michael Ellerman wrote:
>> Michal Simek <monstr@monstr.eu> writes:
>>> Hi Michael,
>>>
>>> On 01. 04. 20 13:30, Michal Simek wrote:
>>>> On 01. 04. 20 12:38, Takashi Iwai wrote:
>>>>> On Wed, 01 Apr 2020 12:35:16 +0200,
>>>>> Michael Ellerman wrote:
>>>>>>
>>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>>> On 01. 04. 20 4:07, Michael Ellerman wrote:
>>>>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> recently we wanted to update xilinx intc driver and we found that function
>>>>>>>>> which we wanted to remove is still wired by ancient Xilinx PowerPC
>>>>>>>>> platforms. Here is the thread about it.
>>>>>>>>> https://lore.kernel.org/linux-next/48d3232d-0f1d-42ea-3109-f44bbabfa2e8@xilinx.com/
>>>>>>>>>
>>>>>>>>> I have been talking about it internally and there is no interest in these
>>>>>>>>> platforms and it is also orphan for quite a long time. None is really
>>>>>>>>> running/testing these platforms regularly that's why I think it makes sense
>>>>>>>>> to remove them also with drivers which are specific to this platform.
>>>>>>>>>
>>>>>>>>> U-Boot support was removed in 2017 without anybody complain about it
>>>>>>>>> https://github.com/Xilinx/u-boot-xlnx/commit/98f705c9cefdfdba62c069821bbba10273a0a8ed
>>>>>>>>>
>>>>>>>>> Based on current ppc/next.
>>>>>>>>>
>>>>>>>>> If anyone has any objection about it, please let me know.
>>>>>>>>
>>>>>>>> Thanks for taking the time to find all this code and remove it.
>>>>>>>>
>>>>>>>> I'm not going to take this series for v5.7, it was posted too close to
>>>>>>>> the merge window, and doing so wouldn't give people much time to object,
>>>>>>>> especially given people are distracted at the moment.
>>>>>>>>
>>>>>>>> I'm happy to take it for v5.8, assuming there's no major objections.
>>>>>>>
>>>>>>> Sure. Just to let you know Christophe Leroy included this patch in his
>>>>>>> series about ppc405 removal. It should be the same.
>>>>>>>
>>>>>>> If you don't want to take that alsa patch I can send it separately and
>>>>>>> this patch can be taken from his series. I don't really mind but please
>>>>>>> let me know what way you prefer.
>>>>>>
>>>>>> It's better to keep it all together, so I'm happy take the alsa patch as
>>>>>> well, it's already been acked.
>>>
>>> Can you please take this series? I know that there is v5 from Christophe
>>> which has this 1/2 as 1/13. But I need this alsa patch too and I would
>>> like to close this because it is around for almost 2 months and none
>>> raised a concern about removing just these Xilinx platforms.
>>
>> Sorry I meant to reply to your last mail.
>>
>> I have Christophe's series in my testing branch, planning for it to be
>> in v5.8.
>>
>> Even if the rest of his series doesn't make it for some reason, as you
>> say the Xilinx removal is uncontroversial so I'll keep that in.
>>
>> I forgot about the sound patch, I'll pick that up as well.
> 
> I took a look at your
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git repo
> and I can't see any branch with my patches.
> Also was checking linux-next and my patches are also not there.
> That's why I am curious if this will be go v5.8 in MW.

I see them in 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git in 
next-test branch.

Christophe

^ permalink raw reply

* Re: [PATCH v2 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms
From: Michal Simek @ 2020-06-03  8:20 UTC (permalink / raw)
  To: Christophe Leroy, Michal Simek, Michael Ellerman, Takashi Iwai
  Cc: Kate Stewart, Mark Rutland, Desnes A. Nunes do Rosario,
	Geert Uytterhoeven, linux-doc, alsa-devel, dri-devel,
	Jaroslav Kysela, Richard Fontana, Paul Mackerras, Miquel Raynal,
	Mauro Carvalho Chehab, Fabio Estevam, Sasha Levin, sfr,
	Jonathan Corbet, maz, Masahiro Yamada, Takashi Iwai, YueHaibing,
	Krzysztof Kozlowski, Allison Randal, linux-arm-kernel, devicetree,
	Andrew Donnellan, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
	Alistair Popple, linuxppc-dev, Nicholas Piggin, Alexios Zavras,
	Mark Brown, git, linux-fbdev, Jonathan Cameron, Thomas Gleixner,
	Andy Shevchenko, Dmitry Vyukov, Christophe Leroy, Wei Hu,
	Greg Kroah-Hartman, Nick Desaulniers, linux-kernel, Rob Herring,
	Enrico Weigelt, David S. Miller, Thiago Jung Bauermann
In-Reply-To: <aad5c6c5-b84a-7a6d-3f07-f45dd1dd85d1@csgroup.eu>

On 03. 06. 20 10:13, Christophe Leroy wrote:
> Hi,
> 
> Le 03/06/2020 à 10:10, Michal Simek a écrit :
>> Hi Michael,
>>
>> On 26. 05. 20 15:44, Michael Ellerman wrote:
>>> Michal Simek <monstr@monstr.eu> writes:
>>>> Hi Michael,
>>>>
>>>> On 01. 04. 20 13:30, Michal Simek wrote:
>>>>> On 01. 04. 20 12:38, Takashi Iwai wrote:
>>>>>> On Wed, 01 Apr 2020 12:35:16 +0200,
>>>>>> Michael Ellerman wrote:
>>>>>>>
>>>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>>>> On 01. 04. 20 4:07, Michael Ellerman wrote:
>>>>>>>>> Michal Simek <michal.simek@xilinx.com> writes:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> recently we wanted to update xilinx intc driver and we found
>>>>>>>>>> that function
>>>>>>>>>> which we wanted to remove is still wired by ancient Xilinx
>>>>>>>>>> PowerPC
>>>>>>>>>> platforms. Here is the thread about it.
>>>>>>>>>> https://lore.kernel.org/linux-next/48d3232d-0f1d-42ea-3109-f44bbabfa2e8@xilinx.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have been talking about it internally and there is no
>>>>>>>>>> interest in these
>>>>>>>>>> platforms and it is also orphan for quite a long time. None is
>>>>>>>>>> really
>>>>>>>>>> running/testing these platforms regularly that's why I think
>>>>>>>>>> it makes sense
>>>>>>>>>> to remove them also with drivers which are specific to this
>>>>>>>>>> platform.
>>>>>>>>>>
>>>>>>>>>> U-Boot support was removed in 2017 without anybody complain
>>>>>>>>>> about it
>>>>>>>>>> https://github.com/Xilinx/u-boot-xlnx/commit/98f705c9cefdfdba62c069821bbba10273a0a8ed
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Based on current ppc/next.
>>>>>>>>>>
>>>>>>>>>> If anyone has any objection about it, please let me know.
>>>>>>>>>
>>>>>>>>> Thanks for taking the time to find all this code and remove it.
>>>>>>>>>
>>>>>>>>> I'm not going to take this series for v5.7, it was posted too
>>>>>>>>> close to
>>>>>>>>> the merge window, and doing so wouldn't give people much time
>>>>>>>>> to object,
>>>>>>>>> especially given people are distracted at the moment.
>>>>>>>>>
>>>>>>>>> I'm happy to take it for v5.8, assuming there's no major
>>>>>>>>> objections.
>>>>>>>>
>>>>>>>> Sure. Just to let you know Christophe Leroy included this patch
>>>>>>>> in his
>>>>>>>> series about ppc405 removal. It should be the same.
>>>>>>>>
>>>>>>>> If you don't want to take that alsa patch I can send it
>>>>>>>> separately and
>>>>>>>> this patch can be taken from his series. I don't really mind but
>>>>>>>> please
>>>>>>>> let me know what way you prefer.
>>>>>>>
>>>>>>> It's better to keep it all together, so I'm happy take the alsa
>>>>>>> patch as
>>>>>>> well, it's already been acked.
>>>>
>>>> Can you please take this series? I know that there is v5 from
>>>> Christophe
>>>> which has this 1/2 as 1/13. But I need this alsa patch too and I would
>>>> like to close this because it is around for almost 2 months and none
>>>> raised a concern about removing just these Xilinx platforms.
>>>
>>> Sorry I meant to reply to your last mail.
>>>
>>> I have Christophe's series in my testing branch, planning for it to be
>>> in v5.8.
>>>
>>> Even if the rest of his series doesn't make it for some reason, as you
>>> say the Xilinx removal is uncontroversial so I'll keep that in.
>>>
>>> I forgot about the sound patch, I'll pick that up as well.
>>
>> I took a look at your
>> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git repo
>> and I can't see any branch with my patches.
>> Also was checking linux-next and my patches are also not there.
>> That's why I am curious if this will be go v5.8 in MW.
> 
> I see them in
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git in
> next-test branch.

ah. My bad.

Thanks,
Michal


^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Jan Kara @ 2020-06-03  8:26 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Jan Kara, linux-nvdimm, Aneesh Kumar K.V, jack@suse.de,
	Jeff Moyer, Oliver O'Halloran, Michal Suchánek,
	linuxppc-dev
In-Reply-To: <BN6PR11MB41328EB6F894DB391DC09DAEC68B0@BN6PR11MB4132.namprd11.prod.outlook.com>

On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
> [ forgive formatting, a series of unfortunate events has me using Outlook for the moment ]
> 
> > From: Jan Kara <jack@suse.cz>
> > > > > These flags are device properties that affect the kernel and
> > > > > userspace's handling of persistence.
> > > > >
> > > >
> > > > That will not handle the scenario with multiple applications using
> > > > the same fsdax mount point where one is updated to use the new
> > > > instruction and the other is not.
> > >
> > > Right, it needs to be a global setting / flag day to switch from one
> > > regime to another. Per-process control is a recipe for disaster.
> > 
> > First I'd like to mention that hopefully the concern is mostly theoretical since
> > as Aneesh wrote above, real persistent memory never shipped for PPC and
> > so there are very few apps (if any) using the old way to ensure cache
> > flushing.
> > 
> > But I'd like to understand why do you think per-process control is a recipe for
> > disaster? Because from my POV the sysfs interface you propose is actually
> > difficult to use in practice. As a distributor, you have hard time picking the
> > default because you have a choice between picking safe option which is
> > going to confuse users because of failing MAP_SYNC and unsafe option
> > where everyone will be happy until someone looses data because of some
> > ancient application using wrong instructions to persist data. Poor experience
> > for users in either way. And when distro defaults to "safe option", then the
> > burden is on the sysadmin to toggle the switch but how is he supposed to
> > decide when that is safe? First he has to understand what the problem
> > actually is, then he has to audit all the applications using pmem whether they
> > use the new instruction - which is IMO a lot of effort if you have a couple of
> > applications and practically infeasible if you have more of them.
> > So IMO the burden should be *on the application* to declare that it is aware
> > of the new instructions to flush pmem on the platform and only to such
> > application the kernel should give the trust to use MAP_SYNC mappings.
> 
> The "disaster" in my mind is this need to globally change the ABI for
> persistence semantics for all of Linux because one CPU wants a do over.
> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
> deployed base of persistent memory applications? Yes, sysfs is awkward,
> but it's trying to provide some relief without imposing unexplainable
> semantics on everyone else. I think a comprehensive (overengineered)
> solution would involve not introducing another "I know what I'm doing"
> flag to the interface, but maybe requiring applications to call a pmem
> sync API in something like a vsyscall. Or, also overengineered, some
> binary translation / interpretation to actively detect and kill
> applications that deploy the old instructions. Something horrid like on
> first write fault to a MAP_SYNC try to look ahead in the binary for the
> correct sync sequence and kill the application otherwise. That would at
> least provide some enforcement and safety without requiring other
> architectures to consider what MAP_SYNC_ENABLE means to them.

Thanks for explanation. So I absolutely agree that other architectures (and
even older versions of POWER architecture) must not be influenced by the new
tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
flush instructions are required and *only in that case* decide based on the
prctl value whether MAP_SYNC should be allowed or not.

Whether this solution is overengineering or not depends on how you think
it's likely there will be applications trying to use old flush instructions
with MAP_SYNC on POWER10 platforms...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* linux-next: manual merge of the akpm-current tree with the powerpc tree
From: Stephen Rothwell @ 2020-06-03  8:50 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman, PowerPC
  Cc: Linux Next Mailing List, Linux Kernel Mailing List, Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 1860 bytes --]

Hi all,

Today's linux-next merge of the akpm-current tree got a conflict in:

  arch/powerpc/mm/ptdump/ptdump.c

between commit:

  6b789a26d7da ("powerpc/ptdump: Handle hugepd at PGD level")

from the powerpc tree and patch:

  "powerpc: add support for folded p4d page tables"

from the akpm-current tree.

Thanks to Michael Ellerman for an example resolution.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/mm/ptdump/ptdump.c
index 5fc880e30175,507cb9793b26..000000000000
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@@ -333,13 -304,13 +333,15 @@@ static void walk_pagetables(struct pg_s
  	 * the hash pagetable.
  	 */
  	for (i = pgd_index(addr); i < PTRS_PER_PGD; i++, pgd++, addr += PGDIR_SIZE) {
- 		if (pgd_none(*pgd) || pgd_is_leaf(*pgd))
- 			note_page(st, addr, 1, pgd_val(*pgd), PGDIR_SIZE);
- 		else if (is_hugepd(__hugepd(pgd_val(*pgd))))
- 			walk_hugepd(st, (hugepd_t *)pgd, addr, PGDIR_SHIFT, 1);
+ 		p4d_t *p4d = p4d_offset(pgd, 0);
+ 
 -		if (!p4d_none(*p4d) && !p4d_is_leaf(*p4d))
++		if (p4d_none(*p4d) || p4d_is_leaf(*p4d))
++			note_page(st, addr, 1, p4d_val(*p4d), PGDIR_SIZE);
++		else if (is_hugepd(__hugepd(p4d_val(*p4d))))
++			walk_hugepd(st, (hugepd_t *)p4d, addr, PGDIR_SHIFT, 1);
 +		else
- 			/* pgd exists */
- 			walk_pud(st, pgd, addr);
+ 			/* p4d exists */
+ 			walk_pud(st, p4d, addr);
 -		else
 -			note_page(st, addr, 1, p4d_val(*p4d));
  	}
  }
  

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Aneesh Kumar K.V @ 2020-06-03  9:09 UTC (permalink / raw)
  To: Jan Kara, Williams, Dan J
  Cc: linux-nvdimm, jack@suse.de, Jeff Moyer, Oliver O'Halloran,
	Michal Suchánek, linuxppc-dev
In-Reply-To: <20200603082628.GE19165@quack2.suse.cz>

On 6/3/20 1:56 PM, Jan Kara wrote:
> On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
>> [ forgive formatting, a series of unfortunate events has me using Outlook for the moment ]
>>
>>> From: Jan Kara <jack@suse.cz>
>>>>>> These flags are device properties that affect the kernel and
>>>>>> userspace's handling of persistence.
>>>>>>
>>>>>
>>>>> That will not handle the scenario with multiple applications using
>>>>> the same fsdax mount point where one is updated to use the new
>>>>> instruction and the other is not.
>>>>
>>>> Right, it needs to be a global setting / flag day to switch from one
>>>> regime to another. Per-process control is a recipe for disaster.
>>>
>>> First I'd like to mention that hopefully the concern is mostly theoretical since
>>> as Aneesh wrote above, real persistent memory never shipped for PPC and
>>> so there are very few apps (if any) using the old way to ensure cache
>>> flushing.
>>>
>>> But I'd like to understand why do you think per-process control is a recipe for
>>> disaster? Because from my POV the sysfs interface you propose is actually
>>> difficult to use in practice. As a distributor, you have hard time picking the
>>> default because you have a choice between picking safe option which is
>>> going to confuse users because of failing MAP_SYNC and unsafe option
>>> where everyone will be happy until someone looses data because of some
>>> ancient application using wrong instructions to persist data. Poor experience
>>> for users in either way. And when distro defaults to "safe option", then the
>>> burden is on the sysadmin to toggle the switch but how is he supposed to
>>> decide when that is safe? First he has to understand what the problem
>>> actually is, then he has to audit all the applications using pmem whether they
>>> use the new instruction - which is IMO a lot of effort if you have a couple of
>>> applications and practically infeasible if you have more of them.
>>> So IMO the burden should be *on the application* to declare that it is aware
>>> of the new instructions to flush pmem on the platform and only to such
>>> application the kernel should give the trust to use MAP_SYNC mappings.
>>
>> The "disaster" in my mind is this need to globally change the ABI for
>> persistence semantics for all of Linux because one CPU wants a do over.
>> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
>> deployed base of persistent memory applications? Yes, sysfs is awkward,
>> but it's trying to provide some relief without imposing unexplainable
>> semantics on everyone else. I think a comprehensive (overengineered)
>> solution would involve not introducing another "I know what I'm doing"
>> flag to the interface, but maybe requiring applications to call a pmem
>> sync API in something like a vsyscall. Or, also overengineered, some
>> binary translation / interpretation to actively detect and kill
>> applications that deploy the old instructions. Something horrid like on
>> first write fault to a MAP_SYNC try to look ahead in the binary for the
>> correct sync sequence and kill the application otherwise. That would at
>> least provide some enforcement and safety without requiring other
>> architectures to consider what MAP_SYNC_ENABLE means to them.
> 
> Thanks for explanation. So I absolutely agree that other architectures (and
> even older versions of POWER architecture) must not be influenced by the new
> tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
> during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
> flush instructions are required and *only in that case* decide based on the
> prctl value whether MAP_SYNC should be allowed or not.
> 

v2 version of the patch series does that

https://lore.kernel.org/linuxppc-dev/20200602074909.36738-1-aneesh.kumar@linux.ibm.com/

> Whether this solution is overengineering or not depends on how you think
> it's likely there will be applications trying to use old flush instructions
> with MAP_SYNC on POWER10 platforms...
> 

Now considering that with ppc64 we never had a real persistent memory 
device available for the end-user to try and the new instructions are 
only needed on newer hardware, can we assume we have enough time to get 
the userspace to use new instructions?

As a safety net, we can keep the dax device-specific sysfs control. But 
in reality, by the time newer hardware gets released, we can get the 
distributions updated to flip the CONFIG_ARCH_MAP_SYNC_DISABLE=n?

With this:
1) vPMEM continues to work and since it is a volatile region. That 
doesn't need any flush instructions.

2) We get pmdk and other user applications updated to use new 
instructions and make sure updated packages are made available to all 
distributions

3) On newer hardware, the device will appear with a new compat string. 
Hence older distributions won't initialize pmem on newer hardware.

4) If we have a newer kernel with an older distro, we use the per 
namespace sysfs knob that prevents the usage of MAP_SYNC.

5) After a year or so we mark the CONFIG_ARCH_MAP_SYNC_DISABLE=n
on ppc64 when we are confident that everybody is using the new flush 
instruction.

-aneesh


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox