LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -next] PCI: rpadlpar: Make some functions static
From: Wei Yongjun @ 2020-07-21 15:17 UTC (permalink / raw)
  To: Hulk Robot, Michael Ellerman, Tyrel Datwyler, Bjorn Helgaas
  Cc: linux-pci, Wei Yongjun, linuxppc-dev

The sparse tool report build warnings as follows:

drivers/pci/hotplug/rpadlpar_core.c:355:5: warning:
 symbol 'dlpar_remove_pci_slot' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:461:12: warning:
 symbol 'rpadlpar_io_init' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:473:6: warning:
 symbol 'rpadlpar_io_exit' was not declared. Should it be static?

Those functions are not used outside of this file, so marks them
static.
Also mark rpadlpar_io_exit() as __exit.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/pci/hotplug/rpadlpar_core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
index c5eb509c72f0..f979b7098acf 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -352,7 +352,7 @@ static int dlpar_remove_vio_slot(char *drc_name, struct device_node *dn)
  * -ENODEV		Not a valid drc_name
  * -EIO			Internal PCI Error
  */
-int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
+static int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
 {
 	struct pci_bus *bus;
 	struct slot *slot;
@@ -458,7 +458,7 @@ static inline int is_dlpar_capable(void)
 	return (int) (rc != RTAS_UNKNOWN_SERVICE);
 }
 
-int __init rpadlpar_io_init(void)
+static int __init rpadlpar_io_init(void)
 {
 
 	if (!is_dlpar_capable()) {
@@ -470,7 +470,7 @@ int __init rpadlpar_io_init(void)
 	return dlpar_sysfs_init();
 }
 
-void rpadlpar_io_exit(void)
+static void __exit rpadlpar_io_exit(void)
 {
 	dlpar_sysfs_exit();
 }


^ permalink raw reply related

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <20200721150656.GN119549@hirez.programming.kicks-ass.net>

----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> So I've been trying to figure out where that PF_KTHREAD comes from,
> commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> 
> So the first version:
> 
>  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
> 
> appears to unconditionally send the IPI and checks p->mm in the IPI
> context, but then v2:
> 
>  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
> 
> has the current code. But I've been unable to find the reason the
> 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

Looking back at my inbox, it seems like you are the one who proposed to
skip all kthreads: 

https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net

> 
> The comment doesn't really help either; sure we have the whole lazy mm
> thing, but that's ->active_mm, not ->mm.
> 
> Possibly it is because {,un}use_mm() do not have sufficient barriers to
> make the remote p->mm test work? Or were we over-eager with the !p->mm
> doesn't imply kthread 'cleanups' at the time?

The nice thing about adding back kthreads to the threads considered for membarrier
IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread
rely on this, and we just provide an additional guarantee for future kthread
implementations.

> Also, I just realized, I still have a fix for use_mm() now
> kthread_use_mm() that seems to have been lost.

I suspect we need to at least document the memory barriers in kthread_use_mm and
kthread_unuse_mm to state that they are required by membarrier if we want to
ipi kthreads as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Peter Zijlstra @ 2020-07-21 15:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <616209816.22376.1595344513051.JavaMail.zimbra@efficios.com>

On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> > 
> >> That being said, the x86 sync core gap that I imagined could be fixed
> >> by changing to rq->curr == rq->idle test does not actually exist because
> >> the global membarrier does not have a sync core option. So fixing the
> >> exit_lazy_tlb points that this series does *should* fix that. So
> >> PF_KTHREAD may be less problematic than I thought from implementation
> >> point of view, only semantics.
> > 
> > So I've been trying to figure out where that PF_KTHREAD comes from,
> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> > 
> > So the first version:
> > 
> >  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
> > 
> > appears to unconditionally send the IPI and checks p->mm in the IPI
> > context, but then v2:
> > 
> >  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
> > 
> > has the current code. But I've been unable to find the reason the
> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
> 
> Looking back at my inbox, it seems like you are the one who proposed to
> skip all kthreads: 
> 
> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net

I had a feeling it might've been me ;-) I just couldn't find the email.

> > The comment doesn't really help either; sure we have the whole lazy mm
> > thing, but that's ->active_mm, not ->mm.
> > 
> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
> > make the remote p->mm test work? Or were we over-eager with the !p->mm
> > doesn't imply kthread 'cleanups' at the time?
> 
> The nice thing about adding back kthreads to the threads considered for membarrier
> IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread
> rely on this, and we just provide an additional guarantee for future kthread
> implementations.
> 
> > Also, I just realized, I still have a fix for use_mm() now
> > kthread_use_mm() that seems to have been lost.
> 
> I suspect we need to at least document the memory barriers in kthread_use_mm and
> kthread_unuse_mm to state that they are required by membarrier if we want to
> ipi kthreads as well.

Right, so going by that email you found it was mostly a case of being
lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
any other bits that might be needed, covering kthreads should be
possible.

No objections from me for making it so.

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <20200721151947.GD10769@hirez.programming.kicks-ass.net>

----- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:
>> 
>> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
>> > 
>> >> That being said, the x86 sync core gap that I imagined could be fixed
>> >> by changing to rq->curr == rq->idle test does not actually exist because
>> >> the global membarrier does not have a sync core option. So fixing the
>> >> exit_lazy_tlb points that this series does *should* fix that. So
>> >> PF_KTHREAD may be less problematic than I thought from implementation
>> >> point of view, only semantics.
>> > 
>> > So I've been trying to figure out where that PF_KTHREAD comes from,
>> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
>> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
>> > 
>> > So the first version:
>> > 
>> >  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
>> > 
>> > appears to unconditionally send the IPI and checks p->mm in the IPI
>> > context, but then v2:
>> > 
>> >  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
>> > 
>> > has the current code. But I've been unable to find the reason the
>> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
>> 
>> Looking back at my inbox, it seems like you are the one who proposed to
>> skip all kthreads:
>> 
>> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net
> 
> I had a feeling it might've been me ;-) I just couldn't find the email.
> 
>> > The comment doesn't really help either; sure we have the whole lazy mm
>> > thing, but that's ->active_mm, not ->mm.
>> > 
>> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
>> > make the remote p->mm test work? Or were we over-eager with the !p->mm
>> > doesn't imply kthread 'cleanups' at the time?
>> 
>> The nice thing about adding back kthreads to the threads considered for
>> membarrier
>> IPI is that it has no observable effect on the user-space ABI. No pre-existing
>> kthread
>> rely on this, and we just provide an additional guarantee for future kthread
>> implementations.
>> 
>> > Also, I just realized, I still have a fix for use_mm() now
>> > kthread_use_mm() that seems to have been lost.
>> 
>> I suspect we need to at least document the memory barriers in kthread_use_mm and
>> kthread_unuse_mm to state that they are required by membarrier if we want to
>> ipi kthreads as well.
> 
> Right, so going by that email you found it was mostly a case of being
> lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
> any other bits that might be needed, covering kthreads should be
> possible.
> 
> No objections from me for making it so.

I'm OK on making membarrier cover kthreads using mm as well, provided we
audit kthread_{,un}use_mm() to make sure the proper barriers are in place
after setting task->mm and before clearing it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH v4 0/3] powernv/idle: Power9 idle cleanup
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel

v3: https://lkml.org/lkml/2020/7/17/1093
Changelog v3-->v4:
Based on comments from Nicholas Piggin and Gautham Shenoy,
1. Changed the naming of pnv_first_spr_loss_level from
pnv_first_fullstate_loss_level to deep_spr_loss_state
2. Make the P9 PVR check only on the top level function
pnv_probe_idle_states and let the rest of the checks be DT based because
it is faster to do so

Pratik Rajesh Sampat (3):
  powerpc/powernv/idle: Replace CPU features check with PVR check
  powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
  powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above

 arch/powerpc/platforms/powernv/idle.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

-- 
2.25.4


^ permalink raw reply

* [PATCH v4 1/3] powerpc/powernv/idle: Replace CPU features check with PVR check
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

As the idle framework's architecture is incomplete, hence instead of
checking for just the processor type advertised in the device tree CPU
features; check for the Processor Version Register (PVR) so that finer
granularity can be leveraged while making processor checks.

Hence, making the PVR check on the outer level function, subsequently in
the hierarchy keeping the CPU_FTR_ARCH_300 check intact as it is a
faster check to do because of static branches

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/idle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..642abf0b8329 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
 		return;
 	}
 
-	if (cpu_has_feature(CPU_FTR_ARCH_300))
+	if (pvr_version_is(PVR_POWER9))
 		pnv_power9_idle_init();
 
 	for (i = 0; i < nr_pnv_idle_states; i++)
-- 
2.25.4


^ permalink raw reply related

* [PATCH v4 3/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/platforms/powernv/idle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 28462d59a8e1..92098d6106be 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
 	 */
 	uint64_t lpcr_val	= mfspr(SPRN_LPCR);
 	uint64_t hid0_val	= mfspr(SPRN_HID0);
-	uint64_t hid1_val	= mfspr(SPRN_HID1);
-	uint64_t hid4_val	= mfspr(SPRN_HID4);
-	uint64_t hid5_val	= mfspr(SPRN_HID5);
 	uint64_t hmeer_val	= mfspr(SPRN_HMEER);
 	uint64_t msr_val = MSR_IDLE;
 	uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
 
 			/* Only p8 needs to set extra HID regiters */
 			if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+				uint64_t hid1_val = mfspr(SPRN_HID1);
+				uint64_t hid4_val = mfspr(SPRN_HID4);
+				uint64_t hid5_val = mfspr(SPRN_HID5);
 
 				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
 				if (rc != 0)
-- 
2.25.4


^ permalink raw reply related

* [PATCH v4 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

Replace the variable name from using "pnv_first_spr_loss_level" to
"deep_spr_loss_state".

pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the
"deep" states as terminology. Hence renaming the variable to be coherent
to its semantics.

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/idle.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 642abf0b8329..28462d59a8e1 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -48,7 +48,7 @@ static bool default_stop_found;
  * First stop state levels when SPR and TB loss can occur.
  */
 static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+static u64 deep_spr_loss_state = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
 		  */
 		mmcr0		= mfspr(SPRN_MMCR0);
 	}
-	if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
+	if ((psscr & PSSCR_RL_MASK) >= deep_spr_loss_state) {
 		sprs.lpcr	= mfspr(SPRN_LPCR);
 		sprs.hfscr	= mfspr(SPRN_HFSCR);
 		sprs.fscr	= mfspr(SPRN_FSCR);
@@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
 	 * just always test PSSCR for SPR/TB state loss.
 	 */
 	pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
-	if (likely(pls < pnv_first_spr_loss_level)) {
+	if (likely(pls < deep_spr_loss_state)) {
 		if (sprs_saved)
 			atomic_stop_thread_idle();
 		goto out;
@@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
 	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 	 */
 	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-	pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+	deep_spr_loss_state = MAX_STOP_STATE + 1;
 	for (i = 0; i < nr_pnv_idle_states; i++) {
 		int err;
 		struct pnv_idle_states_t *state = &pnv_idle_states[i];
@@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
 			pnv_first_tb_loss_level = psscr_rl;
 
 		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
-		     (pnv_first_spr_loss_level > psscr_rl))
-			pnv_first_spr_loss_level = psscr_rl;
+		     (deep_spr_loss_state > psscr_rl))
+			deep_spr_loss_state = psscr_rl;
 
 		/*
 		 * The idle code does not deal with TB loss occurring
@@ -1111,8 +1111,8 @@ static void __init pnv_power9_idle_init(void)
 		 * compatibility.
 		 */
 		if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
-		     (pnv_first_spr_loss_level > psscr_rl))
-			pnv_first_spr_loss_level = psscr_rl;
+		     (deep_spr_loss_state > psscr_rl))
+			deep_spr_loss_state = psscr_rl;
 
 		err = validate_psscr_val_mask(&state->psscr_val,
 					      &state->psscr_mask,
@@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
 	}
 
 	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%llx\n",
-		pnv_first_spr_loss_level);
+		deep_spr_loss_state);
 
 	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%llx\n",
 		pnv_first_tb_loss_level);
-- 
2.25.4


^ permalink raw reply related

* Re: [PATCH v6] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-21 17:26 UTC (permalink / raw)
  To: Bruno Meneguele
  Cc: linux-s390, Nayna, erichte, nayna, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <20200720153841.GG10323@glitch>

On Mon, 2020-07-20 at 12:38 -0300, Bruno Meneguele wrote:
> On Mon, Jul 20, 2020 at 10:56:55AM -0400, Mimi Zohar wrote:
> > On Mon, 2020-07-20 at 10:40 -0400, Nayna wrote:
> > > On 7/13/20 12:48 PM, Bruno Meneguele wrote:
> > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different "ima_appraise="
> > > > modes - log, fix, enforce - at run time, but not when IMA architecture
> > > > specific policies are enabled.  This prevents properly labeling the
> > > > filesystem on systems where secure boot is supported, but not enabled on the
> > > > platform.  Only when secure boot is actually enabled should these IMA
> > > > appraise modes be disabled.
> > > >
> > > > This patch removes the compile time dependency and makes it a runtime
> > > > decision, based on the secure boot state of that platform.
> > > >
> > > > Test results as follows:
> > > >
> > > > -> x86-64 with secure boot enabled
> > > >
> > > > [    0.015637] Kernel command line: <...> ima_policy=appraise_tcb ima_appraise=fix
> > > > [    0.015668] ima: Secure boot enabled: ignoring ima_appraise=fix boot parameter option
> > > >
> > 
> > Is it common to have two colons in the same line?  Is the colon being
> > used as a delimiter when parsing the kernel logs?  Should the second
> > colon be replaced with a hyphen?  (No need to repost.  I'll fix it
> > up.)
> >  
> 
> AFAICS it has been used without any limitations, e.g:
> 
> PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
> microcode: CPU0: patch_level=0x08701013
> Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7
> ...
> 
> I'd say we're fine using it.

Ok.  FYI, it's now in next-integrity.

Mimi

^ permalink raw reply

* [powerpc:next] BUILD SUCCESS 9a77c4a0a12597c661be374b8d566516c0341570
From: kernel test robot @ 2020-07-21 18:37 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next
branch HEAD: 9a77c4a0a12597c661be374b8d566516c0341570  powerpc/prom: Enable Radix GTSE in cpu pa-features

elapsed time: 1700m

configs tested: 117
configs skipped: 4

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
c6x                                 defconfig
c6x                        evmc6474_defconfig
sh                               alldefconfig
mips                         bigsur_defconfig
arm                           sama5_defconfig
arm                           omap1_defconfig
mips                            e55_defconfig
arm                         shannon_defconfig
sparc64                           allnoconfig
arm                      footbridge_defconfig
s390                             alldefconfig
arm                            zeus_defconfig
sh                ecovec24-romimage_defconfig
arm                       netwinder_defconfig
mips                        nlm_xlr_defconfig
powerpc                        cell_defconfig
s390                          debug_defconfig
arm                          pxa3xx_defconfig
m68k                        m5407c3_defconfig
sh                          sdk7780_defconfig
c6x                         dsk6455_defconfig
arm                           h5000_defconfig
arm                         s3c6400_defconfig
mips                       rbtx49xx_defconfig
arm                            u300_defconfig
ia64                             alldefconfig
i386                              allnoconfig
i386                             allyesconfig
i386                                defconfig
i386                              debian-10.3
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nios2                            allyesconfig
openrisc                            defconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                             defconfig
i386                 randconfig-a001-20200719
i386                 randconfig-a006-20200719
i386                 randconfig-a002-20200719
i386                 randconfig-a005-20200719
i386                 randconfig-a003-20200719
i386                 randconfig-a004-20200719
i386                 randconfig-a015-20200719
i386                 randconfig-a011-20200719
i386                 randconfig-a016-20200719
i386                 randconfig-a012-20200719
i386                 randconfig-a013-20200719
i386                 randconfig-a014-20200719
x86_64               randconfig-a005-20200719
x86_64               randconfig-a002-20200719
x86_64               randconfig-a006-20200719
x86_64               randconfig-a001-20200719
x86_64               randconfig-a003-20200719
x86_64               randconfig-a004-20200719
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                             defconfig
sparc64                          allyesconfig
sparc64                          allmodconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Alex Ghiti @ 2020-07-21 18:36 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <d7e3cbb7-c12a-bce2-f1db-c336d15f74bd@ghiti.fr>

Let's try to make progress here: I add linux-mm in CC to get feedback on 
this patch as it blocks sv48 support too.

Alex

Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :
> Hi Palmer,
> 
> Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
>> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>>> This is a preparatory patch for relocatable kernel.
>>>
>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>> loaded
>>> physically at the beginning of the main memory. Therefore, we could use
>>> the linear mapping for the kernel mapping.
>>>
>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>> and since in the linear mapping, two different virtual addresses cannot
>>> point to the same physical address, the kernel mapping needs to lie 
>>> outside
>>> the linear mapping.
>>
>> I know it's been a while, but I keep opening this up to review it and 
>> just
>> can't get over how ugly it is to put the kernel's linear map in the 
>> vmalloc
>> region.
>>
>> I guess I don't understand why this is necessary at all.  
>> Specifically: why
>> can't we just relocate the kernel within the linear map?  That would 
>> let the
>> bootloader put the kernel wherever it wants, modulo the physical 
>> memory size we
>> support.  We'd need to handle the regions that are coupled to the 
>> kernel's
>> execution address, but we could just put them in an explicit memory 
>> region
>> which is what we should probably be doing anyway.
> 
> Virtual relocation in the linear mapping requires to move the kernel 
> physically too. Zong implemented this physical move in its KASLR RFC 
> patchset, which is cumbersome since finding an available physical spot 
> is harder than just selecting a virtual range in the vmalloc range.
> 
> In addition, having the kernel mapping in the linear mapping prevents 
> the use of hugepage for the linear mapping resulting in performance loss 
> (at least for the GB that encompasses the kernel).
> 
> Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> available virtual addresses to whatever purpose we want, and as noted by 
> Zong, arm64 uses the same scheme.
> 
>>
>>> In addition, because modules and BPF must be close to the kernel (inside
>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>> placed at the beginning of the vmalloc zone since other vmalloc
>>> allocations from the kernel could get all the +-2GB window around the
>>> kernel which would prevent new modules and BPF programs to be loaded.
>>
>> Well, that's not enough to make sure this doesn't happen -- it's just 
>> enough to
>> make sure it doesn't happen very quickily.  That's the same boat we're 
>> already
>> in, though, so it's not like it's worse.
> 
> Indeed, that's not worse, I haven't found a way to reserve vmalloc area 
> without actually allocating it.
> 
>>
>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>> Reviewed-by: Zong Li <zong.li@sifive.com>
>>> ---
>>>  arch/riscv/boot/loader.lds.S     |  3 +-
>>>  arch/riscv/include/asm/page.h    | 10 +++++-
>>>  arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>>>  arch/riscv/kernel/head.S         |  3 +-
>>>  arch/riscv/kernel/module.c       |  4 +--
>>>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>  arch/riscv/mm/physaddr.c         |  2 +-
>>>  8 files changed, 88 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>> index 47a5003c2e28..62d94696a19c 100644
>>> --- a/arch/riscv/boot/loader.lds.S
>>> +++ b/arch/riscv/boot/loader.lds.S
>>> @@ -1,13 +1,14 @@
>>>  /* SPDX-License-Identifier: GPL-2.0 */
>>>
>>>  #include <asm/page.h>
>>> +#include <asm/pgtable.h>
>>>
>>>  OUTPUT_ARCH(riscv)
>>>  ENTRY(_start)
>>>
>>>  SECTIONS
>>>  {
>>> -    . = PAGE_OFFSET;
>>> +    . = KERNEL_LINK_ADDR;
>>>
>>>      .payload : {
>>>          *(.payload)
>>> diff --git a/arch/riscv/include/asm/page.h 
>>> b/arch/riscv/include/asm/page.h
>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>> --- a/arch/riscv/include/asm/page.h
>>> +++ b/arch/riscv/include/asm/page.h
>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>
>>>  #ifdef CONFIG_MMU
>>>  extern unsigned long va_pa_offset;
>>> +extern unsigned long va_kernel_pa_offset;
>>>  extern unsigned long pfn_base;
>>>  #define ARCH_PFN_OFFSET        (pfn_base)
>>>  #else
>>>  #define va_pa_offset        0
>>> +#define va_kernel_pa_offset    0
>>>  #define ARCH_PFN_OFFSET        (PAGE_OFFSET >> PAGE_SHIFT)
>>>  #endif /* CONFIG_MMU */
>>>
>>>  extern unsigned long max_low_pfn;
>>>  extern unsigned long min_low_pfn;
>>> +extern unsigned long kernel_virt_addr;
>>>
>>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) + 
>>> va_pa_offset))
>>> -#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>> +#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) - 
>>> va_pa_offset)
>>> +#define kernel_mapping_va_to_pa(x)    \
>>> +    ((unsigned long)(x) - va_kernel_pa_offset)
>>> +#define __va_to_pa_nodebug(x)        \
>>> +    (((x) >= PAGE_OFFSET) ?        \
>>> +        linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>
>>>  #ifdef CONFIG_DEBUG_VIRTUAL
>>>  extern phys_addr_t __virt_to_phys(unsigned long x);
>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>> b/arch/riscv/include/asm/pgtable.h
>>> index 35b60035b6b0..94ef3b49dfb6 100644
>>> --- a/arch/riscv/include/asm/pgtable.h
>>> +++ b/arch/riscv/include/asm/pgtable.h
>>> @@ -11,23 +11,29 @@
>>>
>>>  #include <asm/pgtable-bits.h>
>>>
>>> -#ifndef __ASSEMBLY__
>>> -
>>> -/* Page Upper Directory not used in RISC-V */
>>> -#include <asm-generic/pgtable-nopud.h>
>>> -#include <asm/page.h>
>>> -#include <asm/tlbflush.h>
>>> -#include <linux/mm_types.h>
>>> -
>>> -#ifdef CONFIG_MMU
>>> +#ifndef CONFIG_MMU
>>> +#define KERNEL_VIRT_ADDR    PAGE_OFFSET
>>> +#define KERNEL_LINK_ADDR    PAGE_OFFSET
>>> +#else
>>> +/*
>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>> around
>>> + * the kernel.
>>> + */
>>> +#define KERNEL_VIRT_ADDR    (VMALLOC_END - SZ_2G + 1)
>>> +#define KERNEL_LINK_ADDR    KERNEL_VIRT_ADDR
>>
>> At a bare minimum this is going to make a mess of the 32-bit port, as
>> non-relocatable kernels are now going to get linked at 1GiB which is 
>> where user
>> code is supposed to live.  That's an easy fix, though, as the 32-bit 
>> stuff
>> doesn't need any module address restrictions.
> 
> Indeed, I will take a look at that.
> 
>>
>>>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>
>>>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>> -#define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>> -#define BPF_JIT_REGION_END    (VMALLOC_END)
>>> +#define BPF_JIT_REGION_START    PFN_ALIGN((unsigned long)&_end)
>>> +#define BPF_JIT_REGION_END    (BPF_JIT_REGION_START + 
>>> BPF_JIT_REGION_SIZE)
>>> +
>>> +#ifdef CONFIG_64BIT
>>> +#define VMALLOC_MODULE_START    BPF_JIT_REGION_END
>>> +#define VMALLOC_MODULE_END    (((unsigned long)&_start & PAGE_MASK) 
>>> + SZ_2G)
>>> +#endif
>>>
>>>  /*
>>>   * Roughly size the vmemmap space to be large enough to fit enough
>>> @@ -57,9 +63,16 @@
>>>  #define FIXADDR_SIZE     PGDIR_SIZE
>>>  #endif
>>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>> -
>>>  #endif
>>>
>>> +#ifndef __ASSEMBLY__
>>> +
>>> +/* Page Upper Directory not used in RISC-V */
>>> +#include <asm-generic/pgtable-nopud.h>
>>> +#include <asm/page.h>
>>> +#include <asm/tlbflush.h>
>>> +#include <linux/mm_types.h>
>>> +
>>>  #ifdef CONFIG_64BIT
>>>  #include <asm/pgtable-64.h>
>>>  #else
>>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page 
>>> *page, int numpages, int enabl
>>>
>>>  #define kern_addr_valid(addr)   (1) /* FIXME */
>>>
>>> +extern char _start[];
>>>  extern void *dtb_early_va;
>>>  void setup_bootmem(void);
>>>  void paging_init(void);
>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>> index 98a406474e7d..8f5bb7731327 100644
>>> --- a/arch/riscv/kernel/head.S
>>> +++ b/arch/riscv/kernel/head.S
>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>  #ifdef CONFIG_MMU
>>>  relocate:
>>>      /* Relocate return address */
>>> -    li a1, PAGE_OFFSET
>>> +    la a1, kernel_virt_addr
>>> +    REG_L a1, 0(a1)
>>>      la a2, _start
>>>      sub a1, a1, a2
>>>      add ra, ra, a1
>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>> --- a/arch/riscv/kernel/module.c
>>> +++ b/arch/riscv/kernel/module.c
>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const 
>>> char *strtab,
>>>  }
>>>
>>>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>> -#define VMALLOC_MODULE_START \
>>> -     max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>  void *module_alloc(unsigned long size)
>>>  {
>>>      return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>> -                    VMALLOC_END, GFP_KERNEL,
>>> +                    VMALLOC_MODULE_END, GFP_KERNEL,
>>>                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>                      __builtin_return_address(0));
>>>  }
>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>> b/arch/riscv/kernel/vmlinux.lds.S
>>> index 0339b6bbe11a..a9abde62909f 100644
>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>> @@ -4,7 +4,8 @@
>>>   * Copyright (C) 2017 SiFive
>>>   */
>>>
>>> -#define LOAD_OFFSET PAGE_OFFSET
>>> +#include <asm/pgtable.h>
>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>  #include <asm/vmlinux.lds.h>
>>>  #include <asm/page.h>
>>>  #include <asm/cache.h>
>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>> index 736de6c8739f..71da78914645 100644
>>> --- a/arch/riscv/mm/init.c
>>> +++ b/arch/riscv/mm/init.c
>>> @@ -22,6 +22,9 @@
>>>
>>>  #include "../kernel/head.h"
>>>
>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>> +
>>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>                              __page_aligned_bss;
>>>  EXPORT_SYMBOL(empty_zero_page);
>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>  }
>>>
>>>  #ifdef CONFIG_MMU
>>> +/* Offset between linear mapping virtual address and kernel load 
>>> address */
>>>  unsigned long va_pa_offset;
>>>  EXPORT_SYMBOL(va_pa_offset);
>>> +/* Offset between kernel mapping virtual address and kernel load 
>>> address */
>>> +unsigned long va_kernel_pa_offset;
>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>  unsigned long pfn_base;
>>>  EXPORT_SYMBOL(pfn_base);
>>>
>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>      if (mmu_enabled)
>>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>
>>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>> +    pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>      BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>      return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>  }
>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>  #error "setup_vm() is called from head.S before relocate so it 
>>> should not use absolute addressing."
>>>  #endif
>>>
>>> +static uintptr_t load_pa, load_sz;
>>> +
>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t 
>>> map_size)
>>> +{
>>> +    uintptr_t va, end_va;
>>> +
>>> +    end_va = kernel_virt_addr + load_sz;
>>> +    for (va = kernel_virt_addr; va < end_va; va += map_size)
>>> +        create_pgd_mapping(pgdir, va,
>>> +                   load_pa + (va - kernel_virt_addr),
>>> +                   map_size, PAGE_KERNEL_EXEC);
>>> +}
>>> +
>>>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>  {
>>>      uintptr_t va, end_va;
>>> -    uintptr_t load_pa = (uintptr_t)(&_start);
>>> -    uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>      uintptr_t map_size = best_map_size(load_pa, 
>>> MAX_EARLY_MAPPING_SIZE);
>>>
>>> +    load_pa = (uintptr_t)(&_start);
>>> +    load_sz = (uintptr_t)(&_end) - load_pa;
>>> +
>>>      va_pa_offset = PAGE_OFFSET - load_pa;
>>> +    va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>> +
>>>      pfn_base = PFN_DOWN(load_pa);
>>>
>>>      /*
>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>      /* Setup trampoline PGD and PMD */
>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>                 (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>> -    create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>> +    create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>  #else
>>>      /* Setup trampoline PGD */
>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>                 load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>  #endif
>>>
>>>      /*
>>> -     * Setup early PGD covering entire kernel which will allows
>>> +     * Setup early PGD covering entire kernel which will allow
>>>       * us to reach paging_init(). We map all memory banks later
>>>       * in setup_vm_final() below.
>>>       */
>>> -    end_va = PAGE_OFFSET + load_sz;
>>> -    for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>> -        create_pgd_mapping(early_pg_dir, va,
>>> -                   load_pa + (va - PAGE_OFFSET),
>>> -                   map_size, PAGE_KERNEL_EXEC);
>>> +    create_kernel_page_table(early_pg_dir, map_size);
>>>
>>>      /* Create fixed mapping for early FDT parsing */
>>>      end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>      uintptr_t va, map_size;
>>>      phys_addr_t pa, start, end;
>>>      struct memblock_region *reg;
>>> +    static struct vm_struct vm_kernel = { 0 };
>>>
>>>      /* Set mmu_enabled flag */
>>>      mmu_enabled = true;
>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>          for (pa = start; pa < end; pa += map_size) {
>>>              va = (uintptr_t)__va(pa);
>>>              create_pgd_mapping(swapper_pg_dir, va, pa,
>>> -                       map_size, PAGE_KERNEL_EXEC);
>>> +                       map_size, PAGE_KERNEL);
>>>          }
>>>      }
>>>
>>> +    /* Map the kernel */
>>> +    create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>> +
>>> +    /* Reserve the vmalloc area occupied by the kernel */
>>> +    vm_kernel.addr = (void *)kernel_virt_addr;
>>> +    vm_kernel.phys_addr = load_pa;
>>> +    vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>> +    vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>> +    vm_kernel.caller = __builtin_return_address(0);
>>> +
>>> +    vm_area_add_early(&vm_kernel);
>>> +
>>>      /* Clear fixmap PTE and PMD mappings */
>>>      clear_fixmap(FIX_PTE);
>>>      clear_fixmap(FIX_PMD);
>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>> --- a/arch/riscv/mm/physaddr.c
>>> +++ b/arch/riscv/mm/physaddr.c
>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>
>>>  phys_addr_t __phys_addr_symbol(unsigned long x)
>>>  {
>>> -    unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>> +    unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>      unsigned long kernel_end = (unsigned long)_end;
>>>
>>>      /*
> 
> Alex

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 19:05 UTC (permalink / raw)
  To: alex
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <7cb2285e-68ba-6827-5e61-e33a4b65ac03@ghiti.fr>

On Tue, 21 Jul 2020 11:36:10 PDT (-0700), alex@ghiti.fr wrote:
> Let's try to make progress here: I add linux-mm in CC to get feedback on
> this patch as it blocks sv48 support too.

Sorry for being slow here.  I haven't replied because I hadn't really fleshed
out the design yet, but just so everyone's on the same page my problems with
this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
* On 64-bit systems the VA space around the kernel is precious because it's the
  only place we can place text (modules, BPF, whatever).  If we start putting
  the kernel in the vmalloc space then we either have to pre-allocate a bunch
  of space around it (essentially making it a fixed mapping anyway) or it
  becomes likely that we won't be able to find space for modules as they're
  loaded into running systems.
* Relying on a relocatable kernel for sv48 support introduces a fairly large
  performance hit.

Roughly, my proposal would be to:

* Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
  anywhere and we only have one VA width, so we're not really solving any
  problems with these changes.
* Staticly allocate a 2GiB portion of the VA space for all our text, as its own
  region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
  which would decouple the kernel from the physical memory layout of the system.
  This would have the side effect of sorting out a bunch of bootloader headaches
  that we currently have.
* Sort out how to maintain a linear map as the canonical hole moves around
  between the VA widths without adding a bunch of overhead to the virt2phys and
  friends.  This is probably going to be the trickiest part, but I think if we
  just change the page table code to essentially lie about VAs when an sv39
  system runs an sv48+sv39 kernel we could make it work -- there'd be some
  logical complexity involved, but it would remain fast.

This doesn't solve the problem of virtually relocatable kernels, but it does
let us decouple that from the sv48 stuff.  It also lets us stop relying on a
fixed physical address the kernel is loaded into, which is another thing I
don't like.

I know this may be a more complicated approach, but there aren't any sv48
systems around right now so I just don't see the rush to support them,
particularly when there's a cost to what already exists (for those who haven't
been watching, so far all the sv48 patch sets have imposed a significant
performance penalty on all systems).

>
> Alex
>
> Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :
>> Hi Palmer,
>>
>> Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
>>> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be
>>>> loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie
>>>> outside
>>>> the linear mapping.
>>>
>>> I know it's been a while, but I keep opening this up to review it and
>>> just
>>> can't get over how ugly it is to put the kernel's linear map in the
>>> vmalloc
>>> region.
>>>
>>> I guess I don't understand why this is necessary at all.
>>> Specifically: why
>>> can't we just relocate the kernel within the linear map?  That would
>>> let the
>>> bootloader put the kernel wherever it wants, modulo the physical
>>> memory size we
>>> support.  We'd need to handle the regions that are coupled to the
>>> kernel's
>>> execution address, but we could just put them in an explicit memory
>>> region
>>> which is what we should probably be doing anyway.
>>
>> Virtual relocation in the linear mapping requires to move the kernel
>> physically too. Zong implemented this physical move in its KASLR RFC
>> patchset, which is cumbersome since finding an available physical spot
>> is harder than just selecting a virtual range in the vmalloc range.
>>
>> In addition, having the kernel mapping in the linear mapping prevents
>> the use of hugepage for the linear mapping resulting in performance loss
>> (at least for the GB that encompasses the kernel).
>>
>> Why do you find this "ugly" ? The vmalloc region is just a bunch of
>> available virtual addresses to whatever purpose we want, and as noted by
>> Zong, arm64 uses the same scheme.
>>
>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>
>>> Well, that's not enough to make sure this doesn't happen -- it's just
>>> enough to
>>> make sure it doesn't happen very quickily.  That's the same boat we're
>>> already
>>> in, though, so it's not like it's worse.
>>
>> Indeed, that's not worse, I haven't found a way to reserve vmalloc area
>> without actually allocating it.
>>
>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> Reviewed-by: Zong Li <zong.li@sifive.com>
>>>> ---
>>>>  arch/riscv/boot/loader.lds.S     |  3 +-
>>>>  arch/riscv/include/asm/page.h    | 10 +++++-
>>>>  arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>>>>  arch/riscv/kernel/head.S         |  3 +-
>>>>  arch/riscv/kernel/module.c       |  4 +--
>>>>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>  arch/riscv/mm/physaddr.c         |  2 +-
>>>>  8 files changed, 88 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>  /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>  #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>  OUTPUT_ARCH(riscv)
>>>>  ENTRY(_start)
>>>>
>>>>  SECTIONS
>>>>  {
>>>> -    . = PAGE_OFFSET;
>>>> +    . = KERNEL_LINK_ADDR;
>>>>
>>>>      .payload : {
>>>>          *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h
>>>> b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>  #ifdef CONFIG_MMU
>>>>  extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>  extern unsigned long pfn_base;
>>>>  #define ARCH_PFN_OFFSET        (pfn_base)
>>>>  #else
>>>>  #define va_pa_offset        0
>>>> +#define va_kernel_pa_offset    0
>>>>  #define ARCH_PFN_OFFSET        (PAGE_OFFSET >> PAGE_SHIFT)
>>>>  #endif /* CONFIG_MMU */
>>>>
>>>>  extern unsigned long max_low_pfn;
>>>>  extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
>>>> va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) -
>>>> va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)    \
>>>> +    ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)        \
>>>> +    (((x) >= PAGE_OFFSET) ?        \
>>>> +        linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>  #ifdef CONFIG_DEBUG_VIRTUAL
>>>>  extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h
>>>> b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..94ef3b49dfb6 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>  #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR    PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR    PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range
>>>> around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR    (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR    KERNEL_VIRT_ADDR
>>>
>>> At a bare minimum this is going to make a mess of the 32-bit port, as
>>> non-relocatable kernels are now going to get linked at 1GiB which is
>>> where user
>>> code is supposed to live.  That's an easy fix, though, as the 32-bit
>>> stuff
>>> doesn't need any module address restrictions.
>>
>> Indeed, I will take a look at that.
>>
>>>
>>>>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END    (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START    PFN_ALIGN((unsigned long)&_end)
>>>> +#define BPF_JIT_REGION_END    (BPF_JIT_REGION_START +
>>>> BPF_JIT_REGION_SIZE)
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START    BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END    (((unsigned long)&_start & PAGE_MASK)
>>>> + SZ_2G)
>>>> +#endif
>>>>
>>>>  /*
>>>>   * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>  #define FIXADDR_SIZE     PGDIR_SIZE
>>>>  #endif
>>>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>  #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>  #ifdef CONFIG_64BIT
>>>>  #include <asm/pgtable-64.h>
>>>>  #else
>>>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page
>>>> *page, int numpages, int enabl
>>>>
>>>>  #define kern_addr_valid(addr)   (1) /* FIXME */
>>>>
>>>> +extern char _start[];
>>>>  extern void *dtb_early_va;
>>>>  void setup_bootmem(void);
>>>>  void paging_init(void);
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>  #ifdef CONFIG_MMU
>>>>  relocate:
>>>>      /* Relocate return address */
>>>> -    li a1, PAGE_OFFSET
>>>> +    la a1, kernel_virt_addr
>>>> +    REG_L a1, 0(a1)
>>>>      la a2, _start
>>>>      sub a1, a1, a2
>>>>      add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const
>>>> char *strtab,
>>>>  }
>>>>
>>>>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -     max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>  void *module_alloc(unsigned long size)
>>>>  {
>>>>      return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                    VMALLOC_END, GFP_KERNEL,
>>>> +                    VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                      __builtin_return_address(0));
>>>>  }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S
>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>   * Copyright (C) 2017 SiFive
>>>>   */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>  #include <asm/vmlinux.lds.h>
>>>>  #include <asm/page.h>
>>>>  #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 736de6c8739f..71da78914645 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>  #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                              __page_aligned_bss;
>>>>  EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>  }
>>>>
>>>>  #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load
>>>> address */
>>>>  unsigned long va_pa_offset;
>>>>  EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load
>>>> address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>  unsigned long pfn_base;
>>>>  EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>      if (mmu_enabled)
>>>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +    pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>      BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>      return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>  }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init
>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>  #error "setup_vm() is called from head.S before relocate so it
>>>> should not use absolute addressing."
>>>>  #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t
>>>> map_size)
>>>> +{
>>>> +    uintptr_t va, end_va;
>>>> +
>>>> +    end_va = kernel_virt_addr + load_sz;
>>>> +    for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +        create_pgd_mapping(pgdir, va,
>>>> +                   load_pa + (va - kernel_virt_addr),
>>>> +                   map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>  {
>>>>      uintptr_t va, end_va;
>>>> -    uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -    uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>      uintptr_t map_size = best_map_size(load_pa,
>>>> MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +    load_pa = (uintptr_t)(&_start);
>>>> +    load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>      va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +    va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>      pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>      /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>      /* Setup trampoline PGD and PMD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -    create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +    create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>  #else
>>>>      /* Setup trampoline PGD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>  #endif
>>>>
>>>>      /*
>>>> -     * Setup early PGD covering entire kernel which will allows
>>>> +     * Setup early PGD covering entire kernel which will allow
>>>>       * us to reach paging_init(). We map all memory banks later
>>>>       * in setup_vm_final() below.
>>>>       */
>>>> -    end_va = PAGE_OFFSET + load_sz;
>>>> -    for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -        create_pgd_mapping(early_pg_dir, va,
>>>> -                   load_pa + (va - PAGE_OFFSET),
>>>> -                   map_size, PAGE_KERNEL_EXEC);
>>>> +    create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>      /* Create fixed mapping for early FDT parsing */
>>>>      end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>      uintptr_t va, map_size;
>>>>      phys_addr_t pa, start, end;
>>>>      struct memblock_region *reg;
>>>> +    static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>      /* Set mmu_enabled flag */
>>>>      mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>          for (pa = start; pa < end; pa += map_size) {
>>>>              va = (uintptr_t)__va(pa);
>>>>              create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                       map_size, PAGE_KERNEL_EXEC);
>>>> +                       map_size, PAGE_KERNEL);
>>>>          }
>>>>      }
>>>>
>>>> +    /* Map the kernel */
>>>> +    create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +    /* Reserve the vmalloc area occupied by the kernel */
>>>> +    vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +    vm_kernel.phys_addr = load_pa;
>>>> +    vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>>> +    vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +    vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +    vm_area_add_early(&vm_kernel);
>>>> +
>>>>      /* Clear fixmap PTE and PMD mappings */
>>>>      clear_fixmap(FIX_PTE);
>>>>      clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>  phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>  {
>>>> -    unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +    unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>      unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>      /*
>>
>> Alex

^ permalink raw reply

* Re: [PATCH v6] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-21 19:38 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-s390, Nayna, erichte, nayna, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <1595352376.5311.8.camel@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

On Tue, Jul 21, 2020 at 01:26:16PM -0400, Mimi Zohar wrote:
> On Mon, 2020-07-20 at 12:38 -0300, Bruno Meneguele wrote:
> > On Mon, Jul 20, 2020 at 10:56:55AM -0400, Mimi Zohar wrote:
> > > On Mon, 2020-07-20 at 10:40 -0400, Nayna wrote:
> > > > On 7/13/20 12:48 PM, Bruno Meneguele wrote:
> > > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different "ima_appraise="
> > > > > modes - log, fix, enforce - at run time, but not when IMA architecture
> > > > > specific policies are enabled.  This prevents properly labeling the
> > > > > filesystem on systems where secure boot is supported, but not enabled on the
> > > > > platform.  Only when secure boot is actually enabled should these IMA
> > > > > appraise modes be disabled.
> > > > >
> > > > > This patch removes the compile time dependency and makes it a runtime
> > > > > decision, based on the secure boot state of that platform.
> > > > >
> > > > > Test results as follows:
> > > > >
> > > > > -> x86-64 with secure boot enabled
> > > > >
> > > > > [    0.015637] Kernel command line: <...> ima_policy=appraise_tcb ima_appraise=fix
> > > > > [    0.015668] ima: Secure boot enabled: ignoring ima_appraise=fix boot parameter option
> > > > >
> > > 
> > > Is it common to have two colons in the same line?  Is the colon being
> > > used as a delimiter when parsing the kernel logs?  Should the second
> > > colon be replaced with a hyphen?  (No need to repost.  I'll fix it
> > > up.)
> > >  
> > 
> > AFAICS it has been used without any limitations, e.g:
> > 
> > PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> > clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
> > microcode: CPU0: patch_level=0x08701013
> > Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7
> > ...
> > 
> > I'd say we're fine using it.
> 
> Ok.  FYI, it's now in next-integrity.
> 
> Mimi
> 

Thanks Mimi.

-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [powerpc:next-test] BUILD REGRESSION 08b8bb849948ff5e2305d1115ce8bbdd55364a70
From: kernel test robot @ 2020-07-21 20:01 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next-test
branch HEAD: 08b8bb849948ff5e2305d1115ce8bbdd55364a70  powerpc test_emulate_step: move extern declaration to sstep.h

Error/Warning in current branch:

arch/powerpc/include/asm/ppc-opcode.h:277:22: error: called object is not a function or function pointer
arch/powerpc/include/asm/ppc-opcode.h:281:22: error: called object is not a function or function pointer
arch/powerpc/lib/test_emulate_step.c:28:4: error: 'PPC_INST_LWZ' undeclared (first use in this function); did you mean 'PPC_INST_LSWI'?
arch/powerpc/lib/test_emulate_step.c:29:4: error: 'PPC_INST_LWZ' undeclared (first use in this function); did you mean 'PPC_INST_LFD'?

Error/Warning ids grouped by kconfigs:

recent_errors
|-- powerpc-allmodconfig
|   |-- arch-powerpc-include-asm-ppc-opcode.h:error:called-object-is-not-a-function-or-function-pointer
|   `-- arch-powerpc-lib-test_emulate_step.c:error:PPC_INST_LWZ-undeclared-(first-use-in-this-function)
`-- powerpc-allyesconfig
    |-- arch-powerpc-include-asm-ppc-opcode.h:error:called-object-is-not-a-function-or-function-pointer
    `-- arch-powerpc-lib-test_emulate_step.c:error:PPC_INST_LWZ-undeclared-(first-use-in-this-function)

elapsed time: 1784m

configs tested: 97
configs skipped: 2

arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
arm                            zeus_defconfig
sh                ecovec24-romimage_defconfig
arm                       netwinder_defconfig
mips                        nlm_xlr_defconfig
powerpc                        cell_defconfig
s390                          debug_defconfig
arm                          pxa3xx_defconfig
m68k                        m5407c3_defconfig
sh                          sdk7780_defconfig
c6x                         dsk6455_defconfig
arm                           h5000_defconfig
i386                             allyesconfig
i386                                defconfig
i386                              debian-10.3
i386                              allnoconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nios2                            allyesconfig
openrisc                            defconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                             defconfig
i386                 randconfig-a001-20200719
i386                 randconfig-a006-20200719
i386                 randconfig-a002-20200719
i386                 randconfig-a005-20200719
i386                 randconfig-a003-20200719
i386                 randconfig-a004-20200719
x86_64               randconfig-a005-20200719
x86_64               randconfig-a002-20200719
x86_64               randconfig-a006-20200719
x86_64               randconfig-a001-20200719
x86_64               randconfig-a003-20200719
x86_64               randconfig-a004-20200719
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                             defconfig
sparc64                           allnoconfig
sparc64                          allyesconfig
sparc64                          allmodconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH v2 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
From: Ram Pai @ 2020-07-21 21:28 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-kernel, kvm-ppc, bharata, paulus, sukadev, linuxppc-dev,
	bauerman
In-Reply-To: <20200721104202.15727-2-ldufour@linux.ibm.com>

On Tue, Jul 21, 2020 at 12:42:01PM +0200, Laurent Dufour wrote:
> kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
> so move it upper in this file.
> 
> Furthermore it will be interesting to call this function when already
> holding the kvm->arch.uvmem_lock, so prefix the original function with __
> and remove the locking in it, and introduce a wrapper which call that
> function with the lock held.
> 
> There is no functional change.

Reviewed-by: Ram Pai <linuxram@us.ibm.com>

> 
> Cc: Ram Pai <linuxram@us.ibm.com>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---

RP

^ permalink raw reply

* Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Ram Pai @ 2020-07-21 21:37 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-kernel, kvm-ppc, bharata, paulus, sukadev, linuxppc-dev,
	bauerman
In-Reply-To: <20200721104202.15727-3-ldufour@linux.ibm.com>

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
            ^^ call directly instead of triggering..

> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.
> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
		          ^^ held

> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai <linuxram@us.ibm.com>

Reviewed-by: Ram Pai <linuxram@us.ibm.com>

RP

^ permalink raw reply

* Re: [PATCH v4 6/7] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
From: Leonardo Bras @ 2020-07-21 21:39 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexey Kardashevskiy, Joel Stanley, Christophe Leroy,
	Thiago Jung Bauermann, Ram Pai, Brian King
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200716071658.467820-7-leobras.c@gmail.com>

On Thu, 2020-07-16 at 04:16 -0300, Leonardo Bras wrote:
> +static void iommu_pseries_table_update(struct pci_dev *dev,
> +                                      struct device_node *pdn)
> +{
> +       const struct dynamic_dma_window_prop *ddw;
> +       struct pci_dn *pci;
> +       int len;
> +
> +       ddw = of_get_property(pdn, DMA64_PROPNAME, &len);
> +       if (!ddw  || len < sizeof(struct dynamic_dma_window_prop))
> +               return;
> +
> +       iommu_table_update(pci->table_group->tables[0], pci->phb->node,
> +                          ddw->liobn, ddw->dma_base, ddw->tce_shift,
> +                          ddw->window_shift);
> +}
> +
>  static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>  {
>         struct device_node *pdn, *dn;
> @@ -1382,6 +1403,7 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
>                 pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
>                 if (pdev->dev.archdata.dma_offset)
>                         return true;
> +               iommu_pseries_table_update(pdev, pdn);
>         }
> 

Noticed a bug in this one: pci is not getting assigned. 
My bad, there must have been a merge error.

Also, I will refactor the function to make use of pdn only, as I can do
pci = PCI_DN(pdn) (I think it's better this way).

Sorry for the buggy patch.

Best regards,


^ permalink raw reply

* Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean
From: Leonardo Bras @ 2020-07-21 22:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Joel Stanley, Christophe Leroy,
	Thiago Jung Bauermann, Ram Pai, Brian King
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <51235292-a571-8792-c693-d0dc6faeb21c@ozlabs.ru>

On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
> 
> On 16/07/2020 17:16, Leonardo Bras wrote:
> > Move the part of iommu_table_free() that does struct iommu_table cleaning
> > into iommu_table_clean, so we can invoke it separately.
> > 
> > This new function is useful for cleaning struct iommu_table before
> > initializing it again with a new DMA window, without having it freed and
> > allocated again.
> > 
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > ---
> >  arch/powerpc/kernel/iommu.c | 30 ++++++++++++++++++------------
> >  1 file changed, 18 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > index 9704f3f76e63..c3242253a4e7 100644
> > --- a/arch/powerpc/kernel/iommu.c
> > +++ b/arch/powerpc/kernel/iommu.c
> > @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> >  	return tbl;
> >  }
> >  
> > -static void iommu_table_free(struct kref *kref)
> > +static void iommu_table_clean(struct iommu_table *tbl)
> 
> iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
> work too, why new helper?

iommu_table_free() also frees the tbl, which would need allocate it
again (new address) and to fill it up again, unnecessarily. 
I think it's a better approach to only change what is needed.

> There is also iommu_table_clear() which does a different thing so you
> need a better name.

I agree.
I had not noticed this other function before sending the patchset. What
would be a better name though? __iommu_table_free()? 

> Second, iommu_table_free
> use and it would be ok as we would only see this when hot-unplugging a
> PE because we always kept the default window.
> Btw you must be seeing these warnings now every time you create DDW with
> these patches as at least the first page is reserved, do not you?

It does not print a warning.
I noticed other warnings, but not this one from iommu_table_free():
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_ma
p, tbl->it_size))
	pr_warn("%s: Unexpected TCEs\n", __func__);

Before that, iommu_table_release_pages(tbl) is supposed to clear the 
bitmap, so this only tests for a tce that is created in this short period.

> Since we are replacing a table for a device which is still in the
> system, we should not try messing with its DMA if it already has
> mappings so the warning should become an error preventing DDW. It is
> rather hard to trigger in practice but I could hack a driver to ask for
> 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
> is not illegal, I think. So this needs a new helper - "bool
> iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
> this?... Thanks,

As of today, there seems to be nothing like that happening in the
driver I am testing. 
I spoke to Brian King on slack, and he mentioned that at the point DDW
is created there should be no allocations in place.

But I suppose some driver could try to do this.

Maybe a better approach would be removing the mapping only if the
default window is removed (at the end of enable_ddw, as an else to
resetting the default DMA window), and having a way to add more
mappings to those pools. But this last part doesn't look so simple, and
it would be better to understand if it's necessary investing work in
this.

What do you think?

Best regards,



^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Benjamin Herrenschmidt @ 2020-07-21 23:11 UTC (permalink / raw)
  To: Alex Ghiti, Palmer Dabbelt
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <7cb2285e-68ba-6827-5e61-e33a4b65ac03@ghiti.fr>

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
> > > I guess I don't understand why this is necessary at all.  
> > > Specifically: why
> > > can't we just relocate the kernel within the linear map?  That would 
> > > let the
> > > bootloader put the kernel wherever it wants, modulo the physical 
> > > memory size we
> > > support.  We'd need to handle the regions that are coupled to the 
> > > kernel's
> > > execution address, but we could just put them in an explicit memory 
> > > region
> > > which is what we should probably be doing anyway.
> > 
> > Virtual relocation in the linear mapping requires to move the kernel 
> > physically too. Zong implemented this physical move in its KASLR RFC 
> > patchset, which is cumbersome since finding an available physical spot 
> > is harder than just selecting a virtual range in the vmalloc range.
> > 
> > In addition, having the kernel mapping in the linear mapping prevents 
> > the use of hugepage for the linear mapping resulting in performance loss 
> > (at least for the GB that encompasses the kernel).
> > 
> > Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> > available virtual addresses to whatever purpose we want, and as noted by 
> > Zong, arm64 uses the same scheme.

I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?

At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.

Cheers,
Ben.
 


^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Benjamin Herrenschmidt @ 2020-07-21 23:12 UTC (permalink / raw)
  To: Palmer Dabbelt, alex
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <mhng-08bff01a-ca15-4bbc-8454-2ca3e823fef8@palmerdabbelt-glaptop1>

On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
> 
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
> * On 64-bit systems the VA space around the kernel is precious because it's the
>   only place we can place text (modules, BPF, whatever). 

Why ? Branch distance limits ? You can't use trampolines ?

>  If we start putting
>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>   of space around it (essentially making it a fixed mapping anyway) or it
>   becomes likely that we won't be able to find space for modules as they're
>   loaded into running systems.

I dislike the kernel being in the vmalloc space (see my other email)
but I don't understand the specific issue with modules.

> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>   performance hit.

Out of curiosity why would relocatable kernels introduce a significant
hit ? Where about do you see the overhead coming from ?

> Roughly, my proposal would be to:
> 
> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>   anywhere and we only have one VA width, so we're not really solving any
>   problems with these changes.
> * Staticly allocate a 2GiB portion of the VA space for all our text, as its own
>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>   which would decouple the kernel from the physical memory layout of the system.
>   This would have the side effect of sorting out a bunch of bootloader headaches
>   that we currently have.
> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.
> 
> This doesn't solve the problem of virtually relocatable kernels, but it does
> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
> fixed physical address the kernel is loaded into, which is another thing I
> don't like.
> 
> I know this may be a more complicated approach, but there aren't any sv48
> systems around right now so I just don't see the rush to support them,
> particularly when there's a cost to what already exists (for those who haven't
> been watching, so far all the sv48 patch sets have imposed a significant
> performance penalty on all systems).


^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 23:36 UTC (permalink / raw)
  To: benh
  Cc: aou, alex, linux-mm, Anup Patel, linux-kernel, Atish Patra,
	paulus, zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <54af168083aee9dbda1b531227521a26b77ba2c8.camel@kernel.crashing.org>

On Tue, 21 Jul 2020 16:11:02 PDT (-0700), benh@kernel.crashing.org wrote:
> On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
>> > > I guess I don't understand why this is necessary at all.
>> > > Specifically: why
>> > > can't we just relocate the kernel within the linear map?  That would
>> > > let the
>> > > bootloader put the kernel wherever it wants, modulo the physical
>> > > memory size we
>> > > support.  We'd need to handle the regions that are coupled to the
>> > > kernel's
>> > > execution address, but we could just put them in an explicit memory
>> > > region
>> > > which is what we should probably be doing anyway.
>> >
>> > Virtual relocation in the linear mapping requires to move the kernel
>> > physically too. Zong implemented this physical move in its KASLR RFC
>> > patchset, which is cumbersome since finding an available physical spot
>> > is harder than just selecting a virtual range in the vmalloc range.
>> >
>> > In addition, having the kernel mapping in the linear mapping prevents
>> > the use of hugepage for the linear mapping resulting in performance loss
>> > (at least for the GB that encompasses the kernel).
>> >
>> > Why do you find this "ugly" ? The vmalloc region is just a bunch of
>> > available virtual addresses to whatever purpose we want, and as noted by
>> > Zong, arm64 uses the same scheme.
>
> I don't get it :-)
>
> At least on powerpc we move the kernel in the linear mapping and it
> works fine with huge pages, what is your problem there ? You rely on
> punching small-page size holes in there ?

That was my original suggestion, and I'm not actually sure it's invalid.  It
would mean that both the kernel's physical and virtual addresses are set by the
bootloader, which may or may not be workable if we want to have an sv48+sv39
kernel.  My initial approach to sv48+sv39 kernels would be to just throw away
the sv39 memory on sv48 kernels, which would preserve the linear map but mean
that there is no single physical address that's accessible for both.  That
would require some coordination between the bootloader and the kernel as to
where it should be loaded, but maybe there's a better way to design the linear
map.  Right now we have a bunch of unwritten rules about where things need to
be loaded, which is a recipe for disaster.

We could copy the kernel around, but I'm not sure I really like that idea.  We
do zero the BSS right now, so it's not like we entirely rely on the bootloader
to set up the kernel image, but with the hart race boot scheme we have right
now we'd at least need to leave a stub sitting around.  Maybe we just throw
away SBI v0.1, though, that's why we called it all legacy in the first place.

My bigger worry is that anything that involves running the kernel at arbitrary
virtual addresses means we need a PIC kernel, which means every global symbol
needs an indirection.  That's probably not so bad for shared libraries, but the
kernel has a lot of global symbols.  PLT references probably aren't so scary,
as we have an incoherent instruction cache so the virtual function predictor
isn't that hard to build, but making all global data accesses GOT-relative
seems like a disaster for performance.  This fixed-VA thing really just exists
so we don't have to be full-on PIC.

In theory I think we could just get away with pretending that medany is PIC,
which I believe works as long as the data and text offset stays constant, you
you don't have any symbols between 2GiB and -2GiB (as those may stay fixed,
even in medany), and you deal with GP accordingly (which should work itself out
in the current startup code).  We rely on this for some of the early boot code
(and will soon for kexec), but that's a very controlled code base and we've
already had some issues.  I'd be much more comfortable adding an explicit
semi-PIC code model, as I tend to miss something when doing these sorts of
things and then we could at least add it to the GCC test runs and guarantee it
actually works.  Not really sure I want to deal with that, though.  It would,
however, be the only way to get random virtual addresses during kernel
execution.

> At least in the old days, there were a number of assumptions that
> the kernel text/data/bss resides in the linear mapping.

Ya, it terrified me as well.  Alex says arm64 puts the kernel in the vmalloc
region, so assuming that's the case it must be possible.  I didn't get that
from reading the arm64 port (I guess it's no secret that pretty much all I do
is copy their code)

> If you change that you need to ensure that it's still physically
> contiguous and you'll have to tweak __va and __pa, which might induce
> extra overhead.

I'm operating under the assumption that we don't want to add an additional load
to virt2phys conversions.  arm64 bends over backwards to avoid the load, and
I'm assuming they have a reason for doing so.  Of course, if we're PIC then
maybe performance just doesn't matter, but I'm not sure I want to just give up.
Distros will probably build the sv48+sv39 kernels as soon as they show up, even
if there's no sv48 hardware for a while.

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 23:48 UTC (permalink / raw)
  To: benh
  Cc: aou, alex, linux-mm, Anup Patel, linux-kernel, Atish Patra,
	paulus, zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <6fbea8347bdb8434d91cf3ec2b95b134bd66cfe3.camel@kernel.crashing.org>

On Tue, 21 Jul 2020 16:12:58 PDT (-0700), benh@kernel.crashing.org wrote:
> On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
>>
>> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
>> * On 64-bit systems the VA space around the kernel is precious because it's the
>>   only place we can place text (modules, BPF, whatever).
>
> Why ? Branch distance limits ? You can't use trampolines ?

Nothing fundamental, it's just that we don't have a large code model in the C
compiler.  As a result all the global symbols are resolved as 32-bit
PC-relative accesses.  We could fix this with a fast large code model, but then
the kernel would need to relax global symbol references in modules and we don't
even do that for the simple code models we have now.  FWIW, some of the
proposed large code models are essentially just split-PLT/GOT and therefor
don't require relaxation, but at that point we're essentially PIC until we
have more that 2GiB of kernel text -- and even then, we keep all the
performance issues.

>>  If we start putting
>>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>>   of space around it (essentially making it a fixed mapping anyway) or it
>>   becomes likely that we won't be able to find space for modules as they're
>>   loaded into running systems.
>
> I dislike the kernel being in the vmalloc space (see my other email)
> but I don't understand the specific issue with modules.

Essentially what's above, the modules smell the same as the rest of the
kernel's code and therefor have a similar set of restrictions.  If we build PIC
modules and have the PLT entries do GOT loads (as do our shared libraries) then
we could break this restriction, but that comes with some performance
implications.  Like I said in the other email, I'm less worried about the
instruction side of things so maybe that's the right way to go.

>> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>>   performance hit.
>
> Out of curiosity why would relocatable kernels introduce a significant
> hit ? Where about do you see the overhead coming from ?

Our PIC codegen, probably better addressed by my other email and above.

>
>> Roughly, my proposal would be to:
>>
>> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>>   anywhere and we only have one VA width, so we're not really solving any
>>   problems with these changes.
>> * Staticly allocate a 2GiB portion of the VA space for all our text, as its own
>>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>>   which would decouple the kernel from the physical memory layout of the system.
>>   This would have the side effect of sorting out a bunch of bootloader headaches
>>   that we currently have.
>> * Sort out how to maintain a linear map as the canonical hole moves around
>>   between the VA widths without adding a bunch of overhead to the virt2phys and
>>   friends.  This is probably going to be the trickiest part, but I think if we
>>   just change the page table code to essentially lie about VAs when an sv39
>>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>>   logical complexity involved, but it would remain fast.
>>
>> This doesn't solve the problem of virtually relocatable kernels, but it does
>> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
>> fixed physical address the kernel is loaded into, which is another thing I
>> don't like.
>>
>> I know this may be a more complicated approach, but there aren't any sv48
>> systems around right now so I just don't see the rush to support them,
>> particularly when there's a cost to what already exists (for those who haven't
>> been watching, so far all the sv48 patch sets have imposed a significant
>> performance penalty on all systems).

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc: inline doorbell sending functions
From: Michael Ellerman @ 2020-07-22  0:52 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
  Cc: Nicholas Piggin, kvm-ppc, Paul Mackerras, Cédric Le Goater,
	Anton Blanchard, David Gibson
In-Reply-To: <20200630115034.137050-2-npiggin@gmail.com>

Nicholas Piggin <npiggin@gmail.com> writes:
> These are only called in one place for a given platform, so inline them
> for performance.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/include/asm/dbell.h | 63 ++++++++++++++++++++++++++++++--
>  arch/powerpc/kernel/dbell.c      | 55 ----------------------------
>  2 files changed, 60 insertions(+), 58 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
> index 4ce6808deed3..f19d2282e3f8 100644
> --- a/arch/powerpc/include/asm/dbell.h
> +++ b/arch/powerpc/include/asm/dbell.h
> @@ -13,6 +13,7 @@
>  
>  #include <asm/ppc-opcode.h>
>  #include <asm/feature-fixups.h>
> +#include <asm/kvm_ppc.h>
>  
>  #define PPC_DBELL_MSG_BRDCAST	(0x04000000)
>  #define PPC_DBELL_TYPE(x)	(((x) & 0xf) << (63-36))


This somehow breaks ppc40x_defconfig and others:

In file included from /home/michael/linux/arch/powerpc/include/asm/kvm_ppc.h:24,
                 from /home/michael/linux/arch/powerpc/include/asm/dbell.h:16,
                 from /home/michael/linux/arch/powerpc/kernel/asm-offsets.c:38:
/home/michael/linux/arch/powerpc/include/asm/kvm_booke.h: In function ‘kvmppc_get_fault_dar’:
/home/michael/linux/arch/powerpc/include/asm/kvm_booke.h:94:19: error: ‘struct kvm_vcpu_arch’ has no member named ‘fault_dear’
   94 |  return vcpu->arch.fault_dear;
      |                   ^
make[2]: *** [/home/michael/linux/scripts/Makefile.build:114: arch/powerpc/kernel/asm-offsets.s] Error 1
make[1]: *** [/home/michael/linux/Makefile:1175: prepare0] Error 2
make: *** [Makefile:185: __sub-make] Error 2

cheers

^ permalink raw reply

* Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean
From: Brian King @ 2020-07-22  0:52 UTC (permalink / raw)
  To: Leonardo Bras, Alexey Kardashevskiy, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Joel Stanley,
	Christophe Leroy, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <0f4c2d84d0958e98e7ada53c25750fe548cadf0b.camel@gmail.com>

On 7/21/20 5:13 PM, Leonardo Bras wrote:
> On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
>>
>> On 16/07/2020 17:16, Leonardo Bras wrote:
>>> Move the part of iommu_table_free() that does struct iommu_table cleaning
>>> into iommu_table_clean, so we can invoke it separately.
>>>
>>> This new function is useful for cleaning struct iommu_table before
>>> initializing it again with a new DMA window, without having it freed and
>>> allocated again.
>>>
>>> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
>>> ---
>>>  arch/powerpc/kernel/iommu.c | 30 ++++++++++++++++++------------
>>>  1 file changed, 18 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 9704f3f76e63..c3242253a4e7 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>>>  	return tbl;
>>>  }
>>>  
>>> -static void iommu_table_free(struct kref *kref)
>>> +static void iommu_table_clean(struct iommu_table *tbl)
>>
>> iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
>> work too, why new helper?
> 
> iommu_table_free() also frees the tbl, which would need allocate it
> again (new address) and to fill it up again, unnecessarily. 
> I think it's a better approach to only change what is needed.
> 
>> There is also iommu_table_clear() which does a different thing so you
>> need a better name.
> 
> I agree.
> I had not noticed this other function before sending the patchset. What
> would be a better name though? __iommu_table_free()? 
> 
>> Second, iommu_table_free
>> use and it would be ok as we would only see this when hot-unplugging a
>> PE because we always kept the default window.
>> Btw you must be seeing these warnings now every time you create DDW with
>> these patches as at least the first page is reserved, do not you?
> 
> It does not print a warning.
> I noticed other warnings, but not this one from iommu_table_free():
> /* verify that table contains no entries */
> if (!bitmap_empty(tbl->it_ma
> p, tbl->it_size))
> 	pr_warn("%s: Unexpected TCEs\n", __func__);
> 
> Before that, iommu_table_release_pages(tbl) is supposed to clear the 
> bitmap, so this only tests for a tce that is created in this short period.
> 
>> Since we are replacing a table for a device which is still in the
>> system, we should not try messing with its DMA if it already has
>> mappings so the warning should become an error preventing DDW. It is
>> rather hard to trigger in practice but I could hack a driver to ask for
>> 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
>> is not illegal, I think. So this needs a new helper - "bool
>> iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
>> this?... Thanks,
> 
> As of today, there seems to be nothing like that happening in the
> driver I am testing. 
> I spoke to Brian King on slack, and he mentioned that at the point DDW
> is created there should be no allocations in place.

I think there are a couple of scenarios here. One is where there is a DMA
allocation prior to a call to set the DMA mask. Second scenario is if the
driver makes multiple calls to set the DMA mask. I would argue that a properly
written driver should tell the IOMMU subsystem what DMA mask it supports prior
to allocating DMA memroy. Documentation/core-api/dma-api-howto.rst should
describe what is legal and what is not.

It might be reasonable to declare its not allowed to allocate DMA memory
and then later change the DMA mask and clearly call this out in the documentation
if its not already.

-Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH 5/5] powerpc sstep: Add tests for Prefixed Add Immediate
From: Jordan Niethe @ 2020-07-22  1:26 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Balamuruhan S
In-Reply-To: <20200525025923.19843-5-jniethe5@gmail.com>

On Mon, May 25, 2020 at 1:00 PM Jordan Niethe <jniethe5@gmail.com> wrote:
>
> Use the existing support for testing compute type instructions to test
> Prefixed Add Immediate (paddi).  The R bit of the paddi instruction
> controls whether current instruction address is used. Add test cases for
> when R=1 and for R=0. paddi has a 34 bit immediate field formed by
> concatenating si0 and si1. Add tests for the extreme values of this
> field.
>
> Skip the paddi tests if ISA v3.1 is unsupported.
>
> Some of these test cases were added by Balamuruhan S.
>
> Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
> ---
>  arch/powerpc/lib/test_emulate_step.c          | 127 ++++++++++++++++++
>  .../lib/test_emulate_step_exec_instr.S        |   1 +
>  2 files changed, 128 insertions(+)
>
> diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c
> index 579b5db80674..33a72b7d2764 100644
> --- a/arch/powerpc/lib/test_emulate_step.c
> +++ b/arch/powerpc/lib/test_emulate_step.c
> @@ -105,6 +105,13 @@
>                                         ___PPC_RA(a) | ___PPC_RB(b))
>  #define TEST_ADDC_DOT(t, a, b) ppc_inst(PPC_INST_ADDC | ___PPC_RT(t) |         \
>                                         ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
> +#define TEST_PADDI(t, a, i, pr)        ppc_inst_prefix(PPC_PREFIX_MLS | __PPC_PRFX_R(pr) |     \
> +                                       IMM_H(i),                       \
> +                                       PPC_INST_ADDI |                 \
> +                                       ___PPC_RT(t) | ___PPC_RA(a) |   \
> +                                       IMM_L(i))
> +
> +
>
>  #define MAX_SUBTESTS   16
>
> @@ -699,6 +706,11 @@ struct compute_test {
>         } subtests[MAX_SUBTESTS + 1];
>  };
>
> +/* Extreme values for si0||si1 (the MLS:D-form 34 bit immediate field) */
> +#define SI_MIN BIT(33)
> +#define SI_MAX (BIT(33) - 1)
> +#define SI_UMAX (BIT(34) - 1)
> +
>  static struct compute_test compute_tests[] = {
>         {
>                 .mnemonic = "nop",
> @@ -1071,6 +1083,121 @@ static struct compute_test compute_tests[] = {
>                                 }
>                         }
>                 }
> +       },
> +       {
> +               .mnemonic = "paddi",
> +               .cpu_feature = CPU_FTR_ARCH_31,
> +               .subtests = {
> +                       {
> +                               .descr = "RA = LONG_MIN, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = LONG_MIN, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = LONG_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = ULONG_MAX, SI = SI_UMAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_UMAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = ULONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = ULONG_MAX, SI = 0x1, R = 0",
> +                               .instr = TEST_PADDI(21, 22, 0x1, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = ULONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MIN, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MIN, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = UINT_MAX, SI = 0x1, R = 0",
> +                               .instr = TEST_PADDI(21, 22, 0x1, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = UINT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = UINT_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = UINT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 0, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0x0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = 0, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0x0,
> +                                       .gpr[22] = 0x0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = 0, R = 1",
> +                               .instr = TEST_PADDI(21, 0, 0, 1),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = SI_MIN, R = 1",
> +                               .instr = TEST_PADDI(21, 0, SI_MIN, 1),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                               }
> +                       }
> +               }
>         }
>  };
>
> diff --git a/arch/powerpc/lib/test_emulate_step_exec_instr.S b/arch/powerpc/lib/test_emulate_step_exec_instr.S
> index 1580f34f4f4f..aef53ee77a43 100644
> --- a/arch/powerpc/lib/test_emulate_step_exec_instr.S
> +++ b/arch/powerpc/lib/test_emulate_step_exec_instr.S
> @@ -81,6 +81,7 @@ _GLOBAL(exec_instr)
>
>         /* Placeholder for the test instruction */
>  1:     nop
> +       nop
>         patch_site 1b patch__exec_instr
>
>         /*
> --
> 2.17.1
>

Because of the alignment requirements of prefixed instructions, the
noops to be patched need to be aligned.
mpe, want me to send a new version?
--- a/arch/powerpc/lib/test_emulate_step_exec_instr.S
+++ b/arch/powerpc/lib/test_emulate_step_exec_instr.S
@@ -80,6 +80,7 @@ _GLOBAL(exec_instr)
        REST_NVGPRS(r31)

        /* Placeholder for the test instruction */
+.align 6
 1:     nop
        nop
        patch_site 1b patch__exec_instr

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox