LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
From: Linus Torvalds @ 2020-07-28 19:02 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Hillf Danton, Yang Shi, Yu Xu, Catalin Marinas,
	Hugh Dickins, Josef Bacik, Will Deacon, Linux-MM, Matthew Wilcox,
	Johannes Weiner, mm-commits, Andrew Morton, linuxppc-dev,
	Kirill A . Shutemov
In-Reply-To: <1595932767.wga6c4yy6a.astroid@bobo.none>

On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> The quirk is a problem with coprocessor where it's supposed to
> invalidate the translation after a fault but it doesn't, so we can get a
> read-only TLB stuck after something else does a RO->RW upgrade on the
> TLB. Something like that IIRC.  Coprocessors have their own MMU which
> lives in the nest not the core, so you need a global TLB flush to
> invalidate that thing.

So I assumed, but it does seem confused.

Why? Because if there are stale translations on the co-processor,
there's no guarantee that one of the CPU's will have them and take a
fault.

So I'm not seeing why a core CPU doing spurious TLB invalidation would
follow from "stale TLB in the Nest".

If anything, I think "we have a coprocessor that needs to never have
stale TLB entries" would impact the _regular_ TLB invalidates (by
update_mmu_cache()) and perhaps make those more aggressive, exactly
because the coprocessor may not handle the fault as gracefully.

I dunno. I don't know the coprocessor side well enough to judge, I'm
just looking at it from a conceptual standpoint.

          Linus

^ permalink raw reply

* Re: [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct
From: David Hildenbrand @ 2020-07-28 18:22 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, Laurent Dufour, Rick Lindsley
In-Reply-To: <20200728165339.3031126-1-cheloha@linux.ibm.com>

On 28.07.20 18:53, Scott Cheloha wrote:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block.  There is no need to store the nid
> in multiple locations.
> 
> Note that lmb_to_memblock() uses find_memory_block() to get the
> corresponding memory_block.  As find_memory_block() runs in sub-linear
> time this approach is negligibly slower than what we do at present.
> 
> In exchange for this lookup at hot-remove time we no longer need to
> call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
> On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
> spares us an O(n^2) initialization during boot.
> 
> On systems with many LMBs that initialization overhead is palpable and
> disruptive.  For example, on a box with 249854 LMBs we're seeing
> drmem_init() take upwards of 30 seconds to complete:
> 
> [   53.721639] drmem: initializing drmem v2
> [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
> [   80.604377] Modules linked in:
> [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
> [   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
> [   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
> [   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
> [   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
> [   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
> [   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
> [   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
> [   80.604431] GPR12: 0000000000000000 c00000001e8ec200
> [   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
> [   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
> [   80.604492] Call Trace:
> [   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
> [   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
> [   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
> [   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
> [   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
> [   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
> [   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
> [   80.604567] Instruction dump:
> [   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
> [   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
> [   89.047390] drmem: 249854 LMB(s)
> 
> With a patched kernel on the same machine we're no longer seeing the
> soft lockup.  drmem_init() now completes in negligible time, even when
> the LMB count is large.
> 

Yeah, as long as you remove_memory() in memory block granularity, this
is guaranteed to work.

Reviewed-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Laurent Dufour @ 2020-07-28 17:46 UTC (permalink / raw)
  To: Nathan Lynch, linuxppc-dev; +Cc: tyreld, cheloha
In-Reply-To: <20200728173741.717372-1-nathanl@linux.ibm.com>

Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
> The drmem lmb list can have hundreds of thousands of entries, and
> unfortunately lookups take the form of linear searches. As long as
> this is the case, traversals have the potential to monopolize the CPU
> and provoke lockup reports, workqueue stalls, and the like unless
> they explicitly yield.
> 
> Rather than placing cond_resched() calls within various
> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
> expression of the loop macro itself so users can't omit it.

Hi Nathan,

Is that not too much to call cond_resched() on every LMB?

Could that be less frequent, every 10, or 100, I don't really know ?

Cheers,
Laurent.
> 
> Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format")
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/drmem.h | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
> index 414d209f45bb..36d0ed04bda8 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -8,6 +8,8 @@
>   #ifndef _ASM_POWERPC_LMB_H
>   #define _ASM_POWERPC_LMB_H
>   
> +#include <linux/sched.h>
> +
>   struct drmem_lmb {
>   	u64     base_addr;
>   	u32     drc_index;
> @@ -26,8 +28,14 @@ struct drmem_lmb_info {
>   
>   extern struct drmem_lmb_info *drmem_info;
>   
> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
> +{
> +	cond_resched();
> +	return ++lmb;
> +}
> +
>   #define for_each_drmem_lmb_in_range(lmb, start, end)		\
> -	for ((lmb) = (start); (lmb) < (end); (lmb)++)
> +	for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
>   
>   #define for_each_drmem_lmb(lmb)					\
>   	for_each_drmem_lmb_in_range((lmb),			\
> 


^ permalink raw reply

* [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-07-28 17:37 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tyreld, cheloha, ldufour

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..36d0ed04bda8 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include <linux/sched.h>
+
 struct drmem_lmb {
 	u64     base_addr;
 	u32     drc_index;
@@ -26,8 +28,14 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+	cond_resched();
+	return ++lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)		\
-	for ((lmb) = (start); (lmb) < (end); (lmb)++)
+	for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
 
 #define for_each_drmem_lmb(lmb)					\
 	for_each_drmem_lmb_in_range((lmb),			\
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct
From: Scott Cheloha @ 2020-07-28 16:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nathan Lynch, Laurent Dufour, David Hildenbrand, Michal Suchanek,
	Rick Lindsley

At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
[   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
[   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
[   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
[   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
[   80.604431] GPR12: 0000000000000000 c00000001e8ec200
[   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h              | 21 -------------------
 arch/powerpc/mm/drmem.c                       |  6 +-----
 .../platforms/pseries/hotplug-memory.c        | 19 ++++++++++-------
 3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..34e4e9b257f5 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -13,9 +13,6 @@ struct drmem_lmb {
 	u32     drc_index;
 	u32     aa_index;
 	u32     flags;
-#ifdef CONFIG_MEMORY_HOTPLUG
-	int	nid;
-#endif
 };
 
 struct drmem_lmb_info {
@@ -104,22 +101,4 @@ static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
 	lmb->aa_index = 0xffffffff;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-	lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-	lmb->nid = -1;
-}
-#else
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-}
-#endif
-
 #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..873fcfc7b875 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
 	if (!drmem_info->lmbs)
 		return;
 
-	for_each_drmem_lmb(lmb) {
+	for_each_drmem_lmb(lmb)
 		read_drconf_v1_cell(lmb, &prop);
-		lmb_set_nid(lmb);
-	}
 }
 
 static void __init init_drmem_v2_lmbs(const __be32 *prop)
@@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
 
 			lmb->aa_index = dr_cell.aa_index;
 			lmb->flags = dr_cell.flags;
-
-			lmb_set_nid(lmb);
 		}
 	}
 }
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5ace2f9a277e..7bf66fdcf916 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -356,25 +356,29 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
+	struct memory_block *mem_block;
 	unsigned long block_sz;
 	int rc;
 
 	if (!lmb_is_removable(lmb))
 		return -EINVAL;
 
+	mem_block = lmb_to_memblock(lmb);
+	if (mem_block == NULL)
+		return -EINVAL;
+
 	rc = dlpar_offline_lmb(lmb);
 	if (rc)
 		return rc;
 
 	block_sz = pseries_memory_block_size();
 
-	__remove_memory(lmb->nid, lmb->base_addr, block_sz);
+	__remove_memory(mem_block->nid, lmb->base_addr, block_sz);
 
 	/* Update memory regions for memory remove */
 	memblock_remove(lmb->base_addr, block_sz);
 
 	invalidate_lmb_associativity_index(lmb);
-	lmb_clear_nid(lmb);
 	lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
 	return 0;
@@ -631,7 +635,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
 	unsigned long block_sz;
-	int rc;
+	int nid, rc;
 
 	if (lmb->flags & DRCONF_MEM_ASSIGNED)
 		return -EINVAL;
@@ -642,11 +646,13 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 		return rc;
 	}
 
-	lmb_set_nid(lmb);
 	block_sz = memory_block_size_bytes();
 
+	/* Find the node id for this address. */
+	nid = memory_add_physaddr_to_nid(lmb->base_addr);
+
 	/* Add the memory */
-	rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
+	rc = __add_memory(nid, lmb->base_addr, block_sz);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
@@ -654,9 +660,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	rc = dlpar_online_lmb(lmb);
 	if (rc) {
-		__remove_memory(lmb->nid, lmb->base_addr, block_sz);
+		__remove_memory(nid, lmb->base_addr, block_sz);
 		invalidate_lmb_associativity_index(lmb);
-		lmb_clear_nid(lmb);
 	} else {
 		lmb->flags |= DRCONF_MEM_ASSIGNED;
 	}
-- 
2.24.1


^ permalink raw reply related

* [PATCH] selftests/powerpc: return skip code for spectre_v2
From: Thadeu Lima de Souza Cascardo @ 2020-07-28 15:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: cascardo, Shuah Khan, linuxppc-dev, linux-kselftest, linux-kernel

When running under older versions of qemu of under newer versions with old
machine types, some security features will not be reported to the guest.
This will lead the guest OS to consider itself Vulnerable to spectre_v2.

So, spectre_v2 test fails in such cases when the host is mitigated and miss
predictions cannot be detected as expected by the test.

Make it return the skip code instead, for this particular case. We don't
want to miss the case when the test fails and the system reports as
mitigated or not affected. But it is not a problem to miss failures when
the system reports as Vulnerable.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 tools/testing/selftests/powerpc/security/spectre_v2.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c b/tools/testing/selftests/powerpc/security/spectre_v2.c
index 8c6b982af2a8..d5445bfd63ed 100644
--- a/tools/testing/selftests/powerpc/security/spectre_v2.c
+++ b/tools/testing/selftests/powerpc/security/spectre_v2.c
@@ -183,6 +183,14 @@ int spectre_v2_test(void)
 		if (miss_percent > 15) {
 			printf("Branch misses > 15%% unexpected in this configuration!\n");
 			printf("Possible mis-match between reported & actual mitigation\n");
+			/* Such a mismatch may be caused by a guest system
+			 * reporting as vulnerable when the host is mitigated.
+			 * Return skip code to avoid detecting this as an
+			 * error. We are not vulnerable and reporting otherwise,
+			 * so missing such a mismatch is safe.
+			 */
+			if (state == VULNERABLE)
+				return 4;
 			return 1;
 		}
 		break;
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
From: Valentin Schneider @ 2020-07-28 15:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Oliver O'Halloran, Jordan Niethe,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200727053230.19753-10-srikar@linux.vnet.ibm.com>


Hi,

On 27/07/20 06:32, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
>

So there's at least one arm64 platform out there with the same "pairs of
cores share L2" thing (Ampere eMAG), and that lives quite happily with the
default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
domain, and the whole system is covered by DIE.

Now arguably it's not a perfect representation; DIE doesn't have
SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
will impact all callsites using cpus_share_cache(): in the eMAG case, only
pairs of cores will be seen as sharing cache, even though *all* cores share
the same L3.

I'm trying to paint a picture of what the P9 topology looks like (the one
you showcase in your cover letter) to see if there are any similarities;
from what I gather in [1], wikichips and your cover letter, with P9 you can
have something like this in a single DIE (somewhat unsure about L3 setup;
it looks to be distributed?)

     +---------------------------------------------------------------------+
     |                                  L3                                 |
     +---------------+-+---------------+-+---------------+-+---------------+
     |       L2      | |       L2      | |       L2      | |       L2      |
     +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
     |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
     +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
     |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
     +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+

Which would lead to (ignoring the whole SMT CPU numbering shenanigans)

NUMA     [                                                   ...
DIE      [                                             ]
MC       [         ] [         ] [         ] [         ]
BIGCORE  [         ] [         ] [         ] [         ]
SMT      [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
         00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  <other node here>

This however has MC == BIGCORE; what makes it you can have different spans
for these two domains? If it's not too much to ask, I'd love to have a P9
topology diagram.

[1]: 20200722081822.GG9290@linux.vnet.ibm.com

^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Baoquan He @ 2020-07-28 14:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728141504.GC3655207@kernel.org>

On 07/28/20 at 05:15pm, Mike Rapoport wrote:
> On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote:
> > On 07/28/20 at 08:11am, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > > regions to set node ID in memblock.reserved and than traverses
> > > memblock.reserved to update reserved_nodemask to include node IDs that were
> > > set in the first loop.
> > > 
> > > Remove redundant traversal over memblock.reserved and update
> > > reserved_nodemask while iterating over numa_meminfo.
> > > 
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > ---
> > >  arch/x86/mm/numa.c | 26 ++++++++++----------------
> > >  1 file changed, 10 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 8ee952038c80..4078abd33938 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void)
> > >  	 * and use those ranges to set the nid in memblock.reserved.
> > >  	 * This will split up the memblock regions along node
> > >  	 * boundaries and will set the node IDs as well.
> > > +	 *
> > > +	 * The nid will also be set in reserved_nodemask which is later
> > > +	 * used to clear MEMBLOCK_HOTPLUG flag.
> > > +	 *
> > > +	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > > +	 *   numa_meminfo might not include all memblock.reserved
> > > +	 *   memory ranges, because quirks such as trim_snb_memory()
> > > +	 *   reserve specific pages for Sandy Bridge graphics.
> > > +	 *   These ranges will remain with nid == MAX_NUMNODES. ]
> > >  	 */
> > >  	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> > >  		struct numa_memblk *mb = numa_meminfo.blk + i;
> > >  		int ret;
> > >  
> > >  		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
> > > +		node_set(mb->nid, reserved_nodemask);
> > 
> > Really? This will set all node id into reserved_nodemask. But in the
> > current code, it's setting nid into memblock reserved region which
> > interleaves with numa_memoinfo, then get those nid and set it in
> > reserved_nodemask. This is so different, with my understanding. Please
> > correct me if I am wrong.
> 
> You are right, I've missed the intersections of numa_meminfo with
> memblock.reserved.
> 
> x86 interaction with membock is so, hmm, interesting...

Yeah, numa_clear_kernel_node_hotplug() intends to find out any memory node
which has reserved memory, then make it as unmovable. Setting all node
id into reserved_nodemask will break the use case of hot removing hotpluggable
boot memory after system bootup.


^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Mike Rapoport @ 2020-07-28 14:15 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728110254.GA14854@MiWiFi-R3L-srv>

On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote:
> On 07/28/20 at 08:11am, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > regions to set node ID in memblock.reserved and than traverses
> > memblock.reserved to update reserved_nodemask to include node IDs that were
> > set in the first loop.
> > 
> > Remove redundant traversal over memblock.reserved and update
> > reserved_nodemask while iterating over numa_meminfo.
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/x86/mm/numa.c | 26 ++++++++++----------------
> >  1 file changed, 10 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 8ee952038c80..4078abd33938 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void)
> >  	 * and use those ranges to set the nid in memblock.reserved.
> >  	 * This will split up the memblock regions along node
> >  	 * boundaries and will set the node IDs as well.
> > +	 *
> > +	 * The nid will also be set in reserved_nodemask which is later
> > +	 * used to clear MEMBLOCK_HOTPLUG flag.
> > +	 *
> > +	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > +	 *   numa_meminfo might not include all memblock.reserved
> > +	 *   memory ranges, because quirks such as trim_snb_memory()
> > +	 *   reserve specific pages for Sandy Bridge graphics.
> > +	 *   These ranges will remain with nid == MAX_NUMNODES. ]
> >  	 */
> >  	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> >  		struct numa_memblk *mb = numa_meminfo.blk + i;
> >  		int ret;
> >  
> >  		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
> > +		node_set(mb->nid, reserved_nodemask);
> 
> Really? This will set all node id into reserved_nodemask. But in the
> current code, it's setting nid into memblock reserved region which
> interleaves with numa_memoinfo, then get those nid and set it in
> reserved_nodemask. This is so different, with my understanding. Please
> correct me if I am wrong.

You are right, I've missed the intersections of numa_meminfo with
memblock.reserved.

x86 interaction with membock is so, hmm, interesting...
 
> Thanks
> Baoquan
> 
> >  		WARN_ON_ONCE(ret);
> >  	}
> >  
> > -	/*
> > -	 * Now go over all reserved memblock regions, to construct a
> > -	 * node mask of all kernel reserved memory areas.
> > -	 *
> > -	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > -	 *   numa_meminfo might not include all memblock.reserved
> > -	 *   memory ranges, because quirks such as trim_snb_memory()
> > -	 *   reserve specific pages for Sandy Bridge graphics. ]
> > -	 */
> > -	for_each_memblock(reserved, mb_region) {
> > -		int nid = memblock_get_region_node(mb_region);
> > -
> > -		if (nid != MAX_NUMNODES)
> > -			node_set(nid, reserved_nodemask);
> > -	}
> > -
> >  	/*
> >  	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
> >  	 * belonging to the reserved node mask.
> > -- 
> > 2.26.2
> > 
> > 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [RESEND PATCH v5 08/11] ppc64/kexec_file: setup backup region for kdump kernel
From: Michael Ellerman @ 2020-07-28 14:11 UTC (permalink / raw)
  To: Hari Bathini, Andrew Morton
  Cc: Pingfan Liu, Kexec-ml, Mimi Zohar, Nayna Jain, Petr Tesarik,
	Mahesh J Salgaonkar, Sourabh Jain, lkml, linuxppc-dev,
	Eric Biederman, Thiago Jung Bauermann, Dave Young, Vivek Goyal
In-Reply-To: <159579235754.5790.5203600072984600891.stgit@hbathini>

Hari Bathini <hbathini@linux.ibm.com> writes:
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> index a5c1442590b2..88408b17a7f6 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -697,6 +699,69 @@ static int update_usable_mem_fdt(void *fdt, struct crash_mem *usable_mem)
>  	return ret;
>  }
>  
> +/**
> + * load_backup_segment - Locate a memory hole to place the backup region.
> + * @image:               Kexec image.
> + * @kbuf:                Buffer contents and memory parameters.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int load_backup_segment(struct kimage *image, struct kexec_buf *kbuf)
> +{
> +	void *buf;
> +	int ret;
> +
> +	/* Setup a segment for backup region */
> +	buf = vzalloc(BACKUP_SRC_SIZE);

This worried me initially, because we can't copy from physically
discontiguous pages in real mode.

But as you explained this buffer is not used for copying.

I think if you move the large comment below up here, it would be
clearer.


> diff --git a/arch/powerpc/purgatory/trampoline_64.S b/arch/powerpc/purgatory/trampoline_64.S
> index 464af8e8a4cb..d4b52961f592 100644
> --- a/arch/powerpc/purgatory/trampoline_64.S
> +++ b/arch/powerpc/purgatory/trampoline_64.S
> @@ -43,14 +44,38 @@ master:
>  	mr	%r17,%r3	/* save cpu id to r17 */
>  	mr	%r15,%r4	/* save physical address in reg15 */
>  
> +	bl	0f		/* Work out where we're running */
> +0:	mflr	%r18

I know you just moved it, but this should use:

	bcl	20, 31, $+4
	mflr	%r18

Which is a special form of branch and link that doesn't unbalance the
link stack in the chip.

> +	/*
> +	 * Copy BACKUP_SRC_SIZE bytes from BACKUP_SRC_START to
> +	 * backup_start 8 bytes at a time.
> +	 *
> +	 * Use r3 = dest, r4 = src, r5 = size, r6 = count
> +	 */
> +	ld	%r3,(backup_start - 0b)(%r18)
> +	cmpdi	%cr0,%r3,0

I prefer spaces or tabs between arguments, eg:

	cmpdi	%cr0, %r3, 0

> +	beq	80f		/* skip if there is no backup region */

Local labels will make this clearer I think. eg:

	beq	.Lskip_copy

> +	lis	%r5,BACKUP_SRC_SIZE@h
> +	ori	%r5,%r5,BACKUP_SRC_SIZE@l
> +	cmpdi	%cr0,%r5,0
> +	beq	80f		/* skip if copy size is zero */
> +	lis	%r4,BACKUP_SRC_START@h
> +	ori	%r4,%r4,BACKUP_SRC_START@l
> +	li	%r6,0
> +70:

.Lcopy_loop:

> +	ldx	%r0,%r6,%r4
> +	stdx	%r0,%r6,%r3
> +	addi	%r6,%r6,8
> +	cmpld	%cr0,%r6,%r5
> +	blt	70b

	blt	.Lcopy_loop

> +

.Lskip_copy:

> +80:
>  	or	%r3,%r3,%r3	/* ok now to high priority, lets boot */
>  	lis	%r6,0x1
>  	mtctr	%r6		/* delay a bit for slaves to catch up */
>  	bdnz	.		/* before we overwrite 0-100 again */


cheers

^ permalink raw reply

* Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls
From: Michael Ellerman @ 2020-07-28 13:46 UTC (permalink / raw)
  To: Hari Bathini, Andrew Morton
  Cc: Pingfan Liu, Kexec-ml, Mimi Zohar, Nayna Jain, Petr Tesarik,
	Mahesh J Salgaonkar, Sourabh Jain, lkml, linuxppc-dev,
	Eric Biederman, Thiago Jung Bauermann, Dave Young, Vivek Goyal
In-Reply-To: <159579233676.5790.10701756666641782647.stgit@hbathini>

Hari Bathini <hbathini@linux.ibm.com> writes:
> Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
> to be filled with OPAL base & entry addresses respectively. Setting
> these registers allows the kernel to perform OPAL calls before the
> device tree is parsed.

I'm not convinced we want to do this.

If we do it becomes part of the kexec ABI and we have to honour it into
the future.

And in practice there are no non-development kernels built with OPAL early
debugging enabled, so it's not clear it actually helps anyone other than
developers.

cheers

> v4 -> v5:
> * New patch. Updated opal_base & opal_entry values in r8 & r9 respectively.
>   This change was part of the below dropped patch in v4:
>     - https://lore.kernel.org/patchwork/patch/1275667/
>
>
>  arch/powerpc/kexec/file_load_64.c      |   16 ++++++++++++++++
>  arch/powerpc/purgatory/trampoline_64.S |   15 +++++++++++++++
>  2 files changed, 31 insertions(+)
>
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> index 8df085a22fd7..a5c1442590b2 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -713,6 +713,8 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
>  			  const void *fdt, unsigned long kernel_load_addr,
>  			  unsigned long fdt_load_addr)
>  {
> +	struct device_node *dn = NULL;
> +	uint64_t val;
>  	int ret;
>  
>  	ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
> @@ -735,9 +737,23 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
>  			goto out;
>  	}
>  
> +	/* Setup OPAL base & entry values */
> +	dn = of_find_node_by_path("/ibm,opal");
> +	if (dn) {
> +		of_property_read_u64(dn, "opal-base-address", &val);
> +		ret = kexec_purgatory_get_set_symbol(image, "opal_base", &val,
> +						     sizeof(val), false);
> +		if (ret)
> +			goto out;
> +
> +		of_property_read_u64(dn, "opal-entry-address", &val);
> +		ret = kexec_purgatory_get_set_symbol(image, "opal_entry", &val,
> +						     sizeof(val), false);
> +	}
>  out:
>  	if (ret)
>  		pr_err("Failed to setup purgatory symbols");
> +	of_node_put(dn);
>  	return ret;
>  }
>  
> diff --git a/arch/powerpc/purgatory/trampoline_64.S b/arch/powerpc/purgatory/trampoline_64.S
> index a5a83c3f53e6..464af8e8a4cb 100644
> --- a/arch/powerpc/purgatory/trampoline_64.S
> +++ b/arch/powerpc/purgatory/trampoline_64.S
> @@ -61,6 +61,10 @@ master:
>  	li	%r4,28
>  	STWX_BE	%r17,%r3,%r4	/* Store my cpu as __be32 at byte 28 */
>  1:
> +	/* Load opal base and entry values in r8 & r9 respectively */
> +	ld	%r8,(opal_base - 0b)(%r18)
> +	ld	%r9,(opal_entry - 0b)(%r18)
> +
>  	/* load the kernel address */
>  	ld	%r4,(kernel - 0b)(%r18)
>  
> @@ -102,6 +106,17 @@ dt_offset:
>  	.8byte  0x0
>  	.size dt_offset, . - dt_offset
>  
> +	.balign 8
> +	.globl opal_base
> +opal_base:
> +	.8byte  0x0
> +	.size opal_base, . - opal_base
> +
> +	.balign 8
> +	.globl opal_entry
> +opal_entry:
> +	.8byte  0x0
> +	.size opal_entry, . - opal_entry
>  
>  	.data
>  	.balign 8

^ permalink raw reply

* Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel
From: Michael Ellerman @ 2020-07-28 13:44 UTC (permalink / raw)
  To: Hari Bathini, Andrew Morton
  Cc: Pingfan Liu, Kexec-ml, Mimi Zohar, Nayna Jain, Petr Tesarik,
	Mahesh J Salgaonkar, Sourabh Jain, lkml, linuxppc-dev,
	Eric Biederman, Thiago Jung Bauermann, Dave Young, Vivek Goyal
In-Reply-To: <159579231812.5790.16096865978767385505.stgit@hbathini>

Hari Bathini <hbathini@linux.ibm.com> writes:
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> index 2df6f4273ddd..8df085a22fd7 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -17,9 +17,21 @@
>  #include <linux/kexec.h>
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
> +#include <linux/of_device.h>
>  #include <linux/memblock.h>
> +#include <linux/slab.h>
> +#include <asm/drmem.h>
>  #include <asm/kexec_ranges.h>
>  
> +struct umem_info {
> +	uint64_t *buf; /* data buffer for usable-memory property */
> +	uint32_t idx;  /* current index */
> +	uint32_t size; /* size allocated for the data buffer */

Use kernel types please, u64, u32.

> +	/* usable memory ranges to look up */
> +	const struct crash_mem *umrngs;

"umrngs".

Given it's part of the umem_info struct could it just be "ranges"?

> +};
> +
>  const struct kexec_file_ops * const kexec_file_loaders[] = {
>  	&kexec_elf64_ops,
>  	NULL
> @@ -74,6 +86,42 @@ static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
>  	return ret;
>  }
>  
> +/**
> + * get_usable_memory_ranges - Get usable memory ranges. This list includes
> + *                            regions like crashkernel, opal/rtas & tce-table,
> + *                            that kdump kernel could use.
> + * @mem_ranges:               Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
> +{
> +	int ret;
> +
> +	/*
> +	 * prom code doesn't take kindly to missing low memory. So, add

I don't know what that's referring to, "prom code" is too vague.

> +	 * [0, crashk_res.end] instead of [crashk_res.start, crashk_res.end]
> +	 * to keep it happy.
> +	 */
> +	ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
> +	if (ret)
> +		goto out;
> +
> +	ret = add_rtas_mem_range(mem_ranges);
> +	if (ret)
> +		goto out;
> +
> +	ret = add_opal_mem_range(mem_ranges);
> +	if (ret)
> +		goto out;
> +
> +	ret = add_tce_mem_ranges(mem_ranges);
> +out:
> +	if (ret)
> +		pr_err("Failed to setup usable memory ranges\n");
> +	return ret;
> +}
> +
>  /**
>   * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
>   *                              in the memory regions between buf_min & buf_max
> @@ -273,6 +321,382 @@ static int locate_mem_hole_bottom_up_ppc64(struct kexec_buf *kbuf,
>  	return ret;
>  }
>  
> +/**
> + * check_realloc_usable_mem - Reallocate buffer if it can't accommodate entries
> + * @um_info:                  Usable memory buffer and ranges info.
> + * @cnt:                      No. of entries to accommodate.
> + *
> + * Frees up the old buffer if memory reallocation fails.
> + *
> + * Returns buffer on success, NULL on error.
> + */
> +static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt)
> +{
> +	void *tbuf;
> +
> +	if (um_info->size >=
> +	    ((um_info->idx + cnt) * sizeof(*(um_info->buf))))
> +		return um_info->buf;

This is awkward.

AFAICS you only use um_info->size here, so instead why not store the
number of u64s you have space for, as num for example.

Then the above comparison becomes:

	if (um_info->num >= (um_info->idx + count))

Then you only have to calculate the size internally here for the
realloc.

> +
> +	um_info->size += MEM_RANGE_CHUNK_SZ;

	new_size = um_info->size + MEM_RANGE_CHUNK_SZ;
	tbuf = krealloc(um_info->buf, new_size, GFP_KERNEL);

> +	tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL);
> +	if (!tbuf) {
> +		um_info->size -= MEM_RANGE_CHUNK_SZ;

Then you can drop this.

> +		return NULL;
> +	}

	um_info->size = new_size;

> +
> +	memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);

Just pass __GFP_ZERO to krealloc?

> +	return tbuf;
> +}
> +
> +/**
> + * add_usable_mem - Add the usable memory ranges within the given memory range
> + *                  to the buffer
> + * @um_info:        Usable memory buffer and ranges info.
> + * @base:           Base address of memory range to look for.
> + * @end:            End address of memory range to look for.
> + * @cnt:            No. of usable memory ranges added to buffer.

One caller never uses this AFAICS.

Couldn't the other caller just compare the um_info->idx before and after
the call, and avoid another pass by reference parameter.

> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int add_usable_mem(struct umem_info *um_info, uint64_t base,
> +			  uint64_t end, int *cnt)
> +{
> +	uint64_t loc_base, loc_end, *buf;
> +	const struct crash_mem *umrngs;
> +	int i, add;

add should be bool.

> +	*cnt = 0;
> +	umrngs = um_info->umrngs;
> +	for (i = 0; i < umrngs->nr_ranges; i++) {
> +		add = 0;
> +		loc_base = umrngs->ranges[i].start;
> +		loc_end = umrngs->ranges[i].end;
> +		if (loc_base >= base && loc_end <= end)
> +			add = 1;
> +		else if (base < loc_end && end > loc_base) {
> +			if (loc_base < base)
> +				loc_base = base;
> +			if (loc_end > end)
> +				loc_end = end;
> +			add = 1;
> +		}
> +
> +		if (add) {
> +			buf = check_realloc_usable_mem(um_info, 2);
> +			if (!buf)
> +				return -ENOMEM;
> +
> +			um_info->buf = buf;
> +			buf[um_info->idx++] = cpu_to_be64(loc_base);
> +			buf[um_info->idx++] =
> +					cpu_to_be64(loc_end - loc_base + 1);
> +			(*cnt)++;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * kdump_setup_usable_lmb - This is a callback function that gets called by
> + *                          walk_drmem_lmbs for every LMB to set its
> + *                          usable memory ranges.
> + * @lmb:                    LMB info.
> + * @usm:                    linux,drconf-usable-memory property value.
> + * @data:                   Pointer to usable memory buffer and ranges info.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int kdump_setup_usable_lmb(struct drmem_lmb *lmb, const __be32 **usm,
> +				  void *data)
> +{
> +	struct umem_info *um_info;
> +	uint64_t base, end, *buf;
> +	int cnt, tmp_idx, ret;
> +
> +	/*
> +	 * kdump load isn't supported on kernels already booted with
> +	 * linux,drconf-usable-memory property.
> +	 */
> +	if (*usm) {
> +		pr_err("linux,drconf-usable-memory property already exists!");
> +		return -EINVAL;
> +	}
> +
> +	um_info = data;
> +	tmp_idx = um_info->idx;
> +	buf = check_realloc_usable_mem(um_info, 1);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	um_info->idx++;
> +	um_info->buf = buf;
> +	base = lmb->base_addr;
> +	end = base + drmem_lmb_size() - 1;
> +	ret = add_usable_mem(um_info, base, end, &cnt);
> +	if (!ret)
> +		um_info->buf[tmp_idx] = cpu_to_be64(cnt);
> +
> +	return ret;
> +}
> +
> +/**
> + * get_node_path_size - Get the full path length of the given node.
> + * @dn:                 Device Node.
> + *
> + * Also, counts '\0' at the end of the path.
> + * For example, /memory@0 will be "/memory@0\0" => 10 bytes.
> + *
> + * Returns the string size of the node's full path.
> + */
> +static int get_node_path_size(struct device_node *dn)
> +{
> +	int len = 0;
> +
> +	if (!dn)
> +		return 0;
> +
> +	/* Root node */
> +	if (!(dn->parent))
> +		return 2;
> +
> +	while (dn) {
> +		len += strlen(dn->full_name) + 1;
> +		dn = dn->parent;
> +	}
> +
> +	return len;
> +}
> +
> +/**
> + * get_node_path - Get the full path of the given node.
> + * @node:          Device node.
> + *
> + * Allocates buffer for node path. The caller must free the buffer
> + * after use.
> + *
> + * Returns buffer with path on success, NULL otherwise.
> + */
> +static char *get_node_path(struct device_node *node)
> +{


As discussed this can probably be replaced with snprintf(buf, "%pOF") ?


cheers

^ permalink raw reply

* Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
From: Pratik Sampat @ 2020-07-28 13:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Gautham R. Shenoy, pratik.r.sampat, Linux PM, Daniel Lezcano,
	Rafael J. Wysocki, linuxppc-dev, Nicholas Piggin, Paul Mackerras,
	linux-kselftest, Shuah Khan, srivatsa, Linux Kernel Mailing List
In-Reply-To: <CAJZ5v0j3ip77opkaW3Rtn0cqT7VTL_8goctFBDVehWoZowDY0Q@mail.gmail.com>

Hello Rafael,


On 27/07/20 7:12 pm, Rafael J. Wysocki wrote:
> On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat
> <psampat@linux.ibm.com> wrote:
>> Fire directed smp_call_function_single IPIs from a specified source
>> CPU to the specified target CPU to reduce the noise we have to wade
>> through in the trace log.
> And what's the purpose of it?

The idea for this comes from that fact that estimating wake-up
latencies and residencies for stop states is not an easy task.

The purpose is essentially to determine wakeup latencies, that are
caused by either, an IPI or a timer and compare with the advertised
wakeup latencies for each stop state.

This might help in determining the accuracy of our advertised values
and/or if they need any re-calibration.

>> The module is based on the idea written by Srivatsa Bhat and maintained
>> by Vaidyanathan Srinivasan internally.
>>
>> Queue HR timer and measure jitter. Wakeup latency measurement for idle
>> states using hrtimer.  Echo a value in ns to timer_test_function and
>> watch trace. A HRtimer will be queued and when it fires the expected
>> wakeup vs actual wakeup is computes and delay printed in ns.
>>
>> Implemented as a module which utilizes debugfs so that it can be
>> integrated with selftests.
>>
>> To include the module, check option and include as module
>> kernel hacking -> Cpuidle latency selftests
>>
>> [srivatsa.bhat@linux.vnet.ibm.com: Initial implementation in
>>   cpidle/sysfs]
>>
>> [svaidy@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
>>   and fix some of the time calculation]
>>
>> [ego@linux.vnet.ibm.com: Fix some whitespace and tab errors and
>>   increase the resolution of IPI wakeup]
>>
>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
>> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
>> ---
>>   drivers/cpuidle/Makefile               |   1 +
>>   drivers/cpuidle/test-cpuidle_latency.c | 150 +++++++++++++++++++++++++
>>   lib/Kconfig.debug                      |  10 ++
>>   3 files changed, 161 insertions(+)
>>   create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
>>
>> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
>> index f07800cbb43f..2ae05968078c 100644
>> --- a/drivers/cpuidle/Makefile
>> +++ b/drivers/cpuidle/Makefile
>> @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
>>   obj-$(CONFIG_DT_IDLE_STATES)             += dt_idle_states.o
>>   obj-$(CONFIG_ARCH_HAS_CPU_RELAX)         += poll_state.o
>>   obj-$(CONFIG_HALTPOLL_CPUIDLE)           += cpuidle-haltpoll.o
>> +obj-$(CONFIG_IDLE_LATENCY_SELFTEST)      += test-cpuidle_latency.o
>>
>>   ##################################################################################
>>   # ARM SoC drivers
>> diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c
>> new file mode 100644
>> index 000000000000..61574665e972
>> --- /dev/null
>> +++ b/drivers/cpuidle/test-cpuidle_latency.c
>> @@ -0,0 +1,150 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Module-based API test facility for cpuidle latency using IPIs and timers
> I'd like to see a more detailed description of what it does and how it
> works here.

Right, I'll add that.
Based on comments from Daniel I have also been working on a
user-space only variant of this test as that does seem like
a better way to go.

The only downside is that the latency will be higher, but as we are
taking baseline measurements the diff of that from our observed reading
should still remain the same. Just that the test will take longer to run.
I'm yet to accurately confirm this.

I would appreciate your thoughts on that.

>> + */
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +
>> +/* IPI based wakeup latencies */
>> +struct latency {
>> +       unsigned int src_cpu;
>> +       unsigned int dest_cpu;
>> +       ktime_t time_start;
>> +       ktime_t time_end;
>> +       u64 latency_ns;
>> +} ipi_wakeup;
>> +
>> +static void measure_latency(void *info)
>> +{
>> +       struct latency *v;
>> +       ktime_t time_diff;
>> +
>> +       v = (struct latency *)info;
>> +       v->time_end = ktime_get();
>> +       time_diff = ktime_sub(v->time_end, v->time_start);
>> +       v->latency_ns = ktime_to_ns(time_diff);
>> +}
>> +
>> +void run_smp_call_function_test(unsigned int cpu)
>> +{
>> +       ipi_wakeup.src_cpu = smp_processor_id();
>> +       ipi_wakeup.dest_cpu = cpu;
>> +       ipi_wakeup.time_start = ktime_get();
>> +       smp_call_function_single(cpu, measure_latency, &ipi_wakeup, 1);
>> +}
>> +
>> +/* Timer based wakeup latencies */
>> +struct timer_data {
>> +       unsigned int src_cpu;
>> +       u64 timeout;
>> +       ktime_t time_start;
>> +       ktime_t time_end;
>> +       struct hrtimer timer;
>> +       u64 timeout_diff_ns;
>> +} timer_wakeup;
>> +
>> +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
>> +{
>> +       struct timer_data *w;
>> +       ktime_t time_diff;
>> +
>> +       w = container_of(hrtimer, struct timer_data, timer);
>> +       w->time_end = ktime_get();
>> +
>> +       time_diff = ktime_sub(w->time_end, w->time_start);
>> +       time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
>> +       w->timeout_diff_ns = ktime_to_ns(time_diff);
>> +       return HRTIMER_NORESTART;
>> +}
>> +
>> +static void run_timer_test(unsigned int ns)
>> +{
>> +       hrtimer_init(&timer_wakeup.timer, CLOCK_MONOTONIC,
>> +                    HRTIMER_MODE_REL);
>> +       timer_wakeup.timer.function = timer_called;
>> +       timer_wakeup.time_start = ktime_get();
>> +       timer_wakeup.src_cpu = smp_processor_id();
>> +       timer_wakeup.timeout = ns;
>> +
>> +       hrtimer_start(&timer_wakeup.timer, ns_to_ktime(ns),
>> +                     HRTIMER_MODE_REL_PINNED);
>> +}
>> +
>> +static struct dentry *dir;
>> +
>> +static int cpu_read_op(void *data, u64 *value)
>> +{
>> +       *value = ipi_wakeup.dest_cpu;
>> +       return 0;
>> +}
>> +
>> +static int cpu_write_op(void *data, u64 value)
>> +{
>> +       run_smp_call_function_test(value);
>> +       return 0;
>> +}
>> +DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op, cpu_write_op, "%llu\n");
>> +
>> +static int timeout_read_op(void *data, u64 *value)
>> +{
>> +       *value = timer_wakeup.timeout;
>> +       return 0;
>> +}
>> +
>> +static int timeout_write_op(void *data, u64 value)
>> +{
>> +       run_timer_test(value);
>> +       return 0;
>> +}
>> +DEFINE_SIMPLE_ATTRIBUTE(timeout_ops, timeout_read_op, timeout_write_op, "%llu\n");
>> +
>> +static int __init latency_init(void)
>> +{
>> +       struct dentry *temp;
>> +
>> +       dir = debugfs_create_dir("latency_test", 0);
>> +       if (!dir) {
>> +               pr_alert("latency_test: failed to create /sys/kernel/debug/latency_test\n");
>> +               return -1;
>> +       }
>> +       temp = debugfs_create_file("ipi_cpu_dest",
>> +                                  0666,
>> +                                  dir,
>> +                                  NULL,
>> +                                  &ipi_ops);
>> +       if (!temp) {
>> +               pr_alert("latency_test: failed to create /sys/kernel/debug/ipi_cpu_dest\n");
>> +               return -1;
>> +       }
>> +       debugfs_create_u64("ipi_latency_ns", 0444, dir, &ipi_wakeup.latency_ns);
>> +       debugfs_create_u32("ipi_cpu_src", 0444, dir, &ipi_wakeup.src_cpu);
>> +
>> +       temp = debugfs_create_file("timeout_expected_ns",
>> +                                  0666,
>> +                                  dir,
>> +                                  NULL,
>> +                                  &timeout_ops);
>> +       if (!temp) {
>> +               pr_alert("latency_test: failed to create /sys/kernel/debug/timeout_expected_ns\n");
>> +               return -1;
>> +       }
>> +       debugfs_create_u64("timeout_diff_ns", 0444, dir, &timer_wakeup.timeout_diff_ns);
>> +       debugfs_create_u32("timeout_cpu_src", 0444, dir, &timer_wakeup.src_cpu);
>> +       pr_info("Latency Test module loaded\n");
>> +       return 0;
>> +}
>> +
>> +static void __exit latency_cleanup(void)
>> +{
>> +       pr_info("Cleaning up Latency Test module.\n");
>> +       debugfs_remove_recursive(dir);
>> +}
>> +
>> +module_init(latency_init);
>> +module_exit(latency_cleanup);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("IBM Corporation");
>> +MODULE_DESCRIPTION("Measuring idle latency for IPIs and Timers");
>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>> index d74ac0fd6b2d..e2283790245a 100644
>> --- a/lib/Kconfig.debug
>> +++ b/lib/Kconfig.debug
>> @@ -1375,6 +1375,16 @@ config DEBUG_KOBJECT
>>            If you say Y here, some extra kobject debugging messages will be sent
>>            to the syslog.
>>
>> +config IDLE_LATENCY_SELFTEST
>> +       tristate "Cpuidle latency selftests"
>> +       depends on CPU_IDLE
>> +       help
>> +         This option provides a kernel module that runs tests using the IPI and
>> +         timers to measure latency.
> What latency does it measure?

It measures latencies incurred on wakeup after an IPI and a timer interrupt.

>> +
>> +         Say M if you want these self tests to build as a module.
>> +         Say N if you are unsure.
>> +
>>   config DEBUG_KOBJECT_RELEASE
>>          bool "kobject release debugging"
>>          depends on DEBUG_OBJECTS_TIMERS
>> --
>> 2.25.4
>>
Thanks,
Pratik


^ permalink raw reply

* [PATCH v4 3/3] powerpc test_emulate_step: add testcases for divde[.] and divdeu[.] instructions
From: Balamuruhan S @ 2020-07-28 13:03 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, jniethe5, Balamuruhan S, paulus, sandipan,
	naveen.n.rao, linuxppc-dev
In-Reply-To: <20200728130308.1790982-1-bala24@linux.ibm.com>

add testcases for divde, divde., divdeu, divdeu. emulated
instructions to cover few scenarios,
        * with same dividend and divisor to have undefine RT
          for divdeu[.]
        * with divide by zero to have undefine RT for both
          divde[.] and divdeu[.]
        * with negative dividend to cover -|divisor| < r <= 0 if
          the dividend is negative for divde[.]
        * normal case with proper dividend and divisor for both
          divde[.] and divdeu[.]

Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/lib/test_emulate_step.c | 156 +++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)

diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c
index d242e9f72e0c..0a201b771477 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -1019,6 +1019,162 @@ static struct compute_test compute_tests[] = {
 			}
 		}
 	},
+	{
+		.mnemonic = "divde",
+		.subtests = {
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MIN,
+				}
+			},
+			{
+				.descr = "RA = 1L, RB = 0",
+				.instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = 1L,
+					.gpr[22] = 0,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MAX,
+				}
+			}
+		}
+	},
+	{
+		.mnemonic = "divde.",
+		.subtests = {
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MIN,
+				}
+			},
+			{
+				.descr = "RA = 1L, RB = 0",
+				.instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = 1L,
+					.gpr[22] = 0,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MAX,
+				}
+			}
+		}
+	},
+	{
+		.mnemonic = "divdeu",
+		.subtests = {
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MIN,
+				}
+			},
+			{
+				.descr = "RA = 1L, RB = 0",
+				.instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = 1L,
+					.gpr[22] = 0,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MAX,
+				}
+			},
+			{
+				.descr = "RA = LONG_MAX - 1, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MAX - 1,
+					.gpr[22] = LONG_MAX,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN + 1, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = LONG_MIN + 1,
+					.gpr[22] = LONG_MIN,
+				}
+			}
+		}
+	},
+	{
+		.mnemonic = "divdeu.",
+		.subtests = {
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDEU_DOT(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MIN,
+				}
+			},
+			{
+				.descr = "RA = 1L, RB = 0",
+				.instr = ppc_inst(PPC_RAW_DIVDEU_DOT(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = 1L,
+					.gpr[22] = 0,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDEU_DOT(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MIN,
+					.gpr[22] = LONG_MAX,
+				}
+			},
+			{
+				.descr = "RA = LONG_MAX - 1, RB = LONG_MAX",
+				.instr = ppc_inst(PPC_RAW_DIVDEU_DOT(20, 21, 22)),
+				.regs = {
+					.gpr[21] = LONG_MAX - 1,
+					.gpr[22] = LONG_MAX,
+				}
+			},
+			{
+				.descr = "RA = LONG_MIN + 1, RB = LONG_MIN",
+				.instr = ppc_inst(PPC_RAW_DIVDEU_DOT(20, 21, 22)),
+				.flags = IGNORE_GPR(20),
+				.regs = {
+					.gpr[21] = LONG_MIN + 1,
+					.gpr[22] = LONG_MIN,
+				}
+			}
+		}
+	},
 	{
 		.mnemonic = "paddi",
 		.cpu_feature = CPU_FTR_ARCH_31,
-- 
2.24.1


^ permalink raw reply related

* [PATCH v4 2/3] powerpc sstep: add support for divde[.] and divdeu[.] instructions
From: Balamuruhan S @ 2020-07-28 13:03 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, jniethe5, Balamuruhan S, paulus, sandipan,
	naveen.n.rao, linuxppc-dev
In-Reply-To: <20200728130308.1790982-1-bala24@linux.ibm.com>

This patch adds emulation support for divde, divdeu instructions,
	* Divide Doubleword Extended (divde[.])
	* Divide Doubleword Extended Unsigned (divdeu[.])

Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/lib/sstep.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index c58ea9e787cb..caee8cc77e19 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1806,7 +1806,18 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 			op->val = (int) regs->gpr[ra] /
 				(int) regs->gpr[rb];
 			goto arith_done;
-
+#ifdef __powerpc64__
+		case 425:	/* divde[.] */
+			asm volatile(PPC_DIVDE(%0, %1, %2) :
+				"=r" (op->val) : "r" (regs->gpr[ra]),
+				"r" (regs->gpr[rb]));
+			goto arith_done;
+		case 393:	/* divdeu[.] */
+			asm volatile(PPC_DIVDEU(%0, %1, %2) :
+				"=r" (op->val) : "r" (regs->gpr[ra]),
+				"r" (regs->gpr[rb]));
+			goto arith_done;
+#endif
 		case 755:	/* darn */
 			if (!cpu_has_feature(CPU_FTR_ARCH_300))
 				return -1;
-- 
2.24.1


^ permalink raw reply related

* [PATCH v4 1/3] powerpc ppc-opcode: add divde and divdeu opcodes
From: Balamuruhan S @ 2020-07-28 13:03 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, jniethe5, Balamuruhan S, paulus, sandipan,
	naveen.n.rao, linuxppc-dev
In-Reply-To: <20200728130308.1790982-1-bala24@linux.ibm.com>

include instruction opcodes for divde and divdeu as macros.

Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/ppc-opcode.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 4c0bdafb6a7b..a6e3700c4566 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -466,6 +466,10 @@
 #define PPC_RAW_MULI(d, a, i)		(0x1c000000 | ___PPC_RT(d) | ___PPC_RA(a) | IMM_L(i))
 #define PPC_RAW_DIVWU(d, a, b)		(0x7c000396 | ___PPC_RT(d) | ___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_DIVDU(d, a, b)		(0x7c000392 | ___PPC_RT(d) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDE(t, a, b)		(0x7c000352 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDE_DOT(t, a, b)	(0x7c000352 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
+#define PPC_RAW_DIVDEU(t, a, b)		(0x7c000312 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDEU_DOT(t, a, b)	(0x7c000312 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
 #define PPC_RAW_AND(d, a, b)		(0x7c000038 | ___PPC_RA(d) | ___PPC_RS(a) | ___PPC_RB(b))
 #define PPC_RAW_ANDI(d, a, i)		(0x70000000 | ___PPC_RA(d) | ___PPC_RS(a) | IMM_L(i))
 #define PPC_RAW_AND_DOT(d, a, b)	(0x7c000039 | ___PPC_RA(d) | ___PPC_RS(a) | ___PPC_RB(b))
@@ -510,6 +514,8 @@
 #define PPC_DARN(t, l)		stringify_in_c(.long PPC_RAW_DARN(t, l))
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_RAW_DCBAL(a, b))
 #define	PPC_DCBZL(a, b)		stringify_in_c(.long PPC_RAW_DCBZL(a, b))
+#define	PPC_DIVDE(t, a, b)	stringify_in_c(.long PPC_RAW_DIVDE(t, a, b))
+#define	PPC_DIVDEU(t, a, b)	stringify_in_c(.long PPC_RAW_DIVDEU(t, a, b))
 #define PPC_LQARX(t, a, b, eh)	stringify_in_c(.long PPC_RAW_LQARX(t, a, b, eh))
 #define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_RAW_LDARX(t, a, b, eh))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_RAW_LWARX(t, a, b, eh))
-- 
2.24.1


^ permalink raw reply related

* [PATCH v4 0/3] Add support for divde[.] and divdeu[.] instruction emulation
From: Balamuruhan S @ 2020-07-28 13:03 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, jniethe5, Balamuruhan S, paulus, sandipan,
	naveen.n.rao, linuxppc-dev

Hi All,

This patchset adds support to emulate divde, divde., divdeu and divdeu.
instructions and testcases for it.

Resend v4: rebased on latest powerpc next branch

Changes in v4:
-------------
Fix review comments from Naveen,
* replace TEST_DIVDEU() instead of wrongly used TEST_DIVDEU_DOT() in
  divdeu testcase.
* Include `acked-by` tag from Naveen for the series.
* Rebase it on latest mpe's merge tree.

Changes in v3:
-------------
* Fix suggestion from Sandipan to remove `PPC_INST_DIVDE_DOT` and
  `PPC_INST_DIVDEU_DOT` opcode macros defined in ppc-opcode.h, reuse
  `PPC_INST_DIVDE` and `PPC_INST_DIVDEU` in test_emulate_step.c to
  derive them respectively.

Changes in v2:
-------------
* Fix review comments from Paul to make divde_dot and divdeu_dot simple
  by using divde and divdeu, then goto `arith_done` instead of
  `compute_done`.
* Include `Reviewed-by` tag from Sandipan Das.
* Rebase with recent mpe's merge tree.

I would request for your review and suggestions for making it better.

Boot Log:
--------
:: ::
:: ::
291494043: (291493996): [    0.352649][    T1] emulate_step_test: divde          : RA = LONG_MIN, RB = LONG_MIN                       PASS
291517665: (291517580): [    0.352695][    T1] emulate_step_test: divde          : RA = 1L, RB = 0                                    PASS
291541357: (291541234): [    0.352742][    T1] emulate_step_test: divde          : RA = LONG_MIN, RB = LONG_MAX                       PASS
291565107: (291564946): [    0.352788][    T1] emulate_step_test: divde.         : RA = LONG_MIN, RB = LONG_MIN                       PASS
291588757: (291588558): [    0.352834][    T1] emulate_step_test: divde.         : RA = 1L, RB = 0                                    PASS
291612477: (291612240): [    0.352881][    T1] emulate_step_test: divde.         : RA = LONG_MIN, RB = LONG_MAX                       PASS
291636201: (291635926): [    0.352927][    T1] emulate_step_test: divdeu         : RA = LONG_MIN, RB = LONG_MIN                       PASS
291659830: (291659517): [    0.352973][    T1] emulate_step_test: divdeu         : RA = 1L, RB = 0                                    PASS
291683529: (291683178): [    0.353019][    T1] emulate_step_test: divdeu         : RA = LONG_MIN, RB = LONG_MAX                       PASS
291707248: (291706859): [    0.353066][    T1] emulate_step_test: divdeu         : RA = LONG_MAX - 1, RB = LONG_MAX                   PASS
291730962: (291730535): [    0.353112][    T1] emulate_step_test: divdeu         : RA = LONG_MIN + 1, RB = LONG_MIN                   PASS
291754714: (291754249): [    0.353158][    T1] emulate_step_test: divdeu.        : RA = LONG_MIN, RB = LONG_MIN                       PASS
291778371: (291777868): [    0.353205][    T1] emulate_step_test: divdeu.        : RA = 1L, RB = 0                                    PASS
291802098: (291801557): [    0.353251][    T1] emulate_step_test: divdeu.        : RA = LONG_MIN, RB = LONG_MAX                       PASS
291825844: (291825265): [    0.353297][    T1] emulate_step_test: divdeu.        : RA = LONG_MAX - 1, RB = LONG_MAX                   PASS
291849586: (291848969): [    0.353344][    T1] emulate_step_test: divdeu.        : RA = LONG_MIN + 1, RB = LONG_MIN                   PASS
:: ::
:: ::
292520225: (292519608): [    0.354654][    T1] registered taskstats version 1
292584751: (292584134): [    0.354780][    T1] pstore: Using crash dump compression: deflate
296454422: (296453805): [    0.362338][    T1] Freeing unused kernel memory: 1408K
296467838: (296467221): [    0.362364][    T1] This architecture does not have kernel memory protection.
296485387: (296484770): [    0.362398][    T1] Run /init as init process
297987339: (297986761): [    0.365332][   T46] mount (46) used greatest stack depth: 12512 bytes left
298889548: (298888992): [    0.367094][   T47] mount (47) used greatest stack depth: 11824 bytes left

355356256: (355355821): Welcome to Buildroot
355376898: (355376463): buildroot login:

Balamuruhan S (3):
  powerpc ppc-opcode: add divde and divdeu opcodes
  powerpc sstep: add support for divde[.] and divdeu[.] instructions
  powerpc test_emulate_step: add testcases for divde[.] and divdeu[.]
    instructions

 arch/powerpc/include/asm/ppc-opcode.h |   6 +
 arch/powerpc/lib/sstep.c              |  13 ++-
 arch/powerpc/lib/test_emulate_step.c  | 156 ++++++++++++++++++++++++++
 3 files changed, 174 insertions(+), 1 deletion(-)


base-commit: 7a9912e4cf048b607c8fafcfbdca7566660f1d78
-- 
2.24.1


^ permalink raw reply

* Re: [RESEND PATCH v5 03/11] powerpc/kexec_file: add helper functions for getting memory ranges
From: Michael Ellerman @ 2020-07-28 12:58 UTC (permalink / raw)
  To: Hari Bathini, Andrew Morton
  Cc: Pingfan Liu, Kexec-ml, Mimi Zohar, Nayna Jain, Petr Tesarik,
	Mahesh J Salgaonkar, Sourabh Jain, lkml, linuxppc-dev,
	Thiago Jung Bauermann, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159579222211.5790.10294144969496171475.stgit@hbathini>

Hi Hari,

Some comments inline ...

Hari Bathini <hbathini@linux.ibm.com> writes:
> diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c
> new file mode 100644
> index 000000000000..21bea1b78443
> --- /dev/null
> +++ b/arch/powerpc/kexec/ranges.c
> @@ -0,0 +1,417 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * powerpc code to implement the kexec_file_load syscall
> + *
> + * Copyright (C) 2004  Adam Litke (agl@us.ibm.com)
> + * Copyright (C) 2004  IBM Corp.
> + * Copyright (C) 2004,2005  Milton D Miller II, IBM Corporation
> + * Copyright (C) 2005  R Sharada (sharada@in.ibm.com)
> + * Copyright (C) 2006  Mohan Kumar M (mohan@in.ibm.com)
> + * Copyright (C) 2020  IBM Corporation
> + *
> + * Based on kexec-tools' kexec-ppc64.c, fs2dt.c.
> + * Heavily modified for the kernel by
> + * Hari Bathini <hbathini@linux.ibm.com>.

Please just use your name, email addresses bit rot. It's in the commit
log anyway.

> + */
> +
> +#undef DEBUG
    ^
Dont do that in new code please.

> +#define pr_fmt(fmt) "kexec ranges: " fmt
> +
> +#include <linux/sort.h>
> +#include <linux/kexec.h>
> +#include <linux/of_device.h>
> +#include <linux/slab.h>
> +#include <asm/sections.h>
> +#include <asm/kexec_ranges.h>
> +
> +/**
> + * get_max_nr_ranges - Get the max no. of ranges crash_mem structure
> + *                     could hold, given the size allocated for it.
> + * @size:              Allocation size of crash_mem structure.
> + *
> + * Returns the maximum no. of ranges.
> + */
> +static inline unsigned int get_max_nr_ranges(size_t size)
> +{
> +	return ((size - sizeof(struct crash_mem)) /
> +		sizeof(struct crash_mem_range));
> +}
> +
> +/**
> + * get_mem_rngs_size - Get the allocated size of mrngs based on
> + *                     max_nr_ranges and chunk size.
> + * @mrngs:             Memory ranges.

mrngs is not a great name, what about memory_ranges or ranges?

Ditto everywhere else you use mrngs.

> + *
> + * Returns the maximum size of @mrngs.
> + */
> +static inline size_t get_mem_rngs_size(struct crash_mem *mrngs)
> +{
> +	size_t size;
> +
> +	if (!mrngs)
> +		return 0;
> +
> +	size = (sizeof(struct crash_mem) +
> +		(mrngs->max_nr_ranges * sizeof(struct crash_mem_range)));
> +
> +	/*
> +	 * Memory is allocated in size multiple of MEM_RANGE_CHUNK_SZ.
> +	 * So, align to get the actual length.
> +	 */
> +	return ALIGN(size, MEM_RANGE_CHUNK_SZ);
> +}
> +
> +/**
> + * __add_mem_range - add a memory range to memory ranges list.
> + * @mem_ranges:      Range list to add the memory range to.
> + * @base:            Base address of the range to add.
> + * @size:            Size of the memory range to add.
> + *
> + * (Re)allocates memory, if needed.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int __add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size)
> +{
> +	struct crash_mem *mrngs = *mem_ranges;
> +
> +	if ((mrngs == NULL) || (mrngs->nr_ranges == mrngs->max_nr_ranges)) {

(mrngs == NULL) should just be !mrngs.

> +		mrngs = realloc_mem_ranges(mem_ranges);
> +		if (!mrngs)
> +			return -ENOMEM;
> +	}
> +
> +	mrngs->ranges[mrngs->nr_ranges].start = base;
> +	mrngs->ranges[mrngs->nr_ranges].end = base + size - 1;
> +	pr_debug("Added memory range [%#016llx - %#016llx] at index %d\n",
> +		 base, base + size - 1, mrngs->nr_ranges);
> +	mrngs->nr_ranges++;
> +	return 0;
> +}
> +
> +/**
> + * __merge_memory_ranges - Merges the given memory ranges list.
> + * @mem_ranges:            Range list to merge.
> + *
> + * Assumes a sorted range list.
> + *
> + * Returns nothing.
> + */

A lot of this code is annoyingly similar to the memblock code, though
the internals of that are all static these days.

I guess for now we'll just have to add all this. Maybe in future it can
be consolidated.

> +static void __merge_memory_ranges(struct crash_mem *mrngs)
> +{
> +	struct crash_mem_range *rngs;
> +	int i, idx;
> +
> +	if (!mrngs)
> +		return;
> +
> +	idx = 0;
> +	rngs = &mrngs->ranges[0];
> +	for (i = 1; i < mrngs->nr_ranges; i++) {
> +		if (rngs[i].start <= (rngs[i-1].end + 1))
> +			rngs[idx].end = rngs[i].end;
> +		else {
> +			idx++;
> +			if (i == idx)
> +				continue;
> +
> +			rngs[idx] = rngs[i];
> +		}
> +	}
> +	mrngs->nr_ranges = idx + 1;
> +}
> +
> +/**
> + * realloc_mem_ranges - reallocate mem_ranges with size incremented
> + *                      by MEM_RANGE_CHUNK_SZ. Frees up the old memory,
> + *                      if memory allocation fails.
> + * @mem_ranges:         Memory ranges to reallocate.
> + *
> + * Returns pointer to reallocated memory on success, NULL otherwise.
> + */
> +struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges)
> +{
> +	struct crash_mem *mrngs = *mem_ranges;
> +	unsigned int nr_ranges;
> +	size_t size;
> +
> +	size = get_mem_rngs_size(mrngs);
> +	nr_ranges = mrngs ? mrngs->nr_ranges : 0;
> +
> +	size += MEM_RANGE_CHUNK_SZ;
> +	mrngs = krealloc(*mem_ranges, size, GFP_KERNEL);
> +	if (!mrngs) {
> +		kfree(*mem_ranges);
> +		*mem_ranges = NULL;
> +		return NULL;
> +	}
> +
> +	mrngs->nr_ranges = nr_ranges;
> +	mrngs->max_nr_ranges = get_max_nr_ranges(size);
> +	*mem_ranges = mrngs;
> +
> +	return mrngs;
> +}
> +
> +/**
> + * add_mem_range - Updates existing memory range, if there is an overlap.
> + *                 Else, adds a new memory range.
> + * @mem_ranges:    Range list to add the memory range to.
> + * @base:          Base address of the range to add.
> + * @size:          Size of the memory range to add.
> + *
> + * (Re)allocates memory, if needed.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size)
> +{
> +	struct crash_mem *mrngs = *mem_ranges;
> +	u64 mstart, mend, end;
> +	unsigned int i;
> +
> +	if (!size)
> +		return 0;
> +
> +	end = base + size - 1;
> +
> +	if ((mrngs == NULL) || (mrngs->nr_ranges == 0))
> +		return __add_mem_range(mem_ranges, base, size);
> +
> +	for (i = 0; i < mrngs->nr_ranges; i++) {
> +		mstart = mrngs->ranges[i].start;
> +		mend = mrngs->ranges[i].end;
> +		if (base < mend && end > mstart) {
> +			if (base < mstart)
> +				mrngs->ranges[i].start = base;
> +			if (end > mend)
> +				mrngs->ranges[i].end = end;
> +			return 0;
> +		}
> +	}
> +
> +	return __add_mem_range(mem_ranges, base, size);
> +}
> +
> +/**
> + * add_tce_mem_ranges - Adds tce-table range to the given memory ranges list.
> + * @mem_ranges:         Range list to add the memory range(s) to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_tce_mem_ranges(struct crash_mem **mem_ranges)

Not sure this and the other add_foo_mem_ranges() really belong in this patch.

> +{
> +	struct device_node *dn;
> +	int ret = 0;
> +
> +	for_each_node_by_type(dn, "pci") {
> +		u64 base;
> +		u32 size;
> +		int rc;

Do you really need ret and rc?

> +		/*
> +		 * It is ok to have pci nodes without tce. So, ignore
> +		 * any read errors here.
> +		 */
> +		rc = of_property_read_u64(dn, "linux,tce-base", &base);
> +		rc |= of_property_read_u32(dn, "linux,tce-size", &size);
> +		if (rc)
> +			continue;
> +
> +		ret = add_mem_range(mem_ranges, base, size);
> +		if (ret)
> +			break;
                        ^
                        dn leaked.
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * add_initrd_mem_range - Adds initrd range to the given memory ranges list,
> + *                        if the initrd was retained.
> + * @mem_ranges:           Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_initrd_mem_range(struct crash_mem **mem_ranges)
> +{
> +	u64 base, end;
> +	char *str;
> +	int ret;
> +
> +	/* This range means something only if initrd was retained */
> +	str = strstr(saved_command_line, "retain_initrd");
> +	if (!str)
> +		return 0;

Unfortunate that we have to go and scan the command line again. But I
don't see a better way ATM.

Could be more concise:

	if (!strstr(saved_command_line, "retain_initrd"))
		return 0;

> +
> +	ret = of_property_read_u64(of_chosen, "linux,initrd-start", &base);
> +	ret |= of_property_read_u64(of_chosen, "linux,initrd-end", &end);
> +	if (!ret)
> +		ret = add_mem_range(mem_ranges, base, end - base + 1);
> +	return ret;
> +}
> +
> +#ifdef CONFIG_PPC_BOOK3S_64
> +/**
> + * add_htab_mem_range - Adds htab range to the given memory ranges list,
> + *                      if it exists
> + * @mem_ranges:         Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_htab_mem_range(struct crash_mem **mem_ranges)
> +{
> +	if (!htab_address)
> +		return 0;
> +
> +	return add_mem_range(mem_ranges, __pa(htab_address), htab_size_bytes);
> +}
> +#endif
> +
> +/**
> + * add_kernel_mem_range - Adds kernel text region to the given
> + *                        memory ranges list.
> + * @mem_ranges:           Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_kernel_mem_range(struct crash_mem **mem_ranges)
> +{
> +	return add_mem_range(mem_ranges, 0, __pa(_end));
> +}
> +
> +/**
> + * add_rtas_mem_range - Adds RTAS region to the given memory ranges list.
> + * @mem_ranges:         Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_rtas_mem_range(struct crash_mem **mem_ranges)
> +{
> +	struct device_node *dn;
> +	int ret = 0;
> +
> +	dn = of_find_node_by_path("/rtas");
> +	if (dn) {
> +		u32 base, size;
> +
> +		ret = of_property_read_u32(dn, "linux,rtas-base", &base);
> +		ret |= of_property_read_u32(dn, "rtas-size", &size);
> +		if (ret)
> +			goto out;
> +
> +		ret = add_mem_range(mem_ranges, base, size);
> +	}
> +
> +out:
> +	of_node_put(dn);
> +	return ret;
> +}

Or:
	struct device_node *dn;
        u32 base, size;
	int rc;

	dn = of_find_node_by_path("/rtas");
	if (!dn)
        	return 0;

	rc  = of_property_read_u32(dn, "linux,rtas-base", &base);
	rc |= of_property_read_u32(dn, "rtas-size", &size);
	if (rc == 0)
		rc = add_mem_range(mem_ranges, base, size);

	of_node_put(dn);
	return rc;
}


> +
> +/**
> + * add_opal_mem_range - Adds OPAL region to the given memory ranges list.
> + * @mem_ranges:         Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_opal_mem_range(struct crash_mem **mem_ranges)
> +{
> +	struct device_node *dn;
> +	int ret = 0;
> +
> +	dn = of_find_node_by_path("/ibm,opal");
> +	if (dn) {
> +		u64 base, size;
> +
> +		ret = of_property_read_u64(dn, "opal-base-address", &base);
> +		ret |= of_property_read_u64(dn, "opal-runtime-size", &size);
> +		if (ret)
> +			goto out;
> +
> +		ret = add_mem_range(mem_ranges, base, size);
> +	}
> +
> +out:
> +	of_node_put(dn);
> +	return ret;
> +}
> +
> +/**
> + * add_reserved_ranges - Adds "/reserved-ranges" regions exported by f/w
> + *                       to the given memory ranges list.
> + * @mem_ranges:          Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_reserved_ranges(struct crash_mem **mem_ranges)
> +{
> +	int n_mem_addr_cells, n_mem_size_cells, i, len, cells, ret = 0;
> +	const __be32 *prop;
> +
> +	prop = of_get_property(of_root, "reserved-ranges", &len);
> +	if (!prop)
> +		return 0;
> +
> +	of_node_get(of_root);

You shouldn't need to get the root node, you already used it above anyway.

> +	n_mem_addr_cells = of_n_addr_cells(of_root);
> +	n_mem_size_cells = of_n_size_cells(of_root);
> +	cells = n_mem_addr_cells + n_mem_size_cells;
> +
> +	/* Each reserved range is an (address,size) pair */
> +	for (i = 0; i < (len / (sizeof(*prop) * cells)); i++) {
                                       ^
                                       just u32 would be clearer I think.



cheers

^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Ingo Molnar @ 2020-07-28 11:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728105602.GB3655207@kernel.org>


* Mike Rapoport <rppt@kernel.org> wrote:

> On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote:
> > 
> > * Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > > regions to set node ID in memblock.reserved and than traverses
> > > memblock.reserved to update reserved_nodemask to include node IDs that were
> > > set in the first loop.
> > > 
> > > Remove redundant traversal over memblock.reserved and update
> > > reserved_nodemask while iterating over numa_meminfo.
> > > 
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > ---
> > >  arch/x86/mm/numa.c | 26 ++++++++++----------------
> > >  1 file changed, 10 insertions(+), 16 deletions(-)
> > 
> > I suspect you'd like to carry this in the -mm tree?
> 
> Yes.
>  
> > Acked-by: Ingo Molnar <mingo@kernel.org>
> 
> Thanks!

Assuming it is correct and works. :-)

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
From: Nicholas Piggin @ 2020-07-28 11:22 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, Alexey Kardashevskiy, Will Deacon, linux-kernel,
	Ingo Molnar, linuxppc-dev
In-Reply-To: <20200726121138.GC119549@hirez.programming.kicks-ass.net>

Excerpts from peterz@infradead.org's message of July 26, 2020 10:11 pm:
> On Sun, Jul 26, 2020 at 02:14:34PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 26, 2020 6:26 am:
> 
>> > Which is 'funny' when it interleaves like:
>> > 
>> > 	local_irq_disable();
>> > 	...
>> > 	local_irq_enable()
>> > 	  trace_hardirqs_on();
>> > 	  <NMI/>
>> > 	  raw_local_irq_enable();
>> > 
>> > Because then it will undo the trace_hardirqs_on() we just did. With the
>> > result that both tracing and lockdep will see a hardirqs-disable without
>> > a matching enable, while the hardware state is enabled.
>> 
>> Seems like an arch problem -- why not disable if it was enabled only?
>> I guess the local_irq tracing calls are a mess so maybe they copied 
>> those.
> 
> Because, as I wrote earlier, then we can miss updating software state.
> So your proposal has:
> 
> 	raw_local_irq_disable()
> 	<NMI>
> 	  if (!arch_irqs_disabled(regs->flags) // false
> 	    trace_hardirqs_off();
> 
> 	  // tracing/lockdep still think IRQs are enabled
> 	  // hardware IRQ state is disabled.

... and then lockdep_nmi_enter can disable IRQs if they were enabled?

The only reason it's done this way as opposed to a much simple counter 
increment/decrement AFAIKS is to avoid some overhead of calling 
trace_hardirqs_on/off (which seems a bit dubious but let's go with it).

In that case the lockdep_nmi_enter code is the right spot to clean up 
that gap vs NMIs. I guess there's an argument that arch_nmi_enter could
do it. I don't see the problem with fixing it up here though, this is a 
slow path so it doesn't matter if we have some more logic for it.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Baoquan He @ 2020-07-28 11:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728051153.1590-15-rppt@kernel.org>

On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> regions to set node ID in memblock.reserved and than traverses
> memblock.reserved to update reserved_nodemask to include node IDs that were
> set in the first loop.
> 
> Remove redundant traversal over memblock.reserved and update
> reserved_nodemask while iterating over numa_meminfo.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/mm/numa.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 8ee952038c80..4078abd33938 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void)
>  	 * and use those ranges to set the nid in memblock.reserved.
>  	 * This will split up the memblock regions along node
>  	 * boundaries and will set the node IDs as well.
> +	 *
> +	 * The nid will also be set in reserved_nodemask which is later
> +	 * used to clear MEMBLOCK_HOTPLUG flag.
> +	 *
> +	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> +	 *   numa_meminfo might not include all memblock.reserved
> +	 *   memory ranges, because quirks such as trim_snb_memory()
> +	 *   reserve specific pages for Sandy Bridge graphics.
> +	 *   These ranges will remain with nid == MAX_NUMNODES. ]
>  	 */
>  	for (i = 0; i < numa_meminfo.nr_blks; i++) {
>  		struct numa_memblk *mb = numa_meminfo.blk + i;
>  		int ret;
>  
>  		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
> +		node_set(mb->nid, reserved_nodemask);

Really? This will set all node id into reserved_nodemask. But in the
current code, it's setting nid into memblock reserved region which
interleaves with numa_memoinfo, then get those nid and set it in
reserved_nodemask. This is so different, with my understanding. Please
correct me if I am wrong.

Thanks
Baoquan

>  		WARN_ON_ONCE(ret);
>  	}
>  
> -	/*
> -	 * Now go over all reserved memblock regions, to construct a
> -	 * node mask of all kernel reserved memory areas.
> -	 *
> -	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> -	 *   numa_meminfo might not include all memblock.reserved
> -	 *   memory ranges, because quirks such as trim_snb_memory()
> -	 *   reserve specific pages for Sandy Bridge graphics. ]
> -	 */
> -	for_each_memblock(reserved, mb_region) {
> -		int nid = memblock_get_region_node(mb_region);
> -
> -		if (nid != MAX_NUMNODES)
> -			node_set(nid, reserved_nodemask);
> -	}
> -
>  	/*
>  	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
>  	 * belonging to the reserved node mask.
> -- 
> 2.26.2
> 
> 


^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Mike Rapoport @ 2020-07-28 10:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728104440.GA222284@gmail.com>

On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote:
> 
> * Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > regions to set node ID in memblock.reserved and than traverses
> > memblock.reserved to update reserved_nodemask to include node IDs that were
> > set in the first loop.
> > 
> > Remove redundant traversal over memblock.reserved and update
> > reserved_nodemask while iterating over numa_meminfo.
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/x86/mm/numa.c | 26 ++++++++++----------------
> >  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> I suspect you'd like to carry this in the -mm tree?

Yes.
 
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks!

> Thanks,
> 
> 	Ingo

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
From: Nicholas Piggin @ 2020-07-28 10:53 UTC (permalink / raw)
  To: linux-arch, Linus Torvalds, Yang Shi
  Cc: Hillf Danton, mm-commits, Catalin Marinas, Hugh Dickins,
	Josef Bacik, Will Deacon, Linux-MM, Matthew Wilcox,
	Johannes Weiner, Yu Xu, Andrew Morton, linuxppc-dev,
	Kirill A . Shutemov
In-Reply-To: <CAHk-=wha6f0gF1SJg96R77h0oTuc_oO7-37wD=mYGy6TyJOwbQ@mail.gmail.com>

Excerpts from Linus Torvalds's message of July 28, 2020 4:37 am:
> [ Adding linux-arch, just to make other architectures aware of this issue too.
> 
>   We have a "flush_tlb_fix_spurious_fault()" thing to take care of the
> "TLB may contain stale entries, we can't take the same fault over and
> over again" situation.
> 
>   On x86, it's a no-op, because x86 doesn't do that. x86 will re-walk
> the page tables - or possibly just always invalidate the faulting TLB
> entry - before taking a fault, so there can be no long-term stale
> TLB's.

[snip]

>   It looks like powerpc people at least thought about this, and only
> do it if there is a coprocessor. Which sounds a bit confused, but I
> don't know the rules.

I'm not sure about ppc32 and 64e, I'm almost certain they should do a 
local flush if anyting, and someone with a good understanding of the 
ISAs and CPUs might be able to nop it entirely. I agree global can't 
ever really make sense (except as a default because we have no generic 
local flush).

powerpc/64s reloads translations after taking a fault, so it's fine with 
a nop here.

The quirk is a problem with coprocessor where it's supposed to 
invalidate the translation after a fault but it doesn't, so we can get a 
read-only TLB stuck after something else does a RO->RW upgrade on the 
TLB. Something like that IIRC.  Coprocessors have their own MMU which 
lives in the nest not the core, so you need a global TLB flush to
invalidate that thing.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
From: Ingo Molnar @ 2020-07-28 10:44 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, linux-mips, Max Filippov,
	Paul Mackerras, sparclinux, linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728051153.1590-15-rppt@kernel.org>


* Mike Rapoport <rppt@kernel.org> wrote:

> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> regions to set node ID in memblock.reserved and than traverses
> memblock.reserved to update reserved_nodemask to include node IDs that were
> set in the first loop.
> 
> Remove redundant traversal over memblock.reserved and update
> reserved_nodemask while iterating over numa_meminfo.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/mm/numa.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)

I suspect you'd like to carry this in the -mm tree?

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 03/15] arm, xtensa: simplify initialization of high memory pages
From: Max Filippov @ 2020-07-28  8:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: open list:SUPERH, Peter Zijlstra, Dave Hansen, linux-mips,
	Linux Memory Management List, Paul Mackerras,
	open list:SPARC + UltraSPAR..., linux-riscv, Will Deacon,
	Stafford Horne, Marek Szyprowski, linux-s390, linux-c6x-dev,
	Yoshinori Sato, maintainer:X86 ARCHITECTURE..., Russell King,
	Mike Rapoport, clang-built-linux, Ingo Molnar, Catalin Marinas,
	moderated list:H8/300 ARCHITECTURE,
	open list:TENSILICA XTENSA PORT (xtensa), openrisc,
	Borislav Petkov, Andy Lutomirski, Paul Walmsley, Thomas Gleixner,
	linux-arm-kernel, Michal Simek, linuxppc-dev, LKML, iommu,
	Palmer Dabbelt, Andrew Morton, Christoph Hellwig
In-Reply-To: <20200728051153.1590-4-rppt@kernel.org>

On Mon, Jul 27, 2020 at 10:12 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> The function free_highpages() in both arm and xtensa essentially open-code
> for_each_free_mem_range() loop to detect high memory pages that were not
> reserved and that should be initialized and passed to the buddy allocator.
>
> Replace open-coded implementation of for_each_free_mem_range() with usage
> of memblock API to simplify the code.
>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/arm/mm/init.c    | 48 +++++++------------------------------
>  arch/xtensa/mm/init.c | 55 ++++++++-----------------------------------
>  2 files changed, 18 insertions(+), 85 deletions(-)

For the xtensa part:
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>

-- 
Thanks.
-- Max

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox