LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 7/7] powerpc: unrel_branch_check.sh: exit silently for early errors
From: Stephen Rothwell @ 2020-08-11 14:04 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Linux PowerPC List, Nicholas Piggin
In-Reply-To: <20200811140435.20957-1-sfr@canb.auug.org.au>

If we can't find the address of __end_interrupts, then we still exit
successfully as that is the current behaviour.

Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/powerpc/tools/unrel_branch_check.sh | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/tools/unrel_branch_check.sh b/arch/powerpc/tools/unrel_branch_check.sh
index 4489f16a443c..70da90270c78 100755
--- a/arch/powerpc/tools/unrel_branch_check.sh
+++ b/arch/powerpc/tools/unrel_branch_check.sh
@@ -14,9 +14,12 @@ kstart=0xc000000000000000
 printf -v kend '0x%x' $(( kstart + 0x10000 ))
 
 end_intr=0x$(
-$objdump -R -d --start-address="$kstart" --stop-address="$kend" "$vmlinux" |
+$objdump -R -d --start-address="$kstart" --stop-address="$kend" "$vmlinux" 2>/dev/null |
 awk '$2 == "<__end_interrupts>:" { print $1 }'
 )
+if [ "$end_intr" = "0x" ]; then
+	exit 0
+fi
 
 $objdump -R -D --no-show-raw-insn --start-address="$kstart" --stop-address="$end_intr" "$vmlinux" |
 sed -E -n '
-- 
2.28.0


^ permalink raw reply related

* Re: [RFC PATCH v1] power: don't manage floating point regs when no FPU
From: Christophe Leroy @ 2020-08-11 14:06 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87o8nh9yjd.fsf@mpe.ellerman.id.au>



Le 11/08/2020 à 14:07, Michael Ellerman a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>   10 files changed, 44 insertions(+), 1 deletion(-)
> 
> In general this looks fine.
> 
> It's a bit #ifdef heavy. Maybe some of those can be cleaned up a bit
> with some wrapper inlines?

Looking at it once more, looks like more or less the same level of 
#ifdefs as things like CONFIG_ALTIVEC for instance. I can't really see 
much opportunities to clean it up.

> 
>> diff --git a/arch/powerpc/kernel/ptrace/ptrace-novsx.c b/arch/powerpc/kernel/ptrace/ptrace-novsx.c
>> index b2dc4e92d11a..8f87a11f3f8c 100644
>> --- a/arch/powerpc/kernel/ptrace/ptrace-novsx.c
>> +++ b/arch/powerpc/kernel/ptrace/ptrace-novsx.c
>> @@ -28,6 +29,9 @@ int fpr_get(struct task_struct *target, const struct user_regset *regset,
>>   
>>   	return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
>>   				   &target->thread.fp_state, 0, -1);
>> +#else
>> +	return 0;
>> +#endif
> 
> Should we return -ENODEV/EIO here? Wonder if another arch can give us a clue.
> 

Looks like we have to do another way  ... another #ifdef ... in the 
definition of native_regsets[] in ptrace-view.c . And then we should be 
able to not build ptrace-novsx.c at all. Will try that.

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
From: Nathan Lynch @ 2020-08-11 14:46 UTC (permalink / raw)
  To: Michael Ellerman, Michael Roth
  Cc: linuxppc-dev, Greg Kurz, Thiago Jung Bauermann, Cedric Le Goater
In-Reply-To: <87r1sd9z0d.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:

> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>> Quoting Nathan Lynch (2020-08-07 02:05:09)
> ...
>>> wait_for_cpu_stopped() should be able to accommodate a time-based
>>> warning if necessary, but speaking as a likely recipient of any bug
>>> reports that would arise here, I'm not convinced of the need and I
>>> don't know what a good value would be. It's relatively easy to sample
>>> the stack of a task that's apparently failing to make progress, plus I
>>> probably would use 'perf probe' or similar to report the inputs and
>>> outputs for the RTAS call.
>>
>> I think if we make the timeout sufficiently high like 2 minutes or so
>> it wouldn't hurt and if we did seem them it would probably point to an
>> actual bug. But I don't have a strong feeling either way.
>
> I think we should print a warning after 2 minutes.
>
> It's true that there are fairly easy mechanisms to work out where the
> thread is stuck, but customers are unlikely to use them. They're just
> going to report that it's stuck with no further info, and probably
> reboot the machine before we get a chance to get any further info.
>
> Whereas if the kernel prints a warning with a stack trace we at least
> have that to go on in an initial bug report.
>
>>> I'm happy to make this a proper submission after I can clean it up and
>>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>>> acceptable.
>>> 
>>
>> I've given it a shot with this patch and it seems to be holding up in
>> testing. If we don't think the ~2 minutes warning message is needed I
>> can clean it up to post:
>>
>> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>>
>> I'd likely break the refactoring patches out to a separate patch under
>> Nathan's name since it fixes a separate bug potentially.
>
> While I like Nathan's refactoring, we probably want to do the minimal
> fix first to ease backporting.
>
> Then do the refactoring on top of that.

Fair enough, thanks.

^ permalink raw reply

* Re: [PATCH 19/22] crypto: inside-secure - add check for xts input length equal to zero
From: Horia Geantă @ 2020-08-11 15:28 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrei Botila (OSS), Andrei Botila, Herbert Xu,
	Van Leeuwen, Pascal, Antoine Tenart, linux-s390@vger.kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@axis.com, linux-crypto@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, David S. Miller,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20200810170305.GA3352718@gmail.com>

On 8/10/2020 8:03 PM, Eric Biggers wrote:
> On Mon, Aug 10, 2020 at 05:33:39PM +0300, Horia Geantă wrote:
>> On 8/10/2020 4:45 PM, Herbert Xu wrote:
>>> On Mon, Aug 10, 2020 at 10:20:20AM +0000, Van Leeuwen, Pascal wrote:
>>>>
>>>> With all due respect, but this makes no sense.
>>>
>>> I agree.  This is a lot of churn for no gain.
>>>
>> I would say the gain is that all skcipher algorithms would behave the same
>> when input length equals zero - i.e. treat the request as a no-op.
>>
>> We can't say "no input" has any meaning to the other skcipher algorithms,
>> but the convention is to accept this case and just return 0.
>> I don't see why XTS has to be handled differently.
>>
> 
> CTS also rejects empty inputs.
> 
> The rule it follows is just that all input lengths >= blocksize are allowed.
> Input lengths < blocksize aren't allowed.
> 
Indeed, thanks.

What about, for example, CBC?
AFAICT cbc(aes) with input length = 0 is valid.

Same for CTR (with the note that blocksize = 1) and several other algorithms
mentioned in the cover letter.

What's the rule in these cases?

Thanks,
Horia

^ permalink raw reply

* [PATCH v2] powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
From: Michael Roth @ 2020-08-11 16:15 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nathan Lynch, Cedric Le Goater, Thiago Jung Bauermann, Greg Kurz

For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:

  [ 1767.353447] cpu 65 (hwid 65) Ready to die...
  [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
  [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
  [ 1767.952322] ------------[ cut here ]------------
  [ 1767.952323] kernel BUG at lib/list_debug.c:56!
  [ 1767.952325] Oops: Exception in kernel mode, sig: 5 [#1]
  [ 1767.952326] LE SMP NR_CPUS=2048 NUMA pSeries
  [ 1767.952328] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c sd_mod sg xts vmx_crypto virtio_net net_failover failover virtio_scsi dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
  [ 1767.952352] CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
  [ 1767.952354] NIP:  c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
  [ 1767.952355] REGS: c0000009e5a17840 TRAP: 0700   Not tainted  (4.18.0-221.el8.ppc64le)
  [ 1767.952355] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000842  XER: 20040000
  [ 1767.952360] CFAR: c0000000001ffe64 IRQMASK: 1
  [ 1767.952360] GPR00: c0000000007ab508 c0000009e5a17ac0 c000000001ac0700 0000000000000054
  [ 1767.952360] GPR04: c0000009f056cf90 c0000009f05f5628 c0000009ed40d654 c000000001c90700
  [ 1767.952360] GPR08: 0000000000000007 c0000009f0573980 00000009ef2b0000 7562202c38303230
  [ 1767.952360] GPR12: 0000000000000000 c0000007fff6ce80 c00a000002470208 0000000000000000
  [ 1767.952360] GPR16: 0000000000000001 c0000009f05fbb00 0000000000000800 c0000009ff3d4980
  [ 1767.952360] GPR20: c0000009f05fbb10 5deadbeef0000100 5deadbeef0000200 0000000000187961
  [ 1767.952360] GPR24: c0000009e5a17b78 0000000000000000 0000000000000001 ffffffffffffffff
  [ 1767.952360] GPR28: c00a000002470200 c0000009f05fbb10 c0000009f05fbb10 0000000000000000
  [ 1767.952375] NIP [c0000000007ab50c] __list_del_entry_valid+0xac/0x100
  [ 1767.952376] LR [c0000000007ab508] __list_del_entry_valid+0xa8/0x100
  [ 1767.952377] Call Trace:
  [ 1767.952378] [c0000009e5a17ac0] [c0000000007ab508] __list_del_entry_valid+0xa8/0x100 (unreliable)
  [ 1767.952381] [c0000009e5a17b20] [c000000000476e18] free_pcppages_bulk+0x1f8/0x940
  [ 1767.952383] [c0000009e5a17c20] [c00000000047a9a0] free_unref_page+0xd0/0x100
  [ 1767.952386] [c0000009e5a17c50] [c0000000000bc2a8] xive_spapr_cleanup_queue+0x148/0x1b0
  [ 1767.952388] [c0000009e5a17cf0] [c0000000000b96dc] xive_teardown_cpu+0x1bc/0x240
  [ 1767.952391] [c0000009e5a17d30] [c00000000010bf28] pseries_mach_cpu_die+0x78/0x2f0
  [ 1767.952393] [c0000009e5a17de0] [c00000000005c8d8] cpu_die+0x48/0x70
  [ 1767.952394] [c0000009e5a17e00] [c000000000021cf0] arch_cpu_idle_dead+0x20/0x40
  [ 1767.952397] [c0000009e5a17e20] [c0000000001b4294] do_idle+0x2f4/0x4c0
  [ 1767.952399] [c0000009e5a17ea0] [c0000000001b46a8] cpu_startup_entry+0x38/0x40
  [ 1767.952400] [c0000009e5a17ed0] [c00000000005c43c] start_secondary+0x7bc/0x8f0
  [ 1767.952403] [c0000009e5a17f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14

or on the worker thread handling the unplug:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
  [ 1538.360736] BUG: Bad page state in process kworker/u768:3  pfn:95de1
  [ 1538.360746] cpu 314 (hwid 314) Ready to die...
  [ 1538.360784] page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
  [ 1538.361881] flags: 0x5ffffc00000000()
  [ 1538.361908] raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
  [ 1538.361955] raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
  [ 1538.362002] page dumped because: nonzero mapcount
  [ 1538.362033] Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs libcrc32c sd_mod sg virtio_net net_failover virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
  [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
  [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
  [ 1538.362725] Call Trace:
  [ 1538.362743] [c0000009d4adf590] [c000000000e0e0fc] dump_stack+0xb0/0xf4 (unreliable)
  [ 1538.362789] [c0000009d4adf5d0] [c000000000475dfc] bad_page+0x12c/0x1b0
  [ 1538.362827] [c0000009d4adf660] [c0000000004784bc] free_pcppages_bulk+0x5bc/0x940
  [ 1538.362871] [c0000009d4adf760] [c000000000478c38] page_alloc_cpu_dead+0x118/0x120
  [ 1538.362918] [c0000009d4adf7b0] [c00000000015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760
  [ 1538.362969] [c0000009d4adf820] [c00000000015eee8] _cpu_down+0x188/0x340
  [ 1538.363007] [c0000009d4adf890] [c00000000015d75c] cpu_down+0x5c/0xa0
  [ 1538.363045] [c0000009d4adf8c0] [c00000000092c544] cpu_subsys_offline+0x24/0x40
  [ 1538.363091] [c0000009d4adf8e0] [c0000000009212f0] device_offline+0xf0/0x130
  [ 1538.363129] [c0000009d4adf920] [c00000000010aee4] dlpar_offline_cpu+0x1c4/0x2a0
  [ 1538.363174] [c0000009d4adf9e0] [c00000000010b2f8] dlpar_cpu_remove+0xb8/0x190
  [ 1538.363219] [c0000009d4adfa60] [c00000000010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150
  [ 1538.363264] [c0000009d4adfaf0] [c00000000010ca24] dlpar_cpu+0x94/0x800
  [ 1538.363302] [c0000009d4adfc00] [c000000000102cc8] pseries_hp_work_fn+0x128/0x1e0
  [ 1538.363347] [c0000009d4adfc70] [c00000000018aa84] process_one_work+0x304/0x5d0
  [ 1538.363394] [c0000009d4adfd10] [c00000000018b5cc] worker_thread+0xcc/0x7a0
  [ 1538.363433] [c0000009d4adfdc0] [c00000000019567c] kthread+0x1ac/0x1c0
  [ 1538.363469] [c0000009d4adfe30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80

The latter trace is due to the following sequence:

  page_alloc_cpu_dead
    drain_pages
      drain_pages_zone
        free_pcppages_bulk

where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs
a loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for
25 iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2

At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race with
the XIVE cleanup code executed by the unplugged CPU.

Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.

Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Cedric Le Goater <clg@kaod.org>
Cc: Greg Kurz <groug@kaod.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
changes from v1:
 - renamed from "powerpc/pseries/hotplug-cpu: increase wait time for vCPU death"
 - wait indefinitely, but issue cond_resched() when RTAS reports that CPU
   hasn't stopped (Michael)
 - print a warning after 120s of waiting (Michael)
 - use pr_warn() instead of default printk() level
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index c6e0d8abf75e..7a974ed6b240 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -107,22 +107,28 @@ static int pseries_cpu_disable(void)
  */
 static void pseries_cpu_die(unsigned int cpu)
 {
-	int tries;
 	int cpu_status = 1;
 	unsigned int pcpu = get_hard_smp_processor_id(cpu);
+	unsigned long timeout = jiffies + msecs_to_jiffies(120000);
 
-	for (tries = 0; tries < 25; tries++) {
+	while (true) {
 		cpu_status = smp_query_cpu_stopped(pcpu);
 		if (cpu_status == QCSS_STOPPED ||
 		    cpu_status == QCSS_HARDWARE_ERROR)
 			break;
-		cpu_relax();
 
+		if (time_after(jiffies, timeout)) {
+			pr_warn("CPU %i (hwid %i) didn't die after 120 seconds\n",
+				cpu, pcpu);
+			timeout = jiffies + msecs_to_jiffies(120000);
+		}
+
+		cond_resched();
 	}
 
-	if (cpu_status != 0) {
-		printk("Querying DEAD? cpu %i (%i) shows %i\n",
-		       cpu, pcpu, cpu_status);
+	if (cpu_status == QCSS_HARDWARE_ERROR) {
+		pr_warn("CPU %i (hwid %i) reported error while dying\n",
+			cpu, pcpu);
 	}
 
 	/* Isolation and deallocation are definitely done by
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v3 0/8] huge vmalloc mappings
From: Jonathan Cameron @ 2020-08-11 16:32 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Zefan Li, H. Peter Anvin, Will Deacon, x86,
	linux-kernel, linux-mm, Ingo Molnar, Borislav Petkov,
	Catalin Marinas, Thomas Gleixner, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

On Mon, 10 Aug 2020 12:27:24 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> Not tested on x86 or arm64, would appreciate a quick test there so I can
> ask Andrew to put it in -mm. Other option is I can disable huge vmallocs
> for them for the time being.

Hi Nicholas,

For arm64 testing with a Kunpeng920.

I ran a quick sanity test with this series on top of mainline (yes mid merge window
so who knows what state is...).  Could I be missing some dependency?

Without them it boots, with them it doesn't.  Any immediate guesses?

[    0.069507] Dentry cache hash table entries: 33554432 (order: 16, 268435456 bytes, vmalloc)                                                               
[    0.087134] Inode-cache hash table entries: 16777216 (order: 15, 134217728 bytes, vmalloc)                                                                
[    0.097044] Mount-cache hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)                                                                    
[    0.106534] Mountpoint-cache hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    0.116349] ------------[ cut here ]------------   
[    0.121465] kernel BUG at kernel/fork.c:402!
[    0.126194] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[    0.132273] Modules linked in:
[    0.135653] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-12307-g2b197e00c338 #637
[    0.144240] pstate: 20000009 (nzCv daif -PAN -UAO BTYPE=--)
[    0.150420] pc : copy_process+0x10c0/0x1690
[    0.155049] lr : copy_process+0x2e0/0x1690
[    0.159584] sp : ffffd96c55773d60
[    0.163250] x29: ffffd96c55773d70 x28: ffff20bf87060000
[    0.169134] x27: 0000000000800300 x26: 00000000ffffffff
[    0.175018] x25: ffff8000108a8000 x24: ffffd96c55a32708
[    0.180901] x23: ffff20bf87043800 x22: 0000000000000000
[    0.186787] x21: 0000000000000000 x20: ffffd96c55773ef0
[    0.192672] x19: ffffd96c55783bc0 x18: 0000000000000010
[    0.198557] x17: 00000000855c858e x16: 00000000a8256fca
[    0.204441] x15: ffffffffffffffff x14: ffff000000000000
[    0.210327] x13: ffff800010901000 x12: ffff8000108b1000
[    0.216212] x11: 0000000000000001 x10: ffffd96c55a6d000
[    0.222096] x9 : ffffd96c53bf7594 x8 : 0000000000000041
[    0.227980] x7 : ffff004fffffa6b0 x6 : ffff800010aa8000
[    0.233864] x5 : 000000000000fffd x4 : 0000000000000000
[    0.239748] x3 : ffffd96c55a63598 x2 : 0000000000000001
[    0.245632] x1 : ffffd96c55783bc0 x0 : 0000000000000008
[    0.251519] Call trace:
[    0.254221]  copy_process+0x10c0/0x1690
[    0.258466]  _do_fork+0x98/0x488
[    0.262036]  kernel_thread+0x6c/0x90
[    0.265997]  rest_init+0x38/0xf0
[    0.269568]  arch_call_rest_init+0x18/0x24
[    0.274105]  start_kernel+0x60c/0x644
[    0.278159] Code: f000a441 f943f421 cb010000 17ffffe1 (d4210000)
[    0.284961] ---[ end trace 985361e2cb97a0d9 ]---
[    0.290073] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.297532] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Thanks,

Jonathan


> 
> Since v2:
> - Rebased on vmalloc cleanups, split series into simpler pieces.
> - Fixed several compile errors and warnings
> - Keep the page array and accounting in small page units because
>   struct vm_struct is an interface (this should fix x86 vmap stack debug
>   assert). [Thanks Zefan]
> 
> Nicholas Piggin (8):
>   mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
>   mm: apply_to_pte_range warn and fail if a large pte is encountered
>   mm/vmalloc: rename vmap_*_range vmap_pages_*_range
>   lib/ioremap: rename ioremap_*_range to vmap_*_range
>   mm: HUGE_VMAP arch support cleanup
>   mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
>   mm/vmalloc: add vmap_range_noflush variant
>   mm/vmalloc: Hugepage vmalloc mappings
> 
>  .../admin-guide/kernel-parameters.txt         |   2 +
>  arch/arm64/mm/mmu.c                           |  10 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c      |   8 +-
>  arch/x86/mm/ioremap.c                         |  10 +-
>  include/linux/io.h                            |   9 -
>  include/linux/vmalloc.h                       |  13 +
>  init/main.c                                   |   1 -
>  mm/ioremap.c                                  | 231 +--------
>  mm/memory.c                                   |  60 ++-
>  mm/vmalloc.c                                  | 442 +++++++++++++++---
>  10 files changed, 453 insertions(+), 333 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH 2/2 v2] powerpc/powernv: Enable and setup PCI P2P
From: Frederic Barrat @ 2020-08-11 17:25 UTC (permalink / raw)
  To: Oliver O'Halloran, Max Gurtovoy
  Cc: vladimirk, Carol L Soto, linux-pci, shlomin, israelr, idanw,
	linuxppc-dev, Christoph Hellwig, aneela
In-Reply-To: <CAOSf1CGv=0bwShzzK5zP3dtKg=RxeTFvq52j-Vi4GDfZ4UpBJA@mail.gmail.com>



Le 03/08/2020 à 09:35, Oliver O'Halloran a écrit :
> On Thu, Apr 30, 2020 at 11:15 PM Max Gurtovoy <maxg@mellanox.com> wrote:
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 57d3a6a..9ecc576 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -3706,18 +3706,208 @@ static void pnv_pci_ioda_dma_bus_setup(struct pci_bus *bus)
>>          }
>>   }
>>
>> +#ifdef CONFIG_PCI_P2PDMA
>> +static DEFINE_MUTEX(p2p_mutex);
>> +
>> +static bool pnv_pci_controller_owns_addr(struct pci_controller *hose,
>> +                                        phys_addr_t addr, size_t size)
>> +{
>> +       int i;
>> +
>> +       /*
>> +        * It seems safe to assume the full range is under the same PHB, so we
>> +        * can ignore the size.
>> +        */
>> +       for (i = 0; i < ARRAY_SIZE(hose->mem_resources); i++) {
>> +               struct resource *res = &hose->mem_resources[i];
>> +
>> +               if (res->flags && addr >= res->start && addr < res->end)
>> +                       return true;
>> +       }
>> +       return false;
>> +}
>> +
>> +/*
>> + * find the phb owning a mmio address if not owned locally
>> + */
>> +static struct pnv_phb *pnv_pci_find_owning_phb(struct pci_dev *pdev,
>> +                                              phys_addr_t addr, size_t size)
>> +{
>> +       struct pci_controller *hose;
>> +
>> +       /* fast path */
>> +       if (pnv_pci_controller_owns_addr(pdev->bus->sysdata, addr, size))
>> +               return NULL;
> 
> Do we actually need this fast path? It's going to be slow either way.
> Also if a device is doing p2p to another device under the same PHB
> then it should not be happening via the root complex. Is this a case
> you've tested?


The "fast path" comment is misleading and we should rephrase. The point 
is to catch if we're mapping a resource under the same PHB, in which 
case we don't modify the PHB configuration. So we need to catch it 
early, but it's not a fast path.
If the 2 devices are under the same PHB, the code above shouldn't do 
anything. So I guess behavior depends on the underlying bridge? We'll 
need another platform than witherspoon to test it. Probably worth checking.


>> +       list_for_each_entry(hose, &hose_list, list_node) {
>> +               struct pnv_phb *phb = hose->private_data;
>> +
>> +               if (phb->type != PNV_PHB_NPU_NVLINK &&
>> +                   phb->type != PNV_PHB_NPU_OCAPI) {
>> +                       if (pnv_pci_controller_owns_addr(hose, addr, size))
>> +                               return phb;
>> +               }
>> +       }
>> +       return NULL;
>> +}
>> +
>> +static u64 pnv_pci_dma_dir_to_opal_p2p(enum dma_data_direction dir)
>> +{
>> +       if (dir == DMA_TO_DEVICE)
>> +               return OPAL_PCI_P2P_STORE;
>> +       else if (dir == DMA_FROM_DEVICE)
>> +               return OPAL_PCI_P2P_LOAD;
>> +       else if (dir == DMA_BIDIRECTIONAL)
>> +               return OPAL_PCI_P2P_LOAD | OPAL_PCI_P2P_STORE;
>> +       else
>> +               return 0;
>> +}
>> +
>> +static int pnv_pci_ioda_enable_p2p(struct pci_dev *initiator,
>> +                                  struct pnv_phb *phb_target,
>> +                                  enum dma_data_direction dir)
>> +{
>> +       struct pci_controller *hose;
>> +       struct pnv_phb *phb_init;
>> +       struct pnv_ioda_pe *pe_init;
>> +       u64 desc;
>> +       int rc;
>> +
>> +       if (!opal_check_token(OPAL_PCI_SET_P2P))
>> +               return -ENXIO;
>> +
> 
>> +       hose = pci_bus_to_host(initiator->bus);
>> +       phb_init = hose->private_data;
> 
> You can use the pci_bus_to_pnvhb() helper
> 
>> +
>> +       pe_init = pnv_ioda_get_pe(initiator);
>> +       if (!pe_init)
>> +               return -ENODEV;
>> +
>> +       if (!pe_init->tce_bypass_enabled)
>> +               return -EINVAL;
>> +
>> +       /*
>> +        * Configuring the initiator's PHB requires to adjust its TVE#1
>> +        * setting. Since the same device can be an initiator several times for
>> +        * different target devices, we need to keep a reference count to know
>> +        * when we can restore the default bypass setting on its TVE#1 when
>> +        * disabling. Opal is not tracking PE states, so we add a reference
>> +        * count on the PE in linux.
>> +        *
>> +        * For the target, the configuration is per PHB, so we keep a
>> +        * target reference count on the PHB.
>> +        */
> 
> This irks me a bit because configuring the DMA address limits for the
> TVE is the kernel's job. What we really should be doing is using
> opal_pci_map_pe_dma_window_real() to set the bypass-mode address limit
> for the TVE to something large enough to hit the MMIO ranges rather
> than having set_p2p do it as a side effect. Unfortunately, for some
> reason skiboot doesn't implement support for enabling 56bit addressing
> using opal_pci_map_pe_dma_window_real() and we do need to support
> older kernel's which used this stuff so I guess we're stuck with it
> for now. It'd be nice if we could fix this in the longer term
> though...


OK. We'd need more than a 56-bit opal_pci_map_pe_dma_window_real() 
though, there's also a queue setting change on the target PHB.

   Fred


>> +       mutex_lock(&p2p_mutex);
>> +
>> +       desc = OPAL_PCI_P2P_ENABLE | pnv_pci_dma_dir_to_opal_p2p(dir);
>> +       /* always go to opal to validate the configuration */
>> +       rc = opal_pci_set_p2p(phb_init->opal_id, phb_target->opal_id, desc,
>> +                             pe_init->pe_number);
>> +       if (rc != OPAL_SUCCESS) {
>> +               rc = -EIO;
>> +               goto out;
>> +       }
>> +
>> +       pe_init->p2p_initiator_count++;
>> +       phb_target->p2p_target_count++;
>> +
>> +       rc = 0;
>> +out:
>> +       mutex_unlock(&p2p_mutex);
>> +       return rc;
>> +}
>> +
>> +static int pnv_pci_dma_map_resource(struct pci_dev *pdev,
>> +                                   phys_addr_t phys_addr, size_t size,
>> +                                   enum dma_data_direction dir)
>> +{
>> +       struct pnv_phb *target_phb;
>> +
>> +       target_phb = pnv_pci_find_owning_phb(pdev, phys_addr, size);
>> +       if (!target_phb)
>> +               return 0;
>> +
>> +       return pnv_pci_ioda_enable_p2p(pdev, target_phb, dir);
>> +}
>> +
>> +static int pnv_pci_ioda_disable_p2p(struct pci_dev *initiator,
>> +               struct pnv_phb *phb_target)
>> +{
>> +       struct pci_controller *hose;
>> +       struct pnv_phb *phb_init;
>> +       struct pnv_ioda_pe *pe_init;
>> +       int rc;
>> +
>> +       if (!opal_check_token(OPAL_PCI_SET_P2P))
>> +               return -ENXIO;
> 
> This should probably have a WARN_ON() since we can't hit this path
> unless the initial map succeeds.
> 
>> +       hose = pci_bus_to_host(initiator->bus);
>> +       phb_init = hose->private_data;
> 
> pci_bus_to_pnvhb()
> 
>> +       pe_init = pnv_ioda_get_pe(initiator);
>> +       if (!pe_init)
>> +               return -ENODEV;
>> +
>> +       mutex_lock(&p2p_mutex);
>> +
>> +       if (!pe_init->p2p_initiator_count || !phb_target->p2p_target_count) {
>> +               rc = -EINVAL;
>> +               goto out;
>> +       }
>> +
>> +       if (--pe_init->p2p_initiator_count == 0)
>> +               pnv_pci_ioda2_set_bypass(pe_init, true);
>> +
>> +       if (--phb_target->p2p_target_count == 0) {
>> +               rc = opal_pci_set_p2p(phb_init->opal_id, phb_target->opal_id,
>> +                                     0, pe_init->pe_number);
>> +               if (rc != OPAL_SUCCESS) {
>> +                       rc = -EIO;
>> +                       goto out;
>> +               }
>> +       }
>> +
>> +       rc = 0;
>> +out:
>> +       mutex_unlock(&p2p_mutex);
>> +       return rc;
>> +}
>> +
>> +static void pnv_pci_dma_unmap_resource(struct pci_dev *pdev,
>> +                                      dma_addr_t addr, size_t size,
>> +                                      enum dma_data_direction dir)
>> +{
>> +       struct pnv_phb *target_phb;
>> +       int rc;
>> +
>> +       target_phb = pnv_pci_find_owning_phb(pdev, addr, size);
>> +       if (!target_phb)
>> +               return;
>> +
>> +       rc = pnv_pci_ioda_disable_p2p(pdev, target_phb);
>> +       if (rc)
>> +               dev_err(&pdev->dev, "Failed to undo PCI peer-to-peer setup for address %llx: %d\n",
>> +                       addr, rc);
> 
> Use pci_err() or pe_err().
> 

^ permalink raw reply

* [powerpc:fixes-test] BUILD SUCCESS 6553fb799f601497ca0703682e2aff131197dc5c
From: kernel test robot @ 2020-08-11 18:51 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  fixes-test
branch HEAD: 6553fb799f601497ca0703682e2aff131197dc5c  powerpc/pkeys: Fix boot failures with Nemo board (A-EON AmigaOne X1000)

elapsed time: 868m

configs tested: 73
configs skipped: 93

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
mips                        qi_lb60_defconfig
m68k                          sun3x_defconfig
m68k                           sun3_defconfig
powerpc                     skiroot_defconfig
sh                     sh7710voipgw_defconfig
powerpc                      ppc64e_defconfig
powerpc                      ppc6xx_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
c6x                              allyesconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
x86_64               randconfig-a006-20200811
x86_64               randconfig-a001-20200811
x86_64               randconfig-a003-20200811
x86_64               randconfig-a005-20200811
x86_64               randconfig-a004-20200811
x86_64               randconfig-a002-20200811
i386                 randconfig-a005-20200811
i386                 randconfig-a001-20200811
i386                 randconfig-a002-20200811
i386                 randconfig-a003-20200811
i386                 randconfig-a006-20200811
i386                 randconfig-a004-20200811
i386                 randconfig-a016-20200811
i386                 randconfig-a011-20200811
i386                 randconfig-a015-20200811
i386                 randconfig-a013-20200811
i386                 randconfig-a012-20200811
i386                 randconfig-a014-20200811
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* [powerpc:merge] BUILD SUCCESS f04b169e953c4db1a3a3c1d23eea09c726f01ee5
From: kernel test robot @ 2020-08-11 18:57 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  merge
branch HEAD: f04b169e953c4db1a3a3c1d23eea09c726f01ee5  Automatic merge of 'master', 'next' and 'fixes' (2020-08-11 14:12)

elapsed time: 872m

configs tested: 66
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                            allyesconfig
arm                              allyesconfig
arm                              allmodconfig
arm64                               defconfig
ia64                             allyesconfig
ia64                             allmodconfig
ia64                                defconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
c6x                              allyesconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
i386                 randconfig-a001-20200811
i386                 randconfig-a002-20200811
i386                 randconfig-a003-20200811
i386                 randconfig-a004-20200811
i386                 randconfig-a005-20200811
i386                 randconfig-a006-20200811
i386                 randconfig-a016-20200811
i386                 randconfig-a011-20200811
i386                 randconfig-a015-20200811
i386                 randconfig-a013-20200811
i386                 randconfig-a012-20200811
i386                 randconfig-a014-20200811
x86_64               randconfig-a006-20200811
x86_64               randconfig-a001-20200811
x86_64               randconfig-a003-20200811
x86_64               randconfig-a005-20200811
x86_64               randconfig-a004-20200811
x86_64               randconfig-a002-20200811
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* [powerpc:next-test] BUILD SUCCESS 0a2900256840208c4a4248ff5900ae57990d55dc
From: kernel test robot @ 2020-08-11 18:57 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next-test
branch HEAD: 0a2900256840208c4a4248ff5900ae57990d55dc  powerpc: Use simple i2c probe function

elapsed time: 871m

configs tested: 69
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
sh                          rsk7264_defconfig
mips                           xway_defconfig
h8300                    h8300h-sim_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
c6x                              allyesconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
x86_64               randconfig-a006-20200811
x86_64               randconfig-a001-20200811
x86_64               randconfig-a003-20200811
x86_64               randconfig-a005-20200811
x86_64               randconfig-a004-20200811
x86_64               randconfig-a002-20200811
i386                 randconfig-a005-20200811
i386                 randconfig-a001-20200811
i386                 randconfig-a002-20200811
i386                 randconfig-a003-20200811
i386                 randconfig-a006-20200811
i386                 randconfig-a004-20200811
i386                 randconfig-a016-20200811
i386                 randconfig-a011-20200811
i386                 randconfig-a015-20200811
i386                 randconfig-a013-20200811
i386                 randconfig-a012-20200811
i386                 randconfig-a014-20200811
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH 19/22] crypto: inside-secure - add check for xts input length equal to zero
From: Herbert Xu @ 2020-08-12  0:36 UTC (permalink / raw)
  To: Horia Geantă
  Cc: Andrei Botila (OSS), Andrei Botila, Van Leeuwen, Pascal,
	Antoine Tenart, linux-s390@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-arm-kernel@axis.com,
	Eric Biggers, linux-crypto@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, David S. Miller,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <d4a471e6-34c9-c702-63d6-1f6a3cba0ebe@nxp.com>

On Tue, Aug 11, 2020 at 06:28:39PM +0300, Horia Geantă wrote:
>
> What about, for example, CBC?
> AFAICT cbc(aes) with input length = 0 is valid.

That's just because CBC accepts any input which is a multiple
of blocksize.

> Same for CTR (with the note that blocksize = 1) and several other algorithms
> mentioned in the cover letter.

CTR accepts any input size.

> What's the rule in these cases?

What input size is accepted depends on the algorithm.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v3 0/8] huge vmalloc mappings
From: Zefan Li @ 2020-08-12  1:07 UTC (permalink / raw)
  To: Jonathan Cameron, Nicholas Piggin
  Cc: linux-arch, H. Peter Anvin, Will Deacon, x86, linux-kernel,
	linux-mm, Ingo Molnar, Borislav Petkov, Catalin Marinas,
	Thomas Gleixner, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20200811173217.0000161e@huawei.com>

On 2020/8/12 0:32, Jonathan Cameron wrote:
> On Mon, 10 Aug 2020 12:27:24 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> Not tested on x86 or arm64, would appreciate a quick test there so I can
>> ask Andrew to put it in -mm. Other option is I can disable huge vmallocs
>> for them for the time being.
> 
> Hi Nicholas,
> 
> For arm64 testing with a Kunpeng920.
> 
> I ran a quick sanity test with this series on top of mainline (yes mid merge window
> so who knows what state is...).  Could I be missing some dependency?
> 
> Without them it boots, with them it doesn't.  Any immediate guesses?
> 

I've already reported this bug in v2, and yeah I also tested it on arm64
(not Kunpeng though), so looks like it still hasn't been fixed.

...
>>
>> Since v2:
>> - Rebased on vmalloc cleanups, split series into simpler pieces.
>> - Fixed several compile errors and warnings
>> - Keep the page array and accounting in small page units because
>>   struct vm_struct is an interface (this should fix x86 vmap stack debug
>>   assert). [Thanks Zefan]

though the changelog says it's fixed for x86.


^ permalink raw reply

* [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-08-12  1:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tyreld, cheloha, ldufour

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Call cond_resched() on every 20th element. Each iteration of the loop
in DLPAR code paths can involve around ten RTAS calls which can each
take up to 250us, so this ensures the check is performed at worst
every few milliseconds.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Changes since v1:
* Add bounds assertions in drmem_lmb_next().
* Call cond_resched() in the iterator on only every 20th element
  instead of on every iteration, to reduce overhead in tight loops.

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 17ccc6474ab6..583277e30dd2 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,9 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include <linux/bug.h>
+#include <linux/sched.h>
+
 struct drmem_lmb {
 	u64     base_addr;
 	u32     drc_index;
@@ -26,8 +29,21 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+	const unsigned int resched_interval = 20;
+
+	BUG_ON(lmb < drmem_info->lmbs);
+	BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
+
+	if ((lmb - drmem_info->lmbs) % resched_interval == 0)
+		cond_resched();
+
+	return ++lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)		\
-	for ((lmb) = (start); (lmb) < (end); (lmb)++)
+	for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
 
 #define for_each_drmem_lmb(lmb)					\
 	for_each_drmem_lmb_in_range((lmb),			\
-- 
2.25.4


^ permalink raw reply related

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-08-12  1:32 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: tyreld, cheloha, Laurent Dufour, linuxppc-dev
In-Reply-To: <87imdqz2sg.fsf@linux.ibm.com>

Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> One thought, which I possibly should not put in writing, is that we
>> could use the alignment of the pointer as a poor man's substitute for a
>> counter, eg:
>>
>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
>> +{
>> +	if (lmb % PAGE_SIZE == 0)
>> +		cond_resched();
>> +
>> +	return ++lmb;
>> +}
>>
>> I think the lmbs are allocated in a block, so I think that will work.
>> Maybe PAGE_SIZE is not the right size to use, but you get the idea.
>>
>> Gross I know, but might be OK as short term solution?
>
> OK, looking into this.

To follow up:

I wasn't able to measure more than ~1% difference in DLPAR memory
performance with my original version of this, but that was on a
relatively small configuration - hundreds of elements in the array as
opposed to thousands. I took an educated guess at an appropriate
interval and posted v2:

https://lore.kernel.org/linuxppc-dev/20200812012005.1919255-1-nathanl@linux.ibm.com/

^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
From: Thiago Jung Bauermann @ 2020-08-12  2:46 UTC (permalink / raw)
  To: Michael Roth; +Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater, Greg Kurz
In-Reply-To: <20200811161544.10513-1-mdroth@linux.vnet.ibm.com>


Michael Roth <mdroth@linux.vnet.ibm.com> writes:

> For a power9 KVM guest with XIVE enabled, running a test loop
> where we hotplug 384 vcpus and then unplug them, the following traces
> can be seen (generally within a few loops) either from the unplugged
> vcpu:
>
>   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
>   [ 1767.952322] ------------[ cut here ]------------
>   [ 1767.952323] kernel BUG at lib/list_debug.c:56!
>   [ 1767.952325] Oops: Exception in kernel mode, sig: 5 [#1]
>   [ 1767.952326] LE SMP NR_CPUS=2048 NUMA pSeries
>   [ 1767.952328] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c sd_mod sg xts vmx_crypto virtio_net net_failover failover virtio_scsi dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
>   [ 1767.952352] CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
>   [ 1767.952354] NIP:  c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
>   [ 1767.952355] REGS: c0000009e5a17840 TRAP: 0700   Not tainted  (4.18.0-221.el8.ppc64le)
>   [ 1767.952355] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000842  XER: 20040000
>   [ 1767.952360] CFAR: c0000000001ffe64 IRQMASK: 1
>   [ 1767.952360] GPR00: c0000000007ab508 c0000009e5a17ac0 c000000001ac0700 0000000000000054
>   [ 1767.952360] GPR04: c0000009f056cf90 c0000009f05f5628 c0000009ed40d654 c000000001c90700
>   [ 1767.952360] GPR08: 0000000000000007 c0000009f0573980 00000009ef2b0000 7562202c38303230
>   [ 1767.952360] GPR12: 0000000000000000 c0000007fff6ce80 c00a000002470208 0000000000000000
>   [ 1767.952360] GPR16: 0000000000000001 c0000009f05fbb00 0000000000000800 c0000009ff3d4980
>   [ 1767.952360] GPR20: c0000009f05fbb10 5deadbeef0000100 5deadbeef0000200 0000000000187961
>   [ 1767.952360] GPR24: c0000009e5a17b78 0000000000000000 0000000000000001 ffffffffffffffff
>   [ 1767.952360] GPR28: c00a000002470200 c0000009f05fbb10 c0000009f05fbb10 0000000000000000
>   [ 1767.952375] NIP [c0000000007ab50c] __list_del_entry_valid+0xac/0x100
>   [ 1767.952376] LR [c0000000007ab508] __list_del_entry_valid+0xa8/0x100
>   [ 1767.952377] Call Trace:
>   [ 1767.952378] [c0000009e5a17ac0] [c0000000007ab508] __list_del_entry_valid+0xa8/0x100 (unreliable)
>   [ 1767.952381] [c0000009e5a17b20] [c000000000476e18] free_pcppages_bulk+0x1f8/0x940
>   [ 1767.952383] [c0000009e5a17c20] [c00000000047a9a0] free_unref_page+0xd0/0x100
>   [ 1767.952386] [c0000009e5a17c50] [c0000000000bc2a8] xive_spapr_cleanup_queue+0x148/0x1b0
>   [ 1767.952388] [c0000009e5a17cf0] [c0000000000b96dc] xive_teardown_cpu+0x1bc/0x240
>   [ 1767.952391] [c0000009e5a17d30] [c00000000010bf28] pseries_mach_cpu_die+0x78/0x2f0
>   [ 1767.952393] [c0000009e5a17de0] [c00000000005c8d8] cpu_die+0x48/0x70
>   [ 1767.952394] [c0000009e5a17e00] [c000000000021cf0] arch_cpu_idle_dead+0x20/0x40
>   [ 1767.952397] [c0000009e5a17e20] [c0000000001b4294] do_idle+0x2f4/0x4c0
>   [ 1767.952399] [c0000009e5a17ea0] [c0000000001b46a8] cpu_startup_entry+0x38/0x40
>   [ 1767.952400] [c0000009e5a17ed0] [c00000000005c43c] start_secondary+0x7bc/0x8f0
>   [ 1767.952403] [c0000009e5a17f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
>
> or on the worker thread handling the unplug:
>
>   [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
>   [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
>   [ 1538.360736] BUG: Bad page state in process kworker/u768:3  pfn:95de1
>   [ 1538.360746] cpu 314 (hwid 314) Ready to die...
>   [ 1538.360784] page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
>   [ 1538.361881] flags: 0x5ffffc00000000()
>   [ 1538.361908] raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
>   [ 1538.361955] raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
>   [ 1538.362002] page dumped because: nonzero mapcount
>   [ 1538.362033] Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs libcrc32c sd_mod sg virtio_net net_failover virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
>   [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
>   [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
>   [ 1538.362725] Call Trace:
>   [ 1538.362743] [c0000009d4adf590] [c000000000e0e0fc] dump_stack+0xb0/0xf4 (unreliable)
>   [ 1538.362789] [c0000009d4adf5d0] [c000000000475dfc] bad_page+0x12c/0x1b0
>   [ 1538.362827] [c0000009d4adf660] [c0000000004784bc] free_pcppages_bulk+0x5bc/0x940
>   [ 1538.362871] [c0000009d4adf760] [c000000000478c38] page_alloc_cpu_dead+0x118/0x120
>   [ 1538.362918] [c0000009d4adf7b0] [c00000000015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760
>   [ 1538.362969] [c0000009d4adf820] [c00000000015eee8] _cpu_down+0x188/0x340
>   [ 1538.363007] [c0000009d4adf890] [c00000000015d75c] cpu_down+0x5c/0xa0
>   [ 1538.363045] [c0000009d4adf8c0] [c00000000092c544] cpu_subsys_offline+0x24/0x40
>   [ 1538.363091] [c0000009d4adf8e0] [c0000000009212f0] device_offline+0xf0/0x130
>   [ 1538.363129] [c0000009d4adf920] [c00000000010aee4] dlpar_offline_cpu+0x1c4/0x2a0
>   [ 1538.363174] [c0000009d4adf9e0] [c00000000010b2f8] dlpar_cpu_remove+0xb8/0x190
>   [ 1538.363219] [c0000009d4adfa60] [c00000000010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150
>   [ 1538.363264] [c0000009d4adfaf0] [c00000000010ca24] dlpar_cpu+0x94/0x800
>   [ 1538.363302] [c0000009d4adfc00] [c000000000102cc8] pseries_hp_work_fn+0x128/0x1e0
>   [ 1538.363347] [c0000009d4adfc70] [c00000000018aa84] process_one_work+0x304/0x5d0
>   [ 1538.363394] [c0000009d4adfd10] [c00000000018b5cc] worker_thread+0xcc/0x7a0
>   [ 1538.363433] [c0000009d4adfdc0] [c00000000019567c] kthread+0x1ac/0x1c0
>   [ 1538.363469] [c0000009d4adfe30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80
>
> The latter trace is due to the following sequence:
>
>   page_alloc_cpu_dead
>     drain_pages
>       drain_pages_zone
>         free_pcppages_bulk
>
> where drain_pages() in this case is called under the assumption that
> the unplugged cpu is no longer executing. To ensure that is the case,
> and early call is made to __cpu_die()->pseries_cpu_die(), which runs
> a loop that waits for the cpu to reach a halted state by polling its
> status via query-cpu-stopped-state RTAS calls. It only polls for
> 25 iterations before giving up, however, and in the trace above this
> results in the following being printed only .1 seconds after the
> hotplug worker thread begins processing the unplug request:
>
>   [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
>   [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
>
> At that point the worker thread assumes the unplugged CPU is in some
> unknown/dead state and procedes with the cleanup, causing the race with
> the XIVE cleanup code executed by the unplugged CPU.
>
> Fix this by waiting indefinitely, but also making an effort to avoid
> spurious lockup messages by allowing for rescheduling after polling
> the CPU status and printing a warning if we wait for longer than 120s.
>
> Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Cedric Le Goater <clg@kaod.org>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>

Thanks for fixing this!

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

^ permalink raw reply

* Re: [PATCH] powerpc/rtas: Restrict RTAS requests from userspace
From: Andrew Donnellan @ 2020-08-12  3:13 UTC (permalink / raw)
  To: Daniel Axtens, Michael Ellerman, linuxppc-dev; +Cc: nathanl, leobras.c
In-Reply-To: <875z9pnvuv.fsf@dja-thinkpad.axtens.net>

On 11/8/20 11:41 pm, Daniel Axtens wrote:
>>>> +static bool block_rtas_call(int token, int nargs,
>>>> +			    struct rtas_args *args)
>>>> +{
>>>> +	int i;
>>>> +	const char *reason;
>>>> +	char *token_name = rtas_token_name(token);
>>>
>>> This code isn't particularly performance critical, but I think it would
>>> be cleaner to do the token lookup once at init time, and store the token
>>> in the filter array?
>>>
>>> Then this code would only be doing token comparisons.
>>
>> Yeah that would be cleaner, can get rid of rtas_token_name().
> 
> I'm not sure I quite understand what you're suggesting.
> 
> You still need to do a string->token lookup at least once as the tokens
> differ between PowerVM and qemu. Are you saying that you can fold the
> token name lookup into the init function?

Yeah, mpe is suggesting adding a member to the struct to cache the token 
value, and then just looping through all of them to populate that field 
at init time.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* Re: [PATCH v5 0/4] Allow bigger 64bit window by removing default DMA window
From: Leonardo Bras @ 2020-08-12  4:42 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexey Kardashevskiy, Thiago Jung Bauermann, Ram Pai, Brian King,
	Murilo Fossa Vicentini, David Dai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200805030455.123024-1-leobras.c@gmail.com>

Hello Michael,

Do you suggest any change for this patchset?
Any chance it can get in this merge window?

Best regards,
Leonardo Bras

On Wed, 2020-08-05 at 00:04 -0300, Leonardo Bras wrote:
> There are some devices in which a hypervisor may only allow 1 DMA window
> to exist at a time, and in those cases, a DDW is never created to them,
> since the default DMA window keeps using this resource.
> 
> LoPAR recommends this procedure:
> 1. Remove the default DMA window,
> 2. Query for which configs the DDW can be created,
> 3. Create a DDW.
> 
> Patch #1:
> Create defines for outputs of ibm,ddw-applicable, so it's easier to
> identify them.
> 
> Patch #2:
> - After LoPAR level 2.8, there is an extension that can make
>   ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
>   order of the outputs, and that can cause some trouble. 
> - query_ddw() was updated to check how many outputs the 
>   ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
>   deal correctly with the outputs in both cases.
> - This patch looks somehow unrelated to the series, but it can avoid future
>   problems on DDW creation.
> 
> Patch #3 moves the window-removing code from remove_ddw() to
> remove_dma_window(), creating a way to delete any DMA window, so it can be
> used to delete the default DMA window.
> 
> Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
> default DMA window before query_ddw(). It also implements a new rtas call
> to recover the default DMA window, in case anything fails after it was
> removed, and a DDW couldn't be created.
> 
> ---
> Changes since v4:
> - Removed patches 5+ in order to deal with a feature at a time
> - Remove unnecessary parentesis in patch #4
> - Changed patch #4 title from 
>   "Remove default DMA window before creating DDW"
> - Included David Dai tested-by
> - v4 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=190051&state=%2A&archive=both
> 
> Changes since v3:
> - Introduces new patch #5, to prepare for an important change in #6
> - struct iommu_table was not being updated, so include a way to do this
>   in patch #6.
> - Improved patch #4 based in a suggestion from Alexey, to make code
>   more easily understandable
> - v3 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187348&state=%2A&archive=both
> 
> Changes since v2:
> - Change the way ibm,ddw-extensions is accessed, using a proper function
>   instead of doing this inline everytime it's used.
> - Remove previous patch #6, as it doesn't look like it would be useful.
> - Add new patch, for changing names from direct* to dma*, as indirect 
>   mapping can be used from now on.
> - Fix some typos, corrects some define usage.
> - v2 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=185433&state=%2A&archive=both
> 
> Changes since v1:
> - Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs
> - Merge aux function query_ddw_out_sz() into query_ddw()
> - Merge reset_dma_window() patch (prev. #2) into remove default DMA
>   window patch (#4).
> - Keep device_node *np name instead of using pdn in remove_*()
> - Rename 'device_node *pdn' into 'parent' in new functions
> - Rename dfl_win to default_win
> - Only remove the default DMA window if there is no window available
>   in first query.
> - Check if default DMA window can be restored before removing it.
> - Fix 'unitialized use' (found by travis mpe:ci-test)
> - New patches #5 and #6
> - v1 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=184420&state=%2A&archive=both
> 
> Special thanks for Alexey Kardashevskiy, Brian King and
> Oliver O'Halloran for the feedback provided!
> 
> 
> Leonardo Bras (4):
>   powerpc/pseries/iommu: Create defines for operations in
>     ibm,ddw-applicable
>   powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
>   powerpc/pseries/iommu: Move window-removing part of remove_ddw into
>     remove_dma_window
>   powerpc/pseries/iommu: Allow bigger 64bit window by removing default
>     DMA window
> 
>  arch/powerpc/platforms/pseries/iommu.c | 242 ++++++++++++++++++++-----
>  1 file changed, 195 insertions(+), 47 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Christophe Leroy @ 2020-08-12  5:19 UTC (permalink / raw)
  To: Nathan Lynch, linuxppc-dev; +Cc: tyreld, cheloha, ldufour
In-Reply-To: <20200812012005.1919255-1-nathanl@linux.ibm.com>



Le 12/08/2020 à 03:20, Nathan Lynch a écrit :
> The drmem lmb list can have hundreds of thousands of entries, and
> unfortunately lookups take the form of linear searches. As long as
> this is the case, traversals have the potential to monopolize the CPU
> and provoke lockup reports, workqueue stalls, and the like unless
> they explicitly yield.
> 
> Rather than placing cond_resched() calls within various
> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
> expression of the loop macro itself so users can't omit it.
> 
> Call cond_resched() on every 20th element. Each iteration of the loop
> in DLPAR code paths can involve around ten RTAS calls which can each
> take up to 250us, so this ensures the check is performed at worst
> every few milliseconds.
> 
> Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format")
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/drmem.h | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
> 
> Changes since v1:
> * Add bounds assertions in drmem_lmb_next().
> * Call cond_resched() in the iterator on only every 20th element
>    instead of on every iteration, to reduce overhead in tight loops.
> 
> diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
> index 17ccc6474ab6..583277e30dd2 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -8,6 +8,9 @@
>   #ifndef _ASM_POWERPC_LMB_H
>   #define _ASM_POWERPC_LMB_H
>   
> +#include <linux/bug.h>
> +#include <linux/sched.h>
> +
>   struct drmem_lmb {
>   	u64     base_addr;
>   	u32     drc_index;
> @@ -26,8 +29,21 @@ struct drmem_lmb_info {
>   
>   extern struct drmem_lmb_info *drmem_info;
>   
> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
> +{
> +	const unsigned int resched_interval = 20;
> +
> +	BUG_ON(lmb < drmem_info->lmbs);
> +	BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);

BUG_ON() shall be avoided unless absolutely necessary.
Wouldn't WARN_ON() together with an early return be enough ?

> +
> +	if ((lmb - drmem_info->lmbs) % resched_interval == 0)
> +		cond_resched();

Do you need something that precise ? Can't you use 16 or 32 and use a 
logical AND instead of a MODULO ?

And what garanties that lmb is always an element of a table based at 
drmem_info->lmbs ?

What about:

static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb, 
struct drmem_lmb *start)
{
	const unsigned int resched_interval = 16;

	if ((++lmb - start) & resched_interval == 0)
		cond_resched();

	return lmb;
}

#define for_each_drmem_lmb_in_range(lmb, start, end)		\
	for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb, start))


> +
> +	return ++lmb;
> +}
> +
>   #define for_each_drmem_lmb_in_range(lmb, start, end)		\
> -	for ((lmb) = (start); (lmb) < (end); (lmb)++)
> +	for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
>   
>   #define for_each_drmem_lmb(lmb)					\
>   	for_each_drmem_lmb_in_range((lmb),			\
> 

Christophe

^ permalink raw reply

* Re: [PATCH v5 06/10] powerpc/smp: Optimize start_secondary
From: Gautham R Shenoy @ 2020-08-12  5:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200810071834.92514-7-srikar@linux.vnet.ibm.com>

Hi Srikar,

On Mon, Aug 10, 2020 at 12:48:30PM +0530, Srikar Dronamraju wrote:
> In start_secondary, even if shared_cache was already set, system does a
> redundant match for cpumask. This redundant check can be removed by
> checking if shared_cache is already set.
> 
> While here, localize the sibling_mask variable to within the if
> condition.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

The change looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
> Changelog v4 ->v5:
> 	Retain cache domain, no need for generalization
> 		 (Michael Ellerman, Peter Zijlstra,
> 		 Valentin Schneider, Gautham R. Shenoy)
> 
> Changelog v1 -> v2:
> 	Moved shared_cache topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 0c960ce3be42..91cf5d05e7ec 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> -	return cpu_l2_cache_mask(cpu);
> +	return per_cpu(cpu_l2_cache_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>  	unsigned int cpu = smp_processor_id();
> -	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> 
>  	mmgrab(&init_mm);
>  	current->active_mm = &init_mm;
> @@ -1331,14 +1330,20 @@ void start_secondary(void *unused)
>  	/* Update topology CPU masks */
>  	add_cpu_to_masks(cpu);
> 
> -	if (has_big_cores)
> -		sibling_mask = cpu_smallcore_mask;
>  	/*
>  	 * Check for any shared caches. Note that this must be done on a
>  	 * per-core basis because one core in the pair might be disabled.
>  	 */
> -	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> -		shared_caches = true;
> +	if (!shared_caches) {
> +		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> +		struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> +		if (has_big_cores)
> +			sibling_mask = cpu_smallcore_mask;
> +
> +		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> +			shared_caches = true;
> +	}
> 
>  	set_numa_node(numa_cpu_lookup_table[cpu]);
>  	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
From: Srikar Dronamraju @ 2020-08-12  6:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gautham R Shenoy, Andi Kleen, David Hildenbrand, linuxppc-dev,
	linux-kernel, Michal Hocko, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Christopher Lameter,
	Michal Such?nek, Linus Torvalds, Vlastimil Babka
In-Reply-To: <20200806213211.6a6a56037fe771836e5abbe9@linux-foundation.org>

Hi Andrew, Michal, David

* Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]:

> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> > > The memory hotplug changes that somehow because you can hotremove numa
> > > nodes and therefore make the nodemask sparse but that is not a common
> > > case. I am not sure what would happen if a completely new node was added
> > > and its corresponding node was already used by the renumbered one
> > > though. It would likely conflate the two I am afraid. But I am not sure
> > > this is really possible with x86 and a lack of a bug report would
> > > suggest that nobody is doing that at least.
> > > 
> > 
> > JFYI,
> > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> > hotplug on memoryless node. 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202187
> 
> So...  do we merge this patch or not?  Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

Can we decide on this one way or the other?

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* [PATCH 01/16] powerpc/mm: Add DEBUG_VM WARN for pmd_clear
From: Aneesh Kumar K.V @ 2020-08-12  6:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: linuxppc-dev, Aneesh Kumar K.V, Anshuman Khandual

With the hash page table, the kernel should not use pmd_clear for clearing
huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6de56c3b33c4..079211968987 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -868,6 +868,13 @@ static inline bool pte_ci(pte_t pte)
 
 static inline void pmd_clear(pmd_t *pmdp)
 {
+	if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
+		/*
+		 * Don't use this if we can possibly have a hash page table
+		 * entry mapping this.
+		 */
+		WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE));
+	}
 	*pmdp = __pmd(0);
 }
 
@@ -916,6 +923,13 @@ static inline int pmd_bad(pmd_t pmd)
 
 static inline void pud_clear(pud_t *pudp)
 {
+	if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
+		/*
+		 * Don't use this if we can possibly have a hash page table
+		 * entry mapping this.
+		 */
+		WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE));
+	}
 	*pudp = __pud(0);
 }
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH 02/16] debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
From: Aneesh Kumar K.V @ 2020-08-12  6:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: linuxppc-dev, Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20200812063358.369514-1-aneesh.kumar@linux.ibm.com>

ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in
random value.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/debug_vm_pgtable.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 086309fb9b6f..4c32063a8acf 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -45,9 +45,12 @@
  * pxx_clear() because of how dynamic page table folding works on s390. So
  * while loading up the entries do not change the lower 4 bits. It does not
  * have affect any other platform.
+ *
+ * Also avoid the 62nd bit on ppc64 that is used to mark a pte entry.
  */
 #define S390_MASK_BITS	4
-#define RANDOM_ORVALUE	GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
+#define PPC_MASK_BITS	2
+#define RANDOM_ORVALUE	GENMASK(BITS_PER_LONG - 1 - PPC_MASK_BITS, S390_MASK_BITS)
 #define RANDOM_NZVALUE	GENMASK(7, 0)
 
 static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
-- 
2.26.2


^ permalink raw reply related

* [PATCH 03/16] debug_vm_pgtable/set_pte: Don't use set_pte_at to update an existing pte entry
From: Aneesh Kumar K.V @ 2020-08-12  6:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: linuxppc-dev, Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20200812063358.369514-1-aneesh.kumar@linux.ibm.com>

set_pte_at() should not be used to set a pte entry at locations that
already holds a valid pte entry. Architectures like ppc64 don't do TLB
invalidate in set_pte_at() and hence expect it to be used to set locations
that are not a valid PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/debug_vm_pgtable.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 4c32063a8acf..02a7c20aa4a2 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -81,8 +81,6 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
 	pte = ptep_get(ptep);
 	WARN_ON(pte_write(pte));
 
-	pte = pfn_pte(pfn, prot);
-	set_pte_at(mm, vaddr, ptep, pte);
 	ptep_get_and_clear(mm, vaddr, ptep);
 	pte = ptep_get(ptep);
 	WARN_ON(!pte_none(pte));
@@ -97,12 +95,14 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
 	pte = ptep_get(ptep);
 	WARN_ON(!(pte_write(pte) && pte_dirty(pte)));
 
-	pte = pfn_pte(pfn, prot);
-	set_pte_at(mm, vaddr, ptep, pte);
 	ptep_get_and_clear_full(mm, vaddr, ptep, 1);
 	pte = ptep_get(ptep);
 	WARN_ON(!pte_none(pte));
 
+	/*
+	 * We should clear pte before we do set_pte_at
+	 */
+	pte = ptep_get_and_clear(mm, vaddr, ptep);
 	pte = pte_mkyoung(pte);
 	set_pte_at(mm, vaddr, ptep, pte);
 	ptep_test_and_clear_young(vma, vaddr, ptep);
-- 
2.26.2


^ permalink raw reply related

* [PATCH 04/16] debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.
From: Aneesh Kumar K.V @ 2020-08-12  6:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: linuxppc-dev, Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20200812063358.369514-1-aneesh.kumar@linux.ibm.com>

ppc64 supports huge vmap only with radix translation. Hence use arch helper
to determine the huge vmap support.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/debug_vm_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 02a7c20aa4a2..679bb3d289a3 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -206,7 +206,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
 {
 	pmd_t pmd;
 
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+	if (!arch_ioremap_pmd_supported())
 		return;
 
 	pr_debug("Validating PMD huge\n");
-- 
2.26.2


^ permalink raw reply related

* [PATCH 05/16] debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING
From: Aneesh Kumar K.V @ 2020-08-12  6:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: linuxppc-dev, Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20200812063358.369514-1-aneesh.kumar@linux.ibm.com>

Saved write support was added to track the write bit of a pte after marking the
pte protnone. This was done so that AUTONUMA can convert a write pte to protnone
and still track the old write bit. When converting it back we set the pte write
bit correctly thereby avoiding a write fault again. Hence enable the test only
when CONFIG_NUMA_BALANCING is enabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/debug_vm_pgtable.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 679bb3d289a3..de8a62d0a931 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -110,6 +110,7 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
 	WARN_ON(pte_young(pte));
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static void __init pte_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 {
 	pte_t pte = pfn_pte(pfn, prot);
@@ -118,6 +119,8 @@ static void __init pte_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 	WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte))));
 	WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte))));
 }
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
 {
@@ -221,6 +224,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
 	WARN_ON(!pmd_none(pmd));
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 {
 	pmd_t pmd = pfn_pmd(pfn, prot);
@@ -229,6 +233,7 @@ static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 	WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd))));
 	WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd))));
 }
+#endif
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot)
@@ -1005,8 +1010,10 @@ static int __init debug_vm_pgtable(void)
 	pmd_huge_tests(pmdp, pmd_aligned, prot);
 	pud_huge_tests(pudp, pud_aligned, prot);
 
+#ifdef CONFIG_NUMA_BALANCING
 	pte_savedwrite_tests(pte_aligned, prot);
 	pmd_savedwrite_tests(pmd_aligned, prot);
+#endif
 
 	pte_unmap_unlock(ptep, ptl);
 
-- 
2.26.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox