* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 15:46 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1594873644.viept6os6j.astroid@bobo.none>
----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
> I should be more complete here, especially since I was complaining
> about unclear barrier comment :)
>
>
> CPU0 CPU1
> a. user stuff 1. user stuff
> b. membarrier() 2. enter kernel
> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule
> d. read rq->curr 4. rq->curr switched to kthread
> e. is kthread, skip IPI 5. switch_to kthread
> f. return to user 6. rq->curr switched to user thread
> g. user stuff 7. switch_to user thread
> 8. exit kernel
> 9. more user stuff
>
> What you're really ordering is a, g vs 1, 9 right?
>
> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> etc.
>
> Userspace does not care where the barriers are exactly or what kernel
> memory accesses might be being ordered by them, so long as there is a
> mb somewhere between a and g, and 1 and 9. Right?
This is correct. Note that the accesses to user-space memory can be
done either by user-space code or kernel code, it doesn't matter.
However, in order to be considered as happening before/after
either membarrier or the matching compiler barrier, kernel code
needs to have causality relationship with user-space execution,
e.g. user-space does a system call, or returns from a system call.
In the case of io_uring, submitting a request or returning from waiting
on request completion appear to provide this causality relationship.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Thomas Falcon @ 2020-07-16 15:59 UTC (permalink / raw)
To: David Miller, kuba; +Cc: drt, netdev, linuxppc-dev
In-Reply-To: <20200715.182956.490791427431304861.davem@davemloft.net>
On 7/15/20 8:29 PM, David Miller wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Wed, 15 Jul 2020 17:06:32 -0700
>
>> On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
>>> free_netdev(netdev);
>>> dev_set_drvdata(&dev->dev, NULL);
>>> + netdev_info(netdev, "VNIC client device has been successfully removed.\n");
>> A step too far, perhaps.
>>
>> In general this patch looks a little questionable IMHO, this amount of
>> logging output is not commonly seen in drivers. All the the info
>> messages are just static text, not even carrying any extra information.
>> In an era of ftrace, and bpftrace, do we really need this?
> Agreed, this is too much. This is debugging, and thus suitable for tracing
> facilities, at best.
Thanks for your feedback. I see now that I was overly aggressive with
this patch to be sure, but it would help with narrowing down problems at
a first glance, should they arise. The driver in its current state logs
very little of what is it doing without the use of additional debugging
or tracing facilities. Would it be worth it to pursue a less aggressive
version or would that be dead on arrival? What are acceptable driver
operations to log at this level?
Thanks,
Tom
^ permalink raw reply
* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Michal Suchánek @ 2020-07-16 16:07 UTC (permalink / raw)
To: Thomas Falcon; +Cc: drt, kuba, linuxppc-dev, David Miller, netdev
In-Reply-To: <9c9d6e46-240b-8513-08e4-e1c7556cb3c8@linux.ibm.com>
On Thu, Jul 16, 2020 at 10:59:58AM -0500, Thomas Falcon wrote:
>
> On 7/15/20 8:29 PM, David Miller wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Date: Wed, 15 Jul 2020 17:06:32 -0700
> >
> > > On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
> > > > free_netdev(netdev);
> > > > dev_set_drvdata(&dev->dev, NULL);
> > > > + netdev_info(netdev, "VNIC client device has been successfully removed.\n");
> > > A step too far, perhaps.
> > >
> > > In general this patch looks a little questionable IMHO, this amount of
> > > logging output is not commonly seen in drivers. All the the info
> > > messages are just static text, not even carrying any extra information.
> > > In an era of ftrace, and bpftrace, do we really need this?
> > Agreed, this is too much. This is debugging, and thus suitable for tracing
> > facilities, at best.
>
> Thanks for your feedback. I see now that I was overly aggressive with this
> patch to be sure, but it would help with narrowing down problems at a first
> glance, should they arise. The driver in its current state logs very little
> of what is it doing without the use of additional debugging or tracing
> facilities. Would it be worth it to pursue a less aggressive version or
> would that be dead on arrival? What are acceptable driver operations to log
> at this level?
Also would it be advisable to add the messages as pr_dbg to be enabled on demand?
Thanks
Michal
^ permalink raw reply
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 16:03 UTC (permalink / raw)
To: Nicholas Piggin, paulmck
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1494299304.15894.1594914382695.JavaMail.zimbra@efficios.com>
----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
>> I should be more complete here, especially since I was complaining
>> about unclear barrier comment :)
>>
>>
>> CPU0 CPU1
>> a. user stuff 1. user stuff
>> b. membarrier() 2. enter kernel
>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule
>> d. read rq->curr 4. rq->curr switched to kthread
>> e. is kthread, skip IPI 5. switch_to kthread
>> f. return to user 6. rq->curr switched to user thread
>> g. user stuff 7. switch_to user thread
>> 8. exit kernel
>> 9. more user stuff
>>
>> What you're really ordering is a, g vs 1, 9 right?
>>
>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> etc.
>>
>> Userspace does not care where the barriers are exactly or what kernel
>> memory accesses might be being ordered by them, so long as there is a
>> mb somewhere between a and g, and 1 and 9. Right?
>
> This is correct.
Actually, sorry, the above is not quite right. It's been a while
since I looked into the details of membarrier.
The smp_mb() at the beginning of membarrier() needs to be paired with a
smp_mb() _after_ rq->curr is switched back to the user thread, so the
memory barrier is between store to rq->curr and following user-space
accesses.
The smp_mb() at the end of membarrier() needs to be paired with the
smp_mb__after_spinlock() at the beginning of schedule, which is
between accesses to userspace memory and switching rq->curr to kthread.
As to *why* this ordering is needed, I'd have to dig through additional
scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
Thanks,
Mathieu
> Note that the accesses to user-space memory can be
> done either by user-space code or kernel code, it doesn't matter.
> However, in order to be considered as happening before/after
> either membarrier or the matching compiler barrier, kernel code
> needs to have causality relationship with user-space execution,
> e.g. user-space does a system call, or returns from a system call.
>
> In the case of io_uring, submitting a request or returning from waiting
> on request completion appear to provide this causality relationship.
>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE
From: Qian Cai @ 2020-07-16 17:27 UTC (permalink / raw)
To: Bharata B Rao
Cc: sfr, aneesh.kumar, linux-kernel, npiggin, linux-next,
linuxppc-dev
In-Reply-To: <20200703053608.12884-1-bharata@linux.ibm.com>
On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
>
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE.
>
> To easily check the availability of GTSE, it is made an MMU feature.
> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> handle GTSE as an optionally available feature and to not assume GTSE
> when radix support is available.
>
> The actual hcall implementation for KVM isn't included in this
> patchset and will be posted separately.
>
> Changes in v3
> =============
> - Fixed a bug in the hcall wrapper code where we were missing setting
> H_RPTI_TYPE_NESTED while retrying the failed flush request with
> a full flush for the nested case.
> - s/psize_to_h_rpti/psize_to_rpti_pgsize
>
> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
>
> Bharata B Rao (2):
> powerpc/mm: Enable radix GTSE only if supported.
> powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
> enabled
>
> Nicholas Piggin (1):
> powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
> !GTSE
Reverting the whole series fixed random memory corruptions during boot on
POWER9 PowerNV systems below.
IBM 8335-GTH (ibm,witherspoon)
POWER9, altivec supported
262144 MB memory, 2000 GB disk space
.config:
https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
[ 9.338996][ T925] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[ 9.339026][ T925] Faulting instruction address: 0x00000000
[ 9.339051][ T925] Oops: Kernel access of bad area, sig: 11 [#1]
[ 9.339064][ T925] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[ 9.339098][ T925] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[ 9.339150][ T925] CPU: 92 PID: 925 Comm: (md-udevd) Not tainted 5.8.0-rc5-next-20200716 #3
[ 9.339186][ T925] NIP: 0000000000000000 LR: c00000000021f2cc CTR: 0000000000000000
[ 9.339210][ T925] REGS: c000201cb52d79b0 TRAP: 0400 Not tainted (5.8.0-rc5-next-20200716)
[ 9.339244][ T925] MSR: 9000000040009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24222292 XER: 00000000
[ 9.339278][ T925] CFAR: c00000000021f2c8 IRQMASK: 0
[ 9.339278][ T925] GPR00: c00000000021f2cc c000201cb52d7c40 c000000005901000 c000201cb52d7ca8
[ 9.339278][ T925] GPR04: c00800000ea60038 0000000000000000 000000007fff0000 000000007fff0000
[ 9.339278][ T925] GPR08: 0000000000000000 0000000000000000 c000201cb50bd500 0000000000000003
[ 9.339278][ T925] GPR12: 0000000000000000 c000201fff694500 00007fffa4a8a940 00007fffa4a8a6c8
[ 9.339278][ T925] GPR16: 00007fffa4a8a8f8 00007fffa4a8a650 00007fffa4a8a488 0000000000000000
[ 9.339278][ T925] GPR20: 0000000000050001 00007fffa4a8a984 000000007fff0000 c00000000a4545cc
[ 9.339278][ T925] GPR24: c000000000affe28 0000000000000000 0000000000000000 0000000000000166
[ 9.339278][ T925] GPR28: c000201cb52d7ca8 c00800000ea60000 c000201cc3b72600 000000007fff0000
[ 9.339493][ T925] NIP [0000000000000000] 0x0
[ 9.339516][ T925] LR [c00000000021f2cc] __seccomp_filter+0xec/0x530
bpf_dispatcher_nop_func at include/linux/bpf.h:567
(inlined by) bpf_prog_run_pin_on_cpu at include/linux/filter.h:597
(inlined by) seccomp_run_filters at kernel/seccomp.c:324
(inlined by) __seccomp_filter at kernel/seccomp.c:937
[ 9.339538][ T925] Call Trace:
[ 9.339548][ T925] [c000201cb52d7c40] [c00000000021f2cc] __seccomp_filter+0xec/0x530 (unreliable)
[ 9.339566][ T925] [c000201cb52d7d50] [c000000000025af8] do_syscall_trace_enter+0xb8/0x470
do_seccomp at arch/powerpc/kernel/ptrace/ptrace.c:252
(inlined by) do_syscall_trace_enter at arch/powerpc/kernel/ptrace/ptrace.c:327
[ 9.339600][ T925] [c000201cb52d7dc0] [c00000000002c8f8] system_call_exception+0x138/0x180
[ 9.339625][ T925] [c000201cb52d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[ 9.339648][ T925] Instruction dump:
[ 9.339667][ T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 9.339706][ T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 9.339748][ T925] ---[ end trace d89eb80f9a6bc141 ]---
[ OK ] Started Journal Service.
[ 10.452364][ T925] Kernel panic - not syncing: Fatal exception
[ 11.876655][ T925] ---[ end Kernel panic - not syncing: Fatal exception ]---
There could also be lots of random userspace segfault like,
[ 16.463545][ T771] rngd[771]: segfault (11) at 0 nip 0 lr 0 code 1 in rngd[106d60000+20000]
[ 16.463620][ T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 16.463656][ T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
Occasionally, there are many soft-lockups,
[ 396.920702][ C99] watchdog: BUG: soft lockup - CPU#99 stuck for 22s! [(spawn):2692]
[ 396.920754][ C99] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod bnx2x tg3 ahci mdio libahci libphy firmware_class libata dm_mirror dm_region_hash dm_log dm_mod
[ 396.920843][ C99] irq event stamp: 1731717220
[ 396.920860][ C99] hardirqs last enabled at (1731717219): [<c00000000004d6f4>] do_page_fault+0x324/0xd90
[ 396.920889][ C99] hardirqs last disabled at (1731717220): [<c000000000015638>] arch_local_irq_restore+0x48/0xd0
[ 396.920919][ C99] softirqs last enabled at (41260): [<c0000000009abbe8>] __do_softirq+0x648/0x8c8
[ 396.920948][ C99] softirqs last disabled at (41125): [<c0000000000d717c>] irq_exit+0x15c/0x1c0
[ 396.920976][ C99] CPU: 99 PID: 2692 Comm: (spawn) Tainted: G L 5.8.0-rc5-next-20200716 #3
[ 396.921001][ C99] NIP: c0000000000152b4 LR: c000000000015640 CTR: 0000000000000000
[ 396.921037][ C99] REGS: c000201cbc3d7178 TRAP: 0900 Tainted: G L (5.8.0-rc5-next-20200716)
[ 396.921074][ C99] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 28022482 XER: 20040000
[ 396.921122][ C99] CFAR: 0000000000000000 IRQMASK: 3
[ 396.921122][ C99] GPR00: c000000000015640 c000201cbc3d7340 c000000005901000 c000201cbc3d7178
[ 396.921122][ C99] GPR04: c0000000057d7280 0000000000000000 000000000002000a 0000000000000003
[ 396.921122][ C99] GPR08: 0000201cc61c0000 0000000000000000 0000000000000001 c00000000593f868
[ 396.921122][ C99] GPR12: 0000000000002000 c000201fff67e700 00007fffdcda3918 0000000139eeba60
[ 396.921122][ C99] GPR16: 0000000139f30130 00007fffdcda39c8 0000000139eea708 0000000000000000
[ 396.921122][ C99] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
[ 396.921122][ C99] GPR24: 0000000000000e60 0000000000000900 0000000000000500 0000000000000a00
[ 396.921122][ C99] GPR28: 0000000000000f00 0000000000000002 0000000000000003 c000201ca4212400
[ 396.921468][ C99] NIP [c0000000000152b4] replay_soft_interrupts+0x74/0x3b0
replay_soft_interrupts at arch/powerpc/kernel/irq.c:216
[ 396.921504][ C99] LR [c000000000015640] arch_local_irq_restore+0x50/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:375
[ 396.921539][ C99] Call Trace:
[ 396.921560][ C99] [c000201cbc3d7340] [c000000000015640] arch_local_irq_restore+0x50/0xd0 (unreliable)
[ 396.921602][ C99] [c000201cbc3d7360] [c0000000009a0c68] lock_is_held_type+0xf8/0x180
[ 396.921641][ C99] [c000201cbc3d73c0] [c0000000003e8cf0] mem_cgroup_from_task+0xa0/0x130
[ 396.921666][ C99] [c000201cbc3d7400] [c000000000337950] handle_mm_fault+0x140/0x1d20
[ 396.921703][ C99] [c000201cbc3d7500] [c00000000004d5ac] do_page_fault+0x1dc/0xd90
[ 396.921763][ C99] [c000201cbc3d7600] [c00000000000c028] handle_page_fault+0x10/0x2c
[ 396.921804][ C99] --- interrupt: 300 at futex_cleanup+0x3c0/0x740
[ 396.921804][ C99] LR = futex_cleanup+0x35c/0x740
[ 396.921879][ C99] [c000201cbc3d79c0] [c0000000001df2e8] futex_exec_release+0x28/0x50
[ 396.921929][ C99] [c000201cbc3d79f0] [c0000000000c5e54] exec_mm_release+0x24/0x50
[ 396.921968][ C99] [c000201cbc3d7a30] [c000000000421e84] begin_new_exec+0x324/0xea0
[ 396.922005][ C99] [c000201cbc3d7af0] [c0000000004d8f1c] load_elf_binary+0x7fc/0x1110
[ 396.922042][ C99] [c000201cbc3d7bf0] [c000000000420824] exec_binprm+0x1c4/0x7d0
[ 396.922079][ C99] [c000201cbc3d7cb0] [c000000000421540] do_execveat_common+0x710/0x960
[ 396.922117][ C99] [c000201cbc3d7d90] [c000000000422a44] sys_execve+0x44/0x60
[ 396.922156][ C99] [c000201cbc3d7dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[ 396.922205][ C99] [c000201cbc3d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[ 396.922253][ C99] Instruction dump:
[ 396.922286][ C99] 3b000e60 3b400500 3b600a00 3b800f00 f8010010 f821fe11 38610028 e92d0c70
[ 396.922316][ C99] f9210198 39200000 8aed0989 48037df9 <60000000> 39200003 f9210160 56e90738
[ 248.821138][ T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.821170][ T676] khugepaged D28416 682 2 0x00000800
[ 248.821212][ T676] Call Trace:
[ 248.821241][ T676] [c000001ff310f4e0] [c000000000caeb80] lru_add_drain_work+0x0/0x48 (unreliable)
[ 248.821275][ T676] [c000001ff310f6c0] [c00000000001a2d0] __switch_to+0x260/0x380
[ 248.821308][ T676] [c000001ff310f720] [c0000000009a18b8] __schedule+0x398/0x9f0
[ 248.821352][ T676] [c000001ff310f7f0] [c0000000009a1fa8] schedule+0x98/0x160
[ 248.821387][ T676] [c000001ff310f820] [c0000000009a9814] schedule_timeout+0x304/0x520
[ 248.821432][ T676] [c000001ff310f960] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0
[ 248.821460][ T676] [c000001ff310f9d0] [c0000000000fd0c8] __flush_work+0x3b8/0x770
[ 248.821491][ T676] [c000001ff310faf0] [c0000000002e0ac4] lru_add_drain_all+0x3e4/0x760
[ 248.821521][ T676] [c000001ff310fbf0] [c0000000003e0f18] khugepaged+0xd8/0x1770
[ 248.821560][ T676] [c000001ff310fdb0] [c0000000001095fc] kthread+0x1bc/0x1d0
[ 248.821611][ T676] [c000001ff310fe20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78
[ 248.821655][ T676] INFO: task kworker/56:1:719 blocked for more than 122 seconds.
[ 248.821689][ T676] Tainted: G L 5.8.0-rc5-next-20200716 #3
[ 248.821729][ T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.821779][ T676] kworker/56:1 D27584 719 2 0x00000800
[ 248.821839][ T676] Workqueue: rcu_gp wait_rcu_exp_gp
[ 248.821862][ T676] Call Trace:
[ 248.821888][ T676] [c000001ff2b57660] [c0000000057c9fb0] rcu_state+0x4fb0/0x5100 (unreliable)
[ 248.821934][ T676] [c000001ff2b57840] [c00000000001a2d0] __switch_to+0x260/0x380
[ 248.821977][ T676] [c000001ff2b578a0] [c0000000009a18b8] __schedule+0x398/0x9f0
[ 248.822021][ T676] [c000001ff2b57970] [c0000000009a1fa8] schedule+0x98/0x160
[ 248.822066][ T676] [c000001ff2b579a0] [c0000000009a970c] schedule_timeout+0x1fc/0x520
[ 248.822110][ T676] [c000001ff2b57ae0] [c0000000001a86d0] rcu_exp_wait_wake+0x1b0/0x950
[ 248.822153][ T676] [c000001ff2b57c30] [c0000000000fb754] process_one_work+0x304/0x900
[ 248.822197][ T676] [c000001ff2b57d20] [c0000000000fbdc8] worker_thread+0x78/0x520
[ 248.822242][ T676] [c000001ff2b57db0] [c0000000001095fc] kthread+0x1bc/0x1d0
[ 248.822279][ T676] [c000001ff2b57e20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78
[ 248.822385][ T676] INFO: task lvm:3123 blocked for more than 122 seconds.
[ 248.822413][ T676] Tainted: G L 5.8.0-rc5-next-20200716 #3
[ 248.822462][ T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.822503][ T676] lvm D26608 3123 1 0x00040000
[ 248.822552][ T676] Call Trace:
[ 248.822576][ T676] [c000001feee27a00] [c00000000001a2d0] __switch_to+0x260/0x380
[ 248.822620][ T676] [c000001feee27a60] [c0000000009a18b8] __schedule+0x398/0x9f0
[ 248.822648][ T676] [c000001feee27b30] [c0000000009a1fa8] schedule+0x98/0x160
[ 248.822680][ T676] [c000001feee27b60] [c0000000009a9814] schedule_timeout+0x304/0x520
[ 248.822724][ T676] [c000001feee27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0
[ 248.822768][ T676] [c000001feee27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0
[ 248.822808][ T676] [c000001feee27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[ 248.822840][ T676] [c000001feee27e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[ 248.822873][ T676] INFO: task lvm:3126 blocked for more than 122 seconds.
[ 248.822901][ T676] Tainted: G L 5.8.0-rc5-next-20200716 #3
[ 248.822938][ T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.822987][ T676] lvm D26608 3126 1 0x00040000
[ 248.823017][ T676] Call Trace:
[ 248.823031][ T676] [c000001fc0b27a00] [c00000000001a2d0] __switch_to+0x260/0x380
[ 248.823075][ T676] [c000001fc0b27a60] [c0000000009a18b8] __schedule+0x398/0x9f0
[ 248.823113][ T676] [c000001fc0b27b30] [c0000000009a1fa8] schedule+0x98/0x160
[ 248.823158][ T676] [c000001fc0b27b60] [c0000000009a9814] schedule_timeout+0x304/0x520
[ 248.823199][ T676] [c000001fc0b27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0
[ 248.823250][ T676] [c000001fc0b27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0
[ 248.823294][ T676] [c000001fc0b27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[ 248.823332][ T676] [c000001fc0b27e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[ 248.823374][ T676] INFO: task auditd:3163 blocked for more than 122 seconds.
[ 248.823424][ T676] Tainted: G L 5.8.0-rc5-next-20200716 #3
[ 248.823471][ T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.823512][ T676] auditd D27088 3163 1 0x00042408
[ 248.823551][ T676] Call Trace:
[ 248.823583][ T676] [c000001fc080f760] [0000000200000001] 0x200000001 (unreliable)
[ 248.823648][ T676] [c000001fc080f940] [c00000000001a2d0] __switch_to+0x260/0x380
[ 248.823689][ T676] [c000001fc080f9a0] [c0000000009a18b8] __schedule+0x398/0x9f0
[ 248.823742][ T676] [c000001fc080fa70] [c0000000009a1fa8] schedule+0x98/0x160
[ 248.823784][ T676] [c000001fc080faa0] [c0000000001a9244] synchronize_rcu_expedited+0x394/0x600
[ 248.823837][ T676] [c000001fc080fba0] [c0000000004504c4] namespace_unlock+0xf4/0x230
[ 248.823881][ T676] [c000001fc080fc00] [c000000000456dec] put_mnt_ns+0x5c/0x80
[ 248.823926][ T676] [c000001fc080fc30] [c00000000010ba6c] free_nsproxy+0x2c/0x1e0
[ 248.823966][ T676] [c000001fc080fc60] [c0000000000d5130] do_exit+0x4e0/0xee0
[ 248.823997][ T676] [c000001fc080fd60] [c0000000000d5bec] do_group_exit+0x5c/0xd0
[ 248.824019][ T676] [c000001fc080fda0] [c0000000000d5c7c] sys_exit_group+0x1c/0x20
[ 248.824060][ T676] [c000001fc080fdc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[ 248.824103][ T676] [c000001fc080fe20] [c00000000000c9e8] system_call_common+0xe8/0x214
[ 248.824192][ T676]
[ 248.824192][ T676] Showing all locks held in the system:
[ 248.824419][ T676] 1 lock held by khungtaskd/676:
[ 248.824455][ T676] #0: c0000000057c44c0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.29+0x8/0x30
[ 248.824531][ T676] 1 lock held by khugepaged/682:
[ 248.824565][ T676] #0: c0000000057f42c8 (lock#4){+.+.}-{3:3}, at: lru_add_drain_all+0x68/0x760
[ 248.824679][ T676] 2 locks held by kworker/56:1/719:
[ 248.824742][ T676] #0: c00000000bcc8938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x21c/0x900
[ 248.824857][ T676] #1: c000001ff2b57c90 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x21c/0x900
[ 248.825026][ T676] 3 locks held by (spawn)/2692:
[ 248.825077][ T676] 1 lock held by auditd/3163:
[ 248.825135][ T676] #0: c0000000057c9ee8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x254/0x600
[ 248.825296][ T676] =============================================
[ 248.825296][ T676]
>
> .../include/asm/book3s/64/tlbflush-radix.h | 15 ++++
> arch/powerpc/include/asm/hvcall.h | 34 +++++++-
> arch/powerpc/include/asm/mmu.h | 4 +
> arch/powerpc/include/asm/plpar_wrappers.h | 52 ++++++++++++
> arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
> arch/powerpc/kernel/prom_init.c | 13 +--
> arch/powerpc/mm/book3s64/radix_tlb.c | 82 +++++++++++++++++--
> arch/powerpc/mm/init_64.c | 5 +-
> arch/powerpc/platforms/pseries/lpar.c | 8 +-
> 9 files changed, 197 insertions(+), 17 deletions(-)
>
> --
> 2.21.3
>
^ permalink raw reply
* Re: [V2 PATCH 1/3] Refactoring powerpc code for carrying over IMA measurement logs, to move non architecture specific code to security/ima.
From: Thiago Jung Bauermann @ 2020-07-16 17:51 UTC (permalink / raw)
To: Prakhar Srivastava
Cc: kstewart, mark.rutland, gregkh, bhsharma, tao.li, zohar, paulus,
vincenzo.frascino, will, nramas, frowand.list, masahiroy, jmorris,
takahiro.akashi, linux-arm-kernel, catalin.marinas, serge,
devicetree, pasha.tatashin, robh+dt, hsinyi, tusharsu, tglx,
allison, christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
linux-kernel, linux-security-module, james.morse, linux-integrity,
linuxppc-dev
In-Reply-To: <1385c8bb-cd25-8dc4-7224-8e27135f3356@linux.microsoft.com>
Hello Prakhar,
Prakhar Srivastava <prsriva@linux.microsoft.com> writes:
> On 6/19/20 5:19 PM, Thiago Jung Bauermann wrote:
>>
>> Prakhar Srivastava <prsriva@linux.microsoft.com> writes:
>>
>>> Powerpc has support to carry over the IMA measurement logs. Refatoring the
>>> non-architecture specific code out of arch/powerpc and into security/ima.
>>>
>>> The code adds support for reserving and freeing up of memory for IMA measurement
>>> logs.
>>
>> Last week, Mimi provided this feedback:
>>
>> "From your patch description, this patch should be broken up. Moving
>> the non-architecture specific code out of powerpc should be one patch.
>> Additional support should be in another patch. After each patch, the
>> code should work properly."
>>
>> That's not what you do here. You move the code, but you also make other
>> changes at the same time. This has two problems:
>>
>> 1. It makes the patch harder to review, because it's very easy to miss a
>> change.
>>
>> 2. If in the future a git bisect later points to this patch, it's not
>> clear whether the problem is because of the code movement, or because
>> of the other changes.
>>
>> When you move code, ideally the patch should only make the changes
>> necessary to make the code work at its new location. The patch which
>> does code movement should not cause any change in behavior.
>>
>> Other changes should go in separate patches, either before or after the
>> one moving the code.
>>
>> More comments below.
>>
> Hi Thiago,
>
> Apologies for the delayed response i was away for a few days.
> I am working on breaking up the changes so that its easier to review and update
> as well.
No problem.
>
> Thanks,
> Prakhar Srivastava
>
>>>
>>> ---
>>> arch/powerpc/include/asm/ima.h | 10 ---
>>> arch/powerpc/kexec/ima.c | 126 ++---------------------------
>>> security/integrity/ima/ima_kexec.c | 116 ++++++++++++++++++++++++++
>>> 3 files changed, 124 insertions(+), 128 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
>>> index ead488cf3981..c29ec86498f8 100644
>>> --- a/arch/powerpc/include/asm/ima.h
>>> +++ b/arch/powerpc/include/asm/ima.h
>>> @@ -4,15 +4,6 @@
>>>
>>> struct kimage;
>>>
>>> -int ima_get_kexec_buffer(void **addr, size_t *size);
>>> -int ima_free_kexec_buffer(void);
>>> -
>>> -#ifdef CONFIG_IMA
>>> -void remove_ima_buffer(void *fdt, int chosen_node);
>>> -#else
>>> -static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
>>> -#endif
>>> -
>>> #ifdef CONFIG_IMA_KEXEC
>>> int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
>>> size_t size);
>>> @@ -22,7 +13,6 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node);
>>> static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
>>> int chosen_node)
>>> {
>>> - remove_ima_buffer(fdt, chosen_node);
>>> return 0;
>>> }
>>
>> This is wrong. Even if the currently running kernel doesn't have
>> CONFIG_IMA_KEXEC, it should remove the IMA buffer property and memory
>> reservation from the FDT that is being prepared for the next kernel.
>>
>> This is because the IMA kexec buffer is useless for the next kernel,
>> regardless of whether the current kernel supports CONFIG_IMA_KEXEC or
>> not. Keeping it around would be a waste of memory.
>>
> I will keep it in my next revision.
> My understanding was the reserved memory is freed and property removed when IMA
> loads the logs on init.
If CONFIG_IMA_KEXEC is set, then yes. If it isn't then that needs to
happen in the function above.
> During setup_fdt in kexec, a duplicate copy of the dt is
> used, but memory still needs to be allocated, thus the property itself indicats
> presence of reserved memory.
>
>>> @@ -179,13 +64,18 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node)
>>> int ret, addr_cells, size_cells, entry_size;
>>> u8 value[16];
>>>
>>> - remove_ima_buffer(fdt, chosen_node);
>>
>> This is wrong, for the same reason stated above.
>>
>>> if (!image->arch.ima_buffer_size)
>>> return 0;
>>>
>>> - ret = get_addr_size_cells(&addr_cells, &size_cells);
>>> - if (ret)
>>> + ret = fdt_address_cells(fdt, chosen_node);
>>> + if (ret < 0)
>>> + return ret;
>>> + addr_cells = ret;
>>> +
>>> + ret = fdt_size_cells(fdt, chosen_node);
>>> + if (ret < 0)
>>> return ret;
>>> + size_cells = ret;
>>>
>>> entry_size = 4 * (addr_cells + size_cells);
>>>
>>
>> I liked this change. Thanks! I agree it's better to use
>> fdt_address_cells() and fdt_size_cells() here.
>>
>> But it should be in a separate patch. Either before or after the one
>> moving the code.
>>
>>> diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
>>> index 121de3e04af2..e1e6d6154015 100644
>>> --- a/security/integrity/ima/ima_kexec.c
>>> +++ b/security/integrity/ima/ima_kexec.c
>>> @@ -10,8 +10,124 @@
>>> #include <linux/seq_file.h>
>>> #include <linux/vmalloc.h>
>>> #include <linux/kexec.h>
>>> +#include <linux/of.h>
>>> +#include <linux/memblock.h>
>>> +#include <linux/libfdt.h>
>>> #include "ima.h"
>>>
>>> +static int get_addr_size_cells(int *addr_cells, int *size_cells)
>>> +{
>>> + struct device_node *root;
>>> +
>>> + root = of_find_node_by_path("/");
>>> + if (!root)
>>> + return -EINVAL;
>>> +
>>> + *addr_cells = of_n_addr_cells(root);
>>> + *size_cells = of_n_size_cells(root);
>>> +
>>> + of_node_put(root);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int do_get_kexec_buffer(const void *prop, int len, unsigned long *addr,
>>> + size_t *size)
>>> +{
>>> + int ret, addr_cells, size_cells;
>>> +
>>> + ret = get_addr_size_cells(&addr_cells, &size_cells);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (len < 4 * (addr_cells + size_cells))
>>> + return -ENOENT;
>>> +
>>> + *addr = of_read_number(prop, addr_cells);
>>> + *size = of_read_number(prop + 4 * addr_cells, size_cells);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/**
>>> + * ima_get_kexec_buffer - get IMA buffer from the previous kernel
>>> + * @addr: On successful return, set to point to the buffer contents.
>>> + * @size: On successful return, set to the buffer size.
>>> + *
>>> + * Return: 0 on success, negative errno on error.
>>> + */
>>> +int ima_get_kexec_buffer(void **addr, size_t *size)
>>> +{
>>> + int ret, len;
>>> + unsigned long tmp_addr;
>>> + size_t tmp_size;
>>> + const void *prop;
>>> +
>>> + prop = of_get_property(of_chosen, "linux,ima-kexec-buffer", &len);
>>> + if (!prop)
>>> + return -ENOENT;
>>> +
>>> + ret = do_get_kexec_buffer(prop, len, &tmp_addr, &tmp_size);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + *addr = __va(tmp_addr);
>>> + *size = tmp_size;
>>> +
>>> + return 0;
>>> +}
>>
>> The functions above were moved without being changed. Good.
>>
>>> +/**
>>> + * ima_free_kexec_buffer - free memory used by the IMA buffer
>>> + */
>>> +int ima_free_kexec_buffer(void)
>>> +{
>>> + int ret;
>>> + unsigned long addr;
>>> + size_t size;
>>> + struct property *prop;
>>> +
>>> + prop = of_find_property(of_chosen, "linux,ima-kexec-buffer", NULL);
>>> + if (!prop)
>>> + return -ENOENT;
>>> +
>>> + ret = do_get_kexec_buffer(prop->value, prop->length, &addr, &size);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + ret = of_remove_property(of_chosen, prop);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + return memblock_free(__pa(addr), size);
>>
>> Here you added a __pa() call. Do you store a virtual address in
>> linux,ima-kexec-buffer property? Doesn't it make more sense to store a
>> physical address?
>>
> trying to minimize the changes here as do_get_kexec_buffer return the va.
> I will refactor this to remove the double translation.
In the powerpc version, do_get_kexec_buffer() returns the pa, and one of
its callers does the va translation. I think that worked well.
>> Even if making this change is the correct thing to do, it should be a
>> separate patch, unless it can't be avoided. And if that is the case,
>> then it should be explained in the commit message.
>>
>>> +
>>> +}
>>> +
>>> +/**
>>> + * remove_ima_buffer - remove the IMA buffer property and reservation from @fdt
>>> + *
>>> + * The IMA measurement buffer is of no use to a subsequent kernel, so we always
>>> + * remove it from the device tree.
>>> + */
>>> +void remove_ima_buffer(void *fdt, int chosen_node)
>>> +{
>>> + int ret, len;
>>> + unsigned long addr;
>>> + size_t size;
>>> + const void *prop;
>>> +
>>> + prop = fdt_getprop(fdt, chosen_node, "linux,ima-kexec-buffer", &len);
>>> + if (!prop)
>>> + return;
>>> +
>>> + do_get_kexec_buffer(prop, len, &addr, &size);
>>> + ret = fdt_delprop(fdt, chosen_node, "linux,ima-kexec-buffer");
>>> + if (ret < 0)
>>> + return;
>>> +
>>> + memblock_free(addr, size);
>>> +}
>>
>> Here is another function that changed when moved. This one I know to be
>> wrong. You're confusing the purposes of remove_ima_buffer() and
>> ima_free_kexec_buffer().
>>
>> You did send me a question about them nearly three weeks ago which I
>> only answered today, so I apologize. Also, their names could more
>> clearly reflect their differences, so it's bad naming on my part.
>>
>> With IMA kexec buffers, there are two kernels (and thus their two
>> respective, separate device trees) to be concerned about:
>>
>> 1. the currently running kernel, which uses the live device tree
>> (accessed with the of_* functions) and the memblock subsystem;
>>
>> 2. the kernel which is being loaded by kexec, which will use the FDT
>> blob being passed around as argument to these functions, and the memory
>> reservations in the memory reservation table of the FDT blob.
>>
>> ima_free_kexec_buffer() is used by IMA in the currently running kernel.
>> Therefore the device tree it is concerned about is the live one, and
>> thus uses the of_* functions to access it. And uses memblock to change
>> the memory reservation.
>>
>> remove_ima_buffer() on the other hand is used by the kexec code to
>> prepare an FDT blob for the kernel that is being loaded. It should not
>> make any changes to live device tree, nor to memblock allocations. It
>> should only make changes to the FDT blob.
>>
> Thank you for this, greatly appreciate clearing my misunderstandings.
You're welcome. Sorry again for not answering your question before you
sent this patch series.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 18:58 UTC (permalink / raw)
To: Nicholas Piggin, paulmck, Alan Stern
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1370747990.15974.1594915396143.JavaMail.zimbra@efficios.com>
----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> mathieu.desnoyers@efficios.com wrote:
>
>> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
>>> I should be more complete here, especially since I was complaining
>>> about unclear barrier comment :)
>>>
>>>
>>> CPU0 CPU1
>>> a. user stuff 1. user stuff
>>> b. membarrier() 2. enter kernel
>>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule
>>> d. read rq->curr 4. rq->curr switched to kthread
>>> e. is kthread, skip IPI 5. switch_to kthread
>>> f. return to user 6. rq->curr switched to user thread
>>> g. user stuff 7. switch_to user thread
>>> 8. exit kernel
>>> 9. more user stuff
>>>
>>> What you're really ordering is a, g vs 1, 9 right?
>>>
>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>>> etc.
>>>
>>> Userspace does not care where the barriers are exactly or what kernel
>>> memory accesses might be being ordered by them, so long as there is a
>>> mb somewhere between a and g, and 1 and 9. Right?
>>
>> This is correct.
>
> Actually, sorry, the above is not quite right. It's been a while
> since I looked into the details of membarrier.
>
> The smp_mb() at the beginning of membarrier() needs to be paired with a
> smp_mb() _after_ rq->curr is switched back to the user thread, so the
> memory barrier is between store to rq->curr and following user-space
> accesses.
>
> The smp_mb() at the end of membarrier() needs to be paired with the
> smp_mb__after_spinlock() at the beginning of schedule, which is
> between accesses to userspace memory and switching rq->curr to kthread.
>
> As to *why* this ordering is needed, I'd have to dig through additional
> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
Thinking further about this, I'm beginning to consider that maybe we have been
overly cautious by requiring memory barriers before and after store to rq->curr.
If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current)
while running the membarrier system call, it necessarily means that CPU1 had
to issue smp_mb__after_spinlock when entering the scheduler, between any user-space
loads/stores and update of rq->curr.
Requiring a memory barrier between update of rq->curr (back to current process's
thread) and following user-space memory accesses does not seem to guarantee
anything more than what the initial barrier at the beginning of __schedule already
provides, because the guarantees are only about accesses to user-space memory.
Therefore, with the memory barrier at the beginning of __schedule, just observing that
CPU1's rq->curr differs from current should guarantee that a memory barrier was issued
between any sequentially consistent instructions belonging to the current process on
CPU1.
Or am I missing/misremembering an important point here ?
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>
>> Note that the accesses to user-space memory can be
>> done either by user-space code or kernel code, it doesn't matter.
>> However, in order to be considered as happening before/after
>> either membarrier or the matching compiler barrier, kernel code
>> needs to have causality relationship with user-space execution,
>> e.g. user-space does a system call, or returns from a system call.
>>
>> In the case of io_uring, submitting a request or returning from waiting
>> on request completion appear to provide this causality relationship.
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering
From: Palmer Dabbelt @ 2020-07-16 19:38 UTC (permalink / raw)
To: Will Deacon
Cc: kernel-team, bigeasy, Palmer Dabbelt, linuxppc-dev, npiggin,
linux-kernel, paulus, tglx, msuchanek, jniethe5
From: Palmer Dabbelt <palmerdabbelt@google.com>
This primitive has been renamed, but because it was spelled incorrectly in the
first place it must have escaped the fixup patch. As far as I can tell this
logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
implementation, which is "sync" rather than "hwsync" but those are the same
(though I'm not that familiar with PowerPC).
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
---
arch/powerpc/kernel/entry_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index b3c9f15089b6..7b38b4daca93 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -357,7 +357,7 @@ _GLOBAL(_switch)
* kernel/sched/core.c).
*
* Uncacheable stores in the case of involuntary preemption must
- * be taken care of. The smp_mb__before_spin_lock() in __schedule()
+ * be taken care of. The smp_mb__after_spinlock() in __schedule()
* is implemented as hwsync on powerpc, which orders MMIO too. So
* long as there is an hwsync in the context switch path, it will
* be executed on the source CPU after the task has performed
--
2.28.0.rc0.105.gf9edc3c819-goog
^ permalink raw reply related
* Re: [PATCH v2 2/3] ASoC: bindings: fsl-asoc-card: Support hp-det-gpio and mic-det-gpio
From: Rob Herring @ 2020-07-16 19:53 UTC (permalink / raw)
To: Shengjiu Wang
Cc: devicetree, alsa-devel, timur, lgirdwood, linux-kernel, Xiubo.Lee,
tiwai, robh+dt, perex, nicoleotsuka, broonie, festevam,
linuxppc-dev
In-Reply-To: <1594822179-1849-3-git-send-email-shengjiu.wang@nxp.com>
On Wed, 15 Jul 2020 22:09:38 +0800, Shengjiu Wang wrote:
> Add headphone and microphone detection GPIO support.
> These properties are optional.
>
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
> Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
> ---
> Documentation/devicetree/bindings/sound/fsl-asoc-card.txt | 3 +++
> 1 file changed, 3 insertions(+)
>
Acked-by: Rob Herring <robh@kernel.org>
^ permalink raw reply
* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Jakub Kicinski @ 2020-07-16 20:22 UTC (permalink / raw)
To: Michal Suchánek
Cc: drt, netdev, Thomas Falcon, linuxppc-dev, David Miller
In-Reply-To: <20200716160736.GI32107@kitsune.suse.cz>
On Thu, 16 Jul 2020 18:07:37 +0200 Michal Suchánek wrote:
> On Thu, Jul 16, 2020 at 10:59:58AM -0500, Thomas Falcon wrote:
> > On 7/15/20 8:29 PM, David Miller wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Date: Wed, 15 Jul 2020 17:06:32 -0700
> > >
> > > > On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
> > > > > free_netdev(netdev);
> > > > > dev_set_drvdata(&dev->dev, NULL);
> > > > > + netdev_info(netdev, "VNIC client device has been successfully removed.\n");
> > > > A step too far, perhaps.
> > > >
> > > > In general this patch looks a little questionable IMHO, this amount of
> > > > logging output is not commonly seen in drivers. All the the info
> > > > messages are just static text, not even carrying any extra information.
> > > > In an era of ftrace, and bpftrace, do we really need this?
> > > Agreed, this is too much. This is debugging, and thus suitable for tracing
> > > facilities, at best.
> >
> > Thanks for your feedback. I see now that I was overly aggressive with this
> > patch to be sure, but it would help with narrowing down problems at a first
> > glance, should they arise. The driver in its current state logs very little
> > of what is it doing without the use of additional debugging or tracing
> > facilities. Would it be worth it to pursue a less aggressive version or
> > would that be dead on arrival? What are acceptable driver operations to log
> > at this level?
Sadly it's much more of an art than hard science. Most networking
drivers will print identifying information when they probe the device
and then only about major config changes or when link comes up or goes
down. And obviously when anything unexpected, like an error happens,
that's key.
You seem to be adding start / end information for each driver init /
deinit stage. I'd say try to focus on the actual errors you're trying
to catch.
> Also would it be advisable to add the messages as pr_dbg to be enabled on demand?
I personally have had a pretty poor experience with pr_debug() because
CONFIG_DYNAMIC_DEBUG is not always enabled. Since you're just printing
static text there shouldn't be much difference between pr_debug and
ftrace and/or bpftrace, honestly.
Again, slightly hard to advise not knowing what you're trying to catch.
^ permalink raw reply
* Re: [PATCH v3 10/12] ppc64/kexec_file: prepare elfcore header for crashing kernel
From: Hari Bathini @ 2020-07-16 21:07 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <87tuy88ai7.fsf@morokweng.localdomain>
On 16/07/20 7:52 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
>> /**
>> + * get_crash_memory_ranges - Get crash memory ranges. This list includes
>> + * first/crashing kernel's memory regions that
>> + * would be exported via an elfcore.
>> + * @mem_ranges: Range list to add the memory ranges to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int get_crash_memory_ranges(struct crash_mem **mem_ranges)
>> +{
>> + struct memblock_region *reg;
>> + struct crash_mem *tmem;
>> + int ret;
>> +
>> + for_each_memblock(memory, reg) {
>> + u64 base, size;
>> +
>> + base = (u64)reg->base;
>> + size = (u64)reg->size;
>> +
>> + /* Skip backup memory region, which needs a separate entry */
>> + if (base == BACKUP_SRC_START) {
>> + if (size > BACKUP_SRC_SIZE) {
>> + base = BACKUP_SRC_END + 1;
>> + size -= BACKUP_SRC_SIZE;
>> + } else
>> + continue;
>> + }
>> +
>> + ret = add_mem_range(mem_ranges, base, size);
>> + if (ret)
>> + goto out;
>> +
>> + /* Try merging adjacent ranges before reallocation attempt */
>> + if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges)
>> + sort_memory_ranges(*mem_ranges, true);
>> + }
>> +
>> + /* Reallocate memory ranges if there is no space to split ranges */
>> + tmem = *mem_ranges;
>> + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) {
>> + tmem = realloc_mem_ranges(mem_ranges);
>> + if (!tmem)
>> + goto out;
>> + }
>> +
>> + /* Exclude crashkernel region */
>> + ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end);
>> + if (ret)
>> + goto out;
>> +
>> + ret = add_rtas_mem_range(mem_ranges);
>> + if (ret)
>> + goto out;
>> +
>> + ret = add_opal_mem_range(mem_ranges);
>> + if (ret)
>> + goto out;
>
> Maybe I'm confused, but don't you add the RTAS and OPAL regions as
> usable memory for the crashkernel? In that case they shouldn't show up
> in the core file.
kexec-tools does the same thing. I am not endorsing it but I was trying to stay
in parity to avoid breaking any userspace tools/commands. But as you rightly
pointed, this is NOT right. The right thing to do, to get the rtas/opal data at
the time of crash, is to have a backup region for them just like we have for
the first 64K memory. I was hoping to do that later.
Will check how userspace tools respond to dropping these regions. If that makes
the tools unhappy, will retain the regions with a FIXME. Sorry about the confusion.
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 03/12] powerpc/kexec_file: add helper functions for getting memory ranges
From: Hari Bathini @ 2020-07-16 21:08 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <874kq98xo4.fsf@morokweng.localdomain>
On 15/07/20 5:19 am, Thiago Jung Bauermann wrote:
>
<snip>
> <snip>
>
>> +/**
>> + * get_mem_rngs_size - Get the allocated size of mrngs based on
>> + * max_nr_ranges and chunk size.
>> + * @mrngs: Memory ranges.
>> + *
>> + * Returns the maximum no. of ranges.
>
> This isn't correct. It returns the maximum size of @mrngs.
True. Will update..
> <snip>
>
>> +/**
>> + * add_tce_mem_ranges - Adds tce-table range to the given memory ranges list.
>> + * @mem_ranges: Range list to add the memory range(s) to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +int add_tce_mem_ranges(struct crash_mem **mem_ranges)
>> +{
>> + struct device_node *dn;
>> + int ret;
>> +
>> + for_each_node_by_type(dn, "pci") {
>> + u64 base;
>> + u32 size;
>> +
>> + ret = of_property_read_u64(dn, "linux,tce-base", &base);
>> + ret |= of_property_read_u32(dn, "linux,tce-size", &size);
>> + if (!ret)
>
> Shouldn't the condition be `ret` instead of `!ret`?
Oops! Will fix it.
>> +/**
>> + * sort_memory_ranges - Sorts the given memory ranges list.
>> + * @mem_ranges: Range list to sort.
>> + * @merge: If true, merge the list after sorting.
>> + *
>> + * Returns nothing.
>> + */
>> +void sort_memory_ranges(struct crash_mem *mrngs, bool merge)
>> +{
>> + struct crash_mem_range *rngs;
>> + struct crash_mem_range rng;
>> + int i, j, idx;
>> +
>> + if (!mrngs)
>> + return;
>> +
>> + /* Sort the ranges in-place */
>> + rngs = &mrngs->ranges[0];
>> + for (i = 0; i < mrngs->nr_ranges; i++) {
>> + idx = i;
>> + for (j = (i + 1); j < mrngs->nr_ranges; j++) {
>> + if (rngs[idx].start > rngs[j].start)
>> + idx = j;
>> + }
>> + if (idx != i) {
>> + rng = rngs[idx];
>> + rngs[idx] = rngs[i];
>> + rngs[i] = rng;
>> + }
>> + }
>
> Would it work using sort() from lib/sort.c here?
Yeah. I think we could reuse it with a simple compare callback. Will do that.
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 04/12] ppc64/kexec_file: avoid stomping memory used by special regions
From: Hari Bathini @ 2020-07-16 21:09 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <87365t8pse.fsf@morokweng.localdomain>
On 15/07/20 8:09 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
<snip>
>> +/**
>> + * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
>> + * in the memory regions between buf_min & buf_max
>> + * for the buffer. If found, sets kbuf->mem.
>> + * @kbuf: Buffer contents and memory parameters.
>> + * @buf_min: Minimum address for the buffer.
>> + * @buf_max: Maximum address for the buffer.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int __locate_mem_hole_top_down(struct kexec_buf *kbuf,
>> + u64 buf_min, u64 buf_max)
>> +{
>> + int ret = -EADDRNOTAVAIL;
>> + phys_addr_t start, end;
>> + u64 i;
>> +
>> + for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE,
>> + MEMBLOCK_NONE, &start, &end, NULL) {
>> + if (start > buf_max)
>> + continue;
>> +
>> + /* Memory hole not found */
>> + if (end < buf_min)
>> + break;
>> +
>> + /* Adjust memory region based on the given range */
>> + if (start < buf_min)
>> + start = buf_min;
>> + if (end > buf_max)
>> + end = buf_max;
>> +
>> + start = ALIGN(start, kbuf->buf_align);
>> + if (start < end && (end - start + 1) >= kbuf->memsz) {
>
> This is why I dislike using start and end to express address ranges:
>
> While struct resource seems to use the [address, end] convention, my
struct crash_mem also uses [address, end] convention.
This off-by-one error did not cause any issues as the hole start and size we try to find
are at least page aligned.
Nonetheless, I think fixing 'end' early in the loop with "end -= 1" would ensure
correctness while continuing to use the same convention for structs crash_mem & resource.
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 05/12] powerpc/drmem: make lmb walk a bit more flexible
From: Hari Bathini @ 2020-07-16 21:09 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <871rld8mic.fsf@morokweng.localdomain>
On 15/07/20 9:20 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
>> @@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned long node,
>> #ifdef CONFIG_PPC_PSERIES
>> if (depth == 1 &&
>> strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
>> - walk_drmem_lmbs_early(node, early_init_drmem_lmb);
>> + walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
>
> walk_drmem_lmbs_early() can now fail. Should this failure be propagated
> as a return value of early_init_dt_scan_memory_ppc()?
>
>> return 0;
>> }
>> #endif
> <snip>
>
>> @@ -787,7 +790,7 @@ static int __init parse_numa_properties(void)
>> */
>> memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>> if (memory) {
>> - walk_drmem_lmbs(memory, numa_setup_drmem_lmb);
>> + walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
>
> Similarly here. Now that this call can fail, should
> parse_numa_properties() handle or propagate the failure?
They would still not fail unless the callbacks early_init_drmem_lmb() & numa_setup_drmem_lmb()
are updated to have failure scenarios. Also, these call sites always ignored failure scenarios
even before walk_drmem_lmbs() was introduced. So, I prefer to keep them the way they are?
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel
From: Hari Bathini @ 2020-07-16 21:10 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <87365s9ysj.fsf@morokweng.localdomain>
On 16/07/20 4:22 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
<snip>
>> +/**
>> + * get_node_path - Get the full path of the given node.
>> + * @dn: Node.
>> + * @path: Updated with the full path of the node.
>> + *
>> + * Returns nothing.
>> + */
>> +static void get_node_path(struct device_node *dn, char *path)
>> +{
>> + if (!dn)
>> + return;
>> +
>> + get_node_path(dn->parent, path);
>
> Is it ok to do recursion in the kernel? In this case I believe it's not
> problematic since the maximum call depth will be the maximum depth of a
> device tree node which shouldn't be too much. Also, there are no local
> variables in this function. But I thought it was worth mentioning.
You are right. We are better off avoiding the recursion here. Will
change it to an iterative version instead.
>> + * each representing a memory range.
>> + */
>> + ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
>> +
>> + for (i = 0; i < ranges; i++) {
>> + base = of_read_number(prop, n_mem_addr_cells);
>> + prop += n_mem_addr_cells;
>> + end = base + of_read_number(prop, n_mem_size_cells) - 1;
prop is not used after the above.
> You need to `prop += n_mem_size_cells` here.
But yeah, adding it would make it look complete in some sense..
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 09/12] ppc64/kexec_file: setup backup region for kdump kernel
From: Hari Bathini @ 2020-07-16 21:10 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: kernel test robot, Pingfan Liu, Petr Tesarik, Nayna Jain,
Kexec-ml, Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev,
Sourabh Jain, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <87y2nk8cjq.fsf@morokweng.localdomain>
On 16/07/20 7:08 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
>> @@ -968,7 +1040,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>>
>> /*
>> * Restrict memory usage for kdump kernel by setting up
>> - * usable memory ranges.
>> + * usable memory ranges and memory reserve map.
>> */
>> if (image->type == KEXEC_TYPE_CRASH) {
>> ret = get_usable_memory_ranges(&umem);
>> @@ -980,6 +1052,24 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>> pr_err("Error setting up usable-memory property for kdump kernel\n");
>> goto out;
>> }
>> +
>> + ret = fdt_add_mem_rsv(fdt, BACKUP_SRC_START + BACKUP_SRC_SIZE,
>> + crashk_res.start - BACKUP_SRC_SIZE);
>
> I believe this answers my question from the other email about how the
> crashkernel is prevented from stomping in the crashed kernel's memory,
> right? I needed to think for a bit to understand what the above
> reservation was protecting. I think it's worth adding a comment.
Right. The reason to add it in the first place is, prom presses the panic button if
it can't find low memory. Marking it reserved seems to keep it quiet though. so..
Will add comment mentioning that..
>> +void purgatory(void)
>> +{
>> + void *dest, *src;
>> +
>> + src = (void *)BACKUP_SRC_START;
>> + if (backup_start) {
>> + dest = (void *)backup_start;
>> + __memcpy(dest, src, BACKUP_SRC_SIZE);
>> + }
>> +}
>
> In general I'm in favor of using C code over assembly, but having to
> bring in that relocation support just for the above makes me wonder if
> it's worth it in this case.
I am planning to build on purgatory later with "I'm in purgatory" print support
for pseries at least and also, sha256 digest check.
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 07/12] ppc64/kexec_file: add support to relocate purgatory
From: Hari Bathini @ 2020-07-16 21:11 UTC (permalink / raw)
To: Thiago Jung Bauermann
Cc: kernel test robot, Pingfan Liu, Petr Tesarik, Nayna Jain,
Kexec-ml, Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev,
Sourabh Jain, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <871rlc9upc.fsf@morokweng.localdomain>
On 16/07/20 5:50 am, Thiago Jung Bauermann wrote:
>
> Hari Bathini <hbathini@linux.ibm.com> writes:
>
>> Right now purgatory implementation is only minimal. But if purgatory
>> code is to be enhanced to copy memory to the backup region and verify
>
> Can't the memcpy be done in asm? We have arch/powerpc/lib/memcpy_64.S
> for example, perhaps it could be linked in with the purgatory?
I wanted to avoid touching common code to make it work for purgatory
for now.
>
>> sha256 digest, relocations may have to be applied to the purgatory.
>
> Do we want to do the sha256 verification? My original patch series for
> kexec_file_load() had a purgatory in C from kexec-tools which did the
> sha256 verification but Michael Ellerman thought it was unnecessary and
> decided to use the simpler purgatory in asm from kexec-lite.
kexec_file_load could as well be used without IMA or secureboot. With sha256 digest
calculated anyway, verifying it would make sense to accommodate that case as well.
>
>> So, add support to relocate purgatory in kexec_file_load system call
>> by setting up TOC pointer and applying RELA relocations as needed.
>
> If we do want to use a C purgatory, Michael Ellerman had suggested
> building it as a Position Independent Executable, which greatly reduces
> the number and types of relocations that are needed. See patches 4 and 9
> here:
>
> https://lore.kernel.org/linuxppc-dev/1478748449-3894-1-git-send-email-bauerman@linux.vnet.ibm.com/
>
> In the series above I hadn't converted x86 to PIE. If I had done that,
> possibly Dave Young's opinion would have been different. :-)
>
> If that's still not desirable, he suggested in that discussion lifting
> some code from x86 to generic code, which I implemented and would
> simplify this patch as well:
>
> https://lore.kernel.org/linuxppc-dev/5009580.5GxAkTrMYA@morokweng/
>
Agreed. But I prefer to work on PIE and/or moving common relocation_add code
for x86 & s390 to generic code later when I try to build on these purgatory
changes. So, a separate series later to rework purgatory with the things you
mentioned above sounds ok?
Thanks
Hari
^ permalink raw reply
* Re: [PATCH v3 10/12] ppc64/kexec_file: prepare elfcore header for crashing kernel
From: Thiago Jung Bauermann @ 2020-07-16 21:57 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <929db6fe-b221-a514-8ea1-93227f8d47b0@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 16/07/20 7:52 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>>> /**
>>> + * get_crash_memory_ranges - Get crash memory ranges. This list includes
>>> + * first/crashing kernel's memory regions that
>>> + * would be exported via an elfcore.
>>> + * @mem_ranges: Range list to add the memory ranges to.
>>> + *
>>> + * Returns 0 on success, negative errno on error.
>>> + */
>>> +static int get_crash_memory_ranges(struct crash_mem **mem_ranges)
>>> +{
>>> + struct memblock_region *reg;
>>> + struct crash_mem *tmem;
>>> + int ret;
>>> +
>>> + for_each_memblock(memory, reg) {
>>> + u64 base, size;
>>> +
>>> + base = (u64)reg->base;
>>> + size = (u64)reg->size;
>>> +
>>> + /* Skip backup memory region, which needs a separate entry */
>>> + if (base == BACKUP_SRC_START) {
>>> + if (size > BACKUP_SRC_SIZE) {
>>> + base = BACKUP_SRC_END + 1;
>>> + size -= BACKUP_SRC_SIZE;
>>> + } else
>>> + continue;
>>> + }
>>> +
>>> + ret = add_mem_range(mem_ranges, base, size);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + /* Try merging adjacent ranges before reallocation attempt */
>>> + if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges)
>>> + sort_memory_ranges(*mem_ranges, true);
>>> + }
>>> +
>>> + /* Reallocate memory ranges if there is no space to split ranges */
>>> + tmem = *mem_ranges;
>>> + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) {
>>> + tmem = realloc_mem_ranges(mem_ranges);
>>> + if (!tmem)
>>> + goto out;
>>> + }
>>> +
>>> + /* Exclude crashkernel region */
>>> + ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + ret = add_rtas_mem_range(mem_ranges);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + ret = add_opal_mem_range(mem_ranges);
>>> + if (ret)
>>> + goto out;
>>
>> Maybe I'm confused, but don't you add the RTAS and OPAL regions as
>> usable memory for the crashkernel? In that case they shouldn't show up
>> in the core file.
>
> kexec-tools does the same thing. I am not endorsing it but I was trying to stay
> in parity to avoid breaking any userspace tools/commands. But as you rightly
> pointed, this is NOT right. The right thing to do, to get the rtas/opal data at
> the time of crash, is to have a backup region for them just like we have for
> the first 64K memory. I was hoping to do that later.
>
> Will check how userspace tools respond to dropping these regions. If that makes
> the tools unhappy, will retain the regions with a FIXME. Sorry about the confusion.
No problem, thanks for the clarification.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 04/12] ppc64/kexec_file: avoid stomping memory used by special regions
From: Thiago Jung Bauermann @ 2020-07-16 21:59 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <0582476e-415e-3f60-2bb2-6199d0340156@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 15/07/20 8:09 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>
> <snip>
>
>>> +/**
>>> + * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
>>> + * in the memory regions between buf_min & buf_max
>>> + * for the buffer. If found, sets kbuf->mem.
>>> + * @kbuf: Buffer contents and memory parameters.
>>> + * @buf_min: Minimum address for the buffer.
>>> + * @buf_max: Maximum address for the buffer.
>>> + *
>>> + * Returns 0 on success, negative errno on error.
>>> + */
>>> +static int __locate_mem_hole_top_down(struct kexec_buf *kbuf,
>>> + u64 buf_min, u64 buf_max)
>>> +{
>>> + int ret = -EADDRNOTAVAIL;
>>> + phys_addr_t start, end;
>>> + u64 i;
>>> +
>>> + for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE,
>>> + MEMBLOCK_NONE, &start, &end, NULL) {
>>> + if (start > buf_max)
>>> + continue;
>>> +
>>> + /* Memory hole not found */
>>> + if (end < buf_min)
>>> + break;
>>> +
>>> + /* Adjust memory region based on the given range */
>>> + if (start < buf_min)
>>> + start = buf_min;
>>> + if (end > buf_max)
>>> + end = buf_max;
>>> +
>>> + start = ALIGN(start, kbuf->buf_align);
>>> + if (start < end && (end - start + 1) >= kbuf->memsz) {
>>
>> This is why I dislike using start and end to express address ranges:
>>
>> While struct resource seems to use the [address, end] convention, my
>
> struct crash_mem also uses [address, end] convention.
> This off-by-one error did not cause any issues as the hole start and size we try to find
> are at least page aligned.
>
> Nonetheless, I think fixing 'end' early in the loop with "end -= 1" would ensure
> correctness while continuing to use the same convention for structs crash_mem & resource.
Sounds good.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 05/12] powerpc/drmem: make lmb walk a bit more flexible
From: Thiago Jung Bauermann @ 2020-07-16 22:01 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <30e8f02a-f009-70a5-01e9-dec9eff213b1@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 15/07/20 9:20 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>>> @@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned long node,
>>> #ifdef CONFIG_PPC_PSERIES
>>> if (depth == 1 &&
>>> strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
>>> - walk_drmem_lmbs_early(node, early_init_drmem_lmb);
>>> + walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
>>
>> walk_drmem_lmbs_early() can now fail. Should this failure be propagated
>> as a return value of early_init_dt_scan_memory_ppc()?
>
>>
>>> return 0;
>>> }
>>> #endif
>> <snip>
>>
>>> @@ -787,7 +790,7 @@ static int __init parse_numa_properties(void)
>>> */
>>> memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>>> if (memory) {
>>> - walk_drmem_lmbs(memory, numa_setup_drmem_lmb);
>>> + walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
>>
>> Similarly here. Now that this call can fail, should
>> parse_numa_properties() handle or propagate the failure?
>
> They would still not fail unless the callbacks early_init_drmem_lmb() & numa_setup_drmem_lmb()
> are updated to have failure scenarios. Also, these call sites always ignored failure scenarios
> even before walk_drmem_lmbs() was introduced. So, I prefer to keep them the way they are?
Ok, makes sense. In this case:
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel
From: Thiago Jung Bauermann @ 2020-07-16 22:03 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <baa29ea9-7698-a7e8-e5a4-c9f842e1fcc8@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 16/07/20 4:22 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>
> <snip>
>
>>> +/**
>>> + * get_node_path - Get the full path of the given node.
>>> + * @dn: Node.
>>> + * @path: Updated with the full path of the node.
>>> + *
>>> + * Returns nothing.
>>> + */
>>> +static void get_node_path(struct device_node *dn, char *path)
>>> +{
>>> + if (!dn)
>>> + return;
>>> +
>>> + get_node_path(dn->parent, path);
>>
>> Is it ok to do recursion in the kernel? In this case I believe it's not
>> problematic since the maximum call depth will be the maximum depth of a
>> device tree node which shouldn't be too much. Also, there are no local
>> variables in this function. But I thought it was worth mentioning.
>
> You are right. We are better off avoiding the recursion here. Will
> change it to an iterative version instead.
Ok.
>>> + * each representing a memory range.
>>> + */
>>> + ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
>>> +
>>> + for (i = 0; i < ranges; i++) {
>>> + base = of_read_number(prop, n_mem_addr_cells);
>>> + prop += n_mem_addr_cells;
>>> + end = base + of_read_number(prop, n_mem_size_cells) - 1;
>
> prop is not used after the above.
>
>> You need to `prop += n_mem_size_cells` here.
>
> But yeah, adding it would make it look complete in some sense..
Isn't it used in the next iteration of the loop?
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 09/12] ppc64/kexec_file: setup backup region for kdump kernel
From: Thiago Jung Bauermann @ 2020-07-16 22:06 UTC (permalink / raw)
To: Hari Bathini
Cc: kernel test robot, Pingfan Liu, Petr Tesarik, Nayna Jain,
Kexec-ml, Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev,
Sourabh Jain, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <bea19627-c6b7-5d59-e194-03038bb4d9f6@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 16/07/20 7:08 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>>> @@ -968,7 +1040,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>>>
>>> /*
>>> * Restrict memory usage for kdump kernel by setting up
>>> - * usable memory ranges.
>>> + * usable memory ranges and memory reserve map.
>>> */
>>> if (image->type == KEXEC_TYPE_CRASH) {
>>> ret = get_usable_memory_ranges(&umem);
>>> @@ -980,6 +1052,24 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>>> pr_err("Error setting up usable-memory property for kdump kernel\n");
>>> goto out;
>>> }
>>> +
>>> + ret = fdt_add_mem_rsv(fdt, BACKUP_SRC_START + BACKUP_SRC_SIZE,
>>> + crashk_res.start - BACKUP_SRC_SIZE);
>>
>> I believe this answers my question from the other email about how the
>> crashkernel is prevented from stomping in the crashed kernel's memory,
>> right? I needed to think for a bit to understand what the above
>> reservation was protecting. I think it's worth adding a comment.
>
> Right. The reason to add it in the first place is, prom presses the panic button if
> it can't find low memory. Marking it reserved seems to keep it quiet though. so..
>
> Will add comment mentioning that..
Ah, makes sense. Thanks for the explanation.
>>> +void purgatory(void)
>>> +{
>>> + void *dest, *src;
>>> +
>>> + src = (void *)BACKUP_SRC_START;
>>> + if (backup_start) {
>>> + dest = (void *)backup_start;
>>> + __memcpy(dest, src, BACKUP_SRC_SIZE);
>>> + }
>>> +}
>>
>> In general I'm in favor of using C code over assembly, but having to
>> bring in that relocation support just for the above makes me wonder if
>> it's worth it in this case.
>
> I am planning to build on purgatory later with "I'm in purgatory" print support
> for pseries at least and also, sha256 digest check.
Ok. In that case, my preference would be to convert both the powerpc and
x86 purgatories to PIE since this greatly reduces the types of
relocations that are emitted, but better ask Dave Young what he thinks
before going down that route.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 07/12] ppc64/kexec_file: add support to relocate purgatory
From: Thiago Jung Bauermann @ 2020-07-16 22:12 UTC (permalink / raw)
To: Hari Bathini
Cc: kernel test robot, Pingfan Liu, Petr Tesarik, Nayna Jain,
Kexec-ml, Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev,
Sourabh Jain, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <6f64a18a-352e-fdec-c902-45aefc31cc0a@linux.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> On 16/07/20 5:50 am, Thiago Jung Bauermann wrote:
>>
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>
>>> So, add support to relocate purgatory in kexec_file_load system call
>>> by setting up TOC pointer and applying RELA relocations as needed.
>>
>> If we do want to use a C purgatory, Michael Ellerman had suggested
>> building it as a Position Independent Executable, which greatly reduces
>> the number and types of relocations that are needed. See patches 4 and 9
>> here:
>>
>> https://lore.kernel.org/linuxppc-dev/1478748449-3894-1-git-send-email-bauerman@linux.vnet.ibm.com/
>>
>> In the series above I hadn't converted x86 to PIE. If I had done that,
>> possibly Dave Young's opinion would have been different. :-)
>>
>> If that's still not desirable, he suggested in that discussion lifting
>> some code from x86 to generic code, which I implemented and would
>> simplify this patch as well:
>>
>> https://lore.kernel.org/linuxppc-dev/5009580.5GxAkTrMYA@morokweng/
>>
>
> Agreed. But I prefer to work on PIE and/or moving common relocation_add code
> for x86 & s390 to generic code later when I try to build on these purgatory
> changes. So, a separate series later to rework purgatory with the things you
> mentioned above sounds ok?
Sounds ok to me. Let's see what the maintainers think, then.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
From: Christophe Leroy @ 2020-07-16 16:49 UTC (permalink / raw)
To: Jarkko Sakkinen
Cc: Catalin Marinas, Kefeng Wang, Paul Mackerras, Zong Li, Andi Kleen,
Paul Burton, Vincent Whitchurch, Petr Mladek, Brian Gerst,
Andy Lutomirski, Thomas Gleixner, Jiri Kosina, Anup Patel,
linux-kernel, Philipp Rudo, Torsten Duwe, Masami Hiramatsu,
Andrew Morton, Mark Rutland, James E.J. Bottomley, Vincent Chen,
Omar Sandoval, open list:S390, Joe Lawrence, Helge Deller,
John Fastabend, Anil S Keshavamurthy, Yonghong Song, Iurii Zaikin,
Andrii Nakryiko, Thomas Huth, Vasily Gorbik,
moderated list:ARM PORT, Daniel Axtens, Damien Le Moal,
Martin KaFai Lau, Song Liu, Josh Poimboeuf, Heiko Carstens,
Alexei Starovoitov, Atish Patra, Will Deacon, Daniel Borkmann,
Masahiro Yamada, Nayna Jain, Ley Foon Tan, Christian Borntraeger,
Sami Tolvanen, Naveen N. Rao, Mao Han, Marco Elver,
Steven Rostedt, Babu Moger, Borislav Petkov, Greentime Hu,
Ben Dooks, Guan Xuetao, Thomas Bogendoerfer,
open list:PARISC ARCHITECTURE, Jessica Yu,
open list:BPF JIT for MIPS 32-BIT AND 64-BIT, David S. Miller,
Thiago Jung Bauermann, Peter Zijlstra, David Howells,
Amit Daniel Kachhap, Sandipan Das, H. Peter Anvin,
open list:SPARC + UltraSPARC sparc/sparc64, Tiezhu Yang,
Miroslav Benes, Sven Schnelle, Ard Biesheuvel, Vincenzo Frascino,
Anders Roxell, Jiri Olsa,
maintainer:X86 ARCHITECTURE 32-BIT AND 64-BIT, Russell King,
open list:RISC-V ARCHITECTURE, Mike Rapoport, Ingo Molnar,
Albert Ou, Paul E. McKenney, Paul Walmsley, KP Singh,
Dmitry Vyukov, Nick Hu,
open list:BPF JIT for MIPS 32-BIT AND 64-BIT, open list:MIPS,
Palmer Dabbelt, open list:LINUX FOR POWERPC 32-BIT AND 64-BIT
In-Reply-To: <20200714094625.1443261-2-jarkko.sakkinen@linux.intel.com>
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> a écrit :
> Rename module_alloc() to text_alloc() and module_memfree() to
> text_memfree(), and move them to kernel/text.c, which is unconditionally
> compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> allocate space for executable code without requiring to compile the modules
> support (CONFIG_MODULES=y) in.
You are not changing enough in powerpc to have this work.
On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the
vmalloc space is set to NX (no exec) at segment level (ie by 256Mbytes
zone) unless CONFIG_MODULES is selected.
Christophe
>
> Cc: Andi Kleen <ak@linux.intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
> arch/arm/kernel/Makefile | 3 +-
> arch/arm/kernel/module.c | 21 -----------
> arch/arm/kernel/text.c | 33 ++++++++++++++++++
> arch/arm64/kernel/Makefile | 2 +-
> arch/arm64/kernel/module.c | 42 ----------------------
> arch/arm64/kernel/text.c | 54 ++++++++++++++++++++++++++++
> arch/mips/kernel/Makefile | 2 +-
> arch/mips/kernel/module.c | 9 -----
> arch/mips/kernel/text.c | 19 ++++++++++
> arch/mips/net/bpf_jit.c | 4 +--
> arch/nds32/kernel/Makefile | 2 +-
> arch/nds32/kernel/module.c | 7 ----
> arch/nds32/kernel/text.c | 12 +++++++
> arch/nios2/kernel/Makefile | 1 +
> arch/nios2/kernel/module.c | 19 ----------
> arch/nios2/kernel/text.c | 34 ++++++++++++++++++
> arch/parisc/kernel/Makefile | 2 +-
> arch/parisc/kernel/module.c | 11 ------
> arch/parisc/kernel/text.c | 22 ++++++++++++
> arch/powerpc/net/bpf_jit_comp.c | 4 +--
> arch/riscv/kernel/Makefile | 1 +
> arch/riscv/kernel/module.c | 12 -------
> arch/riscv/kernel/text.c | 20 +++++++++++
> arch/s390/kernel/Makefile | 2 +-
> arch/s390/kernel/ftrace.c | 2 +-
> arch/s390/kernel/module.c | 16 ---------
> arch/s390/kernel/text.c | 23 ++++++++++++
> arch/sparc/kernel/Makefile | 1 +
> arch/sparc/kernel/module.c | 30 ----------------
> arch/sparc/kernel/text.c | 39 +++++++++++++++++++++
> arch/sparc/net/bpf_jit_comp_32.c | 6 ++--
> arch/unicore32/kernel/Makefile | 1 +
> arch/unicore32/kernel/module.c | 7 ----
> arch/unicore32/kernel/text.c | 18 ++++++++++
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/ftrace.c | 4 +--
> arch/x86/kernel/kprobes/core.c | 4 +--
> arch/x86/kernel/module.c | 49 --------------------------
> arch/x86/kernel/text.c | 60 ++++++++++++++++++++++++++++++++
> include/linux/moduleloader.h | 4 +--
> kernel/Makefile | 2 +-
> kernel/bpf/core.c | 4 +--
> kernel/kprobes.c | 4 +--
> kernel/module.c | 37 ++++++--------------
> kernel/text.c | 25 +++++++++++++
> 45 files changed, 400 insertions(+), 275 deletions(-)
> create mode 100644 arch/arm/kernel/text.c
> create mode 100644 arch/arm64/kernel/text.c
> create mode 100644 arch/mips/kernel/text.c
> create mode 100644 arch/nds32/kernel/text.c
> create mode 100644 arch/nios2/kernel/text.c
> create mode 100644 arch/parisc/kernel/text.c
> create mode 100644 arch/riscv/kernel/text.c
> create mode 100644 arch/s390/kernel/text.c
> create mode 100644 arch/sparc/kernel/text.c
> create mode 100644 arch/unicore32/kernel/text.c
> create mode 100644 arch/x86/kernel/text.c
> create mode 100644 kernel/text.c
>
> diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
> index 89e5d864e923..69bfacfd60ef 100644
> --- a/arch/arm/kernel/Makefile
> +++ b/arch/arm/kernel/Makefile
> @@ -19,7 +19,8 @@ CFLAGS_REMOVE_return_address.o = -pg
> obj-y := elf.o entry-common.o irq.o opcodes.o \
> process.o ptrace.o reboot.o \
> setup.o signal.o sigreturn_codes.o \
> - stacktrace.o sys_arm.o time.o traps.o
> + stacktrace.o sys_arm.o time.o traps.o \
> + text.o
>
> ifneq ($(CONFIG_ARM_UNWIND),y)
> obj-$(CONFIG_FRAME_POINTER) += return_address.o
> diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
> index e15444b25ca0..13e3442a6b9f 100644
> --- a/arch/arm/kernel/module.c
> +++ b/arch/arm/kernel/module.c
> @@ -33,27 +33,6 @@
> #define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
> #endif
>
> -#ifdef CONFIG_MMU
> -void *module_alloc(unsigned long size)
> -{
> - gfp_t gfp_mask = GFP_KERNEL;
> - void *p;
> -
> - /* Silence the initial allocation */
> - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
> - gfp_mask |= __GFP_NOWARN;
> -
> - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> - if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
> - return p;
> - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -#endif
> -
> bool module_init_section(const char *name)
> {
> return strstarts(name, ".init") ||
> diff --git a/arch/arm/kernel/text.c b/arch/arm/kernel/text.c
> new file mode 100644
> index 000000000000..600143fb909d
> --- /dev/null
> +++ b/arch/arm/kernel/text.c
> @@ -0,0 +1,33 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * linux/arch/arm/kernel/module.c
> + *
> + * Copyright (C) 2002 Russell King.
> + * Modified for nommu by Hyok S. Choi
> + *
> + * Module allocation method suggested by Andi Kleen.
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +#ifdef CONFIG_MMU
> +void *text_alloc(unsigned long size)
> +{
> + gfp_t gfp_mask = GFP_KERNEL;
> + void *p;
> +
> + /* Silence the initial allocation */
> + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
> + gfp_mask |= __GFP_NOWARN;
> +
> + p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> + gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> + if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
> + return p;
> + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> + GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> +#endif
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index a561cbb91d4d..7765a45d71b5 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -19,7 +19,7 @@ obj-y := debug-monitors.o entry.o irq.o fpsimd.o \
> return_address.o cpuinfo.o cpu_errata.o \
> cpufeature.o alternative.o cacheinfo.o \
> smp.o smp_spin_table.o topology.o smccc-call.o \
> - syscall.o
> + syscall.o text.o
>
> targets += efi-entry.o
>
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 1cd1a4d0ed30..adde022f703c 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -20,48 +20,6 @@
> #include <asm/insn.h>
> #include <asm/sections.h>
>
> -void *module_alloc(unsigned long size)
> -{
> - u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> - gfp_t gfp_mask = GFP_KERNEL;
> - void *p;
> -
> - /* Silence the initial allocation */
> - if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> - gfp_mask |= __GFP_NOWARN;
> -
> - if (IS_ENABLED(CONFIG_KASAN))
> - /* don't exceed the static module region - see below */
> - module_alloc_end = MODULES_END;
> -
> - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> - module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
> - NUMA_NO_NODE, __builtin_return_address(0));
> -
> - if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> - !IS_ENABLED(CONFIG_KASAN))
> - /*
> - * KASAN can only deal with module allocations being served
> - * from the reserved module region, since the remainder of
> - * the vmalloc region is already backed by zero shadow pages,
> - * and punching holes into it is non-trivial. Since the module
> - * region is not randomized when KASAN is enabled, it is even
> - * less likely that the module region gets exhausted, so we
> - * can simply omit this fallback in that case.
> - */
> - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> - module_alloc_base + SZ_2G, GFP_KERNEL,
> - PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -
> - if (p && (kasan_module_alloc(p, size) < 0)) {
> - vfree(p);
> - return NULL;
> - }
> -
> - return p;
> -}
> -
> enum aarch64_reloc_op {
> RELOC_OP_NONE,
> RELOC_OP_ABS,
> diff --git a/arch/arm64/kernel/text.c b/arch/arm64/kernel/text.c
> new file mode 100644
> index 000000000000..64fc7e2d85df
> --- /dev/null
> +++ b/arch/arm64/kernel/text.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * AArch64 loadable module support.
> + *
> + * Copyright (C) 2012 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon@arm.com>
> + */
> +#include <linux/kasan.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void *text_alloc(unsigned long size)
> +{
> + u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> + gfp_t gfp_mask = GFP_KERNEL;
> + void *p;
> +
> + /* Silence the initial allocation */
> + if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> + gfp_mask |= __GFP_NOWARN;
> +
> + if (IS_ENABLED(CONFIG_KASAN))
> + /* don't exceed the static module region - see below */
> + module_alloc_end = MODULES_END;
> +
> + p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> + module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
> + NUMA_NO_NODE, __builtin_return_address(0));
> +
> + if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> + !IS_ENABLED(CONFIG_KASAN))
> + /*
> + * KASAN can only deal with module allocations being served
> + * from the reserved module region, since the remainder of
> + * the vmalloc region is already backed by zero shadow pages,
> + * and punching holes into it is non-trivial. Since the module
> + * region is not randomized when KASAN is enabled, it is even
> + * less likely that the module region gets exhausted, so we
> + * can simply omit this fallback in that case.
> + */
> + p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> + module_alloc_base + SZ_2G, GFP_KERNEL,
> + PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +
> + if (p && (kasan_module_alloc(p, size) < 0)) {
> + vfree(p);
> + return NULL;
> + }
> +
> + return p;
> +}
> diff --git a/arch/mips/kernel/Makefile b/arch/mips/kernel/Makefile
> index 8c7a043295ed..37ebf9a7f259 100644
> --- a/arch/mips/kernel/Makefile
> +++ b/arch/mips/kernel/Makefile
> @@ -8,7 +8,7 @@ extra-y := head.o vmlinux.lds
> obj-y += cmpxchg.o cpu-probe.o branch.o elf.o entry.o genex.o
> idle.o irq.o \
> process.o prom.o ptrace.o reset.o setup.o signal.o \
> syscall.o time.o topology.o traps.o unaligned.o watch.o \
> - vdso.o cacheinfo.o
> + vdso.o cacheinfo.o text.o
>
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_ftrace.o = -pg
> diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
> index 3c0c3d1260c1..f5ac4ebc4bad 100644
> --- a/arch/mips/kernel/module.c
> +++ b/arch/mips/kernel/module.c
> @@ -31,15 +31,6 @@ struct mips_hi16 {
> static LIST_HEAD(dbe_list);
> static DEFINE_SPINLOCK(dbe_lock);
>
> -#ifdef MODULE_START
> -void *module_alloc(unsigned long size)
> -{
> - return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -#endif
> -
> static int apply_r_mips_none(struct module *me, u32 *location,
> u32 base, Elf_Addr v, bool rela)
> {
> diff --git a/arch/mips/kernel/text.c b/arch/mips/kernel/text.c
> new file mode 100644
> index 000000000000..55ca87a421c3
> --- /dev/null
> +++ b/arch/mips/kernel/text.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + *
> + * Copyright (C) 2001 Rusty Russell.
> + * Copyright (C) 2003, 2004 Ralf Baechle (ralf@linux-mips.org)
> + * Copyright (C) 2005 Thiemo Seufer
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +#ifdef MODULE_START
> +void *text_alloc(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
> + GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> +#endif
> diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
> index 0af88622c619..2b03f7128809 100644
> --- a/arch/mips/net/bpf_jit.c
> +++ b/arch/mips/net/bpf_jit.c
> @@ -1233,7 +1233,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> build_epilogue(&ctx);
>
> alloc_size = 4 * ctx.idx;
> - ctx.target = module_alloc(alloc_size);
> + ctx.target = text_alloc(alloc_size);
> if (ctx.target == NULL)
> goto out;
>
> @@ -1264,7 +1264,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> void bpf_jit_free(struct bpf_prog *fp)
> {
> if (fp->jited)
> - module_memfree(fp->bpf_func);
> + text_memfree(fp->bpf_func);
>
> bpf_prog_unlock_free(fp);
> }
> diff --git a/arch/nds32/kernel/Makefile b/arch/nds32/kernel/Makefile
> index 394df3f6442c..fc15778c59d1 100644
> --- a/arch/nds32/kernel/Makefile
> +++ b/arch/nds32/kernel/Makefile
> @@ -10,7 +10,7 @@ AFLAGS_head.o := -DTEXTADDR=$(TEXTADDR)
> obj-y := ex-entry.o ex-exit.o ex-scall.o irq.o \
> process.o ptrace.o setup.o signal.o \
> sys_nds32.o time.o traps.o cacheinfo.o \
> - dma.o syscall_table.o vdso.o
> + dma.o syscall_table.o vdso.o text.o
>
> obj-$(CONFIG_MODULES) += nds32_ksyms.o module.o
> obj-$(CONFIG_STACKTRACE) += stacktrace.o
> diff --git a/arch/nds32/kernel/module.c b/arch/nds32/kernel/module.c
> index 3897fd14a21d..3d23a12ed535 100644
> --- a/arch/nds32/kernel/module.c
> +++ b/arch/nds32/kernel/module.c
> @@ -7,13 +7,6 @@
> #include <linux/moduleloader.h>
> #include <linux/pgtable.h>
>
> -void *module_alloc(unsigned long size)
> -{
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -
> void module_free(struct module *module, void *region)
> {
> vfree(region);
> diff --git a/arch/nds32/kernel/text.c b/arch/nds32/kernel/text.c
> new file mode 100644
> index 000000000000..6e86eff9aaf0
> --- /dev/null
> +++ b/arch/nds32/kernel/text.c
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2005-2017 Andes Technology Corporation
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void *text_alloc(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> + GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> diff --git a/arch/nios2/kernel/Makefile b/arch/nios2/kernel/Makefile
> index 0b645e1e3158..5476fc749f37 100644
> --- a/arch/nios2/kernel/Makefile
> +++ b/arch/nios2/kernel/Makefile
> @@ -18,6 +18,7 @@ obj-y += setup.o
> obj-y += signal.o
> obj-y += sys_nios2.o
> obj-y += syscall_table.o
> +obj-y += text.o
> obj-y += time.o
> obj-y += traps.o
>
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 76e0a42d6e36..20a0faf64e38 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -21,25 +21,6 @@
>
> #include <asm/cacheflush.h>
>
> -/*
> - * Modules should NOT be allocated with kmalloc for (obvious) reasons.
> - * But we do it for now to avoid relocation issues. CALL26/PCREL26
> cannot reach
> - * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
> - * addresses in 0xc0000000)
> - */
> -void *module_alloc(unsigned long size)
> -{
> - if (size == 0)
> - return NULL;
> - return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> - kfree(module_region);
> -}
> -
> int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
> unsigned int symindex, unsigned int relsec,
> struct module *mod)
> diff --git a/arch/nios2/kernel/text.c b/arch/nios2/kernel/text.c
> new file mode 100644
> index 000000000000..af424174442f
> --- /dev/null
> +++ b/arch/nios2/kernel/text.c
> @@ -0,0 +1,34 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Kernel module support for Nios II.
> + *
> + * Copyright (C) 2004 Microtronix Datacom Ltd.
> + * Written by Wentao Xu <xuwentao@microtronix.com>
> + * Copyright (C) 2001, 2003 Rusty Russell
> + *
> + * This file is subject to the terms and conditions of the GNU General
> + * Public License. See the file COPYING in the main directory of this
> + * archive for more details.
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +/*
> + * Modules should NOT be allocated with kmalloc for (obvious) reasons.
> + * But we do it for now to avoid relocation issues. CALL26/PCREL26
> cannot reach
> + * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
> + * addresses in 0xc0000000)
> + */
> +void *text_alloc(unsigned long size)
> +{
> + if (size == 0)
> + return NULL;
> + return kmalloc(size, GFP_KERNEL);
> +}
> +
> +/* Free memory returned from module_alloc */
> +void text_memfree(void *module_region)
> +{
> + kfree(module_region);
> +}
> diff --git a/arch/parisc/kernel/Makefile b/arch/parisc/kernel/Makefile
> index 068d90950d93..f71f7ffdae2e 100644
> --- a/arch/parisc/kernel/Makefile
> +++ b/arch/parisc/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y := cache.o pacache.o setup.o pdt.o
> traps.o time.o irq.o \
> ptrace.o hardware.o inventory.o drivers.o alternative.o \
> signal.o hpmc.o real2.o parisc_ksyms.o unaligned.o \
> process.o processor.o pdc_cons.o pdc_chassis.o unwind.o \
> - patch.o
> + patch.o text.o
>
> ifdef CONFIG_FUNCTION_TRACER
> # Do not profile debug and lowlevel utilities
> diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
> index 7df140545b22..c81e63e2549b 100644
> --- a/arch/parisc/kernel/module.c
> +++ b/arch/parisc/kernel/module.c
> @@ -192,17 +192,6 @@ static inline int reassemble_22(int as22)
> ((as22 & 0x0003ff) << 3));
> }
>
> -void *module_alloc(unsigned long size)
> -{
> - /* using RWX means less protection for modules, but it's
> - * easier than trying to map the text, data, init_text and
> - * init_data correctly */
> - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - GFP_KERNEL,
> - PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -
> #ifndef CONFIG_64BIT
> static inline unsigned long count_gots(const Elf_Rela *rela,
> unsigned long n)
> {
> diff --git a/arch/parisc/kernel/text.c b/arch/parisc/kernel/text.c
> new file mode 100644
> index 000000000000..9ff503084191
> --- /dev/null
> +++ b/arch/parisc/kernel/text.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Linux/PA-RISC Project
> + * Copyright (C) 2003 Randolph Chung <tausq at debian . org>
> + * Copyright (C) 2008 Helge Deller <deller@gmx.de>
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void *text_alloc(unsigned long size)
> +{
> + /*
> + * Using RWX means less protection for modules, but it's
> + * easier than trying to map the text, data, init_text and
> + * init_data correctly.
> + */
> + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> + GFP_KERNEL,
> + PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> diff --git a/arch/powerpc/net/bpf_jit_comp.c
> b/arch/powerpc/net/bpf_jit_comp.c
> index 0acc9d5fb19e..ba1cef7a812d 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -634,7 +634,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
>
> proglen = cgctx.idx * 4;
> alloclen = proglen + FUNCTION_DESCR_SIZE;
> - image = module_alloc(alloclen);
> + image = text_alloc(alloclen);
> if (!image)
> goto out;
>
> @@ -678,7 +678,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> void bpf_jit_free(struct bpf_prog *fp)
> {
> if (fp->jited)
> - module_memfree(fp->bpf_func);
> + text_memfree(fp->bpf_func);
>
> bpf_prog_unlock_free(fp);
> }
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index b355cf485671..d0b30f286ce6 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -29,6 +29,7 @@ obj-y += riscv_ksyms.o
> obj-y += stacktrace.o
> obj-y += cacheinfo.o
> obj-y += patch.o
> +obj-y += text.o
> obj-$(CONFIG_MMU) += vdso.o vdso/
>
> obj-$(CONFIG_RISCV_M_MODE) += clint.o traps_misaligned.o
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 7191342c54da..f6aa66431c9e 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -390,15 +390,3 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const
> char *strtab,
>
> return 0;
> }
> -
> -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> -#define VMALLOC_MODULE_START \
> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> -void *module_alloc(unsigned long size)
> -{
> - return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> - VMALLOC_END, GFP_KERNEL,
> - PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -#endif
> diff --git a/arch/riscv/kernel/text.c b/arch/riscv/kernel/text.c
> new file mode 100644
> index 000000000000..201608a25641
> --- /dev/null
> +++ b/arch/riscv/kernel/text.c
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + *
> + * Copyright (C) 2017 Zihao Yu
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> +#define VMALLOC_MODULE_START \
> + max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> +void *text_alloc(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> + VMALLOC_END, GFP_KERNEL,
> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> +#endif
> diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
> index a8f136943deb..9f00c320b938 100644
> --- a/arch/s390/kernel/Makefile
> +++ b/arch/s390/kernel/Makefile
> @@ -40,7 +40,7 @@ obj-y += sysinfo.o lgr.o os_info.o machine_kexec.o
> pgm_check.o
> obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o
> obj-y += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o
> obj-y += nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o unwind_bc.o
> -obj-y += smp.o
> +obj-y += smp.o text.o
>
> extra-y += head64.o vmlinux.lds
>
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index b388e87a08bf..a752b1442846 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -134,7 +134,7 @@ static int __init ftrace_plt_init(void)
> {
> unsigned int *ip;
>
> - ftrace_plt = (unsigned long) module_alloc(PAGE_SIZE);
> + ftrace_plt = (unsigned long) text_alloc(PAGE_SIZE);
> if (!ftrace_plt)
> panic("cannot allocate ftrace plt\n");
> ip = (unsigned int *) ftrace_plt;
> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 4055f1c49814..087cb5951de6 100644
> --- a/arch/s390/kernel/module.c
> +++ b/arch/s390/kernel/module.c
> @@ -32,22 +32,6 @@
>
> #define PLT_ENTRY_SIZE 20
>
> -void *module_alloc(unsigned long size)
> -{
> - void *p;
> -
> - if (PAGE_ALIGN(size) > MODULES_LEN)
> - return NULL;
> - p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> - if (p && (kasan_module_alloc(p, size) < 0)) {
> - vfree(p);
> - return NULL;
> - }
> - return p;
> -}
> -
> void module_arch_freeing_init(struct module *mod)
> {
> if (is_livepatch_module(mod) &&
> diff --git a/arch/s390/kernel/text.c b/arch/s390/kernel/text.c
> new file mode 100644
> index 000000000000..63aaa1ab727b
> --- /dev/null
> +++ b/arch/s390/kernel/text.c
> @@ -0,0 +1,23 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Kernel module help for s390.
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void *text_alloc(unsigned long size)
> +{
> + void *p;
> +
> + if (PAGE_ALIGN(size) > MODULES_LEN)
> + return NULL;
> + p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR, MODULES_END,
> + GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> + if (p && (kasan_module_alloc(p, size) < 0)) {
> + vfree(p);
> + return NULL;
> + }
> + return p;
> +}
> diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile
> index 97c0e19263d1..e025f9e1db4a 100644
> --- a/arch/sparc/kernel/Makefile
> +++ b/arch/sparc/kernel/Makefile
> @@ -52,6 +52,7 @@ obj-y += prom_common.o
> obj-y += prom_$(BITS).o
> obj-y += of_device_common.o
> obj-y += of_device_$(BITS).o
> +obj-y += text.o
> obj-$(CONFIG_SPARC64) += prom_irqtrans.o
>
> obj-$(CONFIG_SPARC32) += leon_kernel.o
> diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
> index df39580f398d..f2babc69f189 100644
> --- a/arch/sparc/kernel/module.c
> +++ b/arch/sparc/kernel/module.c
> @@ -21,36 +21,6 @@
>
> #include "entry.h"
>
> -#ifdef CONFIG_SPARC64
> -
> -#include <linux/jump_label.h>
> -
> -static void *module_map(unsigned long size)
> -{
> - if (PAGE_ALIGN(size) > MODULES_LEN)
> - return NULL;
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -#else
> -static void *module_map(unsigned long size)
> -{
> - return vmalloc(size);
> -}
> -#endif /* CONFIG_SPARC64 */
> -
> -void *module_alloc(unsigned long size)
> -{
> - void *ret;
> -
> - ret = module_map(size);
> - if (ret)
> - memset(ret, 0, size);
> -
> - return ret;
> -}
> -
> /* Make generic code ignore STT_REGISTER dummy undefined symbols. */
> int module_frob_arch_sections(Elf_Ehdr *hdr,
> Elf_Shdr *sechdrs,
> diff --git a/arch/sparc/kernel/text.c b/arch/sparc/kernel/text.c
> new file mode 100644
> index 000000000000..d16663f2c6ba
> --- /dev/null
> +++ b/arch/sparc/kernel/text.c
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Kernel module help for sparc64.
> + *
> + * Copyright (C) 2001 Rusty Russell.
> + * Copyright (C) 2002 David S. Miller.
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +#ifdef CONFIG_SPARC64
> +
> +#include <linux/jump_label.h>
> +
> +static void *module_map(unsigned long size)
> +{
> + if (PAGE_ALIGN(size) > MODULES_LEN)
> + return NULL;
> + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> + GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> +#else
> +static void *module_map(unsigned long size)
> +{
> + return vmalloc(size);
> +}
> +#endif /* CONFIG_SPARC64 */
> +
> +void *text_alloc(unsigned long size)
> +{
> + void *ret;
> +
> + ret = module_map(size);
> + if (ret)
> + memset(ret, 0, size);
> +
> + return ret;
> +}
> diff --git a/arch/sparc/net/bpf_jit_comp_32.c
> b/arch/sparc/net/bpf_jit_comp_32.c
> index c8eabb973b86..d9dd513b27b2 100644
> --- a/arch/sparc/net/bpf_jit_comp_32.c
> +++ b/arch/sparc/net/bpf_jit_comp_32.c
> @@ -713,7 +713,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> if (unlikely(proglen + ilen > oldproglen)) {
> pr_err("bpb_jit_compile fatal error\n");
> kfree(addrs);
> - module_memfree(image);
> + text_memfree(image);
> return;
> }
> memcpy(image + proglen, temp, ilen);
> @@ -736,7 +736,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> break;
> }
> if (proglen == oldproglen) {
> - image = module_alloc(proglen);
> + image = text_alloc(proglen);
> if (!image)
> goto out;
> }
> @@ -758,7 +758,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> void bpf_jit_free(struct bpf_prog *fp)
> {
> if (fp->jited)
> - module_memfree(fp->bpf_func);
> + text_memfree(fp->bpf_func);
>
> bpf_prog_unlock_free(fp);
> }
> diff --git a/arch/unicore32/kernel/Makefile b/arch/unicore32/kernel/Makefile
> index 2f79aa56735b..96eb8cfc8b1e 100644
> --- a/arch/unicore32/kernel/Makefile
> +++ b/arch/unicore32/kernel/Makefile
> @@ -6,6 +6,7 @@
> # Object file lists.
> obj-y := dma.o elf.o entry.o process.o ptrace.o
> obj-y += setup.o signal.o sys.o stacktrace.o traps.o
> +obj-y += text.o
>
> obj-$(CONFIG_MODULES) += ksyms.o module.o
> obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
> diff --git a/arch/unicore32/kernel/module.c b/arch/unicore32/kernel/module.c
> index 67c89ef2d6ee..e1e703c02379 100644
> --- a/arch/unicore32/kernel/module.c
> +++ b/arch/unicore32/kernel/module.c
> @@ -18,13 +18,6 @@
>
> #include <asm/sections.h>
>
> -void *module_alloc(unsigned long size)
> -{
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> -
> int
> apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned
> int symindex,
> unsigned int relindex, struct module *module)
> diff --git a/arch/unicore32/kernel/text.c b/arch/unicore32/kernel/text.c
> new file mode 100644
> index 000000000000..b94aac824bb8
> --- /dev/null
> +++ b/arch/unicore32/kernel/text.c
> @@ -0,0 +1,18 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * linux/arch/unicore32/kernel/module.c
> + *
> + * Code specific to PKUnity SoC and UniCore ISA
> + *
> + * Copyright (C) 2001-2010 GUAN Xue-tao
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void *text_alloc(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> + GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index e77261db2391..2878e4b753a0 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -68,6 +68,7 @@ obj-y += tsc.o tsc_msr.o io_delay.o rtc.o
> obj-y += pci-iommu_table.o
> obj-y += resource.o
> obj-y += irqflags.o
> +obj-y += text.o
>
> obj-y += process.o
> obj-y += fpu/
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 51504566b3a6..f76703ee96f2 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -265,11 +265,11 @@ int __init ftrace_dyn_arch_init(void)
> /* Module allocation simplifies allocating memory for code */
> static inline void *alloc_tramp(unsigned long size)
> {
> - return module_alloc(size);
> + return text_alloc(size);
> }
> static inline void tramp_free(void *tramp)
> {
> - module_memfree(tramp);
> + text_memfree(tramp);
> }
> #else
> /* Trampolines can only be created if modules are supported */
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index ada39ddbc922..e9ac7d3c658e 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -423,7 +423,7 @@ void *alloc_insn_page(void)
> {
> void *page;
>
> - page = module_alloc(PAGE_SIZE);
> + page = text_alloc(PAGE_SIZE);
> if (!page)
> return NULL;
>
> @@ -446,7 +446,7 @@ void *alloc_insn_page(void)
> /* Recover page to RW mode before releasing it */
> void free_insn_page(void *page)
> {
> - module_memfree(page);
> + text_memfree(page);
> }
>
> static int arch_copy_kprobe(struct kprobe *p)
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 34b153cbd4ac..261df078f127 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -36,55 +36,6 @@ do { \
> } while (0)
> #endif
>
> -#ifdef CONFIG_RANDOMIZE_BASE
> -static unsigned long module_load_offset;
> -
> -/* Mutex protects the module_load_offset. */
> -static DEFINE_MUTEX(module_kaslr_mutex);
> -
> -static unsigned long int get_module_load_offset(void)
> -{
> - if (kaslr_enabled()) {
> - mutex_lock(&module_kaslr_mutex);
> - /*
> - * Calculate the module_load_offset the first time this
> - * code is called. Once calculated it stays the same until
> - * reboot.
> - */
> - if (module_load_offset == 0)
> - module_load_offset =
> - (get_random_int() % 1024 + 1) * PAGE_SIZE;
> - mutex_unlock(&module_kaslr_mutex);
> - }
> - return module_load_offset;
> -}
> -#else
> -static unsigned long int get_module_load_offset(void)
> -{
> - return 0;
> -}
> -#endif
> -
> -void *module_alloc(unsigned long size)
> -{
> - void *p;
> -
> - if (PAGE_ALIGN(size) > MODULES_LEN)
> - return NULL;
> -
> - p = __vmalloc_node_range(size, MODULE_ALIGN,
> - MODULES_VADDR + get_module_load_offset(),
> - MODULES_END, GFP_KERNEL,
> - PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> - if (p && (kasan_module_alloc(p, size) < 0)) {
> - vfree(p);
> - return NULL;
> - }
> -
> - return p;
> -}
> -
> #ifdef CONFIG_X86_32
> int apply_relocate(Elf32_Shdr *sechdrs,
> const char *strtab,
> diff --git a/arch/x86/kernel/text.c b/arch/x86/kernel/text.c
> new file mode 100644
> index 000000000000..724ab2d93ac5
> --- /dev/null
> +++ b/arch/x86/kernel/text.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Kernel module help for x86.
> + * Copyright (C) 2001 Rusty Russell.
> + */
> +#include <linux/kasan.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/random.h>
> +#include <linux/vmalloc.h>
> +#include <asm/setup.h>
> +
> +#ifdef CONFIG_RANDOMIZE_BASE
> +static unsigned long module_load_offset;
> +
> +/* Mutex protects the module_load_offset. */
> +static DEFINE_MUTEX(module_kaslr_mutex);
> +
> +static unsigned long get_module_load_offset(void)
> +{
> + if (kaslr_enabled()) {
> + mutex_lock(&module_kaslr_mutex);
> + /*
> + * Calculate the module_load_offset the first time this
> + * code is called. Once calculated it stays the same until
> + * reboot.
> + */
> + if (module_load_offset == 0)
> + module_load_offset =
> + (get_random_int() % 1024 + 1) * PAGE_SIZE;
> + mutex_unlock(&module_kaslr_mutex);
> + }
> + return module_load_offset;
> +}
> +#else
> +static unsigned long get_module_load_offset(void)
> +{
> + return 0;
> +}
> +#endif
> +
> +void *text_alloc(unsigned long size)
> +{
> + void *p;
> +
> + if (PAGE_ALIGN(size) > MODULES_LEN)
> + return NULL;
> +
> + p = __vmalloc_node_range(size, MODULE_ALIGN,
> + MODULES_VADDR + get_module_load_offset(),
> + MODULES_END, GFP_KERNEL,
> + PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> + if (p && (kasan_module_alloc(p, size) < 0)) {
> + vfree(p);
> + return NULL;
> + }
> +
> + return p;
> +}
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 4fa67a8b2265..4e8b9ba431ee 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -24,10 +24,10 @@ unsigned int arch_mod_section_prepend(struct
> module *mod, unsigned int section);
>
> /* Allocator used for allocating struct module, core sections and init
> sections. Returns NULL on failure. */
> -void *module_alloc(unsigned long size);
> +void *text_alloc(unsigned long size);
>
> /* Free memory returned from module_alloc. */
> -void module_memfree(void *module_region);
> +void text_memfree(void *module_region);
>
> /* Determines if the section name is an init section (that is only
> used during
> * module loading).
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f3218bc5ec69..9e88e81f68ef 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \
> extable.o params.o \
> kthread.o sys_ni.o nsproxy.o \
> notifier.o ksysfs.o cred.o reboot.o \
> - async.o range.o smpboot.o ucount.o
> + async.o range.o smpboot.o ucount.o text.o
>
> obj-$(CONFIG_MODULES) += kmod.o
> obj-$(CONFIG_MULTIUSER) += groups.o
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 9df4cc9a2907..febd55019a8a 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -840,12 +840,12 @@ static void bpf_jit_uncharge_modmem(u32 pages)
>
> void *__weak bpf_jit_alloc_exec(unsigned long size)
> {
> - return module_alloc(size);
> + return text_alloc(size);
> }
>
> void __weak bpf_jit_free_exec(void *addr)
> {
> - module_memfree(addr);
> + text_memfree(addr);
> }
>
> struct bpf_binary_header *
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 4a904cc56d68..d1c354ec89de 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -111,12 +111,12 @@ enum kprobe_slot_state {
>
> void __weak *alloc_insn_page(void)
> {
> - return module_alloc(PAGE_SIZE);
> + return text_alloc(PAGE_SIZE);
> }
>
> void __weak free_insn_page(void *page)
> {
> - module_memfree(page);
> + text_memfree(page);
> }
>
> struct kprobe_insn_cache kprobe_insn_slots = {
> diff --git a/kernel/module.c b/kernel/module.c
> index bee1c25ca5c5..bdb3773f3668 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -2141,16 +2141,6 @@ static void free_module_elf(struct module *mod)
> }
> #endif /* CONFIG_LIVEPATCH */
>
> -void __weak module_memfree(void *module_region)
> -{
> - /*
> - * This memory may be RO, and freeing RO memory in an interrupt is not
> - * supported by vmalloc.
> - */
> - WARN_ON(in_interrupt());
> - vfree(module_region);
> -}
> -
> void __weak module_arch_cleanup(struct module *mod)
> {
> }
> @@ -2200,7 +2190,7 @@ static void free_module(struct module *mod)
>
> /* This may be empty, but that's OK */
> module_arch_freeing_init(mod);
> - module_memfree(mod->init_layout.base);
> + text_memfree(mod->init_layout.base);
> kfree(mod->args);
> percpu_modfree(mod);
>
> @@ -2208,7 +2198,7 @@ static void free_module(struct module *mod)
> lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
>
> /* Finally, free the core (containing the module structure) */
> - module_memfree(mod->core_layout.base);
> + text_memfree(mod->core_layout.base);
> }
>
> void *__symbol_get(const char *symbol)
> @@ -2781,13 +2771,6 @@ static void dynamic_debug_remove(struct
> module *mod, struct _ddebug *debug)
> ddebug_remove_module(mod->name);
> }
>
> -void * __weak module_alloc(unsigned long size)
> -{
> - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> - NUMA_NO_NODE, __builtin_return_address(0));
> -}
> -
> bool __weak module_init_section(const char *name)
> {
> return strstarts(name, ".init");
> @@ -3246,7 +3229,7 @@ static int move_module(struct module *mod,
> struct load_info *info)
> void *ptr;
>
> /* Do the allocs. */
> - ptr = module_alloc(mod->core_layout.size);
> + ptr = text_alloc(mod->core_layout.size);
> /*
> * The pointer to this block is stored in the module structure
> * which is inside the block. Just mark it as not being a
> @@ -3260,7 +3243,7 @@ static int move_module(struct module *mod,
> struct load_info *info)
> mod->core_layout.base = ptr;
>
> if (mod->init_layout.size) {
> - ptr = module_alloc(mod->init_layout.size);
> + ptr = text_alloc(mod->init_layout.size);
> /*
> * The pointer to this block is stored in the module structure
> * which is inside the block. This block doesn't need to be
> @@ -3269,7 +3252,7 @@ static int move_module(struct module *mod,
> struct load_info *info)
> */
> kmemleak_ignore(ptr);
> if (!ptr) {
> - module_memfree(mod->core_layout.base);
> + text_memfree(mod->core_layout.base);
> return -ENOMEM;
> }
> memset(ptr, 0, mod->init_layout.size);
> @@ -3452,8 +3435,8 @@ static void module_deallocate(struct module
> *mod, struct load_info *info)
> {
> percpu_modfree(mod);
> module_arch_freeing_init(mod);
> - module_memfree(mod->init_layout.base);
> - module_memfree(mod->core_layout.base);
> + text_memfree(mod->init_layout.base);
> + text_memfree(mod->core_layout.base);
> }
>
> int __weak module_finalize(const Elf_Ehdr *hdr,
> @@ -3527,7 +3510,7 @@ static void do_free_init(struct work_struct *w)
>
> llist_for_each_safe(pos, n, list) {
> initfree = container_of(pos, struct mod_initfree, node);
> - module_memfree(initfree->module_init);
> + text_memfree(initfree->module_init);
> kfree(initfree);
> }
> }
> @@ -3626,10 +3609,10 @@ static noinline int do_init_module(struct
> module *mod)
> * We want to free module_init, but be aware that kallsyms may be
> * walking this with preempt disabled. In all the failure paths, we
> * call synchronize_rcu(), but we don't want to slow down the success
> - * path. module_memfree() cannot be called in an interrupt, so do the
> + * path. text_memfree() cannot be called in an interrupt, so do the
> * work and call synchronize_rcu() in a work queue.
> *
> - * Note that module_alloc() on most architectures creates W+X page
> + * Note that text_alloc() on most architectures creates W+X page
> * mappings which won't be cleaned up until do_free_init() runs. Any
> * code such as mark_rodata_ro() which depends on those mappings to
> * be cleaned up needs to sync with the queued work - ie
> diff --git a/kernel/text.c b/kernel/text.c
> new file mode 100644
> index 000000000000..9a12c508ded5
> --- /dev/null
> +++ b/kernel/text.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2002 Richard Henderson
> + * Copyright (C) 2001 Rusty Russell, 2002, 2010 Rusty Russell IBM.
> + */
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/vmalloc.h>
> +
> +void __weak text_memfree(void *module_region)
> +{
> + /*
> + * This memory may be RO, and freeing RO memory in an interrupt is not
> + * supported by vmalloc.
> + */
> + WARN_ON(in_interrupt());
> + vfree(module_region);
> +}
> +
> +void * __weak text_alloc(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> + GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> + NUMA_NO_NODE, __builtin_return_address(0));
> +}
> --
> 2.25.1
^ permalink raw reply
* Re: [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering
From: Benjamin Herrenschmidt @ 2020-07-16 22:38 UTC (permalink / raw)
To: Palmer Dabbelt, Will Deacon
Cc: kernel-team, bigeasy, Palmer Dabbelt, linux-kernel, npiggin,
paulus, jniethe5, tglx, msuchanek, linuxppc-dev
In-Reply-To: <20200716193820.1141936-1-palmer@dabbelt.com>
On Thu, 2020-07-16 at 12:38 -0700, Palmer Dabbelt wrote:
> From: Palmer Dabbelt <palmerdabbelt@google.com>
>
> This primitive has been renamed, but because it was spelled incorrectly in the
> first place it must have escaped the fixup patch. As far as I can tell this
> logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
> implementation, which is "sync" rather than "hwsync" but those are the same
> (though I'm not that familiar with PowerPC).
Typo ? That must be me ... :)
Looks fine. Yes, sync and hwsync are the same (by opposition to lwsync
which is lighter weight and doesn't order cache inhibited).
Cheers,
Ben.
> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
> ---
> arch/powerpc/kernel/entry_64.S | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index b3c9f15089b6..7b38b4daca93 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -357,7 +357,7 @@ _GLOBAL(_switch)
> * kernel/sched/core.c).
> *
> * Uncacheable stores in the case of involuntary preemption must
> - * be taken care of. The smp_mb__before_spin_lock() in __schedule()
> + * be taken care of. The smp_mb__after_spinlock() in __schedule()
> * is implemented as hwsync on powerpc, which orders MMIO too. So
> * long as there is an hwsync in the context switch path, it will
> * be executed on the source CPU after the task has performed
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox