* Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
From: Nathan Lynch @ 2020-08-14 17:27 UTC (permalink / raw)
To: Scott Cheloha, linuxppc-dev
Cc: Laurent Dufour, Michal Suchanek, David Hildenbrand, Rick Lindsley
In-Reply-To: <20200811015115.63677-1-cheloha@linux.ibm.com>
Scott Cheloha <cheloha@linux.ibm.com> writes:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block. There is no need to store the nid
> in multiple locations.
>
> Note that lmb_to_memblock() uses find_memory_block() to get the
> corresponding memory_block. As find_memory_block() runs in sub-linear
> time this approach is negligibly slower than what we do at present.
>
> In exchange for this lookup at hot-remove time we no longer need to
> call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
> On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
> spares us an O(n^2) initialization during boot.
>
> On systems with many LMBs that initialization overhead is palpable and
> disruptive. For example, on a box with 249854 LMBs we're seeing
> drmem_init() take upwards of 30 seconds to complete:
>
> [ 53.721639] drmem: initializing drmem v2
> [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
> [ 80.604377] Modules linked in:
> [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
> [ 80.604397] NIP: c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
> [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+)
> [ 80.604412] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 44000248 XER: 0000000d
> [ 80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
> [ 80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
> [ 80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
> [ 80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
> [ 80.604431] GPR12: 0000000000000000 c00000001e8ec200
> [ 80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
> [ 80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
> [ 80.604492] Call Trace:
> [ 80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
> [ 80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
> [ 80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
> [ 80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
> [ 80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
> [ 80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
> [ 80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
> [ 80.604567] Instruction dump:
> [ 80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
> [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
> [ 89.047390] drmem: 249854 LMB(s)
>
> With a patched kernel on the same machine we're no longer seeing the
> soft lockup. drmem_init() now completes in negligible time, even when
> the LMB count is large.
This has been a real annoyance, and it's great to see it fixed in a way
that both simplifies the code and reduces data structure size.
>
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
I think this also should have:
Fixes: b2d3b5ee66f2 ("powerpc/pseries: Track LMB nid instead of using device tree")
since you're essentially reverting it.
The problem that commit was trying to address arose from the fact that
the removal path was determining the Linux node for a block from the
firmware description of the block:
memory_add_physaddr_to_nid
hot_add_scn_to_nid
hot_add_drconf_scn_to_nid
of_drconf_to_nid_single
[parses the DT for the nid]
However, this can become out of sync with Linux's NUMA assignment for
the memory due to VM migration or other events.
I'm satisfied that this change does not reintroduce that problem. The
removal path still derives the node without going to the device tree,
but now performs a more efficient lookup.
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
^ permalink raw reply
* Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
From: Nathan Lynch @ 2020-08-14 19:31 UTC (permalink / raw)
To: Scott Cheloha, linuxppc-dev
Cc: Laurent Dufour, Michal Suchanek, David Hildenbrand, Rick Lindsley
In-Reply-To: <87y2mhxhls.fsf@linux.ibm.com>
Nathan Lynch <nathanl@linux.ibm.com> writes:
> I'm satisfied that this change does not reintroduce that problem. The
> removal path still derives the node without going to the device tree,
> but now performs a more efficient lookup.
Er, actually it's slightly less efficient in the remove path, as you
note in your commit message. This doesn't alter my opinion of the
change. The higher-level algorithmic inefficiencies (e.g. all the
for_each_drmem_lmb loops) in this code are more of a concern.
^ permalink raw reply
* [PATCH 0/1] powerpc/numa: do not skip node 0 in lookup table
From: Daniel Henrique Barboza @ 2020-08-14 20:34 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Daniel Henrique Barboza
Hi,
This is a simple fix that I made while testing NUMA changes
I'm making in QEMU [1]. Setting any non-zero value to the
associativity of NUMA node 0 has no impact in the output
of 'numactl' because the distance_lookup_table is never
initialized for node 0.
Seeing through the LOPAPR spec and git history I found no
technical reason to skip node 0, which makes me believe this is
a bug that got under the radar up until now because no one
attempted to set node 0 associativity like I'm doing now.
For anyone wishing to give it a spin, using the QEMU build
in [1] and experimenting with NUMA distances, such as:
sudo ./qemu-system-ppc64 -machine pseries-5.2,accel=kvm,usb=off,dump-guest-core=off -m 65536 -overcommit mem-lock=off -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc -display none -vga none -nographic -boot menu=on -device spapr-pci-host-bridge,index=1,id=pci.1 -device spapr-pci-host-bridge,index=2,id=pci.2 -device spapr-pci-host-bridge,index=3,id=pci.3 -device spapr-pci-host-bridge,index=4,id=pci.4 -device qemu-xhci,id=usb,bus=pci.0,addr=0x2 -drive file=/home/danielhb/f32.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device usb-kbd,id=input0,bus=usb.0,port=1 -device usb-mouse,id=input1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on \
-numa node,nodeid=0,cpus=0 -numa node,nodeid=1,cpus=1 \
-numa node,nodeid=2,cpus=2 -numa node,nodeid=3,cpus=3 \
-numa dist,src=0,dst=1,val=80 -numa dist,src=0,dst=2,val=80 \
-numa dist,src=0,dst=3,val=80 -numa dist,src=1,dst=2,val=80 \
-numa dist,src=1,dst=3,val=80 -numa dist,src=2,dst=3,val=80
The current kernel code will ignore the associativity of
node 0, and numactl will output this:
node distances:
node 0 1 2 3
0: 10 160 160 160
1: 160 10 80 80
2: 160 80 10 80
3: 160 80 80 10
With this patch:
node distances:
node 0 1 2 3
0: 10 160 160 160
1: 160 10 80 40
2: 160 80 10 20
3: 160 40 20 10
If anyone wonders, this patch has no conflict with the proposed
NUMA changes in [2] because Aneesh isn't changing this line.
[1] https://github.com/danielhb/qemu/tree/spapr_numa_v1
[2] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200731111916.243569-1-aneesh.kumar@linux.ibm.com/
Daniel Henrique Barboza (1):
powerpc/numa: do not skip node 0 when init lookup table
arch/powerpc/mm/numa.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--
2.26.2
^ permalink raw reply
* [PATCH 1/1] powerpc/numa: do not skip node 0 when init lookup table
From: Daniel Henrique Barboza @ 2020-08-14 20:34 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Daniel Henrique Barboza
In-Reply-To: <20200814203413.542050-1-danielhb413@gmail.com>
associativity_to_nid() is skipping node 0 when initializing
the distance lookup table. This has no practical effect when
the associativity of node 0 is always zero, which seems to
be case for a long time. As such, this line got introduced in
commit 41eab6f88f24 from 2010 and never revisited.
However, QEMU is making an effort to allow user input to configure
NUMA topologies, and this behavior got exposed when testing
that work. With the existing code, this is what happens with a
4 node NUMA guest with distance = 80 to each other:
$ numactl -H
(...)
node distances:
node 0 1 2 3
0: 10 160 160 160
1: 160 10 80 80
2: 160 80 10 80
3: 160 80 80 10
With this patch, this is the result:
$ numactl -H
(...)
node distances:
node 0 1 2 3
0: 10 80 80 80
1: 80 10 80 80
2: 80 80 10 80
3: 80 80 80 10
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
arch/powerpc/mm/numa.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..c11aabad1090 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -239,7 +239,7 @@ static int associativity_to_nid(const __be32 *associativity)
if (nid == 0xffff || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
- if (nid > 0 &&
+ if (nid >= 0 &&
of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
* Skip the length field and send start of associativity array
--
2.26.2
^ permalink raw reply related
* Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.9-2 tag
From: pr-tracker-bot @ 2020-08-14 20:42 UTC (permalink / raw)
To: Michael Ellerman
Cc: aneesh.kumar, linuxppc-dev, Linus Torvalds, linux-kernel,
Christian Zigotzky
In-Reply-To: <87ft8p9yjf.fsf@mpe.ellerman.id.au>
The pull request you sent on Fri, 14 Aug 2020 22:56:04 +1000:
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.9-2
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7fca4dee610dffbe119714231cac0d59496bc193
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [PATCH net 0/5] refactoring of ibmvnic code
From: David Miller @ 2020-08-14 20:53 UTC (permalink / raw)
To: ljp; +Cc: netdev, linuxppc-dev
In-Reply-To: <20200814075921.88745-1-ljp@linux.ibm.com>
From: Lijun Pan <ljp@linux.ibm.com>
Date: Fri, 14 Aug 2020 02:59:16 -0500
> This patch series refactor reset_init and init functions,
> improve the debugging messages, and make some other cosmetic changes
> to make the code easier to read and debug.
Cosmetic changes and cleanups are not appropriate at this time as
the net-next tree is closed.
Please repost these changes when the net-next tree opens back up.
^ permalink raw reply
* Re: [PATCH] libnvdimm: Add a NULL entry to 'nvdimm_firmware_attributes'
From: Verma, Vishal L @ 2020-08-14 22:55 UTC (permalink / raw)
To: Weiny, Ira, vaibhav@linux.ibm.com
Cc: sandipan@linux.ibm.com, linuxppc-dev@lists.ozlabs.org,
aneesh.kumar@linux.ibm.com, linux-nvdimm@lists.01.org
In-Reply-To: <20200814171005.GB3142014@iweiny-DESK2.sc.intel.com>
On Fri, 2020-08-14 at 10:10 -0700, Ira Weiny wrote:
> On Fri, Aug 14, 2020 at 08:35:09PM +0530, Vaibhav Jain wrote:
> > We recently discovered a kernel oops with 'papr_scm' module while
> > booting ppc64 phyp guest with following back-trace:
> >
> > BUG: Kernel NULL pointer dereference on write at 0x00000188
> > Faulting instruction address: 0xc0000000005d7084
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > <snip>
> > Call Trace:
> > internal_create_group+0x128/0x4c0 (unreliable)
> > internal_create_groups.part.4+0x70/0x130
> > device_add+0x458/0x9c0
> > nd_async_device_register+0x28/0xa0 [libnvdimm]
> > async_run_entry_fn+0x78/0x1f0
> > process_one_work+0x2c0/0x5b0
> > worker_thread+0x88/0x650
> > kthread+0x1a8/0x1b0
> > ret_from_kernel_thread+0x5c/0x6c
> >
> > A bisect lead to the 'commit 48001ea50d17f ("PM, libnvdimm: Add runtime
> > firmware activation support")' and on investigation discovered that
> > the newly introduced 'struct attribute *nvdimm_firmware_attributes[]'
> > is missing a terminating NULL entry in the array. This causes a loop
> > in sysfs's 'create_files()' to read garbage beyond bounds of
> > 'nvdimm_firmware_attributes' and trigger the oops.
> >
> > Fixes: 48001ea50d17f ("PM, libnvdimm: Add runtime firmware activation support")
> > Reported-by: Sandipan Das <sandipan@linux.ibm.com>
> > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Thanks Vaibhav and Ira. I see this was also reported and fixed by Zqiang
a couple days ago. I'll pick that, merge these trailers and add it to
the fixes queue.
>
> > ---
> > drivers/nvdimm/dimm_devs.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> > index 61374def51555..b59032e0859b7 100644
> > --- a/drivers/nvdimm/dimm_devs.c
> > +++ b/drivers/nvdimm/dimm_devs.c
> > @@ -529,6 +529,7 @@ static DEVICE_ATTR_ADMIN_RW(activate);
> > static struct attribute *nvdimm_firmware_attributes[] = {
> > &dev_attr_activate.attr,
> > &dev_attr_result.attr,
> > + NULL,
> > };
> >
> > static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute *a, int n)
> > --
> > 2.26.2
> >
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply
* [PATCH] swiotlb: Allow allocating buffer anywhere in memory
From: Thiago Jung Bauermann @ 2020-08-15 20:45 UTC (permalink / raw)
To: iommu
Cc: Konrad Rzeszutek Wilk, linuxppc-dev, Ram Pai, linux-kernel,
Satheesh Rajendran, Robin Murphy, Christoph Hellwig,
Thiago Jung Bauermann, Marek Szyprowski
POWER secure guests (i.e., guests which use the Protection Execution
Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
they don't need the SWIOTLB memory to be in low addresses since the
hypervisor doesn't have any addressing limitation.
This solves a SWIOTLB initialization problem we are seeing in secure guests
with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
memory, which leaves no space for SWIOTLB in low addresses.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
---
arch/powerpc/mm/mem.c | 7 ++++++-
include/linux/swiotlb.h | 8 +++++++-
kernel/dma/swiotlb.c | 10 +++++++---
3 files changed, 20 insertions(+), 5 deletions(-)
Normally I would split changes like this into one patch touching generic
code and another for the arch-specific part, but in this case I thought it
would be unneeded complexity. I can split though if people prefer it that
way.
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c2c11eb8dcfc..13f2e3aff8b5 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -50,6 +50,7 @@
#include <asm/swiotlb.h>
#include <asm/rtas.h>
#include <asm/kasan.h>
+#include <asm/svm.h>
#include <mm/mmu_decl.h>
@@ -290,7 +291,11 @@ void __init mem_init(void)
* back to to-down.
*/
memblock_set_bottom_up(true);
- swiotlb_init(0);
+ /*
+ * SVM guests can use the SWIOTLB wherever it is in memory,
+ * even if not DMA-able.
+ */
+ swiotlb_init_anywhere(0, is_secure_guest());
#endif
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 046bb94bd4d6..433f3dbb35b5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -29,7 +29,13 @@ enum swiotlb_force {
*/
#define IO_TLB_SHIFT 11
-extern void swiotlb_init(int verbose);
+void __init swiotlb_init_anywhere(int verbose, bool allocate_anywhere);
+
+static inline void swiotlb_init(int verbose)
+{
+ swiotlb_init_anywhere(verbose, false);
+}
+
int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
extern unsigned long swiotlb_nr_tbl(void);
unsigned long swiotlb_size_or_default(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c19379fabd20..27070aa59e34 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -244,7 +244,7 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
* structures for the software IO TLB used to implement the DMA API.
*/
void __init
-swiotlb_init(int verbose)
+swiotlb_init_anywhere(int verbose, bool allocate_anywhere)
{
size_t default_size = IO_TLB_DEFAULT_SIZE;
unsigned char *vstart;
@@ -257,8 +257,12 @@ swiotlb_init(int verbose)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
- /* Get IO TLB memory from the low pages */
- vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
+ if (allocate_anywhere)
+ vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
+ else
+ /* Get IO TLB memory from the low pages */
+ vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
+
if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
return;
^ permalink raw reply related
* [PATCH v4 0/8] huge vmalloc mappings
From: Nicholas Piggin @ 2020-08-16 9:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
Let's try again.
Thanks,
Nick
Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
Cameron for debugging the cause of this (hopefully).
Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
struct vm_struct is an interface (this should fix x86 vmap stack debug
assert). [Thanks Zefan]
Nicholas Piggin (8):
mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
mm: apply_to_pte_range warn and fail if a large pte is encountered
mm/vmalloc: rename vmap_*_range vmap_pages_*_range
lib/ioremap: rename ioremap_*_range to vmap_*_range
mm: HUGE_VMAP arch support cleanup
mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
mm/vmalloc: add vmap_range_noflush variant
mm/vmalloc: Hugepage vmalloc mappings
.../admin-guide/kernel-parameters.txt | 2 +
arch/arm64/mm/mmu.c | 12 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 10 +-
arch/x86/mm/ioremap.c | 12 +-
include/linux/io.h | 9 -
include/linux/vmalloc.h | 13 +
init/main.c | 1 -
mm/ioremap.c | 231 +--------
mm/memory.c | 60 ++-
mm/vmalloc.c | 445 +++++++++++++++---
10 files changed, 461 insertions(+), 334 deletions(-)
--
2.23.0
^ permalink raw reply
* [PATCH v4 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
From: Nicholas Piggin @ 2020-08-16 9:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.
[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..49f225b0f855 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
#include <linux/overflow.h>
#include <linux/uaccess.h>
+#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
@@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x)
}
/*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
*/
struct page *vmalloc_to_page(const void *vmalloc_addr)
{
@@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pgd_none(*pgd))
return NULL;
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return NULL; /* XXX: no allowance for huge pgd */
+ if (WARN_ON_ONCE(pgd_bad(*pgd)))
+ return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
- pud = pud_offset(p4d, addr);
+ if (p4d_leaf(*p4d))
+ return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(p4d_bad(*p4d)))
+ return NULL;
- /*
- * Don't dereference bad PUD or PMD (below) entries. This will also
- * identify huge mappings, which we may encounter on architectures
- * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
- * identified as vmalloc addresses by is_vmalloc_addr(), but are
- * not [unambiguously] associated with a struct page, so there is
- * no correct value to return for them.
- */
- WARN_ON_ONCE(pud_bad(*pud));
- if (pud_none(*pud) || pud_bad(*pud))
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+ if (pud_leaf(*pud))
+ return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
- WARN_ON_ONCE(pmd_bad(*pmd));
- if (pmd_none(*pmd) || pmd_bad(*pmd))
+ if (pmd_none(*pmd))
+ return NULL;
+ if (pmd_leaf(*pmd))
+ return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
}
EXPORT_SYMBOL(vmalloc_to_page);
--
2.23.0
^ permalink raw reply related
* [PATCH v4 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered
From: Nicholas Piggin @ 2020-08-16 9:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/memory.c | 60 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 44 insertions(+), 16 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..1d5f3093c249 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2260,13 +2260,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
- if (create || !pmd_none_or_clear_bad(pmd)) {
- err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (pmd_none(*pmd) && !create)
+ continue;
+ if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+ if (!create)
+ continue;
+ pmd_clear_bad(pmd);
}
+ err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+ if (err)
+ break;
} while (pmd++, addr = next, addr != end);
+
return err;
}
@@ -2287,13 +2294,20 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
- if (create || !pud_none_or_clear_bad(pud)) {
- err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (pud_none(*pud) && !create)
+ continue;
+ if (WARN_ON_ONCE(pud_leaf(*pud)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pud_bad(*pud))) {
+ if (!create)
+ continue;
+ pud_clear_bad(pud);
}
+ err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+ if (err)
+ break;
} while (pud++, addr = next, addr != end);
+
return err;
}
@@ -2314,13 +2328,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
- if (create || !p4d_none_or_clear_bad(p4d)) {
- err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (p4d_none(*p4d) && !create)
+ continue;
+ if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+ if (!create)
+ continue;
+ p4d_clear_bad(p4d);
}
+ err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+ if (err)
+ break;
} while (p4d++, addr = next, addr != end);
+
return err;
}
@@ -2339,8 +2360,15 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
- if (!create && pgd_none_or_clear_bad(pgd))
+ if (pgd_none(*pgd) && !create)
continue;
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+ if (!create)
+ continue;
+ pgd_clear_bad(pgd);
+ }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
--
2.23.0
^ permalink raw reply related
* [PATCH v4 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range
From: Nicholas Piggin @ 2020-08-16 9:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 28 ++++++++++++----------------
1 file changed, 12 insertions(+), 16 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 49f225b0f855..3a1e45fd1626 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
arch_sync_kernel_mappings(start, end);
}
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pte_t *pte;
@@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
}
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
}
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
- err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+ err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
--
2.23.0
^ permalink raw reply related
* [PATCH v4 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range
From: Nicholas Piggin @ 2020-08-16 9:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
This will be moved to mm/ and used as a generic kernel virtual mapping
function, so re-name it in preparation.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/ioremap.c | 55 ++++++++++++++++++++++------------------------------
1 file changed, 23 insertions(+), 32 deletions(-)
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..6016ae3227ad 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pte_t *pte;
u64 pfn;
@@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
return pmd_set_huge(pmd, phys_addr, prot);
}
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
do {
next = pmd_addr_end(addr, end);
- if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
- if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+ if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
return pud_set_huge(pud, phys_addr, prot);
}
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
do {
next = pud_addr_end(addr, end);
- if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
- if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
return p4d_set_huge(p4d, phys_addr, prot);
}
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -208,19 +201,18 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
do {
next = p4d_addr_end(addr, end);
- if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
*mask |= PGTBL_P4D_MODIFIED;
continue;
}
- if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-int ioremap_page_range(unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pgd_t *pgd;
unsigned long start;
@@ -235,8 +227,7 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
- &mask);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -272,7 +263,7 @@ void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
return NULL;
vaddr = (unsigned long)area->addr;
- if (ioremap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
+ if (vmap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
free_vm_area(area);
return NULL;
}
--
2.23.0
^ permalink raw reply related
* [PATCH v4 5/8] mm: HUGE_VMAP arch support cleanup
From: Nicholas Piggin @ 2020-08-16 9:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.
This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.
This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/arm64/mm/mmu.c | 12 +--
arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++-
arch/x86/mm/ioremap.c | 12 +--
include/linux/io.h | 9 ---
include/linux/vmalloc.h | 10 +++
init/main.c | 1 -
mm/ioremap.c | 96 +++++++++++-------------
7 files changed, 73 insertions(+), 77 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..bbb3ccf6a7ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
return dt_virt;
}
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
{
/*
* Only 4k granule supports level 1 block mappings.
@@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void)
!IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
- /* See arch_ioremap_pud_supported() */
+ /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}
+#endif
int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
{
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..eeb0e8451176 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,13 +1134,14 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
}
-int __init arch_ioremap_pud_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_pud_supported(pgprot_t prot)
{
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
return radix_enabled();
}
@@ -1149,6 +1150,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
{
return 0;
}
+#endif
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
{
@@ -1234,7 +1236,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
}
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..5b8b495ab4ed 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr)
}
EXPORT_SYMBOL(iounmap);
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
{
#ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
#else
- return 0;
+ return false;
#endif
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
return boot_cpu_has(X86_FEATURE_PSE);
}
+#endif
/*
* Convert a physical pointer to a virtual kernel pointer for /dev/mem
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end,
}
#endif
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline void ioremap_huge_init(void) { }
-#endif
-
/*
* Managed iomap interface
*/
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..787d77ad7536 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -84,6 +84,16 @@ struct vmap_area {
};
};
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#else
+static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
+#endif
+
/*
* Highlevel APIs for driver use
*/
diff --git a/init/main.c b/init/main.c
index ae78fb68d231..1c89aa127b8f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -820,7 +820,6 @@ static void __init mm_init(void)
pgtable_init();
debug_objects_mem_init();
vmalloc_init();
- ioremap_huge_init();
/* Should be run before the first non-init thread is created */
init_espfix_bsp();
/* Should be run after espfix64 is set up. */
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 6016ae3227ad..b0032dbadaf7 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -16,49 +16,16 @@
#include "pgalloc-track.h"
#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static int __read_mostly ioremap_p4d_capable;
-static int __read_mostly ioremap_pud_capable;
-static int __read_mostly ioremap_pmd_capable;
-static int __read_mostly ioremap_huge_disabled;
+static bool __ro_after_init iomap_allow_huge = true;
static int __init set_nohugeiomap(char *str)
{
- ioremap_huge_disabled = 1;
+ iomap_allow_huge = false;
return 0;
}
early_param("nohugeiomap", set_nohugeiomap);
-
-void __init ioremap_huge_init(void)
-{
- if (!ioremap_huge_disabled) {
- if (arch_ioremap_p4d_supported())
- ioremap_p4d_capable = 1;
- if (arch_ioremap_pud_supported())
- ioremap_pud_capable = 1;
- if (arch_ioremap_pmd_supported())
- ioremap_pmd_capable = 1;
- }
-}
-
-static inline int ioremap_p4d_enabled(void)
-{
- return ioremap_p4d_capable;
-}
-
-static inline int ioremap_pud_enabled(void)
-{
- return ioremap_pud_capable;
-}
-
-static inline int ioremap_pmd_enabled(void)
-{
- return ioremap_pmd_capable;
-}
-
-#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
-static inline int ioremap_p4d_enabled(void) { return 0; }
-static inline int ioremap_pud_enabled(void) { return 0; }
-static inline int ioremap_pmd_enabled(void) { return 0; }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool iomap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -81,9 +48,12 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}
static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_pmd_enabled())
+ if (max_page_shift < PMD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pmd_supported(prot))
return 0;
if ((end - addr) != PMD_SIZE)
@@ -102,7 +72,8 @@ static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
}
static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -113,7 +84,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
next = pmd_addr_end(addr, end);
- if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
@@ -125,9 +96,12 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
}
static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_pud_enabled())
+ if (max_page_shift < PUD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pud_supported(prot))
return 0;
if ((end - addr) != PUD_SIZE)
@@ -146,7 +120,8 @@ static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
}
static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -157,21 +132,24 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
next = pud_addr_end(addr, end);
- if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
- if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_p4d_enabled())
+ if (max_page_shift < P4D_SHIFT)
+ return 0;
+
+ if (!arch_vmap_p4d_supported(prot))
return 0;
if ((end - addr) != P4D_SIZE)
@@ -190,7 +168,8 @@ static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
}
static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -201,18 +180,19 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
do {
next = p4d_addr_end(addr, end);
- if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_P4D_MODIFIED;
continue;
}
- if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
{
pgd_t *pgd;
unsigned long start;
@@ -227,7 +207,7 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -240,6 +220,16 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
return err;
}
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+ unsigned int max_page_shift = PAGE_SHIFT;
+
+ if (iomap_allow_huge)
+ max_page_shift = P4D_SHIFT;
+
+ return vmap_range(addr, end, phys_addr, prot, max_page_shift);
+}
+
#ifdef CONFIG_GENERIC_IOREMAP
void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
{
--
2.23.0
^ permalink raw reply related
* [PATCH v4 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
From: Nicholas Piggin @ 2020-08-16 9:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
This is a generic kernel virtual memory mapper, not specific to ioremap.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
include/linux/vmalloc.h | 2 +
mm/ioremap.c | 192 ----------------------------------------
mm/vmalloc.c | 191 +++++++++++++++++++++++++++++++++++++++
3 files changed, 193 insertions(+), 192 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
extern struct vm_struct *find_vm_area(const void *addr);
#ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift);
extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
static const bool iomap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
-{
- pte_t *pte;
- u64 pfn;
-
- pfn = phys_addr >> PAGE_SHIFT;
- pte = pte_alloc_kernel_track(pmd, addr, mask);
- if (!pte)
- return -ENOMEM;
- do {
- BUG_ON(!pte_none(*pte));
- set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
- pfn++;
- } while (pte++, addr += PAGE_SIZE, addr != end);
- *mask |= PGTBL_PTE_MODIFIED;
- return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < PMD_SHIFT)
- return 0;
-
- if (!arch_vmap_pmd_supported(prot))
- return 0;
-
- if ((end - addr) != PMD_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, PMD_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, PMD_SIZE))
- return 0;
-
- if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
- return 0;
-
- return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- pmd_t *pmd;
- unsigned long next;
-
- pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
- if (!pmd)
- return -ENOMEM;
- do {
- next = pmd_addr_end(addr, end);
-
- if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_PMD_MODIFIED;
- continue;
- }
-
- if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
- return -ENOMEM;
- } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < PUD_SHIFT)
- return 0;
-
- if (!arch_vmap_pud_supported(prot))
- return 0;
-
- if ((end - addr) != PUD_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, PUD_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, PUD_SIZE))
- return 0;
-
- if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
- return 0;
-
- return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- pud_t *pud;
- unsigned long next;
-
- pud = pud_alloc_track(&init_mm, p4d, addr, mask);
- if (!pud)
- return -ENOMEM;
- do {
- next = pud_addr_end(addr, end);
-
- if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_PUD_MODIFIED;
- continue;
- }
-
- if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
- return -ENOMEM;
- } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < P4D_SHIFT)
- return 0;
-
- if (!arch_vmap_p4d_supported(prot))
- return 0;
-
- if ((end - addr) != P4D_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, P4D_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, P4D_SIZE))
- return 0;
-
- if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
- return 0;
-
- return p4d_set_huge(p4d, phys_addr, prot);
-}
-
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- p4d_t *p4d;
- unsigned long next;
-
- p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
- if (!p4d)
- return -ENOMEM;
- do {
- next = p4d_addr_end(addr, end);
-
- if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_P4D_MODIFIED;
- continue;
- }
-
- if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
- return -ENOMEM;
- } while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- unsigned int max_page_shift)
-{
- pgd_t *pgd;
- unsigned long start;
- unsigned long next;
- int err;
- pgtbl_mod_mask mask = 0;
-
- might_sleep();
- BUG_ON(addr >= end);
-
- start = addr;
- pgd = pgd_offset_k(addr);
- do {
- next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
- if (err)
- break;
- } while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
-
- flush_cache_vmap(start, end);
-
- if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
- arch_sync_kernel_mappings(start, end);
-
- return err;
-}
-
int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
unsigned int max_page_shift = PAGE_SHIFT;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3a1e45fd1626..129f10545bb1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -71,6 +71,197 @@ static void free_work(struct work_struct *w)
}
/*** Page table manipulation functions ***/
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+{
+ pte_t *pte;
+ u64 pfn;
+
+ pfn = phys_addr >> PAGE_SHIFT;
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
+ if (!pte)
+ return -ENOMEM;
+ do {
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+ pfn++;
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
+ return 0;
+}
+
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < PMD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pmd_supported(prot))
+ return 0;
+
+ if ((end - addr) != PMD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PMD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, PMD_SIZE))
+ return 0;
+
+ if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
+ return 0;
+
+ return pmd_set_huge(pmd, phys_addr, prot);
+}
+
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
+ if (!pmd)
+ return -ENOMEM;
+ do {
+ next = pmd_addr_end(addr, end);
+
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_PMD_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+ return -ENOMEM;
+ } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < PUD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pud_supported(prot))
+ return 0;
+
+ if ((end - addr) != PUD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PUD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, PUD_SIZE))
+ return 0;
+
+ if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
+ return 0;
+
+ return pud_set_huge(pud, phys_addr, prot);
+}
+
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
+ if (!pud)
+ return -ENOMEM;
+ do {
+ next = pud_addr_end(addr, end);
+
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_PUD_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
+ return -ENOMEM;
+ } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < P4D_SHIFT)
+ return 0;
+
+ if (!arch_vmap_p4d_supported(prot))
+ return 0;
+
+ if ((end - addr) != P4D_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, P4D_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, P4D_SIZE))
+ return 0;
+
+ if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
+ return 0;
+
+ return p4d_set_huge(p4d, phys_addr, prot);
+}
+
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_P4D_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
+ return -ENOMEM;
+ } while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
+{
+ pgd_t *pgd;
+ unsigned long start;
+ unsigned long next;
+ int err;
+ pgtbl_mod_mask mask = 0;
+
+ might_sleep();
+ BUG_ON(addr >= end);
+
+ start = addr;
+ pgd = pgd_offset_k(addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
+ if (err)
+ break;
+ } while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
+
+ flush_cache_vmap(start, end);
+
+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
+ return err;
+}
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pgtbl_mod_mask *mask)
--
2.23.0
^ permalink raw reply related
* [PATCH v4 7/8] mm/vmalloc: add vmap_range_noflush variant
From: Nicholas Piggin @ 2020-08-16 9:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
return 0;
}
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, phys_addr_t phys_addr,
+ pgprot_t prot, unsigned int max_page_shift)
{
pgd_t *pgd;
unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgp
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
- flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
return err;
}
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
+{
+ int err;
+
+ err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+ flush_cache_vmap(addr, end);
+
+ return err;
+}
+
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pgtbl_mod_mask *mask)
{
--
2.23.0
^ permalink raw reply related
* [PATCH v4 8/8] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2020-08-16 9:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H. Peter Anvin, Thomas Gleixner, Will Deacon,
linux-arm-kernel
In-Reply-To: <20200816090904.83947-1-npiggin@gmail.com>
On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.
Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).
This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmap is added to disable at boot.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
.../admin-guide/kernel-parameters.txt | 2 +
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 177 +++++++++++++-----
3 files changed, 137 insertions(+), 43 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 98ea67f27809..eaef176c597f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
nohugeiomap [KNL,x86,PPC] Disable kernel huge I/O mappings.
+ nohugevmap [KNL,x86,PPC] Disable kernel huge vmalloc mappings.
+
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long size;
unsigned long flags;
struct page **pages;
+ unsigned int page_order;
unsigned int nr_pages;
phys_addr_t phys_addr;
const void *caller;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..c3595d87261c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
#include "internal.h"
#include "pgalloc-track.h"
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmap(char *str)
+{
+ vmap_allow_huge = false;
+ return 0;
+}
+early_param("nohugevmap", set_nohugevmap);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)x;
@@ -468,31 +481,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long en
return 0;
}
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing. The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
- pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages)
{
unsigned long start = addr;
- unsigned long end = addr + size;
- unsigned long next;
pgd_t *pgd;
+ unsigned long next;
int err = 0;
int nr = 0;
pgtbl_mod_mask mask = 0;
@@ -514,6 +508,65 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
return 0;
}
+static int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ WARN_ON(page_shift < PAGE_SHIFT);
+
+ if (page_shift == PAGE_SHIFT) {
+ return vmap_small_pages_range_noflush(addr, end, prot, pages);
+ } else {
+ unsigned int i, nr = (end - addr) >> page_shift;
+
+ for (i = 0; i < nr; i++) {
+ int err;
+
+ err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+ __pa(page_address(pages[i])), prot, page_shift);
+ if (err)
+ return err;
+
+ addr += 1UL << page_shift;
+ }
+
+ return 0;
+ }
+}
+
+static int vmap_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ int err;
+
+ err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
+ flush_cache_vmap(addr, end);
+ return err;
+}
+
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
+ * have been allocated using get_vm_area() and its friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing. The caller is responsible for
+ * calling flush_cache_vmap() on to-be-mapped areas before calling this
+ * function.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+ pgprot_t prot, struct page **pages)
+{
+ return vmap_pages_range_noflush(addr, addr + size, prot, pages, PAGE_SHIFT);
+}
+
int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
struct page **pages)
{
@@ -2274,9 +2327,11 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (unlikely(!size))
return NULL;
- if (flags & VM_IOREMAP)
- align = 1ul << clamp_t(int, get_count_order_long(size),
- PAGE_SHIFT, IOREMAP_MAX_ORDER);
+ if (flags & VM_IOREMAP) {
+ align = max(align,
+ 1ul << clamp_t(int, get_count_order_long(size),
+ PAGE_SHIFT, IOREMAP_MAX_ORDER));
+ }
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
@@ -2471,11 +2526,11 @@ static void __vunmap(const void *addr, int deallocate_pages)
if (deallocate_pages) {
int i;
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1 << area->page_order) {
struct page *page = area->pages[i];
BUG_ON(!page);
- __free_pages(page, 0);
+ __free_pages(page, area->page_order);
}
atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2614,9 +2669,12 @@ void *vmap(struct page **pages, unsigned int count,
EXPORT_SYMBOL(vmap);
static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
- pgprot_t prot, int node)
+ pgprot_t prot, unsigned int page_shift, int node)
{
struct page **pages;
+ unsigned long addr = (unsigned long)area->addr;
+ unsigned long size = get_vm_area_size(area);
+ unsigned int page_order = page_shift - PAGE_SHIFT;
unsigned int nr_pages, array_size, i;
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
@@ -2624,7 +2682,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
0 :
__GFP_HIGHMEM;
- nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+ nr_pages = size >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
/* Please note that the recursion is strictly bounded. */
@@ -2643,29 +2701,29 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
area->pages = pages;
area->nr_pages = nr_pages;
+ area->page_order = page_order;
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1 << page_order) {
struct page *page;
+ int p;
- if (node == NUMA_NO_NODE)
- page = alloc_page(alloc_mask|highmem_mask);
- else
- page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
-
+ page = alloc_pages_node(node, alloc_mask|highmem_mask, page_order);
if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vunmap() */
area->nr_pages = i;
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
goto fail;
}
- area->pages[i] = page;
+
+ for (p = 0; p < (1 << page_order); p++)
+ area->pages[i + p] = page + p;
+
if (gfpflags_allow_blocking(gfp_mask))
cond_resched();
}
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
- if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area),
- prot, pages) < 0)
+ if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0)
goto fail;
return area->addr;
@@ -2701,22 +2759,45 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller)
{
- struct vm_struct *area;
+ struct vm_struct *area = NULL;
void *addr;
unsigned long real_size = size;
+ unsigned long real_align = align;
+ unsigned int shift = PAGE_SHIFT;
size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > totalram_pages())
goto fail;
- area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED |
+ if (vmap_allow_huge && (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) {
+ unsigned long size_per_node;
+
+ /*
+ * Try huge pages. Only try for PAGE_KERNEL allocations,
+ * others like modules don't yet expect huge pages in
+ * their allocations due to apply_to_page_range not
+ * supporting them.
+ */
+
+ size_per_node = size;
+ if (node == NUMA_NO_NODE)
+ size_per_node /= num_online_nodes();
+ if (size_per_node >= PMD_SIZE) {
+ shift = PMD_SHIFT;
+ align = max(real_align, 1UL << shift);
+ size = ALIGN(real_size, 1UL << shift);
+ }
+ }
+
+again:
+ area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
vm_flags, start, end, node, gfp_mask, caller);
if (!area)
goto fail;
- addr = __vmalloc_area_node(area, gfp_mask, prot, node);
+ addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
if (!addr)
- return NULL;
+ goto fail;
/*
* In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -2730,8 +2811,18 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return addr;
fail:
- warn_alloc(gfp_mask, NULL,
+ if (shift > PAGE_SHIFT) {
+ shift = PAGE_SHIFT;
+ align = real_align;
+ size = real_size;
+ goto again;
+ }
+
+ if (!area) {
+ /* Warn for area allocation, page allocations already warn */
+ warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure: %lu bytes", real_size);
+ }
return NULL;
}
--
2.23.0
^ permalink raw reply related
* Re: fsl_espi errors on v5.7.15
From: Chris Packham @ 2020-08-16 21:09 UTC (permalink / raw)
To: Heiner Kallweit, broonie@kernel.org, mpe@ellerman.id.au,
benh@kernel.crashing.org, paulus@samba.org,
tiago.brusamarello@datacom.ind.br
Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
linux-spi@vger.kernel.org
In-Reply-To: <c43a23bd-33ec-4ef2-2ca5-730342248db3@gmail.com>
On 14/08/20 6:19 pm, Heiner Kallweit wrote:
> On 14.08.2020 04:48, Chris Packham wrote:
>> Hi,
>>
>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>> based system to linux v5.7.15
>>
>> For this board u-boot and the u-boot environment live on spi-nor.
>>
>> When I use fw_setenv from userspace I get the following kernel logs
>>
>> # fw_setenv foo=1
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> ...
>>
> This error reporting doesn't exist yet in 4.4. So you may have an issue
> under 4.4 too, it's just not reported.
> Did you verify that under 4.4 fw_setenv actually has an effect?
Just double checked and yes under 4.4 the setting does get saved.
>> If I run fw_printenv (before getting it into a bad state) it is able to
>> display the content of the boards u-boot environment.
>>
> This might indicate an issue with spi being locked. I've seen related
> questions, just use the search engine of your choice and check for
> fw_setenv and locked.
I'm running a version of fw_setenv which includes
https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
be locking things unnecessarily.
>> If been unsuccessful in producing a setup for bisecting the issue. I do
>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>> probably not much help.
>>
>> Any pointers on what the issue (and/or solution) might be.
>>
>> Thanks,
>> Chris
>>
> Heiner
^ permalink raw reply
* [PATCH 2/2] powerpc: Add POWER10 raw mode cputable entry
From: Madhavan Srinivasan @ 2020-08-17 0:56 UTC (permalink / raw)
To: mpe; +Cc: Madhavan Srinivasan, linuxppc-dev
In-Reply-To: <20200817005618.3305028-1-maddy@linux.ibm.com>
Add a raw mode cputable entry for POWER10. Copies most of the fields
from commit a3ea40d5c736 ("powerpc: Add POWER10 architected mode")
except for oprofile_cpu_type, machine_check_early, pvr_mask and pvr_mask
fields. On bare metal systems we use DT CPU features, which doesn't need a
cputable entry. But in VMs we still rely on the raw cputable entry to
set the correct values for the PMU related fields.
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
---
arch/powerpc/kernel/cputable.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 4ca9cf556be92..2aa89c6b28967 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -539,6 +539,25 @@ static struct cpu_spec __initdata cpu_specs[] = {
.machine_check_early = __machine_check_early_realmode_p9,
.platform = "power9",
},
+ { /* Power10 */
+ .pvr_mask = 0xffff0000,
+ .pvr_value = 0x00800000,
+ .cpu_name = "POWER10 (raw)",
+ .cpu_features = CPU_FTRS_POWER10,
+ .cpu_user_features = COMMON_USER_POWER10,
+ .cpu_user_features2 = COMMON_USER2_POWER10,
+ .mmu_features = MMU_FTRS_POWER10,
+ .icache_bsize = 128,
+ .dcache_bsize = 128,
+ .num_pmcs = 6,
+ .pmc_type = PPC_PMC_IBM,
+ .oprofile_cpu_type = "ppc64/power10",
+ .oprofile_type = PPC_OPROFILE_INVALID,
+ .cpu_setup = __setup_cpu_power10,
+ .cpu_restore = __restore_cpu_power10,
+ .machine_check_early = __machine_check_early_realmode_p10,
+ .platform = "power10",
+ },
{ /* Cell Broadband Engine */
.pvr_mask = 0xffff0000,
.pvr_value = 0x00700000,
--
2.26.2
^ permalink raw reply related
* [PATCH 1/2] powerpc/kernel/cputable: cleanup the function declarations
From: Madhavan Srinivasan @ 2020-08-17 0:56 UTC (permalink / raw)
To: mpe; +Cc: Madhavan Srinivasan, linuxppc-dev
__machine_check_early_realmode_p*() are currently declared
as extern in cputable.c and because of this when compiled
with "C=1" (which enables semantic checker) produces these
warnings.
CHECK arch/powerpc/kernel/mce_power.c
arch/powerpc/kernel/mce_power.c:709:6: warning: symbol '__machine_check_early_realmode_p7' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:717:6: warning: symbol '__machine_check_early_realmode_p8' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:722:6: warning: symbol '__machine_check_early_realmode_p9' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:740:6: warning: symbol '__machine_check_early_realmode_p10' was not declared. Should it be static?
Patch here moves the declaration to asm/mce.h
and includes the same in cputable.c
Fixes: ae744f3432d3 ("powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power8")
Fixes: 7b9f71f974a1 ("powerpc/64s: POWER9 machine check handler")
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
---
arch/powerpc/include/asm/cputable.h | 5 +++++
arch/powerpc/include/asm/mce.h | 7 +++++++
arch/powerpc/kernel/cputable.c | 3 ---
arch/powerpc/kernel/dt_cpu_ftrs.c | 4 ----
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index fdddb822d564b..e005b45810232 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -9,6 +9,11 @@
#ifndef __ASSEMBLY__
+/*
+ * Added to include __machine_check_early_realmode_* functions
+ */
+#include <asm/mce.h>
+
/* This structure can grow, it's real size is used by head.S code
* via the mkdefs mechanism.
*/
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index adf2cda67f9a4..89aa8248a57dd 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,9 @@ struct mce_error_info {
#define MCE_EVENT_RELEASE true
#define MCE_EVENT_DONTRELEASE false
+struct pt_regs;
+struct notifier_block;
+
extern void save_mce_event(struct pt_regs *regs, long handled,
struct mce_error_info *mce_err, uint64_t nip,
uint64_t addr, uint64_t phys_addr);
@@ -225,5 +228,9 @@ int mce_register_notifier(struct notifier_block *nb);
int mce_unregister_notifier(struct notifier_block *nb);
#ifdef CONFIG_PPC_BOOK3S_64
void flush_and_reload_slb(void);
+long __machine_check_early_realmode_p7(struct pt_regs *regs);
+long __machine_check_early_realmode_p8(struct pt_regs *regs);
+long __machine_check_early_realmode_p9(struct pt_regs *regs);
+long __machine_check_early_realmode_p10(struct pt_regs *regs);
#endif /* CONFIG_PPC_BOOK3S_64 */
#endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 3d406a9626e86..4ca9cf556be92 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -72,9 +72,6 @@ extern void __setup_cpu_power9(unsigned long offset, struct cpu_spec* spec);
extern void __restore_cpu_power9(void);
extern void __setup_cpu_power10(unsigned long offset, struct cpu_spec* spec);
extern void __restore_cpu_power10(void);
-extern long __machine_check_early_realmode_p7(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p8(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p9(struct pt_regs *regs);
#endif /* CONFIG_PPC64 */
#if defined(CONFIG_E500)
extern void __setup_cpu_e5500(unsigned long offset, struct cpu_spec* spec);
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 6f8c0c6b937a1..8dc46f38680b2 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -64,10 +64,6 @@ struct dt_cpu_feature {
* Set up the base CPU
*/
-extern long __machine_check_early_realmode_p8(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p9(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p10(struct pt_regs *regs);
-
static int hv_mode;
static struct {
--
2.26.2
^ permalink raw reply related
* Re: [PATCH v2 2/5] powerpc/lib: Initialize a temporary mm for code patching
From: Christopher M. Riedl @ 2020-08-17 2:21 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev; +Cc: kernel-hardening
In-Reply-To: <87lfiso3rk.fsf@dja-thinkpad.axtens.net>
On Thu Aug 6, 2020 at 8:24 AM CDT, Daniel Axtens wrote:
> "Christopher M. Riedl" <cmr@informatik.wtf> writes:
>
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped with permissive memory
> > protections. Currently, a per-cpu vmalloc patch area is used for this
> > purpose. While the patch area is per-cpu, the temporary page mapping is
> > inserted into the kernel page tables for the duration of the patching.
> > The mapping is exposed to CPUs other than the patching CPU - this is
> > undesirable from a hardening perspective.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > portion. The next patch uses the temporary mm and patching address for
> > code patching.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> > ---
> > arch/powerpc/lib/code-patching.c | 33 ++++++++++++++++++++++++++++++++
> > 1 file changed, 33 insertions(+)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > index 0a051dfeb177..8ae1a9e5fe6e 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,8 @@
> > #include <linux/cpuhotplug.h>
> > #include <linux/slab.h>
> > #include <linux/uaccess.h>
> > +#include <linux/sched/task.h>
> > +#include <linux/random.h>
> >
> > #include <asm/tlbflush.h>
> > #include <asm/page.h>
> > @@ -44,6 +46,37 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
> > }
> >
> > #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +static struct mm_struct *patching_mm __ro_after_init;
> > +static unsigned long patching_addr __ro_after_init;
> > +
> > +void __init poking_init(void)
> > +{
> > + spinlock_t *ptl; /* for protecting pte table */
> > + pte_t *ptep;
> > +
> > + /*
> > + * Some parts of the kernel (static keys for example) depend on
> > + * successful code patching. Code patching under STRICT_KERNEL_RWX
> > + * requires this setup - otherwise we cannot patch at all. We use
> > + * BUG_ON() here and later since an early failure is preferred to
> > + * buggy behavior and/or strange crashes later.
> > + */
> > + patching_mm = copy_init_mm();
> > + BUG_ON(!patching_mm);
> > +
> > + /*
> > + * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> > + * XXX: Do we want additional bits of entropy for radix?
> > + */
> > + patching_addr = (get_random_long() & PAGE_MASK) %
> > + (DEFAULT_MAP_WINDOW - PAGE_SIZE);
>
> It took me a while to understand this calculation. I see that it's
> calculating a base address for a page in which to do patching. It does
> the following:
I will add a comment explaining the calulcation in the next spin.
>
> - get a random long
>
> - mask with PAGE_MASK so as to get a page aligned value
>
> - make sure that the base address is at least one PAGE_SIZE below
> DEFAULT_MAP_WINDOW so we have a clear page between the base and
> DEFAULT_MAP_WINDOW.
>
> On 64-bit Book3S with 64K pages, that works out to be
>
> PAGE_SIZE = 0x0000 0000 0001 0000
> PAGE_MASK = 0xFFFF FFFF FFFF 0000
>
> DEFAULT_MAP_WINDOW = DEFAULT_MAP_WINDOW_USER64 = TASK_SIZE_128TB
> = 0x0000_8000_0000_0000
>
> DEFAULT_MAP_WINDOW - PAGE_SIZE = 0x0000 7FFF FFFF 0000
>
> It took a while (and a conversation with my wife who studied pure
> maths!) but I am convinced that the modulo preserves the page-alignement
> of the patching address.
I am glad a proper mathematician agrees because my maths are decidedly
unpure :)
>
> One thing I did realise is that patching_addr can be zero at the end of
> this process. That seems dubious and slightly error-prone to me - is
> the patching process robust to that or should we exclude it?
Good catch! I will fix this in the next spin.
>
> Anyway, if I have the maths right, that there are 0x7fffffff or ~2
> billion possible locations for the patching page, which is just shy of
> 31 bits of entropy.
>
> I think this compares pretty favourably to most (K)ASLR implementations?
I will stress that I am not an expert here, but it looks like this does
compares favorably against other 64b ASLR [0].
[0]: https://www.cs.ucdavis.edu/~peisert/research/2017-SecDev-AnalysisASLR.pdf
>
> What's the range if built with 4k pages?
Using the formula from my series coverletter, we should expect 34 bits
of entropy since DEFAULT_MAP_WINDOW_USER64 is 64TB for 4K pages:
bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
PAGE_SIZE=4K, DEFAULT_MAP_WINDOW_USER64=64TB
bits of entropy = log2(64TB / 4K)
bits of entropy = 34
>
> Kind regards,
> Daniel
>
> > +
> > + ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > + BUG_ON(!ptep);
> > + pte_unmap_unlock(ptep, ptl);
> > +}
> > +
> > static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >
> > static int text_area_cpu_up(unsigned int cpu)
> > --
> > 2.27.0
^ permalink raw reply
* Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Michael Ellerman @ 2020-08-17 3:45 UTC (permalink / raw)
To: Nathan Lynch, Tyrel Datwyler, linuxppc-dev; +Cc: cheloha, ldufour
In-Reply-To: <874kp7z73m.fsf@linux.ibm.com>
Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>>> On 8/11/20 6:20 PM, Nathan Lynch wrote:
>>>>
>>>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
>>>> +{
>>>> + const unsigned int resched_interval = 20;
>>>> +
>>>> + BUG_ON(lmb < drmem_info->lmbs);
>>>> + BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
>>>
>>> I think BUG_ON is a pretty big no-no these days unless there is no other option.
>>
>> It's complicated, but yes we would like to avoid adding them if we can.
>>
>> In a case like this there is no other option, *if* the check has to be
>> in drmem_lmb_next().
>>
>> But I don't think we really need to check that there.
>>
>> If for some reason this was called with an *lmb pointing outside of the
>> lmbs array it would confuse the cond_resched() logic, but it's not worth
>> crashing the box for that IMHO.
>
> The BUG_ONs are pretty much orthogonal to the cond_resched().
Yes that was kind of my point. We don't need them to implement the
cond_resched() logic.
> It's not apparent from the context of the change, but some users of the
> for_each_drmem_lmb* macros modify elements of the drmem_info->lmbs
> array. If the lmb passed to drmem_lmb_next() violates the bounds check
> (say, if the callsite inappropriately modifies it within the loop), such
> users are guaranteed to corrupt other objects in memory. This was my
> thinking in adding the BUG_ONs, and I don't see another place to do
> it.
If it's really something we're worried about, I think we'd be better off
putting checks for that in the code that's doing those modifications.
That way you have enough context to do something more nuanced than a
BUG(), ie. you can print a useful message and fail whatever operation is
in progress.
If we did that then we could also add those BUG_ONs() as a safety net.
What you really want is a way for the for_each_xxx() construct to
express that there was an error traversing the list, but there isn't
really a nice way to do that in C.
cheers
^ permalink raw reply
* Re: [PATCH v2 1/5] powerpc/mm: Introduce temporary mm
From: Christopher M. Riedl @ 2020-08-17 5:16 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev; +Cc: kernel-hardening
In-Reply-To: <87o8noo96l.fsf@dja-thinkpad.axtens.net>
On Thu Aug 6, 2020 at 6:27 AM CDT, Daniel Axtens wrote:
> Hi Chris,
>
> > void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > bool ppc_breakpoint_available(void);
> > #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> > extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index 1a474f6b1992..9269c7c7b04e 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -10,6 +10,7 @@
> > #include <asm/mmu.h>
> > #include <asm/cputable.h>
> > #include <asm/cputhreads.h>
> > +#include <asm/debug.h>
> >
> > /*
> > * Most if the context management is out of line
> > @@ -300,5 +301,68 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
> > return 0;
> > }
> >
> > +struct temp_mm {
> > + struct mm_struct *temp;
> > + struct mm_struct *prev;
> > + bool is_kernel_thread;
> > + struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> > +};
>
> This is on the nitpicky end, but I wonder if this should be named
> temp_mm, or should be labelled something else to capture its broader
> purpose as a context for code patching? I'm thinking that a store of
> breakpoints is perhaps unusual in a memory-managment structure?
>
> I don't have a better suggestion off the top of my head and I'm happy
> for you to leave it, I just wanted to flag it as a possible way we could
> be clearer.
First of all thank you for the review!
I had actually planned to move all this code into lib/code-patching.c
directly (and it turns out that's what x86 ended up doing as well).
>
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> > +{
> > + temp_mm->temp = mm;
> > + temp_mm->prev = NULL;
> > + temp_mm->is_kernel_thread = false;
> > + memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > + lockdep_assert_irqs_disabled();
> > +
> > + temp_mm->is_kernel_thread = current->mm == NULL;
> > + if (temp_mm->is_kernel_thread)
> > + temp_mm->prev = current->active_mm;
>
> You don't seem to restore active_mm below. I don't know what active_mm
> does, so I don't know if this is a problem.
For kernel threads 'current->mm' is NULL since a kthread does not need
a userspace mm; however they still need a mm so they "borrow" one which
is indicated by 'current->active_mm'.
'current->mm' needs to be restored because Hash requires a non-NULL
value when handling a page fault and so 'current->mm' gets set to the
temp_mm. This is a special case for kernel threads and Hash translation.
>
> > + else
> > + temp_mm->prev = current->mm;
> > +
> > + /*
> > + * Hash requires a non-NULL current->mm to allocate a userspace address
> > + * when handling a page fault. Does not appear to hurt in Radix either.
> > + */
> > + current->mm = temp_mm->temp;
> > + switch_mm_irqs_off(NULL, temp_mm->temp, current);
> > +
> > + if (ppc_breakpoint_available()) {
>
> I wondered if this could be changed during a text-patching operation.
> AIUI, it potentially can on a P9 via "dawr_enable_dangerous" in debugfs.
>
> I don't know if that's a problem. My concern is that you could turn off
> breakpoints, call 'use_temporary_mm', then turn them back on again
> before 'unuse_temporary_mm' and get a breakpoint while that can access
> the temporary mm. Is there something else that makes that safe?
> disabling IRQs maybe?
Hmm, I will have to investigate this more. I'm not sure if there is a
better way to just completely disable breakpoints while the temporary mm
is in use.
>
> > + struct arch_hw_breakpoint null_brk = {0};
> > + int i = 0;
> > +
> > + for (; i < nr_wp_slots(); ++i) {
>
> super nitpicky, and I'm not sure if this is actually documented, but I'd
> usually see this written as:
>
> for (i = 0; i < nr_wp_slots(); i++) {
>
> Not sure if there's any reason that it _shouldn't_ be written the way
> you've written it (and I do like initialising the variable when it's
> defined!), I'm just not used to it. (Likewise with the unuse function.)
>
I've found other places (even in arch/powerpc!) where this is done so I
think it's fine. I prefer using this style when the variable
declaration and initialization are "close" to the loop statement.
> > + __get_breakpoint(i, &temp_mm->brk[i]);
> > + if (temp_mm->brk[i].type != 0)
> > + __set_breakpoint(i, &null_brk);
> > + }
> > + }
> > +}
> > +
>
> Kind regards,
> Daniel
>
> > +static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > + lockdep_assert_irqs_disabled();
> > +
> > + if (temp_mm->is_kernel_thread)
> > + current->mm = NULL;
> > + else
> > + current->mm = temp_mm->prev;
> > + switch_mm_irqs_off(NULL, temp_mm->prev, current);
> > +
> > + if (ppc_breakpoint_available()) {
> > + int i = 0;
> > +
> > + for (; i < nr_wp_slots(); ++i)
> > + if (temp_mm->brk[i].type != 0)
> > + __set_breakpoint(i, &temp_mm->brk[i]);
> > + }
> > +}
> > +
> > #endif /* __KERNEL__ */
> > #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 4650b9bb217f..b6c123bf5edd 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -824,6 +824,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
> > return 0;
> > }
> >
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > + memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
> > +}
> > +
> > void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > {
> > memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
> > --
> > 2.27.0
^ permalink raw reply
* [PATCH v1] powerpc/process: Replace an #ifdef CONFIG_PPC_47x by IS_ENABLED()
From: Christophe Leroy @ 2020-08-17 5:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel
isync() is always defined, no need for an #ifdef.
Replace it by IS_ENABLED(CONFIG_PPC_47x).
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/kernel/process.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e86e28c28259..fd3e11663613 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -777,9 +777,8 @@ static void switch_hw_breakpoint(struct task_struct *new)
static inline int __set_dabr(unsigned long dabr, unsigned long dabrx)
{
mtspr(SPRN_DAC1, dabr);
-#ifdef CONFIG_PPC_47x
- isync();
-#endif
+ if (IS_ENABLED(CONFIG_PPC_47x))
+ isync();
return 0;
}
#elif defined(CONFIG_PPC_BOOK3S)
--
2.25.0
^ permalink raw reply related
* [PATCH v1] powerpc/process: Replace #ifdef CONFIG_KALLSYMS by IS_ENABLED()
From: Christophe Leroy @ 2020-08-17 5:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel
The #ifdef CONFIG_KALLSYMS encloses some printk which can
compile in all cases.
Replace by IS_ENABLED().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/kernel/process.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 087711aa0126..987bc8e73d5e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1469,14 +1469,14 @@ void show_regs(struct pt_regs * regs)
break;
}
pr_cont("\n");
-#ifdef CONFIG_KALLSYMS
/*
* Lookup NIP late so we have the best change of getting the
* above info out without failing
*/
- printk("NIP ["REG"] %pS\n", regs->nip, (void *)regs->nip);
- printk("LR ["REG"] %pS\n", regs->link, (void *)regs->link);
-#endif
+ if (IS_ENABLED(CONFIG_KALLSYMS)) {
+ printk("NIP ["REG"] %pS\n", regs->nip, (void *)regs->nip);
+ printk("LR ["REG"] %pS\n", regs->link, (void *)regs->link);
+ }
show_stack(current, (unsigned long *) regs->gpr[1], KERN_DEFAULT);
if (!user_mode(regs))
show_instructions(regs);
--
2.25.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox