* [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
From: Guohua Zhong @ 2020-08-20 13:10 UTC (permalink / raw)
To: paulus, mpe, benh, gregkh
Cc: nixiaoming, wangle6, linuxppc-dev, linux-kernel, stable
When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
which call stack is like this:
[17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
[17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: P W O 4.4.176 #1
[17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
[17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 00000000
[17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: P W O (4.4.176)
[17179954.674355]MSR: 00021002 <CE,ME> CR: 28002224 XER: 00000000
[17179954.674364]
GPR00: 00000016 d3cb5cb0 dc9d7040 d3cb5cc0 00000000 0000025d ffe15b24 ffffffff
GPR08: de86aead 00000000 000003ff ffffffff 28002222 0084d1c0 00000000 ffffffff
GPR16: b5929ca0 b4bb7a48 c0863c08 0000048d 00000062 00000062 00000000 0000000f
GPR24: 00000000 d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 fffff30e d3e9c20c
[17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
[17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
[17179954.674434]Call Trace:
[17179961.832693]Call Trace:
[17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
[17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
[17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
[17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
[17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
[17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
[17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
[17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
[17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c
do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
->div_u64_rem->do_div->__div64_32
In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel
the cputime has changed from unsigned long to u64 data type. About 200
days, the lowwer 32 bit will be 0x00000000. Because divisor for __div64_32
is unsigned long data type,which is 32 bit for powepc 32, the bug still
exists.
So it is also a bug in the cputime_adjust which does not check if
stime + utime = 0
time = scale_stime((__force u64)stime, (__force u64)rtime,
(__force u64)(stime + utime));
The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
mainline kernel may has fixed this case. But it is also better to check
if divisor is 0 in __div64_32 for other situation.
Signed-off-by: Guohua Zhong <zhongguohua1@huawei.com>
Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)"
Cc: stable@vger.kernel.org # v2.6.15+
---
arch/powerpc/boot/div64.S | 4 ++++
arch/powerpc/lib/div64.S | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
index 4354928ed62e..39a25b9712d1 100644
--- a/arch/powerpc/boot/div64.S
+++ b/arch/powerpc/boot/div64.S
@@ -13,6 +13,9 @@
.globl __div64_32
__div64_32:
+ li r9,0
+ cmplw r4,r9 # check if divisor r4 is zero
+ beq 5f # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3) # get the dividend into r5/r6
lwz r6,4(r3)
cmplw r5,r4
@@ -52,6 +55,7 @@ __div64_32:
4: stw r7,0(r3) # return the quotient in *r3
stw r8,4(r3)
mr r3,r6 # return the remainder in r3
+5: # return if divisor r4 is zero
blr
/*
diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
index 3d5426e7dcc4..1cc9bcabf678 100644
--- a/arch/powerpc/lib/div64.S
+++ b/arch/powerpc/lib/div64.S
@@ -13,6 +13,9 @@
#include <asm/processor.h>
_GLOBAL(__div64_32)
+ li r9,0
+ cmplw r4,r9 # check if divisor r4 is zero
+ beq 5f # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3) # get the dividend into r5/r6
lwz r6,4(r3)
cmplw r5,r4
@@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
4: stw r7,0(r3) # return the quotient in *r3
stw r8,4(r3)
mr r3,r6 # return the remainder in r3
+5: # return if divisor r4 is zero
blr
--
2.12.3
^ permalink raw reply related
* Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
From: Christophe Leroy @ 2020-08-20 17:02 UTC (permalink / raw)
To: Guohua Zhong, paulus, mpe, benh, gregkh
Cc: linuxppc-dev, wangle6, nixiaoming, linux-kernel, stable
In-Reply-To: <20200820131049.42940-1-zhongguohua1@huawei.com>
Le 20/08/2020 à 15:10, Guohua Zhong a écrit :
> When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
> which call stack is like this:
>
> [17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
> [17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: P W O 4.4.176 #1
> [17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
> [17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 00000000
> [17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: P W O (4.4.176)
> [17179954.674355]MSR: 00021002 <CE,ME> CR: 28002224 XER: 00000000
> [17179954.674364]
> GPR00: 00000016 d3cb5cb0 dc9d7040 d3cb5cc0 00000000 0000025d ffe15b24 ffffffff
> GPR08: de86aead 00000000 000003ff ffffffff 28002222 0084d1c0 00000000 ffffffff
> GPR16: b5929ca0 b4bb7a48 c0863c08 0000048d 00000062 00000062 00000000 0000000f
> GPR24: 00000000 d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 fffff30e d3e9c20c
> [17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
> [17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
> [17179954.674434]Call Trace:
> [17179961.832693]Call Trace:
> [17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
> [17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
> [17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
> [17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
> [17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
> [17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
> [17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
> [17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
> [17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c
>
> do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
> ->div_u64_rem->do_div->__div64_32
>
> In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel
> the cputime has changed from unsigned long to u64 data type. About 200
> days, the lowwer 32 bit will be 0x00000000. Because divisor for __div64_32
> is unsigned long data type,which is 32 bit for powepc 32, the bug still
> exists.
>
> So it is also a bug in the cputime_adjust which does not check if
> stime + utime = 0
>
> time = scale_stime((__force u64)stime, (__force u64)rtime,
> (__force u64)(stime + utime));
>
> The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
> mainline kernel may has fixed this case. But it is also better to check
> if divisor is 0 in __div64_32 for other situation.
>
> Signed-off-by: Guohua Zhong <zhongguohua1@huawei.com>
> Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
> Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)"
> Cc: stable@vger.kernel.org # v2.6.15+
> ---
> arch/powerpc/boot/div64.S | 4 ++++
> arch/powerpc/lib/div64.S | 4 ++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
> index 4354928ed62e..39a25b9712d1 100644
> --- a/arch/powerpc/boot/div64.S
> +++ b/arch/powerpc/boot/div64.S
> @@ -13,6 +13,9 @@
>
> .globl __div64_32
> __div64_32:
> + li r9,0
> + cmplw r4,r9 # check if divisor r4 is zero
> + beq 5f # jump to label 5 if r4(divisor) is zero
> lwz r5,0(r3) # get the dividend into r5/r6
> lwz r6,4(r3)
> cmplw r5,r4
> @@ -52,6 +55,7 @@ __div64_32:
> 4: stw r7,0(r3) # return the quotient in *r3
> stw r8,4(r3)
> mr r3,r6 # return the remainder in r3
> +5: # return if divisor r4 is zero
> blr
>
> /*
> diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
> index 3d5426e7dcc4..1cc9bcabf678 100644
> --- a/arch/powerpc/lib/div64.S
> +++ b/arch/powerpc/lib/div64.S
> @@ -13,6 +13,9 @@
> #include <asm/processor.h>
>
> _GLOBAL(__div64_32)
> + li r9,0
You don't need to load r9 with 0, use cmplwi instead.
> + cmplw r4,r9 # check if divisor r4 is zero
> + beq 5f # jump to label 5 if r4(divisor) is zero
You should leave space between the compare and the branch (i.e. have
other instructions inbetween when possible), so that the processor can
prepare the branching and do a good prediction. Same as the compare
below, you see that there are two other instructions between the cmplw
are the blt. You can eventually use another cr field than cr0 in order
to nest several test/branches.
Also because on recent powerpc32, instructions are fetched and executed
two by two.
> lwz r5,0(r3) # get the dividend into r5/r6
> lwz r6,4(r3)
> cmplw r5,r4
> @@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
> 4: stw r7,0(r3) # return the quotient in *r3
> stw r8,4(r3)
> mr r3,r6 # return the remainder in r3
> +5: # return if divisor r4 is zero
> blr
>
Christophe
^ permalink raw reply
* Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
From: Christophe Leroy @ 2020-08-20 18:23 UTC (permalink / raw)
To: Guohua Zhong, paulus, mpe, benh, gregkh
Cc: linuxppc-dev, wangle6, nixiaoming, linux-kernel, stable
In-Reply-To: <20200820131049.42940-1-zhongguohua1@huawei.com>
Le 20/08/2020 à 15:10, Guohua Zhong a écrit :
> When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
> which call stack is like this:
>
> [17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
> [17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: P W O 4.4.176 #1
> [17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
> [17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 00000000
> [17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: P W O (4.4.176)
> [17179954.674355]MSR: 00021002 <CE,ME> CR: 28002224 XER: 00000000
> [17179954.674364]
> GPR00: 00000016 d3cb5cb0 dc9d7040 d3cb5cc0 00000000 0000025d ffe15b24 ffffffff
> GPR08: de86aead 00000000 000003ff ffffffff 28002222 0084d1c0 00000000 ffffffff
> GPR16: b5929ca0 b4bb7a48 c0863c08 0000048d 00000062 00000062 00000000 0000000f
> GPR24: 00000000 d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 fffff30e d3e9c20c
> [17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
> [17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
> [17179954.674434]Call Trace:
> [17179961.832693]Call Trace:
> [17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
> [17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
> [17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
> [17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
> [17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
> [17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
> [17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
> [17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
> [17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c
>
> do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
> ->div_u64_rem->do_div->__div64_32
>
> In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel
> the cputime has changed from unsigned long to u64 data type. About 200
> days, the lowwer 32 bit will be 0x00000000. Because divisor for __div64_32
> is unsigned long data type,which is 32 bit for powepc 32, the bug still
> exists.
>
> So it is also a bug in the cputime_adjust which does not check if
> stime + utime = 0
>
> time = scale_stime((__force u64)stime, (__force u64)rtime,
> (__force u64)(stime + utime));
>
> The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
> mainline kernel may has fixed this case. But it is also better to check
> if divisor is 0 in __div64_32 for other situation.
>
> Signed-off-by: Guohua Zhong <zhongguohua1@huawei.com>
> Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
> Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)"
> Cc: stable@vger.kernel.org # v2.6.15+
> ---
> arch/powerpc/boot/div64.S | 4 ++++
> arch/powerpc/lib/div64.S | 4 ++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
> index 4354928ed62e..39a25b9712d1 100644
> --- a/arch/powerpc/boot/div64.S
> +++ b/arch/powerpc/boot/div64.S
> @@ -13,6 +13,9 @@
>
> .globl __div64_32
> __div64_32:
> + li r9,0
> + cmplw r4,r9 # check if divisor r4 is zero
> + beq 5f # jump to label 5 if r4(divisor) is zero
In generic version in lib/math/div64.c, there is no checking of 'base'
either.
Do we really want to add this check in the powerpc version only ?
The only user of __div64_32() is do_div() in
include/asm-generic/div64.h. Wouldn't it be better to do the check there ?
Christophe
> lwz r5,0(r3) # get the dividend into r5/r6
> lwz r6,4(r3)
> cmplw r5,r4
> @@ -52,6 +55,7 @@ __div64_32:
> 4: stw r7,0(r3) # return the quotient in *r3
> stw r8,4(r3)
> mr r3,r6 # return the remainder in r3
> +5: # return if divisor r4 is zero
> blr
>
> /*
> diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
> index 3d5426e7dcc4..1cc9bcabf678 100644
> --- a/arch/powerpc/lib/div64.S
> +++ b/arch/powerpc/lib/div64.S
> @@ -13,6 +13,9 @@
> #include <asm/processor.h>
>
> _GLOBAL(__div64_32)
> + li r9,0
> + cmplw r4,r9 # check if divisor r4 is zero
> + beq 5f # jump to label 5 if r4(divisor) is zero
> lwz r5,0(r3) # get the dividend into r5/r6
> lwz r6,4(r3)
> cmplw r5,r4
> @@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
> 4: stw r7,0(r3) # return the quotient in *r3
> stw r8,4(r3)
> mr r3,r6 # return the remainder in r3
> +5: # return if divisor r4 is zero
> blr
>
^ permalink raw reply
* Re: [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit
From: Nathan Lynch @ 2020-08-20 20:10 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, stable
In-Reply-To: <20200806162329.276534-1-aneesh.kumar@linux.ibm.com>
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
> make sure different variables tracking lmb_size are updated to be 64 bit.
>
> This was found by code audit.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> arch/powerpc/include/asm/drmem.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
> index 17ccc6474ab6..d719cbac34b2 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -21,7 +21,7 @@ struct drmem_lmb {
> struct drmem_lmb_info {
> struct drmem_lmb *lmbs;
> int n_lmbs;
> - u32 lmb_size;
> + u64 lmb_size;
> };
>
> extern struct drmem_lmb_info *drmem_info;
> @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 {
> #define DRCONF_MEM_RESERVED 0x00000080
> #define DRCONF_MEM_HOTREMOVABLE 0x00000100
>
> -static inline u32 drmem_lmb_size(void)
> +static inline u64 drmem_lmb_size(void)
> {
> return drmem_info->lmb_size;
> }
Looks fine.
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
^ permalink raw reply
* Re: [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit
From: Nathan Lynch @ 2020-08-20 20:20 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, stable
In-Reply-To: <20200806162329.276534-3-aneesh.kumar@linux.ibm.com>
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> @@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node *np)
> /*
> * Find the base address and size of the memblock
> */
> - regs = of_get_property(np, "reg", NULL);
> - if (!regs)
> + prop = of_get_property(np, "reg", NULL);
> + if (!prop)
> return ret;
>
> - base = be64_to_cpu(*(unsigned long *)regs);
> - lmb_size = be32_to_cpu(regs[3]);
> + /*
> + * "reg" property represents (addr,size) tuple.
> + */
> + base = of_read_number(prop, mem_addr_cells);
> + prop += mem_addr_cells;
> + lmb_size = of_read_number(prop, mem_size_cells);
Would of_n_size_cells() and of_n_addr_cells() work here?
^ permalink raw reply
* Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules
From: Andreas Schwab @ 2020-08-20 22:00 UTC (permalink / raw)
To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <7172c0f5253419315e434a1816ee3d6ed6505bc0.1593428200.git.christophe.leroy@csgroup.eu>
On Jun 29 2020, Christophe Leroy wrote:
> Instead of leaving NX unset on all segments above the start
> of vmalloc space, only leave NX unset on segments used for
> modules.
I'm getting this crash:
kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xf294b000
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in: pata_macio(+)
CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49
NIP: f294b000 LR: 0005c60 CTR: f294b000
REGS: f18d9cc0 TRAP: 0400 Not tainted (5.8.0-rc2-test)
MSR: 10009032 <E,ME,IR,DR,RI> CR: 84222422 XER: 20000000
GPR00: c0005c14 f18d9d78 ef30ca20 00000000 ef0000e0 c00993d0 ef6da038 0000005e
GPR08: c09050b8 c08b0000 00000000 f18d9d78 44222422 10072070 00000000 0fefaca4
GPR16: 1006a00c f294d50b 00000120 00000124 c0096ea8 0000000e ef2776c0 ef2776e4
GPR24: f18fd6e8 00000001 c086fe64 c086fe04 00000000 c08b0000 f294b000 ffffffff
NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio]
LR [c0005c60] do_one_initcall+0x6c/0x160
Call Trace:
[f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable)
[f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0
[f18d9df8] [c00993d8] load_module+0x16a8/0x1c14
[f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94
[f18d9f38] [c0012174] ret_from_syscall+0x0/0x34
--- interrupt: c01 at 0xfdb4318
LR = 0xfeee9c0
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX <3d20c08b> 3d40c086 9421ffe0 8129106c
---[ end trace 85a98cc836109871 ]---
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply
* [RFT][PATCH 0/7] Avoid overflow at boundary_size
From: Nicolin Chen @ 2020-08-20 23:19 UTC (permalink / raw)
To: mpe, benh, paulus, rth, ink, mattst88, tony.luck, fenghua.yu,
schnelle, gerald.schaefer, hca, gor, borntraeger, davem, tglx,
mingo, bp, x86, hpa, James.Bottomley, deller
Cc: sfr, linux-ia64, linux-parisc, linux-s390, linux-kernel,
linux-alpha, sparclinux, linuxppc-dev, hch
We are expending the default DMA segmentation boundary to its
possible maximum value (ULONG_MAX) to indicate that a device
doesn't specify a boundary limit. So all dma_get_seg_boundary
callers should take a precaution with the return values since
it would easily get overflowed.
I scanned the entire kernel tree for all the existing callers
and found that most of callers may get overflowed in two ways:
either "+ 1" or passing it to ALIGN() that does "+ mask".
According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)
We can simplify the logic here:
ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1
So this series of patches fix the potential overflow with this
overflow-free shortcut.
As I don't think that I have these platforms, marking RFT.
Thanks
Nic
Nicolin Chen (7):
powerpc/iommu: Avoid overflow at boundary_size
alpha: Avoid overflow at boundary_size
ia64/sba_iommu: Avoid overflow at boundary_size
s390/pci_dma: Avoid overflow at boundary_size
sparc: Avoid overflow at boundary_size
x86/amd_gart: Avoid overflow at boundary_size
parisc: Avoid overflow at boundary_size
arch/alpha/kernel/pci_iommu.c | 10 ++++------
arch/ia64/hp/common/sba_iommu.c | 4 ++--
arch/powerpc/kernel/iommu.c | 11 +++++------
arch/s390/pci/pci_dma.c | 4 ++--
arch/sparc/kernel/iommu-common.c | 9 +++------
arch/sparc/kernel/iommu.c | 4 ++--
arch/sparc/kernel/pci_sun4v.c | 4 ++--
arch/x86/kernel/amd_gart_64.c | 4 ++--
drivers/parisc/ccio-dma.c | 4 ++--
drivers/parisc/sba_iommu.c | 4 ++--
10 files changed, 26 insertions(+), 32 deletions(-)
--
2.17.1
^ permalink raw reply
* [RFT][PATCH 1/7] powerpc/iommu: Avoid overflow at boundary_size
From: Nicolin Chen @ 2020-08-20 23:19 UTC (permalink / raw)
To: mpe, benh, paulus; +Cc: sfr, linuxppc-dev, hch, linux-kernel
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.
According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)
We can simplify the logic here:
ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1
So fixing a potential overflow with the safer shortcut.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
arch/powerpc/kernel/iommu.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..c01ccbf8afdd 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -236,15 +236,14 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
}
- if (dev)
- boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
- else
- boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
+ boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX;
+
+ /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+ boundary_size = (boundary_size >> tbl->it_page_shift) + 1;
n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
- boundary_size >> tbl->it_page_shift, align_mask);
+ boundary_size, align_mask);
if (n == -1) {
if (likely(pass == 0)) {
/* First try the pool from the start */
--
2.17.1
^ permalink raw reply related
* Re: [PATCH net-next v2 0/4] refactoring of ibmvnic code
From: David Miller @ 2020-08-20 23:36 UTC (permalink / raw)
To: ljp; +Cc: netdev, linuxppc-dev
In-Reply-To: <20200819225226.10152-1-ljp@linux.ibm.com>
From: Lijun Pan <ljp@linux.ibm.com>
Date: Wed, 19 Aug 2020 17:52:22 -0500
> This patch series refactor reset_init and init functions,
> and make some other cosmetic changes to make the code
> easier to read and debug. v2 removes __func__ and v1's 1/5.
Series applied, thank you.
^ permalink raw reply
* [PATCH] tty: hvcs: Don't NULL tty->driver_data until hvcs_cleanup()
From: Tyrel Datwyler @ 2020-08-20 23:46 UTC (permalink / raw)
To: gregkh
Cc: Tyrel Datwyler, Jason Yan,
open list:HYPERVISOR VIRTUAL CONSOLE DRIVER, open list,
Joe Perches, Jiri Slaby
The code currently NULLs tty->driver_data in hvcs_close() with the
intent of informing the next call to hvcs_open() that device needs to be
reconfigured. However, when hvcs_cleanup() is called we copy hvcsd from
tty->driver_data which was previoulsy NULLed by hvcs_close() and our
call to tty_port_put(&hvcsd->port) doesn't actually do anything since
&hvcsd->port ends up translating to NULL by chance. This has the side
effect that when hvcs_remove() is called we have one too many port
references preventing hvcs_destuct_port() from ever being called. This
also prevents us from reusing the /dev/hvcsX node in a future
hvcs_probe() and we can eventually run out of /dev/hvcsX devices.
Fix this by waiting to NULL tty->driver_data in hvcs_cleanup().
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
---
drivers/tty/hvc/hvcs.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c
index 55105ac38f89..509d1042825a 100644
--- a/drivers/tty/hvc/hvcs.c
+++ b/drivers/tty/hvc/hvcs.c
@@ -1216,13 +1216,6 @@ static void hvcs_close(struct tty_struct *tty, struct file *filp)
tty_wait_until_sent(tty, HVCS_CLOSE_WAIT);
- /*
- * This line is important because it tells hvcs_open that this
- * device needs to be re-configured the next time hvcs_open is
- * called.
- */
- tty->driver_data = NULL;
-
free_irq(irq, hvcsd);
return;
} else if (hvcsd->port.count < 0) {
@@ -1237,6 +1230,13 @@ static void hvcs_cleanup(struct tty_struct * tty)
{
struct hvcs_struct *hvcsd = tty->driver_data;
+ /*
+ * This line is important because it tells hvcs_open that this
+ * device needs to be re-configured the next time hvcs_open is
+ * called.
+ */
+ tty->driver_data = NULL;
+
tty_port_put(&hvcsd->port);
}
--
2.27.0
^ permalink raw reply related
* [PATCH 0/2] Reintroduce PROT_SAO
From: Shawn Anastasio @ 2020-08-21 1:08 UTC (permalink / raw)
To: linuxppc-dev; +Cc: npiggin
This set re-introduces the PROT_SAO prot flag removed in
Commit 5c9fa16e8abd ("powerpc/64s: Remove PROT_SAO support").
To address concerns regarding live migration of guests using SAO
to P10 hosts without SAO support, the flag is disabled by default
in LPARs. A new config option, PPC_PROT_SAO_LPAR was added to
allow users to explicitly enable it if they will not be running
in an environment where this is a conern.
Shawn Anastasio (2):
Revert "powerpc/64s: Remove PROT_SAO support"
powerpc/64s: Disallow PROT_SAO in LPARs by default
arch/powerpc/Kconfig | 12 ++++++
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++--
arch/powerpc/include/asm/cputable.h | 10 ++---
arch/powerpc/include/asm/mman.h | 31 ++++++++++++--
arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +
arch/powerpc/include/uapi/asm/mman.h | 2 +-
arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +-
arch/powerpc/mm/book3s64/hash_utils.c | 2 +
include/linux/mm.h | 2 +
include/trace/events/mmflags.h | 2 +
mm/ksm.c | 4 ++
tools/testing/selftests/powerpc/mm/.gitignore | 1 +
tools/testing/selftests/powerpc/mm/Makefile | 4 +-
tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++++++++++++++++++
14 files changed, 107 insertions(+), 17 deletions(-)
create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c
--
2.28.0
^ permalink raw reply
* [PATCH 1/2] Revert "powerpc/64s: Remove PROT_SAO support"
From: Shawn Anastasio @ 2020-08-21 1:08 UTC (permalink / raw)
To: linuxppc-dev; +Cc: npiggin
In-Reply-To: <20200821010837.4079-1-shawn@anastas.io>
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05.
Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++--
arch/powerpc/include/asm/cputable.h | 10 ++---
arch/powerpc/include/asm/mman.h | 26 ++++++++++--
arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +
arch/powerpc/include/uapi/asm/mman.h | 2 +-
arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +-
arch/powerpc/mm/book3s64/hash_utils.c | 2 +
include/linux/mm.h | 2 +
include/trace/events/mmflags.h | 2 +
mm/ksm.c | 4 ++
tools/testing/selftests/powerpc/mm/.gitignore | 1 +
tools/testing/selftests/powerpc/mm/Makefile | 4 +-
tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++++++++++++++++++
13 files changed, 90 insertions(+), 17 deletions(-)
create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6de56c3b33c4..495fc0ccb453 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -20,13 +20,9 @@
#define _PAGE_RW (_PAGE_READ | _PAGE_WRITE)
#define _PAGE_RWX (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
#define _PAGE_PRIVILEGED 0x00008 /* kernel access only */
-
-#define _PAGE_CACHE_CTL 0x00030 /* Bits for the folowing cache modes */
- /* No bits set is normal cacheable memory */
- /* 0x00010 unused, is SAO bit on radix POWER9 */
+#define _PAGE_SAO 0x00010 /* Strong access order */
#define _PAGE_NON_IDEMPOTENT 0x00020 /* non idempotent memory */
#define _PAGE_TOLERANT 0x00030 /* tolerant memory, cache inhibited */
-
#define _PAGE_DIRTY 0x00080 /* C: page changed */
#define _PAGE_ACCESSED 0x00100 /* R: page referenced */
/*
@@ -828,6 +824,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
return hash__set_pte_at(mm, addr, ptep, pte, percpu);
}
+#define _PAGE_CACHE_CTL (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT)
+
#define pgprot_noncached pgprot_noncached
static inline pgprot_t pgprot_noncached(pgprot_t prot)
{
diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index fdddb822d564..f89205eff691 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -191,7 +191,7 @@ static inline void cpu_feature_keys_init(void) { }
#define CPU_FTR_SPURR LONG_ASM_CONST(0x0000000001000000)
#define CPU_FTR_DSCR LONG_ASM_CONST(0x0000000002000000)
#define CPU_FTR_VSX LONG_ASM_CONST(0x0000000004000000)
-// Free LONG_ASM_CONST(0x0000000008000000)
+#define CPU_FTR_SAO LONG_ASM_CONST(0x0000000008000000)
#define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0000000010000000)
#define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0000000020000000)
#define CPU_FTR_ASYM_SMT LONG_ASM_CONST(0x0000000040000000)
@@ -436,7 +436,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
- CPU_FTR_DSCR | CPU_FTR_ASYM_SMT | \
+ CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | \
CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX )
@@ -445,7 +445,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
- CPU_FTR_DSCR | \
+ CPU_FTR_DSCR | CPU_FTR_SAO | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
@@ -456,7 +456,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
- CPU_FTR_DSCR | \
+ CPU_FTR_DSCR | CPU_FTR_SAO | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
@@ -474,7 +474,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
- CPU_FTR_DSCR | \
+ CPU_FTR_DSCR | CPU_FTR_SAO | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 7c07728af300..4ba303ea27f5 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,20 +13,38 @@
#include <linux/pkeys.h>
#include <asm/cpu_has_feature.h>
-#ifdef CONFIG_PPC_MEM_KEYS
static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
unsigned long pkey)
{
- return pkey_to_vmflag_bits(pkey);
+#ifdef CONFIG_PPC_MEM_KEYS
+ return (((prot & PROT_SAO) ? VM_SAO : 0) | pkey_to_vmflag_bits(pkey));
+#else
+ return ((prot & PROT_SAO) ? VM_SAO : 0);
+#endif
}
#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
{
- return __pgprot(vmflag_to_pte_pkey_bits(vm_flags));
+#ifdef CONFIG_PPC_MEM_KEYS
+ return (vm_flags & VM_SAO) ?
+ __pgprot(_PAGE_SAO | vmflag_to_pte_pkey_bits(vm_flags)) :
+ __pgprot(0 | vmflag_to_pte_pkey_bits(vm_flags));
+#else
+ return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
+#endif
}
#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
-#endif
+
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+{
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
+ return false;
+ if ((prot & PROT_SAO) && !cpu_has_feature(CPU_FTR_SAO))
+ return false;
+ return true;
+}
+#define arch_validate_prot arch_validate_prot
#endif /* CONFIG_PPC64 */
#endif /* _ASM_POWERPC_MMAN_H */
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 59ee9fa4ae09..6cb8aa357191 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -82,6 +82,8 @@
*/
#include <asm/nohash/pte-book3e.h>
+#define _PAGE_SAO 0
+
#define PTE_RPN_MASK (~((1UL << PTE_RPN_SHIFT) - 1))
/*
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 3a700351feca..c0c737215b00 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -11,7 +11,7 @@
#include <asm-generic/mman-common.h>
-#define PROT_SAO 0x10 /* Unsupported since v5.9 */
+#define PROT_SAO 0x10 /* Strong Access Ordering */
#define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */
#define MAP_NORESERVE 0x40 /* don't reserve swap pages */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 6f8c0c6b937a..231f92248140 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -657,7 +657,7 @@ static struct dt_cpu_feature_match __initdata
{"processor-control-facility-v3", feat_enable_dbell, CPU_FTR_DBELL},
{"processor-utilization-of-resources-register", feat_enable_purr, 0},
{"no-execute", feat_enable, 0},
- /* strong-access-ordering is unused */
+ {"strong-access-ordering", feat_enable, CPU_FTR_SAO},
{"cache-inhibited-large-page", feat_enable_large_ci, 0},
{"coprocessor-icswx", feat_enable, 0},
{"hypervisor-virtualization-interrupt", feat_enable_hvi, 0},
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 1da9dbba9217..c1ea93b368b6 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -232,6 +232,8 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
rflags |= HPTE_R_I;
else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_NON_IDEMPOTENT)
rflags |= (HPTE_R_I | HPTE_R_G);
+ else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_SAO)
+ rflags |= (HPTE_R_W | HPTE_R_I | HPTE_R_M);
else
/*
* Add memory coherence if cache inhibited is not set
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1983e08f5906..5abe6df4247e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -321,6 +321,8 @@ extern unsigned int kobjsize(const void *objp);
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
+#elif defined(CONFIG_PPC)
+# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP VM_ARCH_1
#elif defined(CONFIG_IA64)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 939092dbcb8b..5fb752034386 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -114,6 +114,8 @@ IF_HAVE_PG_IDLE(PG_idle, "idle" )
#if defined(CONFIG_X86)
#define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" }
+#elif defined(CONFIG_PPC)
+#define __VM_ARCH_SPECIFIC_1 {VM_SAO, "sao" }
#elif defined(CONFIG_PARISC) || defined(CONFIG_IA64)
#define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP, "growsup" }
#elif !defined(CONFIG_MMU)
diff --git a/mm/ksm.c b/mm/ksm.c
index 0aa2247bddd7..90a625b02a1d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2453,6 +2453,10 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
if (vma_is_dax(vma))
return 0;
+#ifdef VM_SAO
+ if (*vm_flags & VM_SAO)
+ return 0;
+#endif
#ifdef VM_SPARC_ADI
if (*vm_flags & VM_SPARC_ADI)
return 0;
diff --git a/tools/testing/selftests/powerpc/mm/.gitignore b/tools/testing/selftests/powerpc/mm/.gitignore
index 91c775c23c66..aac4a59f9e28 100644
--- a/tools/testing/selftests/powerpc/mm/.gitignore
+++ b/tools/testing/selftests/powerpc/mm/.gitignore
@@ -2,6 +2,7 @@
hugetlb_vs_thp_test
subpage_prot
tempfile
+prot_sao
segv_errors
wild_bctr
large_vm_fork_separation
diff --git a/tools/testing/selftests/powerpc/mm/Makefile b/tools/testing/selftests/powerpc/mm/Makefile
index 250ce172e0da..defe488d6bf1 100644
--- a/tools/testing/selftests/powerpc/mm/Makefile
+++ b/tools/testing/selftests/powerpc/mm/Makefile
@@ -2,7 +2,7 @@
noarg:
$(MAKE) -C ../
-TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot segv_errors wild_bctr \
+TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot prot_sao segv_errors wild_bctr \
large_vm_fork_separation bad_accesses pkey_exec_prot \
pkey_siginfo stack_expansion_signal stack_expansion_ldst
@@ -14,6 +14,8 @@ include ../../lib.mk
$(TEST_GEN_PROGS): ../harness.c ../utils.c
+$(OUTPUT)/prot_sao: ../utils.c
+
$(OUTPUT)/wild_bctr: CFLAGS += -m64
$(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64
$(OUTPUT)/bad_accesses: CFLAGS += -m64
diff --git a/tools/testing/selftests/powerpc/mm/prot_sao.c b/tools/testing/selftests/powerpc/mm/prot_sao.c
new file mode 100644
index 000000000000..e2eed65b7735
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/prot_sao.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2016, Michael Ellerman, IBM Corp.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+
+#include <asm/cputable.h>
+
+#include "utils.h"
+
+#define SIZE (64 * 1024)
+
+int test_prot_sao(void)
+{
+ char *p;
+
+ /* 2.06 or later should support SAO */
+ SKIP_IF(!have_hwcap(PPC_FEATURE_ARCH_2_06));
+
+ /*
+ * Ensure we can ask for PROT_SAO.
+ * We can't really verify that it does the right thing, but at least we
+ * confirm the kernel will accept it.
+ */
+ p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE | PROT_SAO,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ FAIL_IF(p == MAP_FAILED);
+
+ /* Write to the mapping, to at least cause a fault */
+ memset(p, 0xaa, SIZE);
+
+ return 0;
+}
+
+int main(void)
+{
+ return test_harness(test_prot_sao, "prot-sao");
+}
--
2.28.0
^ permalink raw reply related
* [PATCH 2/2] powerpc/64s: Disallow PROT_SAO in LPARs by default
From: Shawn Anastasio @ 2020-08-21 1:08 UTC (permalink / raw)
To: linuxppc-dev; +Cc: npiggin
In-Reply-To: <20200821010837.4079-1-shawn@anastas.io>
Since migration of guests using SAO to ISA 3.1 hosts may cause issues,
disable PROT_SAO in LPARs by default and introduce a new Kconfig option
PPC_PROT_SAO_LPAR to allow users to enable it if desired.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
---
arch/powerpc/Kconfig | 12 ++++++++++++
arch/powerpc/include/asm/mman.h | 9 +++++++--
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..65bed1fdeaad 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -860,6 +860,18 @@ config PPC_SUBPAGE_PROT
If unsure, say N here.
+config PPC_PROT_SAO_LPAR
+ bool "Support PROT_SAO mappings in LPARs"
+ depends on PPC_BOOK3S_64
+ help
+ This option adds support for PROT_SAO mappings from userspace
+ inside LPARs on supported CPUs.
+
+ This may cause issues when performing guest migration from
+ a CPU that supports SAO to one that does not.
+
+ If unsure, say N here.
+
config PPC_COPRO_BASE
bool
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 4ba303ea27f5..7cb6d18f5cd6 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -40,8 +40,13 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
{
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
return false;
- if ((prot & PROT_SAO) && !cpu_has_feature(CPU_FTR_SAO))
- return false;
+ if (prot & PROT_SAO) {
+ if (!cpu_has_feature(CPU_FTR_SAO))
+ return false;
+ if (firmware_has_feature(FW_FEATURE_LPAR) &&
+ !IS_ENABLED(CONFIG_PPC_PROT_SAO_LPAR))
+ return false;
+ }
return true;
}
#define arch_validate_prot arch_validate_prot
--
2.28.0
^ permalink raw reply related
* Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes
From: Anshuman Khandual @ 2020-08-21 3:33 UTC (permalink / raw)
To: Aneesh Kumar K.V, linux-mm, akpm; +Cc: linuxppc-dev
In-Reply-To: <87tuwyvjei.fsf@linux.ibm.com>
On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> This patch series includes fixes for debug_vm_pgtable test code so that
>> they follow page table updates rules correctly. The first two patches introduce
>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
>> merge them via ppc64 tree if required.
>>
>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>> page table update rules.
>>
>> Changes from V1:
>> * Address review feedback
>> * drop test specific pfn_pte and pfn_pmd.
>> * Update ppc64 page table helper to add _PAGE_PTE
>>
>> Aneesh Kumar K.V (13):
>> powerpc/mm: Add DEBUG_VM WARN for pmd_clear
>> powerpc/mm: Move setting pte specific flags to pfn_pte
>> mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
>> mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
>> vmap support.
>> mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
>> CONFIG_NUMA_BALANCING
>> mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
>> set_pmd/pud_at
>> mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
>> existing pte entry
>> mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
>> mm/debug_vm_pgtable/locks: Move non page table modifying test together
>> mm/debug_vm_pgtable/locks: Take correct page table lock
>> mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
>> mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
>> mm/debug_vm_pgtable: populate a pte entry before fetching it
>>
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++-
>> arch/powerpc/include/asm/nohash/pgtable.h | 5 -
>> arch/powerpc/mm/book3s64/pgtable.c | 2 +-
>> arch/powerpc/mm/pgtable.c | 5 -
>> include/linux/io.h | 12 ++
>> mm/debug_vm_pgtable.c | 151 +++++++++++--------
>> 6 files changed, 127 insertions(+), 77 deletions(-)
>>
>
> BTW I picked a wrong branch when sending this. Attaching the diff
> against what I want to send. pfn_pmd() no more updates _PAGE_PTE
> because that is handled by pmd_mkhuge().
>
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 3b4da7c63e28..e18ae50a275c 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
> unsigned long pmdv;
>
> pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;
> - return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE);
> + return pmd_set_protbits(__pmd(pmdv), pgprot);
> }
>
> pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 7d9f8e1d790f..cad61d22f33a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
>
> static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
> {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>
> if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
> return;
>
Cover letter does not mention which branch or tag this series applies on.
Just assumed it to be 5.9-rc1. Should the above changes be captured as a
pre-requisite patch ?
Anyways, the series fails to be build on arm64.
A) Without CONFIG_TRANSPARENT_HUGEPAGE
mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’:
mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function ‘pmd_advanced_tests’
pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
^~~~~~~~~~~~~~~~~~
mm/debug_vm_pgtable.c:366:20: note: declared here
static void __init pmd_advanced_tests(struct mm_struct *mm,
^~~~~~~~~~~~~~~~~~
B) As mentioned previously, this should be solved by including <linux/io.h>
mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’:
mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? [-Werror=implicit-function-declaration]
if (!arch_ioremap_pmd_supported())
^~~~~~~~~~~~~~~~~~~~~~~~~~
Please make sure that the series builds on all enabled platforms i.e x86,
arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling
all the features that make various #ifdefs in the test.
- Anshuman
^ permalink raw reply
* Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes
From: Anshuman Khandual @ 2020-08-21 4:20 UTC (permalink / raw)
To: Aneesh Kumar K.V, linux-mm, akpm; +Cc: linuxppc-dev
In-Reply-To: <856eb6d7-9c09-728e-b374-d787145ac052@arm.com>
On 08/21/2020 09:03 AM, Anshuman Khandual wrote:
>
>
> On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> This patch series includes fixes for debug_vm_pgtable test code so that
>>> they follow page table updates rules correctly. The first two patches introduce
>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
>>> merge them via ppc64 tree if required.
>>>
>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>>> page table update rules.
>>>
>>> Changes from V1:
>>> * Address review feedback
>>> * drop test specific pfn_pte and pfn_pmd.
>>> * Update ppc64 page table helper to add _PAGE_PTE
>>>
>>> Aneesh Kumar K.V (13):
>>> powerpc/mm: Add DEBUG_VM WARN for pmd_clear
>>> powerpc/mm: Move setting pte specific flags to pfn_pte
>>> mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
>>> mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
>>> vmap support.
>>> mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
>>> CONFIG_NUMA_BALANCING
>>> mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
>>> set_pmd/pud_at
>>> mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
>>> existing pte entry
>>> mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
>>> mm/debug_vm_pgtable/locks: Move non page table modifying test together
>>> mm/debug_vm_pgtable/locks: Take correct page table lock
>>> mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
>>> mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
>>> mm/debug_vm_pgtable: populate a pte entry before fetching it
>>>
>>> arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++-
>>> arch/powerpc/include/asm/nohash/pgtable.h | 5 -
>>> arch/powerpc/mm/book3s64/pgtable.c | 2 +-
>>> arch/powerpc/mm/pgtable.c | 5 -
>>> include/linux/io.h | 12 ++
>>> mm/debug_vm_pgtable.c | 151 +++++++++++--------
>>> 6 files changed, 127 insertions(+), 77 deletions(-)
>>>
>>
>> BTW I picked a wrong branch when sending this. Attaching the diff
>> against what I want to send. pfn_pmd() no more updates _PAGE_PTE
>> because that is handled by pmd_mkhuge().
>>
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
>> index 3b4da7c63e28..e18ae50a275c 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
>> unsigned long pmdv;
>>
>> pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;
>> - return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE);
>> + return pmd_set_protbits(__pmd(pmdv), pgprot);
>> }
>>
>> pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>> index 7d9f8e1d790f..cad61d22f33a 100644
>> --- a/mm/debug_vm_pgtable.c
>> +++ b/mm/debug_vm_pgtable.c
>> @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
>>
>> static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>> {
>> - pmd_t pmd = pfn_pmd(pfn, prot);
>> + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>>
>> if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>> return;
>>
>
> Cover letter does not mention which branch or tag this series applies on.
> Just assumed it to be 5.9-rc1. Should the above changes be captured as a
> pre-requisite patch ?
>
> Anyways, the series fails to be build on arm64.
>
> A) Without CONFIG_TRANSPARENT_HUGEPAGE
>
> mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’:
> mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function ‘pmd_advanced_tests’
> pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
> ^~~~~~~~~~~~~~~~~~
> mm/debug_vm_pgtable.c:366:20: note: declared here
> static void __init pmd_advanced_tests(struct mm_struct *mm,
> ^~~~~~~~~~~~~~~~~~
>
> B) As mentioned previously, this should be solved by including <linux/io.h>
>
> mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’:
> mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? [-Werror=implicit-function-declaration]
> if (!arch_ioremap_pmd_supported())
> ^~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Please make sure that the series builds on all enabled platforms i.e x86,
> arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling
> all the features that make various #ifdefs in the test.
>
> - Anshuman
Here is another build failure on x86.
mm/debug_vm_pgtable.c: In function ‘pud_advanced_tests’:
mm/debug_vm_pgtable.c:306:31: error: passing argument 1 of ‘pudp_huge_get_and_clear_full’ from incompatible pointer type [-Werror=incompatible-pointer-types]
pudp_huge_get_and_clear_full(vma, vaddr, pudp, 1);
^~~
In file included from ./include/linux/mm.h:33:0,
from ./include/linux/highmem.h:8,
from mm/debug_vm_pgtable.c:14:
./include/linux/pgtable.h:294:21: note: expected ‘struct mm_struct *’ but argument is of type ‘struct vm_area_struct *’
static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
^ permalink raw reply
* [PATCH v5 0/8] huge vmalloc mappings
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
I made this powerpc-only for the time being. It shouldn't be too hard to
add support for other archs that define HUGE_VMAP. I have booted x86
with it enabled, just may not have audited everything.
Hi Andrew, would you care to put this in your tree?
Thanks,
Nick
Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.
Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
Cameron for debugging the cause of this (hopefully).
Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
struct vm_struct is an interface (this should fix x86 vmap stack debug
assert). [Thanks Zefan]
Nicholas Piggin (8):
mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
mm: apply_to_pte_range warn and fail if a large pte is encountered
mm/vmalloc: rename vmap_*_range vmap_pages_*_range
lib/ioremap: rename ioremap_*_range to vmap_*_range
mm: HUGE_VMAP arch support cleanup
mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
mm/vmalloc: add vmap_range_noflush variant
mm/vmalloc: Hugepage vmalloc mappings
.../admin-guide/kernel-parameters.txt | 2 +
arch/Kconfig | 4 +
arch/arm64/mm/mmu.c | 12 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/mm/book3s64/radix_pgtable.c | 10 +-
arch/x86/mm/ioremap.c | 12 +-
include/linux/io.h | 9 -
include/linux/vmalloc.h | 13 +
init/main.c | 1 -
mm/ioremap.c | 231 +--------
mm/memory.c | 60 ++-
mm/page_alloc.c | 4 +-
mm/vmalloc.c | 456 +++++++++++++++---
13 files changed, 476 insertions(+), 339 deletions(-)
--
2.23.0
^ permalink raw reply
* [PATCH v5 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.
[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..49f225b0f855 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
#include <linux/overflow.h>
#include <linux/uaccess.h>
+#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
@@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x)
}
/*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
*/
struct page *vmalloc_to_page(const void *vmalloc_addr)
{
@@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pgd_none(*pgd))
return NULL;
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return NULL; /* XXX: no allowance for huge pgd */
+ if (WARN_ON_ONCE(pgd_bad(*pgd)))
+ return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
- pud = pud_offset(p4d, addr);
+ if (p4d_leaf(*p4d))
+ return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(p4d_bad(*p4d)))
+ return NULL;
- /*
- * Don't dereference bad PUD or PMD (below) entries. This will also
- * identify huge mappings, which we may encounter on architectures
- * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
- * identified as vmalloc addresses by is_vmalloc_addr(), but are
- * not [unambiguously] associated with a struct page, so there is
- * no correct value to return for them.
- */
- WARN_ON_ONCE(pud_bad(*pud));
- if (pud_none(*pud) || pud_bad(*pud))
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+ if (pud_leaf(*pud))
+ return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
- WARN_ON_ONCE(pmd_bad(*pmd));
- if (pmd_none(*pmd) || pmd_bad(*pmd))
+ if (pmd_none(*pmd))
+ return NULL;
+ if (pmd_leaf(*pmd))
+ return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
}
EXPORT_SYMBOL(vmalloc_to_page);
--
2.23.0
^ permalink raw reply related
* [PATCH v5 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/memory.c | 60 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 44 insertions(+), 16 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index f95edbb77326..19986af291e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2261,13 +2261,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
- if (create || !pmd_none_or_clear_bad(pmd)) {
- err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (pmd_none(*pmd) && !create)
+ continue;
+ if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+ if (!create)
+ continue;
+ pmd_clear_bad(pmd);
}
+ err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+ if (err)
+ break;
} while (pmd++, addr = next, addr != end);
+
return err;
}
@@ -2288,13 +2295,20 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
- if (create || !pud_none_or_clear_bad(pud)) {
- err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (pud_none(*pud) && !create)
+ continue;
+ if (WARN_ON_ONCE(pud_leaf(*pud)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pud_bad(*pud))) {
+ if (!create)
+ continue;
+ pud_clear_bad(pud);
}
+ err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+ if (err)
+ break;
} while (pud++, addr = next, addr != end);
+
return err;
}
@@ -2315,13 +2329,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
- if (create || !p4d_none_or_clear_bad(p4d)) {
- err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
- create);
- if (err)
- break;
+ if (p4d_none(*p4d) && !create)
+ continue;
+ if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+ if (!create)
+ continue;
+ p4d_clear_bad(p4d);
}
+ err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+ if (err)
+ break;
} while (p4d++, addr = next, addr != end);
+
return err;
}
@@ -2340,8 +2361,15 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
- if (!create && pgd_none_or_clear_bad(pgd))
+ if (pgd_none(*pgd) && !create)
continue;
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return -EINVAL;
+ if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+ if (!create)
+ continue;
+ pgd_clear_bad(pgd);
+ }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
--
2.23.0
^ permalink raw reply related
* [PATCH v5 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 28 ++++++++++++----------------
1 file changed, 12 insertions(+), 16 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 49f225b0f855..3a1e45fd1626 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
arch_sync_kernel_mappings(start, end);
}
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pte_t *pte;
@@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
}
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
}
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
- err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+ err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
--
2.23.0
^ permalink raw reply related
* [PATCH v5 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
This will be moved to mm/ and used as a generic kernel virtual mapping
function, so re-name it in preparation.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/ioremap.c | 55 ++++++++++++++++++++++------------------------------
1 file changed, 23 insertions(+), 32 deletions(-)
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..6016ae3227ad 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pte_t *pte;
u64 pfn;
@@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
return pmd_set_huge(pmd, phys_addr, prot);
}
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
do {
next = pmd_addr_end(addr, end);
- if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
- if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+ if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
return pud_set_huge(pud, phys_addr, prot);
}
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
do {
next = pud_addr_end(addr, end);
- if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
- if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr,
- pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
{
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
return p4d_set_huge(p4d, phys_addr, prot);
}
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- pgtbl_mod_mask *mask)
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -208,19 +201,18 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
do {
next = p4d_addr_end(addr, end);
- if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
*mask |= PGTBL_P4D_MODIFIED;
continue;
}
- if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-int ioremap_page_range(unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pgd_t *pgd;
unsigned long start;
@@ -235,8 +227,7 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
- &mask);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -272,7 +263,7 @@ void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
return NULL;
vaddr = (unsigned long)area->addr;
- if (ioremap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
+ if (vmap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
free_vm_area(area);
return NULL;
}
--
2.23.0
^ permalink raw reply related
* [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.
This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.
This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/arm64/mm/mmu.c | 12 +--
arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++-
arch/x86/mm/ioremap.c | 12 +--
include/linux/io.h | 9 ---
include/linux/vmalloc.h | 10 +++
init/main.c | 1 -
mm/ioremap.c | 96 +++++++++++-------------
7 files changed, 73 insertions(+), 77 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..bbb3ccf6a7ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
return dt_virt;
}
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
{
/*
* Only 4k granule supports level 1 block mappings.
@@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void)
!IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
- /* See arch_ioremap_pud_supported() */
+ /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}
+#endif
int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
{
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index ae823bba29f2..7d3a620c5adf 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1182,13 +1182,14 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
}
-int __init arch_ioremap_pud_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_pud_supported(pgprot_t prot)
{
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
return radix_enabled();
}
@@ -1197,6 +1198,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
{
return 0;
}
+#endif
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
{
@@ -1282,7 +1284,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
}
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..5b8b495ab4ed 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr)
}
EXPORT_SYMBOL(iounmap);
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
{
- return 0;
+ return false;
}
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
{
#ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
#else
- return 0;
+ return false;
#endif
}
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
{
return boot_cpu_has(X86_FEATURE_PSE);
}
+#endif
/*
* Convert a physical pointer to a virtual kernel pointer for /dev/mem
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end,
}
#endif
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline void ioremap_huge_init(void) { }
-#endif
-
/*
* Managed iomap interface
*/
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..787d77ad7536 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -84,6 +84,16 @@ struct vmap_area {
};
};
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#else
+static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
+#endif
+
/*
* Highlevel APIs for driver use
*/
diff --git a/init/main.c b/init/main.c
index ae78fb68d231..1c89aa127b8f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -820,7 +820,6 @@ static void __init mm_init(void)
pgtable_init();
debug_objects_mem_init();
vmalloc_init();
- ioremap_huge_init();
/* Should be run before the first non-init thread is created */
init_espfix_bsp();
/* Should be run after espfix64 is set up. */
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 6016ae3227ad..b0032dbadaf7 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -16,49 +16,16 @@
#include "pgalloc-track.h"
#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static int __read_mostly ioremap_p4d_capable;
-static int __read_mostly ioremap_pud_capable;
-static int __read_mostly ioremap_pmd_capable;
-static int __read_mostly ioremap_huge_disabled;
+static bool __ro_after_init iomap_allow_huge = true;
static int __init set_nohugeiomap(char *str)
{
- ioremap_huge_disabled = 1;
+ iomap_allow_huge = false;
return 0;
}
early_param("nohugeiomap", set_nohugeiomap);
-
-void __init ioremap_huge_init(void)
-{
- if (!ioremap_huge_disabled) {
- if (arch_ioremap_p4d_supported())
- ioremap_p4d_capable = 1;
- if (arch_ioremap_pud_supported())
- ioremap_pud_capable = 1;
- if (arch_ioremap_pmd_supported())
- ioremap_pmd_capable = 1;
- }
-}
-
-static inline int ioremap_p4d_enabled(void)
-{
- return ioremap_p4d_capable;
-}
-
-static inline int ioremap_pud_enabled(void)
-{
- return ioremap_pud_capable;
-}
-
-static inline int ioremap_pmd_enabled(void)
-{
- return ioremap_pmd_capable;
-}
-
-#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
-static inline int ioremap_p4d_enabled(void) { return 0; }
-static inline int ioremap_pud_enabled(void) { return 0; }
-static inline int ioremap_pmd_enabled(void) { return 0; }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool iomap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -81,9 +48,12 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}
static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_pmd_enabled())
+ if (max_page_shift < PMD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pmd_supported(prot))
return 0;
if ((end - addr) != PMD_SIZE)
@@ -102,7 +72,8 @@ static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
}
static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
@@ -113,7 +84,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
next = pmd_addr_end(addr, end);
- if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
@@ -125,9 +96,12 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
}
static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_pud_enabled())
+ if (max_page_shift < PUD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pud_supported(prot))
return 0;
if ((end - addr) != PUD_SIZE)
@@ -146,7 +120,8 @@ static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
}
static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
@@ -157,21 +132,24 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
next = pud_addr_end(addr, end);
- if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
- if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
{
- if (!ioremap_p4d_enabled())
+ if (max_page_shift < P4D_SHIFT)
+ return 0;
+
+ if (!arch_vmap_p4d_supported(prot))
return 0;
if ((end - addr) != P4D_SIZE)
@@ -190,7 +168,8 @@ static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
}
static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
@@ -201,18 +180,19 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
do {
next = p4d_addr_end(addr, end);
- if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
*mask |= PGTBL_P4D_MODIFIED;
continue;
}
- if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
}
-int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
{
pgd_t *pgd;
unsigned long start;
@@ -227,7 +207,7 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -240,6 +220,16 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
return err;
}
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+ unsigned int max_page_shift = PAGE_SHIFT;
+
+ if (iomap_allow_huge)
+ max_page_shift = P4D_SHIFT;
+
+ return vmap_range(addr, end, phys_addr, prot, max_page_shift);
+}
+
#ifdef CONFIG_GENERIC_IOREMAP
void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
{
--
2.23.0
^ permalink raw reply related
* [PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
This is a generic kernel virtual memory mapper, not specific to ioremap.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
include/linux/vmalloc.h | 2 +
mm/ioremap.c | 192 ----------------------------------------
mm/vmalloc.c | 191 +++++++++++++++++++++++++++++++++++++++
3 files changed, 193 insertions(+), 192 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
extern struct vm_struct *find_vm_area(const void *addr);
#ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift);
extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
static const bool iomap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
-{
- pte_t *pte;
- u64 pfn;
-
- pfn = phys_addr >> PAGE_SHIFT;
- pte = pte_alloc_kernel_track(pmd, addr, mask);
- if (!pte)
- return -ENOMEM;
- do {
- BUG_ON(!pte_none(*pte));
- set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
- pfn++;
- } while (pte++, addr += PAGE_SIZE, addr != end);
- *mask |= PGTBL_PTE_MODIFIED;
- return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < PMD_SHIFT)
- return 0;
-
- if (!arch_vmap_pmd_supported(prot))
- return 0;
-
- if ((end - addr) != PMD_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, PMD_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, PMD_SIZE))
- return 0;
-
- if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
- return 0;
-
- return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- pmd_t *pmd;
- unsigned long next;
-
- pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
- if (!pmd)
- return -ENOMEM;
- do {
- next = pmd_addr_end(addr, end);
-
- if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_PMD_MODIFIED;
- continue;
- }
-
- if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
- return -ENOMEM;
- } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < PUD_SHIFT)
- return 0;
-
- if (!arch_vmap_pud_supported(prot))
- return 0;
-
- if ((end - addr) != PUD_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, PUD_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, PUD_SIZE))
- return 0;
-
- if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
- return 0;
-
- return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- pud_t *pud;
- unsigned long next;
-
- pud = pud_alloc_track(&init_mm, p4d, addr, mask);
- if (!pud)
- return -ENOMEM;
- do {
- next = pud_addr_end(addr, end);
-
- if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_PUD_MODIFIED;
- continue;
- }
-
- if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
- return -ENOMEM;
- } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
- if (max_page_shift < P4D_SHIFT)
- return 0;
-
- if (!arch_vmap_p4d_supported(prot))
- return 0;
-
- if ((end - addr) != P4D_SIZE)
- return 0;
-
- if (!IS_ALIGNED(addr, P4D_SIZE))
- return 0;
-
- if (!IS_ALIGNED(phys_addr, P4D_SIZE))
- return 0;
-
- if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
- return 0;
-
- return p4d_set_huge(p4d, phys_addr, prot);
-}
-
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
- pgtbl_mod_mask *mask)
-{
- p4d_t *p4d;
- unsigned long next;
-
- p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
- if (!p4d)
- return -ENOMEM;
- do {
- next = p4d_addr_end(addr, end);
-
- if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
- *mask |= PGTBL_P4D_MODIFIED;
- continue;
- }
-
- if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
- return -ENOMEM;
- } while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
- return 0;
-}
-
-static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- unsigned int max_page_shift)
-{
- pgd_t *pgd;
- unsigned long start;
- unsigned long next;
- int err;
- pgtbl_mod_mask mask = 0;
-
- might_sleep();
- BUG_ON(addr >= end);
-
- start = addr;
- pgd = pgd_offset_k(addr);
- do {
- next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
- if (err)
- break;
- } while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
-
- flush_cache_vmap(start, end);
-
- if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
- arch_sync_kernel_mappings(start, end);
-
- return err;
-}
-
int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
unsigned int max_page_shift = PAGE_SHIFT;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3a1e45fd1626..129f10545bb1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -71,6 +71,197 @@ static void free_work(struct work_struct *w)
}
/*** Page table manipulation functions ***/
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+{
+ pte_t *pte;
+ u64 pfn;
+
+ pfn = phys_addr >> PAGE_SHIFT;
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
+ if (!pte)
+ return -ENOMEM;
+ do {
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+ pfn++;
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
+ return 0;
+}
+
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < PMD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pmd_supported(prot))
+ return 0;
+
+ if ((end - addr) != PMD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PMD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, PMD_SIZE))
+ return 0;
+
+ if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
+ return 0;
+
+ return pmd_set_huge(pmd, phys_addr, prot);
+}
+
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
+ if (!pmd)
+ return -ENOMEM;
+ do {
+ next = pmd_addr_end(addr, end);
+
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_PMD_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+ return -ENOMEM;
+ } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < PUD_SHIFT)
+ return 0;
+
+ if (!arch_vmap_pud_supported(prot))
+ return 0;
+
+ if ((end - addr) != PUD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PUD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, PUD_SIZE))
+ return 0;
+
+ if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
+ return 0;
+
+ return pud_set_huge(pud, phys_addr, prot);
+}
+
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
+ if (!pud)
+ return -ENOMEM;
+ do {
+ next = pud_addr_end(addr, end);
+
+ if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_PUD_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
+ return -ENOMEM;
+ } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+ if (max_page_shift < P4D_SHIFT)
+ return 0;
+
+ if (!arch_vmap_p4d_supported(prot))
+ return 0;
+
+ if ((end - addr) != P4D_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, P4D_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(phys_addr, P4D_SIZE))
+ return 0;
+
+ if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
+ return 0;
+
+ return p4d_set_huge(p4d, phys_addr, prot);
+}
+
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+ pgtbl_mod_mask *mask)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
+ *mask |= PGTBL_P4D_MODIFIED;
+ continue;
+ }
+
+ if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
+ return -ENOMEM;
+ } while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
+ return 0;
+}
+
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
+{
+ pgd_t *pgd;
+ unsigned long start;
+ unsigned long next;
+ int err;
+ pgtbl_mod_mask mask = 0;
+
+ might_sleep();
+ BUG_ON(addr >= end);
+
+ start = addr;
+ pgd = pgd_offset_k(addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
+ if (err)
+ break;
+ } while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
+
+ flush_cache_vmap(start, end);
+
+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
+ return err;
+}
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pgtbl_mod_mask *mask)
--
2.23.0
^ permalink raw reply related
* [PATCH v5 7/8] mm/vmalloc: add vmap_range_noflush variant
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
return 0;
}
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
- unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, phys_addr_t phys_addr,
+ pgprot_t prot, unsigned int max_page_shift)
{
pgd_t *pgd;
unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgp
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
- flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
return err;
}
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
+{
+ int err;
+
+ err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+ flush_cache_vmap(addr, end);
+
+ return err;
+}
+
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pgtbl_mod_mask *mask)
{
--
2.23.0
^ permalink raw reply related
* [PATCH v5 8/8] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2020-08-21 4:44 UTC (permalink / raw)
To: linux-mm, Andrew Morton
Cc: linux-arch, linux-kernel, Nicholas Piggin, Zefan Li,
Jonathan Cameron, linuxppc-dev
In-Reply-To: <20200821044427.736424-1-npiggin@gmail.com>
On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.
Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).
This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
.../admin-guide/kernel-parameters.txt | 2 +
arch/Kconfig | 4 +
arch/powerpc/Kconfig | 1 +
include/linux/vmalloc.h | 1 +
mm/page_alloc.c | 4 +-
mm/vmalloc.c | 188 +++++++++++++-----
6 files changed, 152 insertions(+), 48 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..6f0b41289a90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings.
+ nohugevmalloc [PPC] Disable kernel huge vmalloc mappings.
+
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
config HAVE_ARCH_HUGE_VMAP
bool
+config HAVE_ARCH_HUGE_VMALLOC
+ depends on HAVE_ARCH_HUGE_VMAP
+ bool
+
config ARCH_WANT_HUGE_PMD_SHARE
bool
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 95dfd8ef3d4b..044e5a94967a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
select GENERIC_TIME_VSYSCALL
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU
+ select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC if PPC32 && PPC_PAGE_SHIFT <= 14
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long size;
unsigned long flags;
struct page **pages;
+ unsigned int page_order;
unsigned int nr_pages;
phys_addr_t phys_addr;
const void *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..d785e5335529 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8102,6 +8102,7 @@ void *__init alloc_large_system_hash(const char *tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+ bool huge;
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8170,7 @@ void *__init alloc_large_system_hash(const char *tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+ huge = (find_vm_area(table)->page_order > 0);
} else {
/*
* If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8187,7 @@ void *__init alloc_large_system_hash(const char *tablename,
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
- virt ? "vmalloc" : "linear");
+ virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..564d7497e551 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
#include "internal.h"
#include "pgalloc-track.h"
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+ vmap_allow_huge = false;
+ return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)x;
@@ -468,31 +481,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long en
return 0;
}
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing. The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
- pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages)
{
unsigned long start = addr;
- unsigned long end = addr + size;
- unsigned long next;
pgd_t *pgd;
+ unsigned long next;
int err = 0;
int nr = 0;
pgtbl_mod_mask mask = 0;
@@ -514,6 +508,65 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
return 0;
}
+static int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ WARN_ON(page_shift < PAGE_SHIFT);
+
+ if (page_shift == PAGE_SHIFT) {
+ return vmap_small_pages_range_noflush(addr, end, prot, pages);
+ } else {
+ unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+
+ for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+ int err;
+
+ err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+ __pa(page_address(pages[i])), prot, page_shift);
+ if (err)
+ return err;
+
+ addr += 1UL << page_shift;
+ }
+
+ return 0;
+ }
+}
+
+static int vmap_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ int err;
+
+ err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
+ flush_cache_vmap(addr, end);
+ return err;
+}
+
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
+ * have been allocated using get_vm_area() and its friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing. The caller is responsible for
+ * calling flush_cache_vmap() on to-be-mapped areas before calling this
+ * function.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+ pgprot_t prot, struct page **pages)
+{
+ return vmap_pages_range_noflush(addr, addr + size, prot, pages, PAGE_SHIFT);
+}
+
int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
struct page **pages)
{
@@ -2274,9 +2327,11 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (unlikely(!size))
return NULL;
- if (flags & VM_IOREMAP)
- align = 1ul << clamp_t(int, get_count_order_long(size),
- PAGE_SHIFT, IOREMAP_MAX_ORDER);
+ if (flags & VM_IOREMAP) {
+ align = max(align,
+ 1ul << clamp_t(int, get_count_order_long(size),
+ PAGE_SHIFT, IOREMAP_MAX_ORDER));
+ }
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
@@ -2391,6 +2446,7 @@ static inline void set_area_direct_map(const struct vm_struct *area,
{
int i;
+ /* HUGE_VMALLOC passes small pages to set_direct_map */
for (i = 0; i < area->nr_pages; i++)
if (page_address(area->pages[i]))
set_direct_map(area->pages[i]);
@@ -2424,11 +2480,12 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
* map. Find the start and end range of the direct mappings to make sure
* the vm_unmap_aliases() flush includes the direct map.
*/
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1U << area->page_order) {
unsigned long addr = (unsigned long)page_address(area->pages[i]);
if (addr) {
+ unsigned long page_size = PAGE_SIZE << area->page_order;
start = min(addr, start);
- end = max(addr + PAGE_SIZE, end);
+ end = max(addr + page_size, end);
flush_dmap = 1;
}
}
@@ -2471,11 +2528,11 @@ static void __vunmap(const void *addr, int deallocate_pages)
if (deallocate_pages) {
int i;
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1U << area->page_order) {
struct page *page = area->pages[i];
BUG_ON(!page);
- __free_pages(page, 0);
+ __free_pages(page, area->page_order);
}
atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2614,9 +2671,12 @@ void *vmap(struct page **pages, unsigned int count,
EXPORT_SYMBOL(vmap);
static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
- pgprot_t prot, int node)
+ pgprot_t prot, unsigned int page_shift, int node)
{
struct page **pages;
+ unsigned long addr = (unsigned long)area->addr;
+ unsigned long size = get_vm_area_size(area);
+ unsigned int page_order = page_shift - PAGE_SHIFT;
unsigned int nr_pages, array_size, i;
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
@@ -2624,7 +2684,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
0 :
__GFP_HIGHMEM;
- nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+ nr_pages = size >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
/* Please note that the recursion is strictly bounded. */
@@ -2643,29 +2703,29 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
area->pages = pages;
area->nr_pages = nr_pages;
+ area->page_order = page_order;
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1U << page_order) {
struct page *page;
+ int p;
- if (node == NUMA_NO_NODE)
- page = alloc_page(alloc_mask|highmem_mask);
- else
- page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
-
+ page = alloc_pages_node(node, alloc_mask|highmem_mask, page_order);
if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vunmap() */
area->nr_pages = i;
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
goto fail;
}
- area->pages[i] = page;
+
+ for (p = 0; p < (1U << page_order); p++)
+ area->pages[i + p] = page + p;
+
if (gfpflags_allow_blocking(gfp_mask))
cond_resched();
}
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
- if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area),
- prot, pages) < 0)
+ if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0)
goto fail;
return area->addr;
@@ -2673,7 +2733,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
fail:
warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure, allocated %ld of %ld bytes",
- (area->nr_pages*PAGE_SIZE), area->size);
+ (area->nr_pages*PAGE_SIZE), size);
__vfree(area->addr);
return NULL;
}
@@ -2704,19 +2764,42 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
struct vm_struct *area;
void *addr;
unsigned long real_size = size;
+ unsigned long real_align = align;
+ unsigned int shift = PAGE_SHIFT;
- size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > totalram_pages())
goto fail;
- area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED |
+ if (vmap_allow_huge && (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) {
+ unsigned long size_per_node;
+
+ /*
+ * Try huge pages. Only try for PAGE_KERNEL allocations,
+ * others like modules don't yet expect huge pages in
+ * their allocations due to apply_to_page_range not
+ * supporting them.
+ */
+
+ size_per_node = size;
+ if (node == NUMA_NO_NODE)
+ size_per_node /= num_online_nodes();
+ if (size_per_node >= PMD_SIZE) {
+ shift = PMD_SHIFT;
+ align = max(real_align, 1UL << shift);
+ size = ALIGN(real_size, 1UL << shift);
+ }
+ }
+
+again:
+ size = PAGE_ALIGN(size);
+ area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
vm_flags, start, end, node, gfp_mask, caller);
if (!area)
goto fail;
- addr = __vmalloc_area_node(area, gfp_mask, prot, node);
+ addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
if (!addr)
- return NULL;
+ goto fail;
/*
* In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -2730,8 +2813,19 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return addr;
fail:
- warn_alloc(gfp_mask, NULL,
+ if (shift > PAGE_SHIFT) {
+ free_vm_area(area);
+ shift = PAGE_SHIFT;
+ align = real_align;
+ size = real_size;
+ goto again;
+ }
+
+ if (!area) {
+ /* Warn for area allocation, page allocations already warn */
+ warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure: %lu bytes", real_size);
+ }
return NULL;
}
@@ -3730,7 +3824,7 @@ static int s_show(struct seq_file *m, void *p)
seq_printf(m, " %pS", v->caller);
if (v->nr_pages)
- seq_printf(m, " pages=%d", v->nr_pages);
+ seq_printf(m, " pages=%d order=%d", v->nr_pages, v->page_order);
if (v->phys_addr)
seq_printf(m, " phys=%pa", &v->phys_addr);
--
2.23.0
^ permalink raw reply related
* [powerpc:merge] BUILD SUCCESS 7c25bda14d66718f9fa428808dae289dd84f1da3
From: kernel test robot @ 2020-08-21 5:00 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
branch HEAD: 7c25bda14d66718f9fa428808dae289dd84f1da3 Automatic merge of 'master', 'next' and 'fixes' (2020-08-20 23:20)
elapsed time: 926m
configs tested: 69
configs skipped: 2
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
m68k m5275evb_defconfig
arm keystone_defconfig
s390 alldefconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
nios2 allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
i386 allyesconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allyesconfig
mips allmodconfig
powerpc allyesconfig
powerpc allmodconfig
powerpc allnoconfig
powerpc defconfig
i386 randconfig-a002-20200820
i386 randconfig-a004-20200820
i386 randconfig-a005-20200820
i386 randconfig-a003-20200820
i386 randconfig-a006-20200820
i386 randconfig-a001-20200820
x86_64 randconfig-a015-20200820
x86_64 randconfig-a012-20200820
x86_64 randconfig-a016-20200820
x86_64 randconfig-a014-20200820
x86_64 randconfig-a011-20200820
x86_64 randconfig-a013-20200820
i386 randconfig-a013-20200820
i386 randconfig-a012-20200820
i386 randconfig-a011-20200820
i386 randconfig-a016-20200820
i386 randconfig-a014-20200820
i386 randconfig-a015-20200820
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox