* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Michael Ellerman @ 2021-10-28 11:33 UTC (permalink / raw)
To: Christophe Leroy, Nicholas Piggin, Benjamin Herrenschmidt,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <063e72e1-fc05-7783-9f42-f681dd08a4b2@csgroup.eu>
Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 27/10/2021 à 06:44, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>>
>>> Book3e has 2 bits, UX and SX, which defines the exec rights
>>> resp. for user (PR=1) and for kernel (PR=0).
>>>
>>> _PAGE_EXEC is defined as UX only.
>>>
>>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>>
>>> So set_memory_nx() call for an executable kernel page does
>>> nothing because UX is already cleared.
>>>
>>> And set_memory_x() on a non-executable kernel page makes it
>>> executable for the user and keeps it non-executable for kernel.
>>>
>>> Also, pte_exec() always returns 'false' on kernel pages, because
>>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>>> the W+X check doesn't work.
>>>
>>> To fix this:
>>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>>> true whenever one of the two bits is set
>>
>> I don't understand this change. Which pte_user() returns true after
>> this change? Or do you mean pte_exec()?
>
> Oops, yes, I mean pte_exec()
>
> Unless I have to re-spin, can Michael eventually fix that typo while
> applying ?
I did.
cheers
^ permalink raw reply
* Re: [PATCH v2 24/45] regulator: pfuze100: Use devm_register_power_handler()
From: Mark Brown @ 2021-10-28 11:59 UTC (permalink / raw)
To: Dmitry Osipenko
Cc: Ulf Hansson, Rich Felker, linux-ia64, Tomer Maimon,
Santosh Shilimkar, Rafael J . Wysocki, Boris Ostrovsky,
Catalin Marinas, Linus Walleij, Dave Hansen, x86, Tali Perry,
James E.J. Bottomley, Thierry Reding, Guo Ren, Pavel Machek,
H. Peter Anvin, linux-riscv, Vincent Chen, Will Deacon,
Greg Ungerer, Stefano Stabellini, Benjamin Fair, Yoshinori Sato,
Krzysztof Kozlowski, linux-sh, Lee Jones, Helge Deller,
Daniel Lezcano, Russell King, linux-csky, Jonathan Hunter,
Tony Lindgren, Chen-Yu Tsai, Ingo Molnar, Geert Uytterhoeven,
xen-devel, linux-mips, Guenter Roeck, Len Brown, Albert Ou,
linux-omap, Jonathan Neuschäfer, Vladimir Zapolskiy,
linux-acpi, linux-m68k, Borislav Petkov, Greentime Hu,
Paul Walmsley, linux-tegra, Thomas Gleixner, Andy Shevchenko,
Nancy Yuen, linux-arm-kernel, Juergen Gross, Thomas Bogendoerfer,
linux-parisc, Nick Hu, Avi Fishman, Patrick Venture, linux-pm,
Liam Girdwood, linux-kernel, Palmer Dabbelt, Philipp Zabel,
Paul Mackerras, Andrew Morton, linuxppc-dev, openbmc,
Joshua Thompson
In-Reply-To: <20211027211715.12671-25-digetx@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 250 bytes --]
On Thu, Oct 28, 2021 at 12:16:54AM +0300, Dmitry Osipenko wrote:
> Use devm_register_power_handler() that replaces global pm_power_off_prepare
> variable and allows to register multiple power-off handlers.
Acked-by: Mark Brown <broonie@kernel.org>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* [PATCH v1 3/5] powerpc/ftrace: Add module_trampoline_target() for PPC32
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
module_trampoline_target() is used by __ftrace_modify_call().
Implement it for PPC32 so that CONFIG_DYNAMIC_FTRACE_WITH_REGS
can be activated on PPC32 as well.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/kernel/module_32.c | 25 ++++++++++++++++++++
arch/powerpc/kernel/trace/ftrace.c | 37 ++++--------------------------
2 files changed, 29 insertions(+), 33 deletions(-)
diff --git a/arch/powerpc/kernel/module_32.c b/arch/powerpc/kernel/module_32.c
index f417afc08d33..5dedd76346b2 100644
--- a/arch/powerpc/kernel/module_32.c
+++ b/arch/powerpc/kernel/module_32.c
@@ -273,6 +273,31 @@ int apply_relocate_add(Elf32_Shdr *sechdrs,
}
#ifdef CONFIG_DYNAMIC_FTRACE
+int module_trampoline_target(struct module *mod, unsigned long addr,
+ unsigned long *target)
+{
+ unsigned int jmp[4];
+
+ /* Find where the trampoline jumps to */
+ if (copy_from_kernel_nofault(jmp, (void *)addr, sizeof(jmp)))
+ return -EFAULT;
+
+ /* verify that this is what we expect it to be */
+ if ((jmp[0] & 0xffff0000) != PPC_RAW_LIS(_R12, 0) ||
+ (jmp[1] & 0xffff0000) != PPC_RAW_ADDI(_R12, _R12, 0) ||
+ jmp[2] != PPC_RAW_MTCTR(_R12) ||
+ jmp[3] != PPC_RAW_BCTR())
+ return -EINVAL;
+
+ addr = (jmp[1] & 0xffff) | ((jmp[0] & 0xffff) << 16);
+ if (addr & 0x8000)
+ addr -= 0x10000;
+
+ *target = addr;
+
+ return 0;
+}
+
int module_finalize_ftrace(struct module *module, const Elf_Shdr *sechdrs)
{
module->arch.tramp = do_plt_call(module->core_layout.base,
diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
index d89c5df4f206..c1d54c18e912 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -222,9 +222,8 @@ __ftrace_make_nop(struct module *mod,
struct dyn_ftrace *rec, unsigned long addr)
{
struct ppc_inst op;
- unsigned int jmp[4];
unsigned long ip = rec->ip;
- unsigned long tramp;
+ unsigned long tramp, ptr;
if (copy_from_kernel_nofault(&op, (void *)ip, MCOUNT_INSN_SIZE))
return -EFAULT;
@@ -238,41 +237,13 @@ __ftrace_make_nop(struct module *mod,
/* lets find where the pointer goes */
tramp = find_bl_target(ip, op);
- /*
- * On PPC32 the trampoline looks like:
- * 0x3d, 0x80, 0x00, 0x00 lis r12,sym@ha
- * 0x39, 0x8c, 0x00, 0x00 addi r12,r12,sym@l
- * 0x7d, 0x89, 0x03, 0xa6 mtctr r12
- * 0x4e, 0x80, 0x04, 0x20 bctr
- */
-
- pr_devel("ip:%lx jumps to %lx", ip, tramp);
-
/* Find where the trampoline jumps to */
- if (copy_from_kernel_nofault(jmp, (void *)tramp, sizeof(jmp))) {
- pr_err("Failed to read %lx\n", tramp);
+ if (module_trampoline_target(mod, tramp, &ptr)) {
+ pr_err("Failed to get trampoline target\n");
return -EFAULT;
}
- pr_devel(" %08x %08x ", jmp[0], jmp[1]);
-
- /* verify that this is what we expect it to be */
- if (((jmp[0] & 0xffff0000) != 0x3d800000) ||
- ((jmp[1] & 0xffff0000) != 0x398c0000) ||
- (jmp[2] != 0x7d8903a6) ||
- (jmp[3] != 0x4e800420)) {
- pr_err("Not a trampoline\n");
- return -EINVAL;
- }
-
- tramp = (jmp[1] & 0xffff) |
- ((jmp[0] & 0xffff) << 16);
- if (tramp & 0x8000)
- tramp -= 0x10000;
-
- pr_devel(" %lx ", tramp);
-
- if (tramp != addr) {
+ if (ptr != addr) {
pr_err("Trampoline location %08lx does not match addr\n",
tramp);
return -EINVAL;
--
2.31.1
^ permalink raw reply related
* [PATCH v1 0/5] Implement livepatch on PPC32
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
This series implements livepatch on PPC32.
This is largely copied from what's done on PPC64.
Christophe Leroy (5):
livepatch: Fix build failure on 32 bits processors
powerpc/ftrace: No need to read LR from stack in _mcount()
powerpc/ftrace: Add module_trampoline_target() for PPC32
powerpc/ftrace: Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on PPC32
powerpc/ftrace: Add support for livepatch to PPC32
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/livepatch.h | 4 +-
arch/powerpc/kernel/module_32.c | 33 +++++
arch/powerpc/kernel/trace/ftrace.c | 53 +++-----
arch/powerpc/kernel/trace/ftrace_32.S | 187 ++++++++++++++++++++++++--
kernel/livepatch/core.c | 4 +-
6 files changed, 230 insertions(+), 53 deletions(-)
--
2.31.1
^ permalink raw reply
* [PATCH v1 4/5] powerpc/ftrace: Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on PPC32
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
Unlike PPC64, PPC32 doesn't require any special compiler option
to get _mcount() call not clobbering registers.
Provide ftrace_regs_caller() and ftrace_regs_call() and activate
HAVE_DYNAMIC_FTRACE_WITH_REGS.
That's heavily copied from ftrace_64_mprofile.S
For the time being leave livepatching aside, it will come with
following patch.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 4 +-
arch/powerpc/kernel/module_32.c | 8 ++
arch/powerpc/kernel/trace/ftrace.c | 16 +++-
arch/powerpc/kernel/trace/ftrace_32.S | 109 ++++++++++++++++++++++++--
4 files changed, 125 insertions(+), 12 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e43e17987b92..f66eb1984b00 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -206,7 +206,7 @@ config PPC
select HAVE_DEBUG_KMEMLEAK
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_DYNAMIC_FTRACE
- select HAVE_DYNAMIC_FTRACE_WITH_REGS if MPROFILE_KERNEL
+ select HAVE_DYNAMIC_FTRACE_WITH_REGS if MPROFILE_KERNEL || PPC32
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !(CPU_LITTLE_ENDIAN && POWER7_CPU)
select HAVE_FAST_GUP
@@ -230,7 +230,7 @@ config PPC
select HAVE_KPROBES_ON_FTRACE
select HAVE_KRETPROBES
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
- select HAVE_LIVEPATCH if HAVE_DYNAMIC_FTRACE_WITH_REGS
+ select HAVE_LIVEPATCH if HAVE_DYNAMIC_FTRACE_WITH_REGS && PPC64
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && PPC_BOOK3S)
select HAVE_OPTPROBES
diff --git a/arch/powerpc/kernel/module_32.c b/arch/powerpc/kernel/module_32.c
index 5dedd76346b2..a491ad481d85 100644
--- a/arch/powerpc/kernel/module_32.c
+++ b/arch/powerpc/kernel/module_32.c
@@ -306,6 +306,14 @@ int module_finalize_ftrace(struct module *module, const Elf_Shdr *sechdrs)
if (!module->arch.tramp)
return -ENOENT;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+ module->arch.tramp_regs = do_plt_call(module->core_layout.base,
+ (unsigned long)ftrace_regs_caller,
+ sechdrs, module);
+ if (!module->arch.tramp_regs)
+ return -ENOENT;
+#endif
+
return 0;
}
#endif
diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
index c1d54c18e912..faa0fa29ac20 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -561,6 +561,8 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
int err;
struct ppc_inst op;
u32 *ip = (u32 *)rec->ip;
+ struct module *mod = rec->arch.mod;
+ unsigned long tramp;
/* read where this goes */
if (copy_inst_from_kernel_nofault(&op, ip))
@@ -573,13 +575,23 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
}
/* If we never set up a trampoline to ftrace_caller, then bail */
- if (!rec->arch.mod->arch.tramp) {
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+ if (!mod->arch.tramp || !mod->arch.tramp_regs) {
+#else
+ if (!mod->arch.tramp) {
+#endif
pr_err("No ftrace trampoline\n");
return -EINVAL;
}
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+ if (rec->flags & FTRACE_FL_REGS)
+ tramp = mod->arch.tramp_regs;
+ else
+#endif
+ tramp = mod->arch.tramp;
/* create the branch to the trampoline */
- err = create_branch(&op, ip, rec->arch.mod->arch.tramp, BRANCH_SET_LINK);
+ err = create_branch(&op, ip, tramp, BRANCH_SET_LINK);
if (err) {
pr_err("REL24 out of range!\n");
return -EINVAL;
diff --git a/arch/powerpc/kernel/trace/ftrace_32.S b/arch/powerpc/kernel/trace/ftrace_32.S
index c7d57124cc59..0a02c0cb12d9 100644
--- a/arch/powerpc/kernel/trace/ftrace_32.S
+++ b/arch/powerpc/kernel/trace/ftrace_32.S
@@ -9,6 +9,7 @@
#include <asm/asm-offsets.h>
#include <asm/ftrace.h>
#include <asm/export.h>
+#include <asm/ptrace.h>
_GLOBAL(mcount)
_GLOBAL(_mcount)
@@ -29,17 +30,21 @@ _GLOBAL(ftrace_caller)
MCOUNT_SAVE_FRAME
/* r3 ends up with link register */
subi r3, r3, MCOUNT_INSN_SIZE
+ lis r5,function_trace_op@ha
+ lwz r5,function_trace_op@l(r5)
+ li r6, 0
.globl ftrace_call
ftrace_call:
bl ftrace_stub
nop
+ MCOUNT_RESTORE_FRAME
+ftrace_caller_common:
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
.globl ftrace_graph_call
ftrace_graph_call:
b ftrace_graph_stub
_GLOBAL(ftrace_graph_stub)
#endif
- MCOUNT_RESTORE_FRAME
/* old link register ends up in ctr reg */
bctr
@@ -47,16 +52,92 @@ _GLOBAL(ftrace_graph_stub)
_GLOBAL(ftrace_stub)
blr
+_GLOBAL(ftrace_regs_caller)
+ /* Save the original return address in A's stack frame */
+ stw r0,LRSAVE(r1)
+
+ /* Create our stack frame + pt_regs */
+ stwu r1,-INT_FRAME_SIZE(r1)
+
+ /* Save all gprs to pt_regs */
+ stw r0, GPR0(r1)
+ stmw r2, GPR2(r1)
+
+ /* Save previous stack pointer (r1) */
+ addi r8, r1, INT_FRAME_SIZE
+ stw r8, GPR1(r1)
+
+ /* Load special regs for save below */
+ mfmsr r8
+ mfctr r9
+ mfxer r10
+ mfcr r11
+
+ /* Get the _mcount() call site out of LR */
+ mflr r7
+ /* Save it as pt_regs->nip */
+ stw r7, _NIP(r1)
+ /* Save the read LR in pt_regs->link */
+ stw r0, _LINK(r1)
+
+ lis r3,function_trace_op@ha
+ lwz r5,function_trace_op@l(r3)
+
+ /* Calculate ip from nip-4 into r3 for call below */
+ subi r3, r7, MCOUNT_INSN_SIZE
+
+ /* Put the original return address in r4 as parent_ip */
+ mr r4, r0
+
+ /* Save special regs */
+ stw r8, _MSR(r1)
+ stw r9, _CTR(r1)
+ stw r10, _XER(r1)
+ stw r11, _CCR(r1)
+
+ /* Load &pt_regs in r6 for call below */
+ addi r6, r1, STACK_FRAME_OVERHEAD
+
+ /* ftrace_call(r3, r4, r5, r6) */
+.globl ftrace_regs_call
+ftrace_regs_call:
+ bl ftrace_stub
+ nop
+
+ /* Load ctr with the possibly modified NIP */
+ lwz r3, _NIP(r1)
+ mtctr r3
+
+ /* Restore gprs */
+ lmw r2, GPR2(r1)
+
+ /* Restore possibly modified LR */
+ lwz r0, _LINK(r1)
+ mtlr r0
+
+ /* Pop our stack frame */
+ addi r1, r1, INT_FRAME_SIZE
+
+ b ftrace_caller_common
+
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
_GLOBAL(ftrace_graph_caller)
+ stwu r1,-48(r1)
+ stw r3, 12(r1)
+ stw r4, 16(r1)
+ stw r5, 20(r1)
+ stw r6, 24(r1)
+ stw r7, 28(r1)
+ stw r8, 32(r1)
+ stw r9, 36(r1)
+ stw r10,40(r1)
+
addi r5, r1, 48
- /* load r4 with local address */
- lwz r4, 44(r1)
+ mfctr r4 /* ftrace_caller has moved local addr here */
+ stw r4, 44(r1)
+ mflr r3 /* ftrace_caller has restored LR from stack */
subi r4, r4, MCOUNT_INSN_SIZE
- /* Grab the LR out of the caller stack frame */
- lwz r3,52(r1)
-
bl prepare_ftrace_return
nop
@@ -65,9 +146,21 @@ _GLOBAL(ftrace_graph_caller)
* Change the LR in the callers stack frame to this.
*/
stw r3,52(r1)
+ mtlr r3
+ lwz r0,44(r1)
+ mtctr r0
+
+ lwz r3, 12(r1)
+ lwz r4, 16(r1)
+ lwz r5, 20(r1)
+ lwz r6, 24(r1)
+ lwz r7, 28(r1)
+ lwz r8, 32(r1)
+ lwz r9, 36(r1)
+ lwz r10,40(r1)
+
+ addi r1, r1, 48
- MCOUNT_RESTORE_FRAME
- /* old link register ends up in ctr reg */
bctr
_GLOBAL(return_to_handler)
--
2.31.1
^ permalink raw reply related
* [PATCH v1 1/5] livepatch: Fix build failure on 32 bits processors
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
Trying to build livepatch on powerpc/32 results in:
kernel/livepatch/core.c: In function 'klp_resolve_symbols':
kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
| ^
kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym *' {aka 'struct elf32_sym *'} from incompatible pointer type 'Elf64_Sym *' {aka 'struct elf64_sym *'} [-Werror=incompatible-pointer-types]
221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
| ^
kernel/livepatch/core.c: In function 'klp_apply_section_relocs':
kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types]
312 | ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname);
| ^~~~~~~
| |
| Elf32_Shdr * {aka struct elf32_shdr *}
kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr *' {aka 'struct elf64_shdr *'} but argument is of type 'Elf32_Shdr *' {aka 'struct elf32_shdr *'}
193 | static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
| ~~~~~~~~~~~~^~~~~~~
Fix it by using the right types instead of forcing 64 bits types.
Fixes: 7c8e2bdd5f0d ("livepatch: Apply vmlinux-specific KLP relocations early")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
kernel/livepatch/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 335d988bd811..c0789383807b 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -190,7 +190,7 @@ static int klp_find_object_symbol(const char *objname, const char *name,
return -EINVAL;
}
-static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
+static int klp_resolve_symbols(Elf_Shdr *sechdrs, const char *strtab,
unsigned int symndx, Elf_Shdr *relasec,
const char *sec_objname)
{
@@ -218,7 +218,7 @@ static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
relas = (Elf_Rela *) relasec->sh_addr;
/* For each rela in this klp relocation section */
for (i = 0; i < relasec->sh_size / sizeof(Elf_Rela); i++) {
- sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
+ sym = (Elf_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
if (sym->st_shndx != SHN_LIVEPATCH) {
pr_err("symbol %s is not marked as a livepatch symbol\n",
strtab + sym->st_name);
--
2.31.1
^ permalink raw reply related
* [PATCH v1 5/5] powerpc/ftrace: Add support for livepatch to PPC32
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
This is heavily copied from PPC64. Not much to say about it.
Livepatch sample modules all work.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/livepatch.h | 4 +-
arch/powerpc/kernel/trace/ftrace_32.S | 69 +++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f66eb1984b00..eceee3b814b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -230,7 +230,7 @@ config PPC
select HAVE_KPROBES_ON_FTRACE
select HAVE_KRETPROBES
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
- select HAVE_LIVEPATCH if HAVE_DYNAMIC_FTRACE_WITH_REGS && PPC64
+ select HAVE_LIVEPATCH if HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && PPC_BOOK3S)
select HAVE_OPTPROBES
diff --git a/arch/powerpc/include/asm/livepatch.h b/arch/powerpc/include/asm/livepatch.h
index 4fe018cc207b..daf24d837241 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -23,8 +23,8 @@ static inline void klp_arch_set_pc(struct ftrace_regs *fregs, unsigned long ip)
static inline unsigned long klp_get_ftrace_location(unsigned long faddr)
{
/*
- * Live patch works only with -mprofile-kernel on PPC. In this case,
- * the ftrace location is always within the first 16 bytes.
+ * Live patch works on PPC32 and only with -mprofile-kernel on PPC64. In
+ * both cases, the ftrace location is always within the first 16 bytes.
*/
return ftrace_location_range(faddr, faddr + 16);
}
diff --git a/arch/powerpc/kernel/trace/ftrace_32.S b/arch/powerpc/kernel/trace/ftrace_32.S
index 0a02c0cb12d9..2545d6bb9f02 100644
--- a/arch/powerpc/kernel/trace/ftrace_32.S
+++ b/arch/powerpc/kernel/trace/ftrace_32.S
@@ -10,6 +10,7 @@
#include <asm/ftrace.h>
#include <asm/export.h>
#include <asm/ptrace.h>
+#include <asm/bug.h>
_GLOBAL(mcount)
_GLOBAL(_mcount)
@@ -83,6 +84,9 @@ _GLOBAL(ftrace_regs_caller)
lis r3,function_trace_op@ha
lwz r5,function_trace_op@l(r3)
+#ifdef CONFIG_LIVEPATCH
+ mr r14,r7 /* remember old NIP */
+#endif
/* Calculate ip from nip-4 into r3 for call below */
subi r3, r7, MCOUNT_INSN_SIZE
@@ -107,6 +111,9 @@ ftrace_regs_call:
/* Load ctr with the possibly modified NIP */
lwz r3, _NIP(r1)
mtctr r3
+#ifdef CONFIG_LIVEPATCH
+ cmpw r14, r3 /* has NIP been altered? */
+#endif
/* Restore gprs */
lmw r2, GPR2(r1)
@@ -118,8 +125,70 @@ ftrace_regs_call:
/* Pop our stack frame */
addi r1, r1, INT_FRAME_SIZE
+#ifdef CONFIG_LIVEPATCH
+ /* Based on the cmpw above, if the NIP was altered handle livepatch */
+ bne- livepatch_handler
+#endif
b ftrace_caller_common
+#ifdef CONFIG_LIVEPATCH
+ /*
+ * This function runs in the mcount context, between two functions. As
+ * such it can only clobber registers which are volatile and used in
+ * function linkage.
+ *
+ * We get here when a function A, calls another function B, but B has
+ * been live patched with a new function C.
+ *
+ * On entry:
+ * - we have no stack frame and can not allocate one
+ * - LR points back to the original caller (in A)
+ * - CTR holds the new NIP in C
+ * - r0, r11 & r12 are free
+ */
+livepatch_handler:
+ /* Allocate 2 x 8 bytes */
+ lwz r11, TI_livepatch_sp+THREAD(r2)
+ addi r11, r11, 16
+ stw r11, TI_livepatch_sp+THREAD(r2)
+
+ /* Save real LR on livepatch stack */
+ mflr r12
+ stw r12, -16(r11)
+
+ /* Store stack end marker */
+ lis r12, STACK_END_MAGIC@h
+ ori r12, r12, STACK_END_MAGIC@l
+ stw r12, -4(r11)
+
+ /* Branch to ctr */
+ bctrl
+
+ /*
+ * Now we are returning from the patched function to the original
+ * caller A. We are free to use r11 and r12.
+ */
+
+ lwz r11, TI_livepatch_sp+THREAD(r2)
+
+ /* Check stack marker hasn't been trashed */
+ lwz r12, -4(r11)
+ subis r12, r12, STACK_END_MAGIC@h
+1: twnei r12, STACK_END_MAGIC@l
+ EMIT_BUG_ENTRY 1b, __FILE__, __LINE__ - 1, 0
+
+ /* Restore LR from livepatch stack */
+ lwz r12, -16(r11)
+ mtlr r12
+
+ /* Pop livepatch stack frame */
+ subi r11, r11, 16
+ stw r11, TI_livepatch_sp+THREAD(r2)
+
+ /* Return to original caller of live patched function */
+ blr
+#endif /* CONFIG_LIVEPATCH */
+
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
_GLOBAL(ftrace_graph_caller)
stwu r1,-48(r1)
--
2.31.1
^ permalink raw reply related
* [PATCH v1 2/5] powerpc/ftrace: No need to read LR from stack in _mcount()
From: Christophe Leroy @ 2021-10-28 12:24 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Steven Rostedt, Ingo Molnar, Naveen N . Rao
Cc: live-patching, linuxppc-dev, linux-kernel
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
All functions calling _mcount do it exactly the same way, with the
following sequence of instructions:
c07de788: 7c 08 02 a6 mflr r0
c07de78c: 90 01 00 04 stw r0,4(r1)
c07de790: 4b 84 13 65 bl c001faf4 <_mcount>
Allthough LR is pushed on stack, it is still in r0 while entering
_mcount().
Function arguments are in r3-r10, so r11 and r12 are still available
at that point.
Do like PPC64 and use r12 to move LR into CTR, so that r0 is preserved
and doesn't need to be restored from the stack.
While at it, bring back the EXPORT_SYMBOL at the end of _mcount.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/kernel/trace/ftrace_32.S | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/trace/ftrace_32.S b/arch/powerpc/kernel/trace/ftrace_32.S
index e023ae59c429..c7d57124cc59 100644
--- a/arch/powerpc/kernel/trace/ftrace_32.S
+++ b/arch/powerpc/kernel/trace/ftrace_32.S
@@ -14,16 +14,16 @@ _GLOBAL(mcount)
_GLOBAL(_mcount)
/*
* It is required that _mcount on PPC32 must preserve the
- * link register. But we have r0 to play with. We use r0
+ * link register. But we have r12 to play with. We use r12
* to push the return address back to the caller of mcount
* into the ctr register, restore the link register and
* then jump back using the ctr register.
*/
- mflr r0
- mtctr r0
- lwz r0, 4(r1)
+ mflr r12
+ mtctr r12
mtlr r0
bctr
+EXPORT_SYMBOL(_mcount)
_GLOBAL(ftrace_caller)
MCOUNT_SAVE_FRAME
@@ -43,7 +43,6 @@ _GLOBAL(ftrace_graph_stub)
/* old link register ends up in ctr reg */
bctr
-EXPORT_SYMBOL(_mcount)
_GLOBAL(ftrace_stub)
blr
--
2.31.1
^ permalink raw reply related
* Re: [PATCH v3] KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling
From: Laurent Vivier @ 2021-10-28 12:39 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: stable
In-Reply-To: <20211027142150.3711582-1-npiggin@gmail.com>
On 27/10/2021 16:21, Nicholas Piggin wrote:
> From: Laurent Vivier <lvivier@redhat.com>
>
> Commit 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest
> context before enabling irqs") moved guest_exit() into the interrupt
> protected area to avoid wrong context warning (or worse). The problem is
> that tick-based time accounting has not yet been updated at this point
> (because it depends on the timer interrupt firing), so the guest time
> gets incorrectly accounted to system time.
>
> To fix the problem, follow the x86 fix in commit 160457140187 ("Defer
> vtime accounting 'til after IRQ handling"), and allow host IRQs to run
> before accounting the guest exit time.
>
> In the case vtime accounting is enabled, this is not required because TB
> is used directly for accounting.
>
> Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a
> guest running a kernel compile, the 'guest' fields of /proc/stat are
> stuck at zero. With the patch they can be observed increasing roughly as
> expected.
>
> Fixes: e233d54d4d97 ("KVM: booke: use __kvm_guest_exit")
> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
> Cc: <stable@vger.kernel.org> # 5.12
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> [np: only required for tick accounting, add Book3E fix, tweak changelog]
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> Since v2:
> - I took over the patch with Laurent's blessing.
> - Changed to avoid processing IRQs if we do have vtime accounting
> enabled.
> - Changed so in either case the accounting is called with irqs disabled.
> - Added similar Book3E fix.
> - Rebased on upstream, tested, observed bug and confirmed fix.
>
> arch/powerpc/kvm/book3s_hv.c | 30 ++++++++++++++++++++++++++++--
> arch/powerpc/kvm/booke.c | 16 +++++++++++++++-
> 2 files changed, 43 insertions(+), 3 deletions(-)
>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Checked with mpstat that time is accounted to %guest while a stress-ng test is running in
the guest. Checked there is no warning in the host kernellogs.
Thanks,
Laurent
^ permalink raw reply
* Re: instruction storage exception handling
From: Nicholas Piggin @ 2021-10-28 12:42 UTC (permalink / raw)
To: Jacques de Laval; +Cc: Scott Wood, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1635413197.83rhc4s3fc.astroid@bobo.none>
Excerpts from Nicholas Piggin's message of October 28, 2021 7:35 pm:
> Excerpts from Jacques de Laval's message of October 28, 2021 7:08 pm:
>> On Thursday, October 28th, 2021 at 5:01 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>> Excerpts from Jacques de Laval's message of October 27, 2021 10:03 pm:
>>>
>>> > On Wednesday, October 27th, 2021 at 9:52 AM, Christophe Leroy christophe.leroy@csgroup.eu wrote:
>>> >
>>> > > Le 27/10/2021 à 09:47, Nicholas Piggin a écrit :
>>> > >
>>> > > > You're right. In that case it shouldn't change anything unless there
>>> > > >
>>> > > > was a BO fault. I'm not sure what the problem is then. Guessing based
>>> > > >
>>> > > > on the NIP and instructions, it looks like it's probably got the correct
>>> > > >
>>> > > > user address that it's storing into vmf on the stack, so it has got past
>>> > > >
>>> > > > the access checks so my theory would be wrong anyway.
>>> > > >
>>> > > > Must be something simple but I can't see it yet.
>>> > >
>>> > > Anyway, I think it is still worth doing the check with setting 0 in
>>> > >
>>> > > _ESR(r11), maybe the Reference Manual is wrong.
>>> > >
>>> > > So Jacques, please do the test anyway if you can.
>>> > >
>>> > > Thanks
>>> > >
>>> > > Christophe
>>> >
>>> > I tested with the last patch from Nicholas, and with that I can not
>>> >
>>> > reproduce the issue, so it seems like that solves it for my case and setup
>>> >
>>> > at least.
>>> >
>>> > Big thanks Christophe and Nicholas for looking in to this!
>>>
>>> Thanks for reporting and testing. We can certainly send this patch
>>>
>>> upstream to fix the regression, but I'm still not exactly sure what is
>>>
>>> going on. If it is an errata or part of specification we missed that
>>>
>>> could explain it but it would be good to understand and comment it.
>>>
>>> If you have time to test again with only the following patch applied,
>>>
>>> it might give a better clue. This patch should keep running but it
>>>
>>> would print some dmesg output.
>>>
>>> Thanks,
>>>
>>> Nick
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>>
>>> index a8d0ce85d39a..cf56f23ff90a 100644
>>>
>>> --- a/arch/powerpc/mm/fault.c
>>>
>>> +++ b/arch/powerpc/mm/fault.c
>>>
>>> @@ -548,6 +548,12 @@ static __always_inline void __do_page_fault(struct pt_regs *regs)
>>>
>>> DEFINE_INTERRUPT_HANDLER(do_page_fault)
>>>
>>> {
>>>
>>> - if (TRAP(regs) == INTERRUPT_INST_STORAGE) {
>>>
>>> - if (regs->dsisr != 0) {
>>>
>>>
>>> - printk("ISI pc:%lx msr:%lx dsisr:%lx ESR:%lx\\n", regs->nip, regs->msr, regs->dsisr, mfspr(SPRN_ESR));
>>>
>>>
>>> - regs->dsisr = 0; // fix?
>>>
>>>
>>> - }
>>>
>>>
>>> - }
>>>
>>> __do_page_fault(regs);
>>>
>>> }
>>>
>>
>> As expected it keeps running. The output, and number of prints is naturally
>> a bit different from time to time, but dsisr/ESR is always 0x800000.
>>
>> Here's a representative output from one run:
>>
>> ISI pc:b789e6c0 msr:2d002 dsisr:800000 ESR:800000
>> ISI pc:b7884220 msr:2d002 dsisr:800000 ESR:800000
>> ISI pc:b78c18a4 msr:2d002 dsisr:800000 ESR:800000
>> ISI pc:55a238 msr:2f902 dsisr:800000 ESR:800000
>> ISI pc:412380 msr:2f902 dsisr:800000 ESR:800000
>> ISI pc:3aabe0 msr:2f902 dsisr:800000 ESR:800000
>> ISI pc:47a0e0 msr:2f902 dsisr:800000 ESR:800000
>> ISI pc:443290 msr:2f902 dsisr:800000 ESR:800000
>> ISI pc:43b350 msr:2d002 dsisr:800000 ESR:800000
>
> Great, thanks for testing that is interesting.
>
> Looking a bit more,
>
> https://www.nxp.com/files-static/32bit/doc/ref_manual/EREF_RM.pdf
>
> This is the manual for e500 family including e5500.
>
> Table 8-4. Interrupt Summary by IVOR shows ISI interrupt as affecting
> ESR register with [PT], this means the PT bit may be set. There is no
> BO bit specified here.
>
> The architecture (and this manual) says that if an interrupt type
> affects one of the ESR bits then all others are cleared. However if
> we look at the 4.8.7 Exception Syndrome Register (ESR) definition,
> the PT bit is specified with <E.PT>. This means it is implemented if
> the processor supports the E.PT extension.
>
> According to this table, the e5500 does not support E.PT.
> https://www.linux-kvm.org/page/E500_virtual_CPU_specification
>
> So it seems possible that a processor which does not support E.PT and
> therefore ISI will never set any bits in ESR, will not zero the ESR
> when it takes an ISI intrrupt, without violating the specification.
>
> It looks like that is what is happening here, ESR is being left from
> a previous store DSI.
Actually it could be another explanation, it takes a instruction
TLB error interrupt first which always sets no ESR bits, and then
it jumps to the ISI handler without clearing ESR.
I have not actually worked out why it is causing an infinite loop,
it doesn't seem to be the main fault handler, maybe it is some
low level TLB or PTE code, I can't get a kernel to boot under
quemu due to some bug, but a 64-bit kernel does not cause the problem if
you manually add ESR_DST into dsisr on instruction faults. Don't really
need to know that exactly though, it's clearly a bug.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH v2 10/45] ARM: Use do_kernel_power_off()
From: Russell King (Oracle) @ 2021-10-28 12:04 UTC (permalink / raw)
To: Dmitry Osipenko
Cc: Ulf Hansson, Rich Felker, linux-ia64, Tomer Maimon,
Santosh Shilimkar, Rafael J . Wysocki, Boris Ostrovsky,
Catalin Marinas, Linus Walleij, Dave Hansen, x86, Tali Perry,
James E.J. Bottomley, Thierry Reding, Guo Ren, Pavel Machek,
H. Peter Anvin, linux-riscv, Vincent Chen, Will Deacon,
Greg Ungerer, Stefano Stabellini, Benjamin Fair, Yoshinori Sato,
Krzysztof Kozlowski, linux-sh, Lee Jones, Helge Deller,
Daniel Lezcano, linux-csky, Jonathan Hunter, Tony Lindgren,
Chen-Yu Tsai, Ingo Molnar, Geert Uytterhoeven, xen-devel,
linux-mips, Guenter Roeck, Len Brown, Albert Ou, linux-omap,
Jonathan Neuschäfer, Vladimir Zapolskiy, linux-acpi,
linux-m68k, Mark Brown, Borislav Petkov, Greentime Hu,
Paul Walmsley, linux-tegra, Thomas Gleixner, Andy Shevchenko,
Nancy Yuen, linux-arm-kernel, Juergen Gross, Thomas Bogendoerfer,
linux-parisc, Nick Hu, Avi Fishman, Patrick Venture, linux-pm,
Liam Girdwood, linux-kernel, Palmer Dabbelt, Philipp Zabel,
Paul Mackerras, Andrew Morton, linuxppc-dev, openbmc,
Joshua Thompson
In-Reply-To: <20211027211715.12671-11-digetx@gmail.com>
On Thu, Oct 28, 2021 at 12:16:40AM +0300, Dmitry Osipenko wrote:
> Kernel now supports chained power-off handlers. Use do_kernel_power_off()
> that invokes chained power-off handlers. It also invokes legacy
> pm_power_off() for now, which will be removed once all drivers will
> be converted to the new power-off API.
>
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Thanks!
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* Re: [PATCH v2 15/45] nds32: Use do_kernel_power_off()
From: Greentime Hu @ 2021-10-28 12:20 UTC (permalink / raw)
To: Dmitry Osipenko
Cc: Ulf Hansson, Rich Felker, linux-ia64, Tomer Maimon,
Santosh Shilimkar, Rafael J . Wysocki, Boris Ostrovsky,
Catalin Marinas, Linus Walleij, Dave Hansen, x86, Tali Perry,
James E.J. Bottomley, Thierry Reding, Guo Ren, Pavel Machek,
H. Peter Anvin, linux-riscv, Vincent Chen, Will Deacon,
Greg Ungerer, Stefano Stabellini, Benjamin Fair, Yoshinori Sato,
Krzysztof Kozlowski, linux-sh, Lee Jones, Helge Deller,
Daniel Lezcano, Russell King, linux-csky, Jonathan Hunter,
Tony Lindgren, Chen-Yu Tsai, Ingo Molnar, Geert Uytterhoeven,
xen-devel, linux-mips, Guenter Roeck, Len Brown, Albert Ou,
linux-omap, Alan Kao, Jonathan Neuschäfer,
Vladimir Zapolskiy, linux-acpi, linux-m68k, Mark Brown,
Borislav Petkov, Paul Walmsley, linux-tegra, Thomas Gleixner,
Andy Shevchenko, Nancy Yuen, linux-arm-kernel, Juergen Gross,
Thomas Bogendoerfer, linux-parisc, Nick Hu, Avi Fishman,
Patrick Venture, linux-pm, Liam Girdwood,
Linux Kernel Mailing List,
K.C. Kuen-Chern Lin(林坤成), Palmer Dabbelt,
Philipp Zabel, Paul Mackerras, Andrew Morton, linuxppc-dev,
openbmc, Joshua Thompson
In-Reply-To: <20211027211715.12671-16-digetx@gmail.com>
Dmitry Osipenko <digetx@gmail.com> 於 2021年10月28日 週四 上午5:18寫道:
>
> Kernel now supports chained power-off handlers. Use do_kernel_power_off()
> that invokes chained power-off handlers. It also invokes legacy
> pm_power_off() for now, which will be removed once all drivers will
> be converted to the new power-off API.
>
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
> arch/nds32/kernel/process.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
> index 49fab9e39cbf..0936dcd7db1b 100644
> --- a/arch/nds32/kernel/process.c
> +++ b/arch/nds32/kernel/process.c
> @@ -54,8 +54,7 @@ EXPORT_SYMBOL(machine_halt);
>
> void machine_power_off(void)
> {
> - if (pm_power_off)
> - pm_power_off();
> + do_kernel_power_off();
> }
>
> EXPORT_SYMBOL(machine_power_off);
> --
> 2.33.1
>
Loop in Alan and KC
^ permalink raw reply
* Re: [PATCH v3] KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling
From: Laurent Vivier @ 2021-10-28 12:48 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: stable
In-Reply-To: <20211027142150.3711582-1-npiggin@gmail.com>
On 27/10/2021 16:21, Nicholas Piggin wrote:
> From: Laurent Vivier <lvivier@redhat.com>
>
> Commit 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest
> context before enabling irqs") moved guest_exit() into the interrupt
> protected area to avoid wrong context warning (or worse). The problem is
> that tick-based time accounting has not yet been updated at this point
> (because it depends on the timer interrupt firing), so the guest time
> gets incorrectly accounted to system time.
>
> To fix the problem, follow the x86 fix in commit 160457140187 ("Defer
> vtime accounting 'til after IRQ handling"), and allow host IRQs to run
> before accounting the guest exit time.
>
> In the case vtime accounting is enabled, this is not required because TB
> is used directly for accounting.
>
> Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a
> guest running a kernel compile, the 'guest' fields of /proc/stat are
> stuck at zero. With the patch they can be observed increasing roughly as
> expected.
>
> Fixes: e233d54d4d97 ("KVM: booke: use __kvm_guest_exit")
> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
> Cc: <stable@vger.kernel.org> # 5.12
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> [np: only required for tick accounting, add Book3E fix, tweak changelog]
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> Since v2:
> - I took over the patch with Laurent's blessing.
> - Changed to avoid processing IRQs if we do have vtime accounting
> enabled.
> - Changed so in either case the accounting is called with irqs disabled.
> - Added similar Book3E fix.
> - Rebased on upstream, tested, observed bug and confirmed fix.
>
> arch/powerpc/kvm/book3s_hv.c | 30 ++++++++++++++++++++++++++++--
> arch/powerpc/kvm/booke.c | 16 +++++++++++++++-
> 2 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2acb1c96cfaf..7b74fc0a986b 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3726,7 +3726,20 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
>
> kvmppc_set_host_core(pcpu);
>
> - guest_exit_irqoff();
> + context_tracking_guest_exit();
> + if (!vtime_accounting_enabled_this_cpu()) {
> + local_irq_enable();
> + /*
> + * Service IRQs here before vtime_account_guest_exit() so any
> + * ticks that occurred while running the guest are accounted to
> + * the guest. If vtime accounting is enabled, accounting uses
> + * TB rather than ticks, so it can be done without enabling
> + * interrupts here, which has the problem that it accounts
> + * interrupt processing overhead to the host.
> + */
> + local_irq_disable();
> + }
> + vtime_account_guest_exit();
>
> local_irq_enable();
>
> @@ -4510,7 +4523,20 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>
> kvmppc_set_host_core(pcpu);
>
> - guest_exit_irqoff();
> + context_tracking_guest_exit();
> + if (!vtime_accounting_enabled_this_cpu()) {
> + local_irq_enable();
> + /*
> + * Service IRQs here before vtime_account_guest_exit() so any
> + * ticks that occurred while running the guest are accounted to
> + * the guest. If vtime accounting is enabled, accounting uses
> + * TB rather than ticks, so it can be done without enabling
> + * interrupts here, which has the problem that it accounts
> + * interrupt processing overhead to the host.
> + */
> + local_irq_disable();
> + }
> + vtime_account_guest_exit();
>
> local_irq_enable();
>
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 977801c83aff..8c15c90dd3a9 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -1042,7 +1042,21 @@ int kvmppc_handle_exit(struct kvm_vcpu *vcpu, unsigned int exit_nr)
> }
>
> trace_kvm_exit(exit_nr, vcpu);
> - guest_exit_irqoff();
> +
> + context_tracking_guest_exit();
> + if (!vtime_accounting_enabled_this_cpu()) {
> + local_irq_enable();
> + /*
> + * Service IRQs here before vtime_account_guest_exit() so any
> + * ticks that occurred while running the guest are accounted to
> + * the guest. If vtime accounting is enabled, accounting uses
> + * TB rather than ticks, so it can be done without enabling
> + * interrupts here, which has the problem that it accounts
> + * interrupt processing overhead to the host.
> + */
> + local_irq_disable();
> + }
> + vtime_account_guest_exit();
>
> local_irq_enable();
>
>
I'm wondering if we should put the context_tracking_guest_exit() just after the
"srcu_read_unlock(&vc->kvm->srcu, srcu_idx);" as it was before 61bd0f66ff92 ("KVM: PPC:
Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GEN")?
Thanks,
Laurent
^ permalink raw reply
* [PATCH] powerpc: Don't provide __kernel_map_pages() without ARCH_SUPPORTS_DEBUG_PAGEALLOC
From: Christophe Leroy @ 2021-10-28 12:59 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel, kernel test robot
When ARCH_SUPPORTS_DEBUG_PAGEALLOC is not selected, the user can
still select CONFIG_DEBUG_PAGEALLOC in which case __kernel_map_pages()
is provided by mm/page_poison.c
So only define __kernel_map_pages() when both
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC
are defined.
Fixes: 68b44f94d637 ("powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCE")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/mm/pgtable_32.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index dcf5ecca19d9..fde1ed445ca4 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -173,7 +173,7 @@ void mark_rodata_ro(void)
}
#endif
-#ifdef CONFIG_DEBUG_PAGEALLOC
+#if defined(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && defined(CONFIG_DEBUG_PAGEALLOC)
void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long addr = (unsigned long)page_address(page);
--
2.31.1
^ permalink raw reply related
* [PATCH] powerpc/32e: Ignore ESR in instruction storage interrupt handler
From: Nicholas Piggin @ 2021-10-28 13:30 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nicholas Piggin, Jacques de Laval
A e5500 machine running a 32-bit kernel sometimes hangs at boot,
seemingly going into an infinite loop of instruction storage interrupts.
The ESR SPR has a value of 0x800000 (store) when this happens, which is
likely set by a previous store. An instruction TLB miss interrupt would
then leave ESR unchanged, and if no PTE exists it calls directly to the
instruction storage interrupt handler without changing ESR.
access_error() does not cause a segfault due to a store to a read-only
vma because is_exec is true. Most subsequent fault handling does not
check for a write fault on a read-only vma, and might do strange things
like create a writeable PTE or call page_mkwrite on a read only vma or
file. It's not clear what happens here to cause the infinite faulting in
this case, a fault handler failure or low level PTE or TLB handling.
In any case this can be fixed by having the instruction storage
interrupt zero regs->dsisr rather than storing the ESR value to it.
Link: https://lore.kernel.org/linuxppc-dev/1635306738.0z8wt7619v.astroid@bobo.none/
Fixes: a01a3f2ddbcd ("powerpc: remove arguments from fault handler functions")
Reported-by: Jacques de Laval <jacques.delaval@protonmail.com>
Tested-by: Jacques de Laval <jacques.delaval@protonmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/head_booke.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index e5503420b6c6..ef8d1b1c234e 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -465,12 +465,21 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
bl do_page_fault; \
b interrupt_return
+/*
+ * Instruction TLB Error interrupt handlers may call InstructionStorage
+ * directly without clearing ESR, so the ESR at this point may be left over
+ * from a prior interrupt.
+ *
+ * In any case, do_page_fault for BOOK3E does not use ESR and always expects
+ * dsisr to be 0. ESR_DST from a prior store in particular would confuse fault
+ * handling.
+ */
#define INSTRUCTION_STORAGE_EXCEPTION \
START_EXCEPTION(InstructionStorage) \
- NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
- mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
+ NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
+ li r5,0; /* Store 0 in regs->esr (dsisr) */ \
stw r5,_ESR(r11); \
- stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
+ stw r12, _DEAR(r11); /* Set regs->dear (dar) to SRR0 */ \
prepare_transfer_to_handler; \
bl do_page_fault; \
b interrupt_return
--
2.23.0
^ permalink raw reply related
* Re: [PATCH v1 0/5] Implement livepatch on PPC32
From: Steven Rostedt @ 2021-10-28 13:35 UTC (permalink / raw)
To: Christophe Leroy
Cc: Petr Mladek, Joe Lawrence, Jiri Kosina, linux-kernel, Ingo Molnar,
Josh Poimboeuf, live-patching, Naveen N . Rao, Miroslav Benes,
linuxppc-dev
In-Reply-To: <cover.1635423081.git.christophe.leroy@csgroup.eu>
On Thu, 28 Oct 2021 14:24:00 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> This series implements livepatch on PPC32.
>
> This is largely copied from what's done on PPC64.
>
> Christophe Leroy (5):
> livepatch: Fix build failure on 32 bits processors
> powerpc/ftrace: No need to read LR from stack in _mcount()
> powerpc/ftrace: Add module_trampoline_target() for PPC32
> powerpc/ftrace: Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on PPC32
> powerpc/ftrace: Add support for livepatch to PPC32
>
> arch/powerpc/Kconfig | 2 +-
> arch/powerpc/include/asm/livepatch.h | 4 +-
> arch/powerpc/kernel/module_32.c | 33 +++++
> arch/powerpc/kernel/trace/ftrace.c | 53 +++-----
> arch/powerpc/kernel/trace/ftrace_32.S | 187 ++++++++++++++++++++++++--
> kernel/livepatch/core.c | 4 +-
> 6 files changed, 230 insertions(+), 53 deletions(-)
>
This is great that you are doing this, but I wonder if it would even be
easier, and more efficient, if you could implement
HAVE_DYNAMIC_FTRACE_WITH_ARGS?
Then you don't need to save all regs for live kernel patching. And I am
also working on function tracing with arguments with this too.
That is, to call a generic ftrace callback, you need to save all the args
that are stored in registers to prevent the callback from clobbering them.
As live kernel patching only needs to have the arguments of the functions,
you save time from having to save the other regs as well.
The callbacks now have "struct ftrace_regs" instead of pt_regs, because it
will allow non ftrace_regs_caller functions to access the arguments if it
is supported.
Look at how x86_64 implements this. It should be possible to do this for
all other archs as well.
Also note, by doing this, we can then get rid of the ftrace_graph_caller,
and have function graph tracer be a function tracing callback, as it will
allow ftrace_graph_caller to have access to the stack and the return as
well.
If you need any more help or information to do this, I'd be happy to assist
you.
Note, you can implement this first, (I looked over the patches and they
seem fine) and then update both ppc64 and ppc32 to implement
DYNAMIC_FTRACE_WITH_ARGS.
Cheers,
-- Steve
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: John Paul Adrian Glaubitz @ 2021-10-28 13:52 UTC (permalink / raw)
To: mpe; +Cc: oss-security, debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <05b88724-90b6-a38a-bb3b-7392f85c1934@physik.fu-berlin.de>
Hello!
An update to this post with oss-security CC'ed.
On 10/26/21 10:48, John Paul Adrian Glaubitz wrote:
> I have tested these patches against 5.14 but it seems the problem [1] still remains for me
> for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
> the build daemons do their work over night.
I have done thorough testing and I'm no longer seeing the problem with the patched kernel.
I am not sure what triggered my previous crash but I don't think it's related to this
particular bug. I will keep monitoring the server in any case and open a new bug report
in case I'm running into similar issues.
Thanks,
Adrian
--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
^ permalink raw reply
* Re: [PATCH] powerpc/32e: Ignore ESR in instruction storage interrupt handler
From: Christophe Leroy @ 2021-10-28 13:52 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: Jacques de Laval
In-Reply-To: <20211028133043.4159501-1-npiggin@gmail.com>
Le 28/10/2021 à 15:30, Nicholas Piggin a écrit :
> A e5500 machine running a 32-bit kernel sometimes hangs at boot,
> seemingly going into an infinite loop of instruction storage interrupts.
> The ESR SPR has a value of 0x800000 (store) when this happens, which is
> likely set by a previous store. An instruction TLB miss interrupt would
> then leave ESR unchanged, and if no PTE exists it calls directly to the
> instruction storage interrupt handler without changing ESR.
>
> access_error() does not cause a segfault due to a store to a read-only
> vma because is_exec is true. Most subsequent fault handling does not
> check for a write fault on a read-only vma, and might do strange things
> like create a writeable PTE or call page_mkwrite on a read only vma or
> file. It's not clear what happens here to cause the infinite faulting in
> this case, a fault handler failure or low level PTE or TLB handling.
>
> In any case this can be fixed by having the instruction storage
> interrupt zero regs->dsisr rather than storing the ESR value to it.
>
> Link: https://lore.kernel.org/linuxppc-dev/1635306738.0z8wt7619v.astroid@bobo.none/
> Fixes: a01a3f2ddbcd ("powerpc: remove arguments from fault handler functions")
Should it go to stable as well ?
> Reported-by: Jacques de Laval <jacques.delaval@protonmail.com>
> Tested-by: Jacques de Laval <jacques.delaval@protonmail.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> arch/powerpc/kernel/head_booke.h | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
> index e5503420b6c6..ef8d1b1c234e 100644
> --- a/arch/powerpc/kernel/head_booke.h
> +++ b/arch/powerpc/kernel/head_booke.h
> @@ -465,12 +465,21 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
> bl do_page_fault; \
> b interrupt_return
>
> +/*
> + * Instruction TLB Error interrupt handlers may call InstructionStorage
> + * directly without clearing ESR, so the ESR at this point may be left over
> + * from a prior interrupt.
> + *
> + * In any case, do_page_fault for BOOK3E does not use ESR and always expects
> + * dsisr to be 0. ESR_DST from a prior store in particular would confuse fault
> + * handling.
> + */
> #define INSTRUCTION_STORAGE_EXCEPTION \
> START_EXCEPTION(InstructionStorage) \
> - NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> - mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
> + NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> + li r5,0; /* Store 0 in regs->esr (dsisr) */ \
> stw r5,_ESR(r11); \
> - stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
> + stw r12, _DEAR(r11); /* Set regs->dear (dar) to SRR0 */ \
> prepare_transfer_to_handler; \
> bl do_page_fault; \
> b interrupt_return
>
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: John Paul Adrian Glaubitz @ 2021-10-28 14:00 UTC (permalink / raw)
To: mpe; +Cc: oss-security, debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <159047aa-6cbd-420f-0589-9dc6a43e2b23@physik.fu-berlin.de>
Hello!
On 10/28/21 15:52, John Paul Adrian Glaubitz wrote:
> I am not sure what triggered my previous crash but I don't think it's related to this
> particular bug. I will keep monitoring the server in any case and open a new bug report
> in case I'm running into similar issues.
This is very unfortunate, but just after I sent this mail, the machine crashed again.
Sorry for the premature success report. I will have to check now what happened
and get in touch with Michael.
Adrian
--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: John Paul Adrian Glaubitz @ 2021-10-28 14:05 UTC (permalink / raw)
To: Michael Ellerman; +Cc: debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <87b1404f-7805-da29-4899-6ab9459e5364@physik.fu-berlin.de>
Hi Michael!
On 10/28/21 13:20, John Paul Adrian Glaubitz wrote:
> It seems I also can no longer reproduce the issue, even when building the most problematic
> packages and I think we should consider it fixed for now. I will keep monitoring the server,
> of course, and will let you know in case the problem shows again.
The host machine is stuck again but I'm not 100% sure what triggered the problem:
[194817.984249] watchdog: BUG: soft lockup - CPU#80 stuck for 246s! [CPU 2/KVM:1836]
[194818.012248] watchdog: BUG: soft lockup - CPU#152 stuck for 246s! [CPU 3/KVM:1837]
[194825.960164] watchdog: BUG: soft lockup - CPU#24 stuck for 246s! [khugepaged:318]
[194841.983991] watchdog: BUG: soft lockup - CPU#80 stuck for 268s! [CPU 2/KVM:1836]
[194842.011991] watchdog: BUG: soft lockup - CPU#152 stuck for 268s! [CPU 3/KVM:1837]
[194849.959906] watchdog: BUG: soft lockup - CPU#24 stuck for 269s! [khugepaged:318]
[194865.983733] watchdog: BUG: soft lockup - CPU#80 stuck for 291s! [CPU 2/KVM:1836]
[194866.011733] watchdog: BUG: soft lockup - CPU#152 stuck for 291s! [CPU 3/KVM:1837]
[194873.959648] watchdog: BUG: soft lockup - CPU#24 stuck for 291s! [khugepaged:318]
[194889.983475] watchdog: BUG: soft lockup - CPU#80 stuck for 313s! [CPU 2/KVM:1836]
[194890.011475] watchdog: BUG: soft lockup - CPU#152 stuck for 313s! [CPU 3/KVM:1837]
[194897.959390] watchdog: BUG: soft lockup - CPU#24 stuck for 313s! [khugepaged:318]
[194913.983218] watchdog: BUG: soft lockup - CPU#80 stuck for 335s! [CPU 2/KVM:1836]
[194914.011217] watchdog: BUG: soft lockup - CPU#152 stuck for 335s! [CPU 3/KVM:1837]
[194921.959133] watchdog: BUG: soft lockup - CPU#24 stuck for 336s! [khugepaged:318]
The following packages were being built at the same time:
- guest 1: virtuoso-opensource and openturns
- guest 2: llvm-toolchain-13
I really did a lot of testing today with no issues and just after I sent my report
to oss-security that the machine seems to be stable again, the issue showed up :(.
Sorry,
Adrian
--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: John Paul Adrian Glaubitz @ 2021-10-28 14:15 UTC (permalink / raw)
To: Michael Ellerman; +Cc: debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <9ed788c0-b99b-f327-0814-a2d92db6bd8b@physik.fu-berlin.de>
Hi!
On 10/28/21 16:05, John Paul Adrian Glaubitz wrote:
> The following packages were being built at the same time:
>
> - guest 1: virtuoso-opensource and openturns
> - guest 2: llvm-toolchain-13
>
> I really did a lot of testing today with no issues and just after I sent my report
> to oss-security that the machine seems to be stable again, the issue showed up :(.
Do you know whether IPMI features any sort of monitoring for capturing the output
of the serial console non-interactively? This way I would be able to capture the
crash besides what I have seen above.
Adrian
--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
^ permalink raw reply
* [PATCH v12 1/2] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-10-28 15:09 UTC (permalink / raw)
To: gregkh, jirislaby, amit, arnd, osandov
Cc: sfr, Xianting Tian, shile.zhang, linux-kernel, virtualization,
linuxppc-dev
In-Reply-To: <20211028150954.1356334-1-xianting.tian@linux.alibaba.com>
As well known, hvc backend can register its opertions to hvc backend.
the operations contain put_chars(), get_chars() and so on.
Some hvc backend may do dma in its operations. eg, put_chars() of
virtio-console. But in the code of hvc framework, it may pass DMA
incapable memory to put_chars() under a specific configuration, which
is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
1, c[] is on stack,
hvc_console_print():
char c[N_OUTBUF] __ALIGNED__;
cons_ops[index]->put_chars(vtermnos[index], c, i);
2, ch is on stack,
static void hvc_poll_put_char(,,char ch)
{
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;
do {
n = hp->ops->put_chars(hp->vtermno, &ch, 1);
} while (n <= 0);
}
Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
is passed to virtio-console by hvc framework in above code. But I think
the fix is aggressive, it directly uses kmemdup() to alloc new buffer
from kmalloc area and do memcpy no matter the memory is in kmalloc area
or not. But most importantly, it should better be fixed in the hvc
framework, by changing it to never pass stack memory to the put_chars()
function in the first place. Otherwise, we still face the same issue if
a new hvc backend using dma added in the furture.
In this patch, add 'char cons_outbuf[]' as part of 'struct hvc_struct',
so hp->cons_outbuf is no longer the stack memory, we can use it in above
cases safely. We also add lock to protect cons_outbuf instead of using
the global lock of hvc.
Introduce array cons_hvcs[] for hvc pointers next to the cons_ops[] and
vtermnos[] arrays. With the array, we can easily find hvc's cons_outbuf
and its lock.
Introduce array cons_early_outbuf[] to ensure the mechanism of early
console still work normally.
With the patch, we can revert the fix c4baad5029.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com>
---
drivers/tty/hvc/hvc_console.c | 55 +++++++++++++++++++++++++----------
drivers/tty/hvc/hvc_console.h | 30 ++++++++++++++++++-
2 files changed, 69 insertions(+), 16 deletions(-)
diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5957ab728..b9521f0f0 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -41,16 +41,6 @@
*/
#define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
-/*
- * These sizes are most efficient for vio, because they are the
- * native transfer size. We could make them selectable in the
- * future to better deal with backends that want other buffer sizes.
- */
-#define N_OUTBUF 16
-#define N_INBUF 16
-
-#define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
-
static struct tty_driver *hvc_driver;
static struct task_struct *hvc_task;
@@ -142,6 +132,8 @@ static int hvc_flush(struct hvc_struct *hp)
static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
+static struct hvc_struct *cons_hvcs[MAX_NR_HVC_CONSOLES];
+static struct hvc_early_outbuf *cons_early_outbufs[MAX_NR_HVC_CONSOLES];
/*
* Console APIs, NOT TTY. These APIs are available immediately when
@@ -151,9 +143,12 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
static void hvc_console_print(struct console *co, const char *b,
unsigned count)
{
- char c[N_OUTBUF] __ALIGNED__;
+ char *c;
unsigned i = 0, n = 0;
int r, donecr = 0, index = co->index;
+ unsigned long flags;
+ struct hvc_struct *hp;
+ spinlock_t *lock;
/* Console access attempt outside of acceptable console range. */
if (index >= MAX_NR_HVC_CONSOLES)
@@ -163,6 +158,27 @@ static void hvc_console_print(struct console *co, const char *b,
if (vtermnos[index] == -1)
return;
+ hp = cons_hvcs[index];
+ if (hp) {
+ c = hp->cons_outbuf;
+ lock = &hp->cons_outbuf_lock;
+
+ /*
+ * To make free early console outbuf locklessly,
+ * free it here when we start to use hp's outbuf.
+ * The actual free action is executed only once.
+ */
+ if (cons_early_outbufs[index]) {
+ kfree(cons_early_outbufs[index]);
+ cons_early_outbufs[index] = NULL;
+ }
+ } else {
+ /* For early console print */
+ c = cons_early_outbufs[index]->outbuf;
+ lock = &cons_early_outbufs[index]->lock;
+ }
+
+ spin_lock_irqsave(lock, flags);
while (count > 0 || i > 0) {
if (count > 0 && i < sizeof(c)) {
if (b[n] == '\n' && !donecr) {
@@ -191,6 +207,7 @@ static void hvc_console_print(struct console *co, const char *b,
}
}
}
+ spin_unlock_irqrestore(lock, flags);
hvc_console_flush(cons_ops[index], vtermnos[index]);
}
@@ -299,6 +316,9 @@ int hvc_instantiate(uint32_t vtermno, int index, const struct hv_ops *ops)
return -1;
}
+ cons_early_outbufs[index] = kzalloc(sizeof(struct hvc_early_outbuf),
+ GFP_KERNEL);
+ spin_lock_init(&cons_early_outbufs[index]->lock);
vtermnos[index] = vtermno;
cons_ops[index] = ops;
@@ -878,9 +898,13 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch)
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;
+ unsigned long flags;
do {
- n = hp->ops->put_chars(hp->vtermno, &ch, 1);
+ spin_lock_irqsave(&hp->cons_outbuf_lock, flags);
+ hp->cons_outbuf[0] = ch;
+ n = hp->ops->put_chars(hp->vtermno, &hp->cons_outbuf[0], 1);
+ spin_unlock_irqrestore(&hp->cons_outbuf_lock, flags);
} while (n <= 0);
}
#endif
@@ -922,8 +946,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
return ERR_PTR(err);
}
- hp = kzalloc(ALIGN(sizeof(*hp), sizeof(long)) + outbuf_size,
- GFP_KERNEL);
+ hp = kzalloc(struct_size(hp, outbuf, outbuf_size), GFP_KERNEL);
if (!hp)
return ERR_PTR(-ENOMEM);
@@ -931,13 +954,13 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
hp->data = data;
hp->ops = ops;
hp->outbuf_size = outbuf_size;
- hp->outbuf = &((char *)hp)[ALIGN(sizeof(*hp), sizeof(long))];
tty_port_init(&hp->port);
hp->port.ops = &hvc_port_ops;
INIT_WORK(&hp->tty_resize, hvc_set_winsz);
spin_lock_init(&hp->lock);
+ spin_lock_init(&hp->cons_outbuf_lock);
mutex_lock(&hvc_structs_mutex);
/*
@@ -964,6 +987,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
if (i < MAX_NR_HVC_CONSOLES) {
cons_ops[i] = ops;
vtermnos[i] = vtermno;
+ cons_hvcs[i] = hp;
}
list_add_tail(&(hp->next), &hvc_structs);
@@ -988,6 +1012,7 @@ int hvc_remove(struct hvc_struct *hp)
if (hp->index < MAX_NR_HVC_CONSOLES) {
vtermnos[hp->index] = -1;
cons_ops[hp->index] = NULL;
+ cons_hvcs[hp->index] = NULL;
}
/* Don't whack hp->irq because tty_hangup() will need to free the irq. */
diff --git a/drivers/tty/hvc/hvc_console.h b/drivers/tty/hvc/hvc_console.h
index 18d005814..3a9fe9275 100644
--- a/drivers/tty/hvc/hvc_console.h
+++ b/drivers/tty/hvc/hvc_console.h
@@ -32,12 +32,30 @@
*/
#define HVC_ALLOC_TTY_ADAPTERS 8
+/*
+ * These sizes are most efficient for vio, because they are the
+ * native transfer size. We could make them selectable in the
+ * future to better deal with backends that want other buffer sizes.
+ */
+#define N_OUTBUF 16
+#define N_INBUF 16
+
+#define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
+
+/*
+ * This is for early console print when hvc backend calls
+ * hvc_instantiate() to create an early console.
+ */
+struct hvc_early_outbuf {
+ spinlock_t lock;
+ char outbuf[N_OUTBUF] __ALIGNED__;
+};
+
struct hvc_struct {
struct tty_port port;
spinlock_t lock;
int index;
int do_wakeup;
- char *outbuf;
int outbuf_size;
int n_outbuf;
uint32_t vtermno;
@@ -48,6 +66,16 @@ struct hvc_struct {
struct work_struct tty_resize;
struct list_head next;
unsigned long flags;
+
+ /*
+ * the buf and its lock are used in hvc console api for putting chars,
+ * and also used in hvc_poll_put_char() for putting single char.
+ */
+ spinlock_t cons_outbuf_lock;
+ char cons_outbuf[N_OUTBUF] __ALIGNED__;
+
+ /* the buf is used for putting chars to tty */
+ char outbuf[] __ALIGNED__;
};
/* implemented by a low level driver */
--
2.17.1
^ permalink raw reply related
* [PATCH v12 2/2] virtio-console: remove unnecessary kmemdup()
From: Xianting Tian @ 2021-10-28 15:09 UTC (permalink / raw)
To: gregkh, jirislaby, amit, arnd, osandov
Cc: sfr, Xianting Tian, shile.zhang, linux-kernel, virtualization,
linuxppc-dev
In-Reply-To: <20211028150954.1356334-1-xianting.tian@linux.alibaba.com>
This revert commit c4baad5029 ("virtio-console: avoid DMA from stack")
hvc framework will never pass stack memory to the put_chars() function,
So the calling of kmemdup() is unnecessary, we can remove it.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com>
---
drivers/char/virtio_console.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7eaf303a7..4ed3ffb1d 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int count)
{
struct port *port;
struct scatterlist sg[1];
- void *data;
- int ret;
if (unlikely(early_put_chars))
return early_put_chars(vtermno, buf, count);
@@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int count)
if (!port)
return -EPIPE;
- data = kmemdup(buf, count, GFP_ATOMIC);
- if (!data)
- return -ENOMEM;
-
- sg_init_one(sg, data, count);
- ret = __send_to_port(port, sg, 1, count, data, false);
- kfree(data);
- return ret;
+ sg_init_one(sg, buf, count);
+ return __send_to_port(port, sg, 1, count, (void *)buf, false);
}
/*
--
2.17.1
^ permalink raw reply related
* [PATCH v12 0/2] make hvc pass dma capable memory to its backend
From: Xianting Tian @ 2021-10-28 15:09 UTC (permalink / raw)
To: gregkh, jirislaby, amit, arnd, osandov
Cc: sfr, Xianting Tian, shile.zhang, linux-kernel, virtualization,
linuxppc-dev
Dear all,
This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)
V1
virtio-console: avoid DMA from vmalloc area
https://lkml.org/lkml/2021/7/27/494
For v1 patch, Arnd Bergmann suggests to fix the issue in the first
place:
Make hvc pass DMA capable memory to put_chars()
The fix suggestion is included in v2.
V2
[PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/1/8
[PATCH 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/1/9
For v2 patch, Arnd Bergmann suggests to make new buf part of the
hvc_struct structure, and fix the compile issue.
The fix suggestion is included in v3.
V3
[PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/3/1347
[PATCH v3 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/3/1348
For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of
hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to
calculate the size of hvc_struct. The fix suggestion is included in
v4.
V4
[PATCH v4 0/2] make hvc pass dma capable memory to its backend
https://lkml.org/lkml/2021/8/5/1350
[PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/5/1351
[PATCH v4 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/5/1352
For v4 patch, Arnd Bergmann suggests to introduce another
array(cons_outbuf[]) for the buffer pointers next to the cons_ops[]
and vtermnos[] arrays. This fix included in this v5 patch.
V5
Arnd Bergmann suggests to use "L1_CACHE_BYTES" as dma alignment,
use 'sizeof(long)' as dma alignment is wrong. fix it in v6.
V6
It contains coding error, fix it in v7 and it worked normally
according to test result.
V7
Greg KH suggests to add test and code review developer,
Jiri Slaby suggests to use lockless buffer and fix dma alignment
in separate patch.
fix above things in v8.
V8
This contains coding error when switch to use new buffer. fix it in v9.
V9
It didn't make things much clearer, it needs add more comments for new added buf.
Add use lock to protect new added buffer. fix in v10.
V10
Remove 'char outchar' and its lock from hvc_struct, adjust hvc_struct and use
pahole to display the struct layout.
fix it in v11.
V11
Fix early console print issue, which is broken by v11, in v12.
********TEST STEPS*********
1, config guest console=hvc0
2, start guest
3, login guest
Welcome to Buildroot
buildroot login: root
#
# cat /proc/cmdline
console=hvc0 root=/dev/vda rw init=/sbin/init
#
drivers/tty/hvc/hvc_console.c | 48 ++++++++++++++++++++++++-----------
drivers/tty/hvc/hvc_console.h | 30 +++++++++++++++++++++-
drivers/char/virtio_console.c | 12 ++----------
3 file changed
^ permalink raw reply
* Re: [PATCH] mm/migrate.c: Remove MIGRATE_PFN_LOCKED
From: Felix Kuehling @ 2021-10-28 15:33 UTC (permalink / raw)
To: Alistair Popple, akpm
Cc: rcampbell, amd-gfx, nouveau, linux-kernel, dri-devel, linux-mm,
jglisse, kvm-ppc, ziy, jhubbard, alexander.deucher, linuxppc-dev,
hch, bskeggs
In-Reply-To: <2096706.TdNOD7Y7u4@nvdebian>
Am 2021-10-27 um 9:42 p.m. schrieb Alistair Popple:
> On Wednesday, 27 October 2021 3:09:57 AM AEDT Felix Kuehling wrote:
>> Am 2021-10-25 um 12:16 a.m. schrieb Alistair Popple:
>>> MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
>>> source page was already locked during migrate_vma_collect(). If it
>>> wasn't then the a second attempt is made to lock the page. However if
>>> the first attempt failed it's unlikely a second attempt will succeed,
>>> and the retry adds complexity. So clean this up by removing the retry
>>> and MIGRATE_PFN_LOCKED flag.
>>>
>>> Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
>>> set, but nothing actually checks that.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> It makes sense to me. Do you have any empirical data on how much more
>> likely migrations are going to fail with this change due to contested
>> page locks?
> Thanks Felix. I do not have any empirical data on this but I've mostly seen
> migrations fail due to the reference count check failing rather than failure to
> lock the page. Even then it's mostly been due to thrashing on the same page, so
> I would be surprised if this change made any noticeable difference.
We have seen more page locking contention on NUMA systems that disappear
when we disable NUMA balancing. Probably NUMA balancing migrations
result in the page lock being more contended, which can cause HMM
migration of some pages to fail.
Also, for migrations to system memory, multiple threads page faulting
concurrently can cause contention. I was just helping debug such an
issue. Having migrations to system memory be only partially successful
is problematic. We'll probably have to implement some retry logic in the
driver to handle this.
Regards,
Felix
>
>> Either way, the patch is
>>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>
>>
>>> ---
>>> Documentation/vm/hmm.rst | 2 +-
>>> arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +-
>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 -
>>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 4 +-
>>> include/linux/migrate.h | 1 -
>>> lib/test_hmm.c | 5 +-
>>> mm/migrate.c | 145 +++++------------------
>>> 7 files changed, 35 insertions(+), 128 deletions(-)
>>>
>>> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
>>> index a14c2938e7af..f2a59ed82ed3 100644
>>> --- a/Documentation/vm/hmm.rst
>>> +++ b/Documentation/vm/hmm.rst
>>> @@ -360,7 +360,7 @@ between device driver specific code and shared common code:
>>> system memory page, locks the page with ``lock_page()``, and fills in the
>>> ``dst`` array entry with::
>>>
>>> - dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>>> + dst[i] = migrate_pfn(page_to_pfn(dpage));
>>>
>>> Now that the driver knows that this page is being migrated, it can
>>> invalidate device private MMU mappings and copy device private memory
>>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> index a7061ee3b157..28c436df9935 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> @@ -560,7 +560,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>> gpa, 0, page_shift);
>>>
>>> if (ret == U_SUCCESS)
>>> - *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
>>> + *mig.dst = migrate_pfn(pfn);
>>> else {
>>> unlock_page(dpage);
>>> __free_page(dpage);
>>> @@ -774,7 +774,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>>> }
>>> }
>>>
>>> - *mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>>> + *mig.dst = migrate_pfn(page_to_pfn(dpage));
>>> migrate_vma_pages(&mig);
>>> out_finalize:
>>> migrate_vma_finalize(&mig);
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> index 4a16e3c257b9..41d9417f182b 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> @@ -300,7 +300,6 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>>> migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
>>> svm_migrate_get_vram_page(prange, migrate->dst[i]);
>>> migrate->dst[i] = migrate_pfn(migrate->dst[i]);
>>> - migrate->dst[i] |= MIGRATE_PFN_LOCKED;
>>> src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
>>> DMA_TO_DEVICE);
>>> r = dma_mapping_error(dev, src[i]);
>>> @@ -580,7 +579,6 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>> dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
>>>
>>> migrate->dst[i] = migrate_pfn(page_to_pfn(dpage));
>>> - migrate->dst[i] |= MIGRATE_PFN_LOCKED;
>>> j++;
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> index 92987daa5e17..3828aafd3ac4 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>>> @@ -166,7 +166,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
>>> goto error_dma_unmap;
>>> mutex_unlock(&svmm->mutex);
>>>
>>> - args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>>> + args->dst[0] = migrate_pfn(page_to_pfn(dpage));
>>> return 0;
>>>
>>> error_dma_unmap:
>>> @@ -602,7 +602,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>>> ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>>> if (src & MIGRATE_PFN_WRITE)
>>> *pfn |= NVIF_VMM_PFNMAP_V0_W;
>>> - return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>>> + return migrate_pfn(page_to_pfn(dpage));
>>>
>>> out_dma_unmap:
>>> dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index c8077e936691..479b861ae490 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -119,7 +119,6 @@ static inline int migrate_misplaced_page(struct page *page,
>>> */
>>> #define MIGRATE_PFN_VALID (1UL << 0)
>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
>>> -#define MIGRATE_PFN_LOCKED (1UL << 2)
>>> #define MIGRATE_PFN_WRITE (1UL << 3)
>>> #define MIGRATE_PFN_SHIFT 6
>>>
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index c259842f6d44..e2ce8f9b7605 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -613,8 +613,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>>> */
>>> rpage->zone_device_data = dmirror;
>>>
>>> - *dst = migrate_pfn(page_to_pfn(dpage)) |
>>> - MIGRATE_PFN_LOCKED;
>>> + *dst = migrate_pfn(page_to_pfn(dpage));
>>> if ((*src & MIGRATE_PFN_WRITE) ||
>>> (!spage && args->vma->vm_flags & VM_WRITE))
>>> *dst |= MIGRATE_PFN_WRITE;
>>> @@ -1137,7 +1136,7 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
>>> lock_page(dpage);
>>> xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
>>> copy_highpage(dpage, spage);
>>> - *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>>> + *dst = migrate_pfn(page_to_pfn(dpage));
>>> if (*src & MIGRATE_PFN_WRITE)
>>> *dst |= MIGRATE_PFN_WRITE;
>>> }
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index a6a7743ee98f..915e969811d0 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -2369,7 +2369,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> * can't be dropped from it).
>>> */
>>> get_page(page);
>>> - migrate->cpages++;
>>>
>>> /*
>>> * Optimize for the common case where page is only mapped once
>>> @@ -2379,7 +2378,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> if (trylock_page(page)) {
>>> pte_t swp_pte;
>>>
>>> - mpfn |= MIGRATE_PFN_LOCKED;
>>> + migrate->cpages++;
>>> ptep_get_and_clear(mm, addr, ptep);
>>>
>>> /* Setup special migration page table entry */
>>> @@ -2413,6 +2412,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>
>>> if (pte_present(pte))
>>> unmapped++;
>>> + } else {
>>> + put_page(page);
>>> + mpfn = 0;
>>> }
>>>
>>> next:
>>> @@ -2517,15 +2519,17 @@ static bool migrate_vma_check_page(struct page *page)
>>> }
>>>
>>> /*
>>> - * migrate_vma_prepare() - lock pages and isolate them from the lru
>>> + * migrate_vma_unmap() - replace page mapping with special migration pte entry
>>> * @migrate: migrate struct containing all migration information
>>> *
>>> - * This locks pages that have been collected by migrate_vma_collect(). Once each
>>> - * page is locked it is isolated from the lru (for non-device pages). Finally,
>>> - * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
>>> - * migrated by concurrent kernel threads.
>>> + * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
>>> + * special migration pte entry and check if it has been pinned. Pinned pages are
>>> + * restored because we cannot migrate them.
>>> + *
>>> + * This is the last step before we call the device driver callback to allocate
>>> + * destination memory and copy contents of original page over to new page.
>>> */
>>> -static void migrate_vma_prepare(struct migrate_vma *migrate)
>>> +static void migrate_vma_unmap(struct migrate_vma *migrate)
>>> {
>>> const unsigned long npages = migrate->npages;
>>> const unsigned long start = migrate->start;
>>> @@ -2534,32 +2538,12 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>>>
>>> lru_add_drain();
>>>
>>> - for (i = 0; (i < npages) && migrate->cpages; i++) {
>>> + for (i = 0; i < npages; i++) {
>>> struct page *page = migrate_pfn_to_page(migrate->src[i]);
>>> - bool remap = true;
>>>
>>> if (!page)
>>> continue;
>>>
>>> - if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
>>> - /*
>>> - * Because we are migrating several pages there can be
>>> - * a deadlock between 2 concurrent migration where each
>>> - * are waiting on each other page lock.
>>> - *
>>> - * Make migrate_vma() a best effort thing and backoff
>>> - * for any page we can not lock right away.
>>> - */
>>> - if (!trylock_page(page)) {
>>> - migrate->src[i] = 0;
>>> - migrate->cpages--;
>>> - put_page(page);
>>> - continue;
>>> - }
>>> - remap = false;
>>> - migrate->src[i] |= MIGRATE_PFN_LOCKED;
>>> - }
>>> -
>>> /* ZONE_DEVICE pages are not on LRU */
>>> if (!is_zone_device_page(page)) {
>>> if (!PageLRU(page) && allow_drain) {
>>> @@ -2569,16 +2553,9 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>>> }
>>>
>>> if (isolate_lru_page(page)) {
>>> - if (remap) {
>>> - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>> - migrate->cpages--;
>>> - restore++;
>>> - } else {
>>> - migrate->src[i] = 0;
>>> - unlock_page(page);
>>> - migrate->cpages--;
>>> - put_page(page);
>>> - }
>>> + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>> + migrate->cpages--;
>>> + restore++;
>>> continue;
>>> }
>>>
>>> @@ -2586,80 +2563,20 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>>> put_page(page);
>>> }
>>>
>>> - if (!migrate_vma_check_page(page)) {
>>> - if (remap) {
>>> - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>> - migrate->cpages--;
>>> - restore++;
>>> -
>>> - if (!is_zone_device_page(page)) {
>>> - get_page(page);
>>> - putback_lru_page(page);
>>> - }
>>> - } else {
>>> - migrate->src[i] = 0;
>>> - unlock_page(page);
>>> - migrate->cpages--;
>>> + if (page_mapped(page))
>>> + try_to_migrate(page, 0);
>>>
>>> - if (!is_zone_device_page(page))
>>> - putback_lru_page(page);
>>> - else
>>> - put_page(page);
>>> + if (page_mapped(page) || !migrate_vma_check_page(page)) {
>>> + if (!is_zone_device_page(page)) {
>>> + get_page(page);
>>> + putback_lru_page(page);
>>> }
>>> - }
>>> - }
>>> -
>>> - for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
>>> - struct page *page = migrate_pfn_to_page(migrate->src[i]);
>>> -
>>> - if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
>>> - continue;
>>>
>>> - remove_migration_pte(page, migrate->vma, addr, page);
>>> -
>>> - migrate->src[i] = 0;
>>> - unlock_page(page);
>>> - put_page(page);
>>> - restore--;
>>> - }
>>> -}
>>> -
>>> -/*
>>> - * migrate_vma_unmap() - replace page mapping with special migration pte entry
>>> - * @migrate: migrate struct containing all migration information
>>> - *
>>> - * Replace page mapping (CPU page table pte) with a special migration pte entry
>>> - * and check again if it has been pinned. Pinned pages are restored because we
>>> - * cannot migrate them.
>>> - *
>>> - * This is the last step before we call the device driver callback to allocate
>>> - * destination memory and copy contents of original page over to new page.
>>> - */
>>> -static void migrate_vma_unmap(struct migrate_vma *migrate)
>>> -{
>>> - const unsigned long npages = migrate->npages;
>>> - const unsigned long start = migrate->start;
>>> - unsigned long addr, i, restore = 0;
>>> -
>>> - for (i = 0; i < npages; i++) {
>>> - struct page *page = migrate_pfn_to_page(migrate->src[i]);
>>> -
>>> - if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
>>> + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>> + migrate->cpages--;
>>> + restore++;
>>> continue;
>>> -
>>> - if (page_mapped(page)) {
>>> - try_to_migrate(page, 0);
>>> - if (page_mapped(page))
>>> - goto restore;
>>> }
>>> -
>>> - if (migrate_vma_check_page(page))
>>> - continue;
>>> -
>>> -restore:
>>> - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>> - migrate->cpages--;
>>> - restore++;
>>> }
>>>
>>> for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
>>> @@ -2672,12 +2589,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>>>
>>> migrate->src[i] = 0;
>>> unlock_page(page);
>>> + put_page(page);
>>> restore--;
>>> -
>>> - if (is_zone_device_page(page))
>>> - put_page(page);
>>> - else
>>> - putback_lru_page(page);
>>> }
>>> }
>>>
>>> @@ -2700,8 +2613,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>>> * it for all those entries (ie with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE
>>> * flag set). Once these are allocated and copied, the caller must update each
>>> * corresponding entry in the dst array with the pfn value of the destination
>>> - * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_LOCKED flags set
>>> - * (destination pages must have their struct pages locked, via lock_page()).
>>> + * page and with MIGRATE_PFN_VALID. Destination pages must be locked via
>>> + * lock_page().
>>> *
>>> * Note that the caller does not have to migrate all the pages that are marked
>>> * with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from
>>> @@ -2770,8 +2683,6 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>
>>> migrate_vma_collect(args);
>>>
>>> - if (args->cpages)
>>> - migrate_vma_prepare(args);
>>> if (args->cpages)
>>> migrate_vma_unmap(args);
>>>
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox