* Re: [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
From: Pavel Tatashin @ 2019-07-31 16:01 UTC (permalink / raw)
To: Mark Rutland
Cc: James Morris, Sasha Levin, Eric W. Biederman, kexec mailing list,
LKML, Jonathan Corbet, Catalin Marinas, will,
Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
Vladimir Murzin, Matthias Brugger, Bhupesh Sharma
In-Reply-To: <20190731155042.GF39768@lakrids.cambridge.arm.com>
> For various reasons, one cannot safely use Set/Way operations in
> portable code. They only make sense for low-level platform-specific
> firmware performing power management operations.
>
> If you need to perform D-cache maintenance, you must use the VA
> operations to do so.
Hi Mark,
I see, thank you for letting me know. I will do d-cache flushing by VA
in the next iteration. First I need to root cause/fix the bug
described in the cover letter.
Thank you,
Pasha
^ permalink raw reply
* Re: [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
From: Mark Rutland @ 2019-07-31 15:50 UTC (permalink / raw)
To: Pavel Tatashin
Cc: jmorris, sashal, ebiederm, kexec, linux-kernel, corbet,
catalin.marinas, will, linux-doc, linux-arm-kernel, marc.zyngier,
james.morse, vladimir.murzin, matthias.bgg, bhsharma
In-Reply-To: <20190731153857.4045-9-pasha.tatashin@soleen.com>
On Wed, Jul 31, 2019 at 11:38:57AM -0400, Pavel Tatashin wrote:
> +/*
> + * The following code is adoped from "Bare-metal Boot Code for ARMv8-A
> + * Processors Version 1.0, 5.3.1 Cleaning and invalidating the caches".
> + * http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a
> + */
> +.macro dcache_invalidate tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8
> + mov \tmp0, #0x0 /* tmp0 = Cache level */
> + msr CSSELR_EL1, \tmp0 /* 0x0 for L1, 0x2 for L2 */
> + mrs \tmp4, CCSIDR_EL1 /* Read Cache Size ID */
> + and \tmp1, \tmp4, #0x7
> + add \tmp1, \tmp1, #0x4 /* tmp1 Cache Line Size */
> + ldr \tmp3, =0x7fff
> + and \tmp2, \tmp3, \tmp4, lsr #13 /* tmp2 Cache Set num - 1 */
> + ldr \tmp3, =0x3ff
> + and \tmp3, \tmp3, \tmp4, lsr #3 /* tmp3 Cache Assoc. num - 1 */
> + clz \tmp4, \tmp3 /* tmp4 way pos. in the CISW */
> + mov \tmp5, #0 /* tmp5 way counter way_loop */
> +1: /* way_loop */
> + mov \tmp6, #0 /* tmp6 set counter set_loop */
> +2: /* set_loop */
> + lsl \tmp7, \tmp5, \tmp4
> + orr \tmp7, \tmp0, \tmp7 /* Set way */
> + lsl \tmp8, \tmp6, \tmp1
> + orr \tmp7, \tmp7, \tmp8 /* Set set */
> + dc cisw, \tmp7 /* Clean & Inval. cache line */
> + add \tmp6, \tmp6, #1 /* Increment set counter */
> + cmp \tmp6, \tmp2 /* Last set reached yet? */
> + ble 2b /* If not, iterate set_loop, */
> + add \tmp5, \tmp5, #1 /* else, next way. */
> + cmp \tmp5, \tmp3 /* Last way reached yet? */
> + ble 1b /* If not, iterate way_loop. */
> +.endm
> +
For various reasons, one cannot safely use Set/Way operations in
portable code. They only make sense for low-level platform-specific
firmware performing power management operations.
If you need to perform D-cache maintenance, you must use the VA
operations to do so.
Thanks,
Mark.
^ permalink raw reply
* [RFC v2 2/8] arm64, mm: transitional tables
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
There are cases where normal kernel pages tables, i.e. idmap_pg_dir
and swapper_pg_dir are not sufficient because they may be overwritten.
This happens when we transition from one world to another: for example
during kexec kernel relocation transition, and also during hibernate
kernel restore transition.
In these cases, if MMU is needed, the page table memory must be allocated
from a safe place. Transitional tables is intended to allow just that.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/Kconfig | 4 +
arch/arm64/include/asm/pgtable-hwdef.h | 1 +
arch/arm64/include/asm/trans_table.h | 66 ++++++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/trans_table.c | 272 +++++++++++++++++++++++++
5 files changed, 344 insertions(+)
create mode 100644 arch/arm64/include/asm/trans_table.h
create mode 100644 arch/arm64/mm/trans_table.c
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..91a7416ffe4e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -999,6 +999,10 @@ config CRASH_DUMP
For more details see Documentation/admin-guide/kdump/kdump.rst
+config TRANS_TABLE
+ def_bool y
+ depends on HIBERNATION || KEXEC_CORE
+
config XEN_DOM0
def_bool y
depends on XEN
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index db92950bb1a0..dcb4f13c7888 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -110,6 +110,7 @@
#define PUD_TABLE_BIT (_AT(pudval_t, 1) << 1)
#define PUD_TYPE_MASK (_AT(pudval_t, 3) << 0)
#define PUD_TYPE_SECT (_AT(pudval_t, 1) << 0)
+#define PUD_SECT_RDONLY (_AT(pudval_t, 1) << 7) /* AP[2] */
/*
* Level 2 descriptor (PMD).
diff --git a/arch/arm64/include/asm/trans_table.h b/arch/arm64/include/asm/trans_table.h
new file mode 100644
index 000000000000..4d7bd0bf36c0
--- /dev/null
+++ b/arch/arm64/include/asm/trans_table.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+#ifndef _ASM_TRANS_TABLE_H
+#define _ASM_TRANS_TABLE_H
+
+#include <asm/pgtable-types.h>
+
+/*
+ * trans_alloc_page
+ * - Allocator that should return exactly one uninitilaized page, if this
+ * allocator fails, trans_table returns -ENOMEM error.
+ *
+ * trans_alloc_arg
+ * - Passed to trans_alloc_page as an argument
+ *
+ * trans_flags
+ * - bitmap with flags that control how page table is filled.
+ * TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
+ * writeable by removing RDONLY flag from PTE.
+ * TRANS_MKVALID: during page table copy, if PTE present, but not valid,
+ * make it valid.
+ * TRANS_CHECKPFN: During page table copy, for every PTE entry check that
+ * PFN that this PTE points to is valid. Otherwise return
+ * -ENXIO
+ * TRANS_FORCEMAP: During page map, if translation exists, force
+ * overwrite it. Otherwise -ENXIO may be returned by
+ * trans_table_map_* functions if conflict is detected.
+ */
+
+#define TRANS_MKWRITE (1 << 0)
+#define TRANS_MKVALID (1 << 1)
+#define TRANS_CHECKPFN (1 << 2)
+#define TRANS_FORCEMAP (1 << 3)
+
+struct trans_table_info {
+ void * (*trans_alloc_page)(void *);
+ void *trans_alloc_arg;
+ unsigned long trans_flags;
+};
+
+/* Create and empty trans table. */
+int trans_table_create_empty(struct trans_table_info *info,
+ pgd_t **trans_table);
+
+/*
+ * Create trans table and copy entries from from_table to trans_table in range
+ * [start, end)
+ */
+int trans_table_create_copy(struct trans_table_info *info, pgd_t **trans_table,
+ pgd_t *from_table, unsigned long start,
+ unsigned long end);
+
+/*
+ * Add map entry to trans_table for a base-size page at PTE level.
+ * page: page to be mapped.
+ * dst_addr: new VA address for the pages
+ * pgprot: protection for the page.
+ */
+int trans_table_map_page(struct trans_table_info *info, pgd_t *trans_table,
+ void *page, unsigned long dst_addr, pgprot_t pgprot);
+
+#endif /* _ASM_TRANS_TABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..3794fff18659 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -6,6 +6,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_ARM64_PTDUMP_CORE) += dump.o
obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS) += ptdump_debugfs.o
+obj-$(CONFIG_TRANS_TABLE) += trans_table.o
obj-$(CONFIG_NUMA) += numa.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
KASAN_SANITIZE_physaddr.o += n
diff --git a/arch/arm64/mm/trans_table.c b/arch/arm64/mm/trans_table.c
new file mode 100644
index 000000000000..d5729eb318b7
--- /dev/null
+++ b/arch/arm64/mm/trans_table.c
@@ -0,0 +1,272 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+/*
+ * Transitional tables are used during system transferring from one world to
+ * another: such as during hibernate restore, and kexec reboots. During these
+ * phases one cannot rely on page table not being overwritten.
+ *
+ */
+
+#include <asm/trans_table.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+
+static void *trans_alloc(struct trans_table_info *info)
+{
+ void *page = info->trans_alloc_page(info->trans_alloc_arg);
+
+ if (page)
+ clear_page(page);
+
+ return page;
+}
+
+static int trans_table_copy_pte(struct trans_table_info *info, pte_t *dst_ptep,
+ pte_t *src_ptep, unsigned long start,
+ unsigned long end)
+{
+ unsigned long addr = start;
+ int i = pgd_index(addr);
+
+ do {
+ pte_t src_pte = READ_ONCE(src_ptep[i]);
+
+ if (pte_none(src_pte))
+ continue;
+ if (info->trans_flags & TRANS_MKWRITE)
+ src_pte = pte_mkwrite(src_pte);
+ if (info->trans_flags & TRANS_MKVALID)
+ src_pte = pte_mkpresent(src_pte);
+ if (info->trans_flags & TRANS_CHECKPFN) {
+ if (!pfn_valid(pte_pfn(src_pte)))
+ return -ENXIO;
+ }
+ set_pte(&dst_ptep[i], src_pte);
+ } while (addr += PAGE_SIZE, i++, addr != end && i < PTRS_PER_PTE);
+
+ return 0;
+}
+
+static int trans_table_copy_pmd(struct trans_table_info *info, pmd_t *dst_pmdp,
+ pmd_t *src_pmdp, unsigned long start,
+ unsigned long end)
+{
+ unsigned long next;
+ unsigned long addr = start;
+ int i = pgd_index(addr);
+ int rc;
+
+ do {
+ pmd_t src_pmd = READ_ONCE(src_pmdp[i]);
+ pmd_t dst_pmd = READ_ONCE(dst_pmdp[i]);
+ pte_t *dst_ptep, *src_ptep;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_none(src_pmd))
+ continue;
+
+ if (!pmd_table(src_pmd)) {
+ if (info->trans_flags & TRANS_MKWRITE)
+ pmd_val(src_pmd) &= ~PMD_SECT_RDONLY;
+ set_pmd(&dst_pmdp[i], src_pmd);
+ continue;
+ }
+
+ if (pmd_none(dst_pmd)) {
+ pte_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pmd_populate(&dst_pmdp[i], __pa(t), PTE_TYPE_PAGE);
+ dst_pmd = READ_ONCE(dst_pmdp[i]);
+ }
+
+ src_ptep = __va(pmd_page_paddr(src_pmd));
+ dst_ptep = __va(pmd_page_paddr(dst_pmd));
+
+ rc = trans_table_copy_pte(info, dst_ptep, src_ptep, addr, next);
+ if (rc)
+ return rc;
+ } while (addr = next, i++, addr != end && i < PTRS_PER_PMD);
+
+ return 0;
+}
+
+static int trans_table_copy_pud(struct trans_table_info *info, pud_t *dst_pudp,
+ pud_t *src_pudp, unsigned long start,
+ unsigned long end)
+{
+ unsigned long next;
+ unsigned long addr = start;
+ int i = pgd_index(addr);
+ int rc;
+
+ do {
+ pud_t src_pud = READ_ONCE(src_pudp[i]);
+ pud_t dst_pud = READ_ONCE(dst_pudp[i]);
+ pmd_t *dst_pmdp, *src_pmdp;
+
+ next = pud_addr_end(addr, end);
+ if (pud_none(src_pud))
+ continue;
+
+ if (!pud_table(src_pud)) {
+ if (info->trans_flags & TRANS_MKWRITE)
+ pud_val(src_pud) &= ~PUD_SECT_RDONLY;
+ set_pud(&dst_pudp[i], src_pud);
+ continue;
+ }
+
+ if (pud_none(dst_pud)) {
+ pmd_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pud_populate(&dst_pudp[i], __pa(t), PMD_TYPE_TABLE);
+ dst_pud = READ_ONCE(dst_pudp[i]);
+ }
+
+ src_pmdp = __va(pud_page_paddr(src_pud));
+ dst_pmdp = __va(pud_page_paddr(dst_pud));
+
+ rc = trans_table_copy_pmd(info, dst_pmdp, src_pmdp, addr, next);
+ if (rc)
+ return rc;
+ } while (addr = next, i++, addr != end && i < PTRS_PER_PUD);
+
+ return 0;
+}
+
+static int trans_table_copy_pgd(struct trans_table_info *info, pgd_t *dst_pgdp,
+ pgd_t *src_pgdp, unsigned long start,
+ unsigned long end)
+{
+ unsigned long next;
+ unsigned long addr = start;
+ int i = pgd_index(addr);
+ int rc;
+
+ do {
+ pgd_t src_pgd;
+ pgd_t dst_pgd;
+ pud_t *dst_pudp, *src_pudp;
+
+ src_pgd = READ_ONCE(src_pgdp[i]);
+ dst_pgd = READ_ONCE(dst_pgdp[i]);
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(src_pgd))
+ continue;
+
+ if (pgd_none(dst_pgd)) {
+ pud_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pgd_populate(&dst_pgdp[i], __pa(t), PUD_TYPE_TABLE);
+ dst_pgd = READ_ONCE(dst_pgdp[i]);
+ }
+
+ src_pudp = __va(pgd_page_paddr(src_pgd));
+ dst_pudp = __va(pgd_page_paddr(dst_pgd));
+
+ rc = trans_table_copy_pud(info, dst_pudp, src_pudp, addr, next);
+ if (rc)
+ return rc;
+ } while (addr = next, i++, addr != end && i < PTRS_PER_PGD);
+
+ return 0;
+}
+
+int trans_table_create_empty(struct trans_table_info *info, pgd_t **trans_table)
+{
+ pgd_t *dst_pgdp = trans_alloc(info);
+
+ if (!dst_pgdp)
+ return -ENOMEM;
+
+ *trans_table = dst_pgdp;
+
+ return 0;
+}
+
+int trans_table_create_copy(struct trans_table_info *info, pgd_t **trans_table,
+ pgd_t *from_table, unsigned long start,
+ unsigned long end)
+{
+ int rc;
+
+ rc = trans_table_create_empty(info, trans_table);
+ if (rc)
+ return rc;
+
+ return trans_table_copy_pgd(info, *trans_table, from_table, start, end);
+}
+
+int trans_table_map_page(struct trans_table_info *info, pgd_t *trans_table,
+ void *page, unsigned long dst_addr, pgprot_t pgprot)
+{
+ int pgd_idx = pgd_index(dst_addr);
+ int pud_idx = pud_index(dst_addr);
+ int pmd_idx = pmd_index(dst_addr);
+ int pte_idx = pte_index(dst_addr);
+ pgd_t *pgdp = trans_table;
+ pgd_t pgd = READ_ONCE(pgdp[pgd_idx]);
+ pud_t *pudp, pud;
+ pmd_t *pmdp, pmd;
+ pte_t *ptep, pte;
+
+ if (pgd_none(pgd)) {
+ pud_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pgd_populate(&pgdp[pgd_idx], __pa(t), PUD_TYPE_TABLE);
+ pgd = READ_ONCE(pgdp[pgd_idx]);
+ }
+
+ pudp = __va(pgd_page_paddr(pgd));
+ pud = READ_ONCE(pudp[pud_idx]);
+ if (pud_sect(pud) && !(info->trans_flags & TRANS_FORCEMAP)) {
+ return -ENXIO;
+ } else if (pud_none(pud) || pud_sect(pud)) {
+ pmd_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pud_populate(&pudp[pud_idx], __pa(t), PMD_TYPE_TABLE);
+ pud = READ_ONCE(pudp[pud_idx]);
+ }
+
+ pmdp = __va(pud_page_paddr(pud));
+ pmd = READ_ONCE(pmdp[pmd_idx]);
+ if (pmd_sect(pmd) && !(info->trans_flags & TRANS_FORCEMAP)) {
+ return -ENXIO;
+ } else if (pmd_none(pmd) || pmd_sect(pmd)) {
+ pte_t *t = trans_alloc(info);
+
+ if (!t)
+ return -ENOMEM;
+
+ __pmd_populate(&pmdp[pmd_idx], __pa(t), PTE_TYPE_PAGE);
+ pmd = READ_ONCE(pmdp[pmd_idx]);
+ }
+
+ ptep = __va(pmd_page_paddr(pmd));
+ pte = READ_ONCE(ptep[pte_idx]);
+
+ if (!pte_none(pte) && !(info->trans_flags & TRANS_FORCEMAP))
+ return -ENXIO;
+
+ set_pte(&ptep[pte_idx], pfn_pte(virt_to_pfn(page), pgprot));
+
+ return 0;
+}
--
2.22.0
^ permalink raw reply related
* [RFC v2 3/8] arm64: hibernate: switch to transtional page tables.
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Transitional page tables provide the needed functionality to setup
temporary page tables needed for hibernate resume.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/kernel/hibernate.c | 261 ++++++++--------------------------
1 file changed, 60 insertions(+), 201 deletions(-)
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 9341fcc6e809..4120b03a02fd 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -16,7 +16,6 @@
#define pr_fmt(x) "hibernate: " x
#include <linux/cpu.h>
#include <linux/kvm_host.h>
-#include <linux/mm.h>
#include <linux/pm.h>
#include <linux/sched.h>
#include <linux/suspend.h>
@@ -31,14 +30,12 @@
#include <asm/kexec.h>
#include <asm/memory.h>
#include <asm/mmu_context.h>
-#include <asm/pgalloc.h>
-#include <asm/pgtable.h>
-#include <asm/pgtable-hwdef.h>
#include <asm/sections.h>
#include <asm/smp.h>
#include <asm/smp_plat.h>
#include <asm/suspend.h>
#include <asm/sysreg.h>
+#include <asm/trans_table.h>
#include <asm/virt.h>
/*
@@ -182,6 +179,12 @@ int arch_hibernation_header_restore(void *addr)
}
EXPORT_SYMBOL(arch_hibernation_header_restore);
+static void *
+hibernate_page_alloc(void *arg)
+{
+ return (void *)get_safe_page((gfp_t)(unsigned long)arg);
+}
+
/*
* Copies length bytes, starting at src_start into an new page,
* perform cache maintentance, then maps it at the specified address low
@@ -196,57 +199,31 @@ EXPORT_SYMBOL(arch_hibernation_header_restore);
*/
static int create_safe_exec_page(void *src_start, size_t length,
unsigned long dst_addr,
- phys_addr_t *phys_dst_addr,
- void *(*allocator)(gfp_t mask),
- gfp_t mask)
+ phys_addr_t *phys_dst_addr)
{
- int rc = 0;
- pgd_t *pgdp;
- pud_t *pudp;
- pmd_t *pmdp;
- pte_t *ptep;
- unsigned long dst = (unsigned long)allocator(mask);
-
- if (!dst) {
- rc = -ENOMEM;
- goto out;
- }
-
- memcpy((void *)dst, src_start, length);
- __flush_icache_range(dst, dst + length);
+ struct trans_table_info trans_info = {
+ .trans_alloc_page = hibernate_page_alloc,
+ .trans_alloc_arg = (void *)GFP_ATOMIC,
+ .trans_flags = 0,
+ };
+ void *page = (void *)get_safe_page(GFP_ATOMIC);
+ pgd_t *trans_table;
+ int rc;
+
+ if (!page)
+ return -ENOMEM;
- pgdp = pgd_offset_raw(allocator(mask), dst_addr);
- if (pgd_none(READ_ONCE(*pgdp))) {
- pudp = allocator(mask);
- if (!pudp) {
- rc = -ENOMEM;
- goto out;
- }
- pgd_populate(&init_mm, pgdp, pudp);
- }
+ memcpy(page, src_start, length);
+ __flush_icache_range((unsigned long)page, (unsigned long)page + length);
- pudp = pud_offset(pgdp, dst_addr);
- if (pud_none(READ_ONCE(*pudp))) {
- pmdp = allocator(mask);
- if (!pmdp) {
- rc = -ENOMEM;
- goto out;
- }
- pud_populate(&init_mm, pudp, pmdp);
- }
-
- pmdp = pmd_offset(pudp, dst_addr);
- if (pmd_none(READ_ONCE(*pmdp))) {
- ptep = allocator(mask);
- if (!ptep) {
- rc = -ENOMEM;
- goto out;
- }
- pmd_populate_kernel(&init_mm, pmdp, ptep);
- }
+ rc = trans_table_create_empty(&trans_info, &trans_table);
+ if (rc)
+ return rc;
- ptep = pte_offset_kernel(pmdp, dst_addr);
- set_pte(ptep, pfn_pte(virt_to_pfn(dst), PAGE_KERNEL_EXEC));
+ rc = trans_table_map_page(&trans_info, trans_table, page, dst_addr,
+ PAGE_KERNEL_EXEC);
+ if (rc)
+ return rc;
/*
* Load our new page tables. A strict BBM approach requires that we
@@ -262,13 +239,12 @@ static int create_safe_exec_page(void *src_start, size_t length,
*/
cpu_set_reserved_ttbr0();
local_flush_tlb_all();
- write_sysreg(phys_to_ttbr(virt_to_phys(pgdp)), ttbr0_el1);
+ write_sysreg(phys_to_ttbr(virt_to_phys(trans_table)), ttbr0_el1);
isb();
- *phys_dst_addr = virt_to_phys((void *)dst);
+ *phys_dst_addr = virt_to_phys(page);
-out:
- return rc;
+ return 0;
}
#define dcache_clean_range(start, end) __flush_dcache_area(start, (end - start))
@@ -332,143 +308,6 @@ int swsusp_arch_suspend(void)
return ret;
}
-static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
-{
- pte_t pte = READ_ONCE(*src_ptep);
-
- if (pte_valid(pte)) {
- /*
- * Resume will overwrite areas that may be marked
- * read only (code, rodata). Clear the RDONLY bit from
- * the temporary mappings we use during restore.
- */
- set_pte(dst_ptep, pte_mkwrite(pte));
- } else if (debug_pagealloc_enabled() && !pte_none(pte)) {
- /*
- * debug_pagealloc will removed the PTE_VALID bit if
- * the page isn't in use by the resume kernel. It may have
- * been in use by the original kernel, in which case we need
- * to put it back in our copy to do the restore.
- *
- * Before marking this entry valid, check the pfn should
- * be mapped.
- */
- BUG_ON(!pfn_valid(pte_pfn(pte)));
-
- set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
- }
-}
-
-static int copy_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long start,
- unsigned long end)
-{
- pte_t *src_ptep;
- pte_t *dst_ptep;
- unsigned long addr = start;
-
- dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
- if (!dst_ptep)
- return -ENOMEM;
- pmd_populate_kernel(&init_mm, dst_pmdp, dst_ptep);
- dst_ptep = pte_offset_kernel(dst_pmdp, start);
-
- src_ptep = pte_offset_kernel(src_pmdp, start);
- do {
- _copy_pte(dst_ptep, src_ptep, addr);
- } while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
-
- return 0;
-}
-
-static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
- unsigned long end)
-{
- pmd_t *src_pmdp;
- pmd_t *dst_pmdp;
- unsigned long next;
- unsigned long addr = start;
-
- if (pud_none(READ_ONCE(*dst_pudp))) {
- dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
- if (!dst_pmdp)
- return -ENOMEM;
- pud_populate(&init_mm, dst_pudp, dst_pmdp);
- }
- dst_pmdp = pmd_offset(dst_pudp, start);
-
- src_pmdp = pmd_offset(src_pudp, start);
- do {
- pmd_t pmd = READ_ONCE(*src_pmdp);
-
- next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
- continue;
- if (pmd_table(pmd)) {
- if (copy_pte(dst_pmdp, src_pmdp, addr, next))
- return -ENOMEM;
- } else {
- set_pmd(dst_pmdp,
- __pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
- }
- } while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
-
- return 0;
-}
-
-static int copy_pud(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
- unsigned long end)
-{
- pud_t *dst_pudp;
- pud_t *src_pudp;
- unsigned long next;
- unsigned long addr = start;
-
- if (pgd_none(READ_ONCE(*dst_pgdp))) {
- dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
- if (!dst_pudp)
- return -ENOMEM;
- pgd_populate(&init_mm, dst_pgdp, dst_pudp);
- }
- dst_pudp = pud_offset(dst_pgdp, start);
-
- src_pudp = pud_offset(src_pgdp, start);
- do {
- pud_t pud = READ_ONCE(*src_pudp);
-
- next = pud_addr_end(addr, end);
- if (pud_none(pud))
- continue;
- if (pud_table(pud)) {
- if (copy_pmd(dst_pudp, src_pudp, addr, next))
- return -ENOMEM;
- } else {
- set_pud(dst_pudp,
- __pud(pud_val(pud) & ~PMD_SECT_RDONLY));
- }
- } while (dst_pudp++, src_pudp++, addr = next, addr != end);
-
- return 0;
-}
-
-static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
- unsigned long end)
-{
- unsigned long next;
- unsigned long addr = start;
- pgd_t *src_pgdp = pgd_offset_k(start);
-
- dst_pgdp = pgd_offset_raw(dst_pgdp, start);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none(READ_ONCE(*src_pgdp)))
- continue;
- if (copy_pud(dst_pgdp, src_pgdp, addr, next))
- return -ENOMEM;
- } while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
-
- return 0;
-}
-
/*
* Setup then Resume from the hibernate image using swsusp_arch_suspend_exit().
*
@@ -484,21 +323,42 @@ int swsusp_arch_resume(void)
phys_addr_t phys_hibernate_exit;
void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
void *, phys_addr_t, phys_addr_t);
+ struct trans_table_info trans_info = {
+ .trans_alloc_page = hibernate_page_alloc,
+ .trans_alloc_arg = (void *)GFP_ATOMIC,
+ /*
+ * Resume will overwrite areas that may be marked read only
+ * (code, rodata). Clear the RDONLY bit from the temporary
+ * mappings we use during restore.
+ */
+ .trans_flags = TRANS_MKWRITE,
+ };
+
+ /*
+ * debug_pagealloc will removed the PTE_VALID bit if the page isn't in
+ * use by the resume kernel. It may have been in use by the original
+ * kernel, in which case we need to put it back in our copy to do the
+ * restore.
+ *
+ * Before marking this entry valid, check the pfn should be mapped.
+ */
+ if (debug_pagealloc_enabled())
+ trans_info.trans_flags |= (TRANS_MKVALID | TRANS_CHECKPFN);
/*
* Restoring the memory image will overwrite the ttbr1 page tables.
* Create a second copy of just the linear map, and use this when
* restoring.
*/
- tmp_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
- if (!tmp_pg_dir) {
- pr_err("Failed to allocate memory for temporary page tables.\n");
- rc = -ENOMEM;
+ rc = trans_table_create_copy(&trans_info, &tmp_pg_dir,
+ pgd_offset_k(PAGE_OFFSET), PAGE_OFFSET, 0);
+ if (rc) {
+ if (rc == -ENOMEM)
+ pr_err("Failed to allocate memory for temporary page tables.\n");
+ else if (rc == -ENXIO)
+ pr_err("Tried to set PTE for PFN that does not exist\n");
goto out;
}
- rc = copy_page_tables(tmp_pg_dir, PAGE_OFFSET, 0);
- if (rc)
- goto out;
/*
* We need a zero page that is zero before & after resume in order to
@@ -523,8 +383,7 @@ int swsusp_arch_resume(void)
*/
rc = create_safe_exec_page(__hibernate_exit_text_start, exit_size,
(unsigned long)hibernate_exit,
- &phys_hibernate_exit,
- (void *)get_safe_page, GFP_ATOMIC);
+ &phys_hibernate_exit);
if (rc) {
pr_err("Failed to create safe executable page for hibernate_exit code.\n");
goto out;
--
2.22.0
^ permalink raw reply related
* [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Now, that we have transitional page tables configured, temporarily enable
MMU to allow faster relocation of segments to final destination.
The performance data: for a moderate size kernel + initramfs: 25M the
relocation was taking 0.382s, with enabled MMU it now takes
0.019s only or x20 improvement.
The time is proportional to the size of relocation, therefore if initramfs
is larger, 100M it could take over a second.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/kernel/relocate_kernel.S | 192 ++++++++++++++++++++++------
1 file changed, 154 insertions(+), 38 deletions(-)
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index d352faf7cbe6..88fc69adb90d 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -4,6 +4,8 @@
*
* Copyright (C) Linaro.
* Copyright (C) Huawei Futurewei Technologies.
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
*/
#include <linux/kexec.h>
@@ -13,6 +15,130 @@
#include <asm/kexec.h>
#include <asm/page.h>
#include <asm/sysreg.h>
+#include <asm/kvm_arm.h>
+
+/*
+ * The following code is adoped from "Bare-metal Boot Code for ARMv8-A
+ * Processors Version 1.0, 5.3.1 Cleaning and invalidating the caches".
+ * http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a
+ */
+.macro dcache_invalidate tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8
+ mov \tmp0, #0x0 /* tmp0 = Cache level */
+ msr CSSELR_EL1, \tmp0 /* 0x0 for L1, 0x2 for L2 */
+ mrs \tmp4, CCSIDR_EL1 /* Read Cache Size ID */
+ and \tmp1, \tmp4, #0x7
+ add \tmp1, \tmp1, #0x4 /* tmp1 Cache Line Size */
+ ldr \tmp3, =0x7fff
+ and \tmp2, \tmp3, \tmp4, lsr #13 /* tmp2 Cache Set num - 1 */
+ ldr \tmp3, =0x3ff
+ and \tmp3, \tmp3, \tmp4, lsr #3 /* tmp3 Cache Assoc. num - 1 */
+ clz \tmp4, \tmp3 /* tmp4 way pos. in the CISW */
+ mov \tmp5, #0 /* tmp5 way counter way_loop */
+1: /* way_loop */
+ mov \tmp6, #0 /* tmp6 set counter set_loop */
+2: /* set_loop */
+ lsl \tmp7, \tmp5, \tmp4
+ orr \tmp7, \tmp0, \tmp7 /* Set way */
+ lsl \tmp8, \tmp6, \tmp1
+ orr \tmp7, \tmp7, \tmp8 /* Set set */
+ dc cisw, \tmp7 /* Clean & Inval. cache line */
+ add \tmp6, \tmp6, #1 /* Increment set counter */
+ cmp \tmp6, \tmp2 /* Last set reached yet? */
+ ble 2b /* If not, iterate set_loop, */
+ add \tmp5, \tmp5, #1 /* else, next way. */
+ cmp \tmp5, \tmp3 /* Last way reached yet? */
+ ble 1b /* If not, iterate way_loop. */
+.endm
+
+/*
+ * Invalidae all TLB: if we are running at EL2, invalidate all TLB at EL1 & EL2,
+ * if we are running at EL1 invalidate all current VMID TLB at EL1.
+ */
+.macro tlb_invalidate tmp
+ mrs \tmp, CurrentEL
+ cmp \tmp, #CurrentEL_EL2
+ isb
+ b.ne 1f
+ dsb sy
+ tlbi alle2
+ tlbi alle1
+ dsb ish
+ isb
+ b 2f
+1:
+ dsb sy
+ tlbi vmalle1
+ dsb ish
+ isb
+2:
+.endm
+
+.macro turn_off_mmu_el sctlr, tmp1, tmp2
+ mrs \tmp1, \sctlr
+ ldr \tmp2, =SCTLR_ELx_FLAGS
+ bic \tmp1, \tmp1, \tmp2
+ pre_disable_mmu_workaround
+ msr \sctlr, \tmp1
+ isb
+.endm
+
+.macro turn_off_mmu tmp1, tmp2
+ turn_off_mmu_el sctlr_el1, \tmp1, \tmp2 /* Turn off MMU at EL1 */
+ mrs \tmp1, CurrentEL
+ cmp \tmp1, #CurrentEL_EL2
+ b.ne 1f
+ turn_off_mmu_el sctlr_el2, \tmp1, \tmp2 /* Turn off MMU at EL2 */
+1:
+.endm
+
+/* Configure TCR_EL2 and MAIR_EL2 */
+.macro tcr_mair_mmu_el2 tmp1, tmp2, tmp3
+ mrs \tmp1, tcr_el1
+ ldr \tmp2, =TCR_EL2_MASK
+ and \tmp1, \tmp1, \tmp2
+ mov \tmp2, #TCR_EL2_RES1
+ orr \tmp1, \tmp1, \tmp2
+ ldr \tmp2, =TCR_T0SZ(VA_BITS)
+ orr \tmp1, \tmp1, \tmp2
+ tcr_compute_pa_size \tmp1, #TCR_EL2_PS_SHIFT, \tmp2, \tmp3
+ msr tcr_el2, \tmp1
+ mrs \tmp1, mair_el1
+ msr mair_el2, \tmp1
+.endm
+
+.macro turn_on_mmu tmp1, tmp2, tmp3
+ mrs \tmp1, CurrentEL
+ cmp \tmp1, #CurrentEL_EL2
+ b.ne 1f
+ tcr_mair_mmu_el2 \tmp1, \tmp2, \tmp3
+ ldr \tmp1, =(SCTLR_EL2_RES1 | SCTLR_ELx_FLAGS | ENDIAN_SET_EL2)
+ msr sctlr_el2, \tmp1
+ b 2f
+1: mrs \tmp1, sctlr_el1
+ ldr \tmp2, =SCTLR_ELx_FLAGS
+ orr \tmp1, \tmp1, \tmp2
+ msr sctlr_el1, \tmp1
+2: ic iallu
+ dsb nsh
+ isb
+.endm
+
+.macro set_ttbr_el ttbr_reg, trans_table
+ phys_to_ttbr \trans_table, \trans_table
+ msr \ttbr_reg, \trans_table
+ isb
+.endm
+
+.macro set_ttbr trans_table, tmp
+ mrs \tmp, CurrentEL
+ cmp \tmp, #CurrentEL_EL2
+ b.ne 1f
+ set_ttbr_el ttbr0_el2 \trans_table
+ b 2f
+1:
+ set_ttbr_el ttbr0_el1 \trans_table
+2:
+.endm
/*
* arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
@@ -24,59 +150,49 @@
* symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end. The
* machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
* safe memory that has been set up to be preserved during the copy operation.
+ *
+ * This function temporarily enables MMU if kernel relocation is needed. This is
+ * done for performance reasons: with MMU-enabled arm64 is much quicker at
+ * copying pages due to also having enabled caching.
*/
ENTRY(arm64_relocate_new_kernel)
- /* Clear the sctlr_el2 flags. */
- mrs x2, CurrentEL
- cmp x2, #CurrentEL_EL2
- b.ne 1f
- mrs x2, sctlr_el2
- ldr x1, =SCTLR_ELx_FLAGS
- bic x2, x2, x1
- pre_disable_mmu_workaround
- msr sctlr_el2, x2
- isb
-1: /* Check if the new image needs relocation. */
- ldr x16, [x0, #KRELOC_HEAD] /* x16 = kimage_head */
- tbnz x16, IND_DONE_BIT, .Ldone
- raw_dcache_line_size x15, x1 /* x15 = dcache line size */
+ /* MMU on EL2 might still be on, turn it off for now */
+ turn_off_mmu x1, x2
+ dcache_invalidate x1, x2, x3, x4, x5, x6, x7, x8, x9
+ tlb_invalidate x1
+
+ /* Check if the new image needs relocation. */
+ ldr x12, [x0, #KRELOC_HEAD] /* x12 = kimage_head */
+ tbnz x12, IND_DONE_BIT, .Ldone
+ ldr x1, [x0, #KRELOC_TRANS_TABLE]
+ set_ttbr x1, x2
+ turn_on_mmu x1, x2, x3
.Lloop:
- and x12, x16, PAGE_MASK /* x12 = addr */
+ and x2, x12, PAGE_MASK /* x2 = addr */
/* Test the entry flags. */
.Ltest_source:
- tbz x16, IND_SOURCE_BIT, .Ltest_indirection
-
- /* Invalidate dest page to PoC. */
- mov x2, x13
- add x20, x2, #PAGE_SIZE
- sub x1, x15, #1
- bic x2, x2, x1
-2: dc ivac, x2
- add x2, x2, x15
- cmp x2, x20
- b.lo 2b
- dsb sy
-
- copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
+ tbz x12, IND_SOURCE_BIT, .Ltest_indirection
+ copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
b .Lnext
.Ltest_indirection:
- tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
- mov x14, x12 /* ptr = addr */
+ tbz x12, IND_INDIRECTION_BIT, .Ltest_destination
+ mov x11, x2 /* x11 = ptr */
b .Lnext
.Ltest_destination:
- tbz x16, IND_DESTINATION_BIT, .Lnext
- mov x13, x12 /* dest = addr */
+ tbz x12, IND_DESTINATION_BIT, .Lnext
+ mov x1, x2 /* x1 = dest */
.Lnext:
- ldr x16, [x14], #8 /* entry = *ptr++ */
- tbz x16, IND_DONE_BIT, .Lloop /* while (!(entry & DONE)) */
-.Ldone:
+ ldr x12, [x11], #8 /* x12 = entry = *ptr++ */
+ tbz x12, IND_DONE_BIT, .Lloop /* while (!(entry & DONE)) */
/* wait for writes from copy_page to finish */
dsb nsh
ic iallu
dsb nsh
isb
-
- /* Start new image. */
+ turn_off_mmu x1, x2
+ dcache_invalidate x1, x2, x3, x4, x5, x6, x7, x8, x9
+ tlb_invalidate x1
+.Ldone: /* Start new image. */
ldr x4, [x0, #KRELOC_ENTRY_ADDR] /* x4 = kimage_start */
ldr x3, [x0, #KRELOC_KERN_ARG3]
ldr x2, [x0, #KRELOC_KERN_ARG2]
--
2.22.0
^ permalink raw reply related
* [RFC v2 7/8] arm64, kexec: configure transitional page table for kexec
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Configure a page table located in kexec-safe memory that has
the following mappings:
1. mapping for text of relocation function with executable permission.
2. mapping for argument for relocation function.
3. mappings for all source ranges
4. mappings for all destination ranges.
5. mappings for array that contains information about source/destinations.
We could make this page table to contain liner addresses, but instead do
identity maps (va == pa) for every mapping. This is because the relocation
code can be executed at EV2, where ttbr1 might not be available. There is
no way to execute relocation code at EV1, because the old world is
overwritten and thus no place to trap to to escalator to EV2.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/include/asm/kexec.h | 3 +
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/machine_kexec.c | 96 ++++++++++++++++++++++++++++++-
3 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index d5b79d4c7fae..1f226cc76e24 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,6 +97,8 @@ static inline void crash_post_resume(void) {}
* kernel, or purgatory entry address).
* kern_arg0 first argument to kernel is its dtb address. The other
* arguments are currently unused, and must be set to 0
+ * trans_table: idmap for source and destination pages, as well as for
+ * relocation text.
*/
struct kern_reloc_arg {
unsigned long head;
@@ -105,6 +107,7 @@ struct kern_reloc_arg {
unsigned long kern_arg1;
unsigned long kern_arg2;
unsigned long kern_arg3;
+ unsigned long trans_table;
};
#define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 900394907fd8..002db58b28f3 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -135,6 +135,7 @@ int main(void)
DEFINE(KRELOC_KERN_ARG1, offsetof(struct kern_reloc_arg, kern_arg1));
DEFINE(KRELOC_KERN_ARG2, offsetof(struct kern_reloc_arg, kern_arg2));
DEFINE(KRELOC_KERN_ARG3, offsetof(struct kern_reloc_arg, kern_arg3));
+ DEFINE(KRELOC_TRANS_TABLE, offsetof(struct kern_reloc_arg, trans_table));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index d7291a663379..402c8fb48f7e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <asm/mmu_context.h>
#include <asm/page.h>
+#include <asm/trans_table.h>
#include "cpu-reset.h"
@@ -72,11 +73,96 @@ static void *kexec_page_alloc(void *arg)
return page_address(page);
}
+/*
+ * idmap every segment that needs to be relocated. We map pages for
+ * destination, source, and also array that holds source, and destination
+ * addresses.
+ * Ideally, we could linearly map src and dst addresses, so, in relocation
+ * routine we would need to only do memcpy(dst, src, len), but this is not
+ * possible, because on armv8.0 EL2 does not have ttbr1, and thus we might
+ * not have enough linear VA range. So, simply idmap it here, that works
+ * for both EL1, and EL2. Note: we cannot really do relocation in EL1 and
+ * later upgrade to EL2 because old world is erased, so there is no where
+ * to trap.
+ */
+static int map_segments(struct kimage *kimage, pgd_t *pgdp,
+ struct trans_table_info *info)
+{
+ unsigned long *ptr = 0;
+ unsigned long dest = 0;
+ unsigned long entry, addr;
+ int rc;
+
+ for (entry = kimage->head; !(entry & IND_DONE); entry = *ptr++) {
+ addr = entry & PAGE_MASK;
+
+ switch (entry & IND_FLAGS) {
+ case IND_DESTINATION:
+ dest = addr;
+ break;
+ case IND_INDIRECTION:
+ ptr = __va(addr);
+ rc = trans_table_map_page(info, pgdp, ptr,
+ addr, PAGE_KERNEL);
+ if (rc)
+ return rc;
+ break;
+ case IND_SOURCE:
+ rc = trans_table_map_page(info, pgdp, __va(addr),
+ addr, PAGE_KERNEL);
+ if (rc)
+ return rc;
+ rc = trans_table_map_page(info, pgdp, __va(dest),
+ dest, PAGE_KERNEL);
+ if (rc)
+ return rc;
+ dest += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int mmu_relocate_setup(struct kimage *kimage, unsigned long kern_reloc,
+ struct kern_reloc_arg *kern_reloc_arg)
+{
+ struct trans_table_info info = {
+ .trans_alloc_page = kexec_page_alloc,
+ .trans_alloc_arg = kimage,
+ .trans_flags = 0,
+ };
+ pgd_t *trans_table;
+ int rc;
+
+ rc = trans_table_create_empty(&info, &trans_table);
+ if (rc)
+ return rc;
+
+ rc = map_segments(kimage, trans_table, &info);
+ if (rc)
+ return rc;
+
+ /* Map relocation function va == pa */
+ rc = trans_table_map_page(&info, trans_table, __va(kern_reloc),
+ kern_reloc, PAGE_KERNEL_EXEC);
+ if (rc)
+ return rc;
+
+ /* Map relocation function argument va == pa */
+ rc = trans_table_map_page(&info, trans_table, kern_reloc_arg,
+ __pa(kern_reloc_arg), PAGE_KERNEL);
+ if (rc)
+ return rc;
+
+ kern_reloc_arg->trans_table = __pa(trans_table);
+
+ return 0;
+}
int machine_kexec_post_load(struct kimage *kimage)
{
unsigned long kern_reloc;
struct kern_reloc_arg *kern_reloc_arg;
+ int rc = 0;
kern_reloc = page_to_phys(kimage->control_code_page);
memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
@@ -94,8 +180,16 @@ int machine_kexec_post_load(struct kimage *kimage)
kern_reloc_arg->entry_addr = kimage->start;
kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
+ /*
+ * If relocation is not needed, we do not need to enable MMU in
+ * relocation routine, therefore do not create page tables for
+ * scenarios such as crash kernel
+ */
+ if (!(kimage->head & IND_DONE))
+ rc = mmu_relocate_setup(kimage, kern_reloc, kern_reloc_arg);
+
kexec_image_info(kimage);
- return 0;
+ return rc;
}
/**
--
2.22.0
^ permalink raw reply related
* [RFC v2 6/8] arm64, kexec: add expandable argument to relocation function
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
the following arguments:
head: start of array that contains relocation information.
entry: entry point for new kernel or purgatory.
dtb_mem: first and only argument to entry.
The number of arguments cannot be easily expended, because this
function is also called from HVC_SOFT_RESTART, which preserves only
three arguments. And, also arm64_relocate_new_kernel is written in
assembly but called without stack, thus no place to move extra
arguments to free registers.
Soon, we will need to pass more arguments: once we enable MMU we
will need to pass information about page tables.
Another benefit of allowing this function to accept more arguments, is that
kernel can actually accept up to 4 arguments (x0-x3), however currently
only one is used, but if in the future we will need for more (for example,
pass information about when previous kernel exited to have a precise
measurement in time spent in purgatory), we won't be easilty do that
if arm64_relocate_new_kernel can't accept more arguments.
So, add a new struct: kern_reloc_arg, and place it in kexec safe page (i.e
memory that is not overwritten during relocation).
Thus, make arm64_relocate_new_kernel to only take one argument, that
contains all the needed information.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/include/asm/kexec.h | 18 ++++++
arch/arm64/kernel/asm-offsets.c | 9 +++
arch/arm64/kernel/cpu-reset.S | 4 +-
arch/arm64/kernel/cpu-reset.h | 8 +--
arch/arm64/kernel/machine_kexec.c | 29 +++++++++-
arch/arm64/kernel/relocate_kernel.S | 88 ++++++++++-------------------
6 files changed, 87 insertions(+), 69 deletions(-)
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index d15ca1ca1e83..d5b79d4c7fae 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,12 +90,30 @@ static inline void crash_prepare_suspend(void) {}
static inline void crash_post_resume(void) {}
#endif
+/*
+ * kern_reloc_arg is passed to kernel relocation function as an argument.
+ * head kimage->head, allows to traverse through relocation segments.
+ * entry_addr kimage->start, where to jump from relocation function (new
+ * kernel, or purgatory entry address).
+ * kern_arg0 first argument to kernel is its dtb address. The other
+ * arguments are currently unused, and must be set to 0
+ */
+struct kern_reloc_arg {
+ unsigned long head;
+ unsigned long entry_addr;
+ unsigned long kern_arg0;
+ unsigned long kern_arg1;
+ unsigned long kern_arg2;
+ unsigned long kern_arg3;
+};
+
#define ARCH_HAS_KIMAGE_ARCH
struct kimage_arch {
void *dtb;
unsigned long dtb_mem;
unsigned long kern_reloc;
+ unsigned long kern_reloc_arg;
};
#ifdef CONFIG_KEXEC_FILE
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 214685760e1c..900394907fd8 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -23,6 +23,7 @@
#include <asm/suspend.h>
#include <linux/kbuild.h>
#include <linux/arm-smccc.h>
+#include <linux/kexec.h>
int main(void)
{
@@ -126,6 +127,14 @@ int main(void)
#ifdef CONFIG_ARM_SDE_INTERFACE
DEFINE(SDEI_EVENT_INTREGS, offsetof(struct sdei_registered_event, interrupted_regs));
DEFINE(SDEI_EVENT_PRIORITY, offsetof(struct sdei_registered_event, priority));
+#endif
+#ifdef CONFIG_KEXEC_CORE
+ DEFINE(KRELOC_HEAD, offsetof(struct kern_reloc_arg, head));
+ DEFINE(KRELOC_ENTRY_ADDR, offsetof(struct kern_reloc_arg, entry_addr));
+ DEFINE(KRELOC_KERN_ARG0, offsetof(struct kern_reloc_arg, kern_arg0));
+ DEFINE(KRELOC_KERN_ARG1, offsetof(struct kern_reloc_arg, kern_arg1));
+ DEFINE(KRELOC_KERN_ARG2, offsetof(struct kern_reloc_arg, kern_arg2));
+ DEFINE(KRELOC_KERN_ARG3, offsetof(struct kern_reloc_arg, kern_arg3));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 6ea337d464c4..64c78a42919f 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -43,9 +43,7 @@ ENTRY(__cpu_soft_restart)
hvc #0 // no return
1: mov x18, x1 // entry
- mov x0, x2 // arg0
- mov x1, x3 // arg1
- mov x2, x4 // arg2
+ mov x0, x2 // arg
br x18
ENDPROC(__cpu_soft_restart)
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..7a8720ff186f 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -11,12 +11,10 @@
#include <asm/virt.h>
void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
- unsigned long arg0, unsigned long arg1, unsigned long arg2);
+ unsigned long arg);
static inline void __noreturn cpu_soft_restart(unsigned long entry,
- unsigned long arg0,
- unsigned long arg1,
- unsigned long arg2)
+ unsigned long arg)
{
typeof(__cpu_soft_restart) *restart;
@@ -25,7 +23,7 @@ static inline void __noreturn cpu_soft_restart(unsigned long entry,
restart = (void *)__pa_symbol(__cpu_soft_restart);
cpu_install_idmap();
- restart(el2_switch, entry, arg0, arg1, arg2);
+ restart(el2_switch, entry, arg);
unreachable();
}
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 596c9b9657be..d7291a663379 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -43,6 +43,7 @@ static void _kexec_image_info(const char *func, int line,
pr_debug(" head: %lx\n", kimage->head);
pr_debug(" nr_segments: %lu\n", kimage->nr_segments);
pr_debug(" kern_reloc: %pa\n", &kimage->arch.kern_reloc);
+ pr_debug(" kern_reloc_arg: %pa\n", &kimage->arch.kern_reloc_arg);
for (i = 0; i < kimage->nr_segments; i++) {
pr_debug(" segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -59,14 +60,39 @@ void machine_kexec_cleanup(struct kimage *kimage)
/* Empty routine needed to avoid build errors. */
}
+/* Allocates pages for kexec page table */
+static void *kexec_page_alloc(void *arg)
+{
+ struct kimage *kimage = (struct kimage *)arg;
+ struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+ if (!page)
+ return NULL;
+
+ return page_address(page);
+}
+
+
int machine_kexec_post_load(struct kimage *kimage)
{
unsigned long kern_reloc;
+ struct kern_reloc_arg *kern_reloc_arg;
kern_reloc = page_to_phys(kimage->control_code_page);
memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
arm64_relocate_new_kernel_size);
+
+ kern_reloc_arg = kexec_page_alloc(kimage);
+ if (!kern_reloc_arg)
+ return -ENOMEM;
+ memset(kern_reloc_arg, 0, sizeof (struct kern_reloc_arg));
+
kimage->arch.kern_reloc = kern_reloc;
+ kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
+
+ kern_reloc_arg->head = kimage->head;
+ kern_reloc_arg->entry_addr = kimage->start;
+ kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
kexec_image_info(kimage);
return 0;
@@ -203,8 +229,7 @@ void machine_kexec(struct kimage *kimage)
* userspace (kexec-tools).
* In kexec_file case, the kernel starts directly without purgatory.
*/
- cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
- kimage->arch.dtb_mem);
+ cpu_soft_restart(kimage->arch.kern_reloc, kimage->arch.kern_reloc_arg);
BUG(); /* Should never get here. */
}
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index c1d7db71a726..d352faf7cbe6 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -8,7 +8,7 @@
#include <linux/kexec.h>
#include <linux/linkage.h>
-
+#include <asm/asm-offsets.h>
#include <asm/assembler.h>
#include <asm/kexec.h>
#include <asm/page.h>
@@ -17,86 +17,58 @@
/*
* arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
*
- * The memory that the old kernel occupies may be overwritten when coping the
+ * The memory that the old kernel occupies may be overwritten when copying the
* new image to its final location. To assure that the
* arm64_relocate_new_kernel routine which does that copy is not overwritten,
* all code and data needed by arm64_relocate_new_kernel must be between the
* symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end. The
* machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
- * control_code_page, a special page which has been set up to be preserved
- * during the copy operation.
+ * safe memory that has been set up to be preserved during the copy operation.
*/
ENTRY(arm64_relocate_new_kernel)
-
- /* Setup the list loop variables. */
- mov x18, x2 /* x18 = dtb address */
- mov x17, x1 /* x17 = kimage_start */
- mov x16, x0 /* x16 = kimage_head */
- raw_dcache_line_size x15, x0 /* x15 = dcache line size */
- mov x14, xzr /* x14 = entry ptr */
- mov x13, xzr /* x13 = copy dest */
-
/* Clear the sctlr_el2 flags. */
- mrs x0, CurrentEL
- cmp x0, #CurrentEL_EL2
+ mrs x2, CurrentEL
+ cmp x2, #CurrentEL_EL2
b.ne 1f
- mrs x0, sctlr_el2
+ mrs x2, sctlr_el2
ldr x1, =SCTLR_ELx_FLAGS
- bic x0, x0, x1
+ bic x2, x2, x1
pre_disable_mmu_workaround
- msr sctlr_el2, x0
+ msr sctlr_el2, x2
isb
-1:
-
- /* Check if the new image needs relocation. */
+1: /* Check if the new image needs relocation. */
+ ldr x16, [x0, #KRELOC_HEAD] /* x16 = kimage_head */
tbnz x16, IND_DONE_BIT, .Ldone
-
+ raw_dcache_line_size x15, x1 /* x15 = dcache line size */
.Lloop:
and x12, x16, PAGE_MASK /* x12 = addr */
-
/* Test the entry flags. */
.Ltest_source:
tbz x16, IND_SOURCE_BIT, .Ltest_indirection
/* Invalidate dest page to PoC. */
- mov x0, x13
- add x20, x0, #PAGE_SIZE
+ mov x2, x13
+ add x20, x2, #PAGE_SIZE
sub x1, x15, #1
- bic x0, x0, x1
-2: dc ivac, x0
- add x0, x0, x15
- cmp x0, x20
+ bic x2, x2, x1
+2: dc ivac, x2
+ add x2, x2, x15
+ cmp x2, x20
b.lo 2b
dsb sy
- mov x20, x13
- mov x21, x12
- copy_page x20, x21, x0, x1, x2, x3, x4, x5, x6, x7
-
- /* dest += PAGE_SIZE */
- add x13, x13, PAGE_SIZE
+ copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
b .Lnext
-
.Ltest_indirection:
tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
-
- /* ptr = addr */
- mov x14, x12
+ mov x14, x12 /* ptr = addr */
b .Lnext
-
.Ltest_destination:
tbz x16, IND_DESTINATION_BIT, .Lnext
-
- /* dest = addr */
- mov x13, x12
-
+ mov x13, x12 /* dest = addr */
.Lnext:
- /* entry = *ptr++ */
- ldr x16, [x14], #8
-
- /* while (!(entry & DONE)) */
- tbz x16, IND_DONE_BIT, .Lloop
-
+ ldr x16, [x14], #8 /* entry = *ptr++ */
+ tbz x16, IND_DONE_BIT, .Lloop /* while (!(entry & DONE)) */
.Ldone:
/* wait for writes from copy_page to finish */
dsb nsh
@@ -105,18 +77,16 @@ ENTRY(arm64_relocate_new_kernel)
isb
/* Start new image. */
- mov x0, x18
- mov x1, xzr
- mov x2, xzr
- mov x3, xzr
- br x17
-
-ENDPROC(arm64_relocate_new_kernel)
+ ldr x4, [x0, #KRELOC_ENTRY_ADDR] /* x4 = kimage_start */
+ ldr x3, [x0, #KRELOC_KERN_ARG3]
+ ldr x2, [x0, #KRELOC_KERN_ARG2]
+ ldr x1, [x0, #KRELOC_KERN_ARG1]
+ ldr x0, [x0, #KRELOC_KERN_ARG0] /* x0 = dtb address */
+ br x4
+END(arm64_relocate_new_kernel)
.ltorg
-
.align 3 /* To keep the 64-bit values below naturally aligned. */
-
.Lcopy_end:
.org KEXEC_CONTROL_PAGE_SIZE
--
2.22.0
^ permalink raw reply related
* [RFC v2 5/8] arm64, kexec: move relocation function setup and clean up
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Currently, kernel relocation function is configured in machine_kexec()
at the time of kexec reboot by using control_code_page.
This operation, however, is more logical to be done during kexec_load,
and thus remove from reboot time. Move, setup of this function to
newly added machine_kexec_post_load().
In addition, do some cleanup: add infor about reloction function to
kexec_image_info(), and remove extra messages from machine_kexec().
Make dtb_mem, always available, if CONFIG_KEXEC_FILE is not configured
dtb_mem is set to zero anyway.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
arch/arm64/include/asm/kexec.h | 3 +-
arch/arm64/kernel/machine_kexec.c | 47 +++++++++++--------------------
2 files changed, 18 insertions(+), 32 deletions(-)
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 12a561a54128..d15ca1ca1e83 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,14 +90,15 @@ static inline void crash_prepare_suspend(void) {}
static inline void crash_post_resume(void) {}
#endif
-#ifdef CONFIG_KEXEC_FILE
#define ARCH_HAS_KIMAGE_ARCH
struct kimage_arch {
void *dtb;
unsigned long dtb_mem;
+ unsigned long kern_reloc;
};
+#ifdef CONFIG_KEXEC_FILE
extern const struct kexec_file_ops kexec_image_ops;
struct kimage;
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 0df8493624e0..596c9b9657be 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -42,6 +42,7 @@ static void _kexec_image_info(const char *func, int line,
pr_debug(" start: %lx\n", kimage->start);
pr_debug(" head: %lx\n", kimage->head);
pr_debug(" nr_segments: %lu\n", kimage->nr_segments);
+ pr_debug(" kern_reloc: %pa\n", &kimage->arch.kern_reloc);
for (i = 0; i < kimage->nr_segments; i++) {
pr_debug(" segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -58,6 +59,19 @@ void machine_kexec_cleanup(struct kimage *kimage)
/* Empty routine needed to avoid build errors. */
}
+int machine_kexec_post_load(struct kimage *kimage)
+{
+ unsigned long kern_reloc;
+
+ kern_reloc = page_to_phys(kimage->control_code_page);
+ memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+ kimage->arch.kern_reloc = kern_reloc;
+
+ kexec_image_info(kimage);
+ return 0;
+}
+
/**
* machine_kexec_prepare - Prepare for a kexec reboot.
*
@@ -67,8 +81,6 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kexec_image_info(kimage);
-
if (kimage->type != KEXEC_TYPE_CRASH && cpus_are_stuck_in_kernel()) {
pr_err("Can't kexec: CPUs are stuck in the kernel.\n");
return -EBUSY;
@@ -143,8 +155,7 @@ static void kexec_segment_flush(const struct kimage *kimage)
*/
void machine_kexec(struct kimage *kimage)
{
- phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;
+ void *reboot_code_buffer = phys_to_virt(kimage->arch.kern_reloc);
bool in_kexec_crash = (kimage == kexec_crash_image);
bool stuck_cpus = cpus_are_stuck_in_kernel();
@@ -155,30 +166,8 @@ void machine_kexec(struct kimage *kimage)
WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
"Some CPUs may be stale, kdump will be unreliable.\n");
- reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
-
kexec_image_info(kimage);
- pr_debug("%s:%d: control_code_page: %p\n", __func__, __LINE__,
- kimage->control_code_page);
- pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
- &reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
- pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
- arm64_relocate_new_kernel);
- pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
- __func__, __LINE__, arm64_relocate_new_kernel_size,
- arm64_relocate_new_kernel_size);
-
- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
/* Flush the reboot_code_buffer in preparation for its execution. */
__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
@@ -214,12 +203,8 @@ void machine_kexec(struct kimage *kimage)
* userspace (kexec-tools).
* In kexec_file case, the kernel starts directly without purgatory.
*/
- cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,
-#ifdef CONFIG_KEXEC_FILE
+ cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
kimage->arch.dtb_mem);
-#else
- 0);
-#endif
BUG(); /* Should never get here. */
}
--
2.22.0
^ permalink raw reply related
* [RFC v2 4/8] kexec: add machine_kexec_post_load()
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
It is the same as machine_kexec_prepare(), but is called after segments are
loaded. This way, can do processing work with already loaded relocation
segments. One such example is arm64: it has to have segments loaded in
order to create a page table, but it cannot do it during kexec time, because
at that time allocations won't be possible anymore.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
kernel/kexec.c | 4 ++++
kernel/kexec_core.c | 6 ++++++
kernel/kexec_file.c | 4 ++++
kernel/kexec_internal.h | 2 ++
4 files changed, 16 insertions(+)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1b018f1a6e0d..27b71dc7b35a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -159,6 +159,10 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
kimage_terminate(image);
+ ret = machine_kexec_post_load(image);
+ if (ret)
+ goto out;
+
/* Install the new kernel and uninstall the old */
image = xchg(dest_image, image);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 2c5b72863b7b..8360645d1bbe 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -587,6 +587,12 @@ static void kimage_free_extra_pages(struct kimage *image)
kimage_free_page_list(&image->unusable_pages);
}
+
+int __weak machine_kexec_post_load(struct kimage *image)
+{
+ return 0;
+}
+
void kimage_terminate(struct kimage *image)
{
if (*image->entry != 0)
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b8cc032d5620..cb531d768114 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -391,6 +391,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
kimage_terminate(image);
+ ret = machine_kexec_post_load(image);
+ if (ret)
+ goto out;
+
/*
* Free up any temporary buffers allocated which are not needed
* after image has been loaded
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 48aaf2ac0d0d..39d30ccf8d87 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -13,6 +13,8 @@ void kimage_terminate(struct kimage *image);
int kimage_is_destination_range(struct kimage *image,
unsigned long start, unsigned long end);
+int machine_kexec_post_load(struct kimage *image);
+
extern struct mutex kexec_mutex;
#ifdef CONFIG_KEXEC_FILE
--
2.22.0
^ permalink raw reply related
* [RFC v2 1/8] kexec: quiet down kexec reboot
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
In-Reply-To: <20190731153857.4045-1-pasha.tatashin@soleen.com>
Here is a regular kexec command sequence and output:
=====
$ kexec --reuse-cmdline -i --load Image
$ kexec -e
[ 161.342002] kexec_core: Starting new kernel
Welcome to Buildroot
buildroot login:
=====
Even when "quiet" kernel parameter is specified, "kexec_core: Starting
new kernel" is printed.
This message has KERN_EMERG level, but there is no emergency, it is a
normal kexec operation, so quiet it down to appropriate KERN_NOTICE.
Machines that have slow console baud rate benefit from less output.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
---
kernel/kexec_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index d5870723b8ad..2c5b72863b7b 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1169,7 +1169,7 @@ int kernel_kexec(void)
* CPU hotplug again; so re-enable it here.
*/
cpu_hotplug_enable();
- pr_emerg("Starting new kernel\n");
+ pr_notice("Starting new kernel\n");
machine_shutdown();
}
--
2.22.0
^ permalink raw reply related
* [RFC v2 0/8] arm64: MMU enabled kexec relocation
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
bhsharma
Changelog from previous RFC:
- Added trans_table support for both hibernate and kexec.
- Fixed performance issue, where enabling MMU did not yield the
actual performance improvement.
Bug:
With the current state, this patch series works on kernels booted with EL1
mode, but for some reason, when elevated to EL2 mode reboot freezes in
both QEMU and on real hardware.
The freeze happens in:
arch/arm64/kernel/relocate_kernel.S
turn_on_mmu()
Right after sctlr_el2 is written (MMU on EL2 is enabled)
msr sctlr_el2, \tmp1
I've been studying all the relevant control registers for EL2, but do not
see what might be causing this hang:
MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400
TCR_EL2 0x80843510
Enabled bits:
PS Physical Address Size. (0b100 44 bits, 16TB.)
SH0 Shareability 11 Inner Shareable
ORGN0 Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach.
IRGN0 Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach.
T0SZ 01 0000
SCTLR_EL2 0x30e5183f
RES1 : Reserve ones
M : MMU enabled
A : Align check
C : Cacheability control
SA : SP Alignment check enable
IESB : Implicit Error Synchronization event
I : Instruction access Cacheability
TTBR0_EL2 0x1b3069000 (address of trans_table)
Any suggestion of what else might be missing that causes this freeze when
MMU is enabled in EL2?
=====
Here is the current data from the real hardware:
(because of bug, I forced EL1 mode by setting el2_switch always to zero in
cpu_soft_restart()):
For this experiment, the size of kernel plus initramfs is 25M. If initramfs
was larger, than the improvements would be even greater, as time spent in
relocation is proportional to the size of relocation.
Previously:
kernel shutdown 0.022131328s
relocation 0.440510736s
kernel startup 0.294706768s
Relocation was taking: 58.2% of reboot time
Now:
kernel shutdown 0.032066576s
relocation 0.022158152s
kernel startup 0.296055880s
Now: Relocation takes 6.3% of reboot time
Total reboot is x2.16 times faster.
Previous approaches and discussions
-----------------------------------
https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com
reserve space for kexec to avoid relocation, involves changes to generic code
to optimize a problem that exists on arm64 only:
https://lore.kernel.org/lkml/20190716165641.6990-1-pasha.tatashin@soleen.com
The first attempt to enable MMU, some bugs that prevented performance
improvement. The page tables unnecessary configured idmap for the whole
physical space.
Pavel Tatashin (8):
kexec: quiet down kexec reboot
arm64, mm: transitional tables
arm64: hibernate: switch to transtional page tables.
kexec: add machine_kexec_post_load()
arm64, kexec: move relocation function setup and clean up
arm64, kexec: add expandable argument to relocation function
arm64, kexec: configure transitional page table for kexec
arm64, kexec: enable MMU during kexec relocation
arch/arm64/Kconfig | 4 +
arch/arm64/include/asm/kexec.h | 24 ++-
arch/arm64/include/asm/pgtable-hwdef.h | 1 +
arch/arm64/include/asm/trans_table.h | 66 ++++++
arch/arm64/kernel/asm-offsets.c | 10 +
arch/arm64/kernel/cpu-reset.S | 4 +-
arch/arm64/kernel/cpu-reset.h | 8 +-
arch/arm64/kernel/hibernate.c | 261 ++++++------------------
arch/arm64/kernel/machine_kexec.c | 168 ++++++++++++---
arch/arm64/kernel/relocate_kernel.S | 238 +++++++++++++++-------
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/trans_table.c | 272 +++++++++++++++++++++++++
kernel/kexec.c | 4 +
kernel/kexec_core.c | 8 +-
kernel/kexec_file.c | 4 +
kernel/kexec_internal.h | 2 +
16 files changed, 756 insertions(+), 319 deletions(-)
create mode 100644 arch/arm64/include/asm/trans_table.h
create mode 100644 arch/arm64/mm/trans_table.c
--
2.22.0
^ permalink raw reply
* Re: [PATCH] tools: memory-model: add it to the Documentation body
From: Akira Yokosawa @ 2019-07-31 15:19 UTC (permalink / raw)
To: Alan Stern, Mauro Carvalho Chehab
Cc: Joel Fernandes, Linux Doc Mailing List, Mauro Carvalho Chehab,
linux-kernel, Jonathan Corbet, Andrea Parri, Will Deacon,
Peter Zijlstra, Boqun Feng, Nicholas Piggin, David Howells,
Jade Alglave, Luc Maranget, Paul E. McKenney, Daniel Lustig,
Ingo Molnar, Jason Gunthorpe, SeongJae Park, linux-arch
In-Reply-To: <Pine.LNX.4.44L0.1907310947340.1497-100000@iolanthe.rowland.org>
On Wed, 31 Jul 2019 09:52:05 -0400, Alan Stern wrote:
> On Tue, 30 Jul 2019, Mauro Carvalho Chehab wrote:
>
>> Em Tue, 30 Jul 2019 18:17:01 -0400
>> Joel Fernandes <joel@joelfernandes.org> escreveu:
>
>>>>> (4) I would argue that every occurence of
>>>>> A ->(some dependency) B should be replaced with fixed size font in the HTML
>>>>> results.
>>>>
>>>> Just place those with ``A -> (some dependency)``. This will make them use
>>>> a fixed size font.
>>>
>>> Ok, understood all these. I guess my point was all of these will need to be
>>> done to make this document useful from a ReST conversion standpoint. Until
>>> then it is probably just better off being plain text - since there are so
>>> many of those ``A -> (dep) B`` things.
>
>> On a very quick look, it seems that, if we replace:
>>
>> (\S+\s->\S*\s\w+)
>>
>> by:
>> ``\1``
>>
>>
>> On an editor that would allow to manually replace the regex (like kate),
>> most of those can be get.
>>
>> See patch enclosed.
>
> Some time ago I considered the problem of converting this file to ReST
> format. But I gave up on the idea, because the necessary changes were
> so widespread and the resulting text file would not be easily readable.
>
> Replacing things of the form "A ->dep B" just scratches the surface.
> That document teems with variable names, formulas, code extracts, and
> other things which would all need to be rendered in a different font
> style. The density of the markup required to do this would be
> phenomenally high.
>
> In my opinion it simply was not worthwhile.
+1 on keeping this and the other .txt files of LKMM intact.
Thanks, Akira
>
> Alan Stern
>
^ permalink raw reply
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Sumit Garg @ 2019-07-31 14:23 UTC (permalink / raw)
To: Janne Karhunen
Cc: keyrings, linux-integrity, linux-security-module, Jens Wiklander,
Jonathan Corbet, dhowells, jejb, Jarkko Sakkinen, Mimi Zohar,
James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
Daniel Thompson, Linux Doc Mailing List,
Linux Kernel Mailing List, linux-arm-kernel,
tee-dev @ lists . linaro . org
In-Reply-To: <CAE=Ncrb23q++z8R8UMbjDE2epEq4YVcNGzrRD31eH3JAooYejg@mail.gmail.com>
On Wed, 31 Jul 2019 at 16:33, Janne Karhunen <janne.karhunen@gmail.com> wrote:
>
> On Wed, Jul 31, 2019 at 1:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> > > Interesting, I wrote something similar and posted it to the lists a while back:
> > > https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
> > >
> > > Since there are no generic 'TEEs' available,
> >
> > There is already a generic TEE interface driver available in kernel.
> > Have a look here: "Documentation/tee.txt".
>
> I guess my wording was wrong, tried to say that physical TEEs in the
> wild vary massively hardware wise. Generalizing these things is rough.
>
There are already well defined GlobalPlatform Standards to generalize
the TEE interface. One of them is GlobalPlatform TEE Client API [1]
which provides the basis for this TEE interface.
>
> > > I implemented the same
> > > thing as a generic protocol translator. The shared memory binding for
> > > instance already assumes fair amount about the TEE and how that is
> > > physically present in the system. Besides, the help from usage of shm
> > > is pretty limited due to the size of the keydata.
> > >
> >
> > If you look at patch #1 and #2, they add support to register kernel
> > memory buffer (keydata buffer in this case) with TEE to operate on. So
> > there isn't any limitation due to the size of the keydata.
>
> Ah, didn't mean that. Meant that the keydata is typically pretty small
> in size, so there is limited benefit from passing that in via shm if
> that complicates anything.
>
Ah, ok. Do you think of any better approach rather than to use SHM?
[1] https://globalplatform.org/specs-library/tee-client-api-specification/
-Sumit
>
> --
> Janne
^ permalink raw reply
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Sumit Garg @ 2019-07-31 13:58 UTC (permalink / raw)
To: Janne Karhunen
Cc: keyrings, linux-integrity, linux-security-module, Jens Wiklander,
Jonathan Corbet, dhowells, jejb, Jarkko Sakkinen, Mimi Zohar,
James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
Daniel Thompson, Linux Doc Mailing List,
Linux Kernel Mailing List, linux-arm-kernel,
tee-dev @ lists . linaro . org
In-Reply-To: <CAE=NcrY7b8eTTovOszBhGhVbjfJAXoAYehiUJyPENGfwWoVcPw@mail.gmail.com>
On Wed, 31 Jul 2019 at 15:51, Janne Karhunen <janne.karhunen@gmail.com> wrote:
>
> Hi,
>
> To clarify a bit further - my thought was to support any type of trust
> source.
That could be very well accomplished via Trusted Keys abstraction
framework [1]. A trust source just need to implement following APIs:
struct trusted_key_ops ts_trusted_key_ops = {
.migratable = 0, /* non-migratable */
.init = init_ts_trusted,
.seal = ts_key_seal,
.unseal = ts_key_unseal,
.get_random = ts_get_random,
.cleanup = cleanup_ts_trusted,
};
> Remote, local or both. Just having one particular type of
> locally bound 'TEE' sounded very limited,
TEE is just one of trust source like TPM, we can have other trust
source as mentioned above.
> especially when nothing from
> the TEE execution side is really needed for supporting the kernel
> crypto. What you really need is the seal/unseal transaction going
> somewhere and where that somewhere is does not matter much.
Its only the seal/unseal operations that are provided by TEE driver
that hooks up under trusted keys abstraction layer.
> With the
> user mode helper in between anyone can easily add their own thing in
> there.
>
Isn't actual purpose to have trusted keys is to protect user-space
from access to kernel keys in plain format? Doesn't user mode helper
defeat that purpose in one way or another?
>
[1] https://lkml.org/lkml/2019/7/18/284
-Sumit
> --
> Janne
>
> On Wed, Jul 31, 2019 at 10:11 AM Janne Karhunen
> <janne.karhunen@gmail.com> wrote:
> >
> > Hi,
> >
> > Interesting, I wrote something similar and posted it to the lists a while back:
> > https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
> >
> > Since there are no generic 'TEEs' available, I implemented the same
> > thing as a generic protocol translator. The shared memory binding for
> > instance already assumes fair amount about the TEE and how that is
> > physically present in the system. Besides, the help from usage of shm
> > is pretty limited due to the size of the keydata.
> >
> >
> > --
> > Janne
> >
> >
> >
> >
> > On Tue, Jul 30, 2019 at 3:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > Add support for TEE based trusted keys where TEE provides the functionality
> > > to seal and unseal trusted keys using hardware unique key. Also, this is
> > > an alternative in case platform doesn't possess a TPM device.
> > >
> > > This series also adds some TEE features like:
> > >
> > > Patch #1, #2 enables support for registered kernel shared memory with TEE.
> > >
> > > Patch #3 enables support for private kernel login method required for
> > > cases like trusted keys where we don't wan't user-space to directly access
> > > TEE service to retrieve trusted key contents.
> > >
> > > Rest of the patches from #4 to #6 adds support for TEE based trusted keys.
> > >
> > > This patch-set has been tested with OP-TEE based pseudo TA which can be
> > > found here [1].
> > >
> > > Also, this patch-set is dependent on generic Trusted Keys framework
> > > patch-set [2].
> > >
> > > [1] https://github.com/OP-TEE/optee_os/pull/3082
> > > [2] https://lkml.org/lkml/2019/7/18/284
> > >
> > > Changes in v2:
> > > 1. Add reviewed-by tags for patch #1 and #2.
> > > 2. Incorporate comments from Jens for patch #3.
> > > 3. Switch to use generic trusted keys framework.
> > >
> > > Sumit Garg (6):
> > > tee: optee: allow kernel pages to register as shm
> > > tee: enable support to register kernel memory
> > > tee: add private login method for kernel clients
> > > KEYS: trusted: Introduce TEE based Trusted Keys
> > > doc: keys: Document usage of TEE based Trusted Keys
> > > MAINTAINERS: Add entry for TEE based Trusted Keys
> > >
> > > Documentation/security/keys/index.rst | 1 +
> > > Documentation/security/keys/tee-trusted.rst | 93 +++++++++
> > > MAINTAINERS | 9 +
> > > drivers/tee/optee/call.c | 7 +
> > > drivers/tee/tee_core.c | 6 +
> > > drivers/tee/tee_shm.c | 16 +-
> > > include/keys/trusted-type.h | 3 +
> > > include/keys/trusted_tee.h | 66 +++++++
> > > include/linux/tee_drv.h | 1 +
> > > include/uapi/linux/tee.h | 8 +
> > > security/keys/Kconfig | 3 +
> > > security/keys/trusted-keys/Makefile | 3 +-
> > > security/keys/trusted-keys/trusted-tee.c | 282 ++++++++++++++++++++++++++++
> > > security/keys/trusted-keys/trusted.c | 3 +
> > > 14 files changed, 498 insertions(+), 3 deletions(-)
> > > create mode 100644 Documentation/security/keys/tee-trusted.rst
> > > create mode 100644 include/keys/trusted_tee.h
> > > create mode 100644 security/keys/trusted-keys/trusted-tee.c
> > >
> > > --
> > > 2.7.4
> > >
^ permalink raw reply
* Re: [PATCH] tools: memory-model: add it to the Documentation body
From: Alan Stern @ 2019-07-31 13:52 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Joel Fernandes, Linux Doc Mailing List, Mauro Carvalho Chehab,
linux-kernel, Jonathan Corbet, Andrea Parri, Will Deacon,
Peter Zijlstra, Boqun Feng, Nicholas Piggin, David Howells,
Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
Daniel Lustig, Ingo Molnar, Jason Gunthorpe, SeongJae Park,
linux-arch
In-Reply-To: <20190730195744.3aef478e@coco.lan>
On Tue, 30 Jul 2019, Mauro Carvalho Chehab wrote:
> Em Tue, 30 Jul 2019 18:17:01 -0400
> Joel Fernandes <joel@joelfernandes.org> escreveu:
> > > > (4) I would argue that every occurence of
> > > > A ->(some dependency) B should be replaced with fixed size font in the HTML
> > > > results.
> > >
> > > Just place those with ``A -> (some dependency)``. This will make them use
> > > a fixed size font.
> >
> > Ok, understood all these. I guess my point was all of these will need to be
> > done to make this document useful from a ReST conversion standpoint. Until
> > then it is probably just better off being plain text - since there are so
> > many of those ``A -> (dep) B`` things.
> On a very quick look, it seems that, if we replace:
>
> (\S+\s->\S*\s\w+)
>
> by:
> ``\1``
>
>
> On an editor that would allow to manually replace the regex (like kate),
> most of those can be get.
>
> See patch enclosed.
Some time ago I considered the problem of converting this file to ReST
format. But I gave up on the idea, because the necessary changes were
so widespread and the resulting text file would not be easily readable.
Replacing things of the form "A ->dep B" just scratches the surface.
That document teems with variable names, formulas, code extracts, and
other things which would all need to be rendered in a different font
style. The density of the markup required to do this would be
phenomenally high.
In my opinion it simply was not worthwhile.
Alan Stern
^ permalink raw reply
* [PATCH] mailmap: add entry to connect my email addresses
From: Chao Yu @ 2019-07-31 11:45 UTC (permalink / raw)
To: corbet; +Cc: linux-doc, chao, linux-kernel, Chao Yu
I've used several email accounts to contribute codes, samsung's one
is obsolete, so let me add entry to map them, in order to let people
find me easily when they blame my codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
---
.mailmap | 2 ++
1 file changed, 2 insertions(+)
diff --git a/.mailmap b/.mailmap
index ebdca3fba91f..45d358534ac5 100644
--- a/.mailmap
+++ b/.mailmap
@@ -47,6 +47,8 @@ Boris Brezillon <bbrezillon@kernel.org> <b.brezillon.dev@gmail.com>
Boris Brezillon <bbrezillon@kernel.org> <b.brezillon@overkiz.com>
Brian Avery <b.avery@hp.com>
Brian King <brking@us.ibm.com>
+Chao Yu <chao@kernel.org> <chao2.yu@samsung.com>
+Chao Yu <chao@kernel.org> <yuchao0@huawei.com>
Christoph Hellwig <hch@lst.de>
Christophe Ricard <christophe.ricard@gmail.com>
Corey Minyard <minyard@acm.org>
--
2.18.0.rc1
^ permalink raw reply related
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Janne Karhunen @ 2019-07-31 11:02 UTC (permalink / raw)
To: Sumit Garg
Cc: keyrings, linux-integrity, linux-security-module, Jens Wiklander,
Jonathan Corbet, dhowells, jejb, Jarkko Sakkinen, Mimi Zohar,
James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
Daniel Thompson, Linux Doc Mailing List,
Linux Kernel Mailing List, linux-arm-kernel,
tee-dev @ lists . linaro . org
In-Reply-To: <CAFA6WYPJAzbPdcpBqioxjY=T8RLw-73B_hpzX4cGnwVvm5zpJw@mail.gmail.com>
On Wed, Jul 31, 2019 at 1:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > Interesting, I wrote something similar and posted it to the lists a while back:
> > https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
> >
> > Since there are no generic 'TEEs' available,
>
> There is already a generic TEE interface driver available in kernel.
> Have a look here: "Documentation/tee.txt".
I guess my wording was wrong, tried to say that physical TEEs in the
wild vary massively hardware wise. Generalizing these things is rough.
> > I implemented the same
> > thing as a generic protocol translator. The shared memory binding for
> > instance already assumes fair amount about the TEE and how that is
> > physically present in the system. Besides, the help from usage of shm
> > is pretty limited due to the size of the keydata.
> >
>
> If you look at patch #1 and #2, they add support to register kernel
> memory buffer (keydata buffer in this case) with TEE to operate on. So
> there isn't any limitation due to the size of the keydata.
Ah, didn't mean that. Meant that the keydata is typically pretty small
in size, so there is limited benefit from passing that in via shm if
that complicates anything.
--
Janne
^ permalink raw reply
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Sumit Garg @ 2019-07-31 10:26 UTC (permalink / raw)
To: Janne Karhunen
Cc: keyrings, linux-integrity, linux-security-module, Jens Wiklander,
Jonathan Corbet, dhowells, jejb, Jarkko Sakkinen, Mimi Zohar,
James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
Daniel Thompson, Linux Doc Mailing List,
Linux Kernel Mailing List, linux-arm-kernel,
tee-dev @ lists . linaro . org
In-Reply-To: <CAE=Ncrb63dQLe-nDQyO9OPv7XjwM_9mzL9SrcLiUi2Dr10cD4A@mail.gmail.com>
Hi Janne,
On Wed, 31 Jul 2019 at 12:41, Janne Karhunen <janne.karhunen@gmail.com> wrote:
>
> Hi,
>
> Interesting, I wrote something similar and posted it to the lists a while back:
> https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
>
> Since there are no generic 'TEEs' available,
There is already a generic TEE interface driver available in kernel.
Have a look here: "Documentation/tee.txt".
> I implemented the same
> thing as a generic protocol translator. The shared memory binding for
> instance already assumes fair amount about the TEE and how that is
> physically present in the system. Besides, the help from usage of shm
> is pretty limited due to the size of the keydata.
>
If you look at patch #1 and #2, they add support to register kernel
memory buffer (keydata buffer in this case) with TEE to operate on. So
there isn't any limitation due to the size of the keydata.
-Sumit
>
> --
> Janne
>
>
>
>
> On Tue, Jul 30, 2019 at 3:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > Add support for TEE based trusted keys where TEE provides the functionality
> > to seal and unseal trusted keys using hardware unique key. Also, this is
> > an alternative in case platform doesn't possess a TPM device.
> >
> > This series also adds some TEE features like:
> >
> > Patch #1, #2 enables support for registered kernel shared memory with TEE.
> >
> > Patch #3 enables support for private kernel login method required for
> > cases like trusted keys where we don't wan't user-space to directly access
> > TEE service to retrieve trusted key contents.
> >
> > Rest of the patches from #4 to #6 adds support for TEE based trusted keys.
> >
> > This patch-set has been tested with OP-TEE based pseudo TA which can be
> > found here [1].
> >
> > Also, this patch-set is dependent on generic Trusted Keys framework
> > patch-set [2].
> >
> > [1] https://github.com/OP-TEE/optee_os/pull/3082
> > [2] https://lkml.org/lkml/2019/7/18/284
> >
> > Changes in v2:
> > 1. Add reviewed-by tags for patch #1 and #2.
> > 2. Incorporate comments from Jens for patch #3.
> > 3. Switch to use generic trusted keys framework.
> >
> > Sumit Garg (6):
> > tee: optee: allow kernel pages to register as shm
> > tee: enable support to register kernel memory
> > tee: add private login method for kernel clients
> > KEYS: trusted: Introduce TEE based Trusted Keys
> > doc: keys: Document usage of TEE based Trusted Keys
> > MAINTAINERS: Add entry for TEE based Trusted Keys
> >
> > Documentation/security/keys/index.rst | 1 +
> > Documentation/security/keys/tee-trusted.rst | 93 +++++++++
> > MAINTAINERS | 9 +
> > drivers/tee/optee/call.c | 7 +
> > drivers/tee/tee_core.c | 6 +
> > drivers/tee/tee_shm.c | 16 +-
> > include/keys/trusted-type.h | 3 +
> > include/keys/trusted_tee.h | 66 +++++++
> > include/linux/tee_drv.h | 1 +
> > include/uapi/linux/tee.h | 8 +
> > security/keys/Kconfig | 3 +
> > security/keys/trusted-keys/Makefile | 3 +-
> > security/keys/trusted-keys/trusted-tee.c | 282 ++++++++++++++++++++++++++++
> > security/keys/trusted-keys/trusted.c | 3 +
> > 14 files changed, 498 insertions(+), 3 deletions(-)
> > create mode 100644 Documentation/security/keys/tee-trusted.rst
> > create mode 100644 include/keys/trusted_tee.h
> > create mode 100644 security/keys/trusted-keys/trusted-tee.c
> >
> > --
> > 2.7.4
> >
^ permalink raw reply
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Janne Karhunen @ 2019-07-31 10:21 UTC (permalink / raw)
To: Sumit Garg
Cc: keyrings, linux-integrity, linux-security-module, jens.wiklander,
corbet, dhowells, jejb, jarkko.sakkinen, Mimi Zohar, James Morris,
Serge E. Hallyn, Casey Schaufler, ard.biesheuvel, daniel.thompson,
linux-doc, Linux Kernel Mailing List, linux-arm-kernel, tee-dev
In-Reply-To: <CAE=Ncrb63dQLe-nDQyO9OPv7XjwM_9mzL9SrcLiUi2Dr10cD4A@mail.gmail.com>
Hi,
To clarify a bit further - my thought was to support any type of trust
source. Remote, local or both. Just having one particular type of
locally bound 'TEE' sounded very limited, especially when nothing from
the TEE execution side is really needed for supporting the kernel
crypto. What you really need is the seal/unseal transaction going
somewhere and where that somewhere is does not matter much. With the
user mode helper in between anyone can easily add their own thing in
there.
--
Janne
On Wed, Jul 31, 2019 at 10:11 AM Janne Karhunen
<janne.karhunen@gmail.com> wrote:
>
> Hi,
>
> Interesting, I wrote something similar and posted it to the lists a while back:
> https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
>
> Since there are no generic 'TEEs' available, I implemented the same
> thing as a generic protocol translator. The shared memory binding for
> instance already assumes fair amount about the TEE and how that is
> physically present in the system. Besides, the help from usage of shm
> is pretty limited due to the size of the keydata.
>
>
> --
> Janne
>
>
>
>
> On Tue, Jul 30, 2019 at 3:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > Add support for TEE based trusted keys where TEE provides the functionality
> > to seal and unseal trusted keys using hardware unique key. Also, this is
> > an alternative in case platform doesn't possess a TPM device.
> >
> > This series also adds some TEE features like:
> >
> > Patch #1, #2 enables support for registered kernel shared memory with TEE.
> >
> > Patch #3 enables support for private kernel login method required for
> > cases like trusted keys where we don't wan't user-space to directly access
> > TEE service to retrieve trusted key contents.
> >
> > Rest of the patches from #4 to #6 adds support for TEE based trusted keys.
> >
> > This patch-set has been tested with OP-TEE based pseudo TA which can be
> > found here [1].
> >
> > Also, this patch-set is dependent on generic Trusted Keys framework
> > patch-set [2].
> >
> > [1] https://github.com/OP-TEE/optee_os/pull/3082
> > [2] https://lkml.org/lkml/2019/7/18/284
> >
> > Changes in v2:
> > 1. Add reviewed-by tags for patch #1 and #2.
> > 2. Incorporate comments from Jens for patch #3.
> > 3. Switch to use generic trusted keys framework.
> >
> > Sumit Garg (6):
> > tee: optee: allow kernel pages to register as shm
> > tee: enable support to register kernel memory
> > tee: add private login method for kernel clients
> > KEYS: trusted: Introduce TEE based Trusted Keys
> > doc: keys: Document usage of TEE based Trusted Keys
> > MAINTAINERS: Add entry for TEE based Trusted Keys
> >
> > Documentation/security/keys/index.rst | 1 +
> > Documentation/security/keys/tee-trusted.rst | 93 +++++++++
> > MAINTAINERS | 9 +
> > drivers/tee/optee/call.c | 7 +
> > drivers/tee/tee_core.c | 6 +
> > drivers/tee/tee_shm.c | 16 +-
> > include/keys/trusted-type.h | 3 +
> > include/keys/trusted_tee.h | 66 +++++++
> > include/linux/tee_drv.h | 1 +
> > include/uapi/linux/tee.h | 8 +
> > security/keys/Kconfig | 3 +
> > security/keys/trusted-keys/Makefile | 3 +-
> > security/keys/trusted-keys/trusted-tee.c | 282 ++++++++++++++++++++++++++++
> > security/keys/trusted-keys/trusted.c | 3 +
> > 14 files changed, 498 insertions(+), 3 deletions(-)
> > create mode 100644 Documentation/security/keys/tee-trusted.rst
> > create mode 100644 include/keys/trusted_tee.h
> > create mode 100644 security/keys/trusted-keys/trusted-tee.c
> >
> > --
> > 2.7.4
> >
^ permalink raw reply
* RE: [PATCH 2/6] hwspinlock: allow sharing of hwspinlocks
From: Loic PALLARDY @ 2019-07-31 9:22 UTC (permalink / raw)
To: Fabien DESSENNE, Ohad Ben-Cohen, Bjorn Andersson, Rob Herring,
Mark Rutland, Maxime Coquelin, Alexandre TORGUE, Jonathan Corbet,
linux-remoteproc@vger.kernel.org, devicetree@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-stm32@st-md-mailman.stormreply.com,
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org
Cc: Fabien DESSENNE, Benjamin GAIGNARD
In-Reply-To: <1552492237-28810-3-git-send-email-fabien.dessenne@st.com>
> -----Original Message-----
> From: linux-remoteproc-owner@vger.kernel.org <linux-remoteproc-
> owner@vger.kernel.org> On Behalf Of Fabien Dessenne
> Sent: mercredi 13 mars 2019 16:51
> To: Ohad Ben-Cohen <ohad@wizery.com>; Bjorn Andersson
> <bjorn.andersson@linaro.org>; Rob Herring <robh+dt@kernel.org>; Mark
> Rutland <mark.rutland@arm.com>; Maxime Coquelin
> <mcoquelin.stm32@gmail.com>; Alexandre TORGUE
> <alexandre.torgue@st.com>; Jonathan Corbet <corbet@lwn.net>; linux-
> remoteproc@vger.kernel.org; devicetree@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-stm32@st-md-mailman.stormreply.com;
> linux-arm-kernel@lists.infradead.org; linux-doc@vger.kernel.org
> Cc: Fabien DESSENNE <fabien.dessenne@st.com>; Benjamin GAIGNARD
> <benjamin.gaignard@st.com>
> Subject: [PATCH 2/6] hwspinlock: allow sharing of hwspinlocks
>
> The current implementation does not allow different devices to use a
> common hwspinlock. Offer the possibility to use the same hwspinlock by
> several users.
> If a device registers to the framework with #hwlock-cells = 2, then
> the second parameter of the 'hwlocks' DeviceTree property defines
> whether an hwlock is requested for an exclusive or a shared usage.
> If a device registers with #hwlock-cells = 1, then all the hwlocks are
> for an exclusive usage.
>
> Signed-off-by: Fabien Dessenne <fabien.dessenne@st.com>
Looks good for me.
Acked-by: Loic Pallardy <loic.pallardy@st.com>
Regards,
Loic
> ---
> Documentation/hwspinlock.txt | 10 ++--
> drivers/hwspinlock/hwspinlock_core.c | 82
> +++++++++++++++++++++++++-------
> drivers/hwspinlock/hwspinlock_internal.h | 2 +
> 3 files changed, 73 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/hwspinlock.txt b/Documentation/hwspinlock.txt
> index ed640a2..e6ce2dd 100644
> --- a/Documentation/hwspinlock.txt
> +++ b/Documentation/hwspinlock.txt
> @@ -54,9 +54,11 @@ Should be called from a process context (might sleep).
> struct hwspinlock *hwspin_lock_request_specific(unsigned int id);
>
> Assign a specific hwspinlock id and return its address, or NULL
> -if that hwspinlock is already in use. Usually board code will
> -be calling this function in order to reserve specific hwspinlock
> -ids for predefined purposes.
> +if that hwspinlock is already in use and not shared. If that specific
> +hwspinlock is declared as shared, it can be requested and used by
> +several users.
> +Usually board code will be calling this function in order to reserve
> +specific hwspinlock ids for predefined purposes.
>
> Should be called from a process context (might sleep).
>
> @@ -368,11 +370,13 @@ of which represents a single hardware lock::
> * struct hwspinlock - this struct represents a single hwspinlock
> instance
> * @bank: the hwspinlock_device structure which owns this lock
> * @lock: initialized and used by hwspinlock core
> + * @refcount: number of users (when shared)
> * @priv: private data, owned by the underlying platform-specific
> hwspinlock drv
> */
> struct hwspinlock {
> struct hwspinlock_device *bank;
> spinlock_t lock;
> + unsigned int refcount;
> void *priv;
> };
>
> diff --git a/drivers/hwspinlock/hwspinlock_core.c
> b/drivers/hwspinlock/hwspinlock_core.c
> index 2bad40d..53afdeb 100644
> --- a/drivers/hwspinlock/hwspinlock_core.c
> +++ b/drivers/hwspinlock/hwspinlock_core.c
> @@ -25,6 +25,8 @@
>
> /* radix tree tags */
> #define HWSPINLOCK_UNUSED (0) /* tags an hwspinlock as unused
> */
> +#define HWSPINLOCK_EXCLUSIVE (1) /* tags an hwspinlock as exclusive
> */
> +#define HWSPINLOCK_SHARED (2) /* tags an hwspinlock as shared */
>
> /*
> * A radix tree is used to maintain the available hwspinlock instances.
> @@ -291,7 +293,7 @@ EXPORT_SYMBOL_GPL(__hwspin_unlock);
> * @hwlock_spec: hwlock specifier as found in the device tree
> *
> * This is a simple translation function, suitable for hwspinlock platform
> - * drivers that only has a lock specifier length of 1.
> + * drivers that only has a lock specifier length of 1 or 2.
> *
> * Returns a relative index of the lock within a specified bank on success,
> * or -EINVAL on invalid specifier cell count.
> @@ -299,7 +301,8 @@ EXPORT_SYMBOL_GPL(__hwspin_unlock);
> static inline int
> of_hwspin_lock_simple_xlate(const struct of_phandle_args *hwlock_spec)
> {
> - if (WARN_ON(hwlock_spec->args_count != 1))
> + if (WARN_ON(hwlock_spec->args_count != 1 &&
> + hwlock_spec->args_count != 2))
> return -EINVAL;
>
> return hwlock_spec->args[0];
> @@ -322,11 +325,12 @@ of_hwspin_lock_simple_xlate(const struct
> of_phandle_args *hwlock_spec)
> int of_hwspin_lock_get_id(struct device_node *np, int index)
> {
> struct of_phandle_args args;
> - struct hwspinlock *hwlock;
> + struct hwspinlock *hwlock, *tmp;
> struct radix_tree_iter iter;
> void **slot;
> int id;
> int ret;
> + unsigned int tag;
>
> ret = of_parse_phandle_with_args(np, "hwlocks", "#hwlock-cells",
> index,
> &args);
> @@ -361,6 +365,37 @@ int of_hwspin_lock_get_id(struct device_node *np,
> int index)
> }
> id += hwlock->bank->base_id;
>
> + /* Set the EXCLUSIVE / SHARED tag */
> + if (args.args_count == 2 && args.args[1]) {
> + /* Tag SHARED unless already tagged EXCLUSIVE */
> + if (radix_tree_tag_get(&hwspinlock_tree, id,
> + HWSPINLOCK_EXCLUSIVE)) {
> + ret = -EINVAL;
> + goto out;
> + }
> + tag = HWSPINLOCK_SHARED;
> + } else {
> + /* Tag EXCLUSIVE unless already tagged SHARED */
> + if (radix_tree_tag_get(&hwspinlock_tree, id,
> + HWSPINLOCK_SHARED)) {
> + ret = -EINVAL;
> + goto out;
> + }
> + tag = HWSPINLOCK_EXCLUSIVE;
> + }
> +
> + /* mark this hwspinlock */
> + hwlock = radix_tree_lookup(&hwspinlock_tree, id);
> + if (!hwlock) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + tmp = radix_tree_tag_set(&hwspinlock_tree, id, tag);
> +
> + /* self-sanity check which should never fail */
> + WARN_ON(tmp != hwlock);
> +
> out:
> of_node_put(args.np);
> return ret ? ret : id;
> @@ -483,6 +518,7 @@ int hwspin_lock_register(struct hwspinlock_device
> *bank, struct device *dev,
>
> spin_lock_init(&hwlock->lock);
> hwlock->bank = bank;
> + hwlock->refcount = 0;
>
> ret = hwspin_lock_register_single(hwlock, base_id + i);
> if (ret)
> @@ -625,7 +661,7 @@ static int __hwspin_lock_request(struct hwspinlock
> *hwlock)
> {
> struct device *dev = hwlock->bank->dev;
> struct hwspinlock *tmp;
> - int ret;
> + int ret, id;
>
> /* prevent underlying implementation from being removed */
> if (!try_module_get(dev->driver->owner)) {
> @@ -642,13 +678,18 @@ static int __hwspin_lock_request(struct hwspinlock
> *hwlock)
> return ret;
> }
>
> + /* update shareable refcount */
> + id = hwlock_to_id(hwlock);
> + if (radix_tree_tag_get(&hwspinlock_tree, id, HWSPINLOCK_SHARED)
> &&
> + hwlock->refcount++)
> + goto out;
> +
> /* mark hwspinlock as used, should not fail */
> - tmp = radix_tree_tag_clear(&hwspinlock_tree,
> hwlock_to_id(hwlock),
> -
> HWSPINLOCK_UNUSED);
> + tmp = radix_tree_tag_clear(&hwspinlock_tree, id,
> HWSPINLOCK_UNUSED);
>
> /* self-sanity check that should never fail */
> WARN_ON(tmp != hwlock);
> -
> +out:
> return ret;
> }
>
> @@ -742,9 +783,9 @@ struct hwspinlock
> *hwspin_lock_request_specific(unsigned int id)
> /* sanity check (this shouldn't happen) */
> WARN_ON(hwlock_to_id(hwlock) != id);
>
> - /* make sure this hwspinlock is unused */
> - ret = radix_tree_tag_get(&hwspinlock_tree, id,
> HWSPINLOCK_UNUSED);
> - if (ret == 0) {
> + /* make sure this hwspinlock is unused or shareable */
> + if (!radix_tree_tag_get(&hwspinlock_tree, id,
> HWSPINLOCK_SHARED) &&
> + !radix_tree_tag_get(&hwspinlock_tree, id,
> HWSPINLOCK_UNUSED)) {
> pr_warn("hwspinlock %u is already in use\n", id);
> hwlock = NULL;
> goto out;
> @@ -777,7 +818,7 @@ int hwspin_lock_free(struct hwspinlock *hwlock)
> {
> struct device *dev;
> struct hwspinlock *tmp;
> - int ret;
> + int ret, id;
>
> if (!hwlock) {
> pr_err("invalid hwlock\n");
> @@ -788,30 +829,35 @@ int hwspin_lock_free(struct hwspinlock *hwlock)
> mutex_lock(&hwspinlock_tree_lock);
>
> /* make sure the hwspinlock is used */
> - ret = radix_tree_tag_get(&hwspinlock_tree, hwlock_to_id(hwlock),
> -
> HWSPINLOCK_UNUSED);
> + id = hwlock_to_id(hwlock);
> + ret = radix_tree_tag_get(&hwspinlock_tree, id,
> HWSPINLOCK_UNUSED);
> if (ret == 1) {
> dev_err(dev, "%s: hwlock is already free\n", __func__);
> dump_stack();
> ret = -EINVAL;
> - goto out;
> + goto unlock;
> }
>
> /* notify the underlying device that power is not needed */
> ret = pm_runtime_put(dev);
> if (ret < 0)
> - goto out;
> + goto unlock;
> +
> + /* update shareable refcount */
> + if (radix_tree_tag_get(&hwspinlock_tree, id, HWSPINLOCK_SHARED)
> &&
> + --hwlock->refcount)
> + goto put;
>
> /* mark this hwspinlock as available */
> - tmp = radix_tree_tag_set(&hwspinlock_tree, hwlock_to_id(hwlock),
> -
> HWSPINLOCK_UNUSED);
> + tmp = radix_tree_tag_set(&hwspinlock_tree, id,
> HWSPINLOCK_UNUSED);
>
> /* sanity check (this shouldn't happen) */
> WARN_ON(tmp != hwlock);
>
> +put:
> module_put(dev->driver->owner);
>
> -out:
> +unlock:
> mutex_unlock(&hwspinlock_tree_lock);
> return ret;
> }
> diff --git a/drivers/hwspinlock/hwspinlock_internal.h
> b/drivers/hwspinlock/hwspinlock_internal.h
> index 9eb6bd0..c808e11 100644
> --- a/drivers/hwspinlock/hwspinlock_internal.h
> +++ b/drivers/hwspinlock/hwspinlock_internal.h
> @@ -35,11 +35,13 @@ struct hwspinlock_ops {
> * struct hwspinlock - this struct represents a single hwspinlock instance
> * @bank: the hwspinlock_device structure which owns this lock
> * @lock: initialized and used by hwspinlock core
> + * @refcount: number of users (when shared)
> * @priv: private data, owned by the underlying platform-specific hwspinlock
> drv
> */
> struct hwspinlock {
> struct hwspinlock_device *bank;
> spinlock_t lock;
> + unsigned int refcount;
> void *priv;
> };
>
> --
> 2.7.4
^ permalink raw reply
* [PATCH] docs: arm: Remove orphan sh-mobile directory
From: Geert Uytterhoeven @ 2019-07-31 9:02 UTC (permalink / raw)
To: Jonathan Corbet, Mauro Carvalho Chehab, Simon Horman, Magnus Damm
Cc: linux-doc, linux-renesas-soc, linux-kernel, Geert Uytterhoeven
This directory is empty, except for a .gitignore file, listing an
executable file that can no longer be built since commit
c6535e1e0361157e ("Documentation: Remove ZBOOT MMC/SDHI utility and
docs").
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
Documentation/arm/sh-mobile/.gitignore | 1 -
1 file changed, 1 deletion(-)
delete mode 100644 Documentation/arm/sh-mobile/.gitignore
diff --git a/Documentation/arm/sh-mobile/.gitignore b/Documentation/arm/sh-mobile/.gitignore
deleted file mode 100644
index c928dbf3cc8806e2..0000000000000000
--- a/Documentation/arm/sh-mobile/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-vrl4
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Minchan Kim @ 2019-07-31 8:53 UTC (permalink / raw)
To: Joel Fernandes (Google)
Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
Christian Hansen, dancol, fmayer, joaodias, joelaf,
Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, namhyung,
Roman Gushchin, Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
Vlastimil Babka, wvw
In-Reply-To: <20190726152319.134152-1-joel@joelfernandes.org>
Hi Joel,
On Fri, Jul 26, 2019 at 11:23:18AM -0400, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
>
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack
>
> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
>
> fs/proc/base.c | 3 +
> fs/proc/internal.h | 1 +
> fs/proc/task_mmu.c | 57 +++++++
> include/linux/page_idle.h | 4 +
> mm/page_idle.c | 340 +++++++++++++++++++++++++++++++++-----
> 5 files changed, 360 insertions(+), 45 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> REG("smaps", S_IRUGO, proc_pid_smaps_operations),
> REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> REG("pagemap", S_IRUSR, proc_pagemap_operations),
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> + REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> +#endif
> #endif
> #ifdef CONFIG_SECURITY
> DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cd0c8d5ce9a1..bc9371880c63 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> extern const struct file_operations proc_pid_smaps_rollup_operations;
> extern const struct file_operations proc_clear_refs_operations;
> extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_page_idle_operations;
>
> extern unsigned long task_vsize(struct mm_struct *);
> extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4d2b860dbc3f..11ccc53da38e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> .open = pagemap_open,
> .release = pagemap_release,
> };
> +
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + int ret;
> + struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> + if (!tsk)
> + return -EINVAL;
> + ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> + put_task_struct(tsk);
Why do you need task_struct here? You already got the task in open
and got mm there so you could pass the MM here instead of task.
> + return ret;
> +}
> +
> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + int ret;
> + struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> + if (!tsk)
> + return -EINVAL;
> + ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> + put_task_struct(tsk);
> + return ret;
> +}
> +
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> + struct mm_struct *mm;
> +
> + mm = proc_mem_open(inode, PTRACE_MODE_READ);
> + if (IS_ERR(mm))
> + return PTR_ERR(mm);
> + file->private_data = mm;
> + return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> + struct mm_struct *mm = file->private_data;
> +
> + if (mm)
> + mmdrop(mm);
> + return 0;
> +}
> +
> +const struct file_operations proc_page_idle_operations = {
> + .llseek = mem_lseek, /* borrow this */
> + .read = proc_page_idle_read,
> + .write = proc_page_idle_write,
> + .open = proc_page_idle_open,
> + .release = proc_page_idle_release,
> +};
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
> #endif /* CONFIG_PROC_PAGE_MONITOR */
>
> #ifdef CONFIG_NUMA
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f1bc2640d85e 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> }
> #endif /* CONFIG_64BIT */
>
> +ssize_t page_idle_proc_write(struct file *file,
> + char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> +ssize_t page_idle_proc_read(struct file *file,
> + char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> #else /* !CONFIG_IDLE_PAGE_TRACKING */
>
> static inline bool page_is_young(struct page *page)
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..86244f7f1faa 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -5,12 +5,15 @@
> #include <linux/sysfs.h>
> #include <linux/kobject.h>
> #include <linux/mm.h>
> -#include <linux/mmzone.h>
> -#include <linux/pagemap.h>
> -#include <linux/rmap.h>
> #include <linux/mmu_notifier.h>
> +#include <linux/mmzone.h>
> #include <linux/page_ext.h>
> #include <linux/page_idle.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/sched/mm.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>
> #define BITMAP_CHUNK_SIZE sizeof(u64)
> #define BITMAP_CHUNK_BITS (BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -25,18 +28,13 @@
> * page tracking. With such an indicator of user pages we can skip isolated
> * pages, but since there are not usually many of them, it will hardly affect
> * the overall result.
> - *
> - * This function tries to get a user memory page by pfn as described above.
> */
> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)
Looks weird function name after you changed the argument.
Maybe "bool check_valid_page(struct page *page)"?
> {
> struct page *page;
> pg_data_t *pgdat;
>
> - if (!pfn_valid(pfn))
> - return NULL;
> -
> - page = pfn_to_page(pfn);
> + page = page_in;
> if (!page || !PageLRU(page) ||
> !get_page_unless_zero(page))
> return NULL;
> @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> return page;
> }
>
> +/*
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *page_idle_get_page_pfn(unsigned long pfn)
So we could use page_idle_get_page name here.
> +{
> +
> + if (!pfn_valid(pfn))
> + return NULL;
page = pfn_to_page(pfn);
return check_valid_page(page) ? page : NULL;
> +
> + return page_idle_get_page(pfn_to_page(pfn));
> +}
> +
> static bool page_idle_clear_pte_refs_one(struct page *page,
> struct vm_area_struct *vma,
> unsigned long addr, void *arg)
> @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> unlock_page(page);
> }
>
> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> + unsigned long *start, unsigned long *end)
> +{
> + unsigned long max_frame;
> +
> + /* If an mm is not given, assume we want physical frames */
> + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> + return -EINVAL;
> +
> + *start = pos * BITS_PER_BYTE;
> + if (*start >= max_frame)
> + return -ENXIO;
> +
> + *end = *start + count * BITS_PER_BYTE;
> + if (*end > max_frame)
> + *end = max_frame;
> + return 0;
> +}
> +
> +static bool page_really_idle(struct page *page)
Just minor:
Instead of creating new API, could we combine page_is_idle with
introducing furthere argument pte_check?
bool page_is_idle(struct page *page, bool pte_check);
> +{
> + if (!page)
> + return false;
> +
> + if (page_is_idle(page)) {
> + /*
> + * The page might have been referenced via a
> + * pte, in which case it is not idle. Clear
> + * refs and recheck.
> + */
> + page_idle_clear_pte_refs(page);
> + if (page_is_idle(page))
> + return true;
> + }
> +
> + return false;
> +}
> +
> static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> struct bin_attribute *attr, char *buf,
> loff_t pos, size_t count)
> @@ -125,35 +176,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> u64 *out = (u64 *)buf;
> struct page *page;
> unsigned long pfn, end_pfn;
> - int bit;
> + int bit, ret;
>
> - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> - return -EINVAL;
> -
> - pfn = pos * BITS_PER_BYTE;
> - if (pfn >= max_pfn)
> - return 0;
> -
> - end_pfn = pfn + count * BITS_PER_BYTE;
> - if (end_pfn > max_pfn)
> - end_pfn = max_pfn;
> + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> + if (ret == -ENXIO)
> + return 0; /* Reads beyond max_pfn do nothing */
> + else if (ret)
> + return ret;
>
> for (; pfn < end_pfn; pfn++) {
> bit = pfn % BITMAP_CHUNK_BITS;
> if (!bit)
> *out = 0ULL;
> - page = page_idle_get_page(pfn);
> - if (page) {
> - if (page_is_idle(page)) {
> - /*
> - * The page might have been referenced via a
> - * pte, in which case it is not idle. Clear
> - * refs and recheck.
> - */
> - page_idle_clear_pte_refs(page);
> - if (page_is_idle(page))
> - *out |= 1ULL << bit;
> - }
> + page = page_idle_get_page_pfn(pfn);
> + if (page && page_really_idle(page)) {
> + *out |= 1ULL << bit;
> put_page(page);
> }
> if (bit == BITMAP_CHUNK_BITS - 1)
> @@ -170,23 +207,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> const u64 *in = (u64 *)buf;
> struct page *page;
> unsigned long pfn, end_pfn;
> - int bit;
> + int bit, ret;
>
> - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> - return -EINVAL;
> -
> - pfn = pos * BITS_PER_BYTE;
> - if (pfn >= max_pfn)
> - return -ENXIO;
> -
> - end_pfn = pfn + count * BITS_PER_BYTE;
> - if (end_pfn > max_pfn)
> - end_pfn = max_pfn;
> + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> + if (ret)
> + return ret;
>
> for (; pfn < end_pfn; pfn++) {
> bit = pfn % BITMAP_CHUNK_BITS;
> if ((*in >> bit) & 1) {
> - page = page_idle_get_page(pfn);
> + page = page_idle_get_page_pfn(pfn);
> if (page) {
> page_idle_clear_pte_refs(page);
> set_page_idle(page);
> @@ -224,6 +254,226 @@ struct page_ext_operations page_idle_ops = {
> };
> #endif
>
> +/* page_idle tracking for /proc/<pid>/page_idle */
> +
> +struct page_node {
> + struct page *page;
> + unsigned long addr;
> + struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> + unsigned long start_addr;
> + char *buffer;
> + int write;
> +
> + /* Pre-allocate and provide nodes to add_page_idle_list() */
> + struct page_node *page_nodes;
> + int cur_page_node;
> + struct list_head *idle_page_list;
> +};
> +
> +/*
> + * Add a page to the idle page list. page can be NULL if pte is
> + * from a swapped page.
> + */
> +static void add_page_idle_list(struct page *page,
> + unsigned long addr, struct mm_walk *walk)
> +{
> + struct page *page_get = NULL;
> + struct page_node *pn;
> + int bit;
> + unsigned long frames;
> + struct page_idle_proc_priv *priv = walk->private;
> + u64 *chunk = (u64 *)priv->buffer;
> +
> + if (priv->write) {
> + /* Find whether this page was asked to be marked */
> + frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> + bit = frames % BITMAP_CHUNK_BITS;
> + chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> + if (((*chunk >> bit) & 1) == 0)
> + return;
> + }
> +
> + if (page) {
> + page_get = page_idle_get_page(page);
> + if (!page_get)
> + return;
> + }
> +
> + pn = &(priv->page_nodes[priv->cur_page_node++]);
> + pn->page = page_get;
> + pn->addr = addr;
> + list_add(&pn->list, priv->idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end,
> + struct mm_walk *walk)
> +{
> + struct vm_area_struct *vma = walk->vma;
> + pte_t *pte;
> + spinlock_t *ptl;
> + struct page *page;
> +
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + if (pmd_present(*pmd)) {
> + page = follow_trans_huge_pmd(vma, addr, pmd,
> + FOLL_DUMP|FOLL_WRITE);
> + if (!IS_ERR_OR_NULL(page))
> + add_page_idle_list(page, addr, walk);
> + }
> + spin_unlock(ptl);
> + return 0;
> + }
> +
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + for (; addr != end; pte++, addr += PAGE_SIZE) {
> + /*
> + * We add swapped pages to the idle_page_list so that we can
> + * reported to userspace that they are idle.
> + */
> + if (is_swap_pte(*pte)) {
I suggested "let's consider every swapped out pages as IDLE" but
let's think about this case:
1. mark heap of the process as IDLE
2. process touch working set
3. process's heap pages are swap out by meory spike or madvise
4. heap profiler investigates the process's IDLE page and surprised all of
heap are idle.
It's the good scenario for other purpose because non-idle pages(IOW,
workingset) could be readahead when the app will restart.
Maybe, squeeze the idle bit in the swap pte to check it.
> + add_page_idle_list(NULL, addr, walk);
> + continue;
> + }
> +
> + if (!pte_present(*pte))
> + continue;
> +
> + page = vm_normal_page(vma, addr, *pte);
> + if (page)
> + add_page_idle_list(page, addr, walk);
> + }
> +
> + pte_unmap_unlock(pte - 1, ptl);
> + return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos,
> + struct task_struct *tsk, int write)
> +{
> + int ret;
> + char *buffer;
> + u64 *out;
> + unsigned long start_addr, end_addr, start_frame, end_frame;
> + struct mm_struct *mm = file->private_data;
> + struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> + struct page_node *cur;
> + struct page_idle_proc_priv priv;
> + bool walk_error = false;
> + LIST_HEAD(idle_page_list);
> +
> + if (!mm || !mmget_not_zero(mm))
> + return -EINVAL;
> +
> + if (count > PAGE_SIZE)
> + count = PAGE_SIZE;
> +
> + buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!buffer) {
> + ret = -ENOMEM;
> + goto out_mmput;
> + }
> + out = (u64 *)buffer;
> +
> + if (write && copy_from_user(buffer, ubuff, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> + if (ret)
> + goto out;
> +
> + start_addr = (start_frame << PAGE_SHIFT);
> + end_addr = (end_frame << PAGE_SHIFT);
> + priv.buffer = buffer;
> + priv.start_addr = start_addr;
> + priv.write = write;
> +
> + priv.idle_page_list = &idle_page_list;
> + priv.cur_page_node = 0;
> + priv.page_nodes = kzalloc(sizeof(struct page_node) *
> + (end_frame - start_frame), GFP_KERNEL);
> + if (!priv.page_nodes) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + walk.private = &priv;
> + walk.mm = mm;
> +
> + down_read(&mm->mmap_sem);
> +
> + /*
> + * idle_page_list is needed because walk_page_vma() holds ptlock which
> + * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> + * pages first, and then call page_idle_clear_pte_refs().
> + */
Thanks for the comment, I was curious why you want to have
idle_page_list and the reason is here.
How about making this /proc/<pid>/page_idle per-process granuariy,
unlike system level /sys/xxx/page_idle? What I meant is not to check
rmap to see any reference from random process but just check only
access from the target process. It would be more proper as /proc/
<pid>/ interface and good for per-process tracking as well as
fast.
> + ret = walk_page_range(start_addr, end_addr, &walk);
> + if (ret)
> + walk_error = true;
> +
> + list_for_each_entry(cur, &idle_page_list, list) {
> + int bit, index;
> + unsigned long off;
> + struct page *page = cur->page;
> +
> + if (unlikely(walk_error))
> + goto remove_page;
> +
> + if (write) {
> + if (page) {
> + page_idle_clear_pte_refs(page);
> + set_page_idle(page);
> + }
> + } else {
> + if (!page || page_really_idle(page)) {
> + off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> + bit = off % BITMAP_CHUNK_BITS;
> + index = off / BITMAP_CHUNK_BITS;
> + out[index] |= 1ULL << bit;
> + }
> + }
> +remove_page:
> + if (page)
> + put_page(page);
> + }
> +
> + if (!write && !walk_error)
> + ret = copy_to_user(ubuff, buffer, count);
> +
> + up_read(&mm->mmap_sem);
> + kfree(priv.page_nodes);
> +out:
> + kfree(buffer);
> +out_mmput:
> + mmput(mm);
> + if (!ret)
> + ret = count;
> + return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
> static int __init page_idle_init(void)
> {
> int err;
> --
> 2.22.0.709.g102302147b-goog
^ permalink raw reply
* Re: [RFC v2 0/6] Introduce TEE based Trusted Keys support
From: Janne Karhunen @ 2019-07-31 7:11 UTC (permalink / raw)
To: Sumit Garg
Cc: keyrings, linux-integrity, linux-security-module, jens.wiklander,
corbet, dhowells, jejb, jarkko.sakkinen, Mimi Zohar, James Morris,
Serge E. Hallyn, Casey Schaufler, ard.biesheuvel, daniel.thompson,
linux-doc, Linux Kernel Mailing List, linux-arm-kernel, tee-dev
In-Reply-To: <1564489420-677-1-git-send-email-sumit.garg@linaro.org>
Hi,
Interesting, I wrote something similar and posted it to the lists a while back:
https://github.com/jkrh/linux/commit/d77ea03afedcb5fd42234cd834da8f8a0809f6a6
Since there are no generic 'TEEs' available, I implemented the same
thing as a generic protocol translator. The shared memory binding for
instance already assumes fair amount about the TEE and how that is
physically present in the system. Besides, the help from usage of shm
is pretty limited due to the size of the keydata.
--
Janne
On Tue, Jul 30, 2019 at 3:26 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> Add support for TEE based trusted keys where TEE provides the functionality
> to seal and unseal trusted keys using hardware unique key. Also, this is
> an alternative in case platform doesn't possess a TPM device.
>
> This series also adds some TEE features like:
>
> Patch #1, #2 enables support for registered kernel shared memory with TEE.
>
> Patch #3 enables support for private kernel login method required for
> cases like trusted keys where we don't wan't user-space to directly access
> TEE service to retrieve trusted key contents.
>
> Rest of the patches from #4 to #6 adds support for TEE based trusted keys.
>
> This patch-set has been tested with OP-TEE based pseudo TA which can be
> found here [1].
>
> Also, this patch-set is dependent on generic Trusted Keys framework
> patch-set [2].
>
> [1] https://github.com/OP-TEE/optee_os/pull/3082
> [2] https://lkml.org/lkml/2019/7/18/284
>
> Changes in v2:
> 1. Add reviewed-by tags for patch #1 and #2.
> 2. Incorporate comments from Jens for patch #3.
> 3. Switch to use generic trusted keys framework.
>
> Sumit Garg (6):
> tee: optee: allow kernel pages to register as shm
> tee: enable support to register kernel memory
> tee: add private login method for kernel clients
> KEYS: trusted: Introduce TEE based Trusted Keys
> doc: keys: Document usage of TEE based Trusted Keys
> MAINTAINERS: Add entry for TEE based Trusted Keys
>
> Documentation/security/keys/index.rst | 1 +
> Documentation/security/keys/tee-trusted.rst | 93 +++++++++
> MAINTAINERS | 9 +
> drivers/tee/optee/call.c | 7 +
> drivers/tee/tee_core.c | 6 +
> drivers/tee/tee_shm.c | 16 +-
> include/keys/trusted-type.h | 3 +
> include/keys/trusted_tee.h | 66 +++++++
> include/linux/tee_drv.h | 1 +
> include/uapi/linux/tee.h | 8 +
> security/keys/Kconfig | 3 +
> security/keys/trusted-keys/Makefile | 3 +-
> security/keys/trusted-keys/trusted-tee.c | 282 ++++++++++++++++++++++++++++
> security/keys/trusted-keys/trusted.c | 3 +
> 14 files changed, 498 insertions(+), 3 deletions(-)
> create mode 100644 Documentation/security/keys/tee-trusted.rst
> create mode 100644 include/keys/trusted_tee.h
> create mode 100644 security/keys/trusted-keys/trusted-tee.c
>
> --
> 2.7.4
>
^ permalink raw reply
* Re: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
From: Mike Rapoport @ 2019-07-31 6:44 UTC (permalink / raw)
To: Joel Fernandes (Google)
Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
Christian Hansen, dancol, fmayer, joaodias, joelaf,
Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
linux-fsdevel, linux-mm, Michal Hocko, minchan, namhyung,
Roman Gushchin, Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
Vlastimil Babka, wvw
In-Reply-To: <20190726152319.134152-2-joel@joelfernandes.org>
On Fri, Jul 26, 2019 at 11:23:19AM -0400, Joel Fernandes (Google) wrote:
> This patch updates the documentation with the new page_idle tracking
> feature which uses virtual address indexing.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
One nit below, otherwise
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
> .../admin-guide/mm/idle_page_tracking.rst | 43 ++++++++++++++++---
> 1 file changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
> index df9394fb39c2..1eeac78c94a7 100644
> --- a/Documentation/admin-guide/mm/idle_page_tracking.rst
> +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
> @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
>
> User API
> ========
> +There are 2 ways to access the idle page tracking API. One uses physical
> +address indexing, another uses a simpler virtual address indexing scheme.
>
> -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
> -Currently, it consists of the only read-write file,
> -``/sys/kernel/mm/page_idle/bitmap``.
> +Physical address indexing
> +-------------------------
> +The idle page tracking API for physical address indexing using page frame
> +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``. Currently, it
> +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
>
> The file implements a bitmap where each bit corresponds to a memory page. The
> bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
> @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
> information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
> ``/proc/kpagecgroup``.
>
> +Virtual address indexing
> +------------------------
> +The idle page tracking API for virtual address indexing using virtual page
> +frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
> +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
> +except that it uses virtual instead of physical frame numbers.
Can you please make it more explicit that VFNs are in the <pid>'s address
space?
> +
> +This idle page tracking API does not need deal with PFN so it does not require
> +prior lookups of ``pagemap`` in order to find if page is idle or not. This is
> +an advantage on some systems where looking up PFN is considered a security
> +issue. Also in some cases, this interface could be slightly more reliable to
> +use than physical address indexing, since in physical address indexing, address
> +space changes can occur between reading the ``pagemap`` and reading the
> +``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is
> +held for the duration of the access.
> +
> +To estimate the amount of pages that are not used by a workload one should:
> +
> + 1. Mark all the workload's pages as idle by setting corresponding bits in
> + ``/proc/<pid>/page_idle``.
> +
> + 2. Wait until the workload accesses its working set.
> +
> + 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
> +
> .. _impl_details:
>
> Implementation Details
> @@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
> exceeding the dirty memory limit, it is not marked referenced.
>
> The idle memory tracking feature adds a new page flag, the Idle flag. This flag
> -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
> -:ref:`User API <user_api>`
> -section), and cleared automatically whenever a page is referenced as defined
> -above.
> +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
> +addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
> +addressing (see the :ref:`User API <user_api>` section), and cleared
> +automatically whenever a page is referenced as defined above.
>
> When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
> mapped to, otherwise we will not be able to detect accesses to the page coming
> --
> 2.22.0.709.g102302147b-goog
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH v2 25/26] docs: rcu: convert some articles from html to ReST
From: Mauro Carvalho Chehab @ 2019-07-31 1:33 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
Joel Fernandes, Jonathan Corbet, rcu, linux-doc
In-Reply-To: <20190731010623.GN14271@linux.ibm.com>
Em Tue, 30 Jul 2019 18:06:24 -0700
"Paul E. McKenney" <paulmck@linux.ibm.com> escreveu:
> On Tue, Jul 30, 2019 at 09:47:22PM -0300, Mauro Carvalho Chehab wrote:
> > Em Tue, 30 Jul 2019 17:04:55 -0700
> > "Paul E. McKenney" <paulmck@linux.ibm.com> escreveu:
> > > This appears to come from Documentation/output/latex/RCU.tex.
> > > There is nevertheless an RCU.pdf in this directory. It is not
> > > bad, but has a figure full of XML on PDF page 21. And a few later
> > > on as well.
> >
> > PDF output is indeed an issue. The way it works is that it first
> > generates a LaTeX and then it uses texlive to produce the PDF.
>
> Would it be fair to say that html output is what is currently supported,
> and that PDF output is a future thing?
Sure.
Anyway, if you want to fix PDF later, I suspect that simply adding:
.. cssclass:: longtable
Before each quiz table should be enough to fix, as the tables there seem
to be simple enough.
After fixed, the PDF and LaTeX output are usually decent.
> > > On the HTML side, the quick quizzes have immediately visible answers,
> > > which defeats the purpose. The original HTML used a white font,
> > > so that you selected the answer with your mouse to make it visible.
> > >
> > > Can something similar be done with Sphinx? Another approach is to
> > > gather the answers into a separate file and link to them.
> >
> > Yeah, I guess you used a css style that would make the answer visible
> > when the mouse is inside it on your original lwn.net set of articles.
> >
> > Sphinx has a directive to use css, so, the short answer is: yes, you
> > can.
> >
> > For html, you would need to add a css specific for the RCU quiz,
> > placing it under Documentation/sphinx directory. Then, use the
> > ".. css" directive to handle that.
> >
> > You should notice, however, that this will be ignored for
> > LaTeX/pdf output.
> >
> > I guess you can place this on another file, or perhaps place at the
> > end of the document, having a link for the quiz answers.
> >
> > Another alternative would be to make the answer as a footnote.
>
> Making it CSS for HTML and a footnote for PDF seems eminently
> reasonable to me!
You should either do CSS or PDF, as otherwise you will end with dirty
hacks like:
.. only:: html
<some quiz table with answers using css>
.. only: latex
<some quiz table with answers using footnotes>
E. g. you'll need to place the quiz twice, making it harder to maintain
and messier.
Btw, the LaTeX may also parse a css tag, processing it via some custom
macro (with should be added at Documentation/conf.py.
> > > I believe that Joel already noted that internal links are not working.
> > > The external links that I tried work just fine, though. As do the
> > > links from the table of contents.
> >
> > Yeah. Funny enough, when I tested here, they worked fine. Maybe
> > this is due to the Sphinx version I used here at the time I wrote
> > it.
> >
> > Anyway, Joel already submitted a patch addressing this one.
>
> And it works for me, anyway! ;-)
Great!
Thanks,
Mauro
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox