From: David Woodhouse <dwmw2@infradead.org>
To: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-arch@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Mike Rapoport <rppt@kernel.org>,
xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH v6 06/38] mm: Add default definition of set_ptes()
Date: Thu, 12 Oct 2023 14:53:05 +0100 [thread overview]
Message-ID: <4c63ee3634ccfed7d687fcbdd9db60663bce481f.camel@infradead.org> (raw)
In-Reply-To: <20230802151406.3735276-7-willy@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 6588 bytes --]
On Wed, 2023-08-02 at 16:13 +0100, Matthew Wilcox (Oracle) wrote:
> Most architectures can just define set_pte() and PFN_PTE_SHIFT to
> use this definition. It's also a handy spot to document the guarantees
> provided by the MM.
>
> Suggested-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
> include/linux/pgtable.h | 81 ++++++++++++++++++++++++++++++-----------
> 1 file changed, 60 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f34e0f2cb4d8..3fde0d5d1c29 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -182,6 +182,66 @@ static inline int pmd_young(pmd_t pmd)
> }
> #endif
>
> +/*
> + * A facility to provide lazy MMU batching. This allows PTE updates and
> + * page invalidations to be delayed until a call to leave lazy MMU mode
> + * is issued. Some architectures may benefit from doing this, and it is
> + * beneficial for both shadow and direct mode hypervisors, which may batch
> + * the PTE updates which happen during this window. Note that using this
> + * interface requires that read hazards be removed from the code. A read
> + * hazard could result in the direct mode hypervisor case, since the actual
> + * write to the page tables may not yet have taken place, so reads though
> + * a raw PTE pointer after it has been modified are not guaranteed to be
> + * up to date. This mode can only be entered and left under the protection of
> + * the page table locks for all page tables which may be modified. In the UP
> + * case, this is required so that preemption is disabled, and in the SMP case,
> + * it must synchronize the delayed page table writes properly on other CPUs.
> + */
> +#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#define arch_enter_lazy_mmu_mode() do {} while (0)
> +#define arch_leave_lazy_mmu_mode() do {} while (0)
> +#define arch_flush_lazy_mmu_mode() do {} while (0)
> +#endif
> +
> +#ifndef set_ptes
> +#ifdef PFN_PTE_SHIFT
> +/**
> + * set_ptes - Map consecutive pages to a contiguous range of addresses.
> + * @mm: Address space to map the pages into.
> + * @addr: Address to map the first page at.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @nr: Number of pages to map.
> + *
> + * May be overridden by the architecture, or the architecture can define
> + * set_pte() and PFN_PTE_SHIFT.
> + *
> + * Context: The caller holds the page table lock. The pages all belong
> + * to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + page_table_check_ptes_set(mm, ptep, pte, nr);
> +
> + arch_enter_lazy_mmu_mode();
> + for (;;) {
> + set_pte(ptep, pte);
> + if (--nr == 0)
> + break;
> + ptep++;
> + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> + }
> + arch_leave_lazy_mmu_mode();
> +}
This breaks the Xen PV guest.
In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then
loop calling set_pte_at(). Which now (or at least in a few commits time
when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your
implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again,
and:
[ 0.628700] ------------[ cut here ]------------
[ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144!
[ 0.628743] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 0.628769] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-rc4+ #1295
[ 0.628818] RIP: e030:paravirt_enter_lazy_mmu+0x24/0x30
[ 0.628839] Code: 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 8b 05 90 28 f9 7e 85 c0 75 10 65 c7 05 81 28 f9 7e 01 00 00 00 c3 cc cc cc cc <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
[ 0.628875] RSP: e02b:ffffc9004000ba48 EFLAGS: 00010202
[ 0.628891] RAX: 0000000000000001 RBX: ffff8880051b7100 RCX: 000ffffffffff000
[ 0.628908] RDX: 80000000763ff967 RSI: 80000000763ff967 RDI: ffff8880051b7100
[ 0.628925] RBP: 80000000763ff967 R08: ffff8880051b6868 R09: 00007ffce1a20000
[ 0.628943] R10: deadbeefdeadf00d R11: 0000000000000000 R12: 00007ffffffff000
[ 0.628964] R13: ffff8880050b7000 R14: 0000000000000001 R15: 00007fffffffe000
[ 0.628988] FS: 0000000000000000(0000) GS:ffff88807b800000(0000) knlGS:0000000000000000
[ 0.629007] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.629024] CR2: ffffc900003f5000 CR3: 0000000003904000 CR4: 0000000000050660
[ 0.629046] Call Trace:
[ 0.629055] <TASK>
[ 0.629066] ? die+0x36/0x90
[ 0.629081] ? do_trap+0xda/0x100
[ 0.629093] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629112] ? do_error_trap+0x6a/0x90
[ 0.629123] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629138] ? exc_invalid_op+0x50/0x70
[ 0.629155] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629169] ? asm_exc_invalid_op+0x1a/0x20
[ 0.629185] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629212] ? pte_offset_map_nolock+0x48/0xc0
[ 0.629226] set_ptes.constprop.0+0xd/0x30
[ 0.629240] move_ptes.isra.0+0xdd/0x290
[ 0.629253] ? pmd_install+0xab/0xd0
[ 0.629267] move_page_tables+0x3a0/0x850
[ 0.629294] shift_arg_pages+0xf4/0x1d0
[ 0.629317] setup_arg_pages+0x205/0x380
[ 0.629330] load_elf_binary+0x398/0xe00
I'm working on making PV kernels testable in qemu. With...
• some qemu fixes and a nasty hackish Xen console implementation:
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv-console
• a CONFIG_PV_SHIM_EXCLUSIVE build of Xen itself to run in the guest,
• some suitable disk image lying around, in ${GUEST_IMAGE}, and
• CONFIG_KVM_XEN enabled in your host kernel,
...you should be able to do something like:
$ ./qemu-system-x86_64 --accel kvm,xen-version=0x40011,kernel-irqchip=split -drive file=${GUEST_IMAGE},if=none,id=disk -device xen-disk,drive=disk,vdev=xvda -m 1G -kernel ~/git/xen/xen/xen -initrd ~/git/linux/arch/x86/boot/bzImage -append "loglvl=all -- console=hvc0 root=/dev/xvda1" -display none
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
next prev parent reply other threads:[~2023-10-12 13:53 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-02 15:13 [PATCH v6 00/38] New page table range API Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 01/38] minmax: Add in_range() macro Matthew Wilcox (Oracle)
2023-08-03 13:00 ` Phi Nguyen
2023-08-03 13:22 ` Matthew Wilcox
2023-08-03 19:11 ` Phi Nguyen
2023-08-02 15:13 ` [PATCH v6 02/38] mm: Convert page_table_check_pte_set() to page_table_check_ptes_set() Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 03/38] mm: Add generic flush_icache_pages() and documentation Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 04/38] mm: Add folio_flush_mapping() Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 05/38] mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 06/38] mm: Add default definition of set_ptes() Matthew Wilcox (Oracle)
2023-10-12 13:53 ` David Woodhouse [this message]
2023-10-12 14:05 ` Matthew Wilcox
2023-10-12 14:43 ` David Woodhouse
2023-08-02 15:13 ` [PATCH v6 07/38] alpha: Implement the new page table range API Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 08/38] arc: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 09/38] arm: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 10/38] arm64: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 11/38] csky: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 12/38] hexagon: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 13/38] ia64: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 14/38] loongarch: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 15/38] m68k: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 16/38] microblaze: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 17/38] mips: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 18/38] nios2: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 19/38] openrisc: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 20/38] parisc: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 21/38] powerpc: " Matthew Wilcox (Oracle)
2023-08-03 23:38 ` Nathan Chancellor
2023-08-04 3:50 ` Matthew Wilcox
2023-08-02 15:13 ` [PATCH v6 22/38] riscv: " Matthew Wilcox (Oracle)
2023-09-01 16:25 ` patchwork-bot+linux-riscv
2023-08-02 15:13 ` [PATCH v6 23/38] s390: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 24/38] sh: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 25/38] sparc32: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 26/38] sparc64: " Matthew Wilcox (Oracle)
2023-09-04 15:36 ` Guenter Roeck
2023-09-04 17:43 ` Mike Rapoport
2023-09-04 19:37 ` Guenter Roeck
2023-08-02 15:13 ` [PATCH v6 27/38] um: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 28/38] x86: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 29/38] xtensa: " Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 30/38] mm: Remove page_mapping_file() Matthew Wilcox (Oracle)
2023-08-02 15:13 ` [PATCH v6 31/38] mm: Rationalise flush_icache_pages() and flush_icache_page() Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 32/38] mm: Tidy up set_ptes definition Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 33/38] mm: Use flush_icache_pages() in do_set_pmd() Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 34/38] filemap: Add filemap_map_folio_range() Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 35/38] rmap: add folio_add_file_rmap_range() Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 36/38] mm: Convert do_set_pte() to set_pte_range() Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 37/38] filemap: Batch PTE mappings Matthew Wilcox (Oracle)
2023-08-02 15:14 ` [PATCH v6 38/38] mm: Call update_mmu_cache_range() in more page fault handling paths Matthew Wilcox (Oracle)
2023-08-02 18:43 ` [PATCH v6 00/38] New page table range API Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4c63ee3634ccfed7d687fcbdd9db60663bce481f.camel@infradead.org \
--to=dwmw2@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rppt@kernel.org \
--cc=willy@infradead.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).