From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Andrew Morton <akpm@linux-foundation.org>,
Yang Shi <shy828301@gmail.com>, Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH v1] mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit
Date: Mon, 5 Dec 2022 17:21:47 -0500 [thread overview]
Message-ID: <Y45u+0c4Hu2snEO2@x1n> (raw)
In-Reply-To: <20221205150857.167583-1-david@redhat.com>
On Mon, Dec 05, 2022 at 04:08:57PM +0100, David Hildenbrand wrote:
> We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
> store a physical address.
>
> On a 64bit system, both are 64bit wide. However, on a 32bit system, the
> latter might be 64bit wide. This is, for example, the case on x86 with
> PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only
> spans 32bit.
>
> The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
> that case, and assumes that the maximum PFN is limited by a 32bit
> phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able to
> cover 4 GiB - 1 on any 32bit system, which is wrong. We can end up
> masking off valid PFN bits from the swap offset.
>
> Ideally, we'd use something like "sizeof(phys_addr_t) * 8 - PAGE_SHIFT",
> however, we might exceed SWP_TYPE_SHIFT and make the BUILD_BUG_ON() in
> is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an
> unsigned long.
>
> Consequently, simply rely on SWP_TYPE_SHIFT in case we're on 32bit and
> CONFIG_PHYS_ADDR_T_64BIT is defined.
>
> For example, on an 8 GiB x86 PAE system, we currently fail removing
> migration entries (remove_migration_ptes()) that target PFNs above
> 4 GiB, because mm/page_vma_mapped.c:check_pte() will fail to identify a
> PFN match as swp_offset_pfn() wrongly masks off valid PFN bits.
>
> With THP, split_huge_page_to_list()->...->remap_page() will leave migration
> entries in place and continue to unlock the page, which is wrong. Later,
> when we stumble over these migration entries (e.g., via
> /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
> migration entries shouldn't exist anymore and the page was unlocked.
>
> [ 33.067591] kernel BUG at include/linux/swapops.h:497!
> [ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16
> [ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> [ 33.067606] EIP: pagemap_pmd_range+0x644/0x650
> [ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ...
> [ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
> [ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
> [ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> [ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
> [ 33.067625] Call Trace:
> [ 33.067628] ? madvise_free_pte_range+0x720/0x720
> [ 33.067632] ? smaps_pte_range+0x4b0/0x4b0
> [ 33.067634] walk_pgd_range+0x325/0x720
> [ 33.067637] ? mt_find+0x1d6/0x3a0
> [ 33.067641] ? mt_find+0x1d6/0x3a0
> [ 33.067643] __walk_page_range+0x164/0x170
> [ 33.067646] walk_page_range+0xf9/0x170
> [ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340
> [ 33.067653] pagemap_read+0x124/0x280
> [ 33.067658] ? default_llseek+0x101/0x160
> [ 33.067662] ? smaps_account+0x1d0/0x1d0
> [ 33.067664] vfs_read+0x90/0x290
> [ 33.067667] ? do_madvise.part.0+0x24b/0x390
> [ 33.067669] ? debug_smp_processor_id+0x12/0x20
> [ 33.067673] ksys_pread64+0x58/0x90
> [ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20
> [ 33.067680] __do_fast_syscall_32+0x4c/0xc0
> [ 33.067683] do_fast_syscall_32+0x29/0x60
> [ 33.067686] do_SYSENTER_32+0x15/0x20
> [ 33.067689] entry_SYSENTER_32+0x98/0xf1
>
> Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Thanks for debugging this one.
> ---
>
> This makes my x86 PAE case work as expected again. Only cross compiled
> on other architectures.
IIUC it's not about PAE but !SPARSEMEM, as PAE actually has it defined when
with sparsemem:
#ifdef CONFIG_X86_32
# ifdef CONFIG_X86_PAE
# define SECTION_SIZE_BITS 29
# define MAX_PHYSMEM_BITS 36
# else
# define SECTION_SIZE_BITS 26
# define MAX_PHYSMEM_BITS 32
# endif
#else /* CONFIG_X86_32 */
# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
#endif
One trivial comment below.
>
> ---
> include/linux/swapops.h | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 86b95ccb81bb..4bb7a20f3fa5 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -32,11 +32,13 @@
> * store PFN, we only need SWP_PFN_BITS bits. Each of the pfn swap entries
> * can use the extra bits to store other information besides PFN.
> */
> -#ifdef MAX_PHYSMEM_BITS
> +#if defined(MAX_PHYSMEM_BITS)
> #define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> -#else /* MAX_PHYSMEM_BITS */
> +#elif !defined(CONFIG_64BIT) && defined(CONFIG_PHYS_ADDR_T_64BIT)
> +#define SWP_PFN_BITS SWP_TYPE_SHIFT
Can we add a comment showing where SWP_TYPE_SHIFT comes from? It should be
a min value comes from either the limitation of phys address width, or from
definition of swp_entry_t (which is unsigned long).
Or I'd rather make this then the code explains better on itself, and the
change should be smaller too:
#ifdef MAX_PHYSMEM_BITS
#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
#else /* MAX_PHYSMEM_BITS */
-#define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
+#define SWP_PFN_BITS MIN((sizeof(phys_addr_t) * 8) - \
+ PAGE_SHIFT, SWP_TYPE_SHIFT)
#endif /* MAX_PHYSMEM_BITS */
#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
What do you think?
> +#else
> #define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
> -#endif /* MAX_PHYSMEM_BITS */
> +#endif /* defined(MAX_PHYSMEM_BITS) */
> #define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
>
> /**
>
> base-commit: 76dcd734eca23168cb008912c0f69ff408905235
> --
> 2.38.1
>
--
Peter Xu
next prev parent reply other threads:[~2022-12-05 22:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-05 15:08 [PATCH v1] mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit David Hildenbrand
2022-12-05 22:21 ` Peter Xu [this message]
2022-12-06 9:16 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y45u+0c4Hu2snEO2@x1n \
--to=peterx@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shy828301@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.