From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Andrew Morton <akpm@linux-foundation.org>,
Yang Shi <shy828301@gmail.com>, Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH v2] mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit
Date: Tue, 6 Dec 2022 17:12:16 -0500 [thread overview]
Message-ID: <Y4++QNzYx1CE1ayY@x1n> (raw)
In-Reply-To: <20221206105737.69478-1-david@redhat.com>
On Tue, Dec 06, 2022 at 11:57:37AM +0100, David Hildenbrand wrote:
> We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
> store a physical address.
>
> On a 64bit system, both are 64bit wide. However, on a 32bit system, the
> latter might be 64bit wide. This is, for example, the case on x86 with
> PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only
> spans 32bit.
>
> The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
> that case, and assumes that the maximum PFN is limited by an 32bit
> phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able to
> cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.
>
> Let's rely on the number of bits in phys_addr_t instead, but make sure to
> not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
> is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an
> unsigned long and the maximum swap offset shares that value with the
> swap type.
>
> For example, on an 8 GiB x86 PAE system with a kernel config based on
> Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently fail
> removing migration entries (remove_migration_ptes()), because
> mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
> swp_offset_pfn() wrongly masks off PFN bits. For example,
> split_huge_page_to_list()->...->remap_page() will leave migration
> entries in place and continue to unlock the page.
>
> Later, when we stumble over these migration entries (e.g., via
> /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because
> these migration entries shouldn't exist anymore and the page was
> unlocked.
>
> [ 33.067591] kernel BUG at include/linux/swapops.h:497!
> [ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16
> [ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> [ 33.067606] EIP: pagemap_pmd_range+0x644/0x650
> [ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> [ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
> [ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
> [ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> [ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
> [ 33.067625] Call Trace:
> [ 33.067628] ? madvise_free_pte_range+0x720/0x720
> [ 33.067632] ? smaps_pte_range+0x4b0/0x4b0
> [ 33.067634] walk_pgd_range+0x325/0x720
> [ 33.067637] ? mt_find+0x1d6/0x3a0
> [ 33.067641] ? mt_find+0x1d6/0x3a0
> [ 33.067643] __walk_page_range+0x164/0x170
> [ 33.067646] walk_page_range+0xf9/0x170
> [ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340
> [ 33.067653] pagemap_read+0x124/0x280
> [ 33.067658] ? default_llseek+0x101/0x160
> [ 33.067662] ? smaps_account+0x1d0/0x1d0
> [ 33.067664] vfs_read+0x90/0x290
> [ 33.067667] ? do_madvise.part.0+0x24b/0x390
> [ 33.067669] ? debug_smp_processor_id+0x12/0x20
> [ 33.067673] ksys_pread64+0x58/0x90
> [ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20
> [ 33.067680] __do_fast_syscall_32+0x4c/0xc0
> [ 33.067683] do_fast_syscall_32+0x29/0x60
> [ 33.067686] do_SYSENTER_32+0x15/0x20
> [ 33.067689] entry_SYSENTER_32+0x98/0xf1
>
> Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep
> it readable and consistent.
>
> Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>
> v1 -> v2:
> * Rely on sizeof(phys_addr_t) and min_t() instead.
> * Survives my various cross compilations and testing on x86 PAE.
Good to know it works, thanks a lot.
Acked-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
next prev parent reply other threads:[~2022-12-06 22:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-06 10:57 [PATCH v2] mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit David Hildenbrand
2022-12-06 22:12 ` Peter Xu [this message]
2022-12-07 21:18 ` David Hildenbrand
2022-12-07 22:40 ` Yang Shi
2022-12-08 8:41 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y4++QNzYx1CE1ayY@x1n \
--to=peterx@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shy828301@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.