Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>, Gavin Shan <gshan@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, william.kucharski@oracle.com,
	ziy@nvidia.com, kirill.shutemov@linux.intel.com,
	zhenyzha@redhat.com, shan.gavin@gmail.com, riel@surriel.com
Subject: Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation
Date: Fri, 25 Nov 2022 11:58:46 +1100	[thread overview]
Message-ID: <87o7svreq0.fsf@nvidia.com> (raw)
In-Reply-To: <51ffd399-7fa3-b2f2-b6e5-61a8b609e350@redhat.com>


David Hildenbrand <david@redhat.com> writes:

> On 24.11.22 04:33, Matthew Wilcox wrote:
>> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote:
>>>
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 23.11.22 06:14, Hugh Dickins wrote:
>>>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>>>
>>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>>> can't be migrated and the corresponding memory block can't be put
>>>>>> into offline state.
>>>>>>
>>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>>> block can be put into offline state.
>>>>>>
>>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>>>> Cc: stable@vger.kernel.org   # v5.8+
>>>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> Interesting, good catch, looked right to me: except for the Fixes
>>>>> line
>>>>> and mention of v5.8.  That CoW change may have added a case which easily
>>>>> demonstrates the problem, but it would have been the wrong test on a THP
>>>>> for long before then - but only in v5.7 were compound pages allowed
>>>>> through at all to reach that test, so I think it should be
>>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
>>>>> CMA allocations")
>>>>> Cc: stable@vger.kernel.org   # v5.7+
>>>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>>> Because at the time of that "admittedly racy check", we have no hold
>>>>> at all on the page in question: and if it's PageLRU or PageCompound
>>>>> at one instant, it may be different the next instant.  Which leaves it
>>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>>>> total_mapcount() path than in the existing page_mapcount() path.
>>>>> I suspect that for this to be safe (before your patch and more so
>>>>> after),
>>>>> it will be necessary to shift the "admittedly racy check" down after the
>>>>> get_page_unless_zero() (and check the sequence of operations when a
>>>>> compound page is initialized).
>>>>
>>>> Grabbing a reference first sounds like the right approach to me.
>>>
>>> I think you're right. Without a page reference I don't think it is even
>>> safe to look at struct page, at least not without synchronisation
>>> against memory hot unplug which could remove the struct page. From a
>>> quick glance I didn't see anything here that obviously did that though.
>> Memory hotplug is the offending party here.  It has to make sure
>> that
>> everything else is definitely quiescent before removing the struct pages.
>> Otherwise you can't even try_get a refcount.

Sure, I might be missing something but how can memory hotplug possibly
synchronise against some process (eg. try_to_compact_pages) doing
try_get(pfn_to_page(random_pfn)) without that process either informing
memory hotplug that it needs pages to remain valid long enough to get a
reference or detecting that memory hotplug is in the process of
offlining pages?

> At least alloc_contig_range() and memory offlining are mutually
> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
> compaction similarly deals with isolated pageblocks (or some other
> mechanism I forgot) to not race with memory offlining. Wouldn't worry
> about that for now.

Sorry, agree - to be clear this discussion isn't relevant for this patch
but reviewing it got me thinking about how this works. I still don't see
how alloc_contig_range() is totally safe against memory offlining
though. From what I can see we have the following call path to set
MIGRATE_ISOLATE:

alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))

There's nothing in that call stack that prevent offlining of the page,
hence the struct page may be freed by this point. Am I missing something
else here?

try_to_compact_pages() has a similar issue which is what I noticed
reviewing this patch:

try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
isolate_migratepages() -> isolate_migratepages_block() ->
PageHuge(pfn_to_page(pfn))

Again I couldn't see anything in that path that would hold the page
stable long enough to safely perform the pfn_to_page() and get a
reference. And after a bit of fiddling I was able to trigger the below
oops running the compaction_test selftest whilst offlining memory so I
don't think it is safe:

Entering kdb (current=0xffff8882452fb6c0, pid 5646) on processor 1 Oops: (null)
due to oops @ 0xffffffff81af6486
CPU: 1 PID: 5646 Comm: compaction_test Not tainted 6.0.0-01424-gf3ec7e734795 #110
Hardware name: Gigabyte Technology Co., Ltd. B150M-D3H/B150M-D3H-CF, BIOS F24 04/11/2018
RIP: 0010:PageHuge+0xa6/0x180
Code: 00 0f 85 d0 00 00 00 48 8b 43 08 a8 01 0f 85 97 00 00 00 66 90 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 02 7e 7f 31 c0 80 7b 50 02 0f 94 c0 5b 41 5c
RSP: 0000:ffffc9001252efa8 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: fffffffffffffffe RCX: ffffffff81af63f9
RDX: 0000000000000009 RSI: 0000000000000008 RDI: 000000000000004e
RBP: ffffc9001252efb8 R08: 0000000000000000 R09: ffffea000f690007
R10: fffff94001ed2000 R11: 0000000000000001 R12: ffffea000f690008
R13: 0000000000000ab3 R14: ffffea000f690000 R15: 00000000003da400
FS:  00007f83e08b7740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb6e1cbb3e0 CR3: 0000000344044003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 isolate_migratepages_block+0x43c/0x3dc0
 ? reclaim_throttle+0x7a0/0x7a0
 ? __reset_isolation_suitable+0x350/0x350
 compact_zone+0xb73/0x31f0
 ? compaction_suitable+0x1e0/0x1e0
 compact_zone_order+0x127/0x240
 ? compact_zone+0x31f0/0x31f0
 ? __this_cpu_preempt_check+0x13/0x20
 ? cpuusage_write+0x380/0x380
 ? __kasan_check_read+0x11/0x20
 try_to_compact_pages+0x23b/0x770
 ? psi_task_change+0x201/0x300
 __alloc_pages_direct_compact+0x15d/0x650
 ? get_page_from_freelist+0x3fa0/0x3fa0
 ? psi_task_change+0x201/0x300
 ? _raw_spin_unlock+0x19/0x40
 __alloc_pages_slowpath.constprop.0+0x9e1/0x2260
 ? warn_alloc+0x1a0/0x1a0
 ? __zone_watermark_ok+0x430/0x430
 ? prepare_alloc_pages+0x40b/0x620
 __alloc_pages+0x42f/0x540
 ? __alloc_pages_slowpath.constprop.0+0x2260/0x2260
 alloc_buddy_huge_page.isra.0+0x7c/0x130
 alloc_fresh_huge_page+0x1f1/0x4e0
 alloc_pool_huge_page+0xab/0x2d0
 __nr_hugepages_store_common+0x37a/0xed0
 ? return_unused_surplus_pages+0x330/0x330
 ? __kasan_check_write+0x14/0x20
 ? _raw_spin_lock_irqsave+0x99/0x100
 ? proc_doulongvec_minmax+0x54/0x80
 hugetlb_sysctl_handler_common+0x247/0x320
 ? nr_hugepages_store+0xf0/0xf0
 ? alloc_huge_page+0xbf0/0xbf0
 hugetlb_sysctl_handler+0x20/0x30
 proc_sys_call_handler+0x451/0x650
 ? unregister_sysctl_table+0x1c0/0x1c0
 ? apparmor_file_permission+0x124/0x280
 ? security_file_permission+0x72/0x90
 proc_sys_write+0x13/0x20
 vfs_write+0x7ca/0xd80
 ? kernel_write+0x4d0/0x4d0
 ? do_sys_openat2+0x114/0x450
 ? __kasan_check_write+0x14/0x20
 ? down_write+0xb4/0x130
 ksys_write+0x116/0x220
 ? __kasan_check_write+0x14/0x20
 ? __ia32_sys_read+0xb0/0xb0
 ? debug_smp_processor_id+0x17/0x20
 ? fpregs_assert_state_consistent+0x52/0xc0
 __x64_sys_write+0x73/0xb0
 ? syscall_exit_to_user_mode+0x26/0x50
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f83e09b2190
Code: 40 00 48 8b 15 71 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 51 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
RSP: 002b:00007ffe4c08e478 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffe4c08e648 RCX: 00007f83e09b2190
RDX: 0000000000000006 RSI: 000055d7575611f8 RDI: 0000000000000003
RBP: 00007ffe4c08e4c0 R08: 00007f83e0a8cc60 R09: 0000000000000000
R10: 00007f83e08d40b8 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffe4c08e658 R14: 000055d757562df0 R15: 00007f83e0adc020
 </TASK>

next prev parent reply	other threads:[~2022-11-25  1:43 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-23  0:57 [PATCH] mm: migrate: Fix THP's mapcount on isolation Gavin Shan
2022-11-23  4:26 ` Alistair Popple
2022-11-23  5:06   ` Gavin Shan
2022-11-23  5:14 ` Hugh Dickins
2022-11-23  8:56   ` David Hildenbrand
2022-11-23 16:07     ` Matthew Wilcox
2022-11-24  8:50       ` David Hildenbrand
2022-11-24  0:14     ` Gavin Shan
2022-11-24  8:46       ` David Hildenbrand
2022-11-24  9:44         ` Gavin Shan
2022-11-24  1:06     ` Alistair Popple
2022-11-24  3:33       ` Matthew Wilcox
2022-11-24  8:49         ` David Hildenbrand
2022-11-25  0:58           ` Alistair Popple [this message]
2022-11-25  8:54             ` David Hildenbrand
2022-12-01 22:35               ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7svreq0.fsf@nvidia.com \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=gshan@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=shan.gavin@gmail.com \
    --cc=william.kucharski@oracle.com \
    --cc=willy@infradead.org \
    --cc=zhenyzha@redhat.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.