From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46DCEC4332F for ; Fri, 25 Nov 2022 08:54:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D50AE6B0071; Fri, 25 Nov 2022 03:54:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D00216B0072; Fri, 25 Nov 2022 03:54:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF93F6B0073; Fri, 25 Nov 2022 03:54:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AEB276B0071 for ; Fri, 25 Nov 2022 03:54:18 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7DCECA0A6C for ; Fri, 25 Nov 2022 08:54:18 +0000 (UTC) X-FDA: 80171352996.27.85EB2BD Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 11A9A4000F for ; Fri, 25 Nov 2022 08:54:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669366457; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1XtvVH6LEhrFZxx7ZWj/yZwCZe7eSY1F/ViQDa6RlP0=; b=POhVU4YNG0UXV5g3IxpYlzWNIrprOBXRptSNyjt7ChvgC3v/V+RcDQ41ean65JyXT7T5WJ tocY9Pnxy7gNW6P905PXck0qJoUgtH/d5ai6ymm7kKg4A+UGEyWmVG3XGX8QKIZcIM0rb/ uR6DBDFr6L5vWPnIGFZumTCiSvSIPII= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-277-JacvQ0rNMwGEZTstp0ldfA-1; Fri, 25 Nov 2022 03:54:16 -0500 X-MC-Unique: JacvQ0rNMwGEZTstp0ldfA-1 Received: by mail-wm1-f69.google.com with SMTP id h4-20020a1c2104000000b003d01b66fe65so3893906wmh.2 for ; Fri, 25 Nov 2022 00:54:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1XtvVH6LEhrFZxx7ZWj/yZwCZe7eSY1F/ViQDa6RlP0=; b=IcWnaTbiySQrvuE1RY/Z9LzkDrdaYmPRRVyK2PVHUsUc5ESPh1WvHbUGrgT4pv2b32 YQgVtjiVtjFa0q78zKEAcI51Zv+A0LwlL4rjLiYNS3fSI+MeherEascE01XuMGCV2J9n YdFdBGk/Foyjp8cFJ3hjKzl9sJeLABUfnID4fB1PZJDHdwKc9cU/DHiI/EgEcw03svF2 CbubROGA0r+Agl3y1siGqPJ7HLrjzOLYZHGS/H0p0znlr/m9UCZEm7uK/5ezLpkCBc1b DkXR7yX3zi7ZnUq1maPakP6KYavqHvueH9ZOCbWNSSXC2hsCAr+r4w7gawmTU3uaY4fz Ahew== X-Gm-Message-State: ANoB5pndMD8RQfYC/GCGMF+3A2kbQRKtY2TdzL3XedSf9zFfUY+FUvMh S4ZapYFGVxKcVdw2nx6QV9B+At0mwpqf46vDSHa0ncoUrHRsVRG863O7Et5+iE3LYk8kbfuG9K7 lXw+TtzMoLpI= X-Received: by 2002:adf:decc:0:b0:241:dd7b:f7d1 with SMTP id i12-20020adfdecc000000b00241dd7bf7d1mr12325353wrn.400.1669366454727; Fri, 25 Nov 2022 00:54:14 -0800 (PST) X-Google-Smtp-Source: AA0mqf7UX+ks/uq1sWlIKueUHaJf+nOMPwYsjuUMs27bQ6wIizOpOJLf74hddzXJoM3lUp5VIbH09Q== X-Received: by 2002:adf:decc:0:b0:241:dd7b:f7d1 with SMTP id i12-20020adfdecc000000b00241dd7bf7d1mr12325322wrn.400.1669366454271; Fri, 25 Nov 2022 00:54:14 -0800 (PST) Received: from ?IPV6:2003:cb:c704:d800:887:cbec:f31f:a08? (p200300cbc704d8000887cbecf31f0a08.dip0.t-ipconnect.de. [2003:cb:c704:d800:887:cbec:f31f:a08]) by smtp.gmail.com with ESMTPSA id p15-20020a05600c468f00b003cfaae07f68sm5597626wmo.17.2022.11.25.00.54.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 25 Nov 2022 00:54:13 -0800 (PST) Message-ID: <2541344f-61d8-96d1-10e9-ba7e1743a299@redhat.com> Date: Fri, 25 Nov 2022 09:54:12 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.1 To: Alistair Popple Cc: Matthew Wilcox , Hugh Dickins , Gavin Shan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ziy@nvidia.com, kirill.shutemov@linux.intel.com, zhenyzha@redhat.com, shan.gavin@gmail.com, riel@surriel.com References: <20221123005752.161003-1-gshan@redhat.com> <871qptrvsw.fsf@nvidia.com> <51ffd399-7fa3-b2f2-b6e5-61a8b609e350@redhat.com> <87o7svreq0.fsf@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation In-Reply-To: <87o7svreq0.fsf@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669366458; a=rsa-sha256; cv=none; b=MgNrxxodi9ck4cY9oPCKIjz9ssk9ZsDRxKFwlkPrvmObIWOkpxCo7GYnXNbOZkamRoMUQ2 cbRu/IPXYXo8CHgOOl/fTPkvxZzsT3jSSAecMVVeSGqD2GdOouyulPsQMQ+M3ENL7J/8Is 48cBGZU61tNz7jgCc+LQpj0sF2j1tM8= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=POhVU4YN; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669366458; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1XtvVH6LEhrFZxx7ZWj/yZwCZe7eSY1F/ViQDa6RlP0=; b=NnKVFe/IeRQ8G6zKUUnvzl8IEP+1XHFjUYexJZ7Eaz4Blw51DnpVBjdeI0cX+9tCcAH+NO YP9jLYQfMj/x3WcDduidNze1Y74yPkqouUmvPvKIyTT3Jc7tFgFHCbeg2O/y3rOoj9rV25 qQO8bNY8O3MzxUKjnchERwypx3GYlaY= X-Rspamd-Queue-Id: 11A9A4000F X-Stat-Signature: r8xmdxg46ajcwwcg5hqgrjecxsiewjpd X-Rspam-User: Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=POhVU4YN; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam09 X-HE-Tag: 1669366457-646672 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >>>> I think you're right. Without a page reference I don't think it is even >>>> safe to look at struct page, at least not without synchronisation >>>> against memory hot unplug which could remove the struct page. From a >>>> quick glance I didn't see anything here that obviously did that though. >>> Memory hotplug is the offending party here. It has to make sure >>> that >>> everything else is definitely quiescent before removing the struct pages. >>> Otherwise you can't even try_get a refcount. > > Sure, I might be missing something but how can memory hotplug possibly > synchronise against some process (eg. try_to_compact_pages) doing > try_get(pfn_to_page(random_pfn)) without that process either informing > memory hotplug that it needs pages to remain valid long enough to get a > reference or detecting that memory hotplug is in the process of > offlining pages? It currently doesn't, and it's been mostly a known theoretical problem so far. We've been ignoring these kind of problems because they are not easy to sort out and so far never popped up in practice. First, the correct approach is using pfn_to_online_page() instead of pfn_to_page() when in doubt whether the page might already be offline. While we're using pfn_to_online_page() already in quite some PFN walkers, changes are good that we're still missing some. Second, essentially anybody (PFN walker) doing a pfn_to_online_page() is prone to races with memory offlining. I've discussed this in the past with Michal and Dan and one idea was to use RCU to synchronize PFN walkers and pfn_to_online_page(), essentially synchronizing clearing of the SECTION_IS_ONLINE flag. Nobody worked on that so far because we've never had actual BUG reports. These kind of races are rare, but they are theoretically possible. > >> At least alloc_contig_range() and memory offlining are mutually >> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory >> compaction similarly deals with isolated pageblocks (or some other >> mechanism I forgot) to not race with memory offlining. Wouldn't worry >> about that for now. > > Sorry, agree - to be clear this discussion isn't relevant for this patch > but reviewing it got me thinking about how this works. I still don't see > how alloc_contig_range() is totally safe against memory offlining > though. From what I can see we have the following call path to set > MIGRATE_ISOLATE: > > alloc_contig_range(pfn) -> start_isolate_page_range(pfn) -> > isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn)) > > There's nothing in that call stack that prevent offlining of the page, > hence the struct page may be freed by this point. Am I missing something > else here? Good point. I even had at some point a patch that converted some pfn_to_online_page() calls in there, but discarded it. IIRC, two alloc_contig_range() users are safe because: 1) virtio-mem synchonizes against memory offlining using the memory notifier. While memory offlining is in progress, it won't call alloc_contig_range(). 2) CMA regions will always fail to offline because they have MIGRATE_CMA set. What remains is alloc_contig_pages(), for example, used for memtrace on ppc, kfence, and allocation of gigantic pages. That might indeed be racy. At least from kfence_init_late() it's unlikely (impossible?) that we'll have concurrent memory offlining. > > try_to_compact_pages() has a similar issue which is what I noticed > reviewing this patch: Yes, I can spot pfn_to_online_page() in __reset_isolation_pfn(), but of course, are likely missing proper synchronization and checks later. I wonder if we could use the memory notifier to temporarily pause any running compaction and to restart it once memory offlining succeeded. > > try_to_compact_pages() -> compact_zone_order() -> compact_zone() -> > isolate_migratepages() -> isolate_migratepages_block() -> > PageHuge(pfn_to_page(pfn)) > > Again I couldn't see anything in that path that would hold the page > stable long enough to safely perform the pfn_to_page() and get a > reference. And after a bit of fiddling I was able to trigger the below > oops running the compaction_test selftest whilst offlining memory so I > don't think it is safe: Thanks for finding a reproducer. This is exactly the kind of BUG that we speculated about in the past but that we haven't seen in practice yet. Having that said, I'd be very happy if someone would have time to work on proper synchronization and I'd be happy to help brainstorming/reviewing :) -- Thanks, David / dhildenb