From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A4A3C433EF for ; Tue, 19 Jul 2022 15:19:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C20EB6B0071; Tue, 19 Jul 2022 11:19:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA9766B0073; Tue, 19 Jul 2022 11:19:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A23EE6B0074; Tue, 19 Jul 2022 11:19:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 45E586B0071 for ; Tue, 19 Jul 2022 11:19:42 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 11B09A7561 for ; Tue, 19 Jul 2022 15:19:42 +0000 (UTC) X-FDA: 79704209004.07.F0739D6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 8D8EC40093 for ; Tue, 19 Jul 2022 15:19:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1658243981; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1ffkQBPmo/sUddUmEl/R8M8xgmN4+H6cBTKR4B6ndbM=; b=cKP+IQvHYlin01ZhYU4R6Gr0j+ClHMIwwYafKSWcVi5zOq2t2BwT8DVysM9zkAjDAfVEsH EuyFQHBMJMvTriGs6kvIRvWXCcA+gMSVT39Jc+uPbZN4AFitX8bjplEqOvKzrAsvEKvb+G UjD61fQYNpneUXKYEOfIP4gemlKdwQk= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-467-5Pk_stLQORCbKRfnCQj-PQ-1; Tue, 19 Jul 2022 11:19:37 -0400 X-MC-Unique: 5Pk_stLQORCbKRfnCQj-PQ-1 Received: by mail-wr1-f69.google.com with SMTP id l5-20020adfa385000000b0021db73f5818so2707886wrb.20 for ; Tue, 19 Jul 2022 08:19:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=1ffkQBPmo/sUddUmEl/R8M8xgmN4+H6cBTKR4B6ndbM=; b=l/uckudfOf+P/RlsYaw2j9eYbzdL1LW0x6F6QsfAIpWfSC65Zd8hTo9maB8N7Y6/7h tUjGh5n0WtiCDZH9F3+Lr9OXHS55ntKRayezcLMwtN0x5pWEMenuVcAZyZVq6Eb/iVc7 IOKKYS62QhwZXry4zbV22asBikXYuak3ngD2d2fGL/3VGzHEW2EymLjMlxY7tp6rKelc WWDMo5HZbWqZ6r6dAP1fo/WNsKW8G6q9u61C+s+svXzTwlFP5pzlOh86AC5tjA3mPDF8 2hp8p8LroFLpMAm/pAaC/MQKkvf9qHNWCRoA/lV91YcX2JI8Q7cAKsIi6K3jJ1douY8t K8bg== X-Gm-Message-State: AJIora/IHVKF5YRFpyjEsYy6H2nsadvxp6dTHmZ0VtQpyE4IVc8ccTsl 2WY9+/9cU3oOjLDStdAgVKsJUT43N7hkEDviIDx9NZ8ad0LXI2OKackWP0LwRwY0Q0Yrbt9iiAN FnmuOuhPawEQ= X-Received: by 2002:a05:6000:1091:b0:21e:3d13:283d with SMTP id y17-20020a056000109100b0021e3d13283dmr2680770wrw.376.1658243975876; Tue, 19 Jul 2022 08:19:35 -0700 (PDT) X-Google-Smtp-Source: AGRyM1so2KooD7kFVghzCwyWiejBTqkj7PcU+kkY0dPiO2+kY+/Q5PrtJbLo80hi8goUOTPKutzPtA== X-Received: by 2002:a05:6000:1091:b0:21e:3d13:283d with SMTP id y17-20020a056000109100b0021e3d13283dmr2680742wrw.376.1658243975526; Tue, 19 Jul 2022 08:19:35 -0700 (PDT) Received: from ?IPV6:2003:cb:c709:600:7807:c947:bc5a:1aea? (p200300cbc70906007807c947bc5a1aea.dip0.t-ipconnect.de. [2003:cb:c709:600:7807:c947:bc5a:1aea]) by smtp.gmail.com with ESMTPSA id b4-20020a05600c150400b003a03185231bsm17643705wmg.31.2022.07.19.08.19.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 19 Jul 2022 08:19:35 -0700 (PDT) Message-ID: <6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com> Date: Tue, 19 Jul 2022 17:19:34 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 To: Michal Hocko , Charan Teja Kalla Cc: akpm@linux-foundation.org, pasha.tatashin@soleen.com, sjpark@amazon.de, sieberf@amazon.com, shakeelb@google.com, dhowells@redhat.com, willy@infradead.org, vbabka@suse.cz, minchan@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "iamjoonsoo.kim@lge.com" References: <1657810063-28938-1-git-send-email-quic_charante@quicinc.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: fix use-after free of page_ext after race with memory-offline In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658243981; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1ffkQBPmo/sUddUmEl/R8M8xgmN4+H6cBTKR4B6ndbM=; b=Pyqywz3MkmL1niB1FFIHznnIrOeQ3vlZAetvVYS1k9o8qRdXk84BCpWmUz8TDh9eC0/XNN ZcK8M2mYNaBSnTOYQt0hfDZXU0+qaXRkLHElKKaPIXB03en66penQK76msjJbGlOVcvLU2 Q/g90wD/ZwAlTnOXqaTAXcccpuXiYEo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658243981; a=rsa-sha256; cv=none; b=l2mWTAVzioAykDG9TZt+b+WhVpB2ZncmpB433JrKkcbH2/WcBirJu5dcobp9hzBZ2I4DEC 7RotLjO1UDIZwjvohpmrrQ4N5fvo654HfTNWhbYwWlUAVzq/W1t56+Qg4BOguHClc09MYG dcRf/eXzGAiOHgEXmaUzMQ1ooOtxjgs= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cKP+IQvH; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf17.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-Rspamd-Queue-Id: 8D8EC40093 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cKP+IQvH; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf17.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: 8poo1m3wm4t5ogd66kp97xgsbjz641kr X-HE-Tag: 1658243981-915459 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 18.07.22 16:54, Michal Hocko wrote: > On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote: >> Thanks Michal for the comments!! >> >> On 7/18/2022 5:20 PM, Michal Hocko wrote: >>>> The above mentioned race is just one example __but the problem persists >>>> in the other paths too involving page_ext->flags access(eg: >>>> page_is_idle())__. Since offline waits till the last reference on the >>>> page goes down i.e. any path that took the refcount on the page can make >>>> the memory offline operation to wait. Eg: In the migrate_pages() >>>> operation, we do take the extra refcount on the pages that are under >>>> migration and then we do copy page_owner by accessing page_ext. For >>>> >>>> Fix those paths where offline races with page_ext access by maintaining >>>> synchronization with rcu lock. >>> Please be much more specific about the synchronization. How does RCU >>> actually synchronize the offlining and access? Higher level description >>> of all the actors would be very helpful not only for the review but also >>> for future readers. >> >> I will improve the commit message about this synchronization change >> using RCU's. > > Thanks! The most imporant part is how the exclusion is actual achieved > because that is not really clear at first sight > > CPU1 CPU2 > lookup_page_ext(PageA) offlining > offline_page_ext > __free_page_ext(addrA) > get_entry(addrA) > ms->page_ext = NULL > synchronize_rcu() > free_page_ext > free_pages_exact (now addrA is unusable) > > rcu_read_lock() > entryA = get_entry(addrA) > base + page_ext_size * index # an address not invalidated by the freeing path > do_something(entryA) > rcu_read_unlock() > > CPU1 never checks ms->page_ext so it cannot bail out early when the > thing is torn down. Or maybe I am missing something. I am not familiar > with page_ext much. > >>> Also, more specifically >>> [...] >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index 3dc715d..5ccd3ee 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>> if (!ms || !ms->page_ext) >>>> return; >>>> base = get_entry(ms->page_ext, pfn); >>>> - free_page_ext(base); >>>> ms->page_ext = NULL; >>>> + synchronize_rcu(); >>>> + free_page_ext(base); >>>> } >>> So you are imposing the RCU grace period for each page_ext! This can get >>> really expensive. Have you tried to measure the effect? > > I was wrong here! This is for each memory section which is not as > terrible as every single page_ext. This can be still quite a lot memory > sections in a single memory block (e.g. on ppc memory sections are > ridiculously small). > >> I didn't really measure the effect. Let me measure it and post these in V2. > > I think it would be much more optimal to split the operation into 2 > phases. Invalidate all the page_ext metadata then synchronize_rcu and > only then free them all. I am not very familiar with page_ext so I am > not sure this is easy to be done. Maybe page_ext = NULL can be done in > the first stage. > >>> Is there any reason why page_ext is freed during offlining rather when >>> it is hotremoved? >> >> This is something I am struggling to get the answer. IMO, this is even >> wrong design where I don't have page_ext but page. Moving the freeing of >> page_ext to hotremove path actually solves the problem but somehow this >> idea didn't liked[1]. copying the excerpt here: > > yes, it certainly adds subtlety to the page_ext thingy. I do agree that > even situation around struct page is not all that great wrt > synchronization. We have pfn_to_online_page which even when racy doesn't > give you a garbage because hotremove happens very rarely or so long > after offlining that the race window is essentially impractically too > long for any potential damage. We would have to change a lot to make it > work "properly". I am not optimistic this is actually feasible. > >>> 3) Change the design where the page_ext is valid as long as the struct >>> page is alive. >> >> :/ Doesn't spark joy." > > I would be wondering why. It should only take to move the callback to > happen at hotremove. So it shouldn't be very involved of a change. I can > imagine somebody would be relying on releasing resources when offlining > memory but is that really the case? Various reasons: 1) There was a discussion in the past to eventually also use rcu protection for handling pdn_to_online_page(). So doing it cleanly here is certainly an improvement. 2) I really dislike having to scatter section online checks all over the place in page ext code. Once there is a difference between active vs. stale page ext data things get a bit messy and error prone. This is already ugly enough in our generic memmap handling code IMHO. 3) Having on-demand allocations, such as KASAN or page ext from the memory online notifier is at least currently cleaner, because we don't have to handle each and every subsystem that hooks into that during the core memory hotadd/remove phase, which primarily only setups the vmemmap, direct map and memory block devices. Personally, I think what we have in this patch is quite nice and clean. But I won't object if it can be similarly done in a clean way from hot(un)plug code. That is, I ack this patch but don't object to similarly clean approaches. -- Thanks, David / dhildenb