From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7A034C433FE for ; Fri, 21 Oct 2022 20:37:14 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1olxBz-0006p7-Fz; Fri, 21 Oct 2022 14:56:34 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1olvGt-000311-Tk for qemu-devel@nongnu.org; Fri, 21 Oct 2022 12:53:24 -0400 Received: from mail-pj1-x1030.google.com ([2607:f8b0:4864:20::1030]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1olvGs-0008NE-5x for qemu-devel@nongnu.org; Fri, 21 Oct 2022 12:53:23 -0400 Received: by mail-pj1-x1030.google.com with SMTP id q10-20020a17090a304a00b0020b1d5f6975so3504026pjl.0 for ; Fri, 21 Oct 2022 09:53:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=xrysGx55UenjRLyS1DI8lWgE85TWYqrMWgtnLZZFAkc=; b=RHWZQBLe9HXVngD3tJPBIAxrqGbPxOHUi+EuoMf2wh8wAVWCUykhVBw5M4bpHipxKX AQrvUqN4owfPyZqdyzULb7AFBCSDE+f4NUKYHAYIddpZR67VGGUvvxapOtY3ozT/Ml4t aiXuSjYrXY3eVMT1GbVbskQFgGxypZbRWqNEOhCm30Zif46drkOhIOZelqHekwQwUk43 bnuW3YgCEoukR20OzDrDD/o7UY7n/fRCHqej35doY178zBweJ3IoeGQQs9hPHYSo3Ohx OqzLxanZHEvXOsNtWfWWxGtsKOFGJnejx5vYFgdKqN5Vb6+SO3aa9cMSKsUkIyd4p5eH sJHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xrysGx55UenjRLyS1DI8lWgE85TWYqrMWgtnLZZFAkc=; b=BRnzZg68wFo02HVZKHsztp2fgwkbs8NJiUt0EiOjdPNCxGYPBe8AH1CZtWNwujY1Jp yScBfKTZ//LCwj+esOCqWhqYBlBlCsOPyotLWyawbClRBlgcfbydH0mEHnBofCyiTu5L yB8lrGzblFSdArRgXGhBEU6UMPvoORXG42lhis45K60UxSfPmcKxeSYXrdSPOnZylHuc VURWHeXLjLt7K9OvVcV5hfTtiAKPxCCLCiopAQEYzPfv6t3+mgyPAPOSxKjHTfu+rLLW 2rNzC0VlNeu/vvhehFVV+TMLEteKp0Q96sqdI+Moxlz+xIExNq4hCFNBixcVU3rn34kk fm6g== X-Gm-Message-State: ACrzQf3GfDTCZw2rd+6HL4i2kjXeGCJ1HY7RYEK5UiWwtPaDZ7JMxuKa ug2UmGezW0XzW+C9LtJd4DHbug== X-Google-Smtp-Source: AMsMyM4X0ByBcSID7/aMrdZ0mivytsG1EHhlrHnS2og4yoaJvAdKT/AGmOFlcmS9yqzFtIJMO1ZXxQ== X-Received: by 2002:a17:90b:2651:b0:20a:daaf:75f0 with SMTP id pa17-20020a17090b265100b0020adaaf75f0mr22464873pjb.142.1666371200475; Fri, 21 Oct 2022 09:53:20 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id b14-20020a170903228e00b00176e8f85147sm15298020plh.83.2022.10.21.09.53.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Oct 2022 09:53:19 -0700 (PDT) Date: Fri, 21 Oct 2022 16:53:15 +0000 From: Sean Christopherson To: Chao Peng Cc: Vishal Annapurve , "Kirill A . Shutemov" , "Gupta, Pankaj" , Vlastimil Babka , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Message-ID: References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221017161955.t4gditaztbwijgcn@box.shutemov.name> <20221017215640.hobzcz47es7dq2bi@box.shutemov.name> <20221019153225.njvg45glehlnjgc7@box.shutemov.name> <20221021135434.GB3607894@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221021135434.GB3607894@chaop.bj.intel.com> Received-SPF: pass client-ip=2607:f8b0:4864:20::1030; envelope-from=seanjc@google.com; helo=mail-pj1-x1030.google.com X-Spam_score_int: -175 X-Spam_score: -17.6 X-Spam_bar: ----------------- X-Spam_report: (-17.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Qemu-devel" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Fri, Oct 21, 2022, Chao Peng wrote: > On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote: > > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov wrote: > > > > > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote: > > > > I think moving this notifier_invalidate before fallocate may not solve > > > > the problem completely. Is it possible that between invalidate and > > > > fallocate, KVM tries to handle the page fault for the guest VM from > > > > another vcpu and uses the pages to be freed to back gpa ranges? Should > > > > hole punching here also update mem_attr first to say that KVM should > > > > consider the corresponding gpa ranges to be no more backed by > > > > inaccessible memfd? > > > > > > We rely on external synchronization to prevent this. See code around > > > mmu_invalidate_retry_hva(). > > > > > > -- > > > Kiryl Shutsemau / Kirill A. Shutemov > > > > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn > > ranges that are being invalidated are retried till invalidation is > > complete. In this case, is it possible that KVM tries to serve the > > page fault after inaccessible_notifier_invalidate is complete but > > before fallocate could punch hole into the files? It's not just the page fault edge case. In the more straightforward scenario where the memory is already mapped into the guest, freeing pages back to the kernel before they are removed from the guest will lead to use-after-free. > > e.g. > > inaccessible_notifier_invalidate(...) > > ... (system event preempting this control flow, giving a window for > > the guest to retry accessing the gfn range which was invalidated) > > fallocate(.., PUNCH_HOLE..) > > Looks this is something can happen. > And sounds to me the solution needs > just follow the mmu_notifier's way of using a invalidate_start/end pair. > > invalidate_start() --> kvm->mmu_invalidate_in_progress++; > zap KVM page table entries; > fallocate() > invalidate_end() --> kvm->mmu_invalidate_in_progress--; > > Then during invalidate_start/end time window mmu_invalidate_retry_gfn > checks 'mmu_invalidate_in_progress' and prevent repopulating the same > page in KVM page table. Yes, if it's not safe to invalidate after making the change (fallocate()), then the change needs to be bookended by a start+end pair. The mmu_notifier's unpaired invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but frees the underlying physical page _after_ invalidate(). And the only reason the unpaired invalidate() exists is because there are secondary MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in which case bookending doesn't work because the secondary MMU can't remove PTEs, it can only flush its TLBs. For this case, the whole point is to not create PTEs in the primary MMU, so there should never be a use case that _needs_ an unpaired invalidate(). TL;DR: a start+end pair is likely the simplest solution.