From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9EFFDC433E0 for ; Tue, 9 Feb 2021 22:01:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 143E264EE4 for ; Tue, 9 Feb 2021 22:01:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 143E264EE4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9E1296B0005; Tue, 9 Feb 2021 17:01:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 992E36B006C; Tue, 9 Feb 2021 17:01:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A9D46B006E; Tue, 9 Feb 2021 17:01:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1]) by kanga.kvack.org (Postfix) with ESMTP id 744DC6B0005 for ; Tue, 9 Feb 2021 17:01:05 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3A9D7688E for ; Tue, 9 Feb 2021 22:01:05 +0000 (UTC) X-FDA: 77800100490.06.snail09_0101d052760a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 1225D10043AA2 for ; Tue, 9 Feb 2021 22:01:05 +0000 (UTC) X-HE-Tag: snail09_0101d052760a X-Filterd-Recvd-Size: 9997 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf21.hostedemail.com (Postfix) with ESMTP for ; Tue, 9 Feb 2021 22:01:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612908063; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eYA2lzjucFPzbA9Ee90dKrm+lNMdDDs9+GspopiTkfM=; b=YOKNcQ60y9Fg+MtrC2lne6omo/2ScPIL26GmtN8Xsfv5d5B9c4/39j/FDTftFMm8QpTFtY EkUdUkd/vnaivlT4R5W+zbW/nBQgo5EjjARkXufoapHTunZ1lmMcXYG94aegEBTIEh/aMC qxr7+HfgT9lZv1WKyCrpcOSFNvoR59I= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-458-W5x4bryHORe2YxPjL9NE8w-1; Tue, 09 Feb 2021 17:00:59 -0500 X-MC-Unique: W5x4bryHORe2YxPjL9NE8w-1 Received: by mail-qk1-f198.google.com with SMTP id s66so16958851qkh.10 for ; Tue, 09 Feb 2021 14:00:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=eYA2lzjucFPzbA9Ee90dKrm+lNMdDDs9+GspopiTkfM=; b=Ta0iTbiIcfAPC2QjYQ/FAgW9nsGkyW4Ht3CgGFUL8Jtpb+ZEr2kBTwiqxXiBfZdVIx G/3FdHD7ZjnNvaTDTPpTn6STMc689OBBPArbRDs2uQivOGGJzQy6yrSGi/tZj20IEa8P 07lv+Z/z5p9oNv37u+LIhrZEq/s5ESgMkpn0NLjSu9w9hgQ0mgejQrxVOu1UKfDWbvsu 3uAV8nTNeEi77dh3I5Xn5E9IdUKatzqf4xZ8KZy/y2wMnmo/TPcMXFw0niDubcIjZUDb +U99AfkYbsm/Eypa4AiJ3p0J8xBrYBEoPQNraGZRcCZX7pK8QwJJyLpsF0YDl05gNu9o jYyQ== X-Gm-Message-State: AOAM532JWxUFSAanqNs1gjqhqy8Hocon+iuBb5Hgn3/2ZD86JbmMVkhf /LKVz7/VV5Bym1rXquGzPzqhHezTbYw2+YBT3yBDyiKFB8JVbI8vuwGIjt0q1bboOEZsoWuFun/ FVL9/PFNdq9s= X-Received: by 2002:a0c:fd64:: with SMTP id k4mr23199133qvs.3.1612908059081; Tue, 09 Feb 2021 14:00:59 -0800 (PST) X-Google-Smtp-Source: ABdhPJyODVUhkZ0D/FcCVYPrcdLpBq1QCMhWW6Jh+y/z7047FCb4TOzpP+ovlH411+WardRc056V1Q== X-Received: by 2002:a0c:fd64:: with SMTP id k4mr23199101qvs.3.1612908058747; Tue, 09 Feb 2021 14:00:58 -0800 (PST) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-20-174-93-89-182.dsl.bell.ca. [174.93.89.182]) by smtp.gmail.com with ESMTPSA id v145sm45655qka.27.2021.02.09.14.00.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 14:00:58 -0800 (PST) Date: Tue, 9 Feb 2021 17:00:56 -0500 From: Peter Xu To: Mike Kravetz Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Rapoport , Jerome Glisse , "Kirill A . Shutemov" , Hugh Dickins , Axel Rasmussen , Matthew Wilcox , Andrew Morton , Andrea Arcangeli , Nadav Amit Subject: Re: [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs Message-ID: <20210209220056.GD103365@xz-x1> References: <20210115170907.24498-1-peterx@redhat.com> <20210129224938.GC260413@xz-x1> <339d27e8-9f34-3e80-2910-46f46d58e9a6@oracle.com> <20210206023627.GD3195@xz-x1> <201f2636-1193-2cc1-ccee-a91243f14666@oracle.com> MIME-Version: 1.0 In-Reply-To: <201f2636-1193-2cc1-ccee-a91243f14666@oracle.com> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote: > On 2/5/21 6:36 PM, Peter Xu wrote: > > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote: > >> On 1/29/21 2:49 PM, Peter Xu wrote: > >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote: > >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs. > >> ... > >>> Huge & Mike, > >>> > >>> Would any of you have comment/concerns on the high-level design of this series? > >>> > >>> It would be great to know it, especially major objection, before move on to an > >>> non-rfc version. > >> > >> My apologies for not looking at this sooner. Even now, I have only taken > >> a very brief look at the hugetlbfs patches. > >> > >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for > >> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes > >> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty > >> as here and in Axel's uffd minor fault code. > > > > Interesting to know that we'll reach and need something common from different > > directions, especially when they all mostly happen at the same time. :) > > > > Is there a real "BUG" that you mentioned? I'd be glad to read about it if > > there is a link or something. > > > > Sorry, I was referring to a bugzilla bug not a BUG(). Bottom line is that > hugetlb was mostly overlooked when soft dirty support was added. A thread > mostly from me is at: > lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@oracle.com > I am close to sending out a RFC, but keep getting distracted. Thanks. Indeed I see no reason to not have hugetlb supported for soft dirty. Tracking 1G huge pages could be too coarse and heavy, but 2M at least still seems reasonable. > > >> No objections to the overall approach based on my quick look. > > > > Thanks for having a look. > > > > So for hugetlb one major thing is indeed about the pmd sharing part, which > > seems that we've got very good consensus on. > > Yes > > > The other thing that I'd love to get some comment would be a shared topic with > > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way > > to record wr-protect information even if the pgtable entries were flushed. > > That comes from a fundamental difference between anonymous and file-backed > > memory in that anonymous pages keep all info in the pgtable entry, but > > file-backed memory is not, e.g., pgtable entries can be dropped at any time as > > long as page cache is there. > > Sorry, but I can not recall this difference for hugetlb pages. What operations > lead to flushing of pagetable entries? It would need to be something other > than unmap as it seems we want to lose the information in unmap IIUC. For hugetlbfs I know two cases. One is exactly huge pmd sharing as mentioned above, where we'll drop the pgtable entries for a specific process but the page cache will still exist. The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called before remove_inode_hugepages(). For uffd-wp, there will be a very small window that a wr-protected huge page can be written before the page is finally dropped in remove_inode_hugepages() but after pgtable entry flushed. In some apps that could cause data loss. > > > I goes to look at soft-dirty then regarding this issue, and there's actually a > > paragraph about it: > > > > While in most cases tracking memory changes by #PF-s is more than enough > > there is still a scenario when we can lose soft dirty bits -- a task > > unmaps a previously mapped memory region and then maps a new one at > > exactly the same place. When unmap is called, the kernel internally > > clears PTE values including soft dirty bits. To notify user space > > application about such memory region renewal the kernel always marks > > new memory regions (and expanded regions) as soft dirty. > > > > I feel like it just means soft-dirty currently allows false positives: we could > > have set the soft dirty bit even if the page is clean. And that's what this > > series wanted to avoid: it used the new concept called "swap special pte" to > > persistent that information even for file-backed memory. That all goes for > > avoiding those false positives. > > Yes, I have seen this with soft dirty. It really does not seem right. When > you first create a mapping, even before faulting in anything the vma is marked > VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty. Right that seems not optimal. It is understandable since dirty info is indeed tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted to solve in this series. It would be great to know if current approach in this series would work for us to remove those false positives. > > To be honest, I am not sure you want to try and carry per-process/per-mapping > wp information in the file. What this series does is trying to persist that information in pgtable entries, rather than in the file (or page cache). Frankly I can't say whether that's optimal either, so I'm always open to any comment. So far I think it's a valid solution, but it could always be possible that I missed something important. > In the comment about soft dirty above, it seems > reasonable that unmapping would clear all soft dirty information. Also, > unmapping would clear any uffd state/information. Right, unmap should always means "dropping all information in the ptes". It's in below patch that we tried to treat it differently: https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2 A quick summary of above patch: only if we unmap or truncate the hugetlbfs file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP (which means we'll drop all the information, including uffd-wp bit). Thanks, -- Peter Xu