From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31F63C636D3 for ; Thu, 9 Feb 2023 19:11:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A33A66B0078; Thu, 9 Feb 2023 14:11:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E3666B007B; Thu, 9 Feb 2023 14:11:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 883F16B007D; Thu, 9 Feb 2023 14:11:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 720486B0078 for ; Thu, 9 Feb 2023 14:11:03 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2B6D8140FEF for ; Thu, 9 Feb 2023 19:11:03 +0000 (UTC) X-FDA: 80448696006.09.AAA261E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf28.hostedemail.com (Postfix) with ESMTP id F3E58C0017 for ; Thu, 9 Feb 2023 19:11:00 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KHuJypAQ; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675969861; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Dj0BP9m73pVTm4Be4o48kVctCdenhcFPv1e/YS/laI=; b=AP3FHP3C901PR+meahWfvMUtRKFnYUG2u9/VYPghLx7suOhr8rpWxRMHkb8IQuWvPMODDe /WheurhZ/ahayuMYrl+A5cAFGYKp1J6EBLGguLFTBWtkWdQDsYmBQCK0VKLdJV7mL8j+Oz w9dOLXGkqy2GYTJi3Em04luwf4a/xyo= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KHuJypAQ; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675969861; a=rsa-sha256; cv=none; b=nlUa7LOzS2dgVJAxiobKdtAM7H13D5qlj1MhKW2uFoHoHZ7tqf7uX0h50LNZ71wR4Csu2L RH7GnsVfqqtQnG+2qkykbnw7PZsrrDLClKpUMbw4Z5PYPx6Qnj+DyExdCswyhvt3pe4pr4 yo38Ik6kn+owRYnUTSJTsvJ0RpcftbE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675969860; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4Dj0BP9m73pVTm4Be4o48kVctCdenhcFPv1e/YS/laI=; b=KHuJypAQEuuEsJSzSlUPM3j/9cmtWwE5vXVhguiQf9kzHK58IEKy3MH3eBI3O5tj7eANXu NnsYQNj+OvTEM6ZDfXvtZtA60MjrevSNlYs58ud6oDLh2RsUKN0xXPqlVz0UzvRFynT9YK sPKQwtl/pdmgkpqpnhITpHcnnmZcrMo= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-519-Q1zlXSZ3OpmayGmnAxR4og-1; Thu, 09 Feb 2023 14:10:56 -0500 X-MC-Unique: Q1zlXSZ3OpmayGmnAxR4og-1 Received: by mail-qt1-f200.google.com with SMTP id g9-20020ac80709000000b003ba266c0c2bso1654042qth.5 for ; Thu, 09 Feb 2023 11:10:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4Dj0BP9m73pVTm4Be4o48kVctCdenhcFPv1e/YS/laI=; b=HV5zS+cqQMEExayFA2z11PABJBMSLiUMTDA/Vrt1GudDigISlQVPK9d/sSGjs64fwH 1Xns+78ep5Yq85H0/8TOtPTIwoCp3iwsPwwakcUTcha/umbgS+SkV9SOnS5eBzCw4WxC gwM4KrANb0CehHvs/M7XOx2a8AzlpgB2FUnguwXWrFLZhUUBO9XHx3ygz1pB5V5EOavq AUBIkeZvgPNmSG1fa6WB3IZ0ACDtUr2gUEK3KD0A50LESylo4sE5u52wmTrHDKc1v6xc zrqu+FVcE3KnxS/3boWtVNnpeW26PNKVkTitFmfsW48IyCk7cF4CWWYqBn239y6zfW/H cAeQ== X-Gm-Message-State: AO0yUKU031C9RRJoBTqlGCZngEol7q3EoJdKjWxVTZYrr1OFfe2kC9+m EAwqt7TPdrL5n/7emj52MCDzkxb94hmYkldlu828+jhfgFg/BYyWTFEsDV2sbOIrPmUFUHv+UqI yzEfMqDl455Q= X-Received: by 2002:ac8:5809:0:b0:3b5:87db:f979 with SMTP id g9-20020ac85809000000b003b587dbf979mr25862194qtg.5.1675969856393; Thu, 09 Feb 2023 11:10:56 -0800 (PST) X-Google-Smtp-Source: AK7set/lmwGYwFrmIA272G1b1Q++8pZ+MeACfzFEldLvKa3aT33BSefwotMazszkuYM+7adjcFuiPw== X-Received: by 2002:ac8:5809:0:b0:3b5:87db:f979 with SMTP id g9-20020ac85809000000b003b587dbf979mr25862148qtg.5.1675969856101; Thu, 09 Feb 2023 11:10:56 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id i185-20020a3786c2000000b006fa4ac86bfbsm1924693qkd.55.2023.02.09.11.10.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Feb 2023 11:10:55 -0800 (PST) Date: Thu, 9 Feb 2023 14:10:53 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: F3E58C0017 X-Stat-Signature: b45ostb4ji1s8zpu1zaehrxmzn4u7exh X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1675969860-603648 X-HE-Meta: U2FsdGVkX18BZCHtS2+roY2E8Zn6ZO6LsriydgHAX74P4FmAzBaHA3+FoLAQb+BeAVfeCokktGcud4kk+Pfo0yJXWNMQ0AlWW18swNNqs65/IfhhWPA8Cf50BBmTdLWV28WXWLIj/rstvS0hlfaN+28plgCpdkje+MFLwlynqPvYlpa5Y35WNWTltW6+fpPyE0ZxZzvy0t1zCLlWyJJI4sBD0dXndQ7Vfzjz+dmioZoNh5n/a+P1EcOOZavAPxQa/7nJ+IkcMtm6E0j27HQiMv5YPpwML/sfans6zVa2ujYHGK2OldvmzbnJlZaRho9+aDwAOec3EgyMSglsONhrWuhW+qFy6dV8KGPI8Hxrb/CWxRXEmWLtNFZGO7tbMz+4fvZKpMKnWfI6j6+7SQm1kdpYM6jVzRepw+OJVql4bGHjMqfHyaBL4tuF5CyYiXOX99OHhGg9PXu7M2KHnCI9cE2Gfatcv11xYzp5wr1gPQT9Iu24cOQhGAJLDFncecQ3ZIq4+q1Cm+4NwEWuijmY+ojLByHXhgiuHhUuzpcGUUuhMfSzUCpWAq9ZXRFNhWnT0lrcJIRBNKbf0r/ctcfURB+K2N53R1kvz6MtejKQfDzrnYnLf/YgYcO4lhTEBIO6KLycNoKkskahV/Thgu4KIv6V6xNdMxRBuXrCV+5hxoAGS515dQygLxFvko6qhZCziLMUHAMGwXYyi012j/YQ2l3Fo9op80KKRe1XBwJM7+gOjriAD4evekBLAxIGPkI401e6kffjlmCEpF/PdgucN1L2acyzcdFtN1XCecsOaBUu2Q7ZVL9NG1QpuqAEzxUeI8xRANhFNb4etUh8b7twYNhiPPQtCcIK7yLR2An2+KtY/VH88qjdvvyEjc8BC+8l0y4JC7NbrviN+r79sj9EeH0l9sUZwIEnGqOFbz9qKSdlv2fQPeSfW+6mpuFkXOLcCQ6bPRbKMQNJ98Wg0HU c140Qq7N vbIozSZecCDPwDd3r+FLnOmJubgLCe5M5nsahEHsWQzR/DSSb3LNqsNAUDDsUM8vWCP5o2akELAP9+VYOf3nlcE4GWVoTSjjgAJD1pcCNXmqeH64pzUg9oKJwS321IkBtCvY4DN7CBBPnDKp5yw7IoDgDCBjfS6Hgn/Ur0FDbtXAMXg1Py/5IbQjhotWy4EXkAQWAuiwLnULzMNv/p9rSBKfBUzflWgNjNsJLiOleJZu3ek1wpz8wZGSW9QC0/suLM+rPsXIvdPJ379TqwLbQmwQNqVEK2ISDC6v2T3oGMKFOch1MwKwIb3pwaT3YCKNvJF75n0vy4YNTY+aW+PFENLl11cXiD5eKcRLmAHZ52uWZnZalj9z/uJoF1rvzRLik2Z6WAgVHsCoUGdlBhYirwWwCMAZ8nA0S7s/v2NmmO7el8T23fmASDVj8m9+oVp/N2cqWJLGZ0Wt2cLo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote: > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu wrote: > > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu wrote: > > > > > > > > James, > > > > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > > > > > Thanks. From what I can tell, that number shows that it'll be great we > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > > > Matthew for generic folio. > > > > > > Do you think the RFC v1 way is better than doing the THP-like way > > > *with the additional MMU notifier*? > > > > What's the additional MMU notifier you're referring? > > An MMU notifier that informs KVM that a collapse has happened without > having to invalidate_range_start() and invalidate_range_end(), the one > you're replying to lower down in the email. :) [ see below... ] Isn't that something that is needed no matter what mapcount approach we'll go for? Did I miss something? > > > > > > > > > > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > > > > > Any more information on why it's horrible? :) > > > > > > I figured the code would speak for itself, heh. It's quite complicated. > > > > > > I really didn't like: > > > 1. The 'inc' business in copy_hugetlb_page_range. > > > 2. How/where I call put_page()/folio_put() to keep the refcount and > > > mapcount synced up. > > > 3. Having to check the page cache in UFFDIO_CONTINUE. > > > > I think the complexity is one thing which I'm fine with so far. However > > when I think again about the things behind that complexity, I noticed there > > may be at least one flaw that may not be trivial to work around. > > > > It's about truncation. The problem is now we use the pgtable entry to > > represent the mapcount, but the pgtable entry cannot be zapped easily, > > unless vma unmapped or collapsed. > > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of > > course, with page lock held). The mappings will be removed for real, but > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps > > the pgtable leaves, not the ones that we used to account for mapcounts. > > > > So the kernel may see weird things, like mapcount>0 after > > truncate_inode_folio() being finished completely. > > > > For HGM to do the right thing, we may want to also remove the non-leaf > > entries when truncating or doing similar things like a rmap walk to drop > > any mappings for a page/folio. Though that's not doable for now because > > the locks that truncate_inode_folio() is weaker than what we need to free > > the pgtable non-leaf entries - we'll need mmap write lock for that, the > > same as when we unmap or collapse. > > > > Matthew's design doesn't have such issue if the ptes need to be populated, > > because mapcount is still with the leaves; not the case for us here. > > > > If that's the case, _maybe_ we still need to start with the stupid but > > working approach of subpage mapcounts. > > Good point. I can't immediately think of a solution. I would prefer to > go with the subpage mapcount approach to simplify HGM for now; > optimizing mapcount for HugeTLB can then be handled separately. If > you're ok with this, I'll go ahead and send v2. I'm okay with it, but I suggest wait for at least another one day or two to see whether Mike or others have any comments. > > One way that might be possible: using the PAGE_SPECIAL bit on the > hstate-level PTE to indicate if mapcount has been incremented or not > (if the PTE is pointing to page tables). As far as I can tell, > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would > need to be careful with existing PTE examination code as to not > misinterpret these PTEs. This is an interesting idea. :) Yes I don't see it being used at all in any pgtable non-leaves. Then it's about how to let the zap code know when to remove the special bit, hence the mapcount, because not all of them should. Maybe it can be passed over as a new zap_flags_t bit? Thanks, -- Peter Xu