From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 04D25C3DA41 for ; Thu, 11 Jul 2024 11:40:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=6QHwTEPVlGjITHhv2PAW5FmvLUcpFLZKYjC7+NKR6Iw=; b=qkX20d0UiMmZhNq+65H5Wr8TW4 LfW69+1IrUSgRmiIhoXYkw6XwY1uSe0cQWkI8CSjL/TeRI7nz/R1L8dxElT8pQmaMXWN9q2Yrq/Xb 7/wJaQ+xBE7besFe506ohbTlYrQ9w5KqCWWH6HO5eJkhJB+831buSSpBtGxBQhMWK72h+zISmmUDw K9F1R0rSrJc3D3dzZI+VWwbykyMLA5mYy4Vph/R/7VIgnkdY4gk3HPo526RTt2Iirb8UpL22A6RZ2 8SEIa40BCcPvkz3tbg2ZvzxduJZAG2s5vel3UNcGFmt1q7ZS8aGF8/fYbyCM/e1e9AQelPyikAKqA boUXbIsA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sRs9p-0000000Do91-0bw5; Thu, 11 Jul 2024 11:40:17 +0000 Received: from sin.source.kernel.org ([2604:1380:40e1:4800::1]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sRs9W-0000000Do2q-3wlY for linux-arm-kernel@lists.infradead.org; Thu, 11 Jul 2024 11:40:00 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 44E1BCE189E; Thu, 11 Jul 2024 11:39:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5F1D1C32786; Thu, 11 Jul 2024 11:39:54 +0000 (UTC) Date: Thu, 11 Jul 2024 12:39:52 +0100 From: Catalin Marinas To: Yu Zhao Cc: Nanyong Sun , will@kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, willy@infradead.org, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240711_043959_347041_765EDC7B X-CRM114-Status: GOOD ( 36.96 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote: > On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas wrote: > > > The Arm ARM states that we need a BBM if we change the output address > > > and: the old or new mappings are RW *or* the content of the page > > > changes. Ignoring the latter (page content), we can turn the PTEs RO > > > first without changing the pfn followed by changing the pfn while they > > > are RO. Once that's done, we make entry 0 RW and, of course, with > > > additional TLBIs between all these steps. > > > > Aha! This is easy to do -- I just made the RO guaranteed, as I > > mentioned earlier. > > > > Just to make sure I fully understand the workflow: > > > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area. I don't think we can turn all of them RO here since some of those 512 PTEs are not related to the hugetlb page. So you'd need to keep them RW but preserving the pfn so that there's no actual translation change. I think that's covered by FEAT_BBM level 2. Basically this step should be only about breaking up a PMD block entry into a table entry. > > 2. TLBI once, after pmd_populate_kernel() > > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > > PTEs, while they remain RO. You may need some intermediate step to turn these PTEs read-only since step 1 should leave them RW. Also, if we want to free and order-3 page here, it might be better to allocate an order 0 even for PTE entry 0 (I had the impression that's what the core code does, I haven't checked). > > 4. TLBI once, after set_pte_at() on PTE 1-7. > > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. > > 6. TLBI once, after set_pte_at() on PTE 0. > > > > No BBM required, regardless of FEAT_BBM level 2. > > I just studied D8.16.1 from the reference manual, and it seems to me: > 1. We still need either FEAT_BBM or BBM to split PMD. Yes. > 2. We still need BBM when we change PTE 1-7, because even if they > remain RO, the content of the `struct page` page at the new location > does not match that at the old location. Yes, in theory, the data at the new pfn should be the same. We could try to get clarification from the architects on what could go wrong but I suspect it's some atomicity is not guarantee if you read the data (the CPU getting confused whether to read from the old or the new page). Otherwise, since after all these steps PTEs 1-7 point to the same data as PTE 0, before step 3 we could copy the data in page 0 over to the other 7 pages while entries 1-7 are still RO. The remapping afterwards would be fully compliant. > > > Can we leave entry 0 RO? This would save an additional TLBI. > > > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a > > refcnt on any hugeTLB pages. OK, fair enough. > > > Now, I wonder if all this is worth it. What are the scenarios where the > > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > One of the fundamental assumptions in core MM is that anyone can > read or try to grab (write) a refcnt from any `struct page`. Those > speculative PFN walkers include memory compaction, etc. But how does this work if PTEs 1-7 are RO? Do those walkers detect it's a tail page and skip it. Actually, if they all point to the same vmemmap page, how can one distinguish a tail page via PTE 1 from the head page via PTE 0? BTW, I'll be on holiday from tomorrow for two weeks and won't be able to follow up on this thread (and likely to forget all the discussion by the time I get back ;)). -- Catalin