From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 51E1F13AC1 for ; Fri, 24 Jan 2025 07:38:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737704309; cv=none; b=VVqLkkOfj1nfExH0x2X6PLUsyj6bBPQR/FoHtKAFx7+j5uj8qg0E8dNH5MCw3vA6KzDyiC4pkkejdULM2t/WD8LKRxdt2CFm+Jxzt/8VzEP2ojsPz5zcMczfSjwA42mPcnC4Wo7t6fXG0wBWvmwMkexm1CZZ8ZHGNcvBLlqLZSU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737704309; c=relaxed/simple; bh=XuY+2OagTmkLvcU2Xg1GC30WH4uyxvweOa1i8KQ8SPg=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=e0P2cSLLkZXB6O+8x5xHFT+SMP00NiOAEHihzYzVZsQgEotfkzvpBaFnboJjtbz1ZojBqFS+4NCU5kS8eQGPe4YnqHpw6s4zXWqDEd6eZ5N173wb1OsrkSS+pFRQeww4XFoeT2UarvsvWA69CTM0GZusc9SB/Gd7NrHVoaMr+3M= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 54AD2497; Thu, 23 Jan 2025 23:38:54 -0800 (PST) Received: from [10.163.88.60] (unknown [10.163.88.60]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D61123F5A1; Thu, 23 Jan 2025 23:38:14 -0800 (PST) Message-ID: Date: Fri, 24 Jan 2025 13:08:11 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/11] khugepaged: mTHP support From: Dev Jain To: Nico Pache Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com> <209ba507-fab8-4011-bc36-1cc38a303800@arm.com> <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> Content-Language: en-US In-Reply-To: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 24/01/25 12:43 pm, Dev Jain wrote: > > > On 24/01/25 1:54 am, Nico Pache wrote: >> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain wrote: >>> >>> >>> >>> --- snip --- >>>>> >>>>> Althogh to be honest, it's not super clear to me what the benefit >>>>> of the bitmap >>>>> is vs just iterating through the PTEs like Dev does; is there a >>>>> significant cost >>>>> saving in practice? On the face of it, it seems like it might be >>>>> uneeded complexity. >>>> The bitmap was to encode the state of PMD without needing rescanning >>>> (or refactor a lot of code). We keep the scan runtime constant at 512 >>>> (for x86). Dev did some good analysis for this here >>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- >>>> aba4b1a441b4@arm.com/ >>> >>> I think I swayed away and over-analyzed, and probably did not make my >>> main objection clear enough, so let us cut to the chase. >>> *Why* is it correct to remember the state of the PMD? >>> >>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs >>> tunables again, since we dropped the lock. The bitmap thingy which you >>> are doing, and in general, any algorithm which tries to remember the >>> state of the PMD, violates the entire point of max_ptes_*. Take for >>> example: Suppose the PTE table had a lot of shared ptes. After you drop >>> the PTL, you do this: scan_bitmap() -> read_unlock() -> >>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot >> per your recommendation I dropped the read_lock() -> read_unlock() and >> made it a conditional unlock > > That's not the one I was talking about here... > >>> of stuff. Now, you do write_lock(), which means that you need to wait >>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process >>> forks and then a lot of PTEs become shared. The point of max_ptes_shared >>> is to stop the collapse here, since we do not want memory bloat >>> (collapse will grab more memory from the buddy and the old memory won't >>> be freed because it has a reference from the parent/child). >> >> That's a fair point, but given the other feedback, my current >> implementation now requires mTHPs to have no shared/swap, and ive >> improved the sysctl interactions for the set_bitmap and the >> max_ptes_none check in the _isolate function. > > I am guessing you are following the policy of letting the creep happen > for none ptes, and assuming shared and swap to be zero. Ah sorry, I read the thread again and it seems we decided on skipping mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the range to check whether we have at least one filled /all filled ptes, and none of them are shared and swap. > >> >> As for *why* remembering the state is correct. It just prevents >> needing to rescan. > > That is what I am saying...if collapse_huge_page() fails, then you have > dropped the mmap write lock, so now the state of the PTEs may have > changed, so you must rescan... > >> >>> Another example would be, a sysadmin does not want too much memory >>> wastage from khugepaged, so we decide to set max_ptes_none low. When you >>> scan the PTE table you justify the collapse. After you drop the PTL and >>> the mmap_lock, a munmap() happens in the region, no longer justifying >>> the collapse. If you have a lot of VMAs of size <= 2MB, then any >>> munmap() on a VMA will happen on the single PTE table present. >>> >>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need >>> to ask whether any algorithm remembering the state of the PMD is even >>> conceptually right. >> >> Both the issues you raised dont really have to do with the bitmap... > > Correct, my issue is with any general algorithm remembering PTE state. > >> they are fair points, but they are more of a criticism of my sysctl >> handling. Ive cleaned up the max_ptes_none interactions, and now that >> we dont plan to initially support swap/shared both these problems are >> 'gone'. >>> >>> Then, you have the harder task of proving that your optimization is >>> actually an optimization, that it is not turned into being futile >>> because of overhead. From a high-level mathematical PoV, you are saving >>> iterations. Any mathematical analysis has the underlying assumption that >>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << >>> order)] is virtually and physically contiguous in memory so prefetching >>> helps us. You are trying to save on pte memory references, but then look >>> at the number of bitmap memory references you have created, not to >>> mention that you are doing a (costly?) division operation in there, you >>> have a while loop, a stack, new structs, and if conditions. I do not see >>> how this is any faster than a naive linear scan. >> >> Yeah it's hard to say without real performance testing. I hope to >> include some performance results with my next post. >> >>> >>>> This prevents needing to hold the read lock for longer, and prevents >>>> needing to reacquire it too. >>> >>> My implementation does not hold the read lock for longer. What you mean >>> to say is, I need to reacquire the lock, and this is by design, to >> yes sorry. >>> ensure correctness, which boils down to what I wrote above. >> The write lock is what ensures correctness, not the read lock. The >> read lock is to gain insight of potential collapse candidates while >> avoiding the cost of the write lock. >> >> Cheers! >> -- Nico >>> >> > >