From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 51E1F13AC1
	for <linux-kernel@vger.kernel.org>; Fri, 24 Jan 2025 07:38:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737704309; cv=none; b=VVqLkkOfj1nfExH0x2X6PLUsyj6bBPQR/FoHtKAFx7+j5uj8qg0E8dNH5MCw3vA6KzDyiC4pkkejdULM2t/WD8LKRxdt2CFm+Jxzt/8VzEP2ojsPz5zcMczfSjwA42mPcnC4Wo7t6fXG0wBWvmwMkexm1CZZ8ZHGNcvBLlqLZSU=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737704309; c=relaxed/simple;
	bh=XuY+2OagTmkLvcU2Xg1GC30WH4uyxvweOa1i8KQ8SPg=;
	h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References:
	 In-Reply-To:Content-Type; b=e0P2cSLLkZXB6O+8x5xHFT+SMP00NiOAEHihzYzVZsQgEotfkzvpBaFnboJjtbz1ZojBqFS+4NCU5kS8eQGPe4YnqHpw6s4zXWqDEd6eZ5N173wb1OsrkSS+pFRQeww4XFoeT2UarvsvWA69CTM0GZusc9SB/Gd7NrHVoaMr+3M=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 54AD2497;
	Thu, 23 Jan 2025 23:38:54 -0800 (PST)
Received: from [10.163.88.60] (unknown [10.163.88.60])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D61123F5A1;
	Thu, 23 Jan 2025 23:38:14 -0800 (PST)
Message-ID: <f3ffda64-93e4-42c9-bf3a-dabcca070ada@arm.com>
Date: Fri, 24 Jan 2025 13:08:11 +0530
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 00/11] khugepaged: mTHP support
From: Dev Jain <dev.jain@arm.com>
To: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com,
 cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com,
 dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org,
 jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com,
 hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
 peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 ziy@nvidia.com, jglisse@google.com, surenb@google.com,
 vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com,
 jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org,
 kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com,
 raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com,
 audra@redhat.com, akpm@linux-foundation.org
References: <20250108233128.14484-1-npache@redhat.com>
 <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com>
 <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
 <209ba507-fab8-4011-bc36-1cc38a303800@arm.com>
 <CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com>
 <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>
Content-Language: en-US
In-Reply-To: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit


On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@arm.com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
>