From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 3FB4E1FC7E7
	for <linux-kernel@vger.kernel.org>; Mon, 27 Jan 2025 09:38:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737970725; cv=none; b=od7tBfmZJZbS8/8WO7DXwDMA3wENq6Xdx+bzG4/RhVppSHSrjOOGklVcLvymwPxeRKzb4BkLhta6QLcbZEzbGHIuTkPzP4SMWCO9qhKabYVQJloEuDv+QlFsoJb/QLZMGFWzH+K0SsUTjF0l9yKqWIjZIlapcdPzhpsnLPUab+s=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737970725; c=relaxed/simple;
	bh=wO9XckSi3fSWFryYeydV98NlSJd6iNI5Vh1WE+omeJo=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=OcKNt6gbQvEZlXHHEt1cJGWqVPk7LGiNHOlmHPYSQitejZJr4TaKWxP79C0dSioOOX7TsBUQ4AsATFLgmY2bDXBq7JK6BgAkwvYxjkEqEhi5v5+QzZZgzPLJkQ8ZyxCpbTvS0bx3w2Lh7oECpO/F1lBH0ZLV93nG13lZdYgF9oo=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B68C1497;
	Mon, 27 Jan 2025 01:32:26 -0800 (PST)
Received: from [10.162.43.36] (K4MQJ0H1H2.blr.arm.com [10.162.43.36])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F38033F528;
	Mon, 27 Jan 2025 01:31:47 -0800 (PST)
Message-ID: <41d85d62-3234-478e-8cd7-571a49cfc031@arm.com>
Date: Mon, 27 Jan 2025 15:01:44 +0530
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 00/11] khugepaged: mTHP support
To: David Hildenbrand <david@redhat.com>, Ryan Roberts
 <ryan.roberts@arm.com>, Nico Pache <npache@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org,
 vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com,
 dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org,
 jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com,
 hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
 peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 ziy@nvidia.com, jglisse@google.com, surenb@google.com,
 vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com,
 jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org,
 kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com,
 sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com,
 akpm@linux-foundation.org
References: <20250108233128.14484-1-npache@redhat.com>
 <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com>
 <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
 <37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com>
 <0a318ea8-7836-405a-a033-f073efdc958f@arm.com>
 <8305ddf7-1ada-4a75-a2c3-385b530b25d4@redhat.com>
 <9bf875ad-3e31-464d-bccd-7c737a2c53bc@arm.com>
 <95472249-44f6-4764-a5fa-fac834eb5a49@redhat.com>
 <ddf0d089-b508-403f-a179-ca0e09a10cb5@arm.com>
 <e40a4097-b921-4af7-8c52-550c515ec7cd@redhat.com>
Content-Language: en-US
From: Dev Jain <dev.jain@arm.com>
In-Reply-To: <e40a4097-b921-4af7-8c52-550c515ec7cd@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


On 21/01/25 3:49 pm, David Hildenbrand wrote:
>> Hmm that's an interesting idea; If I've understood, we would 
>> effectively test
>> the PMD for collapse as if we were collapsing to PMD-size, but then do 
>> the
>> actual collapse to the "highest allowed order" (dictated by what's 
>> enabled +
>> MADV_HUGEPAGE config).
>>
>> I'm not so sure this is a good way to go; there would be no way to 
>> support VMAs
>> (or parts of VMAs) that don't span a full PMD. 
> 
> 
> In Nicos approach to locking, we temporarily have to remove the PTE 
> table either way. While holding the mmap lock in write mode, the VMAs 
> cannot go away, so we could scan the whole PTE table to figure it out.
> 
> To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
> don't need the other VMA information. Figuring out "shared" is trickier, 
> because we have to obtain the folio and would have to walk the other VMAs.
> 
> It's a good question if we would have to VMA-write-lock the other 
> affected VMAs as well in order to temporarily remove the PTE table that 
> crosses multiple VMAs, or if we'd need something different (collapse PMD 
> marker) so the page table walkers could handle that case properly -- 
> keep retrying or fallback to the mmap lock.

I missed this reply, could have saved me some trouble :) When collapsing 
for VMAs < PMD, we *will* have to write lock the VMAs, write lock the 
anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs, 
otherwise someone may fault on another VMA mapping the same PTE table. I 
was trying to implement this, but cannot find a clean way: we will have 
to implement it like mm_take_all_locks(), with a similar bit like 
AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then 
two VMAs may have the same anon_vma, and we cannot get away with the 
following check:

lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem)

since I need to skip the lock only when it is khugepaged which has taken 
the lock.

I guess the way to go about this then is the PMD-marker thingy, which I 
am not very familiar with.

> 
>> And I can imagine we might see
>> memory bloat; imagine you have 2M=madvise, 64K=always, 
>> max_ptes_none=511, and
>> let's say we have a 2M (aligned portion of a) VMA that does NOT have
>> MADV_HUGEPAGE set and has a single page populated. It passes the PMD- 
>> size test,
>> but we opt to collapse to 64K (since 2M=madvise). So now we end up 
>> with 32x 64K
>> folios, 31 of which are all zeros. We have spent the same amount of 
>> memory as if
>> 2M=always. Perhaps that's a detail that could be solved by ignoring 
>> fully none
>> 64K blocks when collapsing to 64K...
> 
> Yes, that's what I had in mind. No need to collapse where there is 
> nothing at all ...
> 
>>
>> Personally, I think your "enforce simplicifation of the tunables for mTHP
>> collapse" idea is the best we have so far.
> 
> Right.
> 
>>
>> But I'll just push against your pushback of the per-VMA cursor idea 
>> briefly. It
>> strikes me that this could be useful for khugepaged regardless of mTHP 
>> support.
> 
> Not a clear pushback, as you say to me this is a different optimization 
> and I am missing how it could really solve the problem at hand here.
> 
> Note that we're already fighting with not growing VMAs (see the VMA 
> locking changes under review), but maybe we could still squeeze it in 
> there without requiring a bigger slab.
> 
>> Today, it starts scanning a VMA, collapses the first PMD it finds that 
>> meets the
>> requirements, then switches to scanning another VMA. When it 
>> eventually gets
>> back to scanning the first VMA, it starts from the beginning again. 
>> Wouldn't a
>> cursor help reduce the amount of scanning it has to do?
> 
> Yes, that whole scanning approach sound weird. I would have assumed that 
> it might nowdays be smarter to just scan the MM sequentially, and not 
> jump between VMAs.
> 
> Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
> up scanning the same handful of VMAs over and over again.
> 
> I think a lot of the khugepaged codebase is just full with historical 
> baggage that must be cleaned up and re-validated if it still required ...
>