From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 3FB4E1FC7E7 for ; Mon, 27 Jan 2025 09:38:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737970725; cv=none; b=od7tBfmZJZbS8/8WO7DXwDMA3wENq6Xdx+bzG4/RhVppSHSrjOOGklVcLvymwPxeRKzb4BkLhta6QLcbZEzbGHIuTkPzP4SMWCO9qhKabYVQJloEuDv+QlFsoJb/QLZMGFWzH+K0SsUTjF0l9yKqWIjZIlapcdPzhpsnLPUab+s= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737970725; c=relaxed/simple; bh=wO9XckSi3fSWFryYeydV98NlSJd6iNI5Vh1WE+omeJo=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=OcKNt6gbQvEZlXHHEt1cJGWqVPk7LGiNHOlmHPYSQitejZJr4TaKWxP79C0dSioOOX7TsBUQ4AsATFLgmY2bDXBq7JK6BgAkwvYxjkEqEhi5v5+QzZZgzPLJkQ8ZyxCpbTvS0bx3w2Lh7oECpO/F1lBH0ZLV93nG13lZdYgF9oo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B68C1497; Mon, 27 Jan 2025 01:32:26 -0800 (PST) Received: from [10.162.43.36] (K4MQJ0H1H2.blr.arm.com [10.162.43.36]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F38033F528; Mon, 27 Jan 2025 01:31:47 -0800 (PST) Message-ID: <41d85d62-3234-478e-8cd7-571a49cfc031@arm.com> Date: Mon, 27 Jan 2025 15:01:44 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/11] khugepaged: mTHP support To: David Hildenbrand , Ryan Roberts , Nico Pache Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com> <37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com> <0a318ea8-7836-405a-a033-f073efdc958f@arm.com> <8305ddf7-1ada-4a75-a2c3-385b530b25d4@redhat.com> <9bf875ad-3e31-464d-bccd-7c737a2c53bc@arm.com> <95472249-44f6-4764-a5fa-fac834eb5a49@redhat.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 21/01/25 3:49 pm, David Hildenbrand wrote: >> Hmm that's an interesting idea; If I've understood, we would >> effectively test >> the PMD for collapse as if we were collapsing to PMD-size, but then do >> the >> actual collapse to the "highest allowed order" (dictated by what's >> enabled + >> MADV_HUGEPAGE config). >> >> I'm not so sure this is a good way to go; there would be no way to >> support VMAs >> (or parts of VMAs) that don't span a full PMD. > > > In Nicos approach to locking, we temporarily have to remove the PTE > table either way. While holding the mmap lock in write mode, the VMAs > cannot go away, so we could scan the whole PTE table to figure it out. > > To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably > don't need the other VMA information. Figuring out "shared" is trickier, > because we have to obtain the folio and would have to walk the other VMAs. > > It's a good question if we would have to VMA-write-lock the other > affected VMAs as well in order to temporarily remove the PTE table that > crosses multiple VMAs, or if we'd need something different (collapse PMD > marker) so the page table walkers could handle that case properly -- > keep retrying or fallback to the mmap lock. I missed this reply, could have saved me some trouble :) When collapsing for VMAs < PMD, we *will* have to write lock the VMAs, write lock the anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs, otherwise someone may fault on another VMA mapping the same PTE table. I was trying to implement this, but cannot find a clean way: we will have to implement it like mm_take_all_locks(), with a similar bit like AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then two VMAs may have the same anon_vma, and we cannot get away with the following check: lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem) since I need to skip the lock only when it is khugepaged which has taken the lock. I guess the way to go about this then is the PMD-marker thingy, which I am not very familiar with. > >> And I can imagine we might see >> memory bloat; imagine you have 2M=madvise, 64K=always, >> max_ptes_none=511, and >> let's say we have a 2M (aligned portion of a) VMA that does NOT have >> MADV_HUGEPAGE set and has a single page populated. It passes the PMD- >> size test, >> but we opt to collapse to 64K (since 2M=madvise). So now we end up >> with 32x 64K >> folios, 31 of which are all zeros. We have spent the same amount of >> memory as if >> 2M=always. Perhaps that's a detail that could be solved by ignoring >> fully none >> 64K blocks when collapsing to 64K... > > Yes, that's what I had in mind. No need to collapse where there is > nothing at all ... > >> >> Personally, I think your "enforce simplicifation of the tunables for mTHP >> collapse" idea is the best we have so far. > > Right. > >> >> But I'll just push against your pushback of the per-VMA cursor idea >> briefly. It >> strikes me that this could be useful for khugepaged regardless of mTHP >> support. > > Not a clear pushback, as you say to me this is a different optimization > and I am missing how it could really solve the problem at hand here. > > Note that we're already fighting with not growing VMAs (see the VMA > locking changes under review), but maybe we could still squeeze it in > there without requiring a bigger slab. > >> Today, it starts scanning a VMA, collapses the first PMD it finds that >> meets the >> requirements, then switches to scanning another VMA. When it >> eventually gets >> back to scanning the first VMA, it starts from the beginning again. >> Wouldn't a >> cursor help reduce the amount of scanning it has to do? > > Yes, that whole scanning approach sound weird. I would have assumed that > it might nowdays be smarter to just scan the MM sequentially, and not > jump between VMAs. > > Assume you only have a handfull of large VMAs (like in a VMM), you'd end > up scanning the same handful of VMAs over and over again. > > I think a lot of the khugepaged codebase is just full with historical > baggage that must be cleaned up and re-validated if it still required ... >