From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B505274B51 for ; Sun, 21 Dec 2025 09:24:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766309056; cv=none; b=h4dE4+NZkyh7xMKlEI4xhPLXqrFaShb7QyYKu7Idp8mBroxSzb1cnVxYCqmKKmtUHJtCDzb/ZHwWzGD1ZA7fhhFAYB2ycEQxiE9bf4wIwwgRci05PU09V31TVlUeaJAWoLJZxJHRgHDpjwoVnfGDEjS8KnBWA3TOLk243sjKLIo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766309056; c=relaxed/simple; bh=FE6MAxoF1LdTjoUH1fHFZ697BonNjLX0s2K2Kv4JCys=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=cskyZYMsvVw+Yc9AHjoBfIH0YAahiubPVvTLuhFvQfyI0pXBG4g3WObaOtoya0+uY5dLjshDGPdSrTxsxUuXYf+tDjjTwJnSH/nhlOAcCdRslE3YXZkL0hCORFbAa42CfogGAF75XkYbD1mR8XFePZMqsB6QFqD9oVmfBsikPCs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=d6KpgVf6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="d6KpgVf6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78DBDC116B1; Sun, 21 Dec 2025 09:24:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1766309055; bh=FE6MAxoF1LdTjoUH1fHFZ697BonNjLX0s2K2Kv4JCys=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=d6KpgVf6jeLEaHeBi1l0tQJ58vP1PpmzA3dDB5YKVJ5VIQaHiOwydb+YAuGxcQXFp eWCkMTJ8OAilSo0keYxcqx1S4rPDn4hu2utNkrPBHTuZlYPM46N3Pl6NrYtd0tc3Qz 82jM0ZyYFyprMUsfiX42/sXwO5KSrPWC2yqnEWPrqQyeN68QZj18RLo57UIGp+mUnL 6caLPumhfu5R5Jr80sh9No/MI5FF/Pho8TXdAeewDFMyCsL07/m3U3rE1/pLOJ/4NE +7xpCZsFY0zQVvE/ztZZc9ouMaQqESQmq1ZuVxqNwUP2MPF4vc9+KvWSUTCpqOk8Md Yj4oF9EZVzCzw== Message-ID: <5af0e0ae-0472-45b8-a249-44b4e5239d33@kernel.org> Date: Sun, 21 Dec 2025 10:24:11 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE To: Vernon Yang , Wei Yang Cc: akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baohua@kernel.org, lance.yang@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vernon Yang References: <20251215090419.174418-1-yanglincheng@kylinos.cn> <20251215090419.174418-4-yanglincheng@kylinos.cn> <3c75d915-5d7f-4e80-975f-4479393e7139@kernel.org> <6e8684a5-1f71-4be6-8805-9b047a2bcb78@kernel.org> <20251221021044.2r5fhepiyyhvuo7h@master> From: "David Hildenbrand (Red Hat)" Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 12/21/25 05:25, Vernon Yang wrote: > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote: >> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote: >>> On 12/19/25 06:29, Vernon Yang wrote: >>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote: >>>>> On 12/15/25 10:04, Vernon Yang wrote: >>>>>> For example, create three task: hot1 -> cold -> hot2. After all three >>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task >>>>>> continuously access 128 MB memory, while the cold task only accesses >>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged >>>>>> still prioritizes scanning the cold task and only scans the hot2 task >>>>>> after completing the scan of the cold task. >>>>>> >>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this >>>>>> memory is cold or will be freed, it is appropriate for khugepaged to >>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary >>>>>> scan and collapse operations to reducing CPU wastage. >>>>>> >>>>>> Here are the performance test results: >>>>>> (Throughput bigger is better, other smaller is better) >>>>>> >>>>>> Testing on x86_64 machine: >>>>>> >>>>>> | task hot2 | without patch | with patch | delta | >>>>>> |---------------------|---------------|---------------|---------| >>>>>> | total accesses time | 3.14 sec | 2.92 sec | -7.01% | >>>>>> | cycles per access | 4.91 | 2.07 | -57.84% | >>>>>> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% | >>>>>> | dTLB-load-misses | 288966432 | 1292908 | -99.55% | >>>>>> >>>>>> Testing on qemu-system-x86_64 -enable-kvm: >>>>>> >>>>>> | task hot2 | without patch | with patch | delta | >>>>>> |---------------------|---------------|---------------|---------| >>>>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% | >>>>>> | cycles per access | 7.23 | 2.12 | -70.68% | >>>>>> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% | >>>>>> | dTLB-load-misses | 237406497 | 3189194 | -98.66% | >>>>> >>>>> Again, I also don't like that because you make assumptions on a full process >>>>> based on some part of it's address space. >>>>> >>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library >>>>> manages, why should the remaining part of the process suffer as well? >>>> >>>> Yes, you make a good point, thanks! >>>> >>>>> This seems to be an heuristic focused on some specific workloads, no? >>>> >>>> Right. >>>> >>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should >>>> not be collapsed, so that khugepaged can simply skip this VMA during >>>> scanning? This way, it won't affect the remaining part of the task's >>>> memory regions. >>> >>> I thought we would skip these regions already properly in khugeapged, or >>> maybe I misunderstood your question. >>> >> >> I think we should, but seems we didn't do this for anonymous memory during >> khugepaged. >> >> We check the vma with thp_vma_allowable_order() during scan. >> >> * For anonymous memory during khugepaged, if we always enable 2M collapse, >> we will scan this vma. Even VM_NOHUGEPAGE is set. >> >> * For other cases, it looks good since __thp_vma_allowable_order() will skip >> this vma with vma_thp_disabled(). > > Hi David, Wei, > > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous > memory during scan, as below: > > khugepaged_scan_mm_slot() > thp_vma_allowable_order() > thp_vma_allowable_orders() > __thp_vma_allowable_orders() > vma_thp_disabled() { > if (vm_flags & VM_NOHUGEPAGE) > return true; > } > > REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma, > so the khugepaged will continue scan this vma. > > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has > been successful. I will send it in the next version. No we must not do that. That's a user-space visible change. :/ -- Cheers David