From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85C34C77B7C for ; Tue, 24 Jun 2025 22:33:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D517C6B00B6; Tue, 24 Jun 2025 18:33:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D02756B00B8; Tue, 24 Jun 2025 18:33:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF10B6B00B9; Tue, 24 Jun 2025 18:33:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AB0A66B00B6 for ; Tue, 24 Jun 2025 18:33:20 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 68879161135 for ; Tue, 24 Jun 2025 22:33:20 +0000 (UTC) X-FDA: 83591746560.15.3411E03 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf22.hostedemail.com (Postfix) with ESMTP id B32FAC000F for ; Tue, 24 Jun 2025 22:33:18 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=R0QHymvc; spf=pass (imf22.hostedemail.com: domain of sj@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750804398; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6rylIDsZFdI9DnkjmCOJAXeedTovu6+KXvX2/jUaiRk=; b=DcBqEg6+R0OnzAuImW3kM/zdH86NyORUPG29yQVHnfsRJZRj8g61XE7vugvi1Q6oT0F/ts EzlfeAzmhfEJWmu42GDjeTCCxUJ9xW7ddTosF9qxndwXBFNgc09b1Tdb8SXxyj+dDzZJ4a 2g/QjWgttHtuXp6V74ZyE2wsr+MmTnQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750804399; a=rsa-sha256; cv=none; b=aLOR1GkKIMlozBZ92EmoxD9rxJaBhIWbOCe53lHrcdwJz0Z9jXEJlXoWDTY9rfQV3bNKYZ 4/uCe/evMhqDnPdhhWNBkykQkDxXazCNMqgasSP6ZzfmBGnXpk7bGxx4Dg2d9QX4Jh5XY9 iR1gkdAlqa/uq4KcE6cqTMnxVek0vOo= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=R0QHymvc; spf=pass (imf22.hostedemail.com: domain of sj@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id 5C74645127; Tue, 24 Jun 2025 22:33:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 79DAFC4CEE3; Tue, 24 Jun 2025 22:33:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1750804397; bh=qqsLtwIT4qxU4K+cteeeR4I1NfXqbVh6Rfi797SclZc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=R0QHymvcqE8bSbo7a5ZCaMfLocL+lI+alJWG0sqJkDMvGi6LhGUwJKwraLqve24Uo 7/PCOIGj/vWYPyFWNRddNsV9e03cPmkfTNQBX1/6KS2XQuKDuHuS9q6b3/7EGDBv7P hKgbr0LUnRUJsZ6waCkiwPVdOLWNa8SSsmXCnogDGXTATykPWYdetghtMu/3HPdqom LYwQRKjHqtt+pzQhlBwJmes7/f3qUQQWkV8HAen2ElHn9HWQ4XzvcSgwEuygzEJpxw L1daQvdO44f2/uXvqF6kVXiK2r94ZqW2zIX/sxkaeUZW+MxLAmnMcdxiEmdveK+v/I thc8y2Zh9IDyA== From: SeongJae Park To: Bijan Tabatabai Cc: SeongJae Park , damon@lists.linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, bijantabatab@micron.com, venkataravis@micron.com, emirakhur@micron.com, ajayjoshi@micron.com, vtavarespetr@micron.com Subject: Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets Date: Tue, 24 Jun 2025 15:33:08 -0700 Message-ID: <20250624223310.55786-1-sj@kernel.org> X-Mailer: git-send-email 2.49.0 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: B32FAC000F X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: 5oco383hmfu9g5xkb4uk1w4d1pnbnnit X-HE-Tag: 1750804398-847390 X-HE-Meta: U2FsdGVkX1+SO+n+9gZNJqUfXJIt2H3L74fw3Q7FPE/TuGVZP9C52TxDrC0w4DIZYtze8ZKt3/Qv/gjK+zwxzRu1X1z7zNt99MhaV6p/hBWNBXJR6cf5anoHWkeHYmXKwXdBwtTymQsJ5URQPtngbGFFfER8sK2ErkhfbIFMXKl0yeU4iOuOdmwGcwg0UBNhytrcFzxGs0WmgdVwYHHaOW6L8NwqyS15nuLQoPbq0szbbk6ldXXBl3Zx4i75TOAAvl4KgD8VbiGgUc8kqOantLkU6VUUU49oaV83Kl4YLc8ADvf+WlFRRJQLWP9RTj4znEXD1XpGQyXxFX3Q96/7zO8/jdbaUt77FBvlsgE+eew1O3v+svJDsLHmdz5JadpL7psIf/I+wSY4kQvl8ZHDxMwTn7FWVZytBg2JUe1bmKQvlfgFfoqHplmt/T2wKIP1aSsnToVcK+7pb4CLfPnCxe8RiwYa2cu+YQ2KGF8/GHVpWrw5g1QCp5DtF8w84bqEa/pKcT+mhT7RweYnOyyGasY9kCm8xfTrQ1VuZ6cupfmnp88KAyLC7/x51TSo8FoNqI6qe/e972P8vYuNlkIoxFD6TKkr/eOmccwHa6jzITSCKVw6MiR5VA7+yJew9NeCPdGXMGEJxtL2oSuVm0ORK38/PuBKq7lNRyheSkztlyOfQlYYSaGlBtDoWLhLDZ1pFM89Zm283Axbsr0W1qPqoIv5EbWC3V7BFE4dGpVqN2EzQ9t7Ru3T3IjAxgWhWSgQIc/BMchWgrP8uaRvhk2W+DNdCZZN/T+aQLjhR8tfJ8LcrnFxrcCTgSKaKNwdIdZl25bK/P4J49Bq4npAaq97SeTA+kiCQNNcwBJ6H6a6elSz55biZclI1gorH1OCGBczrEDKL0uWlm5hvhkTc5wxajrekHVCDdt3OoBxY1Z0Ksnr5s9F17xH1s8qd76PWyXzoCDDlolbvDA6ko++/Bh zZg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 24 Jun 2025 11:01:46 -0500 Bijan Tabatabai wrote: > On Mon, Jun 23, 2025 at 7:34 PM SeongJae Park wrote: > > > > On Mon, 23 Jun 2025 18:15:00 -0500 Bijan Tabatabai wrote: > > > > [...] > > > Hi SeongJae, > > > > > > I really appreciate your detailed response. > > > The quota auto-tuning helps, but I feel like it's still not exactly > > > what I want. For example, I think a quota goal that stops migration > > > based on the memory usage balance gets quite a bit more complicated > > > when instead of interleaving all data, we are just interleaving *hot* > > > data. I haven't looked at it extensively, but I imagine it wouldn't be > > > easy to identify how much data is hot in the paddr setting, > > > > I don't think so, and I don't see why you think so. Could you please > > elaborate? > > Elaborated below. > > > > especially > > > because the regions can contain a significant amount of unallocated > > > data. > > > > In the case, unallocated data shouldn't be accessed at all, so the region will > > just look cold to DAMON. > > "Significant" was too strong of a word, but if physical memory is > fragmented, couldn't there be a non-negligible amount of unallocated > memory in a hot region? If so, I think it would mean that you cannot > simply take a sum of the sizes of the hot regions in each node to > compute how the hot data is interleaved because those regions may > contain unallocated memory that shouldn't count for that calculation. > Does that make sense? > > It's very possible I might be overthinking this and it won't be an > issue in practice. It might be best to not worry about it until it > becomes an issue in practice. So, I understand this is not a problem for only this migration-based interleaving approach, but general DAMON accuracy on physical address space. I agree. DAMON's region-based mechanism provides only best effort results, and depending on the workloads, virtual address space based monitoring has chances to work better than physical address space. Whether the accuracy level is acceptable or not would depend on use case. What I can say at the moment is, there are use cases of physical address based DAMON, and we are working further for making it better. Recent monitoring intervals auto-tuning was one such efforts. > > > > Also, if the interleave weights changed, for example, from 11:9 > > > to 10:10, it would be preferable if only 5% of data is migrated; > > > however, with the round robin approach, 50% would be. > > Elaborating more on this: > Imagine a process begins with a weights of 3 and 2 for node 0 and 1 > respectively in both DAMON and the weighted interleave policy. If you > looked at the which node a page resides in for a group of contiguous > pages, it would be something like this (using letters to represent the > virtual addresses): > > A -> node 0 > B -> node 0 > C -> node 0 > D -> node 1 > E -> node 1 > F -> node 0 > G -> node 0 > H -> node 0 > I -> node 1 > J -> node 1 > > If we use a user defined quota autotuning mechanism like you described > in [1] to stop DAMON interleaving when we detect that the data is > interleaved correctly, no interleaving would happen, which is good. > However, let's say we change the DAMON weights to be 4:1. My > understanding is that DAMON applies the scheme to regions in ascending > order of physical address (for paddr schemes), so if using the > round-robin algorithm you provided in [2], the interleaving would > apply to the pages in node 0 first, then node 1. For the sake of > simplicity, let's say in this scenario the pages in the same node are > sorted by their virtual address, > so the interleaving would be applied > in the order ABCFGHDEIJ. This would result in the following page > placement > > A -> node 0 > B -> node 0 > C -> node 0 > D -> node 0 > E -> node 0 > F -> node 0 > G -> node 1 > H -> node 0 > I -> node 0 > J -> node 1 > > So, four pages, D, E, F, G, and I, have been migrated. However, if > they were interleaved using their virtual addresses*, only pages D and > I would have been migrated. Thank you for this kind and detailed explanation. Now I can understand your concern. To my understanding, one of the important assumptions on the example is that pages of the DAMON region are sorted on both virtual address space and physical address space. I'm not very sure if that is a reasonable assumption. Hence, in such virtual address-randomly mixed DAMON region cases, I think physical address based approach could work better than when virtual address-sorted case, and sometimes even make less migration than virtual address-based case. For example, let's say we found a DAMON region of 6 pages in node 0. And the virtual address-based interleaving targets of the 6 pages are all node 1, while physical address-based interleaving target nodes are 3 pages for node 0 and other 3 pages for node 1 (weights 1:1). Then virtual address-based interleaving does 6 migrations while the other one does only 3 migrations. So both approach has worst case scenario, and my gut feeling is that this might not result in big difference as long as careful scheme aggressiveness control is made (whether using quota auto-tuning or manual online DAMON parameters tuning). Of course it would be nice if we have good test results. Meanwhile, assuming the DAMON regions on the physical address space may have pages of random virtual addresses, I find another concern about the behavior from users' perspective. In my opinion, commonly expected behaviors for the interface we are working together is, "DAMON will migrate pages of the regions of the access pattern to give NUMA nodes with given weights." If there is a DAMON region having random virtual addrss page, and the virtual address-based target node selection results in a target nodes ratio that is different from the user-specified weights (like above example of 6 pages), I think users could be confused about the results. To go in this direction, I think the interface should clearly explain the behavior. I should confess I concern even in the case, if the behavior is unnecessarily complex for users and developers. Alternatively, if we go fully virtual address based approach (monitor on virtual address and operate with virtual address), I think the behavior would be optimum and easier to expected by users. > > * Technically, the mempolicy code interleaves based on the offset from > the start of the VMA, but that difference doesn't change this example. > > > > Finally, and I > > > forgot to mention this in my last message, the round-robin approach > > > does away with any notion of spatial locality, which does help the > > > effectiveness of interleaving [1]. > > Elaborating more on this. > As implied by the comment in [3], interleaving works better the finer > grained it is done in virtual memory. As an extreme example, if you > had weights of 3:2, putting the first 60% of a process's data in node > 0 and the remaining 40% in node 1 would satisfy the ratio globally, > but you would likely not see the benefits of interleaving. We can see > in the example above that your round-robin approach does not maintain > the desired interleave ratio locally, even though it does globally. I'm not very agreed if keeping the ratio process-locally is always beneficial. At least when people use DAMON in the physical address space, I'd assume their concern is global memory bandwidth control. And hence I imagine they would care about the global ratio more than local ratio. I agree there could be a case users care about process local ratio, though. In the case, I feel using virtual address based mode sounds more natural, unless there is other reason to use physical address space mode. > > > We could use the probabilistic interleaving, if this is the problem? > > I don't think so. In the above example, with probabilistic > interleaving, you would still migrate, on average, 20% of the pages in > node 0 and 80% of the pages in node 1. Similarly, the probablistic > interleaving also does not consider the virtual address, so it > wouldn't maintain the interleave ratio locally in the virtual address > space either. Right. But if the DAMON region is having random virtual address pages, it wouldn't always result in the worst case. > > > > I don't think anything done with > > > quotas can get around that. > > > > I think I'm not getting your points well, sorry. More elaboration of your > > concern would be helpful. > > I elaborated more above. Hopefully that clears up any confusion. If > you still have questions, maybe it would be easier to e-meet and have > a live discussion about it? I see you have a DAMON chat slot open > tomorrow at 9:30 PT [4]. If you have nothing else scheduled, maybe > that would be a good time to chat? Nice suggestion, thank you. For readers of this mail other than I and Bijan, I and Bijan made an appointment for this off list. > > [...] > > > > I see where you're coming from. I think the crux of this difference is > > > that in my use case, the set of nodes we are monitoring is the same as > > > the set of nodes we are migrating to, while in the use case you > > > describe, the set of nodes being monitored is disjoint from the set of > > > migration target nodes. > > > > I understand and agree this difference. > > > > > I think this in particular makes ping ponging > > > more of a problem for my use case, compared to promotion/demotion > > > schemes. > > > > But again I'm failing at understanding this, sorry. Could I ask more > > elaborations? > > Sure, and sorry for needing to elaborate so much. No worry, thank you for walking with me. > > What I was trying to say is that in the case where a scheme is > monitoring the same nodes it is migrating to, when it detects a hot > region, it will interleave the pages in the region between the nodes. > If there are two nodes, and assuming the access pattern was uniform > across the region, we have now turned one hot region into two. Using > the algorithms you provided earlier, the next time the scheme is > applied, it will interleave both of those regions again because the > only information it has about where to place pages is how many pages > it has previously interleaved. You're correct, if we don't use a nodes imbalance state based scheme tuning approach. With the approach, but, this will not happen? > Using virtual addresses to interleave > solves this problem by providing one and only one location a page > should be in given a set of interleave weights. Even in the case, DAMON will do rmap of the pages. Even if it is entirely virtual address mode, rmap walking is not needed, but DAMON should still iterate each pages. For large systems, that might be not negligible system resource. I think even if we use virtual address based target node selection, or entirely virtual address mode, that kind of imbalance status based scheme tuning could be beneficial. > > When a scheme is monitoring one set of nodes and migrating to another > disjoint set of nodes, you don't have this problem because once the > pages are migrated, they won't be considered by the scheme until some > other scheme moves those pages back into the monitored nodes. Not technically, in my opinion. In the tiering scenario, we have two schemes migrating ho/cold pages to each other node. Without quota or very good hot/coldness thresholds, unnecessary pingpoing can happen. > > Does that make sense? > > > > > > > > If you really need this virtual address space based > > > > deterministic behavior, it would make more sense to use virtual address spaces > > > > monitoring (damon-vaddr). > > > > > > Maybe it does make sense for me to implement vaddr versions of the > > > migrate actions for my use case. > > > > Yes, that could also be an option. > > Given how much my explanations here stressed that having access to the > virtual addresses solves the problems I mentioned, I think the path > forward for the next revision should be: > > 1) Have the paddr migration scheme use the round-robin interleaving > that you provided - This would be good for the use case you described > where you promote pages from a node into multiple nodes of the same > tier. If you will not use that, I don't think you need to do that. Let's work for only our own selfish purposes ;) I believe that is a better approach for long term maintenance, too. > 2) Implement a vaddr migration scheme that uses the virtual address > based interleaving - This is useful for my target use case of > balancing bandwidth utilization between nodes? > > If the vaddr scheme proves insufficient down the line for my use case, > we can have another discussion at that time. > How does this sound to you? I find no concern at fully virtual address based approach off the top of my head. And if you think it will work for you and you will use that, I think this is the right way to go. I also think based on our previous discussion, this could work for your and other's multiple use cases. We still have a few conflicting opinions about physical address monitoring based approach. I love having the discussions with you. But I don't think those can be blockers of 2). So to me, 2) seems the right way to go. > > > > One thing that gives me pause about > > > this, is that, from what I understand, it would be harder to have > > > vaddr schemes apply to processes that start after damon begins. I > > > think to do that, one would have to detect when a process starts, and > > > then do a damon tune to upgrade the targets list? It would be nice if, > > > say, you could specify a cgroup as a vaddr target and track all > > > processes in that cgroup, but that would be a different patchset for > > > another day. > > > > I agree that could be a future thing to do. Note that DAMON user-space tool > > implements[1] a similar feature. > > Thanks, I'll take a look at that. > > [...] > > Thanks again for the time you are spending on these discussions. I do > appreciate it, and I hope I'm not taking up too much of your time. I also appreciate your time and patience. I see we are near to the alignment, though, please bear in mind! Thanks, SJ [...]