From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3585CD3420 for ; Tue, 3 Sep 2024 10:36:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CD93A8D0158; Tue, 3 Sep 2024 06:36:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C62048D013C; Tue, 3 Sep 2024 06:36:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ADBA98D0158; Tue, 3 Sep 2024 06:36:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8C1E18D013C for ; Tue, 3 Sep 2024 06:36:28 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 10CE581330 for ; Tue, 3 Sep 2024 10:36:28 +0000 (UTC) X-FDA: 82523072856.24.1C0826B Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf15.hostedemail.com (Postfix) with ESMTP id 2D685A000E for ; Tue, 3 Sep 2024 10:36:24 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b="iz/I/FCb"; dkim=pass header.d=linutronix.de header.s=2020e header.b=bD+GUwmk; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf15.hostedemail.com: domain of namcao@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=namcao@linutronix.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725359681; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ojtxnyS4O7KDOKCfeZk7mENAq6gaSLvpMXtpfjPtxzk=; b=8Mqai1GMZNHhp0hFmfy3JRDL0Kuu9QZmazu4D2yz05IBU5GqJdGW622mOz6QhOKtbGRS2Y kFYuSDXfBVD7sYf7DQ4Zfn1ndble0bkxh5JfFNooPDiFXU7ApiD7w1w9Yc+o52or484q6m ph344490scrb4qtbjxXuJJvkdPlSS4w= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725359681; a=rsa-sha256; cv=none; b=kNrfO6SHpMOf5mgffThTJGHE3Yt57RCnqoMwS9AjKAMQgYuIvNK2+amqypJJbpKaw4WZP4 dYUcJ0TP6U03bx7NSSANwpBETh97hqmp7NHJs9OyDMpXJWE9PK6DI+UTtRF/8wfgEO7lWl iOGr3hx8GYiZ57WAD6tLPe07tJB5gII= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b="iz/I/FCb"; dkim=pass header.d=linutronix.de header.s=2020e header.b=bD+GUwmk; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf15.hostedemail.com: domain of namcao@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=namcao@linutronix.de Date: Tue, 3 Sep 2024 12:36:16 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1725359782; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ojtxnyS4O7KDOKCfeZk7mENAq6gaSLvpMXtpfjPtxzk=; b=iz/I/FCbft6jtUwrTh0tVl+H+mfN9TRfMiDUAi8wWTz3zQXPCxaoANmFoVOftABESq1to4 5dvjoctxtV4a+siRmHISB97+4AB5CvhJ3Ud93ZcLCJh2rDP7tz7nKK0TODeHML9kLrukFz s4u2aBslOFxNsTyDTBVxlD/8rsU0M5ywT5CX4v7DiPbXWHl96XLjlaZ1fPEOZcK5m+DYvE r2d209vPF7DP0kzLprng9BKdEO5yjX1223ywNe5e8UYOYjjWnnxwMzG5uAMd5kHuBXFKrx 8Hz+24j8eyHMvBfFhJuc3FUlQNU3ZFwDqno1epeFakTcc6qQIekEOw2ivRi63A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1725359782; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ojtxnyS4O7KDOKCfeZk7mENAq6gaSLvpMXtpfjPtxzk=; b=bD+GUwmkP/6q6dVnEA5lDqXqdC9HzIz8QWjlSjDOCdLPwIPndG/7WkPkHgfOM2lwLKGIq5 SK4397Z2++oJCqBw== From: Nam Cao To: "Liam R. Howlett" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H. Peter Anvin" , Andrew Morton , Vlastimil Babka , Lorenzo Stoakes , linux-kernel@vger.kernel.org, linux-mm@kvack.org, bigeasy@linutronix.de Subject: Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas Message-ID: <20240903103616.i0GrRAfD@linutronix.de> References: <20240825152403.3171682-1-namcao@linutronix.de> <5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian> <20240827075841.BO2qAOzq@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2D685A000E X-Stat-Signature: phpskn6ysws3e94gmdy7gxt6u94or6ps X-Rspam-User: X-HE-Tag: 1725359784-825717 X-HE-Meta: U2FsdGVkX1+KvpjznVzowRu5gkH3dblUzNT2ppxV2zL0ZMqBNVfWbcLW4/KSQiR66+BxkjxtPKktynl3nn9gwUzFgs+Znw+eWuEC7ffPjoGkKyG4nOGhuvUn5gtxlPcmr6SNP7bsCyvY8XfORsVBlNjN6PVSiCBaQ0aCYgKTK4+w8lcckVJsrleEzLViSQI6tMAmwVlssmXK4VobuwYsWHyOTEUyzBUADttgPjZFUyw9ODJ/15OKNwfSSLIM9Y5djHTJc9s9QjokZfWf0PO5bOqpkGUWIejjfm3ir3bGpY/8EyIAhm7VMQ6+6VbONPcTtZruwWiEXrHBnhAhKVAoii/4tfTPY0JNHEm160r188VKLY+wAzgd0CT9eOWqw0MYtYT8xJhkzNCX/wEvPrZHmxDWmRiLz2wHDdDwVQWJejDVPK3ajmWPDs4t75n/24SHAbxariqnzZyPWO/6eXXO+1YYVj9QYo7J0ZXaMD2WQSKluw1Y0pfYpU96Dzol5Z2GvlDXI1+eIe9Az+jtV2JkMDa8S8NFwtAPxyq8AgSKUenVQ/WdaGrOeFRk9rUdJadNVAcu/VX9qOcHytqNc+HuEYsFQvjHHVCPe8P8+2rtLjrvJrKWGlA0I1M9Y6glveh0w1y/ZO6WQ9a3Jkt+OIUe9TtNBTgvnccA9UwO9CmAMsg+wlj/Ec9g99X/fO4AkRAt7+FKTXU+vvWeJ2Q6wZlEXTx0XUhz1d+s+/LX4pBmCU1QpmzuZvRt5R19tlcGwAdcP0awa9wmtMMS07d+drVxHuqodlVPSi9NCuhbgDpAoBx2AsQ2+FMDAnBAcCexfhY7crjaX8trUj4IP1HaLH3bSROF6rfqitYnEp9o61Rfql5qKogRCEBrXH30qFIoBUpu+fdQi8hsDnwleeK8APGvpJE9VWeUCirSO8SVNVekdMgEhihWraAzmftZvA6xuVsFQTkyObkK92nHRFxEGuG qZRqkZME gU7R76kF4dWVHrKyjjIR2A3tEaogxBoi6oZvEsgUiX/VjAHcn5e02jqbd5vgGJ5ZoMKfnRDtYRFJ8RtE81DYXUJuNr/y9DY5S/RP11uJ0l5U9lBebrsnbpc+qSFfPqA6ks7gj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Sorry for the late reply, I was a bit busy, and needed some time to digest your email. On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote: > * Nam Cao [240827 03:59]: > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote: > > > * Nam Cao [240825 11:29]: > > > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries > > > > are not updated. This causes confusion later on when the VMAs get > > > > un-mapped, because the address ranges of the splitted VMAs do not match the > > > > address range of the initial VMA. > > > > > > > > For example, if user does: > > > > > > > > fd = open("/some/pci/bar", O_RDWR); > > > > addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0); > > > > mprotect(addr, 4096, PROT_READ | PROT_WRITE); > > > > munmap(p, 8192); > > What is p? By the comments below, you mean addr here? Yes, it should be addr. Sorry about that. > > > > > > > > > with the physical address starting from 0xfd000000, the range > > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call. > > > > > > > > After mprotect(), the initial range gets splitted into > > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000). > > > > > > > > Then, at munmap(), the first range does not match any entry in > > > > memtype_rbroot, and a message is seen in dmesg: > > > > > > > > x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff] > > > > > > > > The second range still matches by accident, because matching only the end > > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94 > > > > (x86/mm/pat: Change free_memtype() to support shrinking case)). > > > > > > Does this need a fixes tag? > > > > Yes, it should have > > Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch") > > thanks for the reminder. > > That commit is from 2008, is there a bug report on this issue? Not that I am aware of. I'm not entirely sure why, but I would guess due to the combination of: - This is not an issue for pages in RAM - This only happens if VMAs are splitted - The only user-visible effect is merely a pr_info(), and people may miss it. I only encountered this issue while "trying to be smart" with mprotect() on a portion of mmap()-ed device memory, I guess probably not many people do that. > > > > > > > > > > > > > > Make sure VMA splitting is handled properly, by splitting the entries in > > > > memtype_rbroot. > > > > > > > > Signed-off-by: Nam Cao > > > > --- > > > > arch/x86/mm/pat/memtype.c | 59 ++++++++++++++++++++++++++++++ > > > > arch/x86/mm/pat/memtype.h | 3 ++ > > > > arch/x86/mm/pat/memtype_interval.c | 22 +++++++++++ > > > > include/linux/pgtable.h | 6 +++ > > > > mm/mmap.c | 8 ++++ > > > > 5 files changed, 98 insertions(+) > > > > > ... > > > > > > > It is also a bit odd that you check VM_PFNMAP() here, then call a > > > function to check another flag? > > > > Right, this check is redundant, thanks for pointing it out. > > > > I stole this "style" from unmap_single_vma(), but I think the check is > > redundant there as well. > > If you have identified a redundant check, can you please remove it with > a separate patch? Sure. > > > > > > > > > > + err = track_pfn_split(vma, addr); > > > > + if (err) > > > > + goto out_vma_unlink; > > > > + } > > > > + > > > > > > I don't think the __split_vma() location is the best place to put this. > > > Can this be done through the vm_ops->may_split() that is called above? > > > > I don't think ->may_split() is a suitable place. Its name gives me the > > impression that it only checks whether it is okay to split the VMA, but not > > really does any splitting work. Also that function pointer can be > > overwritten by any driver. > > It's a callback that takes the arguments you need and is called as long > as it exists. Your function would deny splitting if it failed, so it > may not split in that case. > > Also, any driver that overwrites it should do what is necessary for PAT > then. I don't love the idea of using the vm_ops either, I just like it > better than dropping in flag checks and arch-specific code. I can see > issue with using the callback and drivers that may have their own vma > mapping that also use PAT, I guess. Yeah I don't love this. You mentioned another approach below, which I think would be the best (if it's possible). I will attempt that other approach. > > > > > > > This is arch independent code that now has an x86 specific check, and > > > I'd like to keep __split_vma() out of the flag checking. > > > > I think these track_pfn_*() functions are meant to be arch-independent, > > it's just that only x86 implements it at the moment. For instance, > > untrack_pfn() and track_pfn_remap() are called in mm/ code. > > > > Arch-independent wrappers that are only used by one arch are not > arch-independent. PAT has been around for ages and only exists for x86 > and x86_64. > > We just went through removing arch_unmap(), which was used just for ppc. > They cause problems for general mm changes and just get in the way. If > we can avoid them, we should. > > memtype_interval.c doesn't have any knowledge of the vmas, so you have > this extraction layer in memtype.c that is being bypassed here for the > memtype_erase(); ensuring the start-end match or at least the end > matches. > > So your comment about the second range still matching by accident is > misleading - it's not matched at all because you are searching for the > exact match or the end address being the same (which it isn't in your > interval tree). But the second range *does* match, because the end address match? The second range is (0xfd001000-0xfd002000), which matches with (0xfd000000-0xfd002000) in the interval tree. Perhaps I should be clearer in the description.. > > Taking a step back here, you are splitting a range in an interval tree > to match a vma split, but you aren't splitting the range based on PAT > changing; you are splitting it based on the vma becoming two vmas. > > Since VM_PFNMAP is in VM_SPECIAL, the splitting is never undone and will > continue to fragment the interval tree, so even if flags change back to > match each other there will always be two vams - and what changed may > not even be the PAT. Right, I did not consider this scenario. > > So the interval split should occur when the PAT changes and needs to be > tracked differently. This does not happen when the vma is split - it > happens when a vma is removed or when the PAT is changed. > > And, indeed, for the mremap() shrinking case, you already support > finding a range by just the end and have an abstraction layer. The > problem here is that you don't check by the start - but you could. You > could make the change to memtype_erase() to search for the exact, end, > or start and do what is necessary to shrink off the front of a region as > well. I thought about this solution initially, but since the interval tree allow overlapping ranges, it can be tricky to determine the "best match" out of the overlapping ranges. But I agree that this approach (if possible) would be better than the current patch. Let me think about this some more, and I will come back later. > > What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change > free_memtype() to support shrinking case") enables shrinking of > VM_PFNMAP, but doesn't allow shrinking the end address. Why is one > allowed and the other not allowed? Not really sure what you mean. I think you are ultimately asking why that commit only matches end address, and not start address? That's because mremap() may shrink a VMA from [start, end] to [start, new_end] (with new_end < end). In that case, the range [new_end, end] would be removed from the interval tree, and that commit wants to match [new_end, end] to [start, end]. And I don't think mremap() can shrink [start, end] to [new_start, end]? Thanks for sharing your thoughts. Best regards, Nam