From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14DEDCA0EC7 for ; Thu, 29 Aug 2024 22:17:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 807996B0085; Thu, 29 Aug 2024 18:17:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B6446B0088; Thu, 29 Aug 2024 18:17:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 67DB76B0089; Thu, 29 Aug 2024 18:17:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4AD586B0085 for ; Thu, 29 Aug 2024 18:17:03 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D1EEB12110D for ; Thu, 29 Aug 2024 22:17:02 +0000 (UTC) X-FDA: 82506694284.10.3518D9A Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) by imf30.hostedemail.com (Postfix) with ESMTP id B6DC180007 for ; Thu, 29 Aug 2024 22:17:00 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b=SL8AXpE8; dmarc=none; spf=pass (imf30.hostedemail.com: domain of charlie@rivosinc.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=charlie@rivosinc.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724969755; a=rsa-sha256; cv=none; b=OGL0OlUqxcEsI8BqWaUR4k6GNX5nHtWJpHXmX0jDsYzPBQeafPCr3vVOSUJ3hA4NBekWg+ GIY6WeGJO5nGBMIF4x15KAzRZltq0UxNlXdffEWDE9L+zxxfSFY5mQ9/XwtUozExT22sBc IPxRcBlq4ijP3dMm8PMryrT/yligPH8= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b=SL8AXpE8; dmarc=none; spf=pass (imf30.hostedemail.com: domain of charlie@rivosinc.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=charlie@rivosinc.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724969755; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8HRU/HuL9VMJOh5LMUQaHyrA0RHNWNSx8V73gtED6wE=; b=UWyqO1qpGgIAq9uAI4c4fYn8NJCeQY4NzTJLOz/3Ywr/pLUFXIYeTcMQgYaDegk2rlQxV3 7Er4o/NSH6omBP3OMlmFAHz/jdnk0Mtwgwimbo/OcjB0o4tXPoqkD/iONZ5GbDJZ01pCbe kjOagxVQ68GacEUW5mpD4ZD0pj2Ii4g= Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-715cc93694fso1062759b3a.2 for ; Thu, 29 Aug 2024 15:17:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20230601.gappssmtp.com; s=20230601; t=1724969819; x=1725574619; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=8HRU/HuL9VMJOh5LMUQaHyrA0RHNWNSx8V73gtED6wE=; b=SL8AXpE8QJSkcImH1Zcr7RPQc4FxtnPbQ7mqnWyMNt6yWCiIvCBaNflm5DZd88wmDr 97+YMRQ5asLhPso0Y6IzNO2l/bwMW/n97Kynw2KBi3v/N/w2n8xHXvH0ZYItHNT/crIk k94iV6A9GB1gLgSsH189urXjkcfJATgaQzaAt3ip3pC5J6rNHqI/JZP2negQAOwNgLmH GXsVOC05XMiGSpc8OZ6uPiRhkWcWn0S7uXcReJkLZ/sqSaRUd4nCby/Yei/RhtCRDjT9 UECeKF0Zi7CAxBKHIIXDq8M12Z9EPY+aHSp6Y1sg5hN2hoR0N1di6T3hnL3qOISDQGyI vcvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724969819; x=1725574619; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8HRU/HuL9VMJOh5LMUQaHyrA0RHNWNSx8V73gtED6wE=; b=CcfxxZrUP9lhhOS+gcCwGFFJYtJi1vb+bIGLLaT/t4daB1vhnc94QDunuCmpqeEwne qNikq9EkhqPsSRQWcPJ6UfrYhn0WF978+DyOIpVK5JjAU829IzVsssEy8nvvMMprYfsO 0ESihSaT0UjD2ikgDwEmVN/gZPHWtnWb5HG5UaGCEGAwzhSv6SOH4bAGVTMF6emi3MM3 MdMy4RwB4M99QMyGDfI6b3TPfpGrW75Pmz5akARwRM9dZJ9R4IwXiFqHglhdsWtMqs37 TgRgEcbN0GlkmYxg7dDnVfVk7m3iB1q+IMzb8NrHw1/jQHf7ZLvxlaBJf6hF0zrKzjXX SCOQ== X-Forwarded-Encrypted: i=1; AJvYcCXsKviVajMT152oIyf6GZ2UyEOUaD3yHOTlotRj4uKAWLunfw9Pub8HesHDLz2Q1U/NrDmfslhgCw==@kvack.org X-Gm-Message-State: AOJu0Yx6BBsWfvDpZljp8gkevLPJ2+Zv1Jl8+jj/JBolGLMvQ5ygJgKO XX4GOpVm7SMvYDUNr+Zhiyb2Mr18V3wP60ax0mFzFLlPAiF7O59Tpy+j1od9kvc= X-Google-Smtp-Source: AGHT+IHZWh2lrRdTfsFvpo/hEhoIVlXgetb87CMqVZNsK2T/td7vE55JQIoBH8+ZybjcSDXzxTcolg== X-Received: by 2002:a05:6a21:38c:b0:1c2:8b95:de15 with SMTP id adf61e73a8af0-1cce111b331mr4094496637.53.1724969818957; Thu, 29 Aug 2024 15:16:58 -0700 (PDT) Received: from ghost ([50.145.13.30]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-715e55a4ecasm1612655b3a.71.2024.08.29.15.16.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Aug 2024 15:16:58 -0700 (PDT) Date: Thu, 29 Aug 2024 15:16:53 -0700 From: Charlie Jenkins To: Lorenzo Stoakes Cc: Arnd Bergmann , Richard Henderson , Ivan Kokshaysky , Matt Turner , Vineet Gupta , Russell King , Guo Ren , Huacai Chen , WANG Xuerui , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Michael Ellerman , Nicholas Piggin , Christophe Leroy , Naveen N Rao , Alexander Gordeev , Gerald Schaefer , Heiko Carstens , Vasily Gorbik , Christian Borntraeger , Sven Schnelle , Yoshinori Sato , Rich Felker , John Paul Adrian Glaubitz , "David S. Miller" , Andreas Larsson , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Muchun Song , Andrew Morton , "Liam R. Howlett" , Vlastimil Babka , Shuah Khan , linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-alpha@vger.kernel.org, linux-snps-arc@lists.infradead.org, linux-arm-kernel@lists.infradead.org, linux-csky@vger.kernel.org, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Subject: Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT Message-ID: References: <20240829-patches-below_hint_mmap-v2-0-638a28d9eae0@rivosinc.com> <4e1e9f49-8da4-4832-972b-2024d623a7bb@lucifer.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4e1e9f49-8da4-4832-972b-2024d623a7bb@lucifer.local> X-Rspamd-Queue-Id: B6DC180007 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 67khywu7ii5rpm6y8yhacebamigneymb X-HE-Tag: 1724969820-979074 X-HE-Meta: U2FsdGVkX18kBwLi9fSzE+dGNsyUYygMWhy43fx4glLudaSWJAs+cNjFwfb7QliW4p3mkzsMSRdTrhUiZwOyrP99/BmV3Jwyja3/UMiy5LerqXaDfZXNaXbXpctwnrar7Md9nVAeamFngN+SvrZKDs3mJQZl6XYd+NuTzY3S9UzP9sGcyEU7Vpn0I2a1qKp/WTVwkDPvthNtp2un3q/5d9jGq5Pd99cO6ABXA6PghcB1eVXekk94LkURi8AcsVmF9QE9OsQ58w1jibxmnnH8M+M8tJF61bWxXMsmIOrUojnG5azavgS23CdHAMnvMfaBx5S8yJcpoVK/o5+8rXyu1d0Dj2itDcRUsdBnCcKTWRuPcyPpRKhU3E0e+RIuL2sQ9jIJuhYZYsGPQ/ABi/3xwP4QVYz7aanPg0kHsJCngVei8SN+GqdZhKaVn7eGw6OybdcU+KMI/rXyUITJpemBL3P1J8a114wRn9Y0jWsrChAEjDBoFzbpwIlUkU53N8GaZu2OtpIE8tTerAZVIy7NmAoCpO+oNwEtLGDR28wXykhSJC8KyKRPu4czAj9z31YVxtealhIK8fjWw6m5P6WmD3TWkx9vMMAqRpXoCm4VQ2fdOAmgb7Dto2Nb1gqX1XuHCKiY3WisbdG3c4/OQgkqnBaKO9NRy5tmoKX2IdV1qNNV7bQnED1Y9hV86g80rWZWD15TbACr5I6awmevXAsrsoLGkkx9Z21rQyVzN1ICr7EW8JVRlPP3OSjxTVfZqib2IrA+CxbSB2to1dFXTJ0SpyRuZG2Kfk5MiAJ1OZp/pddy8g9B2cM8a148I05uUf4UJ0Sw1wCi02iz2x/2dMM9f3Pc0/9O8BHLGvZom4IgrWOM3NrWvLZP0pQ8R9d2QTjC26HBFwHT1KppmPJ8ywZcXtPM70Ih2IapvNE3Afjlh3UUN+sPZMwm1041ICIqNapWEKDrvqglXCNn8xS2QVN tKy/bjkA 4O/BWzjSZeYuOyDAbskD0B08L6hpCBhuCc/jPA2GrCu3bbf251Vwy7Qqhcitpg5lie7DbBpNwM5mT5f9h7XzPa6S2QAS7dUijFa/0Hxa6W7/q60SbV5DeaWUiuK8Yd6xUdS0IuhPPdoJOMMmMKXksPunFllJVcy2ZTZUJM63OOi5m1cNmg0T6dlBY31xG3mJL5KDJvzsN6gqKmrltnezC2/7AKRwDq5vVEB4KxIv0+Qur2cZyt2ZurveIIMkuOQceBY82R9h1teCDKMc+lLYJtCYS4Ddi6NafMz+OXSHTfbOr9X8UAP44bhFcykYeG5MKBHdQNfvygiurD8Rksgwgrt28ZyMcoiY66bACCyJNpBNF0NW9fTkRYyTH4oeaJd6V1wA9BmPSBIyheG22tohWZNE43jiYCS7B9q1UmcM6ZLZsIyU4/LMABe6B9Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the 48-bit address space, > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > for the kernel address space). > > > > I'm still confused as to why, if an mmap flag is desired, and thus programs > > are having to be heavily modified and controlled to be able to do this, why > > you can't just do an mmap() with PROT_NONE early, around a hinted address > > that, sits below the required limit, and then mprotect() or mmap() over it? > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > significantly justified, especially if taking up a new flag. > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > address space. On the riscv port of OpenJDK an error is thrown if > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > has a comment that sv57 support is not complete, but there are some > > > workarounds to get it to mostly work [2]. > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > restriction of mmap() address that contain a hint address that is less > > > than 48 bits. > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > format :) this will not be the case for 5-page table levels though... I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of: 1. Getting an address from mmap() 2. Writing data into bits assumed to be unused in the address 3. Using the data stored in the address 4. Clearing the data from the address and sign extending 5. Dereferencing the now sign-extended address to conform to canonical addresses I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86. > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > current/future architecture), a flag would allow users to opt-in to this > > > behavior rather than opt-out as is done on other architectures. This is > > > desirable because it is a small class of applications that do pointer > > > masking. > > > > I raised this last time and you didn't seem to address it so to be more > > blunt: > > > > I don't understand why this needs to be an mmap() flag. From this it seems > > the whole process needs allocations to be below a certain limit. Yeah making it per-process does seem logical, as it would help with pointer masking. > > > > That _could_ be achieved through a 'personality' or similar (though a > > personality is on/off, rather than allowing configuration so maybe > > something else would be needed). > > > > From what you're saying 57-bit is all you really need right? So maybe > > ADDR_LIMIT_57BIT? Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted. If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits. > > > > I don't see how you're going to actually enforce this in a process either > > via an mmap flag, as a library might decide not to use it, so you'd need to > > control the allocator, the thread library implementation, and everything > > that might allocate. It is reasonable to change the implementation to be per-process but that is not the current proposal. This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go. This flag implementation was an attempt to make this feature as least invasive as possible to reduce maintainence burden and implementation complexity. > > > > Liam also raised various points about VMA particulars that I'm not sure are > > addressed either. > > > > I just find it hard to believe that everything will fit together. > > > > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m > > just not. > > > > > > > > This flag will also allow seemless compatibility between all > > > architectures, so applications like Go and OpenJDK that use bits in a > > > virtual address can request the exact number of bits they need in a > > > generic way. The flag can be checked inside of vm_unmapped_area() so > > > that this flag does not have to be handled individually by each > > > architecture. > > > > I'm still very unconvinced and feel the bar needs to be high for making > > changes like this that carry maintainership burden. > > I may be naive but what is the burden here? It's two lines of code to check MAP_BELOW_HINT and restrict the address. There are the additional flags for hint and mmap_addr but those are also trivial to implement. > > So for me, it's a no really as an overall concept. > > > > Happy to be convinced otherwise, however... (I may be missing details or > > context that provide more justification). > > > > Some more thoughts: > > * If you absolutely must keep allocations below a certain limit, you'd > probably need to actually associate this information with the VMA so the > memory can't be mremap()'d somewhere invalid (you might not control all > code so you can't guarantee this won't happen). > * Keeping a map limit associated with a VMA would be horrid and keeping > VMAs as small as possible is a key aim, so that'd be a no go. VMA flags > are in limited supply also. Yes that does seem like it would be challenging. > * If we did implement a per-process thing, but it were arbitrary, we'd then > have to handle all kinds of corner cases forever (this is UAPI, can't > break it etc.) with crazy-low values, or determine a minimum that might > vary by arch... Throwing an error if the value is determined to be "too low" seems reasonable. > * If we did this we'd absolutely have to implement a check in the brk() > implementation, which is a very very sensitive bit of code. And of > course, in mmap() and mremap()... and any arch-specific code that might > interface with this stuff (these functions are hooked). > * A fixed address limit would make more sense, but it seems difficult to > know what would work for everybody, and again we'd have to deal with edge > cases and having a permanent maintenance burden. A fixed value is not ideal, since a single size probably would not be suffiecient for every application. However if necessary we could fix it to 48-bits since arm64 and x86 already do that, and that would still allow a generic way of defining this behavior. > * If you did have a map flag what about merging between VMAs above the > limit and below it? To avoid that you'd need to implement some kind of a > 'VMA flag that has an arbitrary characteristic' or a 'limit' field, > adjust all the 'can VMA merge' functions and write extensive testing and > none of that is frankly acceptable. > * We have some 'weird' arches that might have problem with certain virtual > address ranges or require arbitrary mappings at a certain address range > that a limit might not be able to account for. > > I'm absolutely opposed to a new MAP_ flag for this, but even if you > implemented that, it implies a lot of complexity. > > It implies even more complexity if you implement something per-process > except if it were a fixed limit. > > And if you implement a fixed limit, it's hard to see that it'll be > acceptable to everybody, and I suspect we'd still run into some possible > weirdness. > > So again, I'm struggling to see how this concept can be justified in any > form. The piece I am missing here is that this idea is already being used by x86 and arm64. They implicitly force all allocations to be below the 47-bit boundary if the hint address is below 47 bits. This flag is much less invasive because it is opt-in and will not impact any existing code. I am not familiar enough with all of the interactions spread throughout mm to know how these architectures have managed to ensure that this 48-bit limit is enforced across things like mremap() as well. Are you against the idea that there should be a standard way for applications to consistently obtain address that have free bits, or are you just against this implementation? From your statement I assume you mean that every architecture should continue to have varying behavior and separate implementations for supporting larger address spaces. - Charlie