From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86531C3ABB2 for ; Wed, 28 May 2025 09:36:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2660B6B008A; Wed, 28 May 2025 05:36:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 23CE16B0092; Wed, 28 May 2025 05:36:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 17A376B0093; Wed, 28 May 2025 05:36:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id ECF2F6B008A for ; Wed, 28 May 2025 05:36:55 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 89D1980717 for ; Wed, 28 May 2025 09:36:55 +0000 (UTC) X-FDA: 83491812390.14.446D324 Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49]) by imf16.hostedemail.com (Postfix) with ESMTP id A10B3180007 for ; Wed, 28 May 2025 09:36:53 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gJrsDkGD; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748425013; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F4H4mltHWA4Hz50a14glpW/QRK0obPDlc6uSYk/4ZI4=; b=LSvObSUBfOYb9y8eis3mOPbbC7Q0MUDNFEOrgTmVMju0Hbcfubdt20It7qXlc7QpRQjTGY nSxeQHqG0mG4agKN8DjPB8EptUkrxn/zQn7FwFuaxa4qx39NOZi5rnIRKB2RkRq4AMyCrP xTaLyW6kk6VqxNqCreWa3XTSarYW5R4= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gJrsDkGD; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748425013; a=rsa-sha256; cv=none; b=lyN5Cpe+xj8nbJX3rIVNAOqnUj68/ShgkgOBHoNvpSf1HL7TvgT97e1OnMvSJvGTbzLaJx qZDGWk7o3luRzCQ1QLErLHI5yCRYRUViE2VfyaWVKQr5dwMEt7GkcYb7vnkqW1IZuAkOdy V3xQ/IhI0lWSxLMuEurZ/j2S3TVMLA4= Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-87de47585acso2805879241.3 for ; Wed, 28 May 2025 02:36:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748425013; x=1749029813; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=F4H4mltHWA4Hz50a14glpW/QRK0obPDlc6uSYk/4ZI4=; b=gJrsDkGDR5cQLNeIoF9a5XHrIdNjEAhzBmoneqpQ6xo5nYZGmyU54FhOs9iePMgep0 uiswPu22tXIhMm24cp7Po8sPx9+wmIXQ9gYkpIRCtNJ2EgZONa/4liiXKftnrUsuC5RM xL+3EeAuOE/lxGEWeHzxvuMHmrkGMWez2tQnN1UeP6RsZ/XxJQYDinw5cnkKIB/6GtkE J+JgCQRHaLr37uBs+AbI448+9egPNQAZ0p4rOtpQGLm3uXVcsj2+jjg5jIMB7FlHNpIS grCMX80RrbtgkdvSXjHjrJ5ww3NCmQZ7Af1p4njIaExV6Bms/V09mQY7kf+7PuFP230s rUQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748425013; x=1749029813; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=F4H4mltHWA4Hz50a14glpW/QRK0obPDlc6uSYk/4ZI4=; b=N9ddUXWqaj2hp7JuWTGk0I9/KAr26bEr6hv/PnWbOvEROQ6XxTjgCEGgshj+lbShMa cdOlg93qc/KJxKLkmDVzcj4Ikad8uMLyUXcv8v2T4snflGOxY/2PK+JB7M8nvcmy5tH1 6Lh3AzJy/Dq1HPxkTw+HjP8SpZ2EagOgYJkMMJZQTGtB1FGWVa7LZTP/lv+YGjAsqTJP 3jIluKt7e6a/GE4IfVfttntjAgGymo2skfp/wMiSjEp3a29pjyxfW6K3R1fYyiod21t4 Zv+37dmbrvBWOYHXon1xoK8MTvd1aP2Y8MdfNsls6Pw59MKFT704VB4WrWhzK+5ouhQ4 b/Uw== X-Forwarded-Encrypted: i=1; AJvYcCX6rVte+T0jpshaFDJ23jaJ5SS9q5uVONTFksygI3CZoyz6wmalhHB2rhOJMzGMbAnSM3J3YuyVIg==@kvack.org X-Gm-Message-State: AOJu0YwWvyDCxAbnogpQ1PrSSL/SnvkKN4JhrK1L5fNJB9SvXzR4LVDY vITX2Jv5G8N9rcilfiwuOFsJC5LhnnFwDPk8qWKilvNZMDDRMD12SXmD3VKbl8O+LaweeOBjytE eNZrhh1WqVTQffj4xprOfcgzjTHi9dpc= X-Gm-Gg: ASbGncu0VNnafHmJnREUEcqkG9xVHSvl/4aCcB6SlRppdo3Ij0hJlCwkhaoeCriXbkc UVUWrNy+leKvBnRKHiPxFzu2OqsEb+LFbpVw4k1g6nDKzrzIlrIDJaz+S2J/e252h5/xi9SnpyD TkyVryMwp4r2vr6T1uiH08y4OIbe0FFfAZaQ== X-Google-Smtp-Source: AGHT+IFPS3+nOyu5GsyZnyt+mdeoMYzphKgXldIOhAR0xUYhOH++PX1IaDSOYuVo4ytjgwf46RsusX5qk/oLVlEUV7s= X-Received: by 2002:a05:6102:161f:b0:4e5:9c06:39d8 with SMTP id ada2fe7eead31-4e59c063b13mr1566237137.5.1748425012652; Wed, 28 May 2025 02:36:52 -0700 (PDT) MIME-Version: 1.0 References: <20250527044145.13153-1-21cnbao@gmail.com> <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local> In-Reply-To: <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local> From: Barry Song <21cnbao@gmail.com> Date: Wed, 28 May 2025 17:36:40 +0800 X-Gm-Features: AX0GCFuEZj13XFVajxQytEEJ3CxbWz8WhpholMp_an_EqsoYNNGJof5-x-h5N-o Message-ID: Subject: Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Jann Horn , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ph8don1tmef9q8m71r1hfa3j8c6kdcuq X-Rspamd-Queue-Id: A10B3180007 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748425013-568145 X-HE-Meta: U2FsdGVkX18TFifAV9rqEUqemKQe8j0pWhm3pomdzD7paU4wl/46RHoBE3/t5OY8NY1DTdusPY9hBKpXJ1LKwGiZwPYF1JyXxy0xouB8Q29pZof0E4uFhzDgohBZTX9Iau0FvMjNrQ2U+57urh5ss0QJWbN3utzOfjcX0mkyniopm/0aUgopUyD8EQxZSYgwziZKz2PvxkZLKijxecDsHpci8TjGbqB2XRPuED1L6LrC8pK6GiFGU60ozkLlrlU9qH1LokQA6Q7Nv5nk0Lb2xC8KwFprVi4AugCawktJoYXe9TrzkD6XV4Snq8IdyrVsyJfaqOW0buEHxlOENPZg76+nFlrLoIJ//HwY/pJYT4FM9MsmE1zPCj9PnSnSS35oa4VCuEjFJ9vyEMCK+lJjXUQO+4GPc1MnB60oM3DjrPlqE61eyB2KwZfsZ9+xbQpEM1h/Jvl3hMM8V56Eatn8I2jhDJBHASYSMr7vYBSEl2j38vFvUhdi7A2eo01Hg60U+Q+wEB/Buccky2+GOSNg0603aVcgpvZLTEfk0oYrcSC5UjHD/ofB8icj8zJlRKH9WT9HzdBk10/p+2qjb7ULPmt+unXhEEy/BPsvcBU8z+RGBf5iS1IZz55UwSlejgXx2X+Gv+NPfMVBma2nmWlfuAUutGljfYQ8XkEbNkMJJVYAbqcLzHJ3MW+w8Z5Jrd5y+ScpWhNCfXmNgQekPs2jvSLLWXi27pllxlRrvTLz8xN40zpIDwxGjpn/S7HDcPUVFolB+gxy1IX2/fqBi8fCgriiGhau2/87btowJhfFIsY6Cq1GW/XGFf5omu2ebJEg0sBYRij8itrl4I3LF+5Y+pQuOKHSCW2FWeIaor7to7WVn+dKSE7sJBd54KwvfA3v+9EK4V7wbIEjacSxVUs42wLxUn7qLRvH5t/LXmPyR/7a5jdqeQUuDHZNmkntzupHMRmaYD+2AwbBmviLKqX m5R/5+F9 iTietXHgUc6oC4I/hvhGz8d/GtD6Tbrscjq80ILo86PmoY5jbu4YiYWCwsRbLMI9CPOm86Au4hSwg5sGQ+X7ML4J6dQBSCunWN/jX3F9qV2KQFVOESzcFALj+hPMnrWpeMhheNFaUOEfAqHyBGoOrnHfCUGhG4yZoymEVlljI2cflWIB2S1k3l78EFutNeaJdphIFT5WF828LFAEPiHNN1uh2hA4kqkmDwgB55BTMR1RoU40/0u80DoL1Mqeu4BUV144VWQaixkSpA0iZZDrFGJAtlfusqUmISNCeMy5QN5iKVD2jESE1LSsjwNg3aJz8/NyJLC7DKKs/ag40vRMc7/e4SUNaPHX0RqvAxV/suxoPghJzmBJrJIEYsAelyixtRoMnNhPRy7xn01uyvOg5QBzj123fVWdG7a9NXnWR4YDIxYjP7hR3xoYoWK5ALTR6EZ884dDFYo71SWUPw08RNwbhdLLytlAHDcTIZ8JUAbb+HHNFKtSrwjMvrQ8WNVTe0QBV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 27, 2025 at 5:20=E2=80=AFPM Lorenzo Stoakes wrote: > > Overall - thanks for this, and I'm not sure why we didn't think of doing > this sooner :P this seems like a super valid thing to try to use the vma > lock with. > > I see you've cc'd Suren who has the most expertise in this and can > hopefully audit this and ensure all is good, but from the process address > doc (see below), I think we're good to just have the VMA stabilised for a > zap. > > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > > From: Barry Song > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > frequently than other madvise options, particularly in native and Java > > heaps for dynamic memory management. > > Ack yeah, I have gathered that this is the case previously. > > > > > Currently, the mmap_lock is always held during these operations, even w= hen > > unnecessary. This causes lock contention and can lead to severe priorit= y > > inversion, where low-priority threads=E2=80=94such as Android's HeapTas= kDaemon=E2=80=94 > > hold the lock and block higher-priority threads. > > That's very nasty... we definitely want to eliminate as much mmap_lock > contention as possible. > > > > > This patch enables the use of per-VMA locks when the advised range lies > > entirely within a single VMA, avoiding the need for full VMA traversal.= In > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VM= As. > > Yeah this single VMA requirement is obviously absolutely key. > > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only= be > kept stable for this operation.' (I had to look this up to remind myself = :P) > > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html > > So we actually... should be good here, locking-wise. > > > > > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed by= Android > > benefits from this per-VMA lock optimization. After extended runtime, > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > only 1,231 fell back to mmap_lock. > > Thanks, this sounds really promising! > > I take it then you have as a result, heavily tested this change? This was extensively tested on an older Android kernel with real devices. As you know, running the latest mm-unstable on phones is challenging due to hardware driver constraints. However, I believe the reported percentage is accurate, since it seems pointless for userspace heaps to free memory across two or more VMAs. How could it possibly manage a slab-like system spanning multiple VMAs? > > > > > To simplify handling, the implementation falls back to the standard > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity= of > > userfaultfd_remove(). > > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, > let me mention it again... > > > > > Cc: "Liam R. Howlett" > > Cc: Lorenzo Stoakes > > Cc: David Hildenbrand > > Cc: Vlastimil Babka > > Cc: Jann Horn > > Cc: Suren Baghdasaryan > > Cc: Lokesh Gidra > > Cc: Tangquan Zheng > > Signed-off-by: Barry Song > > --- > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 8433ac9b27e0..da016a1d0434 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned lo= ng start, size_t len_in, int beh > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > return error; > > + > > + /* > > + * MADV_DONTNEED is commonly used with userspace heaps and most o= ften > > + * affects a single VMA. In these cases, we can use per-VMA locks= to > > + * reduce contention on the mmap_lock. > > + */ > > + if (behavior =3D=3D MADV_DONTNEED || behavior =3D=3D MADV_DONTNEE= D_LOCKED) { > > So firstly doing this here means process_madvise() doesn't get this benef= it, and > we're inconsistent between the two which we really want to avoid. > > But secondly - we definitely need to find a better way to do this :) this > basically follows the 'ignore the existing approach and throw in an if > (special case) { ... }' pattern that I feel we really need to do all we c= an > to avoid in the kernel. > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > I can see why you did this as this is kind of special-cased a bit, and we > already do this kind of thing all over the place but let's try to avoid > this here. > > So I suggest: > > - Remove any code for this from do_madvise() and thus make it available t= o > process_madvise() also. > > - Try to avoid the special casing here as much as humanly possible :) > > - Update madvise_lock()/unlock() to get passed a pointer to struct > madvise_behavior to which we can add a boolean or even better I think - > an enum indicating which lock type was taken (this can simplify > madvise_unlock() also). > > - Update madvise_lock() to do all of the checks below, we already > effectively do a switch (behavior) so it's not so crazy to do this. And > you can also do the fallthrough logic there. > > - Obviously madvise_unlock() can be updated to do vma_end_read(). I=E2=80=99ve definitely considered refactoring madvise_lock, madvise_do_beh= avior, and madvise_unlock to encapsulate the details of the per-VMA locking and mmap_lock handling within those functions: madvise_lock(mm, behavior); madvise_do_behavior(mm, start, len_in, behavior); madvise_unlock(mm, behavior); However, I=E2=80=99m a bit concerned that this approach might make the code= messier by introducing extra arguments to handle different code paths. For instance= , madvise_do_behavior might need an additional parameter to determine whether VMA traversal via madvise_walk_vmas is necessary. That said, I=E2=80=99ll give it a try and see if it actually turns out to b= e as ugly as I fear. > > > + struct vm_area_struct *prev, *vma; > > + unsigned long untagged_start, end; > > + > > + untagged_start =3D untagged_addr(start); > > + end =3D untagged_start + len_in; > > + vma =3D lock_vma_under_rcu(mm, untagged_start); > > + if (!vma) > > + goto lock; > > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > > + vma_end_read(vma); > > + goto lock; > > + } > > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > > + error =3D -EPERM; > > + vma_end_read(vma); > > + goto out; > > + } > > + madvise_init_tlb(&madv_behavior, mm); > > + error =3D madvise_dontneed_free(vma, &prev, untagged_star= t, > > + end, &madv_behavior); > > + madvise_finish_tlb(&madv_behavior); > > + vma_end_read(vma); > > + goto out; > > + } > > + > > +lock: > > error =3D madvise_lock(mm, behavior); > > if (error) > > return error; > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned lon= g start, size_t len_in, int beh > > madvise_finish_tlb(&madv_behavior); > > madvise_unlock(mm, behavior); > > > > +out: > > return error; > > } > > > > -- > > 2.39.3 (Apple Git-146) > > > > Cheers, Lorenzo Thanks Barry