From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 25934C5AD49 for ; Fri, 30 May 2025 20:41:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87B926B0202; Fri, 30 May 2025 16:41:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82C5D6B0203; Fri, 30 May 2025 16:41:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F4076B0204; Fri, 30 May 2025 16:41:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4E8786B0203 for ; Fri, 30 May 2025 16:41:27 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E3F9C56F30 for ; Fri, 30 May 2025 20:41:26 +0000 (UTC) X-FDA: 83500744572.15.DD8616F Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf05.hostedemail.com (Postfix) with ESMTP id EED4C10000F for ; Fri, 30 May 2025 20:41:24 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="EYmt/JNz"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jannh@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748637685; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UZ+fMUsaHfvo2SozDYxIz7fGRWQ2EutqwsUhlBfkD7E=; b=kCQshmd7AasU/DoBnbsbdWrA/T6aAXAYvOaYE+Pc1BAHQ+jJPJpq2R/zi1BkRF9ZSFP5dH WxfjOIyICCzjmabG/XJt1vafGjxtd67zZ6SO0Elkv7nyL5mP14qnV5/Y+0aTdKSsRua3VP nl8dP9hmqiSqZJ1qqYcIgoXEDPLn6Kw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748637685; a=rsa-sha256; cv=none; b=1/lsQo+vRolBY/phC6VXJsqfrHdFlwLzRHQbSHWwR/SLkLooQRDizEIRQ0ULbbYBgwdyBq Do3209wPqTAI7CiAA865GtCXJKamVOOaGxffXyMRGIbzYHkFAOj1UYmPFVe1BV5Q7C+lji OpRL5XRdAdyLb6gKIEk4gZCHSXL7DkU= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="EYmt/JNz"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jannh@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=jannh@google.com Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso1248a12.0 for ; Fri, 30 May 2025 13:41:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748637683; x=1749242483; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UZ+fMUsaHfvo2SozDYxIz7fGRWQ2EutqwsUhlBfkD7E=; b=EYmt/JNzWP48Wg1l8rveFhDvC4TlQ5c2Rr3hjFexkex9m26NKSPNWzxjsHtXlrryAz x+8XO9I2xK9zkdXjjife5xUDTCk4MZbat6MLFGijufRCCDEbXs7irxN7A3V1orshsJOT zjQKCCFTtak9fVkO2nsXLr394dR8pq0A65gkQIU/5pTdXCZbeyVY+ev7h3nHZhZM9Z9u 2O2phhSetafy7YaxS6Qg9DYGh09pYRd3Ll9Ebb4p6sPmt+up249cAIGjm4k2aD3Dxvmk ogEPdg/QPKzgR3zYyyFqLSR8UjFk/yjDmPsBDRXonAvFZQvFny1ienG+rHLy+vJX7xQL ZnFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748637683; x=1749242483; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UZ+fMUsaHfvo2SozDYxIz7fGRWQ2EutqwsUhlBfkD7E=; b=pYPt6E+RFXBqeOUxTH2zxc/VOKFJSNtZwy/ucet/LLyTN3Lk5nfdVja8ffS07m6h/R m0WFJ2l14VakTEbT26ZcSuLFn8yDnZdlxlnlHLJxs3RIxN/gDf+/LiepJbxFFU+CZF6+ 2s7DNklaA8Db8Uml7I8gOtoO76o5gEQccWPLc/WiS9Ws8xy71+JhDeg4kMdaGi7ds5jn qWwErRdQl2WEbvkkd2ICgBERJRYiDoNgRfHqfscJVnrNDmyypYRWdX5auOOgDoJA6792 T8rm1whcO5aXDm06ammLbAaLm33r31dfHsqIGMOuddbBddkzvE2IHvDsID4SZzwOoHJS GMgw== X-Forwarded-Encrypted: i=1; AJvYcCXvKggG8ua/AXiPmgRE87dIA/4fkLUE6F6bc897c+ek4/6xMyS3/Tg0LNWgrADUKg+8kJAiUu/Uwg==@kvack.org X-Gm-Message-State: AOJu0YyVxwJYv7WUqUN1w9NVIkeqrl7jBMxNgrEav4fM3CVRQwWDxgwp zOqtj/J8ww6U1EH42YgzAa4zLvYjk7OaIqO/s28BXrCeIbB5c5rbyt02pk95y/CVal5pNKC7n0j 2SHUbQhWwq5ff8YRmDHH2nHZgex1p50GzF4Vxq+uM X-Gm-Gg: ASbGncv6GKrZq30sw3RPC6ZbM6jDpy/WhKqnHeZirTx3Ppm8Gbh/j1IT5QjOraaSCbc T11mpgPRIBw0Qz+yUxss2oM8asjwAwQskHQNqNj2KuHwZoQVGGtmaBLurIicoI/oBJwSsyyeZZL /cHHK2m/NC27v/nYf4gMuRKDBtbmjswU8tP3ktPCRqNdqYxWLUOV5zytGXsCeEQVG7wGxN6BfW0 wpDH8DO X-Google-Smtp-Source: AGHT+IHs3E34RoNhUSw7HhD+qRy8tkateGf+6/YvHLUuDqovim/6aypcRZUjMmT8/ezjQIUbyqJ6Y6U2BqHuROdtrK4= X-Received: by 2002:a50:d494:0:b0:600:9008:4a40 with SMTP id 4fb4d7f45d1cf-605adffe9ecmr23910a12.4.1748637682985; Fri, 30 May 2025 13:41:22 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> In-Reply-To: <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> From: Jann Horn Date: Fri, 30 May 2025 22:40:47 +0200 X-Gm-Features: AX0GCFvDWkEoZpflVNcH3G_c--yr3b6OvmHSlYd_eMhXYL9WXt3wqgJDaJLq0no Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: EED4C10000F X-Stat-Signature: ex9975r8bamay6cc6p77cz1dwci87fes X-Rspam-User: X-HE-Tag: 1748637684-538611 X-HE-Meta: U2FsdGVkX1+6qPOPxnvQnQ5FPNgohIbVGyPyrpJ2aDjCKL74FumiHthNSIl8+cn2CfFFnblCs6yO/QMoA8lkLAinu/7DJ6a8cmK43cDyoVl1cTePq5xH/vCD/7HqC+4aw2sU0BQuOteTRSD2E6xHM6io+nzpy5kLjTVfGo01vbxnwI+1cP0eudYjT4rem0J4y5vyXugkxo0j7W4dF5mEOLgAu6y+1810HmuEnoUh1p9p94T5RwlvyOyvMpA9v/jr+5wAMKDwJHI+znx5ilr9p3vKp8GwhxsqCepVSsA/EYUjDez2c/HTND5uzOUrqCtC87lce+5FL7fC9/DT6moBZml+93rmbughwnFjgIPQXHmDKn5TgjO2iP7E0+mALwl+/9mZm0kW2VTjxEVjE+im5r2Ws6Ok9uu9lEQIkM2w8VyF2j+7Hd0ElB/nz/gauC45enZKRohgSIv1/HC4WMrtEevpvkNSU6a4sjAmsa9UzIiD2PE0MVZGUS/BjAgipKdem0pDxfveieISFUn8JQ04zYvVOgmAPEzmXhlLPYQlqaPjWkivsxD6PyJx0J2Myew6MfeKjvpBQhrUDm2PnQKSJT6KP1atuFsXQTvxtx8t+KC6lYm8bBLsvnlhdhLCVvLglQM/i7hsDPgWyam8hd5o/7pcwMya2mINqlnSD7PAc3P2nGQc2qs5WKuggf/b+Kzna1PuuQGiaRC2hW7RQa/wBy/HKZ2L86AoO6TF849o4w+qXklKvm70sSjysggOEBETtbSLkT/TSabb/oOAjzyrhE46D87suqRiUpWMGCdE6O3rpZOnvn1Uf0VwvavqNArwQQmTUYjMLHhOmuFvXiSYyzJFfGEWmQmwf6vlLB2GpW+84YkrAhpYuzGvtZ3EKe1Cqyen3jYU4cGjE92G9t6P6FlTkL3EqUt9rqVG8fPxwyq9weSSq2fjbnd1x9Yt2bOKlJkcOHfNqiVtJovNt5o I2dK+FS9 1fycHkSK/K2rlwcxGoaumQAJ+NGFxYy5oNu1B7O9/WmtwBqneCxFFPErE6SQOJ9OlNgc1bu5Wfw5T9WWGe8yj7mUU8okDP9bLekTwyYOIE2m1SZbYCgDF5YZEvub/R4+sSPXY0BBrBjwsjBohQwswgs6H2C/3ky3JSqe5wEsa3wapHp6F79hQAelGfxv3DZoTTcE3RWNSOjJ81ZhTKL3+yPnHGBkWNPPOs8U1tmvpsTBALvpeoZCRpvmKVT3xgsIf5n7I8/OHqo63pZga6YPO3EytvBUxg1RGWqaMiZiSh/POsjD5ZozrKputJ7bZRAfFg9ww/MZqT5SmuJ0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 4:34=E2=80=AFPM Lorenzo Stoakes wrote: > Barry - I was going to come back to this later, but Jann's sort of bumped > this in my inbox. > > This implementation isn't quite what I was after, would you give me a > little bit before a respin so I can have a think about this and make > sensible suggestions? > > Thanks! > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote: > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.com>= wrote: > > One important quirk of this is that it can, from what I can see, cause > > freeing of page tables (through pt_reclaim) without holding the mmap > > lock at all: > > > > do_madvise [behavior=3DMADV_DONTNEED] > > madvise_lock > > lock_vma_under_rcu > > madvise_do_behavior > > madvise_single_locked_vma > > madvise_vma_behavior > > madvise_dontneed_free > > madvise_dontneed_single_vma > > zap_page_range_single_batched [.reclaim_pt =3D true] > > unmap_single_vma > > unmap_page_range > > zap_p4d_range > > zap_pud_range > > zap_pmd_range > > zap_pte_range > > try_get_and_clear_pmd > > free_pte > > > > This clashes with the assumption in walk_page_range_novma() that > > holding the mmap lock in write mode is sufficient to prevent > > concurrent page table freeing, so it can probably lead to page table > > UAF through the ptdump interface (see ptdump_walk_pgd()). > > Hmmmmmm is this because of the series that allows page table freeing on > zap... I think Zi's? Yeah, that was Qi Zheng's https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.173330= 5182.git.zhengqi.arch@bytedance.com/ . > We need to update the documentation on this then... which currently state= s > the VMA need only be stable. > > I guess this is still the case except for the novma walker you mention. > > Relatedly, It's worth looking at Dev's series which introduces a concerni= ng > new 'no lock at all' mode to the page table walker explicitly for novma. = I > cc'd you :) See [0]. > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a7915= 5@lucifer.local/ Yeah, I saw that you CC'ed me; at a first glance that seems relatively innocuous to me as long as it's only done for kernel mappings where all the rules are different. > > > > > I think before this patch can land, you'll have to introduce some new > > helper like: > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > > { > > mmap_write_lock(mm); > > for_each_vma(vmi, vma) > > vma_start_write(vma); > > } > > > > and use that in walk_page_range_novma() for user virtual address space > > walks, and update the comment in there. > > What dude? No, what? Marking literally all VMAs write locked? :/ > > I think this could have unexpected impact no? We're basically disabling V= MA > locking when we're in novma, that seems... really silly? I mean, walk_page_range_novma() being used on user virtual address space is pretty much a debug-only thing, I don't think it matters if it has to spend time poking flags in a few thousand VMAs. I guess the alternative would be to say "ptdump just doesn't show entries between VMAs, which shouldn't exist in the first place", and change ptdump to do a normal walk that skips over userspace areas not covered by a VMA. Maybe that's cleaner. But FWIW, we already do worse than what I proposed here when installing MMU notifiers, with mm_take_all_locks(). > > > + else > > > + __madvise_unlock(mm, madv_behavior->behavior); > > > +} > > > + > > > static bool madvise_batch_tlb_flush(int behavior) > > > { > > > switch (behavior) { > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_stru= ct *mm, > > > unsigned long start, size_t len_in, > > > struct madvise_behavior *madv_behavior) > > > { > > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > > int behavior =3D madv_behavior->behavior; > > > + > > > struct blk_plug plug; > > > unsigned long end; > > > int error; > > > > > > if (is_memory_failure(behavior)) > > > return madvise_inject_error(behavior, start, start + = len_in); > > > - start =3D untagged_addr_remote(mm, start); > > > + start =3D untagged_addr(start); > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that > > the mmap lock is held, which is no longer the case here with your > > patch, but untagged_addr() seems wrong here, since we can be operating > > on another process. I think especially on X86 with 5-level paging and > > LAM, there can probably be cases where address bits are used for part > > of the virtual address in one task while they need to be masked off in > > another task? > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this > > work... ideally by making sure that their address tagging state > > updates are atomic and untagged_area_remote() works locklessly. > > Yeah I don't know why we're doing this at all? This seems new unless I > missed it? Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and reads data that is updated under the mmap lock, I think? So without this change you should get a lockdep splat on x86. > > (Or you could try to use something like the > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing > > against untagged_addr(), first write-lock the MM and then write-lock > > all VMAs in it...) > > This would completely eliminate the point of this patch no? The whole poi= nt > is not taking these locks... And I'm very much not in favour of > write-locking literally every single VMA. under any circumstances. I'm talking about doing this heavyweight locking in places like arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand correctly, essentially reconfigure the size of the virtual address space of a running process from 56-bit to 47-bit at the hardware level and cause address bits that were previously part of the virtual address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too, but then we'll have to keep in mind that two subsequent invocations of untagged_addr() can translate a userspace-specified virtual address into two different virtual addresses at the page table level.