From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E34A2CED242 for ; Mon, 7 Oct 2024 21:31:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 042DD6B0082; Mon, 7 Oct 2024 17:31:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F353E6B0083; Mon, 7 Oct 2024 17:31:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DD5576B0085; Mon, 7 Oct 2024 17:31:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BF6A06B0082 for ; Mon, 7 Oct 2024 17:31:55 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A4DF8C0F1A for ; Mon, 7 Oct 2024 21:31:54 +0000 (UTC) X-FDA: 82648103790.17.FDB1A43 Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by imf04.hostedemail.com (Postfix) with ESMTP id 701B94000C for ; Mon, 7 Oct 2024 21:31:53 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cgnlMmoy; spf=pass (imf04.hostedemail.com: domain of jannh@google.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728336578; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zcwa0K69IwSijQTkEu3XQl+titgYL25kpkU8tLNbnQs=; b=0Wd5nL6tuxbRXzulzB66W+vrOiBFZjyLJqEfG+HRVNzC1sWIl+BVfifqhZKRdcOcA68oGu CLYaOcV7lIWYhiqDKHuqCemP6Cjg+qd/+/Oe26d1D3WW2YwDnWSxbJcp7vuBX90jJG73WN 3gJhi9w4ksIzpvDVNwhBU9iRFrqdgEU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728336578; a=rsa-sha256; cv=none; b=TpMankbobzf2lZuK2mCCJGfrE6y+w8LgDKRSb/0lL3KmJmHTcc8NVuOtR3e0iRa3SnD+YX cSb5iOgjptT9P1yrKIJmBmDalO+0isojrhfxAHddwggrEqrB8fJOte/811ljLsOR5wR3zF XyBs7AxbBBiMee27QnErvjYc0Oof1yU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cgnlMmoy; spf=pass (imf04.hostedemail.com: domain of jannh@google.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-42cae4ead5bso26365e9.1 for ; Mon, 07 Oct 2024 14:31:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728336712; x=1728941512; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zcwa0K69IwSijQTkEu3XQl+titgYL25kpkU8tLNbnQs=; b=cgnlMmoy/dAKSXk1qglwA/XAE4SCyvvzccQl1IHW8IrOrnRe4mQDdflQIfeK21qdow tNLYyN3Lx1bdNqZG22BtisM+BwUuqJdyGyX1QHE4qF0VuJ3QnpEMnw4qnI4L2fQyY7Fn jjahzFFjA3fZ2q3WpzUo6/1zrAv3fltHLrvprLT8jwpNkI10wL4Q//EUoM+dm7Xy5gVg nKb9LLjcksEYaDE/0qQLSMFpywdahSOPM7e9mLbQIsAhus5Ymskgxgie0d16JX6K5Gv4 6Ol8gLl6lQXAz2CNzQjTwfUGnSyxw1iOql52gNb9P9AIlOj7S65CJi12Bj7W05QJnj7N 50Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728336712; x=1728941512; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zcwa0K69IwSijQTkEu3XQl+titgYL25kpkU8tLNbnQs=; b=R8JnirydwzXqAQtN400frOcwuMS+zE2fP5IPj+/vdGRndXjW9XQ8EIp1bl3BhIacMO 3FIArtI1ge5Cglr7Mz2NrBCDKUl2sUp1P3YUGJQypVhvjFfNjFCYDz9Vuyq7kK13hF0I /YqHI+MESaBZbKcAWIQw/Y5IeX/4TATJLW6+TkmTCYV58BjvHoHt5RmK/pyIEst6zdcN 7AS8OUTJAjdm0FaTdiB/i4KO2J2THFhnzxyy4a+hrgujON0uJ13oRthPswwq4EDV0LJr QHpi9kVDz7+qNYi01IUtCUP3bTCI20SLw+qVaIbkCNFczirYv8yXetpiH5X40pUnswCN ZGqw== X-Forwarded-Encrypted: i=1; AJvYcCWUcYq6N2frJlErldI8yAMtyZkM4SlI6N/KyDnZifxiYGxLXnbmXbz8oO6KxvM2b5WYQRA59+diWg==@kvack.org X-Gm-Message-State: AOJu0Ywpy+pPQlZqEmEJIKtEKwF7KleZ4IytM7rBhTs3m3V3mZyOaN8x mpqVFF5tqFTqK6PSsmldd9sqW8rvO9nkEtfBYvhn/7kKMUcfyzN9F+NNJYc3Mm9PIc+5B+zjYmp IrfHeZy66JBm9KBg1Mt7IM15RKhUO3rugYMcZ X-Google-Smtp-Source: AGHT+IGin0ZNEVS1gSKXXrgs4W00kOMVovkBWvysigwLPBAWQgXtJtvhQAKjkfxN+M/JJJeT4+58OvhYq79D6KZt7Qs= X-Received: by 2002:a05:600c:34cb:b0:428:31c:5a4f with SMTP id 5b1f17b1804b1-4303b67064bmr452985e9.3.1728336706808; Mon, 07 Oct 2024 14:31:46 -0700 (PDT) MIME-Version: 1.0 References: <20240830040101.822209-1-Liam.Howlett@oracle.com> <20240830040101.822209-15-Liam.Howlett@oracle.com> In-Reply-To: From: Jann Horn Date: Mon, 7 Oct 2024 23:31:09 +0200 Message-ID: Subject: Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() To: "Liam R. Howlett" , Jann Horn , Andrew Morton , Lorenzo Stoakes , Linux-MM , kernel list , Suren Baghdasaryan , Matthew Wilcox , Vlastimil Babka , Sidhartha Kumar , Bert Karwatzki , Jiri Olsa , Kees Cook , "Paul E . McKenney" , Jeff Xu , Seth Jenkins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: c5yzja97jquodg4nfzgx5shos7zdcuch X-Rspamd-Queue-Id: 701B94000C X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1728336713-669493 X-HE-Meta: U2FsdGVkX1/OnnWN/A3W5M/yi5UPCI0Iub6N1m7dyKB9m8uMn25YlUA3lqt6rzusADNkKfSLHOIKDDMT72jQuBMdpTLqpobgQTCyFyQpFTKQhgoyvi+MuP+6qwGE6oWOi6jqbwKsfXVNbfcTIN13EmGilTLdzdlx9eUU+nlNghA9dSx9vhgTxKzLHMmwTLKWLujMYNy+8Jx/URDCIwk4Ldtm8yrRYqbBzdI/FgBuXunOuUmwFTgbzDw501kTLg7WGz+NTt3iI7rxPIed/zCAHYfWF7383VAjoXyZn+bOKGnNEy6QyFljUcDmVWkphQ7cjSU60ywTkPs3iQE+ZP7yGcU+u/0Pdwa0xvdApPdRvu5MPHwzldu3rHIHYm5eCnGbTeqARHRJ55mZE40hhisB3IuLNZ5HN8Vad76EFyaZm7Fbze7z1OhlN67j4ZJHHW+6jrbx2reNbEBMhY4OgHrYHD1DvkCp+6ulVhAegu1eAbRRxo2eefy1jS8KBhS5fzBcNcKp+RPKPMHOOaY2OBdageLQuNv340vsciu9CF+LDO+2LO/7e5dcDjApg7BikpO4p59MNaeyyGAuDv1ernHkXmk+ydWubgFePaACJvrZ30ialTtTXfLFPHFUUjN16/dFP5ce+MB0G+hT8M3pKYfTR5taF1vzkwffeu91WYsX8/2fvO6RRl8sbrJd1ctAk87YURw0N+wN62K4YheQKbxT+Mbz9YrOaBzq8bYrCQoyG03YwM0ZQxuKB8tMycz/Zl2rh2z0dkuJbzQmVr2KoMrBLtPgg09sbGVBrW2E0mEctiwBGbb1A077F2zoj4q2zX56LVHczAPq95uSqktu30ap/JGUcYaXNUe8IbPNVqINTJlMlmOeMdugqjIusdcHOa4NotzRdPmROx1q269+tGEjcVXIRm18uiBs+1uMgvMrlmg1gG5/8KIq/4FfMemUWnr1bd7U7l/Ae9HuMumLBSJ qx9WHWvv v4q6oZkIZaAZF3CogMWWztcA10z6H9Zb/PbaQPn967dlmmFHV4ipUpRKOfn+7h20GGtfJJP9Xr7tbWH2ugtXwnHG7uHGnrA1ZtZ55lL0PzA/1S6LyxRJRytS5nLOSxsFVQewyz9ZBRrCwaWaRUEwNKwtS9CgkakTTPAbWhZ3AVcTwuLBgcKS/MzzuBWDzIgX6M0TAkMI80stLLumS6sCrN38bSk/hus/SPwDf+pzsM86TL0ncQTaxeysAPfVLAFJEYSva1eXEj+9Jt1Guk6sNCM08dy5XeowYHibQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000034, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 7, 2024 at 10:31=E2=80=AFPM Liam R. Howlett wrote: > * Jann Horn [241007 15:06]: > > On Fri, Aug 30, 2024 at 6:00=E2=80=AFAM Liam R. Howlett wrote: > > > Instead of zeroing the vma tree and then overwriting the area, let th= e > > > area be overwritten and then clean up the gathered vmas using > > > vms_complete_munmap_vmas(). > > > > > > To ensure locking is downgraded correctly, the mm is set regardless o= f > > > MAP_FIXED or not (NULL vma). > > > > > > If a driver is mapping over an existing vma, then clear the ptes befo= re > > > the call_mmap() invocation. This is done using the vms_clean_up_area= () > > > helper. If there is a close vm_ops, that must also be called to ensu= re > > > any cleanup is done before mapping over the area. This also means th= at > > > calling open has been added to the abort of an unmap operation, for n= ow. > > > > As currently implemented, this is not a valid optimization because it > > violates the (unwritten?) rule that you must not call free_pgd_range() > > on a region in the page tables which can concurrently be walked. A > > region in the page tables can be concurrently walked if it overlaps a > > VMA which is linked into rmaps which are not write-locked. > > Just for clarity, this is the rmap write lock. Ah, yes. > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and > > the new mapping is created by expanding an adjacent VMA, the following > > race with an ftruncate() is possible (because page tables for the old > > mapping are removed while the new VMA in the same location is already > > fully set up and linked into the rmap): > > > > > > task 1 (mmap, MAP_FIXED) task 2 (ftruncate) > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > mmap_region > > vma_merge_new_range > > vma_expand > > commit_merge > > vma_prepare > > [take rmap locks] > > vma_set_range > > [expand adjacent mapping] > > vma_complete > > [drop rmap locks] > > vms_complete_munmap_vmas > > vms_clear_ptes > > unmap_vmas > > [removes ptes] > > free_pgtables > > [unlinks old vma from rmap] > > unmap_mapping_range > > unmap_mapping_pages > > i_mmap_lock_read > > unmap_mapping_range_tree > > [loop] > > unmap_mapping_range_vma > > zap_page_range_single > > unmap_single_vma > > unmap_page_range > > zap_p4d_range > > zap_pud_range > > zap_pmd_range > > [looks up pmd entry] > > free_pgd_range > > [frees pmd] > > [UAF pmd entry acces= s] > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff > > to widen the race windows, then build and run the attached reproducer > > mmap-fixed-race.c. > > > > Under a kernel with KASAN, you should ideally get a KASAN splat like th= is: > > Thanks for all the work you did finding the root cause here, I > appreciate it. Ah, this is not a bug I ran into while testing, it's a bug I found while reading the patch. It's much easier to explain the issue and come up with a nice reproducer this way than when you start out from a crash. :P > I think the correct fix is to take the rmap lock on free_pgtables, when > necessary. There are a few code paths (error recovery) that are not > regularly run that will also need to change. Hmm, yes, I guess that might work. Though I think there might be more races: One related aspect of this optimization that is unintuitive to me is that, directly after vma_merge_new_range(), a concurrent rmap walk could probably be walking the newly-extended VMA but still observe PTEs belonging to the previous VMA. I don't know how robust the various rmap walks are to things like encountering pfnmap PTEs in non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For example, page_vma_mapped_walk() looks like, if you called it on a page table range with huge PUD entries, but with a VMA without VM_HUGETLB, something might go wrong on the "pmd_offset(pud, pvmw->address)" call, and a 1G hugepage might get misinterpreted as a page table? But I haven't experimentally verified that.