From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E085D0EE19 for ; Fri, 11 Oct 2024 18:12:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DFF636B00B1; Fri, 11 Oct 2024 14:12:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DAF6B6B00B2; Fri, 11 Oct 2024 14:12:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C4FD46B00B4; Fri, 11 Oct 2024 14:12:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A5C746B00B1 for ; Fri, 11 Oct 2024 14:12:16 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B9A62A13AB for ; Fri, 11 Oct 2024 18:12:07 +0000 (UTC) X-FDA: 82662115788.13.410BF7A Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) by imf27.hostedemail.com (Postfix) with ESMTP id E371240011 for ; Fri, 11 Oct 2024 18:12:11 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DsT2V2Z+; spf=pass (imf27.hostedemail.com: domain of jannh@google.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728670150; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UUuCTwGEAieQUuHz9Si/8fQwhBeTxrZR6ij+65t6Jm8=; b=kwwKKeF1Cy0NfVgTfFaxCgWNjKG7H8aZl1rWloNL2f3BK2ays7pog74gk0WbDIURyK1CM2 aua+msldU6oUTVaRn4KrS/yv3RKEVOiv5fGavYoZIYM6uQhhlIjI7EhWa6+V8KEXcX1c64 QGUujzpO5OUyFrkSjkJLT+7HR/jHI7Y= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DsT2V2Z+; spf=pass (imf27.hostedemail.com: domain of jannh@google.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728670150; a=rsa-sha256; cv=none; b=irgS+HGe42Yx9tqjGUMIP/TUneCfz9M5fo9nty2kCjM7oqsiVmWdgDebaKdqmeDkY0K6Ec +COXHyPQrrSBqkaXVmEe4RZ6cQ52jyb72566B6rPlzgcT9rtv7vYf8TbxxfYmnWQLM0kop 54/4vFEHnNSGu3bBgdMt0IOl5Kfc/+0= Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-539e19d8c41so2827e87.0 for ; Fri, 11 Oct 2024 11:12:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728670333; x=1729275133; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UUuCTwGEAieQUuHz9Si/8fQwhBeTxrZR6ij+65t6Jm8=; b=DsT2V2Z+quBOeWk4Z1RTc/PzkbGzpcRSEibGew51ovKVg+QbH/e0FFbnLLmbqcDw2V 2Hfzusb9YK06xZJXyHJj1UW97LE1m21IWqsNTDRlf2C8OSBQ3HGGbvuLdwx7C3mVnCGM jtqFzWm+ky3KHApd1xSAMQ2qLqMQx4VMa3UqE7eSfToUm1MPO4AU7qskBkLUtNPtUeK3 7QYUYu/W2pxolwDOQzqqCWBemWoJ8ZJRyAQlwJzciDGQvn6Q5FCjMDtPWp+Ge9sQGzZg ZbHTMR/bLMPEi4XDtWUIZRclIse1jv8pxIg2/qId4noKL1WJO26HKwMArv86F6YHCz8r nbqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728670333; x=1729275133; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UUuCTwGEAieQUuHz9Si/8fQwhBeTxrZR6ij+65t6Jm8=; b=LWFmxVwmlAP8MXV1ecoDtX1wnAzIkxIhasABkSBet/RQ9vCfrZ+1cXh+aMb5DxZCYf jarn8YToohvmZbjbEz5yfIUDeZLaAqvYGybAZpeiaJcAQ9UglXVENP4Bf8dNwn7xSXLz XSvBYSxNGMtqv9z3wt+Zqn44ICyuk00xrWWIm2XJgKjoDz6S1qMOz7ES2DSt1YSmHj96 P9YnS7IVX6sHEOoFrq+TtLzmJkHDyZATwft/isAmpMqg1TaD+uSfidcI50YDQGHkbQAi GF29WLaU56HYcXiXBL0e3u1jNk73fk3Wvh8gO1lYjqt3ptoCTD0rlcL4OySQ5yNWknAc 4P8Q== X-Forwarded-Encrypted: i=1; AJvYcCU5mwnlyER8qLaLaTTpNVkAUEkQbESo7G45dY+FFnIzhbQhboHF+wsbbSUy2i5FL38XMpNOV7dN+Q==@kvack.org X-Gm-Message-State: AOJu0Yw9ix6fHXx1gZJYArjni0SnmXCABFpFjDnRb7Tk9NU7qkPgDGMI 8zZ4h0qn9edC44N24O6UpqaY00AXieT5mQq0xsJv9ShIBYTrN6oCcoExrUNvBG4xw69vzq1kgVl 1vGFoLXEgavbLQ+BEFoQv98ro+xL4bK16w4SL X-Google-Smtp-Source: AGHT+IGfIw5en7nMNwXR8WTMmoC6i/mmCdwod1VBWf2nU/vOj3kNQZOC5eep3RZM27n5ERgdi0RUSIce8GRDKg8cVxc= X-Received: by 2002:a05:6512:1154:b0:533:4620:ebe6 with SMTP id 2adb3069b0e04-539e5f9e005mr23838e87.4.1728670332348; Fri, 11 Oct 2024 11:12:12 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Fri, 11 Oct 2024 20:11:36 +0200 Message-ID: Subject: Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism To: Lorenzo Stoakes Cc: Andrew Morton , Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E371240011 X-Stat-Signature: kfx8wak3udpe64ha6s34sskg7omsfste X-Rspam-User: X-HE-Tag: 1728670331-989032 X-HE-Meta: U2FsdGVkX1/S+ZqqeFS26Z71+VqkcVMCbDLvUwFDuWhW1XTyzYswPT04OnYaN5ghlBIB1Gs6Jz0QxpQGnZDG613TBIvWjxSFgTT/MAydtDPqGLi3ODOAHt/YdnGXCKZUfMynb0kPdRinZnf+rTkmAQd4l1LROl2GLuYXJDRIlVC96mfHLWsocXqYCnJYIwPjyKeqxRtdRSpIIWqpfYw0YAsRAEL6CUD/GWfU4i3jGhWiE4vqX3nQWmcSFn7Nyyz3L3MQ0Y1IB8Z7RqnZoAXiVMjsOKs4aNB5b6FqL9HyoK01vqMN8vA4HHptiABJ+IxU2Tq0FzJ9Oq0w08r5RYbwgihNI4R/n3BzehGT2ytAhGM6jiykO1EeyMPkwoWNkPwru1HTEd/zKs/yWzznGfhugVAPxWpsBIPW9yKBrbPiitgrfYggpkPim/wzNKLyV71GywJfhQrxFXuqZvijMVoZDlEVEMftP75xaYfO2D6CjtaF+OgbR+wM2RuxlK+HAmgl32I8wLyU4qXDi46MbXUWgf/PC6Vf2OKAqDj+4HaGHTDAtS3kJmmKT5PKqADHboFjNV4JCwoj32XW08ELHekXxhvGT+OoZgikZ2W/7RMN2aVu2AHSOrPkAes16Uhju1q1MO+ubwNIPpl2RjgRYTfiQCLslzAdoioG9AfsVekHqCwT2F8LxR4qouB6tSAmbMMbrxMTF+OSsjkpVCvJGhCTwFFLLeNmxFQKWpU3/bcTaFIoKSLKCJ3YdH6C57dWnMTTsotCxDU5YH4QedEWiwUhvXzVN5wKZLs5S4CBFmd4hyr2Nl8C1MZejZFP8eNv2SrH15PBqT9NH/xOMrluLWpznU/igjHw8PaiCI4E8KmSAb27o2GSI5702EJ1eX3JNavby47mdu/EoPZ1JPaPbcIyyDho42Z1Tr0ShF2RJMzzNoFVdoDBXwyOkNJL+dpUCkgdJsefTIhQmDaley5iyEr FggIrASE yPx/Dy7fkX10dYYiRFT0sQfP0R3X3Xn1ZlLG9J1Tuv4au1xb/5bTezQQHirZ+4KbfC38xb7M3mmhftW/JMeEPw0bGiUdQ+P9SsDJ7T0T+b4c5IqmBxDFeBKz25H+Svnb8Uv0Y75SrvMLQFRNKyB5yUiiku4VXKLKqcsVhpWw6i6XUfD6v0VN7ahTerijXzn2mVBxnCED8sJyzmy7T4Eubmsv/WvU3B2m3WF/e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Sep 27, 2024 at 2:51=E2=80=AFPM Lorenzo Stoakes wrote: > Implement a new lightweight guard page feature, that is regions of userla= nd > virtual memory that, when accessed, cause a fatal signal to arise. [...] > --- > arch/alpha/include/uapi/asm/mman.h | 3 + > arch/mips/include/uapi/asm/mman.h | 3 + > arch/parisc/include/uapi/asm/mman.h | 3 + > arch/xtensa/include/uapi/asm/mman.h | 3 + > include/uapi/asm-generic/mman-common.h | 3 + I kinda wonder if we could start moving the parts of those headers that are the same for all architectures to include/uapi/linux/mman.h instead... but that's maybe out of scope for this series. [...] > diff --git a/mm/madvise.c b/mm/madvise.c > index e871a72a6c32..7216e10723ae 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. *= / What does poisoning need a write lock for? anon_vma_prepare() doesn't need it (it only needs mmap_lock held for reading), zap_page_range_single() doesn't need it, and pagewalk also doesn't need it as long as the range being walked is covered by a VMA, which it is... I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment saying "We might need to install an anon_vma" - is that referring to an older version of the patch where the anon_vma_prepare() call was inside the pagewalk callback or something like that? Either way, anon_vma_prepare() doesn't need write locks (it can't, it has to work from the page fault handling path). > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ [...] > +static long madvise_guard_poison(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + long err; > + bool retried =3D false; > + > + *prev =3D vma; > + if (!is_valid_guard_vma(vma, /* allow_locked =3D */false)) > + return -EINVAL; > + > + /* > + * Optimistically try to install the guard poison pages first. If= any > + * non-guard pages are encountered, give up and zap the range bef= ore > + * trying again. > + */ > + while (true) { > + unsigned long num_installed =3D 0; > + > + /* Returns < 0 on error, =3D=3D 0 if success, > 0 if zap = needed. */ > + err =3D walk_page_range_mm(vma->vm_mm, start, end, > + &guard_poison_walk_ops, > + &num_installed); > + /* > + * If we install poison markers, then the range is no lon= ger > + * empty from a page table perspective and therefore it's > + * appropriate to have an anon_vma. > + * > + * This ensures that on fork, we copy page tables correct= ly. > + */ > + if (err >=3D 0 && num_installed > 0) { > + int err_anon =3D anon_vma_prepare(vma); I'd move this up, to before we create poison PTEs. There's no harm in attaching an anon_vma to the VMA even if the rest of the operation fails; and I think it would be weird to have error paths that don't attach an anon_vma even though they . > + if (err_anon) > + err =3D err_anon; > + } > + > + if (err <=3D 0) > + return err; > + > + if (!retried) > + /* > + * OK some of the range have non-guard pages mapp= ed, zap > + * them. This leaves existing guard pages in plac= e. > + */ > + zap_page_range_single(vma, start, end - start, NU= LL); > + else > + /* > + * If we reach here, then there is a racing fault= that > + * has populated the PTE after we zapped. Give up= and > + * let the user know to try again. > + */ > + return -EAGAIN; Hmm, yeah, it would be nice if we could avoid telling userspace to loop on -EAGAIN but I guess we don't have any particularly good options here? Well, we could bail out with -EINTR if a (fatal?) signal is pending and otherwise keep looping... if we'd tell userspace "try again on -EAGAIN", we might as well do that in the kernel... (Personally I would put curly braces around these branches because they occupy multiple lines, though the coding style doesn't explicitly say that, so I guess maybe it's a matter of personal preference... adding curly braces here would match what is done, for example, in relocate_vma_down().) > + retried =3D true; > + } > +} > + > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > + unsigned long next, struct mm_walk *w= alk) > +{ > + pte_t ptent =3D ptep_get(pte); > + > + if (is_guard_pte_marker(ptent)) { > + /* Simply clear the PTE marker. */ > + pte_clear_not_present_full(walk->mm, addr, pte, true); I think that last parameter probably should be "false"? The sparc code calls it "fullmm", which is a term the MM code uses when talking about operations that remove all mappings in the entire mm_struct because the process has died, which allows using some faster special-case version of TLB shootdown or something along those lines. > + update_mmu_cache(walk->vma, addr, pte); > + } > + > + return 0; > +} > + > +static const struct mm_walk_ops guard_unpoison_walk_ops =3D { > + .pte_entry =3D guard_unpoison_pte_entry, > + .walk_lock =3D PGWALK_RDLOCK, > +}; It is a _little_ weird that unpoisoning creates page tables when they don't already exist, which will also prevent creating THP entries on fault in such areas afterwards... but I guess it doesn't really matter given that poisoning has that effect, too, and you probably usually won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned before... so I guess this is not an actionable comment.