From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC6FCC5B552 for ; Tue, 10 Jun 2025 11:44:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 842C06B0089; Tue, 10 Jun 2025 07:44:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C4EF6B008A; Tue, 10 Jun 2025 07:44:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68CDC6B008C; Tue, 10 Jun 2025 07:44:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 468756B0089 for ; Tue, 10 Jun 2025 07:44:24 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B7D991A1A2B for ; Tue, 10 Jun 2025 11:44:23 +0000 (UTC) X-FDA: 83539308006.16.7987503 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf25.hostedemail.com (Postfix) with ESMTP id 1E29CA000F for ; Tue, 10 Jun 2025 11:44:21 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf25.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749555862; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QPFbrCJR4whsFotVgvMj1EGBPS8eYSd2Wm4xESIvMNs=; b=K/hoW/7W0t7az4uyxqLcl1hT0X0+F3kJ/KRzcpfuq+OmqPKWj2pZ6I1A8KBLo0cu67Fbr6 JUVBYidGAZlbs/KDD49WRUKaMWfCE+de9D3EMEatTRt52HA+rqA25u2+xJRu6Me3nLqb0a RVVxDvXUe07KPEXWsIv9q6KgV6cfjaU= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf25.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749555862; a=rsa-sha256; cv=none; b=CeX+xpWDJlJ2cK5iOIP4lXvVZTCSbkMEmq96YqFMJ887ZvhASYW5ES2lOjexxfUfze4yVU LtHrRvT6LJwoZ0PQdSDZVtEHsVymUP8mOTVcwiR1EsOcgWdggvXkB6gbfhpZ2dPPvGgRZt JNAbtdasYjNwcw+26/RLv+GB0Z00wgQ= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E84F0169C; Tue, 10 Jun 2025 04:44:01 -0700 (PDT) Received: from MacBook-Pro.blr.arm.com (MacBook-Pro.blr.arm.com [10.164.18.48]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 781653F59E; Tue, 10 Jun 2025 04:44:15 -0700 (PDT) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, catalin.marinas@arm.com, will@kernel.org Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, suzuki.poulose@arm.com, steven.price@arm.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, yang@os.amperecomputing.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, Dev Jain Subject: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Date: Tue, 10 Jun 2025 17:14:00 +0530 Message-Id: <20250610114401.7097-2-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250610114401.7097-1-dev.jain@arm.com> References: <20250610114401.7097-1-dev.jain@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 1E29CA000F X-Stat-Signature: b8otcqwkw5w75n9txwqordy8rrb8kxq5 X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1749555861-524173 X-HE-Meta: U2FsdGVkX1/YF4p9sDY+Qu14GNFxMYf+eHv+rcY/T2/QTsEZWTBpRHeEswsBViFAIz7rhNYNhmhtjLnuLufB4+iZ0p1on+RwZl1rd0u8qjjeo7CIjn700Gl1jBmmqjZHWzJYhxVbnvihqXDe9n9VsbElf7gsul69AQcUXtjlmk7NrOGCLnmmrjhUwOTfs0QcP06h662OQZlWYbOTtCXT55NF1KxxJ2GV4RMx4T5fPsEKUvkeaQ5u8jtweEzxW5VYmmq9fjgvvj15GR3jfUr2f+qMToFfLPIvqG/vaqxHjVRYCZi4CEinfnfQDWXexebEvuJKqT12K99zjL9Tl8rKKRAr5kmWPIF5YANGs4U+SYxWFc9LJFVfV4i+Vf38Yt6/FKGFNW6L+zTNxeYnwCyGCeg4Fzjem6oiuZcG0SIngjwcCWsqOiUgCSwsnppdg491HYl/R1/jTkYkpQnELBbNBm5BPBb7MBT+kyVa1xkecpyIIJchioQtG3SICHOnx/87Vdeyq1RH2NBtRfKLXIh9lH/Z1ZtcG+ikGlFNl5pv7OOMRZiQf5SvA67w1C2QMV0e3bbqPL2z8qLmElUyymHb15ryi6aUCIXkWcJVwLZ63Pe527RVOjzkdP7PIRp/yVyFer6MFs3f9usYj+5maURXnx0na26L+HqKxgk9LI4RSDF405JDsOaEDUQ/f/SAz2FLacXv9g2AmH9hjkcmU25A1W9QBCO4DHlT7Bj1p7mXaMLYN3p7rOVfC+DyH1A2Rc7MoaOYSn0wqn64/X9OnzV/SHaVXqe4Gy9mFyB7CIdd6znIb8jwEvWl6G8CFqlY/0VBqnRIY/YQWa7d2Vp1n4m7c7JD2udhK0ldjOhqpNg5XWmteGXB0OMHSpD63BYvx+HGokXacu9YiD06cHcPqERV0iTQNfTX4iGNDlMw9ToaTgi9HGwtgVx+h/ovZ4QB5bnb6fXTr6vel+qcn6TIQxe 64eDQH43 2n1x71FDHH719iWTpA1ob09zRWp/Zi1BX566yLgx0Yqe9pH3NlSpe+2fYtgEnE65HA9t2CDIOGLflU7okV3rhr7Yee9cdyzyImsgh+SatrBRXJ4f6LS4EcIjd9mcLz/0pFWYl6LJrg3a3DlU9p6ettqR71A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: arm64 currently changes permissions on vmalloc objects locklessly, via apply_to_page_range. Patch 2 moves away from this to use the pagewalk API, since a limitation of the former is to deny changing permissions for block mappings. However, the API currently enforces the init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase, this patch extends this generic API to be used locklessly, so as to retain the existing behaviour for changing permissions. Apart from this reason, it is noted at [1] that KFENCE can manipulate kernel pgtable entries during softirqs. It does this by calling set_memory_valid() -> __change_memory_common(). This being a non-sleepable context, we cannot take the init_mm mmap lock. Since such extension can potentially be dangerous for other callers consuming the pagewalk API, explicitly disallow lockless traversal for userspace pagetables by returning EINVAL. Add comments to highlight the conditions under which we can use the API locklessly - no underlying VMA, and the user having exclusive control over the range, thus guaranteeing no concurrent access. Signed-off-by: Dev Jain --- include/linux/pagewalk.h | 7 +++++++ mm/pagewalk.c | 23 ++++++++++++++++++----- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 8ac2f6d6d2a3..5efd6541239b 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -14,6 +14,13 @@ enum page_walk_lock { PGWALK_WRLOCK = 1, /* vma is expected to be already write-locked during the walk */ PGWALK_WRLOCK_VERIFY = 2, + /* + * Walk without any lock. Use of this is only meant for the + * case where there is no underlying VMA, and the user has + * exclusive control over the range, guaranteeing no concurrent + * access. For example, changing permissions of vmalloc objects. + */ + PGWALK_NOLOCK = 3, }; /** diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ff5299eca687..d55d933f84ec 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end, return err; } -static inline void process_mm_walk_lock(struct mm_struct *mm, +static inline bool process_mm_walk_lock(struct mm_struct *mm, enum page_walk_lock walk_lock) { + if (walk_lock == PGWALK_NOLOCK) + return 1; + if (walk_lock == PGWALK_RDLOCK) mmap_assert_locked(mm); else mmap_assert_write_locked(mm); + return 0; } static inline void process_vma_walk_lock(struct vm_area_struct *vma, @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, case PGWALK_RDLOCK: /* PGWALK_RDLOCK is handled by process_mm_walk_lock */ break; + case PGWALK_NOLOCK: + break; } #endif } @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start, if (!walk.mm) return -EINVAL; - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; vma = find_vma(walk.mm, start); do { @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end, * to prevent the intermediate kernel pages tables belonging to the * specified address range from being freed. The caller should take * other actions to prevent this race. + * + * If the caller can guarantee that it has exclusive access to the + * specified address range, only then it can use PGWALK_NOLOCK. */ - mmap_assert_locked(mm); + if (ops->walk_lock != PGWALK_NOLOCK) + mmap_assert_locked(mm); return walk_pgd_range(start, end, &walk); } @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, if (!check_ops_valid(ops)) return -EINVAL; - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(start, end, &walk); } @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, if (!check_ops_valid(ops)) return -EINVAL; - process_mm_walk_lock(walk.mm, ops->walk_lock); + if (process_mm_walk_lock(walk.mm, ops->walk_lock)) + return -EINVAL; process_vma_walk_lock(vma, ops->walk_lock); return __walk_page_range(vma->vm_start, vma->vm_end, &walk); } -- 2.30.2