From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4D63C5479B for ; Wed, 28 Aug 2024 09:49:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7AB6D6B008C; Wed, 28 Aug 2024 05:49:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 75C0A6B0093; Wed, 28 Aug 2024 05:49:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5FC606B0095; Wed, 28 Aug 2024 05:49:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3BE5A6B008C for ; Wed, 28 Aug 2024 05:49:15 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D4CB4AABE5 for ; Wed, 28 Aug 2024 09:49:14 +0000 (UTC) X-FDA: 82501181028.23.0AE8E05 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) by imf18.hostedemail.com (Postfix) with ESMTP id C6A1C1C0004 for ; Wed, 28 Aug 2024 09:49:12 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sIA4uTt2; spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724838465; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9h63oGvBhtQEIewj2wn7Jmp1aE5e9TuQJjoSjWMO3Mk=; b=N5ORbz5A8KT4+VOl/uIVcSOv+CldcKA9XsTOWELUhBVCMaoG4Yp5Q/Hi0BR9UmbSy1FCVj dl2X+XggADlB6lfHticiNmQosyYNo1L65U7FSvWe+MvCtfy6fjM2O008pL9fU7Zx8j72h5 3W1cqxROOEzMKsvhy475bpctblZD0fE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724838465; a=rsa-sha256; cv=none; b=01pqxay9v39L8qUDC05XGRSPG9z1O+tTc6Jac0MuanXQxerQNZi+xE4SJxsABlVZoRrOXr qjd6F6xEHThfQoG4fQGV8ufRbBLmbkZ+CtHGEYFAkx7JoK00RqqjDOJEE4AGsLRBpObzSH gJzlozgW04LLajTFH31DenhNItbVI24= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sIA4uTt2; spf=pass (imf18.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Message-ID: <2c3f09dc-aa80-412d-ba85-3cb9aa8cc478@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1724838550; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9h63oGvBhtQEIewj2wn7Jmp1aE5e9TuQJjoSjWMO3Mk=; b=sIA4uTt2sd96Y57FvD7zS3kvg67g9xLilCsoGAk9nEYRxtB3b1iolYsHSu3I9Ql8oFQVmc De2qsYmLwylJzl22V/gFDZk5JVyqSrtBactAhSdnX85tb2DIYaN/E2SNYh/5awOngRV2a2 HR/zCTUUpnWoAp1qsMncajskppAIlno= Date: Wed, 28 Aug 2024 17:48:59 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() To: Qi Zheng Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, david@redhat.com, hughd@google.com, willy@infradead.org, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com References: X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: C6A1C1C0004 X-Stat-Signature: zujha3t516b6tymxtmjegmy6hcydf5uk X-HE-Tag: 1724838552-390737 X-HE-Meta: U2FsdGVkX1+FT9ASS0whM42H/pqhnedSGm/sdscU1Md2DDg2Z4wdmmGcNzSSNon5o4r0brqYBivxPoXCjcGgKZYWf3UkEDUS2Th0EM1uEOp0jRmBVHAW6NbnNoEHJ4l038rPt7vnvYnkAOMbIqDYtRPo5BZiHENLHwLowjKQWBmP+1NBWBqT1xBdhqwAEdRQNd/WQ4Nwcmw27st53lXW4BDosBTw3ATAstMZLB6wv9536oD0LSlciAq3gEatzKEuhNRRJtrjmnBxzwrNBVP6kri8hbxz6EuA89NN3QMNr72Qg0DA1qJZIu2gXgg4AUOjqT0tewAXQNQogHiXXvLgH2ZZOHkVSzVwIbqS7FZIVFiV2dHL34hwW8WSUGQwovLyX8s+rRwgtc8qL3OTm8IjhQNEI6Cd1WIAdPJ3Fp1EbdgxC8TKjRo3VRfsknB4Zj5gwkDbKPS5KY/OVgDpAoBOSl4eQC69/ro9nYXdomxehlnnjA0Sq2HdhZvcy7rr5QZQb+fyrR8wpLEUf0bnK2kQmjfqSra0xFQYCaht1SsTTeyWdM5mP7ILxRbZ8ditmlXbQ36hiN2GaNWdyBr+mtNFZo99KGkffqvBBWmGVwyJ/D407ZUneo1BKfZbn5uS3S2/Mi3DqSjgVRsleqsQyvYAsHTLEPzryvyd2FkG+5AI0keEn+Vq2rIfmveVfBqz/eV18XBMCtMNayi3b8aRCgyZDMz/V1EBTdjJRlIJ2oCNigwiGNHfFMtA+DCs3Ydr0hx5QcxOzTAy2l0TfkaisvMXhzhkX9uIbWOm0Fyu4jVuXVHs6FkDYW4kxD3sHV5qs3Hn48ncJyqu+qw5lV+OeRFssVgH4tVQ4NKKkVIOTDhf7FPVrmfZry64uKdRbjaTJnQ0JXTyftQCgGNiPMyR7qO/fGSrL7hmpa9mrVUh7O9qL6FyQlEYd9oR5MFn+Bu6/xxOeX5xSRpykpzT8YIns0+ LmQrecxb znw60kRc+9JOQSlUUXaIAsvFiCqW2ggiGu1h2o80tpRm5KybsfE1c3TjW7O2/RhUHZG7AWAEshOr3KLQiJCIGpOtvPbEL9UGem9jxeQHTIpZZfxFSDPi2I6BA21g8Exm9dqDAnITD4/OXXkgKujivWUKRUEqcjooXkx8m5iVi0H9pg1onf/FX/oY5dNY9hJkYRQ6M40OxSFjRaz1iOhhmMbFCOzES9h9s6dI7CKXd02OuiqsVJwBFVFXMRU91zzt0LGD/bYxShFaAEXUoTP5Z4jJY6iQyycWKO2dfP6ucRhygAeI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/22 15:13, Qi Zheng wrote: > Currently, the usage of pte_offset_map_nolock() can be divided into the > following two cases: > > 1) After acquiring PTL, only read-only operations are performed on the PTE > page. In this case, the RCU lock in pte_offset_map_nolock() will ensure > that the PTE page will not be freed, and there is no need to worry > about whether the pmd entry is modified. > > 2) After acquiring PTL, the pte or pmd entries may be modified. At this > time, we need to ensure that the pmd entry has not been modified > concurrently. > > To more clearing distinguish between these two cases, this commit > introduces two new helper functions to replace pte_offset_map_nolock(). > For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition > to changing the name to pte_offset_map_rw_nolock(), it also outputs the > pmdval when successful. This can help the caller recheck *pmd once the PTL > is taken. In some cases, that is, either the mmap_lock for write, or > pte_same() check on contents, is also enough to ensure that the pmd entry > is stable. But in order to prevent the interface from being abused, we > choose to pass in a dummy local variable instead of NULL. > > Subsequent commits will convert pte_offset_map_nolock() into the above > two functions one by one, and finally completely delete it. > > Signed-off-by: Qi Zheng > --- > Documentation/mm/split_page_table_lock.rst | 7 ++++ > include/linux/mm.h | 5 +++ > mm/pgtable-generic.c | 43 ++++++++++++++++++++++ > 3 files changed, 55 insertions(+) > > diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst > index e4f6972eb6c04..08d0e706a32db 100644 > --- a/Documentation/mm/split_page_table_lock.rst > +++ b/Documentation/mm/split_page_table_lock.rst > @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions: > - pte_offset_map_nolock() > maps PTE, returns pointer to PTE with pointer to its PTE table > lock (not taken), or returns NULL if no PTE table; > + - pte_offset_map_ro_nolock() > + maps PTE, returns pointer to PTE with pointer to its PTE table > + lock (not taken), or returns NULL if no PTE table; > + - pte_offset_map_rw_nolock() > + maps PTE, returns pointer to PTE with pointer to its PTE table > + lock (not taken) and the value of its pmd entry, or returns NULL > + if no PTE table; > - pte_offset_map() > maps PTE, returns pointer to PTE, or returns NULL if no PTE table; > - pte_unmap() > diff --git a/include/linux/mm.h b/include/linux/mm.h > index da29b066495d6..a00cb35ce065f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, > > pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, > unsigned long addr, spinlock_t **ptlp); > +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr, spinlock_t **ptlp); > +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr, pmd_t *pmdvalp, > + spinlock_t **ptlp); > > #define pte_unmap_unlock(pte, ptl) do { \ > spin_unlock(ptl); \ > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index a78a4adf711ac..9a1666574c959 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, > return pte; > } > > +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr, spinlock_t **ptlp) > +{ > + pmd_t pmdval; > + pte_t *pte; > + > + pte = __pte_offset_map(pmd, addr, &pmdval); > + if (likely(pte)) > + *ptlp = pte_lockptr(mm, &pmdval); > + return pte; > +} > + > +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr, pmd_t *pmdvalp, > + spinlock_t **ptlp) > +{ > + pmd_t pmdval; > + pte_t *pte; > + > + BUG_ON(!pmdvalp); > + pte = __pte_offset_map(pmd, addr, &pmdval); > + if (likely(pte)) > + *ptlp = pte_lockptr(mm, &pmdval); > + *pmdvalp = pmdval; > + return pte; > +} > + > /* > * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation > * __pte_offset_map_lock() below, is usually called with the pmd pointer for > @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, > * recheck *pmd once the lock is taken; in practice, no callsite needs that - > * either the mmap_lock for write, or pte_same() check on contents, is enough. > * > + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like > + * pte_offset_map(); but when successful, it also outputs a pointer to the > + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without > + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), > + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock() > + * provides the correct spinlock pointer for the page table that it returns. > + * For readonly case, the caller does not need to recheck *pmd after the lock is > + * taken, because the RCU lock will ensure that the PTE page will not be freed. I'd like to add something like: " It is only applicable for read-only cases where any modification operations to the PTE page table are not allowed even if the corresponding PTL is held afterwards. " to explicitly specify its use cases. > + * > + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like > + * pte_offset_map_ro_nolock(); but when successful, it also outputs the > + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite > + * case, this can help the caller recheck *pmd once the lock is taken. In some > + * cases, that is, either the mmap_lock for write, or pte_same() check on > + * contents, is also enough to ensure that the pmd entry is stable. > + * I'd like to rewrite some sentences to: " It is applicable for may-write cases where any modification operations to the PTE page table may happen after the corresponding PTL is held afterwards. But the users should make sure the PTE page table is stable like holding mmap_lock for write or checking pte_same() or checking pmd_same() before performing the write operations. " Muchun, Thanks. > * Note that free_pgtables(), used after unmapping detached vmas, or when > * exiting the whole mm, does not take page table lock before freeing a page > * table, and may not use RCU at all: "outsiders" like khugepaged should avoid