From: Jann Horn <jannh@google.com>
To: Hugh Dickins <hughd@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
David Hildenbrand <david@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
Song Liu <song@kernel.org>,
sparclinux@vger.kernel.org,
Alexander Gordeev <agordeev@linux.ibm.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>,
Will Deacon <will@kernel.org>,
linux-s390@vger.kernel.org, Yu Zhao <yuzhao@google.com>,
Ira Weiny <ira.weiny@intel.com>,
Alistair Popple <apopple@nvidia.com>,
Russell King <linux@armlinux.org.uk>,
Matthew Wilcox <willy@infradead.org>,
Steven Price <steven.price@arm.com>,
Christoph Hellwig <hch@infradead.org>,
Jason Gunthorpe <jgg@ziepe.ca>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
Ralph Campbell <rcampbell@nvidia.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Heiko Ca rstens <hca@linux.ibm.com>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Suren Baghdasaryan <surenb@google.com>,
linux-arm-kernel@lists.infradead.org,
SeongJae Park <sj@kernel.org>,
linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
linux-kernel@vger.kernel.org, Minchan Kim <minchan@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@techsingularity.net>,
"David S. Miller" <davem@davemloft.net>,
Zack Rusin <zackr@vmware.com>,
Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
Date: Fri, 2 Jun 2023 16:21:35 +0200 [thread overview]
Message-ID: <CAG48ez3seuiWfiVq0z8weTQV3ZXwo-TXSQWLJ0uxn4eOrq0J0w@mail.gmail.com> (raw)
In-Reply-To: <de1e37c-354c-fb98-1598-7ce6d415f257@google.com>
On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > > Before putting them to use (several commits later), add rcu_read_lock()
> > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a
> > > separate commit, since it risks exposing imbalances: prior commits have
> > > fixed all the known imbalances, but we may find some have been missed.
> > [...]
> > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > > index c7ab18a5fb77..674671835631 100644
> > > --- a/mm/pgtable-generic.c
> > > +++ b/mm/pgtable-generic.c
> > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > > {
> > > pmd_t pmdval;
> > >
> > > - /* rcu_read_lock() to be added later */
> > > + rcu_read_lock();
> > > pmdval = pmdp_get_lockless(pmd);
> > > if (pmdvalp)
> > > *pmdvalp = pmdval;
> >
> > It might be a good idea to document that this series assumes that the
> > first argument to __pte_offset_map() is a pointer into a second-level
> > page table (and not a local copy of the entry) unless the containing
> > VMA is known to not be THP-eligible or the page table is detached from
> > the page table hierarchy or something like that. Currently a bunch of
> > places pass references to local copies of the entry, and while I think
> > all of these are fine, it would probably be good to at least document
> > why these are allowed to do it while other places aren't.
>
> Thanks Jann: but I have to guess that here you are showing awareness of
> an important issue that I'm simply ignorant of.
>
> I have been haunted by a dim recollection that there is one architecture
> (arm-32?) which is fussy about the placement of the pmdval being examined
> (deduces info missing from the arch-independent interface, by following
> up the address?), but I couldn't track it down when I tried.
>
> Please tell me more; or better, don't spend your time explaining to me,
> but please just send a link to a good reference on the issue. I'll be
> unable to document what you ask there, without educating myself first.
Sorry, I think I was somewhat confused about what was going on when I
wrote that message.
After this series, __pte_offset_map() looks as follows, with added
comments describing my understanding of the semantics:
// `pmd` points to one of:
// case 1: a pmd_t stored outside a page table,
// referencing a page table detached by the caller
// case 2: a pmd_t stored outside a page table, which the caller copied
// from a page table in an RCU-critical section that extends
// until at least the end of this function
// case 3: a pmd_t stored inside a page table
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
unsigned long __maybe_unused flags;
pmd_t pmdval;
// begin an RCU section; this is needed for case 3
rcu_read_lock();
config_might_irq_save(flags);
// read the pmd_t.
// if the pmd_t references a page table, this page table can not
// go away because:
// - in case 1, the caller is the main owner of the page table
// - in case 2, because the caller
// started an RCU read-side critical section before the caller
// read the original pmd_t. (This pmdp_get_lockless() is just
// reading a copied pmd_t off the stack.)
// - in case 3, because we started an RCU section above before
// reading the pmd_t out of the page table here
pmdval = pmdp_get_lockless(pmd);
config_might_irq_restore(flags);
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
goto nomap;
if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
goto nomap;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
goto nomap;
}
return __pte_map(&pmdval, addr);
nomap:
rcu_read_unlock();
return NULL;
}
case 1 is what happens in __page_table_check_pte_clear_range(),
__split_huge_zero_page_pmd() and __split_huge_pmd_locked().
case 2 happens in lockless page table traversal (gup_pte_range() and
perf_get_pgtable_size()).
case 3 is normal page table traversal under mmap lock or mapping lock.
I think having a function like this that can run in three different
contexts in which it is protected in three different ways is somewhat
hard to understand without comments. Though maybe I'm thinking about
it the wrong way?
Basically my point is: __pte_offset_map() normally requires that the
pmd argument points into a page table so that the rcu_read_lock() can
provide protection starting from the time the pmd_t is read from a
page table. The exception are cases where the caller has taken its own
precautions to ensure that the referenced page table can not have been
freed.
next prev parent reply other threads:[~2023-06-02 14:23 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-29 6:11 [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-05-29 6:14 ` [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-05-31 17:06 ` Jann Horn
2023-06-02 2:50 ` Hugh Dickins
2023-06-02 14:21 ` Jann Horn [this message]
2023-05-29 6:16 ` [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-05-29 13:56 ` Matthew Wilcox
[not found] ` <ZHeg3oRljRn6wlLX@ziepe.ca>
2023-06-02 5:35 ` Hugh Dickins
2023-05-29 6:17 ` [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-05-29 6:18 ` [PATCH 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-05-29 6:20 ` [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-05-29 14:02 ` Matthew Wilcox
2023-05-29 14:36 ` Hugh Dickins
2023-06-01 13:57 ` Gerald Schaefer
2023-06-02 6:38 ` Hugh Dickins
2023-06-02 14:20 ` Jason Gunthorpe
2023-06-06 3:40 ` Hugh Dickins
2023-06-06 18:23 ` Jason Gunthorpe
2023-06-06 19:03 ` Peter Xu
2023-06-06 19:08 ` Jason Gunthorpe
2023-06-07 3:49 ` Hugh Dickins
2023-05-29 6:21 ` [PATCH 06/12] sparc: " Hugh Dickins
2023-06-06 3:46 ` Hugh Dickins
2023-05-29 6:22 ` [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() Hugh Dickins
2023-06-06 5:11 ` Hugh Dickins
2023-06-06 18:39 ` Jason Gunthorpe
2023-06-08 2:46 ` Hugh Dickins
2023-06-06 19:40 ` Gerald Schaefer
2023-06-08 3:35 ` Hugh Dickins
2023-06-08 13:58 ` Jason Gunthorpe
2023-06-08 15:47 ` Gerald Schaefer
2023-05-29 6:23 ` [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-06-01 13:31 ` Jann Horn
[not found] ` <ZHekpAKJ05cr/GLl@ziepe.ca>
2023-06-02 6:03 ` Hugh Dickins
2023-06-02 12:15 ` Jason Gunthorpe
2023-05-29 6:25 ` [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-05-29 23:26 ` Peter Xu
2023-05-31 0:38 ` Hugh Dickins
2023-05-31 15:34 ` Jann Horn
[not found] ` <ZHe0A079X9B8jWlH@x1n>
2023-05-31 22:18 ` Jann Horn
2023-06-01 14:06 ` Jason Gunthorpe
2023-06-06 6:18 ` Hugh Dickins
2023-05-29 6:26 ` [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-05-31 17:25 ` Jann Horn
2023-06-02 5:11 ` Hugh Dickins
2023-05-29 6:28 ` [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-05-29 6:30 ` [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
[not found] ` <CAG48ez0pCqfRdVSnJz7EKtNvMR65=zJgVB-72nTdrNuhtJNX2Q@mail.gmail.com>
2023-06-02 4:37 ` [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-06-02 15:26 ` Jann Horn
2023-06-06 6:28 ` Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAG48ez3seuiWfiVq0z8weTQV3ZXwo-TXSQWLJ0uxn4eOrq0J0w@mail.gmail.com \
--to=jannh@google.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=borntraeger@linux.ibm.com \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=hca@linux.ibm.com \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=imbrenda@linux.ibm.com \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@techsingularity.net \
--cc=mike.kravetz@oracle.com \
--cc=minchan@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=rcampbell@nvidia.com \
--cc=rppt@kernel.org \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=song@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=steven.price@arm.com \
--cc=surenb@google.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=zackr@vmware.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).