Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Huang Ying <ying.huang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>,
	Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Russell King <linux@armlinux.org.uk>,
	"David S. Miller" <davem@davemloft.net>,
	Michael Ellerman <mpe@ellerman.id.au>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Jann Horn <jannh@google.com>,
	Vishal Moola <vishal.moola@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
Date: Thu, 29 Jun 2023 17:43:55 +0200	[thread overview]
Message-ID: <20230629174355.222ebed0@thinkpad-T15> (raw)
In-Reply-To: <ZJ2OK29zPB76i0Ga@li-008a6a4c-3549-11b2-a85c-c5cc2836eea2.ibm.com>

On Thu, 29 Jun 2023 15:59:07 +0200
Alexander Gordeev <agordeev@linux.ibm.com> wrote:

> On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> 
> Hi Gerald, Hugh!
> 
> ...
> > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> >  	__free_page(page);
> >  }
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);  
> 
> What about pte_free_lower() / pte_free_upper()?

I actually like the 0/1 better, I always get confused what exactly we
mean with "lower / upper" in our code and comments. Is it the first
or second half? With 0/1 it is immediately clear to me.

> 
> ...
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	if (mm_alloc_pgste(mm)) {
> > +		/*
> > +		 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > +		 * page_table_free_rcu()?
> > +		 * If yes -> need addr parameter here, like in pte_free_tlb().
> > +		 */
> > +		call_rcu(&page->rcu_head, pte_free_pgste);
> > +		return;
> > +}
> > +	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> > +
> > +	spin_lock_bh(&mm->context.lock);
> > +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));  
> 
> This  makes the bit logic increasingly complicated to me.

I think it is well in line with existing code in page_table_free[_rcu].
Only instead of doing xor with 0x11U, it does xor with 0x15U to also
switch on the H bit while at it.

> 
> What if instead we set the rule "one bit at a time only"?
> That means an atomic group bit flip is only allowed between
> pairs of bits, namely:
> 
> bit flip	initiated from
> -----------	----------------------------------------
> P      <- A	page_table_free(), page_table_free_rcu()
>      H <- A	pte_free_defer()
> P <- H		pte_free_half()
> 
> In the current model P bit could be on together with H
> bit simultaneously. That actually brings in equation
> nothing.

P bit has to be set at the latest when __tlb_remove_table() gets
called, because there it is checked / cleared. It might be possible
to not set it in pte_free_defer() already, but only later in
pte_free_half() RCU callback, before calling __tlb_remove_table().
But that would not be in line any more with existing code, where it
is already set before scheduling the RCU callback.

Therefore, I would rather stick to the current approach, unless
you see some bug in it.

> 
> Besides, this check in page_table_alloc() (while still
> correct) makes one (well, me) wonder "what about HH bits?":
> 
> 	mask = (mask | (mask >> 4)) & 0x03U;
> 	if (mask != 0x03U) {
> 		...
> 	}

Without adding fragments back to the list, it is not necessary
to check any H bits page_table_alloc(), or so I hope. Actually,
I like that aspect most, i.e. we have as little impact on current
code as possible.

And H bits are only relevant for preventing double use of rcu_head,
which is what they were designed for, and only the new code has
to care about them.

WARNING: multiple messages have this Message-ID (diff)

From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	David Hildenbrand <david@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
	linux-kernel@vger.kernel.org, Song Liu <song@kernel.org>,
	sparclinux@vger.kernel.org,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Will Deacon <will@kernel.org>,
	linux-s390@vger.kernel.org, Yu Zhao <yuzhao@google.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Hugh Dickins <hughd@google.com>,
	Russell King <linux@armlinux.org.uk>,
	Matthew Wilcox <willy@infradead.org>,
	Steven Price <steven.price@arm.com>,
	Christoph Hellwig <hch@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Huang Ying <ying.huang@intel.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	V asily Gorbik <gor@linux.ibm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-arm-kernel@lists.infradead.org,
	SeongJae Park <sj@kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Jann Horn <jannh@google.com>,
	linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Zack Rusin <zackr@vmware.com>,
	Vishal Moola <vishal.moola@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	"David S. Miller" <davem@davemloft.net>,
	Mike Rapoport <rppt@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
Date: Thu, 29 Jun 2023 17:43:55 +0200	[thread overview]
Message-ID: <20230629174355.222ebed0@thinkpad-T15> (raw)
In-Reply-To: <ZJ2OK29zPB76i0Ga@li-008a6a4c-3549-11b2-a85c-c5cc2836eea2.ibm.com>

On Thu, 29 Jun 2023 15:59:07 +0200
Alexander Gordeev <agordeev@linux.ibm.com> wrote:

> On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> 
> Hi Gerald, Hugh!
> 
> ...
> > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> >  	__free_page(page);
> >  }
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);  
> 
> What about pte_free_lower() / pte_free_upper()?

I actually like the 0/1 better, I always get confused what exactly we
mean with "lower / upper" in our code and comments. Is it the first
or second half? With 0/1 it is immediately clear to me.

> 
> ...
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	if (mm_alloc_pgste(mm)) {
> > +		/*
> > +		 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > +		 * page_table_free_rcu()?
> > +		 * If yes -> need addr parameter here, like in pte_free_tlb().
> > +		 */
> > +		call_rcu(&page->rcu_head, pte_free_pgste);
> > +		return;
> > +}
> > +	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> > +
> > +	spin_lock_bh(&mm->context.lock);
> > +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));  
> 
> This  makes the bit logic increasingly complicated to me.

I think it is well in line with existing code in page_table_free[_rcu].
Only instead of doing xor with 0x11U, it does xor with 0x15U to also
switch on the H bit while at it.

> 
> What if instead we set the rule "one bit at a time only"?
> That means an atomic group bit flip is only allowed between
> pairs of bits, namely:
> 
> bit flip	initiated from
> -----------	----------------------------------------
> P      <- A	page_table_free(), page_table_free_rcu()
>      H <- A	pte_free_defer()
> P <- H		pte_free_half()
> 
> In the current model P bit could be on together with H
> bit simultaneously. That actually brings in equation
> nothing.

P bit has to be set at the latest when __tlb_remove_table() gets
called, because there it is checked / cleared. It might be possible
to not set it in pte_free_defer() already, but only later in
pte_free_half() RCU callback, before calling __tlb_remove_table().
But that would not be in line any more with existing code, where it
is already set before scheduling the RCU callback.

Therefore, I would rather stick to the current approach, unless
you see some bug in it.

> 
> Besides, this check in page_table_alloc() (while still
> correct) makes one (well, me) wonder "what about HH bits?":
> 
> 	mask = (mask | (mask >> 4)) & 0x03U;
> 	if (mask != 0x03U) {
> 		...
> 	}

Without adding fragments back to the list, it is not necessary
to check any H bits page_table_alloc(), or so I hope. Actually,
I like that aspect most, i.e. we have as little impact on current
code as possible.

And H bits are only relevant for preventing double use of rcu_head,
which is what they were designed for, and only the new code has
to care about them.

next prev parent reply	other threads:[~2023-06-29 15:46 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20  7:35 [PATCH v2 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-06-20  7:35 ` Hugh Dickins
2023-06-20  7:40 ` [PATCH v2 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-06-20  7:40   ` Hugh Dickins
2023-06-20  7:42 ` [PATCH v2 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-06-20  7:42   ` Hugh Dickins
2023-06-20  7:43 ` [PATCH v2 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-06-20  7:43   ` Hugh Dickins
2023-06-20  7:45 ` [PATCH v2 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-06-20  7:45   ` Hugh Dickins
2023-06-20  7:47 ` [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-20  7:47   ` Hugh Dickins
2023-06-20 11:45   ` Jason Gunthorpe
2023-06-20 11:45     ` Jason Gunthorpe
2023-06-20 19:54     ` Hugh Dickins
2023-06-20 19:54       ` Hugh Dickins
2023-06-20 23:52       ` Jason Gunthorpe
2023-06-20 23:52         ` Jason Gunthorpe
2023-06-22  2:36         ` Hugh Dickins
2023-06-22  2:36           ` Hugh Dickins
2023-06-27 17:01           ` Jason Gunthorpe
2023-06-27 17:01             ` Jason Gunthorpe
2023-06-27 20:53             ` Hugh Dickins
2023-06-27 20:53               ` Hugh Dickins
2023-06-20  7:49 ` [PATCH v2 06/12] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
2023-06-20  7:49   ` Hugh Dickins
2023-06-20  7:51 ` [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-20  7:51   ` Hugh Dickins
2023-06-28 19:16   ` Gerald Schaefer
2023-06-28 19:16     ` Gerald Schaefer
2023-06-29  5:08     ` Hugh Dickins
2023-06-29  5:08       ` Hugh Dickins
2023-06-29 15:22       ` Jason Gunthorpe
2023-06-29 15:22         ` Jason Gunthorpe
2023-06-29 15:56         ` Gerald Schaefer
2023-06-29 15:56           ` Gerald Schaefer
2023-06-30  6:00           ` Hugh Dickins
2023-06-30  6:00             ` Hugh Dickins
2023-07-02  4:32             ` Hugh Dickins
2023-07-02  4:32               ` Hugh Dickins
2023-07-04 13:40               ` Alexander Gordeev
2023-07-04 13:40                 ` Alexander Gordeev
2023-07-04 16:03                 ` Hugh Dickins
2023-07-04 16:03                   ` Hugh Dickins
2023-07-04 15:19               ` Gerald Schaefer
2023-07-04 15:19                 ` Gerald Schaefer
2023-07-04 17:03                 ` Hugh Dickins
2023-07-04 17:03                   ` Hugh Dickins
2023-07-05 12:55                   ` Gerald Schaefer
2023-07-05 12:55                     ` Gerald Schaefer
2023-07-06  1:20                     ` Hugh Dickins
2023-07-06  1:20                       ` Hugh Dickins
2023-07-06 15:02                       ` Gerald Schaefer
2023-07-06 15:02                         ` Gerald Schaefer
2023-07-06 19:45                         ` Hugh Dickins
2023-07-06 19:45                           ` Hugh Dickins
2023-07-10 17:21                     ` Jason Gunthorpe
2023-07-10 17:21                       ` Jason Gunthorpe
2023-07-05  6:46               ` Alexander Gordeev
2023-07-05  6:46                 ` Alexander Gordeev
2023-07-06  0:52                 ` Hugh Dickins
2023-07-06  0:52                   ` Hugh Dickins
2023-07-07 14:37                   ` Gerald Schaefer
2023-07-07 14:37                     ` Gerald Schaefer
2023-07-03 16:10             ` Gerald Schaefer
2023-07-03 16:10               ` Gerald Schaefer
2023-06-29 13:59     ` Alexander Gordeev
2023-06-29 13:59       ` Alexander Gordeev
2023-06-29 15:43       ` Gerald Schaefer [this message]
2023-06-29 15:43         ` Gerald Schaefer
2023-06-30 13:38   ` Claudio Imbrenda
2023-06-30 13:38     ` Claudio Imbrenda
2023-06-30 15:28     ` Hugh Dickins
2023-06-30 15:28       ` Hugh Dickins
2023-06-30 16:25       ` Claudio Imbrenda
2023-06-30 16:25         ` Claudio Imbrenda
2023-06-30 19:22         ` Hugh Dickins
2023-06-30 19:22           ` Hugh Dickins
2023-07-03 11:00           ` Claudio Imbrenda
2023-07-03 11:00             ` Claudio Imbrenda
2023-07-03 21:29             ` Jason Gunthorpe
2023-07-03 21:29               ` Jason Gunthorpe
2023-06-20  7:53 ` [PATCH v2 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-06-20  7:53   ` Hugh Dickins
2023-06-20  7:54 ` [PATCH v2 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-06-20  7:54   ` Hugh Dickins
2023-06-20  7:56 ` [PATCH v2 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-06-20  7:56   ` Hugh Dickins
2023-06-20  8:04   ` [PATCH mm " Hugh Dickins
2023-06-20  8:04     ` Hugh Dickins
2023-06-20  7:58 ` [PATCH v2 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-06-20  7:58   ` Hugh Dickins
2023-06-20  7:59 ` [PATCH v2 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
2023-06-20  7:59   ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230629174355.222ebed0@thinkpad-T15 \
    --to=gerald.schaefer@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.