linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	 Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	 Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	 Mel Gorman <mgorman@techsingularity.net>,
	Peter Xu <peterx@redhat.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,  Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	 Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	 Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	 Naoya Horiguchi <naoya.horiguchi@nec.com>,
	 Christophe Leroy <christophe.leroy@csgroup.eu>,
	 Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	 Anshuman Khandual <anshuman.khandual@arm.com>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	 Christoph Hellwig <hch@infradead.org>,
	Song Liu <song@kernel.org>,
	 Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	 Russell King <linux@armlinux.org.uk>,
	 "David S. Miller" <davem@davemloft.net>,
	 Michael Ellerman <mpe@ellerman.id.au>,
	 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	 Heiko Carstens <hca@linux.ibm.com>,
	 Christian Borntraeger <borntraeger@linux.ibm.com>,
	 Claudio Imbrenda <imbrenda@linux.ibm.com>,
	 Alexander Gordeev <agordeev@linux.ibm.com>,
	Jann Horn <jannh@google.com>,
	 linux-arm-kernel@lists.infradead.org,
	sparclinux@vger.kernel.org,  linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
Date: Mon, 5 Jun 2023 22:11:52 -0700 (PDT)	[thread overview]
Message-ID: <175ebec8-761-c3f-2d98-6c3bd87161c8@google.com> (raw)
In-Reply-To: <6dd63b39-e71f-2e8b-7e0-83e02f3bcb39@google.com>

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+	struct list_head *listed;
 	unsigned long *table;
 	struct page *page;
 	unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
 		table = NULL;
 		spin_lock_bh(&mm->context.lock);
 		if (!list_empty(&mm->context.pgtable_list)) {
-			page = list_first_entry(&mm->context.pgtable_list,
-						struct page, lru);
+			listed = mm->context.pgtable_list.next;
+			page = virt_to_page(listed);
 			mask = atomic_read(&page->_refcount) >> 24;
 			/*
 			 * The pending removal bits must also be checked.
@@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
 				bit = mask & 1;		/* =1 -> second 2K */
 				if (bit)
 					table += PTRS_PER_PTE;
+				BUG_ON(table != (unsigned long *)listed);
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del(listed);
+				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
+				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
+		listed = (struct list head *)(table + PTRS_PER_PTE);
 		spin_lock_bh(&mm->context.lock);
-		list_add(&page->lru, &mm->context.pgtable_list);
+		list_add(listed, &mm->context.pgtable_list);
 		spin_unlock_bh(&mm->context.lock);
 	}
 	return table;
@@ -310,6 +315,7 @@ static void page_table_release_check(str
 
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
+	struct list_head *listed;
 	unsigned int mask, bit, half;
 	struct page *page;
 
@@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
-			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		if (mask & 0x03U) {
+			listed = (struct list_head *)table;
+			list_add(listed, &mm->context.pgtable_list);
+		} else {
+			/*
+			 * Get address of the other page table sharing the page.
+			 * There are sure to be MUCH better ways to do all this!
+			 * But I'm rushing, while trying to keep to the obvious.
+			 */
+			listed = (struct list_head *)(table + PTRS_PER_PTE);
+			if (virt_to_page(listed) != page) {
+				/* sizeof(*listed) is twice sizeof(*table) */
+				listed -= PTRS_PER_PTE;
+			}
+			list_del(listed);
+			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
+	struct list_head *listed;
 	struct mm_struct *mm;
 	struct page *page;
 	unsigned int bit, mask;
@@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	if (mask & 0x03U) {
+		listed = (struct list_head *)table;
+		list_add_tail(listed, &mm->context.pgtable_list);
+	} else {
+		/*
+		 * Get address of the other page table sharing the page.
+		 * There are sure to be MUCH better ways to do all this!
+		 * But I'm rushing, and trying to keep to the obvious.
+		 */
+		listed = (struct list_head *)(table + PTRS_PER_PTE);
+		if (virt_to_page(listed) != page) {
+			/* sizeof(*listed) is twice sizeof(*table) */
+			listed -= PTRS_PER_PTE;
+		}
+		list_del(listed);
+		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-06-06  5:12 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29  6:11 [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-05-29  6:14 ` [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-05-31 17:06   ` Jann Horn
2023-06-02  2:50     ` Hugh Dickins
2023-06-02 14:21       ` Jann Horn
2023-05-29  6:16 ` [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-05-29 13:56   ` Matthew Wilcox
     [not found]   ` <ZHeg3oRljRn6wlLX@ziepe.ca>
2023-06-02  5:35     ` Hugh Dickins
2023-05-29  6:17 ` [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-05-29  6:18 ` [PATCH 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-05-29  6:20 ` [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-05-29 14:02   ` Matthew Wilcox
2023-05-29 14:36     ` Hugh Dickins
2023-06-01 13:57       ` Gerald Schaefer
2023-06-02 14:20     ` Jason Gunthorpe
2023-06-06  3:40       ` Hugh Dickins
2023-06-06 18:23         ` Jason Gunthorpe
2023-06-06 19:03           ` Peter Xu
2023-06-06 19:08             ` Jason Gunthorpe
2023-06-07  3:49               ` Hugh Dickins
2023-05-29  6:21 ` [PATCH 06/12] sparc: " Hugh Dickins
2023-06-06  3:46   ` Hugh Dickins
2023-05-29  6:22 ` [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() Hugh Dickins
2023-06-06  5:11   ` Hugh Dickins [this message]
2023-06-06 18:39     ` Jason Gunthorpe
2023-06-08  2:46       ` Hugh Dickins
2023-06-06 19:40     ` Gerald Schaefer
2023-06-08  3:35       ` Hugh Dickins
2023-06-08 13:58         ` Jason Gunthorpe
2023-06-08 15:47         ` Gerald Schaefer
2023-05-29  6:23 ` [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-06-01 13:31   ` Jann Horn
     [not found]   ` <ZHekpAKJ05cr/GLl@ziepe.ca>
2023-06-02  6:03     ` Hugh Dickins
2023-06-02 12:15       ` Jason Gunthorpe
2023-05-29  6:25 ` [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-05-29 23:26   ` Peter Xu
2023-05-31  0:38     ` Hugh Dickins
2023-05-31 15:34   ` Jann Horn
     [not found]     ` <ZHe0A079X9B8jWlH@x1n>
2023-05-31 22:18       ` Jann Horn
2023-06-01 14:06         ` Jason Gunthorpe
2023-06-06  6:18     ` Hugh Dickins
2023-05-29  6:26 ` [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
     [not found]   ` <CAG48ez2X5oZyxaFniZ-HeGHDGjNuPBewGTjZLEHPWkBbBCaigg@mail.gmail.com>
2023-06-02  5:11     ` Hugh Dickins
2023-05-29  6:28 ` [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-05-29  6:30 ` [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
     [not found] ` <CAG48ez0pCqfRdVSnJz7EKtNvMR65=zJgVB-72nTdrNuhtJNX2Q@mail.gmail.com>
2023-06-02  4:37   ` [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-06-02 15:26     ` Jann Horn
2023-06-06  6:28       ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=175ebec8-761-c3f-2d98-6c3bd87161c8@google.com \
    --to=hughd@google.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).