public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* prospective 13/12 s390 pgtable_list patch
@ 2023-06-23  5:49 Hugh Dickins
  2023-06-23  9:29 ` Alexander Gordeev
  2023-06-28 19:15 ` Gerald Schaefer
  0 siblings, 2 replies; 6+ messages in thread
From: Hugh Dickins @ 2023-06-23  5:49 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

Hi Gerald,

It's that moment you've been dreading: I'm hoping that you can, please,
take a look at the patch below, and try building and running with it,
on top of the v2 series of 12 I sent out on Tuesday.

If this seems okay to you, I'll send it properly as 13/12 of that series,
to the full Cc list; but of course you may find I've missed typos and worse
- please don't waste your time on it if it's rubbish, but so far as I can
tell, it is complete and ready now.

Thanks!
Hugh

s390: SLAB_TYPESAFE_BY_RCU mm->context.pgtable_list

Move the s390 pgtable_list (with dedicated spinlock) out of mm->context
into a separate kmem_cache object specified with SLAB_TYPESAFE_BY_RCU:
which then allows __tlb_remove_table() to add a page back to the list,
safely, when the other half is still in use.

Update the commentary on pgtable_list above page_table_alloc().

Stop using pt_mm in struct page: this needs a pt_list in that union.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/mmu.h         |  10 +-
 arch/s390/include/asm/mmu_context.h |   8 +-
 arch/s390/mm/pgalloc.c              | 151 ++++++++++++++++++----------
 include/linux/mm_types.h            |   4 +-
 4 files changed, 117 insertions(+), 56 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index 829d68e2c685..069cdd612d86 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -6,12 +6,18 @@
 #include <linux/errno.h>
 #include <asm/asm-extable.h>
 
+struct pgtable_list {
+	struct list_head list;
+	spinlock_t lock;
+};
+extern struct pgtable_list init_pgtable_list;
+
 typedef struct {
 	spinlock_t lock;
 	cpumask_t cpu_attach_mask;
 	atomic_t flush_count;
 	unsigned int flush_mm;
-	struct list_head pgtable_list;
+	struct pgtable_list *pgtable_list;
 	struct list_head gmap_list;
 	unsigned long gmap_asce;
 	unsigned long asce;
@@ -39,7 +45,7 @@ typedef struct {
 
 #define INIT_MM_CONTEXT(name)						   \
 	.context.lock =	__SPIN_LOCK_UNLOCKED(name.context.lock),	   \
-	.context.pgtable_list = LIST_HEAD_INIT(name.context.pgtable_list), \
+	.context.pgtable_list = &init_pgtable_list,			   \
 	.context.gmap_list = LIST_HEAD_INIT(name.context.gmap_list),
 
 #endif
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 2a38af5a00c2..725f82c55036 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -15,14 +15,20 @@
 #include <asm/ctl_reg.h>
 #include <asm-generic/mm_hooks.h>
 
+bool alloc_pgtable_list(struct mm_struct *mm);
+void destroy_context(struct mm_struct *mm);
+#define destroy_context destroy_context
+
 #define init_new_context init_new_context
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
 	unsigned long asce_type, init_entry;
 
+	if (!alloc_pgtable_list(mm))
+		return -ENOMEM;
+
 	spin_lock_init(&mm->context.lock);
-	INIT_LIST_HEAD(&mm->context.pgtable_list);
 	INIT_LIST_HEAD(&mm->context.gmap_list);
 	cpumask_clear(&mm->context.cpu_attach_mask);
 	atomic_set(&mm->context.flush_count, 0);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 11983a3ff95a..7f851b2dad49 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -159,15 +159,41 @@ void page_table_free_pgste(struct page *page)
 
 #endif /* CONFIG_PGSTE */
 
-/*
- * Temporarily use a global spinlock instead of mm->context.lock.
- * This will be replaced by a per-mm spinlock in a followup commit.
- */
-static DEFINE_SPINLOCK(mm_pgtable_list_lock);
+struct pgtable_list init_pgtable_list = {
+	.list	= LIST_HEAD_INIT(init_pgtable_list.list),
+	.lock	= __SPIN_LOCK_UNLOCKED(init_pgtable_list.lock),
+};
+
+static struct kmem_cache *pgtable_list_cachep;
+
+static void pgtable_list_ctor(void *object)
+{
+	struct pgtable_list *pgtable_list = object;
+
+	INIT_LIST_HEAD(&pgtable_list->list);
+	spin_lock_init(&pgtable_list->lock);
+}
+
+bool alloc_pgtable_list(struct mm_struct *mm)
+{
+	if (unlikely(!pgtable_list_cachep)) {	/* first time */
+		pgtable_list_cachep = kmem_cache_create("pgtable_list",
+			sizeof(struct pgtable_list), 0,
+			SLAB_TYPESAFE_BY_RCU | SLAB_PANIC, pgtable_list_ctor);
+	}
+	mm->context.pgtable_list =
+		kmem_cache_alloc(pgtable_list_cachep, GFP_KERNEL);
+	return !!mm->context.pgtable_list;
+}
+
+void destroy_context(struct mm_struct *mm)
+{
+	kmem_cache_free(pgtable_list_cachep, mm->context.pgtable_list);
+}
+
 /*
  * A 2KB-pgtable is either upper or lower half of a normal page.
- * The second half of the page may be unused or used as another
- * 2KB-pgtable.
+ * The other half of the page may be unused or used as another 2KB-pgtable.
  *
  * Whenever possible the parent page for a new 2KB-pgtable is picked
  * from the list of partially allocated pages mm_context_t::pgtable_list.
@@ -187,11 +213,15 @@ static DEFINE_SPINLOCK(mm_pgtable_list_lock);
  * As follows from the above, no unallocated or fully allocated parent
  * pages are contained in mm_context_t::pgtable_list.
  *
- * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
- * the new rule is that a page is not linked to mm_context_t::pgtable_list
- * while either half is pending free by any method; but afterwards is
- * either relinked to it, or freed, by __tlb_remove_table().  This allows
- * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ * However, there is a further twist. If a 2KB-pgtable is allocated in
+ * advance, in case it might be needed, but is then freed without being
+ * used, then it is freed immediately as described above.
+ *
+ * But when a 2KB-pgtable is freed after use, it usually passes through
+ * the quarantine of an RCU grace period, pending the actual free. While
+ * in this pending state, it cannot be reallocated, and its parent page
+ * is not permitted on mm_context_t::pgtable_list - no matter whether
+ * the other half is allocated, or pending free, or free.
  *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
@@ -211,32 +241,37 @@ static DEFINE_SPINLOCK(mm_pgtable_list_lock);
  *
  * When 2KB-pgtable is allocated the corresponding AA bit is set to 1.
  * The parent page is either:
- *   - added to mm_context_t::pgtable_list in case the second half of the
+ *   - added to mm_context_t::pgtable_list in case the other half of the
  *     parent page is still unallocated;
  *   - removed from mm_context_t::pgtable_list in case both halves of the
  *     parent page are allocated;
- * These operations are protected with mm_context_t::lock.
+ * These operations are protected with mm_context_t::pgtable_list::lock.
  *
  * When 2KB-pgtable is deallocated the corresponding AA bit is set to 0
  * and the corresponding PP bit is set to 1 in a single atomic operation.
  * Thus, PP and AA bits corresponding to the same 2KB-pgtable are mutually
  * exclusive and may never be both set to 1!
  * The parent page is either:
- *   - added to mm_context_t::pgtable_list in case the second half of the
- *     parent page is still allocated;
- *   - removed from mm_context_t::pgtable_list in case the second half of
- *     the parent page is unallocated;
- * These operations are protected with mm_context_t::lock.
+ *   - added to mm_context_t::pgtable_list in case the other half of the
+ *     parent page is still allocated - but never when setting the PP bit;
+ *   - removed from mm_context_t::pgtable_list in case the other half of
+ *     the parent page is unallocated (and not pending free, when the
+ *     parent page would already have been removed from the list);
+ * These operations are protected with mm_context_t::pgtable_list::lock.
  *
- * It is important to understand that mm_context_t::lock only protects
- * mm_context_t::pgtable_list and AA bits, but not the parent page itself
- * and PP bits.
+ * It is important to understand that mm_context_t::pgtable_list::lock
+ * only protects mm_context_t::pgtable_list::list and AA bits, but not
+ * the parent page itself and PP bits.
  *
- * Releasing the parent page happens whenever the PP bit turns from 1 to 0,
- * while both AA bits and the second PP bit are already unset. Then the
- * parent page does not contain any 2KB-pgtable fragment anymore, and it has
- * also been removed from mm_context_t::pgtable_list. It is safe to release
- * the page therefore.
+ * Releasing the parent page happens whenever both PP bits and both AA bits
+ * are unset. Then the parent page does not contain a 2KB-pgtable fragment
+ * anymore, and it has also been removed from mm_context_t::pgtable_list.
+ * It is safe to release the page therefore.
+ *
+ * The HH bits play no part in this logic, but indicate whether either
+ * 2KB-pgtable is using, or is waiting to use, the parent page's rcu_head.
+ * When required, the HH bit is set at the same time as setting the PP bit,
+ * then unset just before unsetting the PP bit.
  *
  * PGSTE memory spaces use full 4KB-pgtables and do not need most of the
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
@@ -245,6 +280,7 @@ static DEFINE_SPINLOCK(mm_pgtable_list_lock);
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+	struct pgtable_list *pgtable_list = mm->context.pgtable_list;
 	unsigned long *table;
 	struct page *page;
 	unsigned int mask, bit;
@@ -252,9 +288,9 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	/* Try to get a fragment of a 4K page as a 2K page table */
 	if (!mm_alloc_pgste(mm)) {
 		table = NULL;
-		spin_lock_bh(&mm_pgtable_list_lock);
-		if (!list_empty(&mm->context.pgtable_list)) {
-			page = list_first_entry(&mm->context.pgtable_list,
+		spin_lock_bh(&pgtable_list->lock);
+		if (!list_empty(&pgtable_list->list)) {
+			page = list_first_entry(&pgtable_list->list,
 						struct page, lru);
 			mask = atomic_read(&page->_refcount) >> 24;
 			/* Cannot be on this list if either half pending free */
@@ -267,7 +303,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
 			list_del(&page->lru);
 		}
-		spin_unlock_bh(&mm_pgtable_list_lock);
+		spin_unlock_bh(&pgtable_list->lock);
 		if (table)
 			return table;
 	}
@@ -281,7 +317,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	}
 	arch_set_page_dat(page, 0);
 	/* Initialize page table */
-	page->pt_mm = mm;
+	page->pt_list = pgtable_list;
 	table = (unsigned long *) page_to_virt(page);
 	if (mm_alloc_pgste(mm)) {
 		/* Return 4K page table with PGSTEs */
@@ -292,9 +328,9 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
-		spin_lock_bh(&mm_pgtable_list_lock);
-		list_add(&page->lru, &mm->context.pgtable_list);
-		spin_unlock_bh(&mm_pgtable_list_lock);
+		spin_lock_bh(&pgtable_list->lock);
+		list_add(&page->lru, &pgtable_list->list);
+		spin_unlock_bh(&pgtable_list->lock);
 	}
 	return table;
 }
@@ -314,6 +350,7 @@ static void page_table_release_check(struct page *page, unsigned long *table,
 
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
+	struct pgtable_list *pgtable_list;
 	unsigned int mask, bit, half;
 	struct page *page;
 
@@ -321,7 +358,8 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	if (!mm_alloc_pgste(mm)) {
 		/* Free 2K page table fragment of a 4K page */
 		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
-		spin_lock_bh(&mm_pgtable_list_lock);
+		pgtable_list = mm->context.pgtable_list;
+		spin_lock_bh(&pgtable_list->lock);
 		/*
 		 * Mark the page for release. The actual release will happen
 		 * below from this function, or later from __tlb_remove_table().
@@ -329,10 +367,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
 		mask >>= 24;
 		if (mask & 0x03U)		/* other half is allocated */
-			list_add(&page->lru, &mm->context.pgtable_list);
+			list_add(&page->lru, &pgtable_list->list);
 		else if (!(mask & 0x30U))	/* other half not pending */
 			list_del(&page->lru);
-		spin_unlock_bh(&mm_pgtable_list_lock);
+		spin_unlock_bh(&pgtable_list->lock);
 		if (mask != 0x00U)
 			return;
 		half = 0x01U << bit;
@@ -350,6 +388,7 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
+	struct pgtable_list *pgtable_list;
 	struct mm_struct *mm;
 	struct page *page;
 	unsigned int bit, mask;
@@ -363,7 +402,8 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 		return;
 	}
 	bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
-	spin_lock_bh(&mm_pgtable_list_lock);
+	pgtable_list = mm->context.pgtable_list;
+	spin_lock_bh(&pgtable_list->lock);
 	/*
 	 * Mark the page for delayed release.
 	 * The actual release will happen later, from __tlb_remove_table().
@@ -373,7 +413,7 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 	/* Other half not allocated? Other half not already pending free? */
 	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
 		list_del(&page->lru);
-	spin_unlock_bh(&mm_pgtable_list_lock);
+	spin_unlock_bh(&pgtable_list->lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);
 }
@@ -391,30 +431,35 @@ void __tlb_remove_table(void *_table)
 	case 0x01U:	/* lower 2K of a 4K page table */
 	case 0x02U:	/* upper 2K of a 4K page table */
 		/*
-		 * If the other half is marked as allocated, page->pt_mm must
+		 * If other half is marked allocated, mm and page->pt_list must
 		 * still be valid, page->rcu_head no longer in use so page->lru
 		 * good for use, so now make the freed half available for reuse.
 		 * But be wary of races with that other half being freed.
 		 */
+		rcu_read_lock();
 		if (atomic_read(&page->_refcount) & (0x03U << 24)) {
-			struct mm_struct *mm = page->pt_mm;
+			struct pgtable_list *pgtable_list = page->pt_list;
 			/*
-			 * It is safe to use page->pt_mm when the other half
-			 * is seen allocated while holding pgtable_list lock;
-			 * but how will it be safe to acquire that spinlock?
-			 * Global mm_pgtable_list_lock is safe and easy for
-			 * now, then a followup commit will split it per-mm.
+			 * Here we are either directly freeing from a valid mm,
+			 * or in RCU callback context.  In the latter case, the
+			 * mm might already have been freed since the _refcount
+			 * check above; but SLAB_TYPESAFE_BY_RCU ensures that
+			 * pgtable_list still points to a valid pgtable_list,
+			 * with a spinlock_t in the right place, even if it is
+			 * no longer "ours".  Take that lock and recheck the
+			 * _refcount before adding to pgtable_list->list.
 			 */
-			spin_lock_bh(&mm_pgtable_list_lock);
+			spin_lock_bh(&pgtable_list->lock);
 			mask = atomic_xor_bits(&page->_refcount, mask << 28);
 			mask >>= 24;
 			if (mask & 0x03U)
-				list_add(&page->lru, &mm->context.pgtable_list);
-			spin_unlock_bh(&mm_pgtable_list_lock);
+				list_add(&page->lru, &pgtable_list->list);
+			spin_unlock_bh(&pgtable_list->lock);
 		} else {
 			mask = atomic_xor_bits(&page->_refcount, mask << 28);
 			mask >>= 24;
 		}
+		rcu_read_unlock();
 		if (mask != 0x00U)
 			return;
 		break;
@@ -475,6 +520,7 @@ static void pte_free_now1(struct rcu_head *head)
 
 void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 {
+	struct pgtable_list *pgtable_list;
 	unsigned int bit, mask;
 	struct page *page;
 
@@ -486,13 +532,14 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
 			(PTRS_PER_PTE * sizeof(pte_t));
 
-	spin_lock_bh(&mm_pgtable_list_lock);
+	pgtable_list = mm->context.pgtable_list;
+	spin_lock_bh(&pgtable_list->lock);
 	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
 	mask >>= 24;
 	/* Other half not allocated? Other half not already pending free? */
 	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
 		list_del(&page->lru);
-	spin_unlock_bh(&mm_pgtable_list_lock);
+	spin_unlock_bh(&pgtable_list->lock);
 
 	/* Do not relink on rcu_head if other half already linked on rcu_head */
 	if ((mask & 0x0CU) != 0x0CU)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 09335fa28c41..a5b131abd82f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -31,6 +31,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct pgtable_list;
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -150,7 +151,8 @@ struct page {
 			 */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgd, s390 */
+				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct pgtable_list *pt_list; /* s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: prospective 13/12 s390 pgtable_list patch
  2023-06-23  5:49 prospective 13/12 s390 pgtable_list patch Hugh Dickins
@ 2023-06-23  9:29 ` Alexander Gordeev
  2023-06-23 20:53   ` Hugh Dickins
  2023-06-28 19:15 ` Gerald Schaefer
  1 sibling, 1 reply; 6+ messages in thread
From: Alexander Gordeev @ 2023-06-23  9:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Jason Gunthorpe, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, linux-s390

On Thu, Jun 22, 2023 at 10:49:43PM -0700, Hugh Dickins wrote:
> Hi Gerald,
> 
> It's that moment you've been dreading: I'm hoping that you can, please,
> take a look at the patch below, and try building and running with it,
> on top of the v2 series of 12 I sent out on Tuesday.
> 
> If this seems okay to you, I'll send it properly as 13/12 of that series,
> to the full Cc list; but of course you may find I've missed typos and worse
> - please don't waste your time on it if it's rubbish, but so far as I can
> tell, it is complete and ready now.
> 
> Thanks!
> Hugh

Hi Hugh,

Gerald is off until Monday and I think is not able to answer right now.

We had discussions with regard to how to better approach your series and
did not come to a conclusion unfortunatelly.

Gerald had several concerns - one of them is global mm_pgtable_list_lock,
wich is luckily avoided with this follow-up patch. But there were others,
which I am not able to articulate in detail.

While you are doing an outstanding job in trying to adjust our fragmented
page tables reuse scheme, one of the ideas emerged was to actually give it
up: partially or may be even fully. That is - not to reuse page tables
returned via pte_free_defer() or not to reuse them at all. To assess this
possible new approaches some statistics is needed and am working on a
prototype that would allow collecting it.

Note, our existing code is extremly complicated already and we had hard
time fixing (at least one) ugly race related to that. With the changes
you suggest that complexity (to me personally) multiplies. But that well
could be the other way around and I am just not smart enough to grasp it.
At least the claim that page_table_free() no longer needs the two-step
release indicates that.

I am sorry that it is probably not the status you would like to hear,
but I still wonder what is your opinion on that do-not-reuse-fragments
approach? Would it simplify pte_free_defer() or had no effect?

Anyway, that is just another option and I will try your patch.

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prospective 13/12 s390 pgtable_list patch
  2023-06-23  9:29 ` Alexander Gordeev
@ 2023-06-23 20:53   ` Hugh Dickins
  2023-06-26 15:22     ` Alexander Gordeev
  0 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2023-06-23 20:53 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Gerald Schaefer, Jason Gunthorpe, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	linux-s390

On Fri, 23 Jun 2023, Alexander Gordeev wrote:
> On Thu, Jun 22, 2023 at 10:49:43PM -0700, Hugh Dickins wrote:
> > Hi Gerald,
> > 
> > It's that moment you've been dreading: I'm hoping that you can, please,
> > take a look at the patch below, and try building and running with it,
> > on top of the v2 series of 12 I sent out on Tuesday.
> > 
> > If this seems okay to you, I'll send it properly as 13/12 of that series,
> > to the full Cc list; but of course you may find I've missed typos and worse
> > - please don't waste your time on it if it's rubbish, but so far as I can
> > tell, it is complete and ready now.
> > 
> > Thanks!
> > Hugh
> 
> Hi Hugh,
> 
> Gerald is off until Monday and I think is not able to answer right now.

Thanks a lot for stepping in, Alexander.

> 
> We had discussions with regard to how to better approach your series and
> did not come to a conclusion unfortunatelly.
> 
> Gerald had several concerns - one of them is global mm_pgtable_list_lock,
> wich is luckily avoided with this follow-up patch. But there were others,
> which I am not able to articulate in detail.
> 
> While you are doing an outstanding job in trying to adjust our fragmented
> page tables reuse scheme, one of the ideas emerged was to actually give it
> up: partially or may be even fully. That is - not to reuse page tables
> returned via pte_free_defer() or not to reuse them at all. To assess this
> possible new approaches some statistics is needed and am working on a
> prototype that would allow collecting it.
> 
> Note, our existing code is extremly complicated already and we had hard
> time fixing (at least one) ugly race related to that. With the changes
> you suggest that complexity (to me personally) multiplies. But that well
> could be the other way around and I am just not smart enough to grasp it.
> At least the claim that page_table_free() no longer needs the two-step
> release indicates that.

Yes, I had to cool myself down to a low temperature, and think about it
for several hours before arriving at that conclusion.  It would be nice
if I could point to one fact which explains it convincingly (something
along the lines of "look, we already took it off the list in that case");
but didn't manage to put it into words, and ended up deciding that we
each have to do our own thinking to convince ourselves on that issue.

> 
> I am sorry that it is probably not the status you would like to hear,

Not a problem for me at all: you're absolutely right to question whether
the existing, and the added, complexity is worth it - all of us who have
looked into that code have wondered the same.

Initially I refused to even try to get into it; but in the end was quite
proud that I had, given time, managed to work with it.  But no problem
if that work is quickly discarded in favour of simplicity.

> but I still wonder what is your opinion on that do-not-reuse-fragments

I don't want to sway you one way or the other on that: I just want to
work with whatever you decide.  I know Matthew Wilcox would prefer if
powerpc and s390 went about things in the same way (but they do have
different constraints, so I don't necessarily agree); but I did not
feel able to persuade powerpc to adopt s390's more complex approach,
nor to persuade s390 to downgrade to powerpc's simpler approach.

> approach? Would it simplify pte_free_defer() or had no effect?

It would certainly simplify it a lot.  Whether it would just be a
matter of deleting my attempt to list_add() from __tlb_remove_table(),
so restoring the per-mm lock, and forgetting the SLAB_TYPESAFE_BY_RCU;
or whether it would go further, and most of the PP-AA bit tracking fall
away, I cannot predict - depends rather on who does the job, and how
far they choose to take it.

Two notes I want to throw into the mix:

One, FWIW my guess is that for most mms, the s390 PP-AA tracking adds very
little value; but without it, perhaps there is some mm, some workload, which
degrades to the point of using only half of the pgtable memory it allocates.

Two, maybe you'll see this as a further complication, to avoid getting into,
rather than a simplification: but it occurred to me while working here that
s390 has no need to keep the pgtable_trans_huge_deposit/withdraw() code in
mm/pgtable.c separate from the mm/pgalloc.c code.  They can use just the
one list, and deposit/withdraw simply keep a count of how many immediately
usable free pgtables must always be kept in reserve on that list.  (But
this would not be true at all for powerpc.)

> 
> Anyway, that is just another option and I will try your patch.

Thank you, please do, if you have the time: my series does need some
kind of s390 solution, depending on whatever s390 has in its tree at
the time: for now I should at least be showing the 13/12 (preferably
known to build and appearing to work!), even if all it does is help
people to say "no, that's too much".

Hugh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prospective 13/12 s390 pgtable_list patch
  2023-06-23 20:53   ` Hugh Dickins
@ 2023-06-26 15:22     ` Alexander Gordeev
  2023-06-27  3:45       ` Hugh Dickins
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander Gordeev @ 2023-06-26 15:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Jason Gunthorpe, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, linux-s390

On Fri, Jun 23, 2023 at 01:53:43PM -0700, Hugh Dickins wrote:
> > Anyway, that is just another option and I will try your patch.
> 
> Thank you, please do, if you have the time: my series does need some
> kind of s390 solution, depending on whatever s390 has in its tree at
> the time: for now I should at least be showing the 13/12 (preferably
> known to build and appearing to work!), even if all it does is help
> people to say "no, that's too much".

Hi Hugh,

I took a bit different approach to run it: with and withough HH bits
in play.

IOW (if I am not mistaken) without the three patch serieses these two
patches on top of v6.4-rc5 alone still should manage page allocation,
as if pte_free_defer() does not exist/never get called:

	s390: add pte_free_defer() for pgtables sharing page
	prospective 13/12 s390 pgtable_list patch

These patches appear to work (aka survive LTP MM tests).

The next step would be to try the three serieses on top of v6.4-rc5 plus
patch 13/12, as you intended:

https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/
https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8ebfb0@google.com
https://lore.kernel.org/r/a69a26c0-ec93-3ad-a443-6655b5e49df2@google.com

I had to manually tweak this one with a fuzz in order to circumvent 'git am' error:
[PATCH v2 28/32] mm/memory: allow pte_offset_map[_lock]() to fail: fix

But in the end it still does not compile for me:

mm/memory.c: In function ‘zap_pmd_range’:
mm/memory.c:1561:21: error: implicit declaration of function ‘pmd_none_or_trans_huge_or_clear_bad’ [-Werror=implicit-function-declaration]
 1561 |                 if (pmd_none_or_trans_huge_or_clear_bad(pmd))
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC      fs/btrfs/dev-replace.o
In file included from ./include/linux/build_bug.h:5,
                 from ./include/linux/container_of.h:5,
                 from ./include/linux/list.h:5,
                 from ./include/linux/smp.h:12,
                 from ./include/linux/kernel_stat.h:5,
                 from mm/memory.c:42:
mm/memory.c: In function ‘do_anonymous_page’:
mm/memory.c:4058:22: error: implicit declaration of function ‘pmd_trans_unstable’; did you mean ‘pud_trans_unstable’? [-Werror=implicit-function-declaration]
 4058 |         if (unlikely(pmd_trans_unstable(vmf->pmd)))
      |                      ^~~~~~~~~~~~~~~~~~
./include/linux/compiler.h:77:45: note: in definition of macro ‘unlikely’
   77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
      |                                             ^
  CC      fs/xfs/xfs_iops.o
mm/memory.c: In function ‘finish_fault’:
mm/memory.c:4388:13: error: implicit declaration of function ‘pmd_devmap_trans_unstable’; did you mean ‘pud_trans_unstable’? [-Werror=implicit-function-declaration]
 4388 |         if (pmd_devmap_trans_unstable(vmf->pmd))
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~
      |             pud_trans_unstable


Could you please let me know if I am missing something?

Thanks!

> Hugh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prospective 13/12 s390 pgtable_list patch
  2023-06-26 15:22     ` Alexander Gordeev
@ 2023-06-27  3:45       ` Hugh Dickins
  0 siblings, 0 replies; 6+ messages in thread
From: Hugh Dickins @ 2023-06-27  3:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Gerald Schaefer, Jason Gunthorpe, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	linux-s390

[-- Attachment #1: Type: text/plain, Size: 5317 bytes --]

On Mon, 26 Jun 2023, Alexander Gordeev wrote:
> On Fri, Jun 23, 2023 at 01:53:43PM -0700, Hugh Dickins wrote:
> > > Anyway, that is just another option and I will try your patch.
> > 
> > Thank you, please do, if you have the time: my series does need some
> > kind of s390 solution, depending on whatever s390 has in its tree at
> > the time: for now I should at least be showing the 13/12 (preferably
> > known to build and appearing to work!), even if all it does is help
> > people to say "no, that's too much".
> 
> Hi Hugh,
> 
> I took a bit different approach to run it: with and withough HH bits
> in play.
> 
> IOW (if I am not mistaken) without the three patch serieses these two
> patches on top of v6.4-rc5 alone still should manage page allocation,
> as if pte_free_defer() does not exist/never get called:
> 
> 	s390: add pte_free_defer() for pgtables sharing page
> 	prospective 13/12 s390 pgtable_list patch

That was a good idea, to try the "bare bones" first.  Yes, I think that
should work okay: with no actual dependence on the preceding patches.

> 
> These patches appear to work (aka survive LTP MM tests).

And that is very good news, thank you (I was expecting I'd at least
have done some silly typo like last time).

> 
> The next step would be to try the three serieses on top of v6.4-rc5 plus
> patch 13/12, as you intended:
> 
> https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/

There's maybe no difference that matters, but that was the v1 series: v2 was
https://lore.kernel.org/linux-mm/a4963be9-7aa6-350-66d0-2ba843e1af44@google.com/
series of 23 posted on 2023-06-08.

> https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/

There's maybe no difference that matters, but that was the v1 series: v2 was
https://lore.kernel.org/linux-mm/c1c9a74a-bc5b-15ea-e5d2-8ec34bc921d@google.com/
series of 32 posted on 2023-06-08.

> https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8ebfb0@google.com

Right.

> https://lore.kernel.org/r/a69a26c0-ec93-3ad-a443-6655b5e49df2@google.com

Right.

> 
> I had to manually tweak this one with a fuzz in order to circumvent 'git am' error:
> [PATCH v2 28/32] mm/memory: allow pte_offset_map[_lock]() to fail: fix
> 
> But in the end it still does not compile for me:

I think perhaps you missed out the actual
[PATCH v2 28/32] mm/memory: allow pte_offset_map[_lock]() to fail
(oh, not exactly, you were working with v1 there) and tried to apply
[PATCH v2 28/32 fix] mm/memory: allow pte_offset_map[_lock]() to fail: fix
instead of it; whereas the "fix" patch is a cumulative patch on top of the
missed one (and the "fix" patch is not at all important for your testing,
since update_mmu_tlb() is a no-op on s390).

(There was other pfaffing around on PATCH v2 28/32, but that was
about applying it to akpm's tree at the time: v6.4-rc5 no problem.)

> 
> mm/memory.c: In function ‘zap_pmd_range’:
> mm/memory.c:1561:21: error: implicit declaration of function ‘pmd_none_or_trans_huge_or_clear_bad’ [-Werror=implicit-function-declaration]
>  1561 |                 if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>       |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   CC      fs/btrfs/dev-replace.o
> In file included from ./include/linux/build_bug.h:5,
>                  from ./include/linux/container_of.h:5,
>                  from ./include/linux/list.h:5,
>                  from ./include/linux/smp.h:12,
>                  from ./include/linux/kernel_stat.h:5,
>                  from mm/memory.c:42:
> mm/memory.c: In function ‘do_anonymous_page’:
> mm/memory.c:4058:22: error: implicit declaration of function ‘pmd_trans_unstable’; did you mean ‘pud_trans_unstable’? [-Werror=implicit-function-declaration]
>  4058 |         if (unlikely(pmd_trans_unstable(vmf->pmd)))
>       |                      ^~~~~~~~~~~~~~~~~~
> ./include/linux/compiler.h:77:45: note: in definition of macro ‘unlikely’
>    77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
>       |                                             ^
>   CC      fs/xfs/xfs_iops.o
> mm/memory.c: In function ‘finish_fault’:
> mm/memory.c:4388:13: error: implicit declaration of function ‘pmd_devmap_trans_unstable’; did you mean ‘pud_trans_unstable’? [-Werror=implicit-function-declaration]
>  4388 |         if (pmd_devmap_trans_unstable(vmf->pmd))
>       |             ^~~~~~~~~~~~~~~~~~~~~~~~~
>       |             pud_trans_unstable
> 
> 
> Could you please let me know if I am missing something?

I think, start again with the v2 of each of the three series on v6.4-rc5,
including the v2 28/32, and I expect that to work out fine.

What you already tested above has achieved a lot: since my changes left
pte_free_defer() and page_table_free_rcu() sharing __tlb_remove_table(),
so page_table_free_rcu() would already be exercising that list_add().

But you may have difficulty actually exercising the pte_free_defer().
Maybe hack in a counter (I tend to hack on BALLOON events in vmstat,
since normally those would be 0 for me) to check whether running LTP
is actually exercising those paths - if not, we'll devise something more.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prospective 13/12 s390 pgtable_list patch
  2023-06-23  5:49 prospective 13/12 s390 pgtable_list patch Hugh Dickins
  2023-06-23  9:29 ` Alexander Gordeev
@ 2023-06-28 19:15 ` Gerald Schaefer
  1 sibling, 0 replies; 6+ messages in thread
From: Gerald Schaefer @ 2023-06-28 19:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jason Gunthorpe, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, 22 Jun 2023 22:49:43 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Hi Gerald,
> 
> It's that moment you've been dreading: I'm hoping that you can, please,
> take a look at the patch below, and try building and running with it,
> on top of the v2 series of 12 I sent out on Tuesday.

Wow, not sure if this is now dreadful or awesome, it could be the latter
since this might actually work. But as Alexander already said, we would
rather not add more complexity, for an already questionable benefit with
regard to saving some page table memory.

I have revisited my last approach with not adding back fragments in the
pte_free_defer() path, and I am rather confident now that there is no
"list_del() without list_add()" flaw any more. I had to use the
page->pt_frag_refcount for this, to track list status. So even though we
do not need page->pt_mm any more, that union is still not free for other
things. But I guess that is acceptable, and I especially like that the
patch will almost not change existing code at all, apart from the list
status tracking.

See my reply to your patch 07/12 in the other thread, for the patch.
It is tested with LTP and patches from all three series applied, and
it would allow going further with your work by having some s390 solution,
that is neither obviously flawed, nor too complex to handle or understand
also in the future.

I also left a TODO comment in the patch, with regard to the still open
question if we need the gmap_unlink(mm, pgtable, addr) also in
pte_free_defer(), similar to page_table_free_rcu(). If yes, we would need
some way to pass along the addr, which should be possible. However, IIUC,
retract_page_tables() is currently ending up in a normal page_table_free(),
w/o your work. And there we also have no addr available, and do no
gmap_unlink(), so there might be a good chance that this is not needed here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-06-28 19:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-23  5:49 prospective 13/12 s390 pgtable_list patch Hugh Dickins
2023-06-23  9:29 ` Alexander Gordeev
2023-06-23 20:53   ` Hugh Dickins
2023-06-26 15:22     ` Alexander Gordeev
2023-06-27  3:45       ` Hugh Dickins
2023-06-28 19:15 ` Gerald Schaefer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox