[PATCH 00/15] mm: struct page lock and counts

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/15] mm: struct page lock and counts
@ 2005-11-10  1:42 Hugh Dickins
  2005-11-10  1:43 ` [PATCH 01/15] mm: poison struct page for ptlock Hugh Dickins
                   ` (14 more replies)
  0 siblings, 15 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Here comes a batch of fifteen against 2.6.14-mm1.  The first eight tie
up various loose ends of the page fault scalability patches (but I've
not yet got to Documentation/vm/locking).  The last seven are about page
count overflow, resolving that issue made more immediate by refcounting
the ZERO_PAGE.  So I'd like to see these as 2.6.15 material: and mostly
the patches do apply to 2.6.14-git12; but there's two reiser4 patches I
separated out, an assumption that Andi's x86_64-page-flags-cleanup.patch
will go in ahead, and an obvious reject in vmscan.c.  You may see them
rather differently!  I think there could be two views on the overflow
patches: you might find them neat, you might find them too oblique; but
they do avoid extra tests in the hotter paths.  In some cases I've put
an additional comment below the --- cutoff.

Hugh

 arch/frv/mm/pgalloc.c                    |    4 
 arch/i386/mm/pgtable.c                   |    8 
 arch/powerpc/mm/4xx_mmu.c                |    4 
 arch/powerpc/mm/hugetlbpage.c            |    4 
 arch/powerpc/mm/mem.c                    |    2 
 arch/powerpc/mm/tlb_32.c                 |    6 
 arch/powerpc/mm/tlb_64.c                 |    4 
 arch/ppc/mm/pgtable.c                    |   13 -
 arch/ppc64/kernel/vdso.c                 |    6 
 arch/sh64/lib/dbg.c                      |    2 
 drivers/char/drm/drm_vm.c                |    2 
 fs/afs/file.c                            |    4 
 fs/buffer.c                              |    2 
 fs/jfs/jfs_metapage.c                    |   12 -
 fs/proc/task_mmu.c                       |    2 
 fs/reiser4/flush_queue.c                 |    2 
 fs/reiser4/jnode.c                       |   10 -
 fs/reiser4/page_cache.c                  |    4 
 fs/reiser4/page_cache.h                  |    2 
 fs/reiser4/plugin/file/tail_conversion.c |    2 
 fs/reiser4/txnmgr.c                      |    6 
 fs/xfs/linux-2.6/xfs_buf.c               |    7 
 include/asm-alpha/atomic.h               |    7 
 include/asm-ppc/pgtable.h                |   10 -
 include/asm-s390/atomic.h                |   10 -
 include/asm-sparc64/atomic.h             |    1 
 include/asm-x86_64/atomic.h              |   49 ++++-
 include/linux/buffer_head.h              |    6 
 include/linux/mm.h                       |  262 +++++++++++++++++++++++--------
 include/linux/page-flags.h               |    1 
 include/linux/rmap.h                     |   12 -
 kernel/futex.c                           |   15 -
 kernel/kexec.c                           |    4 
 mm/Kconfig                               |    6 
 mm/filemap.c                             |    2 
 mm/fremap.c                              |    3 
 mm/memory.c                              |   50 +++--
 mm/page_alloc.c                          |  113 ++++++++++++-
 mm/page_io.c                             |    6 
 mm/rmap.c                                |   29 +--
 mm/shmem.c                               |   22 +-
 mm/swap.c                                |    2 
 mm/swap_state.c                          |    8 
 mm/swapfile.c                            |   14 -
 mm/vmscan.c                              |    2 
 45 files changed, 485 insertions(+), 257 deletions(-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
@ 2005-11-10  1:43 ` Hugh Dickins
  2005-11-10  2:10   ` Andrew Morton
  2005-11-10  1:44 ` [PATCH 02/15] mm: revert page_private Hugh Dickins
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

The split ptlock patch enlarged the default SMP PREEMPT struct page from
32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
7xxx (without PREEMPT).  That was not my intention, and I don't believe
that split ptlock deserves any such slice of the user's memory.

Could we please try this patch, or something like it?  Again to overlay
the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
that we don't go too far; with poisoning of the fields overlaid, and
unsplit SMP config verifying that the split config is safe to use them.

The previous attempt at this patch broke ppc64, which uses slab for its
page tables - and slab saves vital info in page->lru: but no config of
spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
user of slab for page tables is arm26, which is never SMP i.e. all the
problems came from the "safety" checks, not from what's actually needed.
So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.

This overlaying is unlikely to be portable forever: but the added checks
should warn developers when it's going to break, long before any users.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/mm.h |   78 +++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 64 insertions(+), 14 deletions(-)

--- 2.6.14-mm1/include/linux/mm.h	2005-11-07 07:39:57.000000000 +0000
+++ mm01/include/linux/mm.h	2005-11-09 14:37:47.000000000 +0000
@@ -224,18 +224,19 @@ struct page {
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
-	union {
-		unsigned long private;	/* Mapping-private opaque data:
+	unsigned long private;		/* Mapping-private opaque data:
 					 * usually used for buffer_heads
 					 * if PagePrivate set; used for
 					 * swp_entry_t if PageSwapCache
 					 * When page is free, this indicates
 					 * order in the buddy system.
 					 */
-#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
-		spinlock_t ptl;
-#endif
-	} u;
+	/*
+	 * Depending on config, along with private, subsequent fields of
+	 * a page table page's struct page may be overlaid by a spinlock
+	 * for pte locking: see comment on "split ptlock" below.  Please
+	 * do not rearrange these fields without considering that usage.
+	 */
 	struct address_space *mapping;	/* If low bit clear, points to
 					 * inode address_space, or NULL.
 					 * If page mapped as anonymous
@@ -268,8 +269,8 @@ struct page {
 #endif
 };
 
-#define page_private(page)		((page)->u.private)
-#define set_page_private(page, v)	((page)->u.private = (v))
+#define page_private(page)		((page)->private)
+#define set_page_private(page, v)	((page)->private = (v))
 
 /*
  * FIXME: take this include out, include page-flags.h in
@@ -827,25 +828,74 @@ static inline pmd_t *pmd_alloc(struct mm
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/*
+ * In the split ptlock case, we shall be overlaying the struct page
+ * of a page table page with a spinlock starting at &page->private:
+ * ending dependent on architecture and config, but never beyond lru.
+ *
+ * So poison page table struct page in all SMP cases (in part to assert
+ * our territory: that pte locking owns these fields of a page table's
+ * struct page); and verify it when freeing in the unsplit ptlock case,
+ * when none of these fields should have been touched.  So that now the
+ * common config checks also for the larger PREEMPT DEBUG_SPINLOCK lock.
+ *
+ * Poison lru only in the 32-bit configs, since no 64-bit spinlock_t
+ * extends that far - and ppc64 allocates from slab, which saves info in
+ * these lru fields (arm26 also allocates from slab, but is never SMP).
+ * Poison lru back-to-front, to make sure that list_del was not used.
+ */
+static inline void poison_struct_page(struct page *page)
+{
+#ifdef CONFIG_SMP
+	page->private = (unsigned long) page;
+	page->mapping = (struct address_space *) page;
+	page->index   = (pgoff_t) page;
+#if BITS_PER_LONG == 32
+	BUG_ON(PageSlab(page));
+	page->lru.next = LIST_POISON2;
+	page->lru.prev = LIST_POISON1;
+#endif
+#endif
+}
+
+static inline void verify_struct_page(struct page *page)
+{
+#ifdef CONFIG_SMP
+	BUG_ON(page->private != (unsigned long) page);
+	BUG_ON(page->mapping != (struct address_space *) page);
+	BUG_ON(page->index   != (pgoff_t) page);
+	page->mapping = NULL;
+#if BITS_PER_LONG == 32
+	BUG_ON(page->lru.next != LIST_POISON2);
+	BUG_ON(page->lru.prev != LIST_POISON1);
+#endif
+#endif
+}
+
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 /*
  * We tuck a spinlock to guard each pagetable page into its struct page,
  * at page->private, with BUILD_BUG_ON to make sure that this will not
- * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
- * When freeing, reset page->mapping so free_pages_check won't complain.
+ * extend further than expected.
  */
-#define __pte_lockptr(page)	&((page)->u.ptl)
+#define __pte_lockptr(page)	((spinlock_t *)&((page)->private))
 #define pte_lock_init(_page)	do {					\
+	BUILD_BUG_ON(__pte_lockptr((struct page *)0) + 1 > (spinlock_t*)\
+		(&((struct page *)0)->lru + (BITS_PER_LONG == 32)));	\
+	poison_struct_page(_page);					\
 	spin_lock_init(__pte_lockptr(_page));				\
 } while (0)
+/*
+ * When freeing, reset page->mapping so free_pages_check won't complain.
+ */
 #define pte_lock_deinit(page)	((page)->mapping = NULL)
 #define pte_lockptr(mm, pmd)	({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
-#else
+#else  /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
 /*
  * We use mm->page_table_lock to guard all pagetable pages of the mm.
  */
-#define pte_lock_init(page)	do {} while (0)
-#define pte_lock_deinit(page)	do {} while (0)
+#define pte_lock_init(page)	poison_struct_page(page)
+#define pte_lock_deinit(page)	verify_struct_page(page)
 #define pte_lockptr(mm, pmd)	({(void)(pmd); &(mm)->page_table_lock;})
 #endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 02/15] mm: revert page_private
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
  2005-11-10  1:43 ` [PATCH 01/15] mm: poison struct page for ptlock Hugh Dickins
@ 2005-11-10  1:44 ` Hugh Dickins
  2005-11-10  1:46 ` [PATCH 03/15] mm reiser4: " Hugh Dickins
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

The page_private macro serves no purpose while spinlock_t is overlaid
in struct page: a purpose may be found for it later on, but until then
revert the abstraction, restoring various files to their previous state.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
Note: the mm/vmscan.c mod gives an easily resolved reject on 2.6.14-git.

 arch/frv/mm/pgalloc.c       |    4 ++--
 arch/i386/mm/pgtable.c      |    8 ++++----
 fs/afs/file.c               |    4 ++--
 fs/buffer.c                 |    2 +-
 fs/jfs/jfs_metapage.c       |   12 ++++++------
 fs/xfs/linux-2.6/xfs_buf.c  |    7 +++----
 include/linux/buffer_head.h |    6 +++---
 include/linux/mm.h          |    9 +++------
 kernel/kexec.c              |    4 ++--
 mm/filemap.c                |    2 +-
 mm/page_alloc.c             |   16 ++++++++--------
 mm/page_io.c                |    6 ++----
 mm/rmap.c                   |    2 +-
 mm/shmem.c                  |   22 ++++++++++++----------
 mm/swap.c                   |    2 +-
 mm/swap_state.c             |    8 ++++----
 mm/swapfile.c               |   12 ++++++------
 mm/vmscan.c                 |    2 +-
 18 files changed, 62 insertions(+), 66 deletions(-)

--- mm01/arch/frv/mm/pgalloc.c	2005-11-07 07:39:00.000000000 +0000
+++ mm02/arch/frv/mm/pgalloc.c	2005-11-09 14:38:02.000000000 +0000
@@ -87,14 +87,14 @@ static inline void pgd_list_add(pgd_t *p
 	if (pgd_list)
 		pgd_list->private = (unsigned long) &page->index;
 	pgd_list = page;
-	set_page_private(page, (unsigned long)&pgd_list);
+	page->private = (unsigned long) &pgd_list;
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
 	struct page *next, **pprev, *page = virt_to_page(pgd);
 	next = (struct page *) page->index;
-	pprev = (struct page **)page_private(page);
+	pprev = (struct page **) page->private;
 	*pprev = next;
 	if (next)
 		next->private = (unsigned long) pprev;
--- mm01/arch/i386/mm/pgtable.c	2005-11-07 07:39:01.000000000 +0000
+++ mm02/arch/i386/mm/pgtable.c	2005-11-09 14:38:03.000000000 +0000
@@ -191,19 +191,19 @@ static inline void pgd_list_add(pgd_t *p
 	struct page *page = virt_to_page(pgd);
 	page->index = (unsigned long)pgd_list;
 	if (pgd_list)
-		set_page_private(pgd_list, (unsigned long)&page->index);
+		pgd_list->private = (unsigned long)&page->index;
 	pgd_list = page;
-	set_page_private(page, (unsigned long)&pgd_list);
+	page->private = (unsigned long)&pgd_list;
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
 	struct page *next, **pprev, *page = virt_to_page(pgd);
 	next = (struct page *)page->index;
-	pprev = (struct page **)page_private(page);
+	pprev = (struct page **)page->private;
 	*pprev = next;
 	if (next)
-		set_page_private(next, (unsigned long)pprev);
+		next->private = (unsigned long)pprev;
 }
 
 void pgd_ctor(void *pgd, kmem_cache_t *cache, unsigned long unused)
--- mm01/fs/afs/file.c	2005-11-07 07:39:42.000000000 +0000
+++ mm02/fs/afs/file.c	2005-11-09 14:38:03.000000000 +0000
@@ -261,8 +261,8 @@ static int afs_file_releasepage(struct p
 		cachefs_uncache_page(vnode->cache, page);
 #endif
 
-		pageio = (struct cachefs_page *) page_private(page);
-		set_page_private(page, 0);
+		pageio = (struct cachefs_page *) page->private;
+		page->private = 0;
 		ClearPagePrivate(page);
 
 		kfree(pageio);
--- mm01/fs/buffer.c	2005-11-07 07:39:43.000000000 +0000
+++ mm02/fs/buffer.c	2005-11-09 14:38:03.000000000 +0000
@@ -96,7 +96,7 @@ static void
 __clear_page_buffers(struct page *page)
 {
 	ClearPagePrivate(page);
-	set_page_private(page, 0);
+	page->private = 0;
 	page_cache_release(page);
 }
 
--- mm01/fs/jfs/jfs_metapage.c	2005-11-07 07:39:44.000000000 +0000
+++ mm02/fs/jfs/jfs_metapage.c	2005-11-09 14:38:03.000000000 +0000
@@ -86,7 +86,7 @@ struct meta_anchor {
 	atomic_t io_count;
 	struct metapage *mp[MPS_PER_PAGE];
 };
-#define mp_anchor(page) ((struct meta_anchor *)page_private(page))
+#define mp_anchor(page) ((struct meta_anchor *)page->private)
 
 static inline struct metapage *page_to_mp(struct page *page, uint offset)
 {
@@ -108,7 +108,7 @@ static inline int insert_metapage(struct
 		if (!a)
 			return -ENOMEM;
 		memset(a, 0, sizeof(struct meta_anchor));
-		set_page_private(page, (unsigned long)a);
+		page->private = (unsigned long)a;
 		SetPagePrivate(page);
 		kmap(page);
 	}
@@ -136,7 +136,7 @@ static inline void remove_metapage(struc
 	a->mp[index] = NULL;
 	if (--a->mp_count == 0) {
 		kfree(a);
-		set_page_private(page, 0);
+		page->private = 0;
 		ClearPagePrivate(page);
 		kunmap(page);
 	}
@@ -156,13 +156,13 @@ static inline void dec_io(struct page *p
 #else
 static inline struct metapage *page_to_mp(struct page *page, uint offset)
 {
-	return PagePrivate(page) ? (struct metapage *)page_private(page) : NULL;
+	return PagePrivate(page) ? (struct metapage *)page->private : NULL;
 }
 
 static inline int insert_metapage(struct page *page, struct metapage *mp)
 {
 	if (mp) {
-		set_page_private(page, (unsigned long)mp);
+		page->private = (unsigned long)mp;
 		SetPagePrivate(page);
 		kmap(page);
 	}
@@ -171,7 +171,7 @@ static inline int insert_metapage(struct
 
 static inline void remove_metapage(struct page *page, struct metapage *mp)
 {
-	set_page_private(page, 0);
+	page->private = 0;
 	ClearPagePrivate(page);
 	kunmap(page);
 }
--- mm01/fs/xfs/linux-2.6/xfs_buf.c	2005-11-07 07:39:47.000000000 +0000
+++ mm02/fs/xfs/linux-2.6/xfs_buf.c	2005-11-09 14:38:03.000000000 +0000
@@ -141,9 +141,8 @@ set_page_region(
 	size_t		offset,
 	size_t		length)
 {
-	set_page_private(page,
-		page_private(page) | page_region_mask(offset, length));
-	if (page_private(page) == ~0UL)
+	page->private |= page_region_mask(offset, length);
+	if (page->private == ~0UL)
 		SetPageUptodate(page);
 }
 
@@ -155,7 +154,7 @@ test_page_region(
 {
 	unsigned long	mask = page_region_mask(offset, length);
 
-	return (mask && (page_private(page) & mask) == mask);
+	return (mask && (page->private & mask) == mask);
 }
 
 /*
--- mm01/include/linux/buffer_head.h	2005-11-07 07:39:56.000000000 +0000
+++ mm02/include/linux/buffer_head.h	2005-11-09 14:38:03.000000000 +0000
@@ -126,8 +126,8 @@ BUFFER_FNS(Eopnotsupp, eopnotsupp)
 /* If we *know* page->private refers to buffer_heads */
 #define page_buffers(page)					\
 	({							\
-		BUG_ON(!PagePrivate(page));			\
-		((struct buffer_head *)page_private(page));	\
+		BUG_ON(!PagePrivate(page));		\
+		((struct buffer_head *)(page)->private);	\
 	})
 #define page_has_buffers(page)	PagePrivate(page)
 
@@ -220,7 +220,7 @@ static inline void attach_page_buffers(s
 {
 	page_cache_get(page);
 	SetPagePrivate(page);
-	set_page_private(page, (unsigned long)head);
+	page->private = (unsigned long)head;
 }
 
 static inline void get_bh(struct buffer_head *bh)
--- mm01/include/linux/mm.h	2005-11-09 14:37:47.000000000 +0000
+++ mm02/include/linux/mm.h	2005-11-09 14:38:03.000000000 +0000
@@ -269,9 +269,6 @@ struct page {
 #endif
 };
 
-#define page_private(page)		((page)->private)
-#define set_page_private(page, v)	((page)->private = (v))
-
 /*
  * FIXME: take this include out, include page-flags.h in
  * files which need it (119 of them)
@@ -326,14 +323,14 @@ extern void FASTCALL(__page_cache_releas
 static inline int page_count(struct page *page)
 {
 	if (PageCompound(page))
-		page = (struct page *)page_private(page);
+		page = (struct page *)page->private;
 	return atomic_read(&page->_count) + 1;
 }
 
 static inline void get_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+		page = (struct page *)page->private;
 	atomic_inc(&page->_count);
 }
 
@@ -599,7 +596,7 @@ static inline int PageAnon(struct page *
 static inline pgoff_t page_index(struct page *page)
 {
 	if (unlikely(PageSwapCache(page)))
-		return page_private(page);
+		return page->private;
 	return page->index;
 }
 
--- mm01/kernel/kexec.c	2005-11-07 07:39:59.000000000 +0000
+++ mm02/kernel/kexec.c	2005-11-09 14:38:03.000000000 +0000
@@ -334,7 +334,7 @@ static struct page *kimage_alloc_pages(g
 	if (pages) {
 		unsigned int count, i;
 		pages->mapping = NULL;
-		set_page_private(pages, order);
+		pages->private = order;
 		count = 1 << order;
 		for (i = 0; i < count; i++)
 			SetPageReserved(pages + i);
@@ -347,7 +347,7 @@ static void kimage_free_pages(struct pag
 {
 	unsigned int order, count, i;
 
-	order = page_private(page);
+	order = page->private;
 	count = 1 << order;
 	for (i = 0; i < count; i++)
 		ClearPageReserved(page + i);
--- mm01/mm/filemap.c	2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/filemap.c	2005-11-09 14:38:03.000000000 +0000
@@ -154,7 +154,7 @@ static int sync_page(void *word)
 	 * in the ->sync_page() methods make essential use of the
 	 * page_mapping(), merely passing the page down to the backing
 	 * device's unplug functions when it's non-NULL, which in turn
-	 * ignore it for all cases but swap, where only page_private(page) is
+	 * ignore it for all cases but swap, where only page->private is
 	 * of interest. When page_mapping() does go NULL, the entire
 	 * call stack gracefully ignores the page and returns.
 	 * -- wli
--- mm01/mm/page_alloc.c	2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/page_alloc.c	2005-11-09 14:38:03.000000000 +0000
@@ -180,7 +180,7 @@ static void prep_compound_page(struct pa
 		struct page *p = page + i;
 
 		SetPageCompound(p);
-		set_page_private(p, (unsigned long)page);
+		p->private = (unsigned long)page;
 	}
 }
 
@@ -200,7 +200,7 @@ static void destroy_compound_page(struct
 
 		if (!PageCompound(p))
 			bad_page(__FUNCTION__, page);
-		if (page_private(p) != (unsigned long)page)
+		if (p->private != (unsigned long)page)
 			bad_page(__FUNCTION__, page);
 		ClearPageCompound(p);
 	}
@@ -213,18 +213,18 @@ static void destroy_compound_page(struct
  * So, we don't need atomic page->flags operations here.
  */
 static inline unsigned long page_order(struct page *page) {
-	return page_private(page);
+	return page->private;
 }
 
 static inline void set_page_order(struct page *page, int order) {
-	set_page_private(page, order);
+	page->private = order;
 	__SetPagePrivate(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPagePrivate(page);
-	set_page_private(page, 0);
+	page->private = 0;
 }
 
 /*
@@ -264,7 +264,7 @@ __find_combined_index(unsigned long page
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
  * (c) a page and its buddy have the same order.
- * for recording page's order, we use page_private(page) and PG_private.
+ * for recording page's order, we use page->private and PG_private.
  *
  */
 static inline int page_is_buddy(struct page *page, int order)
@@ -290,7 +290,7 @@ static inline int page_is_buddy(struct p
  * parts of the VM system.
  * At each level, we keep a list of pages, which are heads of continuous
  * free pages of length of (1 << order) and marked with PG_Private.Page's
- * order is recorded in page_private(page) field.
+ * order is recorded in page->private field.
  * So when we are allocating or freeing one, we can derive the state of the
  * other.  That is, if we allocate a small block, and both were   
  * free, the remainder of the region must be split into blocks.   
@@ -489,7 +489,7 @@ static void prep_new_page(struct page *p
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
-	set_page_private(page, 0);
+	page->private = 0;
 	set_page_refs(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 }
--- mm01/mm/page_io.c	2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/page_io.c	2005-11-09 14:38:03.000000000 +0000
@@ -91,8 +91,7 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
-	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
-				end_swap_bio_write);
+	bio = get_swap_bio(GFP_NOIO, page->private, page, end_swap_bio_write);
 	if (bio == NULL) {
 		set_page_dirty(page);
 		unlock_page(page);
@@ -116,8 +115,7 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
-	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
-				end_swap_bio_read);
+	bio = get_swap_bio(GFP_KERNEL, page->private, page, end_swap_bio_read);
 	if (bio == NULL) {
 		unlock_page(page);
 		ret = -ENOMEM;
--- mm01/mm/rmap.c	2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/rmap.c	2005-11-09 14:38:03.000000000 +0000
@@ -550,7 +550,7 @@ static int try_to_unmap_one(struct page 
 	update_hiwater_rss(mm);
 
 	if (PageAnon(page)) {
-		swp_entry_t entry = { .val = page_private(page) };
+		swp_entry_t entry = { .val = page->private };
 		/*
 		 * Store the swap location in the pte.
 		 * See handle_pte_fault() ...
--- mm01/mm/shmem.c	2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/shmem.c	2005-11-09 14:38:03.000000000 +0000
@@ -71,6 +71,9 @@
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
+/* Keep swapped page count in private field of indirect struct page */
+#define nr_swapped		private
+
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
 	SGP_QUICK,	/* don't try more than file page cache lookup */
@@ -321,10 +324,8 @@ static void shmem_swp_set(struct shmem_i
 
 	entry->val = value;
 	info->swapped += incdec;
-	if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT) {
-		struct page *page = kmap_atomic_to_page(entry);
-		set_page_private(page, page_private(page) + incdec);
-	}
+	if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT)
+		kmap_atomic_to_page(entry)->nr_swapped += incdec;
 }
 
 /*
@@ -367,8 +368,9 @@ static swp_entry_t *shmem_swp_alloc(stru
 
 		spin_unlock(&info->lock);
 		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
-		if (page)
-			set_page_private(page, 0);
+		if (page) {
+			page->nr_swapped = 0;
+		}
 		spin_lock(&info->lock);
 
 		if (!page) {
@@ -559,7 +561,7 @@ static void shmem_truncate(struct inode 
 			diroff = 0;
 		}
 		subdir = dir[diroff];
-		if (subdir && page_private(subdir)) {
+		if (subdir && subdir->nr_swapped) {
 			size = limit - idx;
 			if (size > ENTRIES_PER_PAGE)
 				size = ENTRIES_PER_PAGE;
@@ -570,10 +572,10 @@ static void shmem_truncate(struct inode 
 			nr_swaps_freed += freed;
 			if (offset)
 				spin_lock(&info->lock);
-			set_page_private(subdir, page_private(subdir) - freed);
+			subdir->nr_swapped -= freed;
 			if (offset)
 				spin_unlock(&info->lock);
-			BUG_ON(page_private(subdir) > offset);
+			BUG_ON(subdir->nr_swapped > offset);
 		}
 		if (offset)
 			offset = 0;
@@ -741,7 +743,7 @@ static int shmem_unuse_inode(struct shme
 			dir = shmem_dir_map(subdir);
 		}
 		subdir = *dir;
-		if (subdir && page_private(subdir)) {
+		if (subdir && subdir->nr_swapped) {
 			ptr = shmem_swp_map(subdir);
 			size = limit - idx;
 			if (size > ENTRIES_PER_PAGE)
--- mm01/mm/swap.c	2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swap.c	2005-11-09 14:38:03.000000000 +0000
@@ -39,7 +39,7 @@ int page_cluster;
 void put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page))) {
-		page = (struct page *)page_private(page);
+		page = (struct page *)page->private;
 		if (put_page_testzero(page)) {
 			void (*dtor)(struct page *page);
 
--- mm01/mm/swap_state.c	2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swap_state.c	2005-11-09 14:38:03.000000000 +0000
@@ -83,7 +83,7 @@ static int __add_to_swap_cache(struct pa
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
-			set_page_private(page, entry.val);
+			page->private = entry.val;
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
@@ -127,8 +127,8 @@ void __delete_from_swap_cache(struct pag
 	BUG_ON(PageWriteback(page));
 	BUG_ON(PagePrivate(page));
 
-	radix_tree_delete(&swapper_space.page_tree, page_private(page));
-	set_page_private(page, 0);
+	radix_tree_delete(&swapper_space.page_tree, page->private);
+	page->private = 0;
 	ClearPageSwapCache(page);
 	total_swapcache_pages--;
 	pagecache_acct(-1);
@@ -201,7 +201,7 @@ void delete_from_swap_cache(struct page 
 {
 	swp_entry_t entry;
 
-	entry.val = page_private(page);
+	entry.val = page->private;
 
 	write_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
--- mm01/mm/swapfile.c	2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swapfile.c	2005-11-09 14:38:03.000000000 +0000
@@ -59,7 +59,7 @@ void swap_unplug_io_fn(struct backing_de
 	swp_entry_t entry;
 
 	down_read(&swap_unplug_sem);
-	entry.val = page_private(page);
+	entry.val = page->private;
 	if (PageSwapCache(page)) {
 		struct block_device *bdev = swap_info[swp_type(entry)].bdev;
 		struct backing_dev_info *bdi;
@@ -67,8 +67,8 @@ void swap_unplug_io_fn(struct backing_de
 		/*
 		 * If the page is removed from swapcache from under us (with a
 		 * racy try_to_unuse/swapoff) we need an additional reference
-		 * count to avoid reading garbage from page_private(page) above.
-		 * If the WARN_ON triggers during a swapoff it maybe the race
+		 * count to avoid reading garbage from page->private above. If
+		 * the WARN_ON triggers during a swapoff it maybe the race
 		 * condition and it's harmless. However if it triggers without
 		 * swapoff it signals a problem.
 		 */
@@ -292,7 +292,7 @@ static inline int page_swapcount(struct 
 	struct swap_info_struct *p;
 	swp_entry_t entry;
 
-	entry.val = page_private(page);
+	entry.val = page->private;
 	p = swap_info_get(entry);
 	if (p) {
 		/* Subtract the 1 for the swap cache itself */
@@ -337,7 +337,7 @@ int remove_exclusive_swap_page(struct pa
 	if (page_count(page) != 2) /* 2: us + cache */
 		return 0;
 
-	entry.val = page_private(page);
+	entry.val = page->private;
 	p = swap_info_get(entry);
 	if (!p)
 		return 0;
@@ -1040,7 +1040,7 @@ int page_queue_congested(struct page *pa
 	BUG_ON(!PageLocked(page));	/* It pins the swap_info_struct */
 
 	if (PageSwapCache(page)) {
-		swp_entry_t entry = { .val = page_private(page) };
+		swp_entry_t entry = { .val = page->private };
 		struct swap_info_struct *sis;
 
 		sis = get_swap_info_struct(swp_type(entry));
--- mm01/mm/vmscan.c	2005-11-07 07:40:01.000000000 +0000
+++ mm02/mm/vmscan.c	2005-11-09 14:38:03.000000000 +0000
@@ -387,7 +387,7 @@ static inline int remove_mapping(struct 
 		goto cannot_free;
 
 	if (PageSwapCache(page)) {
-		swp_entry_t swap = { .val = page_private(page) };
+		swp_entry_t swap = { .val = page->private };
 		add_to_swapped_list(swap.val);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 03/15] mm reiser4: revert page_private
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
  2005-11-10  1:43 ` [PATCH 01/15] mm: poison struct page for ptlock Hugh Dickins
  2005-11-10  1:44 ` [PATCH 02/15] mm: revert page_private Hugh Dickins
@ 2005-11-10  1:46 ` Hugh Dickins
  2005-11-10  1:47 ` [PATCH 04/15] mm: update split ptlock Kconfig Hugh Dickins
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: reiserfs-dev, linux-kernel

Revert page_private: remove -mm tree's reiser4-page-private-fixes.patch

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 fs/reiser4/flush_queue.c                 |    2 +-
 fs/reiser4/jnode.c                       |   10 +++++-----
 fs/reiser4/page_cache.c                  |    2 +-
 fs/reiser4/page_cache.h                  |    2 +-
 fs/reiser4/plugin/file/tail_conversion.c |    2 +-
 fs/reiser4/txnmgr.c                      |    6 +++---
 6 files changed, 12 insertions(+), 12 deletions(-)

--- mm02/fs/reiser4/flush_queue.c	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/flush_queue.c	2005-11-09 14:38:18.000000000 +0000
@@ -427,7 +427,7 @@ end_io_handler(struct bio *bio, unsigned
 
 			assert("zam-736", pg != NULL);
 			assert("zam-736", PagePrivate(pg));
-			node = jprivate(pg);
+			node = (jnode *) (pg->private);
 
 			JF_CLR(node, JNODE_WRITEBACK);
 		}
--- mm02/fs/reiser4/jnode.c	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/jnode.c	2005-11-09 14:38:18.000000000 +0000
@@ -38,12 +38,12 @@
  *     page_address().
  *
  *     jnode and page are attached to each other by jnode_attach_page(). This
- *     function places pointer to jnode in page_private(), sets PG_private flag
+ *     function places pointer to jnode in page->private, sets PG_private flag
  *     and increments page counter.
  *
  *     Opposite operation is performed by page_clear_jnode().
  *
- *     jnode->pg is protected by jnode spin lock, and page_private() is
+ *     jnode->pg is protected by jnode spin lock, and page->private is
  *     protected by page lock. See comment at the top of page_cache.c for
  *     more.
  *
@@ -664,7 +664,7 @@ void jnode_attach_page(jnode * node, str
 	assert("nikita-2060", node != NULL);
 	assert("nikita-2061", pg != NULL);
 
-	assert("nikita-2050", page_private(pg) == 0ul);
+	assert("nikita-2050", pg->private == 0ul);
 	assert("nikita-2393", !PagePrivate(pg));
 	assert("vs-1741", node->pg == NULL);
 
@@ -672,7 +672,7 @@ void jnode_attach_page(jnode * node, str
 	assert("nikita-2397", spin_jnode_is_locked(node));
 
 	page_cache_get(pg);
-	set_page_private(pg, (unsigned long)node);
+	pg->private = (unsigned long)node;
 	node->pg = pg;
 	SetPagePrivate(pg);
 }
@@ -689,7 +689,7 @@ void page_clear_jnode(struct page *page,
 	assert("nikita-3551", !PageWriteback(page));
 
 	JF_CLR(node, JNODE_PARSED);
-	set_page_private(page, 0ul);
+	page->private = 0ul;
 	ClearPagePrivate(page);
 	node->pg = NULL;
 	page_cache_release(page);
--- mm02/fs/reiser4/page_cache.c	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/page_cache.c	2005-11-09 14:38:18.000000000 +0000
@@ -753,7 +753,7 @@ void print_page(const char *prefix, stru
 	}
 	printk("%s: page index: %lu mapping: %p count: %i private: %lx\n",
 	       prefix, page->index, page->mapping, page_count(page),
-	       page_private(page));
+	       page->private);
 	printk("\tflags: %s%s%s%s %s%s%s %s%s%s %s%s\n",
 	       page_flag_name(page, PG_locked), page_flag_name(page, PG_error),
 	       page_flag_name(page, PG_referenced), page_flag_name(page,
--- mm02/fs/reiser4/page_cache.h	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/page_cache.h	2005-11-09 14:38:18.000000000 +0000
@@ -30,7 +30,7 @@ static inline void lock_and_wait_page_wr
 		reiser4_wait_page_writeback(page);
 }
 
-#define jprivate(page) ((jnode *) page_private(page))
+#define jprivate(page) ((jnode *) (page)->private)
 
 extern int page_io(struct page *page, jnode * node, int rw, int gfp);
 extern void drop_page(struct page *page);
--- mm02/fs/reiser4/plugin/file/tail_conversion.c	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/plugin/file/tail_conversion.c	2005-11-09 14:38:18.000000000 +0000
@@ -670,7 +670,7 @@ int extent2tail(unix_file_info_t * uf_in
 		/* page is already detached from jnode and mapping. */
 		assert("vs-1086", page->mapping == NULL);
 		assert("nikita-2690",
-		       (!PagePrivate(page) && page_private(page) == 0));
+		       (!PagePrivate(page) && page->private == 0));
 		/* waiting for writeback completion with page lock held is
 		 * perfectly valid. */
 		wait_on_page_writeback(page);
--- mm02/fs/reiser4/txnmgr.c	2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/txnmgr.c	2005-11-09 14:38:18.000000000 +0000
@@ -2331,7 +2331,7 @@ void uncapture_page(struct page *pg)
 
 	reiser4_wait_page_writeback(pg);
 
-	node = jprivate(pg);
+	node = (jnode *) (pg->private);
 	BUG_ON(node == NULL);
 
 	LOCK_JNODE(node);
@@ -3603,14 +3603,14 @@ static void swap_jnode_pages(jnode * nod
 	copy->pg = node->pg;
 	copy->data = page_address(copy->pg);
 	jnode_set_block(copy, jnode_get_block(node));
-	set_page_private(copy->pg, (unsigned long)copy);
+	copy->pg->private = (unsigned long)copy;
 
 	/* attach new page to jnode */
 	assert("vs-1412", !PagePrivate(new_page));
 	page_cache_get(new_page);
 	node->pg = new_page;
 	node->data = page_address(new_page);
-	set_page_private(new_page, (unsigned long)node);
+	new_page->private = (unsigned long)node;
 	SetPagePrivate(new_page);
 
 	{

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 04/15] mm: update split ptlock Kconfig
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (2 preceding siblings ...)
  2005-11-10  1:46 ` [PATCH 03/15] mm reiser4: " Hugh Dickins
@ 2005-11-10  1:47 ` Hugh Dickins
  2005-11-10  1:48 ` [PATCH 05/15] mm: unbloat get_futex_key Hugh Dickins
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Closer attention to the arithmetic shows that neither ppc64 nor sparc
really uses one page for multiple page tables: how on earth could they,
while pte_alloc_one returns just a struct page pointer, with no offset?

Gasp, splutter... arm26 manages it by returning a pte_t pointer cast to
a struct page pointer, then compensating in its pmd_populate.  That's
almost as evil as overlaying a struct page with a spinlock_t.  But arm26
is never SMP, and we now only poison when SMP, so it's not a problem.

And the PA-RISC situation has been recently improved: CONFIG_PA20
works without the 16-byte alignment which inflated its spinlock_t.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/Kconfig |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

--- mm03/mm/Kconfig	2005-11-07 07:39:59.000000000 +0000
+++ mm04/mm/Kconfig	2005-11-09 14:38:32.000000000 +0000
@@ -125,14 +125,12 @@ comment "Memory hotplug is currently inc
 # space can be handled with less contention: split it at this NR_CPUS.
 # Default to 4 for wider testing, though 8 might be more appropriate.
 # ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
-# PA-RISC's debug spinlock_t is too large for the 32-bit struct page.
-# ARM26 and SPARC32 and PPC64 may use one page for multiple page tables.
+# PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.
 #
 config SPLIT_PTLOCK_CPUS
 	int
 	default "4096" if ARM && !CPU_CACHE_VIPT
-	default "4096" if PARISC && DEBUG_SPINLOCK && !64BIT
-	default "4096" if ARM26 || SPARC32 || PPC64
+	default "4096" if PARISC && DEBUG_SPINLOCK && !PA20
 	default "4"
 
 #

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 05/15] mm: unbloat get_futex_key
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (3 preceding siblings ...)
  2005-11-10  1:47 ` [PATCH 04/15] mm: update split ptlock Kconfig Hugh Dickins
@ 2005-11-10  1:48 ` Hugh Dickins
  2005-11-10  1:50 ` [PATCH 06/15] mm: remove ppc highpte Hugh Dickins
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jamie Lokier, linux-kernel

The follow_page changes in get_futex_key have left it with two almost
identical blocks, when handling the rare case of a futex in a nonlinear
vma.  get_user_pages will itself do that follow_page, and its additional
find_extend_vma is hardly any overhead since the vma is already cached.
Let's just delete the follow_page block and let get_user_pages do it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 kernel/futex.c |   15 ---------------
 1 files changed, 15 deletions(-)

--- mm04/kernel/futex.c	2005-11-07 07:39:59.000000000 +0000
+++ mm05/kernel/futex.c	2005-11-09 14:38:47.000000000 +0000
@@ -201,21 +201,6 @@ static int get_futex_key(unsigned long u
 	 * from swap.  But that's a lot of code to duplicate here
 	 * for a rare case, so we simply fetch the page.
 	 */
-
-	/*
-	 * Do a quick atomic lookup first - this is the fastpath.
-	 */
-	page = follow_page(mm, uaddr, FOLL_TOUCH|FOLL_GET);
-	if (likely(page != NULL)) {
-		key->shared.pgoff =
-			page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-		put_page(page);
-		return 0;
-	}
-
-	/*
-	 * Do it the general way.
-	 */
 	err = get_user_pages(current, mm, uaddr, 1, 0, 0, &page, NULL);
 	if (err >= 0) {
 		key->shared.pgoff =

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 06/15] mm: remove ppc highpte
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (4 preceding siblings ...)
  2005-11-10  1:48 ` [PATCH 05/15] mm: unbloat get_futex_key Hugh Dickins
@ 2005-11-10  1:50 ` Hugh Dickins
  2005-11-10  1:52   ` Benjamin Herrenschmidt
  2005-11-10  1:55   ` Paul Mackerras
  2005-11-10  1:51 ` [PATCH 07/15] mm: powerpc ptlock comments Hugh Dickins
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, Ben Herrenschmidt, linux-kernel

ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
have wanted it enough to restore it: so remove its traces from pgtable.h
and pte_alloc_one.  Or supply an alternative patch to config it back?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 arch/ppc/mm/pgtable.c     |   13 +------------
 include/asm-ppc/pgtable.h |   10 ++++------
 2 files changed, 5 insertions(+), 18 deletions(-)

--- mm05/arch/ppc/mm/pgtable.c	2005-11-07 07:39:08.000000000 +0000
+++ mm06/arch/ppc/mm/pgtable.c	2005-11-09 14:39:02.000000000 +0000
@@ -111,18 +111,7 @@ pte_t *pte_alloc_one_kernel(struct mm_st
 
 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *ptepage;
-
-#ifdef CONFIG_HIGHPTE
-	gfp_t flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_REPEAT;
-#else
-	gfp_t flags = GFP_KERNEL | __GFP_REPEAT;
-#endif
-
-	ptepage = alloc_pages(flags, 0);
-	if (ptepage)
-		clear_highpage(ptepage);
-	return ptepage;
+	return alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 }
 
 void pte_free_kernel(pte_t *pte)
--- mm05/include/asm-ppc/pgtable.h	2005-11-07 07:39:55.000000000 +0000
+++ mm06/include/asm-ppc/pgtable.h	2005-11-09 14:39:02.000000000 +0000
@@ -750,13 +750,11 @@ static inline pmd_t * pmd_offset(pgd_t *
 	(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
 #define pte_offset_kernel(dir, addr)	\
 	((pte_t *) pmd_page_kernel(*(dir)) + pte_index(addr))
-#define pte_offset_map(dir, addr)		\
-	((pte_t *) kmap_atomic(pmd_page(*(dir)), KM_PTE0) + pte_index(addr))
-#define pte_offset_map_nested(dir, addr)	\
-	((pte_t *) kmap_atomic(pmd_page(*(dir)), KM_PTE1) + pte_index(addr))
 
-#define pte_unmap(pte)		kunmap_atomic(pte, KM_PTE0)
-#define pte_unmap_nested(pte)	kunmap_atomic(pte, KM_PTE1)
+#define pte_offset_map(dir,addr)	pte_offset_kernel(dir,addr)
+#define pte_offset_map_nested(dir,addr)	pte_offset_kernel(dir,addr)
+#define pte_unmap(pte)			do { } while(0)
+#define pte_unmap_nested(pte)		do { } while(0)
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 07/15] mm: powerpc ptlock comments
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (5 preceding siblings ...)
  2005-11-10  1:50 ` [PATCH 06/15] mm: remove ppc highpte Hugh Dickins
@ 2005-11-10  1:51 ` Hugh Dickins
  2005-11-10  1:53 ` [PATCH 08/15] mm: powerpc init_mm without ptlock Hugh Dickins
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, Ben Herrenschmidt, linux-kernel

Update comments (only) on page_table_lock and mmap_sem in arch/powerpc.
Removed the comment on page_table_lock from hash_huge_page: since it's
no longer taking page_table_lock itself, it's irrelevant whether others
are; but how it is safe (even against huge file truncation?) I can't say.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 arch/powerpc/mm/hugetlbpage.c |    4 +---
 arch/powerpc/mm/mem.c         |    2 +-
 arch/powerpc/mm/tlb_32.c      |    6 ++++++
 arch/powerpc/mm/tlb_64.c      |    4 ++--
 4 files changed, 10 insertions(+), 6 deletions(-)

--- mm06/arch/powerpc/mm/hugetlbpage.c	2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/hugetlbpage.c	2005-11-09 14:39:16.000000000 +0000
@@ -754,9 +754,7 @@ repeat:
 	}
 
 	/*
-	 * No need to use ldarx/stdcx here because all who
-	 * might be updating the pte will hold the
-	 * page_table_lock
+	 * No need to use ldarx/stdcx here
 	 */
 	*ptep = __pte(new_pte & ~_PAGE_BUSY);
 
--- mm06/arch/powerpc/mm/mem.c	2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/mem.c	2005-11-09 14:39:16.000000000 +0000
@@ -491,7 +491,7 @@ EXPORT_SYMBOL(flush_icache_user_range);
  * We use it to preload an HPTE into the hash table corresponding to
  * the updated linux PTE.
  * 
- * This must always be called with the mm->page_table_lock held
+ * This must always be called with the pte lock held.
  */
 void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
 		      pte_t pte)
--- mm06/arch/powerpc/mm/tlb_32.c	2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/tlb_32.c	2005-11-09 14:39:16.000000000 +0000
@@ -149,6 +149,12 @@ void flush_tlb_mm(struct mm_struct *mm)
 		return;
 	}
 
+	/*
+	 * It is safe to go down the mm's list of vmas when called
+	 * from dup_mmap, holding mmap_sem.  It would also be safe from
+	 * unmap_region or exit_mmap, but not from vmtruncate on SMP -
+	 * but it seems dup_mmap is the only SMP case which gets here.
+	 */
 	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
 		flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
 	FINISH_FLUSH;
--- mm06/arch/powerpc/mm/tlb_64.c	2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/tlb_64.c	2005-11-09 14:39:16.000000000 +0000
@@ -95,7 +95,7 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
 {
-	/* This is safe as we are holding page_table_lock */
+	/* This is safe since tlb_gather_mmu has disabled preemption */
         cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id());
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
 
@@ -206,7 +206,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void pte_free_finish(void)
 {
-	/* This is safe as we are holding page_table_lock */
+	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
 
 	if (*batchp == NULL)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] mm: remove ppc highpte
  2005-11-10  1:50 ` [PATCH 06/15] mm: remove ppc highpte Hugh Dickins
@ 2005-11-10  1:52   ` Benjamin Herrenschmidt
  2005-11-10  1:55   ` Paul Mackerras
  1 sibling, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2005-11-10  1:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Paul Mackerras, linux-kernel

On Thu, 2005-11-10 at 01:50 +0000, Hugh Dickins wrote:
> ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> have wanted it enough to restore it: so remove its traces from pgtable.h
> and pte_alloc_one.  Or supply an alternative patch to config it back?

Hrm... ppc32 does have fully working kmap & kmap atomic so I would
rather put back the CONFIG option...

Ben.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 08/15] mm: powerpc init_mm without ptlock
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (6 preceding siblings ...)
  2005-11-10  1:51 ` [PATCH 07/15] mm: powerpc ptlock comments Hugh Dickins
@ 2005-11-10  1:53 ` Hugh Dickins
  2005-11-10  1:56 ` [PATCH 09/15] mm: fill arch atomic64 gaps Hugh Dickins
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, Ben Herrenschmidt, linux-kernel

Restore an earlier mod which went missing in the powerpc reshuffle:
the 4xx mmu_mapin_ram does not need to take init_mm.page_table_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 arch/powerpc/mm/4xx_mmu.c |    4 ----
 1 files changed, 4 deletions(-)

--- mm07/arch/powerpc/mm/4xx_mmu.c	2005-11-07 07:39:05.000000000 +0000
+++ mm08/arch/powerpc/mm/4xx_mmu.c	2005-11-09 14:39:29.000000000 +0000
@@ -110,13 +110,11 @@ unsigned long __init mmu_mapin_ram(void)
 		pmd_t *pmdp;
 		unsigned long val = p | _PMD_SIZE_16M | _PAGE_HWEXEC | _PAGE_HWWRITE;
 
-		spin_lock(&init_mm.page_table_lock);
 		pmdp = pmd_offset(pgd_offset_k(v), v);
 		pmd_val(*pmdp++) = val;
 		pmd_val(*pmdp++) = val;
 		pmd_val(*pmdp++) = val;
 		pmd_val(*pmdp++) = val;
-		spin_unlock(&init_mm.page_table_lock);
 
 		v += LARGE_PAGE_SIZE_16M;
 		p += LARGE_PAGE_SIZE_16M;
@@ -127,10 +125,8 @@ unsigned long __init mmu_mapin_ram(void)
 		pmd_t *pmdp;
 		unsigned long val = p | _PMD_SIZE_4M | _PAGE_HWEXEC | _PAGE_HWWRITE;
 
-		spin_lock(&init_mm.page_table_lock);
 		pmdp = pmd_offset(pgd_offset_k(v), v);
 		pmd_val(*pmdp) = val;
-		spin_unlock(&init_mm.page_table_lock);
 
 		v += LARGE_PAGE_SIZE_4M;
 		p += LARGE_PAGE_SIZE_4M;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] mm: remove ppc highpte
  2005-11-10  1:50 ` [PATCH 06/15] mm: remove ppc highpte Hugh Dickins
  2005-11-10  1:52   ` Benjamin Herrenschmidt
@ 2005-11-10  1:55   ` Paul Mackerras
  2005-11-10  2:46     ` Hugh Dickins
  1 sibling, 1 reply; 52+ messages in thread
From: Paul Mackerras @ 2005-11-10  1:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Ben Herrenschmidt, linux-kernel

Hugh Dickins writes:

> ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> have wanted it enough to restore it: so remove its traces from pgtable.h
> and pte_alloc_one.  Or supply an alternative patch to config it back?

I'm staggered.  We do want to be able to have pte pages in highmem.
I would rather just have it always enabled if CONFIG_HIGHMEM=y, rather
than putting the config option back.  I think that should just involve
adding __GFP_HIGHMEM to the flags for alloc_pages in pte_alloc_one
unconditionally, no?

Paul.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 09/15] mm: fill arch atomic64 gaps
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (7 preceding siblings ...)
  2005-11-10  1:53 ` [PATCH 08/15] mm: powerpc init_mm without ptlock Hugh Dickins
@ 2005-11-10  1:56 ` Hugh Dickins
  2005-11-10 13:38   ` Andi Kleen
  2005-11-10  1:57 ` [PATCH 10/15] mm: atomic64 page counts Hugh Dickins
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Henderson, Martin Schwidefsky, David S. Miller,
	Andi Kleen, linux-kernel

alpha, s390, sparc64, x86_64 are each missing some primitives from their
atomic64 support: fill in the gaps I've noticed by extrapolating asm,
follow the groupings in each file, and say "int" for the booleans rather
than long or long long.  But powerpc and parisc still have no atomic64.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/asm-alpha/atomic.h   |    7 ++++--
 include/asm-s390/atomic.h    |   10 ++++++--
 include/asm-sparc64/atomic.h |    1 
 include/asm-x86_64/atomic.h  |   49 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 50 insertions(+), 17 deletions(-)

--- mm08/include/asm-alpha/atomic.h	2005-10-28 01:02:08.000000000 +0100
+++ mm09/include/asm-alpha/atomic.h	2005-11-09 14:39:48.000000000 +0000
@@ -118,8 +118,6 @@ static __inline__ long atomic_add_return
 	return result;
 }
 
-#define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
-
 static __inline__ long atomic64_add_return(long i, atomic64_t * v)
 {
 	long temp, result;
@@ -177,6 +175,9 @@ static __inline__ long atomic64_sub_retu
 	return result;
 }
 
+#define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0)
+#define atomic64_add_negative(a, v) (atomic64_add_return((a), (v)) < 0)
+
 #define atomic_dec_return(v) atomic_sub_return(1,(v))
 #define atomic64_dec_return(v) atomic64_sub_return(1,(v))
 
@@ -187,6 +188,8 @@ static __inline__ long atomic64_sub_retu
 #define atomic64_sub_and_test(i,v) (atomic64_sub_return((i), (v)) == 0)
 
 #define atomic_inc_and_test(v) (atomic_add_return(1, (v)) == 0)
+#define atomic64_inc_and_test(v) (atomic64_add_return(1, (v)) == 0)
+
 #define atomic_dec_and_test(v) (atomic_sub_return(1, (v)) == 0)
 #define atomic64_dec_and_test(v) (atomic64_sub_return(1, (v)) == 0)
 
--- mm08/include/asm-s390/atomic.h	2005-08-29 00:41:01.000000000 +0100
+++ mm09/include/asm-s390/atomic.h	2005-11-09 14:39:48.000000000 +0000
@@ -131,7 +131,7 @@ static __inline__ long long atomic64_add
 {
 	return __CSG_LOOP(v, i, "agr");
 }
-static __inline__ long long atomic64_add_negative(long long i, atomic64_t * v)
+static __inline__ int atomic64_add_negative(long long i, atomic64_t * v)
 {
 	return __CSG_LOOP(v, i, "agr") < 0;
 }
@@ -139,6 +139,10 @@ static __inline__ void atomic64_sub(long
 {
 	       __CSG_LOOP(v, i, "sgr");
 }
+static __inline__ long long atomic64_sub_return(long long i, atomic64_t * v)
+{
+	return __CSG_LOOP(v, i, "sgr");
+}
 static __inline__ void atomic64_inc(volatile atomic64_t * v)
 {
 	       __CSG_LOOP(v, 1, "agr");
@@ -147,7 +151,7 @@ static __inline__ long long atomic64_inc
 {
 	return __CSG_LOOP(v, 1, "agr");
 }
-static __inline__ long long atomic64_inc_and_test(volatile atomic64_t * v)
+static __inline__ int atomic64_inc_and_test(volatile atomic64_t * v)
 {
 	return __CSG_LOOP(v, 1, "agr") == 0;
 }
@@ -159,7 +163,7 @@ static __inline__ long long atomic64_dec
 {
 	return __CSG_LOOP(v, 1, "sgr");
 }
-static __inline__ long long atomic64_dec_and_test(volatile atomic64_t * v)
+static __inline__ int atomic64_dec_and_test(volatile atomic64_t * v)
 {
 	return __CSG_LOOP(v, 1, "sgr") == 0;
 }
--- mm08/include/asm-sparc64/atomic.h	2005-10-28 01:02:08.000000000 +0100
+++ mm09/include/asm-sparc64/atomic.h	2005-11-09 14:39:48.000000000 +0000
@@ -54,6 +54,7 @@ extern int atomic64_sub_ret(int, atomic6
  * other cases.
  */
 #define atomic_inc_and_test(v) (atomic_inc_return(v) == 0)
+#define atomic64_inc_and_test(v) (atomic64_inc_return(v) == 0)
 
 #define atomic_sub_and_test(i, v) (atomic_sub_ret(i, v) == 0)
 #define atomic64_sub_and_test(i, v) (atomic64_sub_ret(i, v) == 0)
--- mm08/include/asm-x86_64/atomic.h	2004-12-24 21:37:29.000000000 +0000
+++ mm09/include/asm-x86_64/atomic.h	2005-11-09 14:39:48.000000000 +0000
@@ -160,8 +160,8 @@ static __inline__ int atomic_inc_and_tes
 
 /**
  * atomic_add_negative - add and test if negative
- * @v: pointer of type atomic_t
  * @i: integer value to add
+ * @v: pointer of type atomic_t
  * 
  * Atomically adds @i to @v and returns true
  * if the result is negative, or false when
@@ -178,6 +178,31 @@ static __inline__ int atomic_add_negativ
 	return c;
 }
 
+/**
+ * atomic_add_return - add and return
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+	int __i = i;
+	__asm__ __volatile__(
+		LOCK "xaddl %0, %1;"
+		:"=r"(i)
+		:"m"(v->counter), "0"(i));
+	return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+	return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v)  (atomic_add_return(1,v))
+#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+
 /* An 64bit atomic type */
 
 typedef struct { volatile long counter; } atomic64_t;
@@ -320,14 +345,14 @@ static __inline__ int atomic64_inc_and_t
 
 /**
  * atomic64_add_negative - add and test if negative
- * @v: pointer to atomic64_t
  * @i: integer value to add
+ * @v: pointer to atomic64_t
  *
  * Atomically adds @i to @v and returns true
  * if the result is negative, or false when
  * result is greater than or equal to zero.
  */
-static __inline__ long atomic64_add_negative(long i, atomic64_t *v)
+static __inline__ int atomic64_add_negative(long i, atomic64_t *v)
 {
 	unsigned char c;
 
@@ -339,29 +364,29 @@ static __inline__ long atomic64_add_nega
 }
 
 /**
- * atomic_add_return - add and return
- * @v: pointer of type atomic_t
+ * atomic64_add_return - add and return
  * @i: integer value to add
+ * @v: pointer to type atomic64_t
  *
  * Atomically adds @i to @v and returns @i + @v
  */
-static __inline__ int atomic_add_return(int i, atomic_t *v)
+static __inline__ long atomic64_add_return(long i, atomic64_t *v)
 {
-	int __i = i;
+	long __i = i;
 	__asm__ __volatile__(
-		LOCK "xaddl %0, %1;"
+		LOCK "xaddq %0, %1;"
 		:"=r"(i)
 		:"m"(v->counter), "0"(i));
 	return i + __i;
 }
 
-static __inline__ int atomic_sub_return(int i, atomic_t *v)
+static __inline__ long atomic64_sub_return(long i, atomic64_t *v)
 {
-	return atomic_add_return(-i,v);
+	return atomic64_add_return(-i,v);
 }
 
-#define atomic_inc_return(v)  (atomic_add_return(1,v))
-#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+#define atomic64_inc_return(v)  (atomic64_add_return(1,v))
+#define atomic64_dec_return(v)  (atomic64_sub_return(1,v))
 
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 10/15] mm: atomic64 page counts
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (8 preceding siblings ...)
  2005-11-10  1:56 ` [PATCH 09/15] mm: fill arch atomic64 gaps Hugh Dickins
@ 2005-11-10  1:57 ` Hugh Dickins
  2005-11-10  2:16   ` Andrew Morton
  2005-11-10  2:00 ` [PATCH 11/15] mm: long " Hugh Dickins
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  1:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel

Page count and page mapcount might overflow their 31 bits on 64-bit
architectures, especially now we're refcounting the ZERO_PAGE.  We could
quite easily avoid counting it, but shared file pages may also overflow.

Prefer not to enlarge struct page: don't assign separate atomic64_ts to
count and mapcount, instead keep them both in one atomic64_t - the count
in the low 23 bits and the mapcount in the high 41 bits.  But of course
that can only work if we don't duplicate mapcount in count in this case.

The low 23 bits can accomodate 0x7fffff, that's 2 * PID_MAX_LIMIT - 1,
which seems adequate for tasks with a transient hold on pages; and the
high 41 bits would use 16TB of page table space to back max mapcount.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

I think Nick should have a veto on this patch if he wishes, it's not my
intention to make his lockless pagecache impossible.  But I doubt he'll
have to veto it, I expect it just needs more macros: whereas page_count
carefully assembles grabcount and mapcount into the familiar page count,
he'll probably want something giving the raw atomic32 or atomic64 value.

 include/linux/mm.h   |  163 +++++++++++++++++++++++++++++++++++++--------------
 include/linux/rmap.h |   12 ---
 mm/memory.c          |    3 
 mm/rmap.c            |    9 +-
 4 files changed, 124 insertions(+), 63 deletions(-)

--- mm09/include/linux/mm.h	2005-11-09 14:38:03.000000000 +0000
+++ mm10/include/linux/mm.h	2005-11-09 14:40:00.000000000 +0000
@@ -219,11 +219,15 @@ struct inode;
 struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
+#ifdef ATOMIC64_INIT
+	atomic64_t _pcount;		/* Both _count and _mapcount. */
+#else
 	atomic_t _count;		/* Usage count, see below. */
 	atomic_t _mapcount;		/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
+#endif
 	unsigned long private;		/* Mapping-private opaque data:
 					 * usually used for buffer_heads
 					 * if PagePrivate set; used for
@@ -297,30 +301,112 @@ struct page {
  * macros which retain the old rules: page_count(page) == 0 is a free page.
  */
 
+#ifdef ATOMIC64_INIT
+/*
+ * We squeeze _count and _mapcount into a single atomic64 _pcount:
+ * keeping the mapcount in the upper 41 bits, and the grabcount in
+ * the lower 23 bits: then page_count is the total of these two.
+ * When adding a mapping, get_page_mapped_testone transfers one from
+ * grabcount to mapcount; which put_page_mapped_testzero reverses.
+ * Each stored count is based from 0 in this case, not from -1.
+ */
+#define PCOUNT_SHIFT	23
+#define PCOUNT_MASK	((1UL << PCOUNT_SHIFT) - 1)
+
 /*
- * Drop a ref, return true if the logical refcount fell to zero (the page has
- * no users)
+ * Drop a ref, return true if the logical refcount fell to zero
+ * (the page has no users)
  */
-#define put_page_testzero(p)				\
-	({						\
-		BUG_ON(page_count(p) == 0);		\
-		atomic_add_negative(-1, &(p)->_count);	\
-	})
+static inline int put_page_testzero(struct page *page)
+{
+	BUILD_BUG_ON(PCOUNT_MASK+1 < 2*PID_MAX_LIMIT);
+	BUG_ON(!(atomic64_read(&page->_pcount) & PCOUNT_MASK));
+	return atomic64_dec_and_test(&page->_pcount);
+}
+
+static inline int put_page_mapped_testzero(struct page *page)
+{
+	return (unsigned long)atomic64_sub_return(PCOUNT_MASK, &page->_pcount)
+							<= PCOUNT_MASK;
+}
 
 /*
  * Grab a ref, return true if the page previously had a logical refcount of
  * zero.  ie: returns true if we just grabbed an already-deemed-to-be-free page
  */
-#define get_page_testone(p)	atomic_inc_and_test(&(p)->_count)
+#define get_page_testone(page) \
+	(atomic64_add_return(1, &(page)->_pcount) == 1)
+#define get_page_mapped_testone(page) \
+	((atomic64_add_return(PCOUNT_MASK,&(page)->_pcount)>>PCOUNT_SHIFT)==1)
+
+#define set_page_count(page, val)	atomic64_set(&(page)->_pcount, val)
+#define reset_page_mapcount(page)	do { } while (0)
+#define __put_page(page)		atomic64_dec(&(page)->_pcount)
 
-#define set_page_count(p,v) 	atomic_set(&(p)->_count, v - 1)
-#define __put_page(p)		atomic_dec(&(p)->_count)
+static inline long page_count(struct page *page)
+{
+	unsigned long pcount;
 
-extern void FASTCALL(__page_cache_release(struct page *));
+	if (PageCompound(page))
+		page = (struct page *)page->private;
+	pcount = (unsigned long)atomic64_read(&page->_pcount);
+	/* Return total of grabcount and mapcount */
+	return (pcount & PCOUNT_MASK) + (pcount >> PCOUNT_SHIFT);
+}
 
-#ifdef CONFIG_HUGETLB_PAGE
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageCompound(page)))
+		page = (struct page *)page->private;
+	atomic64_inc(&page->_pcount);
+}
+
+static inline void get_page_dup_rmap(struct page *page)
+{
+	/* copy_one_pte increment mapcount */
+	atomic64_add(PCOUNT_MASK + 1, &page->_pcount);
+}
+
+static inline long page_mapcount(struct page *page)
+{
+	return (unsigned long)atomic64_read(&page->_pcount) >> PCOUNT_SHIFT;
+}
+
+static inline int page_mapped(struct page *page)
+{
+	return (unsigned long)atomic64_read(&page->_pcount) > PCOUNT_MASK;
+}
+
+#else /* !ATOMIC64_INIT */
+
+/*
+ * Drop a ref, return true if the logical refcount fell to zero
+ * (the page has no users)
+ */
+static inline int put_page_testzero(struct page *page)
+{
+	BUG_ON(atomic_read(&page->_count) == -1);
+	return atomic_add_negative(-1, &page->_count);
+}
+
+static inline int put_page_mapped_testzero(struct page *page)
+{
+	return atomic_add_negative(-1, &page->_mapcount);
+}
+
+/*
+ * Grab a ref, return true if the page previously had a logical refcount of
+ * zero.  ie: returns true if we just grabbed an already-deemed-to-be-free page
+ */
+#define get_page_testone(page)	atomic_inc_and_test(&(page)->_count)
+#define get_page_mapped_testone(page) \
+				atomic_inc_and_test(&(page)->_mapcount)
+
+#define set_page_count(page, val) 	atomic_set(&(page)->_count, val - 1)
+#define reset_page_mapcount(page)	atomic_set(&(page)->_mapcount, -1)
+#define __put_page(page)		atomic_dec(&(page)->_count)
 
-static inline int page_count(struct page *page)
+static inline long page_count(struct page *page)
 {
 	if (PageCompound(page))
 		page = (struct page *)page->private;
@@ -334,24 +420,36 @@ static inline void get_page(struct page 
 	atomic_inc(&page->_count);
 }
 
-void put_page(struct page *page);
-
-#else		/* CONFIG_HUGETLB_PAGE */
+static inline void get_page_dup_rmap(struct page *page)
+{
+	/* copy_one_pte increment total count and mapcount */
+	atomic_inc(&page->_count);
+	atomic_inc(&page->_mapcount);
+}
 
-#define page_count(p)		(atomic_read(&(p)->_count) + 1)
+static inline long page_mapcount(struct page *page)
+{
+	return atomic_read(&page->_mapcount) + 1;
+}
 
-static inline void get_page(struct page *page)
+static inline int page_mapped(struct page *page)
 {
-	atomic_inc(&page->_count);
+	return atomic_read(&page->_mapcount) >= 0;
 }
 
+#endif /* !ATOMIC64_INIT */
+
+void FASTCALL(__page_cache_release(struct page *));
+
+#ifdef CONFIG_HUGETLB_PAGE
+void put_page(struct page *page);
+#else
 static inline void put_page(struct page *page)
 {
 	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
-
-#endif		/* CONFIG_HUGETLB_PAGE */
+#endif	/* !CONFIG_HUGETLB_PAGE */
 
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
@@ -601,29 +699,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
- * Return true if this page is mapped into pagetables.
- */
-static inline int page_mapped(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) >= 0;
-}
-
-/*
  * Error return values for the *_nopage functions
  */
 #define NOPAGE_SIGBUS	(NULL)
--- mm09/include/linux/rmap.h	2005-11-07 07:39:58.000000000 +0000
+++ mm10/include/linux/rmap.h	2005-11-09 14:40:00.000000000 +0000
@@ -74,18 +74,6 @@ void page_add_anon_rmap(struct page *, s
 void page_add_file_rmap(struct page *);
 void page_remove_rmap(struct page *);
 
-/**
- * page_dup_rmap - duplicate pte mapping to a page
- * @page:	the page to add the mapping to
- *
- * For copy_page_range only: minimal extract from page_add_rmap,
- * avoiding unnecessary tests (already checked) so it's quicker.
- */
-static inline void page_dup_rmap(struct page *page)
-{
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * Called from mm/vmscan.c to handle paging out
  */
--- mm09/mm/memory.c	2005-11-07 07:39:59.000000000 +0000
+++ mm10/mm/memory.c	2005-11-09 14:40:00.000000000 +0000
@@ -414,8 +414,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	if (vm_flags & VM_SHARED)
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
-	get_page(page);
-	page_dup_rmap(page);
+	get_page_dup_rmap(page);
 	rss[!!PageAnon(page)]++;
 
 out_set_pte:
--- mm09/mm/rmap.c	2005-11-09 14:38:03.000000000 +0000
+++ mm10/mm/rmap.c	2005-11-09 14:40:00.000000000 +0000
@@ -450,7 +450,7 @@ int page_referenced(struct page *page, i
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	if (atomic_inc_and_test(&page->_mapcount)) {
+	if (get_page_mapped_testone(page)) {
 		struct anon_vma *anon_vma = vma->anon_vma;
 
 		BUG_ON(!anon_vma);
@@ -474,8 +474,7 @@ void page_add_file_rmap(struct page *pag
 {
 	BUG_ON(PageAnon(page));
 	BUG_ON(!pfn_valid(page_to_pfn(page)));
-
-	if (atomic_inc_and_test(&page->_mapcount))
+	if (get_page_mapped_testone(page))
 		inc_page_state(nr_mapped);
 }
 
@@ -487,8 +486,8 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
-	if (atomic_add_negative(-1, &page->_mapcount)) {
-		BUG_ON(page_mapcount(page) < 0);
+	BUG_ON(!page_mapcount(page));
+	if (put_page_mapped_testzero(page)) {
 		/*
 		 * It would be tidy to reset the PageAnon mapping here,
 		 * but that might overwrite a racing page_add_anon_rmap

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 11/15] mm: long page counts
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (9 preceding siblings ...)
  2005-11-10  1:57 ` [PATCH 10/15] mm: atomic64 page counts Hugh Dickins
@ 2005-11-10  2:00 ` Hugh Dickins
  2005-11-10  2:01 ` [PATCH 12/15] mm reiser4: " Hugh Dickins
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, Ben Herrenschmidt, Paul Mundt, Dave Airlie,
	linux-kernel

The type of the page_count and page_mapcount functions has changed from
int to long.  Update those places which give warnings (mostly debug
printks), or where the count might significantly overflow.

Don't bother with the arch's show_mem functions for now (some say int
shared, some long): they don't cause warnings, the truncation wouldn't
matter much, and we'll want to visit them all (perhaps bring them into
common code) in a later phase of PageReserved removal.

The thought of page_referenced on a page whose mapcount exceeds an int
is rather disturbing: it should probably skip high mapcounts unless the
memory pressure is high; but that's a different problem, ignore for now.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 arch/ppc64/kernel/vdso.c  |    6 +++---
 arch/sh64/lib/dbg.c       |    2 +-
 drivers/char/drm/drm_vm.c |    2 +-
 fs/proc/task_mmu.c        |    2 +-
 mm/page_alloc.c           |    2 +-
 mm/rmap.c                 |   18 +++++++++---------
 mm/swapfile.c             |    2 +-
 7 files changed, 17 insertions(+), 17 deletions(-)

--- mm10/arch/ppc64/kernel/vdso.c	2005-11-07 07:39:07.000000000 +0000
+++ mm11/arch/ppc64/kernel/vdso.c	2005-11-09 14:40:00.000000000 +0000
@@ -112,11 +112,11 @@ struct lib64_elfinfo
 #ifdef __DEBUG
 static void dump_one_vdso_page(struct page *pg, struct page *upg)
 {
-	printk("kpg: %p (c:%d,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT),
+	printk("kpg: %p (c:%ld,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT),
 	       page_count(pg),
 	       pg->flags);
 	if (upg/* && pg != upg*/) {
-		printk(" upg: %p (c:%d,f:%08lx)", __va(page_to_pfn(upg) << PAGE_SHIFT),
+		printk(" upg: %p (c:%ld,f:%08lx)", __va(page_to_pfn(upg) << PAGE_SHIFT),
 		       page_count(upg),
 		       upg->flags);
 	}
@@ -184,7 +184,7 @@ static struct page * vdso_vma_nopage(str
 		pg = virt_to_page(vbase + offset);
 
 	get_page(pg);
-	DBG(" ->page count: %d\n", page_count(pg));
+	DBG(" ->page count: %ld\n", page_count(pg));
 
 	return pg;
 }
--- mm10/arch/sh64/lib/dbg.c	2005-06-17 20:48:29.000000000 +0100
+++ mm11/arch/sh64/lib/dbg.c	2005-11-09 14:40:00.000000000 +0000
@@ -422,7 +422,7 @@ unsigned long lookup_itlb(unsigned long 
 
 void print_page(struct page *page)
 {
-	printk("  page[%p] -> index 0x%lx,  count 0x%x,  flags 0x%lx\n",
+	printk("  page[%p] -> index 0x%lx,  count 0x%lx,  flags 0x%lx\n",
 	       page, page->index, page_count(page), page->flags);
 	printk("       address_space = %p, pages =%ld\n", page->mapping,
 	       page->mapping->nrpages);
--- mm10/drivers/char/drm/drm_vm.c	2005-11-07 07:39:15.000000000 +0000
+++ mm11/drivers/char/drm/drm_vm.c	2005-11-09 14:40:00.000000000 +0000
@@ -112,7 +112,7 @@ static __inline__ struct page *drm_do_vm
 		get_page(page);
 
 		DRM_DEBUG
-		    ("baddr = 0x%lx page = 0x%p, offset = 0x%lx, count=%d\n",
+		    ("baddr = 0x%lx page = 0x%p, offset = 0x%lx, count=%ld\n",
 		     baddr, __va(agpmem->memory->memory[offset]), offset,
 		     page_count(page));
 
--- mm10/fs/proc/task_mmu.c	2005-11-07 07:39:46.000000000 +0000
+++ mm11/fs/proc/task_mmu.c	2005-11-09 14:40:00.000000000 +0000
@@ -422,7 +422,7 @@ static struct numa_maps *get_numa_maps(c
  	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
 		page = follow_page(mm, vaddr, 0);
 		if (page) {
-			int count = page_mapcount(page);
+			long count = page_mapcount(page);
 
 			if (count)
 				md->mapped++;
--- mm10/mm/page_alloc.c	2005-11-09 14:38:03.000000000 +0000
+++ mm11/mm/page_alloc.c	2005-11-09 14:40:00.000000000 +0000
@@ -126,7 +126,7 @@ static void bad_page(const char *functio
 {
 	printk(KERN_EMERG "Bad page state at %s (in process '%s', page %p)\n",
 		function, current->comm, page);
-	printk(KERN_EMERG "flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
+	printk(KERN_EMERG "flags:0x%0*lx mapping:%p mapcount:%ld count:%ld\n",
 		(int)(2*sizeof(unsigned long)), (unsigned long)page->flags,
 		page->mapping, page_mapcount(page), page_count(page));
 	printk(KERN_EMERG "Backtrace:\n");
--- mm10/mm/rmap.c	2005-11-09 14:40:00.000000000 +0000
+++ mm11/mm/rmap.c	2005-11-09 14:40:00.000000000 +0000
@@ -64,12 +64,12 @@ static inline void validate_anon_vma(str
 #ifdef RMAP_DEBUG
 	struct anon_vma *anon_vma = find_vma->anon_vma;
 	struct vm_area_struct *vma;
-	unsigned int mapcount = 0;
+	unsigned int vmacount = 0;
 	int found = 0;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		mapcount++;
-		BUG_ON(mapcount > 100000);
+		vmacount++;
+		BUG_ON(vmacount > 100000);
 		if (vma == find_vma)
 			found = 1;
 	}
@@ -289,7 +289,7 @@ pte_t *page_check_address(struct page *p
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
 static int page_referenced_one(struct page *page,
-	struct vm_area_struct *vma, unsigned int *mapcount, int ignore_token)
+	struct vm_area_struct *vma, long *mapcount, int ignore_token)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -322,7 +322,7 @@ out:
 
 static int page_referenced_anon(struct page *page, int ignore_token)
 {
-	unsigned int mapcount;
+	long mapcount;
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	int referenced = 0;
@@ -355,7 +355,7 @@ static int page_referenced_anon(struct p
  */
 static int page_referenced_file(struct page *page, int ignore_token)
 {
-	unsigned int mapcount;
+	long mapcount;
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
@@ -600,7 +600,7 @@ out:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static void try_to_unmap_cluster(unsigned long cursor,
-	unsigned int *mapcount, struct vm_area_struct *vma)
+	long *mapcount, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -712,7 +712,7 @@ static int try_to_unmap_file(struct page
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
-	unsigned int mapcount;
+	long mapcount;
 
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
@@ -768,7 +768,7 @@ static int try_to_unmap_file(struct page
 				try_to_unmap_cluster(cursor, &mapcount, vma);
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
-				if ((int)mapcount <= 0)
+				if (mapcount <= 0)
 					goto out;
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
--- mm10/mm/swapfile.c	2005-11-09 14:38:03.000000000 +0000
+++ mm11/mm/swapfile.c	2005-11-09 14:40:00.000000000 +0000
@@ -308,7 +308,7 @@ static inline int page_swapcount(struct 
  */
 int can_share_swap_page(struct page *page)
 {
-	int count;
+	long count;
 
 	BUG_ON(!PageLocked(page));
 	count = page_mapcount(page);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 12/15] mm reiser4: long page counts
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (10 preceding siblings ...)
  2005-11-10  2:00 ` [PATCH 11/15] mm: long " Hugh Dickins
@ 2005-11-10  2:01 ` Hugh Dickins
  2005-11-10  2:03 ` [PATCH 13/15] mm: get_user_pages check count Hugh Dickins
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: reiserfs-dev, linux-kernel

Update -mm tree's reiser4 source for the long page_count.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 fs/reiser4/page_cache.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

--- mm11/fs/reiser4/page_cache.c	2005-11-09 14:38:18.000000000 +0000
+++ mm12/fs/reiser4/page_cache.c	2005-11-09 14:40:00.000000000 +0000
@@ -751,7 +751,7 @@ void print_page(const char *prefix, stru
 		printk("null page\n");
 		return;
 	}
-	printk("%s: page index: %lu mapping: %p count: %i private: %lx\n",
+	printk("%s: page index: %lu mapping: %p count: %li private: %lx\n",
 	       prefix, page->index, page->mapping, page_count(page),
 	       page->private);
 	printk("\tflags: %s%s%s%s %s%s%s %s%s%s %s%s\n",

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 13/15] mm: get_user_pages check count
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (11 preceding siblings ...)
  2005-11-10  2:01 ` [PATCH 12/15] mm reiser4: " Hugh Dickins
@ 2005-11-10  2:03 ` Hugh Dickins
  2005-11-10  2:08 ` [PATCH 14/15] mm: inc_page_table_pages check max Hugh Dickins
  2005-11-10  2:09 ` [PATCH 15/15] mm: remove install_page limit Hugh Dickins
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Most calls to get_user_pages soon release the pages and then return to
user space, but some are long term - notably Infiniband's ib_umem_get.
That's blessed with good locked_vm checking, but if the system were
misconfigured, then it might be possible to build up a huge page_count.

Guard against overflow by page_count_too_high checks in get_user_pages:
refuse with -ENOMEM once page count reaches its final PID_MAX_LIMIT
(which is why we allowed for 2*PID_MAX_LIMIT in the 23 bits of count).

Sorry, can't touch get_user_pages without giving it more cleanup.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/mm.h |   16 ++++++++++++++++
 mm/memory.c        |   40 +++++++++++++++++++++++-----------------
 2 files changed, 39 insertions(+), 17 deletions(-)

--- mm12/include/linux/mm.h	2005-11-09 14:40:00.000000000 +0000
+++ mm13/include/linux/mm.h	2005-11-09 14:40:17.000000000 +0000
@@ -312,6 +312,7 @@ struct page {
  */
 #define PCOUNT_SHIFT	23
 #define PCOUNT_MASK	((1UL << PCOUNT_SHIFT) - 1)
+#define PCOUNT_HIGH	(PCOUNT_MASK - PID_MAX_LIMIT)
 
 /*
  * Drop a ref, return true if the logical refcount fell to zero
@@ -377,8 +378,17 @@ static inline int page_mapped(struct pag
 	return (unsigned long)atomic64_read(&page->_pcount) > PCOUNT_MASK;
 }
 
+static inline int page_count_too_high(struct page *page)
+{
+	/* get_user_pages check when nearing overflow */
+	return ((unsigned long)atomic64_read(&page->_pcount) & PCOUNT_MASK)
+							>= PCOUNT_HIGH;
+}
+
 #else /* !ATOMIC64_INIT */
 
+#define PCOUNT_HIGH	(INT_MAX - PID_MAX_LIMIT)
+
 /*
  * Drop a ref, return true if the logical refcount fell to zero
  * (the page has no users)
@@ -437,6 +447,12 @@ static inline int page_mapped(struct pag
 	return atomic_read(&page->_mapcount) >= 0;
 }
 
+static inline int page_count_too_high(struct page *page)
+{
+	/* get_user_pages check when nearing overflow */
+	return atomic_read(&page->_count) >= PCOUNT_HIGH;
+}
+
 #endif /* !ATOMIC64_INIT */
 
 void FASTCALL(__page_cache_release(struct page *));
--- mm12/mm/memory.c	2005-11-09 14:40:00.000000000 +0000
+++ mm13/mm/memory.c	2005-11-09 14:40:17.000000000 +0000
@@ -928,39 +928,43 @@ int get_user_pages(struct task_struct *t
 	do {
 		struct vm_area_struct *vma;
 		unsigned int foll_flags;
+		struct page *page;
 
 		vma = find_extend_vma(mm, start);
 		if (!vma && in_gate_area(tsk, start)) {
-			unsigned long pg = start & PAGE_MASK;
-			struct vm_area_struct *gate_vma = get_gate_vma(tsk);
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
 			pte_t *pte;
+			pte_t ptent;
+
 			if (write) /* user gate pages are read-only */
 				return i ? : -EFAULT;
-			if (pg > TASK_SIZE)
-				pgd = pgd_offset_k(pg);
+			start &= PAGE_MASK;	/* what needs that? */
+			if (start >= TASK_SIZE)
+				pgd = pgd_offset_k(start);
 			else
-				pgd = pgd_offset_gate(mm, pg);
+				pgd = pgd_offset_gate(mm, start);
 			BUG_ON(pgd_none(*pgd));
-			pud = pud_offset(pgd, pg);
+			pud = pud_offset(pgd, start);
 			BUG_ON(pud_none(*pud));
-			pmd = pmd_offset(pud, pg);
+			pmd = pmd_offset(pud, start);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
-			pte = pte_offset_map(pmd, pg);
-			if (pte_none(*pte)) {
-				pte_unmap(pte);
+			pte = pte_offset_map(pmd, start);
+			ptent = *pte;
+			pte_unmap(pte);
+			if (pte_none(ptent))
 				return i ? : -EFAULT;
-			}
 			if (pages) {
-				pages[i] = pte_page(*pte);
-				get_page(pages[i]);
+				page = pte_page(ptent);
+				if (page_count_too_high(page))
+					return i ? : -ENOMEM;
+				get_page(page);
+				pages[i] = page;
 			}
-			pte_unmap(pte);
 			if (vmas)
-				vmas[i] = gate_vma;
+				vmas[i] = get_gate_vma(tsk);
 			i++;
 			start += PAGE_SIZE;
 			len--;
@@ -985,8 +989,6 @@ int get_user_pages(struct task_struct *t
 			foll_flags |= FOLL_ANON;
 
 		do {
-			struct page *page;
-
 			if (write)
 				foll_flags |= FOLL_WRITE;
 
@@ -1020,6 +1022,10 @@ int get_user_pages(struct task_struct *t
 				}
 			}
 			if (pages) {
+				if (page_count_too_high(page)) {
+					put_page(page);
+					return i ? : -ENOMEM;
+				}
 				pages[i] = page;
 				flush_dcache_page(page);
 			}

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 14/15] mm: inc_page_table_pages check max
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (12 preceding siblings ...)
  2005-11-10  2:03 ` [PATCH 13/15] mm: get_user_pages check count Hugh Dickins
@ 2005-11-10  2:08 ` Hugh Dickins
  2005-11-10  2:09 ` [PATCH 15/15] mm: remove install_page limit Hugh Dickins
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, Ben Herrenschmidt, Matthew Wilcox,
	James Bottomley, linux-kernel

A 32-bit machine needs 16GB of page table space to back its max mapcount,
a 64-bit machine needs 16TB of page table space to back its max mapcount.

But, there are certainly 32-bit machines with 64GB physical - and more?
and even if 16TB were a 64-bit limit today, it would not be tomorrow.
Yet it'll be some time (I hope) before such machines need to use that
amount of their memory just on the page tables.

Therefore, guard against mapcount overflow on such extreme machines, by
limiting the number of page table pages with a check before incrementing
nr_page_table_pages.  That's a per_cpu variable, so add a per_cpu max,
and avoid extra locking (except at startup and rare occasions after).

Of course, normally those page tables would be filled with entries for
different pages, and no mapcount remotely approach the limit; but this
check avoids checks on hotter paths, without being too restrictive.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

But 64-bit powerpc and parisc are limited to 16GB of page table space by
this, because they don't yet provide the atomic64_t type and operations.
Is there someone who could provide the necessary atomic64_t for them?

 include/linux/mm.h         |    2 
 include/linux/page-flags.h |    1 
 mm/memory.c                |    7 +--
 mm/page_alloc.c            |   95 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 102 insertions(+), 3 deletions(-)

--- mm13/include/linux/mm.h	2005-11-09 14:40:17.000000000 +0000
+++ mm14/include/linux/mm.h	2005-11-09 14:40:32.000000000 +0000
@@ -313,6 +313,7 @@ struct page {
 #define PCOUNT_SHIFT	23
 #define PCOUNT_MASK	((1UL << PCOUNT_SHIFT) - 1)
 #define PCOUNT_HIGH	(PCOUNT_MASK - PID_MAX_LIMIT)
+#define MAPCOUNT_MAX	(-1UL >> PCOUNT_SHIFT)
 
 /*
  * Drop a ref, return true if the logical refcount fell to zero
@@ -388,6 +389,7 @@ static inline int page_count_too_high(st
 #else /* !ATOMIC64_INIT */
 
 #define PCOUNT_HIGH	(INT_MAX - PID_MAX_LIMIT)
+#define MAPCOUNT_MAX	(INT_MAX - 2*PID_MAX_LIMIT)
 
 /*
  * Drop a ref, return true if the logical refcount fell to zero
--- mm13/include/linux/page-flags.h	2005-11-07 07:39:57.000000000 +0000
+++ mm14/include/linux/page-flags.h	2005-11-09 14:40:32.000000000 +0000
@@ -139,6 +139,7 @@ extern void get_page_state_node(struct p
 extern void get_full_page_state(struct page_state *ret);
 extern unsigned long __read_page_state(unsigned long offset);
 extern void __mod_page_state(unsigned long offset, unsigned long delta);
+extern int inc_page_table_pages(void);
 
 #define read_page_state(member) \
 	__read_page_state(offsetof(struct page_state, member))
--- mm13/mm/memory.c	2005-11-09 14:40:17.000000000 +0000
+++ mm14/mm/memory.c	2005-11-09 14:40:32.000000000 +0000
@@ -291,22 +291,23 @@ void free_pgtables(struct mmu_gather **t
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
+	int ret = 0;
 	struct page *new = pte_alloc_one(mm, address);
 	if (!new)
 		return -ENOMEM;
 
 	pte_lock_init(new);
 	spin_lock(&mm->page_table_lock);
-	if (pmd_present(*pmd)) {	/* Another has populated it */
+	if (pmd_present(*pmd) ||	/* Another has populated it */
+	    (ret = inc_page_table_pages())) {
 		pte_lock_deinit(new);
 		pte_free(new);
 	} else {
 		mm->nr_ptes++;
-		inc_page_state(nr_page_table_pages);
 		pmd_populate(mm, pmd, new);
 	}
 	spin_unlock(&mm->page_table_lock);
-	return 0;
+	return ret;
 }
 
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
--- mm13/mm/page_alloc.c	2005-11-09 14:40:00.000000000 +0000
+++ mm14/mm/page_alloc.c	2005-11-09 14:40:32.000000000 +0000
@@ -1217,6 +1217,14 @@ static void show_node(struct zone *zone)
 #endif
 
 /*
+ * Declarations for inc_page_table_pages(): placed here in the hope
+ * that max_page_table_pages will share cacheline with page_states.
+ */
+#define MAX_PAGE_TABLE_PAGES	(MAPCOUNT_MAX / PTRS_PER_PTE)
+static DEFINE_SPINLOCK(max_page_tables_lock);
+static DEFINE_PER_CPU(unsigned long, max_page_table_pages) = {0};
+
+/*
  * Accumulate the page_state information across all CPUs.
  * The result is unavoidably approximate - it can change
  * during and after execution of this function.
@@ -1309,6 +1317,90 @@ void __mod_page_state(unsigned long offs
 
 EXPORT_SYMBOL(__mod_page_state);
 
+/*
+ * inc_page_table_pages() increments cpu page_state.nr_page_table_pages
+ * after checking against the MAX_PAGE_TABLE_PAGES limit: which ensures
+ * mapcount cannot wrap even if _every_ page table entry points to the
+ * same page.  Absurdly draconian, yet no serious practical limitation -
+ * limits 32-bit to 16GB in page tables, 64-bit to 16TB in page tables.
+ */
+int inc_page_table_pages(void)
+{
+	unsigned long offset;
+	unsigned long *max;
+	unsigned long *ptr;
+	unsigned long nr_ptps;
+	int nr_cpus;
+	long delta;
+	int cpu;
+
+	offset = offsetof(struct page_state, nr_page_table_pages);
+again:
+	ptr = (void *) &__get_cpu_var(page_states) + offset;
+	max = &__get_cpu_var(max_page_table_pages);
+	/*
+	 * Beware, *ptr and *max may go "negative" if more page
+	 * tables happen to be freed on this cpu than allocated.
+	 * We avoid the need for barriers by keeping max 1 low.
+	 */
+	if (likely((long)(*max - *ptr) > 0)) {
+		(*ptr)++;
+		return 0;
+	}
+
+	spin_lock(&max_page_tables_lock);
+	/*
+	 * Below, we drop *max on each cpu to stop racing allocations
+	 * while we're updating.  But perhaps another cpu just did the
+	 * update while we were waiting for the lock: don't do it again.
+	 */
+	if ((long)(*max - *ptr) > 0) {
+		(*ptr)++;
+		spin_unlock(&max_page_tables_lock);
+		return 0;
+	}
+
+	/*
+	 * Find how much is allocated and how many online cpus.
+	 * Stop racing allocations by dropping *max temporarily.
+	 */
+	nr_cpus = 0;
+	nr_ptps = 0;
+	for_each_online_cpu(cpu) {
+		ptr = (void *) &per_cpu(page_states, cpu) + offset;
+		max = &per_cpu(max_page_table_pages, cpu);
+		*max = *ptr;
+		nr_ptps += *max;
+		nr_cpus++;
+	}
+
+	/*
+	 * Allow each cpu the same quota.  Subtract 1 to avoid the need
+	 * for barriers above: each racing cpu might allocate one table
+	 * too many, but will meet a barrier before it can get another.
+	 */
+	delta = ((MAX_PAGE_TABLE_PAGES - nr_ptps) / nr_cpus) - 1;
+	if (delta <= 0) {
+		spin_unlock(&max_page_tables_lock);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Redistribute new maxima amongst the online cpus.
+	 * Don't allow too much if a new cpu has come online; don't
+	 * worry if a cpu went offline, it'll get sorted eventually.
+	 */
+	for_each_online_cpu(cpu) {
+		max = &per_cpu(max_page_table_pages, cpu);
+		*max += delta;
+		--nr_cpus;
+		if (!nr_cpus)
+			break;
+	}
+	spin_unlock(&max_page_tables_lock);
+	goto again;
+}
+
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free, struct pglist_data *pgdat)
 {
@@ -2431,6 +2523,9 @@ static int page_alloc_cpu_notify(struct 
 			src[i] = 0;
 		}
 
+		src = (unsigned long *)&per_cpu(max_page_table_pages, cpu);
+		*src = 0;
+
 		local_irq_enable();
 	}
 	return NOTIFY_OK;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 15/15] mm: remove install_page limit
  2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
                   ` (13 preceding siblings ...)
  2005-11-10  2:08 ` [PATCH 14/15] mm: inc_page_table_pages check max Hugh Dickins
@ 2005-11-10  2:09 ` Hugh Dickins
  14 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Remove the INT_MAX/2 limit on remap_file_pages mapcount, which we added
in late 2.6.14: both 32-bit and 64-bit can now handle more, and are now
limited by their page tables.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/fremap.c |    3 ---
 1 files changed, 3 deletions(-)

--- mm14/mm/fremap.c	2005-11-07 07:39:59.000000000 +0000
+++ mm15/mm/fremap.c	2005-11-09 14:40:44.000000000 +0000
@@ -87,9 +87,6 @@ int install_page(struct mm_struct *mm, s
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (!page->mapping || page->index >= size)
 		goto unlock;
-	err = -ENOMEM;
-	if (page_mapcount(page) > INT_MAX/2)
-		goto unlock;
 
 	if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
 		inc_mm_counter(mm, file_rss);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  1:43 ` [PATCH 01/15] mm: poison struct page for ptlock Hugh Dickins
@ 2005-11-10  2:10   ` Andrew Morton
  2005-11-10  2:22     ` Hugh Dickins
  2005-11-15 18:49     ` Andrew Morton
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2005-11-10  2:10 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, Linus Torvalds

Hugh Dickins <hugh@veritas.com> wrote:
>
>  The split ptlock patch enlarged the default SMP PREEMPT struct page from
>  32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
>  7xxx (without PREEMPT).  That was not my intention, and I don't believe
>  that split ptlock deserves any such slice of the user's memory.
> 
>  Could we please try this patch, or something like it?  Again to overlay
>  the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
>  that we don't go too far; with poisoning of the fields overlaid, and
>  unsplit SMP config verifying that the split config is safe to use them.
> 
>  The previous attempt at this patch broke ppc64, which uses slab for its
>  page tables - and slab saves vital info in page->lru: but no config of
>  spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
>  user of slab for page tables is arm26, which is never SMP i.e. all the
>  problems came from the "safety" checks, not from what's actually needed.
>  So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
> 
>  This overlaying is unlikely to be portable forever: but the added checks
>  should warn developers when it's going to break, long before any users.

argh.

Really, I'd prefer to abandon gcc-2.95.x rather than this.  Look:

struct a
{
	union {
		struct {
			int b;
			int c;
		};
		int d;
	};
} z;

main()
{
	z.b = 1;
	z.d = 1;
}

It does everything we want.

Of course, it would be nice to retain 2.95.x support.  The reviled
page_private(() would help us do that.  But the now-to-be-reviled
page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.

So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
places.

But I think either a big patch or 2.95.x abandonment is preferable to this
approach.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10  1:57 ` [PATCH 10/15] mm: atomic64 page counts Hugh Dickins
@ 2005-11-10  2:16   ` Andrew Morton
  2005-11-10  2:33     ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-10  2:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nickpiggin, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> Page count and page mapcount might overflow their 31 bits on 64-bit
>  architectures, especially now we're refcounting the ZERO_PAGE.  We could
>  quite easily avoid counting it, but shared file pages may also overflow.
> 
>  Prefer not to enlarge struct page: don't assign separate atomic64_ts to
>  count and mapcount, instead keep them both in one atomic64_t - the count
>  in the low 23 bits and the mapcount in the high 41 bits.  But of course
>  that can only work if we don't duplicate mapcount in count in this case.
> 
>  The low 23 bits can accomodate 0x7fffff, that's 2 * PID_MAX_LIMIT - 1,
>  which seems adequate for tasks with a transient hold on pages; and the
>  high 41 bits would use 16TB of page table space to back max mapcount.

hm.   I thought we were going to address this by

a) checking for an insane number of mappings of the same page in the
   pagefault handler(s) and 

b) special-casing ZERO_PAGEs on the page unallocator path.

That'll generate faster and smaller code than adding an extra shift and
possibly larger constants in lots of places.

It's cleaner, too.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:10   ` Andrew Morton
@ 2005-11-10  2:22     ` Hugh Dickins
  2005-11-10  2:56       ` Andrew Morton
  2005-11-15 18:49     ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Linus Torvalds

On Wed, 9 Nov 2005, Andrew Morton wrote:
> 
> It does everything we want.

I don't think so: the union leaves us just as vulnerable to some
subsystem using fields of the other half of the union, doesn't it?

Which is not really a problem, but is a part of what's worrying you.

> Of course, it would be nice to retain 2.95.x support.  The reviled
> page_private(() would help us do that.  But the now-to-be-reviled
> page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.
> 
> So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
> places.
> 
> But I think either a big patch or 2.95.x abandonment is preferable to this
> approach.

Hmm, that's a pity.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10  2:16   ` Andrew Morton
@ 2005-11-10  2:33     ` Hugh Dickins
  2005-11-10  3:01       ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-kernel

On Wed, 9 Nov 2005, Andrew Morton wrote:
> 
> hm.   I thought we were going to address this by
> 
> a) checking for an insane number of mappings of the same page in the
>    pagefault handler(s) and 
> 
> b) special-casing ZERO_PAGEs on the page unallocator path.
> 
> That'll generate faster and smaller code than adding an extra shift and
> possibly larger constants in lots of places.

I think we all had different ideas of how to deal with it.

When I outlined this method to you, you remarked "Hmm" or something
like that, which I agree didn't indicate great enthusiasm!

I'm quite pleased with the way it's worked out, but you were intending
that the 64-bit arches should get along with 32-bit counts?  Maybe.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/15] mm: remove ppc highpte
  2005-11-10  1:55   ` Paul Mackerras
@ 2005-11-10  2:46     ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10  2:46 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Andrew Morton, Ben Herrenschmidt, linux-kernel

On Thu, 10 Nov 2005, Paul Mackerras wrote:
> Hugh Dickins writes:
> 
> > ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> > have wanted it enough to restore it: so remove its traces from pgtable.h
> > and pte_alloc_one.  Or supply an alternative patch to config it back?
> 
> I'm staggered.  We do want to be able to have pte pages in highmem.
> I would rather just have it always enabled if CONFIG_HIGHMEM=y, rather
> than putting the config option back.  I think that should just involve
> adding __GFP_HIGHMEM to the flags for alloc_pages in pte_alloc_one
> unconditionally, no?

I'm glad the patch has struck a spark!  Yes, there doesn't appear to
be anything else that you'd need to do.  Except, why was it disabled
in the first place?  I think you'll need to reassure yourselves that
whatever the reason was has gone away, before reenabling.

Andrew, please drop 06/15, I don't think anything after depends on it
(but it came about when I was looking at uninlining pte_offset_map_lock).

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:22     ` Hugh Dickins
@ 2005-11-10  2:56       ` Andrew Morton
  2005-11-10  2:58         ` Andrew Morton
  2005-11-10 12:06         ` Ingo Molnar
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2005-11-10  2:56 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, torvalds, Ingo Molnar

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Wed, 9 Nov 2005, Andrew Morton wrote:
> > 
> > It does everything we want.
> 
> I don't think so: the union leaves us just as vulnerable to some
> subsystem using fields of the other half of the union, doesn't it?

Yes, spose so.

> Which is not really a problem, but is a part of what's worrying you.

yes, with either approach it remans the case that someone could write some
code which works just fine on their setup but explodes gorridly fomr
someone else who has a different config/arch which has a larger spinlock_t.

> > Of course, it would be nice to retain 2.95.x support.  The reviled
> > page_private(() would help us do that.  But the now-to-be-reviled
> > page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.
> > 
> > So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
> > places.
> > 
> > But I think either a big patch or 2.95.x abandonment is preferable to this
> > approach.
> 
> Hmm, that's a pity.

Well plan B is to kill spinlock_t.break_lock.  That fixes everything and
has obvious beneficial side-effects.

a) x86 spinlock_t is but one byte.  Can we stuff break_lock into the
   same word?

   (Yes, there's also a >128 CPUs spinlock overflow problem to solve,
   but perhaps we can use lock;addw?)

b) Redesign the code somehow.  Currently break_lock means "there's
   someone waiting for this lock".

   But if we were to leave the lock in a decremented state while
   spinning (as we've always done), that info is still present via the
   value of spinlock_t.slock.   Hence: bye-bye break_lock.

c) Make the break_lock stuff a new config option,
   CONFIG_SUPER_LOW_LATENCY_BLOATS_STRUCT_PAGE.

d) Revert it wholesale, have sucky SMP worst-case latency ;)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:56       ` Andrew Morton
@ 2005-11-10  2:58         ` Andrew Morton
  2005-11-10 11:28           ` Ingo Molnar
  2005-11-10 12:06         ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-10  2:58 UTC (permalink / raw)
  To: hugh, linux-kernel, torvalds, mingo

Andrew Morton <akpm@osdl.org> wrote:
>
>  Well plan B is to kill spinlock_t.break_lock.

And plan C is to use a bit_spinlock() against page->flags.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10  2:33     ` Hugh Dickins
@ 2005-11-10  3:01       ` Andrew Morton
  2005-11-10 21:43         ` Christoph Lameter
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-10  3:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nickpiggin, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> ...
>
> I'm quite pleased with the way it's worked out, but you were intending
> that the 64-bit arches should get along with 32-bit counts?  Maybe.

That seems reasonsable for file pages.  For the ZERO_PAGE the count can do
whatever it wants, because we'd never free them up.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:58         ` Andrew Morton
@ 2005-11-10 11:28           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2005-11-10 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hugh, linux-kernel, torvalds


* Andrew Morton <akpm@osdl.org> wrote:

> Andrew Morton <akpm@osdl.org> wrote:
> >
> >  Well plan B is to kill spinlock_t.break_lock.
> 
> And plan C is to use a bit_spinlock() against page->flags.

please dont. Bit-spinlock is a PITA because it is non-debuggable and 
non-preempt-friendly. (In fact in -rt i have completely eliminated 
bit_spinlock(), because it's also such a PITA to type-encapsulate it 
sanely.)

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:56       ` Andrew Morton
  2005-11-10  2:58         ` Andrew Morton
@ 2005-11-10 12:06         ` Ingo Molnar
  2005-11-10 12:26           ` Andrew Morton
  2005-11-10 12:35           ` Hugh Dickins
  1 sibling, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2005-11-10 12:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, torvalds

* Andrew Morton <akpm@osdl.org> wrote:

> > > But I think either a big patch or 2.95.x abandonment is preferable to this
> > > approach.
> > 
> > Hmm, that's a pity.
> 
> Well plan B is to kill spinlock_t.break_lock.  That fixes everything 
> and has obvious beneficial side-effects.

i'd really, really love to have this solved without restricting the type 
flexibility of spinlocks.

do we really need 2.95.x support? gcc now produces smaller code with -S.

> a) x86 spinlock_t is but one byte.  Can we stuff break_lock into the
>    same word?
> 
>    (Yes, there's also a >128 CPUs spinlock overflow problem to solve, 
>    but perhaps we can use lock;addw?)

the >128 CPUs spinlock overflow problem is solved by going to 32-bit ops 
(patch has been posted to lkml recently). 16-bit ops are out of 
question. While byte ops are frequently optimized for (avoiding partial 
register access related stalls), the same is not true for 16-bit 
prefixed instructions! Mixing 32-bit with 16-bit code is going to suck 
on a good number of x86 CPUs. It also bloats the instruction size of 
spinlocks, due to the 16-bit prefix. (while byte access has its own 
opcode)

also, there are arches where the only atomic op available is a 32-bit 
one. So trying to squeeze the break_lock flag into the spinlock field is 
i think unrobust.

> b) Redesign the code somehow.  Currently break_lock means "there's
>    someone waiting for this lock".
> 
>    But if we were to leave the lock in a decremented state while
>    spinning (as we've always done), that info is still present via the
>    value of spinlock_t.slock.   Hence: bye-bye break_lock.

this wont work on arches that dont have atomic-decrement based 
spinlocks. (some arches dont even have such an instruction) This means 
those arches would have to implement a "I'm spinning" flag in the word, 
which might or might not work (if the arch doesnt have an atomic 
instruction that works on the owner bit only then it becomes impossible) 
- but in any case it would need very fragile per-arch assembly work to 
pull off.

> c) Make the break_lock stuff a new config option,
>    CONFIG_SUPER_LOW_LATENCY_BLOATS_STRUCT_PAGE.
> 
> d) Revert it wholesale, have sucky SMP worst-case latency ;)

yuck. What is the real problem btw? AFAICS there's enough space for a 
2-word spinlock in struct page for pagetables. We really dont want to 
rewrite spinlocks (or remove features) just to keep gcc 2.95 supported 
for some more time. In fact, is there any 2.6 based distro that uses gcc 
2.95?

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:06         ` Ingo Molnar
@ 2005-11-10 12:26           ` Andrew Morton
  2005-11-10 21:37             ` Christoph Lameter
  2005-11-12 23:48             ` Adrian Bunk
  2005-11-10 12:35           ` Hugh Dickins
  1 sibling, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2005-11-10 12:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: hugh, linux-kernel, torvalds

Ingo Molnar <mingo@elte.hu> wrote:
>
> yuck. What is the real problem btw?

Well.  One problem is that spinlocks now take two words...

But apart from that, the problem at hand is that we want to embed a
spinlock in struct page, and the size of the spinlock varies a lot according
to config.  The only >wordsize version we really care about is
CONFIG_PREEMPT, NR_CPUS >= 4.  (which distros don't ship...)

> AFAICS there's enough space for a 
>  2-word spinlock in struct page for pagetables.

spinlocks get a lot bigger than that with CONFIG_DEBUG_SPINLOCK.

> We really dont want to 
>  rewrite spinlocks (or remove features) just to keep gcc 2.95 supported 
>  for some more time. In fact, is there any 2.6 based distro that uses gcc 
>  2.95?

I think some of the debian derivates might.  But who knows?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:06         ` Ingo Molnar
  2005-11-10 12:26           ` Andrew Morton
@ 2005-11-10 12:35           ` Hugh Dickins
  2005-11-10 12:51             ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10 12:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, torvalds

On Thu, 10 Nov 2005, Ingo Molnar wrote:
> 
> yuck. What is the real problem btw? AFAICS there's enough space for a 
> 2-word spinlock in struct page for pagetables.

Yes.  There is no real problem.  But my patch offends good taste.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:35           ` Hugh Dickins
@ 2005-11-10 12:51             ` Andrew Morton
  2005-11-10 13:29               ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-10 12:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: mingo, linux-kernel, torvalds

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > 
> > yuck. What is the real problem btw? AFAICS there's enough space for a 
> > 2-word spinlock in struct page for pagetables.
> 
> Yes.  There is no real problem.  But my patch offends good taste.
> 

Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:51             ` Andrew Morton
@ 2005-11-10 13:29               ` Hugh Dickins
  2005-11-10 15:00                 ` Ingo Molnar
  2005-11-10 19:49                 ` Andrew Morton
  0 siblings, 2 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10 13:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, linux-kernel, torvalds

On Thu, 10 Nov 2005, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> > On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > > 
> > > yuck. What is the real problem btw? AFAICS there's enough space for a 
> > > 2-word spinlock in struct page for pagetables.
> > 
> > Yes.  There is no real problem.  But my patch offends good taste.
> 
> Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?

No.  There is just one case where it would,
so in that case split ptlock is disabled by mm/Kconfig's
# PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.

	default "4096" if PARISC && DEBUG_SPINLOCK && !PA20

Of course, someone may extend spinlock debugging info tomorrow; but
when they do, presumably they'll try it out, and hit the BUILD_BUG_ON.
They'll then probably want to extend the suppression in mm/Kconfig.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/15] mm: fill arch atomic64 gaps
  2005-11-10  1:56 ` [PATCH 09/15] mm: fill arch atomic64 gaps Hugh Dickins
@ 2005-11-10 13:38   ` Andi Kleen
  2005-11-10 15:19     ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2005-11-10 13:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Richard Henderson, Martin Schwidefsky,
	David S. Miller, linux-kernel

On Thursday 10 November 2005 02:56, Hugh Dickins wrote:
> alpha, s390, sparc64, x86_64 are each missing some primitives from their
> atomic64 support: fill in the gaps I've noticed by extrapolating asm,
> follow the groupings in each file, and say "int" for the booleans rather
> than long or long long.  But powerpc and parisc still have no atomic64.

x86-64 part looks ok thanks. I assume you will take care of the merge
or do you want me to take that hunk?

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 13:29               ` Hugh Dickins
@ 2005-11-10 15:00                 ` Ingo Molnar
  2005-11-10 15:38                   ` Hugh Dickins
  2005-11-10 19:49                 ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2005-11-10 15:00 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, torvalds


* Hugh Dickins <hugh@veritas.com> wrote:

> On Thu, 10 Nov 2005, Andrew Morton wrote:
> > Hugh Dickins <hugh@veritas.com> wrote:
> > > On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > > > 
> > > > yuck. What is the real problem btw? AFAICS there's enough space for a 
> > > > 2-word spinlock in struct page for pagetables.
> > > 
> > > Yes.  There is no real problem.  But my patch offends good taste.
> > 
> > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
> 
> No.  There is just one case where it would,
> so in that case split ptlock is disabled by mm/Kconfig's
> # PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.
> 
> 	default "4096" if PARISC && DEBUG_SPINLOCK && !PA20
> 
> Of course, someone may extend spinlock debugging info tomorrow; but 
> when they do, presumably they'll try it out, and hit the BUILD_BUG_ON. 
> They'll then probably want to extend the suppression in mm/Kconfig.

why not do the union thing so that struct page grows automatically as 
new fields are added? It is quite bad design to introduce a hard limit 
like that. The only sizing concern is to make sure that the common 
.configs dont increase the size of struct page, but otherwise why not 
allow a larger struct page - it's for debugging only.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/15] mm: fill arch atomic64 gaps
  2005-11-10 13:38   ` Andi Kleen
@ 2005-11-10 15:19     ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10 15:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Richard Henderson, Martin Schwidefsky,
	David S. Miller, linux-kernel

On Thu, 10 Nov 2005, Andi Kleen wrote:
> On Thursday 10 November 2005 02:56, Hugh Dickins wrote:
> > alpha, s390, sparc64, x86_64 are each missing some primitives from their
> > atomic64 support: fill in the gaps I've noticed by extrapolating asm,
> > follow the groupings in each file, and say "int" for the booleans rather
> > than long or long long.  But powerpc and parisc still have no atomic64.
> 
> x86-64 part looks ok thanks. I assume you will take care of the merge
> or do you want me to take that hunk?

Thanks for checking, Andi.  Don't worry about it for now.  That patch
series is currently sitting somewhere well away from Andrew's breakfast.
I'll wait and see where it goes.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 15:00                 ` Ingo Molnar
@ 2005-11-10 15:38                   ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-10 15:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, torvalds

On Thu, 10 Nov 2005, Ingo Molnar wrote:
> * Hugh Dickins <hugh@veritas.com> wrote:
> > Of course, someone may extend spinlock debugging info tomorrow; but 
> > when they do, presumably they'll try it out, and hit the BUILD_BUG_ON. 
> > They'll then probably want to extend the suppression in mm/Kconfig.
> 
> why not do the union thing so that struct page grows automatically as 
> new fields are added? It is quite bad design to introduce a hard limit 
> like that. The only sizing concern is to make sure that the common 
> .configs dont increase the size of struct page, but otherwise why not 
> allow a larger struct page - it's for debugging only.

Yes, we wouldn't be worrying much about DEBUG_SPINLOCK enlarging struct
page (unnecessary as that currently is): it's the PREEMPT case adding
break_lock that's of concern (and only on 32-bit, I think: on all the
64-bits we'd have two unsigned ints in unsigned long private).

Going the union way doesn't give any more guarantee that another level
isn't using the fields, than the overlay way I did.  Going the union
way, without enlarging the common struct page, seems to involve either
lots of abstraction edits all over the tree (break_lock coinciding with
with page->mapping on 32-bit), or ruling out gcc 2.95.  Either seems to
me like a lose with no real win.

I'm certainly not arguing against break_lock itself: it's one of the
reasons why I went for a proper spinlock_t there; and agree with you
that trying to squeeze it in with the lock would likely go deep into
architectural complications.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 13:29               ` Hugh Dickins
  2005-11-10 15:00                 ` Ingo Molnar
@ 2005-11-10 19:49                 ` Andrew Morton
  2005-11-10 19:56                   ` Linus Torvalds
  2005-11-11 15:02                   ` Hugh Dickins
  1 sibling, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2005-11-10 19:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: mingo, linux-kernel, torvalds

Hugh Dickins <hugh@veritas.com> wrote:
>
>  On Thu, 10 Nov 2005, Andrew Morton wrote:
>  > Hugh Dickins <hugh@veritas.com> wrote:
>  > > On Thu, 10 Nov 2005, Ingo Molnar wrote:
>  > > > 
>  > > > yuck. What is the real problem btw? AFAICS there's enough space for a 
>  > > > 2-word spinlock in struct page for pagetables.
>  > > 
>  > > Yes.  There is no real problem.  But my patch offends good taste.
>  > 
>  > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
> 
>  No.

!no, methinks.

On 32-bit architectures with CONFIG_PREEMPT, CONFIG_DEBUG_SPINLOCK,
CONFIG_NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS we have a 32-byte page, a
20-byte spinlock and offsetof(page, private) == 12.

IOW we're assuming that no 32-bit architectures will obtain pagetables from
slab?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 19:49                 ` Andrew Morton
@ 2005-11-10 19:56                   ` Linus Torvalds
  2005-11-11  0:10                     ` Russell King
  2005-11-12  6:27                     ` Benjamin Herrenschmidt
  2005-11-11 15:02                   ` Hugh Dickins
  1 sibling, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2005-11-10 19:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, mingo, linux-kernel



On Thu, 10 Nov 2005, Andrew Morton wrote:
> 
> IOW we're assuming that no 32-bit architectures will obtain pagetables from
> slab?

I thought ARM does?

The ARM page tables are something strange (I think they have 1024-byte 
page tables and 4kB pages or something like that?). So they'll not only 
obtain the page tables from slab, they have four of them per page.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:26           ` Andrew Morton
@ 2005-11-10 21:37             ` Christoph Lameter
  2005-11-10 21:52               ` Christoph Hellwig
  2005-11-12 23:48             ` Adrian Bunk
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Lameter @ 2005-11-10 21:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, hugh, linux-kernel, torvalds

On Thu, 10 Nov 2005, Andrew Morton wrote:

> spinlock in struct page, and the size of the spinlock varies a lot according
> to config.  The only >wordsize version we really care about is
> CONFIG_PREEMPT, NR_CPUS >= 4.  (which distros don't ship...)

Suse, Debian and Redhat ship such kernels.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10  3:01       ` Andrew Morton
@ 2005-11-10 21:43         ` Christoph Lameter
  2005-11-10 21:53           ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Lameter @ 2005-11-10 21:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, nickpiggin, linux-kernel

On Wed, 9 Nov 2005, Andrew Morton wrote:

> > I'm quite pleased with the way it's worked out, but you were intending
> > that the 64-bit arches should get along with 32-bit counts?  Maybe.
> 
> That seems reasonsable for file pages.  For the ZERO_PAGE the count can do
> whatever it wants, because we'd never free them up.

Frequent increments and decrements on the zero page count can cause a 
bouncing cacheline that may limit performance.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 21:37             ` Christoph Lameter
@ 2005-11-10 21:52               ` Christoph Hellwig
  2005-11-11 10:46                 ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2005-11-10 21:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Ingo Molnar, hugh, linux-kernel, torvalds

On Thu, Nov 10, 2005 at 01:37:19PM -0800, Christoph Lameter wrote:
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> 
> > spinlock in struct page, and the size of the spinlock varies a lot according
> > to config.  The only >wordsize version we really care about is
> > CONFIG_PREEMPT, NR_CPUS >= 4.  (which distros don't ship...)
> 
> Suse, Debian and Redhat ship such kernels.

No.  SuSE and Redhat have always been smart enough to avoid CONFIG_PREEMPT
like the plague, and even Debian finally noticed this a few month ago.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10 21:43         ` Christoph Lameter
@ 2005-11-10 21:53           ` Andrew Morton
  2005-11-11 15:25             ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-10 21:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: hugh, nickpiggin, linux-kernel

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> On Wed, 9 Nov 2005, Andrew Morton wrote:
> 
> > > I'm quite pleased with the way it's worked out, but you were intending
> > > that the 64-bit arches should get along with 32-bit counts?  Maybe.
> > 
> > That seems reasonsable for file pages.  For the ZERO_PAGE the count can do
> > whatever it wants, because we'd never free them up.
> 
> Frequent increments and decrements on the zero page count can cause a 
> bouncing cacheline that may limit performance.

I think Hugh did some instrumentation on that and decided that problems
were unlikely?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 19:56                   ` Linus Torvalds
@ 2005-11-11  0:10                     ` Russell King
  2005-11-12  6:27                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 52+ messages in thread
From: Russell King @ 2005-11-11  0:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Hugh Dickins, mingo, linux-kernel

On Thu, Nov 10, 2005 at 11:56:52AM -0800, Linus Torvalds wrote:
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> > 
> > IOW we're assuming that no 32-bit architectures will obtain pagetables from
> > slab?
> 
> I thought ARM does?

ARM26 does.  On ARM, we play some games to weld to page tables together.
(We also duplicate them to give us a level of independence from the MMU
architecture and to give us space for things like young and dirty bits.)

As far as Linux is concerned, a 2nd level page table is one struct page
and L1 entries are two words in size.

I wrote up some docs on this and threw them into a comment in
include/asm-arm/pgtable.h...

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 21:52               ` Christoph Hellwig
@ 2005-11-11 10:46                 ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2005-11-11 10:46 UTC (permalink / raw)
  To: Christoph Hellwig, Christoph Lameter, Andrew Morton, hugh,
	linux-kernel, torvalds


* Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Nov 10, 2005 at 01:37:19PM -0800, Christoph Lameter wrote:
> > On Thu, 10 Nov 2005, Andrew Morton wrote:
> > 
> > > spinlock in struct page, and the size of the spinlock varies a lot according
> > > to config.  The only >wordsize version we really care about is
> > > CONFIG_PREEMPT, NR_CPUS >= 4.  (which distros don't ship...)
> > 
> > Suse, Debian and Redhat ship such kernels.
> 
> No.  SuSE and Redhat have always been smart enough to avoid 
> CONFIG_PREEMPT like the plague, and even Debian finally noticed this a 
> few month ago.

while you are right about CONFIG_PREEMPT, there is some movement in this 
area: CONFIG_PREEMPT_VOLUNTARY is now turned on in Fedora kernels, and 
PREEMPT_BKL is turned on for SMP kernels.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 19:49                 ` Andrew Morton
  2005-11-10 19:56                   ` Linus Torvalds
@ 2005-11-11 15:02                   ` Hugh Dickins
  1 sibling, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2005-11-11 15:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, linux-kernel, torvalds

On Thu, 10 Nov 2005, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> >  On Thu, 10 Nov 2005, Andrew Morton wrote:
> >  > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
> > 
> >  No.
> 
> !no, methinks.
> 
> On 32-bit architectures with CONFIG_PREEMPT, CONFIG_DEBUG_SPINLOCK,
> CONFIG_NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS we have a 32-byte page, a
> 20-byte spinlock and offsetof(page, private) == 12.
> 
> IOW we're assuming that no 32-bit architectures will obtain pagetables from
> slab?

Sorry, I misunderstood "overrun".  Yes, you're right, the 32-bit
DEBUG_SPINLOCK spinlock_t runs into page.lru but does not run beyond it.
Here's what I already said about that in the comment on the patch:

The previous attempt at this patch broke ppc64, which uses slab for its
page tables - and slab saves vital info in page->lru: but no config of
spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
user of slab for page tables is arm26, which is never SMP i.e. all the
problems came from the "safety" checks, not from what's actually needed.
So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-10 21:53           ` Andrew Morton
@ 2005-11-11 15:25             ` Hugh Dickins
  2005-11-11 18:03               ` Christoph Lameter
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-11 15:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, nickpiggin, linux-kernel

On Thu, 10 Nov 2005, Andrew Morton wrote:
> Christoph Lameter <clameter@engr.sgi.com> wrote:
> > 
> > Frequent increments and decrements on the zero page count can cause a 
> > bouncing cacheline that may limit performance.
> 
> I think Hugh did some instrumentation on that and decided that problems
> were unlikely?

"Instrumentation" is rather too dignified a word for the counting I added
at one point, to see what proportion of faults were using the zero page,
before running some poor variety of workloads.

It came out, rather as I expected, as not so many as to cause immediate
concern, not so few that the issue could definitely be dismissed.  So I
persuaded Nick to separate out the zero page refcounting as a separate
patch, and when it came to submission he was happy enough with his patch
without it, that he didn't feel much like adding it at that stage.

I did try the two SGI tests I'd seen (memscale and pft, I think latter
my abbreviation for something with longer name that Christoph pointed
us to, page-fault-tsomething), and they both showed negligible use of
the zero page (i.e. the program startup did a few such faults, nothing
to compare with all the work that the actual testing then did).

I've nothing against zero page refcounting avoidance, just wanted
numbers to show it's more worth doing than not doing.  And I'm not
the only one to have wondered, if it is an issue, wouldn't big NUMA
benefit more from per-node zero pages anyway?  (Though of course
the pages themselves should stay clean, so won't be bouncing.)

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/15] mm: atomic64 page counts
  2005-11-11 15:25             ` Hugh Dickins
@ 2005-11-11 18:03               ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2005-11-11 18:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, nickpiggin, linux-kernel

On Fri, 11 Nov 2005, Hugh Dickins wrote:

> I've nothing against zero page refcounting avoidance, just wanted
> numbers to show it's more worth doing than not doing.  And I'm not
> the only one to have wondered, if it is an issue, wouldn't big NUMA
> benefit more from per-node zero pages anyway?  (Though of course
> the pages themselves should stay clean, so won't be bouncing.)

Per-node zero pages sound good. The page structs are placed on the 
respective nodes and therefore would not bounce.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 19:56                   ` Linus Torvalds
  2005-11-11  0:10                     ` Russell King
@ 2005-11-12  6:27                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2005-11-12  6:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Hugh Dickins, mingo, linux-kernel

On Thu, 2005-11-10 at 11:56 -0800, Linus Torvalds wrote:
> 
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> > 
> > IOW we're assuming that no 32-bit architectures will obtain pagetables from
> > slab?
> 
> I thought ARM does?
> 
> The ARM page tables are something strange (I think they have 1024-byte 
> page tables and 4kB pages or something like that?). So they'll not only 
> obtain the page tables from slab, they have four of them per page.


Just catching up on old mails... ppc64 also gets page tables from kmem
caches (though they are currently page sized & aligned). We really want
to be able to have more freedom and get away from the current idea that
1 PTE page == one page. Our intermediary levels (PMDs, PUDs, PGDs)
already have various random sizes.

Ben.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10 12:26           ` Andrew Morton
  2005-11-10 21:37             ` Christoph Lameter
@ 2005-11-12 23:48             ` Adrian Bunk
  1 sibling, 0 replies; 52+ messages in thread
From: Adrian Bunk @ 2005-11-12 23:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, hugh, linux-kernel, torvalds

On Thu, Nov 10, 2005 at 04:26:13AM -0800, Andrew Morton wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
>...
> > We really dont want to 
> >  rewrite spinlocks (or remove features) just to keep gcc 2.95 supported 
> >  for some more time. In fact, is there any 2.6 based distro that uses gcc 
> >  2.95?
> 
> I think some of the debian derivates might.  But who knows?

Debian 3.1 still ships a gcc 2.95, but the default compiler is gcc 3.3 .

Even Debian 3.0 (released in July 2002 and not supporting kernel 2.6) 
ships a gcc 3.0.4 (although it isn't the default compiler on !hppa).

Considering that kernel 2.6.0 was released two and a half years _after_ 
gcc 3.0, I do heavily doubt that there is any distribution that does 
both support kernel 2.6 and not ship any gcc >= 3.0 .

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-10  2:10   ` Andrew Morton
  2005-11-10  2:22     ` Hugh Dickins
@ 2005-11-15 18:49     ` Andrew Morton
  2005-11-15 19:51       ` Hugh Dickins
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2005-11-15 18:49 UTC (permalink / raw)
  To: hugh, linux-kernel, torvalds

Andrew Morton <akpm@osdl.org> wrote:
>
> Hugh Dickins <hugh@veritas.com> wrote:
> >
> >  The split ptlock patch enlarged the default SMP PREEMPT struct page from
> >  32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
> >  7xxx (without PREEMPT).  That was not my intention, and I don't believe
> >  that split ptlock deserves any such slice of the user's memory.
> > 
> >  Could we please try this patch, or something like it?  Again to overlay
> >  the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
> >  that we don't go too far; with poisoning of the fields overlaid, and
> >  unsplit SMP config verifying that the split config is safe to use them.
> > 
> >  The previous attempt at this patch broke ppc64, which uses slab for its
> >  page tables - and slab saves vital info in page->lru: but no config of
> >  spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
> >  user of slab for page tables is arm26, which is never SMP i.e. all the
> >  problems came from the "safety" checks, not from what's actually needed.
> >  So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
> > 
> >  This overlaying is unlikely to be portable forever: but the added checks
> >  should warn developers when it's going to break, long before any users.
> 
> argh.
> 
> Really, I'd prefer to abandon gcc-2.95.x rather than this.  Look:
> 
> struct a
> {
> 	union {
> 		struct {
> 			int b;
> 			int c;
> 		};
> 		int d;
> 	};
> } z;
> 
> main()
> {
> 	z.b = 1;
> 	z.d = 1;
> }
> 
> It does everything we want.

It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.

That way, the only people who have a 4-byte-larger pageframe are those who
use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y.  An acceptably small
community, I suspect.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-15 18:49     ` Andrew Morton
@ 2005-11-15 19:51       ` Hugh Dickins
  2005-11-15 20:05         ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2005-11-15 19:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds

On Tue, 15 Nov 2005, Andrew Morton wrote:
> 
> It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.
> 
> That way, the only people who have a 4-byte-larger pageframe are those who
> use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y.  An acceptably small
> community, I suspect.

I can't really think of this at the moment (though the PageReserved
fixups going smoother this evening).  Acceptably small community, yes.
But wouldn't it plunge us into the very mess of wrappers we were trying
to avoid with anony structunions, to handle the __GNUC__ differences?

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/15] mm: poison struct page for ptlock
  2005-11-15 19:51       ` Hugh Dickins
@ 2005-11-15 20:05         ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2005-11-15 20:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, torvalds

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Tue, 15 Nov 2005, Andrew Morton wrote:
> > 
> > It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.
> > 
> > That way, the only people who have a 4-byte-larger pageframe are those who
> > use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y.  An acceptably small
> > community, I suspect.
> 
> I can't really think of this at the moment (though the PageReserved
> fixups going smoother this evening).  Acceptably small community, yes.
> But wouldn't it plunge us into the very mess of wrappers we were trying
> to avoid with anony structunions, to handle the __GNUC__ differences?

Nope, all the changes would be constrained to the definition of struct
page, and struct page is special.

struct page {
	...
#if __GNUC__ > 2
	union {
		spinlock_t ptl;
		struct {
			unsigned long private;
			struct address_space *mapping;
		}
	}
#else
	union {
		unsigned long private;
		spinlock_t ptl;
	} u;
	struct address_space *mapping;
#endif


and

#if __GNUC__ > 2
#define page_private(page)		((page)->private)
#define set_page_private(page, v)	((page)->private = (v))
#else
#define page_private(page)		((page)->u.private)
#define set_page_private(page, v)	((page)->u.private = (v))
#endif

Of course, adding "u." and "u.s." all over the place would be a sane
solution, but we can do that later - I'm sure we'll be changing struct page
again.

View the above as "a space optimisation made possible by gcc-3.x".

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2005-11-15 20:05 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-10  1:42 [PATCH 00/15] mm: struct page lock and counts Hugh Dickins
2005-11-10  1:43 ` [PATCH 01/15] mm: poison struct page for ptlock Hugh Dickins
2005-11-10  2:10   ` Andrew Morton
2005-11-10  2:22     ` Hugh Dickins
2005-11-10  2:56       ` Andrew Morton
2005-11-10  2:58         ` Andrew Morton
2005-11-10 11:28           ` Ingo Molnar
2005-11-10 12:06         ` Ingo Molnar
2005-11-10 12:26           ` Andrew Morton
2005-11-10 21:37             ` Christoph Lameter
2005-11-10 21:52               ` Christoph Hellwig
2005-11-11 10:46                 ` Ingo Molnar
2005-11-12 23:48             ` Adrian Bunk
2005-11-10 12:35           ` Hugh Dickins
2005-11-10 12:51             ` Andrew Morton
2005-11-10 13:29               ` Hugh Dickins
2005-11-10 15:00                 ` Ingo Molnar
2005-11-10 15:38                   ` Hugh Dickins
2005-11-10 19:49                 ` Andrew Morton
2005-11-10 19:56                   ` Linus Torvalds
2005-11-11  0:10                     ` Russell King
2005-11-12  6:27                     ` Benjamin Herrenschmidt
2005-11-11 15:02                   ` Hugh Dickins
2005-11-15 18:49     ` Andrew Morton
2005-11-15 19:51       ` Hugh Dickins
2005-11-15 20:05         ` Andrew Morton
2005-11-10  1:44 ` [PATCH 02/15] mm: revert page_private Hugh Dickins
2005-11-10  1:46 ` [PATCH 03/15] mm reiser4: " Hugh Dickins
2005-11-10  1:47 ` [PATCH 04/15] mm: update split ptlock Kconfig Hugh Dickins
2005-11-10  1:48 ` [PATCH 05/15] mm: unbloat get_futex_key Hugh Dickins
2005-11-10  1:50 ` [PATCH 06/15] mm: remove ppc highpte Hugh Dickins
2005-11-10  1:52   ` Benjamin Herrenschmidt
2005-11-10  1:55   ` Paul Mackerras
2005-11-10  2:46     ` Hugh Dickins
2005-11-10  1:51 ` [PATCH 07/15] mm: powerpc ptlock comments Hugh Dickins
2005-11-10  1:53 ` [PATCH 08/15] mm: powerpc init_mm without ptlock Hugh Dickins
2005-11-10  1:56 ` [PATCH 09/15] mm: fill arch atomic64 gaps Hugh Dickins
2005-11-10 13:38   ` Andi Kleen
2005-11-10 15:19     ` Hugh Dickins
2005-11-10  1:57 ` [PATCH 10/15] mm: atomic64 page counts Hugh Dickins
2005-11-10  2:16   ` Andrew Morton
2005-11-10  2:33     ` Hugh Dickins
2005-11-10  3:01       ` Andrew Morton
2005-11-10 21:43         ` Christoph Lameter
2005-11-10 21:53           ` Andrew Morton
2005-11-11 15:25             ` Hugh Dickins
2005-11-11 18:03               ` Christoph Lameter
2005-11-10  2:00 ` [PATCH 11/15] mm: long " Hugh Dickins
2005-11-10  2:01 ` [PATCH 12/15] mm reiser4: " Hugh Dickins
2005-11-10  2:03 ` [PATCH 13/15] mm: get_user_pages check count Hugh Dickins
2005-11-10  2:08 ` [PATCH 14/15] mm: inc_page_table_pages check max Hugh Dickins
2005-11-10  2:09 ` [PATCH 15/15] mm: remove install_page limit Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox