[PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock
@ 2004-10-03 18:21 Hugh Dickins
  2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

Here's a set of patches against 2.6.9-rc3-mm1, or Ingo's -S7 VP tree:
allow lower latency in vmtruncate (more generally, unmap_mapping_range)
by dropping the i_mmap_lock when required.  Though I've been focussing
on correctness rather than latency myself: do these work for you, Ingo?

1/6 undoes something of Andrea's, 2/6 undoes something of Ingo's:
I thought I might as well provide patches as ask the questions, please
let us know why if you disagree with them.  3/6 is cosmetic, 4/6 is
the real work, 5/6 corrects some loose ends, 6/6 is for testing.

Worth the effort, and ~700 bytes bloat, and additional int in each vma?
If only root could run-bash-shared-mappings, I'd say not worth it; but
it seems a little like prio_tree itself, we're not sure we need it,
but feel more secure to have the solution.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 drivers/mtd/devices/blkmtd.c |    1 
 fs/inode.c                   |    1 
 include/linux/fs.h           |    2 
 include/linux/mm.h           |    8 +
 kernel/fork.c                |    1 
 mm/memory.c                  |  223 ++++++++++++++++++++++++++++++++++---------
 mm/mmap.c                    |   14 ++
 mm/mremap.c                  |   16 +--
 mm/rmap.c                    |    9 -
 mm/truncate.c                |    1 
 10 files changed, 215 insertions(+), 61 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/6] vmtrunc: truncate_count not atomic
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
@ 2004-10-03 18:23 ` Hugh Dickins
  2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, David Woodhouse, linux-kernel

Why is mapping->truncate_count atomic?  It's incremented inside
i_mmap_lock (and i_sem), and the reads don't need it to be atomic.

And why smp_rmb() before call to ->nopage?  The compiler cannot reorder
the initial assignment of sequence after the call to ->nopage, and no
cpu (yet!) can read from the future, which is all that matters there.

And delete totally bogus reset of truncate_count from blkmtd add_device.
truncate_count is all about detecting i_size changes: i_size does not
change there; and if it did, the count should be incremented not reset.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 drivers/mtd/devices/blkmtd.c |    1 -
 fs/inode.c                   |    1 -
 include/linux/fs.h           |    2 +-
 mm/memory.c                  |   12 +++++-------
 4 files changed, 6 insertions(+), 10 deletions(-)

--- 2.6.9-rc3-mm1/drivers/mtd/devices/blkmtd.c	2004-08-14 06:38:20.000000000 +0100
+++ trunc1/drivers/mtd/devices/blkmtd.c	2004-10-03 01:07:01.388014544 +0100
@@ -661,7 +661,6 @@ static struct blkmtd_dev *add_device(cha
 
 	memset(dev, 0, sizeof(struct blkmtd_dev));
 	dev->blkdev = bdev;
-	atomic_set(&(dev->blkdev->bd_inode->i_mapping->truncate_count), 0);
 	if(!readonly) {
 		init_MUTEX(&dev->wrbuf_mutex);
 	}
--- trunc0/fs/inode.c	2004-10-02 18:22:24.857608136 +0100
+++ trunc1/fs/inode.c	2004-10-03 01:07:01.390014240 +0100
@@ -200,7 +200,6 @@ void inode_init_once(struct inode *inode
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
-	atomic_set(&inode->i_data.truncate_count, 0);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
--- trunc0/include/linux/fs.h	2004-10-02 18:22:28.019127512 +0100
+++ trunc1/include/linux/fs.h	2004-10-03 01:07:01.393013784 +0100
@@ -342,7 +342,7 @@ struct address_space {
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
-	atomic_t		truncate_count;	/* Cover race condition with truncate */
+	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	struct address_space_operations *a_ops;	/* methods */
--- trunc0/mm/memory.c	2004-10-02 18:22:29.354924440 +0100
+++ trunc1/mm/memory.c	2004-10-03 01:07:01.396013328 +0100
@@ -1259,7 +1259,7 @@ void unmap_mapping_range(struct address_
 
 	spin_lock(&mapping->i_mmap_lock);
 	/* Protect against page fault */
-	atomic_inc(&mapping->truncate_count);
+	mapping->truncate_count++;
 
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		unmap_mapping_range_list(&mapping->i_mmap, &details);
@@ -1557,7 +1557,7 @@ do_no_page(struct mm_struct *mm, struct 
 	struct page * new_page;
 	struct address_space *mapping = NULL;
 	pte_t entry;
-	int sequence = 0;
+	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 
@@ -1569,9 +1569,8 @@ do_no_page(struct mm_struct *mm, struct 
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
-		sequence = atomic_read(&mapping->truncate_count);
+		sequence = mapping->truncate_count;
 	}
-	smp_rmb();  /* Prevent CPU from reordering lock-free ->nopage() */
 retry:
 	cond_resched();
 	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
@@ -1615,9 +1614,8 @@ retry:
 	 * invalidated this page.  If unmap_mapping_range got called,
 	 * retry getting the page.
 	 */
-	if (mapping &&
-	      (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
-		sequence = atomic_read(&mapping->truncate_count);
+	if (mapping && unlikely(sequence != mapping->truncate_count)) {
+		sequence = mapping->truncate_count;
 		spin_unlock(&mm->page_table_lock);
 		page_cache_release(new_page);
 		goto retry;


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
  2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
@ 2004-10-03 18:24 ` Hugh Dickins
  2004-10-03 18:36   ` Ingo Molnar
  2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

The low-latency unmap_vmas patch silently moved the zap_bytes test
after the TLB finish and lockbreak and regather: why?  That not only
makes zap_bytes redundant (might as well use ZAP_BLOCK_SIZE), it makes
the unmap_vmas level redundant too - it's all about saving TLB flushes
when unmapping a series of small vmas.

Move zap_bytes test back before the lockbreak, and delete the curious
comment that a small zap block size doesn't matter: it's true need_flush
prevents TLB flush when no page has been unmapped, but unmapping pages
in small blocks involves many more TLB flushes than in large blocks.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

--- trunc1/mm/memory.c	2004-10-03 01:07:01.396013328 +0100
+++ trunc2/mm/memory.c	2004-10-03 01:07:12.957255752 +0100
@@ -506,10 +506,6 @@ static void unmap_page_range(struct mmu_
 }
 
 #ifdef CONFIG_PREEMPT
-/*
- * It's not an issue to have a small zap block size - TLB flushes
- * only happen once normally, due to the tlb->need_flush optimization.
- */
 # define ZAP_BLOCK_SIZE	(8 * PAGE_SIZE)
 #else
 /* No preempt: go for improved straight-line efficiency */
@@ -588,6 +584,9 @@ int unmap_vmas(struct mmu_gather **tlbp,
 
 			start += block;
 			zap_bytes -= block;
+			if ((long)zap_bytes > 0)
+				continue;
+
 			if (!atomic) {
 				int fullmm = tlb_is_full_mm(*tlbp);
 				tlb_finish_mmu(*tlbp, tlb_start, start);
@@ -595,8 +594,6 @@ int unmap_vmas(struct mmu_gather **tlbp,
 				*tlbp = tlb_gather_mmu(mm, fullmm);
 				tlb_start_valid = 0;
 			}
-			if ((long)zap_bytes > 0)
-				continue;
 			zap_bytes = ZAP_BLOCK_SIZE;
 		}
 	}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/6] vmtrunc: unmap_mapping_range_tree
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
  2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
  2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
@ 2004-10-03 18:25 ` Hugh Dickins
  2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

Move unmap_mapping_range's nonlinear vma handling out to its own inline,
parallel to the prio_tree handling; unmap_mapping_range_list is a better
name for the nonlinear list, rename the other unmap_mapping_range_tree.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |   42 +++++++++++++++++++++++-------------------
 1 files changed, 23 insertions(+), 19 deletions(-)

--- trunc2/mm/memory.c	2004-10-03 01:07:12.957255752 +0100
+++ trunc3/mm/memory.c	2004-10-03 01:07:24.510499392 +0100
@@ -1189,9 +1189,9 @@ no_new_page:
 }
 
 /*
- * Helper function for unmap_mapping_range().
+ * Helper functions for unmap_mapping_range().
  */
-static inline void unmap_mapping_range_list(struct prio_tree_root *root,
+static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
 					    struct zap_details *details)
 {
 	struct vm_area_struct *vma;
@@ -1215,6 +1215,24 @@ static inline void unmap_mapping_range_l
 	}
 }
 
+static inline void unmap_mapping_range_list(struct list_head *head,
+					    struct zap_details *details)
+{
+	struct vm_area_struct *vma;
+
+	/*
+	 * In nonlinear VMAs there is no correspondence between virtual address
+	 * offset and file offset.  So we must perform an exhaustive search
+	 * across *all* the pages in each nonlinear VMA, not just the pages
+	 * whose virtual address lies outside the file truncation point.
+	 */
+	list_for_each_entry(vma, head, shared.vm_set.list) {
+		details->nonlinear_vma = vma;
+		zap_page_range(vma, vma->vm_start,
+				vma->vm_end - vma->vm_start, details);
+	}
+}
+
 /**
  * unmap_mapping_range - unmap the portion of all mmaps
  * in the specified address_space corresponding to the specified
@@ -1259,23 +1277,9 @@ void unmap_mapping_range(struct address_
 	mapping->truncate_count++;
 
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
-		unmap_mapping_range_list(&mapping->i_mmap, &details);
-
-	/*
-	 * In nonlinear VMAs there is no correspondence between virtual address
-	 * offset and file offset.  So we must perform an exhaustive search
-	 * across *all* the pages in each nonlinear VMA, not just the pages
-	 * whose virtual address lies outside the file truncation point.
-	 */
-	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) {
-		struct vm_area_struct *vma;
-		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
-						shared.vm_set.list) {
-			details.nonlinear_vma = vma;
-			zap_page_range(vma, vma->vm_start,
-				vma->vm_end - vma->vm_start, &details);
-		}
-	}
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
+		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
 	spin_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
                   ` (2 preceding siblings ...)
  2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
@ 2004-10-03 18:26 ` Hugh Dickins
  2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
  2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

vmtruncate (or more generally, unmap_mapping_range) has been observed
responsible for very high latencies: the lockbreak work in unmap_vmas is
good for munmap or exit_mmap, but no use while mapping->i_mmap_lock is
held, to keep our place in the prio_tree (or list) of a file's vmas.

Extend the zap_details block with i_mmap_lock pointer, so unmap_vmas can
detect if that needs lockbreak, and break_addr so it can notify where it
left off.

Add unmap_mapping_range_vma, used from both prio_tree and nonlinear list
handlers.  This is what now calls zap_page_range (above unmap_vmas), but
handles the lockbreak and restart issues: letting unmap_mapping_range_
tree or list know when they need to start over because lock was dropped.

When restarting, of course there's a danger of never making progress.
Add vm_truncate_count field to vm_area_struct, update that to mapping->
truncate_count once fully scanned, skip up-to-date vmas without a scan
(and without dropping i_mmap_lock).

Further danger of never making progress if a vma is very large: when
breaking out, save restart_vma and restart_addr (and restart_pgoff to
confirm, in case vma gets reused), to help continue where we left off.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/mm.h |    8 ++
 mm/memory.c        |  166 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 160 insertions(+), 14 deletions(-)

--- trunc3/include/linux/mm.h	2004-10-02 18:22:28.114113072 +0100
+++ trunc4/include/linux/mm.h	2004-10-03 01:07:36.185724488 +0100
@@ -103,6 +103,7 @@ struct vm_area_struct {
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
+	unsigned int vm_truncate_count;	/* compare mapping->truncate_count */
 
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
@@ -559,7 +560,12 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	int atomic;				/* May not schedule() */
+	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct vm_area_struct *restart_vma;	/* Where lock was dropped */
+	pgoff_t restart_pgoff;			/* File offset for restart */
+	unsigned long restart_addr;		/* Where we should restart */
+	unsigned long break_addr;		/* Where unmap_vmas stopped */
+	unsigned int truncate_count;		/* Compare vm_truncate_count */
 };
 
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
--- trunc3/mm/memory.c	2004-10-03 01:07:24.510499392 +0100
+++ trunc4/mm/memory.c	2004-10-03 01:07:36.189723880 +0100
@@ -548,7 +548,8 @@ int unmap_vmas(struct mmu_gather **tlbp,
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	int ret = 0;
-	int atomic = details && details->atomic;
+	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	int fullmm = tlb_is_full_mm(*tlbp);
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long start;
@@ -587,16 +588,28 @@ int unmap_vmas(struct mmu_gather **tlbp,
 			if ((long)zap_bytes > 0)
 				continue;
 
-			if (!atomic) {
-				int fullmm = tlb_is_full_mm(*tlbp);
-				tlb_finish_mmu(*tlbp, tlb_start, start);
-				cond_resched_lock(&mm->page_table_lock);
-				*tlbp = tlb_gather_mmu(mm, fullmm);
-				tlb_start_valid = 0;
+			tlb_finish_mmu(*tlbp, tlb_start, start);
+
+			if (need_resched() ||
+				need_lockbreak(&mm->page_table_lock) ||
+				(i_mmap_lock && need_lockbreak(i_mmap_lock))) {
+				if (i_mmap_lock) {
+					/* must reset count of rss freed */
+					*tlbp = tlb_gather_mmu(mm, fullmm);
+					details->break_addr = start;
+					goto out;
+				}
+				spin_unlock(&mm->page_table_lock);
+				cond_resched();
+				spin_lock(&mm->page_table_lock);
 			}
+
+			*tlbp = tlb_gather_mmu(mm, fullmm);
+			tlb_start_valid = 0;
 			zap_bytes = ZAP_BLOCK_SIZE;
 		}
 	}
+out:
 	return ret;
 }
 
@@ -1190,7 +1203,114 @@ no_new_page:
 
 /*
  * Helper functions for unmap_mapping_range().
+ *
+ * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ *
+ * We have to restart searching the prio_tree whenever we drop the lock,
+ * since the iterator is only valid while the lock is held, and anyway
+ * a later vma might be split and reinserted earlier while lock dropped.
+ *
+ * The list of nonlinear vmas could be handled more efficiently, using
+ * a placeholder, but handle it in the same way until a need is shown.
+ * It is important to search the prio_tree before nonlinear list: a vma
+ * may become nonlinear and be shifted from prio_tree to nonlinear list
+ * while the lock is dropped; but never shifted from list to prio_tree.
+ *
+ * In order to make forward progress despite restarting the search,
+ * vm_truncate_count is used to mark a vma as now dealt with, so we can
+ * quickly skip it next time around.  Since the prio_tree search only
+ * shows us those vmas affected by unmapping the range in question, we
+ * can't efficiently keep all vmas in step with mapping->truncate_count:
+ * so instead reset them all whenever it wraps back to 0 (then go to 1).
+ * mapping->truncate_count and vma->vm_truncate_count are protected by
+ * i_mmap_lock.
+ *
+ * In order to make forward progress despite repeatedly restarting some
+ * large vma, note the break_addr set by unmap_vmas when it breaks out:
+ * and restart from that address when we reach that vma again (so long
+ * as another such was not inserted earlier while the lock was dropped).
+ *
+ * There is no guarantee that the restart_vma will remain intact when
+ * the lock is regained: but if it has been freed to some other use,
+ * it cannot then appear in the tree or list of vmas, so no harm done;
+ * if it has been reused for a new vma of the same mapping, nopage
+ * checks on i_size and truncate_count ensure it cannot be mapping any
+ * of the truncated pages, so the area below restart_addr is still safe
+ * to skip - but we must check pgoff to prevent spurious unmapping; or
+ * restart_vma may have been split or merged, shrunk or extended - but
+ * never shifted, so restart_addr and restart_pgoff remain in synch
+ * (even in the case of mremap move, which makes a copy of the vma).
  */
+
+static void reset_vma_truncate_counts(struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
+		vma->vm_truncate_count = 0;
+	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+		vma->vm_truncate_count = 0;
+}
+
+static int unmap_mapping_range_vma(struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr,
+		struct zap_details *details)
+{
+	int need_break;
+
+	if (vma == details->restart_vma &&
+	    start_addr < details->restart_addr) {
+		/*
+		 * Be careful: this vm_area_struct may have been reused
+		 * meanwhile.  If pgoff matches up, it's probably the
+		 * same one (perhaps shrunk or extended), but no harm is
+		 * done if actually it's new.  If pgoff does not match,
+		 * it would likely be wrong to unmap from restart_addr,
+		 * but it must be new, so we can just mark it completed.
+		 */
+		start_addr = details->restart_addr;
+		if (linear_page_index(vma, start_addr) !=
+		    details->restart_pgoff || start_addr >= end_addr) {
+			vma->vm_truncate_count = details->truncate_count;
+			return 0;
+		}
+	}
+again:
+	details->break_addr = end_addr;
+	zap_page_range(vma, start_addr, end_addr - start_addr, details);
+
+	/*
+	 * We cannot rely on the break test in unmap_vmas:
+	 * on the one hand, we don't want to restart our loop
+	 * just because that broke out for the page_table_lock;
+	 * on the other hand, it does no test when vma is small.
+	 */
+	need_break = need_resched() ||
+			need_lockbreak(details->i_mmap_lock);
+
+	if (details->break_addr >= end_addr) {
+		/* We have now completed this vma: mark it so */
+		vma->vm_truncate_count = details->truncate_count;
+		if (!need_break)
+			return 0;
+	} else {
+		if (!need_break) {
+			start_addr = details->break_addr;
+			goto again;
+		}
+		details->restart_vma = vma;
+		details->restart_pgoff =
+			linear_page_index(vma, details->break_addr);
+		details->restart_addr = details->break_addr;
+	}
+
+	spin_unlock(details->i_mmap_lock);
+	cond_resched();
+	spin_lock(details->i_mmap_lock);
+	return -EINTR;
+}
+
 static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
 					    struct zap_details *details)
 {
@@ -1198,8 +1318,13 @@ static inline void unmap_mapping_range_t
 	struct prio_tree_iter iter;
 	pgoff_t vba, vea, zba, zea;
 
+restart:
 	vma_prio_tree_foreach(vma, &iter, root,
 			details->first_index, details->last_index) {
+		/* Skip quickly over those we have already dealt with */
+		if (vma->vm_truncate_count == details->truncate_count)
+			continue;
+
 		vba = vma->vm_pgoff;
 		vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
 		/* Assume for now that PAGE_CACHE_SHIFT == PAGE_SHIFT */
@@ -1209,9 +1334,12 @@ static inline void unmap_mapping_range_t
 		zea = details->last_index;
 		if (zea > vea)
 			zea = vea;
-		zap_page_range(vma,
+
+		if (unmap_mapping_range_vma(vma,
 			((zba - vba) << PAGE_SHIFT) + vma->vm_start,
-			(zea - zba + 1) << PAGE_SHIFT, details);
+			((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start,
+				details) < 0)
+			goto restart;
 	}
 }
 
@@ -1226,10 +1354,15 @@ static inline void unmap_mapping_range_l
 	 * across *all* the pages in each nonlinear VMA, not just the pages
 	 * whose virtual address lies outside the file truncation point.
 	 */
+restart:
 	list_for_each_entry(vma, head, shared.vm_set.list) {
+		/* Skip quickly over those we have already dealt with */
+		if (vma->vm_truncate_count == details->truncate_count)
+			continue;
 		details->nonlinear_vma = vma;
-		zap_page_range(vma, vma->vm_start,
-				vma->vm_end - vma->vm_start, details);
+		if (unmap_mapping_range_vma(vma, vma->vm_start,
+					vma->vm_end, details) < 0)
+			goto restart;
 	}
 }
 
@@ -1268,13 +1401,20 @@ void unmap_mapping_range(struct address_
 	details.nonlinear_vma = NULL;
 	details.first_index = hba;
 	details.last_index = hba + hlen - 1;
-	details.atomic = 1;	/* A spinlock is held */
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
+	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.restart_vma = NULL;
 
 	spin_lock(&mapping->i_mmap_lock);
-	/* Protect against page fault */
+
+	/* Protect against page faults, and endless unmapping loops */
 	mapping->truncate_count++;
+	if (unlikely(mapping->truncate_count == 0)) {
+		mapping->truncate_count++;
+		reset_vma_truncate_counts(mapping);
+	}
+	details.truncate_count = mapping->truncate_count;
 
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/6] vmtrunc: vm_truncate_count race caution
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
                   ` (3 preceding siblings ...)
  2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
@ 2004-10-03 18:27 ` Hugh Dickins
  2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

Fix some unlikely races in respect of vm_truncate_count.

Firstly, it's supposed to be guarded by i_mmap_lock, but some places
copy a vma structure by *new_vma = *old_vma: if the compiler implements
that with a bytewise copy, new_vma->vm_truncate_count could be munged,
and new_vma later appear up-to-date when it's not; so set it properly
once under lock.

vma_link set vm_truncate_count to mapping->truncate_count when adding an
empty vma: if new vmas are being added profusely while vmtruncate is in
progess, this lets them be skipped without scanning.

vma_adjust has vm_truncate_count problem much like it had with anon_vma
under mprotect merge: when merging be careful not to leave vma marked as
up-to-date when it might not be, lest unmap_mapping_range in progress -
set vm_truncate_count 0 when in doubt.  Similarly when mremap moving
ptes from one vma to another.

Cut a little code from __anon_vma_merge: now vma_adjust sets "importer"
in the remove_next case (to get its vm_truncate_count right), its
anon_vma is already linked by the time __anon_vma_merge is called.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 kernel/fork.c |    1 +
 mm/mmap.c     |   14 +++++++++++++-
 mm/mremap.c   |   16 ++++++++++------
 mm/rmap.c     |    9 +--------
 4 files changed, 25 insertions(+), 15 deletions(-)

--- trunc4/kernel/fork.c	2004-10-02 18:22:28.000000000 +0100
+++ trunc5/kernel/fork.c	2004-10-03 16:37:39.670145872 +0100
@@ -221,6 +221,7 @@ static inline int dup_mmap(struct mm_str
       
 			/* insert tmp into the share list, just after mpnt */
 			spin_lock(&file->f_mapping->i_mmap_lock);
+			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
--- trunc4/mm/mmap.c	2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/mmap.c	2004-10-03 16:37:39.673145416 +0100
@@ -307,8 +307,10 @@ static void vma_link(struct mm_struct *m
 	if (vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
 
-	if (mapping)
+	if (mapping) {
 		spin_lock(&mapping->i_mmap_lock);
+		vma->vm_truncate_count = mapping->truncate_count;
+	}
 	anon_vma_lock(vma);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -379,6 +381,7 @@ void vma_adjust(struct vm_area_struct *v
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
 			anon_vma = next->anon_vma;
+			importer = vma;
 		} else if (end > next->vm_start) {
 			/*
 			 * vma expands, overlapping part of the next:
@@ -404,7 +407,16 @@ again:			remove_next = 1 + (end > next->
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
 		spin_lock(&mapping->i_mmap_lock);
+		if (importer &&
+		    vma->vm_truncate_count != next->vm_truncate_count) {
+			/*
+			 * unmap_mapping_range might be in progress:
+			 * ensure that the expanding vma is rescanned.
+			 */
+			importer->vm_truncate_count = 0;
+		}
 		if (insert) {
+			insert->vm_truncate_count = vma->vm_truncate_count;
 			/*
 			 * Put into prio_tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
--- trunc4/mm/mremap.c	2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/mremap.c	2004-10-03 16:37:39.675145112 +0100
@@ -82,7 +82,7 @@ static inline pte_t *alloc_one_pte_map(s
 
 static int
 move_one_page(struct vm_area_struct *vma, unsigned long old_addr,
-		unsigned long new_addr)
+		struct vm_area_struct *new_vma, unsigned long new_addr)
 {
 	struct address_space *mapping = NULL;
 	struct mm_struct *mm = vma->vm_mm;
@@ -98,6 +98,9 @@ move_one_page(struct vm_area_struct *vma
 		 */
 		mapping = vma->vm_file->f_mapping;
 		spin_lock(&mapping->i_mmap_lock);
+		if (new_vma->vm_truncate_count &&
+		    new_vma->vm_truncate_count != vma->vm_truncate_count)
+			new_vma->vm_truncate_count = 0;
 	}
 	spin_lock(&mm->page_table_lock);
 
@@ -144,8 +147,8 @@ move_one_page(struct vm_area_struct *vma
 }
 
 static unsigned long move_page_tables(struct vm_area_struct *vma,
-		unsigned long new_addr, unsigned long old_addr,
-		unsigned long len)
+		unsigned long old_addr, struct vm_area_struct *new_vma,
+		unsigned long new_addr, unsigned long len)
 {
 	unsigned long offset;
 
@@ -157,7 +160,8 @@ static unsigned long move_page_tables(st
 	 * only a few pages.. This also makes error recovery easier.
 	 */
 	for (offset = 0; offset < len; offset += PAGE_SIZE) {
-		if (move_one_page(vma, old_addr+offset, new_addr+offset) < 0)
+		if (move_one_page(vma, old_addr + offset,
+				new_vma, new_addr + offset) < 0)
 			break;
 		cond_resched();
 	}
@@ -188,14 +192,14 @@ static unsigned long move_vma(struct vm_
 	if (!new_vma)
 		return -ENOMEM;
 
-	moved_len = move_page_tables(vma, new_addr, old_addr, old_len);
+	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
 	if (moved_len < old_len) {
 		/*
 		 * On error, move entries back from new area to old,
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
 		 */
-		move_page_tables(new_vma, old_addr, new_addr, moved_len);
+		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
--- trunc4/mm/rmap.c	2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/rmap.c	2004-10-03 16:37:39.676144960 +0100
@@ -120,14 +120,7 @@ int anon_vma_prepare(struct vm_area_stru
 
 void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
 {
-	if (!vma->anon_vma) {
-		BUG_ON(!next->anon_vma);
-		vma->anon_vma = next->anon_vma;
-		list_add(&vma->anon_vma_node, &next->anon_vma_node);
-	} else {
-		/* if they're both non-null they must be the same */
-		BUG_ON(vma->anon_vma != next->anon_vma);
-	}
+	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
 }
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 6/6] vmtrunc: bug if page_mapped
  2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
                   ` (4 preceding siblings ...)
  2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
@ 2004-10-03 18:28 ` Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel

If unmap_mapping_range (and mapping->truncate_count) are doing their
jobs right, truncate_complete_page should never find the page mapped:
add BUG_ON for our immediate testing, but this patch should probably
not go to mainline - a mapped page here is not a catastrophe.

Unsigned-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/truncate.c |    1 +
 1 files changed, 1 insertion(+)

--- trunc5/mm/truncate.c	2004-10-02 18:22:29.000000000 +0100
+++ trunc6/mm/truncate.c	2004-10-03 16:37:39.677144808 +0100
@@ -45,6 +45,7 @@ static inline void truncate_partial_page
 static void
 truncate_complete_page(struct address_space *mapping, struct page *page)
 {
+	BUG_ON(page_mapped(page));
 	if (page->mapping != mapping)
 		return;
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes
  2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
@ 2004-10-03 18:36   ` Ingo Molnar
  0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2004-10-03 18:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
	Lorenzo Allegrucci, linux-kernel


* Hugh Dickins <hugh@veritas.com> wrote:

> The low-latency unmap_vmas patch silently moved the zap_bytes test
> after the TLB finish and lockbreak and regather: why?  That not only
> makes zap_bytes redundant (might as well use ZAP_BLOCK_SIZE), it makes
> the unmap_vmas level redundant too - it's all about saving TLB flushes
> when unmapping a series of small vmas.

the problem was latency generated by pages queueing up in tlb_gather's 
queue and being freed in one loop, causing latencies. So there was no 
latency-break, only a shifting of the queue.

> Move zap_bytes test back before the lockbreak, and delete the curious
> comment that a small zap block size doesn't matter: it's true
> need_flush prevents TLB flush when no page has been unmapped, but
> unmapping pages in small blocks involves many more TLB flushes than in
> large blocks.

could we perhaps free those pages outside of the lock? That would be
just as good for me.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-10-03 18:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
2004-10-03 18:36   ` Ingo Molnar
2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.