* [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock
@ 2004-10-03 18:21 Hugh Dickins
2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
` (5 more replies)
0 siblings, 6 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
Here's a set of patches against 2.6.9-rc3-mm1, or Ingo's -S7 VP tree:
allow lower latency in vmtruncate (more generally, unmap_mapping_range)
by dropping the i_mmap_lock when required. Though I've been focussing
on correctness rather than latency myself: do these work for you, Ingo?
1/6 undoes something of Andrea's, 2/6 undoes something of Ingo's:
I thought I might as well provide patches as ask the questions, please
let us know why if you disagree with them. 3/6 is cosmetic, 4/6 is
the real work, 5/6 corrects some loose ends, 6/6 is for testing.
Worth the effort, and ~700 bytes bloat, and additional int in each vma?
If only root could run-bash-shared-mappings, I'd say not worth it; but
it seems a little like prio_tree itself, we're not sure we need it,
but feel more secure to have the solution.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
drivers/mtd/devices/blkmtd.c | 1
fs/inode.c | 1
include/linux/fs.h | 2
include/linux/mm.h | 8 +
kernel/fork.c | 1
mm/memory.c | 223 ++++++++++++++++++++++++++++++++++---------
mm/mmap.c | 14 ++
mm/mremap.c | 16 +--
mm/rmap.c | 9 -
mm/truncate.c | 1
10 files changed, 215 insertions(+), 61 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/6] vmtrunc: truncate_count not atomic
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
@ 2004-10-03 18:23 ` Hugh Dickins
2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
` (4 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, David Woodhouse, linux-kernel
Why is mapping->truncate_count atomic? It's incremented inside
i_mmap_lock (and i_sem), and the reads don't need it to be atomic.
And why smp_rmb() before call to ->nopage? The compiler cannot reorder
the initial assignment of sequence after the call to ->nopage, and no
cpu (yet!) can read from the future, which is all that matters there.
And delete totally bogus reset of truncate_count from blkmtd add_device.
truncate_count is all about detecting i_size changes: i_size does not
change there; and if it did, the count should be incremented not reset.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
drivers/mtd/devices/blkmtd.c | 1 -
fs/inode.c | 1 -
include/linux/fs.h | 2 +-
mm/memory.c | 12 +++++-------
4 files changed, 6 insertions(+), 10 deletions(-)
--- 2.6.9-rc3-mm1/drivers/mtd/devices/blkmtd.c 2004-08-14 06:38:20.000000000 +0100
+++ trunc1/drivers/mtd/devices/blkmtd.c 2004-10-03 01:07:01.388014544 +0100
@@ -661,7 +661,6 @@ static struct blkmtd_dev *add_device(cha
memset(dev, 0, sizeof(struct blkmtd_dev));
dev->blkdev = bdev;
- atomic_set(&(dev->blkdev->bd_inode->i_mapping->truncate_count), 0);
if(!readonly) {
init_MUTEX(&dev->wrbuf_mutex);
}
--- trunc0/fs/inode.c 2004-10-02 18:22:24.857608136 +0100
+++ trunc1/fs/inode.c 2004-10-03 01:07:01.390014240 +0100
@@ -200,7 +200,6 @@ void inode_init_once(struct inode *inode
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
rwlock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
- atomic_set(&inode->i_data.truncate_count, 0);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
--- trunc0/include/linux/fs.h 2004-10-02 18:22:28.019127512 +0100
+++ trunc1/include/linux/fs.h 2004-10-03 01:07:01.393013784 +0100
@@ -342,7 +342,7 @@ struct address_space {
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
spinlock_t i_mmap_lock; /* protect tree, count, list */
- atomic_t truncate_count; /* Cover race condition with truncate */
+ unsigned int truncate_count; /* Cover race condition with truncate */
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
struct address_space_operations *a_ops; /* methods */
--- trunc0/mm/memory.c 2004-10-02 18:22:29.354924440 +0100
+++ trunc1/mm/memory.c 2004-10-03 01:07:01.396013328 +0100
@@ -1259,7 +1259,7 @@ void unmap_mapping_range(struct address_
spin_lock(&mapping->i_mmap_lock);
/* Protect against page fault */
- atomic_inc(&mapping->truncate_count);
+ mapping->truncate_count++;
if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
unmap_mapping_range_list(&mapping->i_mmap, &details);
@@ -1557,7 +1557,7 @@ do_no_page(struct mm_struct *mm, struct
struct page * new_page;
struct address_space *mapping = NULL;
pte_t entry;
- int sequence = 0;
+ unsigned int sequence = 0;
int ret = VM_FAULT_MINOR;
int anon = 0;
@@ -1569,9 +1569,8 @@ do_no_page(struct mm_struct *mm, struct
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- sequence = atomic_read(&mapping->truncate_count);
+ sequence = mapping->truncate_count;
}
- smp_rmb(); /* Prevent CPU from reordering lock-free ->nopage() */
retry:
cond_resched();
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
@@ -1615,9 +1614,8 @@ retry:
* invalidated this page. If unmap_mapping_range got called,
* retry getting the page.
*/
- if (mapping &&
- (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
- sequence = atomic_read(&mapping->truncate_count);
+ if (mapping && unlikely(sequence != mapping->truncate_count)) {
+ sequence = mapping->truncate_count;
spin_unlock(&mm->page_table_lock);
page_cache_release(new_page);
goto retry;
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
@ 2004-10-03 18:24 ` Hugh Dickins
2004-10-03 18:36 ` Ingo Molnar
2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
` (3 subsequent siblings)
5 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
The low-latency unmap_vmas patch silently moved the zap_bytes test
after the TLB finish and lockbreak and regather: why? That not only
makes zap_bytes redundant (might as well use ZAP_BLOCK_SIZE), it makes
the unmap_vmas level redundant too - it's all about saving TLB flushes
when unmapping a series of small vmas.
Move zap_bytes test back before the lockbreak, and delete the curious
comment that a small zap block size doesn't matter: it's true need_flush
prevents TLB flush when no page has been unmapped, but unmapping pages
in small blocks involves many more TLB flushes than in large blocks.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
mm/memory.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)
--- trunc1/mm/memory.c 2004-10-03 01:07:01.396013328 +0100
+++ trunc2/mm/memory.c 2004-10-03 01:07:12.957255752 +0100
@@ -506,10 +506,6 @@ static void unmap_page_range(struct mmu_
}
#ifdef CONFIG_PREEMPT
-/*
- * It's not an issue to have a small zap block size - TLB flushes
- * only happen once normally, due to the tlb->need_flush optimization.
- */
# define ZAP_BLOCK_SIZE (8 * PAGE_SIZE)
#else
/* No preempt: go for improved straight-line efficiency */
@@ -588,6 +584,9 @@ int unmap_vmas(struct mmu_gather **tlbp,
start += block;
zap_bytes -= block;
+ if ((long)zap_bytes > 0)
+ continue;
+
if (!atomic) {
int fullmm = tlb_is_full_mm(*tlbp);
tlb_finish_mmu(*tlbp, tlb_start, start);
@@ -595,8 +594,6 @@ int unmap_vmas(struct mmu_gather **tlbp,
*tlbp = tlb_gather_mmu(mm, fullmm);
tlb_start_valid = 0;
}
- if ((long)zap_bytes > 0)
- continue;
zap_bytes = ZAP_BLOCK_SIZE;
}
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 3/6] vmtrunc: unmap_mapping_range_tree
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
@ 2004-10-03 18:25 ` Hugh Dickins
2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
` (2 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
Move unmap_mapping_range's nonlinear vma handling out to its own inline,
parallel to the prio_tree handling; unmap_mapping_range_list is a better
name for the nonlinear list, rename the other unmap_mapping_range_tree.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
mm/memory.c | 42 +++++++++++++++++++++++-------------------
1 files changed, 23 insertions(+), 19 deletions(-)
--- trunc2/mm/memory.c 2004-10-03 01:07:12.957255752 +0100
+++ trunc3/mm/memory.c 2004-10-03 01:07:24.510499392 +0100
@@ -1189,9 +1189,9 @@ no_new_page:
}
/*
- * Helper function for unmap_mapping_range().
+ * Helper functions for unmap_mapping_range().
*/
-static inline void unmap_mapping_range_list(struct prio_tree_root *root,
+static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
struct zap_details *details)
{
struct vm_area_struct *vma;
@@ -1215,6 +1215,24 @@ static inline void unmap_mapping_range_l
}
}
+static inline void unmap_mapping_range_list(struct list_head *head,
+ struct zap_details *details)
+{
+ struct vm_area_struct *vma;
+
+ /*
+ * In nonlinear VMAs there is no correspondence between virtual address
+ * offset and file offset. So we must perform an exhaustive search
+ * across *all* the pages in each nonlinear VMA, not just the pages
+ * whose virtual address lies outside the file truncation point.
+ */
+ list_for_each_entry(vma, head, shared.vm_set.list) {
+ details->nonlinear_vma = vma;
+ zap_page_range(vma, vma->vm_start,
+ vma->vm_end - vma->vm_start, details);
+ }
+}
+
/**
* unmap_mapping_range - unmap the portion of all mmaps
* in the specified address_space corresponding to the specified
@@ -1259,23 +1277,9 @@ void unmap_mapping_range(struct address_
mapping->truncate_count++;
if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
- unmap_mapping_range_list(&mapping->i_mmap, &details);
-
- /*
- * In nonlinear VMAs there is no correspondence between virtual address
- * offset and file offset. So we must perform an exhaustive search
- * across *all* the pages in each nonlinear VMA, not just the pages
- * whose virtual address lies outside the file truncation point.
- */
- if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) {
- struct vm_area_struct *vma;
- list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
- shared.vm_set.list) {
- details.nonlinear_vma = vma;
- zap_page_range(vma, vma->vm_start,
- vma->vm_end - vma->vm_start, &details);
- }
- }
+ unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
+ unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
spin_unlock(&mapping->i_mmap_lock);
}
EXPORT_SYMBOL(unmap_mapping_range);
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
` (2 preceding siblings ...)
2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
@ 2004-10-03 18:26 ` Hugh Dickins
2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins
5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:26 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
vmtruncate (or more generally, unmap_mapping_range) has been observed
responsible for very high latencies: the lockbreak work in unmap_vmas is
good for munmap or exit_mmap, but no use while mapping->i_mmap_lock is
held, to keep our place in the prio_tree (or list) of a file's vmas.
Extend the zap_details block with i_mmap_lock pointer, so unmap_vmas can
detect if that needs lockbreak, and break_addr so it can notify where it
left off.
Add unmap_mapping_range_vma, used from both prio_tree and nonlinear list
handlers. This is what now calls zap_page_range (above unmap_vmas), but
handles the lockbreak and restart issues: letting unmap_mapping_range_
tree or list know when they need to start over because lock was dropped.
When restarting, of course there's a danger of never making progress.
Add vm_truncate_count field to vm_area_struct, update that to mapping->
truncate_count once fully scanned, skip up-to-date vmas without a scan
(and without dropping i_mmap_lock).
Further danger of never making progress if a vma is very large: when
breaking out, save restart_vma and restart_addr (and restart_pgoff to
confirm, in case vma gets reused), to help continue where we left off.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
include/linux/mm.h | 8 ++
mm/memory.c | 166 ++++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 160 insertions(+), 14 deletions(-)
--- trunc3/include/linux/mm.h 2004-10-02 18:22:28.114113072 +0100
+++ trunc4/include/linux/mm.h 2004-10-03 01:07:36.185724488 +0100
@@ -103,6 +103,7 @@ struct vm_area_struct {
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
+ unsigned int vm_truncate_count; /* compare mapping->truncate_count */
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
@@ -559,7 +560,12 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
- int atomic; /* May not schedule() */
+ spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
+ struct vm_area_struct *restart_vma; /* Where lock was dropped */
+ pgoff_t restart_pgoff; /* File offset for restart */
+ unsigned long restart_addr; /* Where we should restart */
+ unsigned long break_addr; /* Where unmap_vmas stopped */
+ unsigned int truncate_count; /* Compare vm_truncate_count */
};
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
--- trunc3/mm/memory.c 2004-10-03 01:07:24.510499392 +0100
+++ trunc4/mm/memory.c 2004-10-03 01:07:36.189723880 +0100
@@ -548,7 +548,8 @@ int unmap_vmas(struct mmu_gather **tlbp,
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
int tlb_start_valid = 0;
int ret = 0;
- int atomic = details && details->atomic;
+ spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+ int fullmm = tlb_is_full_mm(*tlbp);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long start;
@@ -587,16 +588,28 @@ int unmap_vmas(struct mmu_gather **tlbp,
if ((long)zap_bytes > 0)
continue;
- if (!atomic) {
- int fullmm = tlb_is_full_mm(*tlbp);
- tlb_finish_mmu(*tlbp, tlb_start, start);
- cond_resched_lock(&mm->page_table_lock);
- *tlbp = tlb_gather_mmu(mm, fullmm);
- tlb_start_valid = 0;
+ tlb_finish_mmu(*tlbp, tlb_start, start);
+
+ if (need_resched() ||
+ need_lockbreak(&mm->page_table_lock) ||
+ (i_mmap_lock && need_lockbreak(i_mmap_lock))) {
+ if (i_mmap_lock) {
+ /* must reset count of rss freed */
+ *tlbp = tlb_gather_mmu(mm, fullmm);
+ details->break_addr = start;
+ goto out;
+ }
+ spin_unlock(&mm->page_table_lock);
+ cond_resched();
+ spin_lock(&mm->page_table_lock);
}
+
+ *tlbp = tlb_gather_mmu(mm, fullmm);
+ tlb_start_valid = 0;
zap_bytes = ZAP_BLOCK_SIZE;
}
}
+out:
return ret;
}
@@ -1190,7 +1203,114 @@ no_new_page:
/*
* Helper functions for unmap_mapping_range().
+ *
+ * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ *
+ * We have to restart searching the prio_tree whenever we drop the lock,
+ * since the iterator is only valid while the lock is held, and anyway
+ * a later vma might be split and reinserted earlier while lock dropped.
+ *
+ * The list of nonlinear vmas could be handled more efficiently, using
+ * a placeholder, but handle it in the same way until a need is shown.
+ * It is important to search the prio_tree before nonlinear list: a vma
+ * may become nonlinear and be shifted from prio_tree to nonlinear list
+ * while the lock is dropped; but never shifted from list to prio_tree.
+ *
+ * In order to make forward progress despite restarting the search,
+ * vm_truncate_count is used to mark a vma as now dealt with, so we can
+ * quickly skip it next time around. Since the prio_tree search only
+ * shows us those vmas affected by unmapping the range in question, we
+ * can't efficiently keep all vmas in step with mapping->truncate_count:
+ * so instead reset them all whenever it wraps back to 0 (then go to 1).
+ * mapping->truncate_count and vma->vm_truncate_count are protected by
+ * i_mmap_lock.
+ *
+ * In order to make forward progress despite repeatedly restarting some
+ * large vma, note the break_addr set by unmap_vmas when it breaks out:
+ * and restart from that address when we reach that vma again (so long
+ * as another such was not inserted earlier while the lock was dropped).
+ *
+ * There is no guarantee that the restart_vma will remain intact when
+ * the lock is regained: but if it has been freed to some other use,
+ * it cannot then appear in the tree or list of vmas, so no harm done;
+ * if it has been reused for a new vma of the same mapping, nopage
+ * checks on i_size and truncate_count ensure it cannot be mapping any
+ * of the truncated pages, so the area below restart_addr is still safe
+ * to skip - but we must check pgoff to prevent spurious unmapping; or
+ * restart_vma may have been split or merged, shrunk or extended - but
+ * never shifted, so restart_addr and restart_pgoff remain in synch
+ * (even in the case of mremap move, which makes a copy of the vma).
*/
+
+static void reset_vma_truncate_counts(struct address_space *mapping)
+{
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
+ vma->vm_truncate_count = 0;
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+ vma->vm_truncate_count = 0;
+}
+
+static int unmap_mapping_range_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr,
+ struct zap_details *details)
+{
+ int need_break;
+
+ if (vma == details->restart_vma &&
+ start_addr < details->restart_addr) {
+ /*
+ * Be careful: this vm_area_struct may have been reused
+ * meanwhile. If pgoff matches up, it's probably the
+ * same one (perhaps shrunk or extended), but no harm is
+ * done if actually it's new. If pgoff does not match,
+ * it would likely be wrong to unmap from restart_addr,
+ * but it must be new, so we can just mark it completed.
+ */
+ start_addr = details->restart_addr;
+ if (linear_page_index(vma, start_addr) !=
+ details->restart_pgoff || start_addr >= end_addr) {
+ vma->vm_truncate_count = details->truncate_count;
+ return 0;
+ }
+ }
+again:
+ details->break_addr = end_addr;
+ zap_page_range(vma, start_addr, end_addr - start_addr, details);
+
+ /*
+ * We cannot rely on the break test in unmap_vmas:
+ * on the one hand, we don't want to restart our loop
+ * just because that broke out for the page_table_lock;
+ * on the other hand, it does no test when vma is small.
+ */
+ need_break = need_resched() ||
+ need_lockbreak(details->i_mmap_lock);
+
+ if (details->break_addr >= end_addr) {
+ /* We have now completed this vma: mark it so */
+ vma->vm_truncate_count = details->truncate_count;
+ if (!need_break)
+ return 0;
+ } else {
+ if (!need_break) {
+ start_addr = details->break_addr;
+ goto again;
+ }
+ details->restart_vma = vma;
+ details->restart_pgoff =
+ linear_page_index(vma, details->break_addr);
+ details->restart_addr = details->break_addr;
+ }
+
+ spin_unlock(details->i_mmap_lock);
+ cond_resched();
+ spin_lock(details->i_mmap_lock);
+ return -EINTR;
+}
+
static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
struct zap_details *details)
{
@@ -1198,8 +1318,13 @@ static inline void unmap_mapping_range_t
struct prio_tree_iter iter;
pgoff_t vba, vea, zba, zea;
+restart:
vma_prio_tree_foreach(vma, &iter, root,
details->first_index, details->last_index) {
+ /* Skip quickly over those we have already dealt with */
+ if (vma->vm_truncate_count == details->truncate_count)
+ continue;
+
vba = vma->vm_pgoff;
vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
/* Assume for now that PAGE_CACHE_SHIFT == PAGE_SHIFT */
@@ -1209,9 +1334,12 @@ static inline void unmap_mapping_range_t
zea = details->last_index;
if (zea > vea)
zea = vea;
- zap_page_range(vma,
+
+ if (unmap_mapping_range_vma(vma,
((zba - vba) << PAGE_SHIFT) + vma->vm_start,
- (zea - zba + 1) << PAGE_SHIFT, details);
+ ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start,
+ details) < 0)
+ goto restart;
}
}
@@ -1226,10 +1354,15 @@ static inline void unmap_mapping_range_l
* across *all* the pages in each nonlinear VMA, not just the pages
* whose virtual address lies outside the file truncation point.
*/
+restart:
list_for_each_entry(vma, head, shared.vm_set.list) {
+ /* Skip quickly over those we have already dealt with */
+ if (vma->vm_truncate_count == details->truncate_count)
+ continue;
details->nonlinear_vma = vma;
- zap_page_range(vma, vma->vm_start,
- vma->vm_end - vma->vm_start, details);
+ if (unmap_mapping_range_vma(vma, vma->vm_start,
+ vma->vm_end, details) < 0)
+ goto restart;
}
}
@@ -1268,13 +1401,20 @@ void unmap_mapping_range(struct address_
details.nonlinear_vma = NULL;
details.first_index = hba;
details.last_index = hba + hlen - 1;
- details.atomic = 1; /* A spinlock is held */
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
+ details.i_mmap_lock = &mapping->i_mmap_lock;
+ details.restart_vma = NULL;
spin_lock(&mapping->i_mmap_lock);
- /* Protect against page fault */
+
+ /* Protect against page faults, and endless unmapping loops */
mapping->truncate_count++;
+ if (unlikely(mapping->truncate_count == 0)) {
+ mapping->truncate_count++;
+ reset_vma_truncate_counts(mapping);
+ }
+ details.truncate_count = mapping->truncate_count;
if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 5/6] vmtrunc: vm_truncate_count race caution
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
` (3 preceding siblings ...)
2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
@ 2004-10-03 18:27 ` Hugh Dickins
2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins
5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
Fix some unlikely races in respect of vm_truncate_count.
Firstly, it's supposed to be guarded by i_mmap_lock, but some places
copy a vma structure by *new_vma = *old_vma: if the compiler implements
that with a bytewise copy, new_vma->vm_truncate_count could be munged,
and new_vma later appear up-to-date when it's not; so set it properly
once under lock.
vma_link set vm_truncate_count to mapping->truncate_count when adding an
empty vma: if new vmas are being added profusely while vmtruncate is in
progess, this lets them be skipped without scanning.
vma_adjust has vm_truncate_count problem much like it had with anon_vma
under mprotect merge: when merging be careful not to leave vma marked as
up-to-date when it might not be, lest unmap_mapping_range in progress -
set vm_truncate_count 0 when in doubt. Similarly when mremap moving
ptes from one vma to another.
Cut a little code from __anon_vma_merge: now vma_adjust sets "importer"
in the remove_next case (to get its vm_truncate_count right), its
anon_vma is already linked by the time __anon_vma_merge is called.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
kernel/fork.c | 1 +
mm/mmap.c | 14 +++++++++++++-
mm/mremap.c | 16 ++++++++++------
mm/rmap.c | 9 +--------
4 files changed, 25 insertions(+), 15 deletions(-)
--- trunc4/kernel/fork.c 2004-10-02 18:22:28.000000000 +0100
+++ trunc5/kernel/fork.c 2004-10-03 16:37:39.670145872 +0100
@@ -221,6 +221,7 @@ static inline int dup_mmap(struct mm_str
/* insert tmp into the share list, just after mpnt */
spin_lock(&file->f_mapping->i_mmap_lock);
+ tmp->vm_truncate_count = mpnt->vm_truncate_count;
flush_dcache_mmap_lock(file->f_mapping);
vma_prio_tree_add(tmp, mpnt);
flush_dcache_mmap_unlock(file->f_mapping);
--- trunc4/mm/mmap.c 2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/mmap.c 2004-10-03 16:37:39.673145416 +0100
@@ -307,8 +307,10 @@ static void vma_link(struct mm_struct *m
if (vma->vm_file)
mapping = vma->vm_file->f_mapping;
- if (mapping)
+ if (mapping) {
spin_lock(&mapping->i_mmap_lock);
+ vma->vm_truncate_count = mapping->truncate_count;
+ }
anon_vma_lock(vma);
__vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -379,6 +381,7 @@ void vma_adjust(struct vm_area_struct *v
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
anon_vma = next->anon_vma;
+ importer = vma;
} else if (end > next->vm_start) {
/*
* vma expands, overlapping part of the next:
@@ -404,7 +407,16 @@ again: remove_next = 1 + (end > next->
if (!(vma->vm_flags & VM_NONLINEAR))
root = &mapping->i_mmap;
spin_lock(&mapping->i_mmap_lock);
+ if (importer &&
+ vma->vm_truncate_count != next->vm_truncate_count) {
+ /*
+ * unmap_mapping_range might be in progress:
+ * ensure that the expanding vma is rescanned.
+ */
+ importer->vm_truncate_count = 0;
+ }
if (insert) {
+ insert->vm_truncate_count = vma->vm_truncate_count;
/*
* Put into prio_tree now, so instantiated pages
* are visible to arm/parisc __flush_dcache_page
--- trunc4/mm/mremap.c 2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/mremap.c 2004-10-03 16:37:39.675145112 +0100
@@ -82,7 +82,7 @@ static inline pte_t *alloc_one_pte_map(s
static int
move_one_page(struct vm_area_struct *vma, unsigned long old_addr,
- unsigned long new_addr)
+ struct vm_area_struct *new_vma, unsigned long new_addr)
{
struct address_space *mapping = NULL;
struct mm_struct *mm = vma->vm_mm;
@@ -98,6 +98,9 @@ move_one_page(struct vm_area_struct *vma
*/
mapping = vma->vm_file->f_mapping;
spin_lock(&mapping->i_mmap_lock);
+ if (new_vma->vm_truncate_count &&
+ new_vma->vm_truncate_count != vma->vm_truncate_count)
+ new_vma->vm_truncate_count = 0;
}
spin_lock(&mm->page_table_lock);
@@ -144,8 +147,8 @@ move_one_page(struct vm_area_struct *vma
}
static unsigned long move_page_tables(struct vm_area_struct *vma,
- unsigned long new_addr, unsigned long old_addr,
- unsigned long len)
+ unsigned long old_addr, struct vm_area_struct *new_vma,
+ unsigned long new_addr, unsigned long len)
{
unsigned long offset;
@@ -157,7 +160,8 @@ static unsigned long move_page_tables(st
* only a few pages.. This also makes error recovery easier.
*/
for (offset = 0; offset < len; offset += PAGE_SIZE) {
- if (move_one_page(vma, old_addr+offset, new_addr+offset) < 0)
+ if (move_one_page(vma, old_addr + offset,
+ new_vma, new_addr + offset) < 0)
break;
cond_resched();
}
@@ -188,14 +192,14 @@ static unsigned long move_vma(struct vm_
if (!new_vma)
return -ENOMEM;
- moved_len = move_page_tables(vma, new_addr, old_addr, old_len);
+ moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
if (moved_len < old_len) {
/*
* On error, move entries back from new area to old,
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
*/
- move_page_tables(new_vma, old_addr, new_addr, moved_len);
+ move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
vma = new_vma;
old_len = new_len;
old_addr = new_addr;
--- trunc4/mm/rmap.c 2004-10-02 18:22:29.000000000 +0100
+++ trunc5/mm/rmap.c 2004-10-03 16:37:39.676144960 +0100
@@ -120,14 +120,7 @@ int anon_vma_prepare(struct vm_area_stru
void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
{
- if (!vma->anon_vma) {
- BUG_ON(!next->anon_vma);
- vma->anon_vma = next->anon_vma;
- list_add(&vma->anon_vma_node, &next->anon_vma_node);
- } else {
- /* if they're both non-null they must be the same */
- BUG_ON(vma->anon_vma != next->anon_vma);
- }
+ BUG_ON(vma->anon_vma != next->anon_vma);
list_del(&next->anon_vma_node);
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 6/6] vmtrunc: bug if page_mapped
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
` (4 preceding siblings ...)
2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
@ 2004-10-03 18:28 ` Hugh Dickins
5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2004-10-03 18:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
If unmap_mapping_range (and mapping->truncate_count) are doing their
jobs right, truncate_complete_page should never find the page mapped:
add BUG_ON for our immediate testing, but this patch should probably
not go to mainline - a mapped page here is not a catastrophe.
Unsigned-off-by: Hugh Dickins <hugh@veritas.com>
---
mm/truncate.c | 1 +
1 files changed, 1 insertion(+)
--- trunc5/mm/truncate.c 2004-10-02 18:22:29.000000000 +0100
+++ trunc6/mm/truncate.c 2004-10-03 16:37:39.677144808 +0100
@@ -45,6 +45,7 @@ static inline void truncate_partial_page
static void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
+ BUG_ON(page_mapped(page));
if (page->mapping != mapping)
return;
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes
2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
@ 2004-10-03 18:36 ` Ingo Molnar
0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2004-10-03 18:36 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, Andrea Arcangeli, Rajesh Venkatasubramanian,
Lorenzo Allegrucci, linux-kernel
* Hugh Dickins <hugh@veritas.com> wrote:
> The low-latency unmap_vmas patch silently moved the zap_bytes test
> after the TLB finish and lockbreak and regather: why? That not only
> makes zap_bytes redundant (might as well use ZAP_BLOCK_SIZE), it makes
> the unmap_vmas level redundant too - it's all about saving TLB flushes
> when unmapping a series of small vmas.
the problem was latency generated by pages queueing up in tlb_gather's
queue and being freed in one loop, causing latencies. So there was no
latency-break, only a shifting of the queue.
> Move zap_bytes test back before the lockbreak, and delete the curious
> comment that a small zap block size doesn't matter: it's true
> need_flush prevents TLB flush when no page has been unmapped, but
> unmapping pages in small blocks involves many more TLB flushes than in
> large blocks.
could we perhaps free those pages outside of the lock? That would be
just as good for me.
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-10-03 18:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-03 18:21 [PATCH 0/6] vmtrunc: lowlat vmtruncate drop i_mmap_lock Hugh Dickins
2004-10-03 18:23 ` [PATCH 1/6] vmtrunc: truncate_count not atomic Hugh Dickins
2004-10-03 18:24 ` [PATCH 2/6] vmtrunc: restore unmap_vmas zap_bytes Hugh Dickins
2004-10-03 18:36 ` Ingo Molnar
2004-10-03 18:25 ` [PATCH 3/6] vmtrunc: unmap_mapping_range_tree Hugh Dickins
2004-10-03 18:26 ` [PATCH 4/6] vmtrunc: unmap_mapping dropping i_mmap_lock Hugh Dickins
2004-10-03 18:27 ` [PATCH 5/6] vmtrunc: vm_truncate_count race caution Hugh Dickins
2004-10-03 18:28 ` [PATCH 6/6] vmtrunc: bug if page_mapped Hugh Dickins
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.