* [PATCH] rmap 14 i_shared_lock fixes
@ 2004-04-27 23:59 Hugh Dickins
2004-04-28 0:01 ` [PATCH] rmap 15 vma_adjust Hugh Dickins
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-27 23:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rajesh Venkatasubramanian, linux-kernel
First of batch of six patches against 2.6.6-rc2-mm2, which introduce
Rajesh Venkatasubramanian's implementation of a radix priority search
tree of vmas, to handle object-based reverse mapping corner cases well.
rmap 14 i_shared_lock fixes
Start the sequence with a couple of outstanding i_shared_lock fixes.
Since i_shared_sem became i_shared_lock, we've had to shift and then
temporarily remove mremap move's protection of concurrent truncation
- if mremap moves ptes while unmap_mapping_range_list is making its
way through the vmas, there's a danger we'd move a pte from an area
yet to be cleaned back into an area already cleared.
Now site the i_shared_lock with the page_table_lock in move_one_page.
Replace page_table_present by get_one_pte_map, so we know when it's
necessary to allocate a new page table: in which case have to drop
i_shared_lock, trylock and perhaps reorder locks on the way back.
Yet another fix: must check for NULL dst before pte_unmap(dst).
And over in rmap.c, try_to_unmap_file's cond_resched amidst its
lengthy nonlinear swapping was now causing might_sleep warnings:
moved to a rather unsatisfactory and less frequent cond_resched_lock
on i_shared_lock when we reach the end of the list; and one before
starting on the nonlinears too: the "cursor" may become out-of-date
if we do schedule, but I doubt it's worth bothering about.
mm/mremap.c | 42 +++++++++++++++++++++++++++++++++---------
mm/rmap.c | 3 ++-
2 files changed, 35 insertions(+), 10 deletions(-)
--- 2.6.6-rc2-mm2/mm/mremap.c 2004-04-26 12:39:46.961068192 +0100
+++ rmap14/mm/mremap.c 2004-04-27 19:18:19.421286456 +0100
@@ -55,16 +55,18 @@ end:
return pte;
}
-static inline int page_table_present(struct mm_struct *mm, unsigned long addr)
+static pte_t *get_one_pte_map(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd;
pgd = pgd_offset(mm, addr);
if (pgd_none(*pgd))
- return 0;
+ return NULL;
pmd = pmd_offset(pgd, addr);
- return pmd_present(*pmd);
+ if (!pmd_present(*pmd))
+ return NULL;
+ return pte_offset_map(pmd, addr);
}
static inline pte_t *alloc_one_pte_map(struct mm_struct *mm, unsigned long addr)
@@ -97,11 +99,23 @@ static int
move_one_page(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr)
{
+ struct address_space *mapping = NULL;
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pte_t *src, *dst;
+ if (vma->vm_file) {
+ /*
+ * Subtle point from Rajesh Venkatasubramanian: before
+ * moving file-based ptes, we must lock vmtruncate out,
+ * since it might clean the dst vma before the src vma,
+ * and we propagate stale pages into the dst afterward.
+ */
+ mapping = vma->vm_file->f_mapping;
+ spin_lock(&mapping->i_shared_lock);
+ }
spin_lock(&mm->page_table_lock);
+
src = get_one_pte_map_nested(mm, old_addr);
if (src) {
/*
@@ -109,13 +123,19 @@ move_one_page(struct vm_area_struct *vma
* memory allocation. If it does then we need to drop the
* atomic kmap
*/
- if (!page_table_present(mm, new_addr)) {
+ dst = get_one_pte_map(mm, new_addr);
+ if (unlikely(!dst)) {
pte_unmap_nested(src);
- src = NULL;
- }
- dst = alloc_one_pte_map(mm, new_addr);
- if (src == NULL)
+ if (mapping)
+ spin_unlock(&mapping->i_shared_lock);
+ dst = alloc_one_pte_map(mm, new_addr);
+ if (mapping && !spin_trylock(&mapping->i_shared_lock)) {
+ spin_unlock(&mm->page_table_lock);
+ spin_lock(&mapping->i_shared_lock);
+ spin_lock(&mm->page_table_lock);
+ }
src = get_one_pte_map_nested(mm, old_addr);
+ }
/*
* Since alloc_one_pte_map can drop and re-acquire
* page_table_lock, we should re-check the src entry...
@@ -132,9 +152,13 @@ move_one_page(struct vm_area_struct *vma
}
pte_unmap_nested(src);
}
- pte_unmap(dst);
+ if (dst)
+ pte_unmap(dst);
}
+
spin_unlock(&mm->page_table_lock);
+ if (mapping)
+ spin_unlock(&mapping->i_shared_lock);
return error;
}
--- 2.6.6-rc2-mm2/mm/rmap.c 2004-04-26 12:39:46.000000000 +0100
+++ rmap14/mm/rmap.c 2004-04-27 19:18:19.423286152 +0100
@@ -777,6 +777,7 @@ static inline int try_to_unmap_file(stru
* but even so use it as a guide to how hard we should try?
*/
rmap_unlock(page);
+ cond_resched_lock(&mapping->i_shared_lock);
max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
if (max_nl_cursor == 0)
@@ -799,13 +800,13 @@ static inline int try_to_unmap_file(stru
vma->vm_private_data = (void *) cursor;
if ((int)mapcount <= 0)
goto relock;
- cond_resched();
}
if (ret != SWAP_FAIL)
vma->vm_private_data =
(void *) max_nl_cursor;
ret = SWAP_AGAIN;
}
+ cond_resched_lock(&mapping->i_shared_lock);
max_nl_cursor += CLUSTER_SIZE;
} while (max_nl_cursor <= max_nl_size);
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] rmap 15 vma_adjust
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
@ 2004-04-28 0:01 ` Hugh Dickins
2004-04-28 0:02 ` [PATCH] rmap 16 pretend prio_tree Hugh Dickins
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-28 0:01 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rajesh Venkatasubramanian, linux-kernel
If file-based vmas are to be kept in a tree, according to the file
offsets they map, then adjusting the vma's start pgoff or its end
involves repositioning in the tree, while holding i_shared_lock
(and page_table_lock). We used to avoid that if possible, e.g.
when just moving end; but if we're heading that way, let's now tidy
up vma_merge and split_vma, and do all the locking and adjustment
in a new helper vma_adjust. And please, let's call the next vma
in vma_merge "next" rather than "prev".
Since these patches are diffed over 2.6.6-rc2-mm2, they include the
NUMA mpolicy mods which you'll have to remove to go earlier in the
series, sorry for that nuisance. I have intentionally changed the
one vma_mpol_equal to mpol_equal, to make the merge cases more alike.
include/linux/mm.h | 2
mm/mmap.c | 161 ++++++++++++++++++++++++++---------------------------
mm/mremap.c | 7 +-
3 files changed, 87 insertions(+), 83 deletions(-)
--- rmap14/include/linux/mm.h 2004-04-26 12:39:46.446146472 +0100
+++ rmap15/include/linux/mm.h 2004-04-27 19:18:31.303480088 +0100
@@ -552,6 +552,8 @@ extern void si_meminfo(struct sysinfo *
extern void si_meminfo_node(struct sysinfo *val, int nid);
/* mmap.c */
+extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgoff_t pgoff, struct vm_area_struct *next);
extern void insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
struct rb_node **, struct rb_node *);
--- rmap14/mm/mmap.c 2004-04-26 12:39:46.958068648 +0100
+++ rmap15/mm/mmap.c 2004-04-27 19:18:31.307479480 +0100
@@ -65,13 +65,11 @@ EXPORT_SYMBOL(vm_committed_space);
* Requires inode->i_mapping->i_shared_lock
*/
static inline void
-__remove_shared_vm_struct(struct vm_area_struct *vma, struct inode *inode)
+__remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file)
{
- if (inode) {
- if (vma->vm_flags & VM_DENYWRITE)
- atomic_inc(&inode->i_writecount);
- list_del_init(&vma->shared);
- }
+ if (vma->vm_flags & VM_DENYWRITE)
+ atomic_inc(&file->f_dentry->d_inode->i_writecount);
+ list_del_init(&vma->shared);
}
/*
@@ -84,7 +82,7 @@ static void remove_shared_vm_struct(stru
if (file) {
struct address_space *mapping = file->f_mapping;
spin_lock(&mapping->i_shared_lock);
- __remove_shared_vm_struct(vma, file->f_dentry->d_inode);
+ __remove_shared_vm_struct(vma, file);
spin_unlock(&mapping->i_shared_lock);
}
}
@@ -318,6 +316,54 @@ __insert_vm_struct(struct mm_struct * mm
}
/*
+ * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that is
+ * already present in an i_mmap{_shared} tree without adjusting the tree.
+ * The following helper function should be used when such adjustments
+ * are necessary. The "next" vma (if any) is to be removed or inserted
+ * before we drop the necessary locks.
+ */
+void vma_adjust(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgoff_t pgoff, struct vm_area_struct *next)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct address_space *mapping = NULL;
+ struct file *file = vma->vm_file;
+
+ if (file) {
+ mapping = file->f_mapping;
+ spin_lock(&mapping->i_shared_lock);
+ }
+ spin_lock(&mm->page_table_lock);
+
+ vma->vm_start = start;
+ vma->vm_end = end;
+ vma->vm_pgoff = pgoff;
+
+ if (next) {
+ if (next == vma->vm_next) {
+ /*
+ * vma_merge has merged next into vma, and needs
+ * us to remove next before dropping the locks.
+ */
+ __vma_unlink(mm, next, vma);
+ if (file)
+ __remove_shared_vm_struct(next, file);
+ } else {
+ /*
+ * split_vma has split next from vma, and needs
+ * us to insert next before dropping the locks
+ * (next may either follow vma or precede it).
+ */
+ __insert_vm_struct(mm, next);
+ }
+ }
+
+ spin_unlock(&mm->page_table_lock);
+ if (mapping)
+ spin_unlock(&mapping->i_shared_lock);
+}
+
+/*
* If the vma has a ->close operation then the driver probably needs to release
* per-vma resources, so we don't attempt to merge those.
*/
@@ -389,90 +435,59 @@ static struct vm_area_struct *vma_merge(
struct file *file, unsigned long pgoff,
struct mempolicy *policy)
{
- spinlock_t *lock = &mm->page_table_lock;
- struct inode *inode = file ? file->f_dentry->d_inode : NULL;
- spinlock_t *i_shared_lock;
+ struct vm_area_struct *next;
/*
- * We later require that vma->vm_flags == vm_flags, so this tests
- * vma->vm_flags & VM_SPECIAL, too.
+ * We later require that vma->vm_flags == vm_flags,
+ * so this tests vma->vm_flags & VM_SPECIAL, too.
*/
if (vm_flags & VM_SPECIAL)
return NULL;
- i_shared_lock = file ? &file->f_mapping->i_shared_lock : NULL;
-
if (!prev) {
- prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
+ next = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
goto merge_next;
}
+ next = prev->vm_next;
/*
* Can it merge with the predecessor?
*/
if (prev->vm_end == addr &&
- mpol_equal(vma_policy(prev), policy) &&
+ mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags, file, pgoff)) {
- struct vm_area_struct *next;
- int need_up = 0;
-
- if (unlikely(file && prev->vm_next &&
- prev->vm_next->vm_file == file)) {
- spin_lock(i_shared_lock);
- need_up = 1;
- }
- spin_lock(lock);
- prev->vm_end = end;
-
/*
- * OK, it did. Can we now merge in the successor as well?
+ * OK, it can. Can we now merge in the successor as well?
*/
- next = prev->vm_next;
- if (next && prev->vm_end == next->vm_start &&
- vma_mpol_equal(prev, next) &&
+ if (next && end == next->vm_start &&
+ mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags, file,
pgoff, (end - addr) >> PAGE_SHIFT)) {
- prev->vm_end = next->vm_end;
- __vma_unlink(mm, next, prev);
- __remove_shared_vm_struct(next, inode);
- spin_unlock(lock);
- if (need_up)
- spin_unlock(i_shared_lock);
+ vma_adjust(prev, prev->vm_start,
+ next->vm_end, prev->vm_pgoff, next);
if (file)
fput(file);
-
mm->map_count--;
mpol_free(vma_policy(next));
kmem_cache_free(vm_area_cachep, next);
- return prev;
- }
- spin_unlock(lock);
- if (need_up)
- spin_unlock(i_shared_lock);
+ } else
+ vma_adjust(prev, prev->vm_start,
+ end, prev->vm_pgoff, NULL);
return prev;
}
/*
- * Can this new request be merged in front of prev->vm_next?
+ * Can this new request be merged in front of next?
*/
- prev = prev->vm_next;
- if (prev) {
+ if (next) {
merge_next:
- if (!mpol_equal(policy, vma_policy(prev)))
- return 0;
- if (!can_vma_merge_before(prev, vm_flags, file,
- pgoff, (end - addr) >> PAGE_SHIFT))
- return NULL;
- if (end == prev->vm_start) {
- if (file)
- spin_lock(i_shared_lock);
- spin_lock(lock);
- prev->vm_start = addr;
- prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
- spin_unlock(lock);
- if (file)
- spin_unlock(i_shared_lock);
- return prev;
+ if (end == next->vm_start &&
+ mpol_equal(policy, vma_policy(next)) &&
+ can_vma_merge_before(next, vm_flags, file,
+ pgoff, (end - addr) >> PAGE_SHIFT)) {
+ vma_adjust(next, addr,
+ next->vm_end, pgoff, NULL);
+ return next;
}
}
@@ -1212,7 +1227,6 @@ int split_vma(struct mm_struct * mm, str
{
struct mempolicy *pol;
struct vm_area_struct *new;
- struct address_space *mapping = NULL;
if (mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
@@ -1246,24 +1260,11 @@ int split_vma(struct mm_struct * mm, str
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);
- if (vma->vm_file)
- mapping = vma->vm_file->f_mapping;
-
- if (mapping)
- spin_lock(&mapping->i_shared_lock);
- spin_lock(&mm->page_table_lock);
-
- if (new_below) {
- vma->vm_start = addr;
- vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
- } else
- vma->vm_end = addr;
-
- __insert_vm_struct(mm, new);
-
- spin_unlock(&mm->page_table_lock);
- if (mapping)
- spin_unlock(&mapping->i_shared_lock);
+ if (new_below)
+ vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
+ ((addr - new->vm_start) >> PAGE_SHIFT), new);
+ else
+ vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
return 0;
}
--- rmap14/mm/mremap.c 2004-04-27 19:18:19.421286456 +0100
+++ rmap15/mm/mremap.c 2004-04-27 19:18:31.308479328 +0100
@@ -391,9 +391,10 @@ unsigned long do_mremap(unsigned long ad
/* can we just expand the current mapping? */
if (max_addr - addr >= new_len) {
int pages = (new_len - old_len) >> PAGE_SHIFT;
- spin_lock(&vma->vm_mm->page_table_lock);
- vma->vm_end = addr + new_len;
- spin_unlock(&vma->vm_mm->page_table_lock);
+
+ vma_adjust(vma, vma->vm_start,
+ addr + new_len, vma->vm_pgoff, NULL);
+
current->mm->total_vm += pages;
if (vma->vm_flags & VM_LOCKED) {
current->mm->locked_vm += pages;
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] rmap 16 pretend prio_tree
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
2004-04-28 0:01 ` [PATCH] rmap 15 vma_adjust Hugh Dickins
@ 2004-04-28 0:02 ` Hugh Dickins
2004-04-28 0:03 ` [PATCH] rmap 17 real prio_tree Hugh Dickins
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-28 0:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rajesh Venkatasubramanian, linux-kernel
Pave the way for prio_tree by switching over to its interfaces,
but actually still implement them with the same old lists as before.
Most of the vma_prio_tree interfaces are straightforward. The
interesting one is vma_prio_tree_next, used to search the tree for all
vmas which overlap the given range: unlike the list_for_each_entry it
replaces, it does not find every vma, just those that match.
But this does leave handling of nonlinear vmas in a very unsatisfactory
state: for now we have to search again over the maximum range to find
all the nonlinear vmas which might contain a page, which of course
takes away the point of the tree. Fixed in later patch of this batch.
There is no need to initialize vma linkage all over, just do it before
inserting the vma in list or tree. /proc/pid/statm had an odd test
for its shared count: simplified to an equivalent test on vm_file.
fs/exec.c | 1
fs/hugetlbfs/inode.c | 30 +++++-------------
fs/inode.c | 4 +-
fs/proc/task_mmu.c | 2 -
include/linux/fs.h | 13 ++++---
include/linux/mm.h | 31 +++++++++++++++++--
include/linux/prio_tree.h | 27 ++++++++++++++++
kernel/fork.c | 4 +-
mm/memory.c | 41 +++++++++++++++++--------
mm/mmap.c | 62 +++++++++++++++++++++++++++++---------
mm/rmap.c | 75 +++++++++++++++++++++++-----------------------
11 files changed, 191 insertions(+), 99 deletions(-)
--- rmap15/fs/exec.c 2004-04-26 12:39:43.886535592 +0100
+++ rmap16/fs/exec.c 2004-04-27 19:18:42.765737560 +0100
@@ -427,7 +427,6 @@ int setup_arg_pages(struct linux_binprm
mpnt->vm_pgoff = 0;
mpnt->vm_file = NULL;
mpol_set_vma_default(mpnt);
- INIT_LIST_HEAD(&mpnt->shared);
mpnt->vm_private_data = (void *) 0;
insert_vm_struct(mm, mpnt);
mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
--- rmap15/fs/hugetlbfs/inode.c 2004-04-26 12:39:44.058509448 +0100
+++ rmap16/fs/hugetlbfs/inode.c 2004-04-27 19:18:42.767737256 +0100
@@ -267,39 +267,27 @@ static void hugetlbfs_drop_inode(struct
* vma->vm_pgoff is in PAGE_SIZE units.
*/
static void
-hugetlb_vmtruncate_list(struct list_head *list, unsigned long h_pgoff)
+hugetlb_vmtruncate_list(struct prio_tree_root *root, unsigned long h_pgoff)
{
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
- list_for_each_entry(vma, list, shared) {
+ while ((vma = vma_prio_tree_next(vma, root, &iter,
+ h_pgoff, ULONG_MAX)) != NULL) {
unsigned long h_vm_pgoff;
unsigned long v_length;
- unsigned long h_length;
unsigned long v_offset;
h_vm_pgoff = vma->vm_pgoff << (HPAGE_SHIFT - PAGE_SHIFT);
v_length = vma->vm_end - vma->vm_start;
- h_length = v_length >> HPAGE_SHIFT;
v_offset = (h_pgoff - h_vm_pgoff) << HPAGE_SHIFT;
/*
* Is this VMA fully outside the truncation point?
*/
- if (h_vm_pgoff >= h_pgoff) {
- zap_hugepage_range(vma, vma->vm_start, v_length);
- continue;
- }
-
- /*
- * Is this VMA fully inside the truncaton point?
- */
- if (h_vm_pgoff + (v_length >> HPAGE_SHIFT) <= h_pgoff)
- continue;
+ if (h_vm_pgoff >= h_pgoff)
+ v_offset = 0;
- /*
- * The VMA straddles the truncation point. v_offset is the
- * offset (in bytes) into the VMA where the point lies.
- */
zap_hugepage_range(vma,
vma->vm_start + v_offset,
v_length - v_offset);
@@ -322,9 +310,9 @@ static int hugetlb_vmtruncate(struct ino
inode->i_size = offset;
spin_lock(&mapping->i_shared_lock);
- if (!list_empty(&mapping->i_mmap))
+ if (!prio_tree_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- if (!list_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared))
hugetlb_vmtruncate_list(&mapping->i_mmap_shared, pgoff);
spin_unlock(&mapping->i_shared_lock);
truncate_hugepages(mapping, offset);
--- rmap15/fs/inode.c 2004-04-26 12:39:44.062508840 +0100
+++ rmap16/fs/inode.c 2004-04-27 19:18:42.769736952 +0100
@@ -188,8 +188,8 @@ void inode_init_once(struct inode *inode
atomic_set(&inode->i_data.truncate_count, 0);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
- INIT_LIST_HEAD(&inode->i_data.i_mmap);
- INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);
}
--- rmap15/fs/proc/task_mmu.c 2004-02-04 02:45:17.000000000 +0000
+++ rmap16/fs/proc/task_mmu.c 2004-04-27 19:18:42.770736800 +0100
@@ -65,7 +65,7 @@ int task_statm(struct mm_struct *mm, int
*shared += pages;
continue;
}
- if (vma->vm_flags & VM_SHARED || !list_empty(&vma->shared))
+ if (vma->vm_file)
*shared += pages;
if (vma->vm_flags & VM_EXECUTABLE)
*text += pages;
--- rmap15/include/linux/fs.h 2004-04-26 12:39:46.364158936 +0100
+++ rmap16/include/linux/fs.h 2004-04-27 19:18:42.773736344 +0100
@@ -18,6 +18,7 @@
#include <linux/stat.h>
#include <linux/cache.h>
#include <linux/radix-tree.h>
+#include <linux/prio_tree.h>
#include <linux/kobject.h>
#include <asm/atomic.h>
#include <linux/audit.h>
@@ -329,9 +330,9 @@ struct address_space {
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
struct address_space_operations *a_ops; /* methods */
- struct list_head i_mmap; /* list of private mappings */
- struct list_head i_mmap_shared; /* list of shared mappings */
- spinlock_t i_shared_lock; /* protect both above lists */
+ struct prio_tree_root i_mmap; /* tree of private mappings */
+ struct prio_tree_root i_mmap_shared; /* tree of shared mappings */
+ spinlock_t i_shared_lock; /* protect trees & list above */
atomic_t truncate_count; /* Cover race condition with truncate */
unsigned long flags; /* error bits/gfp mask */
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
@@ -379,8 +380,8 @@ int mapping_tagged(struct address_space
*/
static inline int mapping_mapped(struct address_space *mapping)
{
- return !list_empty(&mapping->i_mmap) ||
- !list_empty(&mapping->i_mmap_shared);
+ return !prio_tree_empty(&mapping->i_mmap) ||
+ !prio_tree_empty(&mapping->i_mmap_shared);
}
/*
@@ -391,7 +392,7 @@ static inline int mapping_mapped(struct
*/
static inline int mapping_writably_mapped(struct address_space *mapping)
{
- return !list_empty(&mapping->i_mmap_shared);
+ return !prio_tree_empty(&mapping->i_mmap_shared);
}
/*
--- rmap15/include/linux/mm.h 2004-04-27 19:18:31.303480088 +0100
+++ rmap16/include/linux/mm.h 2004-04-27 19:18:42.775736040 +0100
@@ -11,6 +11,7 @@
#include <linux/list.h>
#include <linux/mmzone.h>
#include <linux/rbtree.h>
+#include <linux/prio_tree.h>
#include <linux/fs.h>
#include <linux/mempolicy.h>
@@ -69,8 +70,7 @@ struct vm_area_struct {
/*
* For areas with an address space and backing store,
- * one of the address_space->i_mmap{,shared} lists,
- * for shm areas, the list of attaches, otherwise unused.
+ * one of the address_space->i_mmap{,shared} trees.
*/
struct list_head shared;
@@ -551,6 +551,33 @@ extern void show_mem(void);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);
+static inline void vma_prio_tree_init(struct vm_area_struct *vma)
+{
+ INIT_LIST_HEAD(&vma->shared);
+}
+
+static inline void vma_prio_tree_add(struct vm_area_struct *vma,
+ struct vm_area_struct *old)
+{
+ list_add(&vma->shared, &old->shared);
+}
+
+static inline void vma_prio_tree_insert(struct vm_area_struct *vma,
+ struct prio_tree_root *root)
+{
+ list_add_tail(&vma->shared, &root->list);
+}
+
+static inline void vma_prio_tree_remove(struct vm_area_struct *vma,
+ struct prio_tree_root *root)
+{
+ list_del_init(&vma->shared);
+}
+
+struct vm_area_struct *vma_prio_tree_next(
+ struct vm_area_struct *, struct prio_tree_root *,
+ struct prio_tree_iter *, pgoff_t begin, pgoff_t end);
+
/* mmap.c */
extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *next);
--- rmap15/include/linux/prio_tree.h 1970-01-01 01:00:00.000000000 +0100
+++ rmap16/include/linux/prio_tree.h 2004-04-27 19:18:42.776735888 +0100
@@ -0,0 +1,27 @@
+#ifndef _LINUX_PRIO_TREE_H
+#define _LINUX_PRIO_TREE_H
+/*
+ * Dummy version of include/linux/prio_tree.h, just for this patch:
+ * no radix priority search tree whatsoever, just implement interfaces
+ * using the old lists.
+ */
+
+struct prio_tree_root {
+ struct list_head list;
+};
+
+struct prio_tree_iter {
+ int not_used_yet;
+};
+
+#define INIT_PRIO_TREE_ROOT(ptr) \
+do { \
+ INIT_LIST_HEAD(&(ptr)->list); \
+} while (0) \
+
+static inline int prio_tree_empty(const struct prio_tree_root *root)
+{
+ return list_empty(&root->list);
+}
+
+#endif /* _LINUX_PRIO_TREE_H */
--- rmap15/kernel/fork.c 2004-04-26 12:39:46.742101480 +0100
+++ rmap16/kernel/fork.c 2004-04-27 19:18:42.777735736 +0100
@@ -321,7 +321,7 @@ static inline int dup_mmap(struct mm_str
tmp->vm_mm = mm;
tmp->vm_next = NULL;
file = tmp->vm_file;
- INIT_LIST_HEAD(&tmp->shared);
+ vma_prio_tree_init(tmp);
if (file) {
struct inode *inode = file->f_dentry->d_inode;
get_file(file);
@@ -330,7 +330,7 @@ static inline int dup_mmap(struct mm_str
/* insert tmp into the share list, just after mpnt */
spin_lock(&file->f_mapping->i_shared_lock);
- list_add(&tmp->shared, &mpnt->shared);
+ vma_prio_tree_add(tmp, mpnt);
spin_unlock(&file->f_mapping->i_shared_lock);
}
--- rmap15/mm/memory.c 2004-04-26 12:39:46.947070320 +0100
+++ rmap16/mm/memory.c 2004-04-27 19:18:42.780735280 +0100
@@ -1118,25 +1118,20 @@ no_new_page:
/*
* Helper function for unmap_mapping_range().
*/
-static void unmap_mapping_range_list(struct list_head *head,
+static void unmap_mapping_range_list(struct prio_tree_root *root,
struct zap_details *details)
{
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
pgoff_t vba, vea, zba, zea;
- list_for_each_entry(vma, head, shared) {
- if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
- details->nonlinear_vma = vma;
- zap_page_range(vma, vma->vm_start,
- vma->vm_end - vma->vm_start, details);
- details->nonlinear_vma = NULL;
+ while ((vma = vma_prio_tree_next(vma, root, &iter,
+ details->first_index, details->last_index)) != NULL) {
+ if (unlikely(vma->vm_flags & VM_NONLINEAR))
continue;
- }
vba = vma->vm_pgoff;
vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
/* Assume for now that PAGE_CACHE_SHIFT == PAGE_SHIFT */
- if (vba > details->last_index || vea < details->first_index)
- continue; /* Mapping disjoint from hole. */
zba = details->first_index;
if (zba < vba)
zba = vba;
@@ -1149,6 +1144,22 @@ static void unmap_mapping_range_list(str
}
}
+static void unmap_nonlinear_range_list(struct prio_tree_root *root,
+ struct zap_details *details)
+{
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
+
+ while ((vma = vma_prio_tree_next(vma, root, &iter,
+ 0, ULONG_MAX)) != NULL) {
+ if (!(vma->vm_flags & VM_NONLINEAR))
+ continue;
+ details->nonlinear_vma = vma;
+ zap_page_range(vma, vma->vm_start,
+ vma->vm_end - vma->vm_start, details);
+ }
+}
+
/**
* unmap_mapping_range - unmap the portion of all mmaps
* in the specified address_space corresponding to the specified
@@ -1191,14 +1202,18 @@ void unmap_mapping_range(struct address_
spin_lock(&mapping->i_shared_lock);
/* Protect against page fault */
atomic_inc(&mapping->truncate_count);
- if (unlikely(!list_empty(&mapping->i_mmap)))
+
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
unmap_mapping_range_list(&mapping->i_mmap, &details);
/* Don't waste time to check mapping on fully shared vmas */
details.check_mapping = NULL;
- if (unlikely(!list_empty(&mapping->i_mmap_shared)))
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared))) {
unmap_mapping_range_list(&mapping->i_mmap_shared, &details);
+ unmap_nonlinear_range_list(&mapping->i_mmap_shared, &details);
+ }
+
spin_unlock(&mapping->i_shared_lock);
}
EXPORT_SYMBOL(unmap_mapping_range);
--- rmap15/mm/mmap.c 2004-04-27 19:18:31.307479480 +0100
+++ rmap16/mm/mmap.c 2004-04-27 19:18:42.784734672 +0100
@@ -64,12 +64,16 @@ EXPORT_SYMBOL(vm_committed_space);
/*
* Requires inode->i_mapping->i_shared_lock
*/
-static inline void
-__remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file)
+static inline void __remove_shared_vm_struct(struct vm_area_struct *vma,
+ struct file *file, struct address_space *mapping)
{
if (vma->vm_flags & VM_DENYWRITE)
atomic_inc(&file->f_dentry->d_inode->i_writecount);
- list_del_init(&vma->shared);
+
+ if (vma->vm_flags & VM_SHARED)
+ vma_prio_tree_remove(vma, &mapping->i_mmap_shared);
+ else
+ vma_prio_tree_remove(vma, &mapping->i_mmap);
}
/*
@@ -82,7 +86,7 @@ static void remove_shared_vm_struct(stru
if (file) {
struct address_space *mapping = file->f_mapping;
spin_lock(&mapping->i_shared_lock);
- __remove_shared_vm_struct(vma, file);
+ __remove_shared_vm_struct(vma, file, mapping);
spin_unlock(&mapping->i_shared_lock);
}
}
@@ -257,9 +261,9 @@ static inline void __vma_link_file(struc
atomic_dec(&file->f_dentry->d_inode->i_writecount);
if (vma->vm_flags & VM_SHARED)
- list_add_tail(&vma->shared, &mapping->i_mmap_shared);
+ vma_prio_tree_insert(vma, &mapping->i_mmap_shared);
else
- list_add_tail(&vma->shared, &mapping->i_mmap);
+ vma_prio_tree_insert(vma, &mapping->i_mmap);
}
}
@@ -268,6 +272,7 @@ __vma_link(struct mm_struct *mm, struct
struct vm_area_struct *prev, struct rb_node **rb_link,
struct rb_node *rb_parent)
{
+ vma_prio_tree_init(vma);
__vma_link_list(mm, vma, prev, rb_parent);
__vma_link_rb(mm, vma, rb_link, rb_parent);
__vma_link_file(vma);
@@ -316,6 +321,31 @@ __insert_vm_struct(struct mm_struct * mm
}
/*
+ * Dummy version of vma_prio_tree_next, just for this patch:
+ * no radix priority search tree whatsoever, just implement interface
+ * using the old lists: return the next vma overlapping [begin,end].
+ */
+struct vm_area_struct *vma_prio_tree_next(
+ struct vm_area_struct *vma, struct prio_tree_root *root,
+ struct prio_tree_iter *iter, pgoff_t begin, pgoff_t end)
+{
+ struct list_head *next;
+ pgoff_t vba, vea;
+
+ next = vma? vma->shared.next: root->list.next;
+ while (next != &root->list) {
+ vma = list_entry(next, struct vm_area_struct, shared);
+ vba = vma->vm_pgoff;
+ vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
+ /* Return vma if it overlaps [begin,end] */
+ if (vba <= end && vea >= begin)
+ return vma;
+ next = next->next;
+ }
+ return NULL;
+}
+
+/*
* We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that is
* already present in an i_mmap{_shared} tree without adjusting the tree.
* The following helper function should be used when such adjustments
@@ -327,17 +357,28 @@ void vma_adjust(struct vm_area_struct *v
{
struct mm_struct *mm = vma->vm_mm;
struct address_space *mapping = NULL;
+ struct prio_tree_root *root = NULL;
struct file *file = vma->vm_file;
if (file) {
mapping = file->f_mapping;
+ if (vma->vm_flags & VM_SHARED)
+ root = &mapping->i_mmap_shared;
+ else
+ root = &mapping->i_mmap;
spin_lock(&mapping->i_shared_lock);
}
spin_lock(&mm->page_table_lock);
+ if (root)
+ vma_prio_tree_remove(vma, root);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
+ if (root) {
+ vma_prio_tree_init(vma);
+ vma_prio_tree_insert(vma, root);
+ }
if (next) {
if (next == vma->vm_next) {
@@ -347,7 +388,7 @@ void vma_adjust(struct vm_area_struct *v
*/
__vma_unlink(mm, next, vma);
if (file)
- __remove_shared_vm_struct(next, file);
+ __remove_shared_vm_struct(next, file, mapping);
} else {
/*
* split_vma has split next from vma, and needs
@@ -675,7 +716,6 @@ munmap_back:
vma->vm_private_data = NULL;
vma->vm_next = NULL;
mpol_set_vma_default(vma);
- INIT_LIST_HEAD(&vma->shared);
if (file) {
error = -EINVAL;
@@ -1238,8 +1278,6 @@ int split_vma(struct mm_struct * mm, str
/* most fields are the same, copy all, and then fixup */
*new = *vma;
- INIT_LIST_HEAD(&new->shared);
-
if (new_below)
new->vm_end = addr;
else {
@@ -1432,10 +1470,7 @@ unsigned long do_brk(unsigned long addr,
vma->vm_file = NULL;
vma->vm_private_data = NULL;
mpol_set_vma_default(vma);
- INIT_LIST_HEAD(&vma->shared);
-
vma_link(mm, vma, prev, rb_link, rb_parent);
-
out:
mm->total_vm += len >> PAGE_SHIFT;
if (flags & VM_LOCKED) {
@@ -1546,7 +1581,6 @@ struct vm_area_struct *copy_vma(struct v
return NULL;
}
vma_set_policy(new_vma, pol);
- INIT_LIST_HEAD(&new_vma->shared);
new_vma->vm_start = addr;
new_vma->vm_end = addr + len;
new_vma->vm_pgoff = pgoff;
--- rmap15/mm/rmap.c 2004-04-27 19:18:19.423286152 +0100
+++ rmap16/mm/rmap.c 2004-04-27 19:18:42.786734368 +0100
@@ -159,8 +159,8 @@ unsigned long vma_address(struct vm_area
unsigned long address;
address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- return (address >= vma->vm_start && address < vma->vm_end)?
- address: -EFAULT;
+ BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+ return address;
}
/**
@@ -306,7 +306,8 @@ static inline int page_referenced_file(s
unsigned int mapcount = page->mapcount;
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
unsigned long address;
int referenced = 0;
int failed = 0;
@@ -314,16 +315,15 @@ static inline int page_referenced_file(s
if (!spin_trylock(&mapping->i_shared_lock))
return 0;
- list_for_each_entry(vma, &mapping->i_mmap, shared) {
- address = vma_address(vma, pgoff);
- if (address == -EFAULT)
- continue;
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap,
+ &iter, pgoff, pgoff)) != NULL) {
if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
== (VM_LOCKED|VM_MAYSHARE)) {
referenced++;
goto out;
}
if (vma->vm_mm->rss) {
+ address = vma_address(vma, pgoff);
referenced += page_referenced_one(page,
vma->vm_mm, address, &mapcount, &failed);
if (!mapcount)
@@ -331,19 +331,18 @@ static inline int page_referenced_file(s
}
}
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
+ &iter, pgoff, pgoff)) != NULL) {
if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
failed++;
continue;
}
- address = vma_address(vma, pgoff);
- if (address == -EFAULT)
- continue;
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
referenced++;
goto out;
}
if (vma->vm_mm->rss) {
+ address = vma_address(vma, pgoff);
referenced += page_referenced_one(page,
vma->vm_mm, address, &mapcount, &failed);
if (!mapcount)
@@ -351,6 +350,7 @@ static inline int page_referenced_file(s
}
}
+ /* Hmm, but what of the nonlinears which pgoff,pgoff skipped? */
WARN_ON(!failed);
out:
spin_unlock(&mapping->i_shared_lock);
@@ -716,7 +716,8 @@ static inline int try_to_unmap_file(stru
unsigned int mapcount = page->mapcount;
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
unsigned long address;
int ret = SWAP_AGAIN;
unsigned long cursor;
@@ -726,11 +727,10 @@ static inline int try_to_unmap_file(stru
if (!spin_trylock(&mapping->i_shared_lock))
return ret;
- list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap,
+ &iter, pgoff, pgoff)) != NULL) {
if (vma->vm_mm->rss) {
address = vma_address(vma, pgoff);
- if (address == -EFAULT)
- continue;
ret = try_to_unmap_one(page,
vma->vm_mm, address, &mapcount, vma);
if (ret == SWAP_FAIL || !mapcount)
@@ -738,27 +738,12 @@ static inline int try_to_unmap_file(stru
}
}
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
- if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
- /*
- * Defer unmapping nonlinear to the next loop,
- * but take notes while we're here e.g. don't
- * want to loop again when no nonlinear vmas.
- */
- if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
- continue;
- cursor = (unsigned long) vma->vm_private_data;
- if (cursor > max_nl_cursor)
- max_nl_cursor = cursor;
- cursor = vma->vm_end - vma->vm_start;
- if (cursor > max_nl_size)
- max_nl_size = cursor;
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
+ &iter, pgoff, pgoff)) != NULL) {
+ if (unlikely(vma->vm_flags & VM_NONLINEAR))
continue;
- }
if (vma->vm_mm->rss) {
address = vma_address(vma, pgoff);
- if (address == -EFAULT)
- continue;
ret = try_to_unmap_one(page,
vma->vm_mm, address, &mapcount, vma);
if (ret == SWAP_FAIL || !mapcount)
@@ -766,7 +751,20 @@ static inline int try_to_unmap_file(stru
}
}
- if (max_nl_size == 0) /* no nonlinear vmas of this file */
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
+ &iter, 0, ULONG_MAX)) != NULL) {
+ if (VM_NONLINEAR != (vma->vm_flags &
+ (VM_NONLINEAR|VM_LOCKED|VM_RESERVED)))
+ continue;
+ cursor = (unsigned long) vma->vm_private_data;
+ if (cursor > max_nl_cursor)
+ max_nl_cursor = cursor;
+ cursor = vma->vm_end - vma->vm_start;
+ if (cursor > max_nl_size)
+ max_nl_size = cursor;
+ }
+
+ if (max_nl_size == 0) /* any nonlinears locked or reserved */
goto out;
/*
@@ -784,9 +782,10 @@ static inline int try_to_unmap_file(stru
max_nl_cursor = CLUSTER_SIZE;
do {
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
+ &iter, 0, ULONG_MAX)) != NULL) {
if (VM_NONLINEAR != (vma->vm_flags &
- (VM_NONLINEAR|VM_LOCKED|VM_RESERVED)))
+ (VM_NONLINEAR|VM_LOCKED|VM_RESERVED)))
continue;
cursor = (unsigned long) vma->vm_private_data;
while (vma->vm_mm->rss &&
@@ -815,7 +814,9 @@ static inline int try_to_unmap_file(stru
* in locked vmas). Reset cursor on all unreserved nonlinear
* vmas, now forgetting on which ones it had fallen behind.
*/
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ vma = NULL; /* it is already, but above loop might change */
+ while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
+ &iter, 0, ULONG_MAX)) != NULL) {
if ((vma->vm_flags & (VM_NONLINEAR|VM_RESERVED)) ==
VM_NONLINEAR)
vma->vm_private_data = 0;
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] rmap 17 real prio_tree
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
2004-04-28 0:01 ` [PATCH] rmap 15 vma_adjust Hugh Dickins
2004-04-28 0:02 ` [PATCH] rmap 16 pretend prio_tree Hugh Dickins
@ 2004-04-28 0:03 ` Hugh Dickins
2004-04-28 0:04 ` [PATCH] rmap 18 i_mmap_nonlinear Hugh Dickins
2004-04-28 0:06 ` [PATCH] rmap 19 arch prio_tree Hugh Dickins
4 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-28 0:03 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rajesh Venkatasubramanian, linux-kernel
Rajesh Venkatasubramanian's implementation of a radix priority search
tree of vmas, to handle object-based reverse mapping corner cases well.
Amongst the objections to object-based rmap were test cases by akpm and
by mingo, in which large numbers of vmas mapping disjoint or overlapping
parts of a file showed strikingly poor performance of the i_mmap lists.
Perhaps those tests are irrelevant in the real world? We cannot be too
sure: the prio_tree is well-suited to solving precisely that problem,
so unless it turns out to bring too much overhead, let's include it.
Why is this prio_tree.c placed in mm rather than lib? See GET_INDEX:
this implementation is geared throughout to use with vmas, though the
first half of the file appears more general than the second half.
Each node of the prio_tree is itself (contained within) a vma: might
save memory by allocating distinct nodes from which to hang vmas, but
wouldn't save much, and would complicate the usage with preallocations.
Off each node of the prio_tree itself hangs a list of like vmas, if any.
The connection from node to list is a little awkward, but probably the
best compromise: it would be more straightforward to list likes directly
from the tree node, but that would use more memory per vma, for the
list_head and to identify that head. Instead, node's shared.vm_set.head
points to next vma (whose shared.vm_set.head points back to node vma),
and that next contains the list_head from which the rest hang - reusing
fields already used in the prio_tree node itself.
Currently lacks prefetch: Rajesh hopes to add some soon.
include/linux/mm.h | 37 +-
include/linux/prio_tree.h | 57 +++-
init/main.c | 2
mm/Makefile | 5
mm/mmap.c | 25 -
mm/prio_tree.c | 654 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 723 insertions(+), 57 deletions(-)
--- rmap16/include/linux/mm.h 2004-04-27 19:18:42.775736040 +0100
+++ rmap17/include/linux/mm.h 2004-04-27 19:18:54.262989712 +0100
@@ -72,7 +72,15 @@ struct vm_area_struct {
* For areas with an address space and backing store,
* one of the address_space->i_mmap{,shared} trees.
*/
- struct list_head shared;
+ union {
+ struct {
+ struct list_head list;
+ void *parent; /* aligns with prio_tree_node parent */
+ struct vm_area_struct *head;
+ } vm_set;
+
+ struct prio_tree_node prio_tree_node;
+ } shared;
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
@@ -553,27 +561,16 @@ extern void si_meminfo_node(struct sysin
static inline void vma_prio_tree_init(struct vm_area_struct *vma)
{
- INIT_LIST_HEAD(&vma->shared);
-}
-
-static inline void vma_prio_tree_add(struct vm_area_struct *vma,
- struct vm_area_struct *old)
-{
- list_add(&vma->shared, &old->shared);
-}
-
-static inline void vma_prio_tree_insert(struct vm_area_struct *vma,
- struct prio_tree_root *root)
-{
- list_add_tail(&vma->shared, &root->list);
-}
-
-static inline void vma_prio_tree_remove(struct vm_area_struct *vma,
- struct prio_tree_root *root)
-{
- list_del_init(&vma->shared);
+ vma->shared.vm_set.list.next = NULL;
+ vma->shared.vm_set.list.prev = NULL;
+ vma->shared.vm_set.parent = NULL;
+ vma->shared.vm_set.head = NULL;
}
+/* prio_tree.c */
+void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
+void vma_prio_tree_insert(struct vm_area_struct *, struct prio_tree_root *);
+void vma_prio_tree_remove(struct vm_area_struct *, struct prio_tree_root *);
struct vm_area_struct *vma_prio_tree_next(
struct vm_area_struct *, struct prio_tree_root *,
struct prio_tree_iter *, pgoff_t begin, pgoff_t end);
--- rmap16/include/linux/prio_tree.h 2004-04-27 19:18:42.776735888 +0100
+++ rmap17/include/linux/prio_tree.h 2004-04-27 19:18:54.264989408 +0100
@@ -1,27 +1,64 @@
#ifndef _LINUX_PRIO_TREE_H
#define _LINUX_PRIO_TREE_H
-/*
- * Dummy version of include/linux/prio_tree.h, just for this patch:
- * no radix priority search tree whatsoever, just implement interfaces
- * using the old lists.
- */
+
+struct prio_tree_node {
+ struct prio_tree_node *left;
+ struct prio_tree_node *right;
+ struct prio_tree_node *parent;
+};
struct prio_tree_root {
- struct list_head list;
+ struct prio_tree_node *prio_tree_node;
+ unsigned int index_bits;
};
struct prio_tree_iter {
- int not_used_yet;
+ struct prio_tree_node *cur;
+ unsigned long mask;
+ unsigned long value;
+ int size_level;
};
#define INIT_PRIO_TREE_ROOT(ptr) \
do { \
- INIT_LIST_HEAD(&(ptr)->list); \
-} while (0) \
+ (ptr)->prio_tree_node = NULL; \
+ (ptr)->index_bits = 1; \
+} while (0)
+
+#define INIT_PRIO_TREE_NODE(ptr) \
+do { \
+ (ptr)->left = (ptr)->right = (ptr)->parent = (ptr); \
+} while (0)
+
+#define INIT_PRIO_TREE_ITER(ptr) \
+do { \
+ (ptr)->cur = NULL; \
+ (ptr)->mask = 0UL; \
+ (ptr)->value = 0UL; \
+ (ptr)->size_level = 0; \
+} while (0)
+
+#define prio_tree_entry(ptr, type, member) \
+ ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
static inline int prio_tree_empty(const struct prio_tree_root *root)
{
- return list_empty(&root->list);
+ return root->prio_tree_node == NULL;
+}
+
+static inline int prio_tree_root(const struct prio_tree_node *node)
+{
+ return node->parent == node;
+}
+
+static inline int prio_tree_left_empty(const struct prio_tree_node *node)
+{
+ return node->left == node;
+}
+
+static inline int prio_tree_right_empty(const struct prio_tree_node *node)
+{
+ return node->right == node;
}
#endif /* _LINUX_PRIO_TREE_H */
--- rmap16/init/main.c 2004-04-26 12:39:46.693108928 +0100
+++ rmap17/init/main.c 2004-04-27 19:18:54.265989256 +0100
@@ -88,6 +88,7 @@ extern void signals_init(void);
extern void buffer_init(void);
extern void pidhash_init(void);
extern void pidmap_init(void);
+extern void prio_tree_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
extern void populate_rootfs(void);
@@ -523,6 +524,7 @@ asmlinkage void __init start_kernel(void
calibrate_delay();
pidmap_init();
pgtable_cache_init();
+ prio_tree_init();
#ifdef CONFIG_X86
if (efi_enabled)
efi_enter_virtual_mode();
--- rmap16/mm/Makefile 2004-04-26 12:39:46.941071232 +0100
+++ rmap17/mm/Makefile 2004-04-27 19:18:54.266989104 +0100
@@ -8,8 +8,9 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o
shmem.o vmalloc.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
- page_alloc.o page-writeback.o pdflush.o readahead.o \
- slab.o swap.o truncate.o vmscan.o $(mmu-y)
+ page_alloc.o page-writeback.o pdflush.o prio_tree.o \
+ readahead.o slab.o swap.o truncate.o vmscan.o \
+ $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
--- rmap16/mm/mmap.c 2004-04-27 19:18:42.784734672 +0100
+++ rmap17/mm/mmap.c 2004-04-27 19:18:54.268988800 +0100
@@ -321,31 +321,6 @@ __insert_vm_struct(struct mm_struct * mm
}
/*
- * Dummy version of vma_prio_tree_next, just for this patch:
- * no radix priority search tree whatsoever, just implement interface
- * using the old lists: return the next vma overlapping [begin,end].
- */
-struct vm_area_struct *vma_prio_tree_next(
- struct vm_area_struct *vma, struct prio_tree_root *root,
- struct prio_tree_iter *iter, pgoff_t begin, pgoff_t end)
-{
- struct list_head *next;
- pgoff_t vba, vea;
-
- next = vma? vma->shared.next: root->list.next;
- while (next != &root->list) {
- vma = list_entry(next, struct vm_area_struct, shared);
- vba = vma->vm_pgoff;
- vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
- /* Return vma if it overlaps [begin,end] */
- if (vba <= end && vea >= begin)
- return vma;
- next = next->next;
- }
- return NULL;
-}
-
-/*
* We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that is
* already present in an i_mmap{_shared} tree without adjusting the tree.
* The following helper function should be used when such adjustments
--- rmap16/mm/prio_tree.c 1970-01-01 01:00:00.000000000 +0100
+++ rmap17/mm/prio_tree.c 2004-04-27 19:18:54.272988192 +0100
@@ -0,0 +1,654 @@
+/*
+ * mm/prio_tree.c - priority search tree for mapping->i_mmap{,_shared}
+ *
+ * Copyright (C) 2004, Rajesh Venkatasubramanian <vrajesh@umich.edu>
+ *
+ * This file is released under the GPL v2.
+ *
+ * Based on the radix priority search tree proposed by Edward M. McCreight
+ * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985
+ *
+ * 02Feb2004 Initial version
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/prio_tree.h>
+
+/*
+ * A clever mix of heap and radix trees forms a radix priority search tree (PST)
+ * which is useful for storing intervals, e.g, we can consider a vma as a closed
+ * interval of file pages [offset_begin, offset_end], and store all vmas that
+ * map a file in a PST. Then, using the PST, we can answer a stabbing query,
+ * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a
+ * given input interval X (a set of consecutive file pages), in "O(log n + m)"
+ * time where 'log n' is the height of the PST, and 'm' is the number of stored
+ * intervals (vmas) that overlap (map) with the input interval X (the set of
+ * consecutive file pages).
+ *
+ * In our implementation, we store closed intervals of the form [radix_index,
+ * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST
+ * is designed for storing intervals with unique radix indices, i.e., each
+ * interval have different radix_index. However, this limitation can be easily
+ * overcome by using the size, i.e., heap_index - radix_index, as part of the
+ * index, so we index the tree using [(radix_index,size), heap_index].
+ *
+ * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit
+ * machine, the maximum height of a PST can be 64. We can use a balanced version
+ * of the priority search tree to optimize the tree height, but the balanced
+ * tree proposed by McCreight is too complex and memory-hungry for our purpose.
+ */
+
+/*
+ * The following macros are used for implementing prio_tree for i_mmap{_shared}
+ */
+
+#define RADIX_INDEX(vma) ((vma)->vm_pgoff)
+#define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT)
+/* avoid overflow */
+#define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))
+
+#define GET_INDEX_VMA(vma, radix, heap) \
+do { \
+ radix = RADIX_INDEX(vma); \
+ heap = HEAP_INDEX(vma); \
+} while (0)
+
+#define GET_INDEX(node, radix, heap) \
+do { \
+ struct vm_area_struct *__tmp = \
+ prio_tree_entry(node, struct vm_area_struct, shared.prio_tree_node);\
+ GET_INDEX_VMA(__tmp, radix, heap); \
+} while (0)
+
+static unsigned long index_bits_to_maxindex[BITS_PER_LONG];
+
+void __init prio_tree_init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++)
+ index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1;
+ index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL;
+}
+
+/*
+ * Maximum heap_index that can be stored in a PST with index_bits bits
+ */
+static inline unsigned long prio_tree_maxindex(unsigned int bits)
+{
+ return index_bits_to_maxindex[bits - 1];
+}
+
+/*
+ * Extend a priority search tree so that it can store a node with heap_index
+ * max_heap_index. In the worst case, this algorithm takes O((log n)^2).
+ * However, this function is used rarely and the common case performance is
+ * not bad.
+ */
+static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root,
+ struct prio_tree_node *node, unsigned long max_heap_index)
+{
+ static void prio_tree_remove(struct prio_tree_root *,
+ struct prio_tree_node *);
+ struct prio_tree_node *first = NULL, *prev, *last = NULL;
+
+ if (max_heap_index > prio_tree_maxindex(root->index_bits))
+ root->index_bits++;
+
+ while (max_heap_index > prio_tree_maxindex(root->index_bits)) {
+ root->index_bits++;
+
+ if (prio_tree_empty(root))
+ continue;
+
+ if (first == NULL) {
+ first = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(first);
+ last = first;
+ } else {
+ prev = last;
+ last = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(last);
+ prev->left = last;
+ last->parent = prev;
+ }
+ }
+
+ INIT_PRIO_TREE_NODE(node);
+
+ if (first) {
+ node->left = first;
+ first->parent = node;
+ } else
+ last = node;
+
+ if (!prio_tree_empty(root)) {
+ last->left = root->prio_tree_node;
+ last->left->parent = last;
+ }
+
+ root->prio_tree_node = node;
+ return node;
+}
+
+/*
+ * Replace a prio_tree_node with a new node and return the old node
+ */
+static struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
+ struct prio_tree_node *old, struct prio_tree_node *node)
+{
+ INIT_PRIO_TREE_NODE(node);
+
+ if (prio_tree_root(old)) {
+ BUG_ON(root->prio_tree_node != old);
+ /*
+ * We can reduce root->index_bits here. However, it is complex
+ * and does not help much to improve performance (IMO).
+ */
+ node->parent = node;
+ root->prio_tree_node = node;
+ } else {
+ node->parent = old->parent;
+ if (old->parent->left == old)
+ old->parent->left = node;
+ else
+ old->parent->right = node;
+ }
+
+ if (!prio_tree_left_empty(old)) {
+ node->left = old->left;
+ old->left->parent = node;
+ }
+
+ if (!prio_tree_right_empty(old)) {
+ node->right = old->right;
+ old->right->parent = node;
+ }
+
+ return old;
+}
+
+/*
+ * Insert a prio_tree_node @node into a radix priority search tree @root. The
+ * algorithm typically takes O(log n) time where 'log n' is the number of bits
+ * required to represent the maximum heap_index. In the worst case, the algo
+ * can take O((log n)^2) - check prio_tree_expand.
+ *
+ * If a prior node with same radix_index and heap_index is already found in
+ * the tree, then returns the address of the prior node. Otherwise, inserts
+ * @node into the tree and returns @node.
+ */
+static struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+ struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur, *res = node;
+ unsigned long radix_index, heap_index;
+ unsigned long r_index, h_index, index, mask;
+ int size_flag = 0;
+
+ GET_INDEX(node, radix_index, heap_index);
+
+ if (prio_tree_empty(root) ||
+ heap_index > prio_tree_maxindex(root->index_bits))
+ return prio_tree_expand(root, node, heap_index);
+
+ cur = root->prio_tree_node;
+ mask = 1UL << (root->index_bits - 1);
+
+ while (mask) {
+ GET_INDEX(cur, r_index, h_index);
+
+ if (r_index == radix_index && h_index == heap_index)
+ return cur;
+
+ if (h_index < heap_index ||
+ (h_index == heap_index && r_index > radix_index)) {
+ struct prio_tree_node *tmp = node;
+ node = prio_tree_replace(root, cur, node);
+ cur = tmp;
+ /* swap indices */
+ index = r_index;
+ r_index = radix_index;
+ radix_index = index;
+ index = h_index;
+ h_index = heap_index;
+ heap_index = index;
+ }
+
+ if (size_flag)
+ index = heap_index - radix_index;
+ else
+ index = radix_index;
+
+ if (index & mask) {
+ if (prio_tree_right_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->right = node;
+ node->parent = cur;
+ return res;
+ } else
+ cur = cur->right;
+ } else {
+ if (prio_tree_left_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->left = node;
+ node->parent = cur;
+ return res;
+ } else
+ cur = cur->left;
+ }
+
+ mask >>= 1;
+
+ if (!mask) {
+ mask = 1UL << (root->index_bits - 1);
+ size_flag = 1;
+ }
+ }
+ /* Should not reach here */
+ BUG();
+ return NULL;
+}
+
+/*
+ * Remove a prio_tree_node @node from a radix priority search tree @root. The
+ * algorithm takes O(log n) time where 'log n' is the number of bits required
+ * to represent the maximum heap_index.
+ */
+static void prio_tree_remove(struct prio_tree_root *root,
+ struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur;
+ unsigned long r_index, h_index_right, h_index_left;
+
+ cur = node;
+
+ while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
+ if (!prio_tree_left_empty(cur))
+ GET_INDEX(cur->left, r_index, h_index_left);
+ else {
+ cur = cur->right;
+ continue;
+ }
+
+ if (!prio_tree_right_empty(cur))
+ GET_INDEX(cur->right, r_index, h_index_right);
+ else {
+ cur = cur->left;
+ continue;
+ }
+
+ /* both h_index_left and h_index_right cannot be 0 */
+ if (h_index_left >= h_index_right)
+ cur = cur->left;
+ else
+ cur = cur->right;
+ }
+
+ if (prio_tree_root(cur)) {
+ BUG_ON(root->prio_tree_node != cur);
+ INIT_PRIO_TREE_ROOT(root);
+ return;
+ }
+
+ if (cur->parent->right == cur)
+ cur->parent->right = cur->parent;
+ else
+ cur->parent->left = cur->parent;
+
+ while (cur != node)
+ cur = prio_tree_replace(root, cur->parent, cur);
+}
+
+/*
+ * Following functions help to enumerate all prio_tree_nodes in the tree that
+ * overlap with the input interval X [radix_index, heap_index]. The enumeration
+ * takes O(log n + m) time where 'log n' is the height of the tree (which is
+ * proportional to # of bits required to represent the maximum heap_index) and
+ * 'm' is the number of prio_tree_nodes that overlap the interval X.
+ */
+
+static struct prio_tree_node *prio_tree_left(
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ unsigned long radix_index, unsigned long heap_index,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ if (prio_tree_left_empty(iter->cur))
+ return NULL;
+
+ GET_INDEX(iter->cur->left, *r_index, *h_index);
+
+ if (radix_index <= *h_index) {
+ iter->cur = iter->cur->left;
+ iter->mask >>= 1;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ } else {
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ } else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+static struct prio_tree_node *prio_tree_right(
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ unsigned long radix_index, unsigned long heap_index,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ unsigned long value;
+
+ if (prio_tree_right_empty(iter->cur))
+ return NULL;
+
+ if (iter->size_level)
+ value = iter->value;
+ else
+ value = iter->value | iter->mask;
+
+ if (heap_index < value)
+ return NULL;
+
+ GET_INDEX(iter->cur->right, *r_index, *h_index);
+
+ if (radix_index <= *h_index) {
+ iter->cur = iter->cur->right;
+ iter->mask >>= 1;
+ iter->value = value;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ } else {
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ } else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+static struct prio_tree_node *prio_tree_parent(struct prio_tree_iter *iter)
+{
+ iter->cur = iter->cur->parent;
+ if (iter->mask == ULONG_MAX)
+ iter->mask = 1UL;
+ else if (iter->size_level == 1)
+ iter->mask = 1UL;
+ else
+ iter->mask <<= 1;
+ if (iter->size_level)
+ iter->size_level--;
+ if (!iter->size_level && (iter->value & iter->mask))
+ iter->value ^= iter->mask;
+ return iter->cur;
+}
+
+static inline int overlap(unsigned long radix_index, unsigned long heap_index,
+ unsigned long r_index, unsigned long h_index)
+{
+ return heap_index >= r_index && radix_index <= h_index;
+}
+
+/*
+ * prio_tree_first:
+ *
+ * Get the first prio_tree_node that overlaps with the interval [radix_index,
+ * heap_index]. Note that always radix_index <= heap_index. We do a pre-order
+ * traversal of the tree.
+ */
+static struct prio_tree_node *prio_tree_first(struct prio_tree_root *root,
+ struct prio_tree_iter *iter, unsigned long radix_index,
+ unsigned long heap_index)
+{
+ unsigned long r_index, h_index;
+
+ INIT_PRIO_TREE_ITER(iter);
+
+ if (prio_tree_empty(root))
+ return NULL;
+
+ GET_INDEX(root->prio_tree_node, r_index, h_index);
+
+ if (radix_index > h_index)
+ return NULL;
+
+ iter->mask = 1UL << (root->index_bits - 1);
+ iter->cur = root->prio_tree_node;
+
+ while (1) {
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+
+ if (prio_tree_left(root, iter, radix_index, heap_index,
+ &r_index, &h_index))
+ continue;
+
+ if (prio_tree_right(root, iter, radix_index, heap_index,
+ &r_index, &h_index))
+ continue;
+
+ break;
+ }
+ return NULL;
+}
+
+/*
+ * prio_tree_next:
+ *
+ * Get the next prio_tree_node that overlaps with the input interval in iter
+ */
+static struct prio_tree_node *prio_tree_next(struct prio_tree_root *root,
+ struct prio_tree_iter *iter, unsigned long radix_index,
+ unsigned long heap_index)
+{
+ unsigned long r_index, h_index;
+
+repeat:
+ while (prio_tree_left(root, iter, radix_index,
+ heap_index, &r_index, &h_index)) {
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+ }
+
+ while (!prio_tree_right(root, iter, radix_index,
+ heap_index, &r_index, &h_index)) {
+ while (!prio_tree_root(iter->cur) &&
+ iter->cur->parent->right == iter->cur)
+ prio_tree_parent(iter);
+
+ if (prio_tree_root(iter->cur))
+ return NULL;
+
+ prio_tree_parent(iter);
+ }
+
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+
+ goto repeat;
+}
+
+/*
+ * Radix priority search tree for address_space->i_mmap_{_shared}
+ *
+ * For each vma that map a unique set of file pages i.e., unique [radix_index,
+ * heap_index] value, we have a corresponing priority search tree node. If
+ * multiple vmas have identical [radix_index, heap_index] value, then one of
+ * them is used as a tree node and others are stored in a vm_set list. The tree
+ * node points to the first vma (head) of the list using vm_set.head.
+ *
+ * prio_tree_root
+ * |
+ * A vm_set.head
+ * / \ /
+ * L R -> H-I-J-K-M-N-O-P-Q-S
+ * ^ ^ <-- vm_set.list -->
+ * tree nodes
+ *
+ * We need some way to identify whether a vma is a tree node, head of a vm_set
+ * list, or just a member of a vm_set list. We cannot use vm_flags to store
+ * such information. The reason is, in the above figure, it is possible that
+ * vm_flags' of R and H are covered by the different mmap_sems. When R is
+ * removed under R->mmap_sem, H replaces R as a tree node. Since we do not hold
+ * H->mmap_sem, we cannot use H->vm_flags for marking that H is a tree node now.
+ * That's why some trick involving shared.vm_set.parent is used for identifying
+ * tree nodes and list head nodes.
+ *
+ * vma radix priority search tree node rules:
+ *
+ * vma->shared.vm_set.parent != NULL ==> a tree node
+ * vma->shared.vm_set.head != NULL ==> list of others mapping same range
+ * vma->shared.vm_set.head == NULL ==> no others map the same range
+ *
+ * vma->shared.vm_set.parent == NULL
+ * vma->shared.vm_set.head != NULL ==> list head of vmas mapping same range
+ * vma->shared.vm_set.head == NULL ==> a list node
+ */
+
+/*
+ * Add a new vma known to map the same set of pages as the old vma:
+ * useful for fork's dup_mmap as well as vma_prio_tree_insert below.
+ */
+void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
+{
+ /* Leave these BUG_ONs till prio_tree patch stabilizes */
+ BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
+ BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
+
+ if (!old->shared.vm_set.parent)
+ list_add(&vma->shared.vm_set.list,
+ &old->shared.vm_set.list);
+ else if (old->shared.vm_set.head)
+ list_add_tail(&vma->shared.vm_set.list,
+ &old->shared.vm_set.head->shared.vm_set.list);
+ else {
+ INIT_LIST_HEAD(&vma->shared.vm_set.list);
+ vma->shared.vm_set.head = old;
+ old->shared.vm_set.head = vma;
+ }
+}
+
+void vma_prio_tree_insert(struct vm_area_struct *vma,
+ struct prio_tree_root *root)
+{
+ struct prio_tree_node *ptr;
+ struct vm_area_struct *old;
+
+ ptr = prio_tree_insert(root, &vma->shared.prio_tree_node);
+ if (ptr != &vma->shared.prio_tree_node) {
+ old = prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+ vma_prio_tree_add(vma, old);
+ }
+}
+
+void vma_prio_tree_remove(struct vm_area_struct *vma,
+ struct prio_tree_root *root)
+{
+ struct vm_area_struct *node, *head, *new_head;
+
+ if (!vma->shared.vm_set.head) {
+ if (!vma->shared.vm_set.parent)
+ list_del_init(&vma->shared.vm_set.list);
+ else
+ prio_tree_remove(root, &vma->shared.prio_tree_node);
+ } else {
+ /* Leave this BUG_ON till prio_tree patch stabilizes */
+ BUG_ON(vma->shared.vm_set.head->shared.vm_set.head != vma);
+ if (vma->shared.vm_set.parent) {
+ head = vma->shared.vm_set.head;
+ if (!list_empty(&head->shared.vm_set.list)) {
+ new_head = list_entry(
+ head->shared.vm_set.list.next,
+ struct vm_area_struct,
+ shared.vm_set.list);
+ list_del_init(&head->shared.vm_set.list);
+ } else
+ new_head = NULL;
+
+ prio_tree_replace(root, &vma->shared.prio_tree_node,
+ &head->shared.prio_tree_node);
+ head->shared.vm_set.head = new_head;
+ if (new_head)
+ new_head->shared.vm_set.head = head;
+
+ } else {
+ node = vma->shared.vm_set.head;
+ if (!list_empty(&vma->shared.vm_set.list)) {
+ new_head = list_entry(
+ vma->shared.vm_set.list.next,
+ struct vm_area_struct,
+ shared.vm_set.list);
+ list_del_init(&vma->shared.vm_set.list);
+ node->shared.vm_set.head = new_head;
+ new_head->shared.vm_set.head = node;
+ } else
+ node->shared.vm_set.head = NULL;
+ }
+ }
+}
+
+/*
+ * Helper function to enumerate vmas that map a given file page or a set of
+ * contiguous file pages. The function returns vmas that at least map a single
+ * page in the given range of contiguous file pages.
+ */
+struct vm_area_struct *vma_prio_tree_next(struct vm_area_struct *vma,
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ pgoff_t begin, pgoff_t end)
+{
+ struct prio_tree_node *ptr;
+ struct vm_area_struct *next;
+
+ if (!vma) {
+ /*
+ * First call is with NULL vma
+ */
+ ptr = prio_tree_first(root, iter, begin, end);
+ if (ptr)
+ return prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+ else
+ return NULL;
+ }
+
+ if (vma->shared.vm_set.parent) {
+ if (vma->shared.vm_set.head)
+ return vma->shared.vm_set.head;
+ } else {
+ next = list_entry(vma->shared.vm_set.list.next,
+ struct vm_area_struct, shared.vm_set.list);
+ if (!next->shared.vm_set.head)
+ return next;
+ }
+
+ ptr = prio_tree_next(root, iter, begin, end);
+ if (ptr)
+ return prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+ else
+ return NULL;
+}
+EXPORT_SYMBOL(vma_prio_tree_next);
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] rmap 18 i_mmap_nonlinear
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
` (2 preceding siblings ...)
2004-04-28 0:03 ` [PATCH] rmap 17 real prio_tree Hugh Dickins
@ 2004-04-28 0:04 ` Hugh Dickins
2004-04-28 23:11 ` Rik van Riel
2004-04-28 0:06 ` [PATCH] rmap 19 arch prio_tree Hugh Dickins
4 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2004-04-28 0:04 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rajesh Venkatasubramanian, linux-kernel
The prio_tree is of no use to nonlinear vmas: currently we're having to
search the tree in the most inefficient way to find all its nonlinears.
At the very least we need an indication of the unlikely case when there
are some nonlinears; but really, we'd do best to take them out of the
prio_tree altogether, into a list of their own - i_mmap_nonlinear.
fs/inode.c | 1 +
include/linux/fs.h | 7 +++++--
mm/fremap.c | 12 +++++++++++-
mm/memory.c | 31 ++++++++++---------------------
mm/mmap.c | 15 ++++++++++-----
mm/prio_tree.c | 1 +
mm/rmap.c | 35 ++++++++++++++---------------------
7 files changed, 52 insertions(+), 50 deletions(-)
--- rmap17/fs/inode.c 2004-04-27 19:18:42.769736952 +0100
+++ rmap18/fs/inode.c 2004-04-27 19:19:05.866225752 +0100
@@ -190,6 +190,7 @@ void inode_init_once(struct inode *inode
spin_lock_init(&inode->i_data.private_lock);
INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
+ INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);
}
--- rmap17/include/linux/fs.h 2004-04-27 19:18:42.773736344 +0100
+++ rmap18/include/linux/fs.h 2004-04-27 19:19:05.869225296 +0100
@@ -332,6 +332,7 @@ struct address_space {
struct address_space_operations *a_ops; /* methods */
struct prio_tree_root i_mmap; /* tree of private mappings */
struct prio_tree_root i_mmap_shared; /* tree of shared mappings */
+ struct list_head i_mmap_nonlinear;/*list of nonlinear mappings */
spinlock_t i_shared_lock; /* protect trees & list above */
atomic_t truncate_count; /* Cover race condition with truncate */
unsigned long flags; /* error bits/gfp mask */
@@ -381,7 +382,8 @@ int mapping_tagged(struct address_space
static inline int mapping_mapped(struct address_space *mapping)
{
return !prio_tree_empty(&mapping->i_mmap) ||
- !prio_tree_empty(&mapping->i_mmap_shared);
+ !prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear);
}
/*
@@ -392,7 +394,8 @@ static inline int mapping_mapped(struct
*/
static inline int mapping_writably_mapped(struct address_space *mapping)
{
- return !prio_tree_empty(&mapping->i_mmap_shared);
+ return !prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear);
}
/*
--- rmap17/mm/fremap.c 2004-04-26 12:39:46.922074120 +0100
+++ rmap18/mm/fremap.c 2004-04-27 19:19:05.870225144 +0100
@@ -157,6 +157,7 @@ asmlinkage long sys_remap_file_pages(uns
unsigned long __prot, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
+ struct address_space *mapping;
unsigned long end = start + size;
struct vm_area_struct *vma;
int err = -EINVAL;
@@ -197,8 +198,17 @@ asmlinkage long sys_remap_file_pages(uns
end <= vma->vm_end) {
/* Must set VM_NONLINEAR before any pages are populated. */
- if (pgoff != linear_page_index(vma, start))
+ if (pgoff != linear_page_index(vma, start) &&
+ !(vma->vm_flags & VM_NONLINEAR)) {
+ mapping = vma->vm_file->f_mapping;
+ spin_lock(&mapping->i_shared_lock);
vma->vm_flags |= VM_NONLINEAR;
+ vma_prio_tree_remove(vma, &mapping->i_mmap_shared);
+ vma_prio_tree_init(vma);
+ list_add_tail(&vma->shared.vm_set.list,
+ &mapping->i_mmap_nonlinear);
+ spin_unlock(&mapping->i_shared_lock);
+ }
/* ->populate can take a long time, so downgrade the lock. */
downgrade_write(&mm->mmap_sem);
--- rmap17/mm/memory.c 2004-04-27 19:18:42.780735280 +0100
+++ rmap18/mm/memory.c 2004-04-27 19:19:05.873224688 +0100
@@ -1127,8 +1127,6 @@ static void unmap_mapping_range_list(str
while ((vma = vma_prio_tree_next(vma, root, &iter,
details->first_index, details->last_index)) != NULL) {
- if (unlikely(vma->vm_flags & VM_NONLINEAR))
- continue;
vba = vma->vm_pgoff;
vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
/* Assume for now that PAGE_CACHE_SHIFT == PAGE_SHIFT */
@@ -1144,22 +1142,6 @@ static void unmap_mapping_range_list(str
}
}
-static void unmap_nonlinear_range_list(struct prio_tree_root *root,
- struct zap_details *details)
-{
- struct vm_area_struct *vma = NULL;
- struct prio_tree_iter iter;
-
- while ((vma = vma_prio_tree_next(vma, root, &iter,
- 0, ULONG_MAX)) != NULL) {
- if (!(vma->vm_flags & VM_NONLINEAR))
- continue;
- details->nonlinear_vma = vma;
- zap_page_range(vma, vma->vm_start,
- vma->vm_end - vma->vm_start, details);
- }
-}
-
/**
* unmap_mapping_range - unmap the portion of all mmaps
* in the specified address_space corresponding to the specified
@@ -1209,11 +1191,18 @@ void unmap_mapping_range(struct address_
/* Don't waste time to check mapping on fully shared vmas */
details.check_mapping = NULL;
- if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared))) {
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared)))
unmap_mapping_range_list(&mapping->i_mmap_shared, &details);
- unmap_nonlinear_range_list(&mapping->i_mmap_shared, &details);
- }
+ if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) {
+ struct vm_area_struct *vma;
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ details.nonlinear_vma = vma;
+ zap_page_range(vma, vma->vm_start,
+ vma->vm_end - vma->vm_start, &details);
+ }
+ }
spin_unlock(&mapping->i_shared_lock);
}
EXPORT_SYMBOL(unmap_mapping_range);
--- rmap17/mm/mmap.c 2004-04-27 19:18:54.268988800 +0100
+++ rmap18/mm/mmap.c 2004-04-27 19:19:05.875224384 +0100
@@ -70,7 +70,9 @@ static inline void __remove_shared_vm_st
if (vma->vm_flags & VM_DENYWRITE)
atomic_inc(&file->f_dentry->d_inode->i_writecount);
- if (vma->vm_flags & VM_SHARED)
+ if (unlikely(vma->vm_flags & VM_NONLINEAR))
+ list_del_init(&vma->shared.vm_set.list);
+ else if (vma->vm_flags & VM_SHARED)
vma_prio_tree_remove(vma, &mapping->i_mmap_shared);
else
vma_prio_tree_remove(vma, &mapping->i_mmap);
@@ -260,7 +262,10 @@ static inline void __vma_link_file(struc
if (vma->vm_flags & VM_DENYWRITE)
atomic_dec(&file->f_dentry->d_inode->i_writecount);
- if (vma->vm_flags & VM_SHARED)
+ if (unlikely(vma->vm_flags & VM_NONLINEAR))
+ list_add_tail(&vma->shared.vm_set.list,
+ &mapping->i_mmap_nonlinear);
+ else if (vma->vm_flags & VM_SHARED)
vma_prio_tree_insert(vma, &mapping->i_mmap_shared);
else
vma_prio_tree_insert(vma, &mapping->i_mmap);
@@ -337,10 +342,10 @@ void vma_adjust(struct vm_area_struct *v
if (file) {
mapping = file->f_mapping;
- if (vma->vm_flags & VM_SHARED)
- root = &mapping->i_mmap_shared;
- else
+ if (!(vma->vm_flags & VM_SHARED))
root = &mapping->i_mmap;
+ else if (!(vma->vm_flags & VM_NONLINEAR))
+ root = &mapping->i_mmap_shared;
spin_lock(&mapping->i_shared_lock);
}
spin_lock(&mm->page_table_lock);
--- rmap17/mm/prio_tree.c 2004-04-27 19:18:54.272988192 +0100
+++ rmap18/mm/prio_tree.c 2004-04-27 19:19:05.877224080 +0100
@@ -530,6 +530,7 @@ repeat:
/*
* Add a new vma known to map the same set of pages as the old vma:
* useful for fork's dup_mmap as well as vma_prio_tree_insert below.
+ * Note that it just happens to work correctly on i_mmap_nonlinear too.
*/
void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
{
--- rmap17/mm/rmap.c 2004-04-27 19:18:42.786734368 +0100
+++ rmap18/mm/rmap.c 2004-04-27 19:19:05.879223776 +0100
@@ -333,10 +333,6 @@ static inline int page_referenced_file(s
while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
&iter, pgoff, pgoff)) != NULL) {
- if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
- failed++;
- continue;
- }
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
referenced++;
goto out;
@@ -350,8 +346,8 @@ static inline int page_referenced_file(s
}
}
- /* Hmm, but what of the nonlinears which pgoff,pgoff skipped? */
- WARN_ON(!failed);
+ if (list_empty(&mapping->i_mmap_nonlinear))
+ WARN_ON(!failed);
out:
spin_unlock(&mapping->i_shared_lock);
return referenced;
@@ -740,8 +736,6 @@ static inline int try_to_unmap_file(stru
while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
&iter, pgoff, pgoff)) != NULL) {
- if (unlikely(vma->vm_flags & VM_NONLINEAR))
- continue;
if (vma->vm_mm->rss) {
address = vma_address(vma, pgoff);
ret = try_to_unmap_one(page,
@@ -751,10 +745,12 @@ static inline int try_to_unmap_file(stru
}
}
- while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
- &iter, 0, ULONG_MAX)) != NULL) {
- if (VM_NONLINEAR != (vma->vm_flags &
- (VM_NONLINEAR|VM_LOCKED|VM_RESERVED)))
+ if (list_empty(&mapping->i_mmap_nonlinear))
+ goto out;
+
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -782,10 +778,9 @@ static inline int try_to_unmap_file(stru
max_nl_cursor = CLUSTER_SIZE;
do {
- while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
- &iter, 0, ULONG_MAX)) != NULL) {
- if (VM_NONLINEAR != (vma->vm_flags &
- (VM_NONLINEAR|VM_LOCKED|VM_RESERVED)))
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
while (vma->vm_mm->rss &&
@@ -814,11 +809,9 @@ static inline int try_to_unmap_file(stru
* in locked vmas). Reset cursor on all unreserved nonlinear
* vmas, now forgetting on which ones it had fallen behind.
*/
- vma = NULL; /* it is already, but above loop might change */
- while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap_shared,
- &iter, 0, ULONG_MAX)) != NULL) {
- if ((vma->vm_flags & (VM_NONLINEAR|VM_RESERVED)) ==
- VM_NONLINEAR)
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ if (!(vma->vm_flags & VM_RESERVED))
vma->vm_private_data = 0;
}
relock:
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] rmap 19 arch prio_tree
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
` (3 preceding siblings ...)
2004-04-28 0:04 ` [PATCH] rmap 18 i_mmap_nonlinear Hugh Dickins
@ 2004-04-28 0:06 ` Hugh Dickins
4 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-28 0:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Rajesh Venkatasubramanian, rmk, James.Bottomley, schwidefsky, ak,
linux-kernel
The previous patches of this prio_tree batch have been to generic only.
Now the arm and parisc __flush_dcache_page are converted to using
vma_prio_tree_next, and benefit from its selection of relevant vmas.
They're still accessing the tree without i_shared_lock or any other,
that's not forgotten but still under investigation. Include pagemap.h
for the definition of PAGE_CACHE_SHIFT. s390 and x86_64 no longer
initialize vma's shared field (whose type has changed), done later.
arch/arm/mm/fault-armv.c | 59 ++++++++++++----------------------------
arch/parisc/kernel/cache.c | 49 ++++++++++++---------------------
arch/parisc/kernel/sys_parisc.c | 14 +--------
arch/s390/kernel/compat_exec.c | 1
arch/x86_64/ia32/ia32_binfmt.c | 1
5 files changed, 39 insertions(+), 85 deletions(-)
--- 2.6.6-rc2-mm2/arch/arm/mm/fault-armv.c 2004-04-21 06:28:40.000000000 +0100
+++ rmap19/arch/arm/mm/fault-armv.c 2004-04-27 19:19:17.337481856 +0100
@@ -16,6 +16,7 @@
#include <linux/bitops.h>
#include <linux/vmalloc.h>
#include <linux/init.h>
+#include <linux/pagemap.h>
#include <asm/cacheflush.h>
#include <asm/io.h>
@@ -187,7 +188,10 @@ void __flush_dcache_page(struct page *pa
{
struct address_space *mapping = page_mapping(page);
struct mm_struct *mm = current->active_mm;
- struct list_head *l;
+ struct vm_area_struct *mpnt = NULL;
+ struct prio_tree_iter iter;
+ unsigned long offset;
+ pgoff_t pgoff;
__cpuc_flush_dcache_page(page_address(page));
@@ -198,26 +202,16 @@ void __flush_dcache_page(struct page *pa
* With a VIVT cache, we need to also write back
* and invalidate any user data.
*/
- list_for_each(l, &mapping->i_mmap_shared) {
- struct vm_area_struct *mpnt;
- unsigned long off;
-
- mpnt = list_entry(l, struct vm_area_struct, shared);
-
+ pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ while ((mpnt = vma_prio_tree_next(mpnt, &mapping->i_mmap_shared,
+ &iter, pgoff, pgoff)) != NULL) {
/*
* If this VMA is not in our MM, we can ignore it.
*/
if (mpnt->vm_mm != mm)
continue;
-
- if (page->index < mpnt->vm_pgoff)
- continue;
-
- off = page->index - mpnt->vm_pgoff;
- if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
- continue;
-
- flush_cache_page(mpnt, mpnt->vm_start + (off << PAGE_SHIFT));
+ offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
+ flush_cache_page(mpnt, mpnt->vm_start + offset);
}
}
@@ -225,9 +219,11 @@ static void
make_coherent(struct vm_area_struct *vma, unsigned long addr, struct page *page, int dirty)
{
struct address_space *mapping = page_mapping(page);
- struct list_head *l;
struct mm_struct *mm = vma->vm_mm;
- unsigned long pgoff;
+ struct vm_area_struct *mpnt = NULL;
+ struct prio_tree_iter iter;
+ unsigned long offset;
+ pgoff_t pgoff;
int aliases = 0;
if (!mapping)
@@ -240,12 +236,8 @@ make_coherent(struct vm_area_struct *vma
* space, then we need to handle them specially to maintain
* cache coherency.
*/
- list_for_each(l, &mapping->i_mmap_shared) {
- struct vm_area_struct *mpnt;
- unsigned long off;
-
- mpnt = list_entry(l, struct vm_area_struct, shared);
-
+ while ((mpnt = vma_prio_tree_next(mpnt, &mapping->i_mmap_shared,
+ &iter, pgoff, pgoff)) != NULL) {
/*
* If this VMA is not in our MM, we can ignore it.
* Note that we intentionally mask out the VMA
@@ -253,23 +245,8 @@ make_coherent(struct vm_area_struct *vma
*/
if (mpnt->vm_mm != mm || mpnt == vma)
continue;
-
- /*
- * If the page isn't in this VMA, we can also ignore it.
- */
- if (pgoff < mpnt->vm_pgoff)
- continue;
-
- off = pgoff - mpnt->vm_pgoff;
- if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
- continue;
-
- off = mpnt->vm_start + (off << PAGE_SHIFT);
-
- /*
- * Ok, it is within mpnt. Fix it up.
- */
- aliases += adjust_pte(mpnt, off);
+ offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
+ aliases += adjust_pte(mpnt, mpnt->vm_start + offset);
}
if (aliases)
adjust_pte(vma, addr);
--- 2.6.6-rc2-mm2/arch/parisc/kernel/cache.c 2004-04-21 06:28:44.000000000 +0100
+++ rmap19/arch/parisc/kernel/cache.c 2004-04-27 19:19:17.338481704 +0100
@@ -17,6 +17,7 @@
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/seq_file.h>
+#include <linux/pagemap.h>
#include <asm/pdc.h>
#include <asm/cache.h>
@@ -231,34 +232,30 @@ void __flush_dcache_page(struct page *pa
{
struct address_space *mapping = page_mapping(page);
struct mm_struct *mm = current->active_mm;
- struct list_head *l;
+ struct vm_area_struct *mpnt = NULL;
+ struct prio_tree_iter iter;
+ unsigned long offset;
+ pgoff_t pgoff;
flush_kernel_dcache_page(page_address(page));
if (!mapping)
return;
- /* check shared list first if it's not empty...it's usually
- * the shortest */
- list_for_each(l, &mapping->i_mmap_shared) {
- struct vm_area_struct *mpnt;
- unsigned long off;
- mpnt = list_entry(l, struct vm_area_struct, shared);
+ pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ /* check shared list first if it's not empty...it's usually
+ * the shortest */
+ while ((mpnt = vma_prio_tree_next(mpnt, &mapping->i_mmap_shared,
+ &iter, pgoff, pgoff)) != NULL) {
/*
* If this VMA is not in our MM, we can ignore it.
*/
if (mpnt->vm_mm != mm)
continue;
- if (page->index < mpnt->vm_pgoff)
- continue;
-
- off = page->index - mpnt->vm_pgoff;
- if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
- continue;
-
- flush_cache_page(mpnt, mpnt->vm_start + (off << PAGE_SHIFT));
+ offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
+ flush_cache_page(mpnt, mpnt->vm_start + offset);
/* All user shared mappings should be equivalently mapped,
* so once we've flushed one we should be ok
@@ -268,24 +265,16 @@ void __flush_dcache_page(struct page *pa
/* then check private mapping list for read only shared mappings
* which are flagged by VM_MAYSHARE */
- list_for_each(l, &mapping->i_mmap) {
- struct vm_area_struct *mpnt;
- unsigned long off;
-
- mpnt = list_entry(l, struct vm_area_struct, shared);
-
-
+ while ((mpnt = vma_prio_tree_next(mpnt, &mapping->i_mmap,
+ &iter, pgoff, pgoff)) != NULL) {
+ /*
+ * If this VMA is not in our MM, we can ignore it.
+ */
if (mpnt->vm_mm != mm || !(mpnt->vm_flags & VM_MAYSHARE))
continue;
- if (page->index < mpnt->vm_pgoff)
- continue;
-
- off = page->index - mpnt->vm_pgoff;
- if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
- continue;
-
- flush_cache_page(mpnt, mpnt->vm_start + (off << PAGE_SHIFT));
+ offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
+ flush_cache_page(mpnt, mpnt->vm_start + offset);
/* All user shared mappings should be equivalently mapped,
* so once we've flushed one we should be ok
--- 2.6.6-rc2-mm2/arch/parisc/kernel/sys_parisc.c 2004-04-04 03:38:55.000000000 +0100
+++ rmap19/arch/parisc/kernel/sys_parisc.c 2004-04-27 19:19:17.339481552 +0100
@@ -68,17 +68,8 @@ static unsigned long get_unshared_area(u
* existing mapping and use the same offset. New scheme is to use the
* address of the kernel data structure as the seed for the offset.
* We'll see how that works...
- */
-#if 0
-static int get_offset(struct address_space *mapping)
-{
- struct vm_area_struct *vma = list_entry(mapping->i_mmap_shared.next,
- struct vm_area_struct, shared);
- return (vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)) &
- (SHMLBA - 1);
-}
-#else
-/* The mapping is cacheline aligned, so there's no information in the bottom
+ *
+ * The mapping is cacheline aligned, so there's no information in the bottom
* few bits of the address. We're looking for 10 bits (4MB / 4k), so let's
* drop the bottom 8 bits and use bits 8-17.
*/
@@ -87,7 +78,6 @@ static int get_offset(struct address_spa
int offset = (unsigned long) mapping << (PAGE_SHIFT - 8);
return offset & 0x3FF000;
}
-#endif
static unsigned long get_shared_area(struct address_space *mapping,
unsigned long addr, unsigned long len, unsigned long pgoff)
--- 2.6.6-rc2-mm2/arch/s390/kernel/compat_exec.c 2004-04-26 12:39:39.560193296 +0100
+++ rmap19/arch/s390/kernel/compat_exec.c 2004-04-27 19:19:17.340481400 +0100
@@ -73,7 +73,6 @@ int setup_arg_pages32(struct linux_binpr
mpnt->vm_pgoff = 0;
mpnt->vm_file = NULL;
mpol_set_vma_default(mpnt);
- INIT_LIST_HEAD(&mpnt->shared);
mpnt->vm_private_data = (void *) 0;
insert_vm_struct(mm, mpnt);
mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
--- 2.6.6-rc2-mm2/arch/x86_64/ia32/ia32_binfmt.c 2004-04-26 12:39:40.006125504 +0100
+++ rmap19/arch/x86_64/ia32/ia32_binfmt.c 2004-04-27 19:19:17.341481248 +0100
@@ -366,7 +366,6 @@ int setup_arg_pages(struct linux_binprm
mpnt->vm_pgoff = 0;
mpnt->vm_file = NULL;
mpol_set_vma_default(mpnt);
- INIT_LIST_HEAD(&mpnt->shared);
mpnt->vm_private_data = (void *) 0;
insert_vm_struct(mm, mpnt);
mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-28 0:04 ` [PATCH] rmap 18 i_mmap_nonlinear Hugh Dickins
@ 2004-04-28 23:11 ` Rik van Riel
2004-04-28 23:44 ` William Lee Irwin III
0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2004-04-28 23:11 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, Rajesh Venkatasubramanian, linux-kernel
On Wed, 28 Apr 2004, Hugh Dickins wrote:
> are some nonlinears; but really, we'd do best to take them out of the
> prio_tree altogether, into a list of their own - i_mmap_nonlinear.
Agreed, though on a related issue ...
> +++ rmap18/fs/inode.c 2004-04-27 19:19:05.866225752 +0100
> @@ -190,6 +190,7 @@ void inode_init_once(struct inode *inode
> spin_lock_init(&inode->i_data.private_lock);
> INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
> INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
> + INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
... do we still need both i_mmap and i_mmap_shared?
Is there a place left where we're using both trees in
a different way, or are we just walking both trees
anyway in all places where they're referenced ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-28 23:11 ` Rik van Riel
@ 2004-04-28 23:44 ` William Lee Irwin III
2004-04-29 6:10 ` Hugh Dickins
0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2004-04-28 23:44 UTC (permalink / raw)
To: Rik van Riel
Cc: Hugh Dickins, Andrew Morton, Rajesh Venkatasubramanian,
linux-kernel
On Wed, Apr 28, 2004 at 07:11:18PM -0400, Rik van Riel wrote:
> ... do we still need both i_mmap and i_mmap_shared?
> Is there a place left where we're using both trees in
> a different way, or are we just walking both trees
> anyway in all places where they're referenced ?
I believe the flush_dcache_page() implementations touching
->i_mmap_shared care about this distinction.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-28 23:44 ` William Lee Irwin III
@ 2004-04-29 6:10 ` Hugh Dickins
2004-04-29 6:32 ` William Lee Irwin III
2004-04-29 13:43 ` James Bottomley
0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2004-04-29 6:10 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Rik van Riel, Andrew Morton, Rajesh Venkatasubramanian,
Russell King, James Bottomley, linux-kernel
On Wed, 28 Apr 2004, William Lee Irwin III wrote:
> On Wed, Apr 28, 2004 at 07:11:18PM -0400, Rik van Riel wrote:
> > ... do we still need both i_mmap and i_mmap_shared?
> > Is there a place left where we're using both trees in
> > a different way, or are we just walking both trees
> > anyway in all places where they're referenced ?
Good point from Rik. I very nearly combined them in this patchset
(and was trying to avoid adding i_mmap_nonlinear on top, but Rajesh
gently persuaded me that would be a little foolish, the nonlinear
processing too different).
I'm sure i_mmap and i_mmap_shared can and should be combined (with
addition of a count of shared writable mappings, for those places
which need to know if there were any at all); but in the end decided
to leave that for later, consult the architectures affected first.
Another change to contemplate: we should be able to add page_mapped
checks to cut down on the flushing, this stuff was written before
there was any such count. But it's not straightforward: maybe some
of the flush_dcache_page calls come just before the rmap is added,
yet would be relying on it to be counted in?
> I believe the flush_dcache_page() implementations touching
> ->i_mmap_shared care about this distinction.
That's right, arm and parisc do handle them differently: currently
arm ignores i_mmap (and I think rmk was wondering a few months ago
whether that's actually correct, given that MAP_SHARED mappings
which can never become writable go in there - and that surprise is
itself a very good reason for combining them), and parisc... ah,
what it does in Linus' tree at present is about the same for both,
but there are some changes on the way.
The differences are not going to be enough to deserve two separate
prio_tree_roots in every struct address_space, we can check vm_flags
for any differences if necessary.
Something else I should have commented on, in that patch comment or
the next: although we now have the separate i_mmap_nonlinear list,
no attempt to search it for nonlinear pages in flush_dcache_page.
It looks like parisc has no sys_remap_file_pages syscall anyway,
and arm only flushes current active_mm, so should be okay so long
as people don't mix linear and nonlinear maps of same pages (hmm,
and don't map same page twice in a nonlinear: more likely) in same
mm: anyway, I think any problems there have to be a "Don't do that",
searching page tables in flush_dcache_page would be too too painful.
Hugh
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-29 6:10 ` Hugh Dickins
@ 2004-04-29 6:32 ` William Lee Irwin III
2004-04-29 13:43 ` James Bottomley
1 sibling, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2004-04-29 6:32 UTC (permalink / raw)
To: Hugh Dickins
Cc: Rik van Riel, Andrew Morton, Rajesh Venkatasubramanian,
Russell King, James Bottomley, linux-kernel
On Wed, 28 Apr 2004, William Lee Irwin III wrote:
>> I believe the flush_dcache_page() implementations touching
>> ->i_mmap_shared care about this distinction.
On Thu, Apr 29, 2004 at 07:10:59AM +0100, Hugh Dickins wrote:
> That's right, arm and parisc do handle them differently: currently
> arm ignores i_mmap (and I think rmk was wondering a few months ago
> whether that's actually correct, given that MAP_SHARED mappings
> which can never become writable go in there - and that surprise is
> itself a very good reason for combining them), and parisc... ah,
> what it does in Linus' tree at present is about the same for both,
> but there are some changes on the way.
> The differences are not going to be enough to deserve two separate
> prio_tree_roots in every struct address_space, we can check vm_flags
> for any differences if necessary.
It seemed these two actually wanted a precise recovery of virtual
addresses and the like for flush_dcache_page() like they would have
had with pte_chains, but never got around to using it.
On Thu, Apr 29, 2004 at 07:10:59AM +0100, Hugh Dickins wrote:
> Something else I should have commented on, in that patch comment or
> the next: although we now have the separate i_mmap_nonlinear list,
> no attempt to search it for nonlinear pages in flush_dcache_page.
> It looks like parisc has no sys_remap_file_pages syscall anyway,
> and arm only flushes current active_mm, so should be okay so long
> as people don't mix linear and nonlinear maps of same pages (hmm,
> and don't map same page twice in a nonlinear: more likely) in same
> mm: anyway, I think any problems there have to be a "Don't do that",
> searching page tables in flush_dcache_page would be too too painful.
Maybe it's worth #ifdef'ing out core remap_file_pages() support for
those arches if all it can do is harm to them wrt. cache coherency
issues. ARM probably wouldn't mind conserving the code it otherwise
wouldn't use.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-29 6:10 ` Hugh Dickins
2004-04-29 6:32 ` William Lee Irwin III
@ 2004-04-29 13:43 ` James Bottomley
2004-04-29 14:24 ` Hugh Dickins
1 sibling, 1 reply; 13+ messages in thread
From: James Bottomley @ 2004-04-29 13:43 UTC (permalink / raw)
To: Hugh Dickins
Cc: William Lee Irwin III, Rik van Riel, Andrew Morton,
Rajesh Venkatasubramanian, Russell King, Linux Kernel
On Thu, 2004-04-29 at 01:10, Hugh Dickins wrote:
> That's right, arm and parisc do handle them differently: currently
> arm ignores i_mmap (and I think rmk was wondering a few months ago
> whether that's actually correct, given that MAP_SHARED mappings
> which can never become writable go in there - and that surprise is
> itself a very good reason for combining them), and parisc... ah,
> what it does in Linus' tree at present is about the same for both,
> but there are some changes on the way.
Actually, as I said before, parisc is reworking the cache flushing stuff
in our tree. As things currently stand we've altered our map allocation
so that we now treat i_mmap no differently from i_mmap_shared, so we'd
be fine with merging them.
James
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-29 13:43 ` James Bottomley
@ 2004-04-29 14:24 ` Hugh Dickins
2004-04-29 15:20 ` Russell King
0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2004-04-29 14:24 UTC (permalink / raw)
To: James Bottomley
Cc: William Lee Irwin III, Rik van Riel, Andrew Morton,
Rajesh Venkatasubramanian, Russell King, Linux Kernel
On 29 Apr 2004, James Bottomley wrote:
> On Thu, 2004-04-29 at 01:10, Hugh Dickins wrote:
> > That's right, arm and parisc do handle them differently: currently
> > arm ignores i_mmap (and I think rmk was wondering a few months ago
> > whether that's actually correct, given that MAP_SHARED mappings
> > which can never become writable go in there - and that surprise is
> > itself a very good reason for combining them), and parisc... ah,
> > what it does in Linus' tree at present is about the same for both,
> > but there are some changes on the way.
>
> Actually, as I said before, parisc is reworking the cache flushing stuff
Yes, not forgotten, that's what I meant by saying some changes on the way.
> in our tree. As things currently stand we've altered our map allocation
> so that we now treat i_mmap no differently from i_mmap_shared, so we'd
Ah, not quite so in what you last showed me, but no matter...
> be fine with merging them.
Great, thanks. No need for you to refresh me: if I do go ahead with
merging them (not my current priority), it'll be obvious from whatever
patch I show against -mm, what change you'd want to make to your tree.
Hugh
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] rmap 18 i_mmap_nonlinear
2004-04-29 14:24 ` Hugh Dickins
@ 2004-04-29 15:20 ` Russell King
0 siblings, 0 replies; 13+ messages in thread
From: Russell King @ 2004-04-29 15:20 UTC (permalink / raw)
To: Hugh Dickins
Cc: James Bottomley, William Lee Irwin III, Rik van Riel,
Andrew Morton, Rajesh Venkatasubramanian, Linux Kernel
On Thu, Apr 29, 2004 at 03:24:03PM +0100, Hugh Dickins wrote:
> > be fine with merging them.
>
> Great, thanks. No need for you to refresh me: if I do go ahead with
> merging them (not my current priority), it'll be obvious from whatever
> patch I show against -mm, what change you'd want to make to your tree.
Please go ahead and merge them. I suspect ARM not scanning i_mmap
is a bug.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2004-04-29 15:23 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-27 23:59 [PATCH] rmap 14 i_shared_lock fixes Hugh Dickins
2004-04-28 0:01 ` [PATCH] rmap 15 vma_adjust Hugh Dickins
2004-04-28 0:02 ` [PATCH] rmap 16 pretend prio_tree Hugh Dickins
2004-04-28 0:03 ` [PATCH] rmap 17 real prio_tree Hugh Dickins
2004-04-28 0:04 ` [PATCH] rmap 18 i_mmap_nonlinear Hugh Dickins
2004-04-28 23:11 ` Rik van Riel
2004-04-28 23:44 ` William Lee Irwin III
2004-04-29 6:10 ` Hugh Dickins
2004-04-29 6:32 ` William Lee Irwin III
2004-04-29 13:43 ` James Bottomley
2004-04-29 14:24 ` Hugh Dickins
2004-04-29 15:20 ` Russell King
2004-04-28 0:06 ` [PATCH] rmap 19 arch prio_tree Hugh Dickins
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox