* [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
@ 2005-04-26 15:49 David Addison
2005-04-26 16:57 ` Jesper Juhl
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: David Addison @ 2005-04-26 15:49 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Andrea Arcangeli, David Addison
[-- Attachment #1: Type: text/plain, Size: 852 bytes --]
Hi,
here is a patch we use to integrate the Quadrics NICs into the Linux kernel.
The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC'
devices can be informed of page table changes.
This allows the Quadrics NICs to perform user RDMAs safely, without requiring
page pinning. Looking through some of the recent IB and Ammasso discussions,
it may also prove useful to those NICs too.
This patch has been deployed in many large (1000+ CPUs) production Linux
clusters at high profile HPC sites such as LLNL and PNL. It has also been
incorporated in Linux kernel releases from HP, SGI and Bull.
I have discussed this patch with Andrew Morton and Andrea Arcangeli and they
believe now is the time to encourage further comments on whether it's
suitable to be incorporated into the mainline kernel.
Cheers,
David Addison
Quadrics Ltd
[-- Attachment #2: ioproc-2.6.12-rc3.patch --]
[-- Type: text/x-patch, Size: 37629 bytes --]
diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h
--- linux-2.6.12-rc3.orig/include/linux/ioproc.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/ioproc.h 2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,271 @@
+/* -*- linux-c -*-
+ *
+ * Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ */
+
+/*
+ * Callbacks for IO processor page table updates.
+ */
+
+#ifndef __LINUX_IOPROC_H__
+#define __LINUX_IOPROC_H__
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+typedef struct ioproc_ops {
+ struct ioproc_ops *next;
+ void *arg;
+
+ void (*release)(void *arg, struct mm_struct *mm);
+ void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+ void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+ void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+ void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+ void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+ void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+ void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+/* IOPROC Registration
+ *
+ * Called by the IOPROC device driver to register its interest in page table
+ * changes for the process associated with the supplied mm_struct
+ *
+ * The caller should first allocate and fill out an ioproc_ops structure with
+ * the function pointers initialised to the device driver specific code for
+ * each callback. If the device driver doesn't have code for a particular
+ * callback then it should set the function pointer to be NULL.
+ * The ioproc_ops arg parameter will be passed unchanged as the first argument
+ * to each callback function invocation.
+ *
+ * The ioproc registration is not inherited across fork() and should be called
+ * once for each process that the IOPROC device driver is interested in.
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+
+/* IOPROC De-registration
+ *
+ * Called by the IOPROC device driver when it is no longer interested in page
+ * table changes for the process associated with the supplied mm_struct
+ *
+ * Normally this is not needed to be called as the ioproc_release() code will
+ * automatically unlink the ioproc_ops struct from the mm_struct as the
+ * process exits
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+#ifdef CONFIG_IOPROC
+
+/* IOPROC Release
+ *
+ * Called during exit_mmap() as all vmas are torn down and unmapped.
+ *
+ * Also unlinks the ioproc_ops structure from the mm list as it goes.
+ *
+ * No need for locks as the mm can no longer be accessed at this point
+ *
+ */
+static inline void
+ioproc_release(struct mm_struct *mm)
+{
+ struct ioproc_ops *cp;
+
+ while ((cp = mm->ioproc_ops) != NULL) {
+ mm->ioproc_ops = cp->next;
+
+ if (cp->release)
+ cp->release(cp->arg, mm);
+ }
+}
+
+/* IOPROC SYNC RANGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated
+ * pages by the IOPROC should cause the page to be marked as referenced or
+ * modified.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_sync_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->sync_range)
+ cp->sync_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC INVALIDATE RANGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_invalidate_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->invalidate_range)
+ cp->invalidate_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC UPDATE RANGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device driver the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_update_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->update_range)
+ cp->update_range(cp->arg, vma, start, end);
+}
+
+
+/* IOPROC CHANGE PROTECTION
+ *
+ * Called when the protection on a region of memory is changed i.e. when the
+ * mprotect() syscall is invoked.
+ *
+ * The IOPROC must not be able to write to a read-only page, so if the
+ * permissions are downgraded then it must honour them. If they are upgraded
+ * it can treat this in the same way as the ioproc_update_[range|sync]() calls
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->change_protection)
+ cp->change_protection(cp->arg, vma, start, end, newprot);
+}
+
+/* IOPROC SYNC PAGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated page
+ * by the IOPROC should cause the page to be marked as referenced or modified.
+ *
+ * Not currently called as msync() calls ioproc_sync_range() instead
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_sync_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->sync_page)
+ cp->sync_page(cp->arg, vma, addr);
+}
+
+/* IOPROC INVALIDATE PAGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_invalidate_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->invalidate_page)
+ cp->invalidate_page(cp->arg, vma, addr);
+}
+
+/* IOPROC UPDATE PAGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_update_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct ioproc_ops *cp;
+
+ for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+ if (cp->update_page)
+ cp->update_page(cp->arg, vma, addr);
+}
+
+#else
+
+/* ! CONFIG_IOPROC so make all hooks empty */
+
+#define ioproc_release(mm) do { } while (0)
+
+#define ioproc_sync_range(vma, start, end) do { } while (0)
+
+#define ioproc_invalidate_range(vma, start,end) do { } while (0)
+
+#define ioproc_update_range(vma, start, end) do { } while (0)
+
+#define ioproc_change_protection(vma, start, end, prot) do { } while (0)
+
+#define ioproc_sync_page(vma, addr) do { } while (0)
+
+#define ioproc_invalidate_page(vma, addr) do { } while (0)
+
+#define ioproc_update_page(vma, addr) do { } while (0)
+
+#endif /* CONFIG_IOPROC */
+
+#endif /* __LINUX_IOPROC_H__ */
diff -ruN linux-2.6.12-rc3.orig/include/linux/sched.h linux-2.6.12-rc3.ioproc/include/linux/sched.h
--- linux-2.6.12-rc3.orig/include/linux/sched.h 2005-04-26 09:02:29.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/sched.h 2005-04-26 15:55:14.000000000 +0100
@@ -186,6 +186,9 @@
asmlinkage void schedule(void);
struct namespace;
+#ifdef CONFIG_IOPROC
+struct ioproc_ops;
+#endif
/* Maximum number of active map areas.. This is a random (large) number */
#define DEFAULT_MAX_MAP_COUNT 65536
@@ -267,6 +270,11 @@
unsigned long hiwater_rss; /* High-water RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
+
+#ifdef CONFIG_IOPROC
+ /* hooks for io devices with advanced RDMA capabilities */
+ struct ioproc_ops *ioproc_ops;
+#endif
};
struct sighand_struct {
diff -ruN linux-2.6.12-rc3.orig/kernel/fork.c linux-2.6.12-rc3.ioproc/kernel/fork.c
--- linux-2.6.12-rc3.orig/kernel/fork.c 2005-04-26 09:02:36.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/kernel/fork.c 2005-04-26 15:55:14.000000000 +0100
@@ -320,6 +320,9 @@
spin_lock_init(&mm->page_table_lock);
rwlock_init(&mm->ioctx_list_lock);
mm->ioctx_list = NULL;
+#ifdef CONFIG_IOPROC
+ mm->ioproc_ops = NULL;
+#endif
mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm);
mm->free_area_cache = TASK_UNMAPPED_BASE;
diff -ruN linux-2.6.12-rc3.orig/mm/fremap.c linux-2.6.12-rc3.ioproc/mm/fremap.c
--- linux-2.6.12-rc3.orig/mm/fremap.c 2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/fremap.c 2005-04-26 15:55:14.000000000 +0100
@@ -12,6 +12,7 @@
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/swapops.h>
+#include <linux/ioproc.h>
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/syscalls.h>
@@ -30,6 +31,7 @@
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);
+ ioproc_invalidate_page(vma, addr);
flush_cache_page(vma, addr, pfn);
pte = ptep_clear_flush(vma, addr, ptep);
if (pfn_valid(pfn)) {
@@ -99,6 +101,7 @@
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
+ ioproc_update_page(vma, addr);
err = 0;
err_unlock:
@@ -143,6 +146,7 @@
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
+ ioproc_update_page(vma, addr);
spin_unlock(&mm->page_table_lock);
return 0;
diff -ruN linux-2.6.12-rc3.orig/mm/hugetlb.c linux-2.6.12-rc3.ioproc/mm/hugetlb.c
--- linux-2.6.12-rc3.orig/mm/hugetlb.c 2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/hugetlb.c 2005-04-26 15:55:14.000000000 +0100
@@ -11,6 +11,7 @@
#include <linux/sysctl.h>
#include <linux/highmem.h>
#include <linux/nodemask.h>
+#include <linux/ioproc.h>
const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static unsigned long nr_huge_pages, free_huge_pages;
@@ -255,6 +256,7 @@
struct mm_struct *mm = vma->vm_mm;
spin_lock(&mm->page_table_lock);
+ ioproc_invalidate_range(vma, start, start + length);
unmap_hugepage_range(vma, start, start + length);
spin_unlock(&mm->page_table_lock);
}
diff -ruN linux-2.6.12-rc3.orig/mm/ioproc.c linux-2.6.12-rc3.ioproc/mm/ioproc.c
--- linux-2.6.12-rc3.orig/mm/ioproc.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/ioproc.c 2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,58 @@
+/* -*- linux-c -*-
+ *
+ * Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ */
+
+/*
+ * Registration for IO processor page table updates.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/mm.h>
+#include <linux/ioproc.h>
+
+int
+ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+ ip->next = mm->ioproc_ops;
+ mm->ioproc_ops = ip;
+
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_register_ops);
+
+int
+ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+ struct ioproc_ops **tmp;
+
+ for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next)
+ ;
+ if (*tmp) {
+ *tmp = ip->next;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_unregister_ops);
diff -ruN linux-2.6.12-rc3.orig/mm/Kconfig linux-2.6.12-rc3.ioproc/mm/Kconfig
--- linux-2.6.12-rc3.orig/mm/Kconfig 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/Kconfig 2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,15 @@
+#
+# VM subsystem specific config
+#
+
+# Support for IO processors which have advanced RDMA capabilities
+#
+config IOPROC
+ bool "Enable IOPROC VM hooks"
+ depends on MMU
+ default y
+ help
+ This option enables hooks in the VM subsystem so that IO devices which
+ incorporate advanced RDMA capabilities can be kept in sync with CPU
+ page table changes.
+ See Documentation/vm/ioproc.txt for more details.
diff -ruN linux-2.6.12-rc3.orig/mm/Makefile linux-2.6.12-rc3.ioproc/mm/Makefile
--- linux-2.6.12-rc3.orig/mm/Makefile 2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/Makefile 2005-04-26 15:55:14.000000000 +0100
@@ -17,4 +17,5 @@
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SHMEM) += shmem.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
+obj-$(CONFIG_IOPROC) += ioproc.o
diff -ruN linux-2.6.12-rc3.orig/mm/memory.c linux-2.6.12-rc3.ioproc/mm/memory.c
--- linux-2.6.12-rc3.orig/mm/memory.c 2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/memory.c 2005-04-26 15:55:14.000000000 +0100
@@ -45,6 +45,7 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/ioproc.h>
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
@@ -765,6 +766,7 @@
lru_add_drain();
spin_lock(&mm->page_table_lock);
+ ioproc_invalidate_range(vma, address, end);
tlb = tlb_gather_mmu(mm, 0);
end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
tlb_finish_mmu(tlb, address, end);
@@ -1076,6 +1078,7 @@
{
pgd_t *pgd;
unsigned long next;
+ unsigned long beg = addr;
unsigned long end = addr + size;
struct mm_struct *mm = vma->vm_mm;
int err;
@@ -1084,12 +1087,14 @@
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
spin_lock(&mm->page_table_lock);
+ ioproc_invalidate_range(vma, beg, end);
do {
next = pgd_addr_end(addr, end);
err = zeromap_pud_range(mm, pgd, addr, next, prot);
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ ioproc_update_range(vma, beg, end);
spin_unlock(&mm->page_table_lock);
return err;
}
@@ -1164,6 +1169,7 @@
{
pgd_t *pgd;
unsigned long next;
+ unsigned long beg = addr;
unsigned long end = addr + size;
struct mm_struct *mm = vma->vm_mm;
int err;
@@ -1183,6 +1189,7 @@
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
spin_lock(&mm->page_table_lock);
+ ioproc_invalidate_range(vma, beg, end);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1190,6 +1197,7 @@
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ ioproc_update_range(vma, beg, end);
spin_unlock(&mm->page_table_lock);
return err;
}
@@ -1218,8 +1226,10 @@
entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
vma);
+ ioproc_invalidate_page(vma, address);
ptep_establish(vma, address, page_table, entry);
update_mmu_cache(vma, address, entry);
+ ioproc_update_page(vma, address);
lazy_mmu_prot_update(entry);
}
@@ -1273,6 +1283,7 @@
vma);
ptep_set_access_flags(vma, address, page_table, entry, 1);
update_mmu_cache(vma, address, entry);
+ ioproc_update_page(vma, address);
lazy_mmu_prot_update(entry);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
@@ -1736,6 +1747,7 @@
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
+ ioproc_update_page(vma, address);
lazy_mmu_prot_update(pte);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
@@ -1794,6 +1806,7 @@
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
+ ioproc_update_page(vma, addr);
lazy_mmu_prot_update(entry);
spin_unlock(&mm->page_table_lock);
out:
@@ -1920,6 +1933,7 @@
/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
+ ioproc_update_page(vma, address);
lazy_mmu_prot_update(entry);
spin_unlock(&mm->page_table_lock);
out:
diff -ruN linux-2.6.12-rc3.orig/mm/mmap.c linux-2.6.12-rc3.ioproc/mm/mmap.c
--- linux-2.6.12-rc3.orig/mm/mmap.c 2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mmap.c 2005-04-26 15:55:15.000000000 +0100
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/ioproc.h>
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
@@ -1627,6 +1628,7 @@
lru_add_drain();
spin_lock(&mm->page_table_lock);
+ ioproc_invalidate_range(vma, start, end);
tlb = tlb_gather_mmu(mm, 0);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
@@ -1905,6 +1907,7 @@
spin_lock(&mm->page_table_lock);
+ ioproc_release(mm);
flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
/* Use -1 here to ensure all VMAs in the mm are unmapped */
diff -ruN linux-2.6.12-rc3.orig/mm/mprotect.c linux-2.6.12-rc3.ioproc/mm/mprotect.c
--- linux-2.6.12-rc3.orig/mm/mprotect.c 2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mprotect.c 2005-04-26 15:55:15.000000000 +0100
@@ -10,6 +10,7 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <linux/ioproc.h>
#include <linux/slab.h>
#include <linux/shm.h>
#include <linux/mman.h>
@@ -89,6 +90,7 @@
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
spin_lock(&mm->page_table_lock);
+ ioproc_change_protection(vma, start, end, newprot);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/mremap.c linux-2.6.12-rc3.ioproc/mm/mremap.c
--- linux-2.6.12-rc3.orig/mm/mremap.c 2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mremap.c 2005-04-26 15:55:15.000000000 +0100
@@ -9,6 +9,7 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <linux/ioproc.h>
#include <linux/slab.h>
#include <linux/shm.h>
#include <linux/mman.h>
@@ -161,6 +162,8 @@
{
unsigned long offset;
+ ioproc_invalidate_range(vma, old_addr, old_addr + len);
+ ioproc_invalidate_range(vma, new_addr, new_addr + len);
flush_cache_range(vma, old_addr, old_addr + len);
/*
diff -ruN linux-2.6.12-rc3.orig/mm/msync.c linux-2.6.12-rc3.ioproc/mm/msync.c
--- linux-2.6.12-rc3.orig/mm/msync.c 2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/msync.c 2005-04-26 15:55:15.000000000 +0100
@@ -13,6 +13,7 @@
#include <linux/mman.h>
#include <linux/hugetlb.h>
#include <linux/syscalls.h>
+#include <linux/ioproc.h>
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
@@ -95,6 +96,7 @@
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
spin_lock(&mm->page_table_lock);
+ ioproc_sync_range(vma, addr, end);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/rmap.c linux-2.6.12-rc3.ioproc/mm/rmap.c
--- linux-2.6.12-rc3.orig/mm/rmap.c 2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/rmap.c 2005-04-26 15:55:15.000000000 +0100
@@ -53,6 +53,7 @@
#include <linux/init.h>
#include <linux/rmap.h>
#include <linux/rcupdate.h>
+#include <linux/ioproc.h>
#include <asm/tlbflush.h>
@@ -573,6 +574,7 @@
}
/* Nuke the page table entry. */
+ ioproc_invalidate_page(vma, address);
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
@@ -690,6 +692,7 @@
continue;
/* Nuke the page table entry. */
+ ioproc_invalidate_page(vma, address);
flush_cache_page(vma, address, pfn);
pteval = ptep_clear_flush(vma, address, pte);
diff -ruN linux-2.6.12-rc3.orig/arch/i386/defconfig linux-2.6.12-rc3.ioproc/arch/i386/defconfig
--- linux-2.6.12-rc3.orig/arch/i386/defconfig 2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/defconfig 2005-04-26 15:55:15.000000000 +0100
@@ -120,6 +120,7 @@
CONFIG_IRQBALANCE=y
CONFIG_HAVE_DEC_LOCK=y
# CONFIG_REGPARM is not set
+CONFIG_IOPROC=y
#
# Power management options (ACPI, APM)
diff -ruN linux-2.6.12-rc3.orig/arch/i386/Kconfig linux-2.6.12-rc3.ioproc/arch/i386/Kconfig
--- linux-2.6.12-rc3.orig/arch/i386/Kconfig 2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/Kconfig 2005-04-26 15:55:15.000000000 +0100
@@ -923,6 +923,8 @@
If unsure, say Y. Only embedded should say N here.
+source "mm/Kconfig"
+
endmenu
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/defconfig linux-2.6.12-rc3.ioproc/arch/ia64/defconfig
--- linux-2.6.12-rc3.orig/arch/ia64/defconfig 2005-03-02 07:37:48.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/arch/ia64/defconfig 2005-04-26 15:55:15.000000000 +0100
@@ -92,6 +92,7 @@
CONFIG_PERFMON=y
CONFIG_IA64_PALINFO=y
CONFIG_ACPI_DEALLOCATE_IRQ=y
+CONFIG_IOPROC=y
#
# Firmware Drivers
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/Kconfig linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig
--- linux-2.6.12-rc3.orig/arch/ia64/Kconfig 2005-04-26 08:59:38.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig 2005-04-26 15:55:15.000000000 +0100
@@ -319,6 +319,8 @@
depends on IOSAPIC && EXPERIMENTAL
default y
+source "mm/Kconfig"
+
source "drivers/firmware/Kconfig"
source "fs/Kconfig.binfmt"
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/defconfig linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/defconfig 2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig 2005-04-26 15:55:15.000000000 +0100
@@ -100,6 +100,7 @@
CONFIG_SECCOMP=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
+CONFIG_IOPROC=y
#
# Power management options
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/Kconfig linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/Kconfig 2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig 2005-04-26 15:55:15.000000000 +0100
@@ -458,6 +458,8 @@
depends on IA32_EMULATION
default y
+source "mm/Kconfig"
+
endmenu
source drivers/Kconfig
diff -ruN linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt
--- linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt 2005-04-26 15:55:15.000000000 +0100
@@ -0,0 +1,500 @@
+Linux IOPROC patch overview
+===========================
+
+The network interface for an HPC network differs significantly from
+network interfaces for traditional IP networks. HPC networks tend to
+be used directly from user processes and perform large RDMA transfers
+between the process address spaces. They also have a requirement
+for low latency communication, and typically achieve this by OS bypass
+techniques. This then requires a different model to traditional
+interconnects, in that a process may need to expose a large amount of
+it's address space to the network RDMA.
+
+Locking down of memory has been a common mechanism for performing
+this, together with a pin-down cache implemented in user
+libraries. The disadvantage of this method is that large portions of
+the physical memory can be locked down for a single process, even if
+it's working set changes over the different phases of it's
+execution. This leads to inefficient memory utilisation - akin to the
+disadvantage of swapping compared to paging.
+
+This model also has problems where memory is being dynamically
+allocated and freed, since the pin down cache is unaware that memory
+may have been released by a call to munmap() and so it will still be
+locking down the now unused pages.
+
+Some modern HPC network interfaces implement their own MMU and are
+able to handle a translation fault during a network access. The
+Quadrics (http://www.quadrics.com) devices (Elan3 and Elan4) have done
+this for some time, and the Infiniband standard also allows for the
+case where memory has been deregistered when an RDMA occurs.
+These NICs are able to operate in an environment where paging occurs
+and do not require memory to be locked down. The advantage of this is
+that the user process can expose large portions of its address space
+without having to worry about physical memory constraints.
+
+However should the operating system decide to swap a page to disk,
+then the NIC must be made aware that it should no longer read/write
+from this memory, but should generate a translation fault instead.
+
+The ioproc patch has been developed to provide a mechanism whereby the
+device driver for a NIC can be made aware of when a user process's
+address translations change, either by paging or by explicitly mapping
+or unmapping of memory.
+
+The patch involves inserting callbacks where translations are being
+invalidated to notify the NIC that the memory behind those
+translations is no longer visible to the application (and so should
+not be visible to the NIC). This callback is then responsible for
+ensuring that the NIC will not access the physical memory that was
+being mapped.
+
+An ioproc invalidate callback in the kswapd code could be utilised to
+prevent memory from being paged out if the NIC is unable to support
+RDMA page faulting. This has not yet been implemented in this patch.
+
+For NICs which support RDMA page faulting, there is no requirement
+for a user level pin down cache, since they are able to page-in their
+translations on the first communication using a buffer. However this
+is likely to be inefficient, resulting in slow first use of the
+buffer. If the communication buffers were continually allocated and
+freed using mmap() based malloc() calls then this would lead to all
+communications being slower than desirable.
+
+To optimise these warm-up cases the ioproc patch adds calls to
+ioproc_update wherever the kernel is creating translations for a user
+process. These then allow the device driver to preload translations
+so that they are already present for the first network communication
+from a buffer.
+
+Linux 2.6 IOPROC implementation details
+=======================================
+
+The Linux IOPROC patch adds hooks to the Linux VM code whenever page
+table entries are being created and/or invalidated. IOPROC device
+drivers can register their interest in being informed of such changes
+by registering an ioproc_ops structure which is defined as follows;
+
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+typedef struct ioproc_ops {
+ struct ioproc_ops *next;
+ void *arg;
+
+ void (*release)(void *arg, struct mm_struct *mm);
+ void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+ void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+ void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+ void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+ void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+ void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+ void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+ioproc_register_ops
+===================
+This function should be called by the IOPROC device driver to register
+its interest in PTE changes for the process associated with the passed
+in mm_struct.
+
+The ioproc registration is not inherited across fork() and should be
+called once for each process that IOPROC is interested in.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_unregister_ops
+=====================
+This function should be called by the IOPROC device driver when it no
+longer requires informing of PTE changes in the process associated
+with the supplied mm_struct.
+
+This function is not normally needed to be called as the ioproc_ops
+struct is unlinked from the associated mm_struct during the
+ioproc_release() call.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_ops struct
+=================
+A linked list ioproc_ops structures is hung off the user process
+mm_struct (linux/sched.h). At each hook point in the patched kernel
+the ioproc patch will call the associated ioproc_ops callback function
+pointer in turn for each registered structure.
+
+The intention of the callbacks is to allow the IOPROC device driver to
+inspect the new or modified PTE entry via the Linux kernel
+(e.g. find_pte_map()). These callbacks should not modify the Linux
+kernel VM state or PTE entries.
+
+The ioproc_ops callback function pointers are defined as follows;
+
+ioproc_release
+==============
+The release hook is called when a program exits and all its vma areas
+are torn down and unmapped. i.e. during exit_mmap(). Before each
+release hook is called the ioproc_ops structure is unlinked from the
+mm_struct.
+
+No locks are required as the process has the only reference to the mm
+at this point.
+
+ioproc_sync_[range|page]
+========================
+The sync hooks are called when a memory map is synchronised with its
+disk image i.e. when the msync() syscall is invoked. Any future read
+or write by the IOPROC device to the associated pages should cause the
+page to be marked as referenced or modified.
+
+Called holding the mm->page_table_lock
+
+ioproc_invalidate_[range|page]
+==============================
+The invalidate hooks are called whenever a valid PTE is unloaded
+e.g. when a page is unmapped by the user or paged out by the
+kernel. After this call the IOPROC must not access the physical memory
+again unless a new translation is loaded.
+
+Called holding the mm->page_table_lock
+
+ioproc_update_[range|page]
+==========================
+The update hooks are called whenever a valid PTE is loaded
+e.g. mmaping memory, moving the brk up, when breaking COW or faulting
+in an anonymous page of memory. These give the IOPROC device the
+opportunity to load translations speculatively, which can improve
+performance by avoiding device translation faults.
+
+Called holding the mm->page_table_lock
+
+ioproc_change_protection
+========================
+This hook is called when the protection on a region of memory is
+changed i.e. when the mprotect() syscall is invoked.
+
+The IOPROC must not be able to write to a read-only page, so if the
+permissions are downgraded then it must honour them. If they are
+upgraded it can treat this in the same way as the
+ioproc_update_[range|page]() calls
+
+Called holding the mm->page_table_lock
+
+
+Linux 2.6 IOPROC patch details
+==============================
+
+Here are the specific details of each ioproc hook added to the Linux
+2.6 VM system and the reasons for doing so;
+
+===============================================================================
+++++ FILE
+ mm/fremap.c
+
+==== FUNCTION
+ zap_pte
+
+CALLED FROM
+ install_page
+ install_file_pte
+
+PTE MODIFICATION
+ ptep_clear_flush
+
+ADDED HOOKS
+ ioproc_invalidate_page
+
+==== FUNCTION
+ install_page
+
+CALLED FROM
+ filemap_populate, shmem_populate
+
+PTE MODIFICATION
+ set_pte_at
+
+ADDED HOOKS
+ ioproc_update_page
+
+==== FUNCTION
+ install_file_pte
+
+CALLED FROM
+ filemap_populate, shmem_populate
+
+PTE MODIFICATION
+ set_pte_at
+
+ADDED HOOKS
+ ioproc_update_page
+
+
+===============================================================================
+++++ FILE
+ mm/memory.c
+
+==== FUNCTION
+ copy_page_range
+
+CALLED FROM
+ dup_mmap (fork.c)
+
+PTE MODIFICATION
+ set_pte_at (copy_one_pte)
+
+ADDED HOOKS
+ None necessary as its creating a new process
+
+==== FUNCTION
+ zap_page_range
+
+CALLED FROM
+ read_zero_pagealigned, madvise_dontneed, unmap_mapping_range,
+ unmap_mapping_range_list, do_mmap_pgoff
+
+PTE MODIFICATION
+ set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+ ioproc_invalidate_range
+
+
+==== FUNCTION
+ zeromap_page_range
+
+CALLED FROM
+ read_zero_pagealigned, mmap_zero
+
+PTE MODIFICATION
+ set_pte_at (zeromap_pte_range via zeromap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+ ioproc_invalidate_range
+ ioproc_update_range
+
+
+==== FUNCTION
+ remap_pfn_range
+
+CALLED FROM
+ many device drivers
+
+PTE MODIFICATION
+ set_pte_at (remap_pte_range via remap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+ ioproc_invalidate_range
+ ioproc_update_range
+
+
+==== FUNCTION
+ break_cow
+
+CALLED FROM
+ do_wp_page
+
+PTE MODIFICATION
+ ptep_establish
+
+ADDED HOOKS
+ ioproc_invalidate_page
+ ioproc_update_page
+
+
+==== FUNCTION
+ do_wp_page
+
+CALLED FROM
+ do_swap_page, handle_pte_fault
+
+PTE MODIFICATION
+ ptep_set_access_flags, break_cow
+
+ADDED HOOKS
+ ioproc_update_page
+
+
+==== FUNCTION
+ do_swap_page
+
+CALLED FROM
+ handle_pte_fault
+
+PTE MODIFICATION
+ set_pte_at
+
+ADDED HOOKS
+ ioproc_update_page
+
+
+==== FUNCTION
+ do_anonymous_page
+
+CALLED FROM
+ do_no_page
+
+PTE MODIFICATION
+ set_pte_at
+
+ADDED HOOKS
+ ioproc_update_page
+
+
+==== FUNCTION
+ do_no_page
+
+CALLED FROM
+ do_file_page, handle_pte_fault
+
+PTE MODIFICATION
+ set_pte_at
+
+ADDED HOOKS
+ ioproc_update_page
+
+
+==== FUNCTION
+ handle_pte_fault
+
+CALLED FROM
+ handle_mm_fault
+
+PTE MODIFICATION
+ ptep_set_access_flags, do_no_page, do_file_page, do_swap_page
+
+ADDED HOOKS
+ Handled in called functions and not necessary for minor fault
+
+
+===============================================================================
+++++ FILE
+ mm/mmap.c
+
+==== FUNCTION
+ unmap_region
+
+CALLED FROM
+ do_munmap
+
+PTE MODIFICATION
+ set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+ ioproc_invalidate_range
+
+
+==== FUNCTION
+ exit_mmap
+
+CALLED FROM
+ mmput
+
+PTE MODIFICATION
+ set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+ ioproc_release
+
+
+===============================================================================
+++++ FILE
+ mm/mprotect.c
+
+==== FUNCTION
+ change_protection
+
+CALLED FROM
+ mprotect_fixup
+
+PTE MODIFICATION
+ set_pte_at (change_pte_range via change_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+ ioproc_change_protection
+
+
+===============================================================================
+++++ FILE
+ mm/mremap.c
+
+==== FUNCTION
+ move_page_tables
+
+CALLED FROM
+ move_vma
+
+PTE MODIFICATION
+ ptep_clear_flush (move_one_page)
+
+ADDED HOOKS
+ ioproc_invalidate_range
+ ioproc_invalidate_range
+
+
+===============================================================================
+++++ FILE
+ mm/rmap.c
+
+==== FUNCTION
+ try_to_unmap_one
+
+CALLED FROM
+ try_to_unmap_anon, try_to_unmap_file
+
+PTE MODIFICATION
+ ptep_clear_flush
+
+ADDED HOOKS
+ ioproc_invalidate_page
+
+
+==== FUNCTION
+ try_to_unmap_cluster
+
+CALLED FROM
+ try_to_unmap_file
+
+PTE MODIFICATION
+ ptep_clear_flush
+
+ADDED HOOKS
+ ioproc_invalidate_page
+
+
+===============================================================================
+++++ FILE
+ mm/msync.c
+
+==== FUNCTION
+ filemap_sync
+
+CALLED FROM
+ msync_interval
+
+PTE MODIFICATION
+ ptep_clear_flush_dirty (filemap_sync_pte)
+
+ADDED HOOKS
+ ioproc_sync_range
+
+
+===============================================================================
+++++ FILE
+ mm/hugetlb.c
+
+==== FUNCTION
+ zap_hugepage_range
+
+CALLED FROM
+ hugetlb_vmtruncate_list
+
+PTE MODIFICATION
+ ptep_get_and_clear (unmap_hugepage_range)
+
+ADDED HOOK
+ ioproc_invalidate_range
+
+
+-- Last update DavidAddison - 26 Apr 2005
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison @ 2005-04-26 16:57 ` Jesper Juhl 2005-04-26 17:13 ` Lee Revell 2005-04-26 17:06 ` Brice Goglin ` (2 subsequent siblings) 3 siblings, 1 reply; 20+ messages in thread From: Jesper Juhl @ 2005-04-26 16:57 UTC (permalink / raw) To: David Addison Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 26 Apr 2005, David Addison wrote: > Hi, > here is a patch we use to integrate the Quadrics NICs into the Linux kernel. <snip> A few small comments below. > > +static inline void > +ioproc_release(struct mm_struct *mm) > +{ Return types on same line as function name makes grep'ing a lot easier/nicer. Here's the example from Documentation/CodingStyle : int function(int x) { body of function } <snip> > +/* ! CONFIG_IOPROC so make all hooks empty */ > + > +#define ioproc_release(mm) do { } while (0) > + > +#define ioproc_sync_range(vma, start, end) do { } while (0) > + > +#define ioproc_invalidate_range(vma, start,end) do { } while (0) > + > +#define ioproc_update_range(vma, start, end) do { } while (0) > + > +#define ioproc_change_protection(vma, start, end, prot) do { } while (0) > + > +#define ioproc_sync_page(vma, addr) do { } while (0) > + > +#define ioproc_invalidate_page(vma, addr) do { } while (0) > + > +#define ioproc_update_page(vma, addr) do { } while (0) > + Why all these blank lines between each define? Seems like just a waste of screen space to me. -- Jesper Juhl ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 16:57 ` Jesper Juhl @ 2005-04-26 17:13 ` Lee Revell 2005-04-26 17:20 ` Jesper Juhl ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Lee Revell @ 2005-04-26 17:13 UTC (permalink / raw) To: Jesper Juhl Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote: > > > > +static inline void > > +ioproc_release(struct mm_struct *mm) > > +{ > > Return types on same line as function name makes grep'ing a lot > easier/nicer. > > Here's the example from Documentation/CodingStyle : > > int function(int x) > { How so? I never understood the reasons. This makes it easier to grep for everything that returns int. But you make the common case (what file is function() defined in?) harder. Lee ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:13 ` Lee Revell @ 2005-04-26 17:20 ` Jesper Juhl 2005-04-26 17:28 ` Lee Revell 2005-04-26 20:09 ` Lars Marowsky-Bree 2005-04-28 11:34 ` Jakob Oestergaard 2005-04-29 8:22 ` Benjamin Herrenschmidt 2 siblings, 2 replies; 20+ messages in thread From: Jesper Juhl @ 2005-04-26 17:20 UTC (permalink / raw) To: Lee Revell Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 26 Apr 2005, Lee Revell wrote: > On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote: > > > > > > +static inline void > > > +ioproc_release(struct mm_struct *mm) > > > +{ > > > > Return types on same line as function name makes grep'ing a lot > > easier/nicer. > > > > Here's the example from Documentation/CodingStyle : > > > > int function(int x) > > { > > How so? I never understood the reasons. This makes it easier to grep > for everything that returns int. But you make the common case (what > file is function() defined in?) harder. > I don't know what you do, but when I'm grep'ing the tree for some function I'm often looking for its return type, having that on the same line as the function name lets me grep for the function name and the grep output will contain the return type and function name nicely on the same line. -- Jesper ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:20 ` Jesper Juhl @ 2005-04-26 17:28 ` Lee Revell 2005-04-26 17:38 ` Jesper Juhl 2005-04-26 20:14 ` John W. Linville 2005-04-26 20:09 ` Lars Marowsky-Bree 1 sibling, 2 replies; 20+ messages in thread From: Lee Revell @ 2005-04-26 17:28 UTC (permalink / raw) To: Jesper Juhl Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 2005-04-26 at 19:20 +0200, Jesper Juhl wrote: > I don't know what you do, but when I'm grep'ing the tree for some function > I'm often looking for its return type, having that on the same line as the > function name lets me grep for the function name and the grep output will > contain the return type and function name nicely on the same line. > I do a lot of looking at large hunks of code I'm not familiar with and trying to figure out how it works. It's quite handy to grep for foo_func to see all usages, then ^foo_func to see the function. I guess my preferred style favors people trying to grok code for the first time, while the kernel style favors those who know it inside out. Anyway, the coding style guidelines also state clearly that these points are not up for debate on LKML so I'll stop now... Lee ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:28 ` Lee Revell @ 2005-04-26 17:38 ` Jesper Juhl 2005-04-26 20:14 ` John W. Linville 1 sibling, 0 replies; 20+ messages in thread From: Jesper Juhl @ 2005-04-26 17:38 UTC (permalink / raw) To: Lee Revell Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 26 Apr 2005, Lee Revell wrote: > On Tue, 2005-04-26 at 19:20 +0200, Jesper Juhl wrote: > > I don't know what you do, but when I'm grep'ing the tree for some function > > I'm often looking for its return type, having that on the same line as the > > function name lets me grep for the function name and the grep output will > > contain the return type and function name nicely on the same line. > > > > I do a lot of looking at large hunks of code I'm not familiar with and > trying to figure out how it works. It's quite handy to grep for > foo_func to see all usages, then ^foo_func to see the function. Have you ever looked at what "make tags" gives you? Run make tags in the kernel source dir, then open up a source file in vim, place the cursor over some struct name or function name and press CTRL+] and you'll be taken to the definition, you can drill down several levels like that, and if you want to go back up one level to where you were you simply press CTRL+t very useful when navigating the source. -- Jesper ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:28 ` Lee Revell 2005-04-26 17:38 ` Jesper Juhl @ 2005-04-26 20:14 ` John W. Linville 2005-04-26 20:17 ` Lee Revell 1 sibling, 1 reply; 20+ messages in thread From: John W. Linville @ 2005-04-26 20:14 UTC (permalink / raw) To: Lee Revell Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, Apr 26, 2005 at 01:28:31PM -0400, Lee Revell wrote: > I do a lot of looking at large hunks of code I'm not familiar with and > trying to figure out how it works. It's quite handy to grep for I'd suggest cscope... John -- John W. Linville linville@tuxdriver.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 20:14 ` John W. Linville @ 2005-04-26 20:17 ` Lee Revell 0 siblings, 0 replies; 20+ messages in thread From: Lee Revell @ 2005-04-26 20:17 UTC (permalink / raw) To: John W. Linville Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 2005-04-26 at 16:14 -0400, John W. Linville wrote: > On Tue, Apr 26, 2005 at 01:28:31PM -0400, Lee Revell wrote: > > > I do a lot of looking at large hunks of code I'm not familiar with and > > trying to figure out how it works. It's quite handy to grep for > > I'd suggest cscope... Thanks. But now I feel bad hijacking the OP's thread. Any comments on the patch? ;-) Lee ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:20 ` Jesper Juhl 2005-04-26 17:28 ` Lee Revell @ 2005-04-26 20:09 ` Lars Marowsky-Bree 1 sibling, 0 replies; 20+ messages in thread From: Lars Marowsky-Bree @ 2005-04-26 20:09 UTC (permalink / raw) To: Jesper Juhl, Lee Revell; +Cc: linux-kernel On 2005-04-26T19:20:13, Jesper Juhl <juhl-lkml@dif.dk> wrote: > I don't know what you do, but when I'm grep'ing the tree for some function > I'm often looking for its return type, having that on the same line as the > function name lets me grep for the function name and the grep output will > contain the return type and function name nicely on the same line. grep -rB1 '^function' drivers/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:13 ` Lee Revell 2005-04-26 17:20 ` Jesper Juhl @ 2005-04-28 11:34 ` Jakob Oestergaard 2005-04-29 8:22 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 20+ messages in thread From: Jakob Oestergaard @ 2005-04-28 11:34 UTC (permalink / raw) To: Lee Revell Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, Apr 26, 2005 at 01:13:04PM -0400, Lee Revell wrote: > On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote: > > > > > > +static inline void > > > +ioproc_release(struct mm_struct *mm) > > > +{ > > > > Return types on same line as function name makes grep'ing a lot > > easier/nicer. > > > > Here's the example from Documentation/CodingStyle : > > > > int function(int x) > > { > > How so? I never understood the reasons. This makes it easier to grep > for everything that returns int. But you make the common case (what > file is function() defined in?) harder. etags/ctags end of story :) -- / jakob ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:13 ` Lee Revell 2005-04-26 17:20 ` Jesper Juhl 2005-04-28 11:34 ` Jakob Oestergaard @ 2005-04-29 8:22 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2005-04-29 8:22 UTC (permalink / raw) To: Lee Revell Cc: Jesper Juhl, David Addison, Linux Kernel list, Andrew Morton, Andrea Arcangeli, David Addison On Tue, 2005-04-26 at 13:13 -0400, Lee Revell wrote: > On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote: > > > > > > +static inline void > > > +ioproc_release(struct mm_struct *mm) > > > +{ > > > > Return types on same line as function name makes grep'ing a lot > > easier/nicer. > > > > Here's the example from Documentation/CodingStyle : > > > > int function(int x) > > { > > How so? I never understood the reasons. This makes it easier to grep > for everything that returns int. But you make the common case (what > file is function() defined in?) harder. Not exactly. I used the 2-lines style for a while, and changed overtime and now can't stand anything but the one line style :) I recommend you read the mailing list archives for linus comments on this issue btw. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison 2005-04-26 16:57 ` Jesper Juhl @ 2005-04-26 17:06 ` Brice Goglin 2005-04-27 9:41 ` David Addison 2005-04-27 13:43 ` Andi Kleen 2005-04-28 1:42 ` Troy Benjegerdes 2005-04-28 7:21 ` Brice Goglin 3 siblings, 2 replies; 20+ messages in thread From: Brice Goglin @ 2005-04-26 17:06 UTC (permalink / raw) To: David Addison Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison David Addison a écrit : > Hi, > > here is a patch we use to integrate the Quadrics NICs into the Linux > kernel. > The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC' > devices can be informed of page table changes. > This allows the Quadrics NICs to perform user RDMAs safely, without > requiring > page pinning. Looking through some of the recent IB and Ammasso > discussions, > it may also prove useful to those NICs too. Hi, I worked on a similar patch to help updating a registration cache on Myrinet. I came to the problem of deciding between registering ioproc to the entire address space (1) or only to some VMA (2). You're doing (1), I tried (2). (2) avoids calling ioproc hooks for all pages that are never involved in any communication. This might be good if the amount of pages that are involved is not too high and if the coproc_ops cost is a little bit high. Do you have any numbers about this in real applications on QsNet ? I see two drawback in (2). First, it requires to play with the list of ioproc_ops when VMA are merged or split. Actually, it's not that bad since the list often contains only 1 ioproc_ops. Secondly, you have to add the ioproc to all involved VMA at some point. It's easy when the API asks the application to register, you just add the ioproc_ops to the target VMA during registration. But, I guess it's not easy with Quadrics, right ? I see in your patch that ioproc are not inherited during fork. How do you support fork in your driver/lib then ? What if a COW page is given to the son and the copy to the father while some IO are being processed ? Do you require the application to call a specific routine after forking ? Don't you think it might be good to add a hook in the fork code so that ioproc are inherited or duplicated pages are invalidated in the card ? Regards, Brice ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:06 ` Brice Goglin @ 2005-04-27 9:41 ` David Addison 2005-04-28 8:38 ` Andy Isaacson 2005-04-27 13:43 ` Andi Kleen 1 sibling, 1 reply; 20+ messages in thread From: David Addison @ 2005-04-27 9:41 UTC (permalink / raw) To: Brice Goglin; +Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison [-- Attachment #1: Type: text/plain, Size: 3679 bytes --] Brice Goglin wrote: > David Addison a écrit : > >> Hi, >> >> here is a patch we use to integrate the Quadrics NICs into the Linux >> kernel. >> The patch adds hooks to the Linux VM subsystem so that registered >> 'IOPROC' >> devices can be informed of page table changes. >> This allows the Quadrics NICs to perform user RDMAs safely, without >> requiring >> page pinning. Looking through some of the recent IB and Ammasso >> discussions, >> it may also prove useful to those NICs too. > > > Hi, > > I worked on a similar patch to help updating a registration cache on > Myrinet. I came to the problem of deciding between registering ioproc > to the entire address space (1) or only to some VMA (2). > You're doing (1), I tried (2). > > (2) avoids calling ioproc hooks for all pages that are never involved > in any communication. This might be good if the amount of pages that > are involved is not too high and if the coproc_ops cost is a little bit > high. > Do you have any numbers about this in real applications on QsNet ? > We have always taken approach (1) as it seems to be the simplest method and offers the model where the whole user process space can be made available for RDMA operations. Admittedly, the update calls for pages which are not going to be used for RDMA operations is an overhead, but the device driver can elect not to present update functions and instead rely on pre-registration of comms buffers. For invalidates the device driver will have knowledge of the registered regions and can quickly ignore irrelevant unloads. With HPC applications in general we find the memory image is pretty static over the life of the job and hence most of the costs are taken as the pages are created during job startup and initialisation. > I see two drawback in (2). > First, it requires to play with the list of ioproc_ops when VMA are > merged or split. Actually, it's not that bad since the list often > contains only 1 ioproc_ops. > Secondly, you have to add the ioproc to all involved VMA at some point. > It's easy when the API asks the application to register, you just add > the ioproc_ops to the target VMA during registration. But, I guess it's > not easy with Quadrics, right ? > In the past we have allowed dynamic page faulting of all application exposed memory via RDMA operations. However, our newer libraries do now implement a registration cache so we can pre-load translations to our NIC MMU (or pin, if kernel invalidate hooks are not available). However, I still prefer model (1) as it allows both implementations and appears to be much simpler in terms of the linux kernel changes required. > > I see in your patch that ioproc are not inherited during fork. > How do you support fork in your driver/lib then ? > What if a COW page is given to the son and the copy to the father > while some IO are being processed ? Do you require the application to > call a specific routine after forking ? > Don't you think it might be good to add a hook in the fork code > so that ioproc are inherited or duplicated pages are invalidated > in the card ? > Yes, on fork() our programming model is for the child to attach to the device again. The QsNet model has a NIC MMU context for each process so it makes sense for each process to attach and have independent IOPROC and NIC memory management. But you're right, there should be IOPROC hooks to ensure that the device cannot write to COW pages after the fork. I've added a new callback for this; ioproc_wrprotect_page() called in copy_one_pte(), and a new revised patch is attached (Jesper: with whitespace corrections too ;-) > Regards, > Brice Thanks for your comments, Cheers David. [-- Attachment #2: ioproc-2.6.12-rc3.patch --] [-- Type: text/x-patch, Size: 39428 bytes --] diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h --- linux-2.6.12-rc3.orig/include/linux/ioproc.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/include/linux/ioproc.h 2005-04-27 09:59:49.000000000 +0100 @@ -0,0 +1,273 @@ +/* -*- linux-c -*- + * + * Copyright (C) 2002-2005 Quadrics Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + */ + +/* + * Callbacks for IO processor page table updates. + */ + +#ifndef __LINUX_IOPROC_H__ +#define __LINUX_IOPROC_H__ + +#include <linux/sched.h> +#include <linux/mm.h> + +typedef struct ioproc_ops { + struct ioproc_ops *next; + void *arg; + + void (*release)(void *arg, struct mm_struct *mm); + void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + + void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot); + + void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + void (*wrprotect_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + +} ioproc_ops_t; + +/* IOPROC Registration + * + * Called by the IOPROC device driver to register its interest in page table + * changes for the process associated with the supplied mm_struct + * + * The caller should first allocate and fill out an ioproc_ops structure with + * the function pointers initialised to the device driver specific code for + * each callback. If the device driver doesn't have code for a particular + * callback then it should set the function pointer to be NULL. + * The ioproc_ops arg parameter will be passed unchanged as the first argument + * to each callback function invocation. + * + * The ioproc registration is not inherited across fork() and should be called + * once for each process that the IOPROC device driver is interested in. + * + * Must be called holding the mm->page_table_lock + */ +extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip); + +/* IOPROC De-registration + * + * Called by the IOPROC device driver when it is no longer interested in page + * table changes for the process associated with the supplied mm_struct + * + * Normally this is not needed to be called as the ioproc_release() code will + * automatically unlink the ioproc_ops struct from the mm_struct as the + * process exits + * + * Must be called holding the mm->page_table_lock + */ +extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip); + +#ifdef CONFIG_IOPROC + +/* IOPROC Release + * + * Called during exit_mmap() as all vmas are torn down and unmapped. + * + * Also unlinks the ioproc_ops structure from the mm list as it goes. + * + * No need for locks as the mm can no longer be accessed at this point + * + */ +static inline void ioproc_release(struct mm_struct *mm) +{ + struct ioproc_ops *cp; + + while ((cp = mm->ioproc_ops) != NULL) { + mm->ioproc_ops = cp->next; + + if (cp->release) + cp->release(cp->arg, mm); + } +} + +/* IOPROC SYNC RANGE + * + * Called when a memory map is synchronised with its disk image i.e. when the + * msync() syscall is invoked. Any future read or write to the associated + * pages by the IOPROC should cause the page to be marked as referenced or + * modified. + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_sync_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->sync_range) + cp->sync_range(cp->arg, vma, start, end); +} + +/* IOPROC INVALIDATE RANGE + * + * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the + * user or paged out by the kernel. + * + * After this call the IOPROC must not access the physical memory again unless + * a new translation is loaded. + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_invalidate_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->invalidate_range) + cp->invalidate_range(cp->arg, vma, start, end); +} + +/* IOPROC UPDATE RANGE + * + * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk + * up, when breaking COW or faulting in an anonymous page of memory. + * + * These give the IOPROC device driver the opportunity to load translations + * speculatively, which can improve performance by avoiding device translation + * faults. + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_update_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->update_range) + cp->update_range(cp->arg, vma, start, end); +} + + +/* IOPROC CHANGE PROTECTION + * + * Called when the protection on a region of memory is changed i.e. when the + * mprotect() syscall is invoked. + * + * The IOPROC must not be able to write to a read-only page, so if the + * permissions are downgraded then it must honour them. If they are upgraded + * it can treat this in the same way as the ioproc_update_[range|sync]() calls + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->change_protection) + cp->change_protection(cp->arg, vma, start, end, newprot); +} + +/* IOPROC SYNC PAGE + * + * Called when a memory map is synchronised with its disk image i.e. when the + * msync() syscall is invoked. Any future read or write to the associated page + * by the IOPROC should cause the page to be marked as referenced or modified. + * + * Not currently called as msync() calls ioproc_sync_range() instead + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_sync_page(struct vm_area_struct *vma, unsigned long addr) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->sync_page) + cp->sync_page(cp->arg, vma, addr); +} + +/* IOPROC INVALIDATE PAGE + * + * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the + * user or paged out by the kernel. + * + * After this call the IOPROC must not access the physical memory again unless + * a new translation is loaded. + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_invalidate_page(struct vm_area_struct *vma, unsigned long addr) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->invalidate_page) + cp->invalidate_page(cp->arg, vma, addr); +} + +/* IOPROC UPDATE PAGE + * + * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk + * up, when breaking COW or faulting in an anonymous page of memory. + * + * These give the IOPROC device the opportunity to load translations + * speculatively, which can improve performance by avoiding device translation + * faults. + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_update_page(struct vm_area_struct *vma, unsigned long addr) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->update_page) + cp->update_page(cp->arg, vma, addr); +} + +/* IOPROC WRPROTECT PAGE + * + * Called when a page is downgraded for COW (during fork()). This should ensure that + * the page can no longer be written by the IOPROC + * + * Called holding the mm->page_table_lock + */ +static inline void ioproc_wrprotect_page(struct vm_area_struct *vma, unsigned long addr) +{ + struct ioproc_ops *cp; + + for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next) + if (cp->wrprotect_page) + cp->wrprotect_page(cp->arg, vma, addr); +} + +#else + +/* ! CONFIG_IOPROC so make all hooks empty */ + +#define ioproc_release(mm) do { } while (0) +#define ioproc_sync_range(vma, start, end) do { } while (0) +#define ioproc_invalidate_range(vma, start,end) do { } while (0) +#define ioproc_update_range(vma, start, end) do { } while (0) +#define ioproc_change_protection(vma, start, end, prot) do { } while (0) +#define ioproc_sync_page(vma, addr) do { } while (0) +#define ioproc_invalidate_page(vma, addr) do { } while (0) +#define ioproc_update_page(vma, addr) do { } while (0) +#define ioproc_wrprotect_page(vma, addr) do { } while (0) + +#endif /* CONFIG_IOPROC */ + +#endif /* __LINUX_IOPROC_H__ */ diff -ruN linux-2.6.12-rc3.orig/include/linux/sched.h linux-2.6.12-rc3.ioproc/include/linux/sched.h --- linux-2.6.12-rc3.orig/include/linux/sched.h 2005-04-26 09:02:29.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/include/linux/sched.h 2005-04-26 16:03:07.000000000 +0100 @@ -186,6 +186,9 @@ asmlinkage void schedule(void); struct namespace; +#ifdef CONFIG_IOPROC +struct ioproc_ops; +#endif /* Maximum number of active map areas.. This is a random (large) number */ #define DEFAULT_MAX_MAP_COUNT 65536 @@ -267,6 +270,11 @@ unsigned long hiwater_rss; /* High-water RSS usage */ unsigned long hiwater_vm; /* High-water virtual memory usage */ + +#ifdef CONFIG_IOPROC + /* hooks for io devices with advanced RDMA capabilities */ + struct ioproc_ops *ioproc_ops; +#endif }; struct sighand_struct { diff -ruN linux-2.6.12-rc3.orig/kernel/fork.c linux-2.6.12-rc3.ioproc/kernel/fork.c --- linux-2.6.12-rc3.orig/kernel/fork.c 2005-04-26 09:02:36.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/kernel/fork.c 2005-04-26 16:03:07.000000000 +0100 @@ -320,6 +320,9 @@ spin_lock_init(&mm->page_table_lock); rwlock_init(&mm->ioctx_list_lock); mm->ioctx_list = NULL; +#ifdef CONFIG_IOPROC + mm->ioproc_ops = NULL; +#endif mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm); mm->free_area_cache = TASK_UNMAPPED_BASE; diff -ruN linux-2.6.12-rc3.orig/mm/fremap.c linux-2.6.12-rc3.ioproc/mm/fremap.c --- linux-2.6.12-rc3.orig/mm/fremap.c 2005-04-26 09:02:39.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/fremap.c 2005-04-26 16:03:07.000000000 +0100 @@ -12,6 +12,7 @@ #include <linux/mman.h> #include <linux/pagemap.h> #include <linux/swapops.h> +#include <linux/ioproc.h> #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> @@ -30,6 +31,7 @@ if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); + ioproc_invalidate_page(vma, addr); flush_cache_page(vma, addr, pfn); pte = ptep_clear_flush(vma, addr, ptep); if (pfn_valid(pfn)) { @@ -99,6 +101,7 @@ pte_val = *pte; pte_unmap(pte); update_mmu_cache(vma, addr, pte_val); + ioproc_update_page(vma, addr); err = 0; err_unlock: @@ -143,6 +146,7 @@ pte_val = *pte; pte_unmap(pte); update_mmu_cache(vma, addr, pte_val); + ioproc_update_page(vma, addr); spin_unlock(&mm->page_table_lock); return 0; diff -ruN linux-2.6.12-rc3.orig/mm/hugetlb.c linux-2.6.12-rc3.ioproc/mm/hugetlb.c --- linux-2.6.12-rc3.orig/mm/hugetlb.c 2005-03-02 07:38:12.000000000 +0000 +++ linux-2.6.12-rc3.ioproc/mm/hugetlb.c 2005-04-26 16:03:07.000000000 +0100 @@ -11,6 +11,7 @@ #include <linux/sysctl.h> #include <linux/highmem.h> #include <linux/nodemask.h> +#include <linux/ioproc.h> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; static unsigned long nr_huge_pages, free_huge_pages; @@ -255,6 +256,7 @@ struct mm_struct *mm = vma->vm_mm; spin_lock(&mm->page_table_lock); + ioproc_invalidate_range(vma, start, start + length); unmap_hugepage_range(vma, start, start + length); spin_unlock(&mm->page_table_lock); } diff -ruN linux-2.6.12-rc3.orig/mm/ioproc.c linux-2.6.12-rc3.ioproc/mm/ioproc.c --- linux-2.6.12-rc3.orig/mm/ioproc.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/ioproc.c 2005-04-26 17:58:43.000000000 +0100 @@ -0,0 +1,56 @@ +/* -*- linux-c -*- + * + * Copyright (C) 2002-2005 Quadrics Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + */ + +/* + * Registration for IO processor page table updates. + */ + +#include <linux/kernel.h> +#include <linux/module.h> + +#include <linux/mm.h> +#include <linux/ioproc.h> + +int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip) +{ + ip->next = mm->ioproc_ops; + mm->ioproc_ops = ip; + + return 0; +} + +EXPORT_SYMBOL_GPL(ioproc_register_ops); + +int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip) +{ + struct ioproc_ops **tmp; + + for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next) + ; + if (*tmp) { + *tmp = ip->next; + return 0; + } + + return -EINVAL; +} + +EXPORT_SYMBOL_GPL(ioproc_unregister_ops); diff -ruN linux-2.6.12-rc3.orig/mm/Kconfig linux-2.6.12-rc3.ioproc/mm/Kconfig --- linux-2.6.12-rc3.orig/mm/Kconfig 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/Kconfig 2005-04-26 16:03:07.000000000 +0100 @@ -0,0 +1,15 @@ +# +# VM subsystem specific config +# + +# Support for IO processors which have advanced RDMA capabilities +# +config IOPROC + bool "Enable IOPROC VM hooks" + depends on MMU + default y + help + This option enables hooks in the VM subsystem so that IO devices which + incorporate advanced RDMA capabilities can be kept in sync with CPU + page table changes. + See Documentation/vm/ioproc.txt for more details. diff -ruN linux-2.6.12-rc3.orig/mm/Makefile linux-2.6.12-rc3.ioproc/mm/Makefile --- linux-2.6.12-rc3.orig/mm/Makefile 2005-03-02 07:38:12.000000000 +0000 +++ linux-2.6.12-rc3.ioproc/mm/Makefile 2005-04-26 16:03:07.000000000 +0100 @@ -17,4 +17,5 @@ obj-$(CONFIG_NUMA) += mempolicy.o obj-$(CONFIG_SHMEM) += shmem.o obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o +obj-$(CONFIG_IOPROC) += ioproc.o diff -ruN linux-2.6.12-rc3.orig/mm/memory.c linux-2.6.12-rc3.ioproc/mm/memory.c --- linux-2.6.12-rc3.orig/mm/memory.c 2005-04-26 09:02:39.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/memory.c 2005-04-27 09:58:34.000000000 +0100 @@ -45,6 +45,7 @@ #include <linux/swap.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/ioproc.h> #include <linux/rmap.h> #include <linux/module.h> #include <linux/init.h> @@ -343,9 +344,10 @@ static inline void copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, unsigned long vm_flags, + pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr) { + unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; unsigned long pfn; @@ -385,6 +387,7 @@ * in the parent and the child */ if ((vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE) { + ioproc_wrprotect_page(vma, addr); ptep_set_wrprotect(src_mm, addr, src_pte); pte = *src_pte; } @@ -409,7 +412,6 @@ unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; - unsigned long vm_flags = vma->vm_flags; int progress; again: @@ -433,7 +435,7 @@ progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr); + copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); spin_unlock(&src_mm->page_table_lock); @@ -765,6 +767,7 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); + ioproc_invalidate_range(vma, address, end); tlb = tlb_gather_mmu(mm, 0); end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details); tlb_finish_mmu(tlb, address, end); @@ -1076,6 +1079,7 @@ { pgd_t *pgd; unsigned long next; + unsigned long beg = addr; unsigned long end = addr + size; struct mm_struct *mm = vma->vm_mm; int err; @@ -1084,12 +1088,14 @@ pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); spin_lock(&mm->page_table_lock); + ioproc_invalidate_range(vma, beg, end); do { next = pgd_addr_end(addr, end); err = zeromap_pud_range(mm, pgd, addr, next, prot); if (err) break; } while (pgd++, addr = next, addr != end); + ioproc_update_range(vma, beg, end); spin_unlock(&mm->page_table_lock); return err; } @@ -1164,6 +1170,7 @@ { pgd_t *pgd; unsigned long next; + unsigned long beg = addr; unsigned long end = addr + size; struct mm_struct *mm = vma->vm_mm; int err; @@ -1183,6 +1190,7 @@ pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); spin_lock(&mm->page_table_lock); + ioproc_invalidate_range(vma, beg, end); do { next = pgd_addr_end(addr, end); err = remap_pud_range(mm, pgd, addr, next, @@ -1190,6 +1198,7 @@ if (err) break; } while (pgd++, addr = next, addr != end); + ioproc_update_range(vma, beg, end); spin_unlock(&mm->page_table_lock); return err; } @@ -1218,8 +1227,10 @@ entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); + ioproc_invalidate_page(vma, address); ptep_establish(vma, address, page_table, entry); update_mmu_cache(vma, address, entry); + ioproc_update_page(vma, address); lazy_mmu_prot_update(entry); } @@ -1273,6 +1284,7 @@ vma); ptep_set_access_flags(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); + ioproc_update_page(vma, address); lazy_mmu_prot_update(entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1736,6 +1748,7 @@ /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, pte); + ioproc_update_page(vma, address); lazy_mmu_prot_update(pte); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1794,6 +1807,7 @@ /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, addr, entry); + ioproc_update_page(vma, addr); lazy_mmu_prot_update(entry); spin_unlock(&mm->page_table_lock); out: @@ -1920,6 +1934,7 @@ /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); + ioproc_update_page(vma, address); lazy_mmu_prot_update(entry); spin_unlock(&mm->page_table_lock); out: diff -ruN linux-2.6.12-rc3.orig/mm/mmap.c linux-2.6.12-rc3.ioproc/mm/mmap.c --- linux-2.6.12-rc3.orig/mm/mmap.c 2005-04-26 09:02:39.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/mmap.c 2005-04-26 16:03:07.000000000 +0100 @@ -16,6 +16,7 @@ #include <linux/init.h> #include <linux/file.h> #include <linux/fs.h> +#include <linux/ioproc.h> #include <linux/personality.h> #include <linux/security.h> #include <linux/hugetlb.h> @@ -1627,6 +1628,7 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); + ioproc_invalidate_range(vma, start, end); tlb = tlb_gather_mmu(mm, 0); unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); @@ -1905,6 +1907,7 @@ spin_lock(&mm->page_table_lock); + ioproc_release(mm); flush_cache_mm(mm); tlb = tlb_gather_mmu(mm, 1); /* Use -1 here to ensure all VMAs in the mm are unmapped */ diff -ruN linux-2.6.12-rc3.orig/mm/mprotect.c linux-2.6.12-rc3.ioproc/mm/mprotect.c --- linux-2.6.12-rc3.orig/mm/mprotect.c 2005-04-26 09:02:40.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/mprotect.c 2005-04-26 16:03:07.000000000 +0100 @@ -10,6 +10,7 @@ #include <linux/mm.h> #include <linux/hugetlb.h> +#include <linux/ioproc.h> #include <linux/slab.h> #include <linux/shm.h> #include <linux/mman.h> @@ -89,6 +90,7 @@ pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); spin_lock(&mm->page_table_lock); + ioproc_change_protection(vma, start, end, newprot); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) diff -ruN linux-2.6.12-rc3.orig/mm/mremap.c linux-2.6.12-rc3.ioproc/mm/mremap.c --- linux-2.6.12-rc3.orig/mm/mremap.c 2005-04-26 09:02:40.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/mremap.c 2005-04-26 16:03:07.000000000 +0100 @@ -9,6 +9,7 @@ #include <linux/mm.h> #include <linux/hugetlb.h> +#include <linux/ioproc.h> #include <linux/slab.h> #include <linux/shm.h> #include <linux/mman.h> @@ -161,6 +162,8 @@ { unsigned long offset; + ioproc_invalidate_range(vma, old_addr, old_addr + len); + ioproc_invalidate_range(vma, new_addr, new_addr + len); flush_cache_range(vma, old_addr, old_addr + len); /* diff -ruN linux-2.6.12-rc3.orig/mm/msync.c linux-2.6.12-rc3.ioproc/mm/msync.c --- linux-2.6.12-rc3.orig/mm/msync.c 2005-04-26 09:02:40.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/msync.c 2005-04-26 16:03:07.000000000 +0100 @@ -13,6 +13,7 @@ #include <linux/mman.h> #include <linux/hugetlb.h> #include <linux/syscalls.h> +#include <linux/ioproc.h> #include <asm/pgtable.h> #include <asm/tlbflush.h> @@ -95,6 +96,7 @@ pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); spin_lock(&mm->page_table_lock); + ioproc_sync_range(vma, addr, end); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) diff -ruN linux-2.6.12-rc3.orig/mm/rmap.c linux-2.6.12-rc3.ioproc/mm/rmap.c --- linux-2.6.12-rc3.orig/mm/rmap.c 2005-04-26 09:02:40.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/mm/rmap.c 2005-04-26 16:03:07.000000000 +0100 @@ -53,6 +53,7 @@ #include <linux/init.h> #include <linux/rmap.h> #include <linux/rcupdate.h> +#include <linux/ioproc.h> #include <asm/tlbflush.h> @@ -573,6 +574,7 @@ } /* Nuke the page table entry. */ + ioproc_invalidate_page(vma, address); flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -690,6 +692,7 @@ continue; /* Nuke the page table entry. */ + ioproc_invalidate_page(vma, address); flush_cache_page(vma, address, pfn); pteval = ptep_clear_flush(vma, address, pte); diff -ruN linux-2.6.12-rc3.orig/arch/i386/defconfig linux-2.6.12-rc3.ioproc/arch/i386/defconfig --- linux-2.6.12-rc3.orig/arch/i386/defconfig 2005-04-26 08:59:33.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/arch/i386/defconfig 2005-04-26 16:03:07.000000000 +0100 @@ -120,6 +120,7 @@ CONFIG_IRQBALANCE=y CONFIG_HAVE_DEC_LOCK=y # CONFIG_REGPARM is not set +CONFIG_IOPROC=y # # Power management options (ACPI, APM) diff -ruN linux-2.6.12-rc3.orig/arch/i386/Kconfig linux-2.6.12-rc3.ioproc/arch/i386/Kconfig --- linux-2.6.12-rc3.orig/arch/i386/Kconfig 2005-04-26 08:59:33.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/arch/i386/Kconfig 2005-04-26 16:03:08.000000000 +0100 @@ -923,6 +923,8 @@ If unsure, say Y. Only embedded should say N here. +source "mm/Kconfig" + endmenu diff -ruN linux-2.6.12-rc3.orig/arch/ia64/defconfig linux-2.6.12-rc3.ioproc/arch/ia64/defconfig --- linux-2.6.12-rc3.orig/arch/ia64/defconfig 2005-03-02 07:37:48.000000000 +0000 +++ linux-2.6.12-rc3.ioproc/arch/ia64/defconfig 2005-04-26 16:03:08.000000000 +0100 @@ -92,6 +92,7 @@ CONFIG_PERFMON=y CONFIG_IA64_PALINFO=y CONFIG_ACPI_DEALLOCATE_IRQ=y +CONFIG_IOPROC=y # # Firmware Drivers diff -ruN linux-2.6.12-rc3.orig/arch/ia64/Kconfig linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig --- linux-2.6.12-rc3.orig/arch/ia64/Kconfig 2005-04-26 08:59:38.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig 2005-04-26 16:03:08.000000000 +0100 @@ -319,6 +319,8 @@ depends on IOSAPIC && EXPERIMENTAL default y +source "mm/Kconfig" + source "drivers/firmware/Kconfig" source "fs/Kconfig.binfmt" diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/defconfig linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig --- linux-2.6.12-rc3.orig/arch/x86_64/defconfig 2005-04-26 09:00:10.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig 2005-04-26 16:03:08.000000000 +0100 @@ -100,6 +100,7 @@ CONFIG_SECCOMP=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y +CONFIG_IOPROC=y # # Power management options diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/Kconfig linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig --- linux-2.6.12-rc3.orig/arch/x86_64/Kconfig 2005-04-26 09:00:10.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig 2005-04-26 16:03:08.000000000 +0100 @@ -458,6 +458,8 @@ depends on IA32_EMULATION default y +source "mm/Kconfig" + endmenu source drivers/Kconfig diff -ruN linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt --- linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt 2005-04-27 10:05:31.000000000 +0100 @@ -0,0 +1,512 @@ +Linux IOPROC patch overview +=========================== + +The network interface for an HPC network differs significantly from +network interfaces for traditional IP networks. HPC networks tend to +be used directly from user processes and perform large RDMA transfers +between the process address spaces. They also have a requirement +for low latency communication, and typically achieve this by OS bypass +techniques. This then requires a different model to traditional +interconnects, in that a process may need to expose a large amount of +it's address space to the network RDMA. + +Locking down of memory has been a common mechanism for performing +this, together with a pin-down cache implemented in user +libraries. The disadvantage of this method is that large portions of +the physical memory can be locked down for a single process, even if +it's working set changes over the different phases of it's +execution. This leads to inefficient memory utilisation - akin to the +disadvantage of swapping compared to paging. + +This model also has problems where memory is being dynamically +allocated and freed, since the pin down cache is unaware that memory +may have been released by a call to munmap() and so it will still be +locking down the now unused pages. + +Some modern HPC network interfaces implement their own MMU and are +able to handle a translation fault during a network access. The +Quadrics (http://www.quadrics.com) devices (Elan3 and Elan4) have done +this for some time, and the Infiniband standard also allows for the +case where memory has been deregistered when an RDMA occurs. +These NICs are able to operate in an environment where paging occurs +and do not require memory to be locked down. The advantage of this is +that the user process can expose large portions of its address space +without having to worry about physical memory constraints. + +However should the operating system decide to swap a page to disk, +then the NIC must be made aware that it should no longer read/write +from this memory, but should generate a translation fault instead. + +The ioproc patch has been developed to provide a mechanism whereby the +device driver for a NIC can be made aware of when a user process's +address translations change, either by paging or by explicitly mapping +or unmapping of memory. + +The patch involves inserting callbacks where translations are being +invalidated to notify the NIC that the memory behind those +translations is no longer visible to the application (and so should +not be visible to the NIC). This callback is then responsible for +ensuring that the NIC will not access the physical memory that was +being mapped. + +An ioproc invalidate callback in the kswapd code could be utilised to +prevent memory from being paged out if the NIC is unable to support +RDMA page faulting. This has not yet been implemented in this patch. + +For NICs which support RDMA page faulting, there is no requirement +for a user level pin down cache, since they are able to page-in their +translations on the first communication using a buffer. However this +is likely to be inefficient, resulting in slow first use of the +buffer. If the communication buffers were continually allocated and +freed using mmap() based malloc() calls then this would lead to all +communications being slower than desirable. + +To optimise these warm-up cases the ioproc patch adds calls to +ioproc_update wherever the kernel is creating translations for a user +process. These then allow the device driver to preload translations +so that they are already present for the first network communication +from a buffer. + +Linux 2.6 IOPROC implementation details +======================================= + +The Linux IOPROC patch adds hooks to the Linux VM code whenever page +table entries are being created and/or invalidated. IOPROC device +drivers can register their interest in being informed of such changes +by registering an ioproc_ops structure which is defined as follows; + +extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip); +extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip); + +typedef struct ioproc_ops { + struct ioproc_ops *next; + void *arg; + + void (*release)(void *arg, struct mm_struct *mm); + void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end); + + void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot); + + void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address); + +} ioproc_ops_t; + +ioproc_register_ops +=================== +This function should be called by the IOPROC device driver to register +its interest in PTE changes for the process associated with the passed +in mm_struct. + +The ioproc registration is not inherited across fork() and should be +called once for each process that IOPROC is interested in. + +This function must be called whilst holding the mm->page_table_lock. + +ioproc_unregister_ops +===================== +This function should be called by the IOPROC device driver when it no +longer requires informing of PTE changes in the process associated +with the supplied mm_struct. + +This function is not normally needed to be called as the ioproc_ops +struct is unlinked from the associated mm_struct during the +ioproc_release() call. + +This function must be called whilst holding the mm->page_table_lock. + +ioproc_ops struct +================= +A linked list ioproc_ops structures is hung off the user process +mm_struct (linux/sched.h). At each hook point in the patched kernel +the ioproc patch will call the associated ioproc_ops callback function +pointer in turn for each registered structure. + +The intention of the callbacks is to allow the IOPROC device driver to +inspect the new or modified PTE entry via the Linux kernel +(e.g. find_pte_map()). These callbacks should not modify the Linux +kernel VM state or PTE entries. + +The ioproc_ops callback function pointers are defined as follows; + +ioproc_release +============== +The release hook is called when a program exits and all its vma areas +are torn down and unmapped. i.e. during exit_mmap(). Before each +release hook is called the ioproc_ops structure is unlinked from the +mm_struct. + +No locks are required as the process has the only reference to the mm +at this point. + +ioproc_sync_[range|page] +======================== +The sync hooks are called when a memory map is synchronised with its +disk image i.e. when the msync() syscall is invoked. Any future read +or write by the IOPROC device to the associated pages should cause the +page to be marked as referenced or modified. + +Called holding the mm->page_table_lock + +ioproc_invalidate_[range|page] +============================== +The invalidate hooks are called whenever a valid PTE is unloaded +e.g. when a page is unmapped by the user or paged out by the +kernel. After this call the IOPROC must not access the physical memory +again unless a new translation is loaded. + +Called holding the mm->page_table_lock + +ioproc_update_[range|page] +========================== +The update hooks are called whenever a valid PTE is loaded +e.g. mmaping memory, moving the brk up, when breaking COW or faulting +in an anonymous page of memory. These give the IOPROC device the +opportunity to load translations speculatively, which can improve +performance by avoiding device translation faults. + +Called holding the mm->page_table_lock + +ioproc_change_protection +======================== +This hook is called when the protection on a region of memory is +changed i.e. when the mprotect() syscall is invoked. + +The IOPROC must not be able to write to a read-only page, so if the +permissions are downgraded then it must honour them. If they are +upgraded it can treat this in the same way as the +ioproc_update_[range|page]() calls + +Called holding the mm->page_table_lock + + +Linux 2.6 IOPROC patch details +============================== + +Here are the specific details of each ioproc hook added to the Linux +2.6 VM system and the reasons for doing so; + +=============================================================================== +++++ FILE + mm/fremap.c + +==== FUNCTION + zap_pte + +CALLED FROM + install_page + install_file_pte + +PTE MODIFICATION + ptep_clear_flush + +ADDED HOOKS + ioproc_invalidate_page + +==== FUNCTION + install_page + +CALLED FROM + filemap_populate, shmem_populate + +PTE MODIFICATION + set_pte_at + +ADDED HOOKS + ioproc_update_page + +==== FUNCTION + install_file_pte + +CALLED FROM + filemap_populate, shmem_populate + +PTE MODIFICATION + set_pte_at + +ADDED HOOKS + ioproc_update_page + + +=============================================================================== +++++ FILE + mm/memory.c + +==== FUNCTION + copy_one_pte + +CALLED FROM + copy_pte_range + +PTE MODIFICATION + ptep_set_wrprotect + +ADDED HOOKS + ioproc_wrprotect_page + +==== FUNCTION + copy_page_range + +CALLED FROM + dup_mmap (fork.c) + +PTE MODIFICATION + set_pte_at (copy_one_pte) + +ADDED HOOKS + None necessary as its creating a new process + +==== FUNCTION + zap_page_range + +CALLED FROM + read_zero_pagealigned, madvise_dontneed, unmap_mapping_range, + unmap_mapping_range_list, do_mmap_pgoff + +PTE MODIFICATION + set_pte_at (unmap_vmas) + +ADDED HOOKS + ioproc_invalidate_range + + +==== FUNCTION + zeromap_page_range + +CALLED FROM + read_zero_pagealigned, mmap_zero + +PTE MODIFICATION + set_pte_at (zeromap_pte_range via zeromap_[pud|pmd|pte]_range) + +ADDED HOOKS + ioproc_invalidate_range + ioproc_update_range + + +==== FUNCTION + remap_pfn_range + +CALLED FROM + many device drivers + +PTE MODIFICATION + set_pte_at (remap_pte_range via remap_[pud|pmd|pte]_range) + +ADDED HOOKS + ioproc_invalidate_range + ioproc_update_range + + +==== FUNCTION + break_cow + +CALLED FROM + do_wp_page + +PTE MODIFICATION + ptep_establish + +ADDED HOOKS + ioproc_invalidate_page + ioproc_update_page + + +==== FUNCTION + do_wp_page + +CALLED FROM + do_swap_page, handle_pte_fault + +PTE MODIFICATION + ptep_set_access_flags, break_cow + +ADDED HOOKS + ioproc_update_page + + +==== FUNCTION + do_swap_page + +CALLED FROM + handle_pte_fault + +PTE MODIFICATION + set_pte_at + +ADDED HOOKS + ioproc_update_page + + +==== FUNCTION + do_anonymous_page + +CALLED FROM + do_no_page + +PTE MODIFICATION + set_pte_at + +ADDED HOOKS + ioproc_update_page + + +==== FUNCTION + do_no_page + +CALLED FROM + do_file_page, handle_pte_fault + +PTE MODIFICATION + set_pte_at + +ADDED HOOKS + ioproc_update_page + + +==== FUNCTION + handle_pte_fault + +CALLED FROM + handle_mm_fault + +PTE MODIFICATION + ptep_set_access_flags, do_no_page, do_file_page, do_swap_page + +ADDED HOOKS + Handled in called functions and not necessary for minor fault + + +=============================================================================== +++++ FILE + mm/mmap.c + +==== FUNCTION + unmap_region + +CALLED FROM + do_munmap + +PTE MODIFICATION + set_pte_at (unmap_vmas) + +ADDED HOOKS + ioproc_invalidate_range + + +==== FUNCTION + exit_mmap + +CALLED FROM + mmput + +PTE MODIFICATION + set_pte_at (unmap_vmas) + +ADDED HOOKS + ioproc_release + + +=============================================================================== +++++ FILE + mm/mprotect.c + +==== FUNCTION + change_protection + +CALLED FROM + mprotect_fixup + +PTE MODIFICATION + set_pte_at (change_pte_range via change_[pud|pmd|pte]_range) + +ADDED HOOKS + ioproc_change_protection + + +=============================================================================== +++++ FILE + mm/mremap.c + +==== FUNCTION + move_page_tables + +CALLED FROM + move_vma + +PTE MODIFICATION + ptep_clear_flush (move_one_page) + +ADDED HOOKS + ioproc_invalidate_range + ioproc_invalidate_range + + +=============================================================================== +++++ FILE + mm/rmap.c + +==== FUNCTION + try_to_unmap_one + +CALLED FROM + try_to_unmap_anon, try_to_unmap_file + +PTE MODIFICATION + ptep_clear_flush + +ADDED HOOKS + ioproc_invalidate_page + + +==== FUNCTION + try_to_unmap_cluster + +CALLED FROM + try_to_unmap_file + +PTE MODIFICATION + ptep_clear_flush + +ADDED HOOKS + ioproc_invalidate_page + + +=============================================================================== +++++ FILE + mm/msync.c + +==== FUNCTION + filemap_sync + +CALLED FROM + msync_interval + +PTE MODIFICATION + ptep_clear_flush_dirty (filemap_sync_pte) + +ADDED HOOKS + ioproc_sync_range + + +=============================================================================== +++++ FILE + mm/hugetlb.c + +==== FUNCTION + zap_hugepage_range + +CALLED FROM + hugetlb_vmtruncate_list + +PTE MODIFICATION + ptep_get_and_clear (unmap_hugepage_range) + +ADDED HOOK + ioproc_invalidate_range + + +-- Last update DavidAddison - 26 Apr 2005 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-27 9:41 ` David Addison @ 2005-04-28 8:38 ` Andy Isaacson 0 siblings, 0 replies; 20+ messages in thread From: Andy Isaacson @ 2005-04-28 8:38 UTC (permalink / raw) To: David Addison Cc: Brice Goglin, linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Wed, Apr 27, 2005 at 10:41:07AM +0100, David Addison wrote: > Brice Goglin wrote: > >I worked on a similar patch to help updating a registration cache on > >Myrinet. I came to the problem of deciding between registering ioproc > >to the entire address space (1) or only to some VMA (2). > >You're doing (1), I tried (2). > > We have always taken approach (1) as it seems to be the simplest > method and offers the model where the whole user process space can be > made available for RDMA operations. I agree that this is a nice patch for exploring the design space (and frankly, for maintaining outside the kernel tree). I'd like to see something like this merged. As it stands, the patch is a decent standalone implementation of (1). I would personally strongly prefer that whatever is merged be low-impact and so obviously good that it would not need to be a CONFIG_ option. (Or rather, it should be a CONFIG_ option, but one which is forced to yes if !CONFIG_EMBEDDED.) And of course, it needs to be general-purpose enough to satisfy all the significant constituencies: 1. Myrinet/Quadrics (proprietary interconnects for HPC/etc) 2. Infiniband (slightly more general-purpose interconnect standard for etc/HPC) 3. RDMA TCP and I would add 4. people who want to add a commodity card to a general-purpose server and be able to take advantage of direct-to-userspace transfers without breaking the general-purposeness of their server. I think that given a reliable framework for DMA-to-userspace, other users will pop up. OpenGL (DRI) is one obvious example; I think there are others. With those (fairly lofty) goals in mind, I think the verdict is not good for ioproc-2.6.12-rc3.patch. It's got some style-ish issues that would have to be worked out before it could be merged. (#ifdef in code, for one.) It's adding a linked-list walk to a bunch of places in mm/, which is (or at least, seems to me) pretty unacceptable (even if it's just one cacheline miss) in the fast paths. Did you understand Andi's suggestion about NUMA policies? (I'm not smart enough to follow it.) Can we share code between this and the NUMA stuff? > static over the life of the job and hence most of the costs are taken > as the pages are created during job startup and initialisation. Yeah, I'm pretty skeptical about claims that "It's too much work to keep track of all that" regarding per-proc versus per-vma, and also regarding explicit-lock-from-commlib versus dynamic-pinning. For the people who care (HPC), pin/unpin events are very rare (zero during normal runtime), so the overhead is unimportant. It's more important to provide reliable operation with minimal impact to standard mm semantics. > However, I still prefer model (1) as it allows both implementations and > appears to be much simpler in terms of the linux kernel changes required. I agree that (1) looks easier to implement when you're doing it outside the kernel (and tracking). However, if you're aiming for integration we should figure out what the right answer is. It feels like that's per-vma, but I freely admit I don't have any code to back that up. > Thanks for your comments, Thank you for stepping up to be our archery target. :) > diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h Could you add -p to your diff invocation, please... This patch is *exactly* what I'd want if I were looking for an obvious, easy-to-maintain externally-maintained patch to add this capability. (Would that I could say that for all the HPC kernel patches I've been subjected to.) But I think we can do better. At least I would like to see Andi (or another NUMA mm god) and you (or another RDMA expert) hash over the possiblity of sharing code. -andy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 17:06 ` Brice Goglin 2005-04-27 9:41 ` David Addison @ 2005-04-27 13:43 ` Andi Kleen 1 sibling, 0 replies; 20+ messages in thread From: Andi Kleen @ 2005-04-27 13:43 UTC (permalink / raw) To: Brice Goglin; +Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison Brice Goglin <Brice.Goglin@ens-lyon.org> writes: > I see two drawback in (2). > First, it requires to play with the list of ioproc_ops when VMA are > merged or split. Actually, it's not that bad since the list often > contains only 1 ioproc_ops. I had a similar problem with the NUMA policies. With some minor hacks you could probably reuse the policy support by making it a weird kind of policy. That would allow to keep the fast path impact very low, which I think is the most important part of such hardware specific narrow purpose, useless to 99.9999% of all users hacks (Golden rule number 1 such code: dont impact anything else) -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison 2005-04-26 16:57 ` Jesper Juhl 2005-04-26 17:06 ` Brice Goglin @ 2005-04-28 1:42 ` Troy Benjegerdes 2005-04-28 7:21 ` Brice Goglin 3 siblings, 0 replies; 20+ messages in thread From: Troy Benjegerdes @ 2005-04-28 1:42 UTC (permalink / raw) To: David Addison Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison On Tue, Apr 26, 2005 at 04:49:01PM +0100, David Addison wrote: > Hi, > > here is a patch we use to integrate the Quadrics NICs into the Linux kernel. > The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC' > devices can be informed of page table changes. > This allows the Quadrics NICs to perform user RDMAs safely, without > requiring > page pinning. Looking through some of the recent IB and Ammasso discussions, > it may also prove useful to those NICs too. > I think the best thing to do is post this patch to openib-general ( http://openib.org/mailman/listinfo/openib-general ) and get a patch developed that works on amasso, IB, and Quadrics hardware, and then come back to lkml. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison ` (2 preceding siblings ...) 2005-04-28 1:42 ` Troy Benjegerdes @ 2005-04-28 7:21 ` Brice Goglin 2005-04-28 9:21 ` David Addison 2005-04-29 8:19 ` Benjamin Herrenschmidt 3 siblings, 2 replies; 20+ messages in thread From: Brice Goglin @ 2005-04-28 7:21 UTC (permalink / raw) To: David Addison; +Cc: Andrew Morton, Andrea Arcangeli, Linux Kernel > @@ -267,6 +270,11 @@ > > unsigned long hiwater_rss; /* High-water RSS usage */ > unsigned long hiwater_vm; /* High-water virtual memory usage */ > + > +#ifdef CONFIG_IOPROC > + /* hooks for io devices with advanced RDMA capabilities */ > + struct ioproc_ops *ioproc_ops; > +#endif > }; > +int > +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip) > +{ > + ip->next = mm->ioproc_ops; > + mm->ioproc_ops = ip; > + > + return 0; > +} > + > +EXPORT_SYMBOL_GPL(ioproc_register_ops); > + > +int > +ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip) > +{ > + struct ioproc_ops **tmp; > + > + for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next) > + ; > + if (*tmp) { > + *tmp = ip->next; > + return 0; > + } > + > + return -EINVAL; > +} > + > +EXPORT_SYMBOL_GPL(ioproc_unregister_ops); You don't seem to use any synchronization mechanism to protect the ioproc list from concurrent modifications, right ? I understand that it might be useless as long as QsNet is the only user of ioprocs and takes care of locking the address space somewhere in the driver before adding/removing hooks. But, if this patch is to be merged to the mainline, you probably need to do something here. It's not clear how other in-kernel users (IB, Myri, Ammasso, ...) might use ioprocs. And actually, I think all ioproc list traversal need to be protected as well. A spinlock_t ioproc_lock is probably appropriate here. I don't know whether any of the existing locks in the task_struct might be used instead. Regards, Brice ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-28 7:21 ` Brice Goglin @ 2005-04-28 9:21 ` David Addison 2005-04-29 8:19 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 20+ messages in thread From: David Addison @ 2005-04-28 9:21 UTC (permalink / raw) To: Brice Goglin; +Cc: Andrew Morton, Andrea Arcangeli, Linux Kernel Brice Goglin wrote: >> @@ -267,6 +270,11 @@ >> >> unsigned long hiwater_rss; /* High-water RSS usage */ >> unsigned long hiwater_vm; /* High-water virtual memory usage */ >> + >> +#ifdef CONFIG_IOPROC >> + /* hooks for io devices with advanced RDMA capabilities */ >> + struct ioproc_ops *ioproc_ops; >> +#endif >> }; > >> +int >> +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip) >> +{ >> + ip->next = mm->ioproc_ops; >> + mm->ioproc_ops = ip; >> + >> + return 0; >> +} >> + >> +EXPORT_SYMBOL_GPL(ioproc_register_ops); >> + >> +int >> +ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip) >> +{ >> + struct ioproc_ops **tmp; >> + >> + for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next) >> + ; >> + if (*tmp) { >> + *tmp = ip->next; >> + return 0; >> + } >> + >> + return -EINVAL; >> +} >> + >> +EXPORT_SYMBOL_GPL(ioproc_unregister_ops); > > You don't seem to use any synchronization mechanism to protect the > ioproc list from concurrent modifications, right ? > I understand that it might be useless as long as QsNet is the only user > of ioprocs and takes care of locking the address space somewhere in the > driver before adding/removing hooks. > But, if this patch is to be merged to the mainline, you probably need > to do something here. It's not clear how other in-kernel users > (IB, Myri, Ammasso, ...) might use ioprocs. > And actually, I think all ioproc list traversal need to be protected > as well. > All ioproc list traversal is protected by the mm->page_table_lock which is held at all points where the callbacks are invoked. [Actually there is one case where this isn't true, which I'll fix when we refresh this patch later today] The registration/unregister functions also need to be called holding this spinlock, our device driver does this, but perhaps we need to document that requirement more clearly. Cheers David. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-28 7:21 ` Brice Goglin 2005-04-28 9:21 ` David Addison @ 2005-04-29 8:19 ` Benjamin Herrenschmidt 2005-04-29 9:25 ` David Addison 1 sibling, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2005-04-29 8:19 UTC (permalink / raw) To: Brice Goglin Cc: David Addison, Andrew Morton, Andrea Arcangeli, Linux Kernel list > > +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip) > > +{ > > + ip->next = mm->ioproc_ops; > > + mm->ioproc_ops = ip; > > + > > + return 0; > > +} > > + Why not use a list_head along with linux standard list primitives ? Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs 2005-04-29 8:19 ` Benjamin Herrenschmidt @ 2005-04-29 9:25 ` David Addison 0 siblings, 0 replies; 20+ messages in thread From: David Addison @ 2005-04-29 9:25 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Brice Goglin, Andrew Morton, Andrea Arcangeli, Linux Kernel list Benjamin Herrenschmidt wrote: >>>+ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip) >>>+{ >>>+ ip->next = mm->ioproc_ops; >>>+ mm->ioproc_ops = ip; >>>+ >>>+ return 0; >>>+} >>>+ > > Why not use a list_head along with linux standard list primitives ? > > Ben. > > The reason we didn't use the standard list primitives was that we wanted the normal case where no ioproc ops were registered to have minimal impact and this just comes down to mm->ioproc_ops being checked against being zero, which is slightly lighter weight than using the list primitives. Also entries are rarely removed from the list using the ioproc_deregister function as in the normal case they get removed in the call to ioproc_release. Hence there is little need for the doubly linked list. Cheers, David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2005-04-29 9:26 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison 2005-04-26 16:57 ` Jesper Juhl 2005-04-26 17:13 ` Lee Revell 2005-04-26 17:20 ` Jesper Juhl 2005-04-26 17:28 ` Lee Revell 2005-04-26 17:38 ` Jesper Juhl 2005-04-26 20:14 ` John W. Linville 2005-04-26 20:17 ` Lee Revell 2005-04-26 20:09 ` Lars Marowsky-Bree 2005-04-28 11:34 ` Jakob Oestergaard 2005-04-29 8:22 ` Benjamin Herrenschmidt 2005-04-26 17:06 ` Brice Goglin 2005-04-27 9:41 ` David Addison 2005-04-28 8:38 ` Andy Isaacson 2005-04-27 13:43 ` Andi Kleen 2005-04-28 1:42 ` Troy Benjegerdes 2005-04-28 7:21 ` Brice Goglin 2005-04-28 9:21 ` David Addison 2005-04-29 8:19 ` Benjamin Herrenschmidt 2005-04-29 9:25 ` David Addison
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox