large page patch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* large page patch
@ 2002-08-02  0:37 Andrew Morton
  2002-08-02  0:43 ` David S. Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  0:37 UTC (permalink / raw)
  To: lkml, linux-mm@kvack.org; +Cc: Seth, Rohit, Saxena, Sunil, Mallick, Asit K


This is a large-page support patch from Rohit Seth, forwarded
with his permission (thanks!).



> Attached is the large_page support for IA-32.  For most part there are no
> changes over IA-64 patch.  System calls and their semantics remain the
> same. Though there are still some little parts of code that are arch
> specfic (like for IA-64 there is seperate region for large_pages whereas
> on IA-32 it is the same linear address space etc.) I will appreciate if
> you all could provide your input and any issues that you think we need to
> resolve.
> 
> Attached is the large_page patch including the following support: 1-
> Private and Shared Anonymous large pages(This is the earlier patch +
> Anonymous share Large_page support).  Private Anonymous large_pages stay
> with the particular process and vm segments corresponding to these get
> VM_DONTCOPY attribute.  Shared Anonymous pages get shared by children.
> (Children share the same physical large_pages with parent.)  Allocation
> and deallocation of this is done using the following two system calls:
> 
>    sys_get_large_pages (unsigned long addr, unsigned long len, int prot, int flags)
>         where prot could be PROT_READ, PROT_WRITE, PROT_EXEC and flags
>         is MAP_PRIVATE or MAP_SHARE
>    sys_free_large_pages(unsigned long addr)
> 
> 2- Shared Large Pages across different processes.  Allocation and
> deallocation of large_pages that a process can share and unshare across
> different procecess is using follwoign two systm calls:
> 
>    sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
> 
> where key is the system wide unique identifier that processes use to share
> pages.  This should be non-zero positive number. prot is identical as in
> above cases.  flag could be set to IPC_CREAT so that if the segment
> corresponding to key is not already there then it is created (Else -ENOENT
> is returned if there is no existing segment).
> 
>    sys_unshare_large_pages(unsigned long addr)
> 
> is used to unshare the large_pages from process's address space.  The
> large_pages are put on lpage_freelist only when the last user has sent the
> request for unsharing it (kind of SHM_DEST attribute).
> 
> Most of the support needed for above two cases (Anonymous and Sharing
> across processes) is quite similar in kernel except for binding of
> large_pages to key and temporary inode structure.
> 
> 3) Currently the large_page memory is dynamically configurable through
> /proc/sys/kernel/numlargepages User can specify the number (negative
> meaning shrink) that the number of large_page pages need to change.  For
> e.g. a value of -2 will reduce the number of large_page pages currently
> configured in system by 2.  Note that this change will depend on the
> availability of free large_pages. If none is available then the value
> remains same.  (Any cleaner suggestions?)

Some observations which have been made thus far:

- Minimal impact on the VM and MM layers

- Delegates most of it to the arch layer

- Generic code is not tied to pagetables so (for example) PPC could
  implement the system calls with BAT registers

- The change to MAX_ORDER is unneeded

- swapping of large pages and making them pagecache-coherent is
  unpopular.

- may be better to implement the shm API with fd's, not keys.

- an ia64 implementation is available


diff -Naru linux.org/arch/i386/config.in linux.lp/arch/i386/config.in
--- linux.org/arch/i386/config.in	Mon Feb 25 11:37:52 2002
+++ linux.lp/arch/i386/config.in	Tue Jul  2 17:49:15 2002
@@ -184,6 +184,8 @@
 
 bool 'Math emulation' CONFIG_MATH_EMULATION
 bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
+bool 'IA-32 Large Page Support (if available on processor)' CONFIG_LARGE_PAGE
+
 bool 'Symmetric multi-processing support' CONFIG_SMP
 if [ "$CONFIG_SMP" != "y" ]; then
    bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC
@@ -205,7 +207,6 @@
 
 mainmenu_option next_comment
 comment 'General setup'
-
 bool 'Networking support' CONFIG_NET
 
 # Visual Workstation support is utterly broken.
diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
--- linux.org/arch/i386/kernel/entry.S	Mon Feb 25 11:37:53 2002
+++ linux.lp/arch/i386/kernel/entry.S	Tue Jul  2 15:12:23 2002
@@ -634,6 +634,10 @@
 	.long SYMBOL_NAME(sys_ni_syscall)	/* 235 reserved for removexattr */
 	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for lremovexattr */
 	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for fremovexattr */
+	.long SYMBOL_NAME(sys_get_large_pages)	/* Get large_page pages */
+	.long SYMBOL_NAME(sys_free_large_pages)	/* Free large_page pages */
+	.long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
+	.long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -Naru linux.org/arch/i386/kernel/sys_i386.c linux.lp/arch/i386/kernel/sys_i386.c
--- linux.org/arch/i386/kernel/sys_i386.c	Mon Mar 19 12:35:09 2001
+++ linux.lp/arch/i386/kernel/sys_i386.c	Wed Jul  3 14:28:16 2002
@@ -254,3 +254,126 @@
 	return -ERESTARTNOHAND;
 }
 
+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_ALIGN(x)	(((unsigned long)x + (LPAGE_SIZE -1)) & LPAGE_MASK)
+extern long	sys_munmap(unsigned long, size_t);
+
+/* get_addr function gets the currently unused virtaul range in 
+ * current process's address space.  It returns the LARGE_PAGE_SIZE 
+ * aligned address (in cases of success).  Other kernel generic 
+ * routines only could gurantee that allocated address is PAGE_SIZSE aligned.
+ */
+unsigned long
+get_addr(unsigned long addr, unsigned long len)
+{
+	struct vm_area_struct	*vma;
+	if (addr) {
+		addr = LPAGE_ALIGN(addr);
+		vma = find_vma(current->mm, addr);
+		if (((TASK_SIZE - len) >= addr) &&
+		      (!vma || addr + len <= vma->vm_start))
+			goto found_addr;
+	}
+	addr = LPAGE_ALIGN(TASK_UNMAPPED_BASE);
+	 for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+		 if (TASK_SIZE - len < addr)
+			 return -ENOMEM;
+		 if (!vma || ((addr + len) < vma->vm_start))
+			 goto found_addr;
+		 addr = vma->vm_end;
+	}
+found_addr:
+	addr = LPAGE_ALIGN(addr);
+	return addr;
+}
+
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, unsigned long len, int prot, int flags)
+{
+	extern int make_lpages_present(unsigned long, unsigned long, int);
+	int	temp;
+
+	if (!(cpu_has_pse))
+		return -EINVAL;
+	if (len & (LPAGE_SIZE - 1)) 
+		return -EINVAL;
+	addr = get_addr(addr, len);
+	if (addr  ==  -ENOMEM)
+		return addr;
+	temp = MAP_SHARED | MAP_ANONYMOUS |MAP_FIXED;
+	addr = do_mmap_pgoff(NULL, addr, len, prot, temp, 0);
+	printk("Returned addr %x\n", addr);
+	if (!(addr & (LPAGE_SIZE -1))) {
+		 if (make_lpages_present(addr, (addr+len), flags) < 0) {
+			 addr = sys_munmap(addr, len);
+			 return -ENOMEM;
+		}
+	}
+	return addr;
+}
+
+asmlinkage unsigned long 
+sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
+{
+	unsigned long	raddr;
+	int	retval;
+	extern int set_lp_shm_seg(int, unsigned long *, unsigned long, int, int);
+	if (!(cpu_has_pse))
+		return -EINVAL;
+	if (key <= 0) 
+		return -EINVAL;
+	if (len & (LPAGE_SIZE - 1)) 
+		return -EINVAL;
+	raddr = get_addr(addr, len);
+	if (raddr == -ENOMEM)
+		return raddr;
+	retval = set_lp_shm_seg(key, &raddr, len, prot, flag);
+	if (retval < 0) 
+		return (unsigned long) retval;
+	return raddr;
+}
+
+asmlinkage int
+sys_free_large_pages(unsigned long addr)
+{
+	struct vm_area_struct	*vma;
+	extern int unmap_large_pages(struct vm_area_struct *);
+
+	vma = find_vma(current->mm, addr);
+	if ((!vma) || (!(vma->vm_flags & VM_LARGEPAGE)) || 
+		(vma->vm_start!=addr)) 
+		return -EINVAL;
+	return unmap_large_pages(vma);
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+	return sys_free_large_pages(addr);
+}
+
+#else
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, size_t len, int prot, int flags)
+{
+	return -ENOSYS;
+}
+
+asmlinkage unsigned long 
+sys_share_large_apges(int key, unsigned long addr, size_t len, int prot, int flag)
+{
+	return -ENOSYS;
+}
+
+asmlinkage int
+sys_free_large_apges(unsigned long addr)
+{
+	return -ENOSYS;
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+	return -ENOSYS;
+}
+#endif
diff -Naru linux.org/arch/i386/mm/Makefile linux.lp/arch/i386/mm/Makefile
--- linux.org/arch/i386/mm/Makefile	Fri Dec 29 14:07:20 2000
+++ linux.lp/arch/i386/mm/Makefile	Tue Jul  2 16:55:53 2002
@@ -10,5 +10,6 @@
 O_TARGET := mm.o
 
 obj-y	 := init.o fault.o ioremap.o extable.o
+obj-$(CONFIG_LARGE_PAGE) += lpage.o
 
 include $(TOPDIR)/Rules.make
diff -Naru linux.org/arch/i386/mm/init.c linux.lp/arch/i386/mm/init.c
--- linux.org/arch/i386/mm/init.c	Fri Dec 21 09:41:53 2001
+++ linux.lp/arch/i386/mm/init.c	Tue Jul  2 18:39:13 2002
@@ -447,6 +447,12 @@
 	return 0;
 }
 	
+#ifdef CONFIG_LARGE_PAGE
+long	lpagemem = 0;
+int	lp_max;
+long	lpzone_pages;
+extern struct	list_head lpage_freelist;
+#endif
 void __init mem_init(void)
 {
 	extern int ppro_with_ram_bug(void);
@@ -532,6 +538,32 @@
 	zap_low_mappings();
 #endif
 
+#ifdef CONFIG_LARGE_PAGE
+	{
+	long	i;
+	long	j;
+	struct page	*page, *map;
+	
+	/*For now reserve quarter for large_pages.*/
+	lpzone_pages = (max_low_pfn >> ((LPAGE_SHIFT - PAGE_SHIFT) + 2)) ; 
+		/*Will make this kernel command line. */
+	INIT_LIST_HEAD(&lpage_freelist);
+	for (i=0; i<lpzone_pages; i++) {
+		page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+		if (page == NULL)
+			break;
+		map = page;
+		for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+			SetPageReserved(map);
+			map++;
+		}
+		list_add(&page->list, &lpage_freelist);
+	}
+	printk("Total Large_page memory pages allocated %ld\n", i);
+	lpzone_pages = lpagemem = i;
+	lp_max = i;
+	}
+#endif
 }
 
 /* Put this after the callers, so that it cannot be inlined */
diff -Naru linux.org/arch/i386/mm/lpage.c linux.lp/arch/i386/mm/lpage.c
--- linux.org/arch/i386/mm/lpage.c	Wed Dec 31 16:00:00 1969
+++ linux.lp/arch/i386/mm/lpage.c	Wed Jul  3 16:09:59 2002
@@ -0,0 +1,475 @@
+/*
+ * IA-32 Large Page Support for Kernel.
+ *
+ * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
+ */
+
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/string.h>
+#include <linux/locks.h>
+#include <linux/smp_lock.h>
+#include <linux/slab.h>
+
+#include <asm/uaccess.h>
+#include <asm/mman.h>
+
+static struct vm_operations_struct	lp_vm_ops;
+struct list_head 			lpage_freelist;
+spinlock_t				lpage_lock = SPIN_LOCK_UNLOCKED;
+extern	long 				lpagemem;
+
+#define MAX_ID 	32
+struct lpkey {
+	struct inode *in;
+	int			key;
+} lpk[MAX_ID];
+
+static struct inode *
+find_key_inode(int key)
+{
+	int				i;
+
+	for (i=0; i<MAX_ID; i++) {
+		if (lpk[i].key == key) 
+			return (lpk[i].in);
+	}
+	return NULL;
+}
+static struct page *
+alloc_large_page(void)
+{
+	struct list_head	*curr, *head;
+	struct page			*page;
+
+	spin_lock(&lpage_lock);
+
+	head = &lpage_freelist;
+	curr = head->next;
+
+	if (curr == head)  {
+		spin_unlock(&lpage_lock);
+		return NULL;
+	}
+	page = list_entry(curr, struct page, list);
+	list_del(curr);
+	lpagemem--;
+	spin_unlock(&lpage_lock);
+	set_page_count(page, 1);
+	memset(page_address(page), 0, LPAGE_SIZE);
+	return page;
+}
+
+static void
+free_large_page(struct page *page)
+{
+	if ((page->mapping != NULL) && (page_count(page) == 2)) {
+		struct inode *inode = page->mapping->host;
+		int 	i;
+
+		lru_cache_del(page);
+		remove_inode_page(page);
+		set_page_count(page, 1);
+		if ((inode->i_size -= LPAGE_SIZE) == 0) {
+			for (i=0;i<MAX_ID;i++)
+				if (lpk[i].key == inode->i_ino) {
+					lpk[i].key = 0;
+					break;
+			}
+			kfree(inode);
+		}
+	}
+	if (put_page_testzero(page)) {
+		spin_lock(&lpage_lock);
+		list_add(&page->list, &lpage_freelist);
+		lpagemem++;
+		spin_unlock(&lpage_lock);
+	}
+}
+
+static pte_t *
+lp_pte_alloc(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t			*pgd;
+	pmd_t			*pmd = NULL;
+
+	pgd = pgd_offset(mm, addr);
+	pmd = pmd_alloc(mm, pgd, addr);
+	return (pte_t *)pmd;
+}
+
+static pte_t *
+lp_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t			*pgd;
+	pmd_t			*pmd = NULL;
+	
+	pgd =pgd_offset(mm, addr);
+	pmd = pmd_offset(pgd, addr);
+	return (pte_t *)pmd;
+}
+
+#define mk_pte_large(entry) {entry.pte_low |= (_PAGE_PRESENT | _PAGE_PSE);}
+	
+static void
+set_lp_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t *page_table, int write_access)
+{
+	pte_t           entry;
+
+	mm->rss += (LPAGE_SIZE/PAGE_SIZE);
+	if (write_access) {
+		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+	} else
+		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
+	entry = pte_mkyoung(entry);
+	mk_pte_large(entry);
+	set_pte(page_table, entry);
+	printk("VIRTUAL_ADDRESS_OF_LPAGE IS %p\n", page->virtual);
+	return;
+}
+
+static int
+anon_get_lpage(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, pte_t *page_table)
+{
+	struct	page *page;
+
+	page = alloc_large_page();
+	if (page == NULL) 
+		return -1;
+	set_lp_pte(mm, vma, page, page_table, write_access);
+	return 1;
+}
+
+int
+make_lpages_present(unsigned long addr, unsigned long end, int flags)
+{
+	int write;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct * vma;
+	pte_t	*pte;
+
+	vma = find_vma(mm, addr);
+	if (!vma)
+		goto out_error1;
+
+	write = (vma->vm_flags & VM_WRITE) != 0;
+	if ((vma->vm_end - vma->vm_start) & (LPAGE_SIZE-1))
+		goto out_error1;
+	spin_lock(&mm->page_table_lock);
+	do {    
+		pte = lp_pte_alloc(mm, addr);
+		if ((pte) && (pte_none(*pte))) {
+			if (anon_get_lpage(mm, vma, 
+				write ? VM_WRITE : VM_READ, pte) == -1)
+				goto out_error;
+		} else
+			goto out_error;
+		addr += LPAGE_SIZE;
+	} while (addr < end); 
+	spin_unlock(&mm->page_table_lock);
+	vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+	if (flags & MAP_PRIVATE )
+		vma->vm_flags |= VM_DONTCOPY;
+	vma->vm_ops = &lp_vm_ops;
+	return 0;
+out_error: /*Error case, remove the partial lp_resources. */
+	if (addr > vma->vm_start) { 
+	   	vma->vm_end = addr ;
+	   	zap_lp_resources(vma);
+	   	vma->vm_end = end;
+	}
+	spin_unlock(&mm->page_table_lock);
+out_error1:
+	return -1;
+}
+
+int
+copy_lpage_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma)
+{
+	pte_t *src_pte, *dst_pte, entry;
+	struct page 	*ptepage;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+
+		while (addr < end) {
+			dst_pte = lp_pte_alloc(dst, addr);
+			if (!dst_pte)
+				goto nomem;
+			src_pte = lp_pte_offset(src, addr);
+			entry = *src_pte;
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+			set_pte(dst_pte, entry);
+			dst->rss += (LPAGE_SIZE/PAGE_SIZE);
+			addr += LPAGE_SIZE; 
+		}
+    return 0;
+
+nomem:
+    return -ENOMEM;
+}
+int
+follow_large_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *st, int *length, int i)
+{
+	pte_t			*ptep, pte;
+	unsigned long	start = *st;
+	unsigned long	pstart;
+	int				len = *length;
+	struct page		*page;
+
+	do {
+		pstart = start;
+		ptep = lp_pte_offset(mm, start);
+		pte = *ptep;
+
+back1:
+		page = pte_page(pte);
+		if (pages) {
+			page += ((start & ~LPAGE_MASK) >> PAGE_SHIFT);
+			pages[i] = page;
+			page_cache_get(page);
+		}
+		if (vmas)
+			vmas[i] = vma;
+		i++;
+		len--;
+		start += PAGE_SIZE;
+		if (((start & LPAGE_MASK) == pstart) && len && (start < vma->vm_end))
+			goto back1;
+	} while (len && start < vma->vm_end);
+	*length = len;
+	*st = start;
+	return i;
+}
+
+static void
+zap_lp_resources(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = mpnt->vm_mm;
+	unsigned long 	len, addr, end;
+	pte_t			*ptep;
+	struct page		*page;
+
+	addr = mpnt->vm_start;
+	end = mpnt->vm_end;
+	len = end - addr;
+	do {
+		ptep = lp_pte_offset(mm, addr);
+		page = pte_page(*ptep);
+		pte_clear(ptep);
+		free_large_page(page);
+		addr += LPAGE_SIZE;
+	} while (addr < end);
+	mm->rss -= (len >> PAGE_SHIFT);
+}
+
+static void
+unlink_vma(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct	*vma;
+
+	vma = mm->mmap;
+	if (vma == mpnt) {
+		mm->mmap = vma->vm_next;
+	}
+	else {
+		while (vma->vm_next != mpnt) {
+			vma = vma->vm_next;
+		}
+		vma->vm_next = mpnt->vm_next;
+	}
+	rb_erase(&mpnt->vm_rb, &mm->mm_rb);
+	mm->mmap_cache = NULL;
+	mm->map_count--;
+}
+
+int
+unmap_large_pages(struct vm_area_struct *mpnt)
+{
+	struct mm_struct *mm = current->mm;
+
+	unlink_vma(mpnt);
+	spin_lock(&mm->page_table_lock);
+	zap_lp_resources(mpnt);
+	spin_unlock(&mm->page_table_lock);
+	kmem_cache_free(vm_area_cachep, mpnt);
+	return 1;
+}
+
+static struct inode *
+set_new_inode(unsigned long len, int prot, int flag, int key)
+{
+	struct inode	*inode;
+	int	i;
+
+	for (i=0; i<MAX_ID; i++) {
+		if (lpk[i].key == 0)
+			break;
+	}
+	if (i == MAX_ID)
+		return NULL;
+	inode = kmalloc(sizeof(struct inode), GFP_ATOMIC);
+	if (inode == NULL)
+		return NULL;
+	
+	memset(inode, 0, sizeof(struct inode));
+	INIT_LIST_HEAD(&inode->i_hash);
+	inode->i_mapping = &inode->i_data;
+	inode->i_mapping->host = inode;
+	INIT_LIST_HEAD(&inode->i_data.clean_pages);
+	INIT_LIST_HEAD(&inode->i_data.dirty_pages);
+	INIT_LIST_HEAD(&inode->i_data.locked_pages);
+	spin_lock_init(&inode->i_data.i_shared_lock);
+	inode->i_ino = (unsigned long)key;
+
+	lpk[i].key = key;
+	lpk[i].in = inode;
+	inode->i_uid = current->fsuid;
+	inode->i_gid = current->fsgid;
+	inode->i_mode = prot;
+	inode->i_size = len;
+	return inode;
+}
+
+static int
+check_size_prot(struct inode *inode, unsigned long len, int prot, int flag)
+{
+	if (inode->i_uid != current->fsuid)
+		return -1;
+	if (inode->i_gid != current->fsgid)
+		return -1;
+	if (inode->i_mode != prot)
+		return -1;
+	if (inode->i_size != len)
+		return -1;
+	return 0;
+}
+
+int
+set_lp_shm_seg(int key, unsigned long *raddr, unsigned long len, int prot, int flag)
+{
+	struct	mm_struct		*mm = current->mm;
+	struct	vm_area_struct	*vma;
+	struct	inode			*inode;
+	struct	address_space	*mapping;
+	struct	page			*page;
+	unsigned long 			addr = *raddr;
+	int		idx;
+	int 	retval = -ENOMEM;
+
+	if (len & (LPAGE_SIZE -1))
+		return -EINVAL;
+
+	inode = find_key_inode(key);
+	if (inode == NULL) {
+		if (!(flag & IPC_CREAT))
+			return -ENOENT;
+		inode = set_new_inode(len, prot, flag, key);
+		if (inode == NULL) 
+			return -ENOMEM;
+	}
+	else
+		if (check_size_prot(inode, len, prot, flag) < 0)
+			return -EINVAL;
+	mapping = inode->i_mapping;
+
+	addr = do_mmap_pgoff(NULL, addr, len, (unsigned long)prot, 
+			MAP_FIXED|MAP_PRIVATE | MAP_ANONYMOUS, 0);
+	if (IS_ERR((void *)addr)) 
+		return -ENOMEM; 
+
+	vma = find_vma(mm, addr);
+	if (!vma)
+		return -EINVAL;
+	
+	*raddr = addr;
+	spin_lock(&mm->page_table_lock);
+	do {
+		pte_t * pte = lp_pte_alloc(mm, addr);
+		if ((pte) && (pte_none(*pte))) {
+			idx = (addr - vma->vm_start) >> LPAGE_SHIFT;
+			page = find_get_page(mapping, idx);
+			if (page == NULL) {
+				page = alloc_large_page();	
+				if (page == NULL) 
+					goto out;	
+				add_to_page_cache(page, mapping, idx);
+			}
+			set_lp_pte(mm, vma, page, pte, (vma->vm_flags & VM_WRITE));
+		} else 
+			goto out;
+		addr += LPAGE_SIZE;
+	} while (addr < vma->vm_end); 
+	retval = 0;
+	vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+	vma->vm_ops = &lp_vm_ops;
+	spin_unlock(&mm->page_table_lock);
+	return retval;
+out:
+	if (addr > vma->vm_start) {
+		raddr = vma->vm_end;
+		vma->vm_end = addr;
+		zap_lp_resources(vma);
+		vma->vm_end = raddr;
+	}
+	spin_unlock(&mm->page_table_lock);
+	return retval;
+}
+
+int
+change_large_page_mem_size(int count)
+{
+	int j;
+	struct page     *page, *map;
+	extern long        lpzone_pages;
+	extern struct list_head lpage_freelist;
+
+	if (count == 0)
+		return (int)lpzone_pages;
+	if (count > 0) {/*Increase the mem size. */
+		while (count--) {
+			page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+			if (page == NULL)
+				break;
+			map = page;
+			for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+				SetPageReserved(map);
+				map++;
+			}
+			spin_lock(&lpage_lock);
+			list_add(&page->list, &lpage_freelist);
+			lpagemem++;
+			lpzone_pages++;
+			spin_unlock(&lpage_lock);
+		}
+		return (int)lpzone_pages;
+	}
+	/*Shrink the memory size. */
+	while (count++) {
+		page = alloc_large_page();
+		if (page == NULL)
+			break;
+		spin_lock(&lpage_lock);
+		lpzone_pages--;
+		spin_unlock(&lpage_lock);
+		map = page;
+		for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+			ClearPageReserved(map);
+			map++;
+		}
+		__free_pages(page, LARGE_PAGE_ORDER);
+	}
+	return (int)lpzone_pages;
+}
+static struct vm_operations_struct	lp_vm_ops = {
+	close: zap_lp_resources,
+};
diff -Naru linux.org/fs/proc/array.c linux.lp/fs/proc/array.c
--- linux.org/fs/proc/array.c	Thu Oct 11 09:00:01 2001
+++ linux.lp/fs/proc/array.c	Wed Jul  3 16:59:09 2002
@@ -486,6 +486,17 @@
 			pgd_t *pgd = pgd_offset(mm, vma->vm_start);
 			int pages = 0, shared = 0, dirty = 0, total = 0;
 
+			if (is_vm_large_page(vma)) {
+				int num_pages = ((vma->vm_end - vma->vm_start)/PAGE_SIZE);
+				resident += num_pages;
+				if ((vma->vm_flags & VM_DONTCOPY))
+					share += num_pages;
+				if (vma->vm_flags & VM_WRITE)
+					dt += num_pages;
+				drs += num_pages;
+				vma = vma->vm_next;
+				continue;
+			}
 			statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
 			resident += pages;
 			share += shared;
diff -Naru linux.org/fs/proc/proc_misc.c linux.lp/fs/proc/proc_misc.c
--- linux.org/fs/proc/proc_misc.c	Tue Nov 20 21:29:09 2001
+++ linux.lp/fs/proc/proc_misc.c	Wed Jul  3 10:48:21 2002
@@ -151,6 +151,14 @@
 		B(i.sharedram), B(i.bufferram),
 		B(pg_size), B(i.totalswap),
 		B(i.totalswap-i.freeswap), B(i.freeswap));
+#ifdef CONFIG_LARGE_PAGE
+	{
+		extern  unsigned long lpagemem, lpzone_pages;
+		len += sprintf(page+len,"Total # of LargePages: %8lu\t\tAvailable: %8lu\n"
+		"LargePageSize: %8lu(0x%xKB)\n",
+		lpzone_pages, lpagemem, LPAGE_SIZE, (LPAGE_SIZE/1024));
+	}
+#endif
 	/*
 	 * Tagged format, for easy grepping and expansion.
 	 * The above will go away eventually, once the tools
diff -Naru linux.org/include/asm-i386/page.h linux.lp/include/asm-i386/page.h
--- linux.org/include/asm-i386/page.h	Mon Feb 25 11:38:12 2002
+++ linux.lp/include/asm-i386/page.h	Wed Jul  3 10:49:54 2002
@@ -41,14 +41,22 @@
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
 #define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
+#define	LPAGE_SHIFT	21
 #else
 typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
 #define pte_val(x)	((x).pte_low)
+#define	LPAGE_SHIFT	22
 #endif
 #define PTE_MASK	PAGE_MASK
 
+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_SIZE 	((1UL) << LPAGE_SHIFT)
+#define	LPAGE_MASK	(~(LPAGE_SIZE - 1))
+#define LARGE_PAGE_ORDER	(LPAGE_SHIFT - PAGE_SHIFT)
+#endif
+
 typedef struct { unsigned long pgprot; } pgprot_t;
 
 #define pmd_val(x)	((x).pmd)
diff -Naru linux.org/include/linux/mm.h linux.lp/include/linux/mm.h
--- linux.org/include/linux/mm.h	Fri Dec 21 09:42:03 2001
+++ linux.lp/include/linux/mm.h	Wed Jul  3 10:49:54 2002
@@ -103,6 +103,7 @@
 #define VM_DONTEXPAND	0x00040000	/* Cannot expand with mremap() */
 #define VM_RESERVED	0x00080000	/* Don't unmap it from swap_out */
 
+#define VM_LARGEPAGE	0x00400000	/* Large_Page mapping. */
 #define VM_STACK_FLAGS	0x00000177
 
 #define VM_READHINTMASK			(VM_SEQ_READ | VM_RAND_READ)
@@ -425,6 +426,16 @@
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
 
+#ifdef CONFIG_LARGE_PAGE
+#define is_vm_large_page(vma) (vma->vm_flags & VM_LARGEPAGE)
+extern int	copy_large_page(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+extern int	follow_large_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+#else
+#define	is_vm_large_page(vma)	(0)
+#define	follow_large_page(mm, vma, pages, vmas, &start, &len, i) (0)
+#define	copy_large_page(dst, src, vma) (0)
+#endif 
+
 /*
  * On a two-level page table, this ends up being trivial. Thus the
  * inlining and the symmetry break with pte_alloc() that does all
diff -Naru linux.org/include/linux/mmzone.h linux.lp/include/linux/mmzone.h
--- linux.org/include/linux/mmzone.h	Thu Nov 22 11:46:19 2001
+++ linux.lp/include/linux/mmzone.h	Wed Jul  3 10:49:54 2002
@@ -13,7 +13,7 @@
  */
 
 #ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 10
+#define MAX_ORDER 15
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
diff -Naru linux.org/include/linux/sysctl.h linux.lp/include/linux/sysctl.h
--- linux.org/include/linux/sysctl.h	Mon Nov 26 05:29:17 2001
+++ linux.lp/include/linux/sysctl.h	Wed Jul  3 10:49:54 2002
@@ -124,6 +124,7 @@
 	KERN_CORE_USES_PID=52,		/* int: use core or core.%pid */
 	KERN_TAINTED=53,	/* int: various kernel tainted flags */
 	KERN_CADPID=54,		/* int: PID of the process to notify on CAD */
+	KERN_LARGE_PAGE_MEM=55,	/* Number of large_page pages configured */
 };
 
 
diff -Naru linux.org/kernel/sysctl.c linux.lp/kernel/sysctl.c
--- linux.org/kernel/sysctl.c	Fri Dec 21 09:42:04 2001
+++ linux.lp/kernel/sysctl.c	Tue Jul  2 14:07:28 2002
@@ -96,6 +96,10 @@
 extern int acct_parm[];
 #endif
 
+#ifdef CONFIG_LARGE_PAGE
+extern int	lp_max;
+extern int	change_large_page_mem_size(int );
+#endif
 extern int pgt_cache_water[];
 
 static int parse_table(int *, int, void *, size_t *, void *, size_t,
@@ -256,6 +260,10 @@
 	{KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug",
 	 &sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec},
 #endif
+#ifdef CONFIG_LARGE_PAGE
+	{KERN_LARGE_PAGE_MEM, "numlargepages", &lp_max, sizeof(int), 0644, NULL,
+	&proc_dointvec},
+#endif
 	{0}
 };
 
@@ -866,6 +874,10 @@
 				val = -val;
 			buffer += len;
 			left -= len;
+#if CONFIG_LARGE_PAGE
+			if (i == &lp_max)
+				val = change_large_page_mem_size(val);
+#endif
 			switch(op) {
 			case OP_SET:	*i = val; break;
 			case OP_AND:	*i &= val; break;
diff -Naru linux.org/mm/memory.c linux.lp/mm/memory.c
--- linux.org/mm/memory.c	Mon Feb 25 11:38:13 2002
+++ linux.lp/mm/memory.c	Wed Jul  3 16:14:01 2002
@@ -179,6 +179,9 @@
 	unsigned long end = vma->vm_end;
 	unsigned long cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
+	if (is_vm_large_page(vma) )
+		return copy_lpage_range(dst, src, vma);
+
 	src_pgd = pgd_offset(src, address)-1;
 	dst_pgd = pgd_offset(dst, address)-1;
 
@@ -471,6 +474,10 @@
 		if ( !vma || (pages && vma->vm_flags & VM_IO) || !(flags & vma->vm_flags) )
 			return i ? : -EFAULT;
 
+		if (is_vm_large_page(vma)) {
+			i += follow_large_page(mm, vma, pages, vmas, &start, &len, i);
+			continue;
+		}
 		spin_lock(&mm->page_table_lock);
 		do {
 			struct page *map;
@@ -1360,6 +1367,8 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+	if (is_vm_large_page(vma) )
+		return -1;
 
 	current->state = TASK_RUNNING;
 	pgd = pgd_offset(mm, address);
diff -Naru linux.org/mm/mmap.c linux.lp/mm/mmap.c
--- linux.org/mm/mmap.c	Mon Feb 25 11:38:14 2002
+++ linux.lp/mm/mmap.c	Tue Jul  2 14:15:50 2002
@@ -917,6 +917,9 @@
 	if (mpnt->vm_start >= addr+len)
 		return 0;
 
+	if (is_vm_large_page(mpnt)) /*Large pages can not be unmapped like this. */
+		return -EINVAL;
+
 	/* If we'll make "hole", check the vm areas limit */
 	if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
 	    && mm->map_count >= MAX_MAP_COUNT)
diff -Naru linux.org/mm/mprotect.c linux.lp/mm/mprotect.c
--- linux.org/mm/mprotect.c	Mon Sep 17 15:30:23 2001
+++ linux.lp/mm/mprotect.c	Tue Jul  2 14:18:13 2002
@@ -287,6 +287,8 @@
 	error = -EFAULT;
 	if (!vma || vma->vm_start > start)
 		goto out;
+	if (is_vm_large_page(vma))
+		return -EINVAL; /* Cann't change protections on large_page mappings. */
 
 	for (nstart = start ; ; ) {
 		unsigned int newflags;
diff -Naru linux.org/mm/mremap.c linux.lp/mm/mremap.c
--- linux.org/mm/mremap.c	Thu Sep 20 20:31:26 2001
+++ linux.lp/mm/mremap.c	Tue Jul  2 14:20:05 2002
@@ -267,6 +267,10 @@
 	vma = find_vma(current->mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
+	if (is_vm_large_page(vma)) {
+		ret = -EINVAL; /* Cann't remap large_page mappings. */
+		goto out;
+	}
 	/* We can't remap across vm area boundaries */
 	if (old_len > vma->vm_end - addr)
 		goto out;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
@ 2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:26   ` Andrew Morton
  2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:09 ` Martin J. Bligh
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 40+ messages in thread
From: David S. Miller @ 2002-08-02  0:43 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 01 Aug 2002 17:37:46 -0700

   Some observations which have been made thus far:

   - Minimal impact on the VM and MM layers

Well the downside of this is that it means it isn't transparent
to userspace.  For example, specfp2000 results aren't going to
improve after installing these changes.  Some of the other large
page implementations would.

   - The change to MAX_ORDER is unneeded

This is probably done to increase the likelyhood that 4MB page orders
are available.  If we collapse 4MB pages deeper, they are less likely
to be broken up because smaller orders would be selected first.

Maybe it doesn't make a difference....

   - swapping of large pages and making them pagecache-coherent is
     unpopular.

Swapping them is easy, any time you hit a large PTE you unlarge it.
This is what some of other large page implementations do.  Basically
the implementation is that set_pte() breaks apart large ptes when
necessary.

I agree on the pagecache side.

Actually to be honest the other implementations seemed less
intrusive and easier to add support for.  The downside is that
handling of weird cases like x86 using pmd's for 4MB pages
was not complete last time I checked.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
  2002-08-02  0:43 ` David S. Miller
@ 2002-08-02  1:09 ` Martin J. Bligh
  2002-08-02  4:07   ` Linus Torvalds
  2002-08-02  1:36 ` Andrew Morton
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Martin J. Bligh @ 2002-08-02  1:09 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm@kvack.org
  Cc: Seth, Rohit, Saxena, Sunil, Mallick, Asit K

> - The change to MAX_ORDER is unneeded

It's not only unneeded, it's detrimental. Not only will we spend more
time merging stuff up and down to no effect, it also makes the 
config_nonlinear stuff harder (or we have to #ifdef it, which just causes
more unnecessary differentiation). Please don't do that little bit ....

M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:26   ` Andrew Morton
@ 2002-08-02  1:19     ` David S. Miller
  2002-08-02  3:23       ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-08-02  1:19 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 01 Aug 2002 18:26:40 -0700

   "David S. Miller" wrote:
   > This is probably done to increase the likelyhood that 4MB page orders
   > are available.  If we collapse 4MB pages deeper, they are less likely
   > to be broken up because smaller orders would be selected first.
   
   This is leakage from ia64, which supports up to 256k pages.

Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
up to 4MB already :-)
   
   Apparently a page-table based representation could not be used by PPC.
   
The page-table is just an abstraction, there is no reason dummy
"large" ptes could not be used which are just ignored by the HW TLB
reload code.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:43 ` David S. Miller
@ 2002-08-02  1:26   ` Andrew Morton
  2002-08-02  1:19     ` David S. Miller
  2002-08-02  1:55   ` Rik van Riel
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  1:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-kernel, linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

"David S. Miller" wrote:
> 
>    From: Andrew Morton <akpm@zip.com.au>
>    Date: Thu, 01 Aug 2002 17:37:46 -0700
> 
>    Some observations which have been made thus far:
> 
>    - Minimal impact on the VM and MM layers
> 
> Well the downside of this is that it means it isn't transparent
> to userspace.  For example, specfp2000 results aren't going to
> improve after installing these changes.  Some of the other large
> page implementations would.
> 
>    - The change to MAX_ORDER is unneeded
> 
> This is probably done to increase the likelyhood that 4MB page orders
> are available.  If we collapse 4MB pages deeper, they are less likely
> to be broken up because smaller orders would be selected first.

This is leakage from ia64, which supports up to 256k pages.

> Maybe it doesn't make a difference....
> 
>    - swapping of large pages and making them pagecache-coherent is
>      unpopular.
> 
> Swapping them is easy, any time you hit a large PTE you unlarge it.
> This is what some of other large page implementations do.  Basically
> the implementation is that set_pte() breaks apart large ptes when
> necessary.

As far as mm/*.c is concerned, there is no pte.  It's just a vma
which is marked "don't touch"  These pages aren't on the LRU, nothing
knows about them.

Apparently a page-table based representation could not be used by PPC.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
  2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:09 ` Martin J. Bligh
@ 2002-08-02  1:36 ` Andrew Morton
  2002-08-02  4:31   ` Daniel Phillips
  2002-08-02  3:47 ` William Lee Irwin III
  2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  1:36 UTC (permalink / raw)
  To: lkml, linux-mm@kvack.org, Seth, Rohit, Saxena, Sunil,
	Mallick, Asit K

Merged up to 2.5.30.  It compiles.

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:55   ` Rik van Riel
@ 2002-08-02  1:50     ` David S. Miller
  2002-08-02  2:29     ` Gerrit Huizenga
  1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2002-08-02  1:50 UTC (permalink / raw)
  To: riel
  Cc: akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick, gh

   From: Rik van Riel <riel@conectiva.com.br>
   Date: Thu, 1 Aug 2002 22:55:05 -0300 (BRT)
   
   IMHO we shouldn't blindly decide for (or against!) this patch
   but also carefully look at the large page patch from RHAS (which
   got added to -aa recently) and the large page patch which IBM
   is working on.

And the one from Naohiko Shimizu which is my personal favorite
because sparc64 support is there :)

http://shimizu-lab.dt.u-tokai.ac.jp/lsp.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:43 ` David S. Miller
  2002-08-02  1:26   ` Andrew Morton
@ 2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:50     ` David S. Miller
  2002-08-02  2:29     ` Gerrit Huizenga
  1 sibling, 2 replies; 40+ messages in thread
From: Rik van Riel @ 2002-08-02  1:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andrew Morton, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick, gh

On Thu, 1 Aug 2002, David S. Miller wrote:
>    From: Andrew Morton <akpm@zip.com.au>

>    - Minimal impact on the VM and MM layers
>
> Well the downside of this is that it means it isn't transparent
> to userspace.  For example, specfp2000 results aren't going to
> improve after installing these changes.  Some of the other large
> page implementations would.

It also means we can't automatically switch to large pages for
SHM segments, which is the number one area where we need large
pages...

We should also take into account that the main application that
needs large pages for its SHM segments is Oracle, which we don't
have the source code for so we can't recompile it to use the new
syscalls introduced by this patch ...

IMHO we shouldn't blindly decide for (or against!) this patch
but also carefully look at the large page patch from RHAS (which
got added to -aa recently) and the large page patch which IBM
is working on.

kind regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  2:29     ` Gerrit Huizenga
@ 2002-08-02  2:23       ` David S. Miller
  2002-08-02  2:53         ` Gerrit Huizenga
  2002-08-02  5:24       ` David Mosberger
  1 sibling, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-08-02  2:23 UTC (permalink / raw)
  To: gh
  Cc: riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Thu, 01 Aug 2002 19:29:52 -0700
   
   other memory piggish apps (e.g. think scientific) would benefit

There are example "benchmark'ish" example on Naohiko Shimizu's
large page project page.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:55   ` Rik van Riel
  2002-08-02  1:50     ` David S. Miller
@ 2002-08-02  2:29     ` Gerrit Huizenga
  2002-08-02  2:23       ` David S. Miller
  2002-08-02  5:24       ` David Mosberger
  1 sibling, 2 replies; 40+ messages in thread
From: Gerrit Huizenga @ 2002-08-02  2:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David S. Miller, Andrew Morton, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

In message <Pine.LNX.4.44L.0208012246390.23404-100000@imladris.surriel.com>, > 
: Rik van Riel writes:
> On Thu, 1 Aug 2002, David S. Miller wrote:
> >    From: Andrew Morton <akpm@zip.com.au>
> 
> >    - Minimal impact on the VM and MM layers
> >
> > Well the downside of this is that it means it isn't transparent
> > to userspace.  For example, specfp2000 results aren't going to
> > improve after installing these changes.  Some of the other large
> > page implementations would.
> 
> We should also take into account that the main application that
> needs large pages for its SHM segments is Oracle, which we don't
> have the source code for so we can't recompile it to use the new
> syscalls introduced by this patch ...

There are quite a few other applications that can benefit from large
page support.  IBM Watson Research published JVM and some scientific
workload results using large pages which showed substantial benefits.
Also, we believe DB2, Domino, other memory piggish apps (e.g. think
scientific) would benefit equally on many architectures.

It would sure be nice if the interface wasn't some kludgey back door
but more integrated with things like mmap() or shm*(), with semantics
and behaviors that were roughly more predictable.  Other than that,
no comments as yet on the patch internals...

gerrit 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  2:23       ` David S. Miller
@ 2002-08-02  2:53         ` Gerrit Huizenga
  0 siblings, 0 replies; 40+ messages in thread
From: Gerrit Huizenga @ 2002-08-02  2:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

In message <20020801.192304.85746519.davem@redhat.com>, > : "David S. Miller" w
rites:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Thu, 01 Aug 2002 19:29:52 -0700
>    
>    other memory piggish apps (e.g. think scientific) would benefit
> 
> There are example "benchmark'ish" example on Naohiko Shimizu's
> large page project page.

No 2.5 code, though.  :(  But there *is* 2.2.16 code!

gerrit

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:19     ` David S. Miller
@ 2002-08-02  3:23       ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-08-02  3:23 UTC (permalink / raw)
  To: linux-kernel

In article <20020801.181944.09310618.davem@redhat.com>,
David S. Miller <davem@redhat.com> wrote:
>   From: Andrew Morton <akpm@zip.com.au>
>   Date: Thu, 01 Aug 2002 18:26:40 -0700
>
>   "David S. Miller" wrote:
>   > This is probably done to increase the likelyhood that 4MB page orders
>   > are available.  If we collapse 4MB pages deeper, they are less likely
>   > to be broken up because smaller orders would be selected first.
>   
>   This is leakage from ia64, which supports up to 256k pages.
>
>Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
>up to 4MB already :-)

That should be 256_M_ pages (13 bits of page size + 15 bits of MAX_ORDER
gives you 256MB max).

		Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
                   ` (2 preceding siblings ...)
  2002-08-02  1:36 ` Andrew Morton
@ 2002-08-02  3:47 ` William Lee Irwin III
  2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 0 replies; 40+ messages in thread
From: William Lee Irwin III @ 2002-08-02  3:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: lkml, linux-mm@kvack.org, Seth, Rohit, Saxena, Sunil,
	Mallick, Asit K

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:
> This is a large-page support patch from Rohit Seth, forwarded
> with his permission (thanks!).

Overall, the code looks very clean.

(1) So there are now 4 of these things. How do they compare to each
	other? Where are the comparison benchmarks? How do their
	features compare? Which one(s) do users want?

(2) The allocation policies for pagetables mapping the things may as
	well do some kind of lookup, sharing, and cacheing; it's likely
	a significant number of the users of the shm segment will be
	mapping them more or less the same way given database usage
	patterns. It's not a significant amount of space, but kernels
	should be frugal about space, and with many tasks as is typical
	of databases, the savings may well add up to a small but
	respectable chunk of ZONE_NORMAL.

(3) As long as the interface is explicit, it might as well drop flags
	into shm and mmap. There isn't even C library support for these
	things as they are... time to int $0x80 again so I can test.

(4) Requiring app awareness of page alignment looks like an irritating
	porting issue, which doesn't sound as trivial as it would
	otherwise be in already extremely cramped 32-bit virtual
	address spaces.

(5) What's in it for the average user? It's doubtful GNOME will be
	registering memory blocks with these syscalls anytime soon.
	Granted, the opportunities for reducing TLB load this way
	are small on desktop systems, but it doesn't feel quite
	right to just throw mappings of magic physical memory into the
	hands of a few enlightened apps on machines with memory to burn
	and leave all others in the cold.
	By several accounts "scalability" is defined as "performing as
	well on large machines as it does on small ones" ... but this
	seems to be a method of circumventing the kernel's own memory
	management as opposed to a method of improving it in all cases.

(6) As far as reconfiguring, I'm slightly concerned about the robustness
	of change_large_page_mem_size() in terms of how likely it is to
	succeed. Some on-demand defragmentation looks like it should be
	implemented to make it more reliable (now possible thanks to
	rmap). In general, the sysctl seems to lack some adaptivity.
	Granting root privileges to the workload vs. perpetual
	monitoring to find the ideal pool size sounds like a headache.

(7) I'm a little worried by the following:

zone(0): 4096 pages.
zone(1): 225280 pages.
BUG: wrong zone alignment, it will crash
zone(2): 3964928 pages.


My machine doesn't seem to care, but others' might.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:09 ` Martin J. Bligh
@ 2002-08-02  4:07   ` Linus Torvalds
  2002-08-02  4:13     ` David S. Miller
  2002-08-02  4:38     ` Martin J. Bligh
  0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-08-02  4:07 UTC (permalink / raw)
  To: linux-kernel

In article <737220000.1028250590@flay>,
Martin J. Bligh <Martin.Bligh@us.ibm.com> wrote:
>> - The change to MAX_ORDER is unneeded
>
>It's not only unneeded, it's detrimental. Not only will we spend more
>time merging stuff up and down to no effect

I doubt that.  At least the naive math says that it should get
exponentially less likely(*) to merge up/down for each level, so by the
time you've reached order-10, any merging is already in the noise and
totally unmeasurable. 

And the memory footprint of the bitmaps etc should be basically zero
(since they too shrink exponentially for each order).

((*) The "exponentially less likely" simply comes from doing the trivial
experiment of what would happen if you allocated all pages in-order one
at a time, and then free'd them one at a time.  Obviously not a
realistic test, but on the other hand a realistic kernel load tends to
keep a fairly fixed fraction of memory free, which makes it sound
extremely unlikely to me that you'd get sudden collpses/buildups either. 
Th elikelihood of being at just the right border for that to happens
_also_ happens to be decreasins as 2**-n)

Of course, if you can actually measure it, that would be interesting. 
Naive math gives you a guess for the order of magnitude effect, but
nothing beats real numbers ;)

> It also makes the config_nonlinear stuff harder (or we have to
> #ifdef it, which just causes more unnecessary differentiation). 

Hmm..  This sounds like a good point, but I thought we already did all
the math relative to the start of the zone, so that the alignment thing
implied by MAX_ORDER shouldn't be an issue. 

Or were you thinking of some other effect?

		Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:07   ` Linus Torvalds
@ 2002-08-02  4:13     ` David S. Miller
  2002-08-02  4:30       ` William Lee Irwin III
  2002-08-02  4:32       ` Linus Torvalds
  2002-08-02  4:38     ` Martin J. Bligh
  1 sibling, 2 replies; 40+ messages in thread
From: David S. Miller @ 2002-08-02  4:13 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

   From: torvalds@transmeta.com (Linus Torvalds)
   Date: Fri, 2 Aug 2002 04:07:10 +0000 (UTC)

   Of course, if you can actually measure it, that would be
   interesting.  Naive math gives you a guess for the order of
   magnitude effect, but nothing beats real numbers ;)

The SYSV folks actually did have a buddy allocator a long time ago and
they did implement lazy coalescing because is supposedly improved
performance.

See chapter 12 section 7 in "Unix Internals" by Uresh Vahalia.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:13     ` David S. Miller
@ 2002-08-02  4:30       ` William Lee Irwin III
  2002-08-02  4:32       ` Linus Torvalds
  1 sibling, 0 replies; 40+ messages in thread
From: William Lee Irwin III @ 2002-08-02  4:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel

From: torvalds@transmeta.com (Linus Torvalds)
Date: Fri, 2 Aug 2002 04:07:10 +0000 (UTC)
>    Of course, if you can actually measure it, that would be
>    interesting.  Naive math gives you a guess for the order of
>    magnitude effect, but nothing beats real numbers ;)

On Thu, Aug 01, 2002 at 09:13:57PM -0700, David S. Miller wrote:
> The SYSV folks actually did have a buddy allocator a long time ago and
> they did implement lazy coalescing because is supposedly improved
> performance.
> See chapter 12 section 7 in "Unix Internals" by Uresh Vahalia.

And I've implemented it for Linux.

ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/lazy_buddy/


Cheers,
Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  1:36 ` Andrew Morton
@ 2002-08-02  4:31   ` Daniel Phillips
  2002-08-02  4:47     ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2002-08-02  4:31 UTC (permalink / raw)
  To: Andrew Morton, lkml, linux-mm@kvack.org, Seth, Rohit,
	Saxena, Sunil, Mallick, Asit K

On Friday 02 August 2002 03:36, Andrew Morton wrote:
> Merged up to 2.5.30.  It compiles.
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch

What was the original against?

-- 
Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:13     ` David S. Miller
  2002-08-02  4:30       ` William Lee Irwin III
@ 2002-08-02  4:32       ` Linus Torvalds
  2002-08-02  5:11         ` William Lee Irwin III
  2002-08-02  7:30         ` Andrew Morton
  1 sibling, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-08-02  4:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Thu, 1 Aug 2002, David S. Miller wrote:
>    
>    Of course, if you can actually measure it, that would be
>    interesting.  Naive math gives you a guess for the order of
>    magnitude effect, but nothing beats real numbers ;)
> 
> The SYSV folks actually did have a buddy allocator a long time ago and
> they did implement lazy coalescing because is supposedly improved
> performance.

I bet that is mainly because of CPU scalability, and being able to avoid
touching the buddy lists from multiple CPU's - the same reason _we_ have
the per-CPU front-ends on various allocators.

I doubt it is because buddy matters past the 4MB mark. I just can't see 
how you can avoid the naive math which says that it should be 1/512th as 
common to coalesce to 4MB as it is to coalesce to 8kB. 

Walking the buddy bitmaps for a few levels (ie up to order 3 or 4) is
probably quite common, and it's likely to be bad from a SMP cache
standpoint (touching a few bits with what must be fairly random patterns). 
So avoiding the buddy with a simple front-end is likely to win you 
something, without actually being meaningful at the MAX_ORDER point.

		Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:07   ` Linus Torvalds
  2002-08-02  4:13     ` David S. Miller
@ 2002-08-02  4:38     ` Martin J. Bligh
  1 sibling, 0 replies; 40+ messages in thread
From: Martin J. Bligh @ 2002-08-02  4:38 UTC (permalink / raw)
  To: linux-kernel

Direct email seemed to get seperated from the cc somewhere
along the line ... repeated for others on l-k (sorry Linus ;-))

> I doubt that.  At least the naive math says that it should get
> exponentially less likely(*) to merge up/down for each level, so by the
> time you've reached order-10, any merging is already in the noise and
> totally unmeasurable. 

Yeah, it's probably unmeasurable, just ugly ;-)
I guess it's more that it seems unnecessary ... if ia64 are the
only people that need it to be that ludicrously large, it'd be
better if they just did it in their arch tree. Just because they
could theoretically have 256Mb pages, do they really *need* them? ;-)

>> It also makes the config_nonlinear stuff harder (or we have to
>> # ifdef it, which just causes more unnecessary differentiation). 
> 
> Hmm..  This sounds like a good point, but I thought we already did all
> the math relative to the start of the zone, so that the alignment thing
> implied by MAX_ORDER shouldn't be an issue. 
> 
> Or were you thinking of some other effect?

The config_nonlinear stuff relies on a trick ... we shove physically
non-contig areas into the buddy allocator, but the buddy allocator
is guaranteed to return phys contig areas. That all works just fine
as long as the blocks we put in are of size greater than or equal to
2^MAX_ORDER * PAGE_SIZE, which is currently 4Mb. A 4Mb alignment is
not a problem for any known machine, but I think 256Mb may well be.
It's kind of a dirty trick, but it's a really neat, efficient 
solution that gets rid of lots of zone balancing and pgdat proliferation.
It also lets me spread around ZONE_NORMAL across nodes for ia32 NUMA.

M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:31   ` Daniel Phillips
@ 2002-08-02  4:47     ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  4:47 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: lkml, linux-mm@kvack.org, Seth, Rohit, Saxena, Sunil,
	Mallick, Asit K

Daniel Phillips wrote:
> 
> On Friday 02 August 2002 03:36, Andrew Morton wrote:
> > Merged up to 2.5.30.  It compiles.
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch
> 
> What was the original against?

2.4.18

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:32       ` Linus Torvalds
@ 2002-08-02  5:11         ` William Lee Irwin III
  2002-08-02  7:30         ` Andrew Morton
  1 sibling, 0 replies; 40+ messages in thread
From: William Lee Irwin III @ 2002-08-02  5:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-kernel

On Thu, Aug 01, 2002 at 09:32:44PM -0700, Linus Torvalds wrote:
> I bet that is mainly because of CPU scalability, and being able to avoid
> touching the buddy lists from multiple CPU's - the same reason _we_ have
> the per-CPU front-ends on various allocators.
> I doubt it is because buddy matters past the 4MB mark. I just can't see 
> how you can avoid the naive math which says that it should be 1/512th as 
> common to coalesce to 4MB as it is to coalesce to 8kB. 
> Walking the buddy bitmaps for a few levels (ie up to order 3 or 4) is
> probably quite common, and it's likely to be bad from a SMP cache
> standpoint (touching a few bits with what must be fairly random patterns). 
> So avoiding the buddy with a simple front-end is likely to win you 
> something, without actually being meaningful at the MAX_ORDER point.

This is actually part of my strategy.

By properly organizing the deferred queues into lists of lists and
maintaining a small per-cpu cache of pages, a "cache fill" involves
doing a single list deletion under the zone->lock and the remainder
of the work to fill a pagevec occurs outside the lock, reducing the
mean hold time down to ridiculous lows. And since the allocations
are batched, the arrival rate is then divided by the batch size.
Conversely, frees are also batched and the same effect achieved with
the dual operations.

i.e. magazines for the page-level allocator

This can't be achieved with a pure buddy system, as it must examine
individual pages one-by-one to keep the bitmaps updated. Vahalia
discusses the general approach in another section, and integration with
buddy systems (and other allocators) in an exercise.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  5:24       ` David Mosberger
@ 2002-08-02  5:20         ` David S. Miller
  2002-08-02  6:26           ` David Mosberger
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-08-02  5:20 UTC (permalink / raw)
  To: davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 1 Aug 2002 22:24:05 -0700

   In my opinion the proposed large-page patch addresses a relatively
   pressing need for databases (primarily).

Databases want large pages with IPC_SHM, how can this special
syscal hack address that?

It's great for experimentation, but give up syscall slots for
this?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  2:29     ` Gerrit Huizenga
  2002-08-02  2:23       ` David S. Miller
@ 2002-08-02  5:24       ` David Mosberger
  2002-08-02  5:20         ` David S. Miller
  1 sibling, 1 reply; 40+ messages in thread
From: David Mosberger @ 2002-08-02  5:24 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Rik van Riel, David S. Miller, Andrew Morton, linux-kernel,
	linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 19:29:52 -0700, Gerrit Huizenga <gh@us.ibm.com> said:

  Gerrit> It would sure be nice if the interface wasn't some kludgey
  Gerrit> back door but more integrated with things like mmap() or
  Gerrit> shm*(), with semantics and behaviors that were roughly more
  Gerrit> predictable.  Other than that, no comments as yet on the
  Gerrit> patch internals...

In my opinion the proposed large-page patch addresses a relatively
pressing need for databases (primarily).  Longer term, I'd hope that
it can be replaced by a transparent superpage scheme.  But the
existing patch can also serve as a nice benchmark for transparent
schemes (and frankly, since it doesn't have to do anything smart
behind the scenes, it's likely that the existing patch, where
applicable, will always do slightly better than a transparent one).

In any case, the big issue of physical memory fragmentation can be
experimented with indepent what the user-level interface looks like.
So the existing patch is useful in that sense as well.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  5:20         ` David S. Miller
@ 2002-08-02  6:26           ` David Mosberger
  2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  8:20             ` David S. Miller
  0 siblings, 2 replies; 40+ messages in thread
From: David Mosberger @ 2002-08-02  6:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 22:20:53 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM>    From: David Mosberger <davidm@napali.hpl.hp.com> Date:
  DaveM> Thu, 1 Aug 2002 22:24:05 -0700

  DaveM>    In my opinion the proposed large-page patch addresses a
  DaveM> relatively pressing need for databases (primarily).

  DaveM> Databases want large pages with IPC_SHM, how can this
  DaveM> special syscal hack address that?

I believe the interface is OK in that regard.  AFAIK, Oracle is happy
with it.

  DaveM> It's great for experimentation, but give up syscall slots
  DaveM> for this?

I'm a bit concerned about this, too.  My preference would have been to
use the regular mmap() and shmat() syscalls with some
augmentation/hint as to what the preferred page size is (Simon
Winwood's OLS 2002 paper talks about some options here).  I like this
because hints could be useful even with a transparent superpage
scheme.

The original Intel patch did use more of a hint-like approach (the
hint was a simple binary flag though: give me regular pages or give me
large pages), but Linus preferred a separate syscall interface, so the
Intel folks switched over to doing that.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  6:26           ` David Mosberger
@ 2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  6:44               ` David Mosberger
  2002-08-02  7:08               ` Andrew Morton
  2002-08-02  8:20             ` David S. Miller
  1 sibling, 2 replies; 40+ messages in thread
From: Martin J. Bligh @ 2002-08-02  6:33 UTC (permalink / raw)
  To: davidm, David S. Miller
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

>   DaveM>    In my opinion the proposed large-page patch addresses a
>   DaveM> relatively pressing need for databases (primarily).
> 
>   DaveM> Databases want large pages with IPC_SHM, how can this
>   DaveM> special syscal hack address that?
> 
> I believe the interface is OK in that regard.  AFAIK, Oracle is happy
> with it.

Is Oracle now the world's only database? I think not.
 
>   DaveM> It's great for experimentation, but give up syscall slots
>   DaveM> for this?
> 
> I'm a bit concerned about this, too.  My preference would have been to
> use the regular mmap() and shmat() syscalls with some
> augmentation/hint as to what the preferred page size is 

I think that's what most users would prefer, and I don't think it
adds a vast amount of kernel complexity. Linus doesn't seem to
be dead set against the shmem modifications at least ... so that's
half way there ;-)

M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  6:33             ` Martin J. Bligh
@ 2002-08-02  6:44               ` David Mosberger
  2002-08-02 10:00                 ` Marcin Dalecki
  2002-08-02  7:08               ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: David Mosberger @ 2002-08-02  6:44 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: davidm, David S. Miller, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <Martin.Bligh@us.ibm.com> said:

  DaveM> In my opinion the proposed large-page patch addresses a
  DaveM> relatively pressing need for databases (primarily).
  >>
  DaveM> Databases want large pages with IPC_SHM, how can this special
  DaveM> syscal hack address that?

  >>  I believe the interface is OK in that regard.  AFAIK, Oracle is
  >> happy with it.

  Martin> Is Oracle now the world's only database? I think not.

I didn't say such a thing.  I just don't know what other db vendors/authors
think of the proposed interface.  I'm sure their feedback would be welcome.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  6:33             ` Martin J. Bligh
  2002-08-02  6:44               ` David Mosberger
@ 2002-08-02  7:08               ` Andrew Morton
  2002-08-02  7:15                 ` William Lee Irwin III
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  7:08 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: davidm, David S. Miller, gh, riel, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

"Martin J. Bligh" wrote:
> 
> >   DaveM>    In my opinion the proposed large-page patch addresses a
> >   DaveM> relatively pressing need for databases (primarily).
> >
> >   DaveM> Databases want large pages with IPC_SHM, how can this
> >   DaveM> special syscal hack address that?
> >
> > I believe the interface is OK in that regard.  AFAIK, Oracle is happy
> > with it.
> 
> Is Oracle now the world's only database? I think not.

Is a draft of Simon's patch available against 2.5?

-

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  7:08               ` Andrew Morton
@ 2002-08-02  7:15                 ` William Lee Irwin III
  0 siblings, 0 replies; 40+ messages in thread
From: William Lee Irwin III @ 2002-08-02  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin J. Bligh, davidm, David S. Miller, gh, riel, linux-kernel,
	linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

At some point in the past, Dave Miller wrote:
DaveM>    In my opinion the proposed large-page patch addresses a
DaveM> relatively pressing need for databases (primarily).
DaveM> Databases want large pages with IPC_SHM, how can this
DaveM> special syscal hack address that?

At some point in the past, David Mosberger wrote:
I believe the interface is OK in that regard.  AFAIK, Oracle is happy
with it.

"Martin J. Bligh" wrote:
>> Is Oracle now the world's only database? I think not.

On Fri, Aug 02, 2002 at 12:08:50AM -0700, Andrew Morton wrote:
> Is a draft of Simon's patch available against 2.5?

Unless I can turn blood into wine, walk on water, and produce a working
2.5 version of the thing in < 6 hours (not that I'm not trying), this
will probably have to wait until Hubertus remeterializes tomorrow
morning (EDT) and further porting is done. I'll be up early.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  4:32       ` Linus Torvalds
  2002-08-02  5:11         ` William Lee Irwin III
@ 2002-08-02  7:30         ` Andrew Morton
  1 sibling, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2002-08-02  7:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-kernel

Linus Torvalds wrote:
> 
> On Thu, 1 Aug 2002, David S. Miller wrote:
> >
> >    Of course, if you can actually measure it, that would be
> >    interesting.  Naive math gives you a guess for the order of
> >    magnitude effect, but nothing beats real numbers ;)
> >
> > The SYSV folks actually did have a buddy allocator a long time ago and
> > they did implement lazy coalescing because is supposedly improved
> > performance.
> 
> I bet that is mainly because of CPU scalability, and being able to avoid
> touching the buddy lists from multiple CPU's - the same reason _we_ have
> the per-CPU front-ends on various allocators.
> 
> I doubt it is because buddy matters past the 4MB mark. I just can't see
> how you can avoid the naive math which says that it should be 1/512th as
> common to coalesce to 4MB as it is to coalesce to 8kB.

Buddy costs tend to be down in the noise compared with the cost
of the zone->lock.

I did a per-cpu pages patch a while back which, when it takes that
lock, grabs 16 pages or frees 16 pages.  Anton tested it on the
12-way:  http://samba.org/~anton/linux/2.5.9/  blue -> purple

The cost of rmqueue() and __free_pages_ok went from 13% of system
time down to 2%.  So that 2% speedup is all that's available by fiddling
with the buddy algorithm (I think).  And I bet most of that is still taking
the lock.

Didn't submit the patch because I think a per-cpu page buffer is a bit of
a dopey cop-out.  I have patches here which make most of the page-intensive
fastpaths in the kernel stop using single pages and start using 16-page batches.

That will make a 16-page allocation request just a natural thing
to do.  But we will need a per-cpu buffer to wring the last drops
out of anonymous pagefaults and generic_file_write(), which do not
lend themselves to gang allocation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  6:26           ` David Mosberger
  2002-08-02  6:33             ` Martin J. Bligh
@ 2002-08-02  8:20             ` David S. Miller
  2002-08-02  9:05               ` Ryan Cumming
                                 ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: David S. Miller @ 2002-08-02  8:20 UTC (permalink / raw)
  To: davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 1 Aug 2002 23:26:07 -0700

   I'm a bit concerned about this, too.  My preference would have been to
   use the regular mmap() and shmat() syscalls with some
   augmentation/hint as to what the preferred page size is (Simon
   Winwood's OLS 2002 paper talks about some options here).  I like this
   because hints could be useful even with a transparent superpage
   scheme.

A "hint" to use superpages?  That's absurd.

Any time you are able to translate N pages instead of 1 page with 1
TLB entry it's always preferable.

I also don't buy the swapping complexity bit.  The fact is, SHM and
anonymous pages are easy.  Just stay away from the page cache and it
is pretty simple to just make the normal VM do this.

If set_pte sees a large page, you simply undo the large ptes in that
group and the complexity ends right there.  This means the only maker
of large pages is the bit that creates the large mappings and has all
of the ideal conditions up front.  Any time anything happens to that
pte you undo the large'ness of it so that you get normal PAGE_SIZE
ptes back.

Using superpages for anonymous+SHM pages is really the only area I
still think Linux's MM can offer inferior performance compared to what
the hardware is actually capable of.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  8:20             ` David S. Miller
@ 2002-08-02  9:05               ` Ryan Cumming
  2002-08-02  9:06                 ` David S. Miller
  2002-08-02 12:52                 ` Rik van Riel
  2002-08-02 13:29               ` What does this error mean? "local symbols in discarded section .text.exit" jeff millar
  2002-08-02 15:27               ` large page patch David Mosberger
  2 siblings, 2 replies; 40+ messages in thread
From: Ryan Cumming @ 2002-08-02  9:05 UTC (permalink / raw)
  To: David S. Miller, davidm, davidm
  Cc: gh, riel, akpm, linux-kernel, linux-mm, rohit.seth, sunil.saxena,
	asit.k.mallick

On August 2, 2002 01:20, David S. Miller wrote:
>    From: David Mosberger <davidm@napali.hpl.hp.com>
>    Date: Thu, 1 Aug 2002 23:26:07 -0700
>
>    I'm a bit concerned about this, too.  My preference would have been to
>    use the regular mmap() and shmat() syscalls with some
>    augmentation/hint as to what the preferred page size is (Simon
>    Winwood's OLS 2002 paper talks about some options here).  I like this
>    because hints could be useful even with a transparent superpage
>    scheme.
>
> A "hint" to use superpages?  That's absurd.

What about applications that want fine-grained page aging? 4MB is a tad on the 
course side for most desktop applications.

-Ryan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  9:05               ` Ryan Cumming
@ 2002-08-02  9:06                 ` David S. Miller
  2002-08-02 12:52                 ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2002-08-02  9:06 UTC (permalink / raw)
  To: ryan
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

   From: Ryan Cumming <ryan@completely.kicks-ass.org>
   Date: Fri, 2 Aug 2002 02:05:43 -0700

   What about applications that want fine-grained page aging? 4MB is a
   tad on the course side for most desktop applications.

Once vmscan sees the page and tries to liberate it, then it
will be unlarge'd and thus you'll get fine-grained page aging.

That's the beauty of my implementation suggestion.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  6:44               ` David Mosberger
@ 2002-08-02 10:00                 ` Marcin Dalecki
  0 siblings, 0 replies; 40+ messages in thread
From: Marcin Dalecki @ 2002-08-02 10:00 UTC (permalink / raw)
  To: davidm
  Cc: Martin J. Bligh, David S. Miller, gh, riel, akpm, linux-kernel,
	linux-mm, rohit.seth, sunil.saxena, asit.k.mallick

Uz.ytkownik David Mosberger napisa?:
>>>>>>On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <Martin.Bligh@us.ibm.com> said:
>>>>>
> 
>   DaveM> In my opinion the proposed large-page patch addresses a
>   DaveM> relatively pressing need for databases (primarily).
>   >>
>   DaveM> Databases want large pages with IPC_SHM, how can this special
>   DaveM> syscal hack address that?
> 
>   >>  I believe the interface is OK in that regard.  AFAIK, Oracle is
>   >> happy with it.
> 
>   Martin> Is Oracle now the world's only database? I think not.
> 
> I didn't say such a thing.  I just don't know what other db vendors/authors
> think of the proposed interface.  I'm sure their feedback would be welcome.

You better don't ask DB people and in esp. the Oracle people
about opinnions on interface design. Unless you wan't something
fscking ugly internally looking like FORTRAN/COBOL coding.
They will always scrap portability/usability use undocumented behaviour 
and so on in the case they can presumably increase theyr pet benchmark
values.
One of the reasons Solaris is *feeling* so slow is that they asked
Oracle people too frequent about oppinions apparently. In esp. they did 
forgett that there are other uses then DB servers ;-).

PS. I just got too much in touch with Oracle to not hate it...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  9:05               ` Ryan Cumming
  2002-08-02  9:06                 ` David S. Miller
@ 2002-08-02 12:52                 ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2002-08-02 12:52 UTC (permalink / raw)
  To: Ryan Cumming
  Cc: David S. Miller, davidm, davidm, gh, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

On Fri, 2 Aug 2002, Ryan Cumming wrote:
> On August 2, 2002 01:20, David S. Miller wrote:

> > A "hint" to use superpages?  That's absurd.
>
> What about applications that want fine-grained page aging? 4MB is a tad
> on the course side for most desktop applications.

Of course we wouldn't want to use superpages for VMAs smaller
than, say, 4 of these superpages.

That would fix this problem automagically.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 40+ messages in thread

* What does this error mean? "local symbols in discarded section .text.exit"
  2002-08-02  8:20             ` David S. Miller
  2002-08-02  9:05               ` Ryan Cumming
@ 2002-08-02 13:29               ` jeff millar
  2002-08-02 13:52                 ` Jose Luis Domingo Lopez
  2002-08-02 15:27               ` large page patch David Mosberger
  2 siblings, 1 reply; 40+ messages in thread
From: jeff millar @ 2002-08-02 13:29 UTC (permalink / raw)
  To: linux-kernel

I need some help debugging this kernel build problem.

Here's the tail of my kernel build.

  ld -m elf_i386  -r -o init.o main.o version.o do_mounts.o
make[1]: Leaving directory `/usr/src/v2.5.30/init'
  ld -m elf_i386 -T arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o
arch/i386/kernel/init
_task.o init/init.o --start-group arch/i386/kernel/kernel.o
arch/i386/mm/mm.o kernel/kernel.o mm
/mm.o fs/fs.o ipc/ipc.o security/built-in.o
/usr/src/v2.5.30/arch/i386/lib/lib.a lib/lib.a /usr/
src/v2.5.30/arch/i386/lib/lib.a drivers/built-in.o sound/sound.o
arch/i386/pci/pci.o net/network
.o --end-group -o vmlinux
drivers/built-in.o(.data+0x80f4): undefined reference to `local symbols in
discarded section .te
xt.exit'
make: *** [vmlinux] Error 1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: What does this error mean? "local symbols in discarded section .text.exit"
  2002-08-02 13:29               ` What does this error mean? "local symbols in discarded section .text.exit" jeff millar
@ 2002-08-02 13:52                 ` Jose Luis Domingo Lopez
  2002-08-02 22:50                   ` jeff millar
  0 siblings, 1 reply; 40+ messages in thread
From: Jose Luis Domingo Lopez @ 2002-08-02 13:52 UTC (permalink / raw)
  To: linux-kernel

On Friday, 02 August 2002, at 09:29:09 -0400,
jeff millar wrote:

> I need some help debugging this kernel build problem.
> 
> drivers/built-in.o(.data+0x80f4): undefined reference to `local symbols in
> discarded section .te
> xt.exit'
> make: *** [vmlinux] Error 1
> 
A know problem with some combinations of binutils and kernel sources. As
Debian bintuils package says:

x You may experience problems linking older (and some newer) kernels with  x 
x this version of binutils.  This is not because of a bug in the linker,   x 
x but rather a bug in the kernel source.  This is being worked out and     x 
x fixed by the upstream kernel group in newer kernels, but not all of the  x 
x problems may have been fixed at this time.  Older kernel versions will   x 
x almost always exhibit the problem, however, and no attempts are being    x 
x made to fix those that we know of.                                       x 
x                                                                          x 
x There are a few work-arounds, but the most reliable is to edit the       x 
x linker script for your architecture (e.g. arch/i386/vmlinux.lds) and     x 
x remove the '*(.text.exit)' entry from the 'DISCARD' line.  It will       x 
x bloat the kernel somewhat, but it should link properly.                  x 

Regards,

-- 
Jose Luis Domingo Lopez
Linux Registered User #189436     Debian Linux Woody (Linux 2.4.19-pre6aa1)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  8:20             ` David S. Miller
  2002-08-02  9:05               ` Ryan Cumming
  2002-08-02 13:29               ` What does this error mean? "local symbols in discarded section .text.exit" jeff millar
@ 2002-08-02 15:27               ` David Mosberger
  2 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2002-08-02 15:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: davidm, davidm, gh, riel, akpm, linux-kernel, linux-mm,
	rohit.seth, sunil.saxena, asit.k.mallick

>>>>> On Fri, 02 Aug 2002 01:20:40 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  Dave.M> A "hint" to use superpages?  That's absurd.

  Dave.M> Any time you are able to translate N pages instead of 1 page
  Dave.M> with 1 TLB entry it's always preferable.

Yeah, right.  So you think a 256MB page-size is optimal for all apps?

What you're missing is how you *get* to the point where you can map N
pages with a single TLB entry.  For that to happen, you need to
allocate physically contiguous and properly aligned memory (at least
given the hw that's common today).  Doing has certain costs, no matter
what your approach is.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: What does this error mean? "local symbols in discarded section .text.exit"
  2002-08-02 13:52                 ` Jose Luis Domingo Lopez
@ 2002-08-02 22:50                   ` jeff millar
  2002-08-02 23:04                     ` Thunder from the hill
  0 siblings, 1 reply; 40+ messages in thread
From: jeff millar @ 2002-08-02 22:50 UTC (permalink / raw)
  To: Jose Luis Domingo Lopez, linux-kernel

Jose...

thanks for the reply.  This link error happens with 2.5.27-2.5.30.  Are you
sure the kernel people are working on this?

jeff

----- Original Message -----
From: "Jose Luis Domingo Lopez" <linux-kernel@24x7linux.org>

> On Friday, 02 August 2002, at 09:29:09 -0400,
> jeff millar wrote:
>
> > I need some help debugging this kernel build problem.
> >
> > drivers/built-in.o(.data+0x80f4): undefined reference to `local symbols
in
> > discarded section .te
> > xt.exit'
> > make: *** [vmlinux] Error 1
> >
> A know problem with some combinations of binutils and kernel sources. As
> Debian bintuils package says:
>
> x You may experience problems linking older (and some newer) kernels with
x
> x this version of binutils.  This is not because of a bug in the linker,
x
> x but rather a bug in the kernel source.  This is being worked out and
x
> x fixed by the upstream kernel group in newer kernels, but not all of the
x
> x problems may have been fixed at this time.  Older kernel versions will
x
> x almost always exhibit the problem, however, and no attempts are being
x
> x made to fix those that we know of.
x



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: What does this error mean? "local symbols in discarded section .text.exit"
  2002-08-02 22:50                   ` jeff millar
@ 2002-08-02 23:04                     ` Thunder from the hill
  0 siblings, 0 replies; 40+ messages in thread
From: Thunder from the hill @ 2002-08-02 23:04 UTC (permalink / raw)
  To: jeff millar; +Cc: Jose Luis Domingo Lopez, linux-kernel

Hi,

On Fri, 2 Aug 2002, jeff millar wrote:
> thanks for the reply.  This link error happens with 2.5.27-2.5.30.  Are you
> sure the kernel people are working on this?

Well, it doesn't seem so.

http://marc.theaimsgroup.com/?l=linux-kernel&m=102798967514023&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=102799199615357&w=2

The response seems to me like "Well, I don't care as long as the latest 
(not working on some arches) gcc does..."

			Thunder
-- 
.-../../-./..-/-..- .-./..-/.-.././.../.-.-.-

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: large page patch
  2002-08-02  0:37 large page patch Andrew Morton
                   ` (3 preceding siblings ...)
  2002-08-02  3:47 ` William Lee Irwin III
@ 2002-08-02 23:40 ` Chris Wedgwood
  4 siblings, 0 replies; 40+ messages in thread
From: Chris Wedgwood @ 2002-08-02 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: lkml, linux-mm@kvack.org, Seth, Rohit, Saxena, Sunil,
	Mallick, Asit K

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:

    diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
    --- linux.org/arch/i386/kernel/entry.S	Mon Feb 25 11:37:53 2002
    +++ linux.lp/arch/i386/kernel/entry.S	Tue Jul  2 15:12:23 2002
    @@ -634,6 +634,10 @@
     	.long SYMBOL_NAME(sys_ni_syscall)	/* 235 reserved for removexattr */
     	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for lremovexattr */
     	.long SYMBOL_NAME(sys_ni_syscall)	/* reserved for fremovexattr */
    +	.long SYMBOL_NAME(sys_get_large_pages)	/* Get large_page pages */
    +	.long SYMBOL_NAME(sys_free_large_pages)	/* Free large_page pages */
    +	.long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
    +	.long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */


Must large pages be allocated this way?

At some point I would like to see code that mmap's large amounts of
data (over 1GB) and have it take advantage of this once the kernel is
potentially extended to deal with mapping of large and/or variable
sized pages backed to disk.

Also, some scientific applications will malloc(3) gobs of ram, again
in excess of 1GB, is it unreasonable to expect that the kernel will
notice large allocations and try to provide large pages sbrk in
invoked with suitable high values?



  --cw

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2002-08-02 23:37 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-02  0:37 large page patch Andrew Morton
2002-08-02  0:43 ` David S. Miller
2002-08-02  1:26   ` Andrew Morton
2002-08-02  1:19     ` David S. Miller
2002-08-02  3:23       ` Linus Torvalds
2002-08-02  1:55   ` Rik van Riel
2002-08-02  1:50     ` David S. Miller
2002-08-02  2:29     ` Gerrit Huizenga
2002-08-02  2:23       ` David S. Miller
2002-08-02  2:53         ` Gerrit Huizenga
2002-08-02  5:24       ` David Mosberger
2002-08-02  5:20         ` David S. Miller
2002-08-02  6:26           ` David Mosberger
2002-08-02  6:33             ` Martin J. Bligh
2002-08-02  6:44               ` David Mosberger
2002-08-02 10:00                 ` Marcin Dalecki
2002-08-02  7:08               ` Andrew Morton
2002-08-02  7:15                 ` William Lee Irwin III
2002-08-02  8:20             ` David S. Miller
2002-08-02  9:05               ` Ryan Cumming
2002-08-02  9:06                 ` David S. Miller
2002-08-02 12:52                 ` Rik van Riel
2002-08-02 13:29               ` What does this error mean? "local symbols in discarded section .text.exit" jeff millar
2002-08-02 13:52                 ` Jose Luis Domingo Lopez
2002-08-02 22:50                   ` jeff millar
2002-08-02 23:04                     ` Thunder from the hill
2002-08-02 15:27               ` large page patch David Mosberger
2002-08-02  1:09 ` Martin J. Bligh
2002-08-02  4:07   ` Linus Torvalds
2002-08-02  4:13     ` David S. Miller
2002-08-02  4:30       ` William Lee Irwin III
2002-08-02  4:32       ` Linus Torvalds
2002-08-02  5:11         ` William Lee Irwin III
2002-08-02  7:30         ` Andrew Morton
2002-08-02  4:38     ` Martin J. Bligh
2002-08-02  1:36 ` Andrew Morton
2002-08-02  4:31   ` Daniel Phillips
2002-08-02  4:47     ` Andrew Morton
2002-08-02  3:47 ` William Lee Irwin III
2002-08-02 23:40 ` Chris Wedgwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox