LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson

Certain workloads benefit if their data or text segments are backed by
huge pages. The stack is no exception to this rule but there is no
mechanism currently that allows the backing of a stack reliably with
huge pages.  Doing this from userspace is excessively messy and has some
awkward restrictions.  Particularly on POWER where 256MB of address space
gets wasted if the stack is setup there.

This patch stack introduces a personality flag that indicates the kernel
should setup the stack as a hugetlbfs-backed region. A userspace utility
may set this flag then exec a process whose stack is to be backed by
hugetlb pages.

Eric Munson (5):
  Align stack boundaries based on personality
  Add shared and reservation control to hugetlb_file_setup
  Split boundary checking from body of do_munmap
  Build hugetlb backed process stacks
  [PPC] Setup stack memory segment for hugetlb pages

 arch/powerpc/mm/hugetlbpage.c |    6 +
 arch/powerpc/mm/slice.c       |   11 ++
 fs/exec.c                     |  209 ++++++++++++++++++++++++++++++++++++++---
 fs/hugetlbfs/inode.c          |   52 +++++++----
 include/asm-powerpc/hugetlb.h |    3 +
 include/linux/hugetlb.h       |   22 ++++-
 include/linux/mm.h            |    1 +
 include/linux/personality.h   |    3 +
 ipc/shm.c                     |    2 +-
 mm/mmap.c                     |   11 ++-
 10 files changed, 284 insertions(+), 36 deletions(-)

^ permalink raw reply

* [PATCH 4/5 V2] Build hugetlb backed process stacks
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

This patch allows a processes stack to be backed by huge pages on request.
The personality flag defined in a previous patch should be set before
exec is called for the target process to use a huge page backed stack.

When the hugetlb file is setup to back the stack it is sized to fit the
ulimit for stack size or 256 MB if ulimit is unlimited.  The GROWSUP and
GROWSDOWN VM flags are turned off because a hugetlb backed vma is not
resizable so it will be appropriately sized when created.  When a process
exceeds stack size it recieves a segfault as it would if it exceeded the
ulimit.

Also certain architectures require special setup for a memory region before
huge pages can be used in that region.  This patch defines a function with
__attribute__ ((weak)) set that can be defined by these architectures to
do any necessary setup.  If it exists, it will be called right before the
hugetlb file is mmapped.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add comment about not padding huge stacks
Break personality_page_align helper and personality flag into separate patch
Add move_to_huge_pages function that moves the stack onto huge pages
Add hugetlb_mm_setup weak function for archs that require special setup to
 use hugetlb pages
Rebase to 2.6.26-rc8-mm1

 fs/exec.c               |  194 ++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/hugetlb.h |    5 +
 2 files changed, 187 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c99ba24..bf9ead2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -50,6 +50,7 @@
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
 #include <linux/hugetlb.h>
+#include <linux/mman.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -59,6 +60,8 @@
 #include <linux/kmod.h>
 #endif
 
+#define HUGE_STACK_MAX (256*1024*1024)
+
 #ifdef __alpha__
 /* for /sbin/loader handling in search_binary_handler() */
 #include <linux/a.out.h>
@@ -189,7 +192,12 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		return NULL;
 
 	if (write) {
-		unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
+		/*
+		 * Args are always placed at the high end of the stack space
+		 * so this calculation will give the proper size and it is
+		 * compatible with huge page stacks.
+		 */
+		unsigned long size = bprm->vma->vm_end - pos;
 		struct rlimit *rlim;
 
 		/*
@@ -255,7 +263,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	 * configured yet.
 	 */
 	vma->vm_end = STACK_TOP_MAX;
-	vma->vm_start = vma->vm_end - PAGE_SIZE;
+	if (current->personality & HUGETLB_STACK)
+		vma->vm_start = vma->vm_end - HPAGE_SIZE;
+	else
+		vma->vm_start = vma->vm_end - PAGE_SIZE;
 
 	vma->vm_flags = VM_STACK_FLAGS;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
@@ -574,6 +585,156 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	return 0;
 }
 
+static struct file *hugetlb_stack_file(int stack_hpages)
+{
+	struct file *hugefile = NULL;
+
+	if (!stack_hpages) {
+		set_personality(current->personality & (~HUGETLB_STACK));
+		printk(KERN_DEBUG
+			"Stack rlimit set too low for huge page backed stack.\n");
+		return NULL;
+	}
+
+	hugefile = hugetlb_file_setup(HUGETLB_STACK_FILE,
+					HPAGE_SIZE * stack_hpages,
+					HUGETLB_PRIVATE_INODE);
+	if (unlikely(IS_ERR(hugefile))) {
+		/*
+		 * If huge pages are not available for this stack fall
+		 * fall back to normal pages for execution instead of
+		 * failing.
+		 */
+		printk(KERN_DEBUG
+			"Huge page backed stack unavailable for process %lu.\n",
+			(unsigned long)current->pid);
+		set_personality(current->personality & (~HUGETLB_STACK));
+		return NULL;
+	}
+	return hugefile;
+}
+
+static int move_to_huge_pages(struct linux_binprm *bprm,
+				struct vm_area_struct *vma, unsigned long shift)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *new_vma;
+	unsigned long old_end = vma->vm_end;
+	unsigned long old_start = vma->vm_start;
+	unsigned long new_end = old_end - shift;
+	unsigned long new_start, length;
+	unsigned long arg_size = new_end - bprm->p;
+	unsigned long flags = vma->vm_flags;
+	struct file *hugefile = NULL;
+	unsigned int stack_hpages = 0;
+	struct page **from_pages = NULL;
+	struct page **to_pages = NULL;
+	unsigned long num_pages = (arg_size / PAGE_SIZE) + 1;
+	int ret;
+	int i;
+
+#ifdef CONFIG_STACK_GROWSUP
+	/*
+	 * Huge page stacks are not currently supported on GROWSUP
+	 * archs.
+	 */
+	set_personality(current->personality & (~HUGETLB_STACK));
+#else
+	if (current->signal->rlim[RLIMIT_STACK].rlim_cur == _STK_LIM_MAX)
+		stack_hpages = HUGE_STACK_MAX / HPAGE_SIZE;
+	else
+		stack_hpages = current->signal->rlim[RLIMIT_STACK].rlim_cur /
+				HPAGE_SIZE;
+	hugefile = hugetlb_stack_file(stack_hpages);
+	if (!hugefile)
+		goto out_small_stack;
+
+	length = stack_hpages * HPAGE_SIZE;
+	new_start = new_end - length;
+
+	from_pages = kmalloc(num_pages * sizeof(struct page*), GFP_KERNEL);
+	to_pages = kmalloc(num_pages * sizeof(struct page*), GFP_KERNEL);
+	if (!from_pages || !to_pages)
+		goto out_small_stack;
+
+	ret = get_user_pages(current, mm, (old_end - arg_size) & PAGE_MASK,
+				num_pages, 0, 0, from_pages, NULL);
+	if (ret <= 0)
+		goto out_small_stack;
+
+	/*
+	 * __do_munmap is used here because the boundary checking done in
+	 * do_munmap will fail out every time where the kernel is 64 bit and the
+	 * target program is 32 bit as the stack will start at TASK_SIZE for the
+	 * 64 bit address space.
+	 */
+	ret = __do_munmap(mm, old_start, old_end - old_start);
+	if (ret)
+		goto out_small_stack;
+
+	ret = -EINVAL;
+	if (hugetlb_mm_setup)
+		hugetlb_mm_setup(mm, new_start, length);
+	if (IS_ERR_VALUE(do_mmap(hugefile, new_start, length,
+			PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE, 0)))
+		goto out_error;
+	/* We don't want to fput this if the mmap succeeded */
+	hugefile = NULL;
+
+	ret = get_user_pages(current, mm, (new_end - arg_size) & PAGE_MASK,
+				num_pages, 0, 0, to_pages, NULL);
+	if (ret <= 0) {
+		ret = -ENOMEM;
+		goto out_error;
+	}
+
+	for (i = 0; i < num_pages; i++) {
+		char *vfrom, *vto;
+		vfrom = kmap(from_pages[i]);
+		vto = kmap(to_pages[i]);
+		memcpy(vto, vfrom, PAGE_SIZE);
+		kunmap(from_pages[i]);
+		kunmap(to_pages[i]);
+		put_page(from_pages[i]);
+		put_page(to_pages[i]);
+	}
+
+	kfree(from_pages);
+	kfree(to_pages);
+	new_vma = find_vma(current->mm, new_start);
+	if (!new_vma)
+		return -ENOSPC;
+	new_vma->vm_flags |= flags;
+	new_vma->vm_flags &= ~(VM_GROWSUP|VM_GROWSDOWN);
+	new_vma->vm_page_prot = vm_get_page_prot(new_vma->vm_flags);
+
+	bprm->vma = new_vma;
+	return 0;
+
+out_error:
+	for (i = 0; i < num_pages; i++)
+		put_page(from_pages[i]);
+	if (hugefile)
+		fput(hugefile);
+	if (from_pages)
+		kfree(from_pages);
+	if (to_pages)
+		kfree(to_pages);
+	return ret;
+
+out_small_stack:
+	if (hugefile)
+		fput(hugefile);
+	if (from_pages)
+		kfree(from_pages);
+	if (to_pages)
+		kfree(to_pages);
+#endif /* !CONFIG_STACK_GROWSUP */
+	if (shift)
+		return shift_arg_pages(vma, shift);
+	return 0;
+}
+
 #define EXTRA_STACK_VM_PAGES	20	/* random */
 
 /*
@@ -640,23 +801,32 @@ int setup_arg_pages(struct linux_binprm *bprm,
 		goto out_unlock;
 	BUG_ON(prev != vma);
 
+	/* Move stack to hugetlb pages if requested */
+	if (current->personality & HUGETLB_STACK)
+		ret = move_to_huge_pages(bprm, vma, stack_shift);
 	/* Move stack pages down in memory. */
-	if (stack_shift) {
+	else if (stack_shift)
 		ret = shift_arg_pages(vma, stack_shift);
-		if (ret) {
-			up_write(&mm->mmap_sem);
-			return ret;
-		}
+
+	if (ret) {
+		up_write(&mm->mmap_sem);
+		return ret;
 	}
 
+	/*
+	 * Stack padding code is skipped for huge stacks because the vma
+	 * is not expandable when backed by a hugetlb file.
+	 */
+	if (!(current->personality & HUGETLB_STACK)) {
 #ifdef CONFIG_STACK_GROWSUP
-	stack_base = vma->vm_end + EXTRA_STACK_VM_PAGES * PAGE_SIZE;
+		stack_base = vma->vm_end + EXTRA_STACK_VM_PAGES * PAGE_SIZE;
 #else
-	stack_base = vma->vm_start - EXTRA_STACK_VM_PAGES * PAGE_SIZE;
+		stack_base = vma->vm_start - EXTRA_STACK_VM_PAGES * PAGE_SIZE;
 #endif
-	ret = expand_stack(vma, stack_base);
-	if (ret)
-		ret = -EFAULT;
+		ret = expand_stack(vma, stack_base);
+		if (ret)
+			ret = -EFAULT;
+	}
 
 out_unlock:
 	up_write(&mm->mmap_sem);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 26ffed9..b4c88bb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -110,6 +110,11 @@ static inline unsigned long hugetlb_total_pages(void)
 #define HUGETLB_RESERVE	0x00000002UL	/* Reserve the huge pages backed by the
 					 * new file */
 
+#define HUGETLB_STACK_FILE "hugetlb-stack"
+
+extern void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr,
+				unsigned long len) __attribute__ ((weak));
+
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_config {
 	uid_t   uid;
-- 
1.5.6.1

^ permalink raw reply related

* [PATCH 1/5 V2] Align stack boundaries based on personality
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

This patch adds a personality flag that requests hugetlb pages be used for
a processes stack.  It adds a helper function that chooses the proper ALIGN
macro based on tthe process personality and calls this function from
setup_arg_pages when aligning the stack address.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Rebase to 2.6.26-rc8-mm1

 fs/exec.c                   |   15 ++++++++++++++-
 include/linux/hugetlb.h     |    3 +++
 include/linux/personality.h |    3 +++
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index af9b29c..c99ba24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -49,6 +49,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/hugetlb.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -155,6 +156,18 @@ exit:
 	goto out;
 }
 
+static unsigned long personality_page_align(unsigned long addr)
+{
+	if (current->personality & HUGETLB_STACK)
+#ifdef CONFIG_STACK_GROWSUP
+		return HPAGE_ALIGN(addr);
+#else
+		return addr & HPAGE_MASK;
+#endif
+
+	return PAGE_ALIGN(addr);
+}
+
 #ifdef CONFIG_MMU
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
@@ -596,7 +609,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	bprm->p = vma->vm_end - stack_shift;
 #else
 	stack_top = arch_align_stack(stack_top);
-	stack_top = PAGE_ALIGN(stack_top);
+	stack_top = personality_page_align(stack_top);
 	stack_shift = vma->vm_end - stack_top;
 
 	bprm->p -= stack_shift;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9a71d4c..eed37d7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -95,6 +95,9 @@ static inline unsigned long hugetlb_total_pages(void)
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
+
+/* to align the pointer to the (next) huge page boundary */
+#define HPAGE_ALIGN(addr)	ALIGN(addr, HPAGE_SIZE)
 #endif
 
 #endif /* !CONFIG_HUGETLB_PAGE */
diff --git a/include/linux/personality.h b/include/linux/personality.h
index a84e9ff..2bb0f95 100644
--- a/include/linux/personality.h
+++ b/include/linux/personality.h
@@ -22,6 +22,9 @@ extern int		__set_personality(unsigned long);
  * These occupy the top three bytes.
  */
 enum {
+	HUGETLB_STACK =		0x0020000,	/* Attempt to use hugetlb pages
+						 * for the process stack
+						 */
 	ADDR_NO_RANDOMIZE = 	0x0040000,	/* disable randomization of VA space */
 	FDPIC_FUNCPTRS =	0x0080000,	/* userspace function ptrs point to descriptors
 						 * (signal handling)
-- 
1.5.6.1

^ permalink raw reply related

* [PATCH 3/5] Split boundary checking from body of do_munmap
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

Currently do_unmap pre-checks the unmapped address range against the
valid address range for the process size.  However during initial setup
the stack may actually be outside this range, particularly it may be
initially placed at the 64 bit stack address and later moved to the
normal 32 bit stack location.  In a later patch we will want to unmap
the stack as part of relocating it into huge pages.

This patch moves the bulk of do_munmap into __do_munmap which will not
be protected by the boundary checking.  When an area that would normally
fail at these checks needs to be unmapped (e.g. unmapping a stack that
was setup at 64 bit TASK_SIZE for a 32 bit process) __do_munmap should
be called directly.  do_munmap will continue to do the boundary checking
and will call __do_munmap as appropriate.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

 include/linux/mm.h |    1 +
 mm/mmap.c          |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4eeb3c..59c6f89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,6 +1152,7 @@ out:
 	return ret;
 }
 
+extern int __do_munmap(struct mm_struct *, unsigned long, size_t);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
diff --git a/mm/mmap.c b/mm/mmap.c
index 5b62e5d..4e56369 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1881,17 +1881,24 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	if (start > TASK_SIZE || len > TASK_SIZE-start)
+		return -EINVAL;
+	return __do_munmap(mm, start, len);
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge <jeremy@goop.org>
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
 
-	if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
+	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
 	if ((len = PAGE_ALIGN(len)) == 0)
-- 
1.5.6.1

^ permalink raw reply related

* [PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

There are two kinds of "Shared" hugetlbfs mappings:
   1. using internal vfsmount use ipc/shm.c and shmctl()
   2. mmap() of /hugetlbfs/file with MAP_SHARED

There is one kind of private: mmap() of /hugetlbfs/file file with
MAP_PRIVATE

This patch adds a second class of "private" hugetlb-backed mapping.  But
we do it by sharing code with the ipc shm.  This is mostly because we
need to do our stack setup at execve() time and can't go opening files
from hugetlbfs.  The kernel-internal vfsmount for shm lets us get around
this.  We truly want anonymous memory, but MAP_PRIVATE is close enough
for now.

Currently, if the mapping on an internal mount is larger than a single
huge page, one page is allocated, one is reserved, and the rest are
faulted as needed.  For hugetlb backed stacks we do not want any
reserved pages.  This patch gives the caller of hugetlb_file_steup the
ability to control this behavior by specifying flags for private inodes
and page reservations.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add creat_flags to struct hugetlbfs_inode_info
Check if space should be reserved in hugetlbfs_file_mmap
Rebase to 2.6.26-rc8-mm1

 fs/hugetlbfs/inode.c    |   52 ++++++++++++++++++++++++++++++----------------
 include/linux/hugetlb.h |   18 ++++++++++++---
 ipc/shm.c               |    2 +-
 3 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dbd01d2..2e960d6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -92,7 +92,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap_pgoff unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
+	vma->vm_flags |= VM_HUGETLB;
 	vma->vm_ops = &hugetlb_vm_ops;
 
 	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
@@ -106,10 +106,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	ret = -ENOMEM;
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	if (hugetlb_reserve_pages(inode,
+	if (HUGETLBFS_I(inode)->creat_flags & HUGETLB_RESERVE) {
+		vma->vm_flags |= VM_RESERVED;
+		if (hugetlb_reserve_pages(inode,
 				vma->vm_pgoff >> huge_page_order(h),
 				len >> huge_page_shift(h), vma))
-		goto out;
+			goto out;
+	}
 
 	ret = 0;
 	hugetlb_prefault_arch_hook(vma->vm_mm);
@@ -496,7 +499,8 @@ out:
 }
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, 
-					gid_t gid, int mode, dev_t dev)
+					gid_t gid, int mode, dev_t dev,
+					unsigned long creat_flags)
 {
 	struct inode *inode;
 
@@ -512,7 +516,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy, NULL);
+		info->creat_flags = creat_flags;
+		if (!(creat_flags & HUGETLB_PRIVATE_INODE))
+			mpol_shared_policy_init(&info->policy, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -553,7 +559,8 @@ static int hugetlbfs_mknod(struct inode *dir,
 	} else {
 		gid = current->fsgid;
 	}
-	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid, gid, mode, dev);
+	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid, gid, mode, dev,
+					HUGETLB_RESERVE);
 	if (inode) {
 		dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 		d_instantiate(dentry, inode);
@@ -589,7 +596,8 @@ static int hugetlbfs_symlink(struct inode *dir,
 		gid = current->fsgid;
 
 	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid,
-					gid, S_IFLNK|S_IRWXUGO, 0);
+					gid, S_IFLNK|S_IRWXUGO, 0,
+					HUGETLB_RESERVE);
 	if (inode) {
 		int l = strlen(symname)+1;
 		error = page_symlink(inode, symname, l);
@@ -693,7 +701,8 @@ static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
+	if (!(HUGETLBFS_I(inode)->creat_flags & HUGETLB_PRIVATE_INODE))
+		mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
@@ -879,7 +888,8 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
 	inode = hugetlbfs_get_inode(sb, config.uid, config.gid,
-					S_IFDIR | config.mode, 0);
+					S_IFDIR | config.mode, 0,
+					HUGETLB_RESERVE);
 	if (!inode)
 		goto out_free;
 
@@ -944,7 +954,8 @@ static int can_do_hugetlb_shm(void)
 			can_do_mlock());
 }
 
-struct file *hugetlb_file_setup(const char *name, size_t size)
+struct file *hugetlb_file_setup(const char *name, size_t size,
+				unsigned long creat_flags)
 {
 	int error = -ENOMEM;
 	struct file *file;
@@ -955,11 +966,13 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 	if (!hugetlbfs_vfsmount)
 		return ERR_PTR(-ENOENT);
 
-	if (!can_do_hugetlb_shm())
-		return ERR_PTR(-EPERM);
+	if (!(creat_flags & HUGETLB_PRIVATE_INODE)) {
+		if (!can_do_hugetlb_shm())
+			return ERR_PTR(-EPERM);
 
-	if (!user_shm_lock(size, current->user))
-		return ERR_PTR(-ENOMEM);
+		if (!user_shm_lock(size, current->user))
+			return ERR_PTR(-ENOMEM);
+	}
 
 	root = hugetlbfs_vfsmount->mnt_root;
 	quick_string.name = name;
@@ -971,13 +984,15 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 
 	error = -ENOSPC;
 	inode = hugetlbfs_get_inode(root->d_sb, current->fsuid,
-				current->fsgid, S_IFREG | S_IRWXUGO, 0);
+				current->fsgid, S_IFREG | S_IRWXUGO, 0,
+				creat_flags);
 	if (!inode)
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0,
-			size >> huge_page_shift(hstate_inode(inode)), NULL))
+	if ((creat_flags & HUGETLB_RESERVE) &&
+		(hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL)))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
@@ -998,7 +1013,8 @@ out_inode:
 out_dentry:
 	dput(dentry);
 out_shm_unlock:
-	user_shm_unlock(size, current->user);
+	if (!(creat_flags & HUGETLB_PRIVATE_INODE))
+		user_shm_unlock(size, current->user);
 	return ERR_PTR(error);
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index eed37d7..26ffed9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -95,12 +95,20 @@ static inline unsigned long hugetlb_total_pages(void)
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
+#endif
+
+#endif /* !CONFIG_HUGETLB_PAGE */
 
 /* to align the pointer to the (next) huge page boundary */
 #define HPAGE_ALIGN(addr)	ALIGN(addr, HPAGE_SIZE)
-#endif
 
-#endif /* !CONFIG_HUGETLB_PAGE */
+#define HUGETLB_PRIVATE_INODE	0x00000001UL	/* The file is being created on
+						 * the internal hugetlbfs mount
+						 * and is private to the
+						 * process */
+
+#define HUGETLB_RESERVE	0x00000002UL	/* Reserve the huge pages backed by the
+					 * new file */
 
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_config {
@@ -125,6 +133,7 @@ struct hugetlbfs_sb_info {
 struct hugetlbfs_inode_info {
 	struct shared_policy policy;
 	struct inode vfs_inode;
+	unsigned long creat_flags;
 };
 
 static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
@@ -139,7 +148,8 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 
 extern const struct file_operations hugetlbfs_file_operations;
 extern struct vm_operations_struct hugetlb_vm_ops;
-struct file *hugetlb_file_setup(const char *name, size_t);
+struct file *hugetlb_file_setup(const char *name, size_t,
+				unsigned long creat_flags);
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
@@ -161,7 +171,7 @@ static inline void set_file_hugepages(struct file *file)
 
 #define is_file_hugepages(file)		0
 #define set_file_hugepages(file)	BUG()
-#define hugetlb_file_setup(name,size)	ERR_PTR(-ENOSYS)
+#define hugetlb_file_setup(name,size,creat_flags)	ERR_PTR(-ENOSYS)
 
 #endif /* !CONFIG_HUGETLBFS */
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 2774bad..3b5849f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -365,7 +365,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	sprintf (name, "SYSV%08x", key);
 	if (shmflg & SHM_HUGETLB) {
 		/* hugetlb_file_setup takes care of mlock user accounting */
-		file = hugetlb_file_setup(name, size);
+		file = hugetlb_file_setup(name, size, HUGETLB_RESERVE);
 		shp->mlock_user = current->user;
 	} else {
 		int acctflag = VM_ACCOUNT;
-- 
1.5.6.1

^ permalink raw reply related

* [PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linuxppc-dev, libhugetlbfs-devel, linux-kernel, Eric Munson
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

Currently the memory slice that holds the process stack is always initialized
to hold small pages.  This patch defines the weak function that is declared
in the previos patch to convert the stack slice to hugetlb pages.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Instead of setting the mm-wide page size to huge pages, set only the relavent
 slice psize using an arch defined weak function.

 arch/powerpc/mm/hugetlbpage.c |    6 ++++++
 arch/powerpc/mm/slice.c       |   11 +++++++++++
 include/asm-powerpc/hugetlb.h |    3 +++
 3 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fb42c4d..bd7f777 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -152,6 +152,12 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
 }
 #endif
 
+void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr,
+			unsigned long len)
+{
+	slice_convert_address(mm, addr, len, shift_to_mmu_psize(HPAGE_SHIFT));
+}
+
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
  */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 583be67..d984733 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -30,6 +30,7 @@
 #include <linux/err.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
+#include <linux/hugetlb.h>
 #include <asm/mman.h>
 #include <asm/mmu.h>
 #include <asm/spu.h>
@@ -397,6 +398,16 @@ static unsigned long slice_find_area(struct mm_struct *mm, unsigned long len,
 #define MMU_PAGE_BASE	MMU_PAGE_4K
 #endif
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+				unsigned long len, unsigned int psize)
+{
+	struct slice_mask mask;
+
+	mask = slice_range_to_mask(addr, len);
+	slice_convert(mm, mask, psize);
+	slice_flush_segments(mm);
+}
+
 unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 				      unsigned long flags, unsigned int psize,
 				      int topdown, int use_cache)
diff --git a/include/asm-powerpc/hugetlb.h b/include/asm-powerpc/hugetlb.h
index 26f0d0a..10ef089 100644
--- a/include/asm-powerpc/hugetlb.h
+++ b/include/asm-powerpc/hugetlb.h
@@ -17,6 +17,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep);
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+				unsigned long len, unsigned int psize);
+
 /*
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
-- 
1.5.6.1

^ permalink raw reply related

* RE: mpc8349-mITX DTB load failure
From: Sparks, Sam @ 2008-07-28 19:22 UTC (permalink / raw)
  To: Kim Phillips; +Cc: linuxppc-dev
In-Reply-To: <20080728133043.563a691c.kim.phillips@freescale.com>

> From: Kim Phillips
> Sent: Monday, July 28, 2008 1:31 PM
> > ## Booting kernel from Legacy Image at 01001000 ...
>=20
> > ## Flattened Device Tree blob at 01000000
>=20
> so the dtb is being loaded only 0x1000 bytes from the kernel,=20
> yet it's probably bigger than that.  Can you try different=20
> load addresses, preferably with the two images at least=20
> 0x3000 bytes apart?

Ouch. Thanks for pointing that out, it worked. :-)

--Sam

^ permalink raw reply

* Re: [PATCH] Documentation: remove old sbc8260 board specific information
From: Kumar Gala @ 2008-07-28 19:24 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: linuxppc-dev, rob
In-Reply-To: <1217271031-8922-1-git-send-email-paul.gortmaker@windriver.com>


On Jul 28, 2008, at 1:50 PM, Paul Gortmaker wrote:

> This file contains 8 yr. old board specific information that was for
> the now gone ppc implementation, and it pre-dates widespread u-boot
> support.  Any of the technical details of the board memory map would  
> be
> more appropriately captured in a dts if I revive it as powerpc anyway.
>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> Acked-by: Jason Wessel <jason.wessel@windriver.com>
> ---
> Documentation/powerpc/00-INDEX                   |    2 -
> Documentation/powerpc/SBC8260_memory_mapping.txt |  197  
> ----------------------
> 2 files changed, 0 insertions(+), 199 deletions(-)
> delete mode 100644 Documentation/powerpc/SBC8260_memory_mapping.txt

applied.

- k

^ permalink raw reply

* [PATCH] powerpc/lpar - defer prefered console setup
From: Bastian Blank @ 2008-07-28 18:56 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: akpm, linux-kernel

Hi

The powerpc lpar code adds a prefered console at a very early state,
during arch_setup. This runs even before the console setup from the
command line and takes preference.

This patch moves the prefered console setup into an arch_initcall which
runs later and allows the specification of the correct console on the
command line. The udbg console remains as boot console.

There is a different problem that the code does not pick up the correct
console because it uses a part (4) of the lpar device number (30000004)
instead of the correct index 1.

Signed-off-by: Bastian Blank <waldi@debian.org>

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 9235c46..626290d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -57,6 +57,7 @@ extern void pSeries_find_serial_port(void);
 
 
 int vtermno;	/* virtual terminal# for udbg  */
+static char *console_name;
 
 #define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
 static void udbg_hvsi_putc(char c)
@@ -232,18 +233,24 @@ void __init find_udbg_vterm(void)
 		udbg_putc = udbg_putcLP;
 		udbg_getc = udbg_getcLP;
 		udbg_getc_poll = udbg_getc_pollLP;
-		add_preferred_console("hvc", termno[0] & 0xff, NULL);
+		console_name = "hvc";
 	} else if (of_device_is_compatible(stdout_node, "hvterm-protocol")) {
-		vtermno = termno[0];
 		udbg_putc = udbg_hvsi_putc;
 		udbg_getc = udbg_hvsi_getc;
 		udbg_getc_poll = udbg_hvsi_getc_poll;
-		add_preferred_console("hvsi", termno[0] & 0xff, NULL);
+		console_name = "hvsi";
 	}
 out:
 	of_node_put(stdout_node);
 }
 
+static void __init enable_vterm(void)
+{
+	if (console_name)
+		add_preferred_console(console_name, vtermno, NULL);
+}
+arch_initcall(enable_vterm);
+
 void vpa_init(int cpu)
 {
 	int hwcpu = get_hard_smp_processor_id(cpu);

-- 
Genius doesn't work on an assembly line basis.  You can't simply say,
"Today I will be brilliant."
		-- Kirk, "The Ultimate Computer", stardate 4731.3

^ permalink raw reply related

* No output from SMC1 console with the 2.6.26 kernel (PQ2FADS based board)
From: Matvejchikov Ilya @ 2008-07-28 19:43 UTC (permalink / raw)
  To: linuxppc-embedded

Hi all,

I'm working with PQ2FADS based board. When I started to use new 2.6.26
kernel my SMC1 console stopped working. I know that the cpm_uart
driver has been changed and the DTS file needs updating. Unfortunately
I failed to get it running. Could somebody help me solve this problem?

====
The cpm node of the DTS file:
.............................................
		cpm@119c0 {
			#address-cells = <1>;
			#size-cells = <1>;
			#interrupt-cells = <2>;
			compatible = "fsl,mpc8280-cpm", "fsl,cpm2";
			reg = <119c0 30>;
			ranges;

			muram {
				#address-cells = <1>;
				#size-cells = <1>;
				ranges = <0 0 10000>;

				data-only@0 {
					compatible = "fsl,cpm-muram-data";
					reg = <0 2000 9800 800>;
				};
			};

			brg@119f0 {
				compatible = "fsl,cpm-brg", "fsl,cpm2-brg";
				reg = <119f0 10 115f0 10>;
			};

			smc1: serial@11a80 {
				device_type = "serial";
				compatible = "fsl,cpm2-smc-uart";
				reg = <11a80 20 87fc 2>;
				interrupts = <4 8>;
				interrupt-parent = <&PIC>;
				fsl,cpm-brg = <7>;
				fsl,cpm-command = <1d000000>;
			};

		};
.............................................
	chosen {
		linux,stdout-path = &smc1;
		bootargs = "console=ttyCPM0";
	};
.............................................

The early debuging output:
Xid mach(): done
MMU:enter
MMU:hw init
MMU:mapin
MMU:setio
MMU:exit
Using Electronic Devices SPC826 (M82) machine description
Linux version 2.6.26 (ilya@westend) (gcc version 4.1.2) #83 Thu Jul 24
14:45:07 MSD 2008
console [udbg0] enabled
Entering add_active_range(0, 0, 4096) 0 entries of 256 used
setup_arch: bootmem
spc826_setup_arch()
spc826_setup_arch(), finish
arch: exit
Top of RAM: 0x1000000, Total RAM: 0x1000000
Memory hole size: 0MB
Zone PFN ranges:
  DMA             0 ->     4096
  Normal       4096 ->     4096
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
    0:        0 ->     4096
On node 0 totalpages: 4096
  DMA zone: 32 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4064 pages, LIFO batch:0
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
Built 1 zonelists in Zone order, mobility grouping off.  Total pages: 4064
Kernel command line: console=ttyCPM0
PID hash table entries: 64 (order: 6, 256 bytes)
time_init: decrementer frequency = 16.500000 MHz
time_init: processor frequency   = 264.000000 MHz
clocksource: timebase mult[f26c9b2] shift[22] registered
clockevent: decrementer mult[439] shift[16] cpu[0]
Console: colour dummy device 80x25
====

Thanks!

^ permalink raw reply

* Re: No output from SMC1 console with the 2.6.26 kernel (PQ2FADS based board)
From: Scott Wood @ 2008-07-28 19:59 UTC (permalink / raw)
  To: matvejchikov; +Cc: linuxppc-embedded
In-Reply-To: <8496f91a0807281243y2193dad5hc758444fe0a10258@mail.gmail.com>

Matvejchikov Ilya wrote:
> 		cpm@119c0 {
> 			#address-cells = <1>;
> 			#size-cells = <1>;
> 			#interrupt-cells = <2>;
> 			compatible = "fsl,mpc8280-cpm", "fsl,cpm2";

Add "simple-bus" to this compatible list.

Other than that, and that you should be using dts-v1, the tree looks 
fine.  If adding "simple-bus" doesn't fix it, check your pin setup; 
pq2fads doesn't have an SMC serial port, so its platform file doesn't 
set up those pins.

-Scott

^ permalink raw reply

* Re: [PATCH 1/5 V2] Align stack boundaries based on personality
From: Dave Hansen @ 2008-07-28 20:09 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, libhugetlbfs-devel, linux-kernel, linuxppc-dev
In-Reply-To: <6061445882ce9574999bf343eeb333be02a1afa6.1216928613.git.ebmunson@us.ibm.com>

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> +static unsigned long personality_page_align(unsigned long addr)
> +{
> +       if (current->personality & HUGETLB_STACK)
> +#ifdef CONFIG_STACK_GROWSUP
> +               return HPAGE_ALIGN(addr);
> +#else
> +               return addr & HPAGE_MASK;
> +#endif
> +
> +       return PAGE_ALIGN(addr);
> +}
...
> -       stack_top = PAGE_ALIGN(stack_top);
> +       stack_top = personality_page_align(stack_top);

Just out of curiosity, why doesn't the existing small-page case seem to
care about the stack growing up/down?  Why do you need to care in the
large page case?

-- Dave

^ permalink raw reply

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
From: Dave Hansen @ 2008-07-28 20:33 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, libhugetlbfs-devel, linux-kernel, linuxppc-dev
In-Reply-To: <cover.1216928613.git.ebmunson@us.ibm.com>

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> This patch stack introduces a personality flag that indicates the
> kernel
> should setup the stack as a hugetlbfs-backed region. A userspace
> utility
> may set this flag then exec a process whose stack is to be backed by
> hugetlb pages.

I didn't see it mentioned here, but these stacks are fixed-size, right?
They can't actually grow and are fixed in size at exec() time, right?

-- Dave

^ permalink raw reply

* Re: [PATCH 4/5 V2] Build hugetlb backed process stacks
From: Dave Hansen @ 2008-07-28 20:37 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, libhugetlbfs-devel, linux-kernel, linuxppc-dev
In-Reply-To: <34bf5c7a2116bc6bd16b4235bc1cf84395ee561e.1216928613.git.ebmunson@us.ibm.com>

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> +static int move_to_huge_pages(struct linux_binprm *bprm,
> +                               struct vm_area_struct *vma, unsigned
> long shift)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct vm_area_struct *new_vma;
> +       unsigned long old_end = vma->vm_end;
> +       unsigned long old_start = vma->vm_start;
> +       unsigned long new_end = old_end - shift;
> +       unsigned long new_start, length;
> +       unsigned long arg_size = new_end - bprm->p;
> +       unsigned long flags = vma->vm_flags;
> +       struct file *hugefile = NULL;
> +       unsigned int stack_hpages = 0;
> +       struct page **from_pages = NULL;
> +       struct page **to_pages = NULL;
> +       unsigned long num_pages = (arg_size / PAGE_SIZE) + 1;
> +       int ret;
> +       int i;
> +
> +#ifdef CONFIG_STACK_GROWSUP

Why do you have the #ifdef for the CONFIG_STACK_GROWSUP=y case in that
first patch if you don't support CONFIG_STACK_GROWSUP=y?

I think it might be worth some time to break this up a wee little bit.
16 local variables is a big on the beefy side. :)

-- Dave

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2008-07-28 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev, akpm, Linux Kernel list
In-Reply-To: <alpine.LFD.1.10.0807280838580.3486@nehalem.linux-foundation.org>

On Mon, 2008-07-28 at 08:40 -0700, Linus Torvalds wrote:
> 
> On Mon, 28 Jul 2008, Benjamin Herrenschmidt wrote:
> > 
> > It's all in:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git merge
> 
> It doesn't really seem to be. I get "Already up-to-date.", and the top 
> commit there seems to be from July 3..
> 
> Forgot to push?

No, forgot to s/paulus/benh in the path :-)
> 
> > (Hopefully no silly non-printable character this time, at least
> > nothing I manage to spot with evo but who knows...)
> 
> Yeah, no odd whitespace here either. Not that it helps ;)

Cheers,
Ben.

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2008-07-28 21:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen Rothwell, akpm, Linux Kernel list, linuxppc-dev
In-Reply-To: <alpine.LFD.1.10.0807280905500.3486@nehalem.linux-foundation.org>

On Mon, 2008-07-28 at 09:06 -0700, Linus Torvalds wrote:
> 
> On Tue, 29 Jul 2008, Stephen Rothwell wrote:
> > 
> > It should be
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge
> > 
> > Ben seems to have copied from one of Paul's pull requests.
> 
> Ok, that one worked for me.
> 
> Ben, I'm sure some day you'll get it right on the first try. We're all 
> cheering for you!

Should I hang out with a brown paper bag on my head all day today ?

Cheers,
Ben.

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2008-07-28 21:16 UTC (permalink / raw)
  To: Grant Likely
  Cc: Stephen Rothwell, akpm, Linus Torvalds, Linux Kernel list,
	linuxppc-dev
In-Reply-To: <20080728162036.GA21039@secretlab.ca>

On Mon, 2008-07-28 at 10:20 -0600, Grant Likely wrote:
> On Mon, Jul 28, 2008 at 09:06:35AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Tue, 29 Jul 2008, Stephen Rothwell wrote:
> > > 
> > > It should be
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge
> > > 
> > > Ben seems to have copied from one of Paul's pull requests.
> > 
> > Ok, that one worked for me.
> > 
> > Ben, I'm sure some day you'll get it right on the first try. We're all 
> > cheering for you!
> 
> Ben! Ben! He's our man!  If he can't grok it, no-one can!  :-)
> 
> git-request-pull has saved me from many a bogus pull request.

Ah, I didn't know about this critter ! I'm learning new things everyday,
woow !

Ben.

^ permalink raw reply

* [PATCH 5/5] sh: Define elfcorehdr_addr in arch dependent section
From: Vivek Goyal @ 2008-07-28 21:15 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Luck, linux-ia64, kexec, linux-kernel, linuxppc-dev,
	Terry Loftin, Paul Mundt, Paul Mackerras, Eric W. Biederman,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728211408.GD9985@redhat.com>




o Move elfcorehdr_addr definition in arch dependent crash dump file. This is
  equivalent to defining elfcorehdr_addr under CONFIG_CRASH_DUMP instead of
  CONFIG_PROC_VMCORE. This is needed by is_kdump_kernel() which can be
  used irrespective of the fact whether CONFIG_PROC_VMCORE is enabled or
  not.

o I don't see sh setup code parsing the command line for elfcorehdr_addr. I 
  am wondering how does vmcore interface work on sh. Anyway, I am atleast
  defining elfcoredhr_addr so that compilation is not broken on sh.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---

 arch/sh/kernel/crash_dump.c |    3 +++
 1 file changed, 3 insertions(+)

diff -puN arch/sh/kernel/crash_dump.c~fix-elfcorehdr_addr-sh arch/sh/kernel/crash_dump.c
--- linux-2.6.27-pre-rc1/arch/sh/kernel/crash_dump.c~fix-elfcorehdr_addr-sh	2008-07-28 12:17:12.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/sh/kernel/crash_dump.c	2008-07-28 12:17:12.000000000 -0400
@@ -10,6 +10,9 @@
 #include <linux/io.h>
 #include <asm/uaccess.h>
 
+/* Stores the physical address of elf header of crash image. */
+unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
+
 /**
  * copy_oldmem_page - copy one page from "oldmem"
  * @pfn: page frame number to be copied
_

^ permalink raw reply

* [PATCH 2/5] x86: Define elfcorehdr_addr in arch dependent section
From: Vivek Goyal @ 2008-07-28 21:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Luck, linux-ia64, kexec, linux-kernel, linuxppc-dev,
	Terry Loftin, Paul Mundt, Paul Mackerras, Eric W. Biederman,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728211025.GA9985@redhat.com>



o Move elfcorehdr_addr definition in arch dependent crash dump file. This is
  equivalent to defining elfcorehdr_addr under CONFIG_CRASH_DUMP instead of
  CONFIG_PROC_VMCORE. This is needed by is_kdump_kernel() which can be
  used irrespective of the fact whether CONFIG_PROC_VMCORE is enabled or
  not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---

 arch/x86/kernel/crash_dump_32.c |    3 +++
 arch/x86/kernel/crash_dump_64.c |    3 +++
 arch/x86/kernel/setup.c         |    8 +++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff -puN arch/x86/kernel/setup.c~fix-elfcorehdr_addr-parsing-x86 arch/x86/kernel/setup.c
--- linux-2.6.27-pre-rc1/arch/x86/kernel/setup.c~fix-elfcorehdr_addr-parsing-x86	2008-07-28 12:01:35.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/x86/kernel/setup.c	2008-07-28 12:01:35.000000000 -0400
@@ -558,7 +558,13 @@ static void __init reserve_standard_io_r
 
 }
 
-#ifdef CONFIG_PROC_VMCORE
+/*
+ * Note: elfcorehdr_addr is not just limited to vmcore. It is also used by
+ * is_kdump_kernel() to determine if we are booting after a panic. Hence
+ * ifdef it under CONFIG_CRASH_DUMP and not CONFIG_PROC_VMCORE.
+ */
+
+#ifdef CONFIG_CRASH_DUMP
 /* elfcorehdr= specifies the location of elf core header
  * stored by the crashed kernel. This option will be passed
  * by kexec loader to the capture kernel.
diff -puN arch/x86/kernel/crash_dump_32.c~fix-elfcorehdr_addr-parsing-x86 arch/x86/kernel/crash_dump_32.c
--- linux-2.6.27-pre-rc1/arch/x86/kernel/crash_dump_32.c~fix-elfcorehdr_addr-parsing-x86	2008-07-28 12:01:35.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/x86/kernel/crash_dump_32.c	2008-07-28 12:01:35.000000000 -0400
@@ -13,6 +13,9 @@
 
 static void *kdump_buf_page;
 
+/* Stores the physical address of elf header of crash image. */
+unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
+
 /**
  * copy_oldmem_page - copy one page from "oldmem"
  * @pfn: page frame number to be copied
diff -puN arch/x86/kernel/crash_dump_64.c~fix-elfcorehdr_addr-parsing-x86 arch/x86/kernel/crash_dump_64.c
--- linux-2.6.27-pre-rc1/arch/x86/kernel/crash_dump_64.c~fix-elfcorehdr_addr-parsing-x86	2008-07-28 12:01:35.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/x86/kernel/crash_dump_64.c	2008-07-28 12:01:35.000000000 -0400
@@ -11,6 +11,9 @@
 #include <asm/uaccess.h>
 #include <asm/io.h>
 
+/* Stores the physical address of elf header of crash image. */
+unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
+
 /**
  * copy_oldmem_page - copy one page from "oldmem"
  * @pfn: page frame number to be copied
_

^ permalink raw reply

* [PATCH 3/5] ia64: Define elfcorehdr_addr in arch dependent section
From: Vivek Goyal @ 2008-07-28 21:13 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Luck, linux-ia64, kexec, linux-kernel, linuxppc-dev,
	Terry Loftin, Paul Mundt, Paul Mackerras, Eric W. Biederman,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728211156.GB9985@redhat.com>


o Move elfcorehdr_addr definition in arch dependent crash dump file. This is
  equivalent to defining elfcorehdr_addr under CONFIG_CRASH_DUMP instead of
  CONFIG_PROC_VMCORE. This is needed by is_kdump_kernel() which can be
  used irrespective of the fact whether CONFIG_PROC_VMCORE is enabled or
  not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---

 arch/ia64/kernel/setup.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff -puN arch/ia64/kernel/setup.c~fix-elfcorehdr_addr-parsing-ia64 arch/ia64/kernel/setup.c
--- linux-2.6.27-pre-rc1/arch/ia64/kernel/setup.c~fix-elfcorehdr_addr-parsing-ia64	2008-07-28 12:16:40.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/ia64/kernel/setup.c	2008-07-28 12:16:40.000000000 -0400
@@ -478,7 +478,12 @@ static __init int setup_nomca(char *s)
 }
 early_param("nomca", setup_nomca);
 
-#ifdef CONFIG_PROC_VMCORE
+/*
+ * Note: elfcorehdr_addr is not just limited to vmcore. It is also used by
+ * is_kdump_kernel() to determine if we are booting after a panic. Hence
+ * ifdef it under CONFIG_CRASH_DUMP and not CONFIG_PROC_VMCORE.
+ */
+#ifdef CONFIG_CRASH_DUMP
 /* elfcorehdr= specifies the location of elf core header
  * stored by the crashed kernel.
  */
@@ -491,7 +496,9 @@ static int __init parse_elfcorehdr(char 
 	return 0;
 }
 early_param("elfcorehdr", parse_elfcorehdr);
+#endif
 
+#ifdef CONFIG_PROC_VMCORE
 int __init reserve_elfcorehdr(unsigned long *start, unsigned long *end)
 {
 	unsigned long length;
_

^ permalink raw reply

* [PATCH 4/5] powerpc: Define elfcorehdr_addr in arch dependent section
From: Vivek Goyal @ 2008-07-28 21:14 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Luck, linux-ia64, kexec, linux-kernel, linuxppc-dev,
	Terry Loftin, Paul Mundt, Paul Mackerras, Eric W. Biederman,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728211314.GC9985@redhat.com>



o Move elfcorehdr_addr definition in arch dependent crash dump file. This is
  equivalent to defining elfcorehdr_addr under CONFIG_CRASH_DUMP instead of
  CONFIG_PROC_VMCORE. This is needed by is_kdump_kernel() which can be
  used irrespective of the fact whether CONFIG_PROC_VMCORE is enabled or
  not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---

 arch/powerpc/kernel/crash_dump.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff -puN arch/powerpc/kernel/crash_dump.c~fix-elfcorehdr_addr-parsing-ppc64 arch/powerpc/kernel/crash_dump.c
--- linux-2.6.27-pre-rc1/arch/powerpc/kernel/crash_dump.c~fix-elfcorehdr_addr-parsing-ppc64	2008-07-28 12:14:22.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/arch/powerpc/kernel/crash_dump.c	2008-07-28 12:14:22.000000000 -0400
@@ -27,6 +27,9 @@
 #define DBG(fmt...)
 #endif
 
+/* Stores the physical address of elf header of crash image. */
+unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
+
 void __init reserve_kdump_trampoline(void)
 {
 	lmb_reserve(0, KDUMP_RESERVE_LIMIT);
@@ -66,7 +69,11 @@ void __init setup_kdump_trampoline(void)
 	DBG(" <- setup_kdump_trampoline()\n");
 }
 
-#ifdef CONFIG_PROC_VMCORE
+/*
+ * Note: elfcorehdr_addr is not just limited to vmcore. It is also used by
+ * is_kdump_kernel() to determine if we are booting after a panic. Hence
+ * ifdef it under CONFIG_CRASH_DUMP and not CONFIG_PROC_VMCORE.
+ */
 static int __init parse_elfcorehdr(char *p)
 {
 	if (p)
@@ -75,7 +82,6 @@ static int __init parse_elfcorehdr(char 
 	return 1;
 }
 __setup("elfcorehdr=", parse_elfcorehdr);
-#endif
 
 static int __init parse_savemaxmem(char *p)
 {
_

^ permalink raw reply

* [PATCH 1/5] Move elfcorehdr_addr out of vmcore.c (Was: Re: [patch] crashdump: fix undefined reference to `elfcorehdr_addr')
From: Vivek Goyal @ 2008-07-28 21:10 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Luck, linux-ia64, kexec, linux-kernel, linuxppc-dev,
	Terry Loftin, Paul Mundt, Paul Mackerras, Eric W. Biederman,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728034007.GA30450@verge.net.au>

Hi All,

How does following series of patches look like. I have moved
elfcorehdr_addr out of vmcore.c and pushed it to arch dependent section 
of crash dump to make sure that it can be worked with even when
CONFIG_PROC_VMCORE is disabled and CONFIG_CRASH_DUMP is enabled.

I tested it on x86_64. Compile tested it on i386 and ppc64. ia64 and
sh versions are completely untested.

Thanks
Vivek




o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE but
  also by the code which is not inside CONFIG_PROC_VMCORE. For example,
  is_kdump_kernel() is used by powerpc code to determine if kernel is booting
  after a panic then use previous kernel's TCE table. So even if
  CONFIG_PROC_VMCORE is not set in second kernel, one should be able to
  correctly determine that we are booting after a panic and setup calgary
  iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
  (Unfortunately crash dump does not have an arch independent file otherwise
   that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
  second kernel without KEXEC being enabled.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---

 fs/proc/vmcore.c           |    3 ---
 include/linux/crash_dump.h |   14 ++++++++++----
 2 files changed, 10 insertions(+), 7 deletions(-)

diff -puN fs/proc/vmcore.c~remove-elfcore-hdr-addr-definition-vmcore fs/proc/vmcore.c
--- linux-2.6.27-pre-rc1/fs/proc/vmcore.c~remove-elfcore-hdr-addr-definition-vmcore	2008-07-28 09:19:50.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/fs/proc/vmcore.c	2008-07-28 09:20:10.000000000 -0400
@@ -32,9 +32,6 @@ static size_t elfcorebuf_sz;
 /* Total size of vmcore file. */
 static u64 vmcore_size;
 
-/* Stores the physical address of elf header of crash image. */
-unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
-
 struct proc_dir_entry *proc_vmcore = NULL;
 
 /* Reads a page from the oldmem device from given offset. */
diff -puN include/linux/crash_dump.h~remove-elfcore-hdr-addr-definition-vmcore include/linux/crash_dump.h
--- linux-2.6.27-pre-rc1/include/linux/crash_dump.h~remove-elfcore-hdr-addr-definition-vmcore	2008-07-28 12:00:44.000000000 -0400
+++ linux-2.6.27-pre-rc1-root/include/linux/crash_dump.h	2008-07-28 12:00:56.000000000 -0400
@@ -9,11 +9,7 @@
 
 #define ELFCORE_ADDR_MAX	(-1ULL)
 
-#ifdef CONFIG_PROC_VMCORE
 extern unsigned long long elfcorehdr_addr;
-#else
-static const unsigned long long elfcorehdr_addr = ELFCORE_ADDR_MAX;
-#endif
 
 extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
 						unsigned long, int);
@@ -28,6 +24,16 @@ extern struct proc_dir_entry *proc_vmcor
 
 #define vmcore_elf_check_arch(x) (elf_check_arch(x) || vmcore_elf_check_arch_cross(x))
 
+/*
+ * is_kdump_kernel() checks whether this kernel is booting after a panic of
+ * previous kernel or not. This is determined by checking if previous kernel
+ * has passed the elf core header address on command line.
+ *
+ * This is not just a test if CONFIG_CRASH_DUMP is enabled or not. It will
+ * return 1 if CONFIG_CRASH_DUMP=y and if kernel is booting after a panic of
+ * previous kernel.
+ */
+
 static inline int is_kdump_kernel(void)
 {
 	return (elfcorehdr_addr != ELFCORE_ADDR_MAX) ? 1 : 0;
_

^ permalink raw reply

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
From: Eric B Munson @ 2008-07-28 21:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, libhugetlbfs-devel, linux-kernel, linuxppc-dev
In-Reply-To: <1217277204.23502.36.camel@nimitz>

[-- Attachment #1: Type: text/plain, Size: 684 bytes --]

On Mon, 28 Jul 2008, Dave Hansen wrote:

> On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> > 
> > This patch stack introduces a personality flag that indicates the
> > kernel
> > should setup the stack as a hugetlbfs-backed region. A userspace
> > utility
> > may set this flag then exec a process whose stack is to be backed by
> > hugetlb pages.
> 
> I didn't see it mentioned here, but these stacks are fixed-size, right?
> They can't actually grow and are fixed in size at exec() time, right?
> 
> -- Dave

The stack VMA is a fixed size but the pages will be faulted in as needed.

-- 
Eric B Munson
IBM Linux Technology Center
ebmunson@us.ibm.com


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] Move elfcorehdr_addr out of vmcore.c (Was: Re: [patch] crashdump: fix undefined reference to `elfcorehdr_addr')
From: Eric W. Biederman @ 2008-07-28 22:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tony Luck, linux-ia64, Paul Mundt, kexec, linux-kernel,
	linuxppc-dev, Terry Loftin, Simon Horman, Paul Mackerras,
	Andrew Morton, Linus Torvalds, Ingo Molnar
In-Reply-To: <20080728211025.GA9985@redhat.com>

Vivek Goyal <vgoyal@redhat.com> writes:

> Hi All,
>
> How does following series of patches look like. I have moved
> elfcorehdr_addr out of vmcore.c and pushed it to arch dependent section 
> of crash dump to make sure that it can be worked with even when
> CONFIG_PROC_VMCORE is disabled and CONFIG_CRASH_DUMP is enabled.
>
> I tested it on x86_64. Compile tested it on i386 and ppc64. ia64 and
> sh versions are completely untested.

Given the current state of the code:
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

To process a kernel crash dump we pass the kernel elfcorehdr option, so testing
to see if it was passed seems reasonable.

In general I think this method of handling the problems with kdump is
too brittle to live, but in the case of iommus we certainly need to do something
different, and unfortunately iommus were not common on x86 when the original code
was merged so we have not handled them well.

Eric

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Grant Likely @ 2008-07-28 22:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Stephen Rothwell, akpm, Linus Torvalds, Linux Kernel list,
	linuxppc-dev
In-Reply-To: <1217279736.11188.215.camel@pasglop>

On Tue, Jul 29, 2008 at 07:15:36AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2008-07-28 at 09:06 -0700, Linus Torvalds wrote:
> > 
> > On Tue, 29 Jul 2008, Stephen Rothwell wrote:
> > > 
> > > It should be
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge
> > > 
> > > Ben seems to have copied from one of Paul's pull requests.
> > 
> > Ok, that one worked for me.
> > 
> > Ben, I'm sure some day you'll get it right on the first try. We're all 
> > cheering for you!
> 
> Should I hang out with a brown paper bag on my head all day today ?

No, that's reserved for committing things to the top level Makefile for
no reason.

g.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox