[PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found] <20170217141328.164563-1-kirill.shutemov@linux.intel.com>
@ 2017-02-17 14:13 ` Kirill A. Shutemov
       [not found]   ` <20170217141328.164563-34-kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2017-02-17 20:02   ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-17 14:13 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Catalin Marinas, linux-api

This patch introduces two new prctl(2) handles to manage maximum virtual
address available to userspace to map.

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

The patch aims to address this compatibility issue.

MM would use the address as upper limit of virtual address available to
map by userspace, instead of TASK_SIZE.

The limit will be equal to TASK_SIZE everywhere, but the machine
with 5-level paging enabled. In this case, the default limit would be
(1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
paging which known to be safe.

Changing the limit would affect only future virtual address space
allocations. Currently existing VMAs are intact.

MPX can't at the moment handle addresses above 47-bits, so we refuse to
increase the limit above 47-bits. We also refuse to enable MPX if the
limit is already above 47-bits or if there is a VMA above the 47-bit
boundary.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-api@vger.kernel.org
---
 arch/x86/include/asm/elf.h         |  2 +-
 arch/x86/include/asm/mmu.h         |  2 ++
 arch/x86/include/asm/mmu_context.h |  1 +
 arch/x86/include/asm/processor.h   | 25 ++++++++++++++++++++-----
 arch/x86/kernel/process.c          | 18 ++++++++++++++++++
 arch/x86/kernel/sys_x86_64.c       |  6 +++---
 arch/x86/mm/hugetlbpage.c          |  8 ++++----
 arch/x86/mm/mmap.c                 |  4 ++--
 arch/x86/mm/mpx.c                  | 17 ++++++++++++++++-
 fs/binfmt_aout.c                   |  2 --
 fs/binfmt_elf.c                    | 10 +++++-----
 fs/hugetlbfs/inode.c               |  6 +++---
 include/linux/sched.h              |  8 ++++++++
 include/uapi/linux/prctl.h         |  3 +++
 kernel/events/uprobes.c            |  5 +++--
 kernel/sys.c                       | 23 ++++++++++++++++++++---
 mm/mmap.c                          | 20 +++++++++++---------
 mm/mremap.c                        |  3 ++-
 mm/nommu.c                         |  2 +-
 mm/shmem.c                         |  8 ++++----
 20 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c3045e..5ce6f2b2b105 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(mmap_max_addr() / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index f9813b6d8b80..174dc3b60165 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -35,6 +35,8 @@ typedef struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+	/* maximum virtual address the process can create VMA at */
+	unsigned long max_vaddr;
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 306c7e12af55..50bdfd6ab866 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -117,6 +117,7 @@ static inline int init_new_context(struct task_struct *tsk,
 	}
 	#endif
 	init_new_context_ldt(tsk, mm);
+	mm->context.max_vaddr = MAX_VADDR_DEFAULT;
 
 	return 0;
 }
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e6cfe7ba2d65..173f9a6b3b6b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -789,8 +789,9 @@ static inline void spin_lock_prefetch(const void *x)
  */
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
-#define STACK_TOP		TASK_SIZE
-#define STACK_TOP_MAX		STACK_TOP
+#define MAX_VADDR_DEFAULT	TASK_SIZE
+#define STACK_TOP		mmap_max_addr()
+#define STACK_TOP_MAX		TASK_SIZE
 
 #define INIT_THREAD  {							  \
 	.sp0			= TOP_OF_INIT_STACK,			  \
@@ -828,7 +829,14 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+/*
+ * Default maximum virtual address. This is required for
+ * compatibility with applications that assumes 47-bit VA.
+ * The limit can be changed with prctl(PR_SET_MAX_VADDR).
+ */
+#define MAX_VADDR_DEFAULT	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -841,7 +849,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		mmap_max_addr()
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -863,7 +871,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(mmap_max_addr() / 3))
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
@@ -892,6 +900,13 @@ static inline int mpx_disable_management(void)
 }
 #endif /* CONFIG_X86_INTEL_MPX */
 
+extern unsigned long set_max_vaddr(unsigned long addr);
+
+#define SET_MAX_VADDR(addr)	set_max_vaddr(addr)
+#define GET_MAX_VADDR()		READ_ONCE(current->mm->context.max_vaddr)
+
+#define mmap_max_addr() min(TASK_SIZE, GET_MAX_VADDR())
+
 extern u16 amd_get_nb_id(int cpu);
 extern u32 amd_get_nodes_per_socket(void);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b615a1113f58..ddc5af35f146 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -32,6 +32,7 @@
 #include <asm/mce.h>
 #include <asm/vm86.h>
 #include <asm/switch_to.h>
+#include <asm/mpx.h>
 
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -536,3 +537,20 @@ unsigned long get_wchan(struct task_struct *p)
 	put_task_stack(p);
 	return ret;
 }
+
+unsigned long set_max_vaddr(unsigned long addr)
+{
+	down_write(&current->mm->mmap_sem);
+	if (addr > TASK_SIZE_MAX)
+		goto out;
+	/*
+	 * MPX cannot handle addresses above 47-bits. Refuse to increase
+	 * max_vaddr above the limit if MPX is enabled.
+	 */
+	if (addr > MAX_VADDR_DEFAULT && kernel_managing_mpx_tables(current->mm))
+		goto out;
+	current->mm->context.max_vaddr = addr;
+out:
+	up_write(&current->mm->mmap_sem);
+	return 0;
+}
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..e31f5b0c5468 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -115,7 +115,7 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 		}
 	} else {
 		*begin = current->mm->mmap_legacy_base;
-		*end = TASK_SIZE;
+		*end = mmap_max_addr();
 	}
 }
 
@@ -168,7 +168,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	struct vm_unmapped_area_info info;
 
 	/* requested length too big for entire address space */
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -182,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 2ae8584b44c7..b55b04b82097 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -82,7 +82,7 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = current->mm->mmap_legacy_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -114,7 +114,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = mmap_max_addr();
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -131,7 +131,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
@@ -143,7 +143,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	if (addr) {
 		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index d2dc0438d654..c22f0b802576 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -52,7 +52,7 @@ static unsigned long stack_maxrandom_size(void)
  * Leave an at least ~128 MB hole with possible stack randomization.
  */
 #define MIN_GAP (128*1024*1024UL + stack_maxrandom_size())
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (mmap_max_addr()/6*5)
 
 static int mmap_is_legacy(void)
 {
@@ -90,7 +90,7 @@ static unsigned long mmap_base(unsigned long rnd)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(TASK_SIZE - gap - rnd);
+	return PAGE_ALIGN(mmap_max_addr() - gap - rnd);
 }
 
 /*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index af59f808742f..c19707d3e104 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -354,10 +354,25 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/*
+	 * MPX doesn't support addresses above 47-bits yes.
+	 * Make sure it's not allowed to map above the limit and nothing is
+	 * mapped there before enabling.
+	 */
+	if (mmap_max_addr() > MAX_VADDR_DEFAULT ||
+			find_vma(mm, MAX_VADDR_DEFAULT)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
+
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 2a59139f520b..7a7f6dba6b00 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -121,8 +121,6 @@ static struct linux_binfmt aout_format = {
 	.min_coredump	= PAGE_SIZE
 };
 
-#define BAD_ADDR(x)	((unsigned long)(x) >= TASK_SIZE)
-
 static int set_brk(unsigned long start, unsigned long end)
 {
 	start = PAGE_ALIGN(start);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 422370293cfd..b5dbea735c6d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -89,7 +89,7 @@ static struct linux_binfmt elf_format = {
 	.min_coredump	= ELF_EXEC_PAGESIZE,
 };
 
-#define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE)
+#define BAD_ADDR(x) ((unsigned long)(x) >= mmap_max_addr())
 
 static int set_brk(unsigned long start, unsigned long end)
 {
@@ -587,8 +587,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 			k = load_addr + eppnt->p_vaddr;
 			if (BAD_ADDR(k) ||
 			    eppnt->p_filesz > eppnt->p_memsz ||
-			    eppnt->p_memsz > TASK_SIZE ||
-			    TASK_SIZE - eppnt->p_memsz < k) {
+			    eppnt->p_memsz > mmap_max_addr() ||
+			    mmap_max_addr() - eppnt->p_memsz < k) {
 				error = -ENOMEM;
 				goto out;
 			}
@@ -960,8 +960,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		 * <= p_memsz so it is only necessary to check p_memsz.
 		 */
 		if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
-		    elf_ppnt->p_memsz > TASK_SIZE ||
-		    TASK_SIZE - elf_ppnt->p_memsz < k) {
+		    elf_ppnt->p_memsz > mmap_max_addr() ||
+		    mmap_max_addr() - elf_ppnt->p_memsz < k) {
 			/* set_brk can never work. Avoid overflows. */
 			retval = -EINVAL;
 			goto out_free_dentry;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 54de77e78775..e132e93b85fb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -178,7 +178,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
@@ -190,7 +190,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	if (addr) {
 		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
@@ -198,7 +198,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = TASK_UNMAPPED_BASE;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9ec61f7..bf47a62fde5d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3671,4 +3671,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+#ifndef mmap_max_addr
+#define mmap_max_addr mmap_max_addr
+static inline unsigned long mmap_max_addr(void)
+{
+	return TASK_SIZE;
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..e9478ccd4386 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,7 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+#define PR_SET_MAX_VADDR	48
+#define PR_GET_MAX_VADDR	49
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d416f3baf392..651f571a1a79 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1142,8 +1142,9 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 
 	if (!area->vaddr) {
 		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = get_unmapped_area(NULL,
+				mmap_max_addr() - PAGE_SIZE,
+				PAGE_SIZE, 0, 0);
 		if (area->vaddr & ~PAGE_MASK) {
 			ret = area->vaddr;
 			goto fail;
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914ef7de4..366ba7be92a7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -103,6 +103,12 @@
 #ifndef SET_FP_MODE
 # define SET_FP_MODE(a,b)	(-EINVAL)
 #endif
+#ifndef SET_MAX_VADDR
+# define SET_MAX_VADDR(a)	(-EINVAL)
+#endif
+#ifndef GET_MAX_VADDR
+# define GET_MAX_VADDR()	(-EINVAL)
+#endif
 
 /*
  * this is where the system-wide overflow UID and GID are defined, for
@@ -1718,7 +1724,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
  */
 static int validate_prctl_map(struct prctl_mm_map *prctl_map)
 {
-	unsigned long mmap_max_addr = TASK_SIZE;
+	unsigned long max_addr = mmap_max_addr();
 	struct mm_struct *mm = current->mm;
 	int error = -EINVAL, i;
 
@@ -1743,7 +1749,7 @@ static int validate_prctl_map(struct prctl_mm_map *prctl_map)
 	for (i = 0; i < ARRAY_SIZE(offsets); i++) {
 		u64 val = *(u64 *)((char *)prctl_map + offsets[i]);
 
-		if ((unsigned long)val >= mmap_max_addr ||
+		if ((unsigned long)val >= max_addr ||
 		    (unsigned long)val < mmap_min_addr)
 			goto out;
 	}
@@ -1949,7 +1955,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
 	if (opt == PR_SET_MM_AUXV)
 		return prctl_set_auxv(mm, addr, arg4);
 
-	if (addr >= TASK_SIZE || addr < mmap_min_addr)
+	if (addr >= mmap_max_addr() || addr < mmap_min_addr)
 		return -EINVAL;
 
 	error = -EINVAL;
@@ -2261,6 +2267,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+	case PR_SET_MAX_VADDR:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = SET_MAX_VADDR(arg2);
+		break;
+	case PR_GET_MAX_VADDR:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = put_user(GET_MAX_VADDR(),
+				(unsigned long __user *) arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..a3384f23359e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1966,7 +1966,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
-	if (len > TASK_SIZE - mmap_min_addr)
+	if (len > mmap_max_addr() - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -1975,15 +1975,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
-		    (!vma || addr + len <= vma->vm_start))
+		if (mmap_max_addr() - len >= addr &&
+				addr >= mmap_min_addr &&
+				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
 
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = mm->mmap_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = 0;
 	return vm_unmapped_area(&info);
 }
@@ -2005,7 +2006,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	struct vm_unmapped_area_info info;
 
 	/* requested length too big for entire address space */
-	if (len > TASK_SIZE - mmap_min_addr)
+	if (len > mmap_max_addr() - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -2015,7 +2016,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
+		if (mmap_max_addr() - len >= addr &&
+				addr >= mmap_min_addr &&
 				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
@@ -2037,7 +2039,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = mmap_max_addr();
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -2057,7 +2059,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		return error;
 
 	/* Careful about overflows.. */
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
@@ -2078,7 +2080,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 	if (IS_ERR_VALUE(addr))
 		return addr;
 
-	if (addr > TASK_SIZE - len)
+	if (addr > mmap_max_addr() - len)
 		return -ENOMEM;
 	if (offset_in_page(addr))
 		return -EINVAL;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2b3bfcd51c75..a8b4fba3dce6 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -433,7 +433,8 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(new_addr))
 		goto out;
 
-	if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
+	if (new_len > mmap_max_addr() ||
+			new_addr > mmap_max_addr() - new_len)
 		goto out;
 
 	/* Ensure the old/new locations do not overlap */
diff --git a/mm/nommu.c b/mm/nommu.c
index 24f9f5f39145..6043b8b82083 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -905,7 +905,7 @@ static int validate_mmap_request(struct file *file,
 
 	/* Careful about overflows.. */
 	rlen = PAGE_ALIGN(len);
-	if (!rlen || rlen > TASK_SIZE)
+	if (!rlen || rlen > mmap_max_addr())
 		return -ENOMEM;
 
 	/* offset overflow? */
diff --git a/mm/shmem.c b/mm/shmem.c
index 3a7587a0314d..54d1ebfb577d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1983,7 +1983,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	unsigned long inflated_addr;
 	unsigned long inflated_offset;
 
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
@@ -1995,7 +1995,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		return addr;
 	if (addr & ~PAGE_MASK)
 		return addr;
-	if (addr > TASK_SIZE - len)
+	if (addr > mmap_max_addr() - len)
 		return addr;
 
 	if (shmem_huge == SHMEM_HUGE_DENY)
@@ -2038,7 +2038,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		return addr;
 
 	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
-	if (inflated_len > TASK_SIZE)
+	if (inflated_len > mmap_max_addr())
 		return addr;
 	if (inflated_len < len)
 		return addr;
@@ -2054,7 +2054,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_offset > offset)
 		inflated_addr += HPAGE_PMD_SIZE;
 
-	if (inflated_addr > TASK_SIZE - len)
+	if (inflated_addr > mmap_max_addr() - len)
 		return addr;
 	return inflated_addr;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]   ` <20170217141328.164563-34-kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2017-02-17 16:50     ` Andy Lutomirski
  2017-02-21 11:54       ` Dmitry Safonov
  2017-02-17 17:19     ` Dave Hansen
  1 sibling, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2017-02-17 16:50 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dmitry Safonov
  Cc: Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> This patch introduces two new prctl(2) handles to manage maximum virtual
> address available to userspace to map.
>
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> The patch aims to address this compatibility issue.
>
> MM would use the address as upper limit of virtual address available to
> map by userspace, instead of TASK_SIZE.
>
> The limit will be equal to TASK_SIZE everywhere, but the machine
> with 5-level paging enabled. In this case, the default limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe.


I think this patch need to be split up.  In particular, the addition
and use of mmap_max_addr() should be its own patch that doesn't change
any semantics.

> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 306c7e12af55..50bdfd6ab866 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -117,6 +117,7 @@ static inline int init_new_context(struct task_struct *tsk,
>         }
>         #endif
>         init_new_context_ldt(tsk, mm);
> +       mm->context.max_vaddr = MAX_VADDR_DEFAULT;

Is this actually correct for 32-bit binaries?  Although, given the
stuff Dmitry is working on, it might pay to separately track the
32-bit and 64-bit limits per mm.  If you haven't been following it,
Dmitry is trying to fix a bug in which an explicit 32-bit syscall
(int80 or similar) in an otherwise 64-bit process can allocate a VMA
above 4GB that gets truncated.

Also, why the macro?  Why not just put the number in here?

> -#define TASK_SIZE_MAX  ((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX  ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)

This should be in the

> -#define STACK_TOP              TASK_SIZE
> +#define STACK_TOP              mmap_max_addr()

Off the top of my head, this looks wrong.  The 32-bit check got lost, I think.

> +unsigned long set_max_vaddr(unsigned long addr)
> +{

Perhaps this function could set a different field depending on
is_compat_syscall().


Anyway, can you and Dmitry try to reconcile your patches?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]   ` <20170217141328.164563-34-kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2017-02-17 16:50     ` Andy Lutomirski
@ 2017-02-17 17:19     ` Dave Hansen
  2017-02-17 17:21       ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2017-02-17 17:19 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	x86-DgEjT+Ai2ygdnm+yROfE0A, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Andy Lutomirski, linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Catalin Marinas,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 02/17/2017 06:13 AM, Kirill A. Shutemov wrote:
> +/*
> + * Default maximum virtual address. This is required for
> + * compatibility with applications that assumes 47-bit VA.
> + * The limit can be changed with prctl(PR_SET_MAX_VADDR).
> + */
> +#define MAX_VADDR_DEFAULT	((1UL << 47) - PAGE_SIZE)

This is a bit goofy.  It's not the largest virtual adddress that can be
accessed, but the beginning of the last page.

Isn't this easier to deal with in userspace if we make it a "limit", so
we can do:

	if (addr >= limit)
		// error

Now, we have to do:
	
	prctl(PR_GET_MAX_VADDR, &max_vaddr, 0, 0, 0);
	if (addr > (max_vaddr + PAGE_SIZE))
		// error

I don't care what you track in the kernel, but I think we need to
provide a more usable number out to userspace.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 17:19     ` Dave Hansen
@ 2017-02-17 17:21       ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-02-17 17:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 9:19 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 02/17/2017 06:13 AM, Kirill A. Shutemov wrote:
>> +/*
>> + * Default maximum virtual address. This is required for
>> + * compatibility with applications that assumes 47-bit VA.
>> + * The limit can be changed with prctl(PR_SET_MAX_VADDR).
>> + */
>> +#define MAX_VADDR_DEFAULT    ((1UL << 47) - PAGE_SIZE)
>
> This is a bit goofy.  It's not the largest virtual adddress that can be
> accessed, but the beginning of the last page.

No, it really is the limit.  We don't allow user code to map the last
page because ti would be a root hole due to SYSRET.  Thanks, Intel.
See the comment near TASK_SIZE_MAX IIRC.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 14:13 ` [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR Kirill A. Shutemov
       [not found]   ` <20170217141328.164563-34-kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2017-02-17 20:02   ` Linus Torvalds
  2017-02-17 20:12     ` Andy Lutomirski
                       ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Linus Torvalds @ 2017-02-17 20:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, the arch/x86 maintainers, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch@vger.kernel.org,
	linux-mm, Linux Kernel Mailing List, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> This patch introduces two new prctl(2) handles to manage maximum virtual
> address available to userspace to map.

So this is my least favorite patch of the whole series, for a couple of reasons:

 (a) adding new code, and mixing it with the mindless TASK_SIZE ->
get_max_addr() conversion.

 (b) what's the point of that whole TASK_SIZE vs get_max_addr() thing?
When use one, when the other?

so I think this patch needs a lot more thought and/or explanation.

Honestly, (a) is a no-brainer, and can be fixed by just splitting the
patch up. But I think (b) is more fundamental.

In particular, I think that get_max_addr() thing is badly defined.
When should you use TASK_SIZE, when should you use TASK_SIZE_MAX, and
when should you use get_max_addr()? I don't find that clear at all,
and I think that needs to be a whole lot more explicit and documented.

I also get he feeling that the whole thing is unnecessary. I'm
wondering if we should just instead say that the whole 47 vs 56-bit
virtual address is _purely_ about "get_unmapped_area()", and nothing
else.

IOW, I'm wondering if we can't just say that

 - if the processor and kernel support 56-bit user address space, then
you can *always* use the whole space

 - but by default, get_unmapped_area() will only return mappings that
fit in the 47 bit address space.

So if you use MAP_FIXED and give an address in the high range, it will
just always work, and the MM will always consider the task size to be
the full address space.

But for the common case where a process does no use MAP_FIXED, the
kernel will never give a high address by default, and you have to do
the process control thing to say "I want those high addresses".

Hmm?

In other words, I'd like to at least start out trying to keep the
differences between the 47-bit and 56-bit models as simple and minimal
as possible. Not make such a big deal out of it.

We already have "arch_get_unmapped_area()" that controls the whole
"what will non-MAP_FIXED mmap allocations return", so I'd hope that
the above kind of semantics could be done without *any* actual
TASK_SIZE changes _anywhere_ in the VM code.

Comments?

      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 20:02   ` Linus Torvalds
@ 2017-02-17 20:12     ` Andy Lutomirski
  2017-02-17 21:01       ` Linus Torvalds
  2017-02-17 21:04     ` Dave Hansen
  2017-02-18  9:21     ` Kirill A. Shutemov
  2 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2017-02-17 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 12:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> This patch introduces two new prctl(2) handles to manage maximum virtual
>> address available to userspace to map.
>
> So this is my least favorite patch of the whole series, for a couple of reasons:
>
>  (a) adding new code, and mixing it with the mindless TASK_SIZE ->
> get_max_addr() conversion.
>
>  (b) what's the point of that whole TASK_SIZE vs get_max_addr() thing?
> When use one, when the other?
>
> so I think this patch needs a lot more thought and/or explanation.
>
> Honestly, (a) is a no-brainer, and can be fixed by just splitting the
> patch up. But I think (b) is more fundamental.
>
> In particular, I think that get_max_addr() thing is badly defined.
> When should you use TASK_SIZE, when should you use TASK_SIZE_MAX, and
> when should you use get_max_addr()? I don't find that clear at all,
> and I think that needs to be a whole lot more explicit and documented.
>
> I also get he feeling that the whole thing is unnecessary. I'm
> wondering if we should just instead say that the whole 47 vs 56-bit
> virtual address is _purely_ about "get_unmapped_area()", and nothing
> else.
>
> IOW, I'm wondering if we can't just say that
>
>  - if the processor and kernel support 56-bit user address space, then
> you can *always* use the whole space
>
>  - but by default, get_unmapped_area() will only return mappings that
> fit in the 47 bit address space.
>
> So if you use MAP_FIXED and give an address in the high range, it will
> just always work, and the MM will always consider the task size to be
> the full address space.

At the very least, I'd want to see
MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
interface.

>
> But for the common case where a process does no use MAP_FIXED, the
> kernel will never give a high address by default, and you have to do
> the process control thing to say "I want those high addresses".
>
> Hmm?

How about MAP_LIMIT where the address passed in is interpreted as an
upper bound instead of a fixed address?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 20:12     ` Andy Lutomirski
@ 2017-02-17 21:01       ` Linus Torvalds
  2017-02-17 23:02         ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2017-02-17 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> At the very least, I'd want to see
> MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
> interface.

That's unrelated, but I guess w could add a MAP_NOUNMAP flag, and then
you can use MAP_FIXED | MAP_NOUNMAP or something.

But that has nothing to do with the 47-vs-56 bit issue.

> How about MAP_LIMIT where the address passed in is interpreted as an
> upper bound instead of a fixed address?

Again, that's a unrelated semantic issue. Right now - if you don't
pass in MAP_FIXED at all, the "addr" argument is used as a starting
value for deciding where to find an unmapped area. But there is no way
to specify the end. That would basically be what the process control
thing would be (not per-system-call, but per-thread ).

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 20:02   ` Linus Torvalds
  2017-02-17 20:12     ` Andy Lutomirski
@ 2017-02-17 21:04     ` Dave Hansen
  2017-02-17 21:10       ` Linus Torvalds
  2017-02-18  9:21     ` Kirill A. Shutemov
  2 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2017-02-17 21:04 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Andrew Morton, the arch/x86 maintainers, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Andy Lutomirski, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On 02/17/2017 12:02 PM, Linus Torvalds wrote:
> So if you use MAP_FIXED and give an address in the high range, it will
> just always work, and the MM will always consider the task size to be
> the full address space.
> 
> But for the common case where a process does no use MAP_FIXED, the
> kernel will never give a high address by default, and you have to do
> the process control thing to say "I want those high addresses".
> 
> Hmm?

Assuming that folks tend to hard-code MAP_FIXED addresses, they'll be
<48 bits and everything will work splendidly.  But, if folks do
something like take the CPU-enumerated virtual address size and use that
as a starting point, I can see things breaking.

MPX would definitely break if the hardware saw one of those high
addresses and was not ready for it.  It ends up just chopping off the
high bits of the address, so:

	0x10000000000000
and
	0x20000000000000

index into the same spot in the bounds tables.  It does this unless you
put the hardware in the new mode that uses the larger tables, and
consumes more bits of the virtual address.

Is this likely to break anything in practice?  Nah.  But it would nice
to avoid it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 21:04     ` Dave Hansen
@ 2017-02-17 21:10       ` Linus Torvalds
  2017-02-17 21:50         ` hpa
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2017-02-17 21:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Andy Lutomirski, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 1:04 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> Is this likely to break anything in practice?  Nah.  But it would nice
> to avoid it.

So I go the other way: what *I* would like to avoid is odd code that
is hard to follow. I'd much rather make the code be simple and the
rules be straightforward, and not introduce that complicated
"different address limits" thing at all.

Then, _if_ we ever find a case where it makes a difference, we could
go the more complex route. But not first implementation, and not
without a real example of why we shouldn't just keep things simple.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 21:10       ` Linus Torvalds
@ 2017-02-17 21:50         ` hpa
  0 siblings, 0 replies; 27+ messages in thread
From: hpa @ 2017-02-17 21:50 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, Andi Kleen,
	Andy Lutomirski, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On February 17, 2017 1:10:27 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, Feb 17, 2017 at 1:04 PM, Dave Hansen <dave.hansen@intel.com>
>wrote:
>>
>> Is this likely to break anything in practice?  Nah.  But it would
>nice
>> to avoid it.
>
>So I go the other way: what *I* would like to avoid is odd code that
>is hard to follow. I'd much rather make the code be simple and the
>rules be straightforward, and not introduce that complicated
>"different address limits" thing at all.
>
>Then, _if_ we ever find a case where it makes a difference, we could
>go the more complex route. But not first implementation, and not
>without a real example of why we shouldn't just keep things simple.
>
>              Linus

However, we already have different address limits for different threads and/or syscall interfaces - 3 GiB (32-bit with legacy flag), 4 GiB (32-bit or x32), or 128 TiB... and for a while we had a 512 GiB option, too.  In that sense an address cap makes sense and generalizes what we already have.

It would be pretty hideous for the user, long term, to be artificially restricted to a legacy address cap unless they manage the address space themselves.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 21:01       ` Linus Torvalds
@ 2017-02-17 23:02         ` Andy Lutomirski
  2017-02-17 23:11           ` hpa
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-02-17 23:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 1:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Feb 17, 2017 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> At the very least, I'd want to see
>> MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
>> interface.
>
> That's unrelated, but I guess w could add a MAP_NOUNMAP flag, and then
> you can use MAP_FIXED | MAP_NOUNMAP or something.
>
> But that has nothing to do with the 47-vs-56 bit issue.
>
>> How about MAP_LIMIT where the address passed in is interpreted as an
>> upper bound instead of a fixed address?
>
> Again, that's a unrelated semantic issue. Right now - if you don't
> pass in MAP_FIXED at all, the "addr" argument is used as a starting
> value for deciding where to find an unmapped area. But there is no way
> to specify the end. That would basically be what the process control
> thing would be (not per-system-call, but per-thread ).
>

What I'm trying to say is: if we're going to do the route of 48-bit
limit unless a specific mmap call requests otherwise, can we at least
have an interface that doesn't suck?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 23:02         ` Andy Lutomirski
@ 2017-02-17 23:11           ` hpa
  2017-02-17 23:21           ` Linus Torvalds
       [not found]           ` <CA+oaBQ+s5oXqu5TqddKs9LmUbaNNPGM7=gu5On4GYrkSDu0_XA@mail.gmail.com>
  2 siblings, 0 replies; 27+ messages in thread
From: hpa @ 2017-02-17 23:11 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, Andi Kleen,
	Dave Hansen, linux-arch@vger.kernel.org, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On February 17, 2017 3:02:33 PM PST, Andy Lutomirski <luto@amacapital.net> wrote:
>On Fri, Feb 17, 2017 at 1:01 PM, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>> On Fri, Feb 17, 2017 at 12:12 PM, Andy Lutomirski
><luto@amacapital.net> wrote:
>>>
>>> At the very least, I'd want to see
>>> MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
>>> interface.
>>
>> That's unrelated, but I guess w could add a MAP_NOUNMAP flag, and
>then
>> you can use MAP_FIXED | MAP_NOUNMAP or something.
>>
>> But that has nothing to do with the 47-vs-56 bit issue.
>>
>>> How about MAP_LIMIT where the address passed in is interpreted as an
>>> upper bound instead of a fixed address?
>>
>> Again, that's a unrelated semantic issue. Right now - if you don't
>> pass in MAP_FIXED at all, the "addr" argument is used as a starting
>> value for deciding where to find an unmapped area. But there is no
>way
>> to specify the end. That would basically be what the process control
>> thing would be (not per-system-call, but per-thread ).
>>
>
>What I'm trying to say is: if we're going to do the route of 48-bit
>limit unless a specific mmap call requests otherwise, can we at least
>have an interface that doesn't suck?

Let's not, please.

But we really want this interface anyway.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 23:02         ` Andy Lutomirski
  2017-02-17 23:11           ` hpa
@ 2017-02-17 23:21           ` Linus Torvalds
  2017-02-21 10:34             ` Catalin Marinas
       [not found]           ` <CA+oaBQ+s5oXqu5TqddKs9LmUbaNNPGM7=gu5On4GYrkSDu0_XA@mail.gmail.com>
  2 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2017-02-17 23:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Andrew Morton,
	linux-arch@vger.kernel.org, Linux API, the arch/x86 maintainers,
	Andi Kleen, Kirill A. Shutemov, Arnd Bergmann, Dave Hansen,
	Linux Kernel Mailing List, Catalin Marinas, H. Peter Anvin,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1821 bytes --]

On Feb 17, 2017 3:02 PM, "Andy Lutomirski" <luto@amacapital.net> wrote:

What I'm trying to say is: if we're going to do the route of 48-bit
limit unless a specific mmap call requests otherwise, can we at least
have an interface that doesn't suck?

No, I'm not suggesting specific mmap calls at all. I'm suggesting the
complete opposite: not having some magical "max address" at all in the VM
layer. Keep all the existing TASK_SIZE defines as-is, and just make those
be the new 56-bit limit.

But to then not make most processes use it, just make the default x86
arch_get_free_area() return an address limited to the old 47-bit limit. So
effectively all legacy programs work exactly the same way they always did.

Then there are escape mechanisms: the process control that expands
that x86 arch_get_free_area()
to give high addresses. That would be the normal thing.

But also, exactly *because* we don't make all those TASK_SIZE changes, you
could - if you wanted to - use MAP_FIXED to just allocate directly in high
virtual space. For example, maybe you just make your own private memory
allocator do that, and all the normal stuff would just continue to use the
low virtual addresses, and you wouldn't even bother with the prctl().

Because let's face it, the number of processes that will want the high
virtual addresses are going to be fairly few and specialised. Maybe even
those will want it only for special things (like mapping a huge area of
nonvolatile memory)

So I'm saying:

 - don't do all these magical TASK_SIZE things at all

 - don't need with generic mm code at all.

 - only change arch_get_free_area() to take one single process control
issue into account.

Keep it simple and stupid, and don't make this address side expansion
something that the core mm code needs to even know about.

    Linus

[-- Attachment #2: Type: text/html, Size: 3955 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 20:02   ` Linus Torvalds
  2017-02-17 20:12     ` Andy Lutomirski
  2017-02-17 21:04     ` Dave Hansen
@ 2017-02-18  9:21     ` Kirill A. Shutemov
       [not found]       ` <20170218092133.GA17471-sVvlyX1904swdBt8bTSxpkEMvNT87kid@public.gmane.org>
  2 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-18  9:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-mm, Linux Kernel Mailing List,
	Catalin Marinas, Linux API

On Fri, Feb 17, 2017 at 12:02:13PM -0800, Linus Torvalds wrote:
> I also get he feeling that the whole thing is unnecessary. I'm
> wondering if we should just instead say that the whole 47 vs 56-bit
> virtual address is _purely_ about "get_unmapped_area()", and nothing
> else.
> 
> IOW, I'm wondering if we can't just say that
> 
>  - if the processor and kernel support 56-bit user address space, then
> you can *always* use the whole space
> 
>  - but by default, get_unmapped_area() will only return mappings that
> fit in the 47 bit address space.
> 
> So if you use MAP_FIXED and give an address in the high range, it will
> just always work, and the MM will always consider the task size to be
> the full address space.
> 
> But for the common case where a process does no use MAP_FIXED, the
> kernel will never give a high address by default, and you have to do
> the process control thing to say "I want those high addresses".
> 
> Hmm?
> 
> In other words, I'd like to at least start out trying to keep the
> differences between the 47-bit and 56-bit models as simple and minimal
> as possible. Not make such a big deal out of it.
> 
> We already have "arch_get_unmapped_area()" that controls the whole
> "what will non-MAP_FIXED mmap allocations return", so I'd hope that
> the above kind of semantics could be done without *any* actual
> TASK_SIZE changes _anywhere_ in the VM code.
> 
> Comments?

Okay, below is my try on implementing this.

I've chosen to respect hint address even without MAP_FIXED, but only if
it doesn't collide with other mappings. Otherwise, fallback to look for
unmapped area within 47-bit window.

Interaction with MPX would requires more work. I'm not yet sure what is the
right way to address it.

Also Dave noticed that some test-cases from ltp would break with the
approach. See for instance hugemmap03. I don't think it matter much as it
tests for negative outcome and I don't expect real world application to do
anything like this.

Test-case that I used to test the patch:

	#include <stdio.h>
	#include <sys/mman.h>

	#define SIZE (2UL << 20)
	#define LOW_ADDR ((void *) (1UL << 30))
	#define HIGH_ADDR ((void *) (1UL << 50))

	int main(int argc, char **argv)
	{
		void *p;

		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
		printf("mmap(NULL): %p\n", p);

		p = mmap(LOW_ADDR, SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
		printf("mmap(%p): %p\n", LOW_ADDR, p);

		p = mmap(HIGH_ADDR, SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
		printf("mmap(%p): %p\n", HIGH_ADDR, p);

		p = mmap(HIGH_ADDR, SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
		printf("mmap(%p) again: %p\n", HIGH_ADDR, p);

		p = mmap(HIGH_ADDR, SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
		printf("mmap(%p, MAP_FIXED): %p\n", HIGH_ADDR, p);

		return 0;
	}

------------------------8<---------------------------

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c3045e..9c6315d9aa34 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e6cfe7ba2d65..492548c87cb1 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -789,6 +789,7 @@ static inline void spin_lock_prefetch(const void *x)
  */
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -828,7 +829,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -841,7 +844,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -863,7 +866,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..7f2e26dca1f2 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -147,7 +147,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+	info.high_limit = min(end, DEFAULT_MAP_WINDOW);
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 2ae8584b44c7..e1c2ee098be0 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -82,7 +82,7 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = current->mm->mmap_legacy_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = DEFAULT_MAP_WINDOW;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -114,7 +114,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index d2dc0438d654..a29a830ad341 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -52,7 +52,7 @@ static unsigned long stack_maxrandom_size(void)
  * Leave an at least ~128 MB hole with possible stack randomization.
  */
 #define MIN_GAP (128*1024*1024UL + stack_maxrandom_size())
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
 
 static int mmap_is_legacy(void)
 {
@@ -90,7 +90,7 @@ static unsigned long mmap_base(unsigned long rnd)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(TASK_SIZE - gap - rnd);
+	return PAGE_ALIGN(DEFAULT_MAP_WINDOW - gap - rnd);
 }
 
 /*
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]       ` <20170218092133.GA17471-sVvlyX1904swdBt8bTSxpkEMvNT87kid@public.gmane.org>
@ 2017-02-20 13:15         ` Kirill A. Shutemov
  2017-02-21 20:46           ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-20 13:15 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-mm, Linux Kernel Mailing List, Catalin Marinas,
	Linux API

On Sat, Feb 18, 2017 at 12:21:33PM +0300, Kirill A. Shutemov wrote:
> On Fri, Feb 17, 2017 at 12:02:13PM -0800, Linus Torvalds wrote:
> > I also get he feeling that the whole thing is unnecessary. I'm
> > wondering if we should just instead say that the whole 47 vs 56-bit
> > virtual address is _purely_ about "get_unmapped_area()", and nothing
> > else.
> > 
> > IOW, I'm wondering if we can't just say that
> > 
> >  - if the processor and kernel support 56-bit user address space, then
> > you can *always* use the whole space
> > 
> >  - but by default, get_unmapped_area() will only return mappings that
> > fit in the 47 bit address space.
> > 
> > So if you use MAP_FIXED and give an address in the high range, it will
> > just always work, and the MM will always consider the task size to be
> > the full address space.
> > 
> > But for the common case where a process does no use MAP_FIXED, the
> > kernel will never give a high address by default, and you have to do
> > the process control thing to say "I want those high addresses".
> > 
> > Hmm?
> > 
> > In other words, I'd like to at least start out trying to keep the
> > differences between the 47-bit and 56-bit models as simple and minimal
> > as possible. Not make such a big deal out of it.
> > 
> > We already have "arch_get_unmapped_area()" that controls the whole
> > "what will non-MAP_FIXED mmap allocations return", so I'd hope that
> > the above kind of semantics could be done without *any* actual
> > TASK_SIZE changes _anywhere_ in the VM code.
> > 
> > Comments?
> 
> Okay, below is my try on implementing this.
> 
> I've chosen to respect hint address even without MAP_FIXED, but only if
> it doesn't collide with other mappings. Otherwise, fallback to look for
> unmapped area within 47-bit window.
> 
> Interaction with MPX would requires more work. I'm not yet sure what is the
> right way to address it.

I *think* this should do the trick for MPX too.

Dave, could you check if it looks reasonable for you?

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c3045e..9c6315d9aa34 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 0b416d4cf73b..b722499f6aba 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -71,6 +71,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -92,6 +95,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e6cfe7ba2d65..492548c87cb1 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -789,6 +789,7 @@ static inline void spin_lock_prefetch(const void *x)
  */
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -828,7 +829,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -841,7 +844,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -863,7 +866,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..df7bfc635941 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -18,6 +18,7 @@
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -128,6 +129,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -147,7 +152,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+	info.high_limit = min(end, DEFAULT_MAP_WINDOW);
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -167,6 +172,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 2ae8584b44c7..329d653f7238 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -15,6 +15,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -82,7 +83,7 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = current->mm->mmap_legacy_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = DEFAULT_MAP_WINDOW;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -114,7 +115,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -131,6 +132,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index d2dc0438d654..a29a830ad341 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -52,7 +52,7 @@ static unsigned long stack_maxrandom_size(void)
  * Leave an at least ~128 MB hole with possible stack randomization.
  */
 #define MIN_GAP (128*1024*1024UL + stack_maxrandom_size())
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
 
 static int mmap_is_legacy(void)
 {
@@ -90,7 +90,7 @@ static unsigned long mmap_base(unsigned long rnd)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(TASK_SIZE - gap - rnd);
+	return PAGE_ALIGN(DEFAULT_MAP_WINDOW - gap - rnd);
 }
 
 /*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index af59f808742f..794b26661711 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -354,10 +354,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1037,3 +1046,20 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]           ` <CA+oaBQ+s5oXqu5TqddKs9LmUbaNNPGM7=gu5On4GYrkSDu0_XA@mail.gmail.com>
@ 2017-02-21  6:00             ` Michael Pratt
  2017-02-21  6:10             ` Michael Pratt
  1 sibling, 0 replies; 27+ messages in thread
From: Michael Pratt @ 2017-02-21  6:00 UTC (permalink / raw)
  To: luto
  Cc: torvalds, kirill.shutemov, akpm, x86, tglx, mingo, arnd, hpa, ak,
	dave.hansen, linux-arch, linux-mm, linux-kernel, catalin.marinas,
	linux-api

[-- Attachment #1: Type: text/plain, Size: 3128 bytes --]

On Mon, Feb 20, 2017 at 9:21 PM, Michael Pratt <linux@pratt.im> wrote:

> On Fri, Feb 17, 2017 at 3:02 PM, Andy Lutomirski <luto@amacapital.net>
> wrote:
> > On Fri, Feb 17, 2017 at 1:01 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >> On Fri, Feb 17, 2017 at 12:12 PM, Andy Lutomirski <luto@amacapital.net>
> wrote:
> >>>
> >>> At the very least, I'd want to see
> >>> MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
> >>> interface.
> >>
> >> That's unrelated, but I guess w could add a MAP_NOUNMAP flag, and then
> >> you can use MAP_FIXED | MAP_NOUNMAP or something.
> >>
> >> But that has nothing to do with the 47-vs-56 bit issue.
> >>
> >>> How about MAP_LIMIT where the address passed in is interpreted as an
> >>> upper bound instead of a fixed address?
> >>
> >> Again, that's a unrelated semantic issue. Right now - if you don't
> >> pass in MAP_FIXED at all, the "addr" argument is used as a starting
> >> value for deciding where to find an unmapped area. But there is no way
> >> to specify the end. That would basically be what the process control
> >> thing would be (not per-system-call, but per-thread ).
> >>
> >
> > What I'm trying to say is: if we're going to do the route of 48-bit
> > limit unless a specific mmap call requests otherwise, can we at least
> > have an interface that doesn't suck?
>

I've got a set of patches that I've meant to send out as an RFC for a while
that tries to address userspace control of address space layout and covers
many of these ideas.

There is a new syscall and set of prctls for controlling the "mmap layout"
(i.e., get_unmapped_area search range) that look something like this:

struct mmap_layout {
unsigned long start;
unsigned long end;
/*
* These are equivalent to mmap_legacy_base and mmap_base,
* but are not really needed in this proposal.
*/
unsigned long low_base;
unsigned long high_base;
unsigned long flags;
};

/* For flags */
#define MMAP_TOPDOWN 1

struct layout_mmap_args {
unsigned long addr;
unsigned long len;
unsigned long prot;
unsigned long flags;
unsigned long fd;
unsigned long off;
struct mmap_layout layout;
};

void *layout_mmap(struct layout_mmap_args *args);

int prctl(PR_GET_MMAP_LAYOUT, struct mmap_layout *layout);
int prctl(PR_SET_MMAP_LAYOUT, struct mmap_layout *layout);

The prctls control the default range that mmap and friends will allocate.
For 56-bit user address space, it could default to [mmap_min_addr, 1<<47),
as Linus suggests. Applications that want the full address space can
increase it to cover the entire range.

The layout_mmap syscall allows one-off mappings that fall outside the
default layout, and nicely solves the "MAP_FIXED but don't unmap anything
problem" by passing an explicit range to check without actually setting
MAP_FIXED.

This idea is quite similar to the MAX_VADDR + default get_unmapped_area
behavior ides, just more generalized to give userspace more control over
the ultimate behavior of get_unmapped_area.

PS. Apologies if my email client screwed up this message. I didn't have
this thread in my client and have tried to import it from another account.

[-- Attachment #2: Type: text/html, Size: 5537 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]           ` <CA+oaBQ+s5oXqu5TqddKs9LmUbaNNPGM7=gu5On4GYrkSDu0_XA@mail.gmail.com>
  2017-02-21  6:00             ` Michael Pratt
@ 2017-02-21  6:10             ` Michael Pratt
  1 sibling, 0 replies; 27+ messages in thread
From: Michael Pratt @ 2017-02-21  6:10 UTC (permalink / raw)
  To: luto
  Cc: torvalds, kirill.shutemov, akpm, x86, tglx, mingo, arnd, hpa, ak,
	dave.hansen, linux-arch, linux-mm, linux-kernel, catalin.marinas,
	linux-api

Sigh... apologies for the HTML. Trying again...

On Mon, Feb 20, 2017 at 9:21 PM, Michael Pratt <linux@pratt.im> wrote:
> On Fri, Feb 17, 2017 at 3:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 17, 2017 at 1:01 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Fri, Feb 17, 2017 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>
>>>> At the very least, I'd want to see
>>>> MAP_FIXED_BUT_DONT_BLOODY_UNMAP_ANYTHING.  I *hate* the current
>>>> interface.
>>>
>>> That's unrelated, but I guess w could add a MAP_NOUNMAP flag, and then
>>> you can use MAP_FIXED | MAP_NOUNMAP or something.
>>>
>>> But that has nothing to do with the 47-vs-56 bit issue.
>>>
>>>> How about MAP_LIMIT where the address passed in is interpreted as an
>>>> upper bound instead of a fixed address?
>>>
>>> Again, that's a unrelated semantic issue. Right now - if you don't
>>> pass in MAP_FIXED at all, the "addr" argument is used as a starting
>>> value for deciding where to find an unmapped area. But there is no way
>>> to specify the end. That would basically be what the process control
>>> thing would be (not per-system-call, but per-thread ).
>>>
>>
>> What I'm trying to say is: if we're going to do the route of 48-bit
>> limit unless a specific mmap call requests otherwise, can we at least
>> have an interface that doesn't suck?

I've got a set of patches that I've meant to send out as an RFC for a
while that tries to address userspace control of address space layout
and covers many of these ideas.

There is a new syscall and set of prctls for controlling the "mmap
layout" (i.e., get_unmapped_area search range) that look something
like this:

struct mmap_layout {
unsigned long start;
unsigned long end;
/*
* These are equivalent to mmap_legacy_base and mmap_base,
* but are not really needed in this proposal.
*/
unsigned long low_base;
unsigned long high_base;
unsigned long flags;
};

/* For flags */
#define MMAP_TOPDOWN 1

struct layout_mmap_args {
unsigned long addr;
unsigned long len;
unsigned long prot;
unsigned long flags;
unsigned long fd;
unsigned long off;
struct mmap_layout layout;
};

void *layout_mmap(struct layout_mmap_args *args);

int prctl(PR_GET_MMAP_LAYOUT, struct mmap_layout *layout);
int prctl(PR_SET_MMAP_LAYOUT, struct mmap_layout *layout);

The prctls control the default range that mmap and friends will
allocate. For 56-bit user address space, it could default to
[mmap_min_addr, 1<<47), as Linus suggests. Applications that want the
full address space can increase it to cover the entire range.

The layout_mmap syscall allows one-off mappings that fall outside the
default layout, and nicely solves the "MAP_FIXED but don't unmap
anything problem" by passing an explicit range to check without
actually setting MAP_FIXED.

This idea is quite similar to the MAX_VADDR + default
get_unmapped_area behavior ides, just more generalized to give
userspace more control over the ultimate behavior of
get_unmapped_area.

PS. Apologies if my email client screwed up this message. I didn't
have this thread in my client and have tried to import it from another
account.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 23:21           ` Linus Torvalds
@ 2017-02-21 10:34             ` Catalin Marinas
  2017-02-21 10:47               ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2017-02-21 10:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	linux-arch@vger.kernel.org, Linux API, the arch/x86 maintainers,
	Andi Kleen, Kirill A. Shutemov, Arnd Bergmann, Dave Hansen,
	Linux Kernel Mailing List, H. Peter Anvin, linux-mm

On Fri, Feb 17, 2017 at 03:21:27PM -0800, Linus Torvalds wrote:
> On Feb 17, 2017 3:02 PM, "Andy Lutomirski" <luto@amacapital.net> wrote:
> >   What I'm trying to say is: if we're going to do the route of 48-bit
> >   limit unless a specific mmap call requests otherwise, can we at least
> >   have an interface that doesn't suck?
> 
> No, I'm not suggesting specific mmap calls at all. I'm suggesting the complete
> opposite: not having some magical "max address" at all in the VM layer. Keep
> all the existing TASK_SIZE defines as-is, and just make those be the new 56-bit
> limit.
> 
> But to then not make most processes use it, just make the default x86
> arch_get_free_area() return an address limited to the old 47-bit limit. So
> effectively all legacy programs work exactly the same way they always did.

arch_get_unmapped_area() changes would not cover STACK_TOP which is
currently defined as TASK_SIZE (on both x86 and arm64). I don't think it
matters much (normally such upper bits tricks are done on heap objects)
but you may find some weird user program that passes pointers to the
stack around and expects bits 48-63 to be masked out. If that's a real
issue, we could also limit STACK_TOP to 47-bit (48-bit on arm64).

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-21 10:34             ` Catalin Marinas
@ 2017-02-21 10:47               ` Kirill A. Shutemov
  2017-02-21 10:54                 ` Catalin Marinas
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-21 10:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, linux-arch@vger.kernel.org, Linux API,
	the arch/x86 maintainers, Andi Kleen, Kirill A. Shutemov,
	Arnd Bergmann, Dave Hansen, Linux Kernel Mailing List,
	H. Peter Anvin, linux-mm

On Tue, Feb 21, 2017 at 10:34:02AM +0000, Catalin Marinas wrote:
> On Fri, Feb 17, 2017 at 03:21:27PM -0800, Linus Torvalds wrote:
> > On Feb 17, 2017 3:02 PM, "Andy Lutomirski" <luto@amacapital.net> wrote:
> > >   What I'm trying to say is: if we're going to do the route of 48-bit
> > >   limit unless a specific mmap call requests otherwise, can we at least
> > >   have an interface that doesn't suck?
> > 
> > No, I'm not suggesting specific mmap calls at all. I'm suggesting the complete
> > opposite: not having some magical "max address" at all in the VM layer. Keep
> > all the existing TASK_SIZE defines as-is, and just make those be the new 56-bit
> > limit.
> > 
> > But to then not make most processes use it, just make the default x86
> > arch_get_free_area() return an address limited to the old 47-bit limit. So
> > effectively all legacy programs work exactly the same way they always did.
> 
> arch_get_unmapped_area() changes would not cover STACK_TOP which is
> currently defined as TASK_SIZE (on both x86 and arm64). I don't think it
> matters much (normally such upper bits tricks are done on heap objects)
> but you may find some weird user program that passes pointers to the
> stack around and expects bits 48-63 to be masked out. If that's a real
> issue, we could also limit STACK_TOP to 47-bit (48-bit on arm64).

I've limited STACK_TOP to 47-bit in my implementation of Linus' proposal:

http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-21 10:47               ` Kirill A. Shutemov
@ 2017-02-21 10:54                 ` Catalin Marinas
  0 siblings, 0 replies; 27+ messages in thread
From: Catalin Marinas @ 2017-02-21 10:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, linux-arch@vger.kernel.org, Linux API,
	the arch/x86 maintainers, Andi Kleen, Kirill A. Shutemov,
	Arnd Bergmann, Dave Hansen, Linux Kernel Mailing List,
	H. Peter Anvin, linux-mm

On Tue, Feb 21, 2017 at 01:47:36PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 21, 2017 at 10:34:02AM +0000, Catalin Marinas wrote:
> > On Fri, Feb 17, 2017 at 03:21:27PM -0800, Linus Torvalds wrote:
> > > On Feb 17, 2017 3:02 PM, "Andy Lutomirski" <luto@amacapital.net> wrote:
> > > >   What I'm trying to say is: if we're going to do the route of 48-bit
> > > >   limit unless a specific mmap call requests otherwise, can we at least
> > > >   have an interface that doesn't suck?
> > > 
> > > No, I'm not suggesting specific mmap calls at all. I'm suggesting the complete
> > > opposite: not having some magical "max address" at all in the VM layer. Keep
> > > all the existing TASK_SIZE defines as-is, and just make those be the new 56-bit
> > > limit.
> > > 
> > > But to then not make most processes use it, just make the default x86
> > > arch_get_free_area() return an address limited to the old 47-bit limit. So
> > > effectively all legacy programs work exactly the same way they always did.
> > 
> > arch_get_unmapped_area() changes would not cover STACK_TOP which is
> > currently defined as TASK_SIZE (on both x86 and arm64). I don't think it
> > matters much (normally such upper bits tricks are done on heap objects)
> > but you may find some weird user program that passes pointers to the
> > stack around and expects bits 48-63 to be masked out. If that's a real
> > issue, we could also limit STACK_TOP to 47-bit (48-bit on arm64).
> 
> I've limited STACK_TOP to 47-bit in my implementation of Linus' proposal:
> 
> http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name

Ah, sorry for the noise then (still catching up with this thread; at
some point we'll need to add 52-bit VA support to arm64, though with 4
levels only).

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-17 16:50     ` Andy Lutomirski
@ 2017-02-21 11:54       ` Dmitry Safonov
  2017-02-21 12:42         ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry Safonov @ 2017-02-21 11:54 UTC (permalink / raw)
  To: Andy Lutomirski, Kirill A. Shutemov
  Cc: Dmitry Safonov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, Linux API

2017-02-17 19:50 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
> On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> This patch introduces two new prctl(2) handles to manage maximum virtual
>> address available to userspace to map.
...
> Anyway, can you and Dmitry try to reconcile your patches?

So, how can I help that?
Is there the patch's version, on which I could rebase?
Here are BTW the last patches, which I will resend with trivial ifdef-fixup
after the merge window:
http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com

-- 
             Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-21 11:54       ` Dmitry Safonov
@ 2017-02-21 12:42         ` Kirill A. Shutemov
  2017-03-06 14:00           ` Dmitry Safonov
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-21 12:42 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Kirill A. Shutemov, Dmitry Safonov,
	Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, Linux API

On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> >> This patch introduces two new prctl(2) handles to manage maximum virtual
> >> address available to userspace to map.
> ...
> > Anyway, can you and Dmitry try to reconcile your patches?
> 
> So, how can I help that?
> Is there the patch's version, on which I could rebase?
> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
> after the merge window:
> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com

Could you check if this patch collides with anything you do:

http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-20 13:15         ` Kirill A. Shutemov
@ 2017-02-21 20:46           ` Dave Hansen
       [not found]             ` <0d05ac45-a139-6f8e-f98b-71876fbb509d-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2017-02-21 20:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, the arch/x86 maintainers,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-mm, Linux Kernel Mailing List, Catalin Marinas,
	Linux API

Let me make sure I'm grokking what you're trying to do here.

On 02/20/2017 05:15 AM, Kirill A. Shutemov wrote:
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;

At this point, we know MPX management is on and the hint is for memory
above DEFAULT_MAP_WINDOW?

> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;

... and if it's a MAP_FIXED request, fail it.

> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;

What is this case for?  If addr+len wraps?

> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}

Otherwise, blow away the hint, which we know is high and needs to
be discarded?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
       [not found]             ` <0d05ac45-a139-6f8e-f98b-71876fbb509d-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-02-22 13:04               ` Kirill A. Shutemov
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-02-22 13:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton,
	the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, linux-mm,
	Linux Kernel Mailing List, Catalin Marinas, Linux API

On Tue, Feb 21, 2017 at 12:46:55PM -0800, Dave Hansen wrote:
> Let me make sure I'm grokking what you're trying to do here.
> 
> On 02/20/2017 05:15 AM, Kirill A. Shutemov wrote:
> > +/* MPX cannot handle addresses above 47-bits yet. */
> > +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> > +		unsigned long flags)
> > +{
> > +	if (!kernel_managing_mpx_tables(current->mm))
> > +		return addr;
> > +	if (addr + len <= DEFAULT_MAP_WINDOW)
> > +		return addr;
> 
> At this point, we know MPX management is on and the hint is for memory
> above DEFAULT_MAP_WINDOW?

Right.

> > +	if (flags & MAP_FIXED)
> > +		return -ENOMEM;
> 
> ... and if it's a MAP_FIXED request, fail it.

Yep.

> > +	if (len > DEFAULT_MAP_WINDOW)
> > +		return -ENOMEM;
> 
> What is this case for?  If addr+len wraps?

If len is too big to fit into DEFAULT_MAP_WINDOW there's no point in
resetting hint address as we know we can't satisfy it -- fail early.
> 
> > +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> > +	return 0;
> > +}
> 
> Otherwise, blow away the hint, which we know is high and needs to
> be discarded?

Right.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-02-21 12:42         ` Kirill A. Shutemov
@ 2017-03-06 14:00           ` Dmitry Safonov
  2017-03-06 14:17             ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry Safonov @ 2017-03-06 14:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Kirill A. Shutemov, Dmitry Safonov,
	Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, Linux API

2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov <kirill@shutemov.name>:
> On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
>> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
>> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
>> > <kirill.shutemov@linux.intel.com> wrote:
>> >> This patch introduces two new prctl(2) handles to manage maximum virtual
>> >> address available to userspace to map.
>> ...
>> > Anyway, can you and Dmitry try to reconcile your patches?
>>
>> So, how can I help that?
>> Is there the patch's version, on which I could rebase?
>> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
>> after the merge window:
>> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com
>
> Could you check if this patch collides with anything you do:
>
> http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name

Ok, sorry for the late reply - it was the merge window anyway and I've got
urgent work to do.

Let's see:

I'll need minor merge fixup here:
>-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
>+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
while in my patches:
>+#define __TASK_UNMAPPED_BASE(task_size)        (PAGE_ALIGN(task_size / 3))
>+#define TASK_UNMAPPED_BASE             __TASK_UNMAPPED_BASE(TASK_SIZE)

This should be just fine with my changes:
>- info.high_limit = end;
>+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);

This will need another minor fixup:
>-#define MAX_GAP (TASK_SIZE/6*5)
>+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
I've moved it from macro to mmap_base() as local var,
which depends on task_size parameter.

That's all, as far as I can see at this moment.
Does not seems hard to fix. So I suggest sending patches sets
in parallel, the second accepted will rebase the set.
Is it convenient for you?
If you have/will have some questions about my patches, I'll be
open to answer.

-- 
             Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-03-06 14:17             ` Kirill A. Shutemov
@ 2017-03-06 14:15               ` Dmitry Safonov
  0 siblings, 0 replies; 27+ messages in thread
From: Dmitry Safonov @ 2017-03-06 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dmitry Safonov
  Cc: Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, Dave Hansen,
	linux-arch, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Catalin Marinas, Linux API

On 03/06/2017 05:17 PM, Kirill A. Shutemov wrote:
> On Mon, Mar 06, 2017 at 05:00:28PM +0300, Dmitry Safonov wrote:
>> 2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov <kirill@shutemov.name>:
>>> On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
>>>> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
>>>>> On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
>>>>> <kirill.shutemov@linux.intel.com> wrote:
>>>>>> This patch introduces two new prctl(2) handles to manage maximum virtual
>>>>>> address available to userspace to map.
>>>> ...
>>>>> Anyway, can you and Dmitry try to reconcile your patches?
>>>>
>>>> So, how can I help that?
>>>> Is there the patch's version, on which I could rebase?
>>>> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
>>>> after the merge window:
>>>> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com
>>>
>>> Could you check if this patch collides with anything you do:
>>>
>>> http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name
>>
>> Ok, sorry for the late reply - it was the merge window anyway and I've got
>> urgent work to do.
>>
>> Let's see:
>>
>> I'll need minor merge fixup here:
>>> -#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
>>> +#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
>> while in my patches:
>>> +#define __TASK_UNMAPPED_BASE(task_size)        (PAGE_ALIGN(task_size / 3))
>>> +#define TASK_UNMAPPED_BASE             __TASK_UNMAPPED_BASE(TASK_SIZE)
>>
>> This should be just fine with my changes:
>>> - info.high_limit = end;
>>> + info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>>
>> This will need another minor fixup:
>>> -#define MAX_GAP (TASK_SIZE/6*5)
>>> +#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
>> I've moved it from macro to mmap_base() as local var,
>> which depends on task_size parameter.
>>
>> That's all, as far as I can see at this moment.
>> Does not seems hard to fix. So I suggest sending patches sets
>> in parallel, the second accepted will rebase the set.
>> Is it convenient for you?
>
> Works for me.
>
> In fact, I've just sent v4 of the patchset.
>

Ok, thanks.

-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
  2017-03-06 14:00           ` Dmitry Safonov
@ 2017-03-06 14:17             ` Kirill A. Shutemov
  2017-03-06 14:15               ` Dmitry Safonov
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2017-03-06 14:17 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Kirill A. Shutemov, Dmitry Safonov,
	Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, Linux API

On Mon, Mar 06, 2017 at 05:00:28PM +0300, Dmitry Safonov wrote:
> 2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov <kirill@shutemov.name>:
> > On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
> >> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
> >> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
> >> > <kirill.shutemov@linux.intel.com> wrote:
> >> >> This patch introduces two new prctl(2) handles to manage maximum virtual
> >> >> address available to userspace to map.
> >> ...
> >> > Anyway, can you and Dmitry try to reconcile your patches?
> >>
> >> So, how can I help that?
> >> Is there the patch's version, on which I could rebase?
> >> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
> >> after the merge window:
> >> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com
> >
> > Could you check if this patch collides with anything you do:
> >
> > http://lkml.kernel.org/r/20170220131515.GA9502@node.shutemov.name
> 
> Ok, sorry for the late reply - it was the merge window anyway and I've got
> urgent work to do.
> 
> Let's see:
> 
> I'll need minor merge fixup here:
> >-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
> >+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
> while in my patches:
> >+#define __TASK_UNMAPPED_BASE(task_size)        (PAGE_ALIGN(task_size / 3))
> >+#define TASK_UNMAPPED_BASE             __TASK_UNMAPPED_BASE(TASK_SIZE)
> 
> This should be just fine with my changes:
> >- info.high_limit = end;
> >+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> 
> This will need another minor fixup:
> >-#define MAX_GAP (TASK_SIZE/6*5)
> >+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
> I've moved it from macro to mmap_base() as local var,
> which depends on task_size parameter.
> 
> That's all, as far as I can see at this moment.
> Does not seems hard to fix. So I suggest sending patches sets
> in parallel, the second accepted will rebase the set.
> Is it convenient for you?

Works for me.

In fact, I've just sent v4 of the patchset.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-03-06 14:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170217141328.164563-1-kirill.shutemov@linux.intel.com>
2017-02-17 14:13 ` [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR Kirill A. Shutemov
     [not found]   ` <20170217141328.164563-34-kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-02-17 16:50     ` Andy Lutomirski
2017-02-21 11:54       ` Dmitry Safonov
2017-02-21 12:42         ` Kirill A. Shutemov
2017-03-06 14:00           ` Dmitry Safonov
2017-03-06 14:17             ` Kirill A. Shutemov
2017-03-06 14:15               ` Dmitry Safonov
2017-02-17 17:19     ` Dave Hansen
2017-02-17 17:21       ` Andy Lutomirski
2017-02-17 20:02   ` Linus Torvalds
2017-02-17 20:12     ` Andy Lutomirski
2017-02-17 21:01       ` Linus Torvalds
2017-02-17 23:02         ` Andy Lutomirski
2017-02-17 23:11           ` hpa
2017-02-17 23:21           ` Linus Torvalds
2017-02-21 10:34             ` Catalin Marinas
2017-02-21 10:47               ` Kirill A. Shutemov
2017-02-21 10:54                 ` Catalin Marinas
     [not found]           ` <CA+oaBQ+s5oXqu5TqddKs9LmUbaNNPGM7=gu5On4GYrkSDu0_XA@mail.gmail.com>
2017-02-21  6:00             ` Michael Pratt
2017-02-21  6:10             ` Michael Pratt
2017-02-17 21:04     ` Dave Hansen
2017-02-17 21:10       ` Linus Torvalds
2017-02-17 21:50         ` hpa
2017-02-18  9:21     ` Kirill A. Shutemov
     [not found]       ` <20170218092133.GA17471-sVvlyX1904swdBt8bTSxpkEMvNT87kid@public.gmane.org>
2017-02-20 13:15         ` Kirill A. Shutemov
2017-02-21 20:46           ` Dave Hansen
     [not found]             ` <0d05ac45-a139-6f8e-f98b-71876fbb509d-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-02-22 13:04               ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).