* [rfc][patch] Memory Binding Take 2 (0/1)
@ 2003-04-03 5:50 Matthew Dobson
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
2003-04-04 13:34 ` [rfc][patch] Memory Binding Take 2 (0/1) Christoph Hellwig
0 siblings, 2 replies; 9+ messages in thread
From: Matthew Dobson @ 2003-04-03 5:50 UTC (permalink / raw)
To: linux-kernel
Cc: Martin J. Bligh, Andrew Morton, Christoph Hellwig, Paolo Zeppegno,
Andi Kleen, lse-tech
[-- Attachment #1: Type: text/plain, Size: 342 bytes --]
Alrighty, let's give this another go...
This patch hasn't changed much. Just added bitmap_t, typedef'd to
unsigned long * for passing around bitmaps without breaking the
abstraction. I think it's good if we can keep the underlying data type
hidden to partially future-proof (protect? ;) the code.
Part 1/1 coming up...
Cheers!
-Matt
[-- Attachment #2: 00-pre_membind.patch --]
[-- Type: text/plain, Size: 1786 bytes --]
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-vanilla/include/linux/gfp.h linux-2.5.66-pre_membind/include/linux/gfp.h
--- linux-2.5.66-vanilla/include/linux/gfp.h Mon Mar 24 14:00:10 2003
+++ linux-2.5.66-pre_membind/include/linux/gfp.h Mon Mar 31 17:38:47 2003
@@ -52,7 +52,7 @@
if (unlikely(order >= MAX_ORDER))
return NULL;
- return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
+ return __alloc_pages(gfp_mask, order, get_node_zonelist(nid, gfp_mask));
}
#define alloc_pages(gfp_mask, order) \
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-vanilla/include/linux/mmzone.h linux-2.5.66-pre_membind/include/linux/mmzone.h
--- linux-2.5.66-vanilla/include/linux/mmzone.h Mon Mar 24 14:00:45 2003
+++ linux-2.5.66-pre_membind/include/linux/mmzone.h Mon Mar 31 17:38:47 2003
@@ -326,6 +326,14 @@
#define num_online_memblks() 1
#endif /* CONFIG_DISCONTIGMEM || CONFIG_NUMA */
+
+static inline struct zonelist *get_node_zonelist(int nid, int gfp_mask)
+{
+ return NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK);
+}
+
+#define get_zonelist(gfp_mask) get_node_zonelist(numa_node_id(), gfp_mask)
+
#endif /* !__ASSEMBLY__ */
#endif /* __KERNEL__ */
#endif /* _LINUX_MMZONE_H */
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-vanilla/include/linux/types.h linux-2.5.66-pre_membind/include/linux/types.h
--- linux-2.5.66-vanilla/include/linux/types.h Mon Mar 24 14:01:24 2003
+++ linux-2.5.66-pre_membind/include/linux/types.h Wed Apr 2 18:13:27 2003
@@ -10,6 +10,7 @@
unsigned long name[BITS_TO_LONGS(bits)]
#define CLEAR_BITMAP(name,bits) \
memset(name, 0, BITS_TO_LONGS(bits)*sizeof(unsigned long))
+typedef unsigned long * bitmap_t;
#endif
#include <linux/posix_types.h>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 5:50 [rfc][patch] Memory Binding Take 2 (0/1) Matthew Dobson
@ 2003-04-03 5:56 ` Matthew Dobson
2003-04-03 6:37 ` Andrew Morton
` (2 more replies)
2003-04-04 13:34 ` [rfc][patch] Memory Binding Take 2 (0/1) Christoph Hellwig
1 sibling, 3 replies; 9+ messages in thread
From: Matthew Dobson @ 2003-04-03 5:56 UTC (permalink / raw)
To: linux-kernel
Cc: Martin J. Bligh, Andrew Morton, Christoph Hellwig, Paolo Zeppegno,
Andi Kleen, lse-tech
[-- Attachment #1: Type: text/plain, Size: 865 bytes --]
Now for the good stuff! ;)
This one has had more changes... I've changed the syscall from
sys_membind to sys_mbind. I liked Paolo's suggestion of aligning the
naming. I've fixed up the way the bitmaps are passed. I pulled out all
the ZONE_* code, and now just have it use all the zones on the node. I
made sure that the binding pointers are not compiled in for non-NUMA
kernels. All that is added for non-NUMA kernels is the cond_syscall and
a small change in the page_cache_alloc callpath. Now page_cache_alloc
calls __page_cache_alloc, which is just the old page_cache_alloc for
non-NUMA. For NUMA, it's obviously a different function. I also
cleaned up the bitmask size issue, by just making sure userspace doesn't
pass in a bitmask that's way too large.
I guess that's it... As always, I'm looking forward to any comments!
Cheers!
-Matt
[-- Attachment #2: 01-membind.patch --]
[-- Type: text/plain, Size: 11322 bytes --]
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/arch/i386/kernel/entry.S linux-2.5.66-membind/arch/i386/kernel/entry.S
--- linux-2.5.66-pre_membind/arch/i386/kernel/entry.S Mon Mar 24 14:00:11 2003
+++ linux-2.5.66-membind/arch/i386/kernel/entry.S Wed Apr 2 10:46:20 2003
@@ -807,7 +807,7 @@
.long sys_getdents64 /* 220 */
.long sys_fcntl64
.long sys_ni_syscall /* reserved for TUX */
- .long sys_ni_syscall
+ .long sys_mbind
.long sys_gettid
.long sys_readahead /* 225 */
.long sys_setxattr
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/fs/inode.c linux-2.5.66-membind/fs/inode.c
--- linux-2.5.66-pre_membind/fs/inode.c Mon Mar 24 14:01:48 2003
+++ linux-2.5.66-membind/fs/inode.c Wed Apr 2 10:49:36 2003
@@ -144,6 +144,9 @@
mapping->dirtied_when = 0;
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+#ifdef CONFIG_NUMA
+ mapping->binding = NULL;
+#endif
if (sb->s_bdev)
mapping->backing_dev_info = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
memset(&inode->u, 0, sizeof(inode->u));
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/include/asm-i386/unistd.h linux-2.5.66-membind/include/asm-i386/unistd.h
--- linux-2.5.66-pre_membind/include/asm-i386/unistd.h Mon Mar 24 14:00:54 2003
+++ linux-2.5.66-membind/include/asm-i386/unistd.h Wed Apr 2 10:52:18 2003
@@ -228,7 +228,7 @@
#define __NR_madvise1 219 /* delete when C lib stub is removed */
#define __NR_getdents64 220
#define __NR_fcntl64 221
-/* 223 is unused */
+#define __NR_mbind 223
#define __NR_gettid 224
#define __NR_readahead 225
#define __NR_setxattr 226
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/include/linux/fs.h linux-2.5.66-membind/include/linux/fs.h
--- linux-2.5.66-pre_membind/include/linux/fs.h Mon Mar 24 14:00:10 2003
+++ linux-2.5.66-membind/include/linux/fs.h Wed Apr 2 10:54:17 2003
@@ -19,6 +19,7 @@
#include <linux/cache.h>
#include <linux/radix-tree.h>
#include <linux/kobject.h>
+#include <linux/mbind.h>
#include <asm/atomic.h>
struct iovec;
@@ -329,6 +330,9 @@
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_NUMA
+ struct binding *binding; /* for memory bindings */
+#endif
};
struct block_device {
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/include/linux/mbind.h linux-2.5.66-membind/include/linux/mbind.h
--- linux-2.5.66-pre_membind/include/linux/mbind.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.66-membind/include/linux/mbind.h Wed Apr 2 18:52:41 2003
@@ -0,0 +1,40 @@
+/*
+ * include/linux/mbind.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2003, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#ifndef _LINUX_MBIND_H
+#define _LINUX_MBIND_H
+
+#ifdef CONFIG_NUMA
+
+#include <linux/mmzone.h>
+
+/* Structure to keep track of memory segment (VMA) bindings */
+struct binding {
+ struct zonelist zonelist;
+};
+
+#endif /* CONFIG_NUMA */
+#endif /* _LINUX_MBIND_H */
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/include/linux/pagemap.h linux-2.5.66-membind/include/linux/pagemap.h
--- linux-2.5.66-pre_membind/include/linux/pagemap.h Mon Mar 24 13:59:54 2003
+++ linux-2.5.66-membind/include/linux/pagemap.h Wed Apr 2 19:49:42 2003
@@ -8,6 +8,7 @@
#include <linux/fs.h>
#include <linux/list.h>
#include <linux/highmem.h>
+#include <linux/mbind.h>
#include <asm/uaccess.h>
/*
@@ -27,14 +28,37 @@
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
+#ifndef CONFIG_NUMA
+
+static inline struct page *__page_cache_alloc(struct address_space *x, int gfp_mask)
+{
+ return alloc_pages(gfp_mask, 0);
+}
+
+#else /* CONFIG_NUMA */
+
+static inline struct page *__page_cache_alloc(struct address_space *x, int gfp_mask)
+{
+ struct zonelist *zonelist;
+
+ if (!x->binding)
+ zonelist = get_zonelist(gfp_mask);
+ else
+ zonelist = &x->binding->zonelist;
+
+ return __alloc_pages(gfp_mask, 0, zonelist);
+}
+
+#endif /* !CONFIG_NUMA */
+
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return alloc_pages(x->gfp_mask, 0);
+ return __page_cache_alloc(x, x->gfp_mask);
}
static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return alloc_pages(x->gfp_mask|__GFP_COLD, 0);
+ return __page_cache_alloc(x, x->gfp_mask|__GFP_COLD);
}
typedef int filler_t(void *, struct page *);
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/kernel/sys.c linux-2.5.66-membind/kernel/sys.c
--- linux-2.5.66-pre_membind/kernel/sys.c Mon Mar 24 14:00:00 2003
+++ linux-2.5.66-membind/kernel/sys.c Wed Apr 2 11:00:44 2003
@@ -226,6 +226,7 @@
cond_syscall(sys_sendmsg)
cond_syscall(sys_recvmsg)
cond_syscall(sys_socketcall)
+cond_syscall(sys_mbind)
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/Makefile linux-2.5.66-membind/mm/Makefile
--- linux-2.5.66-pre_membind/mm/Makefile Mon Mar 24 14:00:51 2003
+++ linux-2.5.66-membind/mm/Makefile Wed Apr 2 10:50:59 2003
@@ -7,8 +7,10 @@
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
shmem.o vmalloc.o
-obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
+obj-y := bootmem.o fadvise.o filemap.o mempool.o oom_kill.o \
page_alloc.o page-writeback.o pdflush.o readahead.o \
slab.o swap.o truncate.o vcache.o vmscan.o $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
+
+obj-$(CONFIG_NUMA) += mbind.o
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/mbind.c linux-2.5.66-membind/mm/mbind.c
--- linux-2.5.66-pre_membind/mm/mbind.c Wed Dec 31 16:00:00 1969
+++ linux-2.5.66-membind/mm/mbind.c Wed Apr 2 21:45:39 2003
@@ -0,0 +1,131 @@
+/*
+ * mm/mbind.c
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2003, IBM Corp.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpatch@us.ibm.com>
+ */
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/mbind.h>
+#include <asm/string.h>
+#include <asm/topology.h>
+#include <asm/uaccess.h>
+
+/* Translate a cpumask to a nodemask */
+static inline void cpumask_to_nodemask(bitmap_t cpumask, bitmap_t nodemask)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ if (test_bit(i, cpumask))
+ set_bit(cpu_to_node(i), nodemask);
+}
+
+/*
+ * Adds the zones belonging to @pgdat to @zonelist. Returns the next
+ * index in @zonelist.
+ */
+static inline int add_node(pg_data_t *pgdat, struct zonelist *zonelist, int zone_num)
+{
+ int i;
+ struct zone *zone;
+
+ for (i = MAX_NR_ZONES-1; i >=0 ; i--) {
+ zone = pgdat->node_zones + i;
+ if (zone->present_pages)
+ zonelist->zones[zone_num++] = zone;
+ }
+ return zone_num;
+}
+
+/* Top-level function for allocating a binding for a region of memory */
+static inline struct binding *alloc_binding(bitmap_t nodemask)
+{
+ struct binding *binding;
+ int node, zone_num;
+
+ binding = (struct binding *)kmalloc(sizeof(struct binding), GFP_KERNEL);
+ if (!binding)
+ return NULL;
+ memset(binding, 0, sizeof(struct binding));
+
+ /* Build binding zonelist */
+ for (node = 0, zone_num = 0; node < MAX_NUMNODES; node++)
+ if (test_bit(node, nodemask) && node_online(node))
+ zone_num = add_node(NODE_DATA(node),
+ &binding->zonelist, zone_num);
+ binding->zonelist.zones[zone_num] = NULL;
+
+ if (zone_num == 0) {
+ /* No zones were added to the zonelist. Let the caller know. */
+ kfree(binding);
+ binding = NULL;
+ }
+ return binding;
+}
+
+
+/*
+ * membind - Bind a range of a process' VM space to a set of memory blocks according to
+ * a predefined policy.
+ * @start: beginning address of memory region to bind
+ * @len: length of memory region to bind
+ * @mask_ptr: pointer to bitmask of cpus
+ * @mask_len: length of the bitmask
+ * @policy: flag specifying the policy to use for the segment
+ */
+asmlinkage unsigned long sys_mbind(unsigned long start, unsigned long len,
+ unsigned long *mask_ptr, unsigned int mask_len, unsigned long policy)
+{
+ DECLARE_BITMAP(cpu_mask, NR_CPUS);
+ DECLARE_BITMAP(node_mask, MAX_NUMNODES);
+ struct vm_area_struct *vma = NULL;
+ struct address_space *mapping;
+ int copy_len, error = 0;
+
+ /* Deal with getting cpu_mask from userspace & translating to node_mask */
+ copy_len = min(mask_len, (unsigned int)NR_CPUS);
+ CLEAR_BITMAP(cpu_mask, NR_CPUS);
+ CLEAR_BITMAP(node_mask, MAX_NUMNODES);
+ if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
+ error = -EFAULT;
+ goto out;
+ }
+ cpumask_to_nodemask(cpu_mask, node_mask);
+
+ vma = find_vma(current->mm, start);
+ if (!(vma && vma->vm_file && vma->vm_ops &&
+ vma->vm_ops->nopage == shmem_nopage)) {
+ /* This isn't a shm segment. For now, we bail. */
+ error = -EINVAL;
+ goto out;
+ }
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ mapping->binding = alloc_binding(node_mask);
+ if (!mapping->binding)
+ error = -EFAULT;
+
+out:
+ return error;
+}
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/swap_state.c linux-2.5.66-membind/mm/swap_state.c
--- linux-2.5.66-pre_membind/mm/swap_state.c Mon Mar 24 14:00:21 2003
+++ linux-2.5.66-membind/mm/swap_state.c Tue Apr 1 17:12:00 2003
@@ -47,6 +47,9 @@
.i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
.private_lock = SPIN_LOCK_UNLOCKED,
.private_list = LIST_HEAD_INIT(swapper_space.private_list),
+#ifdef CONFIG_NUMA
+ .binding = NULL,
+#endif
};
#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
@ 2003-04-03 6:37 ` Andrew Morton
2003-04-03 23:30 ` Matthew Dobson
2003-04-03 12:20 ` Hugh Dickins
2003-04-04 13:40 ` Christoph Hellwig
2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2003-04-03 6:37 UTC (permalink / raw)
To: colpatch; +Cc: linux-kernel, mbligh, hch, zeppegno.paolo, ak, lse-tech
Matthew Dobson <colpatch@us.ibm.com> wrote:
>
> +#define __NR_mbind 223
What was wrong with "membind"?
> +/* Translate a cpumask to a nodemask */
> +static inline void cpumask_to_nodemask(bitmap_t cpumask, bitmap_t nodemask)
> +{
> + int i;
> +
> + for (i = 0; i < NR_CPUS; i++)
> + if (test_bit(i, cpumask))
That's a bit weird. test_bit is only permitted on longs, so why introduce
bitmap_t?
> +/* Top-level function for allocating a binding for a region of memory */
> +static inline struct binding *alloc_binding(bitmap_t nodemask)
> +{
> + struct binding *binding;
> + int node, zone_num;
> +
> + binding = (struct binding *)kmalloc(sizeof(struct binding), GFP_KERNEL);
> + if (!binding)
> + return NULL;
> + memset(binding, 0, sizeof(struct binding));
> +
> + /* Build binding zonelist */
> + for (node = 0, zone_num = 0; node < MAX_NUMNODES; node++)
> + if (test_bit(node, nodemask) && node_online(node))
> + zone_num = add_node(NODE_DATA(node),
> + &binding->zonelist, zone_num);
> + binding->zonelist.zones[zone_num] = NULL;
> +
> + if (zone_num == 0) {
> + /* No zones were added to the zonelist. Let the caller know. */
> + kfree(binding);
> + binding = NULL;
> + }
> + return binding;
> +}
It looks like this function needs to be able to return a real errno (see
below).
> +asmlinkage unsigned long sys_mbind(unsigned long start, unsigned long len,
> + unsigned long *mask_ptr, unsigned int mask_len, unsigned long policy)
> +{
> + DECLARE_BITMAP(cpu_mask, NR_CPUS);
> + DECLARE_BITMAP(node_mask, MAX_NUMNODES);
Bah. Who cooked that up? It should be DEFINE_BITMAP. Oh well.
> + struct vm_area_struct *vma = NULL;
> + struct address_space *mapping;
> + int copy_len, error = 0;
> +
> + /* Deal with getting cpu_mask from userspace & translating to node_mask */
> + copy_len = min(mask_len, (unsigned int)NR_CPUS);
> + CLEAR_BITMAP(cpu_mask, NR_CPUS);
> + CLEAR_BITMAP(node_mask, MAX_NUMNODES);
> + if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
> + error = -EFAULT;
> + goto out;
> + }
> + cpumask_to_nodemask(cpu_mask, node_mask);
> +
> + vma = find_vma(current->mm, start);
> + if (!(vma && vma->vm_file && vma->vm_ops &&
> + vma->vm_ops->nopage == shmem_nopage)) {
> + /* This isn't a shm segment. For now, we bail. */
> + error = -EINVAL;
> + goto out;
> + }
> +
> + mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> + mapping->binding = alloc_binding(node_mask);
> + if (!mapping->binding)
> + error = -EFAULT;
It returns EFAULT on memory exhaustion?
btw, can you remind me again why this is only available to tmpfs pagecache?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
2003-04-03 6:37 ` Andrew Morton
@ 2003-04-03 12:20 ` Hugh Dickins
2003-04-03 13:25 ` Paolo Zeppegno
2003-04-03 23:57 ` Matthew Dobson
2003-04-04 13:40 ` Christoph Hellwig
2 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2003-04-03 12:20 UTC (permalink / raw)
To: Matthew Dobson
Cc: linux-kernel, Martin J. Bligh, Andrew Morton, Christoph Hellwig,
Paolo Zeppegno, Andi Kleen, lse-tech
On Wed, 2 Apr 2003, Matthew Dobson wrote:
+/*
+ * membind - Bind a range of a process' VM space to a set of memory blocks according to
membind or mbind? Me, I like mbind (modulo remarks below), but you may
find Linus does not (he was rather caustic when I suggested that fremap
should be mseek, and it ended up as remap_file_pages instead).
+ * a predefined policy.
+ * @start: beginning address of memory region to bind
+ * @len: length of memory region to bind
Oh really? len is unused in the code below. If you were to use it,
you'd need to loop over vmas, splitting where necessary.
+ * @mask_ptr: pointer to bitmask of cpus
+ * @mask_len: length of the bitmask
+ * @policy: flag specifying the policy to use for the segment
I think you already remarked that policy is currently unused,
fair enough.
+ */
+asmlinkage unsigned long sys_mbind(unsigned long start, unsigned long len,
+ unsigned long *mask_ptr, unsigned int mask_len, unsigned long policy)
+{
+ DECLARE_BITMAP(cpu_mask, NR_CPUS);
+ DECLARE_BITMAP(node_mask, MAX_NUMNODES);
+ struct vm_area_struct *vma = NULL;
+ struct address_space *mapping;
+ int copy_len, error = 0;
+
+ /* Deal with getting cpu_mask from userspace & translating to node_mask */
+ copy_len = min(mask_len, (unsigned int)NR_CPUS);
+ CLEAR_BITMAP(cpu_mask, NR_CPUS);
+ CLEAR_BITMAP(node_mask, MAX_NUMNODES);
+ if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
+ error = -EFAULT;
+ goto out;
+ }
Shouldn't there be some capability restriction? Is it right that
anyone who can mmap a file for reading can determine its binding
(until the next does it differently)?
+ cpumask_to_nodemask(cpu_mask, node_mask);
+
+ vma = find_vma(current->mm, start);
You must not scan the vma list without at least
down_read(¤t->mm->mmap_sem).
+ if (!(vma && vma->vm_file && vma->vm_ops &&
+ vma->vm_ops->nopage == shmem_nopage)) {
+ /* This isn't a shm segment. For now, we bail. */
So you're allowing this on any file on tmpfs,
but on no file on any other filesystem: curious.
+ error = -EINVAL;
+ goto out;
+ }
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ mapping->binding = alloc_binding(node_mask);
Your NUMA machines clearly have more memory than is good for you:
nowhere is there an equivalent free_binding: which in particular
would need to be called first here if binding is already set (or
else old structure reused), and when inode is freed.
So... mapping->binding conditions every page_cache_alloc for that
inode. Hmm, what on earth does this have to do with mbind or membind?
It looks to me like fbind, except that you've dressed up the interface
to use an address in the caller's address space: presumably because you
couldn't get a file handle on SysV shared memory, and that's what you
were really wanting to bind, hence the shmem_nopage test?
I think this interface is confused (but it probably thinks I am).
+ if (!mapping->binding)
+ error = -EFAULT;
+
+out:
+ return error;
+}
diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/swap_state.c linux-2.5.66-membind/mm/swap_state.c
--- linux-2.5.66-pre_membind/mm/swap_state.c Mon Mar 24 14:00:21 2003
+++ linux-2.5.66-membind/mm/swap_state.c Tue Apr 1 17:12:00 2003
@@ -47,6 +47,9 @@
.i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
.private_lock = SPIN_LOCK_UNLOCKED,
.private_list = LIST_HEAD_INIT(swapper_space.private_list),
+#ifdef CONFIG_NUMA
+ .binding = NULL,
+#endif
};
#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
Please leave swap_state.c out of it: this patch does nothing but add
an ugly #ifdef to initialize something to 0 which would be 0 anyway.
Hugh
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 12:20 ` Hugh Dickins
@ 2003-04-03 13:25 ` Paolo Zeppegno
2003-04-03 23:57 ` Matthew Dobson
1 sibling, 0 replies; 9+ messages in thread
From: Paolo Zeppegno @ 2003-04-03 13:25 UTC (permalink / raw)
To: Hugh Dickins
Cc: Matthew Dobson, linux-kernel, Martin J. Bligh, Andrew Morton,
Christoph Hellwig, Andi Kleen, lse-tech
I suggested mbind for consistency with mmap, munmap, mremap and msync,
that is IFF the mbind operation is in some ways related with these other
syscalls.
Hugh Dickins wrote:
>On Wed, 2 Apr 2003, Matthew Dobson wrote:
>+/*
>+ * membind - Bind a range of a process' VM space to a set of memory blocks according to
>
>membind or mbind? Me, I like mbind (modulo remarks below), but you may
>find Linus does not (he was rather caustic when I suggested that fremap
>should be mseek, and it ended up as remap_file_pages instead).
>
>+ * a predefined policy.
>+ * @start: beginning address of memory region to bind
>+ * @len: length of memory region to bind
>
>Oh really? len is unused in the code below. If you were to use it,
>you'd need to loop over vmas, splitting where necessary.
>
>+ * @mask_ptr: pointer to bitmask of cpus
>+ * @mask_len: length of the bitmask
>+ * @policy: flag specifying the policy to use for the segment
>
>I think you already remarked that policy is currently unused,
>fair enough.
>
>+ */
>+asmlinkage unsigned long sys_mbind(unsigned long start, unsigned long len,
>+ unsigned long *mask_ptr, unsigned int mask_len, unsigned long policy)
>+{
>+ DECLARE_BITMAP(cpu_mask, NR_CPUS);
>+ DECLARE_BITMAP(node_mask, MAX_NUMNODES);
>+ struct vm_area_struct *vma = NULL;
>+ struct address_space *mapping;
>+ int copy_len, error = 0;
>+
>+ /* Deal with getting cpu_mask from userspace & translating to node_mask */
>+ copy_len = min(mask_len, (unsigned int)NR_CPUS);
>+ CLEAR_BITMAP(cpu_mask, NR_CPUS);
>+ CLEAR_BITMAP(node_mask, MAX_NUMNODES);
>+ if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
>+ error = -EFAULT;
>+ goto out;
>+ }
>
>Shouldn't there be some capability restriction? Is it right that
>anyone who can mmap a file for reading can determine its binding
>(until the next does it differently)?
>
>+ cpumask_to_nodemask(cpu_mask, node_mask);
>+
>+ vma = find_vma(current->mm, start);
>
>You must not scan the vma list without at least
>down_read(¤t->mm->mmap_sem).
>
>+ if (!(vma && vma->vm_file && vma->vm_ops &&
>+ vma->vm_ops->nopage == shmem_nopage)) {
>+ /* This isn't a shm segment. For now, we bail. */
>
>So you're allowing this on any file on tmpfs,
>but on no file on any other filesystem: curious.
>
>+ error = -EINVAL;
>+ goto out;
>+ }
>+
>+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
>+ mapping->binding = alloc_binding(node_mask);
>
>Your NUMA machines clearly have more memory than is good for you:
>nowhere is there an equivalent free_binding: which in particular
>would need to be called first here if binding is already set (or
>else old structure reused), and when inode is freed.
>
>So... mapping->binding conditions every page_cache_alloc for that
>inode. Hmm, what on earth does this have to do with mbind or membind?
>It looks to me like fbind, except that you've dressed up the interface
>to use an address in the caller's address space: presumably because you
>couldn't get a file handle on SysV shared memory, and that's what you
>were really wanting to bind, hence the shmem_nopage test?
>
>I think this interface is confused (but it probably thinks I am).
>
>+ if (!mapping->binding)
>+ error = -EFAULT;
>+
>+out:
>+ return error;
>+}
>diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/swap_state.c linux-2.5.66-membind/mm/swap_state.c
>--- linux-2.5.66-pre_membind/mm/swap_state.c Mon Mar 24 14:00:21 2003
>+++ linux-2.5.66-membind/mm/swap_state.c Tue Apr 1 17:12:00 2003
>@@ -47,6 +47,9 @@
> .i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
> .private_lock = SPIN_LOCK_UNLOCKED,
> .private_list = LIST_HEAD_INIT(swapper_space.private_list),
>+#ifdef CONFIG_NUMA
>+ .binding = NULL,
>+#endif
> };
>
> #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
>
>Please leave swap_state.c out of it: this patch does nothing but add
>an ugly #ifdef to initialize something to 0 which would be 0 anyway.
>
>Hugh
>
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 6:37 ` Andrew Morton
@ 2003-04-03 23:30 ` Matthew Dobson
0 siblings, 0 replies; 9+ messages in thread
From: Matthew Dobson @ 2003-04-03 23:30 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, mbligh, hch, zeppegno.paolo, ak, lse-tech,
Hugh Dickins
Andrew Morton wrote:
> Matthew Dobson <colpatch@us.ibm.com> wrote:
>
>>+#define __NR_mbind 223
>
>
> What was wrong with "membind"?
Well, there was nothing wrong with it, per se, I just liked Paolo's
suggestion to align the naming with mmap, munmap, mremap, etc. The
syscalls that manipulate a processes address space tend to be called
m(something). Right now it isn't as generic as I'd like it to be, but
all in good time.
>>+/* Translate a cpumask to a nodemask */
>>+static inline void cpumask_to_nodemask(bitmap_t cpumask, bitmap_t nodemask)
>>+{
>>+ int i;
>>+
>>+ for (i = 0; i < NR_CPUS; i++)
>>+ if (test_bit(i, cpumask))
>
>
> That's a bit weird. test_bit is only permitted on longs, so why introduce
> bitmap_t?
Erm... Good point. I really wanted to try and maintain the abstraction
of a bitmap type. I hoped that we could, via macros and typedefs, keep
the underlying data type obscured, and have a good facsimile of variable
length bitmaps. It's proving too difficult to hide the fact that
they're just unsigned long[]'s, so I'll give up the ghost and pass them
as unsigned long *'s.
>>+/* Top-level function for allocating a binding for a region of memory */
>>+static inline struct binding *alloc_binding(bitmap_t nodemask)
>>+{
>>+ struct binding *binding;
>>+ int node, zone_num;
>>+
>>+ binding = (struct binding *)kmalloc(sizeof(struct binding), GFP_KERNEL);
>>+ if (!binding)
>>+ return NULL;
>>+ memset(binding, 0, sizeof(struct binding));
>>+
>>+ /* Build binding zonelist */
>>+ for (node = 0, zone_num = 0; node < MAX_NUMNODES; node++)
>>+ if (test_bit(node, nodemask) && node_online(node))
>>+ zone_num = add_node(NODE_DATA(node),
>>+ &binding->zonelist, zone_num);
>>+ binding->zonelist.zones[zone_num] = NULL;
>>+
>>+ if (zone_num == 0) {
>>+ /* No zones were added to the zonelist. Let the caller know. */
>>+ kfree(binding);
>>+ binding = NULL;
>>+ }
>>+ return binding;
>>+}
>
> It looks like this function needs to be able to return a real errno (see
> below).
True. EFAULT is a sorta decent catchall, but not appropriate for
something like no memory, etc.
>>+ struct vm_area_struct *vma = NULL;
>>+ struct address_space *mapping;
>>+ int copy_len, error = 0;
>>+
>>+ /* Deal with getting cpu_mask from userspace & translating to node_mask */
>>+ copy_len = min(mask_len, (unsigned int)NR_CPUS);
>>+ CLEAR_BITMAP(cpu_mask, NR_CPUS);
>>+ CLEAR_BITMAP(node_mask, MAX_NUMNODES);
>>+ if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
>>+ error = -EFAULT;
>>+ goto out;
>>+ }
>>+ cpumask_to_nodemask(cpu_mask, node_mask);
>>+
>>+ vma = find_vma(current->mm, start);
>>+ if (!(vma && vma->vm_file && vma->vm_ops &&
>>+ vma->vm_ops->nopage == shmem_nopage)) {
>>+ /* This isn't a shm segment. For now, we bail. */
>>+ error = -EINVAL;
>>+ goto out;
>>+ }
>>+
>>+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
>>+ mapping->binding = alloc_binding(node_mask);
>>+ if (!mapping->binding)
>>+ error = -EFAULT;
>
>
> It returns EFAULT on memory exhaustion?
No longer... That'll be fixed in version 3.
> btw, can you remind me again why this is only available to tmpfs pagecache?
I can try! ;) I originally wanted to do just a shared memory binding
call, but people (correctly) suggested a more generic memory binding
would be more useful. So I've basically just set up a lot of the
infrastructure for a more generic call, but haven't fully implemented
it. This patch is intended to be a starting point, from which it will
be easy to incrementally add more functionality and power to the binding
call. The underlying code (syscalls, structures, .c files, allocator
changes) won't have to change too much. So this patch works for any
shared memory segment. It'd be straightforward to extend this to any
file-backed vma (because it already has a struct address_space, with a
struct binding in it), so I hope to grow this into something more.
Cheers!
-Matt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 12:20 ` Hugh Dickins
2003-04-03 13:25 ` Paolo Zeppegno
@ 2003-04-03 23:57 ` Matthew Dobson
1 sibling, 0 replies; 9+ messages in thread
From: Matthew Dobson @ 2003-04-03 23:57 UTC (permalink / raw)
To: Hugh Dickins
Cc: linux-kernel, Martin J. Bligh, Andrew Morton, Christoph Hellwig,
Paolo Zeppegno, Andi Kleen, lse-tech
Hugh Dickins wrote:
> On Wed, 2 Apr 2003, Matthew Dobson wrote:
> +/*
> + * membind - Bind a range of a process' VM space to a set of memory blocks according to
>
> membind or mbind? Me, I like mbind (modulo remarks below), but you may
> find Linus does not (he was rather caustic when I suggested that fremap
> should be mseek, and it ended up as remap_file_pages instead).
I'm a fan of mbind. s/mbind/whatever_linus_wants/ should be easy enough
if necessary. ;)
> + * a predefined policy.
> + * @start: beginning address of memory region to bind
> + * @len: length of memory region to bind
>
> Oh really? len is unused in the code below. If you were to use it,
> you'd need to loop over vmas, splitting where necessary.
No.. Not really! ;) As I mentioned in my reply to Andrew, one of the
main thrusts of this patch is to set up the infrastructure in a way that
allows it to be fairly easily expanded in the future. For now, as has
been noted several time, it just works on shared memory segments and
doesn't actually respect the length argument. Your point is well taken,
though. I imagine the code for looping over vmas will looks similar to
what is done in the mmap and mlock calls.
> + */
> +asmlinkage unsigned long sys_mbind(unsigned long start, unsigned long len,
> + unsigned long *mask_ptr, unsigned int mask_len, unsigned long policy)
> +{
> + DECLARE_BITMAP(cpu_mask, NR_CPUS);
> + DECLARE_BITMAP(node_mask, MAX_NUMNODES);
> + struct vm_area_struct *vma = NULL;
> + struct address_space *mapping;
> + int copy_len, error = 0;
> +
> + /* Deal with getting cpu_mask from userspace & translating to node_mask */
> + copy_len = min(mask_len, (unsigned int)NR_CPUS);
> + CLEAR_BITMAP(cpu_mask, NR_CPUS);
> + CLEAR_BITMAP(node_mask, MAX_NUMNODES);
> + if (copy_from_user(cpu_mask, mask_ptr, (copy_len+7)/8)) {
> + error = -EFAULT;
> + goto out;
> + }
>
> Shouldn't there be some capability restriction? Is it right that
> anyone who can mmap a file for reading can determine its binding
> (until the next does it differently)?
Probably. Once the binding call is expanded to encompass any mmap'd
file, then yes, we'll want capabilities. For now, with just shared
memory segments, it's probably not quite as imperative. Although I
guess it also works for any shmem_fs segments, too. I'll look into
finding an appropriate capability to check... Any suggestions?
> + cpumask_to_nodemask(cpu_mask, node_mask);
> +
> + vma = find_vma(current->mm, start);
>
> You must not scan the vma list without at least
> down_read(¤t->mm->mmap_sem).
OK.. Thanks for the heads up! ;)
> + if (!(vma && vma->vm_file && vma->vm_ops &&
> + vma->vm_ops->nopage == shmem_nopage)) {
> + /* This isn't a shm segment. For now, we bail. */
>
> So you're allowing this on any file on tmpfs,
> but on no file on any other filesystem: curious.
See comments near the top of this reply, and the comments in my reply to
Andrew.
> + error = -EINVAL;
> + goto out;
> + }
> +
> + mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> + mapping->binding = alloc_binding(node_mask);
>
> Your NUMA machines clearly have more memory than is good for you:
> nowhere is there an equivalent free_binding: which in particular
> would need to be called first here if binding is already set (or
> else old structure reused), and when inode is freed.
Heh... erm, maybe I should retake CS-101. Version 3 awaits! ;)
> So... mapping->binding conditions every page_cache_alloc for that
> inode. Hmm, what on earth does this have to do with mbind or membind?
> It looks to me like fbind, except that you've dressed up the interface
> to use an address in the caller's address space: presumably because you
> couldn't get a file handle on SysV shared memory, and that's what you
> were really wanting to bind, hence the shmem_nopage test?
>
> I think this interface is confused (but it probably thinks I am).
The original RFC I sent out some time ago was for shared memory segment
binding only. This is to be expanded, as per comments above.
> + if (!mapping->binding)
> + error = -EFAULT;
> +
> +out:
> + return error;
> +}
> diff -Nur --exclude-from=/usr/src/.dontdiff linux-2.5.66-pre_membind/mm/swap_state.c linux-2.5.66-membind/mm/swap_state.c
> --- linux-2.5.66-pre_membind/mm/swap_state.c Mon Mar 24 14:00:21 2003
> +++ linux-2.5.66-membind/mm/swap_state.c Tue Apr 1 17:12:00 2003
> @@ -47,6 +47,9 @@
> .i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
> .private_lock = SPIN_LOCK_UNLOCKED,
> .private_list = LIST_HEAD_INIT(swapper_space.private_list),
> +#ifdef CONFIG_NUMA
> + .binding = NULL,
> +#endif
> };
>
> #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
>
> Please leave swap_state.c out of it: this patch does nothing but add
> an ugly #ifdef to initialize something to 0 which would be 0 anyway.
Also in the anxiously awaited version 3. Or should I say *not* in it?
Cheers!
-Matt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (0/1)
2003-04-03 5:50 [rfc][patch] Memory Binding Take 2 (0/1) Matthew Dobson
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
@ 2003-04-04 13:34 ` Christoph Hellwig
1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2003-04-04 13:34 UTC (permalink / raw)
To: Matthew Dobson
Cc: linux-kernel, Martin J. Bligh, Andrew Morton, Christoph Hellwig,
Paolo Zeppegno, Andi Kleen, lse-tech
On Wed, Apr 02, 2003 at 09:50:14PM -0800, Matthew Dobson wrote:
> Alrighty, let's give this another go...
>
> This patch hasn't changed much. Just added bitmap_t, typedef'd to
> unsigned long * for passing around bitmaps without breaking the
> abstraction. I think it's good if we can keep the underlying data type
> hidden to partially future-proof (protect? ;) the code.
Not a good idea. Se_bit & co are defined to work on unsigned longs arrays.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc][patch] Memory Binding Take 2 (1/1)
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
2003-04-03 6:37 ` Andrew Morton
2003-04-03 12:20 ` Hugh Dickins
@ 2003-04-04 13:40 ` Christoph Hellwig
2 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2003-04-04 13:40 UTC (permalink / raw)
To: Matthew Dobson
Cc: linux-kernel, Martin J. Bligh, Andrew Morton, Paolo Zeppegno,
Andi Kleen, lse-tech
> +#ifndef _LINUX_MBIND_H
> +#define _LINUX_MBIND_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#include <linux/mmzone.h>
> +
> +/* Structure to keep track of memory segment (VMA) bindings */
> +struct binding {
> + struct zonelist zonelist;
> +};
> +
> +#endif /* CONFIG_NUMA */
> +#endif /* _LINUX_MBIND_H */
Using CONFIG_ without explicitly including config.h is a bad idea.
But the ifdef is unessecary anyway, no one is hurt by having this
struct defintion in the non-numa case. Also I wonder how you can have
a copyright on a single trivial struct defintion :) What about just
moving it to mmzone.h, btw?
> +#include <linux/errno.h>
> +#include <linux/mm.h>
> +#include <linux/mbind.h>
> +#include <asm/string.h>
Use <linux/string.h> instead.
> + binding = (struct binding *)kmalloc(sizeof(struct binding), GFP_KERNEL);
The cast is superflous.
> + if (!(vma && vma->vm_file && vma->vm_ops &&
> + vma->vm_ops->nopage == shmem_nopage)) {
> + /* This isn't a shm segment. For now, we bail. */
> + error = -EINVAL;
> + goto out;
> + }
This check is just horrible. Please describe what kind of mappings
this doesn't work for and why and add a VM_ flag for those that
support memory binding.
> .private_list = LIST_HEAD_INIT(swapper_space.private_list),
> +#ifdef CONFIG_NUMA
> + .binding = NULL,
> +#endif
You don't need to initialize this at all.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2003-04-04 13:36 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-03 5:50 [rfc][patch] Memory Binding Take 2 (0/1) Matthew Dobson
2003-04-03 5:56 ` [rfc][patch] Memory Binding Take 2 (1/1) Matthew Dobson
2003-04-03 6:37 ` Andrew Morton
2003-04-03 23:30 ` Matthew Dobson
2003-04-03 12:20 ` Hugh Dickins
2003-04-03 13:25 ` Paolo Zeppegno
2003-04-03 23:57 ` Matthew Dobson
2003-04-04 13:40 ` Christoph Hellwig
2003-04-04 13:34 ` [rfc][patch] Memory Binding Take 2 (0/1) Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox