* [PATCH 0/8] De-couple sysfs memory directories from memory sections
@ 2010-09-22 14:15 Nathan Fontenot
2010-09-22 14:28 ` [PATCH 1/8] Move find_memory_block() routine Nathan Fontenot
` (9 more replies)
0 siblings, 10 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:15 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
This set of patches decouples the concept that a single memory
section corresponds to a single directory in
/sys/devices/system/memory/. On systems
with large amounts of memory (1+ TB) there are performance issues
related to creating the large number of sysfs directories. For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories. This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory. With this patch set applied I am now seeing
boot times of 5 minutes or less.
The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates. The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.
The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections. This is
controlled by an optional architecturally defined function
memory_block_size_bytes(). The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.
For architectures that define their own version of this routine,
as is done for powerpc in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections. The number of sections spanned would
depend on the value reported by memory_block_size_bytes.
In both cases a new file 'end_phys_index' is created in each
memoryXXX directory. This file will contain the physical id
of the last memory section covered by the sysfs directory. For
the default case, the value in 'end_phys_index' will be the same
as in the existing 'phys_index' file.
-Nathan Fontenot
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/8] Move find_memory_block() routine
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
@ 2010-09-22 14:28 ` Nathan Fontenot
2010-09-22 14:29 ` [PATCH 2/8] Update memory block struct to have start and end phys index Nathan Fontenot
` (8 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:28 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Move the find_memory_block() routine up to avoid needing a forward
declaration in subsequent patches.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 62 +++++++++++++++++++++++++-------------------------
1 file changed, 31 insertions(+), 31 deletions(-)
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 11:59:24.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:32:45.000000000 -0500
@@ -435,6 +435,37 @@ int __weak arch_get_memory_phys_device(u
return 0;
}
+/*
+ * For now, we have a linear search to go find the appropriate
+ * memory_block corresponding to a particular phys_index. If
+ * this gets to be a real problem, we can always use a radix
+ * tree or something here.
+ *
+ * This could be made generic for all sysdev classes.
+ */
+struct memory_block *find_memory_block(struct mem_section *section)
+{
+ struct kobject *kobj;
+ struct sys_device *sysdev;
+ struct memory_block *mem;
+ char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+
+ /*
+ * This only works because we know that section == sysdev->id
+ * slightly redundant with sysdev_register()
+ */
+ sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+
+ kobj = kset_find_obj(&memory_sysdev_class.kset, name);
+ if (!kobj)
+ return NULL;
+
+ sysdev = container_of(kobj, struct sys_device, kobj);
+ mem = container_of(sysdev, struct memory_block, sysdev);
+
+ return mem;
+}
+
static int add_memory_block(int nid, struct mem_section *section,
unsigned long state, enum mem_add_context context)
{
@@ -468,37 +499,6 @@ static int add_memory_block(int nid, str
return ret;
}
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all sysdev classes.
- */
-struct memory_block *find_memory_block(struct mem_section *section)
-{
- struct kobject *kobj;
- struct sys_device *sysdev;
- struct memory_block *mem;
- char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
-
- /*
- * This only works because we know that section == sysdev->id
- * slightly redundant with sysdev_register()
- */
- sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
-
- kobj = kset_find_obj(&memory_sysdev_class.kset, name);
- if (!kobj)
- return NULL;
-
- sysdev = container_of(kobj, struct sys_device, kobj);
- mem = container_of(sysdev, struct memory_block, sysdev);
-
- return mem;
-}
-
int remove_memory_block(unsigned long node_id, struct mem_section *section,
int phys_device)
{
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/8] Update memory block struct to have start and end phys index
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
2010-09-22 14:28 ` [PATCH 1/8] Move find_memory_block() routine Nathan Fontenot
@ 2010-09-22 14:29 ` Nathan Fontenot
2010-09-22 14:30 ` [PATCH 3/8] Add section count to memory_block struct Nathan Fontenot
` (7 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:29 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Update the 'phys_index' properties of a memory block to include a
'start_phys_index' which is the same as the current 'phys_index' property.
The property still appears as 'phys_index' in sysfs but the memory_block
struct name is updated to indicate the start and end values.
This also adds an 'end_phys_index' property to indicate the id of the
last section in th memory block.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 28 ++++++++++++++++++++--------
include/linux/memory.h | 3 ++-
2 files changed, 22 insertions(+), 9 deletions(-)
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 12:32:45.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:34:04.000000000 -0500
@@ -109,12 +109,20 @@ unregister_memory(struct memory_block *m
* uses.
*/
-static ssize_t show_mem_phys_index(struct sys_device *dev,
+static ssize_t show_mem_start_phys_index(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
{
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- return sprintf(buf, "%08lx\n", mem->phys_index);
+ return sprintf(buf, "%08lx\n", mem->start_phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct memory_block *mem =
+ container_of(dev, struct memory_block, sysdev);
+ return sprintf(buf, "%08lx\n", mem->end_phys_index);
}
/*
@@ -128,7 +136,7 @@ static ssize_t show_mem_removable(struct
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- start_pfn = section_nr_to_pfn(mem->phys_index);
+ start_pfn = section_nr_to_pfn(mem->start_phys_index);
ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
return sprintf(buf, "%d\n", ret);
}
@@ -191,7 +199,7 @@ memory_block_action(struct memory_block
int ret;
int old_state = mem->state;
- psection = mem->phys_index;
+ psection = mem->start_phys_index;
first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
/*
@@ -264,7 +272,7 @@ store_mem_state(struct sys_device *dev,
int ret = -EINVAL;
mem = container_of(dev, struct memory_block, sysdev);
- phys_section_nr = mem->phys_index;
+ phys_section_nr = mem->start_phys_index;
if (!present_section_nr(phys_section_nr))
goto out;
@@ -296,7 +304,8 @@ static ssize_t show_phys_device(struct s
return sprintf(buf, "%d\n", mem->phys_device);
}
-static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -476,16 +485,18 @@ static int add_memory_block(int nid, str
if (!mem)
return -ENOMEM;
- mem->phys_index = __section_nr(section);
+ mem->start_phys_index = __section_nr(section);
mem->state = state;
mutex_init(&mem->state_mutex);
- start_pfn = section_nr_to_pfn(mem->phys_index);
+ start_pfn = section_nr_to_pfn(mem->start_phys_index);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
ret = register_memory(mem, section);
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
if (!ret)
+ ret = mem_create_simple_file(mem, end_phys_index);
+ if (!ret)
ret = mem_create_simple_file(mem, state);
if (!ret)
ret = mem_create_simple_file(mem, phys_device);
@@ -507,6 +518,7 @@ int remove_memory_block(unsigned long no
mem = find_memory_block(section);
unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
+ mem_remove_simple_file(mem, end_phys_index);
mem_remove_simple_file(mem, state);
mem_remove_simple_file(mem, phys_device);
mem_remove_simple_file(mem, removable);
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h 2010-09-21 11:59:28.000000000 -0500
+++ linux-next/include/linux/memory.h 2010-09-21 12:34:04.000000000 -0500
@@ -21,7 +21,8 @@
#include <linux/mutex.h>
struct memory_block {
- unsigned long phys_index;
+ unsigned long start_phys_index;
+ unsigned long end_phys_index;
unsigned long state;
/*
* This serializes all state change requests. It isn't
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/8] Add section count to memory_block struct
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
2010-09-22 14:28 ` [PATCH 1/8] Move find_memory_block() routine Nathan Fontenot
2010-09-22 14:29 ` [PATCH 2/8] Update memory block struct to have start and end phys index Nathan Fontenot
@ 2010-09-22 14:30 ` Nathan Fontenot
2010-09-22 14:32 ` [PATCH 4/8] Add mutex for adding/removing memory blocks Nathan Fontenot
` (6 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:30 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Add a section count property to the memory_block struct to track the number
of memory sections that have been added/removed from a memory block. This
allows us to know when the last memory section of a memory block has been
removed so we can remove the memory block.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 18 +++++++++++-------
include/linux/memory.h | 2 ++
2 files changed, 13 insertions(+), 7 deletions(-)
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 12:36:03.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:36:45.000000000 -0500
@@ -487,6 +487,7 @@ static int add_memory_block(int nid, str
mem->start_phys_index = __section_nr(section);
mem->state = state;
+ atomic_inc(&mem->section_count);
mutex_init(&mem->state_mutex);
start_pfn = section_nr_to_pfn(mem->start_phys_index);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
@@ -516,13 +517,16 @@ int remove_memory_block(unsigned long no
struct memory_block *mem;
mem = find_memory_block(section);
- unregister_mem_sect_under_nodes(mem);
- mem_remove_simple_file(mem, phys_index);
- mem_remove_simple_file(mem, end_phys_index);
- mem_remove_simple_file(mem, state);
- mem_remove_simple_file(mem, phys_device);
- mem_remove_simple_file(mem, removable);
- unregister_memory(mem, section);
+
+ if (atomic_dec_and_test(&mem->section_count)) {
+ unregister_mem_sect_under_nodes(mem);
+ mem_remove_simple_file(mem, phys_index);
+ mem_remove_simple_file(mem, end_phys_index);
+ mem_remove_simple_file(mem, state);
+ mem_remove_simple_file(mem, phys_device);
+ mem_remove_simple_file(mem, removable);
+ unregister_memory(mem, section);
+ }
return 0;
}
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h 2010-09-21 12:34:04.000000000 -0500
+++ linux-next/include/linux/memory.h 2010-09-21 12:36:45.000000000 -0500
@@ -19,11 +19,13 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/mutex.h>
+#include <asm/atomic.h>
struct memory_block {
unsigned long start_phys_index;
unsigned long end_phys_index;
unsigned long state;
+ atomic_t section_count;
/*
* This serializes all state change requests. It isn't
* held during creation because the control files are
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/8] Add mutex for adding/removing memory blocks
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (2 preceding siblings ...)
2010-09-22 14:30 ` [PATCH 3/8] Add section count to memory_block struct Nathan Fontenot
@ 2010-09-22 14:32 ` Nathan Fontenot
2010-09-22 14:33 ` [PATCH 5/8] Allow a memory block to span multiple memory sections Nathan Fontenot
` (5 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:32 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Add a new mutex for use in adding and removing of memory blocks. This
is needed to avoid any race conditions in which the same memory block could
be added and removed at the same time.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 12:36:45.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:37:03.000000000 -0500
@@ -27,6 +27,8 @@
#include <asm/atomic.h>
#include <asm/uaccess.h>
+static DEFINE_MUTEX(mem_sysfs_mutex);
+
#define MEMORY_CLASS_NAME "memory"
static struct sysdev_class memory_sysdev_class = {
@@ -485,6 +487,8 @@ static int add_memory_block(int nid, str
if (!mem)
return -ENOMEM;
+ mutex_lock(&mem_sysfs_mutex);
+
mem->start_phys_index = __section_nr(section);
mem->state = state;
atomic_inc(&mem->section_count);
@@ -508,6 +512,7 @@ static int add_memory_block(int nid, str
ret = register_mem_sect_under_node(mem, nid);
}
+ mutex_unlock(&mem_sysfs_mutex);
return ret;
}
@@ -516,6 +521,7 @@ int remove_memory_block(unsigned long no
{
struct memory_block *mem;
+ mutex_lock(&mem_sysfs_mutex);
mem = find_memory_block(section);
if (atomic_dec_and_test(&mem->section_count)) {
@@ -528,6 +534,7 @@ int remove_memory_block(unsigned long no
unregister_memory(mem, section);
}
+ mutex_unlock(&mem_sysfs_mutex);
return 0;
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 5/8] Allow a memory block to span multiple memory sections
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (3 preceding siblings ...)
2010-09-22 14:32 ` [PATCH 4/8] Add mutex for adding/removing memory blocks Nathan Fontenot
@ 2010-09-22 14:33 ` Nathan Fontenot
2010-09-22 14:34 ` [PATCH 6/8] Update node sysfs code Nathan Fontenot
` (4 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:33 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Update the memory sysfs code such that each sysfs memory directory is now
considered a memory block that can span multiple memory sections per
memory block. The default size of each memory block is SECTION_SIZE_BITS
to maintain the current behavior of having a single memory section per
memory block (i.e. one sysfs directory per memory section).
For architectures that want to have memory blocks span multiple
memory sections they need only define their own memory_block_size_bytes()
routine.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 148 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 103 insertions(+), 45 deletions(-)
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 12:37:03.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:37:30.000000000 -0500
@@ -30,6 +30,14 @@
static DEFINE_MUTEX(mem_sysfs_mutex);
#define MEMORY_CLASS_NAME "memory"
+#define MIN_MEMORY_BLOCK_SIZE (1 << SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+ return (section_nr / sections_per_block) * sections_per_block;
+}
static struct sysdev_class memory_sysdev_class = {
.name = MEMORY_CLASS_NAME,
@@ -84,22 +92,21 @@ EXPORT_SYMBOL(unregister_memory_isolate_
* register_memory - Setup a sysfs device for a memory block
*/
static
-int register_memory(struct memory_block *memory, struct mem_section *section)
+int register_memory(struct memory_block *memory)
{
int error;
memory->sysdev.cls = &memory_sysdev_class;
- memory->sysdev.id = __section_nr(section);
+ memory->sysdev.id = memory->start_phys_index;
error = sysdev_register(&memory->sysdev);
return error;
}
static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
{
BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
- BUG_ON(memory->sysdev.id != __section_nr(section));
/* drop the ref. we got in remove_memory_block() */
kobject_put(&memory->sysdev.kobj);
@@ -133,13 +140,16 @@ static ssize_t show_mem_end_phys_index(s
static ssize_t show_mem_removable(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
{
- unsigned long start_pfn;
- int ret;
+ unsigned long i, pfn;
+ int ret = 1;
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- start_pfn = section_nr_to_pfn(mem->start_phys_index);
- ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+ for (i = mem->start_phys_index; i <= mem->end_phys_index; i++) {
+ pfn = section_nr_to_pfn(i);
+ ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
+ }
+
return sprintf(buf, "%d\n", ret);
}
@@ -192,17 +202,14 @@ int memory_isolate_notify(unsigned long
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_section_action(unsigned long phys_index, unsigned long action)
{
int i;
- unsigned long psection;
unsigned long start_pfn, start_paddr;
struct page *first_page;
int ret;
- int old_state = mem->state;
- psection = mem->start_phys_index;
- first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
+ first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
/*
* The probe routines leave the pages reserved, just
@@ -215,8 +222,8 @@ memory_block_action(struct memory_block
continue;
printk(KERN_WARNING "section number %ld page number %d "
- "not reserved, was it already online? \n",
- psection, i);
+ "not reserved, was it already online?\n",
+ phys_index, i);
return -EBUSY;
}
}
@@ -227,18 +234,13 @@ memory_block_action(struct memory_block
ret = online_pages(start_pfn, PAGES_PER_SECTION);
break;
case MEM_OFFLINE:
- mem->state = MEM_GOING_OFFLINE;
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION << PAGE_SHIFT);
- if (ret) {
- mem->state = old_state;
- break;
- }
break;
default:
- WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
- __func__, mem, action, action);
+ WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
+ "%ld\n", __func__, phys_index, action, action);
ret = -EINVAL;
}
@@ -248,7 +250,7 @@ memory_block_action(struct memory_block
static int memory_block_change_state(struct memory_block *mem,
unsigned long to_state, unsigned long from_state_req)
{
- int ret = 0;
+ int i, ret = 0;
mutex_lock(&mem->state_mutex);
if (mem->state != from_state_req) {
@@ -256,8 +258,21 @@ static int memory_block_change_state(str
goto out;
}
- ret = memory_block_action(mem, to_state);
- if (!ret)
+ if (to_state == MEM_OFFLINE)
+ mem->state = MEM_GOING_OFFLINE;
+
+ for (i = mem->start_phys_index; i <= mem->end_phys_index; i++) {
+ ret = memory_section_action(i, to_state);
+ if (ret)
+ break;
+ }
+
+ if (ret) {
+ for (i = mem->start_phys_index; i <= mem->end_phys_index; i++)
+ memory_section_action(i, from_state_req);
+
+ mem->state = from_state_req;
+ } else
mem->state = to_state;
out:
@@ -270,20 +285,15 @@ store_mem_state(struct sys_device *dev,
struct sysdev_attribute *attr, const char *buf, size_t count)
{
struct memory_block *mem;
- unsigned int phys_section_nr;
int ret = -EINVAL;
mem = container_of(dev, struct memory_block, sysdev);
- phys_section_nr = mem->start_phys_index;
-
- if (!present_section_nr(phys_section_nr))
- goto out;
if (!strncmp(buf, "online", min((int)count, 6)))
ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
else if(!strncmp(buf, "offline", min((int)count, 7)))
ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
-out:
+
if (ret)
return ret;
return count;
@@ -460,12 +470,13 @@ struct memory_block *find_memory_block(s
struct sys_device *sysdev;
struct memory_block *mem;
char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+ int block_id = base_memory_block_id(__section_nr(section));
/*
* This only works because we know that section == sysdev->id
* slightly redundant with sysdev_register()
*/
- sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+ sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);
kobj = kset_find_obj(&memory_sysdev_class.kset, name);
if (!kobj)
@@ -477,26 +488,26 @@ struct memory_block *find_memory_block(s
return mem;
}
-static int add_memory_block(int nid, struct mem_section *section,
- unsigned long state, enum mem_add_context context)
+static int init_memory_block(struct memory_block **memory,
+ struct mem_section *section, unsigned long state)
{
- struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ struct memory_block *mem;
unsigned long start_pfn;
int ret = 0;
+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
if (!mem)
return -ENOMEM;
- mutex_lock(&mem_sysfs_mutex);
-
- mem->start_phys_index = __section_nr(section);
+ mem->start_phys_index = base_memory_block_id(__section_nr(section));
+ mem->end_phys_index = mem->start_phys_index + sections_per_block - 1;
mem->state = state;
atomic_inc(&mem->section_count);
mutex_init(&mem->state_mutex);
start_pfn = section_nr_to_pfn(mem->start_phys_index);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
- ret = register_memory(mem, section);
+ ret = register_memory(mem);
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
if (!ret)
@@ -507,8 +518,29 @@ static int add_memory_block(int nid, str
ret = mem_create_simple_file(mem, phys_device);
if (!ret)
ret = mem_create_simple_file(mem, removable);
+
+ *memory = mem;
+ return ret;
+}
+
+static int add_memory_section(int nid, struct mem_section *section,
+ unsigned long state, enum mem_add_context context)
+{
+ struct memory_block *mem;
+ int ret = 0;
+
+ mutex_lock(&mem_sysfs_mutex);
+
+ mem = find_memory_block(section);
+ if (mem) {
+ atomic_inc(&mem->section_count);
+ kobject_put(&mem->sysdev.kobj);
+ } else
+ ret = init_memory_block(&mem, section, state);
+
if (!ret) {
- if (context == HOTPLUG)
+ if (context == HOTPLUG &&
+ atomic_read(&mem->section_count) == sections_per_block)
ret = register_mem_sect_under_node(mem, nid);
}
@@ -531,8 +563,10 @@ int remove_memory_block(unsigned long no
mem_remove_simple_file(mem, state);
mem_remove_simple_file(mem, phys_device);
mem_remove_simple_file(mem, removable);
- unregister_memory(mem, section);
- }
+ unregister_memory(mem);
+ kfree(mem);
+ } else
+ kobject_put(&mem->sysdev.kobj);
mutex_unlock(&mem_sysfs_mutex);
return 0;
@@ -544,7 +578,7 @@ int remove_memory_block(unsigned long no
*/
int register_new_memory(int nid, struct mem_section *section)
{
- return add_memory_block(nid, section, MEM_OFFLINE, HOTPLUG);
+ return add_memory_section(nid, section, MEM_OFFLINE, HOTPLUG);
}
int unregister_memory_section(struct mem_section *section)
@@ -555,6 +589,26 @@ int unregister_memory_section(struct mem
return remove_memory_block(0, section, 0);
}
+u32 __weak memory_block_size_bytes(void)
+{
+ return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static u32 get_memory_block_size(void)
+{
+ u32 block_sz;
+
+ block_sz = memory_block_size_bytes();
+
+ /* Validate blk_sz is a power of 2 and not less than section size */
+ if ((block_sz & (block_sz - 1)) || (block_sz < MIN_MEMORY_BLOCK_SIZE)) {
+ WARN_ON(1);
+ block_sz = MIN_MEMORY_BLOCK_SIZE;
+ }
+
+ return block_sz;
+}
+
/*
* Initialize the sysfs support for memory devices...
*/
@@ -563,12 +617,16 @@ int __init memory_dev_init(void)
unsigned int i;
int ret;
int err;
+ int block_sz;
memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
ret = sysdev_class_register(&memory_sysdev_class);
if (ret)
goto out;
+ block_sz = get_memory_block_size();
+ sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+
/*
* Create entries for memory sections that were found
* during boot and have been initialized
@@ -576,8 +634,8 @@ int __init memory_dev_init(void)
for (i = 0; i < NR_MEM_SECTIONS; i++) {
if (!present_section_nr(i))
continue;
- err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
- BOOT);
+ err = add_memory_section(0, __nr_to_section(i), MEM_ONLINE,
+ BOOT);
if (!ret)
ret = err;
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 6/8] Update node sysfs code
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (4 preceding siblings ...)
2010-09-22 14:33 ` [PATCH 5/8] Allow a memory block to span multiple memory sections Nathan Fontenot
@ 2010-09-22 14:34 ` Nathan Fontenot
2010-09-22 14:35 ` [PATCH 7/8] Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
` (3 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:34 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Update the node sysfs code to be aware of the new capability for a memory
block to span multiple memory sections. This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
drivers/base/memory.c | 4 +++-
drivers/base/node.c | 12 ++++++++----
include/linux/node.h | 6 ++++--
3 files changed, 15 insertions(+), 7 deletions(-)
Index: linux-next/drivers/base/node.c
===================================================================
--- linux-next.orig/drivers/base/node.c 2010-09-21 11:59:24.000000000 -0500
+++ linux-next/drivers/base/node.c 2010-09-21 12:38:02.000000000 -0500
@@ -346,8 +346,10 @@ int register_mem_sect_under_node(struct
return -EFAULT;
if (!node_online(nid))
return 0;
- sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
- sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+ sect_start_pfn = section_nr_to_pfn(mem_blk->start_phys_index);
+ sect_end_pfn = section_nr_to_pfn(mem_blk->end_phys_index);
+ sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int page_nid;
@@ -371,7 +373,8 @@ int register_mem_sect_under_node(struct
}
/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+ unsigned long phys_index)
{
NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -383,7 +386,8 @@ int unregister_mem_sect_under_nodes(stru
if (!unlinked_nodes)
return -ENOMEM;
nodes_clear(*unlinked_nodes);
- sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+
+ sect_start_pfn = section_nr_to_pfn(phys_index);
sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int nid;
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c 2010-09-21 12:37:30.000000000 -0500
+++ linux-next/drivers/base/memory.c 2010-09-21 12:38:02.000000000 -0500
@@ -555,9 +555,9 @@ int remove_memory_block(unsigned long no
mutex_lock(&mem_sysfs_mutex);
mem = find_memory_block(section);
+ unregister_mem_sect_under_nodes(mem, __section_nr(section));
if (atomic_dec_and_test(&mem->section_count)) {
- unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
mem_remove_simple_file(mem, end_phys_index);
mem_remove_simple_file(mem, state);
@@ -631,6 +631,7 @@ int __init memory_dev_init(void)
* Create entries for memory sections that were found
* during boot and have been initialized
*/
+ printk(KERN_ERR "Memory Start\n");
for (i = 0; i < NR_MEM_SECTIONS; i++) {
if (!present_section_nr(i))
continue;
@@ -639,6 +640,7 @@ int __init memory_dev_init(void)
if (!ret)
ret = err;
}
+ printk(KERN_ERR "Memory End\n");
err = memory_probe_init();
if (!ret)
Index: linux-next/include/linux/node.h
===================================================================
--- linux-next.orig/include/linux/node.h 2010-09-21 11:59:28.000000000 -0500
+++ linux-next/include/linux/node.h 2010-09-21 12:38:02.000000000 -0500
@@ -44,7 +44,8 @@ extern int register_cpu_under_node(unsig
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int register_mem_sect_under_node(struct memory_block *mem_blk,
int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+ unsigned long phys_index);
#ifdef CONFIG_HUGETLBFS
extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -72,7 +73,8 @@ static inline int register_mem_sect_unde
{
return 0;
}
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+ unsigned long phys_index)
{
return 0;
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 7/8] Define memory_block_size_bytes() for powerpc/pseries
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (5 preceding siblings ...)
2010-09-22 14:34 ` [PATCH 6/8] Update node sysfs code Nathan Fontenot
@ 2010-09-22 14:35 ` Nathan Fontenot
2010-09-22 14:36 ` [PATCH 8/8] Update memory hotplug documentation Nathan Fontenot
` (2 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:35 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Define a version of memory_block_size_bytes() for powerpc/pseries such that
a memory block spans an entire lmb.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
arch/powerpc/platforms/pseries/hotplug-memory.c | 66 +++++++++++++++++++-----
1 file changed, 53 insertions(+), 13 deletions(-)
Index: linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-next.orig/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-09-21 11:59:24.000000000 -0500
+++ linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-09-21 12:38:31.000000000 -0500
@@ -17,6 +17,54 @@
#include <asm/pSeries_reconfig.h>
#include <asm/sparsemem.h>
+static u32 get_memblock_size(void)
+{
+ struct device_node *np;
+ unsigned int memblock_size = 0;
+
+ np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+ if (np) {
+ const unsigned long *size;
+
+ size = of_get_property(np, "ibm,lmb-size", NULL);
+ memblock_size = size ? *size : 0;
+
+ of_node_put(np);
+ } else {
+ unsigned int memzero_size = 0;
+ const unsigned int *regs;
+
+ np = of_find_node_by_path("/memory@0");
+ if (np) {
+ regs = of_get_property(np, "reg", NULL);
+ memzero_size = regs ? regs[3] : 0;
+ of_node_put(np);
+ }
+
+ if (memzero_size) {
+ /* We now know the size of memory@0, use this to find
+ * the first memoryblock and get its size.
+ */
+ char buf[64];
+
+ sprintf(buf, "/memory@%x", memzero_size);
+ np = of_find_node_by_path(buf);
+ if (np) {
+ regs = of_get_property(np, "reg", NULL);
+ memblock_size = regs ? regs[3] : 0;
+ of_node_put(np);
+ }
+ }
+ }
+
+ return memblock_size;
+}
+
+u32 memory_block_size_bytes(void)
+{
+ return get_memblock_size();
+}
+
static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
{
unsigned long start, start_pfn;
@@ -127,30 +175,22 @@ static int pseries_add_memory(struct dev
static int pseries_drconf_memory(unsigned long *base, unsigned int action)
{
- struct device_node *np;
- const unsigned long *lmb_size;
+ unsigned long memblock_size;
int rc;
- np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
- if (!np)
+ memblock_size = get_memblock_size();
+ if (!memblock_size)
return -EINVAL;
- lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
- if (!lmb_size) {
- of_node_put(np);
- return -EINVAL;
- }
-
if (action == PSERIES_DRCONF_MEM_ADD) {
- rc = memblock_add(*base, *lmb_size);
+ rc = memblock_add(*base, memblock_size);
rc = (rc < 0) ? -EINVAL : 0;
} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
- rc = pseries_remove_memblock(*base, *lmb_size);
+ rc = pseries_remove_memblock(*base, memblock_size);
} else {
rc = -EINVAL;
}
- of_node_put(np);
return rc;
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 8/8] Update memory hotplug documentation
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (6 preceding siblings ...)
2010-09-22 14:35 ` [PATCH 7/8] Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
@ 2010-09-22 14:36 ` Nathan Fontenot
2010-09-22 15:20 ` [PATCH 0/8] De-couple sysfs memory directories from memory sections Dave Hansen
2010-09-23 18:40 ` Balbir Singh
9 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 14:36 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev
Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen
Update the memory hotplug documentation to reflect the new behaviors of
memory blocks reflected in sysfs.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
---
Documentation/memory-hotplug.txt | 46 +++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 16 deletions(-)
Index: linux-next/Documentation/memory-hotplug.txt
===================================================================
--- linux-next.orig/Documentation/memory-hotplug.txt 2010-09-21 11:59:22.000000000 -0500
+++ linux-next/Documentation/memory-hotplug.txt 2010-09-21 12:39:05.000000000 -0500
@@ -126,36 +126,50 @@ config options.
--------------------------------
4 sysfs files for memory hotplug
--------------------------------
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs. Each section is part of
+a memory block under /sys/devices/system/memory as
/sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block. The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory. It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
For example, assume 1GiB section size. A device for a memory starting at
0x100000000 is /sys/device/system/memory/memory4
(0x100000000 / 1Gib = 4)
This device covers address range [0x100000000 ... 0x140000000)
-Under each section, you can see 4 files.
+Under each section, you can see 5 files.
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
/sys/devices/system/memory/memoryXXX/phys_device
/sys/devices/system/memory/memoryXXX/state
/sys/devices/system/memory/memoryXXX/removable
-'phys_index' : read-only and contains section id, same as XXX.
-'state' : read-write
- at read: contains online/offline state of memory.
- at write: user can specify "online", "offline" command
-'phys_device': read-only: designed to show the name of physical memory device.
- This is not well implemented now.
-'removable' : read-only: contains an integer value indicating
- whether the memory section is removable or not
- removable. A value of 1 indicates that the memory
- section is removable and a value of 0 indicates that
- it is not removable.
+'phys_index' : read-only and contains section id of the first section
+ in the memory block, same as XXX.
+'end_phys_index' : read-only and contains section id of the last section
+ in the memory block.
+'state' : read-write
+ at read: contains online/offline state of memory.
+ at write: user can specify "online", "offline" command
+ which will be performed on al sections in the block.
+'phys_device' : read-only: designed to show the name of physical memory
+ device. This is not well implemented now.
+'removable' : read-only: contains an integer value indicating
+ whether the memory block is removable or not
+ removable. A value of 1 indicates that the memory
+ block is removable and a value of 0 indicates that
+ it is not removable. A memory block is removable only if
+ every section in the block is removable.
NOTE:
These directories/files appear after physical memory hotplug phase.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/8] De-couple sysfs memory directories from memory sections
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (7 preceding siblings ...)
2010-09-22 14:36 ` [PATCH 8/8] Update memory hotplug documentation Nathan Fontenot
@ 2010-09-22 15:20 ` Dave Hansen
2010-09-22 18:40 ` Nathan Fontenot
2010-09-23 18:40 ` Balbir Singh
9 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2010-09-22 15:20 UTC (permalink / raw)
To: Nathan Fontenot
Cc: linux-mm, Greg KH, linux-kernel, KAMEZAWA Hiroyuki, linuxppc-dev
On Wed, 2010-09-22 at 09:15 -0500, Nathan Fontenot wrote:
> For architectures that define their own version of this routine,
> as is done for powerpc in this patchset, the view in userspace
> would change such that each memoryXXX directory would span
> multiple memory sections. The number of sections spanned would
> depend on the value reported by memory_block_size_bytes.
>
> In both cases a new file 'end_phys_index' is created in each
> memoryXXX directory. This file will contain the physical id
> of the last memory section covered by the sysfs directory. For
> the default case, the value in 'end_phys_index' will be the same
> as in the existing 'phys_index' file.
Hi Nathan,
There's one bit missing here, I think.
"block_size_bytes" today means two things today:
1. the SECTION_SIZE from sparsemem
2. the size covered by each memoryXXXX directory
SECTION_SIZE isn't exposed to userspace, but the memoryXXXX directories
are. You've done all of the heavy lifting here to make sure that the
memory directories are no longer bound to SECTION_SIZE, but you've also
broken the assumption that _each_ directory covers "block_size_bytes".
I think it's fairly simple to fix. block_size_bytes() needs to return
memory_block_size_bytes(), and phys_index's calculation needs to be:
mem->start_phys_index * SECTION_SIZE / memory_block_size_bytes()
That way, to userspace, it just looks like before, but with a larger
SECTION_SIZE. Doing that preserves the ABI pretty nicely, I believe.
-- Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/8] De-couple sysfs memory directories from memory sections
2010-09-22 15:20 ` [PATCH 0/8] De-couple sysfs memory directories from memory sections Dave Hansen
@ 2010-09-22 18:40 ` Nathan Fontenot
2010-09-22 18:58 ` Dave Hansen
0 siblings, 1 reply; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-22 18:40 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-mm, Greg KH, linux-kernel, KAMEZAWA Hiroyuki, linuxppc-dev
On 09/22/2010 10:20 AM, Dave Hansen wrote:
> On Wed, 2010-09-22 at 09:15 -0500, Nathan Fontenot wrote:
>> For architectures that define their own version of this routine,
>> as is done for powerpc in this patchset, the view in userspace
>> would change such that each memoryXXX directory would span
>> multiple memory sections. The number of sections spanned would
>> depend on the value reported by memory_block_size_bytes.
>>
>> In both cases a new file 'end_phys_index' is created in each
>> memoryXXX directory. This file will contain the physical id
>> of the last memory section covered by the sysfs directory. For
>> the default case, the value in 'end_phys_index' will be the same
>> as in the existing 'phys_index' file.
>
> Hi Nathan,
>
> There's one bit missing here, I think.
>
> "block_size_bytes" today means two things today:
> 1. the SECTION_SIZE from sparsemem
> 2. the size covered by each memoryXXXX directory
>
> SECTION_SIZE isn't exposed to userspace, but the memoryXXXX directories
> are. You've done all of the heavy lifting here to make sure that the
> memory directories are no longer bound to SECTION_SIZE, but you've also
> broken the assumption that _each_ directory covers "block_size_bytes".
>
> I think it's fairly simple to fix. block_size_bytes() needs to return
> memory_block_size_bytes(),
yes, missed that. I will update the patch set to include this.
> and phys_index's calculation needs to be:
>
> mem->start_phys_index * SECTION_SIZE / memory_block_size_bytes()
I'm not sure if I follow where you suggest using this formula. Is this
instead of what is used now, the base_memory_block_id() calculation?
If so, then I'm not sure it would work. The formula used in base_memory_block_id()
is done because the memory sections are not guaranteed to be added to the
memory block starting with the first section of the block.
If you meant somewhere else let me know.
-Nathan
>
> That way, to userspace, it just looks like before, but with a larger
> SECTION_SIZE. Doing that preserves the ABI pretty nicely, I believe.
>
> -- Dave
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/8] De-couple sysfs memory directories from memory sections
2010-09-22 18:40 ` Nathan Fontenot
@ 2010-09-22 18:58 ` Dave Hansen
0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2010-09-22 18:58 UTC (permalink / raw)
To: Nathan Fontenot
Cc: linux-mm, Greg KH, linux-kernel, KAMEZAWA Hiroyuki, linuxppc-dev
On Wed, 2010-09-22 at 13:40 -0500, Nathan Fontenot wrote:
> On 09/22/2010 10:20 AM, Dave Hansen wrote:
> > and phys_index's calculation needs to be:
> >
> > mem->start_phys_index * SECTION_SIZE / memory_block_size_bytes()
>
> I'm not sure if I follow where you suggest using this formula. Is this
> instead of what is used now, the base_memory_block_id() calculation?
>
> If so, then I'm not sure it would work. The formula used in base_memory_block_id()
> is done because the memory sections are not guaranteed to be added to the
> memory block starting with the first section of the block.
>
> If you meant somewhere else let me know.
My point was just that if we change the "block_size_bytes" contents,
then we have to scale down the "memoryXXXX/phys_index" by that same
amount.
It *used* to be in numbers of SECTION_SIZE units, and I think it still
is:
- mem->start_phys_index = __section_nr(section);
+ mem->start_phys_index = base_memory_block_id(__section_nr(section));
+ mem->end_phys_index = mem->start_phys_index + sections_per_block - 1;
but now it needs to be changed to be in memory_block_size_bytes() units,
*NOT* SECTION_SIZE units.
Let's say we have a system with 4 16MB sections starting at 0x0.
Before, we would have:
block_size_bytes: 16777216
memory0/phys_index: 0
memory1/phys_index: 1
memory2/phys_index: 2
memory3/phys_index: 3
Now, we change memory_block_size_bytes() to be 32MB instead. We reduce
the number of sections in half, and I think the right thing to get is:
block_size_bytes: 33554432
memory0/phys_index: 0
memory1/phys_index: 1
I think, with your code (as it stands in these patches, no fixes) that
we'd instead get this:
block_size_bytes: 16777216
memory0/phys_index: 0
memory1/phys_index: 2
Without consulting "end_phys_index" (which isn't and can't be a part of
the existing ABI), we'd think that we have two 16MB banks instead of
four.
-- Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/8] De-couple sysfs memory directories from memory sections
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
` (8 preceding siblings ...)
2010-09-22 15:20 ` [PATCH 0/8] De-couple sysfs memory directories from memory sections Dave Hansen
@ 2010-09-23 18:40 ` Balbir Singh
2010-09-24 14:35 ` Nathan Fontenot
9 siblings, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2010-09-23 18:40 UTC (permalink / raw)
To: Nathan Fontenot
Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
KAMEZAWA Hiroyuki
* Nathan Fontenot <nfont@austin.ibm.com> [2010-09-22 09:15:43]:
> This set of patches decouples the concept that a single memory
> section corresponds to a single directory in
> /sys/devices/system/memory/. On systems
> with large amounts of memory (1+ TB) there are performance issues
> related to creating the large number of sysfs directories. For
> a powerpc machine with 1 TB of memory we are creating 63,000+
> directories. This is resulting in boot times of around 45-50
> minutes for systems with 1 TB of memory and 8 hours for systems
> with 2 TB of memory. With this patch set applied I am now seeing
> boot times of 5 minutes or less.
>
> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against all sibling
> directories to ensure we do not create duplicates. The list of
> directory nodes in sysfs is kept as an unsorted list which results
> in this being an exponentially longer operation as the number of
> directories are created.
>
> The solution solved by this patch set is to allow a single
> directory in sysfs to span multiple memory sections. This is
> controlled by an optional architecturally defined function
> memory_block_size_bytes(). The default definition of this
> routine returns a memory block size equal to the memory section
> size. This maintains the current layout of sysfs memory
> directories as it appears to userspace to remain the same as it
> is today.
>
> For architectures that define their own version of this routine,
> as is done for powerpc in this patchset, the view in userspace
> would change such that each memoryXXX directory would span
> multiple memory sections. The number of sections spanned would
> depend on the value reported by memory_block_size_bytes.
>
> In both cases a new file 'end_phys_index' is created in each
> memoryXXX directory. This file will contain the physical id
> of the last memory section covered by the sysfs directory. For
> the default case, the value in 'end_phys_index' will be the same
> as in the existing 'phys_index' file.
>
What does this mean for memory hotplug or hotunplug?
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/8] De-couple sysfs memory directories from memory sections
2010-09-23 18:40 ` Balbir Singh
@ 2010-09-24 14:35 ` Nathan Fontenot
0 siblings, 0 replies; 14+ messages in thread
From: Nathan Fontenot @ 2010-09-24 14:35 UTC (permalink / raw)
To: balbir
Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
KAMEZAWA Hiroyuki
On 09/23/2010 01:40 PM, Balbir Singh wrote:
> * Nathan Fontenot <nfont@austin.ibm.com> [2010-09-22 09:15:43]:
>
>> This set of patches decouples the concept that a single memory
>> section corresponds to a single directory in
>> /sys/devices/system/memory/. On systems
>> with large amounts of memory (1+ TB) there are performance issues
>> related to creating the large number of sysfs directories. For
>> a powerpc machine with 1 TB of memory we are creating 63,000+
>> directories. This is resulting in boot times of around 45-50
>> minutes for systems with 1 TB of memory and 8 hours for systems
>> with 2 TB of memory. With this patch set applied I am now seeing
>> boot times of 5 minutes or less.
>>
>> The root of this issue is in sysfs directory creation. Every time
>> a directory is created a string compare is done against all sibling
>> directories to ensure we do not create duplicates. The list of
>> directory nodes in sysfs is kept as an unsorted list which results
>> in this being an exponentially longer operation as the number of
>> directories are created.
>>
>> The solution solved by this patch set is to allow a single
>> directory in sysfs to span multiple memory sections. This is
>> controlled by an optional architecturally defined function
>> memory_block_size_bytes(). The default definition of this
>> routine returns a memory block size equal to the memory section
>> size. This maintains the current layout of sysfs memory
>> directories as it appears to userspace to remain the same as it
>> is today.
>>
>> For architectures that define their own version of this routine,
>> as is done for powerpc in this patchset, the view in userspace
>> would change such that each memoryXXX directory would span
>> multiple memory sections. The number of sections spanned would
>> depend on the value reported by memory_block_size_bytes.
>>
>> In both cases a new file 'end_phys_index' is created in each
>> memoryXXX directory. This file will contain the physical id
>> of the last memory section covered by the sysfs directory. For
>> the default case, the value in 'end_phys_index' will be the same
>> as in the existing 'phys_index' file.
>>
>
> What does this mean for memory hotplug or hotunplug?
>
Memory hotplug will function on a memory block size basis. For
architectures that do not define their own memory_block_size_bytes()
routine, they will get the default size and everything will work
the same as it does today.
For architectures that define their own memory_block_size_bytes()
routine and have multiple memory sections per memory block, hotplug
operations will add or remove all of the memory sections in the memory
memory block.
-Nathan
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-09-24 14:36 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-22 14:15 [PATCH 0/8] De-couple sysfs memory directories from memory sections Nathan Fontenot
2010-09-22 14:28 ` [PATCH 1/8] Move find_memory_block() routine Nathan Fontenot
2010-09-22 14:29 ` [PATCH 2/8] Update memory block struct to have start and end phys index Nathan Fontenot
2010-09-22 14:30 ` [PATCH 3/8] Add section count to memory_block struct Nathan Fontenot
2010-09-22 14:32 ` [PATCH 4/8] Add mutex for adding/removing memory blocks Nathan Fontenot
2010-09-22 14:33 ` [PATCH 5/8] Allow a memory block to span multiple memory sections Nathan Fontenot
2010-09-22 14:34 ` [PATCH 6/8] Update node sysfs code Nathan Fontenot
2010-09-22 14:35 ` [PATCH 7/8] Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
2010-09-22 14:36 ` [PATCH 8/8] Update memory hotplug documentation Nathan Fontenot
2010-09-22 15:20 ` [PATCH 0/8] De-couple sysfs memory directories from memory sections Dave Hansen
2010-09-22 18:40 ` Nathan Fontenot
2010-09-22 18:58 ` Dave Hansen
2010-09-23 18:40 ` Balbir Singh
2010-09-24 14:35 ` Nathan Fontenot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).