* [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
@ 2015-05-08 16:44 Tony Luck
  2015-05-07 22:17 ` [PATCHv2 1/3] mm/memblock: Add extra "flags" to memblock to allow selection of memory based on attribute Tony Luck
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Tony Luck @ 2015-05-08 16:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm
Some high end Intel Xeon systems report uncorrectable memory errors
as a recoverable machine check. Linux has included code for some time
to process these and just signal the affected processes (or even
recover completely if the error was in a read only page that can be
replaced by reading from disk).
But we have no recovery path for errors encountered during kernel
code execution. Except for some very specific cases were are unlikely
to ever be able to recover.
Enter memory mirroring. Actually 3rd generation of memory mirroing.
Gen1: All memory is mirrored
	Pro: No s/w enabling - h/w just gets good data from other side of the mirror
	Con: Halves effective memory capacity available to OS/applications
Gen2: Partial memory mirror - just mirror memory begind some memory controllers
	Pro: Keep more of the capacity
	Con: Nightmare to enable. Have to choose between allocating from
	     mirrored memory for safety vs. NUMA local memory for performance
Gen3: Address range partial memory mirror - some mirror on each memory controller
	Pro: Can tune the amount of mirror and keep NUMA performance
	Con: I have to write memory management code to implement
The current plan is just to use mirrored memory for kernel allocations. This
has been broken into two phases:
1) This patch series - find the mirrored memory, use it for boot time allocations
2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
   mirrored memory from mm/memblock.c and only give it out to select kernel
   allocations (this is still being scoped because page_alloc.c is scary).
Tony Luck (3):
  mm/memblock: Add extra "flags" to memblock to allow selection of
    memory based on attribute
  mm/memblock: Allocate boot time data structures from mirrored memory
  x86, mirror: x86 enabling - find mirrored memory ranges
 arch/s390/kernel/crash_dump.c |   5 +-
 arch/sparc/mm/init_64.c       |   6 ++-
 arch/x86/kernel/check.c       |   3 +-
 arch/x86/kernel/e820.c        |   3 +-
 arch/x86/kernel/setup.c       |   3 ++
 arch/x86/mm/init_32.c         |   2 +-
 arch/x86/platform/efi/efi.c   |  21 ++++++++
 include/linux/efi.h           |   3 ++
 include/linux/memblock.h      |  49 +++++++++++------
 mm/cma.c                      |   6 ++-
 mm/memblock.c                 | 123 +++++++++++++++++++++++++++++++++---------
 mm/memtest.c                  |   3 +-
 mm/nobootmem.c                |  14 ++++-
 13 files changed, 188 insertions(+), 53 deletions(-)
-- 
2.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 11+ messages in thread* [PATCHv2 1/3] mm/memblock: Add extra "flags" to memblock to allow selection of memory based on attribute 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck @ 2015-05-07 22:17 ` Tony Luck 2015-05-07 22:18 ` [PATCHv2 2/3] mm/memblock: Allocate boot time data structures from mirrored memory Tony Luck ` (3 subsequent siblings) 4 siblings, 0 replies; 11+ messages in thread From: Tony Luck @ 2015-05-07 22:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm No functional changes Signed-off-by: Tony Luck <tony.luck@intel.com> --- v1->v2: Use name "flags" everywhere instead of mix of "flag" and "flags" Change type of flags from u32 to ulong (consistent with memblock_region.flags) Use enum for values of flags defining MEMBLOCK_NONE = 0 and use as argument arch/s390/kernel/crash_dump.c | 5 ++-- arch/sparc/mm/init_64.c | 6 +++-- arch/x86/kernel/check.c | 3 ++- arch/x86/kernel/e820.c | 3 ++- arch/x86/mm/init_32.c | 2 +- include/linux/memblock.h | 41 ++++++++++++++++++++------------ mm/cma.c | 6 +++-- mm/memblock.c | 55 +++++++++++++++++++++++++++---------------- mm/memtest.c | 3 ++- mm/nobootmem.c | 6 +++-- 10 files changed, 83 insertions(+), 47 deletions(-) diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c index 9f73c8059022..120a18283483 100644 --- a/arch/s390/kernel/crash_dump.c +++ b/arch/s390/kernel/crash_dump.c @@ -33,11 +33,12 @@ static struct memblock_type oldmem_type = { }; #define for_each_dump_mem_range(i, nid, p_start, p_end, p_nid) \ - for (i = 0, __next_mem_range(&i, nid, &memblock.physmem, \ + for (i = 0, __next_mem_range(&i, nid, MEMBLOCK_NONE, \ + &memblock.physmem, \ &oldmem_type, p_start, \ p_end, p_nid); \ i != (u64)ULLONG_MAX; \ - __next_mem_range(&i, nid, &memblock.physmem, \ + __next_mem_range(&i, nid, MEMBLOCK_NONE, &memblock.physmem,\ &oldmem_type, \ p_start, p_end, p_nid)) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 4ca0d6ba5ec8..6f662d1d92ae 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1952,7 +1952,8 @@ static phys_addr_t __init available_memory(void) phys_addr_t pa_start, pa_end; u64 i; - for_each_free_mem_range(i, NUMA_NO_NODE, &pa_start, &pa_end, NULL) + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &pa_start, + &pa_end, NULL) available = available + (pa_end - pa_start); return available; @@ -1971,7 +1972,8 @@ static void __init reduce_memory(phys_addr_t limit_ram) if (limit_ram >= avail_ram) return; - for_each_free_mem_range(i, NUMA_NO_NODE, &pa_start, &pa_end, NULL) { + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &pa_start, + &pa_end, NULL) { phys_addr_t region_size = pa_end - pa_start; phys_addr_t clip_start = pa_start; diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c index 83a7995625a6..58118e207a69 100644 --- a/arch/x86/kernel/check.c +++ b/arch/x86/kernel/check.c @@ -91,7 +91,8 @@ void __init setup_bios_corruption_check(void) corruption_check_size = round_up(corruption_check_size, PAGE_SIZE); - for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL) { + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, + NULL) { start = clamp_t(phys_addr_t, round_up(start, PAGE_SIZE), PAGE_SIZE, corruption_check_size); end = clamp_t(phys_addr_t, round_down(end, PAGE_SIZE), diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index e2ce85db2283..c8dda42cb6a3 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1123,7 +1123,8 @@ void __init memblock_find_dma_reserve(void) nr_pages += end_pfn - start_pfn; } - for_each_free_mem_range(u, NUMA_NO_NODE, &start, &end, NULL) { + for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, + NULL) { start_pfn = min_t(unsigned long, PFN_UP(start), MAX_DMA_PFN); end_pfn = min_t(unsigned long, PFN_DOWN(end), MAX_DMA_PFN); if (start_pfn < end_pfn) diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index c8140e12816a..8340e45c891a 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -433,7 +433,7 @@ void __init add_highpages_with_active_regions(int nid, phys_addr_t start, end; u64 i; - for_each_free_mem_range(i, nid, &start, &end, NULL) { + for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &start, &end, NULL) { unsigned long pfn = clamp_t(unsigned long, PFN_UP(start), start_pfn, end_pfn); unsigned long e_pfn = clamp_t(unsigned long, PFN_DOWN(end), diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 9497ec7c77ea..7aeec0cb4c27 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -21,7 +21,10 @@ #define INIT_PHYSMEM_REGIONS 4 /* Definition of memblock flags. */ -#define MEMBLOCK_HOTPLUG 0x1 /* hotpluggable region */ +enum { + MEMBLOCK_NONE = 0x0, /* No special request */ + MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */ +}; struct memblock_region { phys_addr_t base; @@ -61,7 +64,7 @@ extern bool movable_node_enabled; phys_addr_t memblock_find_in_range_node(phys_addr_t size, phys_addr_t align, phys_addr_t start, phys_addr_t end, - int nid); + int nid, ulong flags); phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end, phys_addr_t size, phys_addr_t align); phys_addr_t get_allocated_memblock_reserved_regions_info(phys_addr_t *addr); @@ -85,11 +88,13 @@ int memblock_remove_range(struct memblock_type *type, phys_addr_t base, phys_addr_t size); -void __next_mem_range(u64 *idx, int nid, struct memblock_type *type_a, +void __next_mem_range(u64 *idx, int nid, ulong flags, + struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, phys_addr_t *out_end, int *out_nid); -void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, +void __next_mem_range_rev(u64 *idx, int nid, ulong flags, + struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, phys_addr_t *out_end, int *out_nid); @@ -100,16 +105,17 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, * @type_a: ptr to memblock_type to iterate * @type_b: ptr to memblock_type which excludes from the iteration * @nid: node selector, %NUMA_NO_NODE for all nodes + * @flags: pick from blocks based on memory attributes * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL */ -#define for_each_mem_range(i, type_a, type_b, nid, \ +#define for_each_mem_range(i, type_a, type_b, nid, flags, \ p_start, p_end, p_nid) \ - for (i = 0, __next_mem_range(&i, nid, type_a, type_b, \ + for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid); \ i != (u64)ULLONG_MAX; \ - __next_mem_range(&i, nid, type_a, type_b, \ + __next_mem_range(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid)) /** @@ -119,17 +125,18 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, * @type_a: ptr to memblock_type to iterate * @type_b: ptr to memblock_type which excludes from the iteration * @nid: node selector, %NUMA_NO_NODE for all nodes + * @flags: pick from blocks based on memory attributes * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL */ -#define for_each_mem_range_rev(i, type_a, type_b, nid, \ +#define for_each_mem_range_rev(i, type_a, type_b, nid, flags, \ p_start, p_end, p_nid) \ for (i = (u64)ULLONG_MAX, \ - __next_mem_range_rev(&i, nid, type_a, type_b, \ + __next_mem_range_rev(&i, nid, flags, type_a, type_b,\ p_start, p_end, p_nid); \ i != (u64)ULLONG_MAX; \ - __next_mem_range_rev(&i, nid, type_a, type_b, \ + __next_mem_range_rev(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid)) #ifdef CONFIG_MOVABLE_NODE @@ -181,13 +188,14 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL + * @flags: pick from blocks based on memory attributes * * Walks over free (memory && !reserved) areas of memblock. Available as * soon as memblock is initialized. */ -#define for_each_free_mem_range(i, nid, p_start, p_end, p_nid) \ +#define for_each_free_mem_range(i, nid, flags, p_start, p_end, p_nid) \ for_each_mem_range(i, &memblock.memory, &memblock.reserved, \ - nid, p_start, p_end, p_nid) + nid, flags, p_start, p_end, p_nid) /** * for_each_free_mem_range_reverse - rev-iterate through free memblock areas @@ -196,13 +204,15 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL + * @flags: pick from blocks based on memory attributes * * Walks over free (memory && !reserved) areas of memblock in reverse * order. Available as soon as memblock is initialized. */ -#define for_each_free_mem_range_reverse(i, nid, p_start, p_end, p_nid) \ +#define for_each_free_mem_range_reverse(i, nid, flags, p_start, p_end, \ + p_nid) \ for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \ - nid, p_start, p_end, p_nid) + nid, flags, p_start, p_end, p_nid) static inline void memblock_set_region_flags(struct memblock_region *r, unsigned long flags) @@ -273,7 +283,8 @@ static inline bool memblock_bottom_up(void) { return false; } #define MEMBLOCK_ALLOC_ACCESSIBLE 0 phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align, - phys_addr_t start, phys_addr_t end); + phys_addr_t start, phys_addr_t end, + ulong flags); phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr); phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align, diff --git a/mm/cma.c b/mm/cma.c index 3a7a67b93394..3ba03d7ab169 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -316,13 +316,15 @@ int __init cma_declare_contiguous(phys_addr_t base, */ if (base < highmem_start && limit > highmem_start) { addr = memblock_alloc_range(size, alignment, - highmem_start, limit); + highmem_start, limit, + MEMBLOCK_NONE); limit = highmem_start; } if (!addr) { addr = memblock_alloc_range(size, alignment, base, - limit); + limit, + MEMBLOCK_NONE); if (!addr) { ret = -ENOMEM; goto err; diff --git a/mm/memblock.c b/mm/memblock.c index 9318b567ed79..b9ff2f4f0285 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -107,6 +107,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type, * @size: size of free area to find * @align: alignment of free area to find * @nid: nid of the free area to find, %NUMA_NO_NODE for any node + * @flags: pick from blocks based on memory attributes * * Utility called from memblock_find_in_range_node(), find free area bottom-up. * @@ -115,12 +116,13 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type, */ static phys_addr_t __init_memblock __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end, - phys_addr_t size, phys_addr_t align, int nid) + phys_addr_t size, phys_addr_t align, int nid, + ulong flags) { phys_addr_t this_start, this_end, cand; u64 i; - for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) { + for_each_free_mem_range(i, nid, flags, &this_start, &this_end, NULL) { this_start = clamp(this_start, start, end); this_end = clamp(this_end, start, end); @@ -139,6 +141,7 @@ __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end, * @size: size of free area to find * @align: alignment of free area to find * @nid: nid of the free area to find, %NUMA_NO_NODE for any node + * @flags: pick from blocks based on memory attributes * * Utility called from memblock_find_in_range_node(), find free area top-down. * @@ -147,12 +150,14 @@ __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end, */ static phys_addr_t __init_memblock __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end, - phys_addr_t size, phys_addr_t align, int nid) + phys_addr_t size, phys_addr_t align, int nid, + ulong flags) { phys_addr_t this_start, this_end, cand; u64 i; - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) { + for_each_free_mem_range_reverse(i, nid, flags, &this_start, &this_end, + NULL) { this_start = clamp(this_start, start, end); this_end = clamp(this_end, start, end); @@ -174,6 +179,7 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end, * @start: start of candidate range * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE} * @nid: nid of the free area to find, %NUMA_NO_NODE for any node + * @flags: pick from blocks based on memory attributes * * Find @size free area aligned to @align in the specified range and node. * @@ -190,7 +196,7 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end, */ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, phys_addr_t align, phys_addr_t start, - phys_addr_t end, int nid) + phys_addr_t end, int nid, ulong flags) { phys_addr_t kernel_end, ret; @@ -215,7 +221,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, /* ok, try bottom-up allocation first */ ret = __memblock_find_range_bottom_up(bottom_up_start, end, - size, align, nid); + size, align, nid, flags); if (ret) return ret; @@ -233,7 +239,8 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, "memory hotunplug may be affected\n"); } - return __memblock_find_range_top_down(start, end, size, align, nid); + return __memblock_find_range_top_down(start, end, size, align, nid, + flags); } /** @@ -253,7 +260,7 @@ phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start, phys_addr_t align) { return memblock_find_in_range_node(size, align, start, end, - NUMA_NO_NODE); + NUMA_NO_NODE, MEMBLOCK_NONE); } static void __init_memblock memblock_remove_region(struct memblock_type *type, unsigned long r) @@ -782,6 +789,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) * __next__mem_range - next function for for_each_free_mem_range() etc. * @idx: pointer to u64 loop variable * @nid: node selector, %NUMA_NO_NODE for all nodes + * @flags: pick from blocks based on memory attributes * @type_a: pointer to memblock_type from where the range is taken * @type_b: pointer to memblock_type which excludes memory from being taken * @out_start: ptr to phys_addr_t for start address of the range, can be %NULL @@ -803,7 +811,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) * As both region arrays are sorted, the function advances the two indices * in lockstep and returns each intersection. */ -void __init_memblock __next_mem_range(u64 *idx, int nid, +void __init_memblock __next_mem_range(u64 *idx, int nid, ulong flags, struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, @@ -895,6 +903,7 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, * * @idx: pointer to u64 loop variable * @nid: nid: node selector, %NUMA_NO_NODE for all nodes + * @flags: pick from blocks based on memory attributes * @type_a: pointer to memblock_type from where the range is taken * @type_b: pointer to memblock_type which excludes memory from being taken * @out_start: ptr to phys_addr_t for start address of the range, can be %NULL @@ -903,7 +912,7 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, * * Reverse of __next_mem_range(). */ -void __init_memblock __next_mem_range_rev(u64 *idx, int nid, +void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags, struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, @@ -1050,14 +1059,15 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size, static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, phys_addr_t align, phys_addr_t start, - phys_addr_t end, int nid) + phys_addr_t end, int nid, ulong flags) { phys_addr_t found; if (!align) align = SMP_CACHE_BYTES; - found = memblock_find_in_range_node(size, align, start, end, nid); + found = memblock_find_in_range_node(size, align, start, end, nid, + flags); if (found && !memblock_reserve(found, size)) { /* * The min_count is set to 0 so that memblock allocations are @@ -1070,26 +1080,30 @@ static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, } phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align, - phys_addr_t start, phys_addr_t end) + phys_addr_t start, phys_addr_t end, + ulong flags) { - return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE); + return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE, + flags); } static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr, - int nid) + int nid, ulong flags) { - return memblock_alloc_range_nid(size, align, 0, max_addr, nid); + return memblock_alloc_range_nid(size, align, 0, max_addr, nid, flags); } phys_addr_t __init memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid) { - return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, nid); + return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, + nid, MEMBLOCK_NONE); } phys_addr_t __init __memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr) { - return memblock_alloc_base_nid(size, align, max_addr, NUMA_NO_NODE); + return memblock_alloc_base_nid(size, align, max_addr, NUMA_NO_NODE, + MEMBLOCK_NONE); } phys_addr_t __init memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr) @@ -1173,13 +1187,14 @@ static void * __init memblock_virt_alloc_internal( again: alloc = memblock_find_in_range_node(size, align, min_addr, max_addr, - nid); + nid, MEMBLOCK_NONE); if (alloc) goto done; if (nid != NUMA_NO_NODE) { alloc = memblock_find_in_range_node(size, align, min_addr, - max_addr, NUMA_NO_NODE); + max_addr, NUMA_NO_NODE, + MEMBLOCK_NONE); if (alloc) goto done; } diff --git a/mm/memtest.c b/mm/memtest.c index 1997d934b13b..0a1cc133f6d7 100644 --- a/mm/memtest.c +++ b/mm/memtest.c @@ -74,7 +74,8 @@ static void __init do_one_pass(u64 pattern, phys_addr_t start, phys_addr_t end) u64 i; phys_addr_t this_start, this_end; - for_each_free_mem_range(i, NUMA_NO_NODE, &this_start, &this_end, NULL) { + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &this_start, + &this_end, NULL) { this_start = clamp(this_start, start, end); this_end = clamp(this_end, start, end); if (this_start < this_end) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 90b50468333e..ad3641dcdbe7 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -41,7 +41,8 @@ static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align, if (limit > memblock.current_limit) limit = memblock.current_limit; - addr = memblock_find_in_range_node(size, align, goal, limit, nid); + addr = memblock_find_in_range_node(size, align, goal, limit, nid, + MEMBLOCK_NONE); if (!addr) return NULL; @@ -121,7 +122,8 @@ static unsigned long __init free_low_memory_core_early(void) memblock_clear_hotplug(0, -1); - for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL) + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, + NULL) count += __free_memory_core(start, end); #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCHv2 2/3] mm/memblock: Allocate boot time data structures from mirrored memory 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck 2015-05-07 22:17 ` [PATCHv2 1/3] mm/memblock: Add extra "flags" to memblock to allow selection of memory based on attribute Tony Luck @ 2015-05-07 22:18 ` Tony Luck 2015-05-07 22:19 ` [PATCHv2 3/3] x86, mirror: x86 enabling - find mirrored memory ranges Tony Luck ` (2 subsequent siblings) 4 siblings, 0 replies; 11+ messages in thread From: Tony Luck @ 2015-05-07 22:18 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Try to allocate all boot time kernel data structures from mirrored memory. If we run out of mirrored memory print warnings, but fall back to using non-mirrored memory to make sure that we still boot. Signed-off-by: Tony Luck <tony.luck@intel.com> --- v1->v2: Better name for memblock_has_mirror() - now choose_memblock_flags() Print phys_addr_t with %pap instead of %lld Don't fall back by clearing flags. Use: flags &= ~MEMBLOCK_MIRROR; Keep to 80 columns N.B just one checkpatch warning about initializing a static to 0: +static bool system_has_some_mirror __initdata_memblock = false; This fits with existing style in the file ... but zap the " = false" if you like - I have no strong feelings about this. include/linux/memblock.h | 8 +++++ mm/memblock.c | 78 +++++++++++++++++++++++++++++++++++++++++------- mm/nobootmem.c | 10 ++++++- 3 files changed, 84 insertions(+), 12 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 7aeec0cb4c27..0215ffd63069 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -24,6 +24,7 @@ enum { MEMBLOCK_NONE = 0x0, /* No special request */ MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */ + MEMBLOCK_MIRROR = 0x2, /* mirrored region */ }; struct memblock_region { @@ -78,6 +79,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); +int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); +ulong choose_memblock_flags(void); /* Low level functions */ int memblock_add_range(struct memblock_type *type, @@ -160,6 +163,11 @@ static inline bool movable_node_is_enabled(void) } #endif +static inline bool memblock_is_mirror(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_MIRROR; +} + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); diff --git a/mm/memblock.c b/mm/memblock.c index b9ff2f4f0285..1b444c730846 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -54,10 +54,16 @@ int memblock_debug __initdata_memblock; #ifdef CONFIG_MOVABLE_NODE bool movable_node_enabled __initdata_memblock = false; #endif +static bool system_has_some_mirror __initdata_memblock = false; static int memblock_can_resize __initdata_memblock; static int memblock_memory_in_slab __initdata_memblock = 0; static int memblock_reserved_in_slab __initdata_memblock = 0; +ulong __init_memblock choose_memblock_flags(void) +{ + return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE; +} + /* inline so we don't get a warning when pr_debug is compiled out */ static __init_memblock const char * memblock_type_name(struct memblock_type *type) @@ -259,8 +265,21 @@ phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start, phys_addr_t end, phys_addr_t size, phys_addr_t align) { - return memblock_find_in_range_node(size, align, start, end, - NUMA_NO_NODE, MEMBLOCK_NONE); + phys_addr_t ret; + ulong flags = choose_memblock_flags(); + +again: + ret = memblock_find_in_range_node(size, align, start, end, + NUMA_NO_NODE, flags); + + if (!ret && (flags & MEMBLOCK_MIRROR)) { + pr_warn("Could not allocate %pap bytes of mirrored memory\n", + &size); + flags &= ~MEMBLOCK_MIRROR; + goto again; + } + + return ret; } static void __init_memblock memblock_remove_region(struct memblock_type *type, unsigned long r) @@ -786,6 +805,21 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) } /** + * memblock_mark_mirror - Mark mirrored memory with flag MEMBLOCK_MIRROR. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Return 0 on succees, -errno on failure. + */ +int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) +{ + system_has_some_mirror = true; + + return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); +} + + +/** * __next__mem_range - next function for for_each_free_mem_range() etc. * @idx: pointer to u64 loop variable * @nid: node selector, %NUMA_NO_NODE for all nodes @@ -839,6 +873,10 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, ulong flags, if (movable_node_is_enabled() && memblock_is_hotpluggable(m)) continue; + /* if we want mirror memory skip non-mirror memory regions */ + if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m)) + continue; + if (!type_b) { if (out_start) *out_start = m_start; @@ -944,6 +982,10 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags, if (movable_node_is_enabled() && memblock_is_hotpluggable(m)) continue; + /* if we want mirror memory skip non-mirror memory regions */ + if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m)) + continue; + if (!type_b) { if (out_start) *out_start = m_start; @@ -1096,8 +1138,18 @@ static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size, phys_addr_t __init memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid) { - return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, - nid, MEMBLOCK_NONE); + ulong flags = choose_memblock_flags(); + phys_addr_t ret; + +again: + ret = memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, + nid, flags); + + if (!ret && (flags & MEMBLOCK_MIRROR)) { + flags &= ~MEMBLOCK_MIRROR; + goto again; + } + return ret; } phys_addr_t __init __memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr) @@ -1167,6 +1219,7 @@ static void * __init memblock_virt_alloc_internal( { phys_addr_t alloc; void *ptr; + ulong flags = choose_memblock_flags(); if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n")) nid = NUMA_NO_NODE; @@ -1187,14 +1240,14 @@ static void * __init memblock_virt_alloc_internal( again: alloc = memblock_find_in_range_node(size, align, min_addr, max_addr, - nid, MEMBLOCK_NONE); + nid, flags); if (alloc) goto done; if (nid != NUMA_NO_NODE) { alloc = memblock_find_in_range_node(size, align, min_addr, max_addr, NUMA_NO_NODE, - MEMBLOCK_NONE); + flags); if (alloc) goto done; } @@ -1202,10 +1255,16 @@ again: if (min_addr) { min_addr = 0; goto again; - } else { - goto error; } + if (flags & MEMBLOCK_MIRROR) { + flags &= ~MEMBLOCK_MIRROR; + pr_warn("Could not allocate %pap bytes of mirrored memory\n", + &size); + goto again; + } + + return NULL; done: memblock_reserve(alloc, size); ptr = phys_to_virt(alloc); @@ -1220,9 +1279,6 @@ done: kmemleak_alloc(ptr, size, 0, 0); return ptr; - -error: - return NULL; } /** diff --git a/mm/nobootmem.c b/mm/nobootmem.c index ad3641dcdbe7..5258386fa1be 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -37,12 +37,20 @@ static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align, { void *ptr; u64 addr; + ulong flags = choose_memblock_flags(); if (limit > memblock.current_limit) limit = memblock.current_limit; +again: addr = memblock_find_in_range_node(size, align, goal, limit, nid, - MEMBLOCK_NONE); + flags); + if (!addr && (flags & MEMBLOCK_MIRROR)) { + flags &= ~MEMBLOCK_MIRROR; + pr_warn("Could not allocate %pap bytes of mirrored memory\n", + &size); + goto again; + } if (!addr) return NULL; -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCHv2 3/3] x86, mirror: x86 enabling - find mirrored memory ranges 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck 2015-05-07 22:17 ` [PATCHv2 1/3] mm/memblock: Add extra "flags" to memblock to allow selection of memory based on attribute Tony Luck 2015-05-07 22:18 ` [PATCHv2 2/3] mm/memblock: Allocate boot time data structures from mirrored memory Tony Luck @ 2015-05-07 22:19 ` Tony Luck 2015-05-08 20:03 ` [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Andrew Morton 2015-05-19 3:01 ` Xishi Qiu 4 siblings, 0 replies; 11+ messages in thread From: Tony Luck @ 2015-05-07 22:19 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory address ranges. See UEFI 2.5 spec pages 157-158: http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf On EFI enabled systems scan the memory map and tell memblock about any mirrored ranges. Signed-off-by: Tony Luck <tony.luck@intel.com> --- v1->v2: Use u64 instead of "unsigned long long" Just one checkpatch warning for this patch: WARNING: line over 80 characters #86: FILE: include/linux/efi.h:100: + ((u64)0x0000000000010000ULL) /* higher reliability */ But this fits the style for all the other attribute defintions (all of which are above 80 columns - one even longer than this new one) arch/x86/kernel/setup.c | 3 +++ arch/x86/platform/efi/efi.c | 21 +++++++++++++++++++++ include/linux/efi.h | 3 +++ 3 files changed, 27 insertions(+) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index d74ac33290ae..ac85a1775661 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1103,6 +1103,9 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); + if (efi_enabled(EFI_BOOT)) + efi_find_mirror(); + /* * The EFI specification says that boot service code won't be called * after ExitBootServices(). This is, in fact, a lie. diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 02744df576d5..8b1d8dfa3c5c 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -117,6 +117,27 @@ void efi_get_time(struct timespec *now) now->tv_nsec = 0; } +void __init efi_find_mirror(void) +{ + void *p; + u64 mirror_size = 0, total_size = 0; + + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + efi_memory_desc_t *md = p; + unsigned long long start = md->phys_addr; + unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; + + total_size += size; + if (md->attribute & EFI_MEMORY_MORE_RELIABLE) { + memblock_mark_mirror(start, size); + mirror_size += size; + } + } + if (mirror_size) + pr_info("Memory: %lldM/%lldM mirrored memory\n", + mirror_size>>20, total_size>>20); +} + /* * Tell the kernel about the EFI memory map. This might include * more than the max 128 entries that can fit in the e820 legacy diff --git a/include/linux/efi.h b/include/linux/efi.h index af5be0368dec..8d4efdb9dfe9 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -96,6 +96,8 @@ typedef struct { #define EFI_MEMORY_WP ((u64)0x0000000000001000ULL) /* write-protect */ #define EFI_MEMORY_RP ((u64)0x0000000000002000ULL) /* read-protect */ #define EFI_MEMORY_XP ((u64)0x0000000000004000ULL) /* execute-protect */ +#define EFI_MEMORY_MORE_RELIABLE \ + ((u64)0x0000000000010000ULL) /* higher reliability */ #define EFI_MEMORY_RUNTIME ((u64)0x8000000000000000ULL) /* range requires runtime mapping */ #define EFI_MEMORY_DESCRIPTOR_VERSION 1 @@ -864,6 +866,7 @@ extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if pos extern void efi_late_init(void); extern void efi_free_boot_services(void); extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size); +extern void efi_find_mirror(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck ` (2 preceding siblings ...) 2015-05-07 22:19 ` [PATCHv2 3/3] x86, mirror: x86 enabling - find mirrored memory ranges Tony Luck @ 2015-05-08 20:03 ` Andrew Morton 2015-05-08 20:38 ` Tony Luck 2015-05-19 3:01 ` Xishi Qiu 4 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2015-05-08 20:03 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel, linux-mm On Fri, 8 May 2015 09:44:21 -0700 Tony Luck <tony.luck@intel.com> wrote: > Some high end Intel Xeon systems report uncorrectable memory errors > as a recoverable machine check. Linux has included code for some time > to process these and just signal the affected processes (or even > recover completely if the error was in a read only page that can be > replaced by reading from disk). > > But we have no recovery path for errors encountered during kernel > code execution. Except for some very specific cases were are unlikely > to ever be able to recover. > > Enter memory mirroring. Actually 3rd generation of memory mirroing. > > Gen1: All memory is mirrored > Pro: No s/w enabling - h/w just gets good data from other side of the mirror > Con: Halves effective memory capacity available to OS/applications > Gen2: Partial memory mirror - just mirror memory begind some memory controllers > Pro: Keep more of the capacity > Con: Nightmare to enable. Have to choose between allocating from > mirrored memory for safety vs. NUMA local memory for performance > Gen3: Address range partial memory mirror - some mirror on each memory controller > Pro: Can tune the amount of mirror and keep NUMA performance > Con: I have to write memory management code to implement > > The current plan is just to use mirrored memory for kernel allocations. This > has been broken into two phases: > 1) This patch series - find the mirrored memory, use it for boot time allocations > 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused > mirrored memory from mm/memblock.c and only give it out to select kernel > allocations (this is still being scoped because page_alloc.c is scary). Looks good to me. What happens to these patches while ZONE_MIRROR is being worked on? I'm wondering about phase II. What does "select kernel allocations" mean? I assume we can't say "all kernel allocations" because that can sometimes be "almost all memory". How are you planning on implementing this? A new __GFP_foo flag, then sprinkle that into selected sites? Will surplus ZONE_MIRROR memory be available for regular old movable allocations? I suggest you run the design ideas by Mel before getting into implementation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-08 20:03 ` [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Andrew Morton @ 2015-05-08 20:38 ` Tony Luck 2015-05-08 20:49 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2015-05-08 20:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List, linux-mm@kvack.org On Fri, May 8, 2015 at 1:03 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > Looks good to me. What happens to these patches while ZONE_MIRROR is > being worked on? I think these patches can go into the kernel now while I figure out the next phase - there is some value in just this part. We'll have all memory <4GB mirrored to cover the kernel code/data. Adding the boot time allocations mostly means the page structures (in terms of total amount of memory). > I'm wondering about phase II. What does "select kernel allocations" > mean? I assume we can't say "all kernel allocations" because that can > sometimes be "almost all memory". How are you planning on implementing > this? A new __GFP_foo flag, then sprinkle that into selected sites? Some of that is TBD - there are some clear places where we have bounded amounts of memory that we'd like to pull into the mirror area. E.g. loadable modules - on a specific machine an administrator can easily see which modules are loaded, tally up the sizes, and then adjust the amount of mirrored memory. I don't think we necessarily need to get to 100% ... if we can avoid 9/10 errors crashing the machine - that moves the reliability needle enough to make a difference. Phase 2 may turn into phase 2a, 2b, 2c etc. as we pick on certain areas. Oh - they'll be some sysfs or debugfs stats too - so people can check that they have the right amount of mirror memory under application load. Too little and they'll be at risk because kernel allocations will fall back to non-mirrored. Too much, and they are wasting memory. > Will surplus ZONE_MIRROR memory be available for regular old movable > allocations? ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We only want kernel allocations in mirror memory, and we can't allow any kernel allocations in movable (cause they'll pin it). > I suggest you run the design ideas by Mel before getting into > implementation. Good idea - when I have something fit to be seen, I'll share with Mel. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-08 20:38 ` Tony Luck @ 2015-05-08 20:49 ` Andrew Morton 2015-05-08 23:41 ` Tony Luck 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2015-05-08 20:49 UTC (permalink / raw) To: Tony Luck; +Cc: Linux Kernel Mailing List, linux-mm@kvack.org On Fri, 8 May 2015 13:38:52 -0700 Tony Luck <tony.luck@gmail.com> wrote: > > Will surplus ZONE_MIRROR memory be available for regular old movable > > allocations? > ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We > only want kernel allocations in mirror memory, and we can't allow any > kernel allocations in movable (cause they'll pin it). What I mean is: allow userspace to consume ZONE_MIRROR memory because we can snatch it back if it is needed for kernel memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-08 20:49 ` Andrew Morton @ 2015-05-08 23:41 ` Tony Luck 0 siblings, 0 replies; 11+ messages in thread From: Tony Luck @ 2015-05-08 23:41 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List, linux-mm@kvack.org On Fri, May 8, 2015 at 1:49 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > What I mean is: allow userspace to consume ZONE_MIRROR memory because > we can snatch it back if it is needed for kernel memory. For suitable interpretations of "snatch it back" ... if there is none free in a GFP_NOWAIT request, then we are doomed. But we could maintain some high/low watermarks to arrange the snatching when mirrored memory is getting low, rather than all the way out. It's worth a look - but perhaps at phase three. It would make life a bit easier for people to get the right amount of mirror. If they guess too high they are still wasting some memory because every mirrored page has two pages in DIMM. But without this sort of trick all the extra mirrored memory would be totally wasted. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck ` (3 preceding siblings ...) 2015-05-08 20:03 ` [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Andrew Morton @ 2015-05-19 3:01 ` Xishi Qiu 2015-05-19 4:48 ` Tony Luck 4 siblings, 1 reply; 11+ messages in thread From: Xishi Qiu @ 2015-05-19 3:01 UTC (permalink / raw) To: Tony Luck; +Cc: Andrew Morton, linux-kernel, linux-mm, Hanjun Guo, Xiexiuqi On 2015/5/9 0:44, Tony Luck wrote: > Some high end Intel Xeon systems report uncorrectable memory errors > as a recoverable machine check. Linux has included code for some time > to process these and just signal the affected processes (or even > recover completely if the error was in a read only page that can be > replaced by reading from disk). > > But we have no recovery path for errors encountered during kernel > code execution. Except for some very specific cases were are unlikely > to ever be able to recover. > > Enter memory mirroring. Actually 3rd generation of memory mirroing. > > Gen1: All memory is mirrored > Pro: No s/w enabling - h/w just gets good data from other side of the mirror > Con: Halves effective memory capacity available to OS/applications > Gen2: Partial memory mirror - just mirror memory begind some memory controllers > Pro: Keep more of the capacity > Con: Nightmare to enable. Have to choose between allocating from > mirrored memory for safety vs. NUMA local memory for performance > Gen3: Address range partial memory mirror - some mirror on each memory controller > Pro: Can tune the amount of mirror and keep NUMA performance > Con: I have to write memory management code to implement > > The current plan is just to use mirrored memory for kernel allocations. This > has been broken into two phases: > 1) This patch series - find the mirrored memory, use it for boot time allocations > 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused > mirrored memory from mm/memblock.c and only give it out to select kernel > allocations (this is still being scoped because page_alloc.c is scary). > Hi Tony, In part2, does it means the memory allocated from kernel should use mirrored memory? I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3 Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 14 7 6 1 3 0 1 0 0 0 0 Node 0, zone DMA32, type Reclaimable 15 2 2 1 1 2 1 1 0 0 0 Node 0, zone DMA32, type Movable 3 24 52 58 31 2 1 1 1 3 231 Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 80 12 6 7 3 1 67 58 23 11 0 Node 0, zone Normal, type Reclaimable 6 6 8 11 5 3 0 1 0 0 0 Node 0, zone Normal, type Movable 6 198 618 675 363 13 4 3 0 2 4074 Node 0, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve CMA Isolate Node 0, zone DMA 1 0 6 0 1 0 0 Node 0, zone DMA32 8 32 975 0 1 0 0 Node 0, zone Normal 216 334 12760 2048 2 0 0 Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 18 2 19 3 21 28 13 0 1 1 0 Node 1, zone Normal, type Reclaimable 0 1 1 1 0 0 1 0 0 1 0 Node 1, zone Normal, type Movable 6 13 9 3 0 4 5 0 1 0 6970 Node 1, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node 1, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 1, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve CMA Isolate Node 1, zone Normal 112 4 14218 2048 2 0 0 Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both kernel-space and user-space. If there is no mirrored memory, we will allocate other types memory. 1) kernel-space(pcp, page buddy, slab/slub ...): -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) -> __alloc_pages_nodemask() ->gfpflags_to_migratetype() -> use MIGRATE_MIRROR list 2) user-space(syscall, madvise, mmap ...): -> add VM_MIRROR flag in the vma -> add GFP_MIRROR when page fault in the vma -> __alloc_pages_nodemask() -> use MIGRATE_MIRROR list Thanks, Xishi Qiu > Tony Luck (3): > mm/memblock: Add extra "flags" to memblock to allow selection of > memory based on attribute > mm/memblock: Allocate boot time data structures from mirrored memory > x86, mirror: x86 enabling - find mirrored memory ranges > > arch/s390/kernel/crash_dump.c | 5 +- > arch/sparc/mm/init_64.c | 6 ++- > arch/x86/kernel/check.c | 3 +- > arch/x86/kernel/e820.c | 3 +- > arch/x86/kernel/setup.c | 3 ++ > arch/x86/mm/init_32.c | 2 +- > arch/x86/platform/efi/efi.c | 21 ++++++++ > include/linux/efi.h | 3 ++ > include/linux/memblock.h | 49 +++++++++++------ > mm/cma.c | 6 ++- > mm/memblock.c | 123 +++++++++++++++++++++++++++++++++--------- > mm/memtest.c | 3 +- > mm/nobootmem.c | 14 ++++- > 13 files changed, 188 insertions(+), 53 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-19 3:01 ` Xishi Qiu @ 2015-05-19 4:48 ` Tony Luck 2015-05-19 6:37 ` Xishi Qiu 0 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2015-05-19 4:48 UTC (permalink / raw) To: Xishi Qiu Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm@kvack.org, Hanjun Guo, Xiexiuqi On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu <qiuxishi@huawei.com> wrote: > In part2, does it means the memory allocated from kernel should use mirrored memory? Yes. I want to use mirrored memory for all (or as many as possible) kernel allocations. > I have heard of this feature(address range mirroring) before, and I changed some > code to test it(implement memory allocations in specific physical areas). > > In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good > idea. If there are XX discontiguous mirrored areas in one numa node, there should be > XX ZONE_MIRROR zones in one pgdat, it is impossible, right? With current h/w implementations XX is at most 2, and is possibly only 1 on most nodes. But we shouldn't depend on that. > I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print > is from my changed kernel. This sounds interesting. > [root@localhost ~]# cat /proc/pagetypeinfo > Page block order: 9 > Pages per block: 512 > > Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... > Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 ... > Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 I see all zero counts here ... which is fine. I expect that systems will mirror all memory below 4GB ... but we should probably ignore the attribute for this range because we want to make sure that the memory is still available for users that depend on getting memory that legacy devices can access. On systems that support address range mirror the <4GB area is <2% of even a small system (128GB seems to be the minimum rational configuration for a 4 socket machine ... you end up with that much if you populate every channel with just one 4GB DIMM). On a big system (in the TB range) <4GB area is a trivial rounding error. > Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both > kernel-space and user-space. If there is no mirrored memory, we will allocate > other types memory. But I *think* I want all kernel and no users to allocate mirror memory. I'd like to not have to touch every place that allocates memory to add/clear this flag. > 1) kernel-space(pcp, page buddy, slab/slub ...): > -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) > -> __alloc_pages_nodemask() > ->gfpflags_to_migratetype() > -> use MIGRATE_MIRROR list I think you are telling me that we can do this, but I don't understand how the code would look. > 2) user-space(syscall, madvise, mmap ...): > -> add VM_MIRROR flag in the vma > -> add GFP_MIRROR when page fault in the vma > -> __alloc_pages_nodemask() > -> use MIGRATE_MIRROR list If we do let users have access to mirrored memory, then madvise/mmap seem a plausible way to allow it. Not sure what access privileges are appropriate to allow it. I expect mirrored memory to be in short supply (the whole point of address range mirror is to make do with a minimal amount of mirrored memory ... if you expect to want/have lots of mirrored memory, then just take the 50% hit in capacity and mirror everything and ignore all the s/w complexity). Are your patches ready to be shared? -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations 2015-05-19 4:48 ` Tony Luck @ 2015-05-19 6:37 ` Xishi Qiu 0 siblings, 0 replies; 11+ messages in thread From: Xishi Qiu @ 2015-05-19 6:37 UTC (permalink / raw) To: Tony Luck Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm@kvack.org, Hanjun Guo, Xiexiuqi On 2015/5/19 12:48, Tony Luck wrote: > On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu <qiuxishi@huawei.com> wrote: >> In part2, does it means the memory allocated from kernel should use mirrored memory? > > Yes. I want to use mirrored memory for all (or as many as > possible) kernel allocations. > >> I have heard of this feature(address range mirroring) before, and I changed some >> code to test it(implement memory allocations in specific physical areas). >> >> In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good >> idea. If there are XX discontiguous mirrored areas in one numa node, there should be >> XX ZONE_MIRROR zones in one pgdat, it is impossible, right? > > With current h/w implementations XX is at most 2, and is possibly only 1 > on most nodes. But we shouldn't depend on that. > >> I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print >> is from my changed kernel. > > This sounds interesting. > >> [root@localhost ~]# cat /proc/pagetypeinfo >> Page block order: 9 >> Pages per block: 512 >> >> Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 > ... >> Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 > ... >> Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 > > I see all zero counts here ... which is fine. I expect that systems > will mirror all memory below 4GB ... but we should probably > ignore the attribute for this range because we want to make Hi Tony, I think 0-4G will be all mirrored, so I change nothing, just ignore the mirror flag.(e.g. 4 socket machine, every socket has 32G memory, then node0: 0-4G, 4-8G mirrored, node1: 32-36G mirrored, node2:64-68G mirrored, node3: 96-100G mirrored) > sure that the memory is still available for users that depend > on getting memory that legacy devices can access. On systems > that support address range mirror the <4GB area is <2% of even > a small system (128GB seems to be the minimum rational configuration > for a 4 socket machine ... you end up with that much if you populate > every channel with just one 4GB DIMM). On a big system (in the TB > range) <4GB area is a trivial rounding error. > >> Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both >> kernel-space and user-space. If there is no mirrored memory, we will allocate >> other types memory. > > But I *think* I want all kernel and no users to allocate mirror > memory. I'd like to not have to touch every place that allocates > memory to add/clear this flag. > If only want kernel to use the mirrored memory, it is much easier. I have some patches, but it's a little ugly and implement both user and kernel. >> 1) kernel-space(pcp, page buddy, slab/slub ...): >> -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) >> -> __alloc_pages_nodemask() >> ->gfpflags_to_migratetype() >> -> use MIGRATE_MIRROR list > > I think you are telling me that we can do this, but I don't understand > how the code would look. > >> 2) user-space(syscall, madvise, mmap ...): >> -> add VM_MIRROR flag in the vma >> -> add GFP_MIRROR when page fault in the vma >> -> __alloc_pages_nodemask() >> -> use MIGRATE_MIRROR list > > If we do let users have access to mirrored memory, then > madvise/mmap seem a plausible way to allow it. Not sure > what access privileges are appropriate to allow it. I expect > mirrored memory to be in short supply (the whole point of I think allocations from some key process(e.g. date base) are as important as kernel, and in most cases MCE just kill them if memory failure, so let user can access the mirrored memory may be a good way to solve the problem. > address range mirror is to make do with a minimal amount > of mirrored memory ... if you expect to want/have lots of > mirrored memory, then just take the 50% hit in capacity > and mirror everything and ignore all the s/w complexity). > > Are your patches ready to be shared? I'll rewrite and send them soon. Thanks, Xishi Qiu > > -Tony > > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-05-19 6:39 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-08 16:44 [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Tony Luck 2015-05-07 22:17 ` [PATCHv2 1/3] mm/memblock: Add extra "flags" to memblock to allow selection of memory based on attribute Tony Luck 2015-05-07 22:18 ` [PATCHv2 2/3] mm/memblock: Allocate boot time data structures from mirrored memory Tony Luck 2015-05-07 22:19 ` [PATCHv2 3/3] x86, mirror: x86 enabling - find mirrored memory ranges Tony Luck 2015-05-08 20:03 ` [PATCHv2 0/3] Find mirrored memory, use for boot time allocations Andrew Morton 2015-05-08 20:38 ` Tony Luck 2015-05-08 20:49 ` Andrew Morton 2015-05-08 23:41 ` Tony Luck 2015-05-19 3:01 ` Xishi Qiu 2015-05-19 4:48 ` Tony Luck 2015-05-19 6:37 ` Xishi Qiu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).