* [PATCH 01/12] mm: Document /proc/pagetypeinfo
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-12 15:54 ` Christoph Lameter
2010-02-16 7:05 ` KOSAKI Motohiro
2010-02-12 12:00 ` [PATCH 02/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
` (10 subsequent siblings)
11 siblings, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
The memory compaction patches add details to pagetypeinfo that are not
obvious and need to be documented. In preparation for this, document
what is already in /proc/pagetypeinfo.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/filesystems/proc.txt | 45 +++++++++++++++++++++++++++++++++++-
1 files changed, 44 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d07513..1829dfb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -430,6 +430,7 @@ Table 1-5: Kernel info in /proc
modules List of loaded modules
mounts Mounted filesystems
net Networking info (see text)
+ pagetypeinfo Additional page allocator information (see text) (2.5)
partitions Table of partitions known to the system
pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
decoupled by lspci (2.4)
@@ -584,7 +585,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ...
Node 0, zone Normal 1 0 0 1 101 8 ...
Node 0, zone HighMem 2 0 0 1 1 0 ...
-Memory fragmentation is a problem under some workloads, and buddyinfo is a
+External fragmentation is a problem under some workloads, and buddyinfo is a
useful tool for helping diagnose these problems. Buddyinfo will give you a
clue as to how big an area you can safely allocate, or why a previous
allocation failed.
@@ -594,6 +595,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc...
+More information relevant to external fragmentation can be found in
+pagetypeinfo.
+
+> cat /proc/pagetypeinfo
+Page block order: 9
+Pages per block: 512
+
+Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
+Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
+Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
+Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
+Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
+Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
+Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
+Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
+Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+
+Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
+Node 0, zone DMA 2 0 5 1 0
+Node 0, zone DMA32 41 6 967 2 0
+
+Fragmentation avoidance in the kernel works by grouping pages of different
+migrate types into the same contiguous regions of memory called page blocks.
+A page block is typically the size of the default hugepage size e.g. 2MB on
+X86-64. By keeping pages grouped based on their ability to move, the kernel
+can reclaim pages within a page block to satisfy a high-order allocation.
+
+The pagetypinfo begins with information on the size of a page block. It
+then gives the same type of information as buddyinfo except broken down
+by migrate-type and finishes with details on how many page blocks of each
+type exist.
+
+If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
+from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
+make an estimate of the likely number of huge pages that can be allocated
+at a given point in time. All the "Movable" blocks should be allocatable
+unless memory has been mlock()'d. Some of the Reclaimable blocks should
+also be allocatable although a lot of filesystem metadata may have to be
+reclaimed to achieve this.
+
..............................................................................
meminfo:
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 01/12] mm: Document /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 01/12] mm: Document /proc/pagetypeinfo Mel Gorman
@ 2010-02-12 15:54 ` Christoph Lameter
2010-02-16 7:05 ` KOSAKI Motohiro
1 sibling, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2010-02-12 15:54 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, linux-kernel, linux-mm
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 01/12] mm: Document /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 01/12] mm: Document /proc/pagetypeinfo Mel Gorman
2010-02-12 15:54 ` Christoph Lameter
@ 2010-02-16 7:05 ` KOSAKI Motohiro
1 sibling, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 7:05 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> The memory compaction patches add details to pagetypeinfo that are not
> obvious and need to be documented. In preparation for this, document
> what is already in /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Looks nicer.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> Documentation/filesystems/proc.txt | 45 +++++++++++++++++++++++++++++++++++-
> 1 files changed, 44 insertions(+), 1 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 0d07513..1829dfb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -430,6 +430,7 @@ Table 1-5: Kernel info in /proc
> modules List of loaded modules
> mounts Mounted filesystems
> net Networking info (see text)
> + pagetypeinfo Additional page allocator information (see text) (2.5)
> partitions Table of partitions known to the system
> pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
> decoupled by lspci (2.4)
> @@ -584,7 +585,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ...
> Node 0, zone Normal 1 0 0 1 101 8 ...
> Node 0, zone HighMem 2 0 0 1 1 0 ...
>
> -Memory fragmentation is a problem under some workloads, and buddyinfo is a
> +External fragmentation is a problem under some workloads, and buddyinfo is a
> useful tool for helping diagnose these problems. Buddyinfo will give you a
> clue as to how big an area you can safely allocate, or why a previous
> allocation failed.
> @@ -594,6 +595,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
> ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
> available in ZONE_NORMAL, etc...
>
> +More information relevant to external fragmentation can be found in
> +pagetypeinfo.
> +
> +> cat /proc/pagetypeinfo
> +Page block order: 9
> +Pages per block: 512
> +
> +Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
> +Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
> +Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
> +Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
> +Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
> +Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> +Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
> +Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
> +Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
> +Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
> +Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> +
> +Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> +Node 0, zone DMA 2 0 5 1 0
> +Node 0, zone DMA32 41 6 967 2 0
> +
> +Fragmentation avoidance in the kernel works by grouping pages of different
> +migrate types into the same contiguous regions of memory called page blocks.
> +A page block is typically the size of the default hugepage size e.g. 2MB on
> +X86-64. By keeping pages grouped based on their ability to move, the kernel
> +can reclaim pages within a page block to satisfy a high-order allocation.
> +
> +The pagetypinfo begins with information on the size of a page block. It
> +then gives the same type of information as buddyinfo except broken down
> +by migrate-type and finishes with details on how many page blocks of each
> +type exist.
> +
> +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> +from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> +make an estimate of the likely number of huge pages that can be allocated
> +at a given point in time. All the "Movable" blocks should be allocatable
> +unless memory has been mlock()'d. Some of the Reclaimable blocks should
> +also be allocatable although a lot of filesystem metadata may have to be
> +reclaimed to achieve this.
> +
> ..............................................................................
>
> meminfo:
> --
> 1.6.5
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 02/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
2010-02-12 12:00 ` [PATCH 01/12] mm: Document /proc/pagetypeinfo Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-16 17:43 ` Rik van Riel
2010-02-12 12:00 ` [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
` (9 subsequent siblings)
11 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.
As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/Kconfig | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 17b8947..b1c2781 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -168,17 +168,29 @@ config SPLIT_PTLOCK_CPUS
default "4"
#
+# support for memory compaction
+config COMPACTION
+ bool "Allow for memory compaction"
+ def_bool y
+ select MIGRATION
+ depends on EXPERIMENTAL && HUGETLBFS
+ help
+ Allows the compaction of memory for the allocation of huge pages.
+
+#
# support for page migration
#
config MIGRATION
bool "Page migration"
def_bool y
- depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
+ depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
help
Allows the migration of the physical location of pages of processes
- while the virtual addresses are not changed. This is useful for
- example on NUMA systems to put pages nearer to the processors accessing
- the page.
+ while the virtual addresses are not changed. This is useful in
+ two situations. The first is on NUMA systems to put pages nearer
+ to the processors accessing. The second is when allocating huge
+ pages as migration can relocate pages to satisfy a huge page
+ allocation instead of reclaiming.
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 02/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
2010-02-12 12:00 ` [PATCH 02/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-02-16 17:43 ` Rik van Riel
0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-16 17:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> being able to hot-remove memory. The main users of page migration such as
> sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> only beneficial on NUMA so it makes sense.
>
> As memory compaction will operate within a zone and is useful on both NUMA
> and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> user selects CONFIG_COMPACTION as an option.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter<cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
2010-02-12 12:00 ` [PATCH 01/12] mm: Document /proc/pagetypeinfo Mel Gorman
2010-02-12 12:00 ` [PATCH 02/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-16 7:03 ` KOSAKI Motohiro
` (2 more replies)
2010-02-12 12:00 ` [PATCH 04/12] Export fragmentation " Mel Gorman
` (8 subsequent siblings)
11 siblings, 3 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
Unusuable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/pagetypeinfo.
The index is normally calculated as a value between 0 and 1 which is
obviously unsuitable within the kernel. Instead, the first three decimal
places are used as a value between 0 and 1000 for an integer approximation.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/filesystems/proc.txt | 10 ++++
mm/vmstat.c | 99 ++++++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+), 0 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 1829dfb..0968a81 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -614,6 +614,10 @@ Node 0, zone DMA32, type Movable 169 152 113 91 77
Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+Unusable free space index at order
+Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
+Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
+
Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 2 0 5 1 0
Node 0, zone DMA32 41 6 967 2 0
@@ -629,6 +633,12 @@ then gives the same type of information as buddyinfo except broken down
by migrate-type and finishes with details on how many page blocks of each
type exist.
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1000. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. The
+percentage of unusable free memory can be found by dividing this value by 10.
+
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
make an estimate of the likely number of huge pages that can be allocated
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..d05d610 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
return 0;
}
+
+struct contig_page_info {
+ unsigned long free_pages;
+ unsigned long free_blocks_total;
+ unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes to attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+ unsigned int suitable_order,
+ struct contig_page_info *info)
+{
+ unsigned int order;
+
+ info->free_pages = 0;
+ info->free_blocks_total = 0;
+ info->free_blocks_suitable = 0;
+
+ for (order = 0; order < MAX_ORDER; order++) {
+ unsigned long blocks;
+
+ /* Count number of free blocks */
+ blocks = zone->free_area[order].nr_free;
+ info->free_blocks_total += blocks;
+
+ /* Count free base pages */
+ info->free_pages += blocks << order;
+
+ /* Count the suitable free blocks */
+ if (order >= suitable_order)
+ info->free_blocks_suitable += blocks <<
+ (order - suitable_order);
+ }
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(struct zone *zone,
+ unsigned int order,
+ struct contig_page_info *info)
+{
+ /* No free memory is interpreted as all free memory is unusable */
+ if (info->free_pages == 0)
+ return 1000;
+
+ /*
+ * Index should be a value between 0 and 1. Return a value to 3
+ * decimal places.
+ *
+ * 0 => no fragmentation
+ * 1 => high fragmentation
+ */
+ return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
+
+}
+
+static void pagetypeinfo_showunusable_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %4d, zone %8s %19s",
+ pgdat->node_id,
+ zone->name, " ");
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\nUnusable free space index at order\n");
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
+ pagetypeinfo_showunusable(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
return 0;
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
@ 2010-02-16 7:03 ` KOSAKI Motohiro
2010-02-16 8:36 ` Mel Gorman
2010-02-16 18:28 ` Rik van Riel
2010-02-18 15:23 ` Minchan Kim
2 siblings, 1 reply; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 7:03 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/pagetypeinfo.
Hmmm..
/proc/pagetype have a machine unfriendly format. perhaps, some user have own ugly
/proc/pagetype parser. It have a little risk to break userland ABI.
I have dumb question. Why can't we use another file?
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
I think we can treat separately internal representaion and /proc displaing
style. example, load-average have fixed point internal representaion. but
/proc/loadavg hide it.
So, I personally like to keep this internal representation and change external
representaion to 0.000-1.000 or 0.0%-100.0% range.
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> Documentation/filesystems/proc.txt | 10 ++++
> mm/vmstat.c | 99 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 109 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 1829dfb..0968a81 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -614,6 +614,10 @@ Node 0, zone DMA32, type Movable 169 152 113 91 77
> Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
> Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
>
> +Unusable free space index at order
> +Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> +Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
> +
> Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> Node 0, zone DMA 2 0 5 1 0
> Node 0, zone DMA32 41 6 967 2 0
> @@ -629,6 +633,12 @@ then gives the same type of information as buddyinfo except broken down
> by migrate-type and finishes with details on how many page blocks of each
> type exist.
>
> +The unusable free space index measures how much of the available free
> +memory cannot be used to satisfy an allocation of a given size and is a
> +value between 0 and 1000. The higher the value, the more of free memory is
> +unusable and by implication, the worse the external fragmentation is. The
> +percentage of unusable free memory can be found by dividing this value by 10.
> +
> If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> make an estimate of the likely number of huge pages that can be allocated
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..d05d610 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> return 0;
> }
>
> +
> +struct contig_page_info {
> + unsigned long free_pages;
> + unsigned long free_blocks_total;
> + unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> + unsigned int suitable_order,
> + struct contig_page_info *info)
> +{
> + unsigned int order;
> +
> + info->free_pages = 0;
> + info->free_blocks_total = 0;
> + info->free_blocks_suitable = 0;
> +
> + for (order = 0; order < MAX_ORDER; order++) {
> + unsigned long blocks;
> +
> + /* Count number of free blocks */
> + blocks = zone->free_area[order].nr_free;
> + info->free_blocks_total += blocks;
> +
> + /* Count free base pages */
> + info->free_pages += blocks << order;
> +
> + /* Count the suitable free blocks */
> + if (order >= suitable_order)
> + info->free_blocks_suitable += blocks <<
> + (order - suitable_order);
> + }
> +}
> +
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +static int unusable_free_index(struct zone *zone,
> + unsigned int order,
> + struct contig_page_info *info)
> +{
> + /* No free memory is interpreted as all free memory is unusable */
> + if (info->free_pages == 0)
> + return 1000;
> +
> + /*
> + * Index should be a value between 0 and 1. Return a value to 3
> + * decimal places.
> + *
> + * 0 => no fragmentation
> + * 1 => high fragmentation
> + */
I leraned math by japanese. probably then I couldn't understand what mean
"3 decimal places" awhile.
Simply can't we write "Index should be a value between 0 and 1000"?
> + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> +
> +}
> +
> +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> + pg_data_t *pgdat, struct zone *zone)
> +{
> + unsigned int order;
> +
> + /* Alloc on stack as interrupts are disabled for zone walk */
> + struct contig_page_info info;
> +
> + seq_printf(m, "Node %4d, zone %8s %19s",
> + pgdat->node_id,
> + zone->name, " ");
> + for (order = 0; order < MAX_ORDER; ++order) {
> + fill_contig_page_info(zone, order, &info);
> + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> + }
> +
> + seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display unusable free space index
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> +{
> + pg_data_t *pgdat = (pg_data_t *)arg;
> +
> + seq_printf(m, "\nUnusable free space index at order\n");
> + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> +
> + return 0;
> +}
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> seq_putc(m, '\n');
> pagetypeinfo_showfree(m, pgdat);
> + pagetypeinfo_showunusable(m, pgdat);
> pagetypeinfo_showblockcount(m, pgdat);
>
> return 0;
> --
> 1.6.5
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-16 7:03 ` KOSAKI Motohiro
@ 2010-02-16 8:36 ` Mel Gorman
2010-02-16 8:41 ` KOSAKI Motohiro
0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-16 8:36 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 04:03:29PM +0900, KOSAKI Motohiro wrote:
> > Unusuable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/pagetypeinfo.
>
> Hmmm..
> /proc/pagetype have a machine unfriendly format. perhaps, some user have own ugly
> /proc/pagetype parser. It have a little risk to break userland ABI.
>
It's very low risk. I doubt there are machine parsers of
/proc/pagetypeinfo because there are very few machine-orientated actions
that can be taken based on the information. It's more informational for
a user if they were investigating fragmentation problems.
> I have dumb question. Why can't we use another file?
>
I could. What do you suggest?
>
> > The index is normally calculated as a value between 0 and 1 which is
> > obviously unsuitable within the kernel. Instead, the first three decimal
> > places are used as a value between 0 and 1000 for an integer approximation.
>
> I think we can treat separately internal representaion and /proc displaing
> style. example, load-average have fixed point internal representaion. but
> /proc/loadavg hide it.
>
> So, I personally like to keep this internal representation and change external
> representaion to 0.000-1.000 or 0.0%-100.0% range.
>
I don't want to use a percentage for this value as there are two
indices, one of which can be expressed as a percentage and the other
not. I can represent it has a value between 0 and 1 if you prefer.
>
>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > Documentation/filesystems/proc.txt | 10 ++++
> > mm/vmstat.c | 99 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 109 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 1829dfb..0968a81 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -614,6 +614,10 @@ Node 0, zone DMA32, type Movable 169 152 113 91 77
> > Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
> > Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> >
> > +Unusable free space index at order
> > +Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> > +Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
> > +
> > Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> > Node 0, zone DMA 2 0 5 1 0
> > Node 0, zone DMA32 41 6 967 2 0
> > @@ -629,6 +633,12 @@ then gives the same type of information as buddyinfo except broken down
> > by migrate-type and finishes with details on how many page blocks of each
> > type exist.
> >
> > +The unusable free space index measures how much of the available free
> > +memory cannot be used to satisfy an allocation of a given size and is a
> > +value between 0 and 1000. The higher the value, the more of free memory is
> > +unusable and by implication, the worse the external fragmentation is. The
> > +percentage of unusable free memory can be found by dividing this value by 10.
> > +
> > If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> > from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> > make an estimate of the likely number of huge pages that can be allocated
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..d05d610 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> > return 0;
> > }
> >
> > +
> > +struct contig_page_info {
> > + unsigned long free_pages;
> > + unsigned long free_blocks_total;
> > + unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes to attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > + unsigned int suitable_order,
> > + struct contig_page_info *info)
> > +{
> > + unsigned int order;
> > +
> > + info->free_pages = 0;
> > + info->free_blocks_total = 0;
> > + info->free_blocks_suitable = 0;
> > +
> > + for (order = 0; order < MAX_ORDER; order++) {
> > + unsigned long blocks;
> > +
> > + /* Count number of free blocks */
> > + blocks = zone->free_area[order].nr_free;
> > + info->free_blocks_total += blocks;
> > +
> > + /* Count free base pages */
> > + info->free_pages += blocks << order;
> > +
> > + /* Count the suitable free blocks */
> > + if (order >= suitable_order)
> > + info->free_blocks_suitable += blocks <<
> > + (order - suitable_order);
> > + }
> > +}
> > +
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +static int unusable_free_index(struct zone *zone,
> > + unsigned int order,
> > + struct contig_page_info *info)
> > +{
> > + /* No free memory is interpreted as all free memory is unusable */
> > + if (info->free_pages == 0)
> > + return 1000;
> > +
> > + /*
> > + * Index should be a value between 0 and 1. Return a value to 3
> > + * decimal places.
> > + *
> > + * 0 => no fragmentation
> > + * 1 => high fragmentation
> > + */
>
> I leraned math by japanese. probably then I couldn't understand what mean
> "3 decimal places" awhile.
> Simply can't we write "Index should be a value between 0 and 1000"?
>
I can do either that or make it really display between 0 and 1. I'd
prefer displaying between 0 and 1 myself.
>
> > + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > +
> > +}
> > +
> > +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> > + pg_data_t *pgdat, struct zone *zone)
> > +{
> > + unsigned int order;
> > +
> > + /* Alloc on stack as interrupts are disabled for zone walk */
> > + struct contig_page_info info;
> > +
> > + seq_printf(m, "Node %4d, zone %8s %19s",
> > + pgdat->node_id,
> > + zone->name, " ");
> > + for (order = 0; order < MAX_ORDER; ++order) {
> > + fill_contig_page_info(zone, order, &info);
> > + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> > + }
> > +
> > + seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display unusable free space index
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> > +{
> > + pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > + seq_printf(m, "\nUnusable free space index at order\n");
> > + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> > +
> > + return 0;
> > +}
> > +
> > static void pagetypeinfo_showfree_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> > seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> > seq_putc(m, '\n');
> > pagetypeinfo_showfree(m, pgdat);
> > + pagetypeinfo_showunusable(m, pgdat);
> > pagetypeinfo_showblockcount(m, pgdat);
> >
> > return 0;
> > --
> > 1.6.5
> >
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-16 8:36 ` Mel Gorman
@ 2010-02-16 8:41 ` KOSAKI Motohiro
2010-02-16 8:50 ` Mel Gorman
0 siblings, 1 reply; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 8:41 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> On Tue, Feb 16, 2010 at 04:03:29PM +0900, KOSAKI Motohiro wrote:
> > > Unusuable free space index is a measure of external fragmentation that
> > > takes the allocation size into account. For the most part, the huge page
> > > size will be the size of interest but not necessarily so it is exported
> > > on a per-order and per-zone basis via /proc/pagetypeinfo.
> >
> > Hmmm..
> > /proc/pagetype have a machine unfriendly format. perhaps, some user have own ugly
> > /proc/pagetype parser. It have a little risk to break userland ABI.
> >
>
> It's very low risk. I doubt there are machine parsers of
> /proc/pagetypeinfo because there are very few machine-orientated actions
> that can be taken based on the information. It's more informational for
> a user if they were investigating fragmentation problems.
>
> > I have dumb question. Why can't we use another file?
>
> I could. What do you suggest?
I agree it's low risk. but personally I hope fragmentation ABI keep very stable because
I expect some person makes userland compaction daemon. (read fragmentation index
from /proc and write /proc/compact_memory if necessary).
then, if possible, I hope fragmentation info have individual /proc file.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-16 8:41 ` KOSAKI Motohiro
@ 2010-02-16 8:50 ` Mel Gorman
0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-16 8:50 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 05:41:39PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Feb 16, 2010 at 04:03:29PM +0900, KOSAKI Motohiro wrote:
> > > > Unusuable free space index is a measure of external fragmentation that
> > > > takes the allocation size into account. For the most part, the huge page
> > > > size will be the size of interest but not necessarily so it is exported
> > > > on a per-order and per-zone basis via /proc/pagetypeinfo.
> > >
> > > Hmmm..
> > > /proc/pagetype have a machine unfriendly format. perhaps, some user have own ugly
> > > /proc/pagetype parser. It have a little risk to break userland ABI.
> > >
> >
> > It's very low risk. I doubt there are machine parsers of
> > /proc/pagetypeinfo because there are very few machine-orientated actions
> > that can be taken based on the information. It's more informational for
> > a user if they were investigating fragmentation problems.
> >
> > > I have dumb question. Why can't we use another file?
> >
> > I could. What do you suggest?
>
> I agree it's low risk. but personally I hope fragmentation ABI keep very stable because
> I expect some person makes userland compaction daemon. (read fragmentation index
> from /proc and write /proc/compact_memory if necessary).
> then, if possible, I hope fragmentation info have individual /proc file.
>
I'd be somewhat surprised if there was an active userland compaction daemon
because I'd expect them to be depending on direct compaction. Userspace
compaction is more likely to be an all-or-nothing affair and confined to
NUMA nodes if they are being used as containers. If a compaction daemon was
to exist, I'd have expected it to be in-kernel because the triggers from
userspace are so coarse.
Still, I can break out the indices into separate files to cover all the
bases.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
2010-02-16 7:03 ` KOSAKI Motohiro
@ 2010-02-16 18:28 ` Rik van Riel
2010-02-18 15:23 ` Minchan Kim
2 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-16 18:28 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
2010-02-16 7:03 ` KOSAKI Motohiro
2010-02-16 18:28 ` Rik van Riel
@ 2010-02-18 15:23 ` Minchan Kim
2010-02-18 15:32 ` Mel Gorman
2 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2010-02-18 15:23 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
There are comments at below.
It's very trivial so I don't care.
> ---
> Documentation/filesystems/proc.txt | 10 ++++
> mm/vmstat.c | 99 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 109 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 1829dfb..0968a81 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -614,6 +614,10 @@ Node 0, zone DMA32, type Movable 169 152 113 91 77
> Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
> Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
>
> +Unusable free space index at order
> +Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> +Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
> +
> Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> Node 0, zone DMA 2 0 5 1 0
> Node 0, zone DMA32 41 6 967 2 0
> @@ -629,6 +633,12 @@ then gives the same type of information as buddyinfo except broken down
> by migrate-type and finishes with details on how many page blocks of each
> type exist.
>
> +The unusable free space index measures how much of the available free
> +memory cannot be used to satisfy an allocation of a given size and is a
> +value between 0 and 1000. The higher the value, the more of free memory is
> +unusable and by implication, the worse the external fragmentation is. The
> +percentage of unusable free memory can be found by dividing this value by 10.
> +
> If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> make an estimate of the likely number of huge pages that can be allocated
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..d05d610 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> return 0;
> }
>
> +
> +struct contig_page_info {
> + unsigned long free_pages;
> + unsigned long free_blocks_total;
> + unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> + unsigned int suitable_order,
> + struct contig_page_info *info)
> +{
> + unsigned int order;
> +
> + info->free_pages = 0;
> + info->free_blocks_total = 0;
> + info->free_blocks_suitable = 0;
> +
> + for (order = 0; order < MAX_ORDER; order++) {
> + unsigned long blocks;
> +
> + /* Count number of free blocks */
> + blocks = zone->free_area[order].nr_free;
> + info->free_blocks_total += blocks;
> +
> + /* Count free base pages */
> + info->free_pages += blocks << order;
> +
> + /* Count the suitable free blocks */
> + if (order >= suitable_order)
> + info->free_blocks_suitable += blocks <<
> + (order - suitable_order);
> + }
> +}
> +
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +static int unusable_free_index(struct zone *zone,
Did you remain "zone" argument for later?
Anyway, It's trivial.
> + unsigned int order,
> + struct contig_page_info *info)
> +{
> + /* No free memory is interpreted as all free memory is unusable */
> + if (info->free_pages == 0)
> + return 1000;
> +
> + /*
> + * Index should be a value between 0 and 1. Return a value to 3
> + * decimal places.
> + *
> + * 0 => no fragmentation
> + * 1 => high fragmentation
> + */
> + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> +
> +}
> +
> +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> + pg_data_t *pgdat, struct zone *zone)
> +{
> + unsigned int order;
> +
> + /* Alloc on stack as interrupts are disabled for zone walk */
contig_page_info would be larger than now?
I think it's small data now. I don't know why you comment it.
It's just out of curiosity. :)
> + struct contig_page_info info;
> +
> + seq_printf(m, "Node %4d, zone %8s %19s",
> + pgdat->node_id,
> + zone->name, " ");
> + for (order = 0; order < MAX_ORDER; ++order) {
> + fill_contig_page_info(zone, order, &info);
> + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> + }
> +
> + seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display unusable free space index
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> +{
> + pg_data_t *pgdat = (pg_data_t *)arg;
> +
> + seq_printf(m, "\nUnusable free space index at order\n");
> + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> +
> + return 0;
> +}
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> seq_putc(m, '\n');
> pagetypeinfo_showfree(m, pgdat);
> + pagetypeinfo_showunusable(m, pgdat);
> pagetypeinfo_showblockcount(m, pgdat);
>
> return 0;
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo
2010-02-18 15:23 ` Minchan Kim
@ 2010-02-18 15:32 ` Mel Gorman
0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-18 15:32 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 19, 2010 at 12:23:59AM +0900, Minchan Kim wrote:
> On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> > Unusuable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/pagetypeinfo.
> >
> > The index is normally calculated as a value between 0 and 1 which is
> > obviously unsuitable within the kernel. Instead, the first three decimal
> > places are used as a value between 0 and 1000 for an integer approximation.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>
> There are comments at below.
> It's very trivial so I don't care.
>
Thanks
> > ---
> > Documentation/filesystems/proc.txt | 10 ++++
> > mm/vmstat.c | 99 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 109 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 1829dfb..0968a81 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -614,6 +614,10 @@ Node 0, zone DMA32, type Movable 169 152 113 91 77
> > Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
> > Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> >
> > +Unusable free space index at order
> > +Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> > +Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
> > +
> > Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> > Node 0, zone DMA 2 0 5 1 0
> > Node 0, zone DMA32 41 6 967 2 0
> > @@ -629,6 +633,12 @@ then gives the same type of information as buddyinfo except broken down
> > by migrate-type and finishes with details on how many page blocks of each
> > type exist.
> >
> > +The unusable free space index measures how much of the available free
> > +memory cannot be used to satisfy an allocation of a given size and is a
> > +value between 0 and 1000. The higher the value, the more of free memory is
> > +unusable and by implication, the worse the external fragmentation is. The
> > +percentage of unusable free memory can be found by dividing this value by 10.
> > +
> > If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> > from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> > make an estimate of the likely number of huge pages that can be allocated
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..d05d610 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> > return 0;
> > }
> >
> > +
> > +struct contig_page_info {
> > + unsigned long free_pages;
> > + unsigned long free_blocks_total;
> > + unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes to attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > + unsigned int suitable_order,
> > + struct contig_page_info *info)
> > +{
> > + unsigned int order;
> > +
> > + info->free_pages = 0;
> > + info->free_blocks_total = 0;
> > + info->free_blocks_suitable = 0;
> > +
> > + for (order = 0; order < MAX_ORDER; order++) {
> > + unsigned long blocks;
> > +
> > + /* Count number of free blocks */
> > + blocks = zone->free_area[order].nr_free;
> > + info->free_blocks_total += blocks;
> > +
> > + /* Count free base pages */
> > + info->free_pages += blocks << order;
> > +
> > + /* Count the suitable free blocks */
> > + if (order >= suitable_order)
> > + info->free_blocks_suitable += blocks <<
> > + (order - suitable_order);
> > + }
> > +}
> > +
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +static int unusable_free_index(struct zone *zone,
>
> Did you remain "zone" argument for later?
> Anyway, It's trivial.
>
It's an oversight. I'll remove the parameter as unnecessary. It was
needed in an earlier iteration.
> > + unsigned int order,
> > + struct contig_page_info *info)
> > +{
> > + /* No free memory is interpreted as all free memory is unusable */
> > + if (info->free_pages == 0)
> > + return 1000;
> > +
> > + /*
> > + * Index should be a value between 0 and 1. Return a value to 3
> > + * decimal places.
> > + *
> > + * 0 => no fragmentation
> > + * 1 => high fragmentation
> > + */
> > + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > +
> > +}
> > +
> > +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> > + pg_data_t *pgdat, struct zone *zone)
> > +{
> > + unsigned int order;
> > +
> > + /* Alloc on stack as interrupts are disabled for zone walk */
>
> contig_page_info would be larger than now?
> I think it's small data now. I don't know why you comment it.
> It's just out of curiosity. :)
>
It is small data. I don't remember why I felt the need to spell out why
I didn't use kmalloc(). I'll drop the comment as it's not very helpful.
>
> > + struct contig_page_info info;
> > +
> > + seq_printf(m, "Node %4d, zone %8s %19s",
> > + pgdat->node_id,
> > + zone->name, " ");
> > + for (order = 0; order < MAX_ORDER; ++order) {
> > + fill_contig_page_info(zone, order, &info);
> > + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> > + }
> > +
> > + seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display unusable free space index
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> > +{
> > + pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > + seq_printf(m, "\nUnusable free space index at order\n");
> > + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> > +
> > + return 0;
> > +}
> > +
> > static void pagetypeinfo_showfree_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> > seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> > seq_putc(m, '\n');
> > pagetypeinfo_showfree(m, pgdat);
> > + pagetypeinfo_showunusable(m, pgdat);
> > pagetypeinfo_showblockcount(m, pgdat);
> >
> > return 0;
>
>
> --
> Kind regards,
> Minchan Kim
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (2 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 03/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-16 7:59 ` KOSAKI Motohiro
` (2 more replies)
2010-02-12 12:00 ` [PATCH 05/12] Memory compaction core Mel Gorman
` (7 subsequent siblings)
11 siblings, 3 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1). For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/pagetypeinfo.
The index is normally calculated as a value between 0 and 1 which is
obviously unsuitable within the kernel. Instead, the first three decimal
places are used as a value between 0 and 1000 for an integer approximation.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/filesystems/proc.txt | 11 ++++++
mm/vmstat.c | 63 ++++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+), 0 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0968a81..06bf53c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -618,6 +618,10 @@ Unusable free space index at order
Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
+Fragmentation index at order
+Node 0, zone DMA -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
+Node 0, zone DMA32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
+
Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 2 0 5 1 0
Node 0, zone DMA32 41 6 967 2 0
@@ -639,6 +643,13 @@ value between 0 and 1000. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. The
percentage of unusable free memory can be found by dividing this value by 10.
+The fragmentation index, is only meaningful if an allocation would fail and
+indicates what the failure is due to. A value of -1 such as in the example
+states that the allocation would succeed. If it would fail, the value is
+between 0 and 1000. A value tending towards 0 implies the allocation failed
+due to a lack of memory. A value tending towards 1000 implies it failed
+due to external fragmentation.
+
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
make an estimate of the likely number of huge pages that can be allocated
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d05d610..e2d0cc1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
}
/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(struct zone *zone,
+ unsigned int order,
+ struct contig_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index only makes sense when a request would fail */
+ if (info->free_blocks_suitable)
+ return -1;
+
+ /*
+ * Index is between 0 and 1 so return within 3 decimal places
+ *
+ * 0 => allocation would fail due to lack of memory
+ * 1 => allocation would fail due to fragmentation
+ */
+ return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.
*/
@@ -516,6 +545,39 @@ static int unusable_free_index(struct zone *zone,
}
+static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %4d, zone %8s %19s",
+ pgdat->node_id,
+ zone->name, " ");
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\nFragmentation index at order\n");
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showunusable_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -657,6 +719,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
pagetypeinfo_showunusable(m, pgdat);
+ pagetypeinfo_showfragmentation(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
return 0;
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 04/12] Export fragmentation " Mel Gorman
@ 2010-02-16 7:59 ` KOSAKI Motohiro
2010-02-16 8:41 ` Mel Gorman
2010-02-17 1:44 ` Rik van Riel
2010-02-18 15:37 ` Minchan Kim
2 siblings, 1 reply; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 7:59 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
Hmmm..
I haven't understand why admin need to know two metrics (unusable-index
and fragmentation-index). they have very similar meanings and easy confusable
imho.
Can we make just one user friendly metrics?
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> Documentation/filesystems/proc.txt | 11 ++++++
> mm/vmstat.c | 63 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 74 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 0968a81..06bf53c 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -618,6 +618,10 @@ Unusable free space index at order
> Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
>
> +Fragmentation index at order
> +Node 0, zone DMA -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> +Node 0, zone DMA32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> +
> Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> Node 0, zone DMA 2 0 5 1 0
> Node 0, zone DMA32 41 6 967 2 0
> @@ -639,6 +643,13 @@ value between 0 and 1000. The higher the value, the more of free memory is
> unusable and by implication, the worse the external fragmentation is. The
> percentage of unusable free memory can be found by dividing this value by 10.
>
> +The fragmentation index, is only meaningful if an allocation would fail and
> +indicates what the failure is due to. A value of -1 such as in the example
> +states that the allocation would succeed. If it would fail, the value is
> +between 0 and 1000. A value tending towards 0 implies the allocation failed
> +due to a lack of memory. A value tending towards 1000 implies it failed
> +due to external fragmentation.
> +
> If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> make an estimate of the likely number of huge pages that can be allocated
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index d05d610..e2d0cc1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
> }
>
> /*
> + * A fragmentation index only makes sense if an allocation of a requested
> + * size would fail. If that is true, the fragmentation index indicates
> + * whether external fragmentation or a lack of memory was the problem.
> + * The value can be used to determine if page reclaim or compaction
> + * should be used
> + */
> +int fragmentation_index(struct zone *zone,
> + unsigned int order,
> + struct contig_page_info *info)
> +{
> + unsigned long requested = 1UL << order;
> +
> + if (!info->free_blocks_total)
> + return 0;
> +
> + /* Fragmentation index only makes sense when a request would fail */
> + if (info->free_blocks_suitable)
> + return -1;
> +
> + /*
> + * Index is between 0 and 1 so return within 3 decimal places
> + *
> + * 0 => allocation would fail due to lack of memory
> + * 1 => allocation would fail due to fragmentation
> + */
> + return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> +}
Dumb question. I haven't understand why this calculation represent
fragmentation index. Do this have theorical background? if yes, can you
please tell me the pointer?
> +
> +/*
> * Return an index indicating how much of the available free memory is
> * unusable for an allocation of the requested size.
> */
> @@ -516,6 +545,39 @@ static int unusable_free_index(struct zone *zone,
>
> }
>
> +static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
> + pg_data_t *pgdat, struct zone *zone)
> +{
> + unsigned int order;
> +
> + /* Alloc on stack as interrupts are disabled for zone walk */
> + struct contig_page_info info;
> +
> + seq_printf(m, "Node %4d, zone %8s %19s",
> + pgdat->node_id,
> + zone->name, " ");
> + for (order = 0; order < MAX_ORDER; ++order) {
> + fill_contig_page_info(zone, order, &info);
> + seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
> + }
> +
> + seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display fragmentation index for orders that allocations would fail for
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
> +{
> + pg_data_t *pgdat = (pg_data_t *)arg;
> +
> + seq_printf(m, "\nFragmentation index at order\n");
> + walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
> +
> + return 0;
> +}
> +
> static void pagetypeinfo_showunusable_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> @@ -657,6 +719,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> seq_putc(m, '\n');
> pagetypeinfo_showfree(m, pgdat);
> pagetypeinfo_showunusable(m, pgdat);
> + pagetypeinfo_showfragmentation(m, pgdat);
> pagetypeinfo_showblockcount(m, pgdat);
>
> return 0;
> --
> 1.6.5
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-16 7:59 ` KOSAKI Motohiro
@ 2010-02-16 8:41 ` Mel Gorman
2010-02-16 8:49 ` KOSAKI Motohiro
0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-16 8:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 04:59:05PM +0900, KOSAKI Motohiro wrote:
> > Fragmentation index is a value that makes sense when an allocation of a
> > given size would fail. The index indicates whether an allocation failure is
> > due to a lack of memory (values towards 0) or due to external fragmentation
> > (value towards 1). For the most part, the huge page size will be the size
> > of interest but not necessarily so it is exported on a per-order and per-zone
> > basis via /proc/pagetypeinfo.
> >
> > The index is normally calculated as a value between 0 and 1 which is
> > obviously unsuitable within the kernel. Instead, the first three decimal
> > places are used as a value between 0 and 1000 for an integer approximation.
>
> Hmmm..
>
> I haven't understand why admin need to know two metrics (unusable-index
> and fragmentation-index). they have very similar meanings and easy confusable
> imho.
>
Because they have different meanings and used for different things. Unusable
index describes the current system state and is the one that is most likely
to be of interest to an administrator monitoring this. Fragmentation index is
telling you "why" an allocation failed because arguably external fragmentation
does not exist until the time of allocation failure.
Fragmentation index is used for example to determine if compaction is
likely to work in advance or not.
> Can we make just one user friendly metrics?
>
What do you suggest?
Unusable free space index is easier to understand and can be expressed
as a percentage but fragmentation index is what the kernel is using. I
could hide the fragmentation index altogether if you prefer? I intend to
use it myself but I can always use a debugging patch.
>
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > Documentation/filesystems/proc.txt | 11 ++++++
> > mm/vmstat.c | 63 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 74 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 0968a81..06bf53c 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -618,6 +618,10 @@ Unusable free space index at order
> > Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> > Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
> >
> > +Fragmentation index at order
> > +Node 0, zone DMA -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> > +Node 0, zone DMA32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> > +
> > Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> > Node 0, zone DMA 2 0 5 1 0
> > Node 0, zone DMA32 41 6 967 2 0
> > @@ -639,6 +643,13 @@ value between 0 and 1000. The higher the value, the more of free memory is
> > unusable and by implication, the worse the external fragmentation is. The
> > percentage of unusable free memory can be found by dividing this value by 10.
> >
> > +The fragmentation index, is only meaningful if an allocation would fail and
> > +indicates what the failure is due to. A value of -1 such as in the example
> > +states that the allocation would succeed. If it would fail, the value is
> > +between 0 and 1000. A value tending towards 0 implies the allocation failed
> > +due to a lack of memory. A value tending towards 1000 implies it failed
> > +due to external fragmentation.
> > +
> > If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> > from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> > make an estimate of the likely number of huge pages that can be allocated
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index d05d610..e2d0cc1 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
> > }
> >
> > /*
> > + * A fragmentation index only makes sense if an allocation of a requested
> > + * size would fail. If that is true, the fragmentation index indicates
> > + * whether external fragmentation or a lack of memory was the problem.
> > + * The value can be used to determine if page reclaim or compaction
> > + * should be used
> > + */
> > +int fragmentation_index(struct zone *zone,
> > + unsigned int order,
> > + struct contig_page_info *info)
> > +{
> > + unsigned long requested = 1UL << order;
> > +
> > + if (!info->free_blocks_total)
> > + return 0;
> > +
> > + /* Fragmentation index only makes sense when a request would fail */
> > + if (info->free_blocks_suitable)
> > + return -1;
> > +
> > + /*
> > + * Index is between 0 and 1 so return within 3 decimal places
> > + *
> > + * 0 => allocation would fail due to lack of memory
> > + * 1 => allocation would fail due to fragmentation
> > + */
> > + return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > +}
>
> Dumb question. I haven't understand why this calculation represent
> fragmentation index. Do this have theorical background? if yes, can you
> please tell me the pointer?
>
Yes, there is a theoritical background. It's mostly described in
http://portal.acm.org/citation.cfm?id=1375634.1375641
I have a more updated version but it's not published unfortunately.
>
>
>
> > +
> > +/*
> > * Return an index indicating how much of the available free memory is
> > * unusable for an allocation of the requested size.
> > */
> > @@ -516,6 +545,39 @@ static int unusable_free_index(struct zone *zone,
> >
> > }
> >
> > +static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
> > + pg_data_t *pgdat, struct zone *zone)
> > +{
> > + unsigned int order;
> > +
> > + /* Alloc on stack as interrupts are disabled for zone walk */
> > + struct contig_page_info info;
> > +
> > + seq_printf(m, "Node %4d, zone %8s %19s",
> > + pgdat->node_id,
> > + zone->name, " ");
> > + for (order = 0; order < MAX_ORDER; ++order) {
> > + fill_contig_page_info(zone, order, &info);
> > + seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
> > + }
> > +
> > + seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display fragmentation index for orders that allocations would fail for
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
> > +{
> > + pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > + seq_printf(m, "\nFragmentation index at order\n");
> > + walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
> > +
> > + return 0;
> > +}
> > +
> > static void pagetypeinfo_showunusable_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > @@ -657,6 +719,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> > seq_putc(m, '\n');
> > pagetypeinfo_showfree(m, pgdat);
> > pagetypeinfo_showunusable(m, pgdat);
> > + pagetypeinfo_showfragmentation(m, pgdat);
> > pagetypeinfo_showblockcount(m, pgdat);
> >
> > return 0;
> > --
> > 1.6.5
> >
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-16 8:41 ` Mel Gorman
@ 2010-02-16 8:49 ` KOSAKI Motohiro
0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 8:49 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> > Dumb question. I haven't understand why this calculation represent
> > fragmentation index. Do this have theorical background? if yes, can you
> > please tell me the pointer?
> >
>
> Yes, there is a theoritical background. It's mostly described in
>
> http://portal.acm.org/citation.cfm?id=1375634.1375641
>
> I have a more updated version but it's not published unfortunately.
ok, thanks. I stop to rush dumb question and read it first. I'll resume rest reviewing
few days after.
thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 04/12] Export fragmentation " Mel Gorman
2010-02-16 7:59 ` KOSAKI Motohiro
@ 2010-02-17 1:44 ` Rik van Riel
2010-02-18 15:37 ` Minchan Kim
2 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-17 1:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 04/12] Export fragmentation index via /proc/pagetypeinfo
2010-02-12 12:00 ` [PATCH 04/12] Export fragmentation " Mel Gorman
2010-02-16 7:59 ` KOSAKI Motohiro
2010-02-17 1:44 ` Rik van Riel
@ 2010-02-18 15:37 ` Minchan Kim
2 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2010-02-18 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> ---
> Documentation/filesystems/proc.txt | 11 ++++++
> mm/vmstat.c | 63 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 74 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 0968a81..06bf53c 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -618,6 +618,10 @@ Unusable free space index at order
> Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
> Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34
>
> +Fragmentation index at order
> +Node 0, zone DMA -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> +Node 0, zone DMA32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> +
> Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
> Node 0, zone DMA 2 0 5 1 0
> Node 0, zone DMA32 41 6 967 2 0
> @@ -639,6 +643,13 @@ value between 0 and 1000. The higher the value, the more of free memory is
> unusable and by implication, the worse the external fragmentation is. The
> percentage of unusable free memory can be found by dividing this value by 10.
>
> +The fragmentation index, is only meaningful if an allocation would fail and
> +indicates what the failure is due to. A value of -1 such as in the example
> +states that the allocation would succeed. If it would fail, the value is
> +between 0 and 1000. A value tending towards 0 implies the allocation failed
> +due to a lack of memory. A value tending towards 1000 implies it failed
> +due to external fragmentation.
> +
> If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
> from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
> make an estimate of the likely number of huge pages that can be allocated
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index d05d610..e2d0cc1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
> }
>
> /*
> + * A fragmentation index only makes sense if an allocation of a requested
> + * size would fail. If that is true, the fragmentation index indicates
> + * whether external fragmentation or a lack of memory was the problem.
> + * The value can be used to determine if page reclaim or compaction
> + * should be used
> + */
> +int fragmentation_index(struct zone *zone,
Like previous [3/12], why do you remain "zone" argument?
If you will use it in future, I don't care. It's just trivial.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 05/12] Memory compaction core
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (3 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 04/12] Export fragmentation " Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-16 8:31 ` KOSAKI Motohiro
2010-02-18 16:58 ` Minchan Kim
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
` (6 subsequent siblings)
11 siblings, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.
A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/compaction.h | 8 +
include/linux/mm.h | 1 +
include/linux/swap.h | 5 +
include/linux/vmstat.h | 1 +
mm/Makefile | 1 +
mm/compaction.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 37 +++++
mm/vmscan.c | 5 -
mm/vmstat.c | 5 +
scripts/kconfig/conf.c | 1 -
10 files changed, 399 insertions(+), 6 deletions(-)
create mode 100644 include/linux/compaction.h
create mode 100644 mm/compaction.c
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE 0
+#define COMPACT_COMPLETE 1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60c467b..c2a2ede 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -332,6 +332,7 @@ void put_page(struct page *page);
void put_pages_list(struct list_head *pages);
void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
/*
* Compound pages have a destructor function. Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..7e7181b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
__lru_cache_add(page, LRU_ACTIVE_FILE);
}
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
+#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
+
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ee03bba..d7f7236 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
obj-$(CONFIG_SMP) += percpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..51ec864
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,341 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2009 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+ struct list_head freepages; /* List of free pages to migrate to */
+ struct list_head migratepages; /* List of pages being migrated */
+ unsigned long nr_freepages; /* Number of isolated free pages */
+ unsigned long nr_migratepages; /* Number of pages to migrate */
+ unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long migrate_pfn; /* isolate_migratepages search base */
+ struct zone *zone;
+};
+
+static int release_freepages(struct zone *zone, struct list_head *freelist)
+{
+ struct page *page, *next;
+ int count = 0;
+
+ list_for_each_entry_safe(page, next, freelist, lru) {
+ list_del(&page->lru);
+ __free_page(page);
+ count++;
+ }
+
+ return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+ unsigned long blockpfn,
+ struct list_head *freelist)
+{
+ unsigned long zone_end_pfn, end_pfn;
+ int total_isolated = 0;
+
+ /* Get the last PFN we should scan for free pages at */
+ zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ end_pfn = blockpfn + pageblock_nr_pages;
+ if (end_pfn > zone_end_pfn)
+ end_pfn = zone_end_pfn;
+
+ /* Isolate free pages. This assumes the block is valid */
+ for (; blockpfn < end_pfn; blockpfn++) {
+ struct page *page;
+ int isolated, i;
+
+ if (!pfn_valid_within(blockpfn))
+ continue;
+
+ page = pfn_to_page(blockpfn);
+ if (!PageBuddy(page))
+ continue;
+
+ /* Found a free page, break it into order-0 pages */
+ isolated = split_free_page(page);
+ total_isolated += isolated;
+ for (i = 0; i < isolated; i++) {
+ list_add(&page->lru, freelist);
+ page++;
+ }
+ blockpfn += isolated - 1;
+ }
+
+ return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+ /* If the page is a large free page, then allow migration */
+ if (PageBuddy(page) && page_order(page) >= pageblock_order)
+ return 1;
+
+ /* If the block is MIGRATE_MOVABLE, allow migration */
+ if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+ return 1;
+
+ /* Otherwise skip the block */
+ return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ struct page *page;
+ unsigned long high_pfn, low_pfn, pfn;
+ unsigned long flags;
+ int nr_freepages = cc->nr_freepages;
+ struct list_head *freelist = &cc->freepages;
+
+ pfn = cc->free_pfn;
+ low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+ high_pfn = low_pfn;
+
+ /*
+ * Isolate free pages until enough are available to migrate the
+ * pages on cc->migratepages. We stop searching if the migrate
+ * and free page scanners meet or enough free pages are isolated.
+ */
+ spin_lock_irqsave(&zone->lock, flags);
+ for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+ pfn -= pageblock_nr_pages) {
+ int isolated;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ /* Check for overlapping nodes/zones */
+ page = pfn_to_page(pfn);
+ if (page_zone(page) != zone)
+ continue;
+
+ /* Check the block is suitable for migration */
+ if (!suitable_migration_target(page))
+ continue;
+
+ /* Found a block suitable for isolating free pages from */
+ isolated = isolate_freepages_block(zone, pfn, freelist);
+ nr_freepages += isolated;
+
+ /*
+ * Record the highest PFN we isolated pages from. When next
+ * looking for free pages, the search will restart here as
+ * page migration may have returned some pages to the allocator
+ */
+ if (isolated)
+ high_pfn = max(high_pfn, pfn);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ cc->free_pfn = high_pfn;
+ cc->nr_freepages = nr_freepages;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ unsigned long low_pfn, end_pfn;
+ struct list_head *migratelist;
+ enum lru_list lru_src;
+
+ low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
+ migratelist = &cc->migratepages;
+
+ /* Do not scan outside zone boundaries */
+ if (low_pfn < zone->zone_start_pfn)
+ low_pfn = zone->zone_start_pfn;
+
+ /* Setup to scan one block but not past where we are migrating to */
+ end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+ cc->migrate_pfn = end_pfn;
+ VM_BUG_ON(end_pfn > cc->free_pfn);
+
+ if (!pfn_valid(low_pfn))
+ return 0;
+
+ migrate_prep();
+
+ /* Time to isolate some pages for migration */
+ spin_lock_irq(&zone->lru_lock);
+ for (; low_pfn < end_pfn; low_pfn++) {
+ struct page *page;
+ if (!pfn_valid_within(low_pfn))
+ continue;
+
+ /* Get the page and skip if free */
+ page = pfn_to_page(low_pfn);
+ if (PageBuddy(page)) {
+ low_pfn += (1 << page_order(page)) - 1;
+ continue;
+ }
+
+ if (!PageLRU(page) || PageUnevictable(page))
+ continue;
+
+ /* Try isolate the page */
+ lru_src = page_lru(page);
+ switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
+ case 0:
+ list_move(&page->lru, migratelist);
+ mem_cgroup_del_lru(page);
+ cc->nr_migratepages++;
+ break;
+
+ case -EBUSY:
+ /*
+ * else it is being freed elsewhere. The
+ * problem is that we are not really sure where
+ * it came from in the first place
+ * XXX: Verify the putback logic is ok. This was
+ * all written before LRU lists were split
+ */
+ list_move(&page->lru, &zone->lru[lru_src].list);
+ mem_cgroup_rotate_lru_list(page, page_lru(page));
+ continue;
+
+ default:
+ BUG();
+ }
+ }
+ spin_unlock_irq(&zone->lru_lock);
+
+ return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data,
+ int **result)
+{
+ struct compact_control *cc = (struct compact_control *)data;
+ struct page *freepage;
+
+ VM_BUG_ON(cc == NULL);
+
+ /* Isolate free pages if necessary */
+ if (list_empty(&cc->freepages)) {
+ isolate_freepages(cc->zone, cc);
+
+ if (list_empty(&cc->freepages))
+ return NULL;
+ }
+
+ freepage = list_entry(cc->freepages.next, struct page, lru);
+ list_del(&freepage->lru);
+ cc->nr_freepages--;
+
+ return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+ int nr_migratepages = 0;
+ int nr_freepages = 0;
+ struct page *page;
+ list_for_each_entry(page, &cc->migratepages, lru)
+ nr_migratepages++;
+ list_for_each_entry(page, &cc->freepages, lru)
+ nr_freepages++;
+
+ cc->nr_migratepages = nr_migratepages;
+ cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+ struct compact_control *cc)
+{
+ /* Compaction run completes if the migrate and free scanner meet */
+ if (cc->free_pfn <= cc->migrate_pfn)
+ return COMPACT_COMPLETE;
+
+ return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+ int ret = COMPACT_INCOMPLETE;
+
+ /*
+ * Setup to move all movable pages to the end of the zone
+ * XXX: This could be improved upon. In the event compaction
+ * is being successful quickly but called often, there
+ * is a likelihood of scanning the same blocks as sources
+ * and targets frequently. Might be worth caching the
+ * last migrate_pfn to reduce scan times.
+ */
+ cc->migrate_pfn = zone->zone_start_pfn;
+ cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+ cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+ unsigned long nr_migrate, nr_remaining;
+ if (!isolate_migratepages(zone, cc))
+ continue;
+
+ nr_migrate = cc->nr_migratepages;
+ migrate_pages(&cc->migratepages, compaction_alloc,
+ (unsigned long)cc, 0);
+ update_nr_listpages(cc);
+ nr_remaining = cc->nr_migratepages;
+
+ count_vm_event(COMPACTBLOCKS);
+ count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+ if (nr_remaining)
+ count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+ }
+
+ /* Release free pages and check accounting */
+ cc->nr_freepages -= release_freepages(zone, &cc->freepages);
+ VM_BUG_ON(cc->nr_freepages != 0);
+
+ /*
+ * Release LRU pages not migrated
+ * XXX: Page migration at this point tries fairly hard to move
+ * pages as it is but if migration fails, pages are left
+ * on cc->migratepages for more passes. This might cause
+ * multiple useless failures. Watch compact_pagemigrate_failed
+ * in /proc/vmstat. If it grows a lot, then putback should
+ * happen after each failed migration
+ */
+ if (!list_empty(&cc->migratepages))
+ putback_lru_pages(&cc->migratepages);
+
+ return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..6d57154 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
set_page_refcounted(page + i);
}
+/* Similar to split_page except the page is already free */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ unsigned long watermark;
+ struct zone *zone;
+
+ BUG_ON(!PageBuddy(page));
+
+ zone = page_zone(page);
+ order = page_order(page);
+
+ /* Obey watermarks or the system could deadlock */
+ watermark = low_wmark_pages(zone) + (1 << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ return 0;
+
+ /* Remove page from free list */
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+
+ /* Set the migratetype on the assumption it's for migration */
+ if (order >= pageblock_order - 1) {
+ struct page *endpage = page + (1 << order) - 1;
+ for (; page < endpage; page += pageblock_nr_pages)
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ }
+
+ return 1 << order;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..47de19b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -803,11 +803,6 @@ keep:
return nr_reclaimed;
}
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
-#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
-
/*
* Attempt to remove the specified page from its LRU. Only take this page
* if it is of the appropriate PageActive status. Pages which are being
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e2d0cc1..f0930ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+
+ "compact_blocks_moved",
+ "compact_pages_moved",
+ "compact_pagemigrate_failed",
+
#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
"htlb_buddy_alloc_fail",
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-12 12:00 ` [PATCH 05/12] Memory compaction core Mel Gorman
@ 2010-02-16 8:31 ` KOSAKI Motohiro
2010-02-16 8:48 ` Mel Gorman
2010-02-17 13:29 ` Mel Gorman
2010-02-18 16:58 ` Minchan Kim
1 sibling, 2 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16 8:31 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
>
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
>
Dumb question. This patch makes only in-zone comaction. why can't we support
inter zone compaction? usually DMA zone is rare resource rather than NORMAL
zone.
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/compaction.h | 8 +
> include/linux/mm.h | 1 +
> include/linux/swap.h | 5 +
> include/linux/vmstat.h | 1 +
> mm/Makefile | 1 +
> mm/compaction.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 37 +++++
> mm/vmscan.c | 5 -
> mm/vmstat.c | 5 +
> scripts/kconfig/conf.c | 1 -
> 10 files changed, 399 insertions(+), 6 deletions(-)
> create mode 100644 include/linux/compaction.h
> create mode 100644 mm/compaction.c
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> new file mode 100644
> index 0000000..6201371
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE 0
> +#define COMPACT_COMPLETE 1
> +
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60c467b..c2a2ede 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -332,6 +332,7 @@ void put_page(struct page *page);
> void put_pages_list(struct list_head *pages);
>
> void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>
> /*
> * Compound pages have a destructor function. Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..7e7181b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> __lru_cache_add(page, LRU_ACTIVE_FILE);
> }
>
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> +
> /* linux/mm/vmscan.c */
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ee03bba..d7f7236 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_SKIP_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> + COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> #ifdef CONFIG_HUGETLB_PAGE
> HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_COMPACTION) += compaction.o
> obj-$(CONFIG_SMP) += percpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> diff --git a/mm/compaction.c b/mm/compaction.c
> new file mode 100644
> index 0000000..51ec864
> --- /dev/null
> +++ b/mm/compaction.c
> @@ -0,0 +1,341 @@
> +/*
> + * linux/mm/compaction.c
> + *
> + * Memory compaction for the reduction of external fragmentation. Note that
> + * this heavily depends upon page migration to do all the real heavy
> + * lifting
> + *
> + * Copyright IBM Corp. 2009 Mel Gorman <mel@csn.ul.ie>
if my remember is correct, now is 2010 :)
> + */
> +#include <linux/swap.h>
> +#include <linux/migrate.h>
> +#include <linux/compaction.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +/*
> + * compact_control is used to track pages being migrated and the free pages
> + * they are being migrated to during memory compaction. The free_pfn starts
> + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> + * are moved to the end of a zone during a compaction run and the run
> + * completes when free_pfn <= migrate_pfn
> + */
> +struct compact_control {
> + struct list_head freepages; /* List of free pages to migrate to */
> + struct list_head migratepages; /* List of pages being migrated */
> + unsigned long nr_freepages; /* Number of isolated free pages */
> + unsigned long nr_migratepages; /* Number of pages to migrate */
> + unsigned long free_pfn; /* isolate_freepages search base */
> + unsigned long migrate_pfn; /* isolate_migratepages search base */
> + struct zone *zone;
> +};
> +
> +static int release_freepages(struct zone *zone, struct list_head *freelist)
> +{
> + struct page *page, *next;
> + int count = 0;
> +
> + list_for_each_entry_safe(page, next, freelist, lru) {
> + list_del(&page->lru);
> + __free_page(page);
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
> + unsigned long blockpfn,
> + struct list_head *freelist)
> +{
> + unsigned long zone_end_pfn, end_pfn;
> + int total_isolated = 0;
> +
> + /* Get the last PFN we should scan for free pages at */
> + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> + end_pfn = blockpfn + pageblock_nr_pages;
> + if (end_pfn > zone_end_pfn)
> + end_pfn = zone_end_pfn;
> +
> + /* Isolate free pages. This assumes the block is valid */
> + for (; blockpfn < end_pfn; blockpfn++) {
> + struct page *page;
> + int isolated, i;
> +
> + if (!pfn_valid_within(blockpfn))
> + continue;
> +
> + page = pfn_to_page(blockpfn);
> + if (!PageBuddy(page))
> + continue;
> +
> + /* Found a free page, break it into order-0 pages */
> + isolated = split_free_page(page);
> + total_isolated += isolated;
> + for (i = 0; i < isolated; i++) {
> + list_add(&page->lru, freelist);
> + page++;
> + }
> + blockpfn += isolated - 1;
> + }
> +
> + return total_isolated;
> +}
> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
> +{
> + /* If the page is a large free page, then allow migration */
> + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> + return 1;
> +
> + /* If the block is MIGRATE_MOVABLE, allow migration */
> + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> + return 1;
> +
> + /* Otherwise skip the block */
> + return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
> + */
> +static void isolate_freepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + struct page *page;
> + unsigned long high_pfn, low_pfn, pfn;
> + unsigned long flags;
> + int nr_freepages = cc->nr_freepages;
> + struct list_head *freelist = &cc->freepages;
> +
> + pfn = cc->free_pfn;
> + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> + high_pfn = low_pfn;
> +
> + /*
> + * Isolate free pages until enough are available to migrate the
> + * pages on cc->migratepages. We stop searching if the migrate
> + * and free page scanners meet or enough free pages are isolated.
> + */
> + spin_lock_irqsave(&zone->lock, flags);
> + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> + pfn -= pageblock_nr_pages) {
> + int isolated;
> +
> + if (!pfn_valid(pfn))
> + continue;
> +
> + /* Check for overlapping nodes/zones */
> + page = pfn_to_page(pfn);
> + if (page_zone(page) != zone)
> + continue;
> +
> + /* Check the block is suitable for migration */
> + if (!suitable_migration_target(page))
> + continue;
> +
> + /* Found a block suitable for isolating free pages from */
> + isolated = isolate_freepages_block(zone, pfn, freelist);
> + nr_freepages += isolated;
> +
> + /*
> + * Record the highest PFN we isolated pages from. When next
> + * looking for free pages, the search will restart here as
> + * page migration may have returned some pages to the allocator
> + */
> + if (isolated)
> + high_pfn = max(high_pfn, pfn);
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> +
> + cc->free_pfn = high_pfn;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + unsigned long low_pfn, end_pfn;
> + struct list_head *migratelist;
> + enum lru_list lru_src;
> +
> + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> + migratelist = &cc->migratepages;
> +
> + /* Do not scan outside zone boundaries */
> + if (low_pfn < zone->zone_start_pfn)
> + low_pfn = zone->zone_start_pfn;
> +
> + /* Setup to scan one block but not past where we are migrating to */
> + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> + cc->migrate_pfn = end_pfn;
> + VM_BUG_ON(end_pfn > cc->free_pfn);
> +
> + if (!pfn_valid(low_pfn))
> + return 0;
> +
> + migrate_prep();
> +
> + /* Time to isolate some pages for migration */
> + spin_lock_irq(&zone->lru_lock);
> + for (; low_pfn < end_pfn; low_pfn++) {
pageblock_nr_pages seems too long spin lock holding. why can't we
release spinlock more frequently?
plus, we need prevent too many concurrent compaction. otherwise too
many isolation makes strange oom killer.
> + struct page *page;
> + if (!pfn_valid_within(low_pfn))
> + continue;
> +
> + /* Get the page and skip if free */
> + page = pfn_to_page(low_pfn);
> + if (PageBuddy(page)) {
> + low_pfn += (1 << page_order(page)) - 1;
> + continue;
> + }
> +
> + if (!PageLRU(page) || PageUnevictable(page))
> + continue;
> +
> + /* Try isolate the page */
> + lru_src = page_lru(page);
> + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
I don't think __isolate_lru_page() is suitable. because it can't isolate
unevictable pages. unevictable pages mean it's undroppable but it can be
migrated.
This is significantly difference between lumpy reclaim and migrate based
compaction. please consider it.
plus, can you please change NR_ISOLATED_FILE/ANON stat in this place. it
help to prevent strange oom issue.
> + case 0:
> + list_move(&page->lru, migratelist);
> + mem_cgroup_del_lru(page);
> + cc->nr_migratepages++;
> + break;
> +
> + case -EBUSY:
> + /*
> + * else it is being freed elsewhere. The
> + * problem is that we are not really sure where
> + * it came from in the first place
> + * XXX: Verify the putback logic is ok. This was
> + * all written before LRU lists were split
> + */
> + list_move(&page->lru, &zone->lru[lru_src].list);
> + mem_cgroup_rotate_lru_list(page, page_lru(page));
> + continue;
we don't need this rotation. probaby you copied it from isolate_lru_pages().
then, I'd like to explain why isolate_lru_pages() need such rotation.
isolate_lru_pages() isolate page by lru order, then if it put back the page
to lru front, next isolate_lru_pages() found the same page. it's obviously
cpu wasting. then we put back the page to lru tail.
but this function isolate pages by pfn order, we don't need such trick imho.
> +
> + default:
> + BUG();
> + }
> + }
> + spin_unlock_irq(&zone->lru_lock);
> +
> + return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> + unsigned long data,
> + int **result)
> +{
> + struct compact_control *cc = (struct compact_control *)data;
> + struct page *freepage;
> +
> + VM_BUG_ON(cc == NULL);
> +
> + /* Isolate free pages if necessary */
> + if (list_empty(&cc->freepages)) {
> + isolate_freepages(cc->zone, cc);
> +
> + if (list_empty(&cc->freepages))
> + return NULL;
> + }
> +
> + freepage = list_entry(cc->freepages.next, struct page, lru);
> + list_del(&freepage->lru);
> + cc->nr_freepages--;
> +
> + return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> + int nr_migratepages = 0;
> + int nr_freepages = 0;
> + struct page *page;
> + list_for_each_entry(page, &cc->migratepages, lru)
> + nr_migratepages++;
> + list_for_each_entry(page, &cc->freepages, lru)
> + nr_freepages++;
> +
> + cc->nr_migratepages = nr_migratepages;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> + struct compact_control *cc)
> +{
> + /* Compaction run completes if the migrate and free scanner meet */
> + if (cc->free_pfn <= cc->migrate_pfn)
> + return COMPACT_COMPLETE;
> +
> + return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> + int ret = COMPACT_INCOMPLETE;
> +
> + /*
> + * Setup to move all movable pages to the end of the zone
> + * XXX: This could be improved upon. In the event compaction
> + * is being successful quickly but called often, there
> + * is a likelihood of scanning the same blocks as sources
> + * and targets frequently. Might be worth caching the
> + * last migrate_pfn to reduce scan times.
> + */
> + cc->migrate_pfn = zone->zone_start_pfn;
> + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> + cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> + for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> + unsigned long nr_migrate, nr_remaining;
> + if (!isolate_migratepages(zone, cc))
> + continue;
> +
> + nr_migrate = cc->nr_migratepages;
> + migrate_pages(&cc->migratepages, compaction_alloc,
> + (unsigned long)cc, 0);
> + update_nr_listpages(cc);
> + nr_remaining = cc->nr_migratepages;
> +
> + count_vm_event(COMPACTBLOCKS);
> + count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> + if (nr_remaining)
> + count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> + }
> +
> + /* Release free pages and check accounting */
> + cc->nr_freepages -= release_freepages(zone, &cc->freepages);
> + VM_BUG_ON(cc->nr_freepages != 0);
> +
> + /*
> + * Release LRU pages not migrated
> + * XXX: Page migration at this point tries fairly hard to move
> + * pages as it is but if migration fails, pages are left
> + * on cc->migratepages for more passes. This might cause
> + * multiple useless failures. Watch compact_pagemigrate_failed
> + * in /proc/vmstat. If it grows a lot, then putback should
> + * happen after each failed migration
> + */
> + if (!list_empty(&cc->migratepages))
> + putback_lru_pages(&cc->migratepages);
> +
> + return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8deb9d0..6d57154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
> set_page_refcounted(page + i);
> }
>
> +/* Similar to split_page except the page is already free */
> +int split_free_page(struct page *page)
> +{
> + unsigned int order;
> + unsigned long watermark;
> + struct zone *zone;
> +
> + BUG_ON(!PageBuddy(page));
> +
> + zone = page_zone(page);
> + order = page_order(page);
> +
> + /* Obey watermarks or the system could deadlock */
> + watermark = low_wmark_pages(zone) + (1 << order);
> + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> + return 0;
> +
> + /* Remove page from free list */
> + list_del(&page->lru);
> + zone->free_area[order].nr_free--;
> + rmv_page_order(page);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> + /* Split into individual pages */
> + set_page_refcounted(page);
> + split_page(page, order);
> +
> + /* Set the migratetype on the assumption it's for migration */
> + if (order >= pageblock_order - 1) {
> + struct page *endpage = page + (1 << order) - 1;
> + for (; page < endpage; page += pageblock_nr_pages)
> + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> + }
> +
> + return 1 << order;
> +}
> +
> /*
> * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
> * we cheat by calling it from here, in the order > 0 path. Saves a branch
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c26986c..47de19b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -803,11 +803,6 @@ keep:
> return nr_reclaimed;
> }
>
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> -
> /*
> * Attempt to remove the specified page from its LRU. Only take this page
> * if it is of the appropriate PageActive status. Pages which are being
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index e2d0cc1..f0930ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
> "allocstall",
>
> "pgrotated",
> +
> + "compact_blocks_moved",
> + "compact_pages_moved",
> + "compact_pagemigrate_failed",
> +
> #ifdef CONFIG_HUGETLB_PAGE
> "htlb_buddy_alloc_success",
> "htlb_buddy_alloc_fail",
> --
> 1.6.5
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-16 8:31 ` KOSAKI Motohiro
@ 2010-02-16 8:48 ` Mel Gorman
2010-02-16 14:55 ` Christoph Lameter
2010-02-17 13:29 ` Mel Gorman
1 sibling, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-16 8:48 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 05:31:59PM +0900, KOSAKI Motohiro wrote:
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> >
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> >
>
> Dumb question. This patch makes only in-zone comaction. why can't we support
> inter zone compaction? usually DMA zone is rare resource rather than NORMAL
> zone.
>
Because how do I tell in advance that the data I am migrating from DMA can
be safely relocated to the NORMAL zone? We don't save GFP flags. Granted,
for DMA, that will not matter as pages that must be in DMA will also not by
migratable. However, buffer pages should not get relocated to HIGHMEM for
example which is more likely to happen. It could be special cased but
I'm not aware of ZONE_DMA-related pressure problems that would make this
worthwhile and if so, it should be handled as a separate patch series.
>
>
>
>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/linux/compaction.h | 8 +
> > include/linux/mm.h | 1 +
> > include/linux/swap.h | 5 +
> > include/linux/vmstat.h | 1 +
> > mm/Makefile | 1 +
> > mm/compaction.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
> > mm/page_alloc.c | 37 +++++
> > mm/vmscan.c | 5 -
> > mm/vmstat.c | 5 +
> > scripts/kconfig/conf.c | 1 -
> > 10 files changed, 399 insertions(+), 6 deletions(-)
> > create mode 100644 include/linux/compaction.h
> > create mode 100644 mm/compaction.c
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > new file mode 100644
> > index 0000000..6201371
> > --- /dev/null
> > +++ b/include/linux/compaction.h
> > @@ -0,0 +1,8 @@
> > +#ifndef _LINUX_COMPACTION_H
> > +#define _LINUX_COMPACTION_H
> > +
> > +/* Return values for compact_zone() */
> > +#define COMPACT_INCOMPLETE 0
> > +#define COMPACT_COMPLETE 1
> > +
> > +#endif /* _LINUX_COMPACTION_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60c467b..c2a2ede 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -332,6 +332,7 @@ void put_page(struct page *page);
> > void put_pages_list(struct list_head *pages);
> >
> > void split_page(struct page *page, unsigned int order);
> > +int split_free_page(struct page *page);
> >
> > /*
> > * Compound pages have a destructor function. Provide a
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index a2602a8..7e7181b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> > __lru_cache_add(page, LRU_ACTIVE_FILE);
> > }
> >
> > +/* LRU Isolation modes. */
> > +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> > +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> > +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> > +
> > /* linux/mm/vmscan.c */
> > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > gfp_t gfp_mask, nodemask_t *mask);
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index ee03bba..d7f7236 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> > KSWAPD_SKIP_CONGESTION_WAIT,
> > PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > + COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > #ifdef CONFIG_HUGETLB_PAGE
> > HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > #endif
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 7a68d2a..ccb1f72 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> > obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> > obj-$(CONFIG_FS_XIP) += filemap_xip.o
> > obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_COMPACTION) += compaction.o
> > obj-$(CONFIG_SMP) += percpu.o
> > obj-$(CONFIG_QUICKLIST) += quicklist.o
> > obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > new file mode 100644
> > index 0000000..51ec864
> > --- /dev/null
> > +++ b/mm/compaction.c
> > @@ -0,0 +1,341 @@
> > +/*
> > + * linux/mm/compaction.c
> > + *
> > + * Memory compaction for the reduction of external fragmentation. Note that
> > + * this heavily depends upon page migration to do all the real heavy
> > + * lifting
> > + *
> > + * Copyright IBM Corp. 2009 Mel Gorman <mel@csn.ul.ie>
>
> if my remember is correct, now is 2010 :)
>
Ah, it was 2009 when I last kicked this around heavily :) I'll update
it.
>
> > + */
> > +#include <linux/swap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/compaction.h>
> > +#include <linux/mm_inline.h>
> > +#include "internal.h"
> > +
> > +/*
> > + * compact_control is used to track pages being migrated and the free pages
> > + * they are being migrated to during memory compaction. The free_pfn starts
> > + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> > + * are moved to the end of a zone during a compaction run and the run
> > + * completes when free_pfn <= migrate_pfn
> > + */
> > +struct compact_control {
> > + struct list_head freepages; /* List of free pages to migrate to */
> > + struct list_head migratepages; /* List of pages being migrated */
> > + unsigned long nr_freepages; /* Number of isolated free pages */
> > + unsigned long nr_migratepages; /* Number of pages to migrate */
> > + unsigned long free_pfn; /* isolate_freepages search base */
> > + unsigned long migrate_pfn; /* isolate_migratepages search base */
> > + struct zone *zone;
> > +};
> > +
> > +static int release_freepages(struct zone *zone, struct list_head *freelist)
> > +{
> > + struct page *page, *next;
> > + int count = 0;
> > +
> > + list_for_each_entry_safe(page, next, freelist, lru) {
> > + list_del(&page->lru);
> > + __free_page(page);
> > + count++;
> > + }
> > +
> > + return count;
> > +}
> > +
> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> > +static int isolate_freepages_block(struct zone *zone,
> > + unsigned long blockpfn,
> > + struct list_head *freelist)
> > +{
> > + unsigned long zone_end_pfn, end_pfn;
> > + int total_isolated = 0;
> > +
> > + /* Get the last PFN we should scan for free pages at */
> > + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > + end_pfn = blockpfn + pageblock_nr_pages;
> > + if (end_pfn > zone_end_pfn)
> > + end_pfn = zone_end_pfn;
> > +
> > + /* Isolate free pages. This assumes the block is valid */
> > + for (; blockpfn < end_pfn; blockpfn++) {
> > + struct page *page;
> > + int isolated, i;
> > +
> > + if (!pfn_valid_within(blockpfn))
> > + continue;
> > +
> > + page = pfn_to_page(blockpfn);
> > + if (!PageBuddy(page))
> > + continue;
> > +
> > + /* Found a free page, break it into order-0 pages */
> > + isolated = split_free_page(page);
> > + total_isolated += isolated;
> > + for (i = 0; i < isolated; i++) {
> > + list_add(&page->lru, freelist);
> > + page++;
> > + }
> > + blockpfn += isolated - 1;
> > + }
> > +
> > + return total_isolated;
> > +}
> > +
> > +/* Returns 1 if the page is within a block suitable for migration to */
> > +static int suitable_migration_target(struct page *page)
> > +{
> > + /* If the page is a large free page, then allow migration */
> > + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > + return 1;
> > +
> > + /* If the block is MIGRATE_MOVABLE, allow migration */
> > + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> > + return 1;
> > +
> > + /* Otherwise skip the block */
> > + return 0;
> > +}
> > +
> > +/*
> > + * Based on information in the current compact_control, find blocks
> > + * suitable for isolating free pages from
> > + */
> > +static void isolate_freepages(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + struct page *page;
> > + unsigned long high_pfn, low_pfn, pfn;
> > + unsigned long flags;
> > + int nr_freepages = cc->nr_freepages;
> > + struct list_head *freelist = &cc->freepages;
> > +
> > + pfn = cc->free_pfn;
> > + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> > + high_pfn = low_pfn;
> > +
> > + /*
> > + * Isolate free pages until enough are available to migrate the
> > + * pages on cc->migratepages. We stop searching if the migrate
> > + * and free page scanners meet or enough free pages are isolated.
> > + */
> > + spin_lock_irqsave(&zone->lock, flags);
> > + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> > + pfn -= pageblock_nr_pages) {
> > + int isolated;
> > +
> > + if (!pfn_valid(pfn))
> > + continue;
> > +
> > + /* Check for overlapping nodes/zones */
> > + page = pfn_to_page(pfn);
> > + if (page_zone(page) != zone)
> > + continue;
> > +
> > + /* Check the block is suitable for migration */
> > + if (!suitable_migration_target(page))
> > + continue;
> > +
> > + /* Found a block suitable for isolating free pages from */
> > + isolated = isolate_freepages_block(zone, pfn, freelist);
> > + nr_freepages += isolated;
> > +
> > + /*
> > + * Record the highest PFN we isolated pages from. When next
> > + * looking for free pages, the search will restart here as
> > + * page migration may have returned some pages to the allocator
> > + */
> > + if (isolated)
> > + high_pfn = max(high_pfn, pfn);
> > + }
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > + cc->free_pfn = high_pfn;
> > + cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +/*
> > + * Isolate all pages that can be migrated from the block pointed to by
> > + * the migrate scanner within compact_control.
> > + */
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + unsigned long low_pfn, end_pfn;
> > + struct list_head *migratelist;
> > + enum lru_list lru_src;
> > +
> > + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> > + migratelist = &cc->migratepages;
> > +
> > + /* Do not scan outside zone boundaries */
> > + if (low_pfn < zone->zone_start_pfn)
> > + low_pfn = zone->zone_start_pfn;
> > +
> > + /* Setup to scan one block but not past where we are migrating to */
> > + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > + cc->migrate_pfn = end_pfn;
> > + VM_BUG_ON(end_pfn > cc->free_pfn);
> > +
> > + if (!pfn_valid(low_pfn))
> > + return 0;
> > +
> > + migrate_prep();
> > +
> > + /* Time to isolate some pages for migration */
> > + spin_lock_irq(&zone->lru_lock);
> > + for (; low_pfn < end_pfn; low_pfn++) {
>
> pageblock_nr_pages seems too long spin lock holding. why can't we
> release spinlock more frequently?
>
> plus, we need prevent too many concurrent compaction. otherwise too
> many isolation makes strange oom killer.
>
I'll look into isolating in smaller groupings of pages.
>
> > + struct page *page;
> > + if (!pfn_valid_within(low_pfn))
> > + continue;
> > +
> > + /* Get the page and skip if free */
> > + page = pfn_to_page(low_pfn);
> > + if (PageBuddy(page)) {
> > + low_pfn += (1 << page_order(page)) - 1;
> > + continue;
> > + }
> > +
> > + if (!PageLRU(page) || PageUnevictable(page))
> > + continue;
> > +
> > + /* Try isolate the page */
> > + lru_src = page_lru(page);
> > + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
>
> I don't think __isolate_lru_page() is suitable. because it can't isolate
> unevictable pages. unevictable pages mean it's undroppable but it can be
> migrated.
>
At the moment, I'm not handling unevictable pages at all. This is a very
heavy patchset as it is and would prefer to handle unevictable as a
follow-on patch if possible.
> This is significantly difference between lumpy reclaim and migrate based
> compaction. please consider it.
>
> plus, can you please change NR_ISOLATED_FILE/ANON stat in this place. it
> help to prevent strange oom issue.
>
Is this the significant difference you mean? If so, yes, I can look into
that and what is needed to prevent concurrent compactors.
As it is, the fact that there is a check of fragmentation index
beforehand and of watermarks should prevent too many concurrent
compactors as it would be determined in advance that compaction would
fail. However, it could still collide with lumpy reclaim so they should
be brought more in line.
>
> > + case 0:
> > + list_move(&page->lru, migratelist);
> > + mem_cgroup_del_lru(page);
> > + cc->nr_migratepages++;
> > + break;
> > +
> > + case -EBUSY:
> > + /*
> > + * else it is being freed elsewhere. The
> > + * problem is that we are not really sure where
> > + * it came from in the first place
> > + * XXX: Verify the putback logic is ok. This was
> > + * all written before LRU lists were split
> > + */
> > + list_move(&page->lru, &zone->lru[lru_src].list);
> > + mem_cgroup_rotate_lru_list(page, page_lru(page));
> > + continue;
>
> we don't need this rotation. probaby you copied it from isolate_lru_pages().
> then, I'd like to explain why isolate_lru_pages() need such rotation.
> isolate_lru_pages() isolate page by lru order, then if it put back the page
> to lru front, next isolate_lru_pages() found the same page. it's obviously
> cpu wasting. then we put back the page to lru tail.
>
> but this function isolate pages by pfn order, we don't need such trick imho.
>
>
>
> > +
> > + default:
> > + BUG();
> > + }
> > + }
> > + spin_unlock_irq(&zone->lru_lock);
> > +
> > + return cc->nr_migratepages;
> > +}
> > +
> > +/*
> > + * This is a migrate-callback that "allocates" freepages by taking pages
> > + * from the isolated freelists in the block we are migrating to.
> > + */
> > +static struct page *compaction_alloc(struct page *migratepage,
> > + unsigned long data,
> > + int **result)
> > +{
> > + struct compact_control *cc = (struct compact_control *)data;
> > + struct page *freepage;
> > +
> > + VM_BUG_ON(cc == NULL);
> > +
> > + /* Isolate free pages if necessary */
> > + if (list_empty(&cc->freepages)) {
> > + isolate_freepages(cc->zone, cc);
> > +
> > + if (list_empty(&cc->freepages))
> > + return NULL;
> > + }
> > +
> > + freepage = list_entry(cc->freepages.next, struct page, lru);
> > + list_del(&freepage->lru);
> > + cc->nr_freepages--;
> > +
> > + return freepage;
> > +}
> > +
> > +/*
> > + * We cannot control nr_migratepages and nr_freepages fully when migration is
> > + * running as migrate_pages() has no knowledge of compact_control. When
> > + * migration is complete, we count the number of pages on the lists by hand.
> > + */
> > +static void update_nr_listpages(struct compact_control *cc)
> > +{
> > + int nr_migratepages = 0;
> > + int nr_freepages = 0;
> > + struct page *page;
> > + list_for_each_entry(page, &cc->migratepages, lru)
> > + nr_migratepages++;
> > + list_for_each_entry(page, &cc->freepages, lru)
> > + nr_freepages++;
> > +
> > + cc->nr_migratepages = nr_migratepages;
> > + cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +static inline int compact_finished(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + /* Compaction run completes if the migrate and free scanner meet */
> > + if (cc->free_pfn <= cc->migrate_pfn)
> > + return COMPACT_COMPLETE;
> > +
> > + return COMPACT_INCOMPLETE;
> > +}
> > +
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > + int ret = COMPACT_INCOMPLETE;
> > +
> > + /*
> > + * Setup to move all movable pages to the end of the zone
> > + * XXX: This could be improved upon. In the event compaction
> > + * is being successful quickly but called often, there
> > + * is a likelihood of scanning the same blocks as sources
> > + * and targets frequently. Might be worth caching the
> > + * last migrate_pfn to reduce scan times.
> > + */
> > + cc->migrate_pfn = zone->zone_start_pfn;
> > + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > + cc->free_pfn &= ~(pageblock_nr_pages-1);
> > +
> > + for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > + unsigned long nr_migrate, nr_remaining;
> > + if (!isolate_migratepages(zone, cc))
> > + continue;
> > +
> > + nr_migrate = cc->nr_migratepages;
> > + migrate_pages(&cc->migratepages, compaction_alloc,
> > + (unsigned long)cc, 0);
> > + update_nr_listpages(cc);
> > + nr_remaining = cc->nr_migratepages;
> > +
> > + count_vm_event(COMPACTBLOCKS);
> > + count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > + if (nr_remaining)
> > + count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > + }
> > +
> > + /* Release free pages and check accounting */
> > + cc->nr_freepages -= release_freepages(zone, &cc->freepages);
> > + VM_BUG_ON(cc->nr_freepages != 0);
> > +
> > + /*
> > + * Release LRU pages not migrated
> > + * XXX: Page migration at this point tries fairly hard to move
> > + * pages as it is but if migration fails, pages are left
> > + * on cc->migratepages for more passes. This might cause
> > + * multiple useless failures. Watch compact_pagemigrate_failed
> > + * in /proc/vmstat. If it grows a lot, then putback should
> > + * happen after each failed migration
> > + */
> > + if (!list_empty(&cc->migratepages))
> > + putback_lru_pages(&cc->migratepages);
> > +
> > + return ret;
> > +}
> > +
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8deb9d0..6d57154 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
> > set_page_refcounted(page + i);
> > }
> >
> > +/* Similar to split_page except the page is already free */
> > +int split_free_page(struct page *page)
> > +{
> > + unsigned int order;
> > + unsigned long watermark;
> > + struct zone *zone;
> > +
> > + BUG_ON(!PageBuddy(page));
> > +
> > + zone = page_zone(page);
> > + order = page_order(page);
> > +
> > + /* Obey watermarks or the system could deadlock */
> > + watermark = low_wmark_pages(zone) + (1 << order);
> > + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > + return 0;
> > +
> > + /* Remove page from free list */
> > + list_del(&page->lru);
> > + zone->free_area[order].nr_free--;
> > + rmv_page_order(page);
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > +
> > + /* Split into individual pages */
> > + set_page_refcounted(page);
> > + split_page(page, order);
> > +
> > + /* Set the migratetype on the assumption it's for migration */
> > + if (order >= pageblock_order - 1) {
> > + struct page *endpage = page + (1 << order) - 1;
> > + for (; page < endpage; page += pageblock_nr_pages)
> > + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > + }
> > +
> > + return 1 << order;
> > +}
> > +
> > /*
> > * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
> > * we cheat by calling it from here, in the order > 0 path. Saves a branch
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c26986c..47de19b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -803,11 +803,6 @@ keep:
> > return nr_reclaimed;
> > }
> >
> > -/* LRU Isolation modes. */
> > -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> > -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> > -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> > -
> > /*
> > * Attempt to remove the specified page from its LRU. Only take this page
> > * if it is of the appropriate PageActive status. Pages which are being
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index e2d0cc1..f0930ae 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
> > "allocstall",
> >
> > "pgrotated",
> > +
> > + "compact_blocks_moved",
> > + "compact_pages_moved",
> > + "compact_pagemigrate_failed",
> > +
> > #ifdef CONFIG_HUGETLB_PAGE
> > "htlb_buddy_alloc_success",
> > "htlb_buddy_alloc_fail",
> > --
> > 1.6.5
> >
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-16 8:48 ` Mel Gorman
@ 2010-02-16 14:55 ` Christoph Lameter
2010-02-16 14:59 ` Mel Gorman
0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2010-02-16 14:55 UTC (permalink / raw)
To: Mel Gorman
Cc: KOSAKI Motohiro, Andrea Arcangeli, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, 16 Feb 2010, Mel Gorman wrote:
> Because how do I tell in advance that the data I am migrating from DMA can
> be safely relocated to the NORMAL zone? We don't save GFP flags. Granted,
> for DMA, that will not matter as pages that must be in DMA will also not by
> migratable. However, buffer pages should not get relocated to HIGHMEM for
> example which is more likely to happen. It could be special cased but
> I'm not aware of ZONE_DMA-related pressure problems that would make this
> worthwhile and if so, it should be handled as a separate patch series.
Oh there are numerous ZONE_DMA pressure issues if you have ancient /
screwed up hardware that can only operate on DMA or DMA32 memory.
Moving page cache pages out of the DMA zone would be good. A
write request will cause the page to bounce back to the DMA zone if the
device requires the page there.
But I also think that the patchset should be as simple as possible so that
it can be merged soon.
> Ah, it was 2009 when I last kicked this around heavily :) I'll update
> it.
But it was authored in 2009. May be important if patent or other
copyright claims arise. 2009-2010?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-16 14:55 ` Christoph Lameter
@ 2010-02-16 14:59 ` Mel Gorman
2010-02-18 19:37 ` Christoph Lameter
0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-16 14:59 UTC (permalink / raw)
To: Christoph Lameter
Cc: KOSAKI Motohiro, Andrea Arcangeli, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 08:55:46AM -0600, Christoph Lameter wrote:
> On Tue, 16 Feb 2010, Mel Gorman wrote:
>
> > Because how do I tell in advance that the data I am migrating from DMA can
> > be safely relocated to the NORMAL zone? We don't save GFP flags. Granted,
> > for DMA, that will not matter as pages that must be in DMA will also not by
> > migratable. However, buffer pages should not get relocated to HIGHMEM for
> > example which is more likely to happen. It could be special cased but
> > I'm not aware of ZONE_DMA-related pressure problems that would make this
> > worthwhile and if so, it should be handled as a separate patch series.
>
> Oh there are numerous ZONE_DMA pressure issues if you have ancient /
> screwed up hardware that can only operate on DMA or DMA32 memory.
>
I've never ran into the issue. I was under the impression that the only
device that might care these days are floopy disks.
> Moving page cache pages out of the DMA zone would be good. A
> write request will cause the page to bounce back to the DMA zone if the
> device requires the page there.
>
> But I also think that the patchset should be as simple as possible so that
> it can be merged soon.
>
Agreed.
> > Ah, it was 2009 when I last kicked this around heavily :) I'll update
> > it.
>
> But it was authored in 2009. May be important if patent or other
> copyright claims arise. 2009-2010?
>
2007-2010 in that case because 2007 was when I first prototyped this.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-16 14:59 ` Mel Gorman
@ 2010-02-18 19:37 ` Christoph Lameter
2010-02-18 21:35 ` Mel Gorman
2010-02-19 0:04 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 50+ messages in thread
From: Christoph Lameter @ 2010-02-18 19:37 UTC (permalink / raw)
To: Mel Gorman
Cc: KOSAKI Motohiro, Andrea Arcangeli, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm, KAMEZAWA Hiroyuki
On Tue, 16 Feb 2010, Mel Gorman wrote:
> > Oh there are numerous ZONE_DMA pressure issues if you have ancient /
> > screwed up hardware that can only operate on DMA or DMA32 memory.
> >
>
> I've never ran into the issue. I was under the impression that the only
> device that might care these days are floopy disks.
Kame-san had an issue a year or so ago.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-18 19:37 ` Christoph Lameter
@ 2010-02-18 21:35 ` Mel Gorman
2010-02-19 0:04 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-18 21:35 UTC (permalink / raw)
To: Christoph Lameter
Cc: KOSAKI Motohiro, Andrea Arcangeli, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm, KAMEZAWA Hiroyuki
On Thu, Feb 18, 2010 at 01:37:35PM -0600, Christoph Lameter wrote:
> On Tue, 16 Feb 2010, Mel Gorman wrote:
>
> > > Oh there are numerous ZONE_DMA pressure issues if you have ancient /
> > > screwed up hardware that can only operate on DMA or DMA32 memory.
> > >
> >
> > I've never ran into the issue. I was under the impression that the only
> > device that might care these days are floopy disks.
>
> Kame-san had an issue a year or so ago.
>
Will add it as a potential follow-on then. I consider Unevictable more
important than zone-pressure issues. Neither are going to be done in the
first pass. It's complex enough as it is and I'm more concerned with
getting teh page migration anon_vma snag ironed out. Can you review the
patches related to page migration in v3 please and see what you think?
You are much more familiar with the intent of that area than I.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-18 19:37 ` Christoph Lameter
2010-02-18 21:35 ` Mel Gorman
@ 2010-02-19 0:04 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19 0:04 UTC (permalink / raw)
To: Christoph Lameter
Cc: Mel Gorman, KOSAKI Motohiro, Andrea Arcangeli, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
On Thu, 18 Feb 2010 13:37:35 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:
> On Tue, 16 Feb 2010, Mel Gorman wrote:
>
> > > Oh there are numerous ZONE_DMA pressure issues if you have ancient /
> > > screwed up hardware that can only operate on DMA or DMA32 memory.
> > >
> >
> > I've never ran into the issue. I was under the impression that the only
> > device that might care these days are floopy disks.
>
> Kame-san had an issue a year or so ago.
>
Yes. But my customer doesn't use the newest things...
In server area, recent hardware(64bit) and drivers tend not to cause the issue.
I'm not sure there are some driver which still set their DMA mask wrong and
require bounce buffer. But I guess that I'll have to see DMA-zone issue in
customer support still in (early) RHEL6.
Considering other area, I hear OOM-issue from notebook/desktop users, they don't
equip swap. I think some of devices are still 32bit if 64bit isn't required for them.
I wonder problems on lower-zone still exists for 32bit devices users.
In the view point as kernels for x86-32 still support ZONE_DMA
for ISA bus...we shouldn't assume there are no legacy.
But yes, it may not be very important to implement inter-zone moving. It's not
for compaction, but just for memory-reclaim. And it has some complication.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-16 8:31 ` KOSAKI Motohiro
2010-02-16 8:48 ` Mel Gorman
@ 2010-02-17 13:29 ` Mel Gorman
2010-02-17 15:45 ` Rik van Riel
1 sibling, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-17 13:29 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, linux-kernel, linux-mm
On Tue, Feb 16, 2010 at 05:31:59PM +0900, KOSAKI Motohiro wrote:
> > <SNIP>
> >
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + unsigned long low_pfn, end_pfn;
> > + struct list_head *migratelist;
> > + enum lru_list lru_src;
> > +
> > + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> > + migratelist = &cc->migratepages;
> > +
> > + /* Do not scan outside zone boundaries */
> > + if (low_pfn < zone->zone_start_pfn)
> > + low_pfn = zone->zone_start_pfn;
> > +
> > + /* Setup to scan one block but not past where we are migrating to */
> > + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > + cc->migrate_pfn = end_pfn;
> > + VM_BUG_ON(end_pfn > cc->free_pfn);
> > +
> > + if (!pfn_valid(low_pfn))
> > + return 0;
> > +
> > + migrate_prep();
> > +
> > + /* Time to isolate some pages for migration */
> > + spin_lock_irq(&zone->lru_lock);
> > + for (; low_pfn < end_pfn; low_pfn++) {
>
> pageblock_nr_pages seems too long spin lock holding. why can't we
> release spinlock more frequently?
>
I changed it to only isolate in quantities of SWAP_CLUSTER_MAX.
> plus, we need prevent too many concurrent compaction. otherwise too
> many isolation makes strange oom killer.
>
I think the fact that watermarks are checked should prevent this and
processes enter direct reclaim where too_many_isolated() is checked.
>
> > + struct page *page;
> > + if (!pfn_valid_within(low_pfn))
> > + continue;
> > +
> > + /* Get the page and skip if free */
> > + page = pfn_to_page(low_pfn);
> > + if (PageBuddy(page)) {
> > + low_pfn += (1 << page_order(page)) - 1;
> > + continue;
> > + }
> > +
> > + if (!PageLRU(page) || PageUnevictable(page))
> > + continue;
> > +
> > + /* Try isolate the page */
> > + lru_src = page_lru(page);
> > + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
>
> I don't think __isolate_lru_page() is suitable. because it can't isolate
> unevictable pages. unevictable pages mean it's undroppable but it can be
> migrated.
>
> This is significantly difference between lumpy reclaim and migrate based
> compaction. please consider it.
>
As unevictable pages are being ignored for the moment, I kept with
__isolate_lru_page. Unevictable pages will be a follow-on patch if we
end up agreeing on this set.
> plus, can you please change NR_ISOLATED_FILE/ANON stat in this place. it
> help to prevent strange oom issue.
>
Good point. Also, I was breaking counters for the LRU lists so thanks
for pushing me to look closer at that.
>
> > + case 0:
> > + list_move(&page->lru, migratelist);
> > + mem_cgroup_del_lru(page);
> > + cc->nr_migratepages++;
> > + break;
> > +
> > + case -EBUSY:
> > + /*
> > + * else it is being freed elsewhere. The
> > + * problem is that we are not really sure where
> > + * it came from in the first place
> > + * XXX: Verify the putback logic is ok. This was
> > + * all written before LRU lists were split
> > + */
> > + list_move(&page->lru, &zone->lru[lru_src].list);
> > + mem_cgroup_rotate_lru_list(page, page_lru(page));
> > + continue;
>
> we don't need this rotation. probaby you copied it from isolate_lru_pages().
It was copied, yes.
> then, I'd like to explain why isolate_lru_pages() need such rotation.
> isolate_lru_pages() isolate page by lru order, then if it put back the page
> to lru front, next isolate_lru_pages() found the same page. it's obviously
> cpu wasting. then we put back the page to lru tail.
>
> but this function isolate pages by pfn order, we don't need such trick imho.
>
Good point. I've addressed this in the patch below. Does it address your
concerns?
==== CUT HERE ====
Fix concerns from Kosaki Motohiro (merge with compaction core)
o Fewer pages are isolated. Hence, cc->migrate_pfn in
isolate_migratepages() is updated slightly differently and the debug
checks change
o LRU lists are no longer rotated
o NR_ISOLATED_* is updated
o del_page_from_lru_list() is used instead list_move when isolated so
that the counters get updated correctly.
o Pages that fail to migrate are put back on the LRU promptly to avoid
being isolated for too long.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/swap.h | 1 +
mm/compaction.c | 81 +++++++++++++++++++++++++++----------------------
2 files changed, 46 insertions(+), 36 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7e7181b..12566ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
};
#define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
#define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */
#define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */
diff --git a/mm/compaction.c b/mm/compaction.c
index 11934b3..9d6fd9f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -27,6 +27,10 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+ /* Account for isolated anon and file pages */
+ unsigned long nr_anon;
+ unsigned long nr_file;
+
struct zone *zone;
};
@@ -155,6 +159,23 @@ static void isolate_freepages(struct zone *zone,
cc->nr_freepages = nr_freepages;
}
+/* Update the number of anon and file isolated pages in the zone) */
+void update_zone_isolated(struct zone *zone, struct compact_control *cc)
+{
+ struct page *page;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+
+ list_for_each_entry(page, &cc->migratepages, lru) {
+ int lru = page_lru_base_type(page);
+ count[lru]++;
+ }
+
+ cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
/*
* Isolate all pages that can be migrated from the block pointed to by
* the migrate scanner within compact_control.
@@ -164,9 +185,8 @@ static unsigned long isolate_migratepages(struct zone *zone,
{
unsigned long low_pfn, end_pfn;
struct list_head *migratelist;
- enum lru_list lru_src;
- low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
+ low_pfn = cc->migrate_pfn;
migratelist = &cc->migratepages;
/* Do not scan outside zone boundaries */
@@ -175,11 +195,12 @@ static unsigned long isolate_migratepages(struct zone *zone,
/* Setup to scan one block but not past where we are migrating to */
end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
- cc->migrate_pfn = end_pfn;
- VM_BUG_ON(end_pfn > cc->free_pfn);
- if (!pfn_valid(low_pfn))
+ /* Do not cross the free scanner or scan within a memory hole */
+ if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+ cc->migrate_pfn = end_pfn;
return 0;
+ }
migrate_prep();
@@ -201,31 +222,22 @@ static unsigned long isolate_migratepages(struct zone *zone,
continue;
/* Try isolate the page */
- lru_src = page_lru(page);
- switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
- case 0:
- list_move(&page->lru, migratelist);
+ if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+ del_page_from_lru_list(zone, page, page_lru(page));
+ list_add(&page->lru, migratelist);
mem_cgroup_del_lru(page);
cc->nr_migratepages++;
+ }
+
+ /* Avoid isolating too much */
+ if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
break;
+ }
- case -EBUSY:
- /*
- * else it is being freed elsewhere. The
- * problem is that we are not really sure where
- * it came from in the first place
- * XXX: Verify the putback logic is ok. This was
- * all written before LRU lists were split
- */
- list_move(&page->lru, &zone->lru[lru_src].list);
- mem_cgroup_rotate_lru_list(page, page_lru(page));
- continue;
+ update_zone_isolated(zone, cc);
- default:
- BUG();
- }
- }
spin_unlock_irq(&zone->lru_lock);
+ cc->migrate_pfn = low_pfn;
return cc->nr_migratepages;
}
@@ -318,24 +330,21 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
if (nr_remaining)
count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+ /* Release LRU pages not migrated */
+ if (!list_empty(&cc->migratepages)) {
+ putback_lru_pages(&cc->migratepages);
+ cc->nr_migratepages = 0;
+ }
+
+ mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
+ mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
}
/* Release free pages and check accounting */
cc->nr_freepages -= release_freepages(zone, &cc->freepages);
VM_BUG_ON(cc->nr_freepages != 0);
- /*
- * Release LRU pages not migrated
- * XXX: Page migration at this point tries fairly hard to move
- * pages as it is but if migration fails, pages are left
- * on cc->migratepages for more passes. This might cause
- * multiple useless failures. Watch compact_pagemigrate_failed
- * in /proc/vmstat. If it grows a lot, then putback should
- * happen after each failed migration
- */
- if (!list_empty(&cc->migratepages))
- putback_lru_pages(&cc->migratepages);
-
return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-17 13:29 ` Mel Gorman
@ 2010-02-17 15:45 ` Rik van Riel
0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-17 15:45 UTC (permalink / raw)
To: Mel Gorman
Cc: KOSAKI Motohiro, Andrea Arcangeli, Christoph Lameter, Adam Litke,
Avi Kivity, David Rientjes, linux-kernel, linux-mm
On 02/17/2010 08:29 AM, Mel Gorman wrote:
> Fix concerns from Kosaki Motohiro (merge with compaction core)
>
> o Fewer pages are isolated. Hence, cc->migrate_pfn in
> isolate_migratepages() is updated slightly differently and the debug
> checks change
> o LRU lists are no longer rotated
> o NR_ISOLATED_* is updated
> o del_page_from_lru_list() is used instead list_move when isolated so
> that the counters get updated correctly.
> o Pages that fail to migrate are put back on the LRU promptly to avoid
> being isolated for too long.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Consider this an ack to your patch 5/12 with these fixes.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-12 12:00 ` [PATCH 05/12] Memory compaction core Mel Gorman
2010-02-16 8:31 ` KOSAKI Motohiro
@ 2010-02-18 16:58 ` Minchan Kim
2010-02-18 17:34 ` Mel Gorman
1 sibling, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2010-02-18 16:58 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
>
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/compaction.h | 8 +
> include/linux/mm.h | 1 +
> include/linux/swap.h | 5 +
> include/linux/vmstat.h | 1 +
> mm/Makefile | 1 +
> mm/compaction.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 37 +++++
> mm/vmscan.c | 5 -
> mm/vmstat.c | 5 +
> scripts/kconfig/conf.c | 1 -
> 10 files changed, 399 insertions(+), 6 deletions(-)
> create mode 100644 include/linux/compaction.h
> create mode 100644 mm/compaction.c
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> new file mode 100644
> index 0000000..6201371
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE 0
> +#define COMPACT_COMPLETE 1
> +
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60c467b..c2a2ede 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -332,6 +332,7 @@ void put_page(struct page *page);
> void put_pages_list(struct list_head *pages);
>
> void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>
> /*
> * Compound pages have a destructor function. Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..7e7181b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> __lru_cache_add(page, LRU_ACTIVE_FILE);
> }
>
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> +
> /* linux/mm/vmscan.c */
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ee03bba..d7f7236 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_SKIP_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> + COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> #ifdef CONFIG_HUGETLB_PAGE
> HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_COMPACTION) += compaction.o
> obj-$(CONFIG_SMP) += percpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> diff --git a/mm/compaction.c b/mm/compaction.c
> new file mode 100644
> index 0000000..51ec864
> --- /dev/null
> +++ b/mm/compaction.c
> @@ -0,0 +1,341 @@
> +/*
> + * linux/mm/compaction.c
> + *
> + * Memory compaction for the reduction of external fragmentation. Note that
> + * this heavily depends upon page migration to do all the real heavy
> + * lifting
> + *
> + * Copyright IBM Corp. 2009 Mel Gorman <mel@csn.ul.ie>
> + */
> +#include <linux/swap.h>
> +#include <linux/migrate.h>
> +#include <linux/compaction.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +/*
> + * compact_control is used to track pages being migrated and the free pages
> + * they are being migrated to during memory compaction. The free_pfn starts
> + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> + * are moved to the end of a zone during a compaction run and the run
> + * completes when free_pfn <= migrate_pfn
> + */
> +struct compact_control {
> + struct list_head freepages; /* List of free pages to migrate to */
> + struct list_head migratepages; /* List of pages being migrated */
> + unsigned long nr_freepages; /* Number of isolated free pages */
> + unsigned long nr_migratepages; /* Number of pages to migrate */
> + unsigned long free_pfn; /* isolate_freepages search base */
> + unsigned long migrate_pfn; /* isolate_migratepages search base */
> + struct zone *zone;
> +};
> +
> +static int release_freepages(struct zone *zone, struct list_head *freelist)
"zone" argument.
> +{
> + struct page *page, *next;
> + int count = 0;
> +
> + list_for_each_entry_safe(page, next, freelist, lru) {
> + list_del(&page->lru);
> + __free_page(page);
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
return type 'int'?
I think we can't return signed value.
> + unsigned long blockpfn,
> + struct list_head *freelist)
> +{
> + unsigned long zone_end_pfn, end_pfn;
> + int total_isolated = 0;
> +
> + /* Get the last PFN we should scan for free pages at */
> + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> + end_pfn = blockpfn + pageblock_nr_pages;
> + if (end_pfn > zone_end_pfn)
> + end_pfn = zone_end_pfn;
> +
> + /* Isolate free pages. This assumes the block is valid */
> + for (; blockpfn < end_pfn; blockpfn++) {
> + struct page *page;
> + int isolated, i;
> +
> + if (!pfn_valid_within(blockpfn))
> + continue;
> +
> + page = pfn_to_page(blockpfn);
> + if (!PageBuddy(page))
> + continue;
> +
> + /* Found a free page, break it into order-0 pages */
> + isolated = split_free_page(page);
> + total_isolated += isolated;
> + for (i = 0; i < isolated; i++) {
> + list_add(&page->lru, freelist);
> + page++;
> + }
> + blockpfn += isolated - 1;
> + }
> +
> + return total_isolated;
> +}
> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
> +{
> + /* If the page is a large free page, then allow migration */
> + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> + return 1;
> +
> + /* If the block is MIGRATE_MOVABLE, allow migration */
> + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> + return 1;
> +
> + /* Otherwise skip the block */
> + return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
> + */
> +static void isolate_freepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + struct page *page;
> + unsigned long high_pfn, low_pfn, pfn;
> + unsigned long flags;
> + int nr_freepages = cc->nr_freepages;
> + struct list_head *freelist = &cc->freepages;
> +
> + pfn = cc->free_pfn;
> + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> + high_pfn = low_pfn;
> +
> + /*
> + * Isolate free pages until enough are available to migrate the
> + * pages on cc->migratepages. We stop searching if the migrate
> + * and free page scanners meet or enough free pages are isolated.
> + */
> + spin_lock_irqsave(&zone->lock, flags);
> + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> + pfn -= pageblock_nr_pages) {
> + int isolated;
> +
> + if (!pfn_valid(pfn))
> + continue;
> +
> + /* Check for overlapping nodes/zones */
> + page = pfn_to_page(pfn);
> + if (page_zone(page) != zone)
> + continue;
We are progressing backward by physical page order in a zone.
If we meet crossover between zone, Why are we going backward
continuously? Before it happens, migration and free scanner would meet.
Am I miss something?
> +
> + /* Check the block is suitable for migration */
> + if (!suitable_migration_target(page))
> + continue;
Dumb question.
suitable_migration_target considers three type's pages
1. free page and page's order >= pageblock_order
2. free pages and pages's order < pageblock_order with movable page
3. used page with movable
I can understand 1 and 2 but can't 3. This function is for gathering
free page. How do you handle used page as free one?
In addition, as I looked into isolate_freepages_block, it doesn't
consider 3 by PageBuddy check.
I am confusing. Pz, correct me.
> +
> + /* Found a block suitable for isolating free pages from */
> + isolated = isolate_freepages_block(zone, pfn, freelist);
> + nr_freepages += isolated;
> +
> + /*
> + * Record the highest PFN we isolated pages from. When next
> + * looking for free pages, the search will restart here as
> + * page migration may have returned some pages to the allocator
> + */
> + if (isolated)
> + high_pfn = max(high_pfn, pfn);
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> +
> + cc->free_pfn = high_pfn;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + unsigned long low_pfn, end_pfn;
> + struct list_head *migratelist;
> + enum lru_list lru_src;
> +
> + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> + migratelist = &cc->migratepages;
> +
> + /* Do not scan outside zone boundaries */
> + if (low_pfn < zone->zone_start_pfn)
> + low_pfn = zone->zone_start_pfn;
> +
> + /* Setup to scan one block but not past where we are migrating to */
> + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> + cc->migrate_pfn = end_pfn;
> + VM_BUG_ON(end_pfn > cc->free_pfn);
> +
> + if (!pfn_valid(low_pfn))
> + return 0;
> +
> + migrate_prep();
> +
> + /* Time to isolate some pages for migration */
> + spin_lock_irq(&zone->lru_lock);
> + for (; low_pfn < end_pfn; low_pfn++) {
> + struct page *page;
> + if (!pfn_valid_within(low_pfn))
> + continue;
> +
> + /* Get the page and skip if free */
> + page = pfn_to_page(low_pfn);
> + if (PageBuddy(page)) {
> + low_pfn += (1 << page_order(page)) - 1;
> + continue;
> + }
> +
> + if (!PageLRU(page) || PageUnevictable(page))
> + continue;
> +
> + /* Try isolate the page */
> + lru_src = page_lru(page);
> + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
> + case 0:
> + list_move(&page->lru, migratelist);
> + mem_cgroup_del_lru(page);
> + cc->nr_migratepages++;
> + break;
> +
> + case -EBUSY:
> + /*
> + * else it is being freed elsewhere. The
> + * problem is that we are not really sure where
> + * it came from in the first place
> + * XXX: Verify the putback logic is ok. This was
> + * all written before LRU lists were split
> + */
> + list_move(&page->lru, &zone->lru[lru_src].list);
> + mem_cgroup_rotate_lru_list(page, page_lru(page));
> + continue;
> +
> + default:
> + BUG();
> + }
> + }
> + spin_unlock_irq(&zone->lru_lock);
> +
> + return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> + unsigned long data,
> + int **result)
> +{
> + struct compact_control *cc = (struct compact_control *)data;
> + struct page *freepage;
> +
> + VM_BUG_ON(cc == NULL);
> +
> + /* Isolate free pages if necessary */
> + if (list_empty(&cc->freepages)) {
> + isolate_freepages(cc->zone, cc);
> +
> + if (list_empty(&cc->freepages))
> + return NULL;
> + }
> +
> + freepage = list_entry(cc->freepages.next, struct page, lru);
> + list_del(&freepage->lru);
> + cc->nr_freepages--;
> +
> + return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> + int nr_migratepages = 0;
> + int nr_freepages = 0;
> + struct page *page;
> + list_for_each_entry(page, &cc->migratepages, lru)
> + nr_migratepages++;
> + list_for_each_entry(page, &cc->freepages, lru)
> + nr_freepages++;
> +
> + cc->nr_migratepages = nr_migratepages;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> + struct compact_control *cc)
> +{
> + /* Compaction run completes if the migrate and free scanner meet */
> + if (cc->free_pfn <= cc->migrate_pfn)
> + return COMPACT_COMPLETE;
> +
> + return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> + int ret = COMPACT_INCOMPLETE;
> +
> + /*
> + * Setup to move all movable pages to the end of the zone
> + * XXX: This could be improved upon. In the event compaction
> + * is being successful quickly but called often, there
> + * is a likelihood of scanning the same blocks as sources
> + * and targets frequently. Might be worth caching the
> + * last migrate_pfn to reduce scan times.
> + */
> + cc->migrate_pfn = zone->zone_start_pfn;
> + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> + cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> + for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> + unsigned long nr_migrate, nr_remaining;
> + if (!isolate_migratepages(zone, cc))
> + continue;
> +
> + nr_migrate = cc->nr_migratepages;
> + migrate_pages(&cc->migratepages, compaction_alloc,
> + (unsigned long)cc, 0);
> + update_nr_listpages(cc);
> + nr_remaining = cc->nr_migratepages;
> +
> + count_vm_event(COMPACTBLOCKS);
> + count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> + if (nr_remaining)
> + count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> + }
> +
> + /* Release free pages and check accounting */
> + cc->nr_freepages -= release_freepages(zone, &cc->freepages);
> + VM_BUG_ON(cc->nr_freepages != 0);
> +
> + /*
> + * Release LRU pages not migrated
> + * XXX: Page migration at this point tries fairly hard to move
> + * pages as it is but if migration fails, pages are left
> + * on cc->migratepages for more passes. This might cause
> + * multiple useless failures. Watch compact_pagemigrate_failed
> + * in /proc/vmstat. If it grows a lot, then putback should
> + * happen after each failed migration
> + */
> + if (!list_empty(&cc->migratepages))
> + putback_lru_pages(&cc->migratepages);
> +
> + return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8deb9d0..6d57154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
> set_page_refcounted(page + i);
> }
>
> +/* Similar to split_page except the page is already free */
Sometime, this function changes pages's type to MIGRATE_MOVABLE.
I hope adding comment about that.
> +int split_free_page(struct page *page)
> +{
> + unsigned int order;
> + unsigned long watermark;
> + struct zone *zone;
> +
> + BUG_ON(!PageBuddy(page));
> +
> + zone = page_zone(page);
> + order = page_order(page);
> +
> + /* Obey watermarks or the system could deadlock */
> + watermark = low_wmark_pages(zone) + (1 << order);
> + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> + return 0;
> +
> + /* Remove page from free list */
> + list_del(&page->lru);
> + zone->free_area[order].nr_free--;
> + rmv_page_order(page);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> + /* Split into individual pages */
> + set_page_refcounted(page);
> + split_page(page, order);
> +
> + /* Set the migratetype on the assumption it's for migration */
> + if (order >= pageblock_order - 1) {
> + struct page *endpage = page + (1 << order) - 1;
> + for (; page < endpage; page += pageblock_nr_pages)
> + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> + }
> +
> + return 1 << order;
> +}
> +
> /*
> * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
> * we cheat by calling it from here, in the order > 0 path. Saves a branch
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c26986c..47de19b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -803,11 +803,6 @@ keep:
> return nr_reclaimed;
> }
>
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> -
> /*
> * Attempt to remove the specified page from its LRU. Only take this page
> * if it is of the appropriate PageActive status. Pages which are being
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index e2d0cc1..f0930ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
> "allocstall",
>
> "pgrotated",
> +
> + "compact_blocks_moved",
> + "compact_pages_moved",
> + "compact_pagemigrate_failed",
> +
> #ifdef CONFIG_HUGETLB_PAGE
> "htlb_buddy_alloc_success",
> "htlb_buddy_alloc_fail",
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-18 16:58 ` Minchan Kim
@ 2010-02-18 17:34 ` Mel Gorman
2010-02-19 1:21 ` Minchan Kim
0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-18 17:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 19, 2010 at 01:58:44AM +0900, Minchan Kim wrote:
> On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> >
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/linux/compaction.h | 8 +
> > include/linux/mm.h | 1 +
> > include/linux/swap.h | 5 +
> > include/linux/vmstat.h | 1 +
> > mm/Makefile | 1 +
> > mm/compaction.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
> > mm/page_alloc.c | 37 +++++
> > mm/vmscan.c | 5 -
> > mm/vmstat.c | 5 +
> > scripts/kconfig/conf.c | 1 -
> > 10 files changed, 399 insertions(+), 6 deletions(-)
> > create mode 100644 include/linux/compaction.h
> > create mode 100644 mm/compaction.c
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > new file mode 100644
> > index 0000000..6201371
> > --- /dev/null
> > +++ b/include/linux/compaction.h
> > @@ -0,0 +1,8 @@
> > +#ifndef _LINUX_COMPACTION_H
> > +#define _LINUX_COMPACTION_H
> > +
> > +/* Return values for compact_zone() */
> > +#define COMPACT_INCOMPLETE 0
> > +#define COMPACT_COMPLETE 1
> > +
> > +#endif /* _LINUX_COMPACTION_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60c467b..c2a2ede 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -332,6 +332,7 @@ void put_page(struct page *page);
> > void put_pages_list(struct list_head *pages);
> >
> > void split_page(struct page *page, unsigned int order);
> > +int split_free_page(struct page *page);
> >
> > /*
> > * Compound pages have a destructor function. Provide a
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index a2602a8..7e7181b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> > __lru_cache_add(page, LRU_ACTIVE_FILE);
> > }
> >
> > +/* LRU Isolation modes. */
> > +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> > +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> > +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> > +
> > /* linux/mm/vmscan.c */
> > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > gfp_t gfp_mask, nodemask_t *mask);
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index ee03bba..d7f7236 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> > KSWAPD_SKIP_CONGESTION_WAIT,
> > PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > + COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > #ifdef CONFIG_HUGETLB_PAGE
> > HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > #endif
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 7a68d2a..ccb1f72 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> > obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> > obj-$(CONFIG_FS_XIP) += filemap_xip.o
> > obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_COMPACTION) += compaction.o
> > obj-$(CONFIG_SMP) += percpu.o
> > obj-$(CONFIG_QUICKLIST) += quicklist.o
> > obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > new file mode 100644
> > index 0000000..51ec864
> > --- /dev/null
> > +++ b/mm/compaction.c
> > @@ -0,0 +1,341 @@
> > +/*
> > + * linux/mm/compaction.c
> > + *
> > + * Memory compaction for the reduction of external fragmentation. Note that
> > + * this heavily depends upon page migration to do all the real heavy
> > + * lifting
> > + *
> > + * Copyright IBM Corp. 2009 Mel Gorman <mel@csn.ul.ie>
> > + */
> > +#include <linux/swap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/compaction.h>
> > +#include <linux/mm_inline.h>
> > +#include "internal.h"
> > +
> > +/*
> > + * compact_control is used to track pages being migrated and the free pages
> > + * they are being migrated to during memory compaction. The free_pfn starts
> > + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> > + * are moved to the end of a zone during a compaction run and the run
> > + * completes when free_pfn <= migrate_pfn
> > + */
> > +struct compact_control {
> > + struct list_head freepages; /* List of free pages to migrate to */
> > + struct list_head migratepages; /* List of pages being migrated */
> > + unsigned long nr_freepages; /* Number of isolated free pages */
> > + unsigned long nr_migratepages; /* Number of pages to migrate */
> > + unsigned long free_pfn; /* isolate_freepages search base */
> > + unsigned long migrate_pfn; /* isolate_migratepages search base */
> > + struct zone *zone;
> > +};
> > +
> > +static int release_freepages(struct zone *zone, struct list_head *freelist)
>
> "zone" argument.
>
I removed it. Thanks
> > +{
> > + struct page *page, *next;
> > + int count = 0;
> > +
> > + list_for_each_entry_safe(page, next, freelist, lru) {
> > + list_del(&page->lru);
> > + __free_page(page);
> > + count++;
> > + }
> > +
> > + return count;
> > +}
> > +
> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> > +static int isolate_freepages_block(struct zone *zone,
>
> return type 'int'?
> I think we can't return signed value.
>
I don't understand your query. What's wrong with returning int?
> > + unsigned long blockpfn,
> > + struct list_head *freelist)
> > +{
> > + unsigned long zone_end_pfn, end_pfn;
> > + int total_isolated = 0;
> > +
> > + /* Get the last PFN we should scan for free pages at */
> > + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > + end_pfn = blockpfn + pageblock_nr_pages;
> > + if (end_pfn > zone_end_pfn)
> > + end_pfn = zone_end_pfn;
> > +
> > + /* Isolate free pages. This assumes the block is valid */
> > + for (; blockpfn < end_pfn; blockpfn++) {
> > + struct page *page;
> > + int isolated, i;
> > +
> > + if (!pfn_valid_within(blockpfn))
> > + continue;
> > +
> > + page = pfn_to_page(blockpfn);
> > + if (!PageBuddy(page))
> > + continue;
> > +
> > + /* Found a free page, break it into order-0 pages */
> > + isolated = split_free_page(page);
> > + total_isolated += isolated;
> > + for (i = 0; i < isolated; i++) {
> > + list_add(&page->lru, freelist);
> > + page++;
> > + }
> > + blockpfn += isolated - 1;
Incidentally, this line is wrong but will be fixed in line 3. If
split_free_page() fails, it causes an infinite loop.
> > + }
> > +
> > + return total_isolated;
> > +}
> > +
> > +/* Returns 1 if the page is within a block suitable for migration to */
> > +static int suitable_migration_target(struct page *page)
> > +{
> > + /* If the page is a large free page, then allow migration */
> > + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > + return 1;
> > +
> > + /* If the block is MIGRATE_MOVABLE, allow migration */
> > + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> > + return 1;
> > +
> > + /* Otherwise skip the block */
> > + return 0;
> > +}
> > +
> > +/*
> > + * Based on information in the current compact_control, find blocks
> > + * suitable for isolating free pages from
> > + */
> > +static void isolate_freepages(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + struct page *page;
> > + unsigned long high_pfn, low_pfn, pfn;
> > + unsigned long flags;
> > + int nr_freepages = cc->nr_freepages;
> > + struct list_head *freelist = &cc->freepages;
> > +
> > + pfn = cc->free_pfn;
> > + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> > + high_pfn = low_pfn;
> > +
> > + /*
> > + * Isolate free pages until enough are available to migrate the
> > + * pages on cc->migratepages. We stop searching if the migrate
> > + * and free page scanners meet or enough free pages are isolated.
> > + */
> > + spin_lock_irqsave(&zone->lock, flags);
> > + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> > + pfn -= pageblock_nr_pages) {
> > + int isolated;
> > +
> > + if (!pfn_valid(pfn))
> > + continue;
> > +
> > + /* Check for overlapping nodes/zones */
> > + page = pfn_to_page(pfn);
> > + if (page_zone(page) != zone)
> > + continue;
>
> We are progressing backward by physical page order in a zone.
> If we meet crossover between zone, Why are we going backward
> continuously? Before it happens, migration and free scanner would meet.
> Am I miss something?
>
I was considering a situation like the following
Node-0 Node-1 Node-0
DMA DMA DMA
0-1023 1024-2047 2048-4096
In that case, a PFN scanner can enter a new node and zone but the migrate
and free scanners have not necessarily met. This configuration is *extremely*
rare but it happens on messed-up LPAR configurations on POWER.
> > +
> > + /* Check the block is suitable for migration */
> > + if (!suitable_migration_target(page))
> > + continue;
>
> Dumb question.
> suitable_migration_target considers three type's pages
>
> 1. free page and page's order >= pageblock_order
> 2. free pages and pages's order < pageblock_order with movable page
> 3. used page with movable
>
> I can understand 1 and 2 but can't 3. This function is for gathering
> free page. How do you handle used page as free one?
>
> In addition, as I looked into isolate_freepages_block, it doesn't
> consider 3 by PageBuddy check.
>
> I am confusing. Pz, correct me.
>
I'm afraid I don't understand your question. At the point
suitable_migration_target() is called, the only concern is finding a pageblock
of pages that should be scanned for free pages by isolate_freepages_block().
What do you mean by "used page with movable" ?
> > +
> > + /* Found a block suitable for isolating free pages from */
> > + isolated = isolate_freepages_block(zone, pfn, freelist);
> > + nr_freepages += isolated;
> > +
> > + /*
> > + * Record the highest PFN we isolated pages from. When next
> > + * looking for free pages, the search will restart here as
> > + * page migration may have returned some pages to the allocator
> > + */
> > + if (isolated)
> > + high_pfn = max(high_pfn, pfn);
> > + }
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > + cc->free_pfn = high_pfn;
> > + cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +/*
> > + * Isolate all pages that can be migrated from the block pointed to by
> > + * the migrate scanner within compact_control.
> > + */
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + unsigned long low_pfn, end_pfn;
> > + struct list_head *migratelist;
> > + enum lru_list lru_src;
> > +
> > + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> > + migratelist = &cc->migratepages;
> > +
> > + /* Do not scan outside zone boundaries */
> > + if (low_pfn < zone->zone_start_pfn)
> > + low_pfn = zone->zone_start_pfn;
> > +
> > + /* Setup to scan one block but not past where we are migrating to */
> > + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > + cc->migrate_pfn = end_pfn;
> > + VM_BUG_ON(end_pfn > cc->free_pfn);
> > +
> > + if (!pfn_valid(low_pfn))
> > + return 0;
> > +
> > + migrate_prep();
> > +
> > + /* Time to isolate some pages for migration */
> > + spin_lock_irq(&zone->lru_lock);
> > + for (; low_pfn < end_pfn; low_pfn++) {
> > + struct page *page;
> > + if (!pfn_valid_within(low_pfn))
> > + continue;
> > +
> > + /* Get the page and skip if free */
> > + page = pfn_to_page(low_pfn);
> > + if (PageBuddy(page)) {
> > + low_pfn += (1 << page_order(page)) - 1;
> > + continue;
> > + }
> > +
> > + if (!PageLRU(page) || PageUnevictable(page))
> > + continue;
> > +
> > + /* Try isolate the page */
> > + lru_src = page_lru(page);
> > + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
> > + case 0:
> > + list_move(&page->lru, migratelist);
> > + mem_cgroup_del_lru(page);
> > + cc->nr_migratepages++;
> > + break;
> > +
> > + case -EBUSY:
> > + /*
> > + * else it is being freed elsewhere. The
> > + * problem is that we are not really sure where
> > + * it came from in the first place
> > + * XXX: Verify the putback logic is ok. This was
> > + * all written before LRU lists were split
> > + */
> > + list_move(&page->lru, &zone->lru[lru_src].list);
> > + mem_cgroup_rotate_lru_list(page, page_lru(page));
> > + continue;
> > +
> > + default:
> > + BUG();
> > + }
> > + }
> > + spin_unlock_irq(&zone->lru_lock);
> > +
> > + return cc->nr_migratepages;
> > +}
> > +
> > +/*
> > + * This is a migrate-callback that "allocates" freepages by taking pages
> > + * from the isolated freelists in the block we are migrating to.
> > + */
> > +static struct page *compaction_alloc(struct page *migratepage,
> > + unsigned long data,
> > + int **result)
> > +{
> > + struct compact_control *cc = (struct compact_control *)data;
> > + struct page *freepage;
> > +
> > + VM_BUG_ON(cc == NULL);
> > +
> > + /* Isolate free pages if necessary */
> > + if (list_empty(&cc->freepages)) {
> > + isolate_freepages(cc->zone, cc);
> > +
> > + if (list_empty(&cc->freepages))
> > + return NULL;
> > + }
> > +
> > + freepage = list_entry(cc->freepages.next, struct page, lru);
> > + list_del(&freepage->lru);
> > + cc->nr_freepages--;
> > +
> > + return freepage;
> > +}
> > +
> > +/*
> > + * We cannot control nr_migratepages and nr_freepages fully when migration is
> > + * running as migrate_pages() has no knowledge of compact_control. When
> > + * migration is complete, we count the number of pages on the lists by hand.
> > + */
> > +static void update_nr_listpages(struct compact_control *cc)
> > +{
> > + int nr_migratepages = 0;
> > + int nr_freepages = 0;
> > + struct page *page;
> > + list_for_each_entry(page, &cc->migratepages, lru)
> > + nr_migratepages++;
> > + list_for_each_entry(page, &cc->freepages, lru)
> > + nr_freepages++;
> > +
> > + cc->nr_migratepages = nr_migratepages;
> > + cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +static inline int compact_finished(struct zone *zone,
> > + struct compact_control *cc)
> > +{
> > + /* Compaction run completes if the migrate and free scanner meet */
> > + if (cc->free_pfn <= cc->migrate_pfn)
> > + return COMPACT_COMPLETE;
> > +
> > + return COMPACT_INCOMPLETE;
> > +}
> > +
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > + int ret = COMPACT_INCOMPLETE;
> > +
> > + /*
> > + * Setup to move all movable pages to the end of the zone
> > + * XXX: This could be improved upon. In the event compaction
> > + * is being successful quickly but called often, there
> > + * is a likelihood of scanning the same blocks as sources
> > + * and targets frequently. Might be worth caching the
> > + * last migrate_pfn to reduce scan times.
> > + */
> > + cc->migrate_pfn = zone->zone_start_pfn;
> > + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > + cc->free_pfn &= ~(pageblock_nr_pages-1);
> > +
> > + for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > + unsigned long nr_migrate, nr_remaining;
> > + if (!isolate_migratepages(zone, cc))
> > + continue;
> > +
> > + nr_migrate = cc->nr_migratepages;
> > + migrate_pages(&cc->migratepages, compaction_alloc,
> > + (unsigned long)cc, 0);
> > + update_nr_listpages(cc);
> > + nr_remaining = cc->nr_migratepages;
> > +
> > + count_vm_event(COMPACTBLOCKS);
> > + count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > + if (nr_remaining)
> > + count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > + }
> > +
> > + /* Release free pages and check accounting */
> > + cc->nr_freepages -= release_freepages(zone, &cc->freepages);
> > + VM_BUG_ON(cc->nr_freepages != 0);
> > +
> > + /*
> > + * Release LRU pages not migrated
> > + * XXX: Page migration at this point tries fairly hard to move
> > + * pages as it is but if migration fails, pages are left
> > + * on cc->migratepages for more passes. This might cause
> > + * multiple useless failures. Watch compact_pagemigrate_failed
> > + * in /proc/vmstat. If it grows a lot, then putback should
> > + * happen after each failed migration
> > + */
> > + if (!list_empty(&cc->migratepages))
> > + putback_lru_pages(&cc->migratepages);
> > +
> > + return ret;
> > +}
> > +
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8deb9d0..6d57154 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
> > set_page_refcounted(page + i);
> > }
> >
> > +/* Similar to split_page except the page is already free */
>
> Sometime, this function changes pages's type to MIGRATE_MOVABLE.
> I hope adding comment about that.
>
There is a comment within the function about it. Do you want it moved to
here?
> > +int split_free_page(struct page *page)
> > +{
> > + unsigned int order;
> > + unsigned long watermark;
> > + struct zone *zone;
> > +
> > + BUG_ON(!PageBuddy(page));
> > +
> > + zone = page_zone(page);
> > + order = page_order(page);
> > +
> > + /* Obey watermarks or the system could deadlock */
> > + watermark = low_wmark_pages(zone) + (1 << order);
> > + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > + return 0;
> > +
> > + /* Remove page from free list */
> > + list_del(&page->lru);
> > + zone->free_area[order].nr_free--;
> > + rmv_page_order(page);
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > +
> > + /* Split into individual pages */
> > + set_page_refcounted(page);
> > + split_page(page, order);
> > +
> > + /* Set the migratetype on the assumption it's for migration */
> > + if (order >= pageblock_order - 1) {
> > + struct page *endpage = page + (1 << order) - 1;
> > + for (; page < endpage; page += pageblock_nr_pages)
> > + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > + }
> > +
> > + return 1 << order;
> > +}
> > +
> > /*
> > * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
> > * we cheat by calling it from here, in the order > 0 path. Saves a branch
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c26986c..47de19b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -803,11 +803,6 @@ keep:
> > return nr_reclaimed;
> > }
> >
> > -/* LRU Isolation modes. */
> > -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> > -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> > -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> > -
> > /*
> > * Attempt to remove the specified page from its LRU. Only take this page
> > * if it is of the appropriate PageActive status. Pages which are being
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index e2d0cc1..f0930ae 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
> > "allocstall",
> >
> > "pgrotated",
> > +
> > + "compact_blocks_moved",
> > + "compact_pages_moved",
> > + "compact_pagemigrate_failed",
> > +
> > #ifdef CONFIG_HUGETLB_PAGE
> > "htlb_buddy_alloc_success",
> > "htlb_buddy_alloc_fail",
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-18 17:34 ` Mel Gorman
@ 2010-02-19 1:21 ` Minchan Kim
2010-02-19 14:33 ` Mel Gorman
0 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2010-02-19 1:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 19, 2010 at 2:34 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Fri, Feb 19, 2010 at 01:58:44AM +0900, Minchan Kim wrote:
>> On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
>> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
>> > +static int isolate_freepages_block(struct zone *zone,
>>
>> return type 'int'?
>> I think we can't return signed value.
>>
>
> I don't understand your query. What's wrong with returning int?
It's just nitpick. I mean this functions doesn't return minus value.
Never mind.
>
>> > + unsigned long blockpfn,
>> > + struct list_head *freelist)
>> > +{
>> > + unsigned long zone_end_pfn, end_pfn;
>> > + int total_isolated = 0;
>> > +
>> > + /* Get the last PFN we should scan for free pages at */
>> > + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
>> > + end_pfn = blockpfn + pageblock_nr_pages;
>> > + if (end_pfn > zone_end_pfn)
>> > + end_pfn = zone_end_pfn;
>> > +
>> > + /* Isolate free pages. This assumes the block is valid */
>> > + for (; blockpfn < end_pfn; blockpfn++) {
>> > + struct page *page;
>> > + int isolated, i;
>> > +
>> > + if (!pfn_valid_within(blockpfn))
>> > + continue;
>> > +
>> > + page = pfn_to_page(blockpfn);
>> > + if (!PageBuddy(page))
>> > + continue;
>> > +
>> > + /* Found a free page, break it into order-0 pages */
>> > + isolated = split_free_page(page);
>> > + total_isolated += isolated;
>> > + for (i = 0; i < isolated; i++) {
>> > + list_add(&page->lru, freelist);
>> > + page++;
>> > + }
>> > + blockpfn += isolated - 1;
>
> Incidentally, this line is wrong but will be fixed in line 3. If
> split_free_page() fails, it causes an infinite loop.
>
>> > + }
>> > +
>> > + return total_isolated;
>> > +}
>> > +
>> > +/* Returns 1 if the page is within a block suitable for migration to */
>> > +static int suitable_migration_target(struct page *page)
>> > +{
>> > + /* If the page is a large free page, then allow migration */
>> > + if (PageBuddy(page) && page_order(page) >= pageblock_order)
>> > + return 1;
>> > +
>> > + /* If the block is MIGRATE_MOVABLE, allow migration */
>> > + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
>> > + return 1;
>> > +
>> > + /* Otherwise skip the block */
>> > + return 0;
>> > +}
>> > +
>> > +/*
>> > + * Based on information in the current compact_control, find blocks
>> > + * suitable for isolating free pages from
>> > + */
>> > +static void isolate_freepages(struct zone *zone,
>> > + struct compact_control *cc)
>> > +{
>> > + struct page *page;
>> > + unsigned long high_pfn, low_pfn, pfn;
>> > + unsigned long flags;
>> > + int nr_freepages = cc->nr_freepages;
>> > + struct list_head *freelist = &cc->freepages;
>> > +
>> > + pfn = cc->free_pfn;
>> > + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
>> > + high_pfn = low_pfn;
>> > +
>> > + /*
>> > + * Isolate free pages until enough are available to migrate the
>> > + * pages on cc->migratepages. We stop searching if the migrate
>> > + * and free page scanners meet or enough free pages are isolated.
>> > + */
>> > + spin_lock_irqsave(&zone->lock, flags);
>> > + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
>> > + pfn -= pageblock_nr_pages) {
>> > + int isolated;
>> > +
>> > + if (!pfn_valid(pfn))
>> > + continue;
>> > +
>> > + /* Check for overlapping nodes/zones */
>> > + page = pfn_to_page(pfn);
>> > + if (page_zone(page) != zone)
>> > + continue;
>>
>> We are progressing backward by physical page order in a zone.
>> If we meet crossover between zone, Why are we going backward
>> continuously? Before it happens, migration and free scanner would meet.
>> Am I miss something?
>>
>
> I was considering a situation like the following
>
>
> Node-0 Node-1 Node-0
> DMA DMA DMA
> 0-1023 1024-2047 2048-4096
>
> In that case, a PFN scanner can enter a new node and zone but the migrate
> and free scanners have not necessarily met. This configuration is *extremely*
> rare but it happens on messed-up LPAR configurations on POWER.
I don't know such architecture until now.
Thanks for telling me.
How about adding the comment about that?
>
>> > +
>> > + /* Check the block is suitable for migration */
>> > + if (!suitable_migration_target(page))
>> > + continue;
>>
>> Dumb question.
>> suitable_migration_target considers three type's pages
>>
>> 1. free page and page's order >= pageblock_order
>> 2. free pages and pages's order < pageblock_order with movable page
>> 3. used page with movable
>>
>> I can understand 1 and 2 but can't 3. This function is for gathering
>> free page. How do you handle used page as free one?
>>
>> In addition, as I looked into isolate_freepages_block, it doesn't
>> consider 3 by PageBuddy check.
>>
>> I am confusing. Pz, correct me.
>>
>
> I'm afraid I don't understand your question. At the point
> suitable_migration_target() is called, the only concern is finding a pageblock
> of pages that should be scanned for free pages by isolate_freepages_block().
> What do you mean by "used page with movable" ?
After I looked into code, I understand it.
Thanks.
<snip>
>>> +/* Similar to split_page except the page is already free */
>> Sometime, this function changes pages's type to MIGRATE_MOVABLE.
>> I hope adding comment about that.
>>
>
> There is a comment within the function about it. Do you want it moved to
> here?
If you don't mind, I hope so. :)
That's because you wrote down only "except the page is already free" in
function description. So I thought it's only difference with split_page at first
glance. I think information that setting MIGRATE_MOVABLE is important.
Pz, thinks it as just nitpick.
Thanks, Mel.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/12] Memory compaction core
2010-02-19 1:21 ` Minchan Kim
@ 2010-02-19 14:33 ` Mel Gorman
0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-19 14:33 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 19, 2010 at 10:21:10AM +0900, Minchan Kim wrote:
> On Fri, Feb 19, 2010 at 2:34 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Fri, Feb 19, 2010 at 01:58:44AM +0900, Minchan Kim wrote:
> >> On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> >> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> >> > +static int isolate_freepages_block(struct zone *zone,
> >>
> >> return type 'int'?
> >> I think we can't return signed value.
> >>
> >
> > I don't understand your query. What's wrong with returning int?
>
> It's just nitpick. I mean this functions doesn't return minus value.
> Never mind.
>
> >
> >> > + unsigned long blockpfn,
> >> > + struct list_head *freelist)
> >> > +{
> >> > + unsigned long zone_end_pfn, end_pfn;
> >> > + int total_isolated = 0;
> >> > +
> >> > + /* Get the last PFN we should scan for free pages at */
> >> > + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> >> > + end_pfn = blockpfn + pageblock_nr_pages;
> >> > + if (end_pfn > zone_end_pfn)
> >> > + end_pfn = zone_end_pfn;
> >> > +
> >> > + /* Isolate free pages. This assumes the block is valid */
> >> > + for (; blockpfn < end_pfn; blockpfn++) {
> >> > + struct page *page;
> >> > + int isolated, i;
> >> > +
> >> > + if (!pfn_valid_within(blockpfn))
> >> > + continue;
> >> > +
> >> > + page = pfn_to_page(blockpfn);
> >> > + if (!PageBuddy(page))
> >> > + continue;
> >> > +
> >> > + /* Found a free page, break it into order-0 pages */
> >> > + isolated = split_free_page(page);
> >> > + total_isolated += isolated;
> >> > + for (i = 0; i < isolated; i++) {
> >> > + list_add(&page->lru, freelist);
> >> > + page++;
> >> > + }
> >> > + blockpfn += isolated - 1;
> >
> > Incidentally, this line is wrong but will be fixed in line 3. If
> > split_free_page() fails, it causes an infinite loop.
> >
> >> > + }
> >> > +
> >> > + return total_isolated;
> >> > +}
> >> > +
> >> > +/* Returns 1 if the page is within a block suitable for migration to */
> >> > +static int suitable_migration_target(struct page *page)
> >> > +{
> >> > + /* If the page is a large free page, then allow migration */
> >> > + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> >> > + return 1;
> >> > +
> >> > + /* If the block is MIGRATE_MOVABLE, allow migration */
> >> > + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> >> > + return 1;
> >> > +
> >> > + /* Otherwise skip the block */
> >> > + return 0;
> >> > +}
> >> > +
> >> > +/*
> >> > + * Based on information in the current compact_control, find blocks
> >> > + * suitable for isolating free pages from
> >> > + */
> >> > +static void isolate_freepages(struct zone *zone,
> >> > + struct compact_control *cc)
> >> > +{
> >> > + struct page *page;
> >> > + unsigned long high_pfn, low_pfn, pfn;
> >> > + unsigned long flags;
> >> > + int nr_freepages = cc->nr_freepages;
> >> > + struct list_head *freelist = &cc->freepages;
> >> > +
> >> > + pfn = cc->free_pfn;
> >> > + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> >> > + high_pfn = low_pfn;
> >> > +
> >> > + /*
> >> > + * Isolate free pages until enough are available to migrate the
> >> > + * pages on cc->migratepages. We stop searching if the migrate
> >> > + * and free page scanners meet or enough free pages are isolated.
> >> > + */
> >> > + spin_lock_irqsave(&zone->lock, flags);
> >> > + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> >> > + pfn -= pageblock_nr_pages) {
> >> > + int isolated;
> >> > +
> >> > + if (!pfn_valid(pfn))
> >> > + continue;
> >> > +
> >> > + /* Check for overlapping nodes/zones */
> >> > + page = pfn_to_page(pfn);
> >> > + if (page_zone(page) != zone)
> >> > + continue;
> >>
> >> We are progressing backward by physical page order in a zone.
> >> If we meet crossover between zone, Why are we going backward
> >> continuously? Before it happens, migration and free scanner would meet.
> >> Am I miss something?
> >>
> >
> > I was considering a situation like the following
> >
> >
> > Node-0 Node-1 Node-0
> > DMA DMA DMA
> > 0-1023 1024-2047 2048-4096
> >
> > In that case, a PFN scanner can enter a new node and zone but the migrate
> > and free scanners have not necessarily met. This configuration is *extremely*
> > rare but it happens on messed-up LPAR configurations on POWER.
>
> I don't know such architecture until now.
> Thanks for telling me.
> How about adding the comment about that?
>
Sure
> >
> >> > +
> >> > + /* Check the block is suitable for migration */
> >> > + if (!suitable_migration_target(page))
> >> > + continue;
> >>
> >> Dumb question.
> >> suitable_migration_target considers three type's pages
> >>
> >> 1. free page and page's order >= pageblock_order
> >> 2. free pages and pages's order < pageblock_order with movable page
> >> 3. used page with movable
> >>
> >> I can understand 1 and 2 but can't 3. This function is for gathering
> >> free page. How do you handle used page as free one?
> >>
> >> In addition, as I looked into isolate_freepages_block, it doesn't
> >> consider 3 by PageBuddy check.
> >>
> >> I am confusing. Pz, correct me.
> >>
> >
> > I'm afraid I don't understand your question. At the point
> > suitable_migration_target() is called, the only concern is finding a pageblock
> > of pages that should be scanned for free pages by isolate_freepages_block().
> > What do you mean by "used page with movable" ?
>
> After I looked into code, I understand it.
> Thanks.
>
> <snip>
> >>> +/* Similar to split_page except the page is already free */
> >> Sometime, this function changes pages's type to MIGRATE_MOVABLE.
> >> I hope adding comment about that.
> >>
> >
> > There is a comment within the function about it. Do you want it moved to
> > here?
>
> If you don't mind, I hope so. :)
>
Done.
> That's because you wrote down only "except the page is already free" in
> function description. So I thought it's only difference with split_page at first
> glance. I think information that setting MIGRATE_MOVABLE is important.
>
> Pz, thinks it as just nitpick.
No, it's a fair point. I can see how it could trip someone up.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (4 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 05/12] Memory compaction core Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-12 18:34 ` Valdis.Kletnieks
` (3 more replies)
2010-02-12 12:00 ` [PATCH 07/12] Add /sys trigger for per-node " Mel Gorman
` (5 subsequent siblings)
11 siblings, 4 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/sysctl/vm.txt | 11 ++++++++
include/linux/compaction.h | 5 +++
kernel/sysctl.c | 11 ++++++++
mm/compaction.c | 60 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 87 insertions(+), 0 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..92b5b00 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:
- block_dump
+- compact_memory
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
==============================================================
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
dirty_background_bytes
Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..facaa3d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,9 @@
#define COMPACT_INCOMPLETE 0
#define COMPACT_COMPLETE 1
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
#endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..a02c816 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,7 @@
#include <linux/ftrace.h>
#include <linux/slow-work.h>
#include <linux/perf_event.h>
+#include <linux/compaction.h>
#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -80,6 +81,7 @@ extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
+extern int sysctl_compact_memory;
extern int percpu_pagelist_fraction;
extern int compat_log;
extern int latencytop_enabled;
@@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = drop_caches_sysctl_handler,
},
+#ifdef CONFIG_COMPACTION
+ {
+ .procname = "compact_memory",
+ .data = &sysctl_compact_memory,
+ .maxlen = sizeof(int),
+ .mode = 0200,
+ .proc_handler = sysctl_compaction_handler,
+ },
+#endif /* CONFIG_COMPACTION */
{
.procname = "min_free_kbytes",
.data = &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 51ec864..c0b9dc9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
#include <linux/migrate.h>
#include <linux/compaction.h>
#include <linux/mm_inline.h>
+#include <linux/sysctl.h>
#include "internal.h"
/*
@@ -339,3 +340,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+ int zoneid;
+ pg_data_t *pgdat;
+ struct zone *zone;
+
+ if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+ return -EINVAL;
+ pgdat = NODE_DATA(nid);
+
+ /* Flush pending updates to the LRU lists */
+ lru_add_drain_all();
+
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+ struct compact_control cc;
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ cc.nr_freepages = 0;
+ cc.nr_migratepages = 0;
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ compact_zone(zone, &cc);
+
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
+ }
+
+ return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+ int nid;
+
+ for_each_online_node(nid)
+ compact_node(nid);
+
+ return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ if (write)
+ return compact_nodes();
+
+ return 0;
+}
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
@ 2010-02-12 18:34 ` Valdis.Kletnieks
2010-02-12 18:38 ` Mel Gorman
2010-02-17 16:30 ` Rik van Riel
` (2 subsequent siblings)
3 siblings, 1 reply; 50+ messages in thread
From: Valdis.Kletnieks @ 2010-02-12 18:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 459 bytes --]
On Fri, 12 Feb 2010 12:00:53 GMT, Mel Gorman said:
> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
Argh. A global trigger in /proc, and a per-node trigger in /sys too. Can we
get by with just one or the other? Should the /proc one live in /sys too?
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 18:34 ` Valdis.Kletnieks
@ 2010-02-12 18:38 ` Mel Gorman
0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 18:38 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 12, 2010 at 01:34:40PM -0500, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 12 Feb 2010 12:00:53 GMT, Mel Gorman said:
> > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > value is written to the file, all zones are compacted. The expected user
> > of such a trigger is a job scheduler that prepares the system before the
> > target application runs.
>
> Argh. A global trigger in /proc, and a per-node trigger in /sys too. Can we
> get by with just one or the other? Should the /proc one live in /sys too?
>
The sysfs trigger is only visible on NUMA. The proc one is easier to use
when the requirement is "compact all memory". There doesn't appear to be a
suitable place in sysfs for the proc trigger as it's already the case that
all proc tunables are not reflected in sysfs.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
2010-02-12 18:34 ` Valdis.Kletnieks
@ 2010-02-17 16:30 ` Rik van Riel
2010-02-18 19:51 ` Christoph Lameter
2010-02-19 1:56 ` Minchan Kim
3 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-17 16:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
2010-02-12 18:34 ` Valdis.Kletnieks
2010-02-17 16:30 ` Rik van Riel
@ 2010-02-18 19:51 ` Christoph Lameter
2010-02-19 1:56 ` Minchan Kim
3 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2010-02-18 19:51 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, linux-kernel, linux-mm
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 06/12] Add /proc trigger for memory compaction
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
` (2 preceding siblings ...)
2010-02-18 19:51 ` Christoph Lameter
@ 2010-02-19 1:56 ` Minchan Kim
3 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2010-02-19 1:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On Fri, Feb 12, 2010 at 9:00 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 07/12] Add /sys trigger for per-node memory compaction
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (5 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 06/12] Add /proc trigger for memory compaction Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-17 16:30 ` Rik van Riel
2010-02-12 12:00 ` [PATCH 08/12] Direct compact when a high-order allocation fails Mel Gorman
` (4 subsequent siblings)
11 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
drivers/base/node.c | 3 +++
include/linux/compaction.h | 16 ++++++++++++++++
mm/compaction.c | 23 +++++++++++++++++++++++
3 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7012279..2333c9d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
#include <linux/cpu.h>
#include <linux/device.h>
#include <linux/swap.h>
+#include <linux/compaction.h>
static struct sysdev_class node_class = {
.name = "node",
@@ -239,6 +240,8 @@ int register_node(struct node *node, int num, struct node *parent)
scan_unevictable_register_node(node);
hugetlb_register_node(node);
+
+ compaction_register_node(node);
}
return error;
}
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index facaa3d..6a2eefd 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,4 +10,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
#endif /* CONFIG_COMPACTION */
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+ return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
#endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index c0b9dc9..f5bd5ed 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
#include <linux/compaction.h>
#include <linux/mm_inline.h>
#include <linux/sysctl.h>
+#include <linux/sysfs.h>
#include "internal.h"
/*
@@ -399,3 +400,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
return 0;
}
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ compact_node(dev->id);
+
+ return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+ return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+ return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 07/12] Add /sys trigger for per-node memory compaction
2010-02-12 12:00 ` [PATCH 07/12] Add /sys trigger for per-node " Mel Gorman
@ 2010-02-17 16:30 ` Rik van Riel
0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-17 16:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 08/12] Direct compact when a high-order allocation fails
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (6 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 07/12] Add /sys trigger for per-node " Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-18 3:57 ` Rik van Riel
2010-02-12 12:00 ` [PATCH 09/12] Do not compact within a preferred zone after a compaction failure Mel Gorman
` (3 subsequent siblings)
11 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation. With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.
Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/compaction.h | 16 +++++-
include/linux/vmstat.h | 1 +
mm/compaction.c | 119 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 26 ++++++++++
mm/vmstat.c | 16 +++++-
5 files changed, 174 insertions(+), 4 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6a2eefd..1cf95e2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,13 +1,25 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H
-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
#define COMPACT_INCOMPLETE 0
-#define COMPACT_COMPLETE 1
+#define COMPACT_PARTIAL 1
+#define COMPACT_COMPLETE 2
#ifdef CONFIG_COMPACTION
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ return COMPACT_INCOMPLETE;
+}
+
#endif /* CONFIG_COMPACTION */
#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d7f7236..0ea7a38 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+ COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index f5bd5ed..2c88ca9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -29,6 +29,9 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+
+ unsigned int order; /* order a direct compactor needs */
+ int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
};
@@ -282,10 +285,31 @@ static void update_nr_listpages(struct compact_control *cc)
static inline int compact_finished(struct zone *zone,
struct compact_control *cc)
{
+ unsigned int order;
+ unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
/* Compaction run completes if the migrate and free scanner meet */
if (cc->free_pfn <= cc->migrate_pfn)
return COMPACT_COMPLETE;
+ /* Compaction run is not finished if the watermark is not met */
+ if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+ return COMPACT_INCOMPLETE;
+
+ if (cc->order == -1)
+ return COMPACT_INCOMPLETE;
+
+ /* Direct compactor: Is a suitable page free? */
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (order >= pageblock_order && zone->free_area[order].nr_free)
+ return COMPACT_PARTIAL;
+ }
+
return COMPACT_INCOMPLETE;
}
@@ -341,6 +365,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}
+static inline unsigned long compact_zone_order(struct zone *zone,
+ int order, gfp_t gfp_mask)
+{
+ struct compact_control cc = {
+ .nr_freepages = 0,
+ .nr_migratepages = 0,
+ .order = order,
+ .migratetype = allocflags_to_migratetype(gfp_mask),
+ .zone = zone,
+ };
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ int may_enter_fs = gfp_mask & __GFP_FS;
+ int may_perform_io = gfp_mask & __GFP_IO;
+ unsigned long watermark;
+ struct zoneref *z;
+ struct zone *zone;
+ int rc = COMPACT_INCOMPLETE;
+
+ /* Check whether it is worth even starting compaction */
+ if (order == 0 || !may_enter_fs || !may_perform_io)
+ return rc;
+
+ /*
+ * We will not stall if the necessary conditions are not met for
+ * migration but direct reclaim seems to account stalls similarly
+ */
+ count_vm_event(COMPACTSTALL);
+
+ /* Compact each zone in the list */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+ nodemask) {
+ int fragindex;
+ int status;
+
+ /*
+ * Watermarks for order-0 must be met for compaction. Note
+ * the 2UL. This is because during migration, copies of
+ * pages need to be allocated and for a short time, the
+ * footprint is higher
+ */
+ watermark = low_wmark_pages(zone) + (2UL << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ continue;
+
+ /*
+ * fragmentation index determines if allocation failures are
+ * due to low memory or external fragmentation
+ *
+ * index of -1 implies allocations might succeed depending
+ * on watermarks
+ * index < 500 implies alloc failure is due to lack of memory
+ *
+ * XXX: The choice of 500 is arbitrary. Reinvestigate
+ * appropriately to determine a sensible default.
+ * and what it means when watermarks are also taken
+ * into account. Consider making it a sysctl
+ */
+ fragindex = fragmentation_index(zone, order);
+ if (fragindex >= 0 && fragindex <= 500)
+ continue;
+
+ if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+ rc = COMPACT_PARTIAL;
+ break;
+ }
+
+ status = compact_zone_order(zone, order, gfp_mask);
+ rc = max(status, rc);
+
+ if (zone_watermark_ok(zone, order, watermark, 0, 0))
+ break;
+ }
+
+ return rc;
+}
+
+
/* Compact all zones within a node */
static int compact_node(int nid)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d57154..1910b8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/memory.h>
+#include <linux/compaction.h>
#include <trace/events/kmem.h>
#include <asm/tlbflush.h>
@@ -1728,6 +1729,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
cond_resched();
+ /* Try memory compaction for high-order allocations before reclaim */
+ if (order) {
+ *did_some_progress = try_to_compact_pages(zonelist,
+ order, gfp_mask, nodemask);
+ if (*did_some_progress != COMPACT_INCOMPLETE) {
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page) {
+ __count_vm_event(COMPACTSUCCESS);
+ return page;
+ }
+
+ /*
+ * It's bad if compaction run occurs and fails.
+ * The most likely reason is that pages exist,
+ * but not enough to satisfy watermarks.
+ */
+ count_vm_event(COMPACTFAIL);
+
+ cond_resched();
+ }
+ }
+
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f0930ae..8edbe38 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,7 +500,7 @@ static void fill_contig_page_info(struct zone *zone,
* The value can be used to determine if page reclaim or compaction
* should be used
*/
-int fragmentation_index(struct zone *zone,
+static int __fragmentation_index(struct zone *zone,
unsigned int order,
struct contig_page_info *info)
{
@@ -522,6 +522,15 @@ int fragmentation_index(struct zone *zone,
return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
}
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+ struct contig_page_info info;
+
+ fill_contig_page_info(zone, order, &info);
+ return __fragmentation_index(zone, order, &info);
+}
+
/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.
@@ -558,7 +567,7 @@ static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
zone->name, " ");
for (order = 0; order < MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
- seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ seq_printf(m, "%6d ", __fragmentation_index(zone, order, &info));
}
seq_putc(m, '\n');
@@ -856,6 +865,9 @@ static const char * const vmstat_text[] = {
"compact_blocks_moved",
"compact_pages_moved",
"compact_pagemigrate_failed",
+ "compact_stall",
+ "compact_fail",
+ "compact_success",
#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 08/12] Direct compact when a high-order allocation fails
2010-02-12 12:00 ` [PATCH 08/12] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-02-18 3:57 ` Rik van Riel
0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-18 3:57 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation. With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 09/12] Do not compact within a preferred zone after a compaction failure
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (7 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 08/12] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-18 4:09 ` Rik van Riel
2010-02-12 12:00 ` [PATCH 10/12] mm: Check for an empty VMA list in rmap_walk_anon Mel Gorman
` (2 subsequent siblings)
11 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why
o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met
In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/compaction.h | 29 +++++++++++++++++++++++++++++
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 5 ++++-
3 files changed, 40 insertions(+), 1 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 1cf95e2..1891bd1 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -13,6 +13,26 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+ /*
+ * This function is called when compaction fails to result in a page
+ * allocation success. This is somewhat unsatisfactory as the failure
+ * to compact has nothing to do with time and everything to do with
+ * the requested order, the number of free pages and watermarks. How
+ * to wait on that is more unclear, but the answer would apply to
+ * other areas where the VM waits based on time.
+ */
+ zone->compact_resume = jiffies + HZ/50;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+ return time_before(jiffies, zone->compact_resume);
+}
+
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -20,6 +40,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
return COMPACT_INCOMPLETE;
}
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+ return 1;
+}
+
#endif /* CONFIG_COMPACTION */
#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..31fb38b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -328,6 +328,13 @@ struct zone {
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
+#ifdef CONFIG_COMPACTION
+ /*
+ * If a compaction fails, do not try compaction again until
+ * jiffies is after the value of compact_resume
+ */
+ unsigned long compact_resume;
+#endif
ZONE_PADDING(_pad1_)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1910b8b..7021c68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1730,7 +1730,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
cond_resched();
/* Try memory compaction for high-order allocations before reclaim */
- if (order) {
+ if (order && !compaction_deferred(preferred_zone)) {
*did_some_progress = try_to_compact_pages(zonelist,
order, gfp_mask, nodemask);
if (*did_some_progress != COMPACT_INCOMPLETE) {
@@ -1750,6 +1750,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTFAIL);
+ /* On failure, avoid compaction for a short time. */
+ defer_compaction(preferred_zone, jiffies + HZ/50);
+
cond_resched();
}
}
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 09/12] Do not compact within a preferred zone after a compaction failure
2010-02-12 12:00 ` [PATCH 09/12] Do not compact within a preferred zone after a compaction failure Mel Gorman
@ 2010-02-18 4:09 ` Rik van Riel
0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-02-18 4:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm
On 02/12/2010 07:00 AM, Mel Gorman wrote:
> The fragmentation index may indicate that a failure it due to external
> fragmentation, a compaction run complete and an allocation failure still
> fail. There are two obvious reasons as to why
>
> o Page migration cannot move all pages so fragmentation remains
> o A suitable page may exist but watermarks are not met
>
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the
> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 10/12] mm: Check for an empty VMA list in rmap_walk_anon
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (8 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 09/12] Do not compact within a preferred zone after a compaction failure Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-17 18:22 ` Mel Gorman
2010-02-12 12:00 ` [PATCH 11/12] mm: Take the RCU read lock " Mel Gorman
2010-02-12 12:00 ` [PATCH 12/12] mm: Check the anon_vma is still valid in rmap_walk_anon() Mel Gorman
11 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
There appears to be a race in rmap_walk_anon() that can be triggered by using
page migration under heavy load on pages that do not belong to the process
doing the migration - e.g. during memory compaction. The bug triggered is
a NULL pointer deference in list_for_each_entry().
I believe what is happening is that rmap_walk() is called but the process
exits before the lock gets taken. rmap_walk_anon() by design is not holding
the mmap_sem which would have guaranteed its existance. There is a comment
explaining the locking (or lack thereof) decision but the reasoning is not
very clear.
This patch checks if the VMA list is empty after the lock is taken which
avoids the race. It should be reviewed by people more familiar with
migration to confirm this is a sufficient or if the locking decisions
around rmap_walk() need to be revisited.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/rmap.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..b468d5f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1237,6 +1237,14 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
+
+ /*
+ * While the anon_vma may still exist, there is no guarantee
+ * the VMAs still do.
+ */
+ if (list_empty(&anon_vma->head))
+ goto out_anon_unlock;
+
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
@@ -1245,6 +1253,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
if (ret != SWAP_AGAIN)
break;
}
+
+out_anon_unlock:
spin_unlock(&anon_vma->lock);
return ret;
}
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 10/12] mm: Check for an empty VMA list in rmap_walk_anon
2010-02-12 12:00 ` [PATCH 10/12] mm: Check for an empty VMA list in rmap_walk_anon Mel Gorman
@ 2010-02-17 18:22 ` Mel Gorman
0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-17 18:22 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm
On Fri, Feb 12, 2010 at 12:00:57PM +0000, Mel Gorman wrote:
> There appears to be a race in rmap_walk_anon() that can be triggered by using
> page migration under heavy load on pages that do not belong to the process
> doing the migration - e.g. during memory compaction. The bug triggered is
> a NULL pointer deference in list_for_each_entry().
>
Incidentally, I hate patches 10-12 with the last one in particular being
outright flaky. I'm looking at copying the KSM approach instead and have
started tests using the following patch. Any comments on this approach
for ensuring the anon_vma exists for as long as it's needed? Obviously,
the refcounts for KSM and migrate can be collapsed to save space but this
is clearer to review.
=== CUT HERE ====
mm,migration: Take a reference to the anon_vma before migrating
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/rmap.h | 23 +++++++++++++++++++++++
mm/migrate.c | 12 ++++++++++++
mm/rmap.c | 10 +++++-----
3 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b019ae6..6b5a1a9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
#ifdef CONFIG_KSM
atomic_t ksm_refcount;
#endif
+#ifdef CONFIG_MIGRATION
+ atomic_t migrate_refcount;
+#endif
/*
* NOTE: the LSB of the head.next is set by
* mm_take_all_locks() _after_ taking the above lock. So the
@@ -61,6 +64,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
return 0;
}
#endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+ atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return 0;
+}
+#endif /* CONFIG_MIGRATE */
static inline struct anon_vma *page_anon_vma(struct page *page)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0db5b..63addfa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -551,6 +551,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
int rcu_locked = 0;
int charge = 0;
struct mem_cgroup *mem = NULL;
+ struct anon_vma *anon_vma = NULL;
if (!newpage)
return -ENOMEM;
@@ -607,6 +608,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (PageAnon(page)) {
rcu_read_lock();
rcu_locked = 1;
+ anon_vma = page_anon_vma(page);
+ atomic_inc(&anon_vma->migrate_refcount);
}
/*
@@ -646,6 +649,15 @@ skip_unmap:
if (rc)
remove_migration_ptes(page, page);
rcu_unlock:
+
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+ int empty = list_empty(&anon_vma->head);
+ spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
+ }
+
if (rcu_locked)
rcu_read_unlock();
uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..11ba74a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -172,7 +172,8 @@ void anon_vma_unlink(struct vm_area_struct *vma)
list_del(&vma->anon_vma_node);
/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+ empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+ !migrate_refcount(anon_vma);
spin_unlock(&anon_vma->lock);
if (empty)
@@ -185,6 +186,7 @@ static void anon_vma_ctor(void *data)
spin_lock_init(&anon_vma->lock);
ksm_refcount_init(anon_vma);
+ migrate_refcount_init(anon_vma);
INIT_LIST_HEAD(&anon_vma->head);
}
@@ -1228,10 +1230,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
/*
* Note: remove_migration_ptes() cannot use page_lock_anon_vma()
* because that depends on page_mapped(); but not all its usages
- * are holding mmap_sem, which also gave the necessary guarantee
- * (that this anon_vma's slab has not already been destroyed).
- * This needs to be reviewed later: avoiding page_lock_anon_vma()
- * is risky, and currently limits the usefulness of rmap_walk().
+ * are holding mmap_sem. Users without mmap_sem are required to
+ * take a reference count to prevent the anon_vma disappearing
*/
anon_vma = page_anon_vma(page);
if (!anon_vma)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 11/12] mm: Take the RCU read lock in rmap_walk_anon
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (9 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 10/12] mm: Check for an empty VMA list in rmap_walk_anon Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
2010-02-12 12:00 ` [PATCH 12/12] mm: Check the anon_vma is still valid in rmap_walk_anon() Mel Gorman
11 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma. One important difference between page_lock_anon_vma()
and rmap_walk_anon() is that the page_lock_anon_vma() takes the RCU lock
before the lookup so that the anon_vma does not disappear. There does not
appear to be locking in place that prevents the anon_vma disappearing before
the spinlock is taken.
This patch puts a rcu_read_lock() around the anon_vma lookup similar to
what page_lock_anon_vma() does to prevent an accidental use-after-free.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/rmap.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index b468d5f..fb695d3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1233,9 +1233,10 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* This needs to be reviewed later: avoiding page_lock_anon_vma()
* is risky, and currently limits the usefulness of rmap_walk().
*/
+ rcu_read_lock();
anon_vma = page_anon_vma(page);
if (!anon_vma)
- return ret;
+ goto out_rcu_unlock;
spin_lock(&anon_vma->lock);
/*
@@ -1256,6 +1257,10 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
out_anon_unlock:
spin_unlock(&anon_vma->lock);
+
+out_rcu_unlock:
+ rcu_read_unlock();
+
return ret;
}
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 12/12] mm: Check the anon_vma is still valid in rmap_walk_anon()
2010-02-12 12:00 [PATCH 0/12] Memory Compaction v2r12 Mel Gorman
` (10 preceding siblings ...)
2010-02-12 12:00 ` [PATCH 11/12] mm: Take the RCU read lock " Mel Gorman
@ 2010-02-12 12:00 ` Mel Gorman
11 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-12 12:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
KOSAKI Motohiro, Mel Gorman, linux-kernel, linux-mm
Despite the additional locking added around rmap_walk_anon, bad
references still manage to trigger on ppc64. The most likely cause is a
use-after-free but it's not clear if it's due to a locking problem or
something ppc64 specific. This patch somewhat works around the problem
by checking the contents of the anon_vma make sense before using it but
it needs reviewing by eyes familiar with the page migration code to try
spot where the real problem lies.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/rmap.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index fb695d3..462ac86 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1237,6 +1237,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
anon_vma = page_anon_vma(page);
if (!anon_vma)
goto out_rcu_unlock;
+ if (!anon_vma->head.next)
+ goto out_rcu_unlock;
spin_lock(&anon_vma->lock);
/*
--
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 50+ messages in thread