From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 834746B005A for ; Wed, 27 Jun 2012 03:51:51 -0400 (EDT) From: Minchan Kim Subject: [PATCH 0/2 v2] fix livelock because of kswapd stop Date: Wed, 27 Jun 2012 16:51:52 +0900 Message-Id: <1340783514-8150-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; [1/2] factor out memory-isolation functions from page_alloc.c to mm/page_isolation.c This patch can be merged regardless of [2/2]. [2/2] fix this problem. Aaditya, Could you confirm this patch can solve your problem? Minchan Kim (2): mm: Factor out memory isolate functions memory-hotplug: fix kswapd looping forever problem drivers/base/Kconfig | 1 + include/linux/mmzone.h | 8 ++++ include/linux/page-isolation.h | 8 ++-- mm/Kconfig | 5 ++ mm/Makefile | 4 +- mm/page_alloc.c | 102 ++++++++++++---------------------------- mm/page_isolation.c | 96 +++++++++++++++++++++++++++++++++++++ 7 files changed, 147 insertions(+), 77 deletions(-) -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id D994D6B005C for ; Wed, 27 Jun 2012 03:51:53 -0400 (EDT) From: Minchan Kim Subject: [PATCH 1/2 v2] mm: Factor out memory isolate functions Date: Wed, 27 Jun 2012 16:51:53 +0900 Message-Id: <1340783514-8150-2-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-1-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim , Andi Kleen , Marek Szyprowski Now mm/page_alloc.c has some memory isolation functions but they are used oly when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. This patch is based on next-20120626 * from v1 - rebase on next-20120626 Cc: Andi Kleen Cc: Marek Szyprowski Cc: KAMEZAWA Hiroyuki Signed-off-by: Minchan Kim --- Andrew, I know this patch is conflict with current mmotm totally but I don't know what tree I should use. I hope it's a git tree instead of quit-based version. Anyway, If you feel it's inconvenient, let me know it and I will resend it on tree you mention. Thanks. drivers/base/Kconfig | 1 + include/linux/page-isolation.h | 8 ++--- mm/Kconfig | 5 +++ mm/Makefile | 4 +-- mm/page_alloc.c | 71 ---------------------------------------- mm/page_isolation.c | 71 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 83 insertions(+), 77 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 9b21469..08b4c52 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -196,6 +196,7 @@ config CMA bool "Contiguous Memory Allocator (EXPERIMENTAL)" depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL select MIGRATION + select MEMORY_ISOLATION help This enables the Contiguous Memory Allocator which allows drivers to allocate big physically-contiguous blocks of memory for use with diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 3bdcab3..0569a3e 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -10,7 +10,7 @@ * free all pages in the range. test_page_isolated() can be used for * test it. */ -extern int +int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype); @@ -18,7 +18,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. * target range is [start_pfn, end_pfn) */ -extern int +int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype); @@ -30,8 +30,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn); /* * Internal functions. Changes pageblock's migrate type. */ -extern int set_migratetype_isolate(struct page *page); -extern void unset_migratetype_isolate(struct page *page, unsigned migratetype); +int set_migratetype_isolate(struct page *page); +void unset_migratetype_isolate(struct page *page, unsigned migratetype); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 82fed4e..d5c8019 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -140,9 +140,13 @@ config ARCH_DISCARD_MEMBLOCK config NO_BOOTMEM boolean +config MEMORY_ISOLATION + boolean + # eventually, we can have this option just 'select SPARSEMEM' config MEMORY_HOTPLUG bool "Allow for memory hot-add" + select MEMORY_ISOLATION depends on SPARSEMEM || X86_64_ACPI_NUMA depends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUG depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390) @@ -272,6 +276,7 @@ config MEMORY_FAILURE depends on MMU depends on ARCH_SUPPORTS_MEMORY_FAILURE bool "Enable recovery from hardware memory errors" + select MEMORY_ISOLATION help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running diff --git a/mm/Makefile b/mm/Makefile index 25e8002..a0c7725 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -15,8 +15,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ maccess.o page_alloc.o page-writeback.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ - page_isolation.o mm_init.o mmu_context.o percpu.o \ - compaction.o $(mmu-y) + mm_init.o mmu_context.o percpu.o compaction.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM @@ -55,3 +54,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o +obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c29b1c..c175fa9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -51,7 +51,6 @@ #include #include #include -#include #include #include #include @@ -5555,76 +5554,6 @@ bool is_pageblock_removable_nolock(struct page *page) return __count_immobile_pages(zone, page, 0); } -int set_migratetype_isolate(struct page *page) -{ - struct zone *zone; - unsigned long flags, pfn; - struct memory_isolate_notify arg; - int notifier_ret; - int ret = -EBUSY; - - zone = page_zone(page); - - spin_lock_irqsave(&zone->lock, flags); - - pfn = page_to_pfn(page); - arg.start_pfn = pfn; - arg.nr_pages = pageblock_nr_pages; - arg.pages_found = 0; - - /* - * It may be possible to isolate a pageblock even if the - * migratetype is not MIGRATE_MOVABLE. The memory isolation - * notifier chain is used by balloon drivers to return the - * number of pages in a range that are held by the balloon - * driver to shrink memory. If all the pages are accounted for - * by balloons, are free, or on the LRU, isolation can continue. - * Later, for example, when memory hotplug notifier runs, these - * pages reported as "can be isolated" should be isolated(freed) - * by the balloon driver through the memory notifier chain. - */ - notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); - notifier_ret = notifier_to_errno(notifier_ret); - if (notifier_ret) - goto out; - /* - * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. - * We just check MOVABLE pages. - */ - if (__count_immobile_pages(zone, page, arg.pages_found)) - ret = 0; - - /* - * immobile means "not-on-lru" paes. If immobile is larger than - * removable-by-driver pages reported by notifier, we'll fail. - */ - -out: - if (!ret) { - set_pageblock_migratetype(page, MIGRATE_ISOLATE); - move_freepages_block(zone, page, MIGRATE_ISOLATE); - } - - spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) - drain_all_pages(); - return ret; -} - -void unset_migratetype_isolate(struct page *page, unsigned migratetype) -{ - struct zone *zone; - unsigned long flags; - zone = page_zone(page); - spin_lock_irqsave(&zone->lock, flags); - if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) - goto out; - set_pageblock_migratetype(page, migratetype); - move_freepages_block(zone, page, migratetype); -out: - spin_unlock_irqrestore(&zone->lock, flags); -} - #ifdef CONFIG_CMA static unsigned long pfn_max_align_down(unsigned long pfn) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index c9f0477..1a9cb36 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -5,8 +5,79 @@ #include #include #include +#include #include "internal.h" +int set_migratetype_isolate(struct page *page) +{ + struct zone *zone; + unsigned long flags, pfn; + struct memory_isolate_notify arg; + int notifier_ret; + int ret = -EBUSY; + + zone = page_zone(page); + + spin_lock_irqsave(&zone->lock, flags); + + pfn = page_to_pfn(page); + arg.start_pfn = pfn; + arg.nr_pages = pageblock_nr_pages; + arg.pages_found = 0; + + /* + * It may be possible to isolate a pageblock even if the + * migratetype is not MIGRATE_MOVABLE. The memory isolation + * notifier chain is used by balloon drivers to return the + * number of pages in a range that are held by the balloon + * driver to shrink memory. If all the pages are accounted for + * by balloons, are free, or on the LRU, isolation can continue. + * Later, for example, when memory hotplug notifier runs, these + * pages reported as "can be isolated" should be isolated(freed) + * by the balloon driver through the memory notifier chain. + */ + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); + notifier_ret = notifier_to_errno(notifier_ret); + if (notifier_ret) + goto out; + /* + * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. + * We just check MOVABLE pages. + */ + if (__count_immobile_pages(zone, page, arg.pages_found)) + ret = 0; + + /* + * immobile means "not-on-lru" paes. If immobile is larger than + * removable-by-driver pages reported by notifier, we'll fail. + */ + +out: + if (!ret) { + set_pageblock_migratetype(page, MIGRATE_ISOLATE); + move_freepages_block(zone, page, MIGRATE_ISOLATE); + } + + spin_unlock_irqrestore(&zone->lock, flags); + if (!ret) + drain_all_pages(); + return ret; +} + +void unset_migratetype_isolate(struct page *page, unsigned migratetype) +{ + struct zone *zone; + unsigned long flags; + zone = page_zone(page); + spin_lock_irqsave(&zone->lock, flags); + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) + goto out; + set_pageblock_migratetype(page, migratetype); + move_freepages_block(zone, page, migratetype); +out: + spin_unlock_irqrestore(&zone->lock, flags); +} + static inline struct page * __first_valid_page(unsigned long pfn, unsigned long nr_pages) { -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id F35486B0068 for ; Wed, 27 Jun 2012 03:51:56 -0400 (EDT) From: Minchan Kim Subject: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem Date: Wed, 27 Jun 2012 16:51:54 +0900 Message-Id: <1340783514-8150-3-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-1-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim , Mel Gorman When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable, we can solve this problem by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. It could be a problem still if other subsystem needs GFP_ATOMIC request. So kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE going sleep. This patch counts the number of MIGRATE_ISOLATE page block and zone_watermark_ok_safe will consider it if the system has such blocks (fortunately, it's very rare so no problem in POV overhead and kswapd is never hotpath). Copy/modify from Mel's quote " Ideal solution would be "allocating" the pageblock. It would keep the free space accounting as it is but historically, memory hotplug didn't allocate pages because it would be difficult to detect if a pageblock was isolated or if part of some balloon. Allocating just full pageblocks would work around this, However, it would play very badly with CMA. " [1] http://lkml.org/lkml/2012/6/14/74 * from v1 - add changelog - make functions simple - remove atomic variable - discard exact isolated free page accounting. - rebased on next-20120626 Suggested-by: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Cc: Aaditya Kumar Cc: Mel Gorman Signed-off-by: Minchan Kim --- Aaditya, coul you confirm this patch solve your problem and make sure nr_pageblock_isolate is zero after hotplug end? Thanks! include/linux/mmzone.h | 8 ++++++++ mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dbc876e..6ee83b8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -474,6 +474,14 @@ struct zone { * rarely used fields: */ const char *name; +#ifdef CONFIG_MEMORY_ISOLATION + /* + * the number of MIGRATE_ISOLATE *pageblock*. + * We need this for free page counting. Look at zone_watermark_ok_safe. + * It's protected by zone->lock + */ + int nr_pageblock_isolate; +#endif } ____cacheline_internodealigned_in_smp; typedef enum { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c175fa9..b12c8ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); int page_group_by_mobility_disabled __read_mostly; +/* + * NOTE: + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. + * Instead, use {un}set_pageblock_isolate. + */ void set_pageblock_migratetype(struct page *page, int migratetype) { if (unlikely(page_group_by_mobility_disabled)) @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, return true; } +#ifdef CONFIG_MEMORY_ISOLATION +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) +{ + unsigned long nr_pages = 0; + + if (unlikely(zone->nr_pageblock_isolate)) { + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; + } + return nr_pages; +} +#else +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) +{ + return 0; +} +#endif + bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags) { @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); + /* + * If the zone has MIGRATE_ISOLATE type free page, + * we should consider it. nr_zone_isolate_freepages is never + * accurate so kswapd might not sleep although she can. + * But it's more desirable for memory hotplug rather than + * forever sleep which cause livelock in direct reclaim path. + */ + free_pages -= nr_zone_isolate_freepages(z); return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); } @@ -4407,6 +4437,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, lruvec_init(&zone->lruvec, zone); zap_zone_vm_stats(zone); zone->flags = 0; + zone->nr_pageblock_isolate = 0; if (!size) continue; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 1a9cb36..c721ec0 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -8,6 +8,31 @@ #include #include "internal.h" +/* called by holding zone->lock */ +static void set_pageblock_isolate(struct zone *zone, struct page *page) +{ + BUG_ON(page_zone(page) != zone); + + if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE) + return; + + set_pageblock_migratetype(page, MIGRATE_ISOLATE); + zone->nr_pageblock_isolate++; +} + +/* called by holding zone->lock */ +static void restore_pageblock_isolate(struct zone *zone, struct page *page, + int migratetype) +{ + BUG_ON(page_zone(page) != zone); + if (WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE)) + return; + + BUG_ON(zone->nr_pageblock_isolate <= 0); + set_pageblock_migratetype(page, migratetype); + zone->nr_pageblock_isolate--; +} + int set_migratetype_isolate(struct page *page) { struct zone *zone; @@ -54,7 +79,7 @@ int set_migratetype_isolate(struct page *page) out: if (!ret) { - set_pageblock_migratetype(page, MIGRATE_ISOLATE); + set_pageblock_isolate(zone, page); move_freepages_block(zone, page, MIGRATE_ISOLATE); } @@ -72,8 +97,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype) spin_lock_irqsave(&zone->lock, flags); if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) goto out; - set_pageblock_migratetype(page, migratetype); move_freepages_block(zone, page, migratetype); + restore_pageblock_isolate(zone, page, migratetype); out: spin_unlock_irqrestore(&zone->lock, flags); } -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id BBE316B005A for ; Thu, 28 Jun 2012 03:13:38 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp (unknown [10.0.50.72]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 47C873EE0BB for ; Thu, 28 Jun 2012 16:13:37 +0900 (JST) Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id F0CED45DE56 for ; Thu, 28 Jun 2012 16:13:36 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id D7E0D45DE53 for ; Thu, 28 Jun 2012 16:13:36 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id C11751DB803F for ; Thu, 28 Jun 2012 16:13:36 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 7008A1DB803A for ; Thu, 28 Jun 2012 16:13:36 +0900 (JST) Message-ID: <4FEC039A.5060506@jp.fujitsu.com> Date: Thu, 28 Jun 2012 16:11:22 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH 1/2 v2] mm: Factor out memory isolate functions References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-2-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-2-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Andi Kleen , Marek Szyprowski (2012/06/27 16:51), Minchan Kim wrote: > Now mm/page_alloc.c has some memory isolation functions but they > are used oly when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. > So let's make it configurable by new CONFIG_MEMORY_ISOLATION so that it > can reduce binary size and we can check it simple by > CONFIG_MEMORY_ISOLATION, > not if defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. > > This patch is based on next-20120626 > > * from v1 > - rebase on next-20120626 > > Cc: Andi Kleen > Cc: Marek Szyprowski > Cc: KAMEZAWA Hiroyuki > Signed-off-by: Minchan Kim Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id A25246B005A for ; Thu, 28 Jun 2012 03:16:12 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp (unknown [10.0.50.72]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 0B5BB3EE0C1 for ; Thu, 28 Jun 2012 16:16:11 +0900 (JST) Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id E689345DE5A for ; Thu, 28 Jun 2012 16:16:10 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id C402845DE55 for ; Thu, 28 Jun 2012 16:16:10 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 9E0871DB8043 for ; Thu, 28 Jun 2012 16:16:10 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.240.81.133]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 3574A1DB804D for ; Thu, 28 Jun 2012 16:16:10 +0900 (JST) Message-ID: <4FEC042C.5070509@jp.fujitsu.com> Date: Thu, 28 Jun 2012 16:13:48 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-3-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Mel Gorman (2012/06/27 16:51), Minchan Kim wrote: > When hotplug offlining happens on zone A, it starts to mark freed page > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > but we can't allocate them). > When the memory shortage happens during hotplug offlining, > current task starts to reclaim, then wake up kswapd. > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > doesn't consider MIGRATE_ISOLATE freed page count. > Current task continue to reclaim in direct reclaim path without kswapd's helping. > The problem is that zone->all_unreclaimable is set by only kswapd > so that current task would be looping forever like below. > > __alloc_pages_slowpath > restart: > wake_all_kswapd > rebalance: > __alloc_pages_direct_reclaim > do_try_to_free_pages > if global_reclaim && !all_unreclaimable > return 1; /* It means we did did_some_progress */ > skip __alloc_pages_may_oom > should_alloc_retry > goto rebalance; > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > about setting zone->all_unreclaimable, we can solve this problem > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > It could be a problem still if other subsystem needs GFP_ATOMIC request. > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > BEFORE going sleep. > > This patch counts the number of MIGRATE_ISOLATE page block and > zone_watermark_ok_safe will consider it if the system has such blocks > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > hotpath). > > Copy/modify from Mel's quote > " > Ideal solution would be "allocating" the pageblock. > It would keep the free space accounting as it is but historically, > memory hotplug didn't allocate pages because it would be difficult to > detect if a pageblock was isolated or if part of some balloon. > Allocating just full pageblocks would work around this, However, > it would play very badly with CMA. > " > > [1] http://lkml.org/lkml/2012/6/14/74 > > * from v1 > - add changelog > - make functions simple > - remove atomic variable > - discard exact isolated free page accounting. > - rebased on next-20120626 > > Suggested-by: KOSAKI Motohiro > Cc: KAMEZAWA Hiroyuki > Cc: Aaditya Kumar > Cc: Mel Gorman > Signed-off-by: Minchan Kim > --- > > Aaditya, coul you confirm this patch solve your problem and > make sure nr_pageblock_isolate is zero after hotplug end? > > Thanks! > > include/linux/mmzone.h | 8 ++++++++ > mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ > mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- > 3 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index dbc876e..6ee83b8 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -474,6 +474,14 @@ struct zone { > * rarely used fields: > */ > const char *name; > +#ifdef CONFIG_MEMORY_ISOLATION > + /* > + * the number of MIGRATE_ISOLATE *pageblock*. > + * We need this for free page counting. Look at zone_watermark_ok_safe. > + * It's protected by zone->lock > + */ > + int nr_pageblock_isolate; > +#endif > } ____cacheline_internodealigned_in_smp; > > typedef enum { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c175fa9..b12c8ec 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); > > int page_group_by_mobility_disabled __read_mostly; > > +/* > + * NOTE: > + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. > + * Instead, use {un}set_pageblock_isolate. > + */ > void set_pageblock_migratetype(struct page *page, int migratetype) > { > if (unlikely(page_group_by_mobility_disabled)) > @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > return true; > } > > +#ifdef CONFIG_MEMORY_ISOLATION > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + unsigned long nr_pages = 0; > + > + if (unlikely(zone->nr_pageblock_isolate)) { > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > + } > + return nr_pages; > +} > +#else > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + return 0; > +} > +#endif > + > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags) > { > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > + /* > + * If the zone has MIGRATE_ISOLATE type free page, > + * we should consider it. nr_zone_isolate_freepages is never > + * accurate so kswapd might not sleep although she can. > + * But it's more desirable for memory hotplug rather than > + * forever sleep which cause livelock in direct reclaim path. > + */ > + free_pages -= nr_zone_isolate_freepages(z); Here, free_pages could be < 0 ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 3035B6B005A for ; Thu, 28 Jun 2012 19:34:30 -0400 (EDT) Message-ID: <4FECEA16.4060200@kernel.org> Date: Fri, 29 Jun 2012 08:34:46 +0900 From: Minchan Kim MIME-Version: 1.0 Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> <4FEC042C.5070509@jp.fujitsu.com> In-Reply-To: <4FEC042C.5070509@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Mel Gorman Hi Kame, On 06/28/2012 04:13 PM, Kamezawa Hiroyuki wrote: > (2012/06/27 16:51), Minchan Kim wrote: >> When hotplug offlining happens on zone A, it starts to mark freed page >> as MIGRATE_ISOLATE type in buddy for preventing further allocation. >> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy >> but we can't allocate them). >> When the memory shortage happens during hotplug offlining, >> current task starts to reclaim, then wake up kswapd. >> Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe >> doesn't consider MIGRATE_ISOLATE freed page count. >> Current task continue to reclaim in direct reclaim path without kswapd's helping. >> The problem is that zone->all_unreclaimable is set by only kswapd >> so that current task would be looping forever like below. >> >> __alloc_pages_slowpath >> restart: >> wake_all_kswapd >> rebalance: >> __alloc_pages_direct_reclaim >> do_try_to_free_pages >> if global_reclaim && !all_unreclaimable >> return 1; /* It means we did did_some_progress */ >> skip __alloc_pages_may_oom >> should_alloc_retry >> goto rebalance; >> >> If we apply KOSAKI's patch[1] which doesn't depends on kswapd >> about setting zone->all_unreclaimable, we can solve this problem >> by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. >> It could be a problem still if other subsystem needs GFP_ATOMIC request. >> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages >> BEFORE going sleep. >> >> This patch counts the number of MIGRATE_ISOLATE page block and >> zone_watermark_ok_safe will consider it if the system has such blocks >> (fortunately, it's very rare so no problem in POV overhead and kswapd is never >> hotpath). >> >> Copy/modify from Mel's quote >> " >> Ideal solution would be "allocating" the pageblock. >> It would keep the free space accounting as it is but historically, >> memory hotplug didn't allocate pages because it would be difficult to >> detect if a pageblock was isolated or if part of some balloon. >> Allocating just full pageblocks would work around this, However, >> it would play very badly with CMA. >> " >> >> [1] http://lkml.org/lkml/2012/6/14/74 >> >> * from v1 >> - add changelog >> - make functions simple >> - remove atomic variable >> - discard exact isolated free page accounting. >> - rebased on next-20120626 >> >> Suggested-by: KOSAKI Motohiro >> Cc: KAMEZAWA Hiroyuki >> Cc: Aaditya Kumar >> Cc: Mel Gorman >> Signed-off-by: Minchan Kim >> --- >> >> Aaditya, coul you confirm this patch solve your problem and >> make sure nr_pageblock_isolate is zero after hotplug end? >> >> Thanks! >> >> include/linux/mmzone.h | 8 ++++++++ >> mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ >> mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- >> 3 files changed, 66 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index dbc876e..6ee83b8 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -474,6 +474,14 @@ struct zone { >> * rarely used fields: >> */ >> const char *name; >> +#ifdef CONFIG_MEMORY_ISOLATION >> + /* >> + * the number of MIGRATE_ISOLATE *pageblock*. >> + * We need this for free page counting. Look at zone_watermark_ok_safe. >> + * It's protected by zone->lock >> + */ >> + int nr_pageblock_isolate; >> +#endif >> } ____cacheline_internodealigned_in_smp; >> >> typedef enum { >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index c175fa9..b12c8ec 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); >> >> int page_group_by_mobility_disabled __read_mostly; >> >> +/* >> + * NOTE: >> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. >> + * Instead, use {un}set_pageblock_isolate. >> + */ >> void set_pageblock_migratetype(struct page *page, int migratetype) >> { >> if (unlikely(page_group_by_mobility_disabled)) >> @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> return true; >> } >> >> +#ifdef CONFIG_MEMORY_ISOLATION >> +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) >> +{ >> + unsigned long nr_pages = 0; >> + >> + if (unlikely(zone->nr_pageblock_isolate)) { >> + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; >> + } >> + return nr_pages; >> +} >> +#else >> +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) >> +{ >> + return 0; >> +} >> +#endif >> + >> bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> int classzone_idx, int alloc_flags) >> { >> @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, >> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) >> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); >> >> + /* >> + * If the zone has MIGRATE_ISOLATE type free page, >> + * we should consider it. nr_zone_isolate_freepages is never >> + * accurate so kswapd might not sleep although she can. >> + * But it's more desirable for memory hotplug rather than >> + * forever sleep which cause livelock in direct reclaim path. >> + */ >> + free_pages -= nr_zone_isolate_freepages(z); > > Here, free_pages could be < 0 ? It could but __zone_watermark_ok returns false so kswapd is going to work without sleep. It's not bad because nr_zone_isolate_freepages already isn't accurate and hotplug event is temporal so kswapd should work well after end hotplug. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 20F456B0062 for ; Mon, 9 Jul 2012 09:31:12 -0400 (EDT) Received: by lbjn8 with SMTP id n8so20709932lbj.14 for ; Mon, 09 Jul 2012 06:31:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1340783514-8150-3-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> Date: Mon, 9 Jul 2012 19:01:09 +0530 Message-ID: Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem From: Aaditya Kumar Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com On Wed, Jun 27, 2012 at 1:21 PM, Minchan Kim wrote: > When hotplug offlining happens on zone A, it starts to mark freed page > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > but we can't allocate them). > When the memory shortage happens during hotplug offlining, > current task starts to reclaim, then wake up kswapd. > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > doesn't consider MIGRATE_ISOLATE freed page count. > Current task continue to reclaim in direct reclaim path without kswapd's helping. > The problem is that zone->all_unreclaimable is set by only kswapd > so that current task would be looping forever like below. > > __alloc_pages_slowpath > restart: > wake_all_kswapd > rebalance: > __alloc_pages_direct_reclaim > do_try_to_free_pages > if global_reclaim && !all_unreclaimable > return 1; /* It means we did did_some_progress */ > skip __alloc_pages_may_oom > should_alloc_retry > goto rebalance; > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > about setting zone->all_unreclaimable, we can solve this problem > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > It could be a problem still if other subsystem needs GFP_ATOMIC request. > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > BEFORE going sleep. > > This patch counts the number of MIGRATE_ISOLATE page block and > zone_watermark_ok_safe will consider it if the system has such blocks > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > hotpath). > > Copy/modify from Mel's quote > " > Ideal solution would be "allocating" the pageblock. > It would keep the free space accounting as it is but historically, > memory hotplug didn't allocate pages because it would be difficult to > detect if a pageblock was isolated or if part of some balloon. > Allocating just full pageblocks would work around this, However, > it would play very badly with CMA. > " > > [1] http://lkml.org/lkml/2012/6/14/74 > > * from v1 > - add changelog > - make functions simple > - remove atomic variable > - discard exact isolated free page accounting. > - rebased on next-20120626 > > Suggested-by: KOSAKI Motohiro > Cc: KAMEZAWA Hiroyuki > Cc: Aaditya Kumar > Cc: Mel Gorman > Signed-off-by: Minchan Kim > --- > > Aaditya, coul you confirm this patch solve your problem and > make sure nr_pageblock_isolate is zero after hotplug end? I am really sorry for the delay. I just tried this patch on my ARM setup. > > +#ifdef CONFIG_MEMORY_ISOLATION > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + unsigned long nr_pages = 0; > + > + if (unlikely(zone->nr_pageblock_isolate)) { > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > + } > + return nr_pages; > +} > +#else > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + return 0; > +} > +#endif > + > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags) > { > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > + /* > + * If the zone has MIGRATE_ISOLATE type free page, > + * we should consider it. nr_zone_isolate_freepages is never > + * accurate so kswapd might not sleep although she can. > + * But it's more desirable for memory hotplug rather than > + * forever sleep which cause livelock in direct reclaim path. > + */ > + free_pages -= nr_zone_isolate_freepages(z); > return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > free_pages); For my test case, pages to be off lined span the whole node. With this setup the free_pages become negative. (As you and Kamezawa-san already expected.) BUT because of free_pages going negative the memory off lining still livelocks as __zone_watermark_ok() returns true. This is because in below if comparison, because of an unsigned value (z->lowmem_reserve[classzone_idx]) all the longs are converted to unsigned long. static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags, long free_pages) { if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return false; So, may be you can consider following also: As for the nr_pageblock_isolate going back to zero, yes it is going back to zero for my test case.(I tested after this change) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1594,6 +1594,7 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, { /* free_pages my go negative - that's OK */ long min = mark; + long lowmem_res = z->lowmem_reserve[classzone_idx]; int o; free_pages -= (1 << order) - 1; @@ -1602,7 +1603,7 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, if (alloc_flags & ALLOC_HARDER) min -= min / 4; - if (free_pages <= min + z->lowmem_reserve[classzone_idx]) + if (free_pages <= min + lowmem_res) return false; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id 33E436B006E for ; Mon, 9 Jul 2012 10:29:47 -0400 (EDT) Received: by pbbrp2 with SMTP id rp2so23426086pbb.14 for ; Mon, 09 Jul 2012 07:29:46 -0700 (PDT) Date: Mon, 9 Jul 2012 23:29:36 +0900 From: Minchan Kim Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem Message-ID: <20120709142936.GB17314@barrios> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Aaditya Kumar Cc: Minchan Kim , akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com Hi Aaditya, On Mon, Jul 09, 2012 at 07:01:09PM +0530, Aaditya Kumar wrote: > On Wed, Jun 27, 2012 at 1:21 PM, Minchan Kim wrote: > > When hotplug offlining happens on zone A, it starts to mark freed page > > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > > but we can't allocate them). > > When the memory shortage happens during hotplug offlining, > > current task starts to reclaim, then wake up kswapd. > > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > > doesn't consider MIGRATE_ISOLATE freed page count. > > Current task continue to reclaim in direct reclaim path without kswapd's helping. > > The problem is that zone->all_unreclaimable is set by only kswapd > > so that current task would be looping forever like below. > > > > __alloc_pages_slowpath > > restart: > > wake_all_kswapd > > rebalance: > > __alloc_pages_direct_reclaim > > do_try_to_free_pages > > if global_reclaim && !all_unreclaimable > > return 1; /* It means we did did_some_progress */ > > skip __alloc_pages_may_oom > > should_alloc_retry > > goto rebalance; > > > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > > about setting zone->all_unreclaimable, we can solve this problem > > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > > It could be a problem still if other subsystem needs GFP_ATOMIC request. > > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > > BEFORE going sleep. > > > > This patch counts the number of MIGRATE_ISOLATE page block and > > zone_watermark_ok_safe will consider it if the system has such blocks > > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > > hotpath). > > > > Copy/modify from Mel's quote > > " > > Ideal solution would be "allocating" the pageblock. > > It would keep the free space accounting as it is but historically, > > memory hotplug didn't allocate pages because it would be difficult to > > detect if a pageblock was isolated or if part of some balloon. > > Allocating just full pageblocks would work around this, However, > > it would play very badly with CMA. > > " > > > > [1] http://lkml.org/lkml/2012/6/14/74 > > > > * from v1 > > - add changelog > > - make functions simple > > - remove atomic variable > > - discard exact isolated free page accounting. > > - rebased on next-20120626 > > > > Suggested-by: KOSAKI Motohiro > > Cc: KAMEZAWA Hiroyuki > > Cc: Aaditya Kumar > > Cc: Mel Gorman > > Signed-off-by: Minchan Kim > > --- > > > > Aaditya, coul you confirm this patch solve your problem and > > make sure nr_pageblock_isolate is zero after hotplug end? > > I am really sorry for the delay. > I just tried this patch on my ARM setup. No problem. > > > > > > +#ifdef CONFIG_MEMORY_ISOLATION > > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > > +{ > > + unsigned long nr_pages = 0; > > + > > + if (unlikely(zone->nr_pageblock_isolate)) { > > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > > + } > > + return nr_pages; > > +} > > +#else > > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > > +{ > > + return 0; > > +} > > +#endif > > + > > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > > int classzone_idx, int alloc_flags) > > { > > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > > > + /* > > + * If the zone has MIGRATE_ISOLATE type free page, > > + * we should consider it. nr_zone_isolate_freepages is never > > + * accurate so kswapd might not sleep although she can. > > + * But it's more desirable for memory hotplug rather than > > + * forever sleep which cause livelock in direct reclaim path. > > + */ > > + free_pages -= nr_zone_isolate_freepages(z); > > return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > > free_pages); > > For my test case, pages to be off lined span the whole node. > With this setup the free_pages become negative. (As you and Kamezawa-san > already expected.) > > BUT because of free_pages going negative the memory off lining still livelocks > as __zone_watermark_ok() returns true. > > This is because in below if comparison, because of an unsigned value > (z->lowmem_reserve[classzone_idx]) > all the longs are converted to unsigned long. Oh my god. It seems you find new unknown BUG. Currently, free_pages passed into __zone_watermark_ok could be zero. In __zone_watemark_ok, it could be minus by this. free_pages -= (1 << order) - 1; and automatic type conversion could make samek result you have seen. I will fix it in next spin. Thanks for the spotting! > > static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags, long free_pages) > { > > if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > return false; > > > > So, may be you can consider following also: > As for the nr_pageblock_isolate going back to zero, yes it is going back to zero > for my test case.(I tested after this change) Thanks for testing! May I add your tested-by in next spin which will include automatic type conversion problem ? > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1594,6 +1594,7 @@ static bool __zone_watermark_ok(struct zone *z, > int order, unsigned long mark, > { > /* free_pages my go negative - that's OK */ > long min = mark; > + long lowmem_res = z->lowmem_reserve[classzone_idx]; > int o; > > free_pages -= (1 << order) - 1; > @@ -1602,7 +1603,7 @@ static bool __zone_watermark_ok(struct zone *z, > int order, unsigned long mark, > if (alloc_flags & ALLOC_HARDER) > min -= min / 4; > > - if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > + if (free_pages <= min + lowmem_res) > return false; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 2DA7A6B006C for ; Tue, 10 Jul 2012 01:35:29 -0400 (EDT) Received: by lbjn8 with SMTP id n8so21856370lbj.14 for ; Mon, 09 Jul 2012 22:35:26 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20120709142936.GB17314@barrios> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> <20120709142936.GB17314@barrios> Date: Tue, 10 Jul 2012 11:05:26 +0530 Message-ID: Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem From: Aaditya Kumar Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com On Mon, Jul 9, 2012 at 7:59 PM, Minchan Kim wrote: Hello Minchan, > May I add your tested-by in next spin which will include automatic type conversion > problem ? Yeah sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756490Ab2F0Hvw (ORCPT ); Wed, 27 Jun 2012 03:51:52 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:61190 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756155Ab2F0Hvv (ORCPT ); Wed, 27 Jun 2012 03:51:51 -0400 X-AuditID: 9c930197-b7b43ae00000668c-c0-4feabb940848 From: Minchan Kim To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim Subject: [PATCH 0/2 v2] fix livelock because of kswapd stop Date: Wed, 27 Jun 2012 16:51:52 +0900 Message-Id: <1340783514-8150-1-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.9.5 X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; [1/2] factor out memory-isolation functions from page_alloc.c to mm/page_isolation.c This patch can be merged regardless of [2/2]. [2/2] fix this problem. Aaditya, Could you confirm this patch can solve your problem? Minchan Kim (2): mm: Factor out memory isolate functions memory-hotplug: fix kswapd looping forever problem drivers/base/Kconfig | 1 + include/linux/mmzone.h | 8 ++++ include/linux/page-isolation.h | 8 ++-- mm/Kconfig | 5 ++ mm/Makefile | 4 +- mm/page_alloc.c | 102 ++++++++++++---------------------------- mm/page_isolation.c | 96 +++++++++++++++++++++++++++++++++++++ 7 files changed, 147 insertions(+), 77 deletions(-) -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756549Ab2F0Hvy (ORCPT ); Wed, 27 Jun 2012 03:51:54 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:61190 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756233Ab2F0Hvw (ORCPT ); Wed, 27 Jun 2012 03:51:52 -0400 X-AuditID: 9c930197-b7b43ae00000668c-c8-4feabb956e28 From: Minchan Kim To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim , Andi Kleen , Marek Szyprowski Subject: [PATCH 1/2 v2] mm: Factor out memory isolate functions Date: Wed, 27 Jun 2012 16:51:53 +0900 Message-Id: <1340783514-8150-2-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1340783514-8150-1-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now mm/page_alloc.c has some memory isolation functions but they are used oly when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. This patch is based on next-20120626 * from v1 - rebase on next-20120626 Cc: Andi Kleen Cc: Marek Szyprowski Cc: KAMEZAWA Hiroyuki Signed-off-by: Minchan Kim --- Andrew, I know this patch is conflict with current mmotm totally but I don't know what tree I should use. I hope it's a git tree instead of quit-based version. Anyway, If you feel it's inconvenient, let me know it and I will resend it on tree you mention. Thanks. drivers/base/Kconfig | 1 + include/linux/page-isolation.h | 8 ++--- mm/Kconfig | 5 +++ mm/Makefile | 4 +-- mm/page_alloc.c | 71 ---------------------------------------- mm/page_isolation.c | 71 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 83 insertions(+), 77 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 9b21469..08b4c52 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -196,6 +196,7 @@ config CMA bool "Contiguous Memory Allocator (EXPERIMENTAL)" depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL select MIGRATION + select MEMORY_ISOLATION help This enables the Contiguous Memory Allocator which allows drivers to allocate big physically-contiguous blocks of memory for use with diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 3bdcab3..0569a3e 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -10,7 +10,7 @@ * free all pages in the range. test_page_isolated() can be used for * test it. */ -extern int +int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype); @@ -18,7 +18,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. * target range is [start_pfn, end_pfn) */ -extern int +int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype); @@ -30,8 +30,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn); /* * Internal functions. Changes pageblock's migrate type. */ -extern int set_migratetype_isolate(struct page *page); -extern void unset_migratetype_isolate(struct page *page, unsigned migratetype); +int set_migratetype_isolate(struct page *page); +void unset_migratetype_isolate(struct page *page, unsigned migratetype); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 82fed4e..d5c8019 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -140,9 +140,13 @@ config ARCH_DISCARD_MEMBLOCK config NO_BOOTMEM boolean +config MEMORY_ISOLATION + boolean + # eventually, we can have this option just 'select SPARSEMEM' config MEMORY_HOTPLUG bool "Allow for memory hot-add" + select MEMORY_ISOLATION depends on SPARSEMEM || X86_64_ACPI_NUMA depends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUG depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390) @@ -272,6 +276,7 @@ config MEMORY_FAILURE depends on MMU depends on ARCH_SUPPORTS_MEMORY_FAILURE bool "Enable recovery from hardware memory errors" + select MEMORY_ISOLATION help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running diff --git a/mm/Makefile b/mm/Makefile index 25e8002..a0c7725 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -15,8 +15,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ maccess.o page_alloc.o page-writeback.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ - page_isolation.o mm_init.o mmu_context.o percpu.o \ - compaction.o $(mmu-y) + mm_init.o mmu_context.o percpu.o compaction.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM @@ -55,3 +54,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o +obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c29b1c..c175fa9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -51,7 +51,6 @@ #include #include #include -#include #include #include #include @@ -5555,76 +5554,6 @@ bool is_pageblock_removable_nolock(struct page *page) return __count_immobile_pages(zone, page, 0); } -int set_migratetype_isolate(struct page *page) -{ - struct zone *zone; - unsigned long flags, pfn; - struct memory_isolate_notify arg; - int notifier_ret; - int ret = -EBUSY; - - zone = page_zone(page); - - spin_lock_irqsave(&zone->lock, flags); - - pfn = page_to_pfn(page); - arg.start_pfn = pfn; - arg.nr_pages = pageblock_nr_pages; - arg.pages_found = 0; - - /* - * It may be possible to isolate a pageblock even if the - * migratetype is not MIGRATE_MOVABLE. The memory isolation - * notifier chain is used by balloon drivers to return the - * number of pages in a range that are held by the balloon - * driver to shrink memory. If all the pages are accounted for - * by balloons, are free, or on the LRU, isolation can continue. - * Later, for example, when memory hotplug notifier runs, these - * pages reported as "can be isolated" should be isolated(freed) - * by the balloon driver through the memory notifier chain. - */ - notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); - notifier_ret = notifier_to_errno(notifier_ret); - if (notifier_ret) - goto out; - /* - * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. - * We just check MOVABLE pages. - */ - if (__count_immobile_pages(zone, page, arg.pages_found)) - ret = 0; - - /* - * immobile means "not-on-lru" paes. If immobile is larger than - * removable-by-driver pages reported by notifier, we'll fail. - */ - -out: - if (!ret) { - set_pageblock_migratetype(page, MIGRATE_ISOLATE); - move_freepages_block(zone, page, MIGRATE_ISOLATE); - } - - spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) - drain_all_pages(); - return ret; -} - -void unset_migratetype_isolate(struct page *page, unsigned migratetype) -{ - struct zone *zone; - unsigned long flags; - zone = page_zone(page); - spin_lock_irqsave(&zone->lock, flags); - if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) - goto out; - set_pageblock_migratetype(page, migratetype); - move_freepages_block(zone, page, migratetype); -out: - spin_unlock_irqrestore(&zone->lock, flags); -} - #ifdef CONFIG_CMA static unsigned long pfn_max_align_down(unsigned long pfn) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index c9f0477..1a9cb36 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -5,8 +5,79 @@ #include #include #include +#include #include "internal.h" +int set_migratetype_isolate(struct page *page) +{ + struct zone *zone; + unsigned long flags, pfn; + struct memory_isolate_notify arg; + int notifier_ret; + int ret = -EBUSY; + + zone = page_zone(page); + + spin_lock_irqsave(&zone->lock, flags); + + pfn = page_to_pfn(page); + arg.start_pfn = pfn; + arg.nr_pages = pageblock_nr_pages; + arg.pages_found = 0; + + /* + * It may be possible to isolate a pageblock even if the + * migratetype is not MIGRATE_MOVABLE. The memory isolation + * notifier chain is used by balloon drivers to return the + * number of pages in a range that are held by the balloon + * driver to shrink memory. If all the pages are accounted for + * by balloons, are free, or on the LRU, isolation can continue. + * Later, for example, when memory hotplug notifier runs, these + * pages reported as "can be isolated" should be isolated(freed) + * by the balloon driver through the memory notifier chain. + */ + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); + notifier_ret = notifier_to_errno(notifier_ret); + if (notifier_ret) + goto out; + /* + * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. + * We just check MOVABLE pages. + */ + if (__count_immobile_pages(zone, page, arg.pages_found)) + ret = 0; + + /* + * immobile means "not-on-lru" paes. If immobile is larger than + * removable-by-driver pages reported by notifier, we'll fail. + */ + +out: + if (!ret) { + set_pageblock_migratetype(page, MIGRATE_ISOLATE); + move_freepages_block(zone, page, MIGRATE_ISOLATE); + } + + spin_unlock_irqrestore(&zone->lock, flags); + if (!ret) + drain_all_pages(); + return ret; +} + +void unset_migratetype_isolate(struct page *page, unsigned migratetype) +{ + struct zone *zone; + unsigned long flags; + zone = page_zone(page); + spin_lock_irqsave(&zone->lock, flags); + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) + goto out; + set_pageblock_migratetype(page, migratetype); + move_freepages_block(zone, page, migratetype); +out: + spin_unlock_irqrestore(&zone->lock, flags); +} + static inline struct page * __first_valid_page(unsigned long pfn, unsigned long nr_pages) { -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756585Ab2F0Hv4 (ORCPT ); Wed, 27 Jun 2012 03:51:56 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:61190 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756497Ab2F0Hvy (ORCPT ); Wed, 27 Jun 2012 03:51:54 -0400 X-AuditID: 9c930197-b7b43ae00000668c-cb-4feabb95aa9f From: Minchan Kim To: akpm@linux-foundation.org Cc: KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , Aaditya Kumar , LKML , linux-mm , Minchan Kim , Mel Gorman Subject: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem Date: Wed, 27 Jun 2012 16:51:54 +0900 Message-Id: <1340783514-8150-3-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1340783514-8150-1-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable, we can solve this problem by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. It could be a problem still if other subsystem needs GFP_ATOMIC request. So kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE going sleep. This patch counts the number of MIGRATE_ISOLATE page block and zone_watermark_ok_safe will consider it if the system has such blocks (fortunately, it's very rare so no problem in POV overhead and kswapd is never hotpath). Copy/modify from Mel's quote " Ideal solution would be "allocating" the pageblock. It would keep the free space accounting as it is but historically, memory hotplug didn't allocate pages because it would be difficult to detect if a pageblock was isolated or if part of some balloon. Allocating just full pageblocks would work around this, However, it would play very badly with CMA. " [1] http://lkml.org/lkml/2012/6/14/74 * from v1 - add changelog - make functions simple - remove atomic variable - discard exact isolated free page accounting. - rebased on next-20120626 Suggested-by: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Cc: Aaditya Kumar Cc: Mel Gorman Signed-off-by: Minchan Kim --- Aaditya, coul you confirm this patch solve your problem and make sure nr_pageblock_isolate is zero after hotplug end? Thanks! include/linux/mmzone.h | 8 ++++++++ mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dbc876e..6ee83b8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -474,6 +474,14 @@ struct zone { * rarely used fields: */ const char *name; +#ifdef CONFIG_MEMORY_ISOLATION + /* + * the number of MIGRATE_ISOLATE *pageblock*. + * We need this for free page counting. Look at zone_watermark_ok_safe. + * It's protected by zone->lock + */ + int nr_pageblock_isolate; +#endif } ____cacheline_internodealigned_in_smp; typedef enum { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c175fa9..b12c8ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); int page_group_by_mobility_disabled __read_mostly; +/* + * NOTE: + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. + * Instead, use {un}set_pageblock_isolate. + */ void set_pageblock_migratetype(struct page *page, int migratetype) { if (unlikely(page_group_by_mobility_disabled)) @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, return true; } +#ifdef CONFIG_MEMORY_ISOLATION +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) +{ + unsigned long nr_pages = 0; + + if (unlikely(zone->nr_pageblock_isolate)) { + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; + } + return nr_pages; +} +#else +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) +{ + return 0; +} +#endif + bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags) { @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); + /* + * If the zone has MIGRATE_ISOLATE type free page, + * we should consider it. nr_zone_isolate_freepages is never + * accurate so kswapd might not sleep although she can. + * But it's more desirable for memory hotplug rather than + * forever sleep which cause livelock in direct reclaim path. + */ + free_pages -= nr_zone_isolate_freepages(z); return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); } @@ -4407,6 +4437,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, lruvec_init(&zone->lruvec, zone); zap_zone_vm_stats(zone); zone->flags = 0; + zone->nr_pageblock_isolate = 0; if (!size) continue; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 1a9cb36..c721ec0 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -8,6 +8,31 @@ #include #include "internal.h" +/* called by holding zone->lock */ +static void set_pageblock_isolate(struct zone *zone, struct page *page) +{ + BUG_ON(page_zone(page) != zone); + + if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE) + return; + + set_pageblock_migratetype(page, MIGRATE_ISOLATE); + zone->nr_pageblock_isolate++; +} + +/* called by holding zone->lock */ +static void restore_pageblock_isolate(struct zone *zone, struct page *page, + int migratetype) +{ + BUG_ON(page_zone(page) != zone); + if (WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE)) + return; + + BUG_ON(zone->nr_pageblock_isolate <= 0); + set_pageblock_migratetype(page, migratetype); + zone->nr_pageblock_isolate--; +} + int set_migratetype_isolate(struct page *page) { struct zone *zone; @@ -54,7 +79,7 @@ int set_migratetype_isolate(struct page *page) out: if (!ret) { - set_pageblock_migratetype(page, MIGRATE_ISOLATE); + set_pageblock_isolate(zone, page); move_freepages_block(zone, page, MIGRATE_ISOLATE); } @@ -72,8 +97,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype) spin_lock_irqsave(&zone->lock, flags); if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) goto out; - set_pageblock_migratetype(page, migratetype); move_freepages_block(zone, page, migratetype); + restore_pageblock_isolate(zone, page, migratetype); out: spin_unlock_irqrestore(&zone->lock, flags); } -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756295Ab2F1HNj (ORCPT ); Thu, 28 Jun 2012 03:13:39 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:52898 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753814Ab2F1HNi (ORCPT ); Thu, 28 Jun 2012 03:13:38 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.7.4 Message-ID: <4FEC039A.5060506@jp.fujitsu.com> Date: Thu, 28 Jun 2012 16:11:22 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:13.0) Gecko/20120614 Thunderbird/13.0.1 MIME-Version: 1.0 To: Minchan Kim CC: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Andi Kleen , Marek Szyprowski Subject: Re: [PATCH 1/2 v2] mm: Factor out memory isolate functions References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-2-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-2-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/06/27 16:51), Minchan Kim wrote: > Now mm/page_alloc.c has some memory isolation functions but they > are used oly when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. > So let's make it configurable by new CONFIG_MEMORY_ISOLATION so that it > can reduce binary size and we can check it simple by > CONFIG_MEMORY_ISOLATION, > not if defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. > > This patch is based on next-20120626 > > * from v1 > - rebase on next-20120626 > > Cc: Andi Kleen > Cc: Marek Szyprowski > Cc: KAMEZAWA Hiroyuki > Signed-off-by: Minchan Kim Acked-by: KAMEZAWA Hiroyuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932566Ab2F1HQN (ORCPT ); Thu, 28 Jun 2012 03:16:13 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:48622 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753851Ab2F1HQL (ORCPT ); Thu, 28 Jun 2012 03:16:11 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.7.4 Message-ID: <4FEC042C.5070509@jp.fujitsu.com> Date: Thu, 28 Jun 2012 16:13:48 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:13.0) Gecko/20120614 Thunderbird/13.0.1 MIME-Version: 1.0 To: Minchan Kim CC: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Mel Gorman Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> In-Reply-To: <1340783514-8150-3-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/06/27 16:51), Minchan Kim wrote: > When hotplug offlining happens on zone A, it starts to mark freed page > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > but we can't allocate them). > When the memory shortage happens during hotplug offlining, > current task starts to reclaim, then wake up kswapd. > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > doesn't consider MIGRATE_ISOLATE freed page count. > Current task continue to reclaim in direct reclaim path without kswapd's helping. > The problem is that zone->all_unreclaimable is set by only kswapd > so that current task would be looping forever like below. > > __alloc_pages_slowpath > restart: > wake_all_kswapd > rebalance: > __alloc_pages_direct_reclaim > do_try_to_free_pages > if global_reclaim && !all_unreclaimable > return 1; /* It means we did did_some_progress */ > skip __alloc_pages_may_oom > should_alloc_retry > goto rebalance; > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > about setting zone->all_unreclaimable, we can solve this problem > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > It could be a problem still if other subsystem needs GFP_ATOMIC request. > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > BEFORE going sleep. > > This patch counts the number of MIGRATE_ISOLATE page block and > zone_watermark_ok_safe will consider it if the system has such blocks > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > hotpath). > > Copy/modify from Mel's quote > " > Ideal solution would be "allocating" the pageblock. > It would keep the free space accounting as it is but historically, > memory hotplug didn't allocate pages because it would be difficult to > detect if a pageblock was isolated or if part of some balloon. > Allocating just full pageblocks would work around this, However, > it would play very badly with CMA. > " > > [1] http://lkml.org/lkml/2012/6/14/74 > > * from v1 > - add changelog > - make functions simple > - remove atomic variable > - discard exact isolated free page accounting. > - rebased on next-20120626 > > Suggested-by: KOSAKI Motohiro > Cc: KAMEZAWA Hiroyuki > Cc: Aaditya Kumar > Cc: Mel Gorman > Signed-off-by: Minchan Kim > --- > > Aaditya, coul you confirm this patch solve your problem and > make sure nr_pageblock_isolate is zero after hotplug end? > > Thanks! > > include/linux/mmzone.h | 8 ++++++++ > mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ > mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- > 3 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index dbc876e..6ee83b8 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -474,6 +474,14 @@ struct zone { > * rarely used fields: > */ > const char *name; > +#ifdef CONFIG_MEMORY_ISOLATION > + /* > + * the number of MIGRATE_ISOLATE *pageblock*. > + * We need this for free page counting. Look at zone_watermark_ok_safe. > + * It's protected by zone->lock > + */ > + int nr_pageblock_isolate; > +#endif > } ____cacheline_internodealigned_in_smp; > > typedef enum { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c175fa9..b12c8ec 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); > > int page_group_by_mobility_disabled __read_mostly; > > +/* > + * NOTE: > + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. > + * Instead, use {un}set_pageblock_isolate. > + */ > void set_pageblock_migratetype(struct page *page, int migratetype) > { > if (unlikely(page_group_by_mobility_disabled)) > @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > return true; > } > > +#ifdef CONFIG_MEMORY_ISOLATION > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + unsigned long nr_pages = 0; > + > + if (unlikely(zone->nr_pageblock_isolate)) { > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > + } > + return nr_pages; > +} > +#else > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + return 0; > +} > +#endif > + > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags) > { > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > + /* > + * If the zone has MIGRATE_ISOLATE type free page, > + * we should consider it. nr_zone_isolate_freepages is never > + * accurate so kswapd might not sleep although she can. > + * But it's more desirable for memory hotplug rather than > + * forever sleep which cause livelock in direct reclaim path. > + */ > + free_pages -= nr_zone_isolate_freepages(z); Here, free_pages could be < 0 ? Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753973Ab2F1Xeb (ORCPT ); Thu, 28 Jun 2012 19:34:31 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:52718 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752467Ab2F1Xea (ORCPT ); Thu, 28 Jun 2012 19:34:30 -0400 X-AuditID: 9c930179-b7cceae000004195-45-4fecea04bae9 Message-ID: <4FECEA16.4060200@kernel.org> Date: Fri, 29 Jun 2012 08:34:46 +0900 From: Minchan Kim User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Kamezawa Hiroyuki CC: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , Aaditya Kumar , LKML , linux-mm , Mel Gorman Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> <4FEC042C.5070509@jp.fujitsu.com> In-Reply-To: <4FEC042C.5070509@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Kame, On 06/28/2012 04:13 PM, Kamezawa Hiroyuki wrote: > (2012/06/27 16:51), Minchan Kim wrote: >> When hotplug offlining happens on zone A, it starts to mark freed page >> as MIGRATE_ISOLATE type in buddy for preventing further allocation. >> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy >> but we can't allocate them). >> When the memory shortage happens during hotplug offlining, >> current task starts to reclaim, then wake up kswapd. >> Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe >> doesn't consider MIGRATE_ISOLATE freed page count. >> Current task continue to reclaim in direct reclaim path without kswapd's helping. >> The problem is that zone->all_unreclaimable is set by only kswapd >> so that current task would be looping forever like below. >> >> __alloc_pages_slowpath >> restart: >> wake_all_kswapd >> rebalance: >> __alloc_pages_direct_reclaim >> do_try_to_free_pages >> if global_reclaim && !all_unreclaimable >> return 1; /* It means we did did_some_progress */ >> skip __alloc_pages_may_oom >> should_alloc_retry >> goto rebalance; >> >> If we apply KOSAKI's patch[1] which doesn't depends on kswapd >> about setting zone->all_unreclaimable, we can solve this problem >> by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. >> It could be a problem still if other subsystem needs GFP_ATOMIC request. >> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages >> BEFORE going sleep. >> >> This patch counts the number of MIGRATE_ISOLATE page block and >> zone_watermark_ok_safe will consider it if the system has such blocks >> (fortunately, it's very rare so no problem in POV overhead and kswapd is never >> hotpath). >> >> Copy/modify from Mel's quote >> " >> Ideal solution would be "allocating" the pageblock. >> It would keep the free space accounting as it is but historically, >> memory hotplug didn't allocate pages because it would be difficult to >> detect if a pageblock was isolated or if part of some balloon. >> Allocating just full pageblocks would work around this, However, >> it would play very badly with CMA. >> " >> >> [1] http://lkml.org/lkml/2012/6/14/74 >> >> * from v1 >> - add changelog >> - make functions simple >> - remove atomic variable >> - discard exact isolated free page accounting. >> - rebased on next-20120626 >> >> Suggested-by: KOSAKI Motohiro >> Cc: KAMEZAWA Hiroyuki >> Cc: Aaditya Kumar >> Cc: Mel Gorman >> Signed-off-by: Minchan Kim >> --- >> >> Aaditya, coul you confirm this patch solve your problem and >> make sure nr_pageblock_isolate is zero after hotplug end? >> >> Thanks! >> >> include/linux/mmzone.h | 8 ++++++++ >> mm/page_alloc.c | 31 +++++++++++++++++++++++++++++++ >> mm/page_isolation.c | 29 +++++++++++++++++++++++++++-- >> 3 files changed, 66 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index dbc876e..6ee83b8 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -474,6 +474,14 @@ struct zone { >> * rarely used fields: >> */ >> const char *name; >> +#ifdef CONFIG_MEMORY_ISOLATION >> + /* >> + * the number of MIGRATE_ISOLATE *pageblock*. >> + * We need this for free page counting. Look at zone_watermark_ok_safe. >> + * It's protected by zone->lock >> + */ >> + int nr_pageblock_isolate; >> +#endif >> } ____cacheline_internodealigned_in_smp; >> >> typedef enum { >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index c175fa9..b12c8ec 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes); >> >> int page_group_by_mobility_disabled __read_mostly; >> >> +/* >> + * NOTE: >> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. >> + * Instead, use {un}set_pageblock_isolate. >> + */ >> void set_pageblock_migratetype(struct page *page, int migratetype) >> { >> if (unlikely(page_group_by_mobility_disabled)) >> @@ -1614,6 +1619,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> return true; >> } >> >> +#ifdef CONFIG_MEMORY_ISOLATION >> +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) >> +{ >> + unsigned long nr_pages = 0; >> + >> + if (unlikely(zone->nr_pageblock_isolate)) { >> + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; >> + } >> + return nr_pages; >> +} >> +#else >> +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) >> +{ >> + return 0; >> +} >> +#endif >> + >> bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> int classzone_idx, int alloc_flags) >> { >> @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, >> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) >> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); >> >> + /* >> + * If the zone has MIGRATE_ISOLATE type free page, >> + * we should consider it. nr_zone_isolate_freepages is never >> + * accurate so kswapd might not sleep although she can. >> + * But it's more desirable for memory hotplug rather than >> + * forever sleep which cause livelock in direct reclaim path. >> + */ >> + free_pages -= nr_zone_isolate_freepages(z); > > Here, free_pages could be < 0 ? It could but __zone_watermark_ok returns false so kswapd is going to work without sleep. It's not bad because nr_zone_isolate_freepages already isn't accurate and hotplug event is temporal so kswapd should work well after end hotplug. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753716Ab2GINbM (ORCPT ); Mon, 9 Jul 2012 09:31:12 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:40395 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753436Ab2GINbL (ORCPT ); Mon, 9 Jul 2012 09:31:11 -0400 MIME-Version: 1.0 In-Reply-To: <1340783514-8150-3-git-send-email-minchan@kernel.org> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> Date: Mon, 9 Jul 2012 19:01:09 +0530 Message-ID: Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem From: Aaditya Kumar To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 27, 2012 at 1:21 PM, Minchan Kim wrote: > When hotplug offlining happens on zone A, it starts to mark freed page > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > but we can't allocate them). > When the memory shortage happens during hotplug offlining, > current task starts to reclaim, then wake up kswapd. > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > doesn't consider MIGRATE_ISOLATE freed page count. > Current task continue to reclaim in direct reclaim path without kswapd's helping. > The problem is that zone->all_unreclaimable is set by only kswapd > so that current task would be looping forever like below. > > __alloc_pages_slowpath > restart: > wake_all_kswapd > rebalance: > __alloc_pages_direct_reclaim > do_try_to_free_pages > if global_reclaim && !all_unreclaimable > return 1; /* It means we did did_some_progress */ > skip __alloc_pages_may_oom > should_alloc_retry > goto rebalance; > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > about setting zone->all_unreclaimable, we can solve this problem > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > It could be a problem still if other subsystem needs GFP_ATOMIC request. > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > BEFORE going sleep. > > This patch counts the number of MIGRATE_ISOLATE page block and > zone_watermark_ok_safe will consider it if the system has such blocks > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > hotpath). > > Copy/modify from Mel's quote > " > Ideal solution would be "allocating" the pageblock. > It would keep the free space accounting as it is but historically, > memory hotplug didn't allocate pages because it would be difficult to > detect if a pageblock was isolated or if part of some balloon. > Allocating just full pageblocks would work around this, However, > it would play very badly with CMA. > " > > [1] http://lkml.org/lkml/2012/6/14/74 > > * from v1 > - add changelog > - make functions simple > - remove atomic variable > - discard exact isolated free page accounting. > - rebased on next-20120626 > > Suggested-by: KOSAKI Motohiro > Cc: KAMEZAWA Hiroyuki > Cc: Aaditya Kumar > Cc: Mel Gorman > Signed-off-by: Minchan Kim > --- > > Aaditya, coul you confirm this patch solve your problem and > make sure nr_pageblock_isolate is zero after hotplug end? I am really sorry for the delay. I just tried this patch on my ARM setup. > > +#ifdef CONFIG_MEMORY_ISOLATION > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + unsigned long nr_pages = 0; > + > + if (unlikely(zone->nr_pageblock_isolate)) { > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > + } > + return nr_pages; > +} > +#else > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > +{ > + return 0; > +} > +#endif > + > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags) > { > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > + /* > + * If the zone has MIGRATE_ISOLATE type free page, > + * we should consider it. nr_zone_isolate_freepages is never > + * accurate so kswapd might not sleep although she can. > + * But it's more desirable for memory hotplug rather than > + * forever sleep which cause livelock in direct reclaim path. > + */ > + free_pages -= nr_zone_isolate_freepages(z); > return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > free_pages); For my test case, pages to be off lined span the whole node. With this setup the free_pages become negative. (As you and Kamezawa-san already expected.) BUT because of free_pages going negative the memory off lining still livelocks as __zone_watermark_ok() returns true. This is because in below if comparison, because of an unsigned value (z->lowmem_reserve[classzone_idx]) all the longs are converted to unsigned long. static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags, long free_pages) { if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return false; So, may be you can consider following also: As for the nr_pageblock_isolate going back to zero, yes it is going back to zero for my test case.(I tested after this change) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1594,6 +1594,7 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, { /* free_pages my go negative - that's OK */ long min = mark; + long lowmem_res = z->lowmem_reserve[classzone_idx]; int o; free_pages -= (1 << order) - 1; @@ -1602,7 +1603,7 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, if (alloc_flags & ALLOC_HARDER) min -= min / 4; - if (free_pages <= min + z->lowmem_reserve[classzone_idx]) + if (free_pages <= min + lowmem_res) return false; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754205Ab2GIO3s (ORCPT ); Mon, 9 Jul 2012 10:29:48 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:33066 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753705Ab2GIO3q (ORCPT ); Mon, 9 Jul 2012 10:29:46 -0400 Date: Mon, 9 Jul 2012 23:29:36 +0900 From: Minchan Kim To: Aaditya Kumar Cc: Minchan Kim , akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem Message-ID: <20120709142936.GB17314@barrios> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Aaditya, On Mon, Jul 09, 2012 at 07:01:09PM +0530, Aaditya Kumar wrote: > On Wed, Jun 27, 2012 at 1:21 PM, Minchan Kim wrote: > > When hotplug offlining happens on zone A, it starts to mark freed page > > as MIGRATE_ISOLATE type in buddy for preventing further allocation. > > (MIGRATE_ISOLATE is very irony type because it's apparently on buddy > > but we can't allocate them). > > When the memory shortage happens during hotplug offlining, > > current task starts to reclaim, then wake up kswapd. > > Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe > > doesn't consider MIGRATE_ISOLATE freed page count. > > Current task continue to reclaim in direct reclaim path without kswapd's helping. > > The problem is that zone->all_unreclaimable is set by only kswapd > > so that current task would be looping forever like below. > > > > __alloc_pages_slowpath > > restart: > > wake_all_kswapd > > rebalance: > > __alloc_pages_direct_reclaim > > do_try_to_free_pages > > if global_reclaim && !all_unreclaimable > > return 1; /* It means we did did_some_progress */ > > skip __alloc_pages_may_oom > > should_alloc_retry > > goto rebalance; > > > > If we apply KOSAKI's patch[1] which doesn't depends on kswapd > > about setting zone->all_unreclaimable, we can solve this problem > > by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. > > It could be a problem still if other subsystem needs GFP_ATOMIC request. > > So kswapd should consider MIGRATE_ISOLATE when it calculate free pages > > BEFORE going sleep. > > > > This patch counts the number of MIGRATE_ISOLATE page block and > > zone_watermark_ok_safe will consider it if the system has such blocks > > (fortunately, it's very rare so no problem in POV overhead and kswapd is never > > hotpath). > > > > Copy/modify from Mel's quote > > " > > Ideal solution would be "allocating" the pageblock. > > It would keep the free space accounting as it is but historically, > > memory hotplug didn't allocate pages because it would be difficult to > > detect if a pageblock was isolated or if part of some balloon. > > Allocating just full pageblocks would work around this, However, > > it would play very badly with CMA. > > " > > > > [1] http://lkml.org/lkml/2012/6/14/74 > > > > * from v1 > > - add changelog > > - make functions simple > > - remove atomic variable > > - discard exact isolated free page accounting. > > - rebased on next-20120626 > > > > Suggested-by: KOSAKI Motohiro > > Cc: KAMEZAWA Hiroyuki > > Cc: Aaditya Kumar > > Cc: Mel Gorman > > Signed-off-by: Minchan Kim > > --- > > > > Aaditya, coul you confirm this patch solve your problem and > > make sure nr_pageblock_isolate is zero after hotplug end? > > I am really sorry for the delay. > I just tried this patch on my ARM setup. No problem. > > > > > > +#ifdef CONFIG_MEMORY_ISOLATION > > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > > +{ > > + unsigned long nr_pages = 0; > > + > > + if (unlikely(zone->nr_pageblock_isolate)) { > > + nr_pages = zone->nr_pageblock_isolate * pageblock_nr_pages; > > + } > > + return nr_pages; > > +} > > +#else > > +static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) > > +{ > > + return 0; > > +} > > +#endif > > + > > bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > > int classzone_idx, int alloc_flags) > > { > > @@ -1629,6 +1651,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > > > + /* > > + * If the zone has MIGRATE_ISOLATE type free page, > > + * we should consider it. nr_zone_isolate_freepages is never > > + * accurate so kswapd might not sleep although she can. > > + * But it's more desirable for memory hotplug rather than > > + * forever sleep which cause livelock in direct reclaim path. > > + */ > > + free_pages -= nr_zone_isolate_freepages(z); > > return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > > free_pages); > > For my test case, pages to be off lined span the whole node. > With this setup the free_pages become negative. (As you and Kamezawa-san > already expected.) > > BUT because of free_pages going negative the memory off lining still livelocks > as __zone_watermark_ok() returns true. > > This is because in below if comparison, because of an unsigned value > (z->lowmem_reserve[classzone_idx]) > all the longs are converted to unsigned long. Oh my god. It seems you find new unknown BUG. Currently, free_pages passed into __zone_watermark_ok could be zero. In __zone_watemark_ok, it could be minus by this. free_pages -= (1 << order) - 1; and automatic type conversion could make samek result you have seen. I will fix it in next spin. Thanks for the spotting! > > static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags, long free_pages) > { > > if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > return false; > > > > So, may be you can consider following also: > As for the nr_pageblock_isolate going back to zero, yes it is going back to zero > for my test case.(I tested after this change) Thanks for testing! May I add your tested-by in next spin which will include automatic type conversion problem ? > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1594,6 +1594,7 @@ static bool __zone_watermark_ok(struct zone *z, > int order, unsigned long mark, > { > /* free_pages my go negative - that's OK */ > long min = mark; > + long lowmem_res = z->lowmem_reserve[classzone_idx]; > int o; > > free_pages -= (1 << order) - 1; > @@ -1602,7 +1603,7 @@ static bool __zone_watermark_ok(struct zone *z, > int order, unsigned long mark, > if (alloc_flags & ALLOC_HARDER) > min -= min / 4; > > - if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > + if (free_pages <= min + lowmem_res) > return false; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753330Ab2GJFf3 (ORCPT ); Tue, 10 Jul 2012 01:35:29 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:37990 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752774Ab2GJFf2 (ORCPT ); Tue, 10 Jul 2012 01:35:28 -0400 MIME-Version: 1.0 In-Reply-To: <20120709142936.GB17314@barrios> References: <1340783514-8150-1-git-send-email-minchan@kernel.org> <1340783514-8150-3-git-send-email-minchan@kernel.org> <20120709142936.GB17314@barrios> Date: Tue, 10 Jul 2012 11:05:26 +0530 Message-ID: Subject: Re: [PATCH 2/2 v2] memory-hotplug: fix kswapd looping forever problem From: Aaditya Kumar To: Minchan Kim Cc: akpm@linux-foundation.org, KOSAKI Motohiro , Mel Gorman , KAMEZAWA Hiroyuki , LKML , linux-mm , Mel Gorman , tim.bird@am.sony.com, frank.rowand@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 9, 2012 at 7:59 PM, Minchan Kim wrote: Hello Minchan, > May I add your tested-by in next spin which will include automatic type conversion > problem ? Yeah sure.