[RFC] Checking for error code in __offline

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Checking for error code in __offline_pages
@ 2018-05-23  7:35 Oscar Salvador
  2018-05-23  7:52 ` Michal Hocko
  2018-05-23 14:51 ` David Hildenbrand
  0 siblings, 2 replies; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23  7:35 UTC (permalink / raw)
  To: linux-mm; +Cc: mhocko, vbabka, pasha.tatashin, akpm

Hi,

This is something I spotted while testing offlining memory.

__offline_pages() calls do_migrate_range() to try to migrate a range,
but we do not actually check for the error code.
This, besides of ignoring underlying failures, can led to a situations
where we never break up the loop because we are totally unaware of
what is going on.

They way I spotted this was when trying to offline all memblocks belonging
to a node.
Due to an unfortunate setting with movablecore, memblocks containing bootmem
memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
So while trying to remove that memory, the system failed in:

do_migrate_range()
{
...
	if (PageLRU(page))
		ret = isolate_lru_page(page);
	else
		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);

	if (!ret)
		// success: do something
	else
		if (page_count(page))
			ret = -EBUSY;
...
}

Since the pages from bootmem are not LRU, we call isolate_movable_page()
but we fail when checking for __PageMovable().
Since the page_count is more than 0 we return -EBUSY, but we do not check this
in our caller, so we keep trying to migrate this memory over and over:

repeat:
...
        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
                ret = do_migrate_range(pfn, end_pfn);
                goto repeat;
        }

But this is not only situation where we can get stuck.
For example, if we fail with -ENOMEM in
migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well.
I think we should really detect these cases and fail with "goto failed_removal".
Something like

--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
                ret = do_migrate_range(pfn, end_pfn);
+               if (ret) {
+                       if (ret != -ENOMEM)
+                               ret = -EBUSY;
+                       goto failed_removal;
+               }
                goto repeat;
        }

Now, unless I overlooked something
migrate_pages()->unmap_and_move()/unmap_and_move_huge_page() can return:
-ENOMEM
-EAGAIN
-EBUSY
-ENOSYS.

I am not sure if we should differentiate betweeen those errors.
For example, it is possible that in migrate_pages() we just get -EAGAIN,
and we return the number of "retry" we tried without having really failed.
Although, since we do 10 passes it might be considered as failed.

And I am not sure either if we want to propagate the error codes, or in case we fail
in migrate_pages(), whatever the error was (-ENOMEM, -EBUSY, etc.), we
just return -EBUSY.

What do you think?

Thanks
Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  7:35 [RFC] Checking for error code in __offline_pages Oscar Salvador
@ 2018-05-23  7:52 ` Michal Hocko
  2018-05-23  8:16   ` Michal Hocko
  2018-05-23  8:16   ` Oscar Salvador
  2018-05-23 14:51 ` David Hildenbrand
  1 sibling, 2 replies; 11+ messages in thread
From: Michal Hocko @ 2018-05-23  7:52 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed 23-05-18 09:35:47, Oscar Salvador wrote:
> Hi,
> 
> This is something I spotted while testing offlining memory.
> 
> __offline_pages() calls do_migrate_range() to try to migrate a range,
> but we do not actually check for the error code.

Yes, this is intentional. do_migrate_range doesn't distinguish between
temporal and permanent migration failure. Getting EBUSY would be just
too easy and that is why we retry. We rely on start_isolate_page_range
to tell us about any non-migrateable pages and we consider all other
failures as temporal.

> This, besides of ignoring underlying failures, can led to a situations
> where we never break up the loop because we are totally unaware of
> what is going on.

This shouldn't happen. If it does then start_isolate_page_range should
handle those non-migrateable pages.

> They way I spotted this was when trying to offline all memblocks belonging
> to a node.
> Due to an unfortunate setting with movablecore, memblocks containing bootmem
> memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.

This is a bug as well. Zone movable shouldn't contain any
non-migrateable pages.

[...]

> Since the pages from bootmem are not LRU, we call isolate_movable_page()
> but we fail when checking for __PageMovable().
> Since the page_count is more than 0 we return -EBUSY, but we do not check this
> in our caller, so we keep trying to migrate this memory over and over:
> 
> repeat:
> ...
>         pfn = scan_movable_pages(start_pfn, end_pfn);
>         if (pfn) { /* We have movable pages */
>                 ret = do_migrate_range(pfn, end_pfn);
>                 goto repeat;
>         }
> 
> But this is not only situation where we can get stuck.
> For example, if we fail with -ENOMEM in
> migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well.

ENOMEM is highly unlikely because we are should be allocating only small
order pages and those do not fail unless the originator is killed by the
oom killer and we would break out of the loop in such a cace because of
signals pending.

> I think we should really detect these cases and fail with "goto failed_removal".
> Something like
> 
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
>         pfn = scan_movable_pages(start_pfn, end_pfn);
>         if (pfn) { /* We have movable pages */
>                 ret = do_migrate_range(pfn, end_pfn);
> +               if (ret) {
> +                       if (ret != -ENOMEM)
> +                               ret = -EBUSY;
> +                       goto failed_removal;
> +               }
>                 goto repeat;
>         }

no, not really. As explained above this would allow to fail the
offlining way too easily. Yeah, the current code is far from optimal. We
used to have a retry count but that one was removed exactly because of
premature failures. There are three things here
1) zone_movable should contain any bootmem or otherwise non-migrateable
   pages
2) start_isolate_page_range should fail when seeing such pages - maybe
   has_unmovable_pages is overly optimistic and it should check all
   pages even in movable zones.
3) migrate_pages should really tell us whether the failure is temporal
   or permanent. I am not sure we can do that easily though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  7:52 ` Michal Hocko
@ 2018-05-23  8:16   ` Michal Hocko
  2018-05-23  8:19     ` Oscar Salvador
                       ` (2 more replies)
  2018-05-23  8:16   ` Oscar Salvador
  1 sibling, 3 replies; 11+ messages in thread
From: Michal Hocko @ 2018-05-23  8:16 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed 23-05-18 09:52:39, Michal Hocko wrote:
[...]
> Yeah, the current code is far from optimal. We
> used to have a retry count but that one was removed exactly because of
> premature failures. There are three things here
> 1) zone_movable should contain any bootmem or otherwise non-migrateable
>    pages
> 2) start_isolate_page_range should fail when seeing such pages - maybe
>    has_unmovable_pages is overly optimistic and it should check all
>    pages even in movable zones.
> 3) migrate_pages should really tell us whether the failure is temporal
>    or permanent. I am not sure we can do that easily though.

2) should be the most simple one for now. Could you give it a try? Btw.
the exact configuration that led to boothmem pages in zone_movable would
be really appreciated:
--- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  8:16   ` Michal Hocko
@ 2018-05-23  8:19     ` Oscar Salvador
  2018-05-23  9:28     ` Oscar Salvador
  2018-05-23 10:26     ` Oscar Salvador
  2 siblings, 0 replies; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23  8:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed, May 23, 2018 at 10:16:09AM +0200, Michal Hocko wrote:
> On Wed 23-05-18 09:52:39, Michal Hocko wrote:
> [...]
> > Yeah, the current code is far from optimal. We
> > used to have a retry count but that one was removed exactly because of
> > premature failures. There are three things here
> > 1) zone_movable should contain any bootmem or otherwise non-migrateable
> >    pages
> > 2) start_isolate_page_range should fail when seeing such pages - maybe
> >    has_unmovable_pages is overly optimistic and it should check all
> >    pages even in movable zones.
> > 3) migrate_pages should really tell us whether the failure is temporal
> >    or permanent. I am not sure we can do that easily though.
> 
> 2) should be the most simple one for now. Could you give it a try? Btw.
> the exact configuration that led to boothmem pages in zone_movable would
> be really appreciated:

I will try it out and I will paste the config.

> --- 
> From 6aa144a9b1c01255c89a4592221d706ccc4b4eea Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 23 May 2018 10:04:20 +0200
> Subject: [PATCH] mm, memory_hotplug: make has_unmovable_pages more robust
> 
> Oscar has reported:
> : Due to an unfortunate setting with movablecore, memblocks containing bootmem
> : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
> : So while trying to remove that memory, the system failed in do_migrate_range
> : and __offline_pages never returned.
> 
> This is because we rely on start_isolate_page_range resp. has_unmovable_pages
> to do their jobb. The first one isolates the whole range to be offlined
> so that we do not allocate from it anymore and the later makes sure we
> are not stumbling over non-migrateable pages.
> 
> has_unmovable_pages is overly optimistic, however. It doesn't check all
> the pages if we are withing zone_movable because we rely that those
> pages will be always migrateable. As it turns out we are still not
> perfect there. While bootmem pages in zonemovable sound like a clear bug
> which should be fixed let's remove the optimization for now and warn if
> we encounter unmovable pages in zone_movable in the meantime. That
> should help for now at least.
> 
> Btw. this wasn't a real problem until 72b39cfc4d75 ("mm, memory_hotplug:
> do not fail offlining too early") because we used to have a small number
> of retries and then failed. This turned out to be too fragile though.
> 
> Reported-by: Oscar Salvador <osalvador@techadventures.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c6f4008ea55..b9a45753244d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7629,11 +7629,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  	unsigned long pfn, iter, found;
>  
>  	/*
> -	 * For avoiding noise data, lru_add_drain_all() should be called
> -	 * If ZONE_MOVABLE, the zone never contains unmovable pages
> +	 * TODO we could make this much more efficient by not checking every
> +	 * page in the range if we know all of them are in MOVABLE_ZONE and
> +	 * that the movable zone guarantees that pages are migratable but
> +	 * the later is not the case right now unfortunatelly. E.g. movablecore
> +	 * can still lead to having bootmem allocations in zone_movable.
>  	 */
> -	if (zone_idx(zone) == ZONE_MOVABLE)
> -		return false;
>  
>  	/*
>  	 * CMA allocations (alloc_contig_range) really need to mark isolate
> @@ -7654,7 +7655,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  		page = pfn_to_page(check);
>  
>  		if (PageReserved(page))
> -			return true;
> +			goto unmovable;
>  
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
> @@ -7704,9 +7705,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  		 * page at boot.
>  		 */
>  		if (found > count)
> -			return true;
> +			goto unmovable;
>  	}
>  	return false;
> +unmovable:
> +	WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
> +	return true;
>  }
>  
>  #if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
> -- 
> 2.17.0

Thanks

Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  8:16   ` Michal Hocko
  2018-05-23  8:19     ` Oscar Salvador
@ 2018-05-23  9:28     ` Oscar Salvador
  2018-05-23 10:26     ` Oscar Salvador
  2 siblings, 0 replies; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23  9:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed, May 23, 2018 at 10:16:09AM +0200, Michal Hocko wrote:
> On Wed 23-05-18 09:52:39, Michal Hocko wrote:
> [...]
> > Yeah, the current code is far from optimal. We
> > used to have a retry count but that one was removed exactly because of
> > premature failures. There are three things here
> > 1) zone_movable should contain any bootmem or otherwise non-migrateable
> >    pages
> > 2) start_isolate_page_range should fail when seeing such pages - maybe
> >    has_unmovable_pages is overly optimistic and it should check all
> >    pages even in movable zones.
> > 3) migrate_pages should really tell us whether the failure is temporal
> >    or permanent. I am not sure we can do that easily though.
> 
> 2) should be the most simple one for now. Could you give it a try? Btw.
> the exact configuration that led to boothmem pages in zone_movable would
> be really appreciated:
> --- 
> From 6aa144a9b1c01255c89a4592221d706ccc4b4eea Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 23 May 2018 10:04:20 +0200
> Subject: [PATCH] mm, memory_hotplug: make has_unmovable_pages more robust
> 
> Oscar has reported:
> : Due to an unfortunate setting with movablecore, memblocks containing bootmem
> : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
> : So while trying to remove that memory, the system failed in do_migrate_range
> : and __offline_pages never returned.
> 
> This is because we rely on start_isolate_page_range resp. has_unmovable_pages
> to do their jobb. The first one isolates the whole range to be offlined
> so that we do not allocate from it anymore and the later makes sure we
> are not stumbling over non-migrateable pages.
> 
> has_unmovable_pages is overly optimistic, however. It doesn't check all
> the pages if we are withing zone_movable because we rely that those
> pages will be always migrateable. As it turns out we are still not
> perfect there. While bootmem pages in zonemovable sound like a clear bug
> which should be fixed let's remove the optimization for now and warn if
> we encounter unmovable pages in zone_movable in the meantime. That
> should help for now at least.
> 
> Btw. this wasn't a real problem until 72b39cfc4d75 ("mm, memory_hotplug:
> do not fail offlining too early") because we used to have a small number
> of retries and then failed. This turned out to be too fragile though.
> 
> Reported-by: Oscar Salvador <osalvador@techadventures.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c6f4008ea55..b9a45753244d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7629,11 +7629,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  	unsigned long pfn, iter, found;
>  
>  	/*
> -	 * For avoiding noise data, lru_add_drain_all() should be called
> -	 * If ZONE_MOVABLE, the zone never contains unmovable pages
> +	 * TODO we could make this much more efficient by not checking every
> +	 * page in the range if we know all of them are in MOVABLE_ZONE and
> +	 * that the movable zone guarantees that pages are migratable but
> +	 * the later is not the case right now unfortunatelly. E.g. movablecore
> +	 * can still lead to having bootmem allocations in zone_movable.
>  	 */
> -	if (zone_idx(zone) == ZONE_MOVABLE)
> -		return false;
>  
>  	/*
>  	 * CMA allocations (alloc_contig_range) really need to mark isolate
> @@ -7654,7 +7655,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  		page = pfn_to_page(check);
>  
>  		if (PageReserved(page))
> -			return true;
> +			goto unmovable;
>  
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
> @@ -7704,9 +7705,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  		 * page at boot.
>  		 */
>  		if (found > count)
> -			return true;
> +			goto unmovable;
>  	}
>  	return false;
> +unmovable:
> +	WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
> +	return true;
>  }
>  
>  #if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
> -- 
> 2.17.0

Tested-by: Oscar Salvador <osalvador@techadventures.net>

thanks!
Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  8:16   ` Michal Hocko
  2018-05-23  8:19     ` Oscar Salvador
  2018-05-23  9:28     ` Oscar Salvador
@ 2018-05-23 10:26     ` Oscar Salvador
  2018-05-23 11:38       ` Michal Hocko
  2 siblings, 1 reply; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23 10:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed, May 23, 2018 at 10:16:09AM +0200, Michal Hocko wrote:
> On Wed 23-05-18 09:52:39, Michal Hocko wrote:
> [...]
> > Yeah, the current code is far from optimal. We
> > used to have a retry count but that one was removed exactly because of
> > premature failures. There are three things here
> > 1) zone_movable should contain any bootmem or otherwise non-migrateable
> >    pages
> > 2) start_isolate_page_range should fail when seeing such pages - maybe
> >    has_unmovable_pages is overly optimistic and it should check all
> >    pages even in movable zones.
> > 3) migrate_pages should really tell us whether the failure is temporal
> >    or permanent. I am not sure we can do that easily though.
> 
> 2) should be the most simple one for now. Could you give it a try? Btw.
> the exact configuration that led to boothmem pages in zone_movable would
> be really appreciated:
 
Here is some information:

** Qemu cmdline:

# qemu-system-x86_64 -enable-kvm -smp 2  -monitor pty -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M ...
# Option movablecore=4G (cmdline)

** e820 map and some numa information:

linux kernel: BIOS-provided physical RAM map:
linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
linux kernel: NX (Execute Disable) protection: active
linux kernel: SMBIOS 2.8 present.
linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 
linux kernel: Hypervisor detected: KVM
linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000


linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]


** /proc/zoneinfo

Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 2107
      nr_active_anon 49560
      nr_inactive_file 25375
      nr_active_file 19038
      nr_unevictable 12
      nr_slab_reclaimable 5996
      nr_slab_unreclaimable 7236
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_refault 0
      workingset_activate 0
      workingset_nodereclaim 0
      nr_anon_pages 48910
      nr_mapped    13780
      nr_file_pages 46676
      nr_dirty     13
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     2263
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 50
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   17749
      nr_written   17462
                   83328
  pages free     3961
        min      29
        low      36
        high     43
        spanned  4095
        present  3998
        managed  3977
        protection: (0, 2939, 2939, 3898, 3898)
      nr_free_pages 3961
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     2
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   1
      numa_other   1
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone    DMA32
  pages free     724414
        min      5583
        low      6978
        high     8373
        spanned  1044480
        present  782304
        managed  758516
        protection: (0, 0, 0, 959, 959)
      nr_free_pages 724414
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 1697
      nr_zone_active_file 8915
      nr_zone_unevictable 0
      nr_zone_write_pending 12
      nr_mlock     0
      nr_page_table_pages 2976
      nr_kernel_stack 4000
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     281025
      numa_miss    0
      numa_foreign 0
      numa_interleave 8583
      numa_local   135392
      numa_other   145633
  pagesets
    cpu: 0
              count: 164
              high:  186
              batch: 31
  vm stats threshold: 24
    cpu: 1
              count: 32
              high:  186
              batch: 31
  vm stats threshold: 24
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 7677, 7677)
Node 0, zone  Movable
  pages free     160140
        min      1823
        low      2278
        high     2733
        spanned  262144
        present  262144
        managed  245670
        protection: (0, 0, 0, 0, 0)
      nr_free_pages 160140
      nr_zone_inactive_anon 2107
      nr_zone_active_anon 49560
      nr_zone_inactive_file 23678
      nr_zone_active_file 10123
      nr_zone_unevictable 12
      nr_zone_write_pending 1
      nr_mlock     12
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     214370
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   214344
      numa_other   26
  pagesets
    cpu: 0
              count: 32
              high:  42
              batch: 7
  vm stats threshold: 16
    cpu: 1
              count: 26
              high:  42
              batch: 7
  vm stats threshold: 16
  node_unreclaimable:  0
  start_pfn:           1048576
Node 0, zone   Device
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)
Node 1, zone      DMA
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 2014, 2014)
Node 1, zone    DMA32
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 2014, 2014)
Node 1, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 16117, 16117)
Node 1, zone  Movable
  per-node stats
      nr_inactive_anon 524
      nr_active_anon 25734
      nr_inactive_file 28733
      nr_active_file 12316
      nr_unevictable 8
      nr_slab_reclaimable 0
      nr_slab_unreclaimable 0
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_refault 0
      workingset_activate 0
      workingset_nodereclaim 0
      nr_anon_pages 24656
      nr_mapped    16871
      nr_file_pages 41647
      nr_dirty     1
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     598
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_anon_transparent_hugepages 8
      nr_unstable  0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   125
      nr_written   98
                   0
  pages free     448427
        min      3827
        low      4783
        high     5739
        spanned  524288
        present  524288
        managed  515766
        protection: (0, 0, 0, 0, 0)
      nr_free_pages 448427
      nr_zone_inactive_anon 524
      nr_zone_active_anon 25734
      nr_zone_inactive_file 28733
      nr_zone_active_file 12316
      nr_zone_unevictable 8
      nr_zone_write_pending 1
      nr_mlock     8
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     199599
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   199599
      numa_other   0
  pagesets
    cpu: 0
              count: 9
              high:  42
              batch: 7
  vm stats threshold: 20
    cpu: 1
              count: 2
              high:  42
              batch: 7
  vm stats threshold: 20
  node_unreclaimable:  0
  start_pfn:           1310720
Node 1, zone   Device
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)


I hope this is enough.

Thanks
Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23 10:26     ` Oscar Salvador
@ 2018-05-23 11:38       ` Michal Hocko
  2018-05-23 11:53         ` Oscar Salvador
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2018-05-23 11:38 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed 23-05-18 12:26:43, Oscar Salvador wrote:
> On Wed, May 23, 2018 at 10:16:09AM +0200, Michal Hocko wrote:
> > On Wed 23-05-18 09:52:39, Michal Hocko wrote:
> > [...]
> > > Yeah, the current code is far from optimal. We
> > > used to have a retry count but that one was removed exactly because of
> > > premature failures. There are three things here
> > > 1) zone_movable should contain any bootmem or otherwise non-migrateable
> > >    pages
> > > 2) start_isolate_page_range should fail when seeing such pages - maybe
> > >    has_unmovable_pages is overly optimistic and it should check all
> > >    pages even in movable zones.
> > > 3) migrate_pages should really tell us whether the failure is temporal
> > >    or permanent. I am not sure we can do that easily though.
> > 
> > 2) should be the most simple one for now. Could you give it a try? Btw.
> > the exact configuration that led to boothmem pages in zone_movable would
> > be really appreciated:
>  
> Here is some information:
> 
> ** Qemu cmdline:
> 
> # qemu-system-x86_64 -enable-kvm -smp 2  -monitor pty -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M ...
> # Option movablecore=4G (cmdline)
> 
> ** e820 map and some numa information:
> 
> linux kernel: BIOS-provided physical RAM map:
> linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
> linux kernel: NX (Execute Disable) protection: active
> linux kernel: SMBIOS 2.8 present.
> linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 
> linux kernel: Hypervisor detected: KVM
> linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
> linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
> 
> linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
> linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
> linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
> linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
> linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
> linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
> linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
> linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
> linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]

Could you also paste
"Zone ranges:"
and the follow up messages?

>From the zoneinfo it seems the movable zone got placed to both nodes.
And only Node0 is marked as hotplugable so early allocations can be
placed to Node1.
 
> ** /proc/zoneinfo
[...]
> Node 0, zone  Movable
>   pages free     160140
>         min      1823
>         low      2278
>         high     2733
>         spanned  262144
>         present  262144
>         managed  245670

it seems that 1G went to Node0

> Node 1, zone  Movable
[...]
>   pages free     448427
>         min      3827
>         low      4783
>         high     5739
>         spanned  524288
>         present  524288
>         managed  515766

and the rest to Node1. Guessing from spanned-managed it seems that used
memory is for memmaps (struct page arrays).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23 11:38       ` Michal Hocko
@ 2018-05-23 11:53         ` Oscar Salvador
  0 siblings, 0 replies; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23 11:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed, May 23, 2018 at 01:38:57PM +0200, Michal Hocko wrote:
> On Wed 23-05-18 12:26:43, Oscar Salvador wrote:
> > On Wed, May 23, 2018 at 10:16:09AM +0200, Michal Hocko wrote:
> > > On Wed 23-05-18 09:52:39, Michal Hocko wrote:
> > > [...]
> > > > Yeah, the current code is far from optimal. We
> > > > used to have a retry count but that one was removed exactly because of
> > > > premature failures. There are three things here
> > > > 1) zone_movable should contain any bootmem or otherwise non-migrateable
> > > >    pages
> > > > 2) start_isolate_page_range should fail when seeing such pages - maybe
> > > >    has_unmovable_pages is overly optimistic and it should check all
> > > >    pages even in movable zones.
> > > > 3) migrate_pages should really tell us whether the failure is temporal
> > > >    or permanent. I am not sure we can do that easily though.
> > > 
> > > 2) should be the most simple one for now. Could you give it a try? Btw.
> > > the exact configuration that led to boothmem pages in zone_movable would
> > > be really appreciated:
> >  
> > Here is some information:
> > 
> > ** Qemu cmdline:
> > 
> > # qemu-system-x86_64 -enable-kvm -smp 2  -monitor pty -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M ...
> > # Option movablecore=4G (cmdline)
> > 
> > ** e820 map and some numa information:
> > 
> > linux kernel: BIOS-provided physical RAM map:
> > linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> > linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> > linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> > linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
> > linux kernel: NX (Execute Disable) protection: active
> > linux kernel: SMBIOS 2.8 present.
> > linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 
> > linux kernel: Hypervisor detected: KVM
> > linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> > linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
> > linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
> > 
> > linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
> > linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
> > linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> > linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
> > linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
> > linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
> > linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
> > linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
> > linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
> > linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
> 
> Could you also paste
> "Zone ranges:"
> and the follow up messages?

Michal, here is the output about "Zone ranges:"

linux kernel: Zone ranges:
linux kernel:   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
linux kernel:   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
linux kernel:   Normal   [mem 0x0000000100000000-0x00000001bfffffff]
linux kernel:   Device   empty
linux kernel: Movable zone start for each node
linux kernel:   Node 0: 0x0000000100000000
linux kernel:   Node 1: 0x0000000140000000
linux kernel: Early memory node ranges
linux kernel:   node   0: [mem 0x0000000000001000-0x000000000009efff]
linux kernel:   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
linux kernel:   node   0: [mem 0x0000000100000000-0x000000013fffffff]
linux kernel:   node   1: [mem 0x0000000140000000-0x00000001bfffffff]
linux kernel: Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff]
linux kernel: On node 0 totalpages: 1048446
linux kernel:   DMA zone: 64 pages used for memmap
linux kernel:   DMA zone: 21 pages reserved
linux kernel:   DMA zone: 3998 pages, LIFO batch:0
linux kernel:   DMA32 zone: 12224 pages used for memmap
linux kernel:   DMA32 zone: 782304 pages, LIFO batch:31
linux kernel:   Movable zone: 4096 pages used for memmap
linux kernel:   Movable zone: 262144 pages, LIFO batch:31
linux kernel: Initmem setup node 1 [mem 0x0000000140000000-0x00000001bfffffff]
linux kernel: On node 1 totalpages: 524288
linux kernel:   Movable zone: 8192 pages used for memmap
linux kernel:   Movable zone: 524288 pages, LIFO batch:31
linux kernel: Reserved but unavailable: 98 pages

 
Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  7:52 ` Michal Hocko
  2018-05-23  8:16   ` Michal Hocko
@ 2018-05-23  8:16   ` Oscar Salvador
  2018-05-23  8:32     ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23  8:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed, May 23, 2018 at 09:52:39AM +0200, Michal Hocko wrote:
> On Wed 23-05-18 09:35:47, Oscar Salvador wrote:
> > Hi,
> > 
> > This is something I spotted while testing offlining memory.
> > 
> > __offline_pages() calls do_migrate_range() to try to migrate a range,
> > but we do not actually check for the error code.
> 
> Yes, this is intentional. do_migrate_range doesn't distinguish between
> temporal and permanent migration failure. Getting EBUSY would be just
> too easy and that is why we retry. We rely on start_isolate_page_range
> to tell us about any non-migrateable pages and we consider all other
> failures as temporal.
> 
> > This, besides of ignoring underlying failures, can led to a situations
> > where we never break up the loop because we are totally unaware of
> > what is going on.
> 
> This shouldn't happen. If it does then start_isolate_page_range should
> handle those non-migrateable pages.
> 
> > They way I spotted this was when trying to offline all memblocks belonging
> > to a node.
> > Due to an unfortunate setting with movablecore, memblocks containing bootmem
> > memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
> 
> This is a bug as well. Zone movable shouldn't contain any
> non-migrateable pages.
> 
> [...]
> 
> > Since the pages from bootmem are not LRU, we call isolate_movable_page()
> > but we fail when checking for __PageMovable().
> > Since the page_count is more than 0 we return -EBUSY, but we do not check this
> > in our caller, so we keep trying to migrate this memory over and over:
> > 
> > repeat:
> > ...
> >         pfn = scan_movable_pages(start_pfn, end_pfn);
> >         if (pfn) { /* We have movable pages */
> >                 ret = do_migrate_range(pfn, end_pfn);
> >                 goto repeat;
> >         }
> > 
> > But this is not only situation where we can get stuck.
> > For example, if we fail with -ENOMEM in
> > migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well.
> 
> ENOMEM is highly unlikely because we are should be allocating only small
> order pages and those do not fail unless the originator is killed by the
> oom killer and we would break out of the loop in such a cace because of
> signals pending.
> 
> > I think we should really detect these cases and fail with "goto failed_removal".
> > Something like
> > 
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >         pfn = scan_movable_pages(start_pfn, end_pfn);
> >         if (pfn) { /* We have movable pages */
> >                 ret = do_migrate_range(pfn, end_pfn);
> > +               if (ret) {
> > +                       if (ret != -ENOMEM)
> > +                               ret = -EBUSY;
> > +                       goto failed_removal;
> > +               }
> >                 goto repeat;
> >         }
> 
> no, not really. As explained above this would allow to fail the
> offlining way too easily. Yeah, the current code is far from optimal. We
> used to have a retry count but that one was removed exactly because of
> premature failures. There are three things here
> 1) zone_movable should contain any bootmem or otherwise non-migrateable
>    pages
> 2) start_isolate_page_range should fail when seeing such pages - maybe
>    has_unmovable_pages is overly optimistic and it should check all
>    pages even in movable zones.

I will see if I can work this out.

> 3) migrate_pages should really tell us whether the failure is temporal
>    or permanent. I am not sure we can do that easily though.

AFAIU, permament errors are things like -EBUSY, -ENOSYS, -ENOMEM,
and a temporary one would be -EAGAIN?
Maybe it is overcomplicated, but what about adding another parameter to
migrate_pages() where we set the real error.
something like:

int migrate_pages(struct list_head *from, new_page_t get_new_page,
		free_page_t put_new_page, unsigned long private,
		enum migrate_mode mode, int reason, int *error)

Now it is not possible to find out why did we fail there.
We just get the number of pages that were not migrated (unless it is -ENOMEM, 
which completely bails out and returns that)
For -EBUSY,-ENOSYS and -EAGAIN we just increment some value and return it.

Although as I said, this might be overcomplicating things.

Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  8:16   ` Oscar Salvador
@ 2018-05-23  8:32     ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2018-05-23  8:32 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linux-mm, vbabka, pasha.tatashin, akpm

On Wed 23-05-18 10:16:49, Oscar Salvador wrote:
[...]
> AFAIU, permament errors are things like -EBUSY, -ENOSYS, -ENOMEM,
> and a temporary one would be -EAGAIN?

It would be really great to have EBUSY as permanent and ENOMEM and
EAGAIN as temporary failures. But this is not so easy. The migration
code failes on the elevated ref count usually and we simply do not know
whether this is a short term pin or somebody holding the reference
basically for ever (from the migration POV). There was some discussion
about longterm pins on pages at LSFMM this year but it will take quite
some time before we will get some working solution.

> Maybe it is overcomplicated, but what about adding another parameter to
> migrate_pages() where we set the real error.
> something like:
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> 		free_page_t put_new_page, unsigned long private,
> 		enum migrate_mode mode, int reason, int *error)

I am not sure we really need a new parameter. migrate_pages will tell us
the failure. We just do not know _which_ error to return currently.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Checking for error code in __offline_pages
  2018-05-23  7:35 [RFC] Checking for error code in __offline_pages Oscar Salvador
  2018-05-23  7:52 ` Michal Hocko
@ 2018-05-23 14:51 ` David Hildenbrand
  1 sibling, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2018-05-23 14:51 UTC (permalink / raw)
  To: Oscar Salvador, linux-mm; +Cc: mhocko, vbabka, pasha.tatashin, akpm

On 23.05.2018 09:35, Oscar Salvador wrote:
> Hi,
> 
> This is something I spotted while testing offlining memory.
> 
> __offline_pages() calls do_migrate_range() to try to migrate a range,
> but we do not actually check for the error code.
> This, besides of ignoring underlying failures, can led to a situations
> where we never break up the loop because we are totally unaware of
> what is going on.
> 
> They way I spotted this was when trying to offline all memblocks belonging
> to a node.
> Due to an unfortunate setting with movablecore, memblocks containing bootmem
> memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
> So while trying to remove that memory, the system failed in:
> 
> do_migrate_range()
> {
> ...
> 	if (PageLRU(page))
> 		ret = isolate_lru_page(page);
> 	else
> 		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
> 
> 	if (!ret)
> 		// success: do something
> 	else
> 		if (page_count(page))
> 			ret = -EBUSY;
> ...
> }
> 
> Since the pages from bootmem are not LRU, we call isolate_movable_page()
> but we fail when checking for __PageMovable().
> Since the page_count is more than 0 we return -EBUSY, but we do not check this
> in our caller, so we keep trying to migrate this memory over and over:
> 
> repeat:
> ...
>         pfn = scan_movable_pages(start_pfn, end_pfn);
>         if (pfn) { /* We have movable pages */
>                 ret = do_migrate_range(pfn, end_pfn);
>                 goto repeat;
>         }
> 
> But this is not only situation where we can get stuck.
> For example, if we fail with -ENOMEM in
> migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well.
> I think we should really detect these cases and fail with "goto failed_removal".
> Something like
> 
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
>         pfn = scan_movable_pages(start_pfn, end_pfn);
>         if (pfn) { /* We have movable pages */
>                 ret = do_migrate_range(pfn, end_pfn);
> +               if (ret) {
> +                       if (ret != -ENOMEM)
> +                               ret = -EBUSY;
> +                       goto failed_removal;
> +               }
>                 goto repeat;
>         }
> 
> Now, unless I overlooked something
> migrate_pages()->unmap_and_move()/unmap_and_move_huge_page() can return:
> -ENOMEM
> -EAGAIN
> -EBUSY
> -ENOSYS.
> 
> I am not sure if we should differentiate betweeen those errors.
> For example, it is possible that in migrate_pages() we just get -EAGAIN,
> and we return the number of "retry" we tried without having really failed.
> Although, since we do 10 passes it might be considered as failed.
> 
> And I am not sure either if we want to propagate the error codes, or in case we fail
> in migrate_pages(), whatever the error was (-ENOMEM, -EBUSY, etc.), we
> just return -EBUSY.
> 
> What do you think?

Hi,

While working on onlining/offlining of 4MB subsections I also stumbled
over the return value of offline_pages(). It would be nice if the
interface could actually indicate if an error is permanent or only
temporary.

For now I have to live with the assumption, that whenever this function
is not -EAGAIN or 0, that I simply have to retry later.

David

> 
> Thanks
> Oscar Salvador
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-05-23 14:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-23  7:35 [RFC] Checking for error code in __offline_pages Oscar Salvador
2018-05-23  7:52 ` Michal Hocko
2018-05-23  8:16   ` Michal Hocko
2018-05-23  8:19     ` Oscar Salvador
2018-05-23  9:28     ` Oscar Salvador
2018-05-23 10:26     ` Oscar Salvador
2018-05-23 11:38       ` Michal Hocko
2018-05-23 11:53         ` Oscar Salvador
2018-05-23  8:16   ` Oscar Salvador
2018-05-23  8:32     ` Michal Hocko
2018-05-23 14:51 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).