linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Checking for error code in __offline_pages
@ 2018-05-23  7:35 Oscar Salvador
  2018-05-23  7:52 ` Michal Hocko
  2018-05-23 14:51 ` David Hildenbrand
  0 siblings, 2 replies; 11+ messages in thread
From: Oscar Salvador @ 2018-05-23  7:35 UTC (permalink / raw)
  To: linux-mm; +Cc: mhocko, vbabka, pasha.tatashin, akpm

Hi,

This is something I spotted while testing offlining memory.

__offline_pages() calls do_migrate_range() to try to migrate a range,
but we do not actually check for the error code.
This, besides of ignoring underlying failures, can led to a situations
where we never break up the loop because we are totally unaware of
what is going on.

They way I spotted this was when trying to offline all memblocks belonging
to a node.
Due to an unfortunate setting with movablecore, memblocks containing bootmem
memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
So while trying to remove that memory, the system failed in:

do_migrate_range()
{
...
	if (PageLRU(page))
		ret = isolate_lru_page(page);
	else
		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);

	if (!ret)
		// success: do something
	else
		if (page_count(page))
			ret = -EBUSY;
...
}

Since the pages from bootmem are not LRU, we call isolate_movable_page()
but we fail when checking for __PageMovable().
Since the page_count is more than 0 we return -EBUSY, but we do not check this
in our caller, so we keep trying to migrate this memory over and over:

repeat:
...
        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
                ret = do_migrate_range(pfn, end_pfn);
                goto repeat;
        }

But this is not only situation where we can get stuck.
For example, if we fail with -ENOMEM in
migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well.
I think we should really detect these cases and fail with "goto failed_removal".
Something like

--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
                ret = do_migrate_range(pfn, end_pfn);
+               if (ret) {
+                       if (ret != -ENOMEM)
+                               ret = -EBUSY;
+                       goto failed_removal;
+               }
                goto repeat;
        }

Now, unless I overlooked something
migrate_pages()->unmap_and_move()/unmap_and_move_huge_page() can return:
-ENOMEM
-EAGAIN
-EBUSY
-ENOSYS.

I am not sure if we should differentiate betweeen those errors.
For example, it is possible that in migrate_pages() we just get -EAGAIN,
and we return the number of "retry" we tried without having really failed.
Although, since we do 10 passes it might be considered as failed.

And I am not sure either if we want to propagate the error codes, or in case we fail
in migrate_pages(), whatever the error was (-ENOMEM, -EBUSY, etc.), we
just return -EBUSY.

What do you think?

Thanks
Oscar Salvador

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-05-23 14:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-23  7:35 [RFC] Checking for error code in __offline_pages Oscar Salvador
2018-05-23  7:52 ` Michal Hocko
2018-05-23  8:16   ` Michal Hocko
2018-05-23  8:19     ` Oscar Salvador
2018-05-23  9:28     ` Oscar Salvador
2018-05-23 10:26     ` Oscar Salvador
2018-05-23 11:38       ` Michal Hocko
2018-05-23 11:53         ` Oscar Salvador
2018-05-23  8:16   ` Oscar Salvador
2018-05-23  8:32     ` Michal Hocko
2018-05-23 14:51 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).