linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Cc: "dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>,
	Oscar Salvador <osalvador@suse.de>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Subject: Re: [BUG ?] Offline Memory gets stuck in offline_pages()
Date: Mon, 1 Jul 2024 09:14:29 +0200	[thread overview]
Message-ID: <68b4b249-ceac-4d5b-9d15-a2e33a1187e5@redhat.com> (raw)
In-Reply-To: <6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com>

On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote:
> Hi all
> 
> 
> Overview:
> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
> /sys/devices/system/memory/memoryNNN/state.

Hi,

See

Documentation/admin-guide/mm/memory-hotplug.rst

"
Further, when running into out of memory situations while migrating 
pages, or when still encountering permanently unmovable pages within 
ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it 
eventually succeeds.

When offlining is triggered from user space, the offlining context can 
be terminated by sending a signal. A timeout based offlining can easily 
be implemented via::

	% timeout $TIMEOUT offline_block | failure_handling
"

> 
> Workaround:
> When it happens, we can type Ctrl-C to abort it and then retry again.
> Then the CXL memory is able to offline successfully.
> 
> Where the kernel gets stuck:
> After digging into the kernel, we found that when the issue occurs, the kernel
> is stuck in the outer loop of offline_pages(). Below is a piece of the
> highlighted offline_pages():
> 
> ```
> int __ref offline_pages()
> {
>     do { // outer loop
>       pfn = start_pfn;
>       do {
>         ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
>         if (!ret)
>            do_migrate_range(pfn, end_pfn);             // Not reach here
>       } while (!ret);
>       ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>       } while (ret);                                   // ret is -EBUSY
> }
> ```
> 
> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
> content does not change in each iteration.:
> ```
> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...

Are you sure that's the problematic page?

refcount:0

Indicates that the page is free. But maybe it does not have PageBuddy() set.

It could also be that this is a "tail" page of a PageBuddy() page, and 
somehow we always end up on the tail in test_pages_isolated().

Which kernel + architecture are you testing with?

> ```
> 
> Every time the issue occurs, the content of the page structure is similar.
> 
> Questions:
> Q1. Is this behavior expected? At least for an OS administrator, it should return
>       promptly (success or failure) instead of hanging indefinitely.

It's expected that it might take a long time (possibly forever) in 
corner cases. See documentation.

But it's likely unexpected that we have some problematic page here.

> Q2. Regarding the offline_pages() function, encountering such a page indeed causes
>       an endless loop. Shouldn't another part of the kernel timely changed the state
>       of this page?

There are various things that can go wrong. One issue might be that we 
try migrating a page but continuously fail to allocate memory to be used 
as a migration target. It seems unlikely with the page you dumped above, 
though.

Do you maybe have that CXL memory be on a separate "fake" NUMA node, and 
your workload mbind() itself to that NUMA node, possibly refusing to 
migrate somewhere else?

> 
>       When I use the workaround mentioned above (Ctrl-C and try offline again), I find
>       that the page state changes (see dump_page below):
> ```
> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
> ```
> 
> What our test does:
> We have a CXL memory device, which is configured as kmem and online into the MOVABLE
> zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
> see the pseudo code below:
> 
> ```
> main()
> {
>       if (fork() == 0)
>           numactl -m 2 ./consume-memory

What exactly does "consume-memory" do? Does it involve hugetlb maybe?


-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-07-01  7:14 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-01  1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu)
2024-07-01  7:14 ` David Hildenbrand [this message]
2024-07-01 12:07   ` Zhijian Li (Fujitsu)
2024-07-04  7:43 ` Zhijian Li (Fujitsu)
2024-07-04  8:14   ` David Hildenbrand
2024-07-04 13:07     ` Zhijian Li (Fujitsu)
2024-07-12  1:50       ` Zhijian Li (Fujitsu)
2024-07-12  5:51         ` Zhijian Li (Fujitsu)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=68b4b249-ceac-4d5b-9d15-a2e33a1187e5@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhijian@fujitsu.com \
    --cc=osalvador@suse.de \
    --cc=y-goto@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).