linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG ?] Offline Memory gets stuck in offline_pages()
@ 2024-07-01  1:25 Zhijian Li (Fujitsu)
  2024-07-01  7:14 ` David Hildenbrand
  2024-07-04  7:43 ` Zhijian Li (Fujitsu)
  0 siblings, 2 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-01  1:25 UTC (permalink / raw)
  To: linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	david@redhat.com >> David Hildenbrand, Oscar Salvador,
	akpm@linux-foundation.org

Hi all


Overview:
During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
/sys/devices/system/memory/memoryNNN/state.

Workaround:
When it happens, we can type Ctrl-C to abort it and then retry again.
Then the CXL memory is able to offline successfully.

Where the kernel gets stuck:
After digging into the kernel, we found that when the issue occurs, the kernel
is stuck in the outer loop of offline_pages(). Below is a piece of the
highlighted offline_pages():

```
int __ref offline_pages()
{
   do { // outer loop
     pfn = start_pfn;
     do {
       ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
       if (!ret)
          do_migrate_range(pfn, end_pfn);             // Not reach here
     } while (!ret);
     ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
     } while (ret);                                   // ret is -EBUSY
}
```

In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
content does not change in each iteration.:
```
Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
```

Every time the issue occurs, the content of the page structure is similar.

Questions:
Q1. Is this behavior expected? At least for an OS administrator, it should return
     promptly (success or failure) instead of hanging indefinitely.
Q2. Regarding the offline_pages() function, encountering such a page indeed causes
     an endless loop. Shouldn't another part of the kernel timely changed the state
     of this page?

     When I use the workaround mentioned above (Ctrl-C and try offline again), I find
     that the page state changes (see dump_page below):
```
Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
```

What our test does:
We have a CXL memory device, which is configured as kmem and online into the MOVABLE
zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
see the pseudo code below:

```
main()
{
     if (fork() == 0)
         numactl -m 2 ./consume-memory
     else {
         daxctl offline-memory dax0.0
         wait()
     }
}
```

Attached is the process information (when it gets stuck):
```
root 25716 0.0 0.0 2460 1408 pts/0 S+ 15:28 0:00 ./main
root 25719 0.0 0.0 0 0 pts/0 Z+ 15:28 0:00 [consume-memory] <defunct>
root 25720 98.6 0.0 9476 3740 pts/0 R+ 15:28 0:26 daxctl offline-memory /dev/dax0.0
```

Feel free to let me know if you need more details.
Thank you for your attention to this issue. Looking forward to your insights.

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-12  5:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01  1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu)
2024-07-01  7:14 ` David Hildenbrand
2024-07-01 12:07   ` Zhijian Li (Fujitsu)
2024-07-04  7:43 ` Zhijian Li (Fujitsu)
2024-07-04  8:14   ` David Hildenbrand
2024-07-04 13:07     ` Zhijian Li (Fujitsu)
2024-07-12  1:50       ` Zhijian Li (Fujitsu)
2024-07-12  5:51         ` Zhijian Li (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).